towards a discourse resource for italian
Post on 03-Jan-2017
220 Views
Preview:
TRANSCRIPT
UNIVERSITÀ DEGLI STUDI DI PAVIA
Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Linguistica Teorica e Applicata
TOWARDS A DISCOURSE RESOURCE FOR ITALIAN: DEVELOPING AN ANNOTATION SCHEMA FOR
ATTRIBUTION
Relatore: Prof.ssa Irina Prodanof
Correlatore: Dott.ssa Claudia Soria Correlatore: Prof.ssa Cecilia Maria Andorno
Tesi di: Silvia Pareti
Anno Accademico 2008/2009
Nobody believes the official spokesman...
but everybody trusts an unidentified source.
Ron Nessen
- ii -
Abstract This thesis investigates the complex phenomenon of attribution and addresses the
issue of annotating attribution relations, developing, by means of a pilot study, a
possible annotation schema to be applied to the Italian Syntactic Semantic
Treebank corpus of newspaper articles (ISST).
Attribution is the relation occurring between assertions but also e.g. beliefs,
feelings, intentions, and the agents they belong to (e.g. The minister says that
taxes will rise in 2010). As this relation deeply affects the way we perceive
information, this should not be considered in isolation. It is fundamental to
recognise attribution in order to deal with the reliability of information and with
opinions.
The development of an annotation schema for attribution aims at providing
a resource in which information is overtly linked to its source. Having this
annotated resource could serve a number of purposes especially in the fields of
Information Retrieval, Multi Perspective Question Answering and Opinion Mining.
To date, attribution has only been annotated when associated with a discourse
connective or one of its arguments (Prasad et al., 2007), or only at the document,
sentence (Skadhauge and Hardt, 2005) or even word level (Wiebe, 2002) thus
only partially approaching the phenomenon.
The present study addresses attributions independently and regarding them
also as a discourse phenomenon. After analysing the features and issues
connected to attribution, e.g. scope definition, nested attributions, factuality of the
relation and co-reference resolution, an annotation schema will be proposed
following the identification of a set of possible ‘attribution devices’. To test its
feasibility and accuracy against data, a pilot annotation will be performed on a
portion of the ISST corpus. This will allow the definition of annotation guidelines
and the identification of additional issues remained unnoticed at the theoretical
level. In order to select a suitable tool to perform the pilot annotation, several
available annotation tools will be compared.
This thesis not only constructively contributes to the development of a
discourse resource for Italian, but also approaches attribution relations from a new
independent perspective raising problematic issues and providing a deeper
- iii -
account of the phenomenon. Further developments of the project should perform a
complete pilot annotation of all the type of attribution and features intended to be
included and develop, together with the appropriate tool, a final annotation schema
to be applied to the whole corpus.
- iv -
Acknowledgments
It might seem banal, however every time a challenging project is over it is useful to
look back and consider who made it possible. Not only to in order to recognise
other people’s merits and efforts, but especially to realise that we have not been
alone. It is because of that very feeling that every time an endeavour finally comes
to its conclusion, I can once again think about starting a new one. I can start with
remembering how much of what I am I owe to the Erasmus Scheme and the
chances it gave me first as a student and recently as an intern to study and
research at two amazing UK universities: Reading University and the University of
Edinburgh. First of all I would like to thank my supervisor at Edinburgh, Bonnie
Webber, for the unforgettable opportunity and the many hours she devoted to
listen about my progresses and my many doubts, every time with a solution to
propose or a name possibly having it to suggest. Echoes of the enlightening
conversations I had there with Theresa Wilson, Janyce Wiebe, Jean Carletta,
Nicoletta Calzolari, John Niekrasz, Katja Markert, and other colleagues, can be
found in this thesis as they were fundamental in shaping my choices and widening
my perspective concerning the topic. A special acknowledgement is also due to
Rashmi Prasad who has patiently answered all my e-mails providing me with
material and clarifications about the PDTB and precious suggestions. Constructive
were also the contacts I had with Roser Saurí, Tommaso Caselli and Massimo
Poesio. Lastly, I cannot forget the contribution of Jasmine to the revision of the
thesis and the technical and loving support Gregor unfailingly provided.
- v -
Contents
Abstract.................................................................................................................. ii
Acknowledgments................................................................................................ iv
List of Figures and Tables ................................................................................... ix
List of Figures ..................................................................................................... ix
List of Tables....................................................................................................... ix
1 Introduction ........................................................................................................ 1
1.1 An Independent Approach to Attribution........................................................ 3
1.2 Methodology ................................................................................................. 4
1.3 Terminology .................................................................................................. 5
1.4 Outline of the Thesis ..................................................................................... 5
2 Discourse and Attribution.................................................................................. 7
2.1 What is Discourse?....................................................................................... 7
2.1.1 Definition ................................................................................................. 7
2.1.2 Theories of Discourse Coherence and Cohesion.................................... 7
2.1.3 Constituency vs. Dependency................................................................. 9
2.2 Discourse Annotation Projects .....................................................................12
2.2.1 RST-DT ................................................................................................. 12
2.2.2 The Penn Discourse TreeBank PDTB ................................................... 13
2.2.3 Other Projects ....................................................................................... 15
2.3 Attribution.....................................................................................................16
2.3.1 Towards a Definition of Attribution ......................................................... 17
2.3.2 Are Attribution Relations a Discourse Phenomenon?............................ 18
2.4 Related Studies............................................................................................22
2.4.1 GraphBank ............................................................................................ 22
2.4.2 Opinion Corpus ..................................................................................... 23
2.4.3 PDTB - The Penn Discourse TreeBank................................................. 24
- vi -
2.5 Summary .....................................................................................................28
3 An Analysis of Attribution................................................................................ 30
3.1 The Components of Attribution ....................................................................30
3.1.1 The Source............................................................................................ 31
3.1.2 The Content........................................................................................... 34
3.1.3 Elements Functioning as Cue................................................................ 37
3.2 Some Issues ................................................................................................45
3.2.1 Nested Attributions ................................................................................ 45
3.2.2 Source of the Source............................................................................. 49
3.2.3 Multiple Sources, Contents, Cues ......................................................... 51
3.2.4 Co-reference Resolution ....................................................................... 52
3.2.5 Scope Definition .................................................................................... 55
3.3 Summary .....................................................................................................57
4 Features to Include in the Annotation ............................................................ 58
4.1 Type .............................................................................................................58
4.1.1 Assertion ............................................................................................... 58
4.1.2 Belief ..................................................................................................... 59
4.1.3 Fact ....................................................................................................... 59
4.1.4 Eventuality............................................................................................. 60
4.1.5 Issues Concerning Type Definition ........................................................ 61
4.2 Source .........................................................................................................65
4.2.1 Writer..................................................................................................... 65
4.2.2 Arbitrary................................................................................................. 66
4.2.3 Other ..................................................................................................... 69
4.3 Factuality .....................................................................................................70
4.3.1 Factual .................................................................................................. 72
4.3.2 Non-factual ............................................................................................ 73
4.4 Scopal Change ............................................................................................76
4.4.1 Scopal Polarity ...................................................................................... 76
4.4.2 Other Elements Affecting the Factuality ................................................ 79
- vii -
4.5 Summary .....................................................................................................80
5 Performing a Pilot Annotation......................................................................... 81
5.1 Corpus .........................................................................................................82
5.1.1 ISST Architecture .................................................................................. 82
5.1.2 Subcorpus Selection ............................................................................. 85
5.2 Tool Selection ..............................................................................................87
5.2.1 Requirements ........................................................................................ 88
5.2.2 Comparison of Available Tools .............................................................. 90
5.2.3 Selection and Tool Specifics.................................................................. 97
5.3 Setting MMAX2............................................................................................98
5.3.1 Scheme ............................................................................................... 102
5.3.2 Customization ..................................................................................... 105
5.3.3 Style .................................................................................................... 106
5.4 Feasibility of the Schema and Issues ........................................................107
5.5 Summary ...................................................................................................108
6 Annotation Schema and Guidelines ............................................................. 110
6.1 Text Spans Selection .................................................................................112
6.1.1 Source Span........................................................................................ 113
6.1.2 Cue Span ............................................................................................ 115
6.1.3 Content Span ...................................................................................... 117
6.1.4 Supplement ......................................................................................... 119
6.2 Feature Annotation Guidelines...................................................................120
6.2.1 Type Attribute ...................................................................................... 121
6.2.2 Factuality Attribute............................................................................... 122
6.2.3 Scopal Change Attribute...................................................................... 123
6.2.4 Source Type Attribute .......................................................................... 124
6.3 Collecting a List of Italian Cues..................................................................124
6.3.1 Extracting Verb Cues from the PDTB.................................................. 126
6.4 Summary ...................................................................................................127
- viii -
7 Conclusion...................................................................................................... 129
7.1 Future Work ...............................................................................................129
7.1.1 And Beyond......................................................................................... 130
Bibliography: ..................................................................................................... 132
Abbreviations and Acronyms........................................................................... 137
Appendix 1 – MMAX2 Code .............................................................................. 138
Appendix 2 – Italian Attribution Cues.............................................................. 140
Appendix 3 – PDTB Verb Cues......................................................................... 144
- ix -
List of Figures and Tables
List of Figures
Figure A - Reported news example ......................................................................... 2 Figure B - RST schemas ......................................................................................... 9 Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997) ........................ 12 Figure D - Sense classification of discourse connectives in the PDTB.................. 14 Figure E - Graphic extra-linguistic attribution......................................................... 21 Figure F - Newspaper article source ..................................................................... 37 Figure G - Nested attribution schema.................................................................... 46 Figure H - Truth values of a nested content........................................................... 48 Figure I - Design Process...................................................................................... 81 Figure J - ISST orthographic level (sole002) ......................................................... 83 Figure K - ISST morpho-syntactic level (sole002) ................................................. 84 Figure L - ISST syntactic constituent level (sole002)............................................. 84 Figure M - ISST table format ................................................................................. 86 Figure N - GATE annotation environment.............................................................. 91 Figure O - GATE annotation exported in XML ....................................................... 92 Figure P - Knowtator annotation environment ....................................................... 93 Figure Q - Knowtator annotation exported in XML ................................................ 94 Figure R - MMAX2 Project Wizard ........................................................................ 99 Figure S - MMAX2 Base Data (ISST cs001) ....................................................... 100 Figure T - The annotation of cue, content and source as separate levels ........... 101 Figure U - MMAX2 Annotation window................................................................ 103 Figure V - MMAX2 Annotation window (attributes) .............................................. 103 Figure W - MMAX2 Annotation of relations.......................................................... 105 Figure X - Nested attributions visible through handles ........................................ 106 Figure Y - Attribution relation components............................................................111 Figure Z - Annotation, text spans selection.......................................................... 112 Figure AA - Annotation, elements which could function as a markable................ 113 Figure BB - Annotation, attributes selection......................................................... 120
List of Tables
Table 1 - Factuality values (Saurí and Pustejovsky, 2008) .................................... 72 Table 2 - N. of articles selected per section........................................................... 86 Table 3 - Knowtator/ MMAX2 feature comparison ................................................. 97 Table 4 - Annotation schema features. ................................................................ 120 Table 5 - Factuality and Scopal change values assignment ................................ 123
1 Introduction
- 1 -
1 Introduction
Discourse relations represent a fundamental aspect of discourse understanding
and generation. Therefore research in many areas, such as Information Extraction,
Discourse Generation and Question Answering, would benefit from a discourse
annotated corpus as a basis for their studies.
The aim of this thesis is to contribute towards providing Italian with
complete linguistic resources in particular with designing and testing the addition
of a discourse level of annotation to the ISST corpus, a multi-level annotated
corpus of Italian newspaper texts. This already consists of 5 levels of annotation:
orthographic, morpho-syntactic, syntactic (constituents), syntactic (dependencies)
and semantic. The addition of a layer for discourse annotation comes as a natural
development of the ISST corpus.
Most of the work in this frame, to date, concentrates on analysing and
annotating discourse connectives or anaphoric relations. For the purpose of the
present study, however, these issues will not be addressed and the focus will be
on attribution relations. This topic is especially relevant for research dealing with
Information Retrieval, Multi-Perspective Question Answering and Opinion Mining.
Tools able to discern information according to the relevance of its source or
to identify different opinions with regards to a given topic would dramatically
improve the quality of the information we are constantly exposed to. People more
and more refer to the internet as a source of information and knowledge
interrogating search engines instead of encyclopaedias or experts. A number of
projects, last the Microsoft search engine ‘Bing’, are trying to outperform ‘Google’
and break its monopoly with scarce success as they introduce interesting small
changes without remarkably improving the reliability of responses to our queries.
Search engines usually classify the source only at the macro-level, i.e. the
webpage a certain text or information was taken from.
The urge for retrieving answers quickly does not always allow users to take
the context in which the information was found into consideration or to address the
troublesome question: Where does this knowledge come from? Quite often for
example we hear people supporting their views with stating that they read
1 Introduction
- 2 -
something about them on the internet or even that ‘internet says it’. This
generalisation is also due to the difficulty of linking the information to the exact
source, often hidden by several levels of attribution all nested one in another like a
Matryoshka doll.
The practice of reporting information is particularly pervasive in the
journalistic field and especially in news reviews where what is stated is always
second hand if it does not originate from even further away. In the example below
(Figure A), on the website First Bell it is reported that the UK has the largest
gender gap in science achievement. This, however, according to the UK’s
Telegraph which in turn reports a study from the OECD whose data is taken from
the Program for International Student Assessment (2006).
Figure A - Reported news example
http://links.mkt753.com/servlet/MailView?ms=NDEwMjIzOAS2&r=MTQ3NzY4ODQ3OAS2&j=MTIyNTgxMzg1S0&mt=1&rt=0
In the last few years the Web has become the indistinct repository of all human
knowledge. However, although it surely is the shallowest source of the data we
learn from it, it is never the only one and knowing all the passages a certain
statement has gone through is fundamental, as it is e.g. to know its temporal
anchor, in order to verify its veracity, understand and interpret it. Just consider the
example (1) below:
(1) According to The Times the President wants to buy the Amazon Forest and
turn the trees into toothpicks.
1 Introduction
- 3 -
This intentions attributed to the President seems to come from a trustworthy
source, ‘The Times’, and would hopefully determine immediate reactions at least
from the environmentalists. But what if this statement was part of another
attribution relation as in the paragraph that follows (2)?
(2) “According to The Times the President wants to buy the Amazon Forest and
turn the trees into toothpicks.” The comedian pronounced these words,
joking about the President’s disregard for environmental issues.
A last remark concerns the utility and importance of developing such a project in a
language other than English. First of all, because findings and results proceeding
from studies employing the English language cannot be always and entirely valid
for other languages. Secondly, the importance and life of a language depends also
on these efforts to make it available for every possible use. Having language
resources for Italian means providing support for studies and research and allow
the development of tools specific for this language, thus enabling its speakers to
rely on it for the full range of their needs. Lastly, developing resources in several
languages provides precious data for inter-linguistic comparison, thus making it
possible to identify aspects which are common and aspects instead peculiar to
each language.
1.1 An Independent Approach to Attribution
Being able to automatically link together attributed material and its source would
represent a big advantage for a number of tasks. At present, this is still not
possible. A manually annotated corpus for attribution is surely not the solution,
however, it represents an important step towards it. Studies aiming at developing
tools for the recognition of attribution would in fact need a complete description of
how the phenomenon functions and is expressed, together with an already
annotated corpus to test their reliability.
Although attribution relations have already been annotated in a few other
projects (Wolf and Gibson, 2005; Wiebe, 2002; Prasad et al., 2007), a systematic
and independent account of the phenomenon is still lacking. Studies aiming at
1 Introduction
- 4 -
capturing the complexity of discourse relations recognise the importance of
attribution, but reserve a rather secondary role for it (Wolf and Gibson, 2005;
Prasad et al., 2007). Other approaches instead take the distance from discourse
and assume a more independent perspective or pair attribution with subjective
language (Wiebe, 2002). None of them, however, completely investigate attribution
as they limit the annotation to only some of the attribution levels: word, clause,
sentence or document.
In the present project, attribution relations will be investigated as the starting
point towards the construction of a discourse resource for Italian and not as an
additional feature of it. Moreover, all levels of attribution will be considered and
annotated. This way of proceeding will allow exploring the topic independently
from other discourse relations and reaching a deeper understanding and a broader
account of the phenomenon.
1.2 Methodology
In order to annotate the corpus for attribution, some preliminary work needs
to be carried out. First of all, attribution relations have to be analysed in order to
identify their characteristics and spot issues which represent an obstacle to the
annotation. Afterwards, possible solutions to these problems will be proposed and
an annotation schema outlined. This will be then applied to a section of the corpus
with the help of an annotation tool.
The tool has been selected after comparing and testing several available
software applications. The choice of the annotation tool poses constraints to the
annotation schema as its limited functionality determines what is feasible and what
not (e.g. some tools do not allow the selection of overlapping text spans). Although
ideally the tool is determined by the annotation schema and should be developed
according to its requirements, this was at this stage not realistic.
Having to rely on an existing tool, the initial annotation schema proposed
will therefore be adapted to the selected tool. Performing the pilot annotation will
rise additional issues and determine new changes to the annotation schema. This
will finally reach its final stage, a proposal for the annotation of attributions, which
should be applicable to the rest of the corpus with the help of annotators, leading
1 Introduction
- 5 -
to presumably good interannotator-agreement.
1.3 Terminology
Before moving on, it is opportune to briefly introduce some terminology employed.
Although ‘text’ “is used in linguistics to refer to any passage, spoken or written, of
whatever length, that does form a unified whole” (Halliday and Hasan, 1976:1) , as
the type of texts within the scope of this study are solely newspaper articles, this
will refer to written language only. The account for attribution provided in this thesis
should hold also for the spoken language, however, further investigations are
necessary in order to determine to what extent this is true.
When generally discussing attribution, the term ‘writer’ will be mostly
employed to refer to both the writer and the speaker of the text.
Discourse is often characterised as a coherent text, as opposed to text
lacking a semantic unity. As incoherent texts will not be taken into consideration,
both ‘discourse’ and ‘text’ will be generally used to refer to a coherent unit of
language.
The lexical material signalling an attribution relation will be mostly identified
as ‘cue’ or ‘text anchor’
1.4 Outline of the Thesis
In the second chapter, the framework of discourse studies will be briefly introduced
together with a survey of discourse annotation projects. Afterwards, attribution will
be defined and projects involving its annotation reviewed.
The third chapter presents the phenomenon of attribution and provides an
analysis of its constitutive elements, with particular attention to the elements
expressing them in the text. Some of the most problematic issues connected to
attribution relations and their annotation are also investigated.
A first annotation schema proposal is described in chapter four. The
description focuses on the features to include in the annotation. These attributes
and their possible values are carefully analysed and described with the help of
examples.
1 Introduction
- 6 -
The fifth chapter illustrates the stages towards performing a pilot project in
order to test the feasibility of the schema on the corpus. These include the
specification of the tool requirements, the analysis and selection of the most
suitable tool among the ones currently available and the setting of the selected
tool. Afterwards, some additional issues or issues identified through the pilot
annotation are also presented.
In the sixth chapter the final annotation schema proposed for the annotation
of attribution relations is briefly summarised and guidelines concerning the
annotation are provided, as they have been adopted for the pilot, in order to
facilitate the selection of the relevant text spans and the assignment of the
attribute values.
In the last chapter conclusions are drawn and future developments
discussed.
2 Discourse and Attribution
- 7 -
2 Discourse and Attribution
2.1 What is Discourse?
2.1.1 Definition
Aristotle already understood it and warned us in his Metaphysics that “The whole
is more than the sum of its parts.” This also holds for such ‘wholes’ like texts,
where the meaning deriving from the juxtaposition of clauses, as pointed out by
Moore and Wiemer-Hastings (2003), may not coincide with the meaning of the
individual clauses and may imply more than that. Discourse could therefore be
defined as ‘propositions in context’ (Péry-Woodley and Scott, 2007).
Units of language are usually organised in a coherent way and researchers
agree that coherent text has a structure and that understanding the way it
functions is fundamental for the understanding of discourse (Grosz and Sidner,
1986, Hobbs, 1985). This structure needs to be taken into consideration when
dealing with natural language generation but also with tasks such as co-reference
resolution, temporal relations and attribution relations. Coherency not only
depends on the relations holding between strings but has also to do with extra-
linguistic components such as the writer/speaker, the recipient, the knowledge
they share and the communicative situation.
Another concept strongly connected to coherency and contributing to it is
that of cohesion. Cohesive elements are linguistic devices employed to signal
connections between text units. Coherency and cohesion will be both employed in
the next section as some approaches semantically ground discourse relations,
therefore focus on the elements giving coherency to the discourse, while other try
to account for the cohesive means by which this coherence is linguistically
expressed.
2.1.2 Theories of Discourse Coherence and Cohesion
“Between sentences, there are no structural relations, and this is where the study
of cohesion becomes important.” (Halliday and Hasan, 1976:146).
2 Discourse and Attribution
- 8 -
Two metaphors are usually employed by theories of coherence: that of focus,
which holds between entities referred to in a text and can involve more than two
text spans; and that of relation, binary in nature and linking instead sections of
text. Different theories have taken different approaches to discourse relations
which Knott (1996) describes as ‘deep’ and ‘surface structure’. The ‘deep structure’
theories investigate discourse relations identifying the semantic relations which
underlie ‘surface syntactic relations’ (Grimes, 1975). ‘Surface structure’
approaches, on the contrary, consider the ‘deep’ semantic relations less important
and characterise discourse relations from the outside, identifying possible
resources signalling them on the ‘surface structure’ (Halliday and Hasan, 1976).
The types of structure usually employed by computational models of
discourse processing are three. The informational structure, “the relation
between the information conveyed in consecutive elements of a coherent
discourse” (Moore and Pollack, 1992:537) which deals with semantic relations,
e.g. the causal relation. The attentional structure (Grosz and Sidner, 1986)
determines instead the ‘focus’ or ‘centre’ of attention: the information or entities
which are mostly relevant at any given point. Another type of structure is the
intentional structure which deals with the intentions of the speaker/writer and
therefore with what they are trying to accomplish through the communicative act.
This kind of structure underlies Grosz and Sidner’s (1986) concept of
discourse relation. In their theory, relations apply to discourse segments (DS) and
combine them in larger DSs. The ‘intention’ relations are those of ‘dominance’,
when the satisfaction of the subordinate segment concur to the satisfaction of the
dominant one, and ‘satisfaction-precedes’, in which the satisfaction of one
segment precedes the satisfaction of another segment and together they concur to
the satisfaction of a third dominant one. Discourse segments are therefore
organised in a hierarchical structure of goals and sub-goals. Considering
discourse as a composite of linguistic structure, intentional structure and
attentional state, Grosz and Sidner also account for the interaction between
relation and focus and they present every discourse segments as having also a
focus space determined by dominance relations.
Two additional structures should also be added to the three already
mentioned. One is the information structure, which has to do with the concepts
2 Discourse and Attribution
- 9 -
of theme and rheme, the former being the part connected to the rest of the
discourse and the latter the new information which is introduced about it. The
other one is the rhetorical structure, which defines a set of rhetorical relations
that can connect consecutive discourse elements.
Rhetoric relations are the core of the RST (Rhetorical Structure Theory)
formulated by Mann and Thompson (1988). Rhetorical relations (RR) are
functionally defined as the effect the writer intends to achieve and are expressed
by linguistic devices. RRs entail the concept of nuclearity, that is the centrality of
the span with respect to the writer’s purposes. Nucleus and satellite relations, and
less commonly multinuclear relations, structure the text and can be exemplified by
schema applications (Figure B), which can then be mapped onto text. A
hierarchical system of schema applications produces a Rhetorical Structure tree.
For a text to be coherent, it should be possible to represent it with a single RS
tree.
Figure B - RST schemas
2.1.3 Constituency vs. Dependency
Webber (2006) argues that the approaches to discourse structures can also be
grouped according to the concepts of constituency and dependency. RST
approach is based on constituency, the idea of linguist units as “parts within
parts”, having “specific roles or functions” (Webber, 2006:340), and considers this
as the only basis for discourse relations. Their instantiated schemas represent the
constituency structure and correspond to discourse relations between consecutive
spans (i.e. clauses or projections of instantiated schemas).
Also based on constituency is Polanyi's Linguistic Discourse Model (LDM).
This is similar to the RST, however, it separates discourse structure, formed by a
hierarchy of discourse units, from discourse interpretation. Their discourse parse
tree (DPT) can be described by a context-free grammar consisting of 3 re-write
nucleus satellite nucleus nucleus
text span
relation
2 Discourse and Attribution
- 10 -
rules: an N-ary branching rule for discourse coordination, a binary branching rule
for discourse subordination and an N-ary branching rule with sisters related by a
logical or rhetorical relation and contributing to the interpretation of their parent
node. The DPT is right open, which means that every discourse unit resuming an
interrupted constituent also closes it off, thus making it impossible for any
subsequent coordinate or subordinate discourse unit to attach to it. This claim,
which does not allow for incrementation, is similar to the Intention Stack
mechanism depicted by Grosz and Sidner (1986) and the notion of Right Frontier
in Webber (1988).
Another approach to the structure of discourse is that of relating discourse
cohesion to dependency, which can be of three kinds: syntactic, semantic and
anaphoric. In Halliday and Hasan (1976) this is solely anaphoric dependency.
Their idea of cohesion is that of a part whose interpretation requires the
interpretation of another part to be enabled. Five types of cohesion can be
identified on this basis: anaphora, substitution, ellipsis, lexical cohesion (e.g.
repetitions, synonymy) and conjunction, the latter being the only one responsible
for discourse relations. As pointed out by Webber (2006), anaphoric relations
have, however, no constraint on their locality, no constraint on the number of text
parts a given unit can depend on and no constraint on the discourse units that can
be linked together. The lack of constraints in this approach results in embedded
and cross-relations to be allowed.
Other approaches have taken a perspective that combines constituency
and dependency in order to account for discourse relations. In the mixed
approaches constituency and dependency participate in shaping the discourse
structure and determining its cohesion. Wolf and Gibson (2005) discourse
structure relations, a set of informational relations based on Hobbs (1985), are
associated with constituency alone. However, they do not separately account for
anaphoric dependency which is responsible for non-adjacent discourse segments.
In this way their approach can be seen as part of the mixed approaches.
In their theory of discourse graphs, Wolf and Gibson identify discourse
segments as non-overlapping spans of text constituted either by a clause or an
attribution. Segments which are related on the basis of a common topic or
attribution are grouped together. Groups can also engage in a discourse relation
2 Discourse and Attribution
- 11 -
with a clause or another group. This results in a sort of hierarchical structure which
can be related to constituency. Moreover, Wolf and Gibson argue that tree-
structures are not adequate for accounting for discourse coherence and propose a
chain-graph in order to represent problematic aspects such as nodes with multiple
parents and cross-relations.
Although their approach tends to associate discourse structure solely with
constituency, dependency plays an important role in determining their claims.
Cross-relations, which appear to be quite frequent and mainly associated with the
relation of ‘elaboration’, could be explained through dependency. Webber (2006)
notices that cross-relations, which represent the main argument against the tree-
structure, are often anaphoric dependencies.
Also mixed, although very different, is the Lexicalized Tree-Adjoining
Grammar for Discourse (D-LTAG) approach (Cristea and Webber, 1997, Webber
et al., 2003). Discourse relations are lexicalised in the sense that this theory
provides an account of the lexical anchors bearing them. The arguments to these
relations are also lexicalised. Each lexical entry is associated with a set of tree-
structures specifying its syntactic configuration. In this lexical variant of TAG, the
adjoining operation, which is available at the right frontier, is paired with the
operation of substitution.
Adjoining is the operation of “identifying a discourse relation between the
new material and material in the previous discourse that still is open for
elaboration” (Cristea and Webber, 1997:91). Cristea and Webber introduce
substitution in order to account for discourse features (e.g. although, on the one
hand, suppose) arising expectations about what is to come in the following
discourse.
Figure C shows above the grammatical categories (where * is the foot of an
auxiliary tree and ↓ a substitution site) and below the adjoining and substitution
operations.
2 Discourse and Attribution
- 12 -
Figure C - - (L-TAG) Tree examples (Cristea and Webber, 1997)
Structural (i.e. conjunctions, subordinators) and empty connectives are the
anchors of elementary trees. These discourse relations between arguments
produce a compositionally interpreted structure. Discourse adverbials exploit
instead anaphoric dependency, establishing a discourse relation connecting the
interpretation of a clause to the interpretation of a previous clause or group of
clauses.
2.2 Discourse Annotation Projects
Discourse annotation projects are becoming popular in recent years due to a
growing interest in better understanding discourse structures in order to
automatically interpret or reproduce it. A survey of these projects is presented in
this section.
2.2.1 RST-DT
The Rhetorical Structure Theory Discourse Treebank (Carlson et al., 2003) is a
corpus of 176,000 words from the Penn TreeBank, hence consisting of articles
from the Wall Street Journal (WSJ). Realised in the framework of the RST, the
RST-DT corpus is annotated for rhetorical relations holding between two or more
adjacent and non-overlapping text-spans.
In order to construct the discourse tree, they first proceed to identify its
minimal building block, the elementary discourse unit (EDU), which is the clause.
2 Discourse and Attribution
- 13 -
Once the text has been segmented, adjacent EDUs are linked via rhetorical
relations thus creating a hierarchical structure. The inventory of rhetorical relations
they employ consists of 53 mononuclear relations, where one of the spans is more
salient (nucleus) and the other conveys additional information (satellite), and 25
multinuclear relations, with equally salient spans.
2.2.2 The Penn Discourse TreeBank PDTB
The PDTB (Prasad et al., 2004; Webber et al., 2005; Prasad and Dinesh et al.,
2008) represents a fundamental work in the area of discourse for both its unique
approach to discourse relations based on the D-LTAG theory and the echo it has
produced, inspiring a number of recent studies and providing them with a strong
knowledge base. The PDTB is a discourse resource built on top of the PTB, the
Penn Wall Street Journal corpus. It consists of a million words annotated for
discourse connectives and their arguments. The annotation was chosen to be
stand-off as this is generally more clear than the XML in-line annotation and
because the arguments of different connectives could overlap, violating the syntax
of XML.
Although not tied to any particular theory of discourse, the approach taken
is grounded in the D-LTAG approach to discourse. The idea of a lexicalised
grammar for discourse results in a bottom-up approach that avoids recurring to a
pre-defined set of discourse relations as it is in other theories (e.g. RST). Focus of
the annotation are discourse connectives, considered as discourse predicates
taking two text spans as their arguments, and their arguments Arg1 and Arg2. In
the example (3) below, Arg1 is in italic and Arg2 in bold while the connective is
underlined. Discourse relations hold between Abstract Objects (AO), such as
propositions, events and states. The annotation was performed proceeding with
annotating a single connective throughout the whole corpus before taking into
consideration the following one as this was perceived as an easier task for the
annotators.
(3) Most oil companies, when they set exploration and production budgets
for this year, forecast revenue of $15 for each barrel of crude produced.
(Prasad and Dinesh, 2008:2)
2 Discourse and Attribution
- 14 -
Discourse connectives belong to three grammatical classes: subordinating
conjunctions (e.g. because, when), coordinating conjunctions (e.g. and, or) and
discourse adverbials (e.g. for example, instead). They can also appear as
modified or conjoined form (e.g. only because, if and when) or parallel form (e.g.
either…or, on the one hand…on the other hand). The senses of the connectives
are also annotated paying attention to their polysemous nature (e.g. ‘since’ can
have a temporal, causal or temporal-causal sense). Senses are hierarchically
classified according to their class, type and subtype as exemplified in Figure D.
Figure D - Sense classification of discourse connectives in the PDTB
Between adjacent text spans, discourse relations are annotated also when not
explicit, that is, when although they lack a discourse connective, the relation can
be inferred. In these cases a presumed connective is added to the annotation with
the exception of lexicalised discourse relation (AltLex), arguments linked by an
entity-based coherence relation (EntRel), and also when no relation is perceived
(NoRel).
Arguments to a connective can be non-consecutive (3) and anywhere in the
text and are constituted of single or multiple clauses or sentences. A principle of
‘minimality’ applies to them, which prescribes for each argument the selection of
the minimum sufficient span. Additional text related to the arguments can also be
included in a discourse relation as ‘supplement’ (Sup1, Sup2).
The annotation in the PDTB contains additional information as it also
specifies the attribution of connectives and their arguments. This aspect of the
annotation will be considered and analysed at a later stage in this thesis.
class
type
subtype
temporal – contingency – comparison - expansion
condition cause … …
… reason result
2 Discourse and Attribution
- 15 -
2.2.3 Other Projects
The Chinese Discourse Treebank
The Chinese Discourse Treebank (CDTB) project (Xue, 2005) is based on the
same principles of the PDTB. Similarly to the PDTB, and unlike the RST approach,
discourse relations do not represent a predefined inventory but are lexically
grounded and anchored by discourse connectives. Implicit and explicit Chinese
discourse connectives were investigated in order to add a discourse layer of
annotation to the Penn Chinese Treebank. Discourse connectives are also here
regarded as predicates taking two abstract objects as their arguments. In the
CDTB coordinating and subordinating conjunctions as well as discourse adverbials
are annotated.
The main challenges to the realisation of this project were disambiguating
lexical items which in Chinese could function both as discourse connectives and
non-discourse connectives, as well as determining the sense of polysemous
connective, and defining the argument scope. Due to the long morphological
evolution of the Chinese language another issue was determined by discourse
relations realised by more than one discourse connective. Hence, different
morphological forms had to be grouped as diverse realisations of the same
discourse relation. The task of annotating attribution is not included in the CDTB.
Discourse and the Prague Dependency Treebank (PDT)
Also inspired by the PDTB is the initial analysis conducted on the Prague
Dependency Treebank for the addition of a layer of annotation for discourse
(Mladová et al., 2008). The PDT is a corpus of 2 million word Czech journalistic
texts from the Czech National Corpus. Three levels of annotation are already
available: morphological, superficial syntactic and deep syntactic. In the latter each
sentence is represented by a dependency tree connecting clauses but not
trespassing sentence boundaries. In addition, however, some basic co-reference
relations are also marked and among those some textual co-reference relations
going beyond sentence boundaries.
Discourse relations will be added to PDT 3.0 in a fourth level of annotation,
containing various types of relations going beyond the sentence. The discourse
layer to be added to the PDT will use the PDTB as a background and define a new
2 Discourse and Attribution
- 16 -
hierarchy of discourse sense labels and exploit the discourse information already
carried by the deep syntactic level of annotation. Co-reference relations are
already marked for coordination, dependency and reference to the preceding
context, however these need to be explicitly marked for discourse and the PREC
label for relations going beyond sentence boundaries has to be further specified.
Discourse Annotation of the METU Turkish Corpus and the Hindi Discourse
Relation Bank
Still at their early stages are two recent projects aiming at developing a discourse
resource for Turkish and Hindi respectively. Both are based on the theoretical
assumptions postulated by the PDTB and focus on the analysis of discourse
connectives. From these studies also emerges a certain interlanguage validity of
the PDTB schema and the similar approach adopted makes them a valid source
for cross-linguistic comparison.
Most of the work to date in order to prepare the ground for the discourse
annotation of the METU Turkish Corpus (Zeyrek and Webber, 2008) has been the
identification and classification of discourse connectives together with a
preliminary analysis of the argument scope. The attribution of discourse relations
and other aspects such as the annotation of implicit connectives are still
unexplored.
Similarly, the project for a Hindi Discourse Relation Bank also adopts a
lexically grounded approach to discourse relations and is focusing on the analysis
of different types of discourse connectives and their realisation in the Hindi
language. Implicit connectives and their semantic classification, together with the
attribution of connectives and their arguments are left for future developments of
the present research (Prasad and Husain et al., 2008).
2.3 Attribution
This thesis originates in the framework of developing an Italian Discourse
Treebank which similarly to the developing discourse annotation projects for
Chinese, Czech, Hindi and Turkish is theoretically inspired by the PDTB. However,
unlike these projects, it does not focus on the classification and annotation of
2 Discourse and Attribution
- 17 -
discourse connectives but on attributions, an aspect included in the PDTB but only
in a subordinate way.
2.3.1 Towards a Definition of Attribution
Defining what attributions are is a trivial task, so trivial that it is not at all easy.
Although the annotation scheme for attribution in this project is derived from the
one in the PDTB, the definition they provide of attribution as “a relation of
‘ownership’ between abstract objects and individual or agents” (Prasad and
Miltsakaki, 2008:40) is not suitable to fully describe the relations that will be here
considered and investigated. AOs refer to propositions, events or states and do not
include smaller units such as noun phrases or even single words. Another
definition is given in the RST annotation manual:
“Speech acts—verbs that are used to report both direct and indirect speech--
should be segmented and marked for the rhetorical relation of ATTRIBUTION […]
Cognitive predicates, including verbs that express feelings, thoughts, hopes, etc.,
should also be segmented and marked for the rhetorical relation of
ATTRIBUTION.” (Carlson and Marcu 2001:7, 9)
More than attribution, what is defined here by describing the means by which it is
signalled, is the way of spotting attribution in the text. However, it is possible to
derive that attribution is just bound to reporting or cognitive predicates, leaving out
the cases when attribution is conveyed by prepositions (4) or just punctuation (5).
(4) According to the police, crime rate has fallen this month.
(5) The Pope: “ I will pray for the victims ”.
Murphy (2005:131) provides a partial definition of attribution as “the transferral of
responsibility for what is being said to a third party.” This simple explanation,
meant to capture only the attribution of assertions, highlights however the
embedded nature of attribution, recognising a ‘third party’ in the relation. This
because any attribution in a text or speech event is already part of a
2 Discourse and Attribution
- 18 -
communicative event having in the writer/speaker its natural primary source. The
insertion of a ‘third party’ allows the writer/speaker to change this default attribution
and transfer the responsibility or ownership of a certain part to another source.
As all the above mentioned definitions of attribution are alone not sufficient
to capture the phenomenon into consideration in this thesis, a new definition will
be here proposed: attribution in a text is ascribing the ownership of an attitude
towards some linguistic material, i.e. the text itself, a portion of it or their semantic
content, to an entity. This ownership is expressed by explicitly inserting the agent
or experiencer holding the intellectual property of the linguistic material, which can
express an assertion or a mental state such as an opinion, a will or some
knowledge. Attributions as described above will be considered and investigated in
the current research.
2.3.2 Are Attribution Relations a Discourse Phenomenon?
In order to decide if attribution relations are a kind of discourse relation it is
necessary to specify what discourse relations are. The label of discourse holds for
those texts having a structure. This structure originates from cohesive elements.
“Where the interpretation of any item in the discourse requires making reference to
some other item in the discourse, there is cohesion” (Halliday and Hasan,
1976:11). If generating cohesion would represent the sufficient and necessary
condition to identify a discourse relation, attribution would surely belong to this
class.
The interpretation of an attributed element is highly dependent on its
source. Bergler (1991) distinguishes between ‘primary’ and ‘circumstantial
information’, the first being the ‘pure’ information and the latter the ‘primary
information’ within a perspective, a belief or a modality, and argues that the
interest of tasks such as knowledge extraction is ‘primary information’. She
however acknowledges the importance of the additional information carried by the
‘circumstantial information’ and stresses the intimacy of this relation. Although
‘primary information’ still is, after 18 years, the focus of knowledge extraction
tasks, the recent flourishing of studies aiming at capturing this intimate relation
shows a general understanding of the fundamental contribution of the
‘circumstantial information’ to the interpretation of the ‘primary’ one.
2 Discourse and Attribution
- 19 -
Attribution relations are therefore with no doubt cohesive relations. Cohesion,
however, is not enough to specifically describe discourse relations as this could
represent a characteristic of relations in general. Thus a syntactic relation would
also be classified as a discourse one and this should not be the case. The second
necessary condition identifying a discourse relation is that it should hold between
discourse segments. These should be non-overlapping spans of text, however in
the literature a unique definition is still lacking. Different discourse approaches also
adopt different discourse units. These can be intentional units (Grosz and Sidner,
1986), sentences (Hobbs, 1985), clauses or phrasal units (Mann and Thompson,
1988; Webber et al., 1999; Wolf and Gibson, 2005).
Relations of attribution can hold between sentences or inside them between
clauses or group of clauses, therefore it could be considered a discourse
phenomenon.
(6) "There's no question that some of those workers and managers
contracted asbestos-related diseases," said Darrell Phillips, vice
president of human resources for Hollingsworth & Vose. "But you have to
recognize that these events took place 35 years ago. It has no bearing
on our work force today." (PDTB 0003)
Skadhauge and Hardt (2005) argue in this respect that attribution is an intra-
sentential relation, referring to the RST Treebank where it is actually treated as
such, and develop a system that they claim can automatically identify it. The
assumption is that being an intra-sentential relation attribution is encoded at the
syntactic level. Attribution is also a syntactic phenomenon but surely not only that.
The premises leading Skadhauge and Hardt to this conclusion are grounded in the
RST Treebank approach to attribution which considers only intra-sentential
instances of it and only at particular conditions (i.e. a verb immediately followed or
preceded by a sentential complement position, and the phrase ‘according to’). The
conclusion, quite obvious, should be that a subset of attribution relation which are
syntactically grounded, those selected by the RST-DT, can be syntactically derived
and automatically identified.
Although a certain number of attributions are expressed at the intra-
2 Discourse and Attribution
- 20 -
sentential level, verbs are not the only cues signalling them (see examples (4) and
(5) above). They are certainly the most common ones, however, the attributed
span is often separated from the verb by intervening material, such as adverbs,
complements or even clauses. Only eluding the complexity of attribution relations,
considering only a subset of it, Skadhauge and Hardt could easily provide a
solution for the automatic identification of this problematical phenomenon. This
very partial solution demonstrates the importance of reaching a better theoretical
description of attribution and a full account of its characteristics.
From the present work attribution emerged as being also a discourse
phenomenon. This because it often operates at a higher level than the sentence,
connecting larger units such as sentences (6), but also clauses in separate
sentences. Moreover, very frequently it bears co-reference relations or better it is
bounded to them ( (7), (9)).
(7) LONDRA - Con I soldi della lotteria nazionale sarà creata un’”Accademia
Britannica per lo Sport”. Lo ha deciso il primo ministro, John Major, …
(ISST re050)
LONDON – With the money from the National lottery it twill be instituted a
“British Sport Academy”. It was decided by the Prime Minister, John
Major,…
Through the analysis of attribution it was also clear that it can also be a
syntactically encoded phenomenon, intra-sentential and even intra-clausal, with as
little as a single word functioning as the attributed material ( (8), (9)).
(8) “Sì”, le risponde convinta un’amichetta. (ISST cs060)
“Yes”, answers to her confident a friend.
(9) “…L’umanità deve proclamare uno storico sciopero ad oltranza fino alla
distruzione di tutti gli armamenti nucleari.” Le parole registrate di
Gheddafi, …(ISST cs039)
“…The world should proclaim a non-stop strike till the destruction of all
nuclear armaments.” Gheddafi’s recorded words,…
2 Discourse and Attribution
- 21 -
On the other hand attribution relations can involve much larger units than
sentences or clauses and extend to the whole text or speech, reaching the
shallowest level of attribution, the one already easily captured by searching
engines, in which the source is the writer of the text or the person holding a
speech, or even the newspaper or website including the article. At this level the
attribution is often conveyed by prosodic or extra-linguistic means, e.g. the
inclusion in the web-page/ newspaper/ book, a graphic pointer (Figure E), the
sound provenance.
Figure E - Graphic extra-linguistic attribution
(http://www.metrokitty.com/comics/webcomics/medterms/comic_mterms.png)
For the purpose of the present study attribution will be considered at every level it
can be found, however the main account of it will be as a discourse phenomenon.
It will be considered at the discourse level itself, when sentences, propositions or
clauses or groups of them are attributed and at the sentence or even clause level,
with single words or noun phrases being attributed. However, the analysis will be
in this case limited to those instances coreferential to a discourse unit ( (7), (9)),
hence these could be also considered, in combination with the coreferential
relations, a discourse relation. The shallow level of attribution will also be included
in the annotation as the text, which is always a newspaper article, will have the
writer as its primary source, even when the writer is not directly mentioned in the
article. The attribution of the entire article to its writer will be assumed as default
and left implicit with some exceptions.
2 Discourse and Attribution
- 22 -
2.4 Related Studies
Attribution relations have already been included in some studies. These either
have their focus on some other discourse aspect and account for attribution only
marginally or limit their analysis to some level of attribution, e.g. the macro-level or
the intra-sentential or word level, thus neglecting attribution at the discourse level.
Nonetheless they represent a knowledge base and a starting point for the present
study. The annotation schema is in fact derived from the annotation schemas
proposed by these projects. In this section the most influential ones will be
reviewed.
2.4.1 GraphBank
The relation of attribution is included in the GraphBank (Wolf and Gibson, 2005) as
an asymmetrical or directed relation, together with cause–effect, condition, violated
expectation, elaboration, example and generalization. In contrast to symmetrical or
undirected relations, i.e. similarity, contrast and same, directed relations hold from
satellite to nucleus nodes and are related to Mann and Thompson’s (1988)
mononuclear and multi-nuclear relations. Attribution relations go from the DS
containing the source to the DS which is the content of the attribution. Attributions
in the GraphBank are separated only when the attributed material is a sentence or
group of sentences or a complementizer phrase (10). These DSs are grouped if
they are attributed to the same source. In the other cases they are treated as
single discourse segments (11).
(10) 1. John said that
2. the weather would be nice tomorrow.
(Wolf and Gibson, 2005:254)
(11) 1.The restaurant operator cited transaction costs from its 1988
recapitalization.
(Wolf and Gibson, 2005:251)
Wolf and Gibson added attribution to the relations in Hobbs (1985) as they are
dealing with text taken from news corpora. However, they consider attributions,
2 Discourse and Attribution
- 23 -
more than coherence relations themselves, just as “carriers of coherence
structures” (Wolf and Gibson, 2005:251).
2.4.2 Opinion Corpus
Connected to attribution are also works in the fields of opinion and emotion
annotation and recognition. The most consistent and closely related study in this
respect is the Opinion Corpus (Wiebe, 2002; Wiebe et al., 2005; Wilson and
Wiebe, 2005). It consists of more than 11.000 sentences from the world press,
annotated for ‘private states’. This term covers: opinions, beliefs, thoughts,
feelings, emotions, goals, evaluations and judgements. A private state consists of
“an experiencer holding an attitude, optionally toward an object” (Wiebe, 2002:4).
Private states partly overlap with the types of attribution considered for the present
study. Although feeling and emotions are not part of the annotation, therefore the
‘object’ of the private state is not optional, other categories such as beliefs and
thoughts are included, together with assertions.
For the annotation of private states Wiebe et al. (2005) create three frames
corresponding each to a type of private state expression: explicit mention of
private states, speech event expressing private states and expressive subjective
elements. Key elements of these frames are: the ‘text anchor’, namely the text
span representing the speech act or the private state; the ‘source’, employed to
refer to both the experiencer of a private state and the writer or speaker of a
speech event; the ‘target’, although this is only included in the first two frames;
some properties. Properties include the ‘intensity’ of the private state, the
‘expression intensity’, which denotes the contribution of the text anchor to the
intensity of the private state, ‘insubstantial’, when a private state is e.g. in the
scope of a conditional and is therefore not presented as real in the discourse, and
‘attitude type’, accounting for the polarity of the private state.
Assertions are annotated through the ‘objective speech event frame’ if the
target is presented as an objective fact. Another important aspect of their
annotation is the inclusion of an agent frame in order to identify with a unique ID
every source in the text. This feature is particularly significant in order to deal with
bridging or pronominal anaphora, that is when a same source is repeated several
times with different nouns or pronouns being involved and making the identification
2 Discourse and Attribution
- 24 -
of a unique source quite challenging.
Sentences presenting private states and speech events are analysed in
three parts. With ‘on’ it is designated the text anchor corresponding to the private
state or speech event itself. ‘Outside’ includes instead the source and everything
else in the sentence outside the scope of the private state or speech event, which
is labelled as the ‘inside’.
(12) outside: “On Tuesday, John …while hanging up the phone.”
on: “said that”
inside: “he was leaving”
(Wiebe, 2002:8)
The Opinion Corpus surely represents a model and a knowledge base for the
present study regarding the annotation of attributions. This model needs however
to be expanded to go beyond the sentence boundaries, in order to avoid
approaching attribution once again merely as a syntactical intra-sentential
phenomenon.
2.4.3 PDTB - The Penn Discourse TreeBank
Apart from annotating lexically grounded discourse relations in the form of
discourse connectives and their arguments, the PDTB goes further also including
attribution relations in the annotation. Considered as a “relation of ‘ownership’
between abstract objects and individuals or agents” (Prasad and Milsakaki et al.,
2008:40), attribution often overlaps with discourse connectives and their
arguments. Also discourse connectives are establishing relations between AOs
and can therefore hold between attributions (13) or just between the AOs
representing the content of attributions (14). The discourse relation itself can be
the AO representing the content of an attribution relation (15). In the examples that
follow, taken from the PDTB 2.0, the text spans corresponding to Arg1 are shown
in italics , those for Arg2 are in bold, the discourse connectives are underlined and
the attribution phrases are identified by small capitals.
2 Discourse and Attribution
- 25 -
(13) ADVOCATES SAID the 90-cent-an-hour rise, to $4.25 an hour by April 1991, is
too small for the working poor, while OPPONENTS ARGUED that the increase
will still hurt small business and cost many thousand of jobs. (PDTB
0098)
(14) Factory orders and construction outlays were largerly flat in December while
PURCHASING AGENTS SAID manufacturing shrank further in October.
(PDTB 0178)
(15) “The public is buying the market when in reality there is plenty of grain to
be shipped,” SAID BILL BIEDERMANN, ALLENDALE INC. DIRECTOR. (PDTB 0192)
Discourse connective and attribution relations appear as separate layers that can
occur independently or coexist overlapping or even being included one in another.
The approach taken by the PDTB, however, considers attribution as subordinate to
the identification and annotation of discourse connectives and as the focus is on
the latter, attribution appears more as an additional feature to be added to
connectives and their arguments than as an independent phenomenon. Attribution
is in fact annotated in the PDTB only and every time a discourse relation exists,
thus leaving out those instances of attribution to be independently found.
Moreover, what is actually marked is the attribution of the discourse connective
and of its two arguments Arg1 and Arg2. Therefore, a nested attribution included
e.g. in one of the arguments cannot be accounted for and is also left unmarked. In
the example below (16), the discourse relation in quotes is attributed to ‘Gov.
Nelson Rockefeller of New York’ and there is no account of the nested attribution
of an intention expressed by ‘want’ and concerning the span: ‘to keep the crimes
rates high’.
(16) In 1966, on route to a re-election rout of Democrat Frank O’Connor, GOP
GOV. NELSON ROCKEFELLER OF NEW YORK appeared in person SAYING, “If you
want to keep the crime rates high, O’Connor is your man.” (PDTB 0041)
Key properties of attribution included in the PDTB annotation scheme are: source,
2 Discourse and Attribution
- 26 -
type, scopal polarity and determinacy. The source feature specifies if the source of
the attribution, i.e. the agent in the relation of ownership, is the writer (Wr), another
specific agent (Ot), or an arbitrary source (Arb). The writer is always marked as the
source when no explicit attribution is made (17). While ‘Other’ refers to a
determinate source either explicitly mentioned (15) or inferable from some other
occurrences in the text, ‘Arbitrary’ sources are lacking a referential agent. This
happens for example in case of an impersonal source or an attribution with an
agentless passive verb or an adverb (17) as the reporting phrase. In the following
example, the relation and Arg1 are attributed to the writer, while Arg2 is labelled as
arbitrary.
(17) East Germans rallied as officials REPORTEDLY sought Honecker’s ouster.
(PDTB 2278)
Another feature of attribution in the PDTB is the type. This partly accounts for the
degree of factuality of the AOs. Type can take four values: assertions, beliefs, facts
and eventualities. Assertion propositions (Comm) are generally conveyed by verbs
of communication (18), e.g. ‘say’, ‘explain’, ‘announce’. Implicit attributions to the
writer (19) also take this value. Belief propositions, which partly correspond to the
‘private states’ of opinions, beliefs and thoughts, are instead expressed by
prepositional attitude verbs (20), i.e. verbs entailing a mental process such as
‘think’, ‘believe’, ‘doubt’, and are labelled as PAtt.
(18) “We won’t put any burden on Farmers,” HE SAID. (PDTB 2403)
(19) Besides, to a large extent, Mr. Jones may already be getting what he wants
out of the team, even though it keeps losing. (PDTB 1411)
(20) Scientists need to understand that while THEY TEND TO BELIEVE their work is
primarily about establishing new knowledge or doing good, today it is also
about power. (PDTB 1495)
Facts are associated with factive and semi-factive verbs and involve the attribution
2 Discourse and Attribution
- 27 -
of an AO presented as factual. To this type belong verbs of perception such as
‘hear’, ‘know’, ‘remember’. The last type of attribution verbs has to do instead with
agents holding an intention or attitude towards the AO. Prasad and Miltsakaki et al.
(2008) present eventualities (Ctrl) as conveyed by control verbs. These are: verbs
of influence (21), such as ‘order’, ‘allow’ and ‘persuade’; verbs of commitment,
such as ‘agree’, ‘promise’ and ‘accept’; and verbs of orientation such as ‘hope’,
‘want’ and ‘wish’.
(21) Eward and Whittington had planned to leave the bank earlier, but MR.
CRAVEN HAD PERSUADED THEM to remain until the bank was in a healthy
position. (PDTB 1949)
Another feature added in the PDTB to attribution is scopal polarity. This is a
feature that allows identifying cases when a negation which on the surface
appears to scope on the attribution verb, changes instead the polarity of the
attributed AO (22). It is important to recognise the real scope of the negation as
this affects the last feature present in the annotation of attribution in the PDTB:
determinacy. Determinacy has to do with the truth value of the attribution. In case
of an attribution verb being in the scope of a negation (23), or e.g. in a conditional
or infinitive context, the attribution itself is not presented as real and it should be
handled as such when drawing considerations about the AO on the basis of this
relation. This does not mean that the attribution is therefore certainly unreal as it
could also be that the attribution is just shown as possible (24) or probable.
(22) I DON’T THINK it’s a main consideration. (PDTB 0090)
=
I THINK it’s not a main consideration.
(23) Yet the Soviet leader's readiness to embark on foreign visits and steady
accumulation of personal power, …, DO NOT SUGGEST that Mr. Gorbachev is
on the verge of being toppled; (PDTB 0439)
2 Discourse and Attribution
- 28 -
(24) SOME MAY BE TEMPTED TO ARGUE that the idea of a strategic review merely
resurrects the infamous Zero-Based Budgeting (ZBB) concept of the Carter
administration. (PDTB 0692)
A last issue addressed by the PDTB concerns the annotation of the attribution text
span. The attribution span corresponds to the material containing the information
about the source, the type, the scopal polarity and the determinacy of the
attribution. The AO is usually annotated separately. The attribution spans “are
often left unexpressed in the sentence in which the AO is realized, and have to be
inferred from the prior discourse” (Prasad and Miltsakaki et al., 2008:48). When
the attribution is to the writer and implicit, no text span is selected.
The text span also includes, for every element part of it, its non-clausal
modifiers e.g. adverbs and appositive noun phrases. In some cases the attribution
span can be represented by a non-clausal phrase as prepositional groups such as
‘in the eyes of’ and ‘according to’ (25), or adverbs like ‘reportedly’ and ‘allegedly’
can also represent the text anchor of attribution. When one of this constructions
and not a verb signals the attribution relation, the attribution span is a non-clausal
phrase. Non-clausal attributions are included in the argument span corresponding
to their AO as the PDTB annotation conventions do not allow keeping phrasal
modifiers separate from the span they modify (25).
(25) No foreign companies bid on the Hiroshima project, ACCORDING TO THE
BUREAU. But the Japanese practice of deep discounting often is cited
by Americans as a classic barrier to entry in Japan’s market. (PDTB
0501)
2.5 Summary
This chapter has presented a review of different approaches to discourse structure
and coherence relations, introducing different theories and surveying the main
projects regarding the construction of discourse annotated resources.
Attribution relations have been proved to be not only a syntactic, intra-
sentential phenomenon, as they have been regarded by some studies, but also to
scope over discourse units and even to relate extra-textual material. A new
2 Discourse and Attribution
- 29 -
definition of attribution has also been proposed, in order to supply for the need of
one adequate to describe the scope of the present study.
Some annotation projects involving attribution were also reviewed. The
annotation schema developed in this thesis will be grounded on these projects,
though with some modifications. In order to provide a complete account of
attribution, it is necessary to extend and adapt these annotation schemas to the
range of linguistic units between which this relation can hold (i.e. word, clause,
sentence, discourse segment, discourse). Moreover, as the complexity of such a
wide scope suggests, an approach to attribution independent from other syntactic
or discourse phenomena will be adopted.
The benefit of this approach will be reaching a better description of the
phenomenon and the development of a complete resource to be employed for
attribution related studies.
3 An Analysis of Attribution
- 30 -
3 An Analysis of Attribution
Before proceeding with the definition of an annotation schema for attribution, a
deeper understanding and description of the phenomenon is required. Attribution
will be segmented in its constitutive elements, which represent the fundamental
units of the annotation. This will also enable a more considerate selection of the
features to be included in the schema. Moreover, the analysis of the different
components playing a role in the attribution relation will provide an account of the
different lexical elements possibly representing them.
Finally, some characteristics of attribution and various issues representing a
challenge for the annotation will be discussed and possible solutions proposed.
3.1 The Components of Attribution
Attribution relations are intuitively composed by at least two elements: the
attributed linguistic material and the entity this is attributed to. The latter is usually
referred to as the source (Prasad et al., 2007; Wiebe, 2002), which includes the
experiencer of an emotional state as well as the writer or speaker of a text. The
former, due to the multiplicity of its possible referents, has not a unique label.
In the literature the attributed element has been termed as the ‘text’ or
‘document’, when dealing with document-level attribution, as the ‘AO’ (Prasad et
al., 2007), representing a discourse unit, or interchangeably, when annotating
opinions, as the ‘object’, ‘content’, ‘inside’ (Wiebe, 2002) or ‘target’ (Wiebe et al.,
2005) towards which a certain attitude is held by the source. As AO refers to a
discourse segment, which is not always the case in this study, this term will not be
used. The terms proposed for the annotation of opinions are all equally valid,
however, in order to avoid confusion, the attributed linguistic material will be
univocally identified here as the content.
In addition to source and content a third element is fundamental in the
relation: the lexical anchor signalling the existence of an attribution. This has been
assimilated to the source in the PDTB and jointly annotated as the ‘attribution
phrase’. In the manual for the sentential annotation of opinions (Wiebe, 2002:6)
“the private-state or speech event phrase itself” is identified as ‘on’. In the
3 An Analysis of Attribution
- 31 -
annotation scheme proposed later (Wiebe et al., 2005), this element is included in
both the private state and speech event frames as the ‘text anchor’. Although ‘text’
and ‘lexical anchor’ will be occasionally employed, the element connecting source
and content will be in this work labelled as cue.
In the examples that follow, when the cue, the source or the content are
highlighted, this will be done as follows: the source span in bold, the cue
underlined and the text corresponding to the content in italics.
3.1.1 The Source
The source of an attribution relation is the entity the content is ascribed to.
Sources are usually the agents (26) of a speech event, when it is a statement to
be attributed, or the experiencers (27), if dealing with a ‘private state’.
(26) Chairman Krebs says the California pension fund is getting a bargain price
that wouldn’t have been offered to others. (PDTB 0331)
(27) Sue thinks that the election was fair. (Wiebe et al, 2005:9)
However, things can get a lot more complicated than this. Quite frequently
mentioned sources are not animate agents. Contents are often attributed to
institutions or knowledge repositories, such as law codes, studies, reports and
newspapers. Although these are usually in a metonymical relation to the actual
animate source, this is deliberately left out of the attribution as unknown, irrelevant
or even a plurality (28). In the example below (29) the content is a piece of
information which needs a reliable source to be considered trustworthy. Ascribing
the content to a major newspaper is here more effective than directly citing an
unknown journalist.
(28) La Costituzione prevede la mozione di fiducia per battezzare un governo,
quella di sfiducia per farlo cadere. (ISST els035)
The Constitution prescribes a trust motion to establish in office a
government, a distrust one to destitute it.
3 An Analysis of Attribution
- 32 -
(29) Il quotidiano Ma’ariv riporta che è stato rafforzato il servizio di
sorveglianza attorno a Rabin, al capo di stato maggiore Shahak, al ministro
degli Esteri Peres, a quello della Polizia Shahal e dell’Ambiente Sarid.
(ISST cs042)
The newspaper Ma’ariv reports that it has been increased the vigilance
service for Rabin, the Chief of Staff Shahak, the Foreign Secretary Peres,
the Police minister Shahal and the Environment minister Sarid.
In other cases the source is not an agent but a specification or an adjective of its
metonymic referent, e.g. the words for the speaker, the document for the writer, in
agentive position ( (30), (31)).
(30) According to John’s declaration, Mary left the party before midnight.
(31) The presidential report announced that the Defence Minister resigned
today.
When a source is adding credibility to the content it is related to, it is usually
explicitly mentioned through the attribution relation. However, especially in
journalistic texts, attribution relations serve another purpose: they remove liability
from the writer, interposing another source. Sometimes this strategy is used when
the provenance of the information in the content is not certain or not known. In this
case the metonymic source is lacking a specific referent on purpose (32). In this
way the writer is not assuming the responsibility of the given statement, without
really attributing it to another specific source.
(32) …secondo indiscrezioni avrebbe sostenuto davanti agli investigatori che
non intendeva fare nulla di male e che per lui si è trattato di un “gioco”.
(ISST cs004)
…according to indiscretions he would have told the examining magistrates
that he didn’t intend doing anything bad and that for him it was just a
“game”.
3 An Analysis of Attribution
- 33 -
(33) Secondo anticipazioni l’esame del Consiglio di Stato avrebbe avuto un
esito positivo e il regolamento dovrebbe ricevere il semaforo verde ai primi
di giugno. (ISST sole153)
According to anticipations, it seems that the Council of State examination
had a positive result and the regulation should get the starting signal the
first days in June.
Sources without a corresponding referent can also be indefinite entities, e.g. ‘the
people’, ‘someone’, or impersonal pronouns, e.g. ‘one’, ‘you’. Moreover, an
attribution relation can exist although paradoxically one of its constitutive elements,
the source, is missing. This effect is achieved through the use of e.g. a passive
attribution verb lacking the agent (34), a past participle (35) or an infinitive.
(34) É stato detto che si tratta di sport, non bisogna farne una tragedia; (ISST
els060)
It has been said that we’re dealing with sport, we shouldn’t make a fuss out
of it;
(35) L’accordo annunciato ieri… (ISST sole101)
The agreement announced yesterday…
As Italian is a pro-drop language, quite often the source is left implicit. This,
however does not mean that it is missing. It corresponds in fact to the implicit
personal pronoun of the attribution verb, usually coreferential to the explicit entity
mentioned somewhere else in the text.
(36) Probabilmente Vialli non ha dimenticato le voci sulla sua presunta vita
allegra durante i Mondiali del 1990 rivelate su Italia1 da Maurizio Mosca. E
Ø non crede che la recente alleanza tra Juventus e Milan possa cambiare
molto il comportamento dei commentatori sulle emittenti di Berlusconi.
(ISST cs043)
Probably Vialli has not forgotten the rumours about his presumed ‘happy
life’ during the 1990 World Cup revealed on Italia1 by Maurizio Mosca. And
3 An Analysis of Attribution
- 34 -
(he) doesn’t believe that the recent alliance Juventus-Milan could really
change the commentators’ behaviour on Berlusconi’s televisions.
3.1.2 The Content
The content of an attribution could be regarded as the nucleus (Wolf and Gibson,
2005) of the relation. The source and also the cue act as satellite elements,
therefore, according to the RST theory (Mann and Thompson, 1988), they convey
additional information. As it has been already mentioned, the content can be
constituted by different linguistic units.
Word or phrase
A single word or phrase can already constitute the content of the attribution as in
( (37), (38)). This is not only the case when this represents, although short, a
complete utterance directly reported. ‘Yes/ no’ function in this case as a sentence
substitute and therefore contribute to textual cohesion (Renzi, 1995). Very often,
what is attributed is not directly the content, but its ‘container’ (39).
As the main reason behind the creation of an annotated resource for
attribution is to be able to link the content with its source in order to allow a more
correct semantic interpretation of it and to account for its provenance, the
attribution of a ‘container’ of information appears at first not relevant. Therefore,
these words or clauses would not require annotation. In the example (39) knowing
that the ‘press release’ has been issued by ‘Palazzo Chigi’, is not necessary as it
neither represents some linguistic material directly asserted by the source, nor it
conveys any piece of information that could be ascribed to the source.
(37) The minister addressed the president calling him “padrino”.
(38) “Sì”, le risponde convinta un’amichetta. (ISST cs060)
“Yes”, answers to her confident a friend.
(39) Palazzo Chigi emette un nuovo comunicato. (ISST els048)
Palazzo Chigi (seat of the Italian Government) issues a new press release.
3 An Analysis of Attribution
- 35 -
However, this is different in case of event anaphora, when content is also
expressed, although somewhere else in the text (40). The annotation of the
attribution relation binding the source to the ‘container’ of the attributed span would
allow the actual content, once this metonymic co-reference relation is resolved, to
inherit the attribution relation. Similarly, the content can often be found expressed
by just a pronoun (41) co-referentially recalling the attributed utterance. In the
examples below, the content represents an instance of event anaphora, a relation
often intertwined with attribution.
(40) Palazzo Chigi emette UN NUOVO COMUNICATO. <<Sarà il governo>> scrive
<<a prendere una decisione in piena autonomia e responsabilità>>. (ISST
els048)
Palazzo Chigi issues A NEW PRESS RELEASE. <<It will be the Government>>
(it) writes <<to assume a fully autonomous and responsible decision>>.
(41) “…Dobbiamo fare un ulteriore salto di qualità, entrare in una nuova
mentalità”. A dirLO è Giuseppe Signori, … (ISST re126)
“…We have to achieve an additional quality leap, enter a new mentality”. It
is Giuseppe Signori to say IT, …
Finally, it is also possible to find a verb as the content, and at the same time cue,
of the attribution. This happens with verbs such as ‘confermare’ (to confirm),
‘accettare’ (to accept) ‘negare’/ ‘rifiutare’/ ‘smentire’ (to deny), which implicitly
involve, because of the semantic of the verb, the production of a ‘yes/ no’
utterance. In this case, however, it is not necessary to link source and content as
the verb is already syntactically connected to its subject, or object in case of a
passive verb.
Clauses
More often it is a larger linguistic unit to be attributed. This can still happen intra-
sententially, when the content is a single clause, or more than one (42). Reported,
direct (43) or indirect speech is also usually represented at the sentence level.
Source and verbal cue together often constitute the main clause while the content
3 An Analysis of Attribution
- 36 -
is the direct object (42) of the attribution verb. The attributed span can be
expressed by a subordinate or embedded clause. The content might also
represent the main clause, and the attributing span an incidental clause.
(42) Mr. Marcus believes spot steel prices will continue to fall through early 1990
and then reverse themselves. (PDTB 0336)
(43) "Vi daremo le statistiche alla fine", promettono i generali croati. (ISST
cs030)
“We’ll give you the statistics at the end”, promise the Croatian generals.
Sentences and larger units
Nevertheless, it is also common to find one or more clauses in a separate
sentence, or one or more full sentences (44), as the content of an attribution
relation. Discontinuous contents spreading over several sentences are often
associated to interviews (45) or testimonies, where the source and the cue are not
changing and do not need to be constantly repeated.
(44) "There's no question that some of those workers and managers contracted
asbestos-related diseases," said Darrell Phillips, vice president of human
resources for Hollingsworth & Vose. "But you have to recognize that these
events took place 35 years ago. It has no bearing on our work force today."
(PDTB 0003)
(45) Dunque, Ghezzi, che cosa significa "non cinema"? “… Per intenderci,
Moretti potrebbe girare tutta la vita ma non arriverebbe mai alla sensuosità
o fatalità cinematografica di un Michael Cimino...”. Sensuosità? E' un
concetto che ha a che fare con la forma? " Fino a Palombella rossa...".
(ISST cs050)
So, Ghezzi, what does it mean “non cinema”? “…To make it clear, Moretti
could shoot all his life but he would never reach the cinematic sensuality or
fatality of a Michael Cimino…” Sensuality? Is it a concept that has to do with
form? “Till Palombella rossa…”.
3 An Analysis of Attribution
- 37 -
Finally, when dealing with news articles, the article itself represents a content,
whose source is the writer. The content of the article is in fact responsibility of its
author which is usually explicitly mentioned (Figure F).
Figure F - Newspaper article source
http://www-1.unipv.it/webbio/labweb/primantr/news/genetre2_giornale.gif
3.1.3 Elements Functioning as Cue
How is it possible to detect the existence of an attribution relation? The simple
juxtaposition of a source and a content together would not be enough unless some
other element provides the textual anchor that links them together. This element is
the attribution cue and it is realised by different linguistic elements. This can simply
be graphic elements, the use of punctuation, or grammatical and lexical devices.
Apart from establishing the relation, the cue has also another function: it
determines the kind of attribution e.g. a belief, a thought, an assertion, etc. While
punctuation cues always refer to asserted contents and prepositions alone do not
specify the nature of the relation, nouns and verbs can express several types of
attribution.
3 An Analysis of Attribution
- 38 -
Punctuation Cues
Punctuation, double and single quotation marks (i.e. ‘…’, “…”) and less frequently
double angle brackets (<<…>>) or hyphens (-…-),represents, in Italian as well as
in English, the simplest cue to look for when searching for an attribution, although
it is not frequently the only one (46). However, this is not a reliable cue as it only
accounts for the attribution of assertions directly reported, leaving out indirect
speech and also the attribution of mental states such as opinions, intentions or
knowledge. Moreover, the same punctuation marks may as well be employed in
Italian to mention a word or a title, to signal an unusual usage (47) of one or a few
words such as an ironic or metaphoric use, or even to give emphasis to them. In
addition to that, single quotation marks are also used for the apostrophe and in
some cases, in order to avoid using special characters, to render accented glyphs.
(46) Il Papa: “La cultura ha bisogno del genio femminile”. (ISST cs014)
The Pope: “Culture needs the female genius”.
(47) Settembre, mese tradizionalmente <<caldo>>, non fa registrare vistosi
strappi al rialzo, sottolineando l’andamento verso il basso del costo della
vita. (ISST els020)
September, a traditionally <<hot>> month, doesn’t make record
considerable price rises, stressing the trend towards a reduction of living
cost.
Preposition and Prepositional Groups
Syntactic cues can be expressed by several word classes. Although attribution
verbs are by large the most common signal of the existence of an attribution
relation, nouns, adjectives, prepositions and adverbs can also function as cues.
While only one cue is required, it is common to find two or even more cues
combined together (48). A partial account of Italian cues, although only relative to
reported speech, can be found in Renzi (1995). In this grammar the prepositions
‘per’ (‘for’) and ‘secondo’ (‘according to’) (48) are listed. To them it should be
added, although they are not very frequent and some do not even occur in the
ISST corpus, the prepositional groups: ‘a detta di’ (according to), ‘a parere di’ (in
3 An Analysis of Attribution
- 39 -
the opinion of), ‘agli occhi di’ (in the eyes of), ‘nell’ottica di’ (in the perspective of),
‘per quanto riguarda’ (as far as it concerns), ‘stando a’ (according to) (49).
(48) Non solo, ma secondo lo stesso Tronchetti Provera “da fornitore di cavi
siamo diventati fornitori di sistemi integrati”. (ISST re062)
Not only, but according to Tronchetti Provera himself “from supplier of
cables we became suppliers of integrated systems”.
(49) Oltre ai missili di questo tipo, stando alle stesse fonti, le navi partite dalla
Corea del Nord ne trasporterebbero:COND altri del tipo Styx... (ISST
els075)
Besides this kind of missile, according to the same sources, the ships that
left from North Korea, (apparently) transport other ones of the Styx kind…
Adverbials
Prasad and Miltsakaki et al. (2008: 43) identify some adverbials which may
function in English as attribution cues, such as ‘reportedly’ (50), ‘allegedly’,
‘supposedly’, etc. In Italian however, there is not a corresponding class. These
adverbials usually have an equivalent in a prepositional phrase: ‘a quanto si dice’
(according to what one says).
(50) East Germans rallied as officials reportedly sought Honecker’s ouster.
(PDTB 2278)
Nouns and Adjectives
While discussing the way the source can be expressed (3.1.1) it has been shown
how adjectives can assume this function. Adjectives establishing a relation of
possess between possessor and owned entity function as cue of an attribution
relation if the possessor is the source and the owned entity the content, or the
element coreferential to the content, as in the example below (51).
(51) “The Defence Minister resigned today”. The presidential announcement at
the press conference came unexpected.
3 An Analysis of Attribution
- 40 -
Although nouns alone do not establish any relation between source and content,
they can function as ‘introductory elements’ (Renzi, 1995) following or preceding
the attributed material they represent. These nouns or NPs are very informative
about the typology of attribution, e.g. assertion (declaration, release, observation,
etc.), belief (doubt, idea, etc.) or intention (agreement, promise, desire, etc.).
Knowing the type of attribution is very relevant in order to discern if the attributed
material is for example an opinion, a statement or an intention. In the following
example (52), ‘la dichiarazione’ (the declaration) is the only element signalling that
the following material (highlighted in italics) is not attributed to the writer but to
another source, which is not at all mentioned. The noun itself, representing a
speech act, presupposes the existence of a source.
(52) Mi ha sconvolto la dichiarazione che tutto questo non vale niente. (Renzi,
1995:435)
It upset me the declaration that all this is worth nothing.
Renzi (1995) also observes that nouns or NPs functioning as attribution cues
usually have an argumentative structure or refer to speech acts, but also an act of
i.e. thought (53) or will.
(53) “It is nice to die for what you believe in; who is afraid, dies every day, who is
not afraid, dies only once”. With this idea the anti-Mafia magistrate Paolo
Borsellino worked till he was assassinated in 1992.
Grammatical Cues: Quotative Conditional
Some languages grammatically mark the fact that the writer/ speaker is not directly
presenting the information but there is an intermediary source. This grammatical
category is called evidentiality, as it accounts for “the evidence a speaker has for
his/ her statement” (De Haan, 2008:77). As the WALS map of the Semantic
distinction of evidentiality (De Haan, 2008:77) shows, the encoding of evidentiality
is a relatively common feature. In Europe, however, it is almost only indirect
evidentiality that can be expressed, without further distinguishing among different
modes of sensory evidence.
3 An Analysis of Attribution
- 41 -
De Haan (2008) points out the fact that the languages presenting indirect
evidentials in Europe are mainly Germanic, with the exclusion of English, and
suggests that Finnish and French may have developed evidentiality because of
Germanic influence, as Ugro-Finnic and Romance languages do not present this
feature.
However, this is not exact as Italian possesses a grammatical structure to
express hearsay, i.e. the quotative conditional (54), similar to the French
“conditionnel de la rumeur”. Both languages however do not have a dedicated
grammatical category for evidentiality as the conditional is also used for other
purposes, e.g. unreality, attenuated wish, etc., expressing a number of factuality
degrees and epistemic modality.
Epistemic modality is often associated to evidentiality, as the information
source influences the degree of certainty the speaker expresses towards a
proposition. Although epistemic modality may be intertwined to the conditional, this
Italian mood is reportive and not inferential (Giacalone, 2007).
(54) Un incendio, che si sarebbe sviluppato:COND per cause accidentali, ha
gravemente danneggiato a Fiano (Torino), uno chalet di proprietà di
Umberto Agnelli, attiguo alla sua abitazione. (ISST cs010)
A fire, which (is said to have) developed for accidental causes, has severely
damaged in Fiano (Turin), a chalet belonging to Umberto Agnelli, next to his
residence.
According to Aikhenvald (2004) languages like Italian and French do not have
evidentiality as they do not have dedicated morphemes expressing it, but just
“evidentiality strategies” which originate from the verb mood and represent a
secondary function.
Nonetheless, quotative conditionals are very common in Italian and are an
important indicator of attribution, although, as Knott (1996) also remarks, they can
only be recognised in context as exemplified in ( (55)a-b). Moreover, more than
attributing the content to a source, quotative conditionals mark that the default
attribution to the writer is not suitable. They always refer to an indeterminate
unknown source, unless this is explicitly expressed by other means ( (49), (56)).
3 An Analysis of Attribution
- 42 -
(55) a. Il presidente sarebbe:COND morto.
The president (is said) to be dead.
b. Il presidente sarebbe:COND morto, se non avesse usato la cintura.
The president would have died/ would be dead, if he wouldn’t have used
the seatbelt.
(56) Secondo anticipazioni l’esame del Consiglio di Stato avrebbe avuto:COND
un esito positivo e il regolamento dovrebbe ricevere il semaforo verde ai
primi di giugno. (ISST sole153)
According to anticipations, (it seems that) the Council of State examination
had a positive result and the regulation should get the starting signal the
first days in June.
Verb cues
Verbs are the most significant attribution cue in Italian as well as English. When
occurring at the intra-sentential level, they usually constitute the main clause
together with the source, while the content is expressed by a dependent clause
with (57) or without (58) the complementizer ‘che’ (that). The attribution clause
may not only occur before or after the content text span, but also enclosed in it as
an incidental clause, or even, although it is not a very frequent strategy, around the
content (e.g. Giovanni: “Tutto qui?” chiese con un sorriso./ John: “Is that all?” (he)
asked with a smile.)
Renzi (1995) groups these verbs in three categories: A) verbs expressing a
linguistic action, e.g. ‘raccontare’ (to tell), ‘telefonare’ (to phone), ‘rispondere’ (to
answer), ‘scrivere’ (to write), ‘ordinare’ (to order) etc.; B) verbs expressing the
reception of a linguistic act, e.g. ‘sentire’ (to hear), ‘intendere’ (to understand),
‘leggere’ (to read), etc.; C) verbs conveying a cognitive process, e.g. ‘pensare’ (to
think), ‘ricordare’ (to remember), etc.
The PDTB adopts instead a different and more fine-grained classification
(see 2.4.3). Assertions and eventualities partly overlap with A), facts should
correspond to B), and beliefs to C).
3 An Analysis of Attribution
- 43 -
(57) Nella morte di Ivan Ilic, Tolstoj sostiene che in quel momento si va verso
una grande luce. (ISST els034)
In the death of Ivan Ilic, Tolstoj claims that in that moment we go towards a
big light.
(58) The BPC Fine Arts Committee think she had a literal green thumb. (PDTB
0984)
Another category of verbs can also be found as attribution cue that does not match
any of the above mentioned ones. As they cannot themselves function as
introductory devices for the content, Renzi (1995) suggests that these verbs
should be considered as implicitly presupposing one of the attribution verbs,
probably an hyperonym such as say or think. These verbs can be ascribed to two
different categories.
One includes verbs such as ‘iniziare’ (to begin), ‘continuare’ (to continue),
‘aggiungere’ (to add) (59) and ‘concludere’ (to conclude) which suggest the
existence of another attribution they usually follow, but may also precede.
Therefore these verbs could be considered as inheriting the type from the verb
they are linked to, which is usually an assertion, as they correspond to the
chronological phases of a speech event.
(59) “…Finché c’è chi lo difende e lo incoraggia, lui continuerà a comportarsi
così”, profetizza Storace. E aggiunge: “Ancora più grave poi è
l’atteggiamento del governo, che non prende posizione davanti alle
stronzate di Bossi perché la sua sopravvivenza dipende dai voti della Lega”.
(ISST cs027)
“…Till there is someone protecting and encouraging him, he will go on
behaving like that”, forecasts Storace. And (he) adds: “Even worse is the
attitude of the government, which does not take a stand against Bossi’s
absurdities because its survival depends on the votes of the Lega”.
The other includes verbs such as ‘sorridere’ (to smile) (60), ‘alzare le spalle’ (to
shrug the shoulders), ‘adombrarsi’ (to grow dark) (61), ‘rallegrarsi’ (to rejoice),
3 An Analysis of Attribution
- 44 -
‘acquietarsi’ (to calm) etc. These verbs occur mainly in incidental position (Renzi,
1995). Most of them are part of what Levin (1993:219-220) classifies as verbs of
nonverbal expression and of gestures, observing that they are usually associated
with an emotion and mainly involve a facial expression or body parts, e.g. ‘annuire’
(to nod), ‘ammiccare’ (to blink), ‘corrugare’ (to wrinkle), etc. The rest of the verbs in
this group directly refer to an emotional change, often involving a change in the
intonation.
Talmy (2000:152) defines manner as “a subsidiary action or state that a
Patient manifests concurrently with its main action or state”. Therefore manner is
expressed in languages that cannot normally express it on the verb (e.g. Italian) as
two sub-events. As these attribution verbs express the manner of the verbs
conveying the attribution, this could be considered a sort of metonymical use
(manner for the action/ manner). Similarly to the continuative verbs, also these
verbs are usually associated with speech acts, therefore they could be seen as
specifying the hyperonym ‘say’ which is left implicit. In the following example the
verb used as cue, ‘sorride’ (smile), could be substituted with ‘dice sorridendo’ (says
while smiling).
(60) Arlacchi sorride: “Pura paranoia politica. Non ho partecipato ai lavori solo a
causa di un impegno privato…”. (ISST re095)
Arlacchi smiles: “Pure political paranoia. I didn’t participate in the works
only because of a private appointment…” .
(61) E' vero che doveva interpretare lei la parte di Bruce Willis in Pulp Fiction?
"Sì - si adombra Matt - Un ruolo interessante: con Tarantino eravamo a
buon punto, poi é arrivato Bruce. I suoi film incassano un po' più dei miei,
no? Hanno scelto lui", ride nervoso, tormentando il tappo a vite di una
bottiglia d'acqua minerale . (ISST cs060)
Is it right that you were going to play the role of Bruce Willis in Pulp Fiction?
“Yes - Matt grows dark - An interesting role: with Tarantino we were at a
good point, then Bruce arrived. His films cash in a bit more than mines,
right? They chose him”, (he) laughs nervously, tormenting the screw top of
a mineral water bottle.
3 An Analysis of Attribution
- 45 -
3.2 Some Issues
The annotation of attribution relations rises several questions as how to deal with
peculiar aspects or issues of attribution which represent a challenge to the
annotation. These aspects arose from theoretical considerations as well as while
performing the pilot annotation and are particularly important as they determine the
choice of a suitable tool and shape the annotation schema. In this chapter some of
these features will be presented.
3.2.1 Nested Attributions
A pervasive characteristic of attribution is its recursiveness. Any attribution relation
can constitute the content of another attribution relation and this the content of
another one and so forth. The possibility of nesting an attribution into another
attribution is a potentially never-ending process. This could be exemplified as
follows, the capital letter representing the source and the brackets signalled by the
same small letter its content.
A [B {C (D |…|d )c }b ]a
Although not annotated in the PDTB, nested attribution require to be accounted for
in order to determine the truth or trustworthiness value of the embedded content.
Considering just the shallowest source, that is the most left, or the most embedded
one, hence the most right, or even an arbitrary intermediate one, could lead to
ignoring characteristics of the other ones which would possibly determine
important re-reading of the information in the content. Wiebe (2002; et al., 2005)
includes nested sources in their annotation schema, listing in the source ID and all
the sources in the sentence, with the addition of the writer, comprising a certain
text span in their content.
(62) [Sue said {that Mary believes (that Gore won the election)}].
Sources: [writer] {writer, Sue} (writer, Sue, Mary)
(Wiebe, 2002:5 - with the addition of brackets)
Formalising the effect the sources determine on the different embedded contents
would allow, once attribution relations have been recognised, the automatic
3 An Analysis of Attribution
- 46 -
derivation of the truth value of information at different level of embedding. This
represents a simplistic abstraction as sources almost never differentiate so sharply
as ‘sincere’ or ‘liar’ but usually imply different degrees of reliability or bias they
project onto the content. This also vary according to the content topic as the
source expertise also vary.
Making use of Boolean logic it is however possible to draw some
considerations. Figure G represents a possible scheme of nested attributions.
Source ‘A’ is related to the content ‘a’ through the attitude it holds towards it (belief,
statement, desire, etc…). The content ‘a’ is formed by the relation ‘Bb’ occurring
between the source ‘B’ and its content ‘b’, which in turn is composed by ‘Cc’ and
so forth, plus optional additional material which is not part of the relation. The
trustworthiness and knowledge of the source ‘A’ determines the truth and reliability
of its content ‘a’ as in a relation of implication ‘A→a’. Substituting ‘a’ with its
correspondent ‘Bb’ the implication becomes ‘A→Bb’ that is ‘A’ implies the
attribution relation embedded in it. If ‘A’ is trustworthy, also the attribution relation
nested in it ‘Bb’ is and should be considered factual. Similarly every source implies
its content and therefore the attribution relation included in it: ‘B→b’, ‘C→c’, ‘D→d’,
‘N→n’.
Figure G - Nested attribution schema
However, when deriving the ‘truth’ value of a content ‘d’ all the sources of the
contents it is included in need to be considered. It is not sufficient that ‘D’ is
considered reliable, all sources to its left (i.e. A, B, C) need to be (Figure Ha). They
can therefore be joined with an AND relation (A Λ B Λ C Λ D) → d. To make it
simple, sources and contents which are reliable and taken into consideration are
here labelled as ‘true’, while sources and contents that are not, as ‘false’. Figure H
shows the ‘truth’ values (T/ F) of a nested content ‘d’, the arrows point to the
attribution relation between the content (small letter) and the source (capital letter).
A B C D …
a b c d
3 An Analysis of Attribution
- 47 -
Proceeding from the inside to the outside, in case ‘D’ is ‘false’ (Figure Hb), ‘d’ is
already uncertain, and it is not necessary to also check ‘A’, ‘B’ and ‘D’.
Considering the example (63) and supposing an answer to the question ‘Is John
innocent?’ is required, probably his mother should not be considered as a reliable
source. In that case, the piece of information representing the most embedded
content is not relevant and cannot be considered as the answer. Did she really
made such a declaration? If ‘The Times’ and ‘the police’ are considered as reliable
sources, and also in this case this is an arbitrary decision, then it should be
assumed that this attribution relation is correct.
(63) The Times writes about the police saying that the murderer’s mother
declared: “John is innocent”.
Moreover, a ‘false’ source implies that everything to its right, therefore towards the
more embedded attribution relations, cannot be trusted although the other sources
to the right are ‘T’. Figure Hc shows a case in which ‘A’ is ‘true’ and so it is its
content ‘a’ and therefore the attribution of ‘b’ to ‘B’. Being ‘B’ ‘false’, however, the
content ‘b’ and everything contained in it cannot be considered ‘true’.
a)
b)
A B C D …
a b c d Λ Λ Λ
T T T T T/F
→
A B C D …
a b c d Λ
Λ
Λ
T T T T T
→
3 An Analysis of Attribution
- 48 -
c)
Figure H - Truth values of a nested content
Discerning between sources which should be considered and sources which
should lead to the rejection of the content depends on subjective and domain
specific considerations. Once the sources have been sorted, determining if the
content is to be taken into consideration can potentially be turned into a Boolean
problem, as presented above. As already anticipated, however, determining the
relevance of a piece of information is not a Boolean problem as it involves
variables with domain sizes greater than the binary ‘true/ false’.
This means, sources are almost never completely reliable or completely
unreliable, but they occupy intermediate positions on a continuum between the
‘true’ and ‘false’ poles according to the field of information under consideration and
personal orientations and subjective characteristics of both the source and the
person considering the information. Algorithms for non-Boolean problems should
be more appropriately employed to fully deal with the degree of truth of the
content. This is better captured by fuzzy logic as the sources and the content have
a truth value ranging between 0 and 1.
Example (64), taken from real language use, presents four levels of nested
attributions. First from the outside, the writer of the article, or more generally the
newspaper publishing it of which the whole sentence represents a content. Second
the ‘New York Times’ which is reporting rumours, the third source, and last
‘Blinder’, holding the most internal content. None of these four sources needs to
be a priori discarded, although ‘rumours’ is surely less reliable as it has a non-
specific referent.
(64) Blinder, secondo voci riferite dal New York Times, sperava di succedere
al presidente Greenspan quando a marzo scadrà la sua nomina. (ISST
A B C D …
a b c d Λ Λ Λ
T F T T T/F
→
3 An Analysis of Attribution
- 49 -
re070)
Blinder, according to rumours reported by the New York Times, hoped to
succeed to president Greenspan when in May his appointment will run over.
The more the sources, the more the passages a piece of information has gone
through, and therefore the chances it underwent transformations from its original
form as in the well known ‘Whisper Game’. Although not so common and usually
shallower than the nesting of an attribution relation in the content of another
attribution relation, the example (64) above shows that the source may also be an
attribution relation itself. ‘Rumours reported by the New York Times’ is in fact the
source of the content ‘Blinder…hoped to (…)’.
Conversely, this could be also interpreted and analysed as a different
perspective towards the same problematic. A source can present an attribution
relation in its content and the source of a nested attribution relation can be in turn
an attribution relation whose content is its local source and source the
superordinate one (e.g. Blinder, according to rumours reported by the New York
Times…/ The New York Times reports rumours saying that Blinder…).
3.2.2 Source of the Source
A special case of nesting could be considered what is here called ‘source of the
source’. In this case, the attribution relation makes explicit the presence of another
source which is not on the same level of embedding of the actual source, in that
case it would just represent an instance of multiple source (see 3.2.3). This added
source could be more internal as in the example (65), the ‘source of the source’ is
in small capitals, where the most embedded source ‘Maurizio Damilano’ is not
directly connected to the content by an attribution relation although this is
semantically inferable.
This type of additional source is usually dependent on verbs of perception
and knowledge, the ones labelled as ‘facts’ in the PDTB, as these correspond to
the verbs representing the reception of a linguistic act (Renzi, 1995), therefore
they more or less implicitly recall the production of a linguistic act. The sentence in
(65) could be in fact transformed into its reciprocal equivalent ‘Maurizio Damilano
told me about the disqualification of Garciano,…’, the original source becoming the
3 An Analysis of Attribution
- 50 -
indirect object, the recipient of the speech act, while the ‘source of the source’,
signalling the provenance of the information, becomes the new source.
(65) (Ø) Ho saputo della squalifica di Garciano DA MAURIZIO DAMILANO, vi giuro,
non pensavo di arrivare primo. (ISST cs071)
(I) heard of the disqualification of Garciano FROM MAURIZIO DAMILANO, I
swear, I didn’t imagine I would have came first.
Both recipient and ‘source of the source’ are relevant to the attribution relation as
they inform about the source of a piece of information and the entity this was
addressed to. Both these elements can influence the way the content is perceived
( (66)a-b).
(66) a. The pope/ scientist says we do not derive from monkeys.
b. The scientist told THE PRESIDENT/ THE SCHOOLCHILDREN that asbestos is
harmful.
‘Source of the source’ could be also considered those instances where a more
external source is mentioned without directly relating it to its whole content but just
to the source embedded in it as in ( (67)- (68)). This strategy is often adopted when
the intermediary source is not particularly prominent and is not expressing
anything other than reporting the most embedded content. Example (68) could be
paraphrased as ‘The president’s spokesman Rossi said that the president
announced that a new anti-Mafia pool has been appointed’. Making the attribution
relation involving the ‘source of the source’ explicit, the spokesman would become
the subject of the sentence and therefore occupy a prominent role attracting
unneeded attention and diverting it from the more prominent source ‘the
president’.
(67) Poi però, TRAMITE LA FIGLIA che sta a Santiago, prima limita la portata del
colloquio con Gaston Salvatore (“non è stata una vera intervista, solo una
conversazione”), poi smentisce. (ISST period005)
Afterwards however, THROUGH THE DAUGHTER who lives in Santiago, first
3 An Analysis of Attribution
- 51 -
diminishes the importance of the colloquium with Gaston Salvatore (“it
wasn’t a real interview, just a conversation”), then she denies.
(68) The president has announced THROUGH HIS SPOKESMAN ROSSI that a new
anti-Mafia pool has been appointed.
These second type of ‘sources of sources’ are expressed as an adjunct indicating
means. In that position they are also presented as less relevant, as if they were
neutral and not affecting the content. However, both types should be included in
the annotation as they do inform about the fact that the attribution relation is
second hand material and they surely need to be considered when computing the
disturbing effect of the ‘Whispering game’ as they add a level of nesting to the
attribution.
3.2.3 Multiple Sources, Contents, Cues
The main elements involved in an attribution relation are three, however, more
than one of each at a time can be involved in the relation. The most common is the
case when a source is holding the same attitude towards more than one content
as in the examples ( (69), (72)).
(69) (Ø) Ho detto |che ero dalla sua parte| e |che ritenevo giusta la sua protesta|.
(ISST cs063)
(I) said |that I was on his side| and |that I considered his complaint fair|.
Also quite common is the presence of more than one mentioned source (70). This
is different from collective sources, such as institutions, organisations, pluralities or
groups as multiple sources are separate entities or at least are presented as such,
e.g. John and Mary; the government, the army, and the civilians, etc... Often, like
in the example below one source semantically includes the other one which
represents a specification of the more general source. Multiple sources are more
common when expressing believes or knowledge as assertions or even opinions
usually belong to a single entity or to an entity presented as unanimous.
3 An Analysis of Attribution
- 52 -
(70) Tutti, incluse le autorità, conoscono la loro provenienza, ma nessuno dice
e fa nulla per prevenire il massacro di capi selvatici. (cs.morph020)
Everyone, including the authorities, knows their provenance, but no one
says and does anything to prevent the massacre of wild animals.
Lastly, the cue itself or the attitude the source is holding towards the content can
be multiple. Often an attribution relation is signalled by several strategies, e.g.
According to what John suggests, ”the market is not ready yet”, however, they do
not interfere as they are all conveying that the content is a statement or a belief or
another kind of attribution. When cues represent instead two separate attitudes a
source holds towards the same content as in the example (72), this could be
considered a multiple source. In (71) instead both verb cues refer to linguistic
productions and could be grouped together. Multiple cues are not very common,
more frequently the presence of two different cues does not reflect a different
attitude but an evaluation about the content the writer expresses, suggesting in a
way the key to interpret the utterance as in (73) where a speech act directly
reported is bound to a cue labelling it as an opinion.
(71) … <<domani questa stessa gente é pronta a scendere in piazza per
rivendicare>> dicono e scrivono in molti. (ISST els063)
…<<tomorrow the same people are ready to take to the streets to claim>>
many say and write.
(72) The men can defeat immunities that states often assert in court by showing
that officials knew or should have known |that design of the structure was
defective| and |that they failed to make reasonable changes|. (PDTB 1160)
(73) “The journalists shouldn’t morbidly write about people’s sorrow” thinks
Mary.
3.2.4 Co-reference Resolution
Since source and content are often recalled by a pronoun or a coreferential
element, co-reference resolution becomes a fundamental issue when dealing with
3 An Analysis of Attribution
- 53 -
attribution relations. The manual annotation could simply mark the coreferential
text span, nonetheless, the automatic capturing of the phenomenon would require
the resolution of anaphora and co-reference relations. Research in this area is
progressing, however, a tool able to resolve the kind of co-references involved in
attribution is still lacking.
Co-reference regarding the source is usually either bridging, e.g. El
Sayed….l’arabo…/El Sayed…the Arabian… (ISST els001), or pronominal
anaphora. Source anaphora often involves pronouns (74) recalling full nouns or
NPs, but also in Italian Ø subjects (75). The coreferential source or content is
presented in the examples below in small capitals.
(74) Secondo il governo di Pechino, le accuse in base alle quali due
diplomatici cinesi sono stati espulsi la settimana scorsa dagli Stati Uniti,
sono una montatura. LO ha detto ieri un portavoce del ministero degli
Esteri, IL QUALE ha anche annunciato che il governo cinese ha protestato
con quello degli Stati Uniti e che si riserva il diritto di ulteriori reazioni. (ISST
els075)
According to Beijing Government, the charges on the basis of which two
Chinese diplomats have been banned last week from the United States, are
a frame. IT was said yesterday by a spokesman of the Foreign Ministry,
WHO has also announced that the Chinese government has complained to
the one of the United States and that they reserve themselves the right of
further reactions.
(75) Probabilmente Vialli non ha dimenticato le voci sulla sua presunta vita
allegra durante i Mondiali del 1990 rivelate su Italia1 da Maurizio Mosca. E
Ø non crede che la recente alleanza tra Juventus e Milan possa cambiare
molto il comportamento dei commentatori sulle emittenti di Berlusconi.
(ISST cs043)
Probably Vialli has not forgotten the rumours about his presumed ‘happy
life’ during the 1990 World Cup revealed on Italia1 by Maurizio Mosca. And
(HE) doesn’t believe that the recent alliance Juventus-Milan could really
change the commentators’ behaviour on Berlusconi’s televisions.
3 An Analysis of Attribution
- 54 -
The content is instead usually formed by clauses or sentences recalled by a
pronoun (74), but also a noun of which it represents an elaboration (see 3.1.2), as
in (76), where ‘words’ refers back to the whole direct quotation. Example (74)
contains three attribution relations of which two involve co-reference. The first
sentence/ attribution is in fact attributed to ‘a spokesman of the Foreign Ministry’
via recalling it by the personal pronoun ‘it’. The first attribution is nested in the
second not as usual with being inside its content span, it is in fact in a separate
sentence, but because of the event anaphora relating it with the content of the
attribution above. The source, ‘a spokesman of the Foreign Ministry’ is afterwards
recalled by the relative pronoun ‘who’ and becomes part of the last attribution
relation.
(76) “…L’umanità deve proclamare uno storico sciopero ad oltranza fino alla
distruzione di tutti gli armamenti nucleari.” LE PAROLE registrate di Gheddafi,
…(ISST cs039)
“…The world should proclaim a non-stop strike till the destruction of all
nuclear armaments.” Gheddafi’s recorded WORDS,…
While still challenging, anaphoric expression such as pronouns have been deeply
investigated and some studies are also analysing event co-reference, which is
closely related to studies about temporality and time references. The co-
references included in attribution relations partly overlap with both research areas
as the source falls in the first group, i.e. bridging and pronominal anaphora, while
the content is partly of interest of the second one, i.e. event anaphora.
The resolution of co-reference is crucial in order to allow retrieving the
specific provenance of information as pronouns alone do not carry information
about reliability, expertise or bias of the source. Similarly, it is necessary to be able
to establish a relation between the source and what it has actually said, thought,
dreamt of, etc... In a sentence like ‘John has an idea’, linking ‘John’ to ‘idea’ is not
informative and would be of no use unless we can retrieve what John’s idea was.
As attribution is part of a bidirectional relation, not only linking linguistic
material to the entity expressing it but also entities to what they express, co-
reference also needs to point in both direction. Only a co-reference tool being able
3 An Analysis of Attribution
- 55 -
to account for this bidirectionality would allow in the example (76), once the
material in quotes has been retrieved, to realise that this is coreferential to a NP
which is part of an attribution relation from which it should therefore inherit the
source. On the other hand, if the task is retrieving Gheddafi’s declarations, ‘words’
as such, although attributed to him, is not what he said and it should be possible to
clasp the coreferential quotation it stands for.
3.2.5 Scope Definition
The main challenge for the annotation of discourse phenomena, and annotation in
general, is reaching a precise scope definition which would not invalidate any
attempt to reach satisfactory interannotator agreement scores. As far as attribution
is concerned, it is important to define what to include in the cue and over which
text span the attribution relation holds.
The content is not always as easily detectable as when it is delimited by
quotes. Sometimes it is expressed by a pronoun or full noun recalling it as
discussed in (3.2.4), other times, due to the ambiguity of language, it is not clear
what exactly is the attributed span and what is possibly just additional material. In
case of multiple insides for example, the presence of a conjunction (77) is not
sufficient to assume the second span is also a content as it often represents some
additional information or even a comment the writer expresses. To be sure the text
span should also be attributed there should be also the subordinator ‘that’.
(77) The president said that the economy is on the verge of a severe crisis AND
|he is going to meet the ministers to talk about possible solutions|.
Concerning the source, this is often a noun phrase, however, attributes,
appositives (78) or relative clauses need to be considered as they might be
necessary to the characterisation of the entity they refer to. Other times this
material constitutes a colourful description (79) which does not help identifying the
source referent and would just make the annotation less neat and manageable.
(78) Per il presidente dei deputati progressisti, Luigi Berlinguer, la
maggioranza <<ha fatto una proposta di natura consociativa che abbiamo
3 An Analysis of Attribution
- 56 -
rifiutato…>> (ISST sole013)
For the president of the progressive delegates, Luigi Berlinguer, the
political caucus <<has made a associative proposal that we have
refused…>>
(79) “… Poi stasera torno a Zagabria”, grida Kasim Zdionica, un signore con
una pancia enorme, le ciabatte di gomma e un pugnale infilato nella
cintura. (ISST cs030)
“… Besides, this evening I’ll go back to Zagreb”, shouts Kasim Zdionica, a
men with a huge belly, plastic slippers and a dagger inserted in the
belt.
The span to be included in the cue itself is also sometimes unclear. Although the
verb, noun, preposition, etc., functioning as textual anchor of the attribution
relation are not difficult to recognise, there might be supplementary information
necessary to the characterisation of the context in which the relation takes place
such as a temporal specification, a reference to the situation or entity (80) the
content refers to and so forth.
(80) PARLANDO DI VERGA, Pirandello scriveva: i siciliani, quasi tutti, hanno
un’istintiva paura della vita. (ISST els034)
WHILE TALKING ABOUT VERGA, Pirandello wrote: Sicilians, almost all of them,
have an instinctive fear of life.
Deciding what to include and what to leave out of the annotation has not only to be
done taking into account the relevance of each element to the interpretation of the
content, but also considering the difficulty this could cause to the annotation
therefore making the task of the annotators more complicated and uncertain.
Suggestions concerning how to deal with this issue are reported in (6.1).
3 An Analysis of Attribution
- 57 -
3.3 Summary
In this chapter attribution has been analysed in order to highlight its constitutive
elements and some problematic characteristics it possesses which are of
particular interest for the annotation. Attribution relations can be considered as
being composed of three constitutive elements: the content, representing the
attributed material; the source, which is the entity the content is related to; and the
cue, the textual anchor linking source and content together.
Each of these constitutive element can be expressed by a number of
linguistic structures which make it more difficult to describe the phenomenon in
order e.g. to allow the automatic recognition of it. Sources can be expressed by
proper nouns or pronouns but also be left implicit. The content span can range
from a single word to the entire discourse. The cue is usually a verb, but can also
be a noun, an adjective, an adverb, a preposition or prepositional group, a
graphical device (i.e. punctuation) or even a grammatical one (i.e. quotative
conditional).
To this complex scenery it is also necessary to add some features and
problematic issues that need to be considered when developing the annotation
schema. First of all, attribution relation can recursively nest into each other.
Moreover, it can happen that a level of nesting is not made explicit and the relative
source is added as a ‘source of the source’ representing the means through which
a more embedded source expresses the content. Other times a speech event is
presented from the hearer’s perspective, therefore leading to a change in the roles
with the perception of the information being attributed to the hearer and its source
being expressed as a specification of its provenance.
Another issue involves the occurrence of multiple sources, contents and
even cues as part of the same attribution relation. Furthermore, attribution is
heavily intertwined with co-reference and the understanding of attribution relations
is subsequent to the resolution of anaphora and co-references. A last challenge is
determined by the definition of the scope as the text span to include in each of the
three components of attribution needs to be defined so that elements important for
the interpretation of the content or the identification of the source are not left out,
but also without making the annotation too complex or too arbitrary, thus
decreasing the interannotator agreement.
4 Features to Include in the Annotation
- 58 -
4 Features to Include in the Annotation
The annotation of an attribution relation basically requires to mark the link between
source and content. However, additional features could also be included in the
annotation which would provide useful information about the nature and veracity of
this relation. This features, or attributes, have been derived from the PDTB
annotation scheme for attribution (Prasad et al., 2007). As presented in (2.4.3), the
scheme includes the attributes of ‘type’, ‘source’, ‘determinacy’ and ‘scopal
polarity’.
After analysing the phenomenon of attribution, however, the PDTB scheme
had to be partially modified and adapted in order to suit the present project. In the
following chapters each feature that has been included in the annotation will be
presented and the values it can assume discussed with the help of examples from
the ISST corpus.
4.1 Type
The feature ‘type’, marking the type of attribution, has been included in the
annotation schema employed for the pilot without any changes from the PDTB.
The type, which is anchored to the cue, namely determines the kind of attitude the
speaker holds towards the content of the relation. This, as in the PDTB scheme
(Prasad, Miltsakaki et al., 2008) can assume four values: ‘assertion’, ‘fact’, ‘belief’
and ‘eventuality’. The distinction seems quite viable, especially if compared to the
more fine-grained categorisation adopted by Wiebe (2002; Wiebe et al., 2005) for
the annotation of speech events and private states: ‘assertions’ (writing or
speaking), ‘opinions’, ‘beliefs’, ‘thoughts’, ‘feelings’, ‘emotions’, ‘goals’,
‘evaluations’ and ‘judgements’. However, some issues arose that would suggest a
revision of this classification before applying it to the whole corpus.
4.1.1 Assertion
Assertions are conveyed by verbs of communication, e.g. ‘dire’ (to say), ‘affermare’
(to claim), ‘riferire’ (to relate), ‘spiegare’ (81) (to explain), and suggest that the
attribution content has been verbally expressed, in writing (82) or speaking (81).
4 Features to Include in the Annotation
- 59 -
(81) Ha spiegato Sciandri dopo l’arrivo: “Ho imparato dagli errori del passato,
quando spesso esitavo troppo prima di partire…” (ISST cs082)
Sciandri explained after the arrival: “I’ve learnt from past mistakes, as when
I was hesitating too much before starting”.
(82) L’obiettivo, dice sempre il comunicato dell’Olp, <<è quello di assicurare
una gestione trasparente e altamente professionale delle risorse
palestinesi>>. (ISST sole023)
The goal, says the PLO release, <<is that of guarantying a transparent
management and highly professional of the Palestinian resources>>.
4.1.2 Belief
Beliefs are associated with verbs expressing a mental attitude, such as ‘pensare’
(to think), ‘credere’ (83) (to believe), ‘immaginare’ (to imagine). The content in this
case reflects a mental orientation more than conveying an event and it also
expresses a slightly lower level of factuality as while the content of assertions is
presented in a factual way, beliefs bound the content to a point of view (83), an
opinion without pretence of being generally valid.
(83) Ø credo che vivesse nella villa dei Pietroiusti anche d’inverno. (ISST re118)
I think that she was living in Pietroiusti’s villa also in winter.
4.1.3 Fact
Facts are the attributions of the reception of a speech act or of the knowledge of
an information whose truth is not questioned. Cues in this category include verbs
of perception, e.g. ‘sentire’ (84) (to hear), ‘vedere’ (84) (to see), and verbs
expressing a knowledge such as ‘sapere’ (to know), ‘ricordare’ (85) (to recall),
‘rimpiangere’ (to regret).
(84) Ø abbiamo visto e sentito, assieme, un’antica ira e uno stato di grazia.
(ISST re011)
4 Features to Include in the Annotation
- 60 -
(We) have seen and heard, contemporarily, an ancient anger and a
condition of grace.
(85) Era di ottimo umore, ricorda Francesco. (ISST els077)
She was in a very good mood, recalls Francesco.
4.1.4 Eventuality
Eventuality conveys instead an intention the source holds towards the content.
This group is quite heterogeneous and includes, under the label of ‘control verbs’,
these three classes (Sag and Pollard, 1991:65): verbs of the order/ permit type,
with the source trying to influence another agent to perform what is in the content,
e.g. ‘ordinare’ (to order), ‘consentire’ (to allow), ‘proibire’ (86) (to forbid); verbs of
promise, e.g. ‘promettere’ (87) (to promise), ‘accettare’ (to accept), ‘accordarsi’ (to
agree), expressing the commitment of the source towards performing a certain
action; and verb of the want/ expect type, e.g. ‘desiderare’ (to desire), ‘sperare’
(88) (to wish), ‘volere’ (to want), expressing a mental orientation of the source.
(86) E le autorità di Zagabria hanno proibito ai giornalisti di andare a Petrinja e
nelle altre località appena riconquistate. (ISST cs030)
And Zagreb authorities have forbidden journalists to go to Petrinja and the
other just reconquered places.
(87) Il governo di Zagabria smentisce seccamente e promette di “punire i
responsabili” se venissero portate delle prove del fatto. (ISST cs031)
The Zagreb government sharply denies and promises to “punish the
responsible people” in case evidence of the deed would be provided.
(88) Gli operatori del mercato fisico sperano che la chiusura americana segni
la fine dell’esplosivo rialzo delle quotazioni. (ISST sole150)
The listed exchange operators hope that the American close could mark
the end of the explosive price rise of the quotations.
4 Features to Include in the Annotation
- 61 -
4.1.5 Issues Concerning Type Definition
The definition of the ‘type’ feature presents some problems. First of all, it refers
only to verbal cues, while the textual anchor signalling an attribution relation can
be expressed by different means as listed in (3.1.3), e.g. prepositions, nouns,
punctuation. The latter is employed to report direct speech and can be therefore
interpreted as indicating an ‘assertion-type’ attribution. Nouns are often deverbal,
e.g. ‘suggerire’ > ‘suggerimento’ (suggestion), ‘permettere’ > ‘permesso’
(permission), ‘comunicare’ > ‘comunicato’ (82) (release), and generally easily
referable to the verb they implicitly involve, e.g. ‘pensiero/ idea’ (thought/ idea) >
‘pensare’ (to think), ‘parola’ (word) > ‘dire/ scrivere’ (to say/ write). Prepositions
instead do not explicitly specify the type of attitude the source holds towards the
content. However, it could be argued that they express an opinion, a point of view
( (89), (90)), although derived from an assertion.
(89) …secondo indiscrezioni avrebbe sostenuto davanti agli investigatori che
non intendeva fare nulla di male e che per lui si è trattato di un “gioco”.
(ISST cs004)
…according to indiscretions he would have told the examining magistrates
that he didn’t intend doing anything bad and that for him it was just a
“game”.
(90) Secondo il giornale gli Stati Uniti sperano di siglare un <<memorandum di
intesa>> sul programma <<Sdi>> con Italia, Israele e Giappone entro la fine
del 1986. (ISST els015)
According to the newspaper the United States hope to sign a
<<memorandum of understanding>> concerning the <<Sdi>> program with
Italy, Israel and Japan by the end of 1986.
All types of attribution presuppose however some kind of assertion allowing the
entity reporting the attribution relation to acquire the information. In the example
(91) the content represents the thought of some people, however, it does not
mean that this was acquired through mind-reading techniques. It is implicit that it is
possible to learn about opinions if they are expressed, usually through assertions,
4 Features to Include in the Annotation
- 62 -
but also using other means of communication, e.g. facial expressions. More
strikingly this bound connecting assertions and beliefs is clear with self-attributions
as in (92). A speaker or writer wanting to express a personal belief has to assert it.
The source in (92) believes the assertion expressed by the content but at the
same time is saying it. Similarly also wills, intentions, orders, etc., more or less
directly presuppose an assertion.
(91) “…C’è gente che pensa siamo professionisti super pagati e invece la
situazione è molto diversa.” (ISST cs077)
“…There are people who think that we are super paid professionals instead
the situation is very different.”
(92) “…Ø credo anche che forse convenga parlarsi tra le parti prima di spedire
lettere”. (ISST re012)
“…(I) also believe that maybe it would be appropriate for the parties to talk
to each other before sending letters”.
On the other hand, assertions quite often reflect what the source is thinking as the
two attributions in (93). In the example, what the sources say, in quotes, is also an
expression of their opinion. Less common are attributions like (94) where the
assertion itself is what matters and the content is not an expression of the source’s
thought but just the sequence of words ‘she’ pronounced, namely the attention is
on the cue rather than on the content. The verbal cue in (93), ‘dicono’ could have
been substituted by the entity reporting these two attributions with ‘pensano’
(think).
(93) “S’é pentita d’aver rotto il silenzio” dicono alcuni. “L’hanno costretta”,
dicono gli altri. (ISST period005)
“She regretted having broken the silence” say some. “She’s been forced”,
say the others.
(94) …Shana, meglio ricordata per la pubblicità dove Ø dice: “Toglietemi tutto
ma non il mio Breil”… (ISST re028)
4 Features to Include in the Annotation
- 63 -
…Shana, better remembered for the commercial in which she says: “Take
everything away from me but my Breil”…
Another issue is determining the type when different types of cue co-exist. This is
different from multiple cues (95) (3.2.3), which should be analysed as separate
attribution relations. Relatively often a direct quotation occurs combined with a
verbal cue other than assertion, sometimes providing an interpretation of the
content (3.2.3). In the example (96) the quotes suggest that the content
corresponds to reported direct speech, therefore an assertion, while the cue
‘promise’ refers to an eventuality, of the kind expressing a commitment.
(95) The men can defeat immunities that states often assert in court by showing
that officials knew or should have known |that design of the structure was
defective| and |that they failed to make reasonable changes|. (PDTB 1160)
(96) "Vi daremo le statistiche alla fine", promettono i generali croati. (ISST
cs030)
“We’ll give you the statistics at the end”, promise the Croatian generals.
The strategy adopted here for these cases of composite cues of different types is
to give priority to the punctuation. A direct quote is surely the most reliable of the
attributions as the content is reported without any mediation. Moreover, the
assertion precedes the attitude expressed by the other cue as this was derived
from the semantic of the content. In (96) what the ‘Croatian generals’ said was
perceived as a promise, at least by the journalist reporting the information. With
establishing the predominance of punctuation, these instances would be classified
as ‘assertions’. Consequently manner verbs, with implicit general reportive verbs,
functioning as cues in combination with quotes (3.1.2), e.g. ‘sorridere’ (97) (to
smile/ to say while smiling), will also be classified as ‘assertions’ avoiding possible
confusion.
(97) Arlacchi sorride: “Pura paranoia politica. Non ho partecipato ai lavori solo a
causa di un impegno privato…”. (ISST re095)
4 Features to Include in the Annotation
- 64 -
Arlacchi smiles: “Pure political paranoia. I didn’t participate in the works
only because of a private appointment…” .
A last issue is determined by the semantic of the verb cues. On one hand because
the myriad of attribution verbs cannot be always unquestionably assigned to one
of the four possible types. While verbs like ‘dire’ (to say), ‘pensare’ (to think),
‘sapere’ (to know), ‘volere’ (to want), are quite prototypical and central to their
category, other verbs such as ‘criticare’ (to criticise), ‘avvertire’ (to warn), ‘leggere’
(to read), ‘elogiare’ (to praise), ‘suggerire’ (to suggest), are more peripheral und
uncertain. On the other hand, a conspicuous number of verbs are polysemous and
can belong to one or the other type according to which of its meanings is currently
at use. This can only be determined by the context as in ( (98), (99)). The same
verb cue ‘sostenere’ assumes in (98) an assertive function, corresponding to
‘claim’, while in (99) it expresses a commitment, meaning ‘support’, which
represents an ‘eventuality’.
(98) Il governo di Zagabria, invece, sostiene che sono “solo” 100 mila le
persone in cammino. (ISST cs031)
Zagreb government claims instead that they are only 100 thousand the
people who set out.
(99) Ma ieri sera I parlamentari serbi hanno “sostenuto senza riserve” la
decisione di Karadzic. (ISST cs034)
However yesterday evening the Serbian parliamentarians have
“supported wholeheartedly” Karadzic’s decision.
The issues presented in this chapter partly arose from the pilot annotation, partly
from previous considerations and from the attempt to list and classify attribution
cues (6.3). Although the ‘type’ classification has been adopted unchanged for the
pilot in this study, the problems it arises strongly suggest testing its feasibility with
evaluating the inter-annotator agreement it determines and eventually introduce
some changes.
4 Features to Include in the Annotation
- 65 -
4.2 Source
The source is one of the key components of the attribution relation and as such it
is marked in the annotation. It can occupy any position, i.e. before, around, after or
in between, with respect to its content and can be expressed by a number of
elements (3.1.2). All the variation in their linguistic realisation aside, the entities the
sources refer to can be very different and this deeply affects the content, hence
the need of retrieving this relation. The annotation could therefore mark a basic
distinction of source types which would facilitate evaluating their reliability or
relevance. The source type has been included in the annotation schema and can
assume the same values as in the PDTB. These are: ‘writer’, ‘other’, and
‘arbitrary’.
Aikhenvald (2004:64) distinguishes between QUOTATIVE, that is reported
information having an overt reference to the source, ‘writer’ and ‘other’ are of this
kind, and HEARSAY, referring instead to reported information without an overt
reference to those it was reported by. The source of a hearsay takes the value
‘arbitrary’ in the annotation.
4.2.1 Writer
The writer is the default source of any journalistic text, and he or she holds the
shallowest level of attribution, the content being the entire news article. Relatively
often, at least in Italian newspapers, authors are not even explicitly mentioned, or
they are recalled by just their initials. Even when they are mentioned, writers are
never part of the article body as, similarly to any other attribution relation, the
source is usually not part of the content it holds but occupies an external or
peripheral position with respect to it.
Unlike the PDTB, where discourse connectives and their arguments are
always attributed even without an explicit attribution relation, therefore most of the
attributions are to the writer, this will be here left implicit in order to simplify the
annotation process. The writer is external to the article intended as a discourse
unit and usually not the only external source involved. Apart from the writer of the
article, the newspaper publishing it could be considered another source and even
the website reporting it, in case of news published on the web. The attribution of
the entire news article to the writer and subsequently to the newspaper should be
4 Features to Include in the Annotation
- 66 -
easily inferable and can be added in a second time if needed.
Nonetheless, in case the writer is directly and explicitly reporting his or her
opinion or words, the annotation should mark the writer as the source. By explicitly
mentioning himself, the writer presents information in a less factual way making
explicit that it is not shared knowledge but a personal point of view he is
presenting ( (100), (101)).
(100) È questo a mio parere il dato politico-sociale rilevante: … (ISST re085)
It is this in my opinion the relevant socio-political data: …
(101) Un arbitro corrotto caro Brera, è possibile che in tanti anni di calcio non sia
venuto fuori il nome di un arbitro corrotto? Io non ci credo. (ISST els027)
A corrupted referee dear Brera, is it possible that in many years of football
no name of a corrupted referee has come up? I don’t believe it.
4.2.2 Arbitrary
As ‘arbitrary’ should be marked all those sources which do not really attribute the
content to a specific entity or to an entity having a real referent in the world. In this
category fall impersonal sources such as ‘si’ (102)/ ‘uno’ (one), personal and
indefinite pronouns used as impersonals ‘tu’ (you), ‘qualcuno’ (someone) (103),
‘nessuno’ (no one), relative pronouns, e.g. ‘chi’ (who) (104), and missing sources,
like with verbal moods having no explicit subject, e.g. ‘infinito’ (infinitive), ‘gerundio’
(gerundive), and passive constructions (3.1.2) with omitted agent.
(102) Spesso in questi casi si dice la mobilitazione popolare é più importante di
mille altre ricerche. (ISST els032)
Often in these cases one says that the popular intervention is more
important than thousands of other investigations.
(103) Qualcuno pensa che questo sia un quartiere privilegiato. (ISST cs092)
Someone thinks that this is a privileged district.
4 Features to Include in the Annotation
- 67 -
(104) C’è chi sostiene che stiamo vivendo il ritmo giusto di un mercato azionario
come quello italiano, considerato la sua dimensione e le sue strutture.
(ISST els055)
There is who claims that we are living the right pace of a share market like
the Italian one, considered its size and its structures.
Also as ‘arbitrary’ can be used personal plural pronouns, i.e. ‘noi’ (we) ‘voi’ (you)
‘loro’ (they), or indefinite pronouns, e.g. ‘tutti’ (everyone), ‘molti’ (many) (105), but
also collective nouns, such as ‘la gente’ (the people) (106), referring to an
indistinct plurality. Especially with plural impersonals the effect achieved is often
that of attributing the content to everyone, as if this was some kind of general truth,
the expression of common sense or general knowledge (107).
(105) … <<domani questa stessa gente é pronta a scendere in piazza per
rivendicare>> dicono e scrivono in molti. (ISST els063)
…<<tomorrow the same people are ready to take to the streets to claim>>
many say and write.
(106) “…C’è gente che pensa siamo professionisti super pagati e invece la
situazione è molto diversa.” (ISST cs077)
“…There are people who think that we are super paid professionals instead
the situation is very different.”
(107) Tutti gli esseri umani sanno di poter essere più di ciò che sono. (ISST
cs012)
Every human being knows they can be more than what they are.
Indefinite pronouns however are not ‘arbitrary’ when their referent is restricted by a
specification (108) or they assume an adjectival role as in (109).
(108) Ma, con una reazione molto comune in casi del genere, nessuna delle
vittime ha pensato ... (ISST els072)
4 Features to Include in the Annotation
- 68 -
However, having a very common reaction in similar cases, no one of the
victims thought …
(109) La decisione di convocarla fu presa domenica 23 dicembre, dopo che
alcuni ministri affermarono: <<sentiremo l’opinione dei familiari e
decideremo>>. (ISST els048)
The decision of convoking it was made Sunday, 23 December, after that
some ministers affirmed : <<we will listen to the relatives’ opinion and
decide>>.
Another group of arbitrary sources is formed by those nouns referring to
‘containers’ or means of information, such as ‘voci’ (voices/rumors) (110),
‘resoconto’ (report), ‘indiscrezione’ (indiscretion) (111), ‘proverbio’ (proverb) etc…,
when the entity producing them is not expressed.
(110) In Italia si è fermi ai progetti e alle intenzioni, nonostante le voci che da
anni pronosticano l’avvento di Warner o Paramount nella gestione di sale.
(ISST sole036)
In Italy we are still at projects and intentions, despite the voices that since
years predict the arrival of Warner or Paramount in the management of
movie theatres.
(111) Secondo indiscrezioni la prima segnalazione è stata inviata alla Procura
della Repubblica. (ISST cs015)
According to indiscretions the first report has been sent to the Public
Prosecutor’s office.
‘Arbitrary’ is a very informative attribute which allows distinguishing between
attributions to a real referent, which are labelled as ‘other’, and attributions whose
source is not really clear. Having this data marked, it is possible to decide whether
to include these attributions when considering the content or just leave them out
as if they were just a device the above source is employing to take the distance
from the content and from the responsibility deriving from being its direct source.
4 Features to Include in the Annotation
- 69 -
In case information with a traceable source are searched, contents having an
‘arbitrary’ referent could be automatically discarded as they do not meet this
requirement and cannot be verified. On the other hand, when looking for general
truths, rumours about previsions, moods concerning an event and so forth,
‘arbitrary’ sources are particularly relevant.
4.2.3 Other
The value ‘other’ is associated with those sources which refer to a specific entity.
This is often a proper noun of a person, e.g. ‘Kasim Zdionica’ (112), ‘Angela
Merkel’, or an organisation, e.g. ‘the Parliament’, ‘The Times’, etc... The specific
referent can be mentioned also somewhere else in the article and recalled
(bridging anaphora) by a general noun or pronoun in the attribution relation as in
(113).
The borderline between ‘arbitrary’ and ‘other’, however, is far from being
sharp. Sources can be more or less generic and detectable. Common nouns
sometimes refer to an entity whose identity can be more or less easily
reconstructed, such as ‘the president’ when taking about a specific company, ‘the
judge’ referring to a precise trial, ‘Angelina Jolie’s husband’, and so forth, but other
times this term is too generic to really allow identifying its referent.
(112) “… Poi stasera torno a Zagabria”, grida Kasim Zdionica, un signore con
una pancia enorme, le ciabatte di gomma e un pugnale infilato nella
cintura. (ISST cs030)
“… Besides, this evening I’ll go back to Zagreb”, shouts Kasim Zdionica, a
men with a huge belly, plastic slippers and a dagger inserted in the
belt.
(113) La Fermenta, a sentire l'arabo, è organizzata in modo che oggi consegue
un utile pari al 35 per cento del fatturato. Questo il vero traguardo che dovrà
nel tempo raggiungere la Pierrel. Ma come? Con tagli di mano d'opera?
Nemmeno per sogno, dice El Sayed. (ISST els001)
Fermenta, according to the Arabian, is organised so that it earns at present
a profit of 35 per cent of the turnover. This is the real goal that in the long
4 Features to Include in the Annotation
- 70 -
distance Pierrel will have to achieve. But how? Cutting down on workforce?
No way, says El Sayed.
Although included in the present study as ‘other’, common names such as
‘residente’ (resident), ‘passante’ (passer-by), ‘donna/ signora’ (woman) (114),
‘esperti’ (experts), whilst referring to a specific referent in the real world, represent
general terms which do not allow any identification or characterisation of the
source. In the example (115) the journalist is trying to give a characterisation to
this unknown referent he is quoting by adding a detail about the way she was
dressed, i.e. ‘in grey’, as if this would make the lady recognisable. However, these
sources are not to be confused with ‘arbitrary’. ‘A lady in grey’, unless the writer is
lying, is not a generic entity, a plurality, a hearsay, but a specific human being in
the real world, as the man in (112).
It is desirable to provide the final annotation with an additional distinction
that can account for this type of source, introducing an additional value for the
‘source’ feature such as ‘common’.
(114) Una donna afferma di aver assistito all’uccisione a sangue freddo del
marito. (ISST re084)
A woman claims she has witnessed the cold blood killing of her husband.
(115) <<Voto no>> diceva una signora in grigio <<tanto c’è già chi ha deciso
per noi>>. (ISSTels048)
<<I vote no>> was saying a lady in grey <<anyway there is already who
has decided for us>>.
4.3 Factuality
Hunter et al. (2006) remark that many reportative verbs have, in addition to their
intensional use, an evidential use, such as the one making ‘B’ in (116) an
appropriate answer to ‘A’. They argue that theories of discourse interpretation
should account for these different uses. According to their analysis, the intensional
use is conceptually primary and the evidential use derives from it. They therefore
4 Features to Include in the Annotation
- 71 -
introduce two discourse relations in order to account for the different
interpretations: an evidence relation and an attribution relation. While evidence is a
subordinating relation veridical in both arguments, with attribution is the embedded
clause that is subordinate to the main claim and it is non-veridical with respect to
the right argument.
(116) A: Why is John absent from the meeting?
B: Sharon said that he is out of town.
(Hunter et al., 2006:99)
When considering the factuality of an attribution relation, it should be clear to
which of these two uses it refers to. The evidential relation is true depending on
the veracity of the evidence, which is the information in the content of the
attribution relation (116). The attribution is instead true if the actual relation source-
content via the attitude expressed by the cue is real, e.g. the assertive event in
(116) really took place. The factuality of an attribution relation, however, does not
entail the factuality of its content. Sources can in fact lie or just be wrong.
As the intentional use precedes the evidential use, the content of an
attribution relation can constitute evidence for something else only if the attribution
relation itself is factual. While it is very complex to account for the factuality of the
content, the factuality of the attribution relation can be syntactically computed
considering the cue and the source or other elements scoping over it. Some
information about the factuality of the content can be derived from the type of cue,
as suggested by Prasad, Miltsakaki et al. (2008:44). The feature ‘factuality’
accounts here for the factuality of the attribution relation only, marking the fact
whether this relation really exists, i.e. answering the question: is this content really
presented as attributed to this source?
In their account of event factuality, Saurí and Pustejovsky (2008) distinguish
among situations presented as corresponding to real situations in the world,
situations which instead are unreal, and uncertain situations. They characterise
factuality as involving polarity and epistemic modality, which could be defined as
the commitment of a source towards the content of a proposition. Polarity takes
two values, i.e. positive and negative, while epistemic modality can assume a
4 Features to Include in the Annotation
- 72 -
range of values varying from absolute certain to uncertain. The combination of
these two features determines a range of factuality values (Table 1).
Positive Negative Certain Fact Counterfact Probable Probable Not probable Possible Possible Not certain
Table 1 - Factuality values (Saurí and Pustejovsky, 2008)
For the present annotation scheme, the factuality of the attribution relation can
assume only two values: ‘factual’ and ‘non-factual’. The first accounts for the
attribution relation being presented as a fact in the world (certain and positive).
‘Non-factual’ represents underspecified factuality and should not be confused with
counterfactual. It accounts in fact not only for attributions presented as not real,
but also includes the intermediate values expressing different degrees of
possibility and probability. Further distinctions of the ‘non-factual’ value of the
factuality attribute are left for future developments of the annotation schema.
4.3.1 Factual
The factuality of the attribution relation is marked in the PDTB with the
‘determinacy’ feature. This can take only two values: ‘indet’, accounting for the
attributions presented not as factual, and ‘null’ for the factual ones. This
substantially corresponds to the present account of factuality of attribution, with
‘null’ corresponding to ‘factual and ‘indet’ to ‘non-factual’. The term has been here
however changed to ‘factuality’ as this seems to be more specific and easily
recognisable.
In news language factual attributions occur by far most frequently as
journalists tend to report facts and to present information and events as real facts,
more than just making suppositions or hypothesis. An attribution presented as
factual may nonetheless not correspond to a real event. Whether or not to believe
the attribution relation is genuine can be decided only on the basis of the above
source, i.e. the source, or sources, of the content in which the attribution relation is
nested. In the examples ( (117), (118)) the attributions are presented as factual by
the writer. In order to postulate about the veracity of the content it is instead
4 Features to Include in the Annotation
- 73 -
necessary to determine whether the source in (117) ‘Evtuscenko’ is mendacious
and the source in (118) ‘the public prosecutors’ really hold the attitude towards the
content or deceived it. This could be decided with the help of the context but also
common sense and extra-linguistic knowledge contribute to the conclusion.
(117) Evtuscenko, nel suo articolo, afferma che Pasternak gli fece pervenire una
copia del romanzo poco dopo la sua prima pubblicazione. (ISST els076)
Evtuscenko, in his article, claims that Pasternak sent him a copy of the
romance shortly before its first publication.
(118) Monreale, i pm vogliono Cassisa alla sbarra. (ISST re124)
Monreale, the public prosecutors want Cassisa before the bar.
The analysis of the content factuality as well as of the source trustworthiness is
complex as it is not inferable from the syntax and grammatical features. The
factuality of the attribution itself is instead easily determined: the source should be
an entity and the cue should not be in the scope of a negation or an element
expressing uncertainty or probability.
4.3.2 Non-factual
Non-factual attributions can be considered as negated attributions, they namely
express that there is no link between source and content or that this link is just
hypothetical. It could be argued that these instances do not hence represent
attribution relations and could be left out of the annotation. However, they
nonetheless convey relevant information and have been for this reason included in
the annotation. Non-attributions can, for example, correct false attributions or just
remark that there is no link between that particular content and a specific source
(119).
(119) John is under investigation. The police, however, haven’t said that he is
presumed guilty.
4 Features to Include in the Annotation
- 74 -
Non-factual attributions expressing possibility or probability, moreover, are very
useful when there is an interest in retrieving hypothesis or previsions and not just
facts. Modal verbs can be employed to express an attribution which is just
possible, desired, ordered or urged (120). However, this is not the case when the
source is in the first person as it then reflects more an idiomatic use as in (121).
While the source in (120) has never really asserted the content, the one in (121)
necessarily did.
(120) “…L’umanità deve proclamare uno storico sciopero ad oltranza fino alla
distruzione di tutti gli armamenti nucleari.” ISST cs039)
“…The world should proclaim a non-stop strike till the destruction of all
nuclear armaments.”
(121) No, Ø devo dire anzi che in queste prime due settimane il mondo sindacale
è stato in attesa e mi auguro che sia possibile intessere un dialogo forte.
(ISST sole011)
No, on the contrary (I) have to say that in these first two weeks the union
world has been lying in wait and I wish that it will be possible to intertwine a
strong dialogue.
Similarly, when the cue is in the scope of a conditional or part of an hypothetical
sentence (122), the attribution should be marked as non-factual. Other structures
or contexts making an attribution non-factual are: the imperative, usually with
verbs of belief and assertion such as ‘think’, ‘imagine’, but also ‘say’ and ‘admit’;
interrogative forms, as in (123); the future tense (124), (125), as an event
happening in the future is not yet a real event and it is not certain it will ever
become one; and the infinitive used to make a conjecture as in (126).
(122) Se Ø vuoi che il fast relax sia davvero efficace tieni d’occhio l’orologio e
scegli: l’intervallo di pranzo e il ritorno a casa. (ISST period003)
If (you) want that the ‘fast relax’ is really effective keep an eye on the watch
and choose: the lunch break and the homecoming.
4 Features to Include in the Annotation
- 75 -
(123) Pensa anche lei come tanti critici che, con il suo romanzo incompiuto, lo
scrittore si trovasse a una svolta esistenziale? (ISST els034)
Do you also think like many literary critics that, with his unfinished romance,
the writer was at an existential turning-point?
(124) E Ø diranno all’ONU che il problema dei profughi non li riguarda. (ISST
re084)
And (they) will tell the UN that the refugee problem does not concern them.
(125) E naturalmente molti diranno che ha usurpato il posto in finale. (ISST
els062)
And surely many will say that it has usurped the presence in the final.
(126) It is silly libel on our teachers to think they would educate our children
better if only they got a few thousand dollars a year more. (PDTB 1286)
The presence of a grammatical cue, i.e. the quotative conditional (see 3.1.3) could
be also taken as a sign of uncertainty, as in the example (127). However, although
often related to epistemic modality, and therefore involving some degree of
uncertainty, the quotative conditional is a sign of an additional level of attribution,
namely a level of nesting left implicit. In (127) what the quotative conditional
expresses is not uncertainty about the attribution. The uncertainty is a
consequence of the quotative conditional which, scoping on the cue, presents the
attribution relation as second hand material, similarly to hearsays. Attributions
including a quotative conditional should be therefore considered factual.
(127) Manlio Averna avrebbe infatti riferito al pm che, in base agli accertamenti
finora effettuati, è molto improbabile che Castellari si sia sparato. (ISST
sole016)
Manlio Averna has told (QUOT.COND) the public prosecutor that,
according to the verifications done till now, it is very unlikely that Castellari
shot himself.
4 Features to Include in the Annotation
- 76 -
Apart from being connected with the cue, non-factual attributions are also found
when the source is negated as in (128). The attribution to no-source is not linking
the content to any entity and therefore is non-factual.
(128) Nessuno parla più di baratro imminente e di crisi finanziaria. (ISST cs025)
No one is talking anymore about imminent precipice and financial crisis.
4.4 Scopal Change
It is not always the case that an attribution cue in the scope of a negation is non-
factual. It is possible for example that a negative particle affecting a verbal cue on
the surface, reverses instead the polarity of the content. This feature is included in
the PDTB (Prasad, Miltsakaki et al., 2008:46) with the name of ‘scopal polarity’.
Annotating this feature is not essential in order to account for the attribution
relation, however, it is crucial for the interpretation of the content. The feature
takes two values: ‘scopal change’ and ‘none’. In case an attribution is factual, but
its cue is in the scope of a negation, presumably the negation is affecting the
content and not the relation itself. If it could be possible to separately determine
the scope of negations and other elements this would be preferable and the
‘scopal change’ feature would be no longer needed.
4.4.1 Scopal Polarity
Most commonly, the scopal change affects the polarity of the content. The surface
negation can be expressed syntactically (i.e. don’t say, don’t think), or lexically,
e.g. ‘negare’ (to deny), ‘escludere’ (to exclude), ‘smentire’ (to deny) . Lexical
negations which are part of the verb semantics, as in the example (129) below, are
always scoping on the content of the relation. The relation between the ‘Croatian
government’ and the contents it holds is factual and it could be changed into:
‘…the Croatian government affirms that they have NOT been banned and affirms
also TO HAVE NO ethnic cleansing intention in the newly conquered areas’ or
alternatively ‘…says it is not true that…’.
(129) Qualunque sia il numero di sfollati, il governo croato nega che siano stati
espulsi e nega anche qualsiasi volontà di pulizia etnica nelle regioni appena
4 Features to Include in the Annotation
- 77 -
riconquistate. (ISST cs031)
Whatever the number of evacuees, the Croatian government denies that
they have been banned and denies also any ethnic cleansing intention in
the newly conquered areas.
In case of a double negation as in (130), containing a syntactic and a lexical
negation, the first scoping on the verb and the second on the content, the result is
again a positive, and therefore factual, attribution. ‘Not deny’ corresponds to
‘affirm’ and as the negation scoping on the verb is changing its semantics, its
reversed reading can no longer affect the polarity of the content. In these cases
the annotation should assign the feature ‘scopal change’ the value ‘none’.
(130) Ieri circa mille giovani hanno lasciato la città, ma la polizia non esclude che
possa esserci qualche altra esplosione di violenza. (ISST cs037)
Yesterday around a thousand young people have left town, but the police
don’t exclude that there could be some other act of violence.
Scopal changes do not occur with the verbs of the ‘fact’ type (131) as noted by
Kiparsky and Kiparsky (1971). They can occur however with the other types of cue
and relatively often with ‘beliefs’. Determining whether an attribution is non-factual
or there is a change in the scope of the polarity is often problematic. In the
example below (132) the attribution relation contains a ‘no-entity’ source, ‘no one’,
and should therefore be non-factual. However, ‘no one would like that to happen in
their town’ could be also rewritten as ‘everyone would like that not to happen in
their town’, involving a change in the polarity from the source to the content. Are
these sentences equivalent? Probably not. The correspondence is especially
difficult with wills or intentions: not wanting something does not exactly correspond
to wanting the opposite.
(131) Ma lui si strapazza, lavora troppo, Ø non ha capito che deve stare più
attento. (ISST cs059)
But he tires himself out, he works too much, (he) hasn’t understood that he
has to take more care of himself.
4 Features to Include in the Annotation
- 78 -
(132) Strano destino, quello di Civitavecchia: finire spesso, troppo spesso, sulle
pagine dei giornali per eventi misteriosi, oppure per fatti che nessuno
vorrebbe accadessero nella sua città. (ISST cs090)
Strange destiny, that of Civitavecchia: ending up often, too often, in the
news because of mysterious events, or because of events that no one
would like to happen in their town.
Part of the problem derives from the fact that ‘beliefs’ and some ‘eventualities’ do
not refer to events like assertions. While negating an event makes it non-factual, a
negative belief or will does not cancel the attribution relation: a negative mental
state is still a mental state.
With including non-factual attributions in the annotation, the issue of determining
the presence of a ‘scopal change’ in order to account for the veracity of the content
is less crucial. Uncertain instances, those still involving an attribution, therefore not
completely non-factual, and not exactly attributing the negation of the content,
hence also not involving a real scopal change, could be annotated according to
two strategies.
One possible solution would be that of marking them as ‘non-factual’ since
the attribution of the content does not actually take place. In this case, the
‘factuality’ attribute would be restricted to the veracity of the attribution of the
unchanged content to the source. This ‘non-factual’ attribution could still suggest,
however, that the reverse of the content, or a different content is presupposed. In
case this solution is adopted, the content of a non-factual attribution should be
more carefully considered as it could still carry useful information.
On the other hand, the opposite strategy could be adopted and the
attribution marked as ‘factual’ but involving a scopal change. In this case it should
be clear that the change is not implying the exact reverse of the content polarity,
but just that the negation is not really scoping over the attribution relation itself,
and that the content or the attitude the source holds are affected by it. With
choosing this solution it should be clear that the ‘scopal change’ attribute does not
necessarily reverse the polarity of the content.
Since in some cases it is not possible to determine, despite the help of the
4 Features to Include in the Annotation
- 79 -
context, if the attribution relation itself is negated or just the attitude the source
holds, e.g. John doesn’t want to become president (he never expressed this
intention/ he expressed a negative intention towards becoming president), the first
strategy seems more appropriate. The annotators should be invited to decide
whether they perceive an existing attitude, positive or negative, the source holds
towards the content and in case this is not clear they should mark the attribution
as ‘non-factual’. Analysing the inter-annotator agreement it will be then possible to
determine whether the issue of ‘scopal change’ requires further clarifications.
4.4.2 Other Elements Affecting the Factuality
‘Scopal polarity’ (PDTB annotation) has been in the present annotation project
labelled as ‘scopal change’ as polarity is not the only element affecting the
factuality, and not the only one which can change in scope and affect the content
of an attribution instead of the attribution itself. Other constructions, although
uncommon, can occur. For example, the cue could be in the scope of a condition
as in (133). However, the condition in the first clause does not mean this is
required for the attribution relation to be factual, namely for the belief event in
(133) to take place. The condition affects instead the content of the attribution and
it is part of the belief: ‘If there is a majority […] the legislature could continue’.
(133) Se c’è, cioè, una maggioranza in Parlamento in grado di affrontare
seriamente una fase di riforme anche elettorali, Ø penso che la legislatura
possa utilmente proseguire. (ISST re075)
If there is a majority at the Parliament able to seriously face a phase of
reforms, also electoral, (I) think that the legislature could usefully continue.
It is possible that other elements or constructions manifest a change in scope,
although further investigations are necessary to detect which ones and how to
recognise them. This is especially difficult because of their infrequency. The
annotation could however allow detecting other changes in scope affecting the
content.
4 Features to Include in the Annotation
- 80 -
4.5 Summary
Apart from annotating the spans corresponding to the three components of the
attribution relation, i.e. ‘source’, ‘cue’, ‘content’, attributes should be included in the
annotation schema which carry relevant information affecting the relation itself or
the interpretation of its content. In this chapter, these features have been
presented and confronted to the features included in the PDTB scheme, adopted
as a model for the present one.
One aspect to annotate is the ‘type’ of the cue (4.1), expressing the kind of
attitude the entity is holding towards the content: ‘assertion’, ‘fact’, ‘belief’ or
‘eventuality’. This feature provides information partially affecting the factuality of
the content and the values other features can assume, e.g. ‘facts’ do not support
any ‘scopal change’. The feature ‘type’, however, is often complex to determine as
this categorisation is partially ambiguous. Before applying it to the whole corpus,
this should be tested for inter-annotator agreement and, in case of poor score,
perfected by changing or reducing the values.
Another useful feature to be marked is the ‘source type’ (4.2). This allows a
basic distinction among: ‘writer’, ‘other’ and ‘arbitrary’. The first (4.2.1) can be
connected to information presented as the personal point of view of the writer.
‘Other’ (4.2.2) stands for a specific source corresponding to a real entity. The latter
(4.2.3), ‘arbitrary’, should be used when referring to sources without a real or
certain referent, thus labelling e.g. general knowledge, hearsays and rumours.
‘Factuality’ (4.3) allows to distinguish between real attributions,
corresponding to a real event or mental attitude in the world, and hypothetical or
unreal attribution events. The annotation of this feature enables keeping these
separate, without loosing the information carried by ‘non-factual’ attributions.
Lastly, a change in the scope affecting the content is also annotated and
labelled as ‘scopal change’. This usually affects the polarity of the content although
superficially it should involve the cue and make the attribution non-factual.
Determining when it is correct to identify a scopal change is a problematic issue.
Despite the fact that a scopal change cannot occur with cues of the type ‘fact’, this
matter needs to be addressed in context with particular attention to discerning
between negations affecting the existence of the attitude the source holds and
negations reversing instead this attitude.
5 Performing a Pilot Annotation
- 81 -
5 Performing a Pilot Annotation
Developing an annotation schema goes hand in hand with testing it on the corpus
that is going to be annotated. The application of the schema to the corpus allows
to assess intuitions and solutions thus making more aware choices based on the
data and not only on theoretical considerations. Real language examples,
moreover, while on one hand reflect real language use, thus having few or even no
occurrences of some possible but uncommon features, on the other hand
represent a repository of special cases which do not match descriptions of general
occurrences and characteristics.
Designing an annotation schema follows a similar path as any design
process (Figure I): (1) a preliminary stage in which objectives and requirements
are defined; (2) a phase in which the problem is analysed; (3) a planning phase in
which possible solutions are presented and a subsequent (4) testing phase with
the development of a prototype. This latter leads to the identification of viable or
unfeasible solutions and the discovery of new issues. This leads to a new planning
phase, and the process gets iterated until a satisfactory solution is reached.
Figure I - Design Process
In order to perform an annotation, a suitable tool is required. The selection of the
most appropriate one to employ for the pilot annotation is the result of a detailed
requirements
1
analysis
2
planning
3
evaluation
4
release
5
5 Performing a Pilot Annotation
- 82 -
analysis of several available tools. To be able to make such a decision, the
characteristics they should possess so as to match the annotation schema
requirements had to be identified thus allowing the definition of desired tool
specifications. These represent the basis towards the development of an
appropriate software especially designed to perform the task of annotating
attribution relations.
In this chapter, the Italian corpus to which a layer for attribution will be
added is presented and a subsection of it is sampled to be employed in the pilot
annotation. Afterwards, several tools for performing annotation are compared in
the light of the specific requirements of the current annotation schema. Eventually,
one of these tools will be selected and set before proceeding with the annotation
of a sample of the corpus, thus leading to the identification of new issues and a
partial redesign of the annotation scheme.
5.1 Corpus
The present study originates in the framework of a project aiming at the addition of
a layer for discourse to the ISST corpus. It takes, however, a different perspective,
leaving for later the analysis of discourse relations in general and concentrating
instead on attribution, which is only partially a discourse phenomenon. The ISST
corpus employed for this study is the Italian Syntactic-Semantic Treebank,
developed between 1999 and 2001 in the frame of the SI-TAL project, a
collaboration of several Italian research and university institutions with the purpose
of developing a suite of resources and tools for Natural Language Processing
applications. For the pilot annotation a subcorpus of the ISST had been selected
as described in the relevant chapter (5.1.2).
5.1.1 ISST Architecture
The ISST corpus (Montemagni et al., 2003) consists of 307.682 word tokens and
was built to reflect contemporary language use. It is formed by a collection of 484
newspaper and periodical articles published between 1985 and 1995. One section
of the corpus, about two thirds, represent general language use and contains
articles about different subjects from ‘Repubblica’, identified in the examples as
5 Performing a Pilot Annotation
- 83 -
‘re’, ‘Corriere della Sera’ (cs), and other newspapers (els) and periodicals (period).
The other section of the corpus, about 90.000 tokens is instead specialised as it
deals with the financial domain. Articles in this section are taken from a single
financial newspaper: ‘Il Sole 24 Ore’ (sole) and were all published in 1994.
The ISST has a five level structure encoding orthographic, morpho-
syntactic, syntactic and semantic information. Only the financial section of the
corpus has been fully annotated with all five levels. The syntactic level is split into
two separate ones so as to separately account for the constituent and dependency
structures, thus providing an independent view of the same surface syntax as one
level does not presuppose the other.
The orthographic level (Figure J) contains the word tokens and information
about low or capital letters and punctuation. To each token a unique ID number is
assigned.
<w id="w_001" case="cap"> Bruxelles </w> <w id="w_002" case="low"> all' </w> <w id="w_003" case="cap"> Italia </w> <w id="w_004"> : </w> <w id="w_005" case="low"> urgente </w> <w id="w_006" case="low"> ridurre </w> <w id="w_007" case="low"> il </w> <w id="w_008" case="low"> deficit </w> <w id="w_009"> . </w>
Figure J - ISST orthographic level (sole002)
The morpho-syntactic annotation (Figure K) includes the mark-up of POS, lemma,
number, person, gender, etc…Multi-word expressions are analysed as a whole
e.g. ‘in_mezzo_a’ (between/ among), while morphologically complex words, such
as cliticised verbs are instead treated so as to account for its constitutive parts,
e.g. impedendoci > impedire + ci (prevent us).
<mw id="mw_001" pos="SP" mfeats="NN" lemma="bruxelles" sfeats="NP"
href="sole.orth002#id(w_001)"> Bruxelles </mw>
<mw id="mw_002" pos="E" mfeats="FS" lemma="a" sfeats="PART"
href="sole.orth002#id(w_002)"> all' </mw>
5 Performing a Pilot Annotation
- 84 -
<mw id="mw_003" pos="SP" mfeats="NN" lemma="italia" sfeats="NP"
href="sole.orth002#id(w_003)"> Italia </mw>
<mw id="mw_004" pos="PU" lemma=":" sfeats="DIRS"
href="sole.orth002#id(w_004)"> : </mw>
<mw id="mw_005" pos="A" mfeats="NS" lemma="urgente" sfeats="AG"
href="sole.orth002#id(w_005)"> urgente </mw>
<mw id="mw_006" pos="V" mfeats="F" lemma="ridurre" sfeats="VIT"
href="sole.orth002#id(w_006)"> ridurre </mw>
<mw id="mw_007" pos="RD" mfeats="MS" lemma="il" sfeats="ART"
href="sole.orth002#id(w_007)"> il </mw>
<mw id="mw_008" pos="S" mfeats="MS" lemma="deficit" sfeats="N"
href="sole.orth002#id(w_008)"> deficit </mw>
<mw id="mw_009" pos="PU" lemma="." sfeats="TIT"
href="sole.orth002#id(w_009)"> . </mw>
Figure K - ISST morpho-syntactic level (sole002)
The ISST takes a distributed approach to syntax, keeping functional annotation
and constituent structure on two separate levels which can be however combined
if required. This strategy represent a more suitable way (Montemagni et al., 2003)
of describing languages like Italian having a syntactically free constituent order
and pro-drop property thus requiring the insertion of a number of empty elements
which would result in a consequent loss of annotation transparency.
The annotation of constituency (Figure L) produces shallow tree structures.
It was performed with a Shallow Parser and then manually revised. The functional
annotation is word-based and includes relations such as dependency, coordination
and intra-sentential co-reference.
[F3 [SN Bruxelles [SP a [SN Italia SN] SP] SN] F3] [CP [SA urgente SA] [F [SV2
ridurre SV2] [COMPT [SN il deficit SN] COMPT] F] CP]
Figure L - ISST syntactic constituent level (sole002)
Lastly, the ISST presents a lexico-semantic level of annotation, assigning
5 Performing a Pilot Annotation
- 85 -
semantics tags. These convey: the sense of each word, based on the ItalWordNet
(IWT) lexical resource; special uses, e.g. idiomatic, proper nouns, neologisms,
etc…;and additional comments of the annotators.
A tool has been especially developed for the task of annotating and
combining the 5 levels of annotation of the ISST: GesTALt. The tool also provides
a visual representation of the annotation, e.g. functional annotation makes use of
graphs, while constituent structure is visualised as a strip tree. This tool is
unfortunately not open-source and could not be tested or employed for the pilot
annotation in the present study. The ISST corpus is available in a number of
formats, i.e. text, XML and CoNLL.
5.1.2 Subcorpus Selection
As attribution is a very pervasive relation in journalist language, as it is common in
newspaper article to report opinion, statements and information other people
expressed, only a part of the corpus could be annotated for the present study.
Extending the annotation to the whole ISST represents a subsequent stage which
would require employing annotators and possibly the development of a specific
tool.
In order to test the feasibility and effectiveness of the annotation schema
object of the present study, a pilot annotation was performed on a sample of the
ISST corpus. Being the financial section the only one having already all five levels
of annotation, the addition of a sixth level for discourse and attribution would be
better performed on this part of the corpus so as to have a complete resource.
However, in order to avoid interferences deriving from the specificity of the
financial domain, the selection of articles for the pilot annotation has not been
drawn only from this part, corresponding to the articles form ‘Il Sole 24 Ore’.
The subcorpus has been designed in order to be balanced with respect to
the language contained in the ISST corpus as articles from every section are
represented. Table 2 reports the total number of articles in each section (first row)
and the number of articles from that section included in the sample (second row).
A total of 50 articles out of the 484 constituting the corpus have been annotated,
representing approximately a tenth of the ISST (roughly 30.000 tokens). The
phenomenon of attribution appeared to be well represented in this subsection,
5 Performing a Pilot Annotation
- 86 -
thus containing a wide range of occurrences of attribution relations.
Cs Els Period Re Sole
99 81 13 136 155 10 9 2 14 15
Table 2 - N. of articles selected per section
The subcorpus was obtained from a single file (Figure M), containing the whole
corpus in table format with each line corresponding to a new token, and each tab-
separated column to a different annotation feature. The first column refers to the
article ID, the second to the sentence number and the third to the word counter in
the relative article. Following columns add information about constituency, POS,
lemma and the seventh contains the tokens.
Figure M - ISST table format
In order to reconstruct the articles, so as to have them available in the text format
the tool required, the table file was split into a file each article containing the word
tokens only, divided by a single space. This was achieved with writing a few lines
5 Performing a Pilot Annotation
- 87 -
of code in the scripting language Python. Subsequently, it was necessary to
correct some errors detected in the original file leading to an incorrect word order.
Moreover, some characters such as hyphens and angle brackets were
individuated as responsible for the crash at the launch of the tool software. In this
case it was necessary to substitute the relative ASCII character codes for the
problematic characters.
5.2 Tool Selection
A myriad of tools have been developed with the purpose of annotating NL, though
finding an existing tool perfectly matching a specific annotation project
requirements is a search which in most cases is doomed to fail. One obstacle is
determined by the availability of the tool, as due to the high costs involved in the
production of software material, some tools are commercialised. Among the many
open-source tools, developed mainly by research and university institutes and
made available for academic purposes in order to promote their use and share
resources, the great majority was developed in the frame of a specific project.
These tools do not support all the annotation requirements of another project and
their code is often difficult or impossible to change in order to adapt it to the new
task.
A last group of open-source tools supports a wider range of annotation
projects and a high level of customizability. These annotation tools were designed
not just for a specific project, but to be able to support the annotation of one or a
group of phenomena, e.g. anaphora relations, speech interactions, temporal
references, etc… However, it is unlikely that a tool generally developed for a
specific phenomenon succeeds in capturing all its possible aspects as it might take
an approach grounded in a specific theory or miss aspects which another project
wish to consider and include in the annotation.
In the frame of attribution relations, to the above mentioned issues making
the identification of a suitable tool challenging, it has to be added a more relevant
one: there is no tool especially designed to support the annotation of attribution.
5 Performing a Pilot Annotation
- 88 -
5.2.1 Requirements
In order to find the best matching available tool it is necessary to first define what it
should match in order to support the annotation scheme, i.e. the annotation
requirements. First of all, the tool should be able to take advantage of the other
layers of annotation already available for the ISST corpus, especially to facilitate
the annotators’ task of retrieving possible annotation relations through the corpus.
For this reason, the tool should be able to read in a file like the table format (Figure
M) containing information from other layers of annotation. Only the bare text
should be displayed in order to avoid confusion, however, the tool should possess
a search function capable of retrieving e.g. the lemma of a given verb that could
be associated with attribution such as ‘say’, ‘think’ or ‘order’ or the POS of a token
in order to disambiguate between e.g. a verb and an adjective with words like
‘ordinato’ (ordered/ tidy). This would support proceeding cue by cue to annotate
attribution, strategy adopted also by the PDTB (Prasad, Miltsakaki et al., 2008) for
the annotation of discourse connectives.
Once a cue is identified it should be possible to select it and mark the
existence of an attribution relation in that point of the text. This should be done on
the cue as it represents the only constituent of attribution which is always
expressed and singularly considered (in case of multiple cues separate relations
are annotated). The relation should require the selection of one or multiple text
spans for the content and the optional selection, as the source might be left
implicit, of one or multiple spans corresponding to the source. Each element
constituting a single source, cue or content will be from now on called “markable”
(Mueller and Strube, 2001:48). As source, cue and content (134) might be
fragmented and separate by intervening material, it should also be possible to
select as a single markable discontinuous text spans.
(134) <<La responsabilità è politica – aveva aggiunto il Procuratore capo- ed è il
potere politico che deve far funzionare i servizi>>. (ISST els046)
<<The responsibility is political– had added the Chief Prosecutor– and it is
the political power that has to make services work>>.
5 Performing a Pilot Annotation
- 89 -
Moreover, overlapping text spans should also be selectable as it is often the case
that attribution relations are nested into each other (see 3.2.1). To each selected
markable it should be therefore possible to associate the features it possesses,
through the selection of predefined values, thus speeding up the annotation
process and avoiding spelling errors the annotators could make when manually
writing these values. Finally, in case this is not automatically done when adding an
attribution relation, the tool should support linking two or more markables to
establish relations in both directions.
Lastly, concerning the output of the tool, this should save the annotation as
stand-off in a separate file each article identified by the same index as the files
containing the other levels of annotation for the same article (i.e. cs.morph001,
cs.orth001, etc…, cs.attr001). In-line annotation, consisting of adding XML tags to
the original text as in the example (135) cannot represent overlapping markables,
because of XML syntax, and therefore is not suitable for describing attribution
relations.
The annotation should preferably refer to the word index (136), thus establishing a
unique pointer to each token in the corpus corresponding to each line in the table
format (Figure O) and not to the byte as e.g. white spaces and multi-words would
possibly determine a mismatch between the bytes in the original files and those
the tool refers to. Although possible, transforming the byte reference into the word
index reference can lead to additional errors and should be dispreferred.
(135) <content>“In città non abbiamo uno scippo”</content>, <cue>ha
dichiarato</cue> <source>il sindaco</source>. (ISST re040)
“In town we do not have a single bag-snatching”, declared the major.
(136) <markable id=”1” span=”token_001…token_008” role=”content”>
<markable id=”1” span=”token_010…token_011” role=”cue”>
<markable id=”1” span=”token_012…token_013” role=”source”>
5 Performing a Pilot Annotation
- 90 -
5.2.2 Comparison of Available Tools
In order to select the most appropriate software to employ to perform the pilot
annotation of attribution relations, features of different available tools have been
compared in the light of the requirements listed above (5.2.1). Only open-source
tools have been taken into consideration. The analysis that follows is not intended
to provide a full account of every tool described but just to highlight positive and
negative aspects with respect to the present annotation project. While the most
promising tools have been tested via setting a sample annotation schema and
performing the annotation of a single file, tools which appeared to be incompatible
with the most important requirements were soon dismissed and not further
investigated, together with those tools potentially meeting these requirements but
practically requiring complex modifications to their code.
A selection of possibly suitable tools has been drawn from surveys available
on the internet, such as David Lee’s Corpus-based Linguistics LINKS
(http://personal.cityu.edu.hk/~davidlee/devotedtocorpora/CBLLinks.htm) and
considering the tools adopted by similar annotation projects.
Since there is no tool specifically developed for the annotation of attribution,
general annotation tools or tools for the annotation of anaphora or discourse,
phenomena relatively similar or overlapping with attribution and therefore also
likely to require a similar description, have been considered. A brief analysis of the
main tools taken into account is reported below.
GATE
GATE (Cunningham et al., 2002), General Architecture for Text Engineering, is a
very complete architecture (freely available to download from http://gate.ac.uk/)
allowing the development of language processing software. The tool supports a
variety of formats, such as XML, RTF, HTML, plain text, although only the latter
was easily accepted and used for the sample annotation. A set of NLP resources
are provided with the tool and include a POS and a semantic tagger and a
coreferencer. Setting an annotation schema was a relatively easy task which could
be performed in a few minutes.
5 Performing a Pilot Annotation
- 91 -
Figure N - GATE annotation environment
The tool supports nested annotation, as the same portion of text can be selected
several times, however, the annotation of discontinuous spans is not possible as it
is not allowed to include in the same markable non adjacent spans. It also seems
not to be possible to establish relations between markables. Moreover, the
annotation itself is quite problematic as the selection of the text spans, their
deletion or modification, and the addition of features is not intuitive. GATE, which
was used for example for the annotation of the MPQA Opinion Corpus (Wiebe et
al., 2005), includes also a query tool. The annotation is stored in XML format with
reference to the byte, as in the example below (Figure O).
<?xml version='1.0' encoding='windows-1252'?> <GateDocument> <!-- The document's features--> <GateDocumentFeatures> <Feature> <Name className="java.lang.String">gate.SourceURL</Name> <Value className="java.lang.String">file:/C:/Documents%20and%20Settings/Prova1.txt</Value> </Feature> <Feature> <Name className="java.lang.String">MimeType</Name>
5 Performing a Pilot Annotation
- 92 -
<Value className="java.lang.String">text/plain</Value> </Feature> <Feature> <Name className="java.lang.String">docNewLineType</Name> <Value className="java.lang.String">CRLF</Value> </Feature> </GateDocumentFeatures> <!-- The document content area with serialized nodes --> <AnnotationSet> <Annotation Id="8" Type="Source" StartNode="2736" EndNode="2746"> <Feature> <Name className="java.lang.String">Type</Name> <Value className="java.lang.String">Arbitrary</Value> </Feature> </Annotation> <Annotation Id="9" Type="cue" StartNode="2747" EndNode="2771"> <Feature> <Name className="java.lang.String">Factuality</Name> <Value className="java.lang.String">Non-factual</Value> </Feature> <Feature> <Name className="java.lang.String">Scopal change</Name> <Value className="java.lang.String">None</Value> </Feature> <Feature> <Name className="java.lang.String">Type</Name> <Value className="java.lang.String">Fact</Value> </Feature> </Annotation> <Annotation Id="10" Type="content" StartNode="2772" EndNode="2864"> </Annotation> </AnnotationSet> <!-- Named annotation set --> <AnnotationSet Name="Original markups"> <Annotation Id="0" Type="paragraph" StartNode="0" EndNode="2887"> </Annotation> </AnnotationSet> </GateDocument>
Figure O - GATE annotation exported in XML
Knowtator
Meant to serve a wide range of annotation purposes, Knowtator (Ogren, 2006) is a
plug-in of the knowledge representation system Protégé (both freely downloadable
from http://knowtator.sourceforge.net/) which allows the definition of annotation
schemas. Setting an annotation schema is not particularly complicated, however,
for the attributes it is not possible to set pre-defined values to choose from but only
one default element. This means that values have to be typed in manually by the
5 Performing a Pilot Annotation
- 93 -
annotators thus representing an additional difficulty, and a consequent chance for
errors.
The tool, however, supports establishing a relation between markables as
well as multiple selections. A multiple slot for instances of source or cue could be
for example inserted in the cue class as in Figure P (right hand side in the middle).
This shows a sample annotation project consisting of a single file and of a single
attribution relation. The file containing the annotation is presented in Figure Q.
Nested and discontinuous selections are also supported. A searching function is
instead not available. Another negative side is that although relatively easy to set,
the tool is quite complicated to use and requires some training as the markable
selection and addition of features make use of icon buttons in a not very user-
friendly manner.
Figure P - Knowtator annotation environment
A collection of texts can be defined for a project. These should be plain text,
however XML and database table formats should also be supported. The tool
provides stand-off annotation with reference to the byte (Figure Q). The output is
relatively redundant as every annotation and feature is saved as a separate
5 Performing a Pilot Annotation
- 94 -
annotation instance with explicit mention of the annotator and creation date.
<?xml version="1.0" encoding="UTF-8"?> <annotations textSource="01.txt"> <annotation> <mention id="Attributionprova_Instance_20000" /> <annotator id="Attributionprova_Instance_6"> Pareti, Edinburgh University</annotator> <span start="439" end="494" /> <spannedText>Il presidente della Banca Centrale, Jean-Claude Trichet</spannedText> <creationDate>Sun Aug 16 18:24:30 CEST 2009</creationDate> </annotation> <annotation> <mention id="Attributionprova_Instance_20003" /> <annotator id="Attributionprova_Instance_6"> Pareti, Edinburgh University</annotator> <span start="496" end="506" /> <spannedText>ha parlato</spannedText> <creationDate>Sun Aug 16 18:24:49 CEST 2009</creationDate> </annotation> <annotation> <mention id="Attributionprova_Instance_20007" /> <annotator id="Attributionprova_Instance_6"> Pareti, Edinburgh University</annotator> <span start="511" end="544" /> <spannedText>grave rallentamento dell’economia</spannedText> <creationDate>Sun Aug 16 18:25:24 CEST 2009</creationDate> </annotation> <classMention id="Attributionprova_Instance_20007"> <mentionClass id="Content">Content</mentionClass> </classMention> <classMention id="Attributionprova_Instance_20000"> <mentionClass id="Source">Source</mentionClass> </classMention> <classMention id="Attributionprova_Instance_20003"> <mentionClass id="Cue">Cue</mentionClass> <hasSlotMention id="Attributionprova_Instance_20009" /> <hasSlotMention id="Attributionprova_Instance_20010" /> </classMention> <stringSlotMention id="Attributionprova_Instance_20009"> <mentionSlot id="type" /> <stringSlotMentionValue value="Assertion" /> </stringSlotMention> <complexSlotMention id="Attributionprova_Instance_20010"> <mentionSlot id="Attribution_source" /> <complexSlotMentionValue value="Attributionprova_Instance_20000" /> <complexSlotMentionValue value="Attributionprova_Instance_20007" /> </complexSlotMention> </annotations>
Figure Q - Knowtator annotation exported in XML
Callisto
The annotation tool Callisto (open-source, available from http://callisto.mitre.org/)
5 Performing a Pilot Annotation
- 95 -
was adopted for a part of the annotation of temporal relations in the frame of
developing the ITB, Italian TimeBank (Caselli et al., 2008), on a portion of the ISST
corpus. The tool has a very neat and basic interface nonetheless allowing setting
user preferences. Overlapping text spans can be selected as well as single
characters, by changing the annotation from ‘word’ to ‘character swiping’.
To create a new ‘task’, some annotation schemas e.g. POS or coreference
are already available, it is necessary to define a DTD. However, this possibility
seems no to easily work and therefore the tool was not set for the annotation of
attribution on a sample article. This was not necessary, since the tool does not
meet some important requirements as it seems not possible to select
discontinuous spans as a single markable and to establish relations between
markables. Callisto annotation is saved as stand-off with reference to the byte,
however, the conversion into word index reference is supported.
MMAX2
Written in Java, MMAX2 (Mueller and Strube, 2006) is a general purpose tool
(available open-source from http://mmax2.sourceforge.net/) with a special focus
on the annotation of anaphoric/ coreferential expressions, word sense
disambiguation and POS tagging. Starting a project requires some time as the
annotation schema has to be externally specified prior to launching the program.
Nonetheless MMAX2 is a very flexible instrument that allows personalising the
display of the annotation tool using XSL Style Sheets.
The tool requires text input files, however XML support is under
development and should be available shortly. The tool can be set so as to guide
the annotation presenting default and pre-defined values to choose from for the
attributes. MMAX2 allows the selection of overlapping and discontinuous text
spans as well as the possibility to link markables together using relations.
The stand-off annotation provided by the tool points to the word index and
not to the byte as most other tools. Every markable level is saved in a different
XML file where to each markable is associated an ID, the pointer to the text span
and any other feature or relation associated with it. The result is a very compact
and easy to read annotation. The tool was employed, among others, for the
annotation of anaphora an deixis in the VENEX corpus (Poesio et al., 2009).
5 Performing a Pilot Annotation
- 96 -
Annotator
Annotator was the tool especially developed for the annotation of discourse
connectives and their argument in the PDTB (Prasad, Miltsakaki et al., 2008). The
tool supports the annotation of attribution on ‘raw text’ files according to the
schema adopted by the PDTB (see 2.4.3). The interface is very user-friendly and
guides the annotation with listing the possible values from which to select and with
employing constraints. Unfortunately, however, the tool could not be adapted to the
present annotation schema as Annotator does not support the setting of different
markables or features. The tool could be adapted by changing the source code,
however this was not available and represents anyway a time-consuming task
similar to writing a completely new annotation software.
Annotator was not designed to account for attribution relation not occurring
in correspondence to the discourse connective structure. In addition, nested
attributions are not contemplated and it is not possible to specify the role of each
of the three element constituting an attribution (i.e. source, cue, content) and
establish relations between them. The tool produces stand-off annotation with
reference to the byte. Even though the tool represents a good example of how an
annotation tool for attribution could be also designed and implemented, since it
was not possible to adapt Annotator to the annotation schema developed in this
study, this could not be considered a possible candidate for the pilot annotation.
Other tools
Among other tools, also NITE and EXMARaLDA were briefly taken into
consideration. NITE XML Toolkit (open-source at:
http://sourceforge.net/projects/nite/files/) is a very powerful instrument aimed at
software developers which allows building specialised annotation schemas and
interfaces for a wide range of purposes. It is especially intended to support
multimedia language data and it has been employed in a number of meeting and
dialogue corpora. NITE, however, is quite complex to set up and a sample
annotation project could not be developed to test it.
EXMARaLDA (Schmidt, 2001), Extensible Mark-up Language for Discourse
Annotation (available from : http://exmaralda.org/), is a system of Java based tools
with XML data formats especially designed for the annotation and assisted
5 Performing a Pilot Annotation
- 97 -
transcription of spoken language. The tool is not suitable for the annotation of
attribution relations as it is not meant to relate markables and it does not support
the definition of an annotation schema as it would be required for attribution.
5.2.3 Selection and Tool Specifics
Concerning the requirements specified in (5.2.1) priority was given first to those
features enabling the selection of the text spans involved in attribution, i.e.
discontinuous and nested markables, together with the possibility of establishing
relations among cue, source and content markables, including the eventuality of
having more than one source and/or content each relation. Subsequently, the tool
customizability and user-friendliness were also considered, with particular
attention to the possibility of setting guided choices for the markable features. Part
of this second group of requirements was also the tool annotation format, ideally
neat and compact stand-off XML annotation with reference to the word index.
Other aspects, such as the possibility of querying the corpus with reference to
other levels of annotation in order to retrieve possible cues or the support of input
data in a format other than text, were temporarily left aside.
From the tool considered above (5.2.2) only two, Knowtator and MMAX2,
appeared to meet the first set of requirements and were therefore more closely
compared to check other relevant characteristics.
Supported features Knowtator MMAX2
Discontinuous text selection Yes Yes Nested selection Yes Yes Relations Yes Yes Multiple sources/contents Yes Yes
Pre-defined values selection No (one default) Yes (menus) Display customizability Yes (partial) Yes (complete) Ease of setting a scheme Simple (internal) Medium (external) Ease of annotation Medium Simple XML stand-off output Yes Yes Reference to word index No (byte) Yes
Table 3 - Knowtator/ MMAX2 feature comparison
5 Performing a Pilot Annotation
- 98 -
Knowtator and MMAX2 differ in some aspects concerning the second group of
requirements. These are listed in the lower half of Table 3. Knowtator is certainly
easier to set as the annotation schema and customization can be internally
defined through the interface. MMAX2 requires instead the modification of XSL for
both setting the annotation schema and customizing the interface and display of
the annotation.
On the other hand MMAX2 can be more personalised and the annotation
scheme better specified so as to have pre-defined values to select from, thus
facilitating the annotation by reducing the annotators’ cognitive load. The
annotation itself, i.e. selection, deletion, extension of a text span, is also easier.
Lastly (last row in Table 3), MMAX2 saves the annotation as stand-off with
reference to the word index, whereas Knowtator refers to the byte.
Considering all their characteristics, the higher setting costs of MMAX2
seems to be well compensated by a subsequent more structured annotation and a
more flexible interface. This, together with the possibility to anchor the markables
span to the original text through references to the word indexes, made this tool
prevail as the most suitable for the present purpose of annotating attribution
relations according to the proposed schema.
5.3 Setting MMAX2
Installing MMAX2 is easy, though it requires a current Java version installed on the
machine to run. Once the program is launched, it is possible to start a project
using the Project Wizard shown in Figure R. In this window the ‘raw text’ input file
that will be used for the annotation has to be selected. This file gets then analysed
by the program and tokenised. The article from the ISST were previously
tokenised and corrected, it was therefore not necessary to do it again as in this
case it is possible to tick the ‘Input file is one token per line’ box.
Afterwards it is required to specify at least one markable level for the
annotation, and eventually some display preferences related to it. The last section
in the window (Figure R) contains the paths to where the different project
components are stored and allows selecting a name for the project and the stored
input file.
5 Performing a Pilot Annotation
- 99 -
Figure R - MMAX2 Project Wizard
Each MMAX2 annotation project has five different components (and a
common_paths file specifying where these are stored):
-the Base Data, that is the data on which the annotation is performed. For the
present project this consists of an XML file each article, derived
from the ‘raw text’ files provided to the tool as input. The file has a
token per line to which a progressive word index is assigned
(Figure S).
<?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE words SYSTEM "words.dtd"> <words> <word id="word_1">LONDRA</word> <word id="word_2">.</word> <word id="word_3">Gas</word> <word id="word_4">dalla</word>
5 Performing a Pilot Annotation
- 100 -
<word id="word_5">statua</word> <word id="word_6">Evacuata</word> <word id="word_7">la</word> <word id="word_8">Tate</word> <word id="word_9">Gallery</word> <word id="word_10">.</word> <word id="word_n">…</word> </word>
Figure S - MMAX2 Base Data (ISST cs001)
-the Scheme, an XML file for each markable level containing the annotation
schema. This file specifies markable attributes and relations.
Attributes represent descriptive information, while relations account
for structural or associative information. Relations in MMAX2 can
be of two kinds: ‘markable-set’, undirected relations between two or
more markables, and ‘markable-pointer’, a directed relation from
one markable to one or more target markables. Attributes can be
simple FREETEXT, thus accepting any string as their value,
NOMINAL_LIST, a pre-defined closed set of possible values
presented as a drop-down menu, or NOMINAL_BUTTON, similar
to the precedent but the values are presented as a sequence of
radio buttons. In the Scheme file it is not only possible to set the
type of attributes, with their pre-defined values, and relations, but
also to determine a hierarchy of attributes. Dependencies can be
expressed by adding a ‘next’ value to an attribute specifying, in
case this one is selected, which other attribute or set of attributes to
enable.
-the Style, an XSL file which defines the display. Here the way the text and the
annotation are presented can be modified, for example, by adding
handles to the markables, inserting empty lines or structuring
dialogue turns.
-the Customization file (XML), containing a description of how each markable
should be visualised, i.e. foreground and background colour, size
5 Performing a Pilot Annotation
- 101 -
and font aspect, according to its attributes and relations. A
markable that has not yet been assigned attribute values could be
associated e.g. with a different background colour so as to be
easily spotted as requiring the completion of the annotation.
- the Markable directory, containing the annotation in XML format. The annotation
of each article is stored in a separate file, as it was for the Base
Data, while Scheme, Style, Customization are common to the
entire project. This file represents the stand-off annotation and lists
all the markables for the specific level of annotation. Markables are
assigned a unique ID and a reference span, pointing to the original
text, stored in the Base Data, by pointing to the word index (e.g.
span="word_62..word_78"). For each markable attribute values
and relations are specified.
In a preliminary stage, cue, content and source were defined as three separate
markable levels. This allows keeping the three components completely distinct
during the annotation process and makes it possible to select immediately the role
of each markable when the text span is selected, as shown in Figure T on the right
hand side.
Figure T - The annotation of cue, content and source as separate levels
On the other hand, however, this results in the annotation of each article being
5 Performing a Pilot Annotation
- 102 -
stored in three separate files, one each markable level. As having the annotation
on one single file guarantees better access to it and less storage space, the
Scheme was changed and the components of attribution were subsequently
annotated on the same markable level as different attributes.
The pilot annotation project consists of an ‘input’ directory containing the 50
articles, tokenised and in XML format, i.e. the Base Data, and the Markable
directory containing 50 corresponding files where the annotation ‘output’ is stored.
Scheme, Customization and Style are instead common and had to be written only
once. For each article a ‘.mmax’ file is also produced, this contains the reference
to the Base Data file corresponding to it and it is this file that needs to be loaded
when opening the relative annotation project with the tool.
5.3.1 Scheme
The Scheme, included in Appendix 1, is the most interesting component of
MMAX2 as it describes the annotation schema and the way it is presented during
the annotation. After selecting a markable on the text, it is possible to assign
attributes to it through the annotation window. This is initially displayed as in Figure
U, where just the role of the markable in the attribution relation can be selected,
‘none’ being the default value and all the possible values being displayed as radio
buttons. Only when the relevant role has been selected, other features are
activated.
The ‘type’ feature is definitely related to the cue, together with the
‘factuality’. The ‘source type’ is an attribute of the source, however, implicit sources
can frequently occur with the consequence of no source markable to be available.
Not to loose the information about the ‘source type’, this attribute can be made
available when selecting the cue. As the cue is the textual anchor of the attribution
relation, it is never missing and can therefore carry information about ‘weaker’
elements.
As far as the ‘scopal change’ is concerned, as the change in the scope
usually involves reversing the polarity or factuality of the content, this could have
been associated with it. Considering however that the element changing scope is
usually included in the ‘cue’ span, e.g. the negation of the attribution verb, and that
it would be easy to forget this relatively infrequent attribute if this would be
5 Performing a Pilot Annotation
- 103 -
separate from the other ones, the ‘scopal change’ has also been made available
for selection through the cue.
Figure U - MMAX2 Annotation window
When a markable is defined as the cue, the annotation panel shows also all the
features connected to it (Figure V). The type attribute has by default the value
‘none’. When ‘assertion’, ‘belief’ or ‘eventuality’ is selected, the ‘scopal_change’
feature is also made available. As a change in the scope cannot occur with ‘facts’,
this feature is disabled in order to facilitate the annotation.
Figure V - MMAX2 Annotation window (attributes)
5 Performing a Pilot Annotation
- 104 -
Factuality is by default ‘factual’ as this is by far more frequently the case, the
‘unmarked’ value, ‘non-factual’ has to be therefore voluntarily selected. The source
is by default ‘writer’. It is the writer in fact the shallowest source of any attribution
relation. The annotation scheme includes also a free text slot for the ‘source_ID’
attribute. This was left blank in the pilot, however it has been included in the tool
Scheme as it represents a highly desirable feature the final annotation should
posses. The same source can be in fact mentioned in an article, in a number of
different ways, e.g. proper name, common name, profession, pronoun, etc…
This feature is included in the Opinion Corpus (Wiebe, 2002) where it not
only provides the source with a unique ID, assigned by the annotator, but it also
accounts for embedded attributions. In this slot the annotator should list, from the
shallowest, i.e. the first source the writer is mentioning or the writer itself when
explicit, to the most embedded one, i.e. the one directly holding the content in the
attribution relation.
The ‘source ID’ slot should ideally be redundant. A coreference tool should
be able in the future to automatically and reliably relate pronouns and alternative
full nouns to the original source, similarly as it should be possible to do for
coreference relations involving the content. It should also be possible to derive the
additional sources to the left of an attribution by identifying text spans containing
the one corresponding to the attribution. Once an attribution relation is included in
the content of another attribution, it should inherit its source. By performing this
task starting from the ‘outside’, a nested attribution would simply inherit all the
‘external sources’ from the attribution immediately above as they would be all
already listed in its ‘source ID’.
Relations are not established in the annotation window, although they are
shown as the last element (Figure V), but directly on the window displaying the
text by selecting a markable and then right-clicking on the markable this should be
related to, the option ‘add to markable set’ should be then available as in Figure W.
When an element part of a relation is selected (Figure W below) the markables
part of the same relation are shown with a grey background and linked by a red
line.
The type of relation adopted for the annotation of attribution is the ‘markable
set’. This allows to relate as many markables as required. As the relation is
5 Performing a Pilot Annotation
- 105 -
undirected this can be retrieved from the annotation of any markable part of the
set, and not only from the annotation of the markable from which the relation
originates as with the ‘markable pointer’ relation. This was especially important as
attribution relations are bidirectional, it is in fact necessary to trace the source from
the content, but also vice versa.
Figure W - MMAX2 Annotation of relations
5.3.2 Customization
The Customization file was written as reported in Appendix 1. The third line
specifies the display preferences for all the markables. Every selected span is
highlighted by surrounding it with black handles and showing its text in blue, bold
font. The following lines in the file define for each attribute specific display
preferences. This could have been done for every different value of every feature
connected to attribution, however an excessive differentiation instead of helping
the annotation and fruition of the annotation by visually characterising different
elements, would simply confuse. It would be in fact necessary to memorise the
association of many colours and font effects to the different features involved in
the annotation.
Only the components of attribution were therefore visually characterised,
once a span is marked as the cue, its background is changed to orange, the
content has instead a cyan background, the source and the supplement a green
and a light gray one respectively. Apart from allowing the immediate identification
in the text of cue, content, source and supplement, these display settings provide
5 Performing a Pilot Annotation
- 106 -
a feedback about the successful annotation of a markable.
In case the annotator forgets to assign a role to a selected markable, or, in
case of uncertainty, intentionally leaves that for a later stage, the display will
continue showing the markable (blue, bold font and handles) without colour
background, thus making it easier to identify it later on when completing the
annotation. Similarly, in case the annotator fails saving the annotation, as it is easy
to forget selecting the ‘auto-save’ function every time a different annotation project
is loaded and even more to manually save the annotation for every single
markable, this can be immediately noticed. The annotator can therefore select the
‘auto-save’ option and repeat only the last markable selection.
5.3.3 Style
The Style sheet (also reported in Appendix 1) has not been deeply modified. While
for example dialogues are surely better displayed with separating turns and
differentiating the actual text from the speaker, news articles have a simpler
structure. Apart from the body text, the other elements are the title(s) and the
author, when explicitly mentioned. However, distinguishing these elements for the
annotation is not necessary. On the other hand, since attribution relations can
often be found nested one in another, adding handles to the right and to the left of
each markable represents the only way to make it possible to identify these
instances (Figure X). Handles were therefore added in the Style sheet.
Figure X - Nested attributions visible through handles
5 Performing a Pilot Annotation
- 107 -
5.4 Feasibility of the Schema and Issues
While performing the pilot annotation several issues arose leading to a
reconsideration of the annotation schema. This was partially modified and
reapplied to the sample corpus. Some changes were determined by the tool
characteristics, in order to better exploit its potential or make up for shortcomings
so that the schema was adequately represented and the annotation process
relatively easy and intuitive. Other issues were brought up by acquiring evidence
of real language occurrences of attribution relations presenting aspects not yet
considered. Finally, doubts and difficulties in applying the schema shed light on
features of the schema requiring further investigations to reach a more appropriate
description. These issues, which have already been analysed in the relevant
chapters, will be here only shortly presented.
First of all, the annotation process highlighted the necessity of more
precisely determine the scope of the attribution relation, i.e. the text span to select
as source, cue or content. Adverbs, relative clauses, appositive or other elements
can in turn represent highly informative material contributing to the interpretation of
the attribution or disruptive additional information which could be better left out of
the annotation. In addition, through the annotation it was possible to realise the
necessity of a solution to preserve the information carried by ‘source of the source’
elements (3.2.2), namely the provenance of the knowledge acquired through verbs
of the ‘fact’ type (e.g. John knows FROM MARY, that…) and recipients of messages
presupposing a perlocutionary act i.e. the indirect object of eventualities
(especially influence verbs, e.g. The pope prohibits CATHOLICS to… ). In order to
account for these elements which do not however correspond to any of the three
components of an attribution relation, a ‘supplement’ role was added and included
in the annotation.
Moreover, it was necessary to account for instances of multiple sources
belonging to different types. As all attributes have been added on the cue, i.e.
associated to the text span corresponding to the cue, it is not possible to give
different values for the ‘source type’ feature. It could have been instead possible
with marking the ‘type’ directly on the source, therefore assigning a type to each
source in the attribution relation, however, the null or hidden sources would have
then been problematic as they have no corresponding text span. Hidden sources
5 Performing a Pilot Annotation
- 108 -
are a lot more frequent than multiple sources belonging to different source types
and therefore the former issue was given priority. The solution adopted was that of
including a value ‘mixed’ for the ‘source type’ attribute. In addition, the frequency
of coreference relations involving the source led to the addition of a ‘source ID’
attribute as described in (3.2.1, 5.3.1).
Lastly, assigning a value to the features ‘type’ (4.1) and ‘scopal change’
(4.4.) turned out to be in some cases not certain and depending on subjective
considerations. Thus the necessity of a statistical analysis of the problem,
confronting inter-annotator agreement on these features, in order to estimate its
entity and introduce the appropriate changes to the schema if required.
5.5 Summary
In order to develop an annotation schema for the phenomenon considered in this
study and test its efficacy, it was decided to develop a pilot annotation. The pilot
was performed on a balanced portion of the ISST corpus consisting of 50 articles.
The annotation was carried out with the help of an annotation tool.
In order to select the most suitable tool, specifications for the proposed
annotation schema were listed and confronted with the available software. Some
requirements, such as the possibility to select discontinuous text span for a single
markable and to relate markables through relations, were considered having
priority and tools not meeting them were discarded. The two remaining tools,
Knowtator and MMAX2, were confronted with respect to the additional
requirements, e.g. their customisability, how the annotation is saved and user-
friendliness.
MMAX2 was eventually adopted and set for the annotation. This required
delineating the annotation ‘Scheme’, i.e. how to organise cue, source and content
as well as their attributes and constraints, as well as defining ‘Style’ and
‘Customisation’. The articles had to be prepared, that is corrected and in raw text,
to become the XML ‘Base Data’ of the annotation. The annotation of each article
was stored stand-off in a single XML file with reference to the word index.
The pilot allowed identifying some issues with confronting the annotation
schema with real language instances. These, together with the constraints
5 Performing a Pilot Annotation
- 109 -
determined by the tool characteristics resulted in the partial modification of the
annotation schema, in order to account for example for the problem of co-
reference resolution, and phenomena such as ‘sources of the source’ and ‘mixed
sources’. Although some modifications might still be required, e.g. ‘type’ and
‘scopal change’ features, once the applicability of the schema has been
statistically evaluated, the annotation scheme developed so far proved to be
feasible and the annotation, with the help of the tool and of annotation constraints,
rather reliable although at times problematic.
6 Annotation Schema and Guidelines
- 110 -
6 Annotation Schema and Guidelines
The annotation process starts with loading an article at a time in the MMAX2 tool
and is generally performed in five phases. First of all it is necessary to identify the
presence of an attribution relation. This is usually done starting from the
identification of an attribution cue, typically punctuation marks and reportive verbs.
However, for the relation to be annotated, it is not enough to find elements linked
by the cue which function as source and content.
The content should in fact express the object of the attribution and not just
its description. An attribution like ‘John said two words’ is not relevant (while ‘John
said: “two words”’ would be), unless it is necessary in order to relate ‘John’ to the
actual two words he pronounced which can be expressed somewhere else in the
article, similarly to coreferential pronouns functioning as content.
Also idiomatic or ‘false’ attributions ( (137), (138), (139)) should not be
annotated. These attributions in fact are not meant to establish a relation between
source and content. The source of idiomatic attributions is also generally hidden.
Examples (137) and (138) represent a specification or a concession with respect
to what it was previously said. In (139) the reportive verb ‘say’ is just employed to
express an equivalence (Biedermeier = “il buon Meier”).
(137) C'È DA DIRE CHE d'Arminio Monforte non sarà scelto dall'intraprendente El
Sayed ad interpretare ed eseguire le nuove strategie che dovranno portare
a così alti traguardi. (ISST els001)
IT SHOULD BE SAID THAT Arminio Monforte won’t be chosen by the
enterprising El Sayed to interpret and execute the new strategies that will
have to bring to so high achievements.
(138) Perché VA DETTO CHE il signor B. spesso ha una casa fuori porta, in mezzo
al verde. (ISST perod001)
Because IT HAS TO BE SAID THAT mister B. often has a house out of town, in
the countryside.
6 Annotation Schema and Guidelines
- 111 -
(139) Biedermeier, COME DIRE "il buon Meier": il cittadino medio del secolo scorso,
protagonista di un'epoca, un gusto, uno stile. (ISST perod001)
Biedermeier, AS TO SAY “the good Meier”: last century average citizen,
protagonist of an epoch, a taste, a style.
In the second phase, after having identified an attribution relation, the relevant text
spans need to be selected and labelled as markables. They hence get displayed
as blue bold text in between square brackets. Afterwards, a role (Figure Y) is
assigned to each markable which is therefore shown with a specific colour
background. The following passage consists in assigning values to each attribute
in the annotation. Lastly, the markables need to be linked in a relation. This can be
done by selecting a markable and right-clicking on the elements which should be
included in the same set. When a markable in a relation is selected, the markables
part of that relation set are displayed joined by red arches.
Figure Y - Attribution relation components
In this chapter the annotation schema developed in this thesis will be summarised
and presented as it has been employed in the pilot annotation. Indications will be
provided regarding the selection of the relevant text span for each of the
constitutive elements of the attribution relation. With the use of examples from the
corpus, instructions concerning how to assign the values for each annotated
features will be also given. All the recommendations reported in the following
chapters however have to be regarded as suggestions, a referential repository of
good practice examples with the aim of facilitating the annotation process, rather
than prescriptions. The context and a full awareness of the goals to achieve
should alone be sufficient to reliably drive the annotation. Would this prove
incorrect the strategy adopted here should be abandoned in favour of a more
controlled one.
SOURCE(S) CUE CONTENT(S)
relation
(SUPPLEMENT)
6 Annotation Schema and Guidelines
- 112 -
6.1 Text Spans Selection
Once an attribution relation is found, it is necessary first of all to identify its
constitutive elements (Figure Z) and determine which span represents them. Each
relation requires at least three components: the cue, i.e. the textual anchor
signalling the relation; the content, that is the attributed material; and the source,
the entity the content is attributed to. The source can be missing as it is sometimes
left implicit. It should be however clear when annotating which implicit entity the
attribution refers to. In some cases it is instead possible to have multiple instances
of ‘source’ and ‘content’. In addition to these three components there is a fourth
one, the ‘supplement’, which can be optionally used to mark additional relevant
information.
Figure Z - Annotation, text spans selection.
The text spans corresponding to cue, source and content should be first selected
(Figure Z) thus enabling the option of creating a markable with the selected text. In
case extensions or reductions to the text span corresponding to a markable are
required, it is possible to do so with choosing ‘add’ or ‘remove from this markable’
from the menu on the selected span.
6 Annotation Schema and Guidelines
- 113 -
Elements that can possibly constitute each markable type are listed in Figure AA.
Deciding what is in the scope of the attribution relation, i.e. what exactly to
comprise in each markable, should not be taken for granted. In the following
chapters indications will be provided about each markable type and what should
be included or left out of its text span.
Figure AA - Annotation, elements which could function as a markable.
6.1.1 Source Span
In general, in the source span should be included all those elements relevant to
the identification of the entity having this role. However, what is to be considered
relevant needs to be defined. The source should always comprehend the full noun
phrase expressing it ( (140) attribution 1) or, in case the source is represented by
an adjective or a prepositional phrase (141), these elements have to be included.
(140) [Il ministro del Tesoro]1 [ha indicato anche] 1 [l'obiettivo del prossimo anno:
4 per cento]1. Ø [Ha anche aggiunto] 2 [che i risultati positivi derivano
soprattutto dalla caduta dei prezzi del petrolio, da quello delle altre materie
prime e dal calo del dollaro] 2. (els020)
[The Secretary of the Treasury] 1 [has also indicated] 1 [next year goal: 4
per cent] 1. (He) [has also added] 2 [that the positive results mainly derive
from the drop of the petrol price, from that of other raw materials and the
dollar decrease] 2.
SOURCE(S) CUE CONTENT(S)
relation
(SUPPLEMENT)
-verb -noun -adjective -preposition -prep. group -graphic marker
-noun phrase -adjective -prep. phrase
-word -phrase -clause -sentence -entire article
-cue modifier -indirect object -source of source -event specification
6 Annotation Schema and Guidelines
- 114 -
(141) Le parole registrate di Gheddafi, …(ISST cs039)
Gheddafi’s recorded words,…
In case of appositives or relative clauses referring to the entity in the noun phrase
and contributing to its characterisation, these should also be selected together with
the noun phrase as in the example (142). When they instead digress from the task
of identifying the scope as in (143) and constitute a mere description or provide
additional details which are not necessary, they should not be annotated.
(142) …il presidente della casa giapponese, Osamu Suzuki, ha previsto
un'ulteriore flessione dei profitti anche per quest'anno. (ISST sole100)
…the president of the Japanese trade, Osamu Suzuki, has predicted an
additional fall of the revenues also for his year.
(143) <<Un'idea geniale>> l'ha definita Cesare Verlucca, editore piemontese
pronto ad affrontare i salotti dopo il successo di vendite ottenuto dal Salone.
(ISST sole040)
<<A genius idea>> has defined it Cesare Verlucca, publisher from
Piedmont ready to face the ‘salotti’ after the sale success obtained at the
‘Salone’ fair.
When the relation is part of a relative clause with the source expressed by a
relative pronoun, just the pronoun should be annotated as in (144). The full noun
the relative pronoun refers to, in this case ‘Milan vice-president, Galliani’, should
be syntactically retrievable, moreover, it will be reported in the ‘source ID’ slot. Null
or missing subject, having no corresponding span, should not be marked on the
text ( (140) attribution 2).
(144) Una provocazione collegata a un recente colloquio con il vicepresidente del
Milan, Galliani, il quale ha convenuto con me circa l’insostenibilità della
situazione. (ISST cs064)
A provocation connected to a recent conversation with Milan vice-president,
Galliani, who agreed with me about the situation being unbearable.
6 Annotation Schema and Guidelines
- 115 -
6.1.2 Cue Span
The cue can be expressed by a considerable number of elements thus making it
difficult to automatically recognise it. Most commonly, however, cues are reportive
verbs. Apart form including the particle or expression reversing its polarity (145),
adverbs ( (146), (147)modifying the attitude should also be included while
complements or specifications, which should just be considered in case they
provide relevant information, can be included in the supplement.
(145) Ø Non ho mai pensato che <<Il Dottor Zhivago>> potesse essere
considerato un’opera ostile al socialismo. (ISST els076)
(I) have never thought that <<Doctor Zhivago>> could be considered a work
against socialism.
(146) Afferma ufficialmente l’Antitrust : <<Le modalità di pubblicizzazione del
prezzo consigliato…>>. (ISST sole049)
The Antitrust officially affirms: <<The advertisement modalities of the
suggested price…>>.
(147) Ieri sera i segretari generali hanno esplicitamente detto di essere
d’accordo con una delicata proposta contenuta nel documento dei giuristi
sulle sanzioni da applicare ai singoli lavoratori che si rifiutassero di prestare
il lavoro richiesto per garantire il minimo di servizio. (ISST els079)
Yesterday the general secretaries explicitly said that they agree with a
delicate proposal included in the lawyers’ document concerning the
penalties to inflict to the individual workers who would refuse to give the
work required to guarantee the minimum service.
Similarly ‘cue of the cue’ particles, i.e. usually complements expressing mean or
provenance, should not be labelled as ‘cue’ but included in the annotation as
‘supplement’, together with the indirect object, when expressed. In this way these
elements and the information they carry would be retrievable.
When more than one cue belonging to the same type, i.e. conveying the
same attitude the source holds, is expressed, these should all be included in a
6 Annotation Schema and Guidelines
- 116 -
single ‘cue’ markable as to each and every ‘cue’ markable corresponds an
attribution relation. Cases like (148) with a redundant cue and source (since
Martino and the Foreign Secretary are the same person) are also possible. In this
case the co-referential sources should be included in the same source markable,
included in the squared brackets labelled as ‘s’. Similarly, the two cues will
constitute a single cue markable, the corresponding text span is marked with ‘c’.
While cues of different types should be split into separate attribution relations,
those of the same type concur to signalling the presence of an attribution and
should be grouped. An exception is made only for punctuation cues which should
be annotated only when the relation is not signalled by any other mean as in (149).
(148) [Secondo]c [il ministro degli Esteri]s, [la prossima ondata di ottimismo ci
sarà], [ha detto]c [Martino]s, [quando comunicheremo le nostre prime
iniziative concrete]. (ISST sole017)
[According to]c [the Foreign Secretary]s, [the next wave of optimism will
take place], [said]c [Martino]s, [when we will announce our first concrete
initiatives].
(149) Il Papa: “La cultura ha bisogno del genio femminile”. (ISST cs014)
The Pope: “Culture needs the female genius”.
Lastly, when the attribution is a question, the cue should include the element
giving the utterance the interrogative form, i.e. the question mark in case of direct
questions (150). This element should not be included in the content: It is in fact the
cue, therefore the attribution itself which is questioned and not that a question is
the content of the attribution.
(150) Pensa anche lei come tanti critici che, con il suo romanzo incompiuto, lo
scrittore si trovasse a una svolta esistenziale? (ISST els034)
Do you also think like many literary critics that, with his unfinished romance,
the writer was at an existential turning-point?
6 Annotation Schema and Guidelines
- 117 -
6.1.3 Content Span
The selection of the content should obey to a principle of limiting the annotation to
that portion of text which is surely meant to be attributed to the source. This means
that the content span should not include utterances of uncertain attribution due to
syntactic ambiguities. An example is when a clause constituting the content is
joined to another utterance via a coordinating conjunction. In this case, only if the
complementizer ‘che’ (that) is included ( (151), (152)) the second clause is also
surely attributed, otherwise it could represent material added by the source above,
usually the writer.
(151) Più positive, invece, il giudizio di Fim-Cisl e Uilm-Uil, che hanno annunciato
per oggi una conferenza stampa e che sono favorevoli ad una votazione
referendaria sulla bozza di accordo. (ISST els002)
More positive, instead, the opinion of Fim-Cisl and Uilm-Uil, that have
announced a press release for today and that they are positive about a
referendum poll concerning the agreement draft.
(152) Lo ha detto ieri un portavoce del ministero degli Esteri, il quale ha anche
annunciato che il governo cinese ha protestato con quello degli Stati Uniti e
che si riserva il diritto di ulteriori reazioni. (ISST els075)
It was said yesterday by a spokesman of the Foreign Ministry, who has also
announced that the Chinese government has complained to the one of the
United States and that they reserve themselves the right of further
reactions.
Also part of the content span should be the IO of verbs requiring one, e.g. to order,
to forbid (153). In the example below in fact the prohibition would be incomplete
without the IO to which it is addressed. ‘Zagreb authorities’ did not prohibit ‘to go to
Petrinja’, this could even be considered an incorrect attribution, but they ‘forbid the
journalists to go to Petrinja’.
(153) E le autorità di Zagabria hanno proibito ai giornalisti di andare a Petrinja e
nelle altre località appena riconquistate. (ISST cs030)
6 Annotation Schema and Guidelines
- 118 -
And Zagreb authorities have forbidden journalists to go to Petrinja and the
other just reconquered places.
When the content span is separated by an incidental phrase or clause, it should be
annotated as a single markable, unless, as in (154), the content is also divided by
sentence boundaries. In this case it seems more appropriate the addition of the
second part of the attribution still to the same relation, though as a second content
markable.
(154) "There's no question that some of those workers and managers contracted
asbestos-related diseases," said Darrell Phillips, vice president of
human resources for Hollingsworth & Vose. "But you have to recognize
that these events took place 35 years ago. It has no bearing on our work
force today." (PDTB 0003)
The complementizer ‘that’ should always be included in the content span, together
with the quotation marks (155) (i.e. “…” or <<…>>). When source and cue are
expressed incidentally, surrounded by hyphens, these should also be included in
the content (155).
(155) E' vero che doveva interpretare lei la parte di Bruce Willis in Pulp Fiction ?
["Sì -] [si adombra] [Matt] [- Un ruolo interessante: con Tarantino eravamo a
buon punto, poi é arrivato Bruce. I suoi film incassano un po' più dei miei,
no? Hanno scelto lui"]…(ISST cs060)
Is it right that you were going to play the role of Bruce Willis in Pulp Fiction?
[“Yes -] [Matt] [grows dark] [- An interesting role: with Tarantino we were at
a good point, then Bruce arrived. His films cash in a bit more than mines,
right? They chose him”]…
Punctuation at the end of a content span should only be included if part of the
content itself. This means that for example a full stop at the end should be
included when the content is expressed by a full sentence, a question mark when
the content itself is a question (156) and so forth.
6 Annotation Schema and Guidelines
- 119 -
(156) Ø Sospende il racconto e formula una domanda, in inglese: “Sai cos’è un
rabbit?”. (ISST cs030)
(He) holds the narration and poses a question, in English: “Do you know
what a rabbit is?”.
6.1.4 Supplement
The supplement span is a useful device in order to account for optional additional
elements which although not fundamental in an attribution relation, they are in fact
often missing, do carry useful information. These can be: concurring to the
identification of the source and the provenance (157) or mean by which the
information was acquired; providing further specification of the attitude this holds;
the recipient of a reportive verb of the assertion type (e.g. to tell); and event
specifications providing context indications determinant to the interpretation and
comprehension of the content. The latter includes also instances like (158), where
the content has been asserted or expressed about a certain entity or event (‘it’). In
the example this element ‘it’ is necessary as required by the verb and in case of an
indirect quotation it could be included in the content. In this case however, it is not
directly part of it as the source has been talking about this event without
mentioning it. In the examples below the supplement span is in small capitals.
(157) (Ø) Ho saputo della squalifica di Garciano DA MAURIZIO DAMILANO, vi giuro,
non pensavo di arrivare primo. (ISST cs071)
(I) heard of the disqualification of Garciano FROM MAURIZIO DAMILANO, I
swear, I didn’t imagine I would have came first.
(158) <<Un'idea geniale>> L'ha definita Cesare Verlucca, editore piemontese
pronto ad affrontare i salotti dopo il successo di vendite ottenuto dal Salone.
(ISST sole040)
<<A genius idea>> has defined IT Cesare Verlucca, publisher from
Piedmont ready to face the ‘salotti’ after the sale success obtained at the
‘Salone’ fair.
6 Annotation Schema and Guidelines
- 120 -
6.2 Feature Annotation Guidelines
After selecting the text spans corresponding to the elements part of an attribution
relation it is necessary to assign the role to each markable in the ‘annotation
window’. When the role ‘cue’ is chosen, the window ( Figure BB) will display also
the attributes and their values which need to be assigned.
Figure BB - Annotation, attributes selection.
The features included in the attribution are summarised in Table 4. They are all
marked on the cue, although some refer to characteristics of the source, i.e.
‘source type’ and ‘source ID’.
Cue
Type Factuality Scopal change None Factual None Assertion Non-factual Scopal change Belief
Fact
Eventuality
Source type Source ID Source
Writer free text
Other Content
Arbitrary
Mixed Supplement
Table 4 - Annotation schema features
6 Annotation Schema and Guidelines
- 121 -
It was decided to proceed this way as it allows preventing a loss of information, or
the necessity to add dummy elements, in case the source is implicit and has no
corresponding text span to which the annotation could be anchored. The feature
‘scopal change’ is disabled when cues of the type ‘fact’ are selected. The
underlined values represent the default values.
6.2.1 Type Attribute
The type of attitude held by the source is by default ‘None’. In the annotation
window however, one of the four values this feature can assume, namely
assertion, belief, fact and eventuality, needs to be selected. For a more detailed
analysis of the issues involved in the selection of the type, see (4.1.5). Here some
strategies are presented which have been adopted in the pilot.
A direct quotation should always be marked as ‘assertion’ even though the
punctuation is not the only cue and other cues express a different attitude as in
(159). The preposition ‘per’ (for) and ‘secondo’ (according to) have also been
considered assertions, together with prepositional groups such as ‘stando a’
(according to), ‘a detta di’ (according to’, and so forth. Other prepositional groups,
e.g. ‘a parere di’ (160) (in the opinion of), ‘agli occhi di’ (in the eyes of) ‘nell’ottica
di’ (in the perspective of) have been instead marked as ‘belief’.
(159) "Vi daremo le statistiche alla fine", promettono i generali croati. (ISST
cs030)
“We’ll give you the statistics at the end”, promise the Croatian generals.
(160) A suo parere, una particolare follia segnerebbe la continuità della
letteratura siciliana. (ISST els034)
In his opinion, a special folly marks(QUOT.COND.) the continuity of Sicilian
literature.
Verb cues need instead to be considered in context and annotated according to
the attitude they express. A first effort to collect Italian cues and identify their type
is presented in (6.3).
6 Annotation Schema and Guidelines
- 122 -
6.2.2 Factuality Attribute
The factuality attribute takes just two values: factual and non-factual. In order to
decide which value to assign, it is necessary to concentrate on the attribution
relation itself no matter what the content is. ‘Factual’ is by default the value
assigned, it is in fact more frequent, at least in journalistic texts, and represents
real attributions. In case the attribution relation is not a real bound but just an
hypothetical match or the negation of a link between source and content, it takes
the value ‘non-factual’. To summarise the analysis in (4.3.2), the factuality can be
compromised by the following elements when they scope on the cue:
� polarity reversing particle (negation, negative pronouns) (161)
� verb mode (conditional, imperative)
� verb tense (future)
� hypothetical (if)
� interrogative form (162)
� modals
(161) Nessuno parla più di baratro imminente e di crisi finanziaria. (ISST cs025)
No one is talking anymore about imminent precipice and financial crisis.
(162) Ø Ti dico una cosa: Ø sai qual è il nostro gioco preferito quando partiamo
per qualche operazione militare? (ISST cs030)
(I) tell you something: do (you) know what’s our favourite game when we
leave for some military operation?
The factuality judgement represents the answer to the following question: is the
content presented as attributed to the source in the real world? However, factuality
should be kept separate from evidentiality, thus elements like the quotative
conditional, or ‘sembra/ pare’ (it seems) which are employed by the outer source to
express that he or she has no direct evidence of the reported attribution. Although
evidentiality can be often perceived as affecting the epistemic modality, thus
reflecting a lower degree of certainty about the fact that the attribution really took
place, this alone should not be considered enough to reverse the factuality.
6 Annotation Schema and Guidelines
- 123 -
6.2.3 Scopal Change Attribute
Also the scopal change attribute can take two values, ‘none’ being the default one,
and ‘scopal change’ the other. A change in the scope happens relatively seldom,
however it is important to recognise it in order to avoid incorrectly considering it as
affecting the factuality. The scopal change almost solely occurs with polarity,
therefore it is opportune to pay particular attention to those attributions appearing
at first as non-factual because of the cue being in the scope of a negation. In these
cases it can be checked if there is a polarity change first with determining whether
there is still a perceived attribution and secondly with considering if the reverse of
the content is attributed. Both requirements are satisfied in (163).
(163) Qualunque sia il numero di sfollati, il governo croato nega che siano stati
espulsi e nega anche qualsiasi volontà di pulizia etnica nelle regioni appena
riconquistate. (ISST cs031)
Whatever the number of evacuees, the Croatian government denies that
they have been banned and denies also any ethnic cleansing intention in
the newly conquered areas.
The case when just the first requirement is not satisfied corresponds to a ‘non-
factual’ attribution. The way factuality and scopal change are intertwined is shown
in Table 5, where values are assigned to these features according to the
intersection of the above mentioned requirements, i.e. a perceived attribution and
the reverse of the content being the intended attributed material.
Factuality +Scopal change Attribution No attribution Content Factual + None Non-factual + None Content reverse Factual + Scopal change X Cue reverse Non-factual + Scopal change X
Table 5 - Factuality and Scopal change values assignment
When just the first requirement is satisfied, that is when there is a perceived intent
of attributing something which however does not correspond to the content or its
reverse as in (164), the annotation should mark the attribution as non-factual and
having a scopal change. It could in fact be considered as a factual attribution of a
6 Annotation Schema and Guidelines
- 124 -
negative attitude, in this case ‘not believing’, however, the attribution of the positive
attitude is not expressed. With marking these instances as ‘non-factual + scopal
change’ they have a unique combination of attributes making them retrievable and
easily distinguishable from simple ‘non-factual’ attributions. The ‘scopal change’
refers to the fact that it is not the polarity of the attribution that is affected nor that
of the content, but the polarity of the attitude held by the source.
(164) Le opposizioni non credono alla rinascita del tripartito ed insistono nella
richiesta di autoscioglimento. (ISST els038)
The oppositions do not believe in the re-birth of the three-party and insist
in asking its self-dissolution.
6.2.4 Source Type Attribute
Assigning a value to the ‘source type’ attribute is not a particularly complex task.
The source is by default ‘writer’ and can assume also the values: other, arbitrary
and mixed. ‘Writer’ should be assigned in case the attribution is overtly to the
writer of the article while ‘other’ refers to another defined entity, including very
general sources like ‘a man’ or ‘experts’. As ‘arbitrary’ should be marked those
instances without a specific source, i.e. impersonal or hidden sources such as
‘everyone’, ‘the people’, ‘one’ or pronouns like ‘you’ or ‘they’ when used as
impersonals. ‘Mixed’ should be instead used to mark when an attribution
possesses multiple sources of different type as in (165).
(165) Tutti, incluse le autorità, conoscono la loro provenienza, ma nessuno dice
e fa nulla per prevenire il massacro di capi selvatici. (cs.morph020)
Everyone, including the authorities, knows their provenance, but no one
says and does anything to prevent the massacre of wild animals.
6.3 Collecting a List of Italian Cues
Possessing a listing of Italian cues classified according to their type would allow
for example to perform the annotation of attribution, as it was done for the
6 Annotation Schema and Guidelines
- 125 -
annotation of the PDTB, with looking for each cue, one at a time, throughout the
whole corpus. In addition, this list would represent a database which could guide
the annotators in their task. Moreover, a collection of all the possible cues would
also help the automatic identification of attribution relations providing tools with a
lexical anchor to look for. The collection of a complete list is unfortunately not
feasible for a number of reasons and therefore cannot represent a reliable
instrument.
First of all, only the punctuation sequence colon-quotation mark almost
certainly corresponds to an attribution. All other lexical and grammatical categories
can assume the function of cue, however this is not always, and in some cases
only occasionally, the case. Secondly, although prepositions and prepositional
groups, together with punctuation, represent a close class, verbs, adjectives and
nouns are surely not and it is problematic to determine and list all the ones
possibly functioning as an attribution device. Lastly, to the most substantial class of
attribution cues, the verb, cannot be assigned a type a priori. Many verbs are
polysemous and embrace meaning which belong to different attribution types (see
4.1.5). The disambiguation can only rely on the context.
The generic verb ‘fare’ (to do) can also be used as reportive (Renzi, 1995),
although it is quite colloquial and limited to introducing reported direct speech as
for example ‘Mario mi fa: “Come stai?”’ (Mario goes: “How are you?”). More often it
is combined in an expression, e.g. ‘fare il punto’ (to define/clarify), ‘fare notare’
(166) (to point out), ‘fare riferimento’ (167) (to make reference), ‘fare il nome’ (168)
(to mention). This verb cannot be considered as a reliable indicator of an
attribution and is so frequent that it is not feasible to check every instance of it to
ensure it does not entail an attribution relation.
(166) A seguito dell’operazione, l’azionariato della Cementeria di Merone – fa
notare ancora il comunicato diffuso ieri – rimane invariato; (ISST sole103)
Following the transaction, the share of the Cement factory of Merone –
points out again the announcement released yesterday- stays unchanged;
(167) Sanpaolo fa riferimento al <<prezzo già concordato con Sasea di 10
miliardi>>… (ISST sole113)
6 Annotation Schema and Guidelines
- 126 -
Sanpaolo makes reference to the <<price already agreed with Sasea of 10
billiards>>…
(168) …sarebbero interessati acquirenti stranieri e si fa il nome di Bouygues).
(ISST sole117)
…foreign buyers are(QUOT.COND) interested and it was made the name of
Bouygues).
Reaching a satisfactory description of possible cues and their type seems at
present if not unfeasible quite unlikely to succeed. Nonetheless, a first effort to
collect a partial list of cues was made. Italian cues were partly taken from Renzi
(1995) and Knott (1996). To this first group, including some prepositions,
prepositional groups and verbs, were added cues found in the corpus and
deverbal noun cues derived from the listed verbs. In order to enlarge the list of
reportive verbs, English verbal cues were extracted from the PDTB.
The list of Italian cues is reported in Appendix 2. The inventory reports the
Italian cue, followed by its English equivalent in italic and, when possible, the
overall number of occurrences of the lemma in the corpus. In case of multi-word
expressions or generic verbs, this figure was not retrieved and it is signalled with
an ‘N’. Cues are classified according to their class, i.e. verbs, nouns, prepositions,
prepositional groups, grammatical markers, punctuation. Verbs have also been
classified according to their type. Some polysemous verbs are reported in more
than one type group. Their classification, however, is just a suggestion as the
context has to be first considered. The purpose of the classification is in fact not to
determine once and for all the type of each verb but to provide a list of members in
order to make it easier to identify the different types and confront any verb with
them.
6.3.1 Extracting Verb Cues from the PDTB
Verbal cues were extracted from the PDTB from files containing the attribution
phrase, i.e. source and cue, of each attribution relation. These files were parsed
with the POS tagger developed by the ‘Center for Sprogteknologi Kobenhavns
Universitet’, based on the Brill tagger and available open-source
6 Annotation Schema and Guidelines
- 127 -
(http://cst.dk/online/pos_tagger/), in order to obtain a list of tokens from a specific
word class: the verb. Afterwards, with the help of the CST’s lemmatizer
(http://cst.dk/online/lemmatiser/), using Celex traning data (© Max Planck Institute
for Psycholinguistics, 2001), a list of just the verb lemmas was obtained and
manually reviewed. The resulting list comprises about 470 verbs and is reported in
Appendix 3.
In order to enlarge the list of Italian reportive verbs, English verb cues from
the PDTB could be usefully employed. By confronting the inventory extracted from
the PDTB to the one already collected for Italian it would be possible to merge
these two lists.
6.4 Summary
The annotation process consists of different steps. First of all it is necessary to
identify in the article, displayed by the tool as plain text, an attribution relation.
Afterwards the text spans corresponding to the components of the attribution need
to be selected. To each selection, or markable, it is then assigned a role, i.e. cue,
source, content or supplement. On the source, in the annotation window in the
tool, the attribute have to be specified selecting the appropriate values. Lastly, the
markables belonging to the same attribution are connected in a relation set.
The selection of the text span corresponding to each markable is not free
from obstacles. Some indications were provided which should constitute a
guideline for eventual annotators. Source, content and cue have to include the
elements having that role and eventual modifiers. Additional useful spans can be
included as supplement.
More indications were given concerning the selection of the values for each
attribute. The selection of the ‘type’ feature cannot rely on a previously given
repository of cues classified according to the attitude they express as this depends
also on the context. The values for the feature ‘factuality’ should be selected with
bearing in mind the question: is there a perceived attribution? ‘Scopal change’ has
instead to be assigned to factual attributions reversing the polarity of the content,
but also attributions in which the negation is reversing the attitude and not
negating the existence of a link. The ‘source type’ is probably the least problematic
6 Annotation Schema and Guidelines
- 128 -
feature. Its values are assigned on the basis of the source being a determined
entity, either the ‘writer’ or ‘other’, or an impersonal or generic one, the latter being
usually employed to express hearsays. ‘Mixed’ is instead used to label the source
type of multiple instances of source belonging to different types.
Lastly, an attempt to list and classify Italian cues was presented (Appendix
2). In order to find more verb cues, the ones in the PDTB have been extracted and
are available to confront to the Italian list with the aim of adding the missing ones.
The outcome of this work is reported in Appendix 3.
7 Conclusion
- 129 -
7 Conclusion
This thesis originates from the intention to compensate for the lack of a discourse
level of annotation in the ISST corpus of Italian. In the frame of discourse relations,
attribution was chosen as the current topic not just because of its relevance, but
also in order to provide this phenomenon with a more complete investigation.
Studies involving attribution, the most relevant being the PDTB (Prasad, Miltsakaki
et al., 2008) and the Opinion Corpus (Wiebe, 2002), have till now approached this
matter only partially. This thesis provides instead a full account of the phenomenon
through an independent approach.
The aim was not only to develop a reliable annotation schema to apply to
the ISST, but also to contribute to the progress of IR and QA studies. The outcome
of this study is particularly relevant for example for works dealing with information
and committed to provide software capable of discerning reliable or relevant
sources in order to deliver quality data.
Attribution was analysed in all its linguistic manifestations, comprising
relations at the discourse level as well as intra-sentential ones involving smaller
units such as clauses, phrases and even words. The resulting image is that of a
very composite phenomenon which is only partially, in contrast with Skadhauge
and Hardt (2005) claims, syntactically inferable.
On a more tangible side, the present study resulted in the definition of an
annotation schema and the pilot annotation of a portion of the corpus on which its
feasibility was tested. The annotation schema originates from the one adopted in
the PDTB and partially departs from it in order to conform to the present
requirements. An accurate analysis of some available annotation tools also
accompanies the development of the pilot.
In order to facilitate the annotation process and the identification of
attribution a first list of Italian cues, namely the elements functioning as textual
anchor of each relation, was also collected and is included in the thesis.
7.1 Future Work
While this thesis lays the foundations of the attribution annotation project, it surely
7 Conclusion
- 130 -
does not represent its completion. The proposed annotation schema needs to be
tested and consequently perfected. This could be done with the help of one or
more annotators which should mark the same portion of the corpus already
annotated for the pilot, in order to verify through the agreement the clarity of the
schema and amend ambiguous tasks or distinctions. If possible, an ad hoc tool
should be developed in order to facilitate the annotation process even more and
provide all the desired features. Finally, the annotation should be performed on the
whole ISST corpus and statistically evaluated.
Some features have also proved to require better investigation. It would be
for example useful to better determine the conditions for a scopal change to
happen and identify other elements potentially able to superficially scope over the
cue or even source span and affecting instead the content. The interaction
between attribution coreference and event anaphora or attribution relations and
other discourse relations should also be further investigated as well as the role of
logical metonymy in some attribution cues.
The project could be also expanded so as to comprehend more features, or
more feature distinction. The source type attribute ‘other’ could be in fact further
specified so as to distinguish determined sources which can be identified, e.g.
proper names, important charges, institutions, from sources having instead a
generic or common referent, e.g. a man, experts, etc. Moreover, attribution could
be expanded so as to comprise feelings and emotions as in the Opinion Corpus
(Wiebe, 2002).
The Italian cue inventory should also be enlarged, first with merging it with
the PDTB verb inventory and subsequently with extracting all the cues in the ISST
corpus. This could be done once the annotation is completed as it would be then
possible to easily collect the cues already annotated and marked also for their
type.
7.1.1 And Beyond
Another aspect that deserves further investigation emerged during the annotation.
I realised that after a month or two browsing the corpus or performing the
annotation, and consequently reading the news in it, I was often getting confused. I
was in fact mixing events from the articles in the corpus, hence that happened
7 Conclusion
- 131 -
about fifteen years ago, with today happenings from the online news. The process
I observed could be regarded as evidence of the fact that we tend to remember
contents, or information, and forget how we acquired them. The temporal flattening
I experienced was also a source flattening: I could remember the information I
read but I was partly no longer able to discern between my sources: the ISST
corpus and the Web.
‘Who said that?’ is the question beneath this thesis, however, while trying to
answer it other questions arise as it becomes more and more evident that knowing
this answer alone is not sufficient. Knowing in which circumstances the attribution
event was real or took place is also necessary. The same source can think or
assert different things about the same event in different times, just imagine a
politician commenting about an issue before or after his party is elected or an
expert suggesting a particular investment and so forth. Sources may also assert or
express the same attitude, however in different circumstances (e.g. a formal/ funny
occasion, to audiences differing in expertise or bias, freewill or being threatened,
etc…) hence affecting the meaning and the way the content is perceived.
Anchoring attribution to the event it refers to, the circumstances (audience,
situation, etc.) in which it took place and the temporal dimension would allow e.g.
retrieving different opinions expressed by different sources about a same event
(e.g. historical happenings, political issues, etc…) together with the evolution of a
source’s thought concerning the same topic during a lapse of time. More
importantly, it would determine a more correct understanding of the content and its
real semantic significance and a consequent increased precision in the selection
of relevant and trustworthy information.
However far the stream flows, it never forgets its source.
(Nigerian Proverb)
- 132 -
Bibliography: Aikhenvald, A. Y., Evidentiality. Oxford: Oxford University Press, 2004. Bergler, S., “The semantics of collocational patterns for reporting verbs”. In
Proceedings of the Fifth Conference of the European Chapter of the Association for Computational Linguistics, Berlin, Germany, 1991.
Carlson, L., Marcu, D., Discourse tagging manual. ISI Tech Report ISI-TR-545,
2001. Available at: http://www.isi.edu/~marcu/discourse/tagging-ref-manual.pdf
Carlson, L., Marcu, D., Okurowski, M. E., “Building a Discourse-tagged Corpus in
the Framework of Rhetorical Structure Theory”. In van Kuppevelt, J., Smith, R., Current Directions in Discourse and Dialogue, pp. 85-112, 2003.
Caselli, T., Ide, N., Bartolini, R., “A Bilingual Corpus of Inter-linked Events”. In
Proceeding of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.
Cristea, D., Webber, B., “Expectations in Incremental Discourse Processing”. In
Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pp. 88-95 Madrid, Spain, 1997.
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., “GATE: A Framework
and Graphical Development Environment for Robust NLP Tools and Applications”. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL ’02). Philadelphia, July 2002.
De Haan, F. “Coding of Evidentiality” / “Semantic distinctions of Evidentiality”. In
Haspelmath, M., Dryer, M., Gil, D., Comrie, B. (eds.), The World Atlas of Language Structures Online. Munich: Max Planck Digital Library, chapter 77-78, 2008. Available at: http://wals.info/feature/77 (/78).
Forbes, K., Miltsakaki, E., Prasad, R., Sarkar, A., Joshi, A., Webber, B., “D-LTAG
System: Discourse Parsing with a Lexicalized Tree Adjoining Grammar”. Journal of Language, Logic and Information, 12(3), 2003.
Gajewski, J., “Neg-rising Predicates are Definite Plural World Descriptions”. Sinn
und Bedeutung, 9, Nijmengen, The Netherlands, 2004. Giacalone Ramat, A., Topadze, M., “The Coding of Evidentiality: A Comparative
Look at Georgian and Italian”. In Rivista di Linguistica Italiana 19,1, special issue on Evidentiality between lexicon and grammar, edited by Mario Squartini, 2007.
Grimes, J., The Thread of Discourse. The Hague: Mouton, 1975.
- 133 -
Grosz, B. J., Sidner, C. L., “Attention, Intention, and the Structure of Discourse”. Computational Linguistics, 12 (3): 175-204, 1986.
Halliday, M. A. K., Hasan, R., Cohesion in English. London: Longman UK group
Limited, 1976. Hobbs, J. R., On the Coherence and Structure of Discourse. Technical Report
CSLI-85-37, Center for the Study of Language and Information, Stanford University, 1985.
Hunter, J., Asher, N., Reese, B., Denis, P., “Evidentiality and intensionality: Two
uses of reportative constructions in discourse”. Presented at the 2006 Workshop on Constraints in Discourse, Maynooth, Ireland, July 7-9, 2006.
Kiparsky, C, Kiparsky, P., “Fact”. In Jakobovits, L., Steinberg, D. (eds.), Semantics:
An Interdisciplinary Reader in Philosophy, Linguistics and Psychology, Cambridge: Cambridge University Press, pp.345-369, 1971.
Knott, A., A Data-driven Methodology for Motivating a Set of Coherence Relations.
PhD thesis, Department of Artificial Intelligence, University of Edinburgh, 1996.
Lee, A., Joshi, A., “Systematic Mismatches Across Annotations”. ULA Workshop,
University of Colorado, Boulder, March, 2008 Levin, B., English Verb Classes and Alternations: A Preliminary Investigation.
Chicago and London: The University of Chicago Press, 1993. Mann, W. C., Thompson, S. A., “Rhetorical Structure Theory: A theory of text
organization”. In Polanyi, L. (ed.) The Structure of Discourse, Ablex, 1988. Mladová, L., Zikánová, Š., Hajičová, E., “From Sentence to Discourse: Building an
Annotation Scheme for Discourse Based on Prague Dependency Treebank”. In Proceeding of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.
Montemagni, S., Barsotti, F., Battista, M., Calzolari, N., Corazzari, O., Lenci, A.,
Zampolli, A., Fanciulli, F., Massetani, M., Raffaelli, R., Basili, R., Pazienza, M. T., Saracino, D., Zanzotto, F., Mana, N., Pianesi, F., Delmonte, R., “Building the Italian Syntactic-Semantic Treebank”. In Anne Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, pp. 189-210, 2003.
Moore, J. D., Pollack, M. E., “A Problem for RST: The Need for Multi-Level
Discourse Analysis”. Computational Linguistics, 18 (4), 537-544, 1992.
- 134 -
Moore, J. D., Wiemer-Hastings, P., “Discourse in Computational Linguistics and Artificial Intelligence”. In A. C. Graesser, M. A. Gernsbacher, S. R. Goldman (Eds.), Handbook of Discourse Processes, London: Lawrence Erlbaum, pp. 439-485, 2003.
Mueller, C., MMAX2 Annotation Tool - Quickstart Guide and Style Sheet Guide, EML Research gGmbH, 27th – 28th October 2004. Available at: http://mmax2.sourceforge.net/
Mueller, C., MMAXQL The MMAX2 Query Language – Reference Manual (draft),
EML Research gGmbH, 12th August 2004. Available at: http://mmax2.sourceforge.net/
Mueller, C., Strube, M., “MMAX: A Tool for the Annotation of Multi-modal Corpora”.
In Proceedings of the 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, Washington, pp.45-50, August 2001.
Mueller, C., Strube, M., “Multi-level Annotation of Linguistic Data with MMAX2”. In
Braun, S., Kohn, K., Mukherjee, J. (Eds.), Corpus Technology and Language Pedagogy. New Resources, New Tools, New Methods (English Corpus Linguistics, vol.3), Frankfurt: Peter Lang, pp.197-214, 2006.
Murphy, A. C., “Markers of attribution in English and Italian opinion articles: A
comparative corpus-based study”. ICAME Journal vol. 29 pp. 131-150, 2005. Ogren, P. V., “Knowtator: A Protégé Plug-in for Annotated Corpus Construction”. In
Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Morristown, USA, pp.273-275, 2006
Péry-Woodley, M.P., Scott, D., “Computational Approaches to Discourse and
Document Processing”.T.A.L. 47(2): 7-19, 2006. Polanyii, L., Scha, R. J. H., “On the Recursive Structure of Discourse”. In Ehlich,
K., van Riemsdijk, H. (eds.), Connectedness in Sentence, Discourse and Text, Tillburg:Tillburg University, pp. 141-178, 1983.
Poesio, M., Delmonte, R., Bristot, A., Chiran, L., Tonelli, S., The VENEX Corpus of
Anaphora and Deixis in Spoken and Written Italian (draft), 2009. Available at: http://cswww.essex.ac.uk/staff/poesio/
Prasad, R., Dinesh, N., Lee, A., Joshi, A., Webber, B., “Attribution and its
Annotation in the Penn Discourse TreeBank”. In Traitement Automatique des Langues, Special Issue on Computational Approaches to Document and Discourse, vol. 47, no. 2:43-64, 2007.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B.,
“The Penn Discourse TreeBank 2.0”. In Proceeding of the 6th International
- 135 -
Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.
Prasad, R., Husain, S., Sharma, D.M., Joshi, A., “Towards an Annotated Corpus of
Discourse Relations in Hindi”. In Proceeding of IJCNLP Workshop on Asian Language Resources, Hyderabad, India, 2008.
Prasad, R., Miltsakaki, E., Dinesh, N., Lee, A., Joshi, A., Robaldo, L., Webber, B.
The Penn Discourse Treebank 2.0. Annotation Manual. IRCS Technical Report IRCS-08-01. Institute for Research in Cognitive Science, University of Pennsylvania, 2008.
Prasad, R., Miltsakaki, E., A., Joshi, A., Webber, B, “Annotation and data mining of
the Penn Discourse TreeBank”. In ACL Workshop on Discourse Annotation, Barcelona, Spain, 2004.
Renzi, L., Salvi, G., Cardinaletti, A., Grande Grammatica Italiana di Consultazione.
Bologna: Il Mulino, vol. III: 431-436,1995. Sag, I. A., Pollard C., “An Integrated Theory of Complement Control”, Language,
vol. 67, n° 1: 63-113, 1991. Saurí, R., Pustejovsky, J., “From Structure to Interpretation: A Double-layered
Annotation for Event Factuality”. In Proceedings of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.
Schmidt, T., “The Transcription System EXMARaLDA: An Application of the
Annotation Graph Formalism as the Basis of a Database of Multilingual Spoken Discourse”. In Bird, S., Buneman, P., Liberman, M. (ed.), Proceedings of the IRCS Workshop on Linguistic Databases 11-13 December 2001, Philadelphia: Institute for Research in Cognitive Science, University of Pennsylvania, pp. 219-227, 2001. Available at: http://www1.uni-hamburg.de/exmaralda/files/IRCS_Paper.pdf
Skadhauge, P. R., Hardt, D., “Syntactic Identification of Attribution in the RST
Treebank”. In Proceedings of the 2nd International Joint Conference on Natural Language Processing, Jeju Island, Korea, 11-13 October, 2005.
Soria, C., Ferrari, G., “Lexical marking of discourse relations - some experimental
findings”. In Proceedings of the Conference workshop Discourse Relations and Discourse Markers at COLING-ACL'98, pp. 36-42. Montréal, Québec, Canada, 1998.
Talmy, L., Toward a Cognitive Semantics. Cambridge, MA: Massachusetts Institute
of Technology, vol. II, 2000. Webber, B., “Accounting for Discourse Relations: Constituency and Dependency”.
In Butt, M., Dalrymple, M. and King, T., Intelligent Linguistic Architectures,
- 136 -
Stanford: CSLI Publications, pp. 339-360, 2006. Webber, B., Joshi, A., Miltsakaki, E., Prasad, R., Dinesh, N., Lee, A., Forbes, K., “A
Short Introduction to the Penn Discourse TreeBank”. In Copenhagen Working Papers in Language and Speech Processing, 2005.
Webber, B., Stone, M., Joshi, A., “Anchoring a Lexicalized Tree-Adjoining
Grammar for Discourse”. In Coling/ACL Workshop on Discourse Relations and Discourse Markers, Montreal, Canada, pp. 86-92, 1998.
Webber, B., Stone, M., Joshi, A., Knott, A., “Anaphora and Discourse Structure”.
Computational Linguistics 29: 545-587, 2003. Webber, B., Stone, M., Joshi, A., Knott, A., “Discourse Relations: A Structural and
Presuppositional Account Using Lexicalised TAG”. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistic, College Park, Maryland, pp.41-48, 1999.
Wiebe, J., Instructions for annotating opinions in newspaper articles. Technical
report TR-02-101, Department of Computer Science, University of Pittsburgh, 2002.
Wiebe, J., Wilson, T., Cardie, C., “Annotating Expressions of Opinions and
Emotions in Language”. Language Resources and Evaluation 1(2), 2005. Williams, S., Power, R., “Deriving Rhetorical Complexity Data from the RST-DT
Corpus”. In Proceedings of the 6th International Conference on Language Resources Evaluation (LREC 2008), Marrakech, Morocco, 28-30 May, 2008.
Wilson, T., Wiebe, J., “Annotating Attributions and Private States”. In Proceedings
of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, Ann Arbor, Michigan, 2005.
Wolf, F., Gibson, E., “Representing Discourse Coherence: A Corpus-based Study”.
Computational Linguistics 31:249-287, 2005. Xue, N., “Annotating Discourse Connectives in the Chinese Treebank”. In
Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, Ann Arbor, Michigan, 2005.
Zeyrek, D., Webber, B., “A Discourse Resource for Turkish: Annotating Discourse
Connectives in the METU Corpus”. Paper presented at The 6th workshop on Asian Language Resources, The 3rd International Joint Conference on Natural Language Processing (IJNLP), Hyderabad, India, 2008.
- 137 -
Abbreviations and Acronyms AO Abstract Object AR Attribution Relation Arb Arbitrary Arg1 Argument 1 CDTB Chinese Discourse Treebank Comm Communication (attribution type) cs Corriere della Sera Ctrl Control (attribution type) D-LTAG Lexicalised Tree-Adjoining Grammar for Discourse DPT Discourse Parse Tree DS Discourse Segment DTD DocumentType Definition EDU Elementary Discourse Unit els else EXMARaLDA EXtensible MARkup Language for Discourse Annotation Ftv Factive (attribution type) GATE General Architecture for Text Engineering IO Indirect Object IR Information Retrieval ISST Italian Syntactic-Semantic Treebank ITB Italian TimeBank IWN ItalWordNet LDM Linguistic Discourse Model MTC METU Turkish Corpus NL Natural Language NP Noun Phrase Ot Other PAtt Propositional Attitude (attribution type) PDT Prague Dependency Treebank PDTB Penn Discourse TreeBank period Periodicals POS Part Of Speech PTB Penn TreeBank QA Question Answering re Repubblica RR Rhetorical Relation RST Rhetorical Structure Theory RST-DT Rhetorical Structure Theory Discourse Treebank sole Il Sole 24 Ore Sup1 Supplement 1 TAG Tree Adjoining Grammar WALS World Atlas of Language Structures Wr Writer WSJ Wall Street Journal XML EXtensible Markup Language XSL EXtensible Stylesheet Language
- 138 -
Appendix 1 – MMAX2 Code
MMAX2 StyleSheet
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:mmax="org.eml.MMAX2.discourse.MMAX2DiscourseLoader" xmlns:Attribution_relation="www.eml.org/NameSpaces/Attribution_relation"> <xsl:output method="text" indent="no" omit-xml-declaration="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="words"> <xsl:apply-templates/> </xsl:template> <xsl:template match="word"> <xsl:value-of select="mmax:registerDiscourseElement(@id)"/> <xsl:apply-templates select="mmax:getStartedMarkables(@id)" mode="opening"/> <xsl:value-of select="mmax:setDiscourseElementStart()"/> <xsl:apply-templates/> <xsl:value-of select="mmax:setDiscourseElementEnd()"/> <xsl:apply-templates select="mmax:getEndedMarkables(@id)" mode="closing"/> <xsl:text> </xsl:text> </xsl:template> <xsl:template match="Attribution_relation:markable" mode="opening"> <xsl:value-of select="mmax:addLeftMarkableHandle(@mmax_level, @id, '[')"/> </xsl:template> <xsl:template match="Attribution_relation:markable" mode="closing"> <xsl:value-of select="mmax:addRightMarkableHandle(@mmax_level, @id, ']')"/> </xsl:template> </xsl:stylesheet>
MMAX2 Scheme
<?xml version="1.0" encoding="UTF-8"?> <annotationscheme> <attribute id="role" name="attribution_role" type="nominal_button">
<value name="None"/> <value name="Cue" next="attribution_type, attribution_factuality, source_type, source_uniID"/> <value name="Source"/> <value name="Content"/> <value name="Supplement"/>
</attribute> <attribute id="attribution_type" name="Type" type="nominal_list">
<value name="None"/> <value name="Assertion" next="attribution_scopal_change"/> <value name="Fact"/> <value name="Belief" next="attribution_scopal_change"/> <value name="Eventuality" next="attribution_scopal_change"/>
</attribute>
- 139 -
<attribute id="attribution_factuality" name="Factuality" type="nominal_button"> <value name="Factual"/> <value name="Non-factual"/>
</attribute> <attribute id="attribution_scopal_change" name="Scopal_change" type="nominal_button">
<value name="None"/> <value name="Scopal_change"/>
</attribute> <attribute id="source_type" name="Source" type="nominal_list">
<value name="Writer"/> <value name="Other"/> <value name="Arbitrary"/> <value name="Mixed"/>
</attribute> <attribute id="source_uniID" name="source_ID" type="freetext">
<value id="source_ID_uni" name="source_ID"/> </attribute> <attribute id="Relation" name="Relation" type="markable_set" style="rcurve" color="red">
<value name="Relation"/> </attribute> </annotationscheme>
MMAX2 Customization <?xml version="1.0" encoding="UTF-8"?> <customization> <rule pattern="{all}" style="foreground=blue handles=black bold=true "/> <rule pattern="attribution_role={cue}" style="background=orange" /> <rule pattern="attribution_role={content}" style="background=cyan" /> <rule pattern="attribution_role={source}" style="background=green" /> <rule pattern="attribution_role={supplement}" style="background=lightGray" /> </customization>
- 140 -
Appendix 2 – Italian Attribution Cues
Verb Cues Assertion
accusare accuse 35 iniziare to start 35 affermare to assert 41 insinuare to insinuate 2 aggiungere to add 77 invocare to invoke 8 ammettere to admit 40 lamentare to lament/complain 9 annunciare to announce 72 mormorare to murmur 3 apostrofare to address 0 mostrare to show 49 asserire to assert 2 narrare to narrate 2 augurare to wish 11 Negare (-) to deny 14 avvertire to warn 20 nominare to mention 26 avvisare to warn 4 osservare to observe 28 bisbigliare to whisper 0 parlare to talk 181 borbottare to mumble 0 proporre to propose 60 chiacchierare to chat 4 raccontare to tell 61 chiarire to clarify 19 replicare to reply 11 chiedere to ask 152 riassumere to sum up 16 cominciare to commence 95 ribattere to talk-back 3 commentare to comment on 31 ricominciare to start over again 6 comunicare to communicate 15 riconoscere to acknowledge 29 concludere to conclude 56 riferire to relate 42 condividere share 11 rimproverare to reproach 3 confermare to confirm 80 ripetere to repeat 36 continuare to continue 107 riportare to report 38 controbattere to talk-back 0 riportare to account 38 declamare to rave 1 riprendere to resume 41 denunciare to denounce 30 rispondere to answer 83 dichiarare to declare 69 rivelare to reveal 34 dire to say 532 sbottare to burst out 0 domadare to ask 5 scrivere to write 112 elogiare to praise 0 seguitare to continue 1 esclamare to exclaim 0 soggiungere to add 0 esprimere to express 39 sostenere to claim 75 fare to do/say N spiegare to explain 115 gridare to shout 15 testimoniare to testify 9 informare to inform 11 urlare to shout 8 Belief
credere to believe 81 dubitare to doubt 2 immaginare to imagine 18 pensare to think 134 ponderare to ponder 0 riflettere to think 18 supporre to assume 2
- 141 -
Fact
apprendere to learn 3 capire to understand 81 constatare to ascertain 4 dimenticare to forget 25 dimostrare to prove 43 essere a conoscenza to know N evidenziare to point out 28 leggere to read 48 notare to note 15 osservare to observe 28 rendersi conto to realise N ricordare to remember 105 rilevare to point out 34 rimpiangere to regret 2 sapere to know 182 sentire to hear/ feel 89 udire to hear 3 vedere to see 284 venire a conoscenza to get to know N venire in mente to remember N Eventuality
accettare to accept 48 invocare to invoke 8 acconsentire to agree 0 lasciare to let 121 accordarsi to arrange 0 minacciare to threaten 24 appoggiare to support 8 ordinare to order 9 aspettarsi to expect 44 permettere to allow 46 assicurare to assure 33 persuadere to persuade 1 augurarsi to wish oneself N pregare to pray 1 bramare to long for 0 promettere to promise 27 comandare to command 1 provare to prove 27 concordare to agrree/arrange 16 raccomandare to recommend 1 condividere share 11 rassicurare to reassure 10 consentire to allow 81 rifiutare to disagree 24 consigliare to advise 23 riportare to account 38 convenire to agree 8 sospettare to suspect 7 declinare to refuse 1 sostenere to support 75 desiderare to wish 6 sperare to hope 47 discordare to disagree 0 suggerire to suggest 22 essere d’accordo to agree N supplicare to plea 0 implorare to beg 1 temere to fear 30 imporre to impose 42 volere to want 626 intendere to intend 49
- 142 -
Other Cues Noun Markers
acclamazione applause 1 mormorio murmuring 0 accordo agreement 109 narrazione narration 0 affermazione assertion 10 nota note 71 ammirazione admiration 2 opinione opinion 21 ammissione admission 2 ordine order 93 annuncio announcement 13 osservazione observation 12 appello appeal 20 parola word 76 appoggio support 4 patto pact 19 apprezzamento appreciation 4 paura fear 31 approvazione approval 20 pensiero thought 11 aspettativa expectation 5 permesso permission 4 augurio wish 5 persuasione persuasion 0 avvertimento warning 3 petizione petition 0 certezza certainty 15 plauso approval 0 chiarimento clarification 6 posizione position 71 comando command 14 preghiera pray 1 commento comment 41 promessa promise 9 comunicato release 19 proposta proposal 73 congettura conjecture 0 punto di vista point of view N conoscenza knowledge 15 raccomandazione recommendation 2 consenso consensus 26 racconto story 12 consiglio advice 136 rassicurazione reassurement 0 constatazione realization 3 replica reply 4 credenza belief 1 resoconto account 2 deposizione deposition 2 ricordo memory 9 desiderio desire 4 rifiuto refusal 6 dichiarazione declaration 59 riflessione reflexion 13 dimostrazione demonstration 8 rilevazione survey 5 discordanza disagreement 1 rimpianto regret 0 disprezzo contempt 1 risposta answer 43 domanda question 42 rivelazione revelation 8 dubbio doubt 27 segnalazione signalling 9 esclamazione exclamation 0 sensazione feeling 12 grido shout 8 sostegno support 15 idea idea 55 speranza hope 23 illazione insinuation 0 suggerimento suggestion 4 implorazione imploration 0 supplica plea 0 imposizione imposition 3 supporto support 12 informazione information 89 supposizione supposition 0 insinuazione insinuation 2 testimonianza testimony 9 intesa agreement 39 timore fear 13 invocazione invocation 1 urlo shout 4 lamentela complaint 2 visione point of view 5 lode praise 0 visione vision N memoria memory 19 volontà will 25
- 143 -
Prepositions Prepositional groups Punctuation Mode secondo (348)
according to
per quanto riguarda as far as it concerns <<...>> (824)
quotative conditional
per (3236) for a detta di according to “...” (1783) agli occhi di in the eyes of nell’ottica di in the perspective of a parere di in the opinion of stando a according to
- 144 -
Appendix 3 – PDTB Verb Cues
accept blame cover exist impose accompany boast create expect include accord bolster credit expire increase accuse break criticize explain indicate acknowledge broadcast cut express indict acquire brook date favour inform act build dawn fear insinuate add bury decide feed insist address buy declare feel inspire admit calculate decline figure integrate adopt call deem file intend advance capture defend finance interject advertise caution define find interpret advise challenge demand finger interview affect characterize deny fireproof introduce agree charge describe float invent allege chastise design flock invest allow check determine fly investigate amend cheque develop follow involve analyse chuckle disappoint forecast issue anger circulate disclose foresee join announce cite discover form joke ansie claim dismay found jump anticipate classify dispute franchise keep appear clear diversify free kill appoint close doubt fret know appreciate come draft future lament argue comment draw gain laud arise compare dream get laugh ask compile drill give launch assert complain drip go lay assist concede drop gripe lead associate concern dub grow leak assume conclude earn gush lean assure concur eat hamper learn attach conduct echo hand lease attend confess elaborate handle leave attest confide emphasize head lecture attribute confirm empty hear left auction consider emulate help legislate avert consult encourage highlight light avoid contain end hint like base contend erupt hire link be contest establish hit liquidate bear continue estimate hold list become control evaluate hope lit begin convince evince identify locate believe copy examine ignore lose bet count exclaim illustrate love bid counter exclude imply maintain
- 145 -
make programme represent spur voice manage project request stand volunteer mark promise require start vote market promote research state vow mean prompt resort stem walk meet propose respond stop want mention prosecute restore strengthen warn monitor protect retail stress watch motor prove retire strike waver muse provide reveal stroke wear name publish rid study welcome negotiate pull rob stun win nickname purchase rule suggest wonder note push run suit work notice put say summarize worry notify question scare supply write observe quip scoff support yell offer quote score survey operate raise scotch oppose rally scream swap order rave see swear organize reach seek swivel originate read seem take oversee realize sell talk own reason send teach participate rebuild serve tell pay recall set tend perishables receive shake test permit reckon shout testify persuade recognize shove theorize pick recommend show think place reconstruct shrug threaten plan record sidestep title plant recount sigh tout play recruit sign track pledge reduce signal trade plot refer slide trail point reflect snap travel poll regard sniff trouble ponder reign snort trundle pour reiterate solicit try practise reject specialize understand praise relate specify unleash predict release speculate urge prepare relieve spell use present remark spend vacate prime remember spill value proclaim remind sponsor verify produce repeat spot view profess reply spread visit
top related