corpus-based evaluation of referring expression generation albert gatt ielka van der sluis kees van...

20
Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University of Aberdeen

Upload: damon-pierce

Post on 21-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Corpus-based evaluation of Referring Expression Generation

Albert GattIelka van der SluisKees van Deemter

Department of Computing ScienceUniversity of Aberdeen

Page 2: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Focus of this talk● Generation of Referring Expressions (GRE)

● A very big part of this is Content Determination

Knowledge Base+

Intended referent (R)

Search fordistinguishing

properties“Description”

=A semantic representation

● Evaluation challenges: Semantically intensive Pragmatic issues: identify, inform, signal agreement...

(cf. Jordan 2000, ...) “Human gold standard”: one and only one standard per

input? Evaluation metric: an all-or-none affair?

Page 3: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Outline of our proposal● Large corpus of descriptions (2000+) constructed

via a controlled experiment. Part of the TUNA Project.

● Semantic annotation.

● Balance.

● Expressive variety

● Related proposals on Human Gold Standards:

M. Walker: Language Productivity Assumption J. Viethen: GRE resources are difficult to obtain from

naturally occurring text.

Page 4: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Corpora and NLG: Transparency

● Requirements for a GRE Evaluation Corpus: Semantic transparency:

● Linguistic realisation + semantic representation + domain Pragmatic transparency:

● Human intention = algorithmic intention

● These requirements ensure that a match between the output of content determination and a corpus instance is done on a level playing field.

● Perhaps the same can be said of other Content Determination tasks.

Page 5: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Example

the large red sofathe large, bright red settee

the red couch which is larger than the rest

● All of the above are co-extensive.

● An algorithm may generate a logical form that “means” all of the above.

● Corpus annotation should indicate that all realisations of the same property denote that property.

Page 6: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Corpora and NLG: Balance

● Corpora are sources of exclusively positive evidence. If C is not in the corpus, should the generator avoid it?

● Frequency of occurrence: If C’ is very frequent, should the generator always use it?

(Only if we know that C is produced to the exclusion of other interesting possibilities)

● So there's a trade-off between: ecological validity adequacy for the evaluation task

● Partial solution: Experimental design to generate a balanced corpus.

Page 7: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Example (cont/d.)

● Relevant variables: When are A and A’ used when not required? When are A and A’ omitted when required?

● Ideal setting: A and A’ are (not) required in an equal no. of instances.

● Same argument for, e.g., communicative setting.

Hypothesis:

Incremental algorithms with preference order A >> A’ are better than A’ >> A

Page 8: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

The TUNA Reference Corpus● Corpus meets the transparency and balance requirements.

● Different domains (of different complexity): A domain of simple furniture objects:

● 4 attributes + horizontal and vertical location A domain of real b&w photographs of people:

● 9 attributes + horizontal and vertical location

● Different communicative situations: Fault-critical Non-fault critical

● Different kinds of attributes: Absolute properties (e.g. colour, baldness) Gradable properties (e.g. size, relative position)

● Different numbers of referents: Reference to individuals (“the red sofa”) Reference to sets (“the red and blue sofas”)

Page 9: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Web-based corpus collection experiment

Page 10: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

With (limited) feedback…

Page 11: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Design

● Balance within-subjects: Content: For each attribute combination, there are

equal numbers of domains in which the combination is minimally required to distinguish the referents.

Cardinality: number of plural & singular references

● Between subjects: Fault-critical vs. non fault-critical communicative

situation. Use of location

Page 12: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Corpus annotation

<DOMAIN condition=“3”> <ENTITY type=target'>

<ATTRIBUTE name=`type' value=`sofa' />

<ATTRIBUTE name=`orientation' value=`right' />

<ATTRIBUTE name=`size' value=`large' />

<ATTRIBUTE name=`colour' value=`red' />

<ATTRIBUTE name=“location”/> <ATTRIBUTE name=“y-dimension”

value=“1”/> <ATTRIBUTE name=“x-dimension”

value=“3”/></ATTRIBUTE>

</ENTITY></DOMAIN>

● Domain representation makes all attributes of all domain entities explicit.

Page 13: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Corpus annotation

● 2-level annotation for descriptions: <ATTRIBUTE> tags mark up description segments with the

domain information they express. <DESCRIPTION> tag allows compilation of a logical form

from the description

“the large settee at oblique angle”

<DESCRIPTION num=`singular'> <ATTRIBUTE name=`size' value=`large'>

large </ATTRIBUTE>

<ATTRIBUTE name=`type' value=`sofa'> settee

</ATTRIBUTE>

<ATTRIBUTE name=`orientation' value=`right'>at oblique angle

</ATTRIBUTE></DESCRIPTION>

Page 14: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

How feasible is this annotation?

● Evaluation with 2 independent annotators using the same annotation manual.

● Very high inter-annotator agreement: Furniture domain: ca. 75% perfect agreement. Mean

DICE coefficient 0.92

People domain: ca. 40% perfect agreement. Mean DICE coefficient: 0.84

Page 15: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

State of the corpus

11401140total

300300-Loc

300300+Loc

furniture

270270-Loc

270270+Loc

People

-FC+FC -Fully annotated

-Evaluation showshigh inter-annotator agreement

-Annotation in progress

Corpus is currently available on demand. Will be in public domain by May 2007.

Page 16: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Current uses of the corpus

● Two evaluations, comparing some standard GRE algorithms on singulars and plurals.

● Basic procedure: Run algorithm over a domain Compile a logical form from a corpus description Estimate the degree of match between description and

algorithm output.

Page 17: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Future uses

● Machine learning approaches to GRE: Corpus contains a mapping between linguistic and

semantic representations…

● Extending the remit of GRE to cover realisation and lexicalisation, exploiting realisation-semantics mapping.

● Investigation of impact of communicative setting on algorithm performance.

● Compare outcomes of corpus evaluation to task-oriented (reader) evaluation.

Page 18: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Conclusion● NLG is not only about surface linguistic form. Many

choices are made at a different level.

● Evaluation of Content Determination requires adequate resources. Our arguments are strongly related to those by J. Viethen and M. Walker.

● We argue that evaluation in such tasks is more reliable if resources are semantically/pragmatically transparent & balanced.

● This obviously makes the evaluation exercise more expensive, but ultimately pays off.

Page 19: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Further info

http://www.csd.abdn.ac.uk/research/tuna/corpus

Page 20: Corpus-based evaluation of Referring Expression Generation Albert Gatt Ielka van der Sluis Kees van Deemter Department of Computing Science University

Design: between subjects● Fault-critical vs. non-fault-critical:

Our program will eventually be used in situations where it is crucial that it understand descriptions accurately with no option to correct mistakes…

vs.If the computer misunderstands your description and removes the wrong objects,

you can point out the right objects for it, by clicking on the pictures with the red borders.

● +Location vs. –Location Row/column of each object determined randomly at

runtime. This increases domain variation, offsets the more

determinate nature of other attribute combinations. Some people could use location, others couldn’t. We considered location a good candidate for a gradable

property.