corpus-based evaluation of referring expression generation albert gatt ielka van der sluis kees van...

Corpus-based evaluation of Referring Expression Generation

Albert GattIelka van der SluisKees van Deemter

Department of Computing ScienceUniversity of Aberdeen

Focus of this talk● Generation of Referring Expressions (GRE)

● A very big part of this is Content Determination

Knowledge Base+

Intended referent (R)

Search fordistinguishing

properties“Description”

=A semantic representation

● Evaluation challenges: Semantically intensive Pragmatic issues: identify, inform, signal agreement...

(cf. Jordan 2000, ...) “Human gold standard”: one and only one standard per

input? Evaluation metric: an all-or-none affair?

Outline of our proposal● Large corpus of descriptions (2000+) constructed

via a controlled experiment. Part of the TUNA Project.

● Semantic annotation.

● Balance.

● Expressive variety

● Related proposals on Human Gold Standards:

M. Walker: Language Productivity Assumption J. Viethen: GRE resources are difficult to obtain from

naturally occurring text.

Corpora and NLG: Transparency

● Requirements for a GRE Evaluation Corpus: Semantic transparency:

● Linguistic realisation + semantic representation + domain Pragmatic transparency:

● Human intention = algorithmic intention

● These requirements ensure that a match between the output of content determination and a corpus instance is done on a level playing field.

● Perhaps the same can be said of other Content Determination tasks.

Example

the large red sofathe large, bright red settee

the red couch which is larger than the rest

● All of the above are co-extensive.

● An algorithm may generate a logical form that “means” all of the above.

● Corpus annotation should indicate that all realisations of the same property denote that property.

Corpora and NLG: Balance

● Corpora are sources of exclusively positive evidence. If C is not in the corpus, should the generator avoid it?

● Frequency of occurrence: If C’ is very frequent, should the generator always use it?

(Only if we know that C is produced to the exclusion of other interesting possibilities)

● So there's a trade-off between: ecological validity adequacy for the evaluation task

● Partial solution: Experimental design to generate a balanced corpus.

Example (cont/d.)

● Relevant variables: When are A and A’ used when not required? When are A and A’ omitted when required?

● Ideal setting: A and A’ are (not) required in an equal no. of instances.

● Same argument for, e.g., communicative setting.

Hypothesis:

Incremental algorithms with preference order A >> A’ are better than A’ >> A

The TUNA Reference Corpus● Corpus meets the transparency and balance requirements.

● Different domains (of different complexity): A domain of simple furniture objects:

● 4 attributes + horizontal and vertical location A domain of real b&w photographs of people:

● 9 attributes + horizontal and vertical location

● Different communicative situations: Fault-critical Non-fault critical

● Different kinds of attributes: Absolute properties (e.g. colour, baldness) Gradable properties (e.g. size, relative position)

● Different numbers of referents: Reference to individuals (“the red sofa”) Reference to sets (“the red and blue sofas”)

Web-based corpus collection experiment

With (limited) feedback…

Design

● Balance within-subjects: Content: For each attribute combination, there are

equal numbers of domains in which the combination is minimally required to distinguish the referents.

Cardinality: number of plural & singular references

● Between subjects: Fault-critical vs. non fault-critical communicative

situation. Use of location

Corpus annotation

<DOMAIN condition=“3”> <ENTITY type=target'>

<ATTRIBUTE name=`type' value=`sofa' />

<ATTRIBUTE name=`orientation' value=`right' />

<ATTRIBUTE name=`size' value=`large' />

<ATTRIBUTE name=`colour' value=`red' />

<ATTRIBUTE name=“location”/> <ATTRIBUTE name=“y-dimension”

value=“1”/> <ATTRIBUTE name=“x-dimension”

value=“3”/></ATTRIBUTE>

</ENTITY></DOMAIN>

● Domain representation makes all attributes of all domain entities explicit.

Corpus annotation

● 2-level annotation for descriptions: <ATTRIBUTE> tags mark up description segments with the

domain information they express. <DESCRIPTION> tag allows compilation of a logical form

from the description

“the large settee at oblique angle”

<DESCRIPTION num=`singular'> <ATTRIBUTE name=`size' value=`large'>

large </ATTRIBUTE>

<ATTRIBUTE name=`type' value=`sofa'> settee

</ATTRIBUTE>

<ATTRIBUTE name=`orientation' value=`right'>at oblique angle

</ATTRIBUTE></DESCRIPTION>

How feasible is this annotation?

● Evaluation with 2 independent annotators using the same annotation manual.

● Very high inter-annotator agreement: Furniture domain: ca. 75% perfect agreement. Mean

DICE coefficient 0.92

People domain: ca. 40% perfect agreement. Mean DICE coefficient: 0.84

State of the corpus

11401140total

300300-Loc

300300+Loc

furniture

270270-Loc

270270+Loc

People

-FC+FC -Fully annotated

-Evaluation showshigh inter-annotator agreement

-Annotation in progress

Corpus is currently available on demand. Will be in public domain by May 2007.

Current uses of the corpus

● Two evaluations, comparing some standard GRE algorithms on singulars and plurals.

● Basic procedure: Run algorithm over a domain Compile a logical form from a corpus description Estimate the degree of match between description and

algorithm output.

Future uses

● Machine learning approaches to GRE: Corpus contains a mapping between linguistic and

semantic representations…

● Extending the remit of GRE to cover realisation and lexicalisation, exploiting realisation-semantics mapping.

● Investigation of impact of communicative setting on algorithm performance.

● Compare outcomes of corpus evaluation to task-oriented (reader) evaluation.

Conclusion● NLG is not only about surface linguistic form. Many

choices are made at a different level.

● Evaluation of Content Determination requires adequate resources. Our arguments are strongly related to those by J. Viethen and M. Walker.

● We argue that evaluation in such tasks is more reliable if resources are semantically/pragmatically transparent & balanced.

● This obviously makes the evaluation exercise more expensive, but ultimately pays off.

Further info

http://www.csd.abdn.ac.uk/research/tuna/corpus

Design: between subjects● Fault-critical vs. non-fault-critical:

Our program will eventually be used in situations where it is crucial that it understand descriptions accurately with no option to correct mistakes…

vs.If the computer misunderstands your description and removes the wrong objects,

you can point out the right objects for it, by clicking on the pictures with the red borders.

● +Location vs. –Location Row/column of each object determined randomly at

runtime. This increases domain variation, offsets the more

determinate nature of other attribute combinations. Some people could use location, others couldn’t. We considered location a good candidate for a gradable

property.

corpus-based evaluation of referring expression generation albert gatt ielka van der sluis kees van...

Documents

corpus annotation

corpusbased evaluation

gre evaluation corpus

balanced corpus

corpus instance

semantic transparency

bright red

red couch