Dialogue Structure and Pronoun Resolution
Joel Tetreault and James Allen University of RochesterDepartment of Computer ScienceDAARC September 23, 2004
Reference in Spoken Dialogue Resolving anaphoric expressions correctly is critical
in task-oriented domains Makes conversation easier for humans
Reference resolution module provides feedback to other components in system Ie. Incremental Parsing, Interpretation Module
Investigate how to improve RRM: Discourse Structure could be effective in reducing search
space of antecedents and improving accuracy (Grosz and Sidner, 1986)
Paucity of empirical work: Byron and Stent (1998), Eckert and Strube (2001), Byron (2002)
Goal
To evaluate whether shallow approaches to dialogue structure can improve a reference resolution algorithm (LRC used as baseline model to augment)
Investigated two models: Eckert &Strube (manual and automatic versions) “Literal QUD” model (manual)
Outline
Background Dialogue Act synchronization (Eckert and Strube model) QUD (Craige Roberts)
Monroe Corpus Algorithm Results
3rd person pronoun evaluation Dialogue Structure
Summary
Past approaches in structure and reference Veins: the nuclei of RST trees are the most salient
discourse units, the entities in these units are this more salient than others
Tetreault (2003): Penn Treebank subset annotated with RST. Used G&S approximations to try to improve on LRC baseline. Result: performed the same as baseline Veins: decreased performance slightly
Problem: fine-grained approaches (RST) are difficult to annotate reliably and do in real-time.
Perhaps shallow approaches can work?
literal QUD
Questions Under Discussion (Craige Roberts, Jonathan Ginzburg) – “what are we talking about?”: topics create discourse segments
Literally: questions or modals can be viewed as creating a discourse segment
Result – questions provide a shallow discourse structuring, and that maybe enough to improve performance, especially in a task-oriented domain
Entities in QUD main segment can be viewed as the topic Segment closed when question is answered (use ack
sequences, change in entities used) only entities from answer and entities in question are accessible Can be used in TRIPS to reduce search space of entities – set
context size
QUD Annotation Scheme
Annotate: Start utterance End utterance Type (aside, repeated question, unanswered,
open-ended, clarification) Kappa (compared with reconciled data):
Annotator Start End Type Overall
1 0.86 0.80 0.93 0.73
2 0.86 0.73 0.86 0.73
Example - QUDutt06 U: Where is it?
utt07 U: Just a second
utt08 U: I can't find the Rochester airport
utt09 S: It's
--------------------------------------------------------
utt10 U: I think I have a disability with maps
utt11 U: Have I ever told you that before
utt12 S: It's located on brooks avenue
utt13 U: Oh thank you
utt14 S: Do you see it?
utt15 U: Yes
(QUD-entry
:start utt06
:end utt13
:type clarification)
(QUD-entry
:start utt10
:end utt11
:type aside)
Example - QUD (utt10-11 processed)utt06 U: Where is it?
utt07 U: Just a second
utt08 U: I can't find the Rochester airport
utt09 S: It's
[utt10,11 removed]
--------------------------------------------------------
utt12 S: It's located on brooks avenue
utt13 U: Oh thank you
utt14 S: Do you see it?
utt15 U: Yes
(QUD-entry
:start utt06
:end utt13
:type clarification)
(QUD-entry
:start utt10
:end utt11
:type aside)
Example - QUD (s13 processed)[utt06-13 collapsed: {the Rochester airport, brooks avenue}]
--------------------------------------------------------
utt14 S: Do you see it?
utt15 U: Yes
(QUD-entry
:start utt06
:end utt13
:type clarification)
QUD Issues
Issue 1: easy to detect Q’s (use Speech-Act information), but how do you know Q is answered?
Cue words, multiple acknowledgements, changes in entities discussed provide strong clues that question is finishing, but general questions such as “how are we going to do this?” can be ambiguous
Issue 2: what is more salient to a QUD pronoun – the QUD topic or a more recent entity?
Dialogue Act Segmentation
E&S: model to resolve all types of pronouns (3rd person and abstract) in spoken dialogue
Intuition: grounding is very important in spoken dialogue
Utterances that are not acknowledged by the listener may not be in common ground and thus not accessible to pronominal reference
Dialogue Act Segmentation Each utterance marked as
(I): contains content (initiation), question (A): acknowledgment (C): combination of the above (N): none of the above
Basic algorithm: utterances not ack’d or not in a string of I’s are removed from the discourse before next sentence is processed
Evaluation showed improvement for pronouns referring to abstract entities, and strong annotator reliability
Pronoun performance? Unclear, no comparison of measure without using DA model
Example – DA modelutt06 U: Where is it?
utt07 U: Just a second
utt08 U: I can't find the Rochester airport
utt09 S: It's
utt10 U: I think I have a disability with maps (removed)
utt11 U: Have I ever told you that before
utt12 S: It's located on brooks avenue
utt13 U: Oh thank you
utt14 S: Do you see it?
utt15 U: Yes
(I)
(N)
(I)
(N)
(I)
(I)
(I)
(A)
(I)
(A)
Parsing Monroe Domain Domain: Monroe Corpus of 20 transcriptions (Stent,
2001) of human subjects collaborating on Emergency Rescue 911 tasks
Each dialogue was at least 10 minutes long, and most were over 300 utterances long
Work presented here focuses on 5 of the dialogues (1756 utterances) (278 3rd person pronouns)
Goals: develop a corpus of sentences parsed with rich syntactic, semantic, discourse information to
Able to parse 5 dialogue sub-corpus with 84% accuracy
More details see ACL Discourse Annotation ‘04
TRIPS Parser
Broad-coverage, deep parser Uses bottom-up algorithm with CFG and
domain independent ontology combined with a domain model
Flat, unscoped LF with events and labeled semantic roles based on FrameNet
Semantic information for noun phrases based on EuroWordNet
Parser information for Reference Rich parser output is helpful for discourse
annotation and reference resolution: Referring expressions identified (pronoun, NP, impros) Verb roles and temporal information (tense, aspect)
identified Noun phrases have semantic information associated
with them Speech act information (question, acknowledgment) Discourse markers (so, but) Semi-automatic annotation increases reliability
Semantics Example: “an ambulance” (TERM :VAR V213818
:LF (A V213818 (:* LF::LAND-VEHICLE W::AMBULANCE) :INPUT (AN AMBULANCE))
:SEM ($ F::PHYS-OBJ (SPATIAL-ABSTRACTION SPATIAL-POINT)
(GROUP -) (MOBILITY LAND-MOVABLE) (FORM ENCLOSURE) (ORIGIN ARTIFACT) (OBJECT-FUNCTION VEHICLE) (INTENTIONAL -) (INFORMATION -) (CONTAINER (OR + -))
(TRAJECTORY -)))
Reference Annotation
Annotated dialogues for reference w/undergraduate researchers (created a Java Tool: PronounTool)
Markables determined by LF terms Identification numbers determined by :VAR field of LF
term Used stand-off file to encode what each pronoun refers
to (refers-to) and the relation between pronoun and antecedent (relation)
Post-processing phase assigns an unique identification number to coreference chains
Also annotated coreference between definite noun phrases
Reference Annotation
Used slightly modified MATE scheme: pronouns divided into the following types: IDENTITY (Coreference) (278)
Includes set constructions (6) FUNCTIONAL (20) PROPOSITON/D.DEXEIS (41) ACTION/EVENT (22) INDEXICAL (417) EXPLETIVE (97) DIFFICULT (5)
LRC Algorithm
LRC: modified centering algorithm (Tetreault ’01) that does not use Cb or transitions, but keeps a Cf-list (history) for each utterance
While processing utterance’s entities (left to right) do:Push entity onto Cf-list-new, for a pronoun p, attempt to resolve:
Search through Cf-list-new (l-to-r) taking the first candidate that meets gender, agreement, and binding and semantic feature constraints.
If none found, search past utterance’s Cf-lists starting from previous utterance to beginning of discourse
When p is resolved, push pronoun with semantic features from antecedent on to Cf-list-new
More details see SemDial ‘04
LRC Algorithm with Structure Info Augmented algorithm with extensions to
handle QUD and E&S input For QUD, at the start and end of processing
an utterance, QUD’s are started (pushed on stack) or ended (entities are collapsed), so Cf-list history changes
For E&S, each utterance is assigned a DA code and then removed or kept depending on the next utterance (if it is an acknowledgement, or a series of I’s)
Results
Metric Baseline QUD E&S Auto
E&S Manual
+sem 67.9% 67.9% 64.7% 60.4%
-sem 61.5% 61.5% 60.1% 54.7%
Error Analysis
Though QUD and +sem baseline performed the same (89 errors), they each got 3 pronouns right the other did not
Baseline: 3 collapsing nodes removes correct antecedent
QUD: 2 right associated with blocking off aside 1 associated with collapsing (intervening nodes blocked)
15 pronouns, both got wrong, but made different predictions
Remaining 71, both made same error
Issues
Structuring methods are probably more trouble than they are worth with the corpora available right now
Also only affect a few pronouns Segment ends are least reliable
What constitutes an end? 3 errors show either boundaries are marked incorrectly if
pronouns are accessing elements in a “closed” DS Or perhaps collapsing routine is too harsh
Small corpus size Hard to draw definite conclusions given only 3 criss-
crossed errors need more data for statistical evaluations
Issues
E&S Model has advantage over QUD of being easiest to automate, but fares worse since it takes into account a small window of utterances (extremely shallow)
QUD model can be semi-automated (detecting question starts is easy) but detecting ends and type are harder
QUD could definitely be improved by taking into account plan initiations and suggestions, instead of limiting to questions only, but tradeoff is reliability