michel galley, kathleen mckeown, julia hirschberg columbia university elizabeth shriberg

26
Identifying Agreement and Disagreement in Conversational Speech: Use of Bayesian Networks to Model Pragmatic Dependencies Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg SRI International

Upload: derek-finch

Post on 02-Jan-2016

74 views

Category:

Documents


1 download

DESCRIPTION

Identifying Agreement and Disagreement in Conversational Speech: Use of Bayesian Networks to Model Pragmatic Dependencies. Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg SRI International. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

Identifying Agreement and Disagreement in Conversational Speech:

Use of Bayesian Networks to Model Pragmatic Dependencies

Michel Galley, Kathleen McKeown, Julia HirschbergColumbia University

Elizabeth ShribergSRI International

Page 2: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

2

Motivation

• Problem: identification of agreements and disagreements between participants in meetings.

• Ultimate goal: automatic summarization.This enables us to generate “minutes” of meetings highlighting the debate that affected each decision.

Page 3: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

3

4-way classification: AGREE, DISAGREE, BACKCHANNEL, OTHER

Example

Alex So. Um, and then the next are two sections in the form - So, one's for native English speakers - with three circle boxes American, British, Indian and Other, for write-in. And at the bottom, I added: “ what language was spoken in the home? ”

OTHER

Nick "What language was spok-", yeah, so it's not “where did you grow up”, but what language was spoken in the home between the ages of, what would it be, twelve or something like that?

AGREE

James Mmm. Yeah. BACKCHANNEL

Julia It's a good idea. AGREE

Alex Depends of who you ask, what the age range is. OTHER

Luciana Well, in the home, the influence of the home is much lower age than that, I mean, once you go beyond the age five or six-

DISAGREE

Page 4: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

4

Previous work

• Decision-tree classifiers [Hillard et al. 03]o CART-style tree learner.o Features local to the utterance: lexical, durational,

and acoustic.o Reasonably good accuracy in a 3-way classification

(AGREE, DISAGREE, OTHER):• 71% with ASR output; • 82% with accurate transcription.

Page 5: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

5

Extend [Hillard et al. 03] by investigating the effect of context

• Empirical questions:o Are preceding agreements/disagreements good

predictors for the classification task?o Does the current label (agreement/disagreement)

depend on the identity of the addressee?o Should we distinguish preceding labels by the identity

of their corresponding addresses?

• Studies we report on show that preceding context supplies good predictors.o Addressee identification is instrumental to analyzing

preceding context.

Page 6: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

6

Agreement/disagreement classification in two steps

1. Addressee identification o Large corpus of labeled adjacency pairs (AP), labeled

paired utterances A and B• e.g. question-answer, offer-acceptance, apology-downplay

o Train a system to determine who is the addressee (A-part) of any given utterance (B-part) in a meeting.

2. Agreement/disagreement classificationo Features local to the utterance and pertaining to

immediately preceding speech and silences.o Label-dependency features: dependencies between current

label (agree, disagree, …) and previous labels in a Bayesian network.

o Addressee identification defines the topology of the Bayesian network.

Page 7: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

7

Corpus annotation

• ICSI meeting corpus: 75 informal meetings recorder at UC Berkeley, averaging one hour, and ranging from 3 to 9 participants.

• Adjacency pair annotation: [Dhillon et al. 04]o All 75 meetings labeled with dialog acts and adjacency pairs.

• Agreement/disagreement annotation: [Hillard et al. 03]o Annotation of 4 meeting segments plus tags for 4 additional

meetings obtained with a clustering method [Hillard et al. 03] o 8135 labeled utterances:

11.9% agreements 6.8% disagreements 23.2% backchannels 58.1% other

o Inter-labeler reliability: kappa coefficient of .63

Page 8: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

8

Step 1: Addressee (AP) identification

• Baseline algorithm: o always assume that the addressee in an adjacency pair

(A,B) is the party who spoke last before B.o Works reasonably well: 79.8% accuracy.

• Our method: speaker rankingo rank all speakers S = (s1,…,sN) with probabilities reflecting

how likely they are to be speaker A (i.e. the addressee).o Log-linear (maximum entropy) probability model for

ranking:

o di in D = (d1,…,dN) are observations pertaining to speaker si and to the last utterance of speaker si

J

jijji sDf

DZDsp

1

),(exp)(

1)|(

Page 9: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

9

Features for AP identification

• Structural: o number of speakers taking the floor between A and B

we match the baseline with this single feature (79.8%)

• Durational features:o duration of A

short utterances generally do not elicit responses/reactions

o seconds of overlap with any other speakercompetitive speech incompatible with AP construction

• Lexical features:o number of n-grams both in A and B (uni- to trigrams)

A and B parts often have some words in common

o first word of Ato exploit cue words, detect wh- questions

o Is the B speaker (addressee) named explicitly in A?

Page 10: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

10

Adjacency pairs identification: results

Feature sets Accuracy

Baseline (default: most recent speaker) 79.8%

Structural 84.0%

Durational 84.7%

Lexical 75.4%

Structural, durational 87.9%

All (lexical, structural, and durational) 90.2%

• Experimental setting:40 meetings used for training (9104 APs), 10 meetings used for testing (1723 APs)5 meetings of an held-out set used for forward feature selection and regularization (Gaussian smoothing)

Page 11: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

11

Step 2: Agreement/disagreement classification:local features of the utterance

• Local features of the utterance include the ones used in [Hillard et al. 03] (but no acoustics). Best predictors:Lexical features: o agreement and disagreement markers [Cohen, 02],

adjectives with positive/negative polarity [Hatzivassiloglou and McKeown, 97], general cue phrases [Hirschberg and Litman, 94].

o first word of the utteranceo score according to four LM (one for each class).

Structural and durational features: o duration of the utteranceo speech rate

Page 12: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

12

Label dependencies in sequence classification

• Previous-tag feature p(ci|ci-1) helpful in many NLP application to model context: POS tagging, supertagging, dialog act classification.o Various families of Markov models to train (e.g. HMMs,

CMMs, CRFs).

• Limitations of fixed-order Markov models for representing of multi-party conversations:o overlapping speech; no strict label orderingo multiple speakers, with different opinions: previous tag

(speaker A) might affect current tag (speaker B addressing A), but unlikely if B addresses C.

Page 13: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

13

Intuition: previous tag affects current tag

Label dependency: previous-tag

CAc 4

BAc 5

188.0)(AGREE p

tag index

A speaking to B

106.0)(DISAGREE p

213.0)A|(A GREEGREE p

209.0)D|(D ISAGREEISAGREE p

.1390)D|(A ISAGREEGREE p

078.0)A|(D GREEISAGREE p(BACKCHANNEL tags ignored for better interpretability)

Page 14: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

14

Intuition: If A disagreed with B (when A last spoke to B), then A is likely to disagree with B again.

Label dependency: same-interactants previous tags

CAc 4

BA5c

BCc 3

BA2c

188.0)(AGREE p

106.0)(DISAGREE p

25.0)A|(A GREEGREE p

261.0)D|(D ISAGREEISAGREE p

.0870)D|(A ISAGREEGREE p

107.0)A|(D GREEISAGREE p

Page 15: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

15

Intuition: If B disagreed with A (when B last spoke to A), then A is likely to disagree with B.

Label dependency: symmetry

CAc 4

BA5c

BCc 3

BAc 2

AB1c

188.0)(AGREE p

106.0)(DISAGREE p

.1750)A|(A GREEGREE p

128.0)D|(D ISAGREEISAGREE p

.2340)D|(A ISAGREEGREE p

088.0)A|(D GREEISAGREE p

Page 16: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

16

Intuition: If A disagrees with C after C agreed with B, then we might expect A to disagree with B as well.

Label dependency: transitivity

CA4c

BA5c

BC3c

BAc 2

ABc 1

188.0)(AGREE p

106.0)(DISAGREE p

.2250)A,A|(A GREEGREEGREE p

.1770)A,D|(D GREEISGREEISAGREE p

.1860)D,A|(D ISGREEGREEISAGREE p.2250)D,D|(D ISAGREEISAGREEISAGREE p

Page 17: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

17

• We use (dynamic) Bayes nets to factor the conditional probability distribution:

C = (c1,…,cL) : sequence of labels

D = (d1,…,dL) : sequence of observations

pa(ci) : parents of ci, i.e. label dependencies as in:

• (Maximum entropy) log-linear model used to estimate the probability of the dynamic variable ci:

Parameter estimation

L

iiii dcpacpdcpcpDCp

1000 )),(|()|()()|(

J

jiiijj

iiiii cdcpaf

dcpaZdcpacp

1

),),((exp)),((

1)),(|(

CAc

4

BAc

5

BCc

3

ABc

1

BAc

2

Page 18: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

18

Decoding of the maximizing sequence

• Beam search

o Maintain a beam of B most likely left-to-right partial sequences (as in [Ratnaparkhi 96] for POS tagging).

o In theory, possible search errors.

o Practically, our search is seldom affected by beam size if isn’t too small: B=100 is a reasonable value for any sequence.

Page 19: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

19

Results: comparison to previous work

• 3-way classification (AGREE, DISAGREE, OTHER) as in [Hillard et al, 03]; priors are normalized.

• Best performing feature set represents a 27.3% error reduction over [Hillard et al, 03].

Systems Accuracy

Baseline 50%

[Hillard et al 03] 82%

Feature sets Accuracy

Structural and durational 71.2%

Lexical 85.0%

Lexical, structural, and durational 85.6%

All (including label dependencies) 86.9%

Page 20: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

20

Results: comparison to previous work

• 3-way classification (AGREE, DISAGREE, OTHER) as in [Hillard et al, 03]; priors are normalized.

• Label dependency features reduce error by 9%.

Systems Accuracy

Baseline 50%

[Hillard et al 03] 82%

Feature sets Accuracy

Structural and durational 71.2%

Lexical 85.0%

Lexical, structural, and durational 85.6%

All (including label dependencies) 86.9%

Page 21: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

21

Results: 4-way classification

• 6-fold cross-validation, each fold on one meeting, representing a total of 8135 utterances to classify.

• Label dependencies contribution on different feature sets:

Feature sets No label dep label dep

Baseline 58.1% -

Structural and durational 58.9% 62.1%

Lexical 82.6% 83.5%

Lexical, structural, and durational 83.1% 84.1%

Page 22: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

22

Results: 4-way classification

• Accuracies by label dependency type (assuming all other features – structural, durational, lexical - are used):

Label dependency: Accuracy

None 83.1%

Previous tag 83.8%

Same-interactants previous tag 83.9%

Symmetry 83.7%

Transitivity 83.2%

All 84.1%

Page 23: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

23

Conclusion and future work

• Conclusion:o Performed addressee identification as a byproduct of

agreement/disagreement classification.o AP identification: significantly outperform a

competitive baseline.o Compelling evidence that models that incorporate

label dependency features are superior.

• Future work:o Summarization: identification of what propositional

content was agreed or disagreed.o Addressee identification may also be beneficial in DA

labeling of multi-party speech.

Page 24: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

24

Thank you

Page 25: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

25

Preceding-tags dependencies

Previous tag

Same-interactants previous tag

Symmetry

p(Agr|Agr) .213 .250 .175

p(Other|Agr) .713 .643 .737

p(Dis|Agr) .073 .107 .088

p(Agr|Other) .187 .115 .177

p(Other|Other) .714 .784 .710

p(Dis|Other) .098 .1 .113

p(Agr|Dis) .139 .087 .234

p(Other|Dis) .651 .652 .638

p(Dis|Dis) .209 .261 .128

Priors p(ci)

Agr .188

Other .706

Dis .106

Page 26: Michel Galley, Kathleen McKeown, Julia Hirschberg Columbia University Elizabeth Shriberg

26

Preceding-tag dependency: transitivity

ci = Agreecj = Agree

ci = Disagr

cj = Agree

ci = Agreecj = Disagr

ci = Disagrcj = Disagr

p(Agr|ci,cj) .225 .147 .131 .152

p(Other|ci,cj) .658 .677 .684 .668

p(Dis|ci,cj) .117 .177 .186 .180

Priors p(ci)

Agr .188

Dis .106

Other .706