michel galley, kathleen mckeown, julia hirschberg columbia university elizabeth shriberg

Identifying Agreement and Disagreement in Conversational Speech:

Use of Bayesian Networks to Model Pragmatic Dependencies

Michel Galley, Kathleen McKeown, Julia HirschbergColumbia University

Elizabeth ShribergSRI International

2

Motivation

• Problem: identification of agreements and disagreements between participants in meetings.

• Ultimate goal: automatic summarization.This enables us to generate “minutes” of meetings highlighting the debate that affected each decision.

3

4-way classification: AGREE, DISAGREE, BACKCHANNEL, OTHER

Example

Alex So. Um, and then the next are two sections in the form - So, one's for native English speakers - with three circle boxes American, British, Indian and Other, for write-in. And at the bottom, I added: “ what language was spoken in the home? ”

OTHER

Nick "What language was spok-", yeah, so it's not “where did you grow up”, but what language was spoken in the home between the ages of, what would it be, twelve or something like that?

AGREE

James Mmm. Yeah. BACKCHANNEL

Julia It's a good idea. AGREE

Alex Depends of who you ask, what the age range is. OTHER

Luciana Well, in the home, the influence of the home is much lower age than that, I mean, once you go beyond the age five or six-

DISAGREE

4

Previous work

• Decision-tree classifiers [Hillard et al. 03]o CART-style tree learner.o Features local to the utterance: lexical, durational,

and acoustic.o Reasonably good accuracy in a 3-way classification

(AGREE, DISAGREE, OTHER):• 71% with ASR output; • 82% with accurate transcription.

5

Extend [Hillard et al. 03] by investigating the effect of context

• Empirical questions:o Are preceding agreements/disagreements good

predictors for the classification task?o Does the current label (agreement/disagreement)

depend on the identity of the addressee?o Should we distinguish preceding labels by the identity

of their corresponding addresses?

• Studies we report on show that preceding context supplies good predictors.o Addressee identification is instrumental to analyzing

preceding context.

6

Agreement/disagreement classification in two steps

1. Addressee identification o Large corpus of labeled adjacency pairs (AP), labeled

paired utterances A and B• e.g. question-answer, offer-acceptance, apology-downplay

o Train a system to determine who is the addressee (A-part) of any given utterance (B-part) in a meeting.

2. Agreement/disagreement classificationo Features local to the utterance and pertaining to

immediately preceding speech and silences.o Label-dependency features: dependencies between current

label (agree, disagree, …) and previous labels in a Bayesian network.

o Addressee identification defines the topology of the Bayesian network.

7

Corpus annotation

• ICSI meeting corpus: 75 informal meetings recorder at UC Berkeley, averaging one hour, and ranging from 3 to 9 participants.

• Adjacency pair annotation: [Dhillon et al. 04]o All 75 meetings labeled with dialog acts and adjacency pairs.

• Agreement/disagreement annotation: [Hillard et al. 03]o Annotation of 4 meeting segments plus tags for 4 additional

meetings obtained with a clustering method [Hillard et al. 03] o 8135 labeled utterances:

11.9% agreements 6.8% disagreements 23.2% backchannels 58.1% other

o Inter-labeler reliability: kappa coefficient of .63

8

Step 1: Addressee (AP) identification

• Baseline algorithm: o always assume that the addressee in an adjacency pair

(A,B) is the party who spoke last before B.o Works reasonably well: 79.8% accuracy.

• Our method: speaker rankingo rank all speakers S = (s1,…,sN) with probabilities reflecting

how likely they are to be speaker A (i.e. the addressee).o Log-linear (maximum entropy) probability model for

ranking:

o di in D = (d1,…,dN) are observations pertaining to speaker si and to the last utterance of speaker si

J

jijji sDf

DZDsp

1

),(exp)(

1)|(

9

Features for AP identification

• Structural: o number of speakers taking the floor between A and B

we match the baseline with this single feature (79.8%)

• Durational features:o duration of A

short utterances generally do not elicit responses/reactions

o seconds of overlap with any other speakercompetitive speech incompatible with AP construction

• Lexical features:o number of n-grams both in A and B (uni- to trigrams)

A and B parts often have some words in common

o first word of Ato exploit cue words, detect wh- questions

o Is the B speaker (addressee) named explicitly in A?

10

Adjacency pairs identification: results

Feature sets Accuracy

Baseline (default: most recent speaker) 79.8%

Structural 84.0%

Durational 84.7%

Lexical 75.4%

Structural, durational 87.9%

All (lexical, structural, and durational) 90.2%

• Experimental setting:40 meetings used for training (9104 APs), 10 meetings used for testing (1723 APs)5 meetings of an held-out set used for forward feature selection and regularization (Gaussian smoothing)

11

Step 2: Agreement/disagreement classification:local features of the utterance

• Local features of the utterance include the ones used in [Hillard et al. 03] (but no acoustics). Best predictors:Lexical features: o agreement and disagreement markers [Cohen, 02],

adjectives with positive/negative polarity [Hatzivassiloglou and McKeown, 97], general cue phrases [Hirschberg and Litman, 94].

o first word of the utteranceo score according to four LM (one for each class).

Structural and durational features: o duration of the utteranceo speech rate

12

Label dependencies in sequence classification

• Previous-tag feature p(ci|ci-1) helpful in many NLP application to model context: POS tagging, supertagging, dialog act classification.o Various families of Markov models to train (e.g. HMMs,

CMMs, CRFs).

• Limitations of fixed-order Markov models for representing of multi-party conversations:o overlapping speech; no strict label orderingo multiple speakers, with different opinions: previous tag

(speaker A) might affect current tag (speaker B addressing A), but unlikely if B addresses C.

13

Intuition: previous tag affects current tag

Label dependency: previous-tag

CAc 4

BAc 5

188.0)(AGREE p

tag index

A speaking to B

106.0)(DISAGREE p

213.0)A|(A GREEGREE p

209.0)D|(D ISAGREEISAGREE p

.1390)D|(A ISAGREEGREE p

078.0)A|(D GREEISAGREE p(BACKCHANNEL tags ignored for better interpretability)

14

Intuition: If A disagreed with B (when A last spoke to B), then A is likely to disagree with B again.

Label dependency: same-interactants previous tags

CAc 4

BA5c

BCc 3

BA2c

188.0)(AGREE p

106.0)(DISAGREE p

25.0)A|(A GREEGREE p



107.0)A|(D GREEISAGREE p

15

Intuition: If B disagreed with A (when B last spoke to A), then A is likely to disagree with B.

Label dependency: symmetry

CAc 4

BA5c

BCc 3

BAc 2

AB1c

188.0)(AGREE p

106.0)(DISAGREE p

.1750)A|(A GREEGREE p



088.0)A|(D GREEISAGREE p

16

Intuition: If A disagrees with C after C agreed with B, then we might expect A to disagree with B as well.

Label dependency: transitivity

CA4c

BA5c

BC3c

BAc 2

ABc 1

188.0)(AGREE p

106.0)(DISAGREE p

.2250)A,A|(A GREEGREEGREE p

.1770)A,D|(D GREEISGREEISAGREE p

.1860)D,A|(D ISGREEGREEISAGREE p.2250)D,D|(D ISAGREEISAGREEISAGREE p

17

• We use (dynamic) Bayes nets to factor the conditional probability distribution:

C = (c1,…,cL) : sequence of labels

D = (d1,…,dL) : sequence of observations

pa(ci) : parents of ci, i.e. label dependencies as in:

• (Maximum entropy) log-linear model used to estimate the probability of the dynamic variable ci:

Parameter estimation

L

iiii dcpacpdcpcpDCp

1000 )),(|()|()()|(

J

jiiijj

iiiii cdcpaf

dcpaZdcpacp

1

),),((exp)),((

1)),(|(

CAc

4

BAc

5

BCc

3

ABc

1

BAc

2

18

Decoding of the maximizing sequence

• Beam search

o Maintain a beam of B most likely left-to-right partial sequences (as in [Ratnaparkhi 96] for POS tagging).

o In theory, possible search errors.

o Practically, our search is seldom affected by beam size if isn’t too small: B=100 is a reasonable value for any sequence.

19

Results: comparison to previous work

• 3-way classification (AGREE, DISAGREE, OTHER) as in [Hillard et al, 03]; priors are normalized.

• Best performing feature set represents a 27.3% error reduction over [Hillard et al, 03].

Systems Accuracy

Baseline 50%

[Hillard et al 03] 82%


Structural and durational 71.2%

Lexical 85.0%

Lexical, structural, and durational 85.6%

All (including label dependencies) 86.9%

20

Results: comparison to previous work

• 3-way classification (AGREE, DISAGREE, OTHER) as in [Hillard et al, 03]; priors are normalized.

• Label dependency features reduce error by 9%.

Systems Accuracy

Baseline 50%

[Hillard et al 03] 82%


Structural and durational 71.2%

Lexical 85.0%

Lexical, structural, and durational 85.6%

All (including label dependencies) 86.9%

21

Results: 4-way classification

• 6-fold cross-validation, each fold on one meeting, representing a total of 8135 utterances to classify.

• Label dependencies contribution on different feature sets:

Feature sets No label dep label dep

Baseline 58.1% -

Structural and durational 58.9% 62.1%

Lexical 82.6% 83.5%

Lexical, structural, and durational 83.1% 84.1%

22

Results: 4-way classification

• Accuracies by label dependency type (assuming all other features – structural, durational, lexical - are used):

Label dependency: Accuracy

None 83.1%

Previous tag 83.8%

Same-interactants previous tag 83.9%

Symmetry 83.7%

Transitivity 83.2%

All 84.1%

23

Conclusion and future work

• Conclusion:o Performed addressee identification as a byproduct of

agreement/disagreement classification.o AP identification: significantly outperform a

competitive baseline.o Compelling evidence that models that incorporate

label dependency features are superior.

• Future work:o Summarization: identification of what propositional

content was agreed or disagreed.o Addressee identification may also be beneficial in DA

labeling of multi-party speech.

24

Thank you

26

Preceding-tag dependency: transitivity

ci = Agreecj = Agree

ci = Disagr

cj = Agree

ci = Agreecj = Disagr

ci = Disagrcj = Disagr

p(Agr|ci,cj) .225 .147 .131 .152

p(Other|ci,cj) .658 .677 .684 .668

p(Dis|ci,cj) .117 .177 .186 .180

Priors p(ci)

Agr .188

Dis .106

Other .706

michel galley, kathleen mckeown, julia hirschberg columbia university elizabeth shriberg

Documents

preceding speech

preceding context

preceding labels

additional meetings

minutes of meetings

way classification

labeled utterances

identification of agreements