prosody in spoken language understanding

Prosody in Spoken Language Understanding

Gina Anne LevowUniversity of Chicago

January 4, 2008NLP Winter School 2008

U: Give me the price for AT&T.



U: Give me the price for American Telephone and Telegraph.

Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½

since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American

Telephone and Telegraph. S: Excuse me?

Roadmap Corrections: A motivating example Defining prosody Why prosody? Challenges in prosody Prosody in language understanding

Recognizing tone and pitch accent Spoken corrections, Topic segmentation

Conclusions

Defining Prosody Prosody

Phonetic phenomena in speech than span more than a single segment-“suprasegmental”

Prosody includes: Stress, focus, tone, intonation, length/pause, rhythm

Prosodic features include: Pitch: perceptual correlate of fundamental

frequency f0: rate of vocal fold vibration

Loudness/intensity, duration, segment quality

Why Prosody?

Prosody plays a crucial role At all levels of language

Lexical, syntactic, pragmatic/discourse Establishes meaning Disambiguates sense and structure

Across languages families Common physiological, articulatory basis

In synthesis and recognition of fluent speech

Prosody and the Lexicon Lexical: Determines word identity

Prosodic effect at the syllable level (minimal unit) Lexical stress: syllable prominence

Combination of length, pitch movement, loudness REcord (N) vs reCORD (V)

Pitch accent can differentiate words in some languages

Lexical tone: tone languages, e.g. Chinese, Punjabi Pitch height (register) and/or shape (contour)

Ma (high): motherMa (rising): hempMa (low): horseMa (falling): scold

Prosody and Syntax

Prosody can disambiguate structure Associated with chunking and attachment

Not identical with syntactic phrase boundaries “Prosody is predictable from syntax, except

when it isn’t” Prosodic phrasing indicated by:

Some combination of pause, change in pitch

Chunking, or “phrasing”

A1: I met Mary and Elena’s mother at the mall yesterday.

A2: I met Mary and Elena’s mother at the mall yesterday.

50

100

150

200

250

300

350

400

50

100

150

200

250

300

350

400

Example from Jennifer Venidetti

Punctuation & Prosody Humor A panda goes into a restaurant and

has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’

Prosody in Pragmatics & Discourse Focus:

Prominence, new information: pitch accent “October eleventh”:

Sentence type, dialogue act: Statement vs. declarative question :“It’s

raining (?)”

Discourse Structure (Topic), Emotion

from Shih, Prosody Learning and Generation

Challenges in Prosody I

Highly variable Actual realization differs from ideal

Speaker variation: Gender, vocal track differences, idiosyncrasy

Tonal coarticulation Neighboring tones influence (like segmental)

Underlying fall can become rise

Parallel encoding Effects at multiple levels realized

simultaneously

Challenges in Prosody II

Challenges for learning Lack of training data

Sparseness: Many prosodic phenomena are infrequent

E.g., non-declarative utterances, topic boundaries, contrastive accents, etc

Challenging for machine learning methods Costs of labeling:

Many prosodic events require expert labeling Need large corpus to attest

Time-consuming, expensive

Context and Learning in Multilingual Tone and Pitch

Accent Recognition

Strategy: Context Common model across languages

Pure acoustic-prosodic model No word label, POS, lexical stress info

English, Mandarin Chinese (also Cantonese, isiZulu)

Exploit contextual information Features from adjacent syllables, phrase

contour Analyze impact of

Context position, context encoding, context type

> 12.5% reduction in error over no context

Data Collections English: (Ostendorf et al, 95)

Boston University Radio News Corpus, f2b Manually annotated, aligned, syllabified 4 Pitch accent labels, aligned to syllables

Mandarin: TDT2 Voice of America Mandarin Broadcast News Automatically aligned, syllabified 4 main tones, neutral

Local Feature Extraction Uniform representation for tone, pitch accent

Motivated by Pitch Target Approximation Model Tone/pitch accent target exponentially approached

Linear target: height, slope (Xu et al, 99) Base features:

Pitch, Intensity max, mean, min, range (Praat, speaker normalized)

Pitch at 5 points across voiced region Duration Initial, final in phrase

Slope: Linear fit to last half of pitch contour

Context Features Local context:

Extended features Pitch max, mean, adjacent points of preceding,

following syllables Difference features

Difference between Pitch max, mean, mid, slope Intensity max, mean

Of preceding, following and current syllable Phrasal context:

Compute collection average phrase slope Compute scalar pitch values, adjusted for slope

Classification Experiments Classifier: Support Vector Machine

Linear kernel Multiclass formulation

SVMlight (Joachims), LibSVM (Cheng & Lin 01) 4:1 training / test splits

Experiments: Effects of Context position: preceding, following, none,

both Context encoding: Extended/Difference Context type: local, phrasal

Results: Local ContextContext Mandarin Tone English

Pitch Accent

Full 74.5% 81.3%

Extend PrePost

74% 80.7%

Extend Pre 74% 79.9%

Extend Post 70.5% 76.7%

Diffs PrePost 75.5% 80.7%

Diffs Pre 76.5% 79.5%

Diffs Post 69% 77.3%

Both Pre 76.5% 79.7%

Both Post 71.5% 77.6%

No context 68.5% 75.9%

Discussion: Local Context Any context information improves over none

Preceding context information consistently improves over none or following context information

English: Generally more context features are better Mandarin: Following context can degrade

Little difference in encoding (Extend vs Diffs)

Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory

Results & Discussion: Phrasal Context

Phrase Context

Mandarin Tone

English Pitch Accent

Phrase 75.5% 81.3%

No Phrase 72% 79.9%

•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve

Strategy: Training Challenge:

Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data?

Exploit semisupervised and unsupervised learning Semi-supervised Laplacian SVM K-means and asymmetric k-lines clustering Substantially outperform baselines

Can approach supervised levels

Semi-supervised Learning Approach:

Employ small amount of labeled data Exploit information from additional – presumably

more available –unlabeled data Few prior examples: several weakly supervised: (Wong

et al, ’05)

Classifier: Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) Semi-supervised variant of SVM

Exploits unlabeled examples RBF kernel, typically 6 nearest neighbors, transductive

Experiments Pitch accent recognition:

Binary classification: Unaccented/Accented 1000 instances, proportionally sampled

Labeled training: 200 unacc, 100 acc 80% accuracy (cf. 84% w/15x labeled SVM)

Mandarin tone recognition: 4-way classification: n(n-1)/2 binary classifiers 400 instances: balanced; 160 labeled

Clean lab speech- in-focus-94% cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training

samples Broadcast news: 70%

Cf. < 50% w/SVM 160 training samples

Unsupervised Learning Question:

Can we identify the tone structure of a language from the acoustic space without training?

Analogous to language acquisition Significant recent research in unsupervised

clustering Established approaches: k-means Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004):

asymmetric k-lines Little research for tone

Self-organizing maps (Gauthier et al,2005) Tones identified in lab speech using f0 velocities

Cluster-based bootstrapping (Narayanan et al, 2006) Prominence clustering (Tambourini ’05)

Contrasting Clustering Contrasts:

Clustering: 2-16 clusters, label w/most freq class 3 Spectral approaches:

Perform spectral decomposition of affinity matrix Asymmetric k-lines (Fischer & Poland 2004) Symmetric k-lines (Fischer & Poland 2004) Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) Binary weights, k-lines clustering

K-means: Standard Euclidean distance # of clusters: 2-16

Best results: > 78% 2 clusters: asymmetric k-lines; > 2 clusters:

kmeans Larger # clusters: all similar

Contrasting Learners

Tone Clustering: I Mandarin four tones:

400 samples: balanced 2-phase clustering: 2-5 clusters each Asymmetric k-lines, k-means clustering

Clean read speech: In-focus syllables: 87% (cf. 99% supervised) In-focus and pre-focus: 77% (cf. 93% supervised)

Broadcast news: 57% (cf. 74% supervised) K-means requires more clusters to reach k-lines level

Tone Structure

First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height

Conclusions Common prosodic framework for tone and

pitch accent recognition

Contextual modeling enhances recognition Local context and broad phrase contour

Carryover coarticulation has larger effect for Mandarin

Exploiting unlabeled examples for recognition Semi- and Un-supervised approaches

Best cases approach supervised levels with less training

Exploits acoustic structure of tone and accent space

Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½

since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American

Telephone and Telegraph. S: Excuse me?

Recognizing Spoken Corrections

Spoken Corrections Recognize user attempts to correct ASR failures Compare original input to repeat corrections Significant differences:

Corrections: increases in duration, pause #/length, final fall Increases in pitch accent for misrecognitions

Automatic recognition with decision trees, boosting Distinguish corrective/not (human level)

Key features: raw/normalized duration, pause Identify specific word being corrected

Key features: highest pitch, widest pitch range

The Problem:Speech Topic Segmentation

Separate audio stream into component topics

On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||

Is It Possible in Mandarin?

Recognizing Shifts in Topic & Turn

Topic & Turn boundaries in English & Mandarin Initial syllables:

Significantly higher pitch, loudness than final Lexical and prosodic cues:

Cue words, tf*idf similarity; pitch, loudness, silence Automatic recognition with decision trees, boosting

Voting to combine text, prosody, silence: 97% accuracy Key features:

Pause; pitch, loudness contrast between syllables

Conclusions & Opportunities

Prosody Rich source of information for languages Challenging due to variation, paucity of data

Can be successfully employed, with learning, to improve language understanding Pitch accent, tone, dialogue act, turn, topic,…

Unrestricted conversational, multi-party, multimodal speech much more challenging Increased variability, interaction with non-verbal

evidence

Thanks Dinoj Surendran, Siwei Wang, Yi Xu

V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin

This work supported by NSF Grant #0414919

http://people.cs.uchicago.edu/~levow/tai

50

100

150

200

250

300

350

400

Phrasing can disambiguate

I met Mary and Elena’s mother at the mall yesterday

Mary & Elena’s mothermall

One intonation phrase with relatively flat overall pitch range.

50

100

150

200

250

300

350

400

Phrasing can disambiguate

I met Mary and Elena’s mother at the mall yesterday

Marymall

Elena’s mother

Separate phrases, with expanded pitch movements.

Lists of numbers, nouns

twenty.eight.five

ninety.four.three

seventy.three.seven

forty.seven.seven

seventy.seven.seven

coffee cake and cream

chocolate ice cream and cake

fish fingers and bottles

cheese sandwiches and milk

cream buns and chocolate[from Prosody on the Web tutorial on chunking]

Clustering Pitch accent clustering:

4 way distinction: 1000 samples, proportional

2-16 clusters constructed Assign most frequent class label to each cluster

Classifier: Asymmetric k-lines:

context-dependent kernel radii, non-spherical > 78% accuracy:

2 clusters: asymmetric k-lines best Context effects:

Vector w/preceding context vs vector with no context comparable

prosody in spoken language understanding

Documents

prosody prosody

tone languages

languages lexical tone

pitch movement

american telephone

elenas mother

hewlett packard

pitch accentspoken corrections