prosody in spoken language understanding
DESCRIPTION
Prosody in Spoken Language Understanding. Gina Anne Levow University of Chicago January 4, 2008 NLP Winter School 2008. U: Give me the price for AT&T. U: Give me the price for AT&T. U: Give me the price for AT&T. U: Give me the price for American Telephone and Telegraph. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/1.jpg)
Prosody in Spoken Language Understanding
Gina Anne LevowUniversity of Chicago
January 4, 2008NLP Winter School 2008
![Page 2: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/2.jpg)
U: Give me the price for AT&T.
U: Give me the price for AT&T.
U: Give me the price for AT&T.
U: Give me the price for American Telephone and Telegraph.
![Page 3: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/3.jpg)
Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½
since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American
Telephone and Telegraph. S: Excuse me?
![Page 4: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/4.jpg)
Roadmap Corrections: A motivating example Defining prosody Why prosody? Challenges in prosody Prosody in language understanding
Recognizing tone and pitch accent Spoken corrections, Topic segmentation
Conclusions
![Page 5: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/5.jpg)
Defining Prosody Prosody
Phonetic phenomena in speech than span more than a single segment-“suprasegmental”
Prosody includes: Stress, focus, tone, intonation, length/pause, rhythm
Prosodic features include: Pitch: perceptual correlate of fundamental
frequency f0: rate of vocal fold vibration
Loudness/intensity, duration, segment quality
![Page 6: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/6.jpg)
Why Prosody?
Prosody plays a crucial role At all levels of language
Lexical, syntactic, pragmatic/discourse Establishes meaning Disambiguates sense and structure
Across languages families Common physiological, articulatory basis
In synthesis and recognition of fluent speech
![Page 7: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/7.jpg)
Prosody and the Lexicon Lexical: Determines word identity
Prosodic effect at the syllable level (minimal unit) Lexical stress: syllable prominence
Combination of length, pitch movement, loudness REcord (N) vs reCORD (V)
Pitch accent can differentiate words in some languages
Lexical tone: tone languages, e.g. Chinese, Punjabi Pitch height (register) and/or shape (contour)
Ma (high): motherMa (rising): hempMa (low): horseMa (falling): scold
![Page 8: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/8.jpg)
Prosody and Syntax
Prosody can disambiguate structure Associated with chunking and attachment
Not identical with syntactic phrase boundaries “Prosody is predictable from syntax, except
when it isn’t” Prosodic phrasing indicated by:
Some combination of pause, change in pitch
![Page 9: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/9.jpg)
Chunking, or “phrasing”
A1: I met Mary and Elena’s mother at the mall yesterday.
A2: I met Mary and Elena’s mother at the mall yesterday.
50
100
150
200
250
300
350
400
50
100
150
200
250
300
350
400
Example from Jennifer Venidetti
![Page 10: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/10.jpg)
Punctuation & Prosody Humor A panda goes into a restaurant and
has a meal. Just before he leaves he takes out a gun and fires it. The irate restaurant owner says ‘Why did you do that?’ The panda replies, ‘ I'm a panda. Look it up.’The restaurateur goes to his dictionary and under ‘panda’ finds: ‘black and white arboreal, bear like creatures; eats, shoots and leaves.’
![Page 11: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/11.jpg)
Prosody in Pragmatics & Discourse Focus:
Prominence, new information: pitch accent “October eleventh”:
Sentence type, dialogue act: Statement vs. declarative question :“It’s
raining (?)”
Discourse Structure (Topic), Emotion
from Shih, Prosody Learning and Generation
![Page 12: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/12.jpg)
Challenges in Prosody I
Highly variable Actual realization differs from ideal
Speaker variation: Gender, vocal track differences, idiosyncrasy
Tonal coarticulation Neighboring tones influence (like segmental)
Underlying fall can become rise
Parallel encoding Effects at multiple levels realized
simultaneously
![Page 13: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/13.jpg)
Challenges in Prosody II
Challenges for learning Lack of training data
Sparseness: Many prosodic phenomena are infrequent
E.g., non-declarative utterances, topic boundaries, contrastive accents, etc
Challenging for machine learning methods Costs of labeling:
Many prosodic events require expert labeling Need large corpus to attest
Time-consuming, expensive
![Page 14: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/14.jpg)
Context and Learning in Multilingual Tone and Pitch
Accent Recognition
![Page 15: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/15.jpg)
Strategy: Context Common model across languages
Pure acoustic-prosodic model No word label, POS, lexical stress info
English, Mandarin Chinese (also Cantonese, isiZulu)
Exploit contextual information Features from adjacent syllables, phrase
contour Analyze impact of
Context position, context encoding, context type
> 12.5% reduction in error over no context
![Page 16: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/16.jpg)
Data Collections English: (Ostendorf et al, 95)
Boston University Radio News Corpus, f2b Manually annotated, aligned, syllabified 4 Pitch accent labels, aligned to syllables
Mandarin: TDT2 Voice of America Mandarin Broadcast News Automatically aligned, syllabified 4 main tones, neutral
![Page 17: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/17.jpg)
Local Feature Extraction Uniform representation for tone, pitch accent
Motivated by Pitch Target Approximation Model Tone/pitch accent target exponentially approached
Linear target: height, slope (Xu et al, 99) Base features:
Pitch, Intensity max, mean, min, range (Praat, speaker normalized)
Pitch at 5 points across voiced region Duration Initial, final in phrase
Slope: Linear fit to last half of pitch contour
![Page 18: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/18.jpg)
Context Features Local context:
Extended features Pitch max, mean, adjacent points of preceding,
following syllables Difference features
Difference between Pitch max, mean, mid, slope Intensity max, mean
Of preceding, following and current syllable Phrasal context:
Compute collection average phrase slope Compute scalar pitch values, adjusted for slope
![Page 19: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/19.jpg)
Classification Experiments Classifier: Support Vector Machine
Linear kernel Multiclass formulation
SVMlight (Joachims), LibSVM (Cheng & Lin 01) 4:1 training / test splits
Experiments: Effects of Context position: preceding, following, none,
both Context encoding: Extended/Difference Context type: local, phrasal
![Page 20: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/20.jpg)
Results: Local ContextContext Mandarin Tone English
Pitch Accent
Full 74.5% 81.3%
Extend PrePost
74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
![Page 21: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/21.jpg)
Results: Local ContextContext Mandarin Tone English
Pitch Accent
Full 74.5% 81.3%
Extend PrePost
74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
![Page 22: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/22.jpg)
Results: Local ContextContext Mandarin Tone English
Pitch Accent
Full 74.5% 81.3%
Extend PrePost
74% 80.7%
Extend Pre 74% 79.9%
Extend Post 70.5% 76.7%
Diffs PrePost 75.5% 80.7%
Diffs Pre 76.5% 79.5%
Diffs Post 69% 77.3%
Both Pre 76.5% 79.7%
Both Post 71.5% 77.6%
No context 68.5% 75.9%
![Page 23: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/23.jpg)
Discussion: Local Context Any context information improves over none
Preceding context information consistently improves over none or following context information
English: Generally more context features are better Mandarin: Following context can degrade
Little difference in encoding (Extend vs Diffs)
Consistent with phonetic analysis (Xu) that carryover coarticulation is greater than anticipatory
![Page 24: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/24.jpg)
Results & Discussion: Phrasal Context
Phrase Context
Mandarin Tone
English Pitch Accent
Phrase 75.5% 81.3%
No Phrase 72% 79.9%
•Phrase contour compensation enhances recognition•Simple strategy•Use of non-linear slope compensate may improve
![Page 25: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/25.jpg)
Strategy: Training Challenge:
Can we use the underlying acoustic structure of the language – through unlabeled examples – to reduce the need for expensive labeled training data?
Exploit semisupervised and unsupervised learning Semi-supervised Laplacian SVM K-means and asymmetric k-lines clustering Substantially outperform baselines
Can approach supervised levels
![Page 26: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/26.jpg)
Semi-supervised Learning Approach:
Employ small amount of labeled data Exploit information from additional – presumably
more available –unlabeled data Few prior examples: several weakly supervised: (Wong
et al, ’05)
Classifier: Laplacian SVM (Sindhwani,Belkin&Niyogi ’05) Semi-supervised variant of SVM
Exploits unlabeled examples RBF kernel, typically 6 nearest neighbors, transductive
![Page 27: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/27.jpg)
Experiments Pitch accent recognition:
Binary classification: Unaccented/Accented 1000 instances, proportionally sampled
Labeled training: 200 unacc, 100 acc 80% accuracy (cf. 84% w/15x labeled SVM)
Mandarin tone recognition: 4-way classification: n(n-1)/2 binary classifiers 400 instances: balanced; 160 labeled
Clean lab speech- in-focus-94% cf. 99% w/SVM, 1000s train; 85% w/SVM 160 training
samples Broadcast news: 70%
Cf. < 50% w/SVM 160 training samples
![Page 28: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/28.jpg)
Unsupervised Learning Question:
Can we identify the tone structure of a language from the acoustic space without training?
Analogous to language acquisition Significant recent research in unsupervised
clustering Established approaches: k-means Spectral clustering (Shi & Malik ‘97, Fischer & Poland 2004):
asymmetric k-lines Little research for tone
Self-organizing maps (Gauthier et al,2005) Tones identified in lab speech using f0 velocities
Cluster-based bootstrapping (Narayanan et al, 2006) Prominence clustering (Tambourini ’05)
![Page 29: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/29.jpg)
Contrasting Clustering Contrasts:
Clustering: 2-16 clusters, label w/most freq class 3 Spectral approaches:
Perform spectral decomposition of affinity matrix Asymmetric k-lines (Fischer & Poland 2004) Symmetric k-lines (Fischer & Poland 2004) Laplacian Eigenmaps (Belkin, Niyogi, & Sindhwani 2004) Binary weights, k-lines clustering
K-means: Standard Euclidean distance # of clusters: 2-16
Best results: > 78% 2 clusters: asymmetric k-lines; > 2 clusters:
kmeans Larger # clusters: all similar
![Page 30: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/30.jpg)
Contrasting Learners
![Page 31: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/31.jpg)
Tone Clustering: I Mandarin four tones:
400 samples: balanced 2-phase clustering: 2-5 clusters each Asymmetric k-lines, k-means clustering
Clean read speech: In-focus syllables: 87% (cf. 99% supervised) In-focus and pre-focus: 77% (cf. 93% supervised)
Broadcast news: 57% (cf. 74% supervised) K-means requires more clusters to reach k-lines level
![Page 32: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/32.jpg)
Tone Structure
First phase of clustering splits high/rising from low/falling by slopeSecond phase by pitch height
![Page 33: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/33.jpg)
Conclusions Common prosodic framework for tone and
pitch accent recognition
Contextual modeling enhances recognition Local context and broad phrase contour
Carryover coarticulation has larger effect for Mandarin
Exploiting unlabeled examples for recognition Semi- and Un-supervised approaches
Best cases approach supervised levels with less training
Exploits acoustic structure of tone and accent space
![Page 34: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/34.jpg)
Error Correction Spiral U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½
since yesterday. U: Give me the price for AT&T. S: Data General was at 10 ½ up a quarter. U: Give me the price for AT&T. S: Hewlett Packard was 83 ¾, up 2 ½. U: Give me the price for American
Telephone and Telegraph. S: Excuse me?
![Page 35: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/35.jpg)
Recognizing Spoken Corrections
Spoken Corrections Recognize user attempts to correct ASR failures Compare original input to repeat corrections Significant differences:
Corrections: increases in duration, pause #/length, final fall Increases in pitch accent for misrecognitions
Automatic recognition with decision trees, boosting Distinguish corrective/not (human level)
Key features: raw/normalized duration, pause Identify specific word being corrected
Key features: highest pitch, widest pitch range
![Page 36: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/36.jpg)
The Problem:Speech Topic Segmentation
Separate audio stream into component topics
On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, Bangalore, in India, sees only profit.||
![Page 37: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/37.jpg)
Is It Possible in Mandarin?
![Page 38: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/38.jpg)
Recognizing Shifts in Topic & Turn
Topic & Turn boundaries in English & Mandarin Initial syllables:
Significantly higher pitch, loudness than final Lexical and prosodic cues:
Cue words, tf*idf similarity; pitch, loudness, silence Automatic recognition with decision trees, boosting
Voting to combine text, prosody, silence: 97% accuracy Key features:
Pause; pitch, loudness contrast between syllables
![Page 39: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/39.jpg)
Conclusions & Opportunities
Prosody Rich source of information for languages Challenging due to variation, paucity of data
Can be successfully employed, with learning, to improve language understanding Pitch accent, tone, dialogue act, turn, topic,…
Unrestricted conversational, multi-party, multimodal speech much more challenging Increased variability, interaction with non-verbal
evidence
![Page 40: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/40.jpg)
Thanks Dinoj Surendran, Siwei Wang, Yi Xu
V. Sindhwani, M. Belkin, & P. Niyogi; I. Fischer & J. Poland; T. Joachims; C-C. Cheng & C. Lin
This work supported by NSF Grant #0414919
http://people.cs.uchicago.edu/~levow/tai
![Page 41: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/41.jpg)
50
100
150
200
250
300
350
400
Phrasing can disambiguate
I met Mary and Elena’s mother at the mall yesterday
Mary & Elena’s mothermall
One intonation phrase with relatively flat overall pitch range.
![Page 42: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/42.jpg)
50
100
150
200
250
300
350
400
Phrasing can disambiguate
I met Mary and Elena’s mother at the mall yesterday
Marymall
Elena’s mother
Separate phrases, with expanded pitch movements.
![Page 43: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/43.jpg)
Lists of numbers, nouns
twenty.eight.five
ninety.four.three
seventy.three.seven
forty.seven.seven
seventy.seven.seven
coffee cake and cream
chocolate ice cream and cake
fish fingers and bottles
cheese sandwiches and milk
cream buns and chocolate[from Prosody on the Web tutorial on chunking]
![Page 44: Prosody in Spoken Language Understanding](https://reader036.vdocument.in/reader036/viewer/2022062422/56812c2b550346895d90a650/html5/thumbnails/44.jpg)
Clustering Pitch accent clustering:
4 way distinction: 1000 samples, proportional
2-16 clusters constructed Assign most frequent class label to each cluster
Classifier: Asymmetric k-lines:
context-dependent kernel radii, non-spherical > 78% accuracy:
2 clusters: asymmetric k-lines best Context effects:
Vector w/preceding context vs vector with no context comparable