computing the fundamental frequency variation …heldner/posters/laskowskiacoustics2008.slide… ·...

66
Introduction SC/¬SC Prediction Experiments Conclusions Computing the Fundamental Frequency Variation Spectrum in Conversational Spoken Dialogue Systems Kornel Laskowski a,b , MatthiasW¨olfel b , Mattias Heldner c & Jens Edlund c a CMU, Pittsburgh PA, USA b UKA(TH), Karlsruhe, Germany c KTH, Stockholm, Sweden 2 July, 2008 K. Laskowski, M. W¨ olfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Upload: lamdan

Post on 18-Mar-2018

222 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Computing the Fundamental Frequency

Variation Spectrum in ConversationalSpoken Dialogue Systems

Kornel Laskowskia,b,Matthias Wolfelb, Mattias Heldnerc & Jens Edlundc

aCMU, Pittsburgh PA, USAbUKA(TH), Karlsruhe, Germany

cKTH, Stockholm, Sweden

2 July, 2008

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 2: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Fundamental Frequency (F0) Variation (FFV)

how does F0 vary in time?

FFV: ongoing work, building on ICASSP 2008 and SpeechProsody 2008

OUR ULTIMATE GOAL: ability to automatically learnprosodic sequences characterizing various phenomena

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 3: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 4: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 5: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 6: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 7: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 8: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 9: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 10: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Canonical Measurement of F0 Variation

1 estimate frame-level autocorrelation

2 find local maxima

3 identify best maximum via dynamic programming acrossmultiple frames

4 median filter maxima across multiple frames

5 syllabify speech via ASR or landmark detection

6 fit linear model acros multiple frames in same syllable

7 estimate speaker’s baseline pitch across multiple frames

8 normalize out baseline

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 11: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 12: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 13: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 14: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 15: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 16: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 17: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 18: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 19: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Wish List

Would like

a representation which is:

continuous: not undefined in unvoiced regionsinstantaneous: no long-distance constraintsdistributed: vector-valued rather than scalar-valuedsparse: minimally redundant

and which:

exhibits speaker-independence: no normalization necessaryenjoys perceptual relevance: variation in octaves per timelends itself to a wealth of ASR HMM modeling techniques

FFV appears to satisfy all these constraints/requirements

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 20: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 21: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 22: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 23: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 24: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 25: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 26: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Applications in Speech Technology

identification of places to use back-channel feedback

classification of rhetorical relations

interpretation of discourse markers

dialogue act tagging

identification of speech repairs

here, prediction of speaker change in conversational spokendialogue systems

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 27: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Outline

1. Introduction & Motivation

2. Speaker-Change Prediction

3. Windowing Experiments

4. Conclusions

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 28: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Speaker-Change Prediction in Dialogue Systems

in other words: is the speaker finished?

study how humans behave, towards humans

learn from what actually happens: no need to label data

500 ms

f

T tg ,V

g

T tg ,N

t

T tf ,N

Lt =

{

SC if T tf ,N

− T tg ,N < 0

¬SC, otherwise(1)

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 29: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Assessing Performance

FALSE POSITIVE RATE

TR

UE

PO

SIT

IVE

RA

TE

receiver operating characteristic (ROC)curves: true vs false positive rate

performance of random guessing: line ofno discrimination

discrimination: area A below the ROCcurve, 0≤A≤1

in this work: area A between the ROCcurve and the line of no discrimination,0≤A≤1

2

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 30: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Assessing Performance

FALSE POSITIVE RATE

TR

UE

PO

SIT

IVE

RA

TE

receiver operating characteristic (ROC)curves: true vs false positive rate

performance of random guessing: line ofno discrimination

discrimination: area A below the ROCcurve, 0≤A≤1

in this work: area A between the ROCcurve and the line of no discrimination,0≤A≤1

2

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 31: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Assessing Performance

FALSE POSITIVE RATE

TR

UE

PO

SIT

IVE

RA

TE

receiver operating characteristic (ROC)curves: true vs false positive rate

performance of random guessing: line ofno discrimination

discrimination: area A below the ROCcurve, 0≤A≤1

in this work: area A between the ROCcurve and the line of no discrimination,0≤A≤1

2

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 32: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Assessing Performance

FALSE POSITIVE RATE

TR

UE

PO

SIT

IVE

RA

TE

receiver operating characteristic (ROC)curves: true vs false positive rate

performance of random guessing: line ofno discrimination

discrimination: area A below the ROCcurve, 0≤A≤1

in this work: area A between the ROCcurve and the line of no discrimination,0≤A≤1

2

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 33: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Data

interactive human-human dialogues

Swedish Map Task Corpus:

Duration Dialogue role gData Set

(mn:ss) speakers # EOTs # SCs

DevSet 77:40 F4,F5,M2,M3 480 222EvalSet 60:39 F1,F2,F3,M1 317 149

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 34: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

System Architecture

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

FILTERBANK

F0 VARIATION

WINDOWING

LIKELIHOOD

CLASSIFICATION

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 35: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 2: Windowing (& FFT Computation)

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

FILTERBANK

F0 VARIATION

LIKELIHOOD

CLASSIFICATION

WINDOWING

1 Spectral estimation over left and rightportions of analysis frame.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 36: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 2: Windowing (& FFT Computation)

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

FILTERBANK

F0 VARIATION

LIKELIHOOD

CLASSIFICATION

WINDOWING

1 Spectral estimation over left and rightportions of analysis frame.

−0.016 −0.004−0.002 0 0.002 0.004 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 37: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 3: F0 Variation (FFV) Computation

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

FILTERBANK

WINDOWING

LIKELIHOOD

CLASSIFICATION

F0 VARIATION

1 Dilate left FFT, dot product with rightFFT; & vice versa. (ICASSP’2008)

2 Maximum over resulting spectrumrepresents change in octaves per second.

−2 −1 0 +1 +20

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 38: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 3: F0 Variation (FFV) Computation

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

FILTERBANK

WINDOWING

LIKELIHOOD

CLASSIFICATION

F0 VARIATION

1 Dilate left FFT, dot product with rightFFT; & vice versa. (ICASSP’2008)

2 Maximum over resulting spectrumrepresents change in octaves per second.

−2 −1 0 +1 +20

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 39: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 4: Application of Filterbank

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

WINDOWING

LIKELIHOOD

CLASSIFICATION

F0 VARIATION

FILTERBANK

1 Compress spectral representation to7-element vector. (SpeechProsody’2008)

−5.4 −3.4 2.4 −1.0 +1.0 2.4 +3.4 +5.40

0.05

0.1

0.15

0.2

0.25

0.3FBold and FBnewFBnew only

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 40: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 6: Modeling

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

WINDOWING

F0 VARIATION

FILTERBANK

LIKELIHOOD

CLASSIFICATION

1 For each class (SC/¬SC), train 10HMMs.

2 Maximum likelihood classification.

3 → 100 candidate dividing hyperplanes.

4 Compute the mean/min/maxdiscrimination over these 100.

5 Compute the single hyperplane (“prod”)between 2 class products of 10 modelseach.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 41: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 6: Modeling

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

WINDOWING

F0 VARIATION

FILTERBANK

LIKELIHOOD

CLASSIFICATION

1 For each class (SC/¬SC), train 10HMMs.

2 Maximum likelihood classification.

3 → 100 candidate dividing hyperplanes.

4 Compute the mean/min/maxdiscrimination over these 100.

5 Compute the single hyperplane (“prod”)between 2 class products of 10 modelseach.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 42: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 6: Modeling

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

WINDOWING

F0 VARIATION

FILTERBANK

LIKELIHOOD

CLASSIFICATION

1 For each class (SC/¬SC), train 10HMMs.

2 Maximum likelihood classification.

3 → 100 candidate dividing hyperplanes.

4 Compute the mean/min/maxdiscrimination over these 100.

5 Compute the single hyperplane (“prod”)between 2 class products of 10 modelseach.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 43: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 6: Modeling

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

WINDOWING

F0 VARIATION

FILTERBANK

LIKELIHOOD

CLASSIFICATION

1 For each class (SC/¬SC), train 10HMMs.

2 Maximum likelihood classification.

3 → 100 candidate dividing hyperplanes.

4 Compute the mean/min/maxdiscrimination over these 100.

5 Compute the single hyperplane (“prod”)between 2 class products of 10 modelseach.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 44: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Step 6: Modeling

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

WINDOWING

F0 VARIATION

FILTERBANK

LIKELIHOOD

CLASSIFICATION

1 For each class (SC/¬SC), train 10HMMs.

2 Maximum likelihood classification.

3 → 100 candidate dividing hyperplanes.

4 Compute the mean/min/maxdiscrimination over these 100.

5 Compute the single hyperplane (“prod”)between 2 class products of 10 modelseach.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 45: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Focus of This Work

5.

7.

4.

3.

2.

1.

0.

6.

SAD

AUDIO CAPTURE

PREEMPHASIS

WHITENING

FILTERBANK

F0 VARIATION

LIKELIHOOD

CLASSIFICATION

WINDOWING

In this work, investigate sensitivity ofspeaker-change prediction performanceon windowing policy

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 46: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Two Experiments

Observation: Baseline asymmetric windows are known to havepoor frequency resolution.

1 Keep window separation fixed; increase overlap to symmetrize.

2 Keep overlap fixed; increase window separation to symmetrize.

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 47: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1

keep window maxima a constant tsep apart

less asymmetry ↔ more window support overlap

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 48: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1

keep window maxima a constant tsep apart

less asymmetry ↔ more window support overlap

−0.016 −0.004−0.002 0 0.002 0.004 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 49: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1

keep window maxima a constant tsep apart

less asymmetry ↔ more window support overlap

−0.016 −0.004 0 0.004 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 50: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1

keep window maxima a constant tsep apart

less asymmetry ↔ more window support overlap

−0.016 −0.006−0.004 0 0.004 0.006 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 51: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1

keep window maxima a constant tsep apart

less asymmetry ↔ more window support overlap

−0.016 −0.008 −0.004 0 0.004 0.008 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 52: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1

keep window maxima a constant tsep apart

less asymmetry ↔ more window support overlap

−0.016 −0.01 −0.004 0 0.004 0.01 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 53: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1: Results

devSet

baseline −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> symmetric0

5

10

15

20

25

WINDOW SHAPER

OC

dis

crim

inat

ion

abov

e 50

%

meanprod

symmetric windows appear to lead to:

lower ROC discrimination than baseline, in all cases

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 54: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1: Results

devSet

baseline −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> symmetric0

5

10

15

20

25

WINDOW SHAPER

OC

dis

crim

inat

ion

abov

e 50

%

meanprod

symmetric windows appear to lead to:

lower ROC discrimination than baseline, in all cases

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 55: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 1: Results

devSet

baseline −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> symmetric0

5

10

15

20

25

WINDOW SHAPER

OC

dis

crim

inat

ion

abov

e 50

%

meanprod

symmetric windows appear to lead to:

lower ROC discrimination than baseline, in all cases

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 56: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2

keep window support overlap constant

less asymmetry ↔ window maxima further apart

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 57: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2

keep window support overlap constant

less asymmetry ↔ window maxima further apart

−0.016 −0.004 0 0.004 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 58: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2

keep window support overlap constant

less asymmetry ↔ window maxima further apart

−0.016 −0.005−0.004 0 0.0040.005 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 59: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2

keep window support overlap constant

less asymmetry ↔ window maxima further apart

−0.016 −0.006−0.004 0 0.004 0.006 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 60: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2

keep window support overlap constant

less asymmetry ↔ window maxima further apart

−0.016 −0.007 −0.004 0 0.004 0.007 0.0160

0.2

0.4

0.6

0.8

1

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 61: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2: Results

devSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

symmetric windows appear to lead to:

higher ROC discrimination than baseline, in all casessmaller variability between best and worst partitions

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 62: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2: Results

devSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

symmetric windows appear to lead to:

higher ROC discrimination than baseline, in all casessmaller variability between best and worst partitions

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 63: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2: Results

devSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

symmetric windows appear to lead to:

higher ROC discrimination than baseline, in all casessmaller variability between best and worst partitions

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 64: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Experiment 2: Results

devSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

evalSet

baseline −−> more sym −−> symmetric0

5

10

15

20

25

WINDOW SHAPE

RO

C d

iscr

imin

atio

n ab

ove

50%

meanprod

symmetric windows appear to lead to:

higher ROC discrimination than baseline, in all casessmaller variability between best and worst partitions

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 65: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Conclusions

tsep: separation between window maxima

tfra: duration of analysis frame

1 when tsep > 13tfra, symmetric-support windows appear best

2 when tsep < 13tfra, first priority should be to limit overlap in

support to a maximum of tsep at the expense of symmetryif necessary

3 results suggest that better ROC discrimination may bepossible when symmetric-support windows are placed evenfurther apart in time than tried here

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France

Page 66: Computing the Fundamental Frequency Variation …heldner/posters/laskowskiACOUSTICS2008.slide… · interpretation of discourse markers dialogue act tagging identification of speech

Introduction SC/¬SC Prediction Experiments Conclusions

Thanks for attending.

([email protected])

K. Laskowski, M. Wolfel, M. Heldner, J. Edlund Acoustics 2008, Paris, France