ceices: a “vertical” approach towards recognizing emotion in speech anton batliner lehrstuhl...
TRANSCRIPT
CEICES: a “Vertical” Approach
Towards Recognizing Emotion in Speech
Anton BatlinerLehrstuhl für Mustererkennung (Informatik 5)
(Chair for Pattern Recognition)
Friedrich-Alexander-Universität Erlangen-Nürnberg
HUMAINE Plenary, Paris, June 4th, 2007
Seite 2
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 2
What is CEICES?
Seite 3
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 3
CEICES Initiative
Combining Efforts for Improving automatic Classification of Emotional user States - a "forced co-operation" initiative
Partners active: from outside HUMAINE: TUM (Technische Universität München),
FBK-irst (inside/outside) from within HUMAINE, WP4: FAU, UA, LIMSI, TAU/AFEKA
People: Anton Batliner, Stefan Steidl, Björn Schuller, Dino Seppi,
Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, Vered Aharonson
Seite 4
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 4
Idea Behind
different research traditions at different sites
somehow fossilized approaches at different sites
co-operation pays off: pooling together competence and feature sets
Seite 5
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 5
Which data do we use?
Seite 6
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 6
Database
German corpus with recordings of 51 ten to twelve year old children communicating with Sony's Aibo pet robot (9.2 hours of speech, 51.393 words)
data ± reverberated, transliterated, annotated: 5 labellers, 11 word-based "emotion" labels
originator site (FAU) provides speech files, phonetic lexicon, definition of train and test samples, etc.
effort for manual “pre-processing” only: ~80 k € researcher, ~80 k € students (conservative estimation)
Seite 7
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 7
A “Vertical” Approach
segmentation, transliteration
emotion labelling
annotation of interaction
manual word segmentation
manual correction of F0
syntactic annotation
rule-based chunking system
.............
Seite 8
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 8
Basics of Chosen Approach
children: not exotic but normal annotation: with context (it’s speech, not sounds)
majority voting (≤ 3 out of 5 agree) unit of annotation: the word, because
link to ASR link to higher processing (syntax, dialogue, semantics) smallest possible emotional unit can be combined onto higher units of different size
mapping onto 4 cover classes, due to sparse data: Motherese (positive valence) default class Neutral "pre-stage" to negative: Emphatic negative (Angry) "dimension" (smearing fine-grained differences
between: touchy, reprimanding, angry) AMEN sub-sample
Seite 9
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 9
The AMEN Sub-sample
syntactically/semantically meaningful chunks with at least one AMEN word
syntactic-prosodic chunking rules: IF (synt. bound. = sentence/free phrase/between vocatives)
OR (pause 500 ms at any other synt. bound.)
frequencies: Motherese: 586 Neutral: 1998 Emphatic: 1045 Angry: 914
experiments so far with 2- or 3-fold speaker-independent cross-validation, upsampling for training
Seite 10
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 10
Type of Database
Scenario
- acted - prompted
- real-life + elicited/induced + volunteering
+ application-oriented - emotion-oriented
Outcome
+ spontaneous
+ natural + realistic
- selected
acted - induced - natural
Seite 11
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 11
fully exploiting the state of the art:
relevance of features
Seite 12
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 12
SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PPOV_positive_valenceBOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_undS S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.10.00N00X0000000000T0000000000Pspectral_cog_meanE S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000001000F00.10.00N00X0000000000T__________PeneMean___BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_komBOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_rumE S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.01N00X0000000000T0000000000PEnMax0_0_f36_minS S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00C0000010000F10.30.00N00X0000000000T0000000000PTUM_0_fa1_band_stddSEM S8I02108M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.91N04X0000000000T0000000000PPOV_positive_valence_normBOW S5I01440M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_wiederE S8I01229M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.10N00X0000000000T0000000000PEnEneAbs0_0_f29_meanPOS S5I00042M1D4R5101L020000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_APNPOS S5I00045M1D4R5101L050000A00.00.00.00.00.00.00.00.00.00C0000010000F10.35.00N00X0000000000T0000000000PTUM_sum_PAJSEM S8I02106M1D4R5111L009000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PRES_restD S8I01056M1D4R5111L000000A10.00.00.00.00.00.00.00.00.00C0000010000F00.99.01N00X0000000000T0000000000PDurAbsSyl0_0_f56_minP S8I01263M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F20.61.10N00X0000000000T0000000000PF0RegCoeff0_0_f63_meanS S4I00080M1D4R5111L000000A00.00.00.10.00.00.04.00.00.00C0000010000F00.00.00N00X1000000000T0000000000PvnhrP S4I01055M1D4R5111L000000A00.00.10.00.00.00.00.00.00.00C0000010000F00.22.00N00X1000000000T0000000000Pprctilep4AE S4I01001M1D4R5111L000000A00.10.10.00.00.00.00.00.00.00C0000010000F30.02.00N00X1000000000T0000000000Ploud_maxvalV S4I00075M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000Pvshimapq3E S8I01029M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.00.01N00X0000000000T0000000000PEnEneAbs0_0_f29_minV S4I00074M1D4R5111L000000A00.00.00.00.00.00.02.00.00.00C0000010000F00.00.00N00X1000000000T0000000000PvshimlocBOW S5I01217M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_haltBOW S5I01382M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00C0000010000F00.92.00N05X0000000000T0000000000PTUM_logTF_ILS_sollstC S5I03116M1D4R5102L000000A00.00.00.00.00.10.00.00.00.00C0000010000F10.10.00N00X0000000000T0000000000PTUM_MFCC10AverageC S5I05195M1D4R5102L000000A00.00.00.00.00.12.00.00.00.00C0000010000F11.18.00N00X0000000000T0000000000PTUM_0_mfcc_c12_d_cntE S6I01002M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F00.02.00N00X0000000000T0000000000Penergy_maxC S6I04113M1D4R5112L000000A00.00.00.00.00.04.00.00.00.00C0000010000F00.31.00N00X0000000000T0000000000Pmfcc4_varE S1I00006M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000101010F00.49.00N00X0000000000T__________PeneTau____S S6I03006M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00C0000010000F00.21.00N00X0000000000T0000000000Pspectral_cog_medianSEM S8I02103M1D4R5111L003000A00.00.00.00.00.00.00.00.00.00C0000010000F00.41.00N00X0000000000T0000000000PNEV_negative_valenceE S6I01114M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00C0000010000F02.21.00N00X0000000000T0000000000Penergy_deltadelta_median
Feature Encoding Scheme (WS at FAU 12/06)
Seite 13
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 13
SEM S8I02102M1D4R5111L002000A00.00.00.00.00.00.00.00.00.00 .. PPOV_positive_valenceBOW S5I01413M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_undS S6I03001M1D4R5112L000000A00.00.00.10.00.00.00.00.00.00 .. Pspectral_cog_meanE S1I00004M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00 .. PeneMean___BOW S5I01270M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_komBOW S5I01344M1D4R5101L200000A00.00.00.00.00.00.00.00.00.00 .. PTUM_logTF_ILS_rumE S8I01036M1D4R5112L000000A00.10.00.00.00.00.00.00.00.00 .. PEnMax0_0_f36_minS S5I04027M1D4R5102L000000A00.00.00.00.13.00.00.00.00.00 .. PTUM_0_fa1_band_stdd
Zoom on Feature Encoding Scheme
linguistic encoding
acoustic encoding
Seite 14
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 14
Impact of Feature Types (SVM), Separate and Combined Classification
of Chunks, F Values: Acoustic and Linguistic Features
feature types # all red. (150) SFFSsep SFFScomb
energy 265 58.5 60.0 56.9 56.3
duration 391 55.1 60.6 54.9 49.6
F0 333 56.1 55.1 46.7 46.8
spectral/formant 656 54.4 56.0 49.9 46.2
cepstral 1699 52.7 57.1 50.4 46.4
voice quality 154 51.5 51.6 41.5 38.7
wavelets 216 56.0 56.3 44.9 35.3
bag of words 476 62.6 62.3 53.2 37.4
part-of-speech 31 54.7 - 54.9 48.1
higher semantics 12 57.6 - 57.9 56.0
non-verbal 8 24.2 - - -
disfluencies 4 26.8 - - -
Seite 15
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 15
Impact of Feature Types (SVM), Separate and Combined Classification
of Chunks, F Values: Acoustic and Linguistic Features
feature types # all red. (150) SFFSsep SFFScomb
energy 265 58.5 60.0 56.9 56.3
duration 391 55.1 60.6 54.9 49.6
F0 333 56.1 55.1 46.7 46.8
spectral/formant 656 54.4 56.0 49.9 46.2
cepstral 1699 52.7 57.1 50.4 46.4
voice quality 154 51.5 51.6 41.5 38.7
wavelets 216 56.0 56.3 44.9 35.3
bag of words 476 62.6 62.3 53.2 37.4
part-of-speech 31 54.7 - 54.9 48.1
higher semantics 12 57.6 - 57.9 56.0
non-verbal 8 24.2 - - -
disfluencies 4 26.8 - - -
Seite 16
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 16
Impact of Feature Types (SVM), Separate and Combined Classification
of Chunks, F Values: Acoustic and Linguistic Features
feature types # all red. (150) SFFSsep SFFScomb
energy 265 58.5 60.0 56.9 56.3
duration 391 55.1 60.6 54.9 49.6
F0 333 56.1 55.1 46.7 46.8
spectral/formant 656 54.4 56.0 49.9 46.2
cepstral 1699 52.7 57.1 50.4 46.4
voice quality 154 51.5 51.6 41.5 38.7
wavelets 216 56.0 56.3 44.9 35.3
bag of words 476 62.6 62.3 53.2 37.4
part-of-speech 31 54.7 - 54.9 48.1
higher semantics 12 57.6 - 57.9 56.0
non-verbal 8 24.2 - - -
disfluencies 4 26.8 - - -
Seite 17
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 17
Impact of Feature Types (SVM), Separate and Combined Classification
of Chunks, F Values: Acoustic and Linguistic Features
feature types # all red. (150) SFFSsep SFFScomb
energy 265 58.5 60.0 56.9 56.3
duration 391 55.1 60.6 54.9 49.6
F0 333 56.1 55.1 46.7 46.8
spectral/formant 656 54.4 56.0 49.9 46.2
cepstral 1699 52.7 57.1 50.4 46.4
voice quality 154 51.5 51.6 41.5 38.7
wavelets 216 56.0 56.3 44.9 35.3
bag of words 476 62.6 62.3 53.2 37.4
part-of-speech 31 54.7 - 54.9 48.1
higher semantics 12 57.6 - 57.9 56.0
non-verbal 8 24.2 - - -
disfluencies 4 26.8 - - -
Seite 18
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 18
Impact of Feature Types (SVM), Separate and Combined Classification
of Chunks, F Values: Acoustic and Linguistic Features
feature types # all red. (150) SFFSsep SFFScomb
energy 265 58.5 60.0 56.9 56.3
duration 391 55.1 60.6 54.9 49.6
F0 333 56.1 55.1 46.7 46.8
spectral/formant 656 54.4 56.0 49.9 46.2
cepstral 1699 52.7 57.1 50.4 46.4
voice quality 154 51.5 51.6 41.5 38.7
wavelets 216 56.0 56.3 44.9 35.3
bag of words 476 62.6 62.3 53.2 37.4
part-of-speech 31 54.7 - 54.9 48.1
higher semantics 12 57.6 - 57.9 56.0
non-verbal 8 24.2 - - -
disfluencies 4 26.8 - - -
Seite 19
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 19
Impact of Feature Types (SVM), Separate and Combined Classification of Chunks, F Values: Acoustic and Linguistic Features
feature types # all red. (150) SFFSsep SFFScomb
energy 265 58.5 60.0 56.9 56.3
duration 391 55.1 60.6 54.9 49.6
F0 333 56.1 55.1 46.7 46.8
spectral/formant 656 54.4 56.0 49.9 46.2
cepstral 1699 52.7 57.1 50.4 46.4
voice quality 154 51.5 51.6 41.5 38.7
wavelets 216 56.0 56.3 44.9 35.3
bag of words 476 62.6 62.3 53.2 37.4
part-of-speech 31 54.7 - 54.9 48.1
higher semantics 12 57.6 - 57.9 56.0
non-verbal 8 24.2 - - -
disfluencies 4 26.8 - - -
Seite 20
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 20
Types of Approaches, SFFS, F Values
knowledge-based and sequential (FAU, FBK: 118)*: 58.8
knowledge-based (TAU, LIMSI: 312): 53.3
brute-force (TUM, UA: 3304): 54.9
all acoustic features (3714) 63.4
all linguistic features (531) 62.6
all together (4245) 65.5
* word-based features, using manually corrected word boundaries, combined into chunk-based features
Seite 21
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 21
beyond the state of the art:units of analysis
Seite 22
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 22
Performance for Different Chunks, Preliminary Experiments at FBK, Small Feature Set, F Values
# F
optimal, i.e. adjacent identical labels 7008 64.0 (e.g.: NNN EE AAAA NNNNN M N MMM)
turns (pause > 1.5 sec.) 3990 50.0syntactic-prosodic rule system 9152 55.2words 17611 55.0
syntactic rule system (clauses/phrases/ …) 9102 53.9prosodic rule system (pause > 0.5 sec.) 5129 53.0
LM2 (bi-gram language model) 5480 52.8POS-LM (part-of-speech language model) 4637 56.0
Seite 23
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 23
Summing up our Results
impact of acoustic feature types: energy most important, voice quality less important, other types in between (note domain-dependency!)
impact of linguistic feature types: very high - to be checked with real Automatic Speech Recognition (ASR) output
sequential approach promising
chunking is the right way to do
emotion recognition seems to be less prone to noise than comparable speech processing tasks (ICASSP 2007)
PDA (Pitch Detection Algorithm) extraction errors deteriorate performance consistently but not detrimentally (ICPhS 2007)
Seite 24
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 24
In a Nutshell
full exploitation of state-of-the-art approaches > 4 k features knowledge-based vs. brute-force selection and classification
and beyond state-of-the-art towards new dimensions (UMUAI 2007) meaningful units of analysis (chunking) interaction/dialogue modelling prototyping personalization …
Seite 25
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 25
and the Message of the Day
people stare at classification performance
which is tuned explicitely by highly sophisticated classifiers
and implicitely by settings not obvious to the 'normal' reader such as manual emotion chunking using only prototypes using acted data and other devices
Seite 26
A. Batliner
Click to edit Master title style
04/11/23 Ergebnisse Mitarbeiterbefragung 2001
page 26
Thank you for your attention