practical evaluation of speech recognizers [0.4ex]for...

1
4-color Process Practical Evaluation of Speech Recognizers for Virtual Human Dialogue Systems Xuchen Yao 1 , Pravin Bhutada 2 , Kallirroi Georgila 3 , Kenji Sagae 3 , Ron Artstein 3 and David Traum 3 1 Faculty of Arts, University of Groningen (currently at Saarland University) 2 Dept. of Computer Science and Center for Computational Learning Systems, Columbia University 3 Institute for Creative Technologies, University of Southern California Overview Evaluation of Automatic Speech Recognition (ASR) systems using data collected from six deployed dialogue systems. Speech recognizers tested Cambridge HTK family HVite (v3.4.1), HDecode & Julius (v4.1.2) I Two sets of acoustic and language models: I Trained directly on TRAIN I Adapted with the WSJ training corpus I Same acoustic and language models for all engines. CMU Sphinx family Sphinx 4 & Pocket Sphinx (v0.5) I Language model trained directly on TRAIN I Acoustic model adapted with the WSJ training corpus Data Five domains of human speech directed at one or more virtual characters: SGT Blackwell A question-answering character who answers general questions about the Army, himself, and his technology. I “When did you join the Army?” I “How can you understand what I’m saying right now?” SGT Star A question-answering character who talks about careers in the Army. I “Who are you?” I “Is the pay good in the Army?” Amani A bargaining character used as a prototype for training sol- diers to perform tactical questioning. I “Do you know where he lives?” I “I’ll keep this a secret.” SASO A negotiation training prototype in which two virtual char- acters negotiate with a human “trainee” about moving a medical clinic. I “I understand, but it is imperative that we move the clinic out of this area.” I “We can dig a well for you.” Radiobots A training prototype that responds to military calls for ar- tillery fire in a virtual reality urban combat environment. I “M T O kilo alpha four rounds target number alpha bravo one out.” I “Shot out.” One additional domain of conversations between two human participants: IOTA An extension of the Radiobots system with more varied types of calls for fire (including calls for air support). I “Roger where do you want to look from now that I’m looking at that building where do you want me to go?” I “Roger contact on that east west road.” Corpus Size Corpora divided into TRAIN (80%), DEV (10%) and TEST (10%) Words Turns Turn Length TRAIN TEST DEV TRAIN TEST DEV (TEST) Blackwell 80901 11520 11141 17755 2500 2499 4.6 Star 16340 2137 2051 2974 400 400 5.3 Amani 15553 1855 1503 1479 188 187 9.9 SASO 22703 3483 2892 3601 510 466 6.8 Radiobots 6841 1163 1325 1082 167 190 7.0 IOTA 49633 5441 6552 4939 650 608 8.4 Vocabulary Vocabulary (TRAIN) Out-of-Vocab. Items (DEV) Size Density Types Tokens N % Blackwell 2568 31.5 128 147 1.32 Star 516 31.7 35 37 1.80 Amani 1194 13.0 58 67 4.46 SASO 808 28.1 38 43 1.49 Radiobots 198 34.6 16 18 1.36 IOTA 1878 26.4 114 143 2.18 I Density is tokens divided by vocabulary size (not normalized to corpus size). Perplexity TRAIN DEV TEST bigram trigram bigram trigram bigram trigram Blackwell 10.8 8.1 11.5 8.9 12.3 9.9 Star 7.1 4.8 10.7 7.9 13.2 10.4 Amani 26.7 17.6 47.1 39.0 52.9 45.0 SASO 10.1 9.9 15.0 13.3 17.8 15.9 Radiobots 4.8 3.7 5.5 4.7 4.9 4.2 IOTA 34.3 24.3 60.4 53.0 34.1 27.7 I Language models were trained on TRAIN, hence the lower perplexity. I Trigrams lead to lower perplexities than bigrams. Results Main result: Word Error Rate on TEST Non Real-time Real-time HVite HDecode Julius PocketSphinx Blackwell 42 32 53 Star 33 32 36 33 Amani 56 65 50 38 SASO 33 29 33 30 Radiobots 15 12 14 10 IOTA 57 39 42 47 I The difference between recognizers is typically smaller than the differences between the domains. I Among the HTK recognizers, HDecode had the best word error rates in general. However, HDecode does not run in real-time. Word Error Rate on DEV Non Real-time Real-time HVite HDecode Sphinx4 Julius PocketSphinx Blackwell 34 31 60 35 49 Star 22 20 33 27 25 Amani 47 49 42 54 35 SASO 32 28 32 33 36 Radiobots 10 11 17 7 IOTA 66 49 76 61 55 I Between the two Sphinx recognizers, PocketSphinx (real-time) outperformed Sphinx4 (not real-time) in five of the six datasets. Adaptation Affects Performance I HTK decoders tested on DEV under two conditions: Unadapted: only the domain-specific dataset was used for training. + Adapted: the domain-specific dataset was augmented with a larger WSJ dataset. Adaptation was done for both language and acoustic models. I Some decoders were tested with bigram and trigram language models. HVite HDecode Julius Adaptation + + + N-gram 2 2 2 3 2 3 2 2 3 Blackwell 34 44 34 31 47 45 35 46 53 Star 22 23 21 20 23 22 27 26 27 Amani 56 47 52 50 51 49 54 56 62 SASO 32 32 29 28 36 34 33 38 44 Radiobots 14 10 13 13 12 11 17 18 18 IOTA 71 66 49 49 66 66 61 69 82 I No single setting is best in all data sets: I HVite is better with adaptation for Blackwell, but worse for IOTA and Amani. I HDecode does best with adapted trigrams for most domains. I Adaptation decreases decoding speed because the search space is widened. I Enriched models could compensate for data sparsity: Blackwell, SASO. I Using the additional WSJ dataset increases the size of models substantially. I The effect of adaptation is typically smaller than the inherent differences between the domains. Conclusion I No single ASR engine is superior to all others in every domain. I WER levels for dialogues coming from different systems vary from relatively low to very high in different domains. I Performance of free off-the-shelf ASR engines is good enough for use in virtual human dialogue systems in some specific domains, but not others. I Additional considerations are important for practical applications: I Design requirements, e.g. whether real-time performance is needed; I Development constraints, e.g. ease of intergration with other system components. I In future work we intend to examine the impact of speech recognizers on the performance of natural language understanding modules and on overall performance of dialogue systems. Acknowledgments The work described here has been sponsored by the U.S. Army Research, Development, and Engineering Command (RDECOM). Statements and opinions expressed do not necessarily reflect the position or the policy of the United States Government, and no official endorsement should be inferred. The bulk of the work was performed at ICT, where the first two authors were interns in the summer of 2009. The first author acknowledges financial support from the Erasmus Mundus Program through scholarships for the European Masters Program in Language and Communication Technologies (LCT). Natural Language Dialogue Group, Institute for Creative Technologies, University of Southern California http://ict.usc.edu

Upload: duonglien

Post on 12-Apr-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Handleiding gebruik logo’s, versie 2.0, augustus 2007

Handleidinglogobestandenversie 2.0, augustus 2007

3

THe UnIveRSITY IdenTITY

The design of the Columbia identity incorporates the core elements of well- thought-out branding: name, font, color, and visual mark. The logo was designed using the official University font, Trajan Pro, and features specific proportions of type height in relation to the visual mark. The official Colum-bia color is Columbia Blue, or Pantone 290. On a light color background, the logo can also be rendered in black, grey (60% black), Pantone 280, or Pantone 286; on a darker color background, the logo can be rendered in Pantone 290, 291, or 284, depending on which color works best with the overall design of your product, the media in which it will be reproduced, and its intended use.

Black

Pantone 286

4-color Process100% Cyan72% Magenta

White or Pantone 290 (Columbia Blue)Background: Pantone 286

For photographs, use the logo in white against a darker area, posi-tioning it either at top left/right or bottom left/right.

Practical Evaluation of Speech Recognizersfor Virtual Human Dialogue SystemsXuchen Yao1, Pravin Bhutada2, Kallirroi Georgila3, Kenji Sagae3, Ron Artstein3 and David Traum3

1Faculty of Arts, University of Groningen (currently at Saarland University)2Dept. of Computer Science and Center for Computational Learning Systems, Columbia University3Institute for Creative Technologies, University of Southern California

OverviewEvaluation of Automatic Speech Recognition (ASR) systems using data collectedfrom six deployed dialogue systems.

Speech recognizers tested

Cambridge HTK familyHVite (v3.4.1), HDecode & Julius (v4.1.2)

I Two sets of acoustic and language models:I Trained directly on TRAINI Adapted with the WSJ training corpus

I Same acoustic and language models for all engines.CMU Sphinx familySphinx 4 & Pocket Sphinx (v0.5)

I Language model trained directly on TRAINI Acoustic model adapted with the WSJ training corpus

DataFive domains of human speech directed at one or more virtual characters:

SGT Blackwell A question-answering character who answers generalquestions about the Army, himself, and his technology.

I “When did you join the Army?”I “How can you understand what I’m saying right now?”

SGT Star A question-answering character who talks about careersin the Army.

I “Who are you?”I “Is the pay good in the Army?”

Amani A bargaining character used as a prototype for training sol-diers to perform tactical questioning.

I “Do you know where he lives?”I “I’ll keep this a secret.”

SASO A negotiation training prototype in which two virtual char-acters negotiate with a human “trainee” about moving amedical clinic.

I “I understand, but it is imperative that we move theclinic out of this area.”

I “We can dig a well for you.”

Radiobots A training prototype that responds to military calls for ar-tillery fire in a virtual reality urban combat environment.

I “M T O kilo alpha four rounds target number alphabravo one out.”

I “Shot out.”

One additional domain of conversations between two human participants:

IOTA An extension of the Radiobots system with more variedtypes of calls for fire (including calls for air support).

I “Roger where do you want to look from now that I’mlooking at that building where do you want me to go?”

I “Roger contact on that east west road.”

Corpus SizeCorpora divided into TRAIN (≈ 80%), DEV (≈ 10%) and TEST (≈ 10%)

Words Turns Turn LengthTRAIN TEST DEV TRAIN TEST DEV (TEST)

Blackwell 80901 11520 11141 17755 2500 2499 4.6Star 16340 2137 2051 2974 400 400 5.3Amani 15553 1855 1503 1479 188 187 9.9SASO 22703 3483 2892 3601 510 466 6.8Radiobots 6841 1163 1325 1082 167 190 7.0IOTA 49633 5441 6552 4939 650 608 8.4

VocabularyVocabulary (TRAIN) Out-of-Vocab. Items (DEV)

Size Density TypesTokens

N %Blackwell 2568 31.5 128 147 1.32Star 516 31.7 35 37 1.80Amani 1194 13.0 58 67 4.46SASO 808 28.1 38 43 1.49Radiobots 198 34.6 16 18 1.36IOTA 1878 26.4 114 143 2.18

I Density is tokens divided by vocabulary size (not normalized to corpus size).Perplexity

TRAIN DEV TESTbigram trigram bigram trigram bigram trigram

Blackwell 10.8 8.1 11.5 8.9 12.3 9.9Star 7.1 4.8 10.7 7.9 13.2 10.4Amani 26.7 17.6 47.1 39.0 52.9 45.0SASO 10.1 9.9 15.0 13.3 17.8 15.9Radiobots 4.8 3.7 5.5 4.7 4.9 4.2IOTA 34.3 24.3 60.4 53.0 34.1 27.7

I Language models were trained on TRAIN, hence the lower perplexity.I Trigrams lead to lower perplexities than bigrams.

Results

Main result: Word Error Rate on TEST

Non Real-time Real-timeHVite HDecode Julius PocketSphinx

Blackwell — 42 32 53Star 33 32 36 33Amani 56 65 50 38SASO 33 29 33 30Radiobots 15 12 14 10IOTA 57 39 42 47

I The difference between recognizers is typically smaller than the differencesbetween the domains.

I Among the HTK recognizers, HDecode had the best word error rates ingeneral. However, HDecode does not run in real-time.

Word Error Rate on DEV

Non Real-time Real-timeHVite HDecode Sphinx4 Julius PocketSphinx

Blackwell 34 31 60 35 49Star 22 20 33 27 25Amani 47 49 42 54 35SASO 32 28 32 33 36Radiobots 10 11 — 17 7IOTA 66 49 76 61 55

I Between the two Sphinx recognizers, PocketSphinx (real-time) outperformedSphinx4 (not real-time) in five of the six datasets.

Adaptation Affects Performance

I HTK decoders tested on DEV under two conditions:– Unadapted: only the domain-specific dataset was used for training.+ Adapted: the domain-specific dataset was augmented with a larger WSJ dataset.Adaptation was done for both language and acoustic models.

I Some decoders were tested with bigram and trigram language models.

HVite HDecode JuliusAdaptation + – + – + –N-gram 2 2 2 3 2 3 2 2 3Blackwell 34 44 34 31 47 45 35 46 53Star 22 23 21 20 23 22 27 26 27Amani 56 47 52 50 51 49 54 56 62SASO 32 32 29 28 36 34 33 38 44Radiobots 14 10 13 13 12 11 17 18 18IOTA 71 66 49 49 66 66 61 69 82

I No single setting is best in all data sets:I HVite is better with adaptation for Blackwell, but worse for IOTA and Amani.I HDecode does best with adapted trigrams for most domains.

I Adaptation decreases decoding speed because the search space is widened.I Enriched models could compensate for data sparsity: Blackwell, SASO.I Using the additional WSJ dataset increases the size of models substantially.

I The effect of adaptation is typically smaller than the inherent differencesbetween the domains.

Conclusion

I No single ASR engine is superior to all others in every domain.

I WER levels for dialogues coming from different systems vary from relativelylow to very high in different domains.

I Performance of free off-the-shelf ASR engines is good enough for use invirtual human dialogue systems in some specific domains, but not others.

I Additional considerations are important for practical applications:I Design requirements, e.g. whether real-time performance is needed;I Development constraints, e.g. ease of intergration with other system components.

I In future work we intend to examine the impact of speech recognizers on theperformance of natural language understanding modules and on overallperformance of dialogue systems.

Acknowledgments

The work described here has been sponsored by the U.S. Army Research,Development, and Engineering Command (RDECOM). Statements and opinionsexpressed do not necessarily reflect the position or the policy of the UnitedStates Government, and no official endorsement should be inferred.

The bulk of the work was performed at ICT, where the first two authors wereinterns in the summer of 2009.

The first author acknowledges financial support from the Erasmus MundusProgram through scholarships for the European Masters Program in Languageand Communication Technologies (LCT).

Natural Language Dialogue Group, Institute for Creative Technologies, University of Southern California http://ict.usc.edu