an analysis of the aurora large vocabulary evaluation authors: naveen parihar and joseph picone...

An Analysis of theAurora Large Vocabulary Evaluation

• Authors:Naveen Parihar and Joseph PiconeInst. for Signal and Info. ProcessingDept. Electrical and Computer Eng.Mississippi State University

• Contact Information:Box 9571Mississippi State UniversityMississippi State, Mississippi 39762Tel: 662-325-8335Fax: 662-325-2298

EUROSPEECH 2003

• URL: isip.msstate.edu/publications/conferences/eurospeech/2003/evaluation/

Email: {parihar,picone}@isip.msstate.edu

http://www.isip.msstate.edu/publications/conferences/eurospeech/2003/evaluation/

mailto:[email protected]?subject=IBM%20-%20Signal%20Processing










INTRODUCTIONABSTRACT

In this paper, we analyze the results of the recent Aurora large vocabulary evaluations (ALV). Two consortia submitted proposals on speech recognition front ends for this evaluation: (1) Qualcomm, ICSI, and OGI (QIO), and (2) Motorola, France Telecom, and Alcatel (MFA). These front ends used a variety of noise reduction techniques including discriminative transforms, feature normalization, voice activity detection, and blind equalization. Participants used a common speech recognition engine to post-process their features. In this paper, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Both the mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends.

INTRODUCTIONMOTIVATION

• ALV goal was at least a 25% relative improvement over the baseline MFCC front end

• Two consortia participated:QIO: QualComm, ICSI, OGIMFA: Motorola, France

Telecom, Alcatel

• Generic baseline LVCSR system with no front end specific tuning

• Would front end specific tuning change the rankings?

MFCC: Overall WER – 50.3%

8 kHz – 49.6% 16 kHz – 51.0%

TS1 TS2 TS1 TS2

58.1% 41.0% 62.2% 39.8%

QIO: Overall WER – 37.5%

8 kHz – 38.4% 16 kHz – 36.5%

TS1 TS2 TS1 TS2

43.2% 33.6% 40.7% 32.4%

MFA: Overall WER – 34.5%

8 kHz – 34.5% 16 kHz – 34.4%

TS1 TS2 TS1 TS2

37.5% 31.4% 37.2% 31.5%

ALV Evaluation Results

EVALUATION PARADIGMTHE AURORA – 4 DATABASE

Acoustic Training:• Derived from 5000 word WSJ0 task• TS1 (clean), and TS2 (multi-condition)• Clean plus 6 noise conditions• Randomly chosen SNR between 10 and 20 dB• 2 microphone conditions (Sennheiser and secondary)• 2 sample frequencies – 16 kHz and 8 kHz• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

Development and Evaluation Sets:• Derived from WSJ0 Evaluation and Development sets• 14 test sets for each• 7 recorded on Sennheiser; 7 on secondary• Clean plus 6 noise conditions• Randomly chosen SNR between 5 and 15 dB• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

EVALUATION PARADIGMBASELINE LVCSR SYSTEM

Standard context-dependent cross-word HMM-based system:

• Acoustic models: state-tied4-mixture cross-word triphones

• Language model: WSJ0 5K bigram

• Search: Viterbi one-best using lexical trees for N-gram cross-word decoding

• Lexicon: based on CMUlex

• Real-time: 4 xRT for training and 15 xRT for decoding on an800 MHz Pentium

MonophoneModeling

State-Tying

CD-TriphoneModeling

CD-TriphoneModeling

MixtureModeling (2,4)

Training Data

EVALUATION PARADIGMWI007 ETSI MFCC FRONT END

The baseline HMM system used an

ETSI standard MFCC-based front end:

• Zero-mean debiasing

• 10 ms frame duration

• 25 ms Hamming window

• Absolute energy

• 12 cepstral coefficients

• First and second derivatives

Input Speech

Fourier Transf. Analysis

Cepstral Analysis

Zero-mean andPre-emphasis

Energy

/

FRONT END PROPOSALSQIO FRONT END

• 10 msec frame duration• 25 msec analysis

window• 15 RASTA-like filtered

cepstral coefficients• MLP-based VAD• Mean and variance

normalization• First and second

derivatives

Fourier Transform

RASTA

Mel-scale Filter Bank

DCT

Mean/VarianceNormalization

Input Speech

/

MLP-basedVAD

Qualcomm, ICSI, OGI (QIO) front end:

FRONT END PROPOSALSMFA FRONT END

• 10 msec frame duration• 25 msec analysis window• Mel-warped Wiener filter

based noise reduction• Energy-based VADNest• Waveform processing to

enhance SNR• Weighted log-energy• 12 cepstral coefficients• Blind equalization (cepstral

domain)• VAD based on acceleration of

various energy based measures

• First and second derivatives

Input Speech

Noise Reduction

Cepstral Analysis

Waveform Processing

Blind Equalization

Feature Processing

VADNest

VAD

/

EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING

• Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors.

• Tuning parameters: State-tying thresholds: solves the problem of sparsity

of training data by sharing state distributions among phonetically similar states

Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)

Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)

EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING - QIO

• Parameter tuning clean data recorded on Sennhieser mic.

(corresponds to Training Set 1 and Devtest Set 1 of the Aurora-4 database)

8 kHz sampling frequency• 7.5% relative improvement

QIO

FE

# of

Tied

States

State Tying Thresholds LM Scale

Word

Ins. Pen.

WER

Split Merge Occu.

Base 3209 165 165 840 18 10 16.1%

Tuned 3512 125 125 750 20 10 14.9%

EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING - MFA

MFA

FE

# of

Tied

States

State Tying Thresholds LM Scale

Word

Ins. Pen.

WER

Split Merge Occu.

Base 3208 165 165 840 18 10 13.8%

Tuned 4254 100 100 600 18 05 12.5%

• Parameter tuning clean data recorded on Sennhieser mic. (corresponds

to Training Set 1 and Devtest Set 1 of the Aurora-4 database)

8 kHz sampling frequency• 9.4% relative improvement• Ranking is still the same (14.9% vs. 12.5%) !

EXPERIMENTAL RESULTSCOMPARISON OF TUNING

Front End

Train Set

Tuning Average WER over 14 Test Sets

QIO 1 No 43.1%

QIO 2 No 38.1%

Avg. WER No 38.4%

QIO 1 Yes 45.7%

QIO 2 Yes 35.3%

Avg. WER Yes 40.5%

MFA 1 No 37.5%

MFA 2 No 31.8%

Avg. WER No 34.7%

MFA 1 Yes 37.0%

MFA 2 Yes 31.1%

Avg. WER Yes 34.1%

• Same Ranking: relative performance gap increased from9.6% to 15.8%

• On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%)

• On TS2, MFA FE significantly better only on test sets 5 and 14

EXPERIMENTAL RESULTSMICROPHONE VARIATION

• Train on Sennheiser mic.; evaluate on secondary mic.

• Matched conditions result in optimal performance

• Significant degradation for all front ends on mismatched conditions

• Both QIO and MFA provide improved robustness relative to MFCC baseline

0

10

20

30

40

Sennheiser Secondary

ETSI QIO MFA

EXPERIMENTAL RESULTSADDITIVE NOISE

ETSI QIO MFA

0

10

20

30

40

50

60

70

TS2 TS3 TS4 TS5 TS6 TS7

• Performance degrades on noise condition when systems are trained only on clean data

• Both QIO and MFA deliver improved performance

0

10

20

30

40

TS2 TS3 TS4 TS5 TS6 TS7

• Exposing systems to noise and microphone variations (TS2) improves performance

SUMMARY AND CONCLUSIONSWHAT HAVE WE LEARNED?

• Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO)

• Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline

• Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline

• WER is still high ( ~ 35%), further research on noise robust front end is needed

SUMMARY AND CONCLUSIONSAVAILABLE RESOURCES

• Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit

• ETSI DSR Website: reports and front end standards

• Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end

SUMMARY AND CONCLUSIONSBRIEF BIBLIOGRAPHY

• N. Parihar, Performance Analysis of Advanced Front Ends, M.S. Dissertation, Mississippi State University, December 2003.

• N. Parihar, J. Picone, D. Pearce, and H.G. Hirsch, “Performance Analysis of the Aurora Large Vocabulary Baseline System,” submitted to the Eurospeech 2003, Geneva, Switzerland, September 2003.

• N. Parihar and J. Picone, “DSR Front End LVCSR Evaluation - AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, December 06, 2002.

• D. Pearce, “Overview of Evaluation Criteria for Advanced Distributed Speech Recognition,” ETSI STQ-Aurora DSR Working Group, October 2001.

• G. Hirsch, “Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task,” ETSI STQ-Aurora DSR Working Group, December 2002.

• “ETSI ES 201 108 v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm,” ETSI, April 2000.

SUMMARY AND CONCLUSIONSBIOGRAPHY

• Naveen Parihar is a M.S. student in Electrical Engineering in the Department of Electrical and Computer Engineering at Mississippi State University. He currently leads the Core Speech Technology team developing a state-of-the-art public-domain speech recognition system. Mr. Parihar’s research interests lie in the development of discriminative algorithms for better acoustic modeling and feature extraction. Mr. Parihar is a student member of the IEEE.

• Joseph Picone is currently a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Institute for Signal and Information Processing. For the past 15 years he has been promoting open source speech technology. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in 1983. He is a Senior Member of the IEEE and a registered Professional Engineer.

an analysis of the aurora large vocabulary evaluation authors: naveen parihar and joseph picone...

Documents

wsj0 evaluation

evaluation sets

based system

end specific tuningwould

baseline mfcc

noise conditionsrandomly

word wsj0 taskts1 clean

france telecom