performance analysis of advanced front ends on the aurora large vocabulary evaluation
DESCRIPTION
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation. • Author: Naveen Parihar Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University •Contact Information: Box 9571 Mississippi State University - PowerPoint PPT PresentationTRANSCRIPT
Performance Analysis of Advanced Front Endson the Aurora Large Vocabulary Evaluation
• Author:Naveen PariharInst. for Signal and Info. ProcessingDept. Electrical and Computer Eng.Mississippi State University
• Contact Information:Box 9571Mississippi State UniversityMississippi State, Mississippi 39762Tel: 662-325-8335Fax: 662-325-2298
• URL: http://www.isip.msstate.edu/publications/books/msstate_theses/2003/advanced_frontends
Email: [email protected]
INTRODUCTIONABSTRACT
The primary objective of this thesis was to analyze the performance of two advanced front ends, referred to as the QIO (Qualcomm, ICSI, and OGI) and MFA (Motorola, France Telecom, and Alcatel) front ends, on a speech recognition task based on the Wall Street Journal database. Though the advanced front ends are shown to achieve a significant improvement over an industry‑standard baseline front end, this improvement is not operationally significant. Further, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end-specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Finally, we also show that mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends.
INTRODUCTIONSPEECH RECOGNITION OVERVIEW
A noisy communication channel model for speech production and perception:
Bayesian formulation for speech recognition:
P(W/A) = P(A/W) P(W) / P(A)
Objective: minimize word error rate by maximizing P(W/A)
Approach: maximize P(A/W) (training)• P(A/W): acoustic model (hidden Markov Model, Gaussians)• P(W): language model (Finite state machines, N-grams)• P(A): acoustics (ignored during maximization)
MessageSource
LinguisticChannel
ArticulatoryChannel
AcousticChannel
Observable: Message Words Sounds Features
INTRODUCTIONBLOCK DIAGRAM APPROACH
Core components:
• Transduction
• Feature extraction
• Acoustic modeling (hidden Markov models)
• Language modeling (statistical N-grams)
• Search (Viterbi beam)
• Knowledge sources
INTRODUCTIONAURORA EVALUATION OVERVIEW
• WSJ 5K (closed task) with seven (digitally-added) noise conditions
• Common ASR system• Two participants:
QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel
• Client/server applications
• Evaluate robustness in noisy environments
• Propose a standard for LVCSR applications
Performance Summary
SiteTest Set
CleanNoise
(Sennh)
Noise
(MultiM)
Base (TS1) 15% 59% 75%
Base (TS2) 19% 33% 50%
QIO (TS2) 17% 26% 41%
MFA (TS2) 15% 26% 40%
• Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ?
INTRODUCTIONMOTIVATION
• Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end
MFCC: Overall WER – 50.3%
8 kHz – 49.6% 16 kHz – 51.0%
TS1 TS2 TS1 TS2
58.1% 41.0% 62.2% 39.8%
QIO: Overall WER – 37.5%
8 kHz – 38.4% 16 kHz – 36.5%
TS1 TS2 TS1 TS2
43.2% 33.6% 40.7% 32.4%
MFA: Overall WER – 34.5%
8 kHz – 34.5% 16 kHz – 34.4%
TS1 TS2 TS1 TS2
37.5% 31.4% 37.2% 31.5%
ALV Evaluation Results
• Generic baseline LVCSR system with no front end specific tuning
• Would front end specific tuning change the rankings?
EVALUATION PARADIGMTHE AURORA – 4 DATABASE
Acoustic Training:• Derived from 5000 word WSJ0 task• TS1 (clean), and TS2 (multi-condition)• Clean plus 6 noise conditions• Randomly chosen SNR between 10 and 20 dB• 2 microphone conditions (Sennheiser and secondary)• 2 sample frequencies – 16 kHz and 8 kHz• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz
Development and Evaluation Sets:• Derived from WSJ0 Evaluation and Development sets• 14 test sets for each• 7 test sets recorded on Sennheiser; 7 on secondary• Clean plus 6 noise conditions• Randomly chosen SNR between 5 and 15 dB• G.712 filtering at 8 kHz and P.341 filtering at 16 kHz
EVALUATION PARADIGMBASELINE LVCSR SYSTEM
Standard context-dependent cross-word HMM-based system:
• Acoustic models: state-tied4-mixture cross-word triphones
• Language model: WSJ0 5K bigram
• Search: Viterbi one-best using lexical trees for N-gram cross-word decoding
• Lexicon: based on CMUlex
• Real-time: 4 xRT for training and 15 xRT for decoding on an800 MHz Pentium
MonophoneModeling
State-Tying
CD-TriphoneModeling
CD-TriphoneModeling
MixtureModeling (2,4)
Training Data
EVALUATION PARADIGMWI007 ETSI MFCC FRONT END
• Zero-mean debiasing
• 10 ms frame duration
• 25 ms Hamming window
• Absolute energy
• 12 cepstral coefficients
• First and second derivatives
Input Speech
Fourier Transf. Analysis
Cepstral Analysis
Zero-mean andPre-emphasis
Energy
/
FRONT END PROPOSALSQIO FRONT END
• 10 msec frame duration• 25 msec analysis
window• 15 RASTA-like filtered
cepstral coefficients• MLP-based VAD• Mean and variance
normalization• First and second
derivatives
Fourier Transform
RASTA
Mel-scale Filter Bank
DCT
Mean/VarianceNormalization
Input Speech
/
MLP-basedVAD
FRONT END PROPOSALSMFA FRONT END
• 10 msec frame duration• 25 msec analysis window• Mel-warped Wiener filter
based noise reduction• Energy-based VADNest• Waveform processing to
enhance SNR• Weighted log-energy• 12 cepstral coefficients• Blind equalization (cepstral
domain)• VAD based on acceleration of
various energy based measures
• First and second derivatives
Input Speech
Noise Reduction
Cepstral Analysis
Waveform Processing
Blind Equalization
Feature Processing
VADNest
VAD
/
EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING
• Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors.
• Tuning parameters: State-tying thresholds: solves the problem of
sparsity of training data by sharing state distributions among phonetically similar states
Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)
Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)
EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING - QIO
• Parameter tuning clean data recorded on Sennhieser mic.
(corresponds to Training Set 1 and Devtest Set 1 of the Aurora-4 database)
8 kHz sampling frequency• 7.5% relative improvement
QIO
FE
# of
Tied
States
State Tying Thresholds LM Scale
Word
Ins. Pen.
WER
Split Merge Occu.
Base 3209 165 165 840 18 10 16.1%
Tuned 3512 125 125 750 20 10 14.9%
EXPERIMENTAL RESULTSFRONT END SPECIFIC TUNING - MFA
MFA
FE
# of
Tied
States
State Tying Thresholds LM Scale
Word
Ins. Pen.
WER
Split Merge Occu.
Base 3208 165 165 840 18 10 13.8%
Tuned 4254 100 100 600 18 05 12.5%
• Parameter tuning clean data recorded on Sennhieser mic. (corresponds
to Training Set 1 and Devtest Set 1 of the Aurora-4 database)
8 kHz sampling frequency• 9.4% relative improvement• Ranking is still the same (14.9% vs. 12.5%) !
EXPERIMENTAL RESULTSCOMPARISON OF TUNING
Front End
Train Set
Tuning Average WER over 14 Test Sets
QIO 1 No 43.1%
QIO 2 No 38.1%
Avg. WER No 38.4%
QIO 1 Yes 45.7%
QIO 2 Yes 35.3%
Avg. WER Yes 40.5%
MFA 1 No 37.5%
MFA 2 No 31.8%
Avg. WER No 34.7%
MFA 1 Yes 37.0%
MFA 2 Yes 31.1%
Avg. WER Yes 34.1%
• Same Ranking: relative performance gap increased from9.6% to 15.8%
• On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%)
• On TS2, MFA FE significantly better only on test sets 5 and 14
EXPERIMENTAL RESULTSMICROPHONE VARIATION
• Train on Sennheiser mic.; evaluate on secondary mic.
• Matched conditions result in optimal performance
• Significant degradation for all front ends on mismatched conditions
• Both QIO and MFA provide improved robustness relative to MFCC baseline
0
10
20
30
40
Sennheiser Secondary
ETSI QIO MFA
EXPERIMENTAL RESULTSADDITIVE NOISE
ETSI QIO MFA
0
10
20
30
40
50
60
70
TS2 TS3 TS4 TS5 TS6 TS7
• Performance degrades on noise condition when systems are trained only on clean data
• Both QIO and MFA deliver improved performance
0
10
20
30
40
TS2 TS3 TS4 TS5 TS6 TS7
• Exposing systems to noise and microphone variations (TS2) improves performance
SUMMARY AND CONCLUSIONSWHAT HAVE WE LEARNED?
• Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline
• WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant
• Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO)
• Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline
• The contribution of each of the advanced noise robust algorithms to the overall improvement in performance can be calibrated in isolation
• Recognition system parameter tuning can be performed on the multi-condition training data and multi-condition testing data, that are representative of various noise types, microphone types, etc.
• The improvements in the advanced noise robust algorithms needs to be verified with a recognition system that utilizes more state of the art features, such as speaker normalization, speaker and channel adaptation, and discriminative training
SUMMARY AND CONCLUSIONSFUTURE WORK
• I would like to thank Dr. Joe Picone for his mentoring and guidance through out my graduate program
• I wish to acknowledge David Pearce of Motorola Labs, Motorola Ltd., United Kingdom, and Guenter Hirsch of Niederrhein University of Applied Sciences, Germany, for their invaluable collaborations and direction on some portions of this thesis.
• I would also like to thank Jon Hamaker for answering my queries on ISIP recognition software, and Ram Sundaram for introducing me to the art of running an recognition experiment
• I would like to thank Dr. Georgious Lazarou and Dr. Jeff Jonkman for being on my committee
• Finally, I would like to thank my co-workers (former and current) at the Institute for Signal and Information Processing (ISIP) for all their help
SUMMARY AND CONCLUSIONSACKNOWLEDGEMENTS
APPENDIXBRIEF BIBLIOGRAPHY
• N. Parihar, J. Picone, D. Pearce, and H.G. Hirsch, “Performance Analysis of the Aurora Large Vocabulary Baseline System,” submitted to the ICASSP, Montreal, Canada, May 2004.
• N. Parihar, and J. Picone, “An Analysis of the Aurora Large Vocabulary Evaluation,” Proceeding of the Eurospeech 2003, Geneva, Switzerland, September 2003.
• N. Parihar, J. Picone, D. Pearce, and H.G. Hirsch, “Performance Analysis of the Aurora Large Vocabulary Baseline System,” submitted to the Eurospeech 2003, Geneva, Switzerland, September 2003.
• N. Parihar and J. Picone, “DSR Front End LVCSR Evaluation - AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, December 06, 2002.
• D. Pearce, “Overview of Evaluation Criteria for Advanced Distributed Speech Recognition,” ETSI STQ-Aurora DSR Working Group, October 2001.
• G. Hirsch, “Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task,” ETSI STQ-Aurora DSR Working Group, December 2002.
• “ETSI ES 201 108 v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm,” ETSI, April 2000.
APPENDIXAVAILABLE RESOURCES
• Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit
• ETSI DSR Website: reports and front end standards
• Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front end
PROGRAM OF STUDY
Course No. Title Semester
ECE 6713 Computer Architecture Fall 2000
ECE 6773 Digital Signal Processing Fall 2000
*ST 8114 Statistical Methods Fall 2000
ECE 8990 Pattern Recognition Spring 2001
ECE 8990 Information Theory Spring 2001
*ST 8253 Regression Analysis Spring 2001
*ST 8413 Multivariate Statistical Methods Fall 2001
ECE 8463 Fundamentals of Speech Recognition Spring 2002
ECE 8000 Research/Thesis
APPENDIXPROGRAM OF STUDY
APPENDIXPUBLICATIONS
• N. Parihar, J. Picone, D. Pearce, and H.G. Hirsch, “Performance Analysis of the Aurora Large Vocabulary Baseline System,” submitted to the ICASSP, Montreal, Canada, May 2004.
• N. Parihar, and J. Picone, “An Analysis of the Aurora Large Vocabulary Evaluation,” Proceeding of the Eurospeech 2003, Geneva, Switzerland, September 2003.
• F. Zheng, J. Hamaker, F. Goodman, B. George, N. Parihar, and J. Picone, “The ISIP 2001 NRL Evaluation for Recognition of Speech in Noisy Environments,” Speech In Noisy Environments (SPINE) Workshop, Orlando, Florida, USA, November 2001.
• B. Jelinek, F. Zheng, N. Parihar, J. Hamaker, and J. Picone, “Generalized Hierarchical Search in the ISIP ASR System,” Proceedings of the Thirty-Fifth Asilomar Conference on Signals, Systems, and Computers, vol. 2, pp. 1553-1556, Pacific Grove, California, USA, November 2001.