microsoft research academic services gtm draft€¦ · microsoft cortana 2014 amazon echo 2014...
TRANSCRIPT
Yanmin QianShanghai Jiao Tong University
New Challenges andRecent Progresses inSpeech Recognition
Statistical Speech Recognition
Speech Waveforms
Front End Processing
Acoustic Model
Recognition (Inference)
Language ModelLexicon
RecognizedHypothesis
Speech Recognition in Products
Microsoft
Cortana
2014
Amazon
Echo
2014
Apple
Siri
2011
Samsung
S Voice
2014
Home
2016
Microsoft
Invoke
2017
Apple
HomePod
2017
Now
2012
Progress in Speech Recognition
14.5
12.2 11.810.4
8
5.9 5.5
2011 IBM
GMM-HMM
2012 IBM
DNN-HMM
2013 IBM
CNN-HMM
2014 IBM
Joint CNN/DNN
2015 IBM
Joint CNN/DNN
+RNN+NNLM
2016 MSR
ResNet+LACE
+BLSTM+RNNLM
2017 IBM
ResNet+WaveNet
+LSTMLM
SWB WER(%)
Microsoft speech & dialogue group
Good Enough to Deploy ASR Everywhere?
Still Challenging on Many Aspects
Noise Robust
Multi Genre
Low Resource
Multi/Mix Lingual
Low Computation
Rich Transcription
Challenge 1: Noise RobustLarge degradation exists in noisy scenarios
• Additive noise, reverberation, channel distortion…
Mismatch and Adaption
System is fragile in reality due to themismatch in training and test
• Background noise
• Channel
• Speaker
• Accent
Adaptation is one effective method toreduce the mismatch
Cluster Adaptive Training (T. Tan, et al. T-ASLP2016)
DNN: full layer matrix as bases CNN: feature maps or filters as bases
Environment-Aware Training (Y. Qian, et al. T-ASLP2016)
DNNs are used to do all factorrepresentations• Speaker, phone, environment
• With the specific target and criterion
Factor integration + Cross connection• Information exchange mechanism
• All modules can benefit from each other
Traditional factors are not ruled out
Auto-noise-reduction Acoustic Model(Y. Qian, et al. T-ASLP2016)
Very deep CNNs• Local correlation
• Translational invariance
Very deep CNN achieves promising resultsin noisy scenarios
Much better than RNN in noisy conditions
Can reduce the noise embedding and de-noise gradually across the stackedconvolutional layers
Appropriate pooling, padding and inputchannel usages are important in speech
More Advanced with All Techs (T. Tian, et al. SpeechCom2017)
System can be further significantly boosted with all these technologies
VDCNN + Factor
-aware Training
ResNet + Factor-aware Training +
Cluster Adaptive Training
VDCNN + RNN
Joint Decoding
New Milestone on Aurora4 Task (T. Tian, et al.
SpeechCom2017)13.4
12.4
10 10.39.7
8.7
7.1
5.7
2012 CUED… 2013 MSR… 2014 IBM… 2015 USTC… 2015 SJTU… 2016 EU… 2016 SJTU… 2017 SJTU…
Aurora4 WER(%)
Challenge 2: Multi GenreMulti-Genre:Comedy, Drama, Children, Advice, News…
• Youtube, BBC, etc
Very high WER on transcripts, 30.0%~40.0%, and no accurate time
Diverse audio• Multi Genres
• Different length
• Diverse conditions
Alignment:Lightly Supervised(P. Lanchantin, et al. InterSpeech2016)
Lightly supervised alignment• Lightly supervised decoding
• Split point detection
• Segments merging
• Non-speech filter
• Data selection by confidence
Diverse transcription• Words existing were not spoken
• Words missing were spoken
• High WER 30.0%~40.0%
• No accurate time
Demo: Alignment
Transcription: Acoustic Model + Adaptation(P.C. Woodland, et al. ASRU2016)
Based on parameterised p-sigmoid / p-reluactivation
Scales slope of activation functions
Adaptation at utterance level and layer-by-layer
DNN Hybrid system with MPE training• More advanced using stacked hybrid system
DNN Tandem system with MPE training• More advanced using adaptation
System combination using joint decoding• More advanced using structured log-linear model
Transcription: Language Model + Adaptation(P.C. Woodland, et al. ASRU2016)
Efficient RNNLM training• Non-class based, full vocab output
• Training with bunch mode
Efficient RNNLM lattice rescoring• Better than N-best list rescoring
• Better using CN decoding
Topic adaptation via Latent Dirichletallocation
• Fed-into both the input & output layers
• Used in RNNLM training & after 1st-passdecoding
RNNLM RNNLM + Adaptation
Demo: Transcription
Challenge 3: Cocktail Party Problem
“One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of
others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail
party problem’…” (Cherry’ 57)
Cocktail Party Problem
This is a key and very difficult problem for speech processing in reality• Require most research work
Human’s performance is superior to machine in this scenario• “For ‘cocktail party’-like situations…when all voices are equally loud, speech remains intelligible for
normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92)
Multi-talker• Speech separation: Separate and trace the streams of the mixed speech
• Speech recognition: Recognize the streams of the mixed speech
• Speaker identification: Identify the speakers of the mixed speech
Multi-Talker Speech Separation (D. Yu, et al. ICASSP2017)
Tradition: CASA, NMF, factorial GMM-HMM,
Microphone-array…
Deep learning: convert the problem from an
unsupervised learning problem to a
supervised one
Simple supervised training does not work
due to label permutation problem
Only work well in the seen speakers or
specific interferences
Label Permutation/Ambiguity Problem Speaker 1 -> output 1? / -> output 2?
Permutation Invariant Training (D. Yu, et al. ICASSP2017)
Automatically determines the best label
assignment based on the current model
Can be used to separate multiple speech
streams
Only affects the training, no extra
processing during separation
Can be easily extended to 3-speakers
Multi-Talker Speech Recognition (D. Yu, et al.
InterSpeech2017)
Tradition• factorial GMM-HMM (IBM2006)
Outperform human, however easy isolated word & only seen speakers
• Deep modelsexplicit speech separation + normal speech recognition
PIT is extended to ASR• Auto-determines the best label assign.
• Do the recognition without separation
• Separation/tracing/recognition in one shot
• Can recognize multiple speech streams
Multi-Talker Speech Recognition (D. Yu, et al.
InterSpeech2017)
Multi-Talker Speaker Identification (X. Zhao, et al. T-
ASLP2015)
Tradition• Speech Separation + Speaker Identification
• GMM-based Approach
Deep Learning based Approach• Input can be raw wave or cepstral features
• Easy to extend for multi-speaker (>2)
condition
• Soft aligned frames to speakers
Multi-Talker Speaker Identification
2 speaker
3 speaker
4 speaker
• 音频样例
ChorusSSC
SSC (Speech Separation Contest)• English short commands (1 second)
• 34 speakers in total
Chorus• Chinese song chorus
• 10 kids (8-12 years old)
Performance• For Chorus, the accuracy for 3 speakers out of 4 is 98.0 % .
Corpus 2 speakers 3 speakers 4 speakers
SSC 100% 97.8% 80.0%
Chorus 97.0% 85.5% 66.2%
Microsoft Cognitive Toolkit (CNTK)
From 2014
ComputationalNetwork Toolkit
To now
Cognitive Toolkit
References
• [1] Y. Qian, M. Bi, T. Tan and K. Yu. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 12, 2263-2276, 2016.
• [2] Y. Qian, T. Tan and D. Yu. Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, 2231-2240, 2016.
• [3] T. Tan, Y. Qian and K. Yu. Cluster Adaptive Training for Deep Neural Network Based Acoustic Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, 459-468, 2016.
• [4] T. Tan, Y. Qian, H. Hu, Y. Zhou and K. Yu. Adaptive very deep convolutional residual network for noise robust speechrecognition. Speech Communication, under review, 2017.
• [5] P. Lanchantin, M.J.F. Gales, P. Karanasou, X. Liu, Y. Qian, L. Wang, P.C. Woodland and C. Zhang. Selection of Multi-genre Broadcast Data for the Training of Automatic Speech Recognition System. Interspeech, 2016.
• [6] P.C. Woodland, X. Liu, Y. Qian, C. Zhang, M.J.F. Gales, P. Karanasou, P. Lanchantin and L. Wang. CambridgeUniversity Transcription Systems for the Multi-Genre Broadcast Challenge. ASRU, 2015.
• [7] D. Yu, M. Kolbak, Z. Hua and J. Jensen. Permutation Invariant Training of Deep Models for Speaker-independentMulti-talker Speech Recognition. ICASSP, 2017.
• [8] D. Yu, X. Chang and Y. Qian. Recognizing Multi-talker Speech with Permutation Invariant Training. Interspeech,2017.
• [9] X. Zhao, Y. Wang and D. Wang. Cochannel Speaker Identification in Anechoic and Reverberant Conditions.IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 11, 1727-1736, 2017.
Thanks for my colleague and students in SJTU !
Thanks for the collaboration with Microsoft !
Thanks for the CNTK team !