microsoft research academic services gtm draft€¦ · microsoft cortana 2014 amazon echo 2014...

Yanmin QianShanghai Jiao Tong University

New Challenges andRecent Progresses inSpeech Recognition

Statistical Speech Recognition

Speech Waveforms

Front End Processing

Acoustic Model

Recognition (Inference)

Language ModelLexicon

RecognizedHypothesis

Speech Recognition in Products

Microsoft

Cortana

2014

Amazon

Echo

2014

Apple

Siri

2011

Samsung

S Voice

2014

Google

Home

2016

Microsoft

Invoke

2017

Apple

HomePod

2017

Google

Now

2012

Progress in Speech Recognition

14.5

12.2 11.810.4

8

5.9 5.5

2011 IBM

GMM-HMM

2012 IBM

DNN-HMM

2013 IBM

CNN-HMM

2014 IBM

Joint CNN/DNN

2015 IBM

Joint CNN/DNN

+RNN+NNLM

2016 MSR

ResNet+LACE

+BLSTM+RNNLM

2017 IBM

ResNet+WaveNet

+LSTMLM

SWB WER(%)

Microsoft speech & dialogue group

Good Enough to Deploy ASR Everywhere?

Still Challenging on Many Aspects

Noise Robust

Multi Genre

Low Resource

Multi/Mix Lingual

Low Computation

Rich Transcription

Challenge 1: Noise RobustLarge degradation exists in noisy scenarios

• Additive noise, reverberation, channel distortion…

Mismatch and Adaption

System is fragile in reality due to themismatch in training and test

• Background noise

• Channel

• Speaker

• Accent

Adaptation is one effective method toreduce the mismatch

Cluster Adaptive Training (T. Tan, et al. T-ASLP2016)

DNN: full layer matrix as bases CNN: feature maps or filters as bases

Environment-Aware Training (Y. Qian, et al. T-ASLP2016)

DNNs are used to do all factorrepresentations• Speaker, phone, environment

• With the specific target and criterion

Factor integration + Cross connection• Information exchange mechanism

• All modules can benefit from each other

Traditional factors are not ruled out

Auto-noise-reduction Acoustic Model(Y. Qian, et al. T-ASLP2016)

Very deep CNNs• Local correlation

• Translational invariance

Very deep CNN achieves promising resultsin noisy scenarios

Much better than RNN in noisy conditions

Can reduce the noise embedding and de-noise gradually across the stackedconvolutional layers

Appropriate pooling, padding and inputchannel usages are important in speech

More Advanced with All Techs (T. Tian, et al. SpeechCom2017)

System can be further significantly boosted with all these technologies

VDCNN + Factor

-aware Training

ResNet + Factor-aware Training +

Cluster Adaptive Training

VDCNN + RNN

Joint Decoding

New Milestone on Aurora4 Task (T. Tian, et al.

SpeechCom2017)13.4

12.4

10 10.39.7

8.7

7.1

5.7

2012 CUED… 2013 MSR… 2014 IBM… 2015 USTC… 2015 SJTU… 2016 EU… 2016 SJTU… 2017 SJTU…

Aurora4 WER(%)

Challenge 2: Multi GenreMulti-Genre：Comedy, Drama, Children, Advice, News…

• Youtube, BBC, etc

Very high WER on transcripts, 30.0%~40.0%, and no accurate time

Diverse audio• Multi Genres

• Different length

• Diverse conditions

Alignment：Lightly Supervised(P. Lanchantin, et al. InterSpeech2016)

Lightly supervised alignment• Lightly supervised decoding

• Split point detection

• Segments merging

• Non-speech filter

• Data selection by confidence

Diverse transcription• Words existing were not spoken

• Words missing were spoken

• High WER 30.0%~40.0%

• No accurate time

Demo: Alignment

Transcription: Acoustic Model + Adaptation(P.C. Woodland, et al. ASRU2016)

Based on parameterised p-sigmoid / p-reluactivation

Scales slope of activation functions

Adaptation at utterance level and layer-by-layer

DNN Hybrid system with MPE training• More advanced using stacked hybrid system

DNN Tandem system with MPE training• More advanced using adaptation

System combination using joint decoding• More advanced using structured log-linear model

Transcription： Language Model + Adaptation(P.C. Woodland, et al. ASRU2016)

Efficient RNNLM training• Non-class based, full vocab output

• Training with bunch mode

Efficient RNNLM lattice rescoring• Better than N-best list rescoring

• Better using CN decoding

Topic adaptation via Latent Dirichletallocation

• Fed-into both the input & output layers

• Used in RNNLM training & after 1st-passdecoding

RNNLM RNNLM + Adaptation

Demo: Transcription

Challenge 3: Cocktail Party Problem

“One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of

others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail

party problem’…” (Cherry’ 57)

Cocktail Party Problem

This is a key and very difficult problem for speech processing in reality• Require most research work

Human’s performance is superior to machine in this scenario• “For ‘cocktail party’-like situations…when all voices are equally loud, speech remains intelligible for

normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92)

Multi-talker• Speech separation: Separate and trace the streams of the mixed speech

• Speech recognition: Recognize the streams of the mixed speech

• Speaker identification: Identify the speakers of the mixed speech

Multi-Talker Speech Separation (D. Yu, et al. ICASSP2017)

Tradition: CASA, NMF, factorial GMM-HMM,

Microphone-array…

Deep learning: convert the problem from an

unsupervised learning problem to a

supervised one

Simple supervised training does not work

due to label permutation problem

Only work well in the seen speakers or

specific interferences

Label Permutation/Ambiguity Problem Speaker 1 -> output 1? / -> output 2?

Permutation Invariant Training (D. Yu, et al. ICASSP2017)

Automatically determines the best label

assignment based on the current model

Can be used to separate multiple speech

streams

Only affects the training, no extra

processing during separation

Can be easily extended to 3-speakers

Multi-Talker Speech Recognition (D. Yu, et al.

InterSpeech2017)

Tradition• factorial GMM-HMM (IBM2006)

Outperform human, however easy isolated word & only seen speakers

• Deep modelsexplicit speech separation + normal speech recognition

PIT is extended to ASR• Auto-determines the best label assign.

• Do the recognition without separation

• Separation/tracing/recognition in one shot

• Can recognize multiple speech streams

Multi-Talker Speech Recognition (D. Yu, et al.

InterSpeech2017)

Multi-Talker Speaker Identification (X. Zhao, et al. T-

ASLP2015)

Tradition• Speech Separation + Speaker Identification

• GMM-based Approach

Deep Learning based Approach• Input can be raw wave or cepstral features

• Easy to extend for multi-speaker (>2)

condition

• Soft aligned frames to speakers

Multi-Talker Speaker Identification

2 speaker

3 speaker

4 speaker

• 音频样例

ChorusSSC

SSC (Speech Separation Contest)• English short commands (1 second)

• 34 speakers in total

Chorus• Chinese song chorus

• 10 kids (8-12 years old)

Performance• For Chorus, the accuracy for 3 speakers out of 4 is 98.0 % .

Corpus 2 speakers 3 speakers 4 speakers

SSC 100% 97.8% 80.0%

Chorus 97.0% 85.5% 66.2%

Microsoft Cognitive Toolkit (CNTK)

From 2014

ComputationalNetwork Toolkit

To now

Cognitive Toolkit

References

• [1] Y. Qian, M. Bi, T. Tan and K. Yu. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 12, 2263-2276, 2016.

• [2] Y. Qian, T. Tan and D. Yu. Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, 2231-2240, 2016.

• [3] T. Tan, Y. Qian and K. Yu. Cluster Adaptive Training for Deep Neural Network Based Acoustic Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, 459-468, 2016.

• [4] T. Tan, Y. Qian, H. Hu, Y. Zhou and K. Yu. Adaptive very deep convolutional residual network for noise robust speechrecognition. Speech Communication, under review, 2017.

• [5] P. Lanchantin, M.J.F. Gales, P. Karanasou, X. Liu, Y. Qian, L. Wang, P.C. Woodland and C. Zhang. Selection of Multi-genre Broadcast Data for the Training of Automatic Speech Recognition System. Interspeech, 2016.

• [6] P.C. Woodland, X. Liu, Y. Qian, C. Zhang, M.J.F. Gales, P. Karanasou, P. Lanchantin and L. Wang. CambridgeUniversity Transcription Systems for the Multi-Genre Broadcast Challenge. ASRU, 2015.

• [7] D. Yu, M. Kolbak, Z. Hua and J. Jensen. Permutation Invariant Training of Deep Models for Speaker-independentMulti-talker Speech Recognition. ICASSP, 2017.

• [8] D. Yu, X. Chang and Y. Qian. Recognizing Multi-talker Speech with Permutation Invariant Training. Interspeech,2017.

• [9] X. Zhao, Y. Wang and D. Wang. Cochannel Speaker Identification in Anechoic and Reverberant Conditions.IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 11, 1727-1736, 2017.

Thanks for my colleague and students in SJTU !

Thanks for the collaboration with Microsoft !

Thanks for the CNTK team !

microsoft research academic services gtm draft€¦ · microsoft cortana 2014 amazon echo 2014...

Documents