microsoft research academic services gtm draft€¦ · microsoft cortana 2014 amazon echo 2014...

32

Upload: others

Post on 18-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft
Page 2: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Yanmin QianShanghai Jiao Tong University

New Challenges andRecent Progresses inSpeech Recognition

Page 3: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Statistical Speech Recognition

Speech Waveforms

Front End Processing

Acoustic Model

Recognition (Inference)

Language ModelLexicon

RecognizedHypothesis

Page 4: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Speech Recognition in Products

Microsoft

Cortana

2014

Amazon

Echo

2014

Apple

Siri

2011

Samsung

S Voice

2014

Google

Home

2016

Microsoft

Invoke

2017

Apple

HomePod

2017

Google

Now

2012

Page 5: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Progress in Speech Recognition

14.5

12.2 11.810.4

8

5.9 5.5

2011 IBM

GMM-HMM

2012 IBM

DNN-HMM

2013 IBM

CNN-HMM

2014 IBM

Joint CNN/DNN

2015 IBM

Joint CNN/DNN

+RNN+NNLM

2016 MSR

ResNet+LACE

+BLSTM+RNNLM

2017 IBM

ResNet+WaveNet

+LSTMLM

SWB WER(%)

Microsoft speech & dialogue group

Page 6: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Good Enough to Deploy ASR Everywhere?

Page 7: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Still Challenging on Many Aspects

Noise Robust

Multi Genre

Low Resource

Multi/Mix Lingual

Low Computation

Rich Transcription

Page 8: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Challenge 1: Noise RobustLarge degradation exists in noisy scenarios

• Additive noise, reverberation, channel distortion…

Page 9: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Mismatch and Adaption

System is fragile in reality due to themismatch in training and test

• Background noise

• Channel

• Speaker

• Accent

Adaptation is one effective method toreduce the mismatch

Page 10: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Cluster Adaptive Training (T. Tan, et al. T-ASLP2016)

DNN: full layer matrix as bases CNN: feature maps or filters as bases

Page 11: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Environment-Aware Training (Y. Qian, et al. T-ASLP2016)

DNNs are used to do all factorrepresentations• Speaker, phone, environment

• With the specific target and criterion

Factor integration + Cross connection• Information exchange mechanism

• All modules can benefit from each other

Traditional factors are not ruled out

Page 12: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Auto-noise-reduction Acoustic Model(Y. Qian, et al. T-ASLP2016)

Very deep CNNs• Local correlation

• Translational invariance

Very deep CNN achieves promising resultsin noisy scenarios

Much better than RNN in noisy conditions

Can reduce the noise embedding and de-noise gradually across the stackedconvolutional layers

Appropriate pooling, padding and inputchannel usages are important in speech

Page 13: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

More Advanced with All Techs (T. Tian, et al. SpeechCom2017)

System can be further significantly boosted with all these technologies

VDCNN + Factor

-aware Training

ResNet + Factor-aware Training +

Cluster Adaptive Training

VDCNN + RNN

Joint Decoding

Page 14: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

New Milestone on Aurora4 Task (T. Tian, et al.

SpeechCom2017)13.4

12.4

10 10.39.7

8.7

7.1

5.7

2012 CUED… 2013 MSR… 2014 IBM… 2015 USTC… 2015 SJTU… 2016 EU… 2016 SJTU… 2017 SJTU…

Aurora4 WER(%)

Page 15: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Challenge 2: Multi GenreMulti-Genre:Comedy, Drama, Children, Advice, News…

• Youtube, BBC, etc

Very high WER on transcripts, 30.0%~40.0%, and no accurate time

Page 16: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Diverse audio• Multi Genres

• Different length

• Diverse conditions

Alignment:Lightly Supervised(P. Lanchantin, et al. InterSpeech2016)

Lightly supervised alignment• Lightly supervised decoding

• Split point detection

• Segments merging

• Non-speech filter

• Data selection by confidence

Diverse transcription• Words existing were not spoken

• Words missing were spoken

• High WER 30.0%~40.0%

• No accurate time

Page 17: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Demo: Alignment

Page 18: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Transcription: Acoustic Model + Adaptation(P.C. Woodland, et al. ASRU2016)

Based on parameterised p-sigmoid / p-reluactivation

Scales slope of activation functions

Adaptation at utterance level and layer-by-layer

DNN Hybrid system with MPE training• More advanced using stacked hybrid system

DNN Tandem system with MPE training• More advanced using adaptation

System combination using joint decoding• More advanced using structured log-linear model

Page 19: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Transcription: Language Model + Adaptation(P.C. Woodland, et al. ASRU2016)

Efficient RNNLM training• Non-class based, full vocab output

• Training with bunch mode

Efficient RNNLM lattice rescoring• Better than N-best list rescoring

• Better using CN decoding

Topic adaptation via Latent Dirichletallocation

• Fed-into both the input & output layers

• Used in RNNLM training & after 1st-passdecoding

RNNLM RNNLM + Adaptation

Page 20: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Demo: Transcription

Page 21: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Challenge 3: Cocktail Party Problem

“One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of

others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail

party problem’…” (Cherry’ 57)

Page 22: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Cocktail Party Problem

This is a key and very difficult problem for speech processing in reality• Require most research work

Human’s performance is superior to machine in this scenario• “For ‘cocktail party’-like situations…when all voices are equally loud, speech remains intelligible for

normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92)

Multi-talker• Speech separation: Separate and trace the streams of the mixed speech

• Speech recognition: Recognize the streams of the mixed speech

• Speaker identification: Identify the speakers of the mixed speech

Page 23: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Multi-Talker Speech Separation (D. Yu, et al. ICASSP2017)

Tradition: CASA, NMF, factorial GMM-HMM,

Microphone-array…

Deep learning: convert the problem from an

unsupervised learning problem to a

supervised one

Simple supervised training does not work

due to label permutation problem

Only work well in the seen speakers or

specific interferences

Label Permutation/Ambiguity Problem Speaker 1 -> output 1? / -> output 2?

Page 24: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Permutation Invariant Training (D. Yu, et al. ICASSP2017)

Automatically determines the best label

assignment based on the current model

Can be used to separate multiple speech

streams

Only affects the training, no extra

processing during separation

Can be easily extended to 3-speakers

Page 25: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Multi-Talker Speech Recognition (D. Yu, et al.

InterSpeech2017)

Tradition• factorial GMM-HMM (IBM2006)

Outperform human, however easy isolated word & only seen speakers

• Deep modelsexplicit speech separation + normal speech recognition

PIT is extended to ASR• Auto-determines the best label assign.

• Do the recognition without separation

• Separation/tracing/recognition in one shot

• Can recognize multiple speech streams

Page 26: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Multi-Talker Speech Recognition (D. Yu, et al.

InterSpeech2017)

Page 27: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Multi-Talker Speaker Identification (X. Zhao, et al. T-

ASLP2015)

Tradition• Speech Separation + Speaker Identification

• GMM-based Approach

Deep Learning based Approach• Input can be raw wave or cepstral features

• Easy to extend for multi-speaker (>2)

condition

• Soft aligned frames to speakers

Page 28: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Multi-Talker Speaker Identification

2 speaker

3 speaker

4 speaker

• 音频样例

ChorusSSC

SSC (Speech Separation Contest)• English short commands (1 second)

• 34 speakers in total

Chorus• Chinese song chorus

• 10 kids (8-12 years old)

Performance• For Chorus, the accuracy for 3 speakers out of 4 is 98.0 % .

Corpus 2 speakers 3 speakers 4 speakers

SSC 100% 97.8% 80.0%

Chorus 97.0% 85.5% 66.2%

Page 29: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Microsoft Cognitive Toolkit (CNTK)

From 2014

ComputationalNetwork Toolkit

To now

Cognitive Toolkit

Page 30: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

References

• [1] Y. Qian, M. Bi, T. Tan and K. Yu. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 12, 2263-2276, 2016.

• [2] Y. Qian, T. Tan and D. Yu. Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, 2231-2240, 2016.

• [3] T. Tan, Y. Qian and K. Yu. Cluster Adaptive Training for Deep Neural Network Based Acoustic Model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, 459-468, 2016.

• [4] T. Tan, Y. Qian, H. Hu, Y. Zhou and K. Yu. Adaptive very deep convolutional residual network for noise robust speechrecognition. Speech Communication, under review, 2017.

• [5] P. Lanchantin, M.J.F. Gales, P. Karanasou, X. Liu, Y. Qian, L. Wang, P.C. Woodland and C. Zhang. Selection of Multi-genre Broadcast Data for the Training of Automatic Speech Recognition System. Interspeech, 2016.

• [6] P.C. Woodland, X. Liu, Y. Qian, C. Zhang, M.J.F. Gales, P. Karanasou, P. Lanchantin and L. Wang. CambridgeUniversity Transcription Systems for the Multi-Genre Broadcast Challenge. ASRU, 2015.

• [7] D. Yu, M. Kolbak, Z. Hua and J. Jensen. Permutation Invariant Training of Deep Models for Speaker-independentMulti-talker Speech Recognition. ICASSP, 2017.

• [8] D. Yu, X. Chang and Y. Qian. Recognizing Multi-talker Speech with Permutation Invariant Training. Interspeech,2017.

• [9] X. Zhao, Y. Wang and D. Wang. Cochannel Speaker Identification in Anechoic and Reverberant Conditions.IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 11, 1727-1736, 2017.

Page 31: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft

Thanks for my colleague and students in SJTU !

Thanks for the collaboration with Microsoft !

Thanks for the CNTK team !

Page 32: Microsoft Research Academic Services GTM Draft€¦ · Microsoft Cortana 2014 Amazon Echo 2014 Apple Siri 2011 Samsung S Voice 2014 Google Home 2016 Microsoft Invoke ... Microsoft