[ppt]scaling properties of deep neural network acoustic...

47
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 7: Neural network acoustic models in speech recognition

Upload: trinhque

Post on 06-May-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

CS 224S / LINGUIST 285Spoken Language Processing

Andrew MaasStanford University

Spring 2017

Lecture 7: Neural network acoustic models in speech recognition

Page 2: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

OutlineHybrid acoustic modeling overview

Basic ideaHistoryRecent results

What’s different about modern DNNs?Convolutional networksRecurrent networksAlternative loss functions

Page 3: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Acoustic Modeling with GMMsSamsonS – AE – M – S –AH – N942 – 6 – 37 – 8006 – 4422 …

Transcription:Pronunciation:Sub-phones :

Hidden Markov Model (HMM):

Acoustic Model:

Audio Input:

Features

942

Features

942

Features

6

GMM models:P(x|s)x: input featuress: HMM state

Page 4: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

DNN Hybrid Acoustic ModelsSamsonS – AE – M – S –AH – N942 – 6 – 37 – 8006 – 4422 …

Transcription:Pronunciation:Sub-phones :

Hidden Markov Model (HMM):

Acoustic Model:

Audio Input: Features (x1)

P(s|x1)

942

Features (x2)

P(s|x2)

942

Features (x3)

P(s|x3)

6

Use a DNN to approximate:P(s|x)

Apply Bayes’ Rule:P(x|s) = P(s|x) * P(x) / P(s)

DNN * Constant / State prior

Page 5: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Noisy Channel ModelProbabilistic implication: Pick the highest prob S:

We can use Bayes rule to rewrite this:

Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

Page 6: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Objective Function for LearningSupervised learning, minimize our classification errorsStandard choice: Cross entropy loss function

Straightforward extension of logistic loss for binary

This is a frame-wise loss. We use a label for each frame from a forced alignment

Other loss functions possible. Can get deeper integration with the HMM or word error rate

Page 7: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Not Really a New Idea

Renals, Morgan, Bourland, Cohen, & Franco. 1994.

Page 8: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Early neural network approaches

(Waibel, Hanazawa, Hinton, Shikano, & Lang. 1988)

(McClelland, & Elman. 1985)

Page 9: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Hybrid MLPs on Resource Management

Renals, Morgan, Bourland, Cohen, & Franco. 1994.

Page 10: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Hybrid Systems now Dominate ASR

Hinton et al. 2012.

Page 11: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

What’s Different in Modern DNNs?Context-dependent HMM states Deeper nets improve on single hidden layer nets Hidden unit nonlinearity Many more model parameters (scaling up)Specific depth (e.g. 3 vs 7 hidden layers)Fast computers = run many experimentsArchitecture choices (easiest is replacing sigmoid)Pre-training does not matter. Initially we thought this

was the new trick that made things work

Page 12: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Modern Systems use DNNs and Senones

Monophone Triphone0

10

20

30

40

Voice Search Error Rate

1 Layer 5 Layer 7 layer05

1015202530

Switchboard WERDeep

(Dahl, Yu, Deng, & Acero. 2011) (Yu, Seltzer, Li, Huang, & Seide. 2013)

Page 13: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Many hidden layersWarning! Depth can also act as a regularizer because

it makes optimization more difficult. This is why you will sometimes see very deep networks perform well on TIMIT or other small tasks.

Page 14: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Replacing Sigmoid Hidden Units

Rectified Linear (ReL)

(Glorot & Bengio. 2011)

-4 0 4

-1

0

1

2

TanHReL

Page 15: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

GMM 2 Layer 3 Layer 4 Layer MSR 9 Layer

IBM 7 Layer MMI

0

5

10

15

20

25 Switchboard WERTanHReL

Comparing Nonlinearities

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 16: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Rectifier DNNs on SwitchboardModel Dev

Acc(%)Switchboard WER

GMM Baseline N/A 25.12 Layer Tanh 48.0 21.02 Layer ReLU 51.7 19.13 Layer Tanh 49.8 20.03 Layer RelU 53.3 18.14 Layer Tanh 49.8 19.54 Layer RelU 53.9 17.39 Layer Sigmoid CE [MSR]

-- 17.0

7 Layer Sigmoid MMI [IBM]

-- 13.7

Maas, Hannun, & Ng,. 2013.

Page 17: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Scaling up NN acoustic models in 1999

[Ellis & Morgan. 1999]0.7M total NN parameters

Page 18: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Adding More Parameters 15 Years AgoSize matters: An empirical study of neural network training for LVCSR. Ellis & Morgan. ICASSP. 1999.

Hybrid NN. 1 hidden layer. 54 HMM states. 74hr broadcast news task

“…improvements are almost always obtained by increasing either or both of the amount of training data or the number of network parameters … We are now planning to train an 8000 hidden unit net on 150 hours of data … this training will require over three weeks of computation.”

Page 19: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Adding More Parameters Now

0 50 100 150 200 250 300 350 400 450Total DNN parameters (Millions)

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 20: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Scaling Total Parameters on Switchboard

GMM 36M ±10 100M ±10

200M ±10

100M ±20

200M ±20

0

5

10

15

20

25

30

35

40

0

2

4

6

8

10

12

Model Size

Fram

e Er

ror

Rate

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 21: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Scaling Total Parameters on Switchboard

GMM 36M ±10 100M ±10

200M ±10

100M ±20

200M ±20

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25Frame Error Rate

Model Size

Fram

e Er

ror

Rate

Wor

d Er

ror

Rate

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 22: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Experimental Framework1-4 GPUs using the infrastructure of Coates et al (ICML 2013)Improved baseline GMM system. ~9,000 HMM statesFix DNN hidden layers to 5Vary total number of parameters, same number of hidden units

in each hidden layerEvaluate input context of 21 and 41 framesSizes evaluated: 36M, 100M, 200MLayer sizes: 2048, 3953, 5984Output layer:

51% of all parameters for a 36M DNN6% of all parameters for a 200M DNN

Page 23: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Word error rate during training

Training Epochs

Page 24: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Correlating frame-level metrics and WER

Training EpochsNormalize training set word accuracy and negative cross entropy separately. Strong correlation (r=0.992) suggests that cross entropy is a good predictor of training set WER

Page 25: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Generalization: Early realignment

Page 26: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Generalization: DropoutDuring training randomly zero out hidden unit

activations with probability p=0.5Cross validate over p in initial experimentsPrevious work found dropout helped on 50hr

broadcast news but did not directly evaluate dropout with control experiments (Dahl et al 2013, Sainath et al 2013)

Page 27: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Improving Generalization with Dropout

(Srivastava, Hinton, Kirzhevsky, Sutskever, & Salakhutdinov. 2014)

Page 28: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

GMM 36M ±10 100M ±10 200M ±100

5

10

15

20

25StandardDropout

Model Size

Wor

d Er

ror

Rate

Improving Generalization with Dropout

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 29: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Combining Speech Corpora

300 hours

2,000 hours4,870

speakers

23,394speakers

Switchboard Fisher

Combined corpus baseline system now available in Kaldi

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. In Submission)

Page 30: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

GMM 36M 100M 200M 400M20

25

30

35

40

45

50

20

22

24

26

28

30

32

34

36

38

40

Model Size

Fram

e Er

ror

Rate

RT-0

3 W

ord

Erro

r Ra

te

Scaling Total Parameters Revisited

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 31: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

GMM 36M 100M 200M 400M20

25

30

35

40

45

50

20

25

30

35

40

45Frame Error Rate

Model Size

Fram

e Er

ror

Rate

RT-0

3 W

ord

Erro

r Ra

te

Scaling Total Parameters Revisited

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 32: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

1 3 5 720

25

30

35

40

4536M100M

Number of Hidden Layers

RT-0

3 W

ord

Erro

r Ra

teImpact of Depth

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 33: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Frame accuracy analysis

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 34: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Building A Strong DNN Acoustic ModelLargeAt least 3 hidden layersTraining dataLess important:

Dropout regularizationSpecific optimization algorithm settingsInitialization (we don’t need pre-training)

(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017)

Page 35: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

OutlineHybrid acoustic modeling overview

Basic ideaHistoryRecent results

What’s different about modern DNNs?Convolutional networksRecurrent networksAlternative loss functions

Page 36: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Visualizing first layer DNN features

Page 37: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Convolutional NetworksSlide your filters along the frequency axis of filterbank

featuresGreat for spectral distortions (e.g. short wave radio)

(Sainath, Mohamed, Kingsbury, & Ramabhadran. 2013)

Page 38: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Learned Convolutional Features

Page 39: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Features for convolutional netsPorting VTLN and fMLLR

feature transforms important

Slide from Tara Sainath

Page 40: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Pooling in time

Slide from Tara Sainath

Page 41: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Comparing CNNs, DNNs, & GMMs

Slide from Tara Sainath

Page 42: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Recurrent DNN Hybrid Acoustic ModelsSamsonS – AE – M – S –AH – N942 – 6 – 37 – 8006 – 4422 …

Transcription:Pronunciation:Sub-phones :

Hidden Markov Model (HMM):

Acoustic Model:

Audio Input: Features (x1)

P(s|x1)

942

Features (x2)

P(s|x2)

942

Features (x3)

P(s|x3)

6

Page 43: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Deep Recurrent Network

Input

Output Layer

Hidden Layer

Hidden Layer

Page 44: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

RNNs for acoustic modeling in 1996

(Robinson, Hochberg, & Renals. 1996)

Page 45: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

LSTM RNN on Google voice search

Best LSTM WER: 10.5%Best DNN WER: 11.3%LSTM required half the

training time

(Sak, Senior, & Beufays. 2014)

Page 46: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Alternative loss functionsScalable minimum

Bayes risk (sMBR)Minimum phone error

(MPE)Maximum mutual

information (MMI)Move beyond frame

classification to decoder outputs

Page 47: [PPT]Scaling properties of deep neural network acoustic …web.stanford.edu/class/cs224s/lectures/224s.17.lec7.pptx · Web viewCS 224S / LINGUIST 285Spoken Language Processing Andrew

Stanford CS224S Spring 2017

Other Current WorkMulti-lingual acoustic modelingLow resource acoustic modelingLearning directly from waveforms without

complicated feature transforms