deep learning for speech recognition hung-yi lee

Deep Learning for Speech Recognition

Hung-yi Lee

Outline

• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing

Conventional Speech Recognition

Machine Learning helps

天氣真好Speech Recognition ~

𝑊X

¿𝑎𝑟𝑔max𝑊

𝑃 (𝑊∨𝑋 )


𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )𝑃 (𝑋 )


𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

: Acoustic Model, : Language Model

This is a structured learning problem.

Evaluation function:

Inference: ~𝑊=𝑎𝑟𝑔max

𝑊𝐹 ( 𝑋 ,𝑊 )

X:

Input Representation

• Audio is represented by a vector sequence

39 dim MFCC

……

……x1 x2 x3

X:

Input Representation - Splice• To consider some temporal information

MFCC

Splice

…… ……

…… ……

Phoneme

• Phoneme: basic unit

Each word corresponds to a sequence of phonemes.

hh w aa t d uw y uw th ih ng k

what do you think

Different words can correspond to the same phonemes

Lexicon

Lexicon

State

• Each phoneme correspond to a sequence of states

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uw d-uw+y uw-y+uw y-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phone:

Tri-phone:

State:

State

• Each state has a stationary distribution for acoustic features

t-d+uw1

d-uw+y3

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Gaussian Mixture Model (GMM)

State

• Each state has a stationary distribution for acoustic features

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Tied-state

Same Address

pointer pointer

Acoustic Model~𝑊=𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

X: ……

Assume we also know the alignment .

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

𝑃 (𝑋∨𝑆 ,h )=∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )

P(X|W) = P(X|S)

……

transition

emission

what do you think?W:

a b c d e ……S:

Acoustic Model~𝑊=𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)what do you think?W

:a b c d e ……S:

Actually, we don’t know the alignment.

𝑃 (𝑋∨𝑆 )≈ max𝑠1⋯ 𝑠𝑇

∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )(Viterbi algorithm)

How to use Deep Learning?

People imagine ……

This can not be true!

…… ……

Acoustic features

DNN can only take fixed length vectors as input.

DNN

“Hello”

What DNN can do is ……

…… ……

xi

Size of output layer= No. of states

P(a|xi)

DNN

DNN input: One acoustic feature

DNN output: Probability of each state

P(b|xi) P(c|xi) ……

……

Low rank approximation

W

Input layer

Output layerM

N

W: M X N

M can be large if the outputs are the states of tri-phone.

N is the size of the last hidden layerM is the size of output layer Number of states

Low rank approximation

W UV

W linear

≈M M

N NK

K

U

V

Less parameters

M

N

M

N

K

K < M,N

Output layer Output layer

How we use deep learning

• There are three ways to use DNN for acoustic modeling

• Way 1. Tandem • Way 2. DNN-HMM hybrid • Way 3. End-to-end

Efforts for exploiting deep learning

How to use Deep Learning?Way 1: Tandem

Way 1: Tandem system

…… ……

xi

Size of output layer= No. of states DNN

……

……

Last hidden layer or bottleneck layer are also possible.

new feature Input of your original speech recognition system

P(a|xi) P(b|xi) P(c|xi)


Way 2: DNN-HMM hybrid

Way 2: DNN-HMM Hybrid

~𝑊=𝑎𝑟𝑔max

𝑊𝑃 (𝑊∨𝑋 )¿𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇

∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 ) From DNN

DNN

𝑃 (𝑠𝑡∨𝑥𝑡 )……

𝑥𝑡

𝑃 (𝑥𝑡∨𝑠𝑡 )=𝑃 (𝑥𝑡 ,𝑠𝑡 )𝑃 (𝑠𝑡 )

¿𝑃 (𝑠𝑡∨𝑥𝑡 )𝑃 (𝑥𝑡 )

𝑃 (𝑠𝑡 ) Count from training data



∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑠𝑡∨𝑥𝑡 )𝑃 (𝑠𝑡 )

Count from training data

From original HMM

This assembled vehicle works …….

DNN

DNN

𝑃 (𝑠𝑡∨𝑥𝑡 )……

𝑥𝑡

𝜃𝑃 (𝑋∨𝑊 ;𝜃 )


∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )


• Sequential Training~𝑊 ¿𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ;𝜃 ) 𝑃 (𝑊 )

Find-tune the DNN parameters such that

Given training data

𝑃 (𝑋 𝑟∨𝑊 𝑟 ;𝜃 )𝑃 (𝑊 𝑟 )𝑃 (𝑋 𝑟∨𝑊 ;𝜃 )𝑃 (𝑊 )

( is any word sequence different from )

increase

decrease


Way 3: End-to-end

Way 3: End-to-end - Character

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acoustic features (spectrograms)

No phoneme and lexicon(No OOV problem)

+ null (~)

Way 3: End-to-end - Character

Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.

HIS FRIEND’S

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~

Way 3: End-to-end – Word?

Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech. 2014.

… …

DNN

Input layer

Output layer

~50k words in the lexicon Size = 200 times of

acoustic features

padding with zero!

Use other systems to get the word boundaries

apple bog cat ……

Why Deep Learning?

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

Deeper is Better

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

For a fixed number of parameters, a deep model is clearly better than the shallow one.

Deeper is Better

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC)

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

1-st Hidden Layer

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

What does DNN do?

• In ordinary acoustic models, all the states are modeled independently

• Not effective way to model human voice

high

low

front backThe sound of vowel is only controlled by a few factors.

http://www.ipachart.com/

high

low

front back

/a/

/i/

/e/ /o/

/u/

Output of hidden layer reduce to two dimensions

The lower layers detect the manner of articulation

All the states share the results from the same set of detectors.

Use parameters effectively

What does DNN do?Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.

Speaker Adaptation

Speaker Adaptation

• Speaker adaptation: use different models to recognition the speech of different speakers

• Collect the audio data of each speaker• A DNN model for each speaker

• Challenge: limited data for training• Not enough data for directly training a DNN

model• Not enough data for just fine-tune a speaker

independent DNN model

Categories of Methods

• Re-train the whole DNN with some constraints

Conservative training

• Only train the parameter of one layer

Transformation methods

• Do not really change the DNN parameters

Speaker-aware

TrainingNeed less

training data

Conservative Training

Audio data of Many speakers

A little data from target speaker

Input layer

Output layer

initializationInput layer

Output layeroutput close

parameter close


W

Input layer

Output layer

layer i

layer i+1

Add an extra layer

W

Input layer

Output layer

layer i

layer i+1

extra layerWa

Fix all the other parameters

A little data from target speaker


• Add the extra layer between the input and first layer

• With splicing

layer 1

extra layerWa

……

layer 1

extra layers

Wa WaWa

……

Larger Wa Smaller WaMore data less data


• SVD bottleneck adaptation

linear

U

V

M

N

K

Output layer

Wa

K is usually small

Speaker-aware Training

Lots of mismatched data

Text transcription is not needed for extracting the vectors.

Data of Speaker 1

Data of Speaker 2

Data of Speaker 3

Fixed length low dimension vectors

Can also be noise-aware, devise aware training, ……

Speaker information

Speaker-aware Training

Speaker 1

Speaker 2

Training data:

Testing data:

Acoustic features are appended with speaker information features

Acousticfeature

Output layer

SpeakerInformation

All the speaker use the same DNN modelDifferent speaker augmented by different features

train

test

Multi-task Learning

Multitask Learning

• The multi-layer structure makes DNN suitable for multitask learning

Input feature

Task A Task BTask A Task B

Input feature for task A

Input feature for task B

Multitask Learning - Multilingual

acoustic features

states of French

states of German

Human languages share some common characteristics.

states of Spanish

states of Italian

states of Mandarin

Multitask Learning - Multilingual

Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013

1 10 100 100025

30

35

40

45

50

Char

acte

r Err

or R

ate

Hours of training data for Mandarin

Mandarin only

With European Language

Multitask Learning - Different units

acoustic features

A= state

A B

B = phoneme

A= state B = gender

A= state B = grapheme(character)

Deep Learning for Acoustic

ModelingNew acoustic features

MFCC

DFT

DCT log

MFCC

………

Input of DNN

Waveform

spectrogram

filter bank

Filter-bank Output

DFT

……

Input of DNN

Kind of standard now

…

log

Waveform

spectrogram

filter bank

Spectrogram

DFT

Input of DNN

5% relative improvement over fitlerbank output

Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Waveform

spectrogram

common today

Spectrogram

Waveform?

Input of DNN

People tried, but not better than spectrogram yet

Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014

Still need to take Signal & Systems ……

Waveform

If success, no Signal & Systems

Waveform?

Convolutional Neural Network

(CNN)

CNN

• Speech can be treated as images

Spectrogram

Time

Freq

uenc

y

CNN

Probabilities of states

CNN

Image

CNN

Replace DNN by CNN

CNN

a2

a1 b1

b2

Max Max

Max

Max pooling

Maxout

CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014.

Applications in Acoustic Signal

Processing

DNN for Speech Enhancement

Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html

Noisy Speech

Clean Speech

for mobile commination or speech recognition

DNN

http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html



DNN for Voice Conversion

Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Male

Female

DNN

http://research.microsoft.com/en-us/projects/vcnn/default.aspx

http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Concluding Remarks

Concluding Remarks

• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing

Thank you for your attention!

More Researches related to Speech

你好Speech Recognition

lecture recordings

Summary

Hi

Hello

Computer Assisted Language Learning

I would like to leave Taipei on November 2nd

DialogueInformation Extraction

Speech SummarizationSpoken Content Retrieval

Find the lectures related to “deep learning”

core techniques

deep learning for speech recognition hung-yi lee

Documents

dnn slide

deep learning slide

hybrid slide

tandem slide

dnn hello slide

endtoend slide

dnn dnn input

n output layer slide