deep learning for speech recognition hung-yi lee

68
Deep Learning for Speech Recognition Hung-yi Lee

Upload: conrad-mckenzie

Post on 21-Dec-2015

253 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Speech Recognition Hung-yi Lee

Deep Learning for Speech Recognition

Hung-yi Lee

Page 2: Deep Learning for Speech Recognition Hung-yi Lee

Outline

• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing

Page 3: Deep Learning for Speech Recognition Hung-yi Lee

Conventional Speech Recognition

Page 4: Deep Learning for Speech Recognition Hung-yi Lee

Machine Learning helps

天氣真好Speech Recognition ~

𝑊X

¿𝑎𝑟𝑔max𝑊

𝑃 (𝑊∨𝑋 )

¿𝑎𝑟𝑔max𝑊

𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )𝑃 (𝑋 )

¿𝑎𝑟𝑔max𝑊

𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

: Acoustic Model, : Language Model

This is a structured learning problem.

Evaluation function:

Inference: ~𝑊=𝑎𝑟𝑔max

𝑊𝐹 ( 𝑋 ,𝑊 )

Page 5: Deep Learning for Speech Recognition Hung-yi Lee

X:

Input Representation

• Audio is represented by a vector sequence

39 dim MFCC

……

……x1 x2 x3

X:

Page 6: Deep Learning for Speech Recognition Hung-yi Lee

Input Representation - Splice• To consider some temporal information

MFCC

Splice

…… ……

…… ……

Page 7: Deep Learning for Speech Recognition Hung-yi Lee

Phoneme

• Phoneme: basic unit

Each word corresponds to a sequence of phonemes.

hh w aa t d uw y uw th ih ng k

what do you think

Different words can correspond to the same phonemes

Lexicon

Lexicon

Page 8: Deep Learning for Speech Recognition Hung-yi Lee

State

• Each phoneme correspond to a sequence of states

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uw d-uw+y uw-y+uw y-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phone:

Tri-phone:

State:

Page 9: Deep Learning for Speech Recognition Hung-yi Lee

State

• Each state has a stationary distribution for acoustic features

t-d+uw1

d-uw+y3

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Gaussian Mixture Model (GMM)

Page 10: Deep Learning for Speech Recognition Hung-yi Lee

State

• Each state has a stationary distribution for acoustic features

P(x|”t-d+uw1”)

P(x|”d-uw+y3”)

Tied-state

Same Address

pointer pointer

Page 11: Deep Learning for Speech Recognition Hung-yi Lee

Acoustic Model~𝑊=𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

X: ……

Assume we also know the alignment .

x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

𝑃 (𝑋∨𝑆 ,h )=∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )

P(X|W) = P(X|S)

……

transition

emission

what do you think?W:

a b c d e ……S:

Page 12: Deep Learning for Speech Recognition Hung-yi Lee

Acoustic Model~𝑊=𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

X: ……x1 x2 x3 x4 x5

s1 s2 s3 s4 s5

P(X|W) = P(X|S)what do you think?W

:a b c d e ……S:

Actually, we don’t know the alignment.

𝑃 (𝑋∨𝑆 )≈ max𝑠1⋯ 𝑠𝑇

∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )(Viterbi algorithm)

Page 13: Deep Learning for Speech Recognition Hung-yi Lee

How to use Deep Learning?

Page 14: Deep Learning for Speech Recognition Hung-yi Lee

People imagine ……

This can not be true!

…… ……

Acoustic features

DNN can only take fixed length vectors as input.

DNN

“Hello”

Page 15: Deep Learning for Speech Recognition Hung-yi Lee

What DNN can do is ……

…… ……

xi

Size of output layer= No. of states

P(a|xi)

DNN

DNN input: One acoustic feature

DNN output: Probability of each state

P(b|xi) P(c|xi) ……

……

Page 16: Deep Learning for Speech Recognition Hung-yi Lee

Low rank approximation

W

Input layer

Output layerM

N

W: M X N

M can be large if the outputs are the states of tri-phone.

N is the size of the last hidden layerM is the size of output layer Number of states

Page 17: Deep Learning for Speech Recognition Hung-yi Lee

Low rank approximation

W UV

W linear

≈M M

N NK

K

U

V

Less parameters

M

N

M

N

K

K < M,N

Output layer Output layer

Page 18: Deep Learning for Speech Recognition Hung-yi Lee

How we use deep learning

• There are three ways to use DNN for acoustic modeling

• Way 1. Tandem • Way 2. DNN-HMM hybrid • Way 3. End-to-end

Efforts for exploiting deep learning

Page 19: Deep Learning for Speech Recognition Hung-yi Lee

How to use Deep Learning?Way 1: Tandem

Page 20: Deep Learning for Speech Recognition Hung-yi Lee

Way 1: Tandem system

…… ……

xi

Size of output layer= No. of states DNN

……

……

Last hidden layer or bottleneck layer are also possible.

new feature Input of your original speech recognition system

P(a|xi) P(b|xi) P(c|xi)

Page 21: Deep Learning for Speech Recognition Hung-yi Lee

How to use Deep Learning?

Way 2: DNN-HMM hybrid

Page 22: Deep Learning for Speech Recognition Hung-yi Lee

Way 2: DNN-HMM Hybrid

~𝑊=𝑎𝑟𝑔max

𝑊𝑃 (𝑊∨𝑋 )¿𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )

𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇

∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 ) From DNN

DNN

𝑃 (𝑠𝑡∨𝑥𝑡 )……

𝑥𝑡

𝑃 (𝑥𝑡∨𝑠𝑡 )=𝑃 (𝑥𝑡 ,𝑠𝑡 )𝑃 (𝑠𝑡 )

¿𝑃 (𝑠𝑡∨𝑥𝑡 )𝑃 (𝑥𝑡 )

𝑃 (𝑠𝑡 ) Count from training data

Page 23: Deep Learning for Speech Recognition Hung-yi Lee

Way 2: DNN-HMM Hybrid

𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇

∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑠𝑡∨𝑥𝑡 )𝑃 (𝑠𝑡 )

Count from training data

From original HMM

This assembled vehicle works …….

DNN

DNN

𝑃 (𝑠𝑡∨𝑥𝑡 )……

𝑥𝑡

𝜃𝑃 (𝑋∨𝑊 ;𝜃   )

𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇

∏𝑡=1

𝑇

𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )

Page 24: Deep Learning for Speech Recognition Hung-yi Lee

Way 2: DNN-HMM Hybrid

• Sequential Training~𝑊 ¿𝑎𝑟𝑔max

𝑊𝑃 ( 𝑋∨𝑊 ;𝜃 ) 𝑃 (𝑊 )

Find-tune the DNN parameters such that

Given training data

𝑃 (𝑋 𝑟∨𝑊 𝑟 ;𝜃 )𝑃 (𝑊 𝑟 )𝑃 (𝑋 𝑟∨𝑊 ;𝜃 )𝑃 (𝑊 )

( is any word sequence different from )

increase

decrease

Page 25: Deep Learning for Speech Recognition Hung-yi Lee

How to use Deep Learning?

Way 3: End-to-end

Page 26: Deep Learning for Speech Recognition Hung-yi Lee

Way 3: End-to-end - Character

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.

Output: characters (and space)

Input: acoustic features (spectrograms)

No phoneme and lexicon(No OOV problem)

+ null (~)

Page 27: Deep Learning for Speech Recognition Hung-yi Lee

Way 3: End-to-end - Character

Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.

HIS FRIEND’S

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~

Page 28: Deep Learning for Speech Recognition Hung-yi Lee

Way 3: End-to-end – Word?

Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech. 2014.

… …

DNN

Input layer

Output layer

~50k words in the lexicon Size = 200 times of

acoustic features

padding with zero!

Use other systems to get the word boundaries

apple bog cat ……

Page 29: Deep Learning for Speech Recognition Hung-yi Lee

Why Deep Learning?

Page 30: Deep Learning for Speech Recognition Hung-yi Lee

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

Deeper is Better

Page 31: Deep Learning for Speech Recognition Hung-yi Lee

Deeper is better?

• Word error rate (WER)

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

1 hidden layermultiple layers

For a fixed number of parameters, a deep model is clearly better than the shallow one.

Deeper is Better

Page 32: Deep Learning for Speech Recognition Hung-yi Lee

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC)

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

1-st Hidden Layer

Page 33: Deep Learning for Speech Recognition Hung-yi Lee

What does DNN do?

• Speaker normalization is automatically done in DNN

Input Acoustic Feature (MFCC) 8-th Hidden Layer

A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.

Page 34: Deep Learning for Speech Recognition Hung-yi Lee

What does DNN do?

• In ordinary acoustic models, all the states are modeled independently

• Not effective way to model human voice

high

low

front backThe sound of vowel is only controlled by a few factors.

http://www.ipachart.com/

Page 35: Deep Learning for Speech Recognition Hung-yi Lee

high

low

front back

/a/

/i/

/e/ /o/

/u/

Output of hidden layer reduce to two dimensions

The lower layers detect the manner of articulation

All the states share the results from the same set of detectors.

Use parameters effectively

What does DNN do?Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.

Page 36: Deep Learning for Speech Recognition Hung-yi Lee

Speaker Adaptation

Page 37: Deep Learning for Speech Recognition Hung-yi Lee

Speaker Adaptation

• Speaker adaptation: use different models to recognition the speech of different speakers

• Collect the audio data of each speaker• A DNN model for each speaker

• Challenge: limited data for training• Not enough data for directly training a DNN

model• Not enough data for just fine-tune a speaker

independent DNN model

Page 38: Deep Learning for Speech Recognition Hung-yi Lee

Categories of Methods

• Re-train the whole DNN with some constraints

Conservative training

• Only train the parameter of one layer

Transformation methods

• Do not really change the DNN parameters

Speaker-aware

TrainingNeed less

training data

Page 39: Deep Learning for Speech Recognition Hung-yi Lee

Conservative Training

Audio data of Many speakers

A little data from target speaker

Input layer

Output layer

initializationInput layer

Output layeroutput close

parameter close

Page 40: Deep Learning for Speech Recognition Hung-yi Lee

Transformation methods

W

Input layer

Output layer

layer i

layer i+1

Add an extra layer

W

Input layer

Output layer

layer i

layer i+1

extra layerWa

Fix all the other parameters

A little data from target speaker

Page 41: Deep Learning for Speech Recognition Hung-yi Lee

Transformation methods

• Add the extra layer between the input and first layer

• With splicing

layer 1

extra layerWa

……

layer 1

extra layers

Wa WaWa

……

Larger Wa Smaller WaMore data less data

Page 42: Deep Learning for Speech Recognition Hung-yi Lee

Transformation methods

• SVD bottleneck adaptation

linear

U

V

M

N

K

Output layer

Wa

K is usually small

Page 43: Deep Learning for Speech Recognition Hung-yi Lee

Speaker-aware Training

Lots of mismatched data

Text transcription is not needed for extracting the vectors.

Data of Speaker 1

Data of Speaker 2

Data of Speaker 3

Fixed length low dimension vectors

Can also be noise-aware, devise aware training, ……

Speaker information

Page 44: Deep Learning for Speech Recognition Hung-yi Lee

Speaker-aware Training

Speaker 1

Speaker 2

Training data:

Testing data:

Acoustic features are appended with speaker information features

Acousticfeature

Output layer

SpeakerInformation

All the speaker use the same DNN modelDifferent speaker augmented by different features

train

test

Page 45: Deep Learning for Speech Recognition Hung-yi Lee

Multi-task Learning

Page 46: Deep Learning for Speech Recognition Hung-yi Lee

Multitask Learning

• The multi-layer structure makes DNN suitable for multitask learning

Input feature

Task A Task BTask A Task B

Input feature for task A

Input feature for task B

Page 47: Deep Learning for Speech Recognition Hung-yi Lee

Multitask Learning - Multilingual

acoustic features

states of French

states of German

Human languages share some common characteristics.

states of Spanish

states of Italian

states of Mandarin

Page 48: Deep Learning for Speech Recognition Hung-yi Lee

Multitask Learning - Multilingual

Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013

1 10 100 100025

30

35

40

45

50

Char

acte

r Err

or R

ate

Hours of training data for Mandarin

Mandarin only

With European Language

Page 49: Deep Learning for Speech Recognition Hung-yi Lee

Multitask Learning - Different units

acoustic features

A= state

A B

B = phoneme

A= state B = gender

A= state B = grapheme(character)

Page 50: Deep Learning for Speech Recognition Hung-yi Lee

Deep Learning for Acoustic

ModelingNew acoustic features

Page 51: Deep Learning for Speech Recognition Hung-yi Lee

MFCC

DFT

DCT log

MFCC

………

Input of DNN

Waveform

spectrogram

filter bank

Page 52: Deep Learning for Speech Recognition Hung-yi Lee

Filter-bank Output

DFT

……

Input of DNN

Kind of standard now

log

Waveform

spectrogram

filter bank

Page 53: Deep Learning for Speech Recognition Hung-yi Lee

Spectrogram

DFT

Input of DNN

5% relative improvement over fitlerbank output

Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013

Waveform

spectrogram

common today

Page 54: Deep Learning for Speech Recognition Hung-yi Lee

Spectrogram

Page 55: Deep Learning for Speech Recognition Hung-yi Lee

Waveform?

Input of DNN

People tried, but not better than spectrogram yet

Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014

Still need to take Signal & Systems ……

Waveform

If success, no Signal & Systems

Page 56: Deep Learning for Speech Recognition Hung-yi Lee

Waveform?

Page 57: Deep Learning for Speech Recognition Hung-yi Lee

Convolutional Neural Network

(CNN)

Page 58: Deep Learning for Speech Recognition Hung-yi Lee

CNN

• Speech can be treated as images

Spectrogram

Time

Freq

uenc

y

Page 59: Deep Learning for Speech Recognition Hung-yi Lee

CNN

Probabilities of states

CNN

Image

CNN

Replace DNN by CNN

Page 60: Deep Learning for Speech Recognition Hung-yi Lee

CNN

a2

a1 b1

b2

Max Max

Max

Max pooling

Maxout

Page 61: Deep Learning for Speech Recognition Hung-yi Lee

CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014.

Page 62: Deep Learning for Speech Recognition Hung-yi Lee

Applications in Acoustic Signal

Processing

Page 63: Deep Learning for Speech Recognition Hung-yi Lee

DNN for Speech Enhancement

Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html

Noisy Speech

Clean Speech

for mobile commination or speech recognition

DNN

Page 64: Deep Learning for Speech Recognition Hung-yi Lee

DNN for Voice Conversion

Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx

Male

Female

DNN

Page 65: Deep Learning for Speech Recognition Hung-yi Lee

Concluding Remarks

Page 66: Deep Learning for Speech Recognition Hung-yi Lee

Concluding Remarks

• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing

Page 67: Deep Learning for Speech Recognition Hung-yi Lee

Thank you for your attention!

Page 68: Deep Learning for Speech Recognition Hung-yi Lee

More Researches related to Speech

你好Speech Recognition

lecture recordings

Summary

Hi

Hello

Computer Assisted Language Learning

I would like to leave Taipei on November 2nd

DialogueInformation Extraction

Speech SummarizationSpoken Content Retrieval

Find the lectures related to “deep learning”

core techniques