![Page 1: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/1.jpg)
Deep Learning for Speech Recognition
Hung-yi Lee
![Page 2: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/2.jpg)
Outline
• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing
![Page 3: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/3.jpg)
Conventional Speech Recognition
![Page 4: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/4.jpg)
Machine Learning helps
天氣真好Speech Recognition ~
𝑊X
¿𝑎𝑟𝑔max𝑊
𝑃 (𝑊∨𝑋 )
¿𝑎𝑟𝑔max𝑊
𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )𝑃 (𝑋 )
¿𝑎𝑟𝑔max𝑊
𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )
: Acoustic Model, : Language Model
This is a structured learning problem.
Evaluation function:
Inference: ~𝑊=𝑎𝑟𝑔max
𝑊𝐹 ( 𝑋 ,𝑊 )
![Page 5: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/5.jpg)
X:
Input Representation
• Audio is represented by a vector sequence
39 dim MFCC
……
……x1 x2 x3
X:
![Page 6: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/6.jpg)
Input Representation - Splice• To consider some temporal information
MFCC
Splice
…… ……
…… ……
![Page 7: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/7.jpg)
Phoneme
• Phoneme: basic unit
Each word corresponds to a sequence of phonemes.
hh w aa t d uw y uw th ih ng k
what do you think
Different words can correspond to the same phonemes
Lexicon
Lexicon
![Page 8: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/8.jpg)
State
• Each phoneme correspond to a sequence of states
hh w aa t d uw y uw th ih ng k
what do you think
t-d+uw1 t-d+uw2 t-d+uw3
…… t-d+uw d-uw+y uw-y+uw y-uw+th ……
d-uw+y1 d-uw+y2 d-uw+y3
Phone:
Tri-phone:
State:
![Page 9: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/9.jpg)
State
• Each state has a stationary distribution for acoustic features
t-d+uw1
d-uw+y3
P(x|”t-d+uw1”)
P(x|”d-uw+y3”)
Gaussian Mixture Model (GMM)
![Page 10: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/10.jpg)
State
• Each state has a stationary distribution for acoustic features
P(x|”t-d+uw1”)
P(x|”d-uw+y3”)
Tied-state
Same Address
pointer pointer
![Page 11: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/11.jpg)
Acoustic Model~𝑊=𝑎𝑟𝑔max
𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )
X: ……
Assume we also know the alignment .
x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
𝑃 (𝑋∨𝑆 ,h )=∏𝑡=1
𝑇
𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )
P(X|W) = P(X|S)
……
transition
emission
what do you think?W:
a b c d e ……S:
![Page 12: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/12.jpg)
Acoustic Model~𝑊=𝑎𝑟𝑔max
𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )
X: ……x1 x2 x3 x4 x5
s1 s2 s3 s4 s5
P(X|W) = P(X|S)what do you think?W
:a b c d e ……S:
Actually, we don’t know the alignment.
𝑃 (𝑋∨𝑆 )≈ max𝑠1⋯ 𝑠𝑇
∏𝑡=1
𝑇
𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )(Viterbi algorithm)
![Page 13: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/13.jpg)
How to use Deep Learning?
![Page 14: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/14.jpg)
People imagine ……
This can not be true!
…… ……
Acoustic features
DNN can only take fixed length vectors as input.
DNN
“Hello”
![Page 15: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/15.jpg)
What DNN can do is ……
…… ……
xi
Size of output layer= No. of states
P(a|xi)
DNN
DNN input: One acoustic feature
DNN output: Probability of each state
P(b|xi) P(c|xi) ……
……
![Page 16: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/16.jpg)
Low rank approximation
W
Input layer
Output layerM
N
W: M X N
M can be large if the outputs are the states of tri-phone.
N is the size of the last hidden layerM is the size of output layer Number of states
![Page 17: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/17.jpg)
Low rank approximation
W UV
W linear
≈M M
N NK
K
U
V
Less parameters
M
N
M
N
K
K < M,N
Output layer Output layer
![Page 18: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/18.jpg)
How we use deep learning
• There are three ways to use DNN for acoustic modeling
• Way 1. Tandem • Way 2. DNN-HMM hybrid • Way 3. End-to-end
Efforts for exploiting deep learning
![Page 19: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/19.jpg)
How to use Deep Learning?Way 1: Tandem
![Page 20: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/20.jpg)
Way 1: Tandem system
…… ……
xi
Size of output layer= No. of states DNN
……
……
Last hidden layer or bottleneck layer are also possible.
new feature Input of your original speech recognition system
P(a|xi) P(b|xi) P(c|xi)
![Page 21: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/21.jpg)
How to use Deep Learning?
Way 2: DNN-HMM hybrid
![Page 22: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/22.jpg)
Way 2: DNN-HMM Hybrid
~𝑊=𝑎𝑟𝑔max
𝑊𝑃 (𝑊∨𝑋 )¿𝑎𝑟𝑔max
𝑊𝑃 ( 𝑋∨𝑊 ) 𝑃 (𝑊 )
𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇
∏𝑡=1
𝑇
𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 ) From DNN
DNN
𝑃 (𝑠𝑡∨𝑥𝑡 )……
𝑥𝑡
𝑃 (𝑥𝑡∨𝑠𝑡 )=𝑃 (𝑥𝑡 ,𝑠𝑡 )𝑃 (𝑠𝑡 )
¿𝑃 (𝑠𝑡∨𝑥𝑡 )𝑃 (𝑥𝑡 )
𝑃 (𝑠𝑡 ) Count from training data
![Page 23: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/23.jpg)
Way 2: DNN-HMM Hybrid
𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇
∏𝑡=1
𝑇
𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑠𝑡∨𝑥𝑡 )𝑃 (𝑠𝑡 )
Count from training data
From original HMM
This assembled vehicle works …….
DNN
DNN
𝑃 (𝑠𝑡∨𝑥𝑡 )……
𝑥𝑡
𝜃𝑃 (𝑋∨𝑊 ;𝜃 )
𝑃 (𝑋∨𝑊 )≈max𝑠1⋯ 𝑠𝑇
∏𝑡=1
𝑇
𝑃 (𝑠𝑡∨𝑠𝑡 −1 )𝑃 (𝑥𝑡∨𝑠𝑡 )
![Page 24: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/24.jpg)
Way 2: DNN-HMM Hybrid
• Sequential Training~𝑊 ¿𝑎𝑟𝑔max
𝑊𝑃 ( 𝑋∨𝑊 ;𝜃 ) 𝑃 (𝑊 )
Find-tune the DNN parameters such that
Given training data
𝑃 (𝑋 𝑟∨𝑊 𝑟 ;𝜃 )𝑃 (𝑊 𝑟 )𝑃 (𝑋 𝑟∨𝑊 ;𝜃 )𝑃 (𝑊 )
( is any word sequence different from )
increase
decrease
![Page 25: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/25.jpg)
How to use Deep Learning?
Way 3: End-to-end
![Page 26: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/26.jpg)
Way 3: End-to-end - Character
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng ”Deep Speech: Scaling up end-to-end speech recognition”, arXiv:1412.5567v2, 2014.
Output: characters (and space)
Input: acoustic features (spectrograms)
No phoneme and lexicon(No OOV problem)
+ null (~)
![Page 27: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/27.jpg)
Way 3: End-to-end - Character
Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.
HIS FRIEND’S
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~ ~ ~
![Page 28: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/28.jpg)
Way 3: End-to-end – Word?
Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition.“, Interspeech. 2014.
… …
DNN
Input layer
Output layer
~50k words in the lexicon Size = 200 times of
acoustic features
padding with zero!
Use other systems to get the word boundaries
apple bog cat ……
![Page 29: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/29.jpg)
Why Deep Learning?
![Page 30: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/30.jpg)
Deeper is better?
• Word error rate (WER)
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
1 hidden layermultiple layers
Deeper is Better
![Page 31: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/31.jpg)
Deeper is better?
• Word error rate (WER)
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
1 hidden layermultiple layers
For a fixed number of parameters, a deep model is clearly better than the shallow one.
Deeper is Better
![Page 32: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/32.jpg)
What does DNN do?
• Speaker normalization is automatically done in DNN
Input Acoustic Feature (MFCC)
A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.
1-st Hidden Layer
![Page 33: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/33.jpg)
What does DNN do?
• Speaker normalization is automatically done in DNN
Input Acoustic Feature (MFCC) 8-th Hidden Layer
A. Mohamed, G. Hinton, and G. Penn, “Understanding how Deep Belief Networks Perform Acoustic Modelling,” in ICASSP, 2012.
![Page 34: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/34.jpg)
What does DNN do?
• In ordinary acoustic models, all the states are modeled independently
• Not effective way to model human voice
high
low
front backThe sound of vowel is only controlled by a few factors.
http://www.ipachart.com/
![Page 35: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/35.jpg)
high
low
front back
/a/
/i/
/e/ /o/
/u/
Output of hidden layer reduce to two dimensions
The lower layers detect the manner of articulation
All the states share the results from the same set of detectors.
Use parameters effectively
What does DNN do?Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.
![Page 36: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/36.jpg)
Speaker Adaptation
![Page 37: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/37.jpg)
Speaker Adaptation
• Speaker adaptation: use different models to recognition the speech of different speakers
• Collect the audio data of each speaker• A DNN model for each speaker
• Challenge: limited data for training• Not enough data for directly training a DNN
model• Not enough data for just fine-tune a speaker
independent DNN model
![Page 38: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/38.jpg)
Categories of Methods
• Re-train the whole DNN with some constraints
Conservative training
• Only train the parameter of one layer
Transformation methods
• Do not really change the DNN parameters
Speaker-aware
TrainingNeed less
training data
![Page 39: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/39.jpg)
Conservative Training
Audio data of Many speakers
A little data from target speaker
Input layer
Output layer
initializationInput layer
Output layeroutput close
parameter close
![Page 40: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/40.jpg)
Transformation methods
W
Input layer
Output layer
layer i
layer i+1
Add an extra layer
W
Input layer
Output layer
layer i
layer i+1
extra layerWa
Fix all the other parameters
A little data from target speaker
![Page 41: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/41.jpg)
Transformation methods
• Add the extra layer between the input and first layer
• With splicing
layer 1
extra layerWa
……
layer 1
extra layers
Wa WaWa
……
Larger Wa Smaller WaMore data less data
![Page 42: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/42.jpg)
Transformation methods
• SVD bottleneck adaptation
linear
U
V
M
N
K
Output layer
Wa
K is usually small
![Page 43: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/43.jpg)
Speaker-aware Training
Lots of mismatched data
Text transcription is not needed for extracting the vectors.
Data of Speaker 1
Data of Speaker 2
Data of Speaker 3
Fixed length low dimension vectors
Can also be noise-aware, devise aware training, ……
Speaker information
![Page 44: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/44.jpg)
Speaker-aware Training
Speaker 1
Speaker 2
Training data:
Testing data:
Acoustic features are appended with speaker information features
Acousticfeature
Output layer
SpeakerInformation
All the speaker use the same DNN modelDifferent speaker augmented by different features
train
test
![Page 45: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/45.jpg)
Multi-task Learning
![Page 46: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/46.jpg)
Multitask Learning
• The multi-layer structure makes DNN suitable for multitask learning
Input feature
Task A Task BTask A Task B
Input feature for task A
Input feature for task B
![Page 47: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/47.jpg)
Multitask Learning - Multilingual
acoustic features
states of French
states of German
Human languages share some common characteristics.
states of Spanish
states of Italian
states of Mandarin
![Page 48: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/48.jpg)
Multitask Learning - Multilingual
Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013
1 10 100 100025
30
35
40
45
50
Char
acte
r Err
or R
ate
Hours of training data for Mandarin
Mandarin only
With European Language
![Page 49: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/49.jpg)
Multitask Learning - Different units
acoustic features
A= state
A B
B = phoneme
A= state B = gender
A= state B = grapheme(character)
![Page 50: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/50.jpg)
Deep Learning for Acoustic
ModelingNew acoustic features
![Page 51: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/51.jpg)
MFCC
DFT
DCT log
MFCC
………
Input of DNN
Waveform
spectrogram
filter bank
![Page 52: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/52.jpg)
Filter-bank Output
DFT
……
Input of DNN
Kind of standard now
…
log
Waveform
spectrogram
filter bank
![Page 53: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/53.jpg)
Spectrogram
DFT
Input of DNN
5% relative improvement over fitlerbank output
Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., “Learning filter banks within a deep neural network framework,” In Automatic Speech Recognition and Understanding (ASRU), 2013
Waveform
spectrogram
common today
![Page 54: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/54.jpg)
Spectrogram
![Page 55: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/55.jpg)
Waveform?
Input of DNN
People tried, but not better than spectrogram yet
Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” In INTERPSEECH 2014
Still need to take Signal & Systems ……
Waveform
If success, no Signal & Systems
![Page 56: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/56.jpg)
Waveform?
![Page 57: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/57.jpg)
Convolutional Neural Network
(CNN)
![Page 58: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/58.jpg)
CNN
• Speech can be treated as images
Spectrogram
Time
Freq
uenc
y
![Page 59: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/59.jpg)
CNN
Probabilities of states
CNN
Image
CNN
Replace DNN by CNN
![Page 60: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/60.jpg)
CNN
a2
a1 b1
b2
Max Max
Max
Max pooling
Maxout
![Page 61: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/61.jpg)
CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition”, Interspeech, 2014.
![Page 62: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/62.jpg)
Applications in Acoustic Signal
Processing
![Page 63: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/63.jpg)
DNN for Speech Enhancement
Demo for speech enhancement: http://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html
Noisy Speech
Clean Speech
for mobile commination or speech recognition
DNN
![Page 64: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/64.jpg)
DNN for Voice Conversion
Demo for Voice Conversion: http://research.microsoft.com/en-us/projects/vcnn/default.aspx
Male
Female
DNN
![Page 65: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/65.jpg)
Concluding Remarks
![Page 66: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/66.jpg)
Concluding Remarks
• Conventional Speech Recognition• How to use Deep Learning in acoustic modeling?• Why Deep Learning?• Speaker Adaptation• Multi-task Deep Learning• New acoustic features• Convolutional Neural Network (CNN)• Applications in Acoustic Signal Processing
![Page 67: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/67.jpg)
Thank you for your attention!
![Page 68: Deep Learning for Speech Recognition Hung-yi Lee](https://reader033.vdocument.in/reader033/viewer/2022061605/56649d6e5503460f94a4fe2f/html5/thumbnails/68.jpg)
More Researches related to Speech
你好Speech Recognition
lecture recordings
Summary
Hi
Hello
Computer Assisted Language Learning
I would like to leave Taipei on November 2nd
DialogueInformation Extraction
Speech SummarizationSpoken Content Retrieval
Find the lectures related to “deep learning”
core techniques