end$to$end’speech’recogni0on’ using’deep’lstms, ctc...

51
EndtoEnd Speech Recogni0on using Deep LSTMs, CTC Training and WFST Decoding Yajie Miao Carnegie Mellon University @ MIT Dec 07 2015

Upload: others

Post on 17-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

End-­‐to-­‐End  Speech  Recogni0on  using  Deep  LSTMs,  CTC  Training  and  WFST  Decoding  

Yajie  Miao  Carnegie  Mellon  University  

@  MIT        Dec  07  2015  

Page 2: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

1

Experiments  &  Analysis

Conclusions

Page 3: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

2

Experiments  &  Analysis

Conclusions

Page 4: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

 Why End-to-End?

ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky et al.

End-to-End Image Classification

3

Page 5: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

 Why End-to-End?

           

                                                                                                                                                             Encoder  

           

                                                                                                                                                             Decoder  

Sequence to Sequence Learning with Neural Networks. Sutskever  et  al.  2014.

“Mary is in love with John”

“玛丽爱上了约翰”  End-to-End Machine Translation

4

Page 6: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

The  HMM/GMM  or  HMM/DNN  pipelines  are  highly  complex  

Mul0ple  training  stages:  CI  phone,  CD  senones,  …

Various  resources:  dic0onaries,  decision  trees,  …  

Many  super-­‐parameters:  number  of  senones,  number  of  Gaussians,  …  

5

CI Phone

CD Senone

DNN/ LSTM

SAT

DT

Dict

Decision Tree

 Complexity of ASR

GMM   Hybrid  

Page 7: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

6

ASR  is  a  sequence-­‐to-­‐sequence  learning  problem  

 End-­‐to-­‐End  Model  

“I  am  in  Boston  today”  

A  simpler  paradigm  with  a  single  model  (and  training  stage)  

 End-to-End ASR!

Page 8: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

7

Experiments  &  Analysis

Conclusions

Page 9: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

8

Experiments  &  Analysis

Conclusions

Page 10: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

LSTM  Models

RNNs  model  temporal  dependency  across  speech  frames.  

tx1−tx

Long  short-­‐term  memory  (LSTM)  units.   Memory  cells  store  the  history  informa0on.

Various  gates  control  the  informa0on  flow  inside  the  LSTM.

Advantageous  in  learning  long-­‐term  temporal  dependency.

9

Page 11: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

 LSTM  Models

This  is  uni-­‐direc0onal  LSTM,  i.e.,  forward  LSTM.

LSTMs  outperform  DNNs  in  the  hybrid  approach  [Sainath  et  al.,  Miao  et  al.]

10

input  gate   forget  gate   memory  cell output  gate  

Page 12: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Bi-­‐direc0onal  LSTMs  

11

Forward LSTM

Backward LSTM

t-­‐1            t            t+1

h(t)  =  [h_f(t),  h_b(t)]

h_f(t) h_b(t)

BiLSTM

Page 13: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Deep  BiLSTM  Model

12

t-­‐1        t          t+1

Bi LSTM

Bi LSTM

Bi LSTM

…  …

affine  transform

sokmax

Page 14: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

13

Experiments  &  Analysis

Conclusions

Page 15: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Connec0onist  Temporal  Classifica0on

14

“A  B  C” X = (x1, …, xT)

CTC  is  a  sequence-­‐to-­‐sequence  learning  technique  [Graves  et  al.]

observa0ons label  sequence

LCTC = ln Pr(z | X)

Page 16: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Connec0onist  Temporal  Classifica0on

15

“A  B  C” X = (x1, …, xT)

CTC  is  a  sequence-­‐to-­‐sequence  learning  technique  [Graves  et  al.]

Y = (y1, …, yT)

observa0ons label  sequence

LCTC = ln Pr(z | X)

Page 17: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

CTC  Paths

16

CTC  paths  bridge  frame-­‐level  labels  with  the  label  sequence

A  CTC  path  is  a  sequence  of  labels  on  the  frame  level

CTC  Path

p =[p1, …, pT]

The  likelihood  of  a  CTC  path  is  decomposed  onto  the  frames:

“A  B  C” X = (x1, …, xT)

CTC  is  a  sequence-­‐to-­‐sequence  learning  technique  [Graves  et  al.]

Y = (y1, …, yT)

observa0ons label  sequence

LCTC = ln Pr(z | X)

Pr(p | X) = yt

pt

t=1

T

Page 18: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

CTC  Paths

17

CTC  paths  differ  from  labels  sequences  in  that:

Allow  repe00ons  of  non-­‐blank  labels

Add  the  blank  as  an  addi0onal  label,  meaning  no  (actual)  labels  are  emioed  

A    B    C

Computa0onally  Intractable  !!

Many-­‐to-­‐one  mapping  from  CTC  paths  Φ(z)  to  the  label  sequence  z

!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!!∅!!!A!!!A!!!B!!!∅!!!C!!!C!!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!

collapse  

expand  

Pr(z | X) = Pr(p | X)

p∈Φ(z)∑

Page 19: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Forward-­‐Backward  Algorithm

18

A ✔

B ✔

C ✔

expanded  labels  

   

summed  probability  of  all  the  CTC  paths  ending  at  t  with  s    

α t (s)

l

1 2 3 T-2 T-1 T

α1 ( A) = yA1

α1 (∅) = y∅1

α1 (s) = 0, ∀s > 2

.  .  .

Page 20: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Forward  Computa0on

19

A

B

C

expanded  labels  

   

summed  probability  of  all  the  CTC  paths  ending  at  t  with  s    

α t (s)

l

1 2 3 T-2 T-1 T

.  .  .

yls

t α t−1(s) +α t−1(s −1)( ) if ls =∅ or ls = ls−2

α t (s) =

Page 21: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Forward  Computa0on

20

A

B

C

expanded  labels  

   

summed  probability  of  all  the  CTC  paths  ending  at  t  with  s    

α t (s)

l

1 2 3 T-2 T-1 T

.  .  .

yls

t α t−1(s) +α t−1(s −1)( ) if ls =∅ or ls = ls−2

α t (s) =

yls

t α t−1(s) +α t−1(s −1) +α t−1(s − 2)( ) otherwise

Page 22: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Backward  Computa0on

21

A

B

C

expanded  labels  

   

summed  probability  of  all  the  CTC  paths  star0ng  at  t  with  s    

βt (s)

l

1 2 3 T-2 T-1 T

βT (C) = yCT

βT (∅) = y∅T

βT (s) = 0, ∀s <| l | −1

.  .  .

Page 23: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Backward  Computa0on

22

A

B

C

expanded  labels  

l

1 2 3 T-2 T-1 T

.  .  .

yls

t βt+1(s) + βt+1(s +1)( ) if ls =∅ or ls = ls+2

βt (s) =

yls

t βt+1(s) + βt+1(s +1) + βt+1(s + 2)( ) otherwise

   

summed  probability  of  all  CTC  paths  

star0ng  at  t  with  s    

βt (s)

Page 24: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

CTC  Training

23

Evalua0on  of  the  objec0ve  

∂LCTC

∂akt = yk

t − 1Pr(z | X)

α t (s)s∈B( l,k )∑ βt (s)

Pr(z | X) = α t (s)

s=1

|l|

∑ βt (s)

Pr(z | X)

Gradients  w.r.t.  the  pre-­‐sokmax  network  outputs:

∂LCE

∂akt = yk

t − gkt

LCTC = ln Pr(z | X)

sok  labels  

hard  labels  

CTC

CE

Page 25: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

What  Happens  during  CTC  Training?

24

init  

iter2  

iter8  

iter14  

P  AO  R  K  B  EH  L  IY  AY  S  AH  Z  F  

Page 26: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

25

Experiments  &  Analysis

Conclusions

Page 27: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

CTC  Decoding

Difficulty  of  CTC  decoding  

The  language  model  WFST    G

26

Previous  work  proposed  beam  search  for  CTC  [Graves  et  al,  Hannun  et  al.] Incorpora0ng  word  language  models  efficiently  was  difficult

It  was  challenging  to  deal  with  the  behaviors  of  blanks

WFST-­‐based  Decoding 3  WFSTs  encode  3  components  required  in  decoding

how  are  you  how  is  it  

Page 28: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Lexicon  -­‐-­‐  L

Maps  sequences  of  lexicon  units  to  words.

Phonemes  as  CTC  labels:  the  standard  dic0onary  WFST  

27

Characters  as  CTC  labels  Contains  word  spellings,  easy  to  include  OOV  words

Allows  <space>  to  appear  op0onally  at  the  beginning  and  end  of  the  word

The  space  <space>  between  each  pair  of  words  is  taken  as  a  label

is      IH  Z  

is      i  s  

Page 29: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Token  -­‐-­‐  T

Maps  a  (segment  of)  CTC  path  to  a  lexicon  unit  (phonemes  or  characters)  

Allows  occurrences  of  blanks  and  repe00ons  of  non-­‐blank  labels    

28

A  

   

!A!!!!!!A!!!!!!A!!!!!!A!!!!!!A!!∅!!!!!!∅!!!!!!A!!!!!!A!!!!!!∅!!∅!!!!!!A!!!!!!A!!!!!!A!!!!!!∅!

Page 30: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Search  Graph  &  Posterior  Scaling

The  3  WFSTs  are  composed  into  a  search  graph  

29

–  composi0on            det  –  determiniza0on            min  –  minimiza0on          

The  search  graph  encodes  the  mapping  from  a  sequence  of  frame-­‐level  CTC  labels  to  a  sequence  of  words.    

During  decoding,  scale  the  label  posteriors  with  their  priors    

p(x t | k) ∝ p(k | x t ) p(k)

where  the  prior  p(k)  is  es0mated  from  the  expanded  label  sequences  (                                                            )  from  the  training  set  by  simple  coun0ng.  

∅ A ∅ B ∅ C ∅

S = T !min(det(L !G))

Page 31: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Eesen  Recipe  -­‐  Switchboard hops://github.com/yajiemiao/eesen    

30

Data  Prep  and  FST  Composi0on  

Feature  Genera0on  

Page 32: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Eesen  Recipe  -­‐  Switchboard

31

Model  Training  

Decoding  

hops://github.com/yajiemiao/eesen    

Page 33: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Eesen  Recipe  -­‐  Switchboard

32

Model  Training  

Decoding  

hops://github.com/yajiemiao/eesen    

   NO  •  HMM  •  GMM  •  Phone0c  decision  trees  •  Mul0ple  training  stages  •  Dic0onaries,  if  characters  are  CTC  labels  

•  …  …  

Page 34: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

33

Experiments  &  Analysis

Conclusions

Page 35: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  WSJ

34

Experimental  Setup  4  bi-­‐direc0onal  LSTM  layers,  each  with  640  memory  cells  

Training  uoerances  are  sorted  by  their  lengths,  and  10  uoerances  are  processed  in  a  batch  each  0me

40-­‐dimemsional  log-­‐mel  filterbank  features,  plus  Δ  and  ΔΔ  

WERs  on  the  eval92  tes0ng  set

Phoneme-­‐based  Systems  The  CMU  dic0onary  as  the  lexicon  

72  labels  including  phonemes,  noise  marks  and  the  blank

Character-­‐based  Systems  59  labels  including  leoers,  digits,  punctua0on  marks  and  the  blank.  

Page 36: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  WSJ

35

Models Vocabulary LM WER%

CTC Original NIST 7.87 Hybrid  DNN Original NIST 7.14

Phone  

Page 37: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  WSJ

36

Models Vocabulary LM WER%

CTC Original NIST 7.87 Hybrid  DNN Original NIST 7.14

CTC Original NIST 9.07 CTC Expanded Retrained 7.34 Graves  et  al. Expanded Retrained 8.7 Hannun  et  al. Original Unknown 14.1

Phone  

Char  

Page 38: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  Switchboard

37

Experimental  Setup  300  hours  of  training  speech;    tested  on  the  SWBD  part  of  Hub5’00  

Ini0aliza0on  of  Forget-­‐gate  Bias  Ini0alizing  the  forget  gate  bias  to  a  larger  value  helps  LSTM  learn  long-­‐term  dependency  [Jozefowicz  et  al.]  

The  bias  of  the  forget  gates  is  ini0alized  to  1.0

5  bi-­‐direc0onal  LSTM  layers,  each  with  640  memory  cells  

CTC  labels:  46  phonemes  (including  the  blank)

=  1.0  

Page 39: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  Switchboard

38

Models FG  Bias WER%

CTC 0 15.7 CTC 1.0 15.0

Page 40: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  Switchboard

39

Models #Model  Param WER%

CTC 11M 15.0

Hybrid  DNN 40M 16.9 Hybrid  LSTM 12M 15.8

For  fairness,  all  the  systems  use  the  filterbank  features  

Page 41: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  Switchboard

40

Models #Model  Param WER%

CTC 11M 15.0

Hybrid  DNN 40M 16.9 Hybrid  LSTM 12M 15.8

Models #Model  Param WER%

CTC 8M 19.9

Hybrid  DNN 12M 20.2 Hybrid  LSTM 8M 19.2

110-­‐Hour    Training  

300-­‐Hour    Training  

Page 42: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Decoding  Efficiency

41

Models Decoding    Graph Graph  Size RTF*

CTC TLG 123M 0.71

Hybrid  DNN HCLG 216M 1.43 Hybrid  LSTM HCLG 216M 1.12

*  RTF  –  Real  0me  factor  

Page 43: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Frame  Skipping

42

blanks  

Page 44: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Frame  Skipping

43

#Frames  Skipped Decoding  RTF WER%

0 0.71 15.0 1 0.38 15.8 2 0.25 16.5

x1 x2 x3 x4 x5 [x1, x2] [x3, x4] [x5

original   frame  skipping  training  &  decoding  

Page 45: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Frame  Skipping

44

#Frames  Skipped Decoding  RTF WER%

0 0.71 15.0 1 0.38 15.8 2 0.25 16.5

x1 x2 x3 x4 x5 [x1, x2] [x3, x4] [x5

original   frame  skipping  training  &  decoding  

0 -­‐-­‐-­‐-­‐ 19.9

<  100Hrs  

110Hrs  

Page 46: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  HKUST  Mandarin

45

Experimental  Setup  

Chinese  Mandarin  conversa0onal  telephone  speech  [Liu  et  al.]  

175  hours  of  training  

CTC  labels:  3600+  Chinese  characters

Models Features CER%

Hybrid  DNN FBank 39.42

CTC FBank 39.70 CTC FBank+Pitch 38.67

Page 47: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Results  on  Mul0lingual  CTC

46

Experimental  Setup  

BABEL  mul0lingual  corpora  

Around  80  hours  of  training  speech  per  language  

Features:  filterbank  +  pitch

Models Training  Hrs #Labels CTC  WER%

Tagalog 84.5 48 51.5 Turkish 77.2 51 54.5 Pashto 78.4 54 56.6

Page 48: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Outline

Mo0va0on

End-­‐to-­‐End  Speech  Recogni0on  

Deep  LSTM  Models

CTC  Training

WFST-­‐based  Decoding

48

Experiments  &  Analysis

Conclusions

Page 49: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Conclusions

49

Conclusions  

We  have  presented  a  complete  end-­‐to-­‐end  ASR  framework  

CTC  models  achieve  comparable  performance  to  the  hybrid  approach  

There  are  s0ll  a  lot  of  unknowns  about  CTC

Open  Ques0ons  

How  can  we  perform  Keyword  Search  (KWS)  over  CTC  models?    

How  can  we  add  context-­‐dependency  to  CTC  labels?  

How  can  we  es0mate  i-­‐vectors  [Dehak  et  al.]?  

How  can  we  perform  pre-­‐training  of  the  deep  BiLSTM  models?    

…  …  

Page 50: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

References

50

•   Y.  Miao,  M.  Gowayyed,  and  F.  Metze,  “EESEN:  End-­‐to-­‐End  Speech  Recogni0on  using  Deep  RNN  Models  and  WFST-­‐based  Decoding,”  in  Proc.  ASRU.  IEEE,  2015.      •   T.  N.  Sainath,  O.  Vinyals,  A.  Senior,  H.  Sak,  “Convolu0onal,  long  short-­‐term  memory,  fully  connected  deep  neural  networks,”  in  Proc.  ICASSP.  IEEE,  2015.    •   R.  Jozefowicz,  W.  Zaremba,  and  I.  Sutskever,  “An  empirical  explora0on  of  recurrent  network  architectures,”  in  Proc  ICML,  2015.    •   A.  Graves  and  N.  Jaitly,  “Towards  end-­‐to-­‐end  speech  recogni0on  with  recurrent  neural  networks,”  in  Proc.  ICML,  2014.    •   A.  Y.  Hannun,  A.  L.  Maas,  D.  Jurafsky,  and  A.  Y.  Ng,  “First-­‐pass  large  vocabulary  con0nuous  speech  recogni0on  using  bi-­‐direc0onal  recurrent  DNNs,”  arXiv  preprint  arXiv:1408.2873,  2014.    •   Y.  Miao  and  F.  Metze,  “On  speaker  adapta0on  of  long  short-­‐term  memory  recurrent  neural  networks,”  in  Proc.  INTERSPEECH.  ISCA,  2015.  •   Y.  Liu,  P.  Fung,  Y.  Yang,  C.  Cieri,  S.  Huang,  and  D.  Graff,  “HKUST/MTS:  a  very  large  scale  Mandarin  telephone  speech  corpus,”  in  Chinese  Spoken  Language  Processing,  2006.    •   H.  Sak,  A.  Senior,  K.  Rao,  and  F.  Beaufays,  “Fast  and  accurate  recurrent  neural  network  acous0c  models  for  speech  recogni0on,”  arXiv  preprint  arXiv:1507.06947,  2015.    •   N.  Dehak,  R.  Dehak,  P.  Kenny,  N.  Brummer,  P.  Ouellet,  and  P.  Dumouchel,  “Support  vector  machines  versus  fast  scoring  in  the  low-­‐dimensional  total  variability  space  for  speaker  verifica0on,”  in  Proc.  Interspeech,  2009.    

Page 51: End$to$End’Speech’Recogni0on’ using’Deep’LSTMs, CTC ...people.csail.mit.edu/jrg/meetings/CTC-Dec07.pdfChinese’Mandarin’conversaonal’telephone’speech’[Liu’etal.]’

Ques0ons?  

Yajie  Miao  Carnegie  Mellon  University  

@  MIT        Dec  07  2015