1 discriminative learning for hidden markov models li deng microsoft research ee 516; uw spring 2009

Discriminative Learning for Hidden Markov Models

Li Deng

Microsoft Research

EE 516; UW Spring 2009

Minimum Classification Error (MCE)

The objective function of MCE training is a smoothed recognition error rate.

Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD)

In this work we proposed the Growth Transformation based method for MCE based model estimation

Automatic Speech Recognition (ASR)

Spectrum analysis: Xr =

Decoding sr* = argmax pΛ(Xr, sr)

Speech signal of the r-th utt.:

x1, x2, x3, x4 ,…, xt ,…, xT

(sil) OH (sil) SIX EIGHT (sil)

Segment to frames:

| | | | | … | | 1 2 3 4 T

Speech recognition:

* arg max log ( | ) arg max log ( , )r r

r r r r rs s

s p s X p X s

Models (feature functions) in ASR

( , ) exp ( , )r r m m r rm

p X s h s X

h1(sr, Xr) = log p(Xr|sr; Λ) (AM)

h2(sr, Xr) = log p(sr) (LM)

h3(sr, Xr) = |sr| (#word)

λ1 = 1

λ2 = s (LM scale)

λ3 = p (word ins. penalty)

ASR in the log-linear framework

Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

MCE: Mis-classification measure

OH EIGHT THREE correct label: Sr

OH EIGHT SIX

competitor: sr,1

Observation. seq.: Xr x1, x2, x3, x4 ,…, xt ,…, xT

, ,1 ,( , ) log ( ) log ( )r r r r r rd X p X s p X S

Define misclassification measure:

sr,1: the top one incorrect (not equal to Sr) competing string

(in the case of using correct and top one incorrect competing tokens)

MCE: Loss function

rr Xdrrr eXdl

Loss function: smoothed error count func.

Classification: ),(logmaxarg*rr

sr sXps

Classifi. error: dr(Xr,Λ) > 0 1 classification error

dr(Xr,Λ) < 0 0 classification error

MCE: Objective function

MCE objective function:

rrrrMCE Xdl

LMCE(Λ) is the smoothed recognition error rate on the string (token) level.

Model (acoustic model) is trained to minimize LMCE(Λ), i.e., Λ* = argmin Λ{LMCE(Λ)}

MCE: Optimization

Traditional Stochastic GD New Growth Transform.

Gradient descent based online optimization

Convergence is unstable

Training process is difficult to be parallelized

Extend Baum-Welch based batch-mode method

Stable convergence

Ready for parallelized processing

MCE: Optimization

MinimizingLMCE(Λ) = ∑ l ﴾d(∙)﴿

MaximizingP(Λ) = G(Λ)/H(Λ)

MaximizingF(Λ;Λ′) = G-P′×H+D

MaximizingF(Λ;Λ′) = ∑ f (∙)

MaximizingU(Λ;Λ′) = ∑ f ′(∙)log f

GT formula∂U(∙)/∂Λ = 0 Λ =T(Λ′)

If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation of Λ for P(Λ).

o Growth Transformation based MCE:

MCE: Optimization

Re-write MCE loss function to

Then, min. LMCE(Λ) max. Q(Λ), where

( , | )( , )

( , | ) ( , | )r r

r r rr r r r

p X sl d X

p X s p X S

1 1,1{ , }

( ) 1 ( )

( , | ) ( , )( , | )

( , | ) ( , | ) ( , | )r r r

r r r rR Rs s Sr r

r rr r r r r rs s S

p X s s Sp X S

p X s p X S p X s

MCE: Optimization

Q(Λ) is further re-formulated to a single fractional function P(Λ)

),...,,,...,()(

),(),...,,,...,()(

ssXXpH

SsssXXpG

MCE: Optimization

Increasing P(Λ) can be achieved by maximizing

( ; ) ( ) ( ) ( )F G P H D

1( )( ) ( ) ( ; ) ( ; )HP P F F

as long as D is a Λ-independent constant.

( ; ) ( , , | )[ ( ) ( )]q s

F p X q s C s P D Substitute G() and H() into F(),

(Λ′ is the parameter set obtained from last iteration)

MCE: Optimization

),|,()()();,,,( sqpsdsqf

dsqfFs q

);,,,();(

Reformulate F(Λ;Λ') to

F(Λ;Λ') is ready for EM style optimization

( ) ( , ) ( , )[ ( ) ( )]s

X p q s C s P

( ) ( , )R

C s s S

Note: Γ(Λ′) is a constant, and log p(χ, q | s, Λ) is easy to decompose.

MCE: Optimization

dsqfsqfU

);,,,(log);,,,()(

Increasing F(Λ;Λ') can be achieved by maximizing

So the growth transformation of Λ for CDHMM is:

Use extend Baum-Welch for E step.

log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute.

MCE: Model estimation formulas

For Gaussian mixture CDHMM,

m r r t m mr t

mm r m

( )( - )( - ) ( - )( - )

T Tm r r t m r t m m m m m m m m

m r mr t

t x x D D

,1, ,1 , , , ,( ) ( | , ) ( | , ) ( ) ( )

r rm r r r r r m r S m r st p S X p s X t t where

( | , ) | | exp ( ) ( )2

Tp x x x

GT of mean and covariance of Gaussian m is

MCE: Model estimation formulas

Setting of Dm

,1 , ,

( | , ) ( | , ) ( )

( | , ) ( )

m r r r r m r Sr t

r r m r st

D E p S X p S X t

p s X t

Theoretically,

set Dm so that f(χ,q,s,Λ;Λ') > 0

Empirically,

MCE: Workflow

Training utterances

Last iteration Model Λ′

Recognition

GT-MCE Training transcripts

Competing strings

New model Λ

next iteration

Experiment: TI-DIGITS

Vocabulary: “1” to “9”, plus “oh” and “zero”

Training set: 8623 utterances / 28329 words

Test set: 8700 utterances / 28583 words

33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features.

Model: Continuous Density HMMs

Total number of Gaussian components: 3284

Experiment: TI-DIGITS

Obtain the lowest error rate on this task

Reduce recognition Word Error Rate (WER) by 23%

Fast and stable convergence

GT-MCE vs. ML (maximum likelihood) baseline

E=1.0E=2.0E=2.5

0 2 4 6 8 10600

MCE iteration

MCE training - TIdigits

E=1.0E=2.0E=2.5

0 2 4 6 8 100.2

MCE iteration

MCE training - TIdigits

Experiment: Microsoft Tele. ASR

Microsoft Speech Server – ENUTEL A telephony speech recognition system

Training set: 2000 hour speech / 2.7 million utterances

33-dim spectrum features: (E+MFCCs) +∆ +∆∆

Acoustic Model: Gaussian mixture HMM

Total number of Gaussian components: 100K

Vocabulary: 120K (delivered vendor lexicon)

CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz

Training Cost: 4~5 hours per iteration

Name voc.size # word description

MSCT 70K 4356 enterprise call center system(the MS call center we use daily)

SA 20K 43966 major commercial applications(and include many cell phone data)

QSR 55K 5718 name dialing system(many names are OOV, rely on LTS)

ACNT 20K 3219 foreign accented speech recognition(designed to test system robustness)

Evaluate on four corpus-independent tests

Collected from sites other than training data providers

Cover major commercial Tele. ASR scenarios

WER ML GT-MCE WER reduction

MSCT 11.59% 9.73% 16.04%

SA 11.24% 10.07% 10.40%

QSR 9.55% 8.58% 10.07%

ACNT 32.68% 29.00% 11.25%

Significant performance improvements across-the-board

The first time MCE is successfully applied to a 2000 hr. speech database

The Growth Transformation based MCE training is well suited for large scale modeling tasks

1 discriminative learning for hidden markov models li deng microsoft research ee 516; uw spring 2009

time mce

mce criterion

mce loss function

iteration slide

classification error

model estimation slide

log px r s r

classification error

Documents

discriminative orthogonal neighborhood-preserving

deng xiaoping: economic policies and the four...

phone: 516-378-1315 fax: 516-378-5754

lecture 32: discriminative training

discriminative, unsupervised, convex learning

deep generative & discriminative models for speech...

ffic-ricky deng

cindy deng turn

discriminative adaptive training and discriminative...

discriminative model checking

lecture 25: discriminative training

discriminative estimation (maxentmodelsandperceptron)

hybrid generative-discriminative visual...

discriminative naïve bayesian classifiers

maxentmodels and discriminative estimation · pdf...

discriminative video pattern search for efﬁcient action...

deng (industrial phd) at universiti putra malaysia · deng...

shields deng kato2000

nfl.etenders.in · ha trial 210-gr-a1 sa-105n sa-105n...

worksample_jing deng