presented by: fang-hui, chu automatic speech recognition based on weighted minimum classification...

Presented by: Fang-Hui, Chu

Automatic Speech Recognition Based on Weighted Minimum Classification Error Training

Method

Qiang Fu, Biing-Hwang Juang

School of Electrical & Computer EngineeringGeorgia Institute of Technology

ASRU 2007

Outline

• Introduction• Weighted word error rate• The minimum risk decision rule & weighted MCE method• Training scenarios & weighting strategies in ASR• Experiment results for weighted MCE• Conclusion & future work

Review of Bayes decision theory

• A conditional loss for classifying into a class event :

• Expected loss function

• If we impose the assumption that the error loss function is uniform

• maximum a posteriori (MAP) decision rule

– It transforms the classifier design problem into a distribution estimation problem

X i

, )|()|(1

M

jjiji xCPexCR

XCPXCPXCRji

jie i

jijiij

1 ,1

,0

dXXpXXCRL )()|)(( )|(min)|)(( xCRxxCR ii

)|(max)|( if )( XCPXCPiXC jj

i

Several limitations !!

Introduction

• In a variety of ASR applications, some errors should be considered more critical than others in terms of system objective– Keyword spotting system, speech understanding system,…– The difference of the significance of the recognition error is nece

ssary and a nonuniform error cost function becomes appropriate

– This transforms the classifier design into an error cost minimization problem instead of a distribution estimation problem

M

jjij

ii XCPeCXC

1

)|(minarg)(

An example for non-uniform error rate

• Here is an example for using non-uniform error rate :

• The weighted word error rate (WWER) can be calculated as

0 AT N. E. C. THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING

1 AT ANY <del> SEE THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING

2 AT N. E. C. <del> NEEDS FOR INTERNATIONAL MANAGER’S WILL KEEP RISING

Two recognition results with equal-significance word error rate.But, which is better ?

N

nn

I

ii

D

dd

S

ss

wP

wPwPwP

1

111

log

logloglog

WWER

An example for non-uniform error rate cont.

• An example of weighted word error rate :

0 AT N. E. C. THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING

2.317 3.138 3.135 2.784 1.275 3.675 2.027 3.259 3.797 2.481 3.689 3.925 35.502

1 AT ANY <del> SEE THE NEED FOR INTERNATIONAL MANAGERS WILL KEEP RISING

2.317 3.038 <del> 3.503 1.275 3.675 2.027 3.259 3.797 2.481 3.689 3.925

2 AT N. E. C. <del> NEEDS FOR INTERNATIONAL MANAGER’S WILL KEEP RISING

2.317 3.138 3.135 2.784 <del> 3.966 2.027 3.259 3.719 2.481 3.689 3.925

%24.25502.3596.82

%25.27502.35676.91

regWWER

regWWER

The Minimum Risk decision rule

• minimum risk (MR) decision rule :

– involves a weighted combination of the a posteriori probabilities for all the classes

M

j

jjij

i

M

jjij

ii

XP

CPCXPe

XCPeCXC

1

1

)(

)()|(minarg

)|(minarg )(

A practical MR rule

Xj

XCi

el

eK

lK

L

dXXpXCPeL

X

X

K

kk

jikji

xji

xkji

M

jjji

XXXX

XXXX

XX

ofindex indentity true thebe

recognizer by the decided asindex identity thebe )(Let

tokens trainingofset theis X

costerror theis )X(

where

1)X(

1

)()|(

1

1

A practical MR rule cont.

• We can prescribe a discriminant function for each class, , and define the practical decision rule for the recognizer as

• The alternative system loss is then

jXg j ,;

:maxarg)( XgiXC ii

X

jIi

mm

ijX

j

Ijj

mm

jijIjIiX

CXXgieL

LL

XgiCXeL

M

M

MM

1;maxarg1

;maxarg11

A practical MR rule cont.

• The approximation then needs to be made to the summands

otherwise ,0

;max; ,1

;;

;

, as that Note

;;

;;

;;maxarg1

,

XgXg

XGXg

Xg

XgXG

XGXg

XgeXgie

mmi

ii

i

imImmi

Ii ii

iij

Iim

mij

M

MM

The weighted MCE method

• The objective function of the weighted MCE is

X j ij

iiij

jii

iij

IiIjX

CXGg

eL

CXXGXg

XgeL

MM

1lnlnexp1

1

asrewritten becan which

1;;

;

Training Scenarios

• Intra-level training– The training and recognition decisions are on the same semantic

level with the performance measure

• Inter-level training– The training and recognition decisions are on the different

semantic level with the performance metric

– Minimizing the cost of the wrong recognition decisions does not directly optimize the recognizer’s performance in term of the evaluation metric

– To alleviate this inconsistency, the error weighting strategy could be built in a cross-level fashion

Two types of error cost

• User-defined cost– Usually characterized by the system requirement and relatively

straightforward

• Data-defined cost– More complicated– The wrong decisions occur because the underlying data

observation deviates form the distribution represented– “bad” data ? or “bad” models ? – It is possible to measure the “reliability” of the errors by

introducing the data-defined weighting

Error weighting for intra-level training

• In the intra-level training situation, the system performance is directly measured by the loss of wrong recognition decisions

• We can absorb both types of the error weighting into the error cost function as one universal functional form

• The objective function for the weighted MCE could be written as :

jji

imImlkmlki

lkilkilki

k

L

l ijlkijlkiMCEW

CPCXPXg

XgXG

XGXgXl

where

CXeXlF

M

kk

kk

k

k

k

kk

|;;

;;

;ln;lnexp1

1;

1;

,,,

,,,

1,,

)1(

k

kk

k

L

llkk

L

XX

phphphPH

,

21 ,...,,

Error weighting for inter-level training

• We need to use cross-level weighting in this case to break down the high level cost and impose the appropriate weights upon the low level models

• The user-defined weighting of the weighted MCE in the inter-level training can be written as :

kk

k

k

k

kk

llu

ij

lu

ijk

L

l ijlklkiMCEW

wPwe

where

weCXXlF

log)(

)(1;

)(

)(

1,,

)2(

k

kkk

k

kkkk

k

L

lnlkk

Nllll

L

XX

phphphw

wwwW

,,

21

21

,...,,

,...,,

Error weighting for inter-level training cont.

• The data-defined weighting of the weighted MCE in the inter-level training can be written as :

• A W-MCE objective function including both weighting function under the inter-level training scenario can be written as

klnm

where

meCXXlF

kk

dij

k

L

l ijlklkiMCEW

k

k

kk

,,

)(1; )(

1,,

)2(

weights total theis ),(

,,

),(1;1

,,)2(

mwe

klnm

where

mweCXXlF

k

k

k

k

kk

lij

kk

lijk

L

l ijlklkiMCEW

Weighted MCE & MPE/MWE method

• The MPE/MWE is a training method with a weighted objective function to mimic training errors :

k

k

kk

kkk

k

k

k

k

kk

kk

luuulkw

Lllllkc

cwwc

w

lulku

K

kk

lulku

K

kk

wc

cK

kk

uuuu

Llllk

MWEMPE

wPwXpP

wPwXpP

PPPP

PXl

WWAXl

WWAPP

PWWA

wPwXp

wPwXp

F

;

;

lnlnexp1

1;

where

,;1

,,

,

,

,

1,

11/

Weighted MCE & MPE/MWE method cont.

• To maximize the original MPE/MWE objective function is equivalent to minimize the modified objective function :

• In summary, MPE/MWE builds a objective function that incorporates the non-uniform error cost of each training utterance– W-MCE & MPE/MWE are both rooted in the Bayes decision

theory, directing to the same aim of designing the optimal classifier to minimize the non-uniform error cost

K

k iklkiMWEMPE WWAXlF

k1

,/ ,;

W-MCE implementation

• In our experiments, we assume that the weighting function only contains the data-defined weighting for simplicity

ttss

XP

wwPwXPXw

where

XwXlK

F

nn

T

lllT

wwn

Tl

k

L

l i

TliMCEW

kkk

kln

k

k

k

k

when

Pr

Pr;1

1

11

,1

11

Experiments

• Database : WSJ0

presented by: fang-hui, chu automatic speech recognition based on weighted minimum classification...

Documents