confidence measures for automatic speech recognition presented by tzan-hwei chen national taiwan...

Confidence Measures for Automatic Speech Recognition

Presented by Tzan-Hwei Chen

National Taiwan Normal UniversitySpoken Language Processing Lab

Advisor : Hsin-Min Wang Berlin Chen

2

Outline

• Introduction

• The category of estimation methods of confidence measure (CM)– Featured based– Posterior probability based– Explicit model based – Incorporation of high-level information for CM*

• The application of CM to improve speech recognition

• Summary

3

Introduction (1/9)

• It is extremely important to be able to make an appropriate and reliable judgement based on the error-prone ASR result.

• Researchers have proposed to compute a score (preferably 0~1), called confidence measure (CM), to indicate reliability of any recognition decision made by an ASR system.

4

Introduction (2/9)

Feature extraction

Decodingspeechsignal

Acousticmodel

Languagemodel

recognizedword

sequence

featurevector

Lexicon

Confidence Measure

Verification

臺北到魚籃

12

1. 臺北到魚籃2. 臺北到宜蘭

Some application of CM

5

Introduction (3/9)

• First of all, we can backtrack some early research on CM to rejection in word-spotting systems.

• Other early CM-related works lie in automatic detection of new words in LVCSR.

• From the past few years, the CM has been applied to more and more research areas, e.g.,– To improve speech recognition– The algorithm about look-head in LVCSR– To guide the system to perform unsupervised learning– …

6

Recognizedunits

Introduction (4/9)

• The general procedure of CM for verification

Confidence estimation

judgment

Predefinedthreshold

> threshold < threshold

Confidenceof unit

acceptance rejection

7

Introduction (5/9)

• Four situations when judging a hypothesis

宜蘭ref

hyp宜蘭 Accept Correct acceptance

reject Correct rejection

reject false rejection

Accept false acceptance

宜蘭ref

hyp魚籃

宜蘭ref

hyp宜蘭

宜蘭ref

hyp魚籃

8

Introduction (6/9)

• The evaluation metric :– Confidence error rate :

wordsrecognized ofnumber totalthe

rejection false and acceptance false ofnumber CER

三民候選人通過

有三名候選人通過審查

審查了

FA CA FR CA FA

5

12 CER

hyp

ref

9

Introduction (7/9)

• The evaluation metric :– Confidence error rate :

wordsrecognized ofnumber totalthe

ins. sub.

correct) is wordrecognizedevery assumed is(it

baseline

三民候選人通過

有三名候選人通過審查

審查了

FA CA CA CA FA

5

11baseline

hyp

ref

10

Introduction (8/9)

• The evaluation metric (cont):– Receiver operator characteristics (ROC) curve :simply contains

a plot of the false acceptance rate over the detection rate.

raterejection false-1 ratedetection

wordsrecognizedy incorrectl ofnumber

acceptance false of num.rate acceptance false

wordsrecognizedcorrectly ofnumber

rejection false of num.rate rejection false

11

Introduction (9/9)

• All methods proposed for computing CMs can be roughly classified into three major categories [7]:

– Feature based

– Posterior probability based

– Explicit model based (utterance verification, UV)

– Incorporation of high-level information for CM*

12

Feature-based confidence measure

13

Feature-based confidence measure (1/8)

• The feature can be collected during the decoding procedure and may include acoustic, language and syntactic information

• Any feature can be called a predictor if its p.d.f. of correctly recognized words is clearly distinct from that of misrecognized words

)( wfp

wf

misrecognized word correctly recognized word

14


• Some common predictor features – Pure normalized likelihood score related : acoustic score per

frame.

– N-best related : count in the N-best list, N-best homogeneity score

– Duration related : word duration divided by its number of phones

)|],,([

)|],,([

1],,[

1],,[

1

1

Xeswp

Xeswp

Nnnn

bestNesw

Mmmm

esww

Nnnn

Mmmm

wordth- theof timeend the:

wordth- theof start time the:

is word

ofnumber that sequence a word :],,[ 1

me

ms

M

esw

m

m

Mmmm

15


• Some common predictor features (cont)– Hypothesis density :

ttD at arc worddifferent theofnumber The:)'(

)(1

1]),;[:( tD

seeswaHD

a

a

e

staaaaa

靜音結果

建國

有

由

又

三名三名

三名

候選人

候選人

候選人

沒有

沒有

沒有

審查

審查

候選人通過

三名候選人

通過

2)( tDgraph wordin arc a word :],;[: aaa eswa

16


• Some common predictor features (cont)– Acoustic stability

今天天氣很好Hypothesized word sequence

天氣很好

今天天氣

今天Hypothesized word sequence

1)()|( WPWXp

今天天氣

今天天氣不佳

2)()|( WPWXp

3)()|( WPWXp

今天天氣很好

今天天氣Hypothesized word sequence

不佳

17


• We can combine the above features with any one of the following classifiers

– Line discriminant function

– Generalized linear model

– Neural networks

– Decision tree

– Support vector machine

– Boosting

– Naïve Bayes classifier

18


• Naïve Bayes Classifier [3]

),'|()|'(

),|()|(),|(

21or'

111

wCfPwCP

wCfPwCPwfCP

wCCC

ww

),|(),|(1

wCfPwCfP iw

K

diw d

)(

),()|(

wN

wCNwCP i

i

),(

),,(),|(

wCN

wCfNwCfP

i

iwiw

d

d

vectorfeaturepredictor dimension the:

wrongis wordrecognized the:

correct is wordrecognized the:

2

1

kf

C

C

w

19


• Experiments [3]

• Corpus : an Italian speech corpus of phone calls to the front desk of a hotel

feature 　 CER(%)relative reduction (%)

acoustic stability 16.3 22.4

language modelscore

18.8 10

hypothesis density 18.9 10.5

duration 19.3 8.1

acoustic score 19.6 6.7

baseline 21 　

20

Posterior probability based confidence measure

21

)],,([)],,[|(

)],,([)],,[|(

)(

)],,([)],,[|()|],,([

11

],,[

11

111

1

Nnnn

Nnnn

esw

Mmmm

Mmmm

Mmmm

MmmmM

mmm

eswPeswXp

eswPeswXp

Xp

eswPeswXpXeswP

Nnnn

__

W

Posterior probability based confidence measure (1/11)

• Posterior probability of a word sequence :

• To adopt some approximation methods

Impossible to estimate in a precise manner

22

靜音

建國


• Word graph based approximation

結果

有

由

又

又

有

三名三名

三名

三名

三名

候選人

候選人

候選人

沒有

沒有

沒有靜音

通過靜音

候選人通過

三名候選人

靜音

)],,([)],,[|()( 11

],,[ 1

Nnnn

Nnnn

esw

eswPeswXpXpN

nnn

__

W

)],,([)],,[|( 11],,[ 1

Mmmm

Mmmm

esw

eswPeswXpXM

mmm

23


• Posterior probability of a word arc :

– Some issues are addressed and the word posterior probability is generalized

• Reduced search space

• Relaxed time registration

• Optimal acoustic and language model weights

)|()|(

)|()|(

)|],;[:(

1}],;{[

1],;[,}],;{[

1

11

nnnes

N

nesw

mmmes

M

meswaeswXaaa

hwPwXp

hwPwXp

eswapn

nXN

nnnn

m

mMmmmm

XMmmmm

24


• Posterior probability of a word arc [6] :

)|()|(

)|()|(

)|],;[:(

1}],;{[

1],;[,}],;{[

1

11

nnnes

N

nesw

mmmes

M

meswaeswXaaanormal

hwPwXp

hwPwXp

eswaCn

nXN

nnnn

m

mMmmmm

XMmmmm

靜音結果

建國

有由

又

又

有

三名三名

三名

三名

三名

候選人

候選人

候選人

沒有

沒有

沒有靜音

通過靜音

候選人通過

三名候選人

靜音

25



)|],;[:(]),;[:(

2/)(],,;[:

Xrrrnormal

eesswweswr

aaamed eswrCeswaC

raar

arrrr

靜音結果

建國

有由

又

又

有

三名三名

三名

三名

三名

候選人

候選人

候選人

沒有

沒有

沒有靜音

通過靜音

候選人通過

三名候選人

靜音

26

三名



)|],;[:(max]),;[:(],,;[:

},,{

Xrrrnormal

etswweswr

estaaa eswrCeswaC

rr

arrrraa

max

靜音結果

建國

有由

又

又

有

三名三名

三名

三名

候選人

候選人

候選人

沒有

沒有

沒有靜音

通過靜音

候選人通過

三名候選人

靜音

27



)|],;[:(]),;[:(

),(),(:],,;[:

secX

rrrnormal

eseswweswr

aaa eswrCeswaC

rraa

arrrr

靜音結果

建國

有由

又

又

有

三名三名

三名

三名

三名

候選人

候選人

候選人

沒有

沒有

沒有靜音

通過靜音

候選人通過

三名候選人

靜音

28


• The drawbacks of the above methods – all need an additional pass.

• In [8], the “local word confidence measure” is proposed

)'())'|(max(

)())|((max]),,([

''

]'',',[

''

]'',,[

wpwxp

wpwxpeswC

es

Eesw

es

Eesw

今天 rate. relaxationa given sconstraint length and time

therealize which wordsalternate theofset the:E

今天

今天

今天

))|((max ''

]'',,[wxp e

sesw

29


• local word confidence measure (cont)

)'())'|(max(

)())|((max]),,([

''

]'',',[

''

]'',,[

wpwxp

wpwxpeswC

es

Eesw

es

Eesw

)'|'())'|(max(

)|())|((max]),,([

'

''

]'',',[

'

''

]'',,[

hw

ss

Eesw

hw

es

Eesw

wwpwxp

wwpwxpeswC

h

bigram applied

)}'|'()'|'({))'|(max(

)}|()|({))|((max

]),,([''

]'',',[

''

]'',,[

wwpwwpwxp

wwpwwpwxp

wswCfh

ww

es

Eesw

fhww

es

Eesw

fh

fh

forward/backwardbigram applied

30


• Impact of word graph density on the quality of posterior probability [9]

Baseline 27.3 15.4

wordsspoken ofnumber the

arcs wordofnumber totalWGD

31


• Experiments [6]

corpus baseline Cnormal Csec Cmed Cmax

ARISE 13.6 11.5 8.9 8.8 8.9

Verbmobil 27.3 23.3 19.0 20.0 18.9

NAB 20k 11.3 10.3 9.2 9.2 9.2

NAB 64k 9.2 8.4 7.2 7.2 7.2

Broadcast news 27.7 23.7 20.6 20.4 20.6

32

Explicit model based confidence measure (1/10)

• The CM problem is formulated as a statistical hypothesis testing problem.

• Under the framework of binary hypothesis testing, there are two complementary hypotheses

• We test against

W1

W0

model from NOT is and recognized wrongly is :Hypothese) ve(Alternati

model from comes truly and recognizedcorrectly is :Hypothese) (Null

XH

XH

0H 1H

0

1

)|(

)|( RT) testing(Lratio likelihood

1

0

H

HHXP

HXP

33


• The above LRT score can be transformed to a CM based on a monotonic 1-1 mapping function.

• The major difficulty with LRT is how to model the alternative hypothesis.

• In practice, the same HMM structure is adopted to model the alternative hypothesis.

• A discriminative training procedure plays a crucial role in improving modeling performance.

34


• Two-pass procedure :

)|(score observaion csxP

)|(score transition ci

cj ssp

今天天氣很好

)|(

)|(:

aes

ces

XP

XPLRT

今天

今天

of model-anti the:

of modelcorrect the:a

c

35


• One-pass procedure

)|(

)|(score observaion

a

c

sxP

sxP

)|(

)|(score transition

ai

aj

ci

cj

ssp

ssp

今天天氣很好

)|(

)|(:

aes

ces

XP

XPLRT

今天

今天

a

tct ss

36


• How to calculate the confidence of a recognized word?

shift.a is and function sigmoid theof slope thedefines where

)))((logexp(1

1)(

function sigmoida by

dmanipulate is measure confidence subword therange, dynamic limit the To

segment. decoded thein frames ofnumber theis where

)|(

)|(log

1),,(log

1)(

as obtained be can,X vectors,nobservatio

ofsegment a over decoded unit subworda for score likelihood levelunit unweighted The

u

uu

uu

u

au

cu

u

acu

u

uLRuU

N

XP

XP

NXLR

NuLR

u

37


• How to calculate the confidence of a recognized word (cont)?

))(log1

exp()(

)(1

)(

))(log1

exp()(

)(1

)(

:,,1, units subword of composed a wordfor defined are measures following The

compared. are U()scores ratio likelihood levelunit weightedsigmoid theand LR(),

scores, ratio likelihood levelunit unweighted the toingcorrespond measures confidence level Word

,1

4

,1

3

,1

2

,1

1

,n

in

N

inn

in

N

inn

in

N

inn

in

N

inn

nin

uUN

wW

uUN

wW

uLRN

wW

uLRN

wW

Niu

n

n

n

n

LR() theof means arithmetic

LR() theof means geometric

U() theof means arithmetic

U() theof means geometric

38


• Discriminative training [10] – The goal of the training procedure is to increase the average val

ue of for correct hypotheses and decrease the average value of for false acceptance.

),,( acXLR

),,( acXLR

},,{segment over the

unit as decodedsegment speech theof frame final and initial theare and where

)(1

1)(

as distances based frame theaveragingby obtained is distance basedsegment The

))(log())(log()(

:decoder by the

obtained sequence in then transitiostateeach for defined is distance based frameA

1-tq

ij

ufui

uu

t

uf

uiuu

ttu

fi

tq

tt

ttif

uu

taj

aijt

cj

cijt

xxX

utt

xrtt

XR

xbaxbayr

39


• Discriminative training (cont)

),(1

)},({ },,{ where

)},({

)},({cost expected theon performed is updategradient A

imposter ,1

correct ,1)(

as defined is )( functionindicator thewhere

)))()((exp(1

1),,(

function sigmoida using unit for ),,( functioncost theDefine

1

u

u1

uuu

N

iu

uuu

au

cu

uuun

un

uuu

uu

uu

au

cu

uu

au

cu

uu

XFN

XFE

XFE

XFE

u

uu

u

XRuXF

uXF

u

40


)(F

)))()(exp(1

1),,(

uu

uu

au

cu

uu XRuXF

) R(andimposter

)( andcorrect if

uu

uRu )(F

Why discriminative training works?

41


• Experiments [10] • This task, referred to as the “movie locator”,

42

Incorporation of high-level information for CM

43

Incorporation of high-level information for CM (1/4)

• LSA

– The key property of LSA is that words whose vectors are close to each other are semantically similar words.

– These similarities can be used to provide an estimate of the likelihood of the words co-occurring within the same utterance.

21dd nd

2

1

w

w

mw

2

1

w

w

mw

A U

21dd nd

TV

44


• LSA (cont)– The entry of matrix :

– The confidence of a recognized word :

)1log()1(j

ijiij n

cEa

A

ijij

N

ji ff

NE 2

12

log)(log

1

i

ij

ij t

cf

))(),((Cosine1

1ji

N

jwUwU

N

document all in termofcount the:

document of size the:

document in termofcount the:

it

jn

jic

i

j

ij

45


• Inter-word mutual information :

wordsrecognized remaining thewith

word thisof ninformatio mutual average theas calculated is wordrecognized each ofCM

))w()(

)w,(log(

as calculated be can and wordsany two between ninformatio Mutual

)w,(

)w,(),(

: is )w,(y probabilitjoint thedocuments, training thein

wordand wordof timesoccurrence-co thedenotes )w,( Assume

21

21

21

21w,

2121

21

2121

21

pwp

wpMI

ww

wN

wNwwP

wP

wwwN

w

46


• Experiments [14]

CM Switchbord Mandarin dictation

LSA 44.7 38.5

MI 41.0 33.7

Cmed 24.4 17.5

N-best count 28.3 21.1

MI+Cmed 23.9 16.2

47

The application of CM to improve speech recognition

48

The application of CM to improve speech recognition (1/10)

• Statistical decision theory aims at minimizing the expected of making error

)|],,([maxarg],,[ 1],,[

*

11

XeswPesw Nnnn

esw

Nnnn

Nnnn

靜音結果

建國

有由

又

又

有

三名三名

三名

三名

三名

候選人

候選人

候選人

沒有沒有

沒有靜音

通過靜音

候選人通過

三名候選人

靜音

49


• Method 1 [16]:

)|],,([

),],,[|],,([

)|],,([)|],,([

11

11

11

111

Tnnn

N

n

Tnnnn

N

n

TNnnn

Nnnn

xtswp

xtswtswp

xeswpXeswp

)|],,([maxarg],,[ 11],,[

**

11

TNnnn

esw

Nnnn xeswPesw

Nnnn

50


• Method 2 [18] :

)|],,([WERminarg],,[ 1],,[

*

11

Xeswesw Nnnn

esw

Nnnn

Nnnn

)|()(1

0.1)|],,([1

1 XwPcorrectwPN

XeswWER nn

N

n

Nnnn

51


• Method 3 (Time Frame Error decoding) [17]: – In minimum Bayes risk decoding

– if

)|],,([

)],,[,],,([minarg],,[

1

],,[11

],,[

*

1 1

1XesvP

esveswCesw

Mmmm

esv

Mmmm

Nnnn

esw

Nnnn

Mmmm

Nnnn

M

mmmN

nnn

Mmmm

NnnnM

mmmN

nnnesvesw

esveswesveswC

11

1111

],,[],,[,0

],,[],,[,1)],,[,],,([

52


• Method 3 (cont)

)|],,([maxarg

)|],,([1minarg

)|],,([1minarg

)|],,([

)],,[,],,([minarg],,[

1],,[

1],,[

1],,[],,[

],,[

1

],,[11

],,[

*

1

1

1

111

1

1

XeswP

XeswP

XesvP

XesvP

esveswCesw

Nnnn

esw

Nnnn

esw

Mmmm

eswesvesw

Mmmm

esv

Mmmm

Nnnn

esw

Nnnn

Nnnn

Nnnn

Nnnn

Mmmm

Nnnn

Mmmm

Nnnn

53


• Method 3 (cont) – we are now faced with a conceptual mismatch between the decis

ion rule and the evaluation criterion for speech recognizers- the word error rate

– The easiest way to overcome this mismatch is to use the same cost function for evaluation – Levensthein distance

– In (Stolcke et. al 1997), the pairwise alignment is restricted to N-best list.

– Let us assume that sub. were the one type of error.• A dynamic programming alignment would thus not be necess

ary.

54


• Method 3 (cont)

)(1

),(1

)],,[,],,([1

11nn

tn

et

stN

n

Mmmm

Nnnn se

w

esveswC

n

n

'W

W

今天天氣很好

今天天氣

1 every word ofcost maximum the1, if 2' ofcost max the

3' ofcost max the

W

W

tesv

vM

mmm

t

frame timeintersects which],,[

sequence wordin hypothesis word theofidentity word the:

1

55


• Method 3 (cont)

)(1

)|],,([),(1

minarg

)(1

)|],,([),()|],,([1

minarg

)|],,([)(1

),(1

minarg

)|],,([)],,[,],,([minarg],,[

1],,[

1],,[

1]'',,[

11],,[

1],,[

],,[1

1],,[

],,[111

],,[

*

1

1

1

11

1

11

11

nn

Mmmm

tn

esv

et

stN

nesw

nn

Mmmm

tn

es

et

st

TMmmm

esv

et

stN

nesw

esv

Mmmm

nn

tn

et

stN

nesw

esv

Mmmm

Mmmm

Nnnn

esw

Nnnn

se

XesvPw

se

XesvPwxesvP

XesvPse

w

XesvPesveswCesw

Mmmm

n

n

N

M

n

nM

mmm

n

n

N

Mmmm

n

n

N

Mmmm

N

56


• Method 3 (cont)

),|(

)|],,([),(

)|],,([),(

)|],,([),(

''

1

1

:],,,[

11 ],,[

1],,[

Xtwp

XesvPw

XesvPw

XesvPw

n

mmmmnetsesv

Mmmmmn

M

etsmesv

Mmmm

tn

esv

mmmmmm

mm

Mmmm

Mmmm

)(1

),|(1

nn

n

et

st

se

Xtwpn

n

Can be interpreted as the normalizedProbability of a word being incorrect.

nw

57


• Experiments (Method 3)

corpus baseline Time frame error

ARISE 15.8 15.0

Verbmobil 33.6 32.5

NAB 20k 13.2 12.9

NAB 64k 11.1 10.8

broadcast news 33.3 32.3

58

Summary

• Almost all CMs rely almost entirely on a single information source :how much the underlying decision can overtake other possible competitors.

• We believe it is critical to improve performance of CMs by – taking this segmentation issue into account.– Deciding a dynamic threshold for different word.

59

Reference (1/4)

• Main reference– [1] H. Jiang ,“Confidence Measures for Speech Recognition : A Survey”,

Speech communication 2005 .

• Feature based confidence measure– [2] S. Cox and R. Rose, “Confidence Measures for The Switchboard Dat

abase”, ICASSP 1996.– [3] T. Schaaf and T. Kemp, “Confidence Measures for Spontaneous Spe

ech Recognition”, ICASSP 1997.– [4] A. Sanchis , A. Juan and E. Vidal, “New Features based on Multiple

Word Graphs for Utterance Verification”, ICSLP 2003.– [5] R. Zhang and A.I. Rudnicky, “Word Level Confidence Annotation Usi

ng Combinations of Features”, EuroSpeech 2001.

• Posterior based confidence measure– [6] F. Wessel , R. Schluter, K. Macherey, and H. Ney, “Confidence Meas

ures for Large Vocabulary Continuous Speech Recognition”, IEEE SAP 2001.

– [7] F. K. Soong and W. K. Lo, “Generalized Posterior Probability for Minimum Error Verification of Recognized Sentences”, ICASSP 2005

60

Reference (2/4)

• Posterior based confidence measure– [8] J. Razik, O. Mella, D. Fohr, J.-P. Haton, “Local Word Confidence Me

asure Using Word Graph and N-Best List.”

– [9] T, Fabian, R. Lieb, G. Ruske, M. Thomae, “Impact of Word Graph Density on the Quality of Posterior Probability Based Confidence Measures.

• Explicit model based confidence measure– [10] E. Lleida, R. C. Rose, “Utterance Verification in Continuous Speech

Recognition: Decoding and Training Procedures”, IEEE SAP 2000.

– [11] M. G. Rahim and C.-H Lee, “String-based Minimum Verification Error (SB-MVE) Training for Speech Recognition”, computer speech and language 1997.

– [12] H. Jiang, F. K. Soong and C.-H. Lee, “A Dynamic In-Search Data Selection Method With Its Applications to Acoustic Modeling and Utterance Verification”, IEEE SAP 2005.

61

Reference (3/4)

• Incorporation of high-level information for CM– [13] R. Sarikaya, Y. Gao, M. Picheny and H. Erdogan, “Semantic Confid

ence Measurement for Spoken Dialog Systems.”, IEEE SAP 2005.

– [14] G. Guo, C. Huang, H. Jiang, R.-H. Wang, “A Comparative Study on Various Confidence Measures in Large Vocabulary Speech Recognition”, ISCSLP 2004.

• Some application for CM – [15] M. Afify, F. Liu, H. Jiang and O. Siohan, “A New Verification-based

Fast-Match for Large Vocabulary Continuous Speech Recognition” IEEE SAP 2005

– [16] F. Wessel , R. Schluter and H. Ney, “Using Posterior Word Probabilities for Improved Speech Recognition”, ICASSP 2000.

– [17] F. Wessel , R. Schluter and H. Ney, “Explicit Word Error Minimization Using Word Hypothesis Posterior Probabilities”, ICASSP 2001.

62

Reference (2/4)

• Some application for CM – [18] A. Kobayashi, K. Onoe, S. Sato and T. Imai, “Word Error Minimizati

on Using an Integrate Confidence Measure”, INTERSPEECH 2005.

– [19] Y. Qian, T. Lee and F. K. Soong, “Tone Information as a Confidence Measure for Improving Cantonese LVCSR “, ICSLP 2004.

confidence measures for automatic speech recognition presented by tzan-hwei chen national taiwan...

Documents