human and systems engineering:

Sridhar Raghavan1 and Joseph Picone2

Graduate research assistant1,Human and Systems Engineering

Professor2,Electrical and Computer Engineering

URL: www.isip.msstate.edu/publications/seminars/msstate/2005/confidence/

HUMAN AND SYSTEMS ENGINEERING:Confidence Measures Based on Word Posteriors and Word Graphs

http://www.isip.msstate.edu/publications/seminars/msstate/2005/confidence/

of 20Confidence Measures Based on Word Posteriors and Word Graphs

Abstract

Confidence measure using word posterior:

• There is a strong need for determining the confidence of a word hypothesis in a LVCSR systems, because in conventional viterbi decoding the objective function is to reduce the sentence error rate and not the word error rate.

• A good estimate of the confidence is the word posterior probability.

• The word posteriors can be computed from a word graph.

• A forward-backward algorithm can be used to compute the word posteriors.


Foundation

The equation for computing the posterior probability of the word is as follows [Wessel.F]:

w.succeeding sequences hypothesis wordall Denotes

w.preceeding sequences hypothesis wordall Denotes

T. to1 timefrom vector Acoustic

interest. of word theof timeend andStart ,

WordSingle

)|,,()|,,(

T1

11

e

b

eb

w w

Teb

Teb

w

w

x

tt

w

xwwwpxttwpb e

“The posterior probability of a word hypothesis is the sum of the posterior probabilities of all lattice paths of which the word is a part“ by Lidia Mangu et al [Finding consensus in speech recognition: word error minimization and other applications of confusion networks]

bt etTime

1bw

2bw

bnw

1ew

2ew

emw

W


Foundation: continued…

We cannot compute the posterior probability directly, so we decompose it into likelihood and priors using Baye’s rule.

w w weaea

TT

ea

eaT

T

w weaea

T

a e

a e

wwwpwwwxpxp

wwwp

wwwxp

xp

wwwpwwwxp

),,().,,|()(

and

yprobabilit model Language ),,(

yprobabilit model Acoustic ),,|(

)(

),,().,,|(

11

1

1

1

The numerator is computed using the forward backward algorithm. The denominator term is simply the by product of the forward-backward algorithm.

N

There are 6 different ways to reach the node N and 2 different ways to leave N, so we need to obtain the forward probability as well as the backward probability to determine the probability of passing through the node N, and this is where the forward-backward algorithm comes into picture.


Scaling

Scaling is used to obtain a flat posterior distribution so that the distribution is not dominated by the best path [G. Evermann]. Experimentally it has been determined that (1/language model scale factor) is a good value that can be used to scale down the acoustic model score.

The acoustic model is scaled down using the language model scale factor as follows:

factor scale model language

),,( score model Language

)),,|(( score model Acoustic /11

ea

eaT

wwwp

wwwxp


How to combine word-posteriors?

The word posteriors corresponding to the same word can be combined in order to obtain a better confidence estimate. There are several ways to do this, and some of the methods are as follows:

1. Sum up the posteriors of similar words that fall within the same time frame or choose the maximum posterior value among the similar words in the same time frame [F. Wessel, R. Schlüter, K. Macherey, H. Ney. "Confidence Measures for Large Vocabulary Continuous Speech Recognition“].

2. Build a confusion network where the entire lattice is mapped into a single linear graph i.e. where the links pass through all the nodes in the same order.

Sil

Sil

This

isa

test

sentence

Sil

this

isthe

is

a

the

guest sentence

senseSil

sil

this

is

this

the

is

a

the

quest

guest

test

sense

sentence

sil

Full Lattice Network =

Confusion =Network

Note: The redundant silent edges can be fused together in the full lattice network before computing the forward-backward probabilities. This will save a lot of computation if there are many silence

edges in the lattice.


Some challenges during posterior rescoring!

Apparently the word posteriors are not very good estimate of confidence when the WER on the data is very poor. This is described in the paper by G.Evermann & P.C.Woodland [Large Vocabulary Decoding and Confidence Estimation using Word

Posterior Probabilities]. The reason is because the posteriors are overestimated since the words in the lattices are not the full set of possible words, and in case of poor WER the lattice will contain a lot of wrong hypothesis. In such a case the depth of the lattice becomes a critical factor in determining the effectiveness of using the confidence measure.

The paper mentioned above cites two techniques to solve this problem.

1. A Decision tree based technique

2. A neural network based technique.

Different confidence measure techniques are judged on a metric known as normalized cross entropy (NCE).


How can we compute the word posterior from a word graph?

The word posterior probability is computed by considering the word’s acoustic score, language model score and its position and history in a particular path through the word graph.

An example of a word graph is given below, note that the nodes hold the start-stop times and the links hold the word labels, language model score ane acoustic score.

quest

3/6

3/6 3/6

2/6

4/6

2/6

2/6

2/6

1/6

4/6

1/6

1/6

4/6

1/6

1/6

4/6

5/6Sil

Sil

This

is

atest

sentence

Sil

this

is

the

is

a

the

guest sentence

sense 1/6 1/6

Sil


Example

Let us consider the example as shown below:

3/6

3/6 3/6

2/6

4/6

2/6

2/6

2/6

1/6

4/6

1/6

1/6

4/6

1/6

1/6

4/6

5/6Sil

Sil

This

is

atest

sentence

Sil

this

is

the

is

a

the

guest

a

quest

sentence

sense 1/6 1/6

Sil

The values on the links are the likelihoods. Some nodes are outlined with red to signify that they occur at the same time.


Forward-backward algorithm

We will use forward-backward type algorithm for determining the link probability.

The general equations used to compute the alphas and betas for an HMM are as follows [from any speech text book]:

Computing alphas:

Step 1: Initialization: In a conventional HMM forward-backward algorithm we would perform the following –

i statein are given we X

data observed theof prob.Emission )(

i state of prob. Initial

1 )()(

1

1

11

Xb

NiXbi

i

i

ii

We need to use a slightly modified version of the above equation for processing a word graph. The emission probability will be the acoustic score . In our implementation we just initialize the first alpha value with a constant. Since, we use log domain for implementation we assign the first alpha value as 0.


Forward-backward algorithm continue…

The α for the first node is 1:

11

Step 2: Induction

tj

11

Xn observatio theofy probabilitemmision )(b

yprobabilittion transi

1 ;2 )()()(

t

ij

tj

N

iijtt

X

a

NjTtXbaij

The alpha values computed in the previous step are used to compute the alphas for the succeeding nodes.

Note: Unlike in HMMs where we move from left to right at fixed intervals of time, over here we move from one node to the next based on node indexes which are time aligned.



Let us see the computation of the alphas from node 2, the alpha for node 1 was initialized as 1 in the previous step during initialization.

Node 2:

0.005

01.0*)6/3(*12

0.005025

)01.0*)6/3(*005.0()01.0*)6/3(*1(3

Node 3:

Node 4:

05-1.675E

)01.0*)6/2(*005025.0(4

The alpha calculation continues in this manner for all the remaining nodes.

The forward backward calculation on word-graphs is similar to the calculations used on HMMs, but in word graphs the transition matrix is populated by the language model probabilities and the emission probability corresponds to the acoustic score.

1 3

4

2

3/6

3/6 3/6

2/6

4/6

Sil

Sil

this

is

α =1

α =0.005

α =0.005025

α=1.675E-05



Once we compute the alphas using the forward algorithm we begin the beta computation using the backward algorithm.

The backward algorithm is similar to the forward algorithm, but we start from the last node and proceed from right to left.

Step 1 : Initialization

training.during systems ASRour ofboth in used valueinitial same

theis This 1. is node final at the of valueinitial the

hence and '1'usually isinstant final at the N The

N1 /1)(

iNiT

Step 2: Induction

node.current

thepreceedingjust nodes theof valuebeta The )(

score acoustic The )(b

score model Language a

Ni1 1....1;-T t)()()(

1

1j

ij

1 11

j

X

jXbai

t

t

N

j ttjijt



Let us see the computation of the beta values from node 14 and backwards.

Node 14:

0.001667

01.0*1*)6/1(14

00833.0

01.0*1*)6/5(13

Node 13:

Node 12:

05-5.555E

00833.0*01.0*)6/4(12

11

12

13 15

1/6

4/6

5/6

sentence

Sil

14

sentence

sense 1/6 1/6

Sil

β=1.66E-5

β=5.55E-5

β=0.00833

β=0.001667

β=1



Node 11:

05-1.666E

)00833.0*01.0*)6/1(( )001667.0*01.0*)6/1((11

In a similar manner we obtain the beta values for all the nodes till node 1.

The alpha for the last node should be the same as the beta in the first node.

We can compute the probabilities on the links (between two nodes) as follows:

Let us call this link probability as Γ.

Therefore Γ(t-1,t) is computed as the product of α(t-1)*ß(t)*aij. These values give the un-normalized posterior probabilities of the word on the link considering all possible paths through the link.


Word graph showing the computed alphas and betas

α=1.675E-7β=4.61E-11

α=2.79E-10β=2.766E-8

1 3

4

5

6

7 10

11

12

139

15

2

3/6

3/6 3/6

2/6

4/6

2/6

2/6

2/6

1/6

4/6

1/6

1/6

4/6

1/6

1/6

4/6

5/6Sil

Sil

This

is

atest

sentence

Sil

this

is

the

is

a

the

guest

quest14

sentence

sense 1/6 1/6

Silα =1β=2.88E-16

8

α =5e-03β=2.87E-16

α =5.025E-03β=5.740E-14

α=1.117E-7β=2.512E-9

α=1.675E-05β=1.536E-13

α=3.35E-5β=8.537E-12

α=1.861E-10β=2.766E-8

α=7.446E-10β=3.7E-07

α=7.751E-13β=1.66E-05

α=4.964E-12β=5.55E-05

α=3.438E-14β=8.33E-03

α=1.2923E-15β=1.667E-03

α=2.88E-16β=1

Assumption here is that the probability of occurrence of any word is 0.01. i.e. when we have 100 words in a loop grammar

This word graph shows every node with its corresponding alpha and beta value.


Link probabilities calculated from alphas and betas

The following word graph shows the links with their corresponding link posterior probabilities normalized by the sum of all paths.

Γ=0.0268

1 3

4

5

15

2

3/6

3/6 3/6

2/6

4/6

2/6

2/6

2/6

1/6

4/6

1/6

1/6

4/6

1/6

1/6

4/6

5/6Sil

Sil

This

is

atest

sentence

Sil

this

is

the

is

a

the

guest

quest14

sentence

sense 1/6 1/6

SilΓ=0.996

Γ=4.98E-03

Γ=0.993

Γ=8.93E-03

Γ=8.937E-03

Γ=0.0178

Γ=0.9739

Γ=0.0268

Γ=0.0178

Γ=0.9566

Γ=0.0178

Γ=0.9566

Γ=0.9571

Γ=0.0373

Γ=7.47E-03

Γ=7.478E-03

Γ=0.9950

6

7

8

9

10

11

12

13

Γ=4.98E-03

By choosing the links with the maximum posterior probability we can be certain that we have included most probable words in the final sequence.


Some Alternate approaches…

The paper by F.Wessel (confidence Measures for Large Vocabulary Continuous Speech Recognition) describes alternate techniques to compute the posterior, the drawback of the approach described above is that the lattice has to be very deep to accommodate sufficient links at the same time instant. To overcome the problem we can use a soft time margin instead of a hard margin, and this is achieved by considering overlapping words to a certain degree. But, by doing this the author states that the normalization part will no longer work since the probabilities are not summed in the same time frame, and hence the probabilities will not sum to one. Hence, the author suggests an approach where the posteriors are computed frame-by-frame so that the normalization of the posteriors is possible. In the end it was found that normalization using frame-by-frame approach did not perform significantly better than the overlapping time marks approach.

The normalization of the posteriors is done by dividing the value by the sum of the posterior probabilities of all the paths in the lattice. This example suffers from the fact that the lattice is not deep enough, hence normalization might result in the values of some of the links to go very close to 1. This phenomenon is explained in the paper by G.Evermann and P.C Woodland.


Logarithmic computations:

Instead of using the probabilities as described above, we can use logarithmic approximations of the above probabilities so that the multiplications are converted to additions. We can directly use the acoustic and language model scores from the word graphs.

We will use the following log trick to add two logarithmic values:log(x+y) = log(x) + log(1+y/x)The logarithmic alphas and betas computed are shown below:

0 2

3

4

5

6 9

10

11

128

14

1

-0.6931

-0.6931 -0.6931

-1.0986

-0.4054

-1.0986

-1.0986

-1.0986

-1.7917

-0.4054

-1.7917

-1.7917

-0.4054

p=-4.0224-1.7917

-1.7917

-0.4054 -0.1823

Sil

Sil

This

is

a test

sentence

Sil

this

is

the

is

a

the

guest

quest13

sentence

sense-1.7917

-1.7917

Silα =0β=-3.1438

7

α =-0.6931β=-3.5493

α =-0.2876β=-2.8562

α=-1.7916β=-1.3799

α =-1.3862β=-6.4736

α=-0.6930β=-2.4598

α=-3.5833β=-3.5833

α=-2.1970β=-0.9931

α=-4.4586β=-1.7916

α=-2.6024β=-0.5877

α=-2.9694β=-0.1823

α=-6.2503β=-1.7917

α=-3.1442β=0

α=-1.3861β=-5.375

α=-3.1778β=-3.5833


Logarithmic posterior probabilities

p=-1.0982

0 2

3

4

5

6 9

10

11

128

14

1

p=-0.4051

p=-1.0982

p=-1.0986

p=-0.0086

p=-4.7156

p=-3.604

p=-0.0273

p=-3.6169

p=-0.0459

p=-4.0224

p=-3.6169

p=-0.0459

p=-4.0224

p=-3.2884

p=-0.0459 p=-0.0075

Sil

Sil

This

is

atest

sentence

Sil

this

is the

is

a

the

guest

quest13

sentence

sense

p=-4.8978

p=--4.8978

Sil

7

From the lattice we can obtain the best word sequence by picking the words with the highest posterior probability as we traverse from node to node.


References:

• F. Wessel, R. Schlüter, K. Macherey, H. Ney. "Confidence Measures for Large Vocabulary Continuous Speech Recognition". IEEE Trans. on Speech and Audio Processing. Vol. 9, No. 3, pp. 288-298, March 2001

• Wessel, Macherey, and Schauter, "Using Word Probabilities as Confidence Measures, ICASSP'97

• G. Evermann and P.C. Woodland, “Large Vocabulary Decoding and Confidence Estimation using Word Posterior Probabilities in Proc. ICASSP 2000, pp. 2366-2369, Istanbul.

• X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall, ISBN: 0-13-022616-5, 2001

• J. Deller, et. al., Discrete-Time Processing of Speech Signals, MacMillan Publishing Co., ISBN: 0-7803-5386-2, 2000

human and systems engineering:

Documents