lattices segmentation and minimum bayes risk discriminative training for large vocabulary continuous...

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for

Large Vocabulary Continuous Speech Recognition

Vlasios Doumpiotis, William ByrneJohns Hopkins University

Speech Communication. Accepted, in Revision

Present by shih-hung

2005/10/31

Speech Lab NTNU 2

Reference

Discriminative training for segmental minimum bayes risk decoding (ICASSP 2003) (Alphadigits)

Lattice segmentation and minimum bayes risk discriminative training (Eurospeech 2003) (Alphadigits)

Pinched lattice minimum bayes risk discriminative training for large vocabulary continuous speech recognition (ICSLP 2004)

Minimum bayes risk estimation and decoding in large vocabulary continuous speech recognition (ATR workshop 2004)

Speech Lab NTNU 3

Outline

Introduction Minimum bayes risk discriminative training

– Update parameters via Extended Baum Welch algorithm

– Efficient computation of risk

– Risk-based pruning of the evidence space Pinched lattice minimum bayes risk discriminative training Pinched lattice MMIE for whole word acoustic models One worst pinched lattice MBRDT Small vocabulary ASR performance and analysis (Alphadigits) MBRDT for LVCSR results (Switchboard, MALACH-CZ) Conclusion

Speech Lab NTNU 4

Introduction

Discriminative acoustic modeling procedures, such as MMIE, are powerful modeling techniques that can be used to improve the performance of ASR systems.

MMI is often motivated as an estimation procedure by observing that it increases the a posteriori probability of the correct transcription of the speech in the training set.

);|(logmaxarg

);(

);|(logmaxarg

);|(logmaxarg*

WOP

OP

WOP

OWP

ML

MMI (MP)

Speech Lab NTNU 5

Introduction

Since the ultimate goal is to reduce the number of words in error, which we will define as the loss, estimation procedures that reduce the loss rather than improve likelihood.

One such risk-based parameter estimation procedure was developed by Kaiser et al.(2000, 2002) to reduce the expected loss, or risk, over the training set.

WW

W

W

W

'''

*

'

*

)',()''();''|(

)'();'|(minarg

);|'()',(),,(

),,(minarg

WW

W

WWLWPWOP

WPWOP

OWPWWlθWR

θWR

Speech Lab NTNU 6

Introduction

Their approach is a generalization of MMI in that both are derived via Extended Baum Welch algorithm, and MMI is special case of risk minimization under the sentence error loss function.

W

W

'

*

*

*

'

*

);,'(

);,(maxarg

);|(maxarg

));|(1(minarg

' 1' 0)',(

);|'()',(minarg

10

W

W

OWP

OWP

OWP

OWP

WWWWWWl

OWPWWl

2

22

);(

)()();(

);(

)();(

sss

ssss

s

ss

sss

s

D

Do

D

Do

Speech Lab NTNU 7

Introduction

The risk-based estimation algorithm of Kaiser et al. is not suited for direct application to LVCSR.– The difficulty arises from the need to compute the risk over many

recognition hypotheses to obtain reliable statistical estimates.

In small vocabulary tasks, N-Best lists are adequate to represent the space of hypotheses.

While lattice algorithms have been developed to compute the statistic needed for likelihood-based estimation procedures such as MMI.– risk-based estimation algorithms are not easily formulated over

lattices.

Speech Lab NTNU 8

Introduction

The focus of this paper is the efficient computation of loss and likelihood in risk-based parameter estimation for LVCSR.

We use lattice-cutting techniques developed for Minimum Bayes Risk decoding (Goel 2001) to efficiently compute the statistics needed by the algorithm of Kaiser et al.

Speech Lab NTNU 9

Minimum bayes risk discriminative training

Likelihood-based: Risk-based:

via the Extended Baum Welch:

),WR(),WR(OWPOWP W;W; );|();|(

Tss

ssW

Tsssss

Ws

ssW

sssW

s

W

W

DWWK

DoWWK

DWWK

DoWWK

OWPWWlWWlOWPWK

WOPWKWR

)';(),'(

)()()';(),'(

)';(),'(

)()';(),'(

);|'()',()'',();|''(),'(

);'|(log ),'();,(

'

2

'

'

'

''

''

W

W

W

W

W

W

W;

W;

W;

W;

W;

W;W

CqCavg rqrMBR

Speech Lab NTNU 10

Computing statistic over the evidence space

In LVCSR tasks, W is often a lattice generated by ASR decoder.

Through the conditional independence assumption underlying the ASR system, posteriori can be found by summing over the lattice arcs.– Dense lattices are needed to obtain robust and unbiased estimations.

However the risk minimizing estimation procedure is not readily over lattices.– The trellis (DP alignment) is not consistent with the structure of

ASR lattices.

Speech Lab NTNU 11

Computing statistic over the evidence space

The only possibility for lattice-based estimation is simply to expand the lattices into N-Best lists so that the string-to-string comparison and the gathering statistics (done via the Forward-Backward procedure) can be carried out exactly. (Kaiser 2000)

While correct, this approach is not feasible for LVCSR tasks where these N-Best lists would have to be extremely deep to contain a significant portion of the most likely hypotheses.

We now discuss risk-based MMI variants for parameter estimation in LVCSR.

Speech Lab NTNU 12

Efficient computation of risk

The key is an efficient lattice-to-string alignment algorithm to find l(W’,W) for all W’ in any lattice W.

By tracing a path through the lattice and accumulating the Levenshtein alignment costs and weighting them by the arcs likelihood (copied from original lattice), the risk can be computed.),( W;WR

);|'()',(),(),'( OWPWWlWRWK W;W;

Speech Lab NTNU 13

Lattice to string alignment

Speech Lab NTNU 14

Risk-based pruning of the evidence space

Risk-based lattice segmentation proceeds by segmenting the lattice with respect to the reference string.

For the K words of the reference string, we identify K-1 node cut sets.– identify all lattice subpaths that are align to the reference word

– the cut set Ni consists of the final lattice nodes of all these subpaths

The evidence space is pruned in two steps:– The likelihood of each lattice arc is used to discard all paths through every

confusion set so that only likely alternative to the reference word remains.

– We simply count all the confusion pairs in the training set lattices, and if any pair has occurred fewer times than a set threshold, that pair is everywhere pruned back to the reference transcription.

1iW

Speech Lab NTNU 15

Lattice to string alignment

Speech Lab NTNU 16

Induced loss function

Original motivation to refining the evidence space was to speed up MBR search, but lattice pinching also allows us to refine the string-to-string loss within W.

We refer to the corresponding loss as induced loss function.

If the initial lattice-to-string alignment was good, will be a good approximation to

The induced loss function and the pinched and pruned evidence space can be used to reduce the computational cost of MBRDT in LVCSR.

1

0

)',()',(k

iiiI WWlWWl

)',( WWlI)',( WWl

Speech Lab NTNU 17

Pinched lattice minimum bayes risk discriminative training (PLMBRDT)

);|'()',()~

,()~

,'(

);|'()',()'',();|''()~

,'(

);|'()',(minarg

~''

~'

*

OWPWWlWRWK

OWPWWlWWlOWPWK

OWPWWl

II

WII

WI

;W;W

;WW

W

Tss

ssW

Tsssss

Ws

ssW

sssW

s

DWWK

DoWWK

DWWK

DoWWK

)';(),'(

)()()';(),'(

)';(),'(

)()';(),'(

'

2

'

'

'

W

W

W

W

W;

W;

W;

W;

Speech Lab NTNU 18

Pinched lattice minimum bayes risk discriminative training (PLMBRDT)

PLMBRDT algorithm:

step1: Generate lattices over the training set

step2: Align the training set lattices to the reference transcriptionstep3: Segment the latticesstep4: Prune the confusion set to confusion pairsstep5: Discard infrequently occurring confusion pairsstep6: Expand each pinched lattice into N-Best list, keeping step7: Computestep8: For each , compute step9: Perform a Forward-Backward pass for eachstep10: Perform reestimation equation.

)',( WWlI);

~,( WWRI

);~

,'( WWKW~

'W

W~

'W

Speech Lab NTNU 19

Pinched lattice MMIE for whole word acoustic models (PLMMI)

Ciii

Ci Wii

K

i Wii

W W

K

iii

WI

OWP

OWPWWl

OWPWWl

OWPWWl

OWPWWl

K

);~

,|(maxarg

);~

,|'()',(minarg

);|'()',(minarg

);|'()',(...minarg

);|'()',(minarg

~'

1

0~

'

~'

~'

1

0

~'

*

0 1

W

Wi

i

0 1-K

W

W

W W

W

Speech Lab NTNU 20

Pinched lattice MMIE for whole word acoustic models (PLMMI)

PLMMI algorithmstep1: Generate lattices over the training set

step2: Align the training set lattices to the reference transcriptionstep3: Segment the latticesstep4: Prune the confusion set to confusion pairsstep5: Discard infrequently occurring confusion pairsstep6: Tag word hypotheses in confusion pairsstep7: Regenerate lattices over the training set using the tagged and pinched lattice to constrain recognitionstep8: Perform lattice-based MMI using the word boundary times obtained from the lattice

Statistics are gathered only for those word instances that appear in confusion set.

Speech Lab NTNU 21

One worst pinched lattice MBRDT

)|()|(),(

)|(),()~

,()~

(

)|()|(),( )|(),()

~,()

~,(

);|(),(minarg

)',(maxarg

**

**

**

***

~'

*

OWPOWPWWlOWPWWlWRK

OWPOWPWWlOWPWWlWRWK

OWPWWl

WWlW

I

II

I

II

I

IW

;W;W,W

;W;W

*

W

2

*

22*2

*

*

);,();,(

)()();,()();,(

);,();,(

)();,()();,(

ssss

sssss

s

sss

ssss

s

DWW

DoWoW

DWW

DoWoW

Speech Lab NTNU 22

One worst pinched lattice MBRDT

One Worst PLMBRDT algorithm (corrective training)step1: Generate lattices over the training set

step2: Align the training set lattices to the reference transcriptionstep3: Segment the latticesstep4: Prune the confusion set to confusion pairsstep5: Discard infrequently occurring confusion pairsstep6: Extract the most errorful hypothesis W*step7: Perform Forward-Backward passes with respect to and W*step8: Perform MMI below

W

2

*

22*2

*

*

);,();,(

)()();,()();,(

);,();,(

)();,()();,(

ssss

sssss

s

sss

ssss

s

DWW

DoWoW

DWW

DoWoW

Speech Lab NTNU 23

Small vocabulary ASR performance and analysis

Speech Lab NTNU 24


Speech Lab NTNU 25


Speech Lab NTNU 26


Speech Lab NTNU 27


Speech Lab NTNU 28

MMIE baseline

Speech Lab NTNU 29

MBRDT for LVCSR results

Speech Lab NTNU 30


Speech Lab NTNU 31


Speech Lab NTNU 32

Conclusion

We have demonstrated how techniques developed for MBR decoding make it possible to apply risk-based parameter estimation algorithm to LVCSR.– PLMBRDT, PLMMI, One-worst PLMBRDT

Our approach starts with the original derivations of Kaiser et al.which show how the Extended Baum Welch algorithm can be used to derive a parameter estimation procedure to reduce expected loss over training data.

lattices segmentation and minimum bayes risk discriminative training for large vocabulary continuous...

Documents

minimum bayes risk estimation

speech lab ntnu

training set

mmi mp slide

lattice algorithms

recognition hypotheses

efficient computation

malachcz conclusion