lattices segmentation and minimum bayes risk discriminative training for large vocabulary continuous...
TRANSCRIPT
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for
Large Vocabulary Continuous Speech Recognition
Vlasios Doumpiotis, William ByrneJohns Hopkins University
Speech Communication. Accepted, in Revision
Present by shih-hung
2005/10/31
Speech Lab NTNU 2
Reference
Discriminative training for segmental minimum bayes risk decoding (ICASSP 2003) (Alphadigits)
Lattice segmentation and minimum bayes risk discriminative training (Eurospeech 2003) (Alphadigits)
Pinched lattice minimum bayes risk discriminative training for large vocabulary continuous speech recognition (ICSLP 2004)
Minimum bayes risk estimation and decoding in large vocabulary continuous speech recognition (ATR workshop 2004)
Speech Lab NTNU 3
Outline
Introduction Minimum bayes risk discriminative training
– Update parameters via Extended Baum Welch algorithm
– Efficient computation of risk
– Risk-based pruning of the evidence space Pinched lattice minimum bayes risk discriminative training Pinched lattice MMIE for whole word acoustic models One worst pinched lattice MBRDT Small vocabulary ASR performance and analysis (Alphadigits) MBRDT for LVCSR results (Switchboard, MALACH-CZ) Conclusion
Speech Lab NTNU 4
Introduction
Discriminative acoustic modeling procedures, such as MMIE, are powerful modeling techniques that can be used to improve the performance of ASR systems.
MMI is often motivated as an estimation procedure by observing that it increases the a posteriori probability of the correct transcription of the speech in the training set.
);|(logmaxarg
);(
);|(logmaxarg
);|(logmaxarg*
WOP
OP
WOP
OWP
ML
MMI (MP)
Speech Lab NTNU 5
Introduction
Since the ultimate goal is to reduce the number of words in error, which we will define as the loss, estimation procedures that reduce the loss rather than improve likelihood.
One such risk-based parameter estimation procedure was developed by Kaiser et al.(2000, 2002) to reduce the expected loss, or risk, over the training set.
WW
W
W
W
'''
*
'
*
)',()''();''|(
)'();'|(minarg
);|'()',(),,(
),,(minarg
WW
W
WWLWPWOP
WPWOP
OWPWWlθWR
θWR
Speech Lab NTNU 6
Introduction
Their approach is a generalization of MMI in that both are derived via Extended Baum Welch algorithm, and MMI is special case of risk minimization under the sentence error loss function.
W
W
'
*
*
*
'
*
);,'(
);,(maxarg
);|(maxarg
));|(1(minarg
' 1' 0)',(
);|'()',(minarg
10
W
W
OWP
OWP
OWP
OWP
WWWWWWl
OWPWWl
2
22
);(
)()();(
);(
)();(
sss
ssss
s
ss
sss
s
D
Do
D
Do
Speech Lab NTNU 7
Introduction
The risk-based estimation algorithm of Kaiser et al. is not suited for direct application to LVCSR.– The difficulty arises from the need to compute the risk over many
recognition hypotheses to obtain reliable statistical estimates.
In small vocabulary tasks, N-Best lists are adequate to represent the space of hypotheses.
While lattice algorithms have been developed to compute the statistic needed for likelihood-based estimation procedures such as MMI.– risk-based estimation algorithms are not easily formulated over
lattices.
Speech Lab NTNU 8
Introduction
The focus of this paper is the efficient computation of loss and likelihood in risk-based parameter estimation for LVCSR.
We use lattice-cutting techniques developed for Minimum Bayes Risk decoding (Goel 2001) to efficiently compute the statistics needed by the algorithm of Kaiser et al.
Speech Lab NTNU 9
Minimum bayes risk discriminative training
Likelihood-based: Risk-based:
via the Extended Baum Welch:
),WR(),WR(OWPOWP W;W; );|();|(
Tss
ssW
Tsssss
Ws
ssW
sssW
s
W
W
DWWK
DoWWK
DWWK
DoWWK
OWPWWlWWlOWPWK
WOPWKWR
)';(),'(
)()()';(),'(
)';(),'(
)()';(),'(
);|'()',()'',();|''(),'(
);'|(log ),'();,(
'
2
'
'
'
''
''
W
W
W
W
W
W
W;
W;
W;
W;
W;
W;W
CqCavg rqrMBR
Speech Lab NTNU 10
Computing statistic over the evidence space
In LVCSR tasks, W is often a lattice generated by ASR decoder.
Through the conditional independence assumption underlying the ASR system, posteriori can be found by summing over the lattice arcs.– Dense lattices are needed to obtain robust and unbiased estimations.
However the risk minimizing estimation procedure is not readily over lattices.– The trellis (DP alignment) is not consistent with the structure of
ASR lattices.
Speech Lab NTNU 11
Computing statistic over the evidence space
The only possibility for lattice-based estimation is simply to expand the lattices into N-Best lists so that the string-to-string comparison and the gathering statistics (done via the Forward-Backward procedure) can be carried out exactly. (Kaiser 2000)
While correct, this approach is not feasible for LVCSR tasks where these N-Best lists would have to be extremely deep to contain a significant portion of the most likely hypotheses.
We now discuss risk-based MMI variants for parameter estimation in LVCSR.
Speech Lab NTNU 12
Efficient computation of risk
The key is an efficient lattice-to-string alignment algorithm to find l(W’,W) for all W’ in any lattice W.
By tracing a path through the lattice and accumulating the Levenshtein alignment costs and weighting them by the arcs likelihood (copied from original lattice), the risk can be computed.),( W;WR
);|'()',(),(),'( OWPWWlWRWK W;W;
Speech Lab NTNU 14
Risk-based pruning of the evidence space
Risk-based lattice segmentation proceeds by segmenting the lattice with respect to the reference string.
For the K words of the reference string, we identify K-1 node cut sets.– identify all lattice subpaths that are align to the reference word
– the cut set Ni consists of the final lattice nodes of all these subpaths
The evidence space is pruned in two steps:– The likelihood of each lattice arc is used to discard all paths through every
confusion set so that only likely alternative to the reference word remains.
– We simply count all the confusion pairs in the training set lattices, and if any pair has occurred fewer times than a set threshold, that pair is everywhere pruned back to the reference transcription.
1iW
Speech Lab NTNU 16
Induced loss function
Original motivation to refining the evidence space was to speed up MBR search, but lattice pinching also allows us to refine the string-to-string loss within W.
We refer to the corresponding loss as induced loss function.
If the initial lattice-to-string alignment was good, will be a good approximation to
The induced loss function and the pinched and pruned evidence space can be used to reduce the computational cost of MBRDT in LVCSR.
1
0
)',()',(k
iiiI WWlWWl
)',( WWlI)',( WWl
Speech Lab NTNU 17
Pinched lattice minimum bayes risk discriminative training (PLMBRDT)
);|'()',()~
,()~
,'(
);|'()',()'',();|''()~
,'(
);|'()',(minarg
~''
~'
*
OWPWWlWRWK
OWPWWlWWlOWPWK
OWPWWl
II
WII
WI
;W;W
;WW
W
Tss
ssW
Tsssss
Ws
ssW
sssW
s
DWWK
DoWWK
DWWK
DoWWK
)';(),'(
)()()';(),'(
)';(),'(
)()';(),'(
'
2
'
'
'
W
W
W
W
W;
W;
W;
W;
Speech Lab NTNU 18
Pinched lattice minimum bayes risk discriminative training (PLMBRDT)
PLMBRDT algorithm:
step1: Generate lattices over the training set
step2: Align the training set lattices to the reference transcriptionstep3: Segment the latticesstep4: Prune the confusion set to confusion pairsstep5: Discard infrequently occurring confusion pairsstep6: Expand each pinched lattice into N-Best list, keeping step7: Computestep8: For each , compute step9: Perform a Forward-Backward pass for eachstep10: Perform reestimation equation.
)',( WWlI);
~,( WWRI
);~
,'( WWKW~
'W
W~
'W
Speech Lab NTNU 19
Pinched lattice MMIE for whole word acoustic models (PLMMI)
Ciii
Ci Wii
K
i Wii
W W
K
iii
WI
OWP
OWPWWl
OWPWWl
OWPWWl
OWPWWl
K
);~
,|(maxarg
);~
,|'()',(minarg
);|'()',(minarg
);|'()',(...minarg
);|'()',(minarg
~'
1
0~
'
~'
~'
1
0
~'
*
0 1
W
Wi
i
0 1-K
W
W
W W
W
Speech Lab NTNU 20
Pinched lattice MMIE for whole word acoustic models (PLMMI)
PLMMI algorithmstep1: Generate lattices over the training set
step2: Align the training set lattices to the reference transcriptionstep3: Segment the latticesstep4: Prune the confusion set to confusion pairsstep5: Discard infrequently occurring confusion pairsstep6: Tag word hypotheses in confusion pairsstep7: Regenerate lattices over the training set using the tagged and pinched lattice to constrain recognitionstep8: Perform lattice-based MMI using the word boundary times obtained from the lattice
Statistics are gathered only for those word instances that appear in confusion set.
Speech Lab NTNU 21
One worst pinched lattice MBRDT
)|()|(),(
)|(),()~
,()~
(
)|()|(),( )|(),()
~,()
~,(
);|(),(minarg
)',(maxarg
**
**
**
***
~'
*
OWPOWPWWlOWPWWlWRK
OWPOWPWWlOWPWWlWRWK
OWPWWl
WWlW
I
II
I
II
I
IW
;W;W,W
;W;W
*
W
2
*
22*2
*
*
);,();,(
)()();,()();,(
);,();,(
)();,()();,(
ssss
sssss
s
sss
ssss
s
DWW
DoWoW
DWW
DoWoW
Speech Lab NTNU 22
One worst pinched lattice MBRDT
One Worst PLMBRDT algorithm (corrective training)step1: Generate lattices over the training set
step2: Align the training set lattices to the reference transcriptionstep3: Segment the latticesstep4: Prune the confusion set to confusion pairsstep5: Discard infrequently occurring confusion pairsstep6: Extract the most errorful hypothesis W*step7: Perform Forward-Backward passes with respect to and W*step8: Perform MMI below
W
2
*
22*2
*
*
);,();,(
)()();,()();,(
);,();,(
)();,()();,(
ssss
sssss
s
sss
ssss
s
DWW
DoWoW
DWW
DoWoW
Speech Lab NTNU 32
Conclusion
We have demonstrated how techniques developed for MBR decoding make it possible to apply risk-based parameter estimation algorithm to LVCSR.– PLMBRDT, PLMMI, One-worst PLMBRDT
Our approach starts with the original derivations of Kaiser et al.which show how the Extended Baum Welch algorithm can be used to derive a parameter estimation procedure to reduce expected loss over training data.