discriminative training of decoding graphs for large...

10
Discriminative Training of Decoding Graphs for Large Vocabulary Continuous Speech Recognition by Hong-Kwang Jeff Kuo, Brian Kingsbury (IBM Research) and Geoffry Zweig (Microsoft Research) ICASSP 2007 Presented by: Eugene Weinstein, NYU April 22nd, 2008

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Discriminative Training of Decoding Graphs for Large

Vocabulary Continuous Speech Recognition

by Hong-Kwang Jeff Kuo, Brian Kingsbury (IBM Research) and Geoffry Zweig (Microsoft Research)

ICASSP 2007

Presented by:Eugene Weinstein, NYU

April 22nd, 2008

Page 2: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Transducers in Speech

• Given observation sequence , want word sequence :

• Constraints modeled between HMM states/distributions , context-dependent phones , phonemes , and words

• Constraint set combined by using transducer composition

2

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Model Combination

Steps:

• models represented by weighted transducers.

• Viterbi approximation: semiring change.

• composition of weighted transducers.

9

w = argminw

!2

!

O ! H ! C ! L ! G"

.

Pron. ModelHMM Lang. ModelCD Model

word seq.phoneme seq.CD phone seq. word seq.observ. seq. H C L G

O

[Graphic: Mohri ‘07]

– usually a beam search implemented using the Viterbi algorithm. This is agreedy algorithm which discards any paths through the automaton which areunlikely to match the observation sequence according to some heuristic.

Representing the output of a speech recognizer as a set of most likely hy-potheses as opposed to a single best-path hypothesis is advantageous for severalreasons. One reason is that we may try to apply additional recognition runsto these hypotheses. For instance, suppose we are interested in increasing thespeed of decoding. One possibility is to apply a first run generating a numberof hypotheses using a fast decoder with simple models. We would subsequentlyapply a second “pass” of a more sophisticated decoder over the search space.This second pass would generate the final hypothesis. Another possibility isthat we might be interested in more than just the one-best hypothesis as theoutput of the decoder. This would be the case if the output of the speech recog-nizer is being fed into another application, such as a natural language parsingalgorithm or a speech indexing system.

The term pruning refers to narrowing the set of hypotheses being considered.The pruning algorithm used usually uses a heuristic, such as a threshold on thelikelihood of the path. In this work, we present the current methods being used,empirically analyze their e!cacy, and suggest methods for improving them.

2 Viterbi Decoding and the Lattice Repre-sentation

If o, c, p, and w are sequences of observations, context-dependent phones,phonemes, and words, respectively, the decoding problem may be written as

w = arg maxw

!

p

Pr(o|c)Pr(c|p)Pr(p|w)Pr(w). (1)

Let C, L, and G be transducers over the tropical semiring representing thecontext dependent phonotactic model, the dictionary or pronunciation model,and the language model or grammar, respectively. The decoding problem maythe be viewed as the application of a shortest path algorithm to the compositionof these components

w = arg minw

"o[A(O) ! C ! L !G], (2)

where A(O) represents a distribution sequence resulting from the applicationof a HMM-based acoustic model to the observation sequence. Since exhaustivesearch of the paths in the graph resulting from the above compositions is gen-erally prohibitive, a beam search is used. If R is an automaton representing thepaths searched thus far and t is a pruning threshold, the beam search algorithmdiscards any state q that is more than t away from the cost of the shortest pathfound so far. If so is the cost of the shortest path through the whole searchspace so far and s(q) is the cost of the shortest path passing through state q,the pruning rule is

"q # R : s(q) > so + t,discard q. (3)

2

o

p wc

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Recognition Cascade

Combination of components

Viterbi approximation

8

Pron. ModelHMM Lang. ModelCD Model

word seq.phoneme seq.CD phone seq. word seq.observ. seq.

w = argmaxw

!

d,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w]

! argmaxw

maxd,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w].

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Statistical Formulation

Observation sequence produced by signal processing system:

Sequence of words over alphabet :

Formulation (maximum a posteriori decoding):

29

o = o1 . . . om.

! w = w1 . . . wk.

w = argmaxw!!!

Pr[w | o]

= argmaxw!!!

Pr[o | w] Pr[w]

Pr[o]

= argmaxw!!!

Pr[o | w]! "# $

Pr[w]! "# $

.

language modelacoustic & pronounciation model

(Bahl, Jelinek, and Mercer, 1983)

w =

w

d

Page 3: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Discriminative Training

• Previous work on discriminative training in speech

• Minimum Classification Error (e.g., [Juang et al. ‘97]): train acoustic models w/ discriminative criterion

• Discriminative learning of language models (e.g., [Roark et al. ‘06])

• Other work extends this to training entire “CLG”

• [Lin and Yvon ‘05]: Construct the full constraint graph, train weights to minimize error

• Present paper: same technique; larger-scale experiments

3

Page 4: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Discriminative Formulation

• Let be the set of transition weights in and be the set of acoustic model parameters

• Given observations and a word sequence , the log-prob of path is

• Decoding problem: find the best word sequence

• If is the correct transcription, a discriminant function is

4

DISCRIMINATIVE TRAINING OF DECODING GRAPHS FOR LARGE VOCABULARY

CONTINUOUS SPEECH RECOGNITION

Hong-Kwang Jeff Kuo, Brian Kingsbury

IBM T.J. Watson Research Center,Yorktown Heights, NY 10598{hkuo,bedk}@us.ibm.com

Geoffrey Zweig

Microsoft Research,Redmond, WA

[email protected]

ABSTRACT

Finite-state decoding graphs integrate the decision trees, pronuncia-

tion model and language model for speech recognition into a unified

representation of the search space. We explore discriminative train-

ing of the transition weights in the decoding graph in the context

of large vocabulary speech recognition. In preliminary experiments

on the RT-03 English Broadcast News evaluation set, the word er-

ror rate was reduced by about 5.7% relative, from 23.0% to 21.7%.

We discuss how this method is particularly applicable to low-latency

and low-resource applications such as real-time closed captioning of

broadcast news and interactive speech-to-speech translation.

Index Terms— Discriminative training, Finite-state decoding

graph, Language model, Pronunciation model, Low-resource speech

recognition.

1. INTRODUCTION

In recent years, it has become popular to use an integrated finite-

state decoding graph as a pre-compiled search space for efficient de-

coding for large-vocabulary speech recognition [1]. This decoding

graph can be thought of as a finite-state machine that results from

the composition of a few weighted finite-state transducers (wFSTs)

that incorporate the statistical language model (LM), the pronuncia-

tion model, and the decision trees that expand context-independent

phones to context-dependent units. With appropriate optimizations,

the decoding graph can be made efficient for speech decoding.

Discriminative training of the language model for speech recog-

nition has become an active area of research [2, 3, 4, 5, 6]. The moti-

vation is clear: instead of using maximum likelihood to estimate the

LM probabilities, the LM parameters are trained on speech data and

corresponding transcripts to minimize the actual speech recognition

error rate. In the framework of speech recognizers using an inte-

grated finite-state decoding graph, one can either discriminatively

train the LM before constructing the decoding graph or, as recently

proposed [7], one can create the decoding graph and then discrimi-

natively train the transition weights in the graph.

Potential advantages of training the decoding graph instead of

just the language model include the following. First, the decoding

graph combines several models (the language model, pronunciation

model, decision trees, silence insertion penalty, etc.) and in some

cases it would be better to perform end-to-end optimization of the

combined model rather than just one model separately. In addition,

it is possible to learn LM context dependent pronunciation probabil-

ities, e.g. “for the record” vs. “need to record.”

In this paper, we extend previous work on discriminative graph

training [7] to large vocabulary continuous speech recognition, using

context-dependent acoustic models instead of context-independent

models. All transition weights in the decoding graph are adjustable

except for zero-cost self loops. We also describe methods using FST

tools to make it possible to perform discriminative graph training on

a large decoding graph and a large amount of training data.

2. DISCRIMINATIVE TRAINING

In this section, we describe how the transition weights of an inte-

grated finite-state decoding graph are adjusted discriminatively to

improve the score separation of the correct word sequence from the

competing word sequence hypothesis, using the Minimum Classifi-

cation Error (MCE) criterion [8, 9]. The treatment is similar to [7],

with one difference being that we use context-dependent acoustic

models, so the sequences are context-dependent state sequences.

Note that the integrated decoding graph (call it G) is essentiallya classical Hidden Markov Model (HMM) with two sets of parame-

ters: the acoustic model!, consisting of the Gaussian densities in theHMM states, and the transition weights ", which specify the costsof transitions between the HMM states. The decoding graph can be

constructed according to procedures described in [1, 10], which in-

volve FST composition and optimization of the the language model,

pronunciation model, and decision trees of the context-dependent

HMM. The language model is a back-off n-gram language model,

trained using the conventional maximum likelihood criterion and ap-

propriate smoothing such as modified Kneser-Ney. The pronuncia-

tion probabilities may be arbitrarily set to uniform or may be based

on estimates from aligning pronunciation variants (“lexemes”) to the

speech training data.

Given an acoustic observation sequence X = x1, x2, . . . , xt

representing the speech signal and a word sequence W , the condi-

tional likelihood of X is approximated as the score of the best path

S = S1, S2, . . . , St through G for input X and output W . We de-

fine a discriminant function to be this score, which is a weighted

combination of the sum of acoustic log likelihoods a(X, W, S;!)and transition weights b(W, S;"):

g(X, W, S;!, ") = ! · a(X, W, S; !) + b(W, S;"), (1)

where ! is the acoustic model weight. Note that the path S is

a sequence of hidden states that actually specify a lexeme (spe-

cific pronunciation variant of a word) sequence as well as a leaf

(context-dependent HMM state) sequence. The sum of the transition

weights of the path includes the sum of the language and pronuncia-

tion model log probabilities of the associated lexeme sequence (plus

other parameters such as word or silence insertion penalties, etc.).

C ! L ! G!

DISCRIMINATIVE TRAINING OF DECODING GRAPHS FOR LARGE VOCABULARY

CONTINUOUS SPEECH RECOGNITION

Hong-Kwang Jeff Kuo, Brian Kingsbury

IBM T.J. Watson Research Center,Yorktown Heights, NY 10598{hkuo,bedk}@us.ibm.com

Geoffrey Zweig

Microsoft Research,Redmond, WA

[email protected]

ABSTRACT

Finite-state decoding graphs integrate the decision trees, pronuncia-

tion model and language model for speech recognition into a unified

representation of the search space. We explore discriminative train-

ing of the transition weights in the decoding graph in the context

of large vocabulary speech recognition. In preliminary experiments

on the RT-03 English Broadcast News evaluation set, the word er-

ror rate was reduced by about 5.7% relative, from 23.0% to 21.7%.

We discuss how this method is particularly applicable to low-latency

and low-resource applications such as real-time closed captioning of

broadcast news and interactive speech-to-speech translation.

Index Terms— Discriminative training, Finite-state decoding

graph, Language model, Pronunciation model, Low-resource speech

recognition.

1. INTRODUCTION

In recent years, it has become popular to use an integrated finite-

state decoding graph as a pre-compiled search space for efficient de-

coding for large-vocabulary speech recognition [1]. This decoding

graph can be thought of as a finite-state machine that results from

the composition of a few weighted finite-state transducers (wFSTs)

that incorporate the statistical language model (LM), the pronuncia-

tion model, and the decision trees that expand context-independent

phones to context-dependent units. With appropriate optimizations,

the decoding graph can be made efficient for speech decoding.

Discriminative training of the language model for speech recog-

nition has become an active area of research [2, 3, 4, 5, 6]. The moti-

vation is clear: instead of using maximum likelihood to estimate the

LM probabilities, the LM parameters are trained on speech data and

corresponding transcripts to minimize the actual speech recognition

error rate. In the framework of speech recognizers using an inte-

grated finite-state decoding graph, one can either discriminatively

train the LM before constructing the decoding graph or, as recently

proposed [7], one can create the decoding graph and then discrimi-

natively train the transition weights in the graph.

Potential advantages of training the decoding graph instead of

just the language model include the following. First, the decoding

graph combines several models (the language model, pronunciation

model, decision trees, silence insertion penalty, etc.) and in some

cases it would be better to perform end-to-end optimization of the

combined model rather than just one model separately. In addition,

it is possible to learn LM context dependent pronunciation probabil-

ities, e.g. “for the record” vs. “need to record.”

In this paper, we extend previous work on discriminative graph

training [7] to large vocabulary continuous speech recognition, using

context-dependent acoustic models instead of context-independent

models. All transition weights in the decoding graph are adjustable

except for zero-cost self loops. We also describe methods using FST

tools to make it possible to perform discriminative graph training on

a large decoding graph and a large amount of training data.

2. DISCRIMINATIVE TRAINING

In this section, we describe how the transition weights of an inte-

grated finite-state decoding graph are adjusted discriminatively to

improve the score separation of the correct word sequence from the

competing word sequence hypothesis, using the Minimum Classifi-

cation Error (MCE) criterion [8, 9]. The treatment is similar to [7],

with one difference being that we use context-dependent acoustic

models, so the sequences are context-dependent state sequences.

Note that the integrated decoding graph (call it G) is essentiallya classical Hidden Markov Model (HMM) with two sets of parame-

ters: the acoustic model!, consisting of the Gaussian densities in theHMM states, and the transition weights ", which specify the costsof transitions between the HMM states. The decoding graph can be

constructed according to procedures described in [1, 10], which in-

volve FST composition and optimization of the the language model,

pronunciation model, and decision trees of the context-dependent

HMM. The language model is a back-off n-gram language model,

trained using the conventional maximum likelihood criterion and ap-

propriate smoothing such as modified Kneser-Ney. The pronuncia-

tion probabilities may be arbitrarily set to uniform or may be based

on estimates from aligning pronunciation variants (“lexemes”) to the

speech training data.

Given an acoustic observation sequence X = x1, x2, . . . , xt

representing the speech signal and a word sequence W , the condi-

tional likelihood of X is approximated as the score of the best path

S = S1, S2, . . . , St through G for input X and output W . We de-

fine a discriminant function to be this score, which is a weighted

combination of the sum of acoustic log likelihoods a(X, W, S;!)and transition weights b(W, S;"):

g(X, W, S;!, ") = ! · a(X, W, S; !) + b(W, S;"), (1)

where ! is the acoustic model weight. Note that the path S is

a sequence of hidden states that actually specify a lexeme (spe-

cific pronunciation variant of a word) sequence as well as a leaf

(context-dependent HMM state) sequence. The sum of the transition

weights of the path includes the sum of the language and pronuncia-

tion model log probabilities of the associated lexeme sequence (plus

other parameters such as word or silence insertion penalties, etc.).

DISCRIMINATIVE TRAINING OF DECODING GRAPHS FOR LARGE VOCABULARY

CONTINUOUS SPEECH RECOGNITION

Hong-Kwang Jeff Kuo, Brian Kingsbury

IBM T.J. Watson Research Center,Yorktown Heights, NY 10598{hkuo,bedk}@us.ibm.com

Geoffrey Zweig

Microsoft Research,Redmond, WA

[email protected]

ABSTRACT

Finite-state decoding graphs integrate the decision trees, pronuncia-

tion model and language model for speech recognition into a unified

representation of the search space. We explore discriminative train-

ing of the transition weights in the decoding graph in the context

of large vocabulary speech recognition. In preliminary experiments

on the RT-03 English Broadcast News evaluation set, the word er-

ror rate was reduced by about 5.7% relative, from 23.0% to 21.7%.

We discuss how this method is particularly applicable to low-latency

and low-resource applications such as real-time closed captioning of

broadcast news and interactive speech-to-speech translation.

Index Terms— Discriminative training, Finite-state decoding

graph, Language model, Pronunciation model, Low-resource speech

recognition.

1. INTRODUCTION

In recent years, it has become popular to use an integrated finite-

state decoding graph as a pre-compiled search space for efficient de-

coding for large-vocabulary speech recognition [1]. This decoding

graph can be thought of as a finite-state machine that results from

the composition of a few weighted finite-state transducers (wFSTs)

that incorporate the statistical language model (LM), the pronuncia-

tion model, and the decision trees that expand context-independent

phones to context-dependent units. With appropriate optimizations,

the decoding graph can be made efficient for speech decoding.

Discriminative training of the language model for speech recog-

nition has become an active area of research [2, 3, 4, 5, 6]. The moti-

vation is clear: instead of using maximum likelihood to estimate the

LM probabilities, the LM parameters are trained on speech data and

corresponding transcripts to minimize the actual speech recognition

error rate. In the framework of speech recognizers using an inte-

grated finite-state decoding graph, one can either discriminatively

train the LM before constructing the decoding graph or, as recently

proposed [7], one can create the decoding graph and then discrimi-

natively train the transition weights in the graph.

Potential advantages of training the decoding graph instead of

just the language model include the following. First, the decoding

graph combines several models (the language model, pronunciation

model, decision trees, silence insertion penalty, etc.) and in some

cases it would be better to perform end-to-end optimization of the

combined model rather than just one model separately. In addition,

it is possible to learn LM context dependent pronunciation probabil-

ities, e.g. “for the record” vs. “need to record.”

In this paper, we extend previous work on discriminative graph

training [7] to large vocabulary continuous speech recognition, using

context-dependent acoustic models instead of context-independent

models. All transition weights in the decoding graph are adjustable

except for zero-cost self loops. We also describe methods using FST

tools to make it possible to perform discriminative graph training on

a large decoding graph and a large amount of training data.

2. DISCRIMINATIVE TRAINING

In this section, we describe how the transition weights of an inte-

grated finite-state decoding graph are adjusted discriminatively to

improve the score separation of the correct word sequence from the

competing word sequence hypothesis, using the Minimum Classifi-

cation Error (MCE) criterion [8, 9]. The treatment is similar to [7],

with one difference being that we use context-dependent acoustic

models, so the sequences are context-dependent state sequences.

Note that the integrated decoding graph (call it G) is essentiallya classical Hidden Markov Model (HMM) with two sets of parame-

ters: the acoustic model!, consisting of the Gaussian densities in theHMM states, and the transition weights ", which specify the costsof transitions between the HMM states. The decoding graph can be

constructed according to procedures described in [1, 10], which in-

volve FST composition and optimization of the the language model,

pronunciation model, and decision trees of the context-dependent

HMM. The language model is a back-off n-gram language model,

trained using the conventional maximum likelihood criterion and ap-

propriate smoothing such as modified Kneser-Ney. The pronuncia-

tion probabilities may be arbitrarily set to uniform or may be based

on estimates from aligning pronunciation variants (“lexemes”) to the

speech training data.

Given an acoustic observation sequence X = x1, x2, . . . , xt

representing the speech signal and a word sequence W , the condi-

tional likelihood of X is approximated as the score of the best path

S = S1, S2, . . . , St through G for input X and output W . We de-

fine a discriminant function to be this score, which is a weighted

combination of the sum of acoustic log likelihoods a(X, W, S;!)and transition weights b(W, S;"):

g(X, W, S;!, ") = ! · a(X, W, S; !) + b(W, S;"), (1)

where ! is the acoustic model weight. Note that the path S is

a sequence of hidden states that actually specify a lexeme (spe-

cific pronunciation variant of a word) sequence as well as a leaf

(context-dependent HMM state) sequence. The sum of the transition

weights of the path includes the sum of the language and pronuncia-

tion model log probabilities of the associated lexeme sequence (plus

other parameters such as word or silence insertion penalties, etc.).

!

W

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

W0

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

Page 5: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Loss Function; Gradient

• For utterance , the loss is

• Smooth, differentiable, 0 to 1 range function

• For a training set , total loss is

• Gradient of the loss is

• is a vector of transition weights:

5

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

!li!di

=! exp(!"di + #)(!")

(1 + exp(!"di + #))2= "

exp(!"di + #)

(1 + exp(!"di + #))2= "li(1 ! li)

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

!

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

Xi

!di

!!=

!

!di

!s1

,!di

!s2

, . . . ,!di

!sk

"

;

! = (s1, s2, . . . , sk)

counts

Page 6: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Computing Gradient

• : HMM state sequence of best forced alignment path of training utterance to the correct word sequence

• : HMM state sequence of best full search path

• : Transducer mapping HMM sequence to transition sequence (each transition in CLG gets distinct label)

• and : counts of in and

• Gradient ascent training:

• Alternative update: Quickprop:

6

Fi

Xi

Bi

T

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

sj Fi ! T Bi ! T

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

A common strategy for a speech recognizer is to search for the

word sequenceW1 with the largest value for this function:

W1 = argmaxW,S

g(X, W, S;!, "). (2)

LetW0 be the known correct word sequence. The misclassifica-

tion function is defined to be the difference between the discriminant

function and the anti-discriminant function, which is normally an Lp

norm weighted combination of the N-best competing hypotheses [3].

For simplicity, we follow [7] and just consider the decoded (single

best) hypothesisW1. Let the misclassification function be

d(X;!, ") = !g(X, W0, S0; !, ") + g(X, W1, S1; !, ").(3)

When this misclassification function is strictly positive, a sen-

tence recognition error has been made. To formulate an error func-

tion appropriate for gradient descent optimization, a smooth, differ-

entiable function ranging from 0 to 1 such as the sigmoid function is

chosen to be the class loss function for a specific utteranceXi:

li(Xi) = l(d(Xi)) =1

1 + exp(!!d(Xi) + "), (4)

where ! and " are constants which control the slope and the shift ofthe sigmoid function, respectively. Our objective is to minimize the

loss function over all utterances in the training corpus:

l(X) =X

i

li(Xi). (5)

The transition-weight parameters can be adjusted iteratively

(with step size #) to minimize the objective function using the fol-lowing update equation:

"t+1 = "t ! #"l(X;!t, "t). (6)

The gradient of the loss function is

"l(X;!t, "t) =X

i

$li$di

$d(Xi; !, ")$"

, (7)

where the first term is the slope associated with the sigmoid class-

loss function and is given by:

$li$di

= !l(di)(1 ! l(di)). (8)

If we regard " as a vector of transition weights sj , to compute!d(Xi;!,")

!" , we can take the partial derivatives with respect to each

sj . Using the definition of d in Equation 3 and after working out themathematics, we get:

$d(Xi; !, ")$sj

= !I(W0, sj) + I(W1, sj), (9)

where I(W, sj) denotes the number of times the transition Sj is

taken in the best aligned path ofXi to the word sequenceW .

For each utterance in the training data, the algorithm is count-

ing the transitions for the correct string and the decoded hypothesis.

Transitions for the correct string increase the corresponding transi-

tion weights, while those for the decoded string decrease the weights.

The amount of increase or decrease is proportional to the step size #,the value of the slope of the sigmoid function and the difference in

the number of times the transition appears. The slope of the sigmoid

function is close to 0 for very large positive d, so little adjustment is

made for a sentence for which the total score of the correct string is

much worse than the score of the competing string. This decreases

the effect of outliers, for example of utterances whose transcripts are

erroneous. Notice that the only dependence on the acoustic scores

(more specifically, the difference in the total path scores) in the equa-

tions is in the slope!li!di, which determines how much influence a

particular training sample has in updating the parameters.

With gradient descent optimization, there is a choice of batch

mode, which collects the statistics over all the training data before

making an update to the model, or online mode, where the model

is updated after processing each training sample and typically the

sample order is randomized. Although online mode may result in

faster convergence, batch mode has the advantage of allowing for

parallelism in collecting statistics of the training data. In this paper,

we use batch mode training.

3. FST IMPLEMENTATION

In this section, we describe a simple and elegant method for im-

plementing discriminative training for large vocabulary decoding

graphs using weighted finite-state transducers. We use an internal

IBM FSM toolkit [11], with functionality similar to the publicly

available AT&T toolkit [12].

It is easy to instrument a Viterbi decoder to count state tran-

sitions (just count on the backtrace). However, it is not obvious

how the reference aligns to the decoding graph, especially because

of LM backoff arcs. Therefore, we treat both cases in the same way:

we produce leaf sequences for the reference and decoded word se-

quences, and then use a decoder-derived transducer that reads leaves

and outputs state transitions to do the counting.

The algorithm consists of the following steps:

1. For each sentence in the training data, find the reference leaf

sequence by aligning the reference transcript to the speech

data. Encode the leaf sequence as an FSM and attach a

dummy arc to store the acoustic score.

2. Decode training speech data using the decoding graph. For

each sentence, find the decoded leaf sequence. Construct the

FSM and attach a dummy arc to store the acoustic score.

3. Construct a transducer from the decoding graph to transform

leaf sequences to state transition sequences, with associated

transition weights. (The transition weights will be used later

to calculate the path score of the leaf sequence.) Apply the

transducer to both reference and decoded leaf sequences.

4. For each sentence, count transitions in the reference and de-

coded sequences (Equation 9), and weight the count for each

transition by the derivative of the sigmoid class loss function

(Equation 8) using the difference in total path scores (Equa-

tion 3). Accumulate over all utterances (Equation 7) to com-

pute the gradient.

5. Update weights in the decoding graph based on the gradient.

6. Repeat from step 2 until the performance on a held-out set

converges.

There are a variety of methods to make the updates based on the

gradient [13]. We tried regular gradient descent (Equation 6) and

Quickprop [14]. For Quickprop, Equations 7–9 are still the same;

the only difference is in the update:

"t+1 = "t ! [(#2l("))!1 + #]"l(X;!t, "t), (10)

where the Hessian#2l(") is assumed to be diagonal [13, 14].

Page 7: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Experiments

• Language model: trained on 132M-word Broadcast News

• Small bigram model to make training feasible

• Large 4-gram model used for lattice rescoring

• Acoustic model: trained on 143-hour BN corpus

• Test audio: 2.93 hours, 25K words

• Effect of training set size:

7

4. EXPERIMENTAL SETUP

The experiments are done using a speaker-independent, English

broadcast news recognition system. The language model used to

build the decoding graph is trained on a 132M word corpus com-

prising the 1996 English Broadcast News Transcripts (LDC97T22),

the 1997 English Broadcast News Transcripts (LDC98T28) and the

1996 CSR Hub4 Language Model data (LDC98T31). It is pruned

to a bigram language model with 61K unigrams and 204K bigrams.

This very small model size was necessary to turn around multiple

experiments. A larger language model was used in lattice rescoring

experiments. It was trained on the same 132M word corpus, but was

a back-off 4-gram model containing 61K unigrams, 5.7M bigrams,

14M trigrams, and 8.7M 4-grams.

The acoustic model is trained on a 143-hour corpus com-

prising the 1996 English Broadcast News Speech collection

(LDC97S44) and the 1997 English Broadcast News Speech collec-

tion (LDC98S71). The recognition features are 40-d vectors com-

puted via an LDA+MLLT projection of 9 spliced frames of 19-d

PLP features. The raw PLP features are normalized using utterance-

based cepstral mean subtraction. In total, the acoustic model in-

cludes 6000 quinphone context dependent states and 120K Gaussian

mixture components. The acoustic model is trained using maximum

likelihood estimation.

For testing, we use the RT-03 English Broadcast News evalu-

ation set, a collection of six news broadcasts from 2001. The total

duration of the test audio is 2.93 hours, and the total number of words

in the reference transcripts is 24790. The segmentation of the test set

audio was derived from the reference transcripts.

5. RESULTS

The first practical issue that arises when doing discriminative train-

ing of decoding graphs is that the reference transcripts of the train-

ing data may not match the decoding graph. There may be out-

of-vocabulary (OOV) words in the training data due to vocabulary

pruning during LM training – we can do nothing about these words.

However, some OOVs may be be correctable, e.g. in the reference

“um” and “uh” may be separate words while in the decoding graph

they are represented as one word with multiple pronunciations, or

OOVs may be caused by typographical errors or spelling variations.

The first step is to normalize the high frequency OOVs to better

match the decoding graph vocabulary.

Because training over all the data takes a long time, we per-

formed some preliminary experiments on a subset of the training

data. Since there is a data weighting effect due to the derivative of

the sigmoid in Equation 8, an intelligent method is to select the sub-

set based on a range of the misclassification function (Equation 3).

By picking d < 2.0, we ended up with about 5K training utterances.Using this training set, we first explored the difference between

gradient descent and Quickprop for updating the transition weights.

Choosing the proper learning rate (!) is always tricky for gradient de-scent and to a lesser degree for Quickprop. For gradient descent, the

learning rate was chosen by considering the largest absolute value of

the gradient, in order to constrain the maximum weight update to be

on the order of 0.1.

Figure 1 shows the convergence rate of simple gradient descent

and Quickprop. The objective function and word error rate (WER)

for the training data, as well as the test data WER, generally decrease

with each epoch of training. Quickprop seems to achieve conver-

gence faster then gradient descent, although the test error curve is a

bit more bumpy at some points.

0 5 10 15 20 25 30 35 401.224

1.226

1.228

1.23

1.232

1.234

1.236

1.238x 10

4

Obj. F

unction

Training Set Objective Function

0 5 10 15 20 25 30 35 407

7.5

8

8.5

9

9.5

WE

R (

%)

Training Set WER

gradient descent

quickprop

gradient descent

quickprop

0 5 10 15 20 25 30 35 40

22.4

22.6

22.8

23

23.2

23.4

Number of epochsW

ER

(%

)

Test Set WER

gradient descent

quickprop

Fig. 1. Convergence rates for gradient descent and Quickprop

Baseline WER 23.0

d Threshold No.Training Sentences WER

2.0 5538 22.3

10.0 24492 21.8

5000.0 56194 21.7

Table 1. Effect of amount of training data on WER

In Table 5, we show the effects of discriminative graph training

on the WER, for various amount of training data. Even with 5538

sentences, we are able to get a significant improvement (of 0.7%)

because of the way we have chosen them: we chose the sentences

that are likely to have the largest contribution to the training. With

more (25K sentences), there is further improvement. Based on the

best results in the table, starting from a baseline WER of 23.0%,

discriminative graph training was able to reduce the WER to 21.7%,

representing an absolute improvement of 1.3%, or 5.7% relative.

In many large vocabulary transcription tasks it is common to

generate lattices representing a large set of possible output hypothe-

ses and then rescore the lattices with a larger, more complex lan-

guage model before producing the final output. One objection to the

discriminative training of decoding graphs is that any gains achieved

through discriminative training will disappear following such lan-

guage model rescoring. We tested this hypothesis by generating lat-

tices using the baseline decoding graph and the best decoding graph

from above (obtained with a d threshold of 5000.0) and rescoringthe lattices with a much larger, 4-gram language model. The base-

line system’s WER drops to 17.7% after rescoring, while the dis-

criminatively trained system’s WER drops to 17.6%. The absolute

Page 8: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Experiments, cont’d

• Discriminative training (MCE) works better than ML

• However, training infeasible for full language model

• We can first use small MCE model, then rescore with large ML model; however, benefit is lost

8

beam ML MCE Diff %Diff

one-pass 14 23.0% 21.7% 1.3% 5.7%

10 23.2% 21.9% 1.3% 5.6%

9 23.8% 22.3% 1.5% 6.3%

8 25.7% 23.7% 2.0% 7.8%

7 33.5% 30.1% 3.4% 10.1%

LM rescored 14 17.7% 17.6% 0.1% 0.6%

10 18.5% 18.0% 0.5% 2.7%

9 19.4% 18.9% 0.5% 2.6%

8 22.2% 21.0% 1.2% 5.4%

7 31.7% 28.4% 3.3% 10.4%

Table 2. WER as function of decoding beam width

difference in the number of errors between the two systems is 38.

Thus, the rescoring objection is upheld by these results. We note,

however, that there are a number of applications in which low sys-

tem latency is vital, such as the real-time closed captioning of news

broadcasts, or in which system resources are constrained, such as

speech recognition on handheld computers. In such low-latency and

resource constrained applications, lattice rescoring may not be pos-

sible, so techniques such as discriminative training that improve the

decoding graph itself are of practical interest.

Furthermore, discriminatively trained decoding graphs appear

to have better pruning behavior. Table 5 shows how WER changes

as the beam width for decoding is decreased to reduce computation

and memory requirements. The advantage of using an MCE-trained

decoding graph is even more apparent when very low beam widths

are used. For example, with a beam of 8, the improvement in WER

is increased to 2.0% absolute.

6. DISCUSSION AND CONCLUSIONS

We extended discriminative training of decoding graphs to large-

vocabulary speech recognition with context-dependent acoustic

models. We partially overcame challenges of large graphs and large

amounts of training data, using a simple wFST approach and achiev-

ing decent improvements on a relatively small decoding graph.

The benefit of using a discriminatively trained decoding graph

when a very large language model is available for rescoring is un-

clear at this time. (Such a language model is so large that it cannot

practically be expanded into a decoding graph.) Our current exper-

iments indicate that LM rescoring decreases the benefits of using

our discriminatively trained decoding graph. However, future im-

provements to the training paradigm may show different results. For

example, we do not know what will happen if we had trained a much

larger decoding graph and then used LM rescoring. Also, our simpli-

fied framework for discriminative training does not expose the algo-

rithm to enough realistic acoustic confusions in the training data. Fu-

ture work could include holding out acoustic training data transcripts

from language model training and using lattice or N-best hypothe-

ses. Despite the LM rescoring issue, the discriminatively trained de-

coding graph is particularly useful in low-latency and low-resource

applications such as real-time closed captioning or speech-to-speech

translation, where LM rescoring is not possible.

Another line of future work is to more precisely characterize the

benefits of integrated training of the decoding graph compared with

discriminative LM training. It would also be interesting to deter-

mine whether LM context dependent pronunciation modeling, one

advantage of integrated decoding graph training, is useful for certain

applications.

7. ACKNOWLEDGMENTS

This work was partially supported by the Defense Advanced Re-

search Projects Agency under contract No. HR0011-06-2-0001. We

thank Hagen Soltau for help with the IBM ASR package Attila, and

Mohamed Afify and Alain Biem for discussions.

8. REFERENCES

[1] Mehryar Mohri, Fernando Pereira, and Michael Riley,

“Weighted finite-state transducers in speech recognition,”

Computer Speech and Language, vol. 16, no. 1, pp. 69–88,

2002.

[2] Andreas Stolcke and Mitch Weintraub, “Discriminative lan-

guage modeling,” in Proc. 9th Hub-5 Conversational Speech

Recognition Workshop, 1998.

[3] Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, and

Chin-Hui Lee, “Discriminative training of language models for

speech recognition,” in Proc. ICASSP 2002, Orlando, Florida,

May 2002.

[4] Vaibhava Goel, “Conditional maximum likelihood estimation

for improving annotation performance of N-gram models in-

corporating stochastic finite state grammars,” in Proc. ICSLP

2004, Jeju Island, Korea, Oct. 2004.

[5] Brian Roark, Murat Saraclar, and Michael Collins, “Discrimi-

native n-gram language modeling,” Computer Speech and Lan-

guage, 2006, to appear.

[6] Brian Roark, Murat Saraclar, Michael Collins, and Mark John-

son, “Discriminative language modeling with conditional ran-

dom fields and the perceptron algorithm,” in Proc. ACL,

Barcelona, Spain, July 2004.

[7] Shiuan-Sung Lin and Francois Yvon, “Discriminative training

of finite state decoding graphs,” in Proc. Interspeech 2005,

Lisbon, Portugal, Sept. 2005.

[8] Shigeru Katagiri, Chin-Hui Lee, and Biing-Hwang Juang,

“New discriminative algorithm based on the generalized prob-

abilistic descent method,” in Proc. IEEE Workshop on Neu-

ral Network for Signal Processing, Princeton, Sept. 1991, pp.

299–309.

[9] Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee, “Minimum

classification error rate methods for speech recognition,” IEEE

Transactions on Speech and Audio Processing, vol. 5, no. 3,

pp. 257–265, May 1997.

[10] Stanley F. Chen, “Compiling large-context phonetic decision

trees into finite-state transducers,” in Proc. Eurospeech 2003,

Geneva, Switzerland, Sept. 2003.

[11] Stanley F. Chen, “The IBM finite-state machine toolkit,” Tech.

Rep., IBM T.J. Watson Research Center, Feb. 2000.

[12] Mehryar Mohri, Fernando C. Pereira, and Michael

Riley, “AT&T Finite-State Machine Library,”

http://www.research.att.com/˜fsmtools/fsm/.

[13] Jonathan Le Roux and Eric McDermott, “Optimization meth-

ods for discriminative training,” in Proc. Interspeech 2005,

Lisbon, Portugal, Sept. 2005.

[14] Scott E. Fahlman, “An empirical study of learning speed in

back-propagation networks,” Tech. Rep. CMU-CS-88-162,

Carnegie Mellon University, Sept. 1998.

Page 9: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

Experiments Cont’d

• Quickprop results in faster convergence

9

4. EXPERIMENTAL SETUP

The experiments are done using a speaker-independent, English

broadcast news recognition system. The language model used to

build the decoding graph is trained on a 132M word corpus com-

prising the 1996 English Broadcast News Transcripts (LDC97T22),

the 1997 English Broadcast News Transcripts (LDC98T28) and the

1996 CSR Hub4 Language Model data (LDC98T31). It is pruned

to a bigram language model with 61K unigrams and 204K bigrams.

This very small model size was necessary to turn around multiple

experiments. A larger language model was used in lattice rescoring

experiments. It was trained on the same 132M word corpus, but was

a back-off 4-gram model containing 61K unigrams, 5.7M bigrams,

14M trigrams, and 8.7M 4-grams.

The acoustic model is trained on a 143-hour corpus com-

prising the 1996 English Broadcast News Speech collection

(LDC97S44) and the 1997 English Broadcast News Speech collec-

tion (LDC98S71). The recognition features are 40-d vectors com-

puted via an LDA+MLLT projection of 9 spliced frames of 19-d

PLP features. The raw PLP features are normalized using utterance-

based cepstral mean subtraction. In total, the acoustic model in-

cludes 6000 quinphone context dependent states and 120K Gaussian

mixture components. The acoustic model is trained using maximum

likelihood estimation.

For testing, we use the RT-03 English Broadcast News evalu-

ation set, a collection of six news broadcasts from 2001. The total

duration of the test audio is 2.93 hours, and the total number of words

in the reference transcripts is 24790. The segmentation of the test set

audio was derived from the reference transcripts.

5. RESULTS

The first practical issue that arises when doing discriminative train-

ing of decoding graphs is that the reference transcripts of the train-

ing data may not match the decoding graph. There may be out-

of-vocabulary (OOV) words in the training data due to vocabulary

pruning during LM training – we can do nothing about these words.

However, some OOVs may be be correctable, e.g. in the reference

“um” and “uh” may be separate words while in the decoding graph

they are represented as one word with multiple pronunciations, or

OOVs may be caused by typographical errors or spelling variations.

The first step is to normalize the high frequency OOVs to better

match the decoding graph vocabulary.

Because training over all the data takes a long time, we per-

formed some preliminary experiments on a subset of the training

data. Since there is a data weighting effect due to the derivative of

the sigmoid in Equation 8, an intelligent method is to select the sub-

set based on a range of the misclassification function (Equation 3).

By picking d < 2.0, we ended up with about 5K training utterances.Using this training set, we first explored the difference between

gradient descent and Quickprop for updating the transition weights.

Choosing the proper learning rate (!) is always tricky for gradient de-scent and to a lesser degree for Quickprop. For gradient descent, the

learning rate was chosen by considering the largest absolute value of

the gradient, in order to constrain the maximum weight update to be

on the order of 0.1.

Figure 1 shows the convergence rate of simple gradient descent

and Quickprop. The objective function and word error rate (WER)

for the training data, as well as the test data WER, generally decrease

with each epoch of training. Quickprop seems to achieve conver-

gence faster then gradient descent, although the test error curve is a

bit more bumpy at some points.

0 5 10 15 20 25 30 35 401.224

1.226

1.228

1.23

1.232

1.234

1.236

1.238x 10

4

Obj. F

unction

Training Set Objective Function

0 5 10 15 20 25 30 35 407

7.5

8

8.5

9

9.5

WE

R (

%)

Training Set WER

gradient descent

quickprop

gradient descent

quickprop

0 5 10 15 20 25 30 35 40

22.4

22.6

22.8

23

23.2

23.4

Number of epochs

WE

R (

%)

Test Set WER

gradient descent

quickprop

Fig. 1. Convergence rates for gradient descent and Quickprop

Baseline WER 23.0

d Threshold No.Training Sentences WER

2.0 5538 22.3

10.0 24492 21.8

5000.0 56194 21.7

Table 1. Effect of amount of training data on WER

In Table 5, we show the effects of discriminative graph training

on the WER, for various amount of training data. Even with 5538

sentences, we are able to get a significant improvement (of 0.7%)

because of the way we have chosen them: we chose the sentences

that are likely to have the largest contribution to the training. With

more (25K sentences), there is further improvement. Based on the

best results in the table, starting from a baseline WER of 23.0%,

discriminative graph training was able to reduce the WER to 21.7%,

representing an absolute improvement of 1.3%, or 5.7% relative.

In many large vocabulary transcription tasks it is common to

generate lattices representing a large set of possible output hypothe-

ses and then rescore the lattices with a larger, more complex lan-

guage model before producing the final output. One objection to the

discriminative training of decoding graphs is that any gains achieved

through discriminative training will disappear following such lan-

guage model rescoring. We tested this hypothesis by generating lat-

tices using the baseline decoding graph and the best decoding graph

from above (obtained with a d threshold of 5000.0) and rescoringthe lattices with a much larger, 4-gram language model. The base-

line system’s WER drops to 17.7% after rescoring, while the dis-

criminatively trained system’s WER drops to 17.6%. The absolute

Page 10: Discriminative Training of Decoding Graphs for Large ...eugenew/publications/nyu-apr08-decoding.pdf · Discriminative Formulation • Let be the set of transition weights in and be

References• Biing-Hwang Juang, Wu Chou, and Chin-Hui Lee, “Minimum classification

error rate methods for speech recognition,” IEEE Transactions on Speech and Audio processing, vol. 5, no. 3, pp. 257–265, May 1997.

• Shiuan-Sung Lin and François Yvon, “Discriminative training of finite state decoding graphs,” in Proc. Interspeech 2005, Lisbon, Portugal, Sept. 2005.

• Brian Roark, Murat Saraclar and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373-392.

10