wolfgang macherey von der fakult¨at f¨ur mathematik, informatik und naturwissenschaften der

Discriminative Training andAcoustic Modeling for Automatic

Speech Recognition- Chap. 4 Discriminative Training

Wolfgang MachereyVon der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften

derRWTH Aachen University zur Erlangung des akademischen Grades eines

Doktorsder Naturwissenschaften genehmigte Dissertation

Presented by Yueng-Tien, Lo

outline

• A General View of Discriminative Training Criteria– Extended Unifying Approach– Smoothed Error Minimizing Training Criteria

A General View of Discriminative Training Criteria

• A unifying approach for a class of discriminative training criteria was presented that allows for optimizing several objective functions – among them the Maximum Mutual Information criterion and the Minimum Classification Error criterion – within a single framework.

• The approach is extended such that it also captures other criteria more recently proposed


• A class of discriminative training criteria can then be defined by:

• is a gain function that allows for rating sentence hypotheses W based on the spoken word string Wr

• G reflects an error metric such as the Levenshtein distance or the sentence error.


• The fraction inside the brackets is called the local discriminative criterion

Extended Unifying Approach

• The choice of alternative word sequences contained in the set Mr together with the smoothing function f, the weighting exponent , and the gain function G determine the particular criterion in use.

Extended Unifying Approach

• Maximum Likelihood (ML) Criterion• Maximum Mutual Information (MMI) and Corrective

Training (CT) Criterion• Diversity Index• Jeffreys’ Criterion• Chernoff Affinity

Maximum Likelihood (ML) Criterion

• Although not a discriminative criterion in a strict sense, the ML objective function is contained as a special case in the extended unifying approach.

• The ML estimator is consistent which means that for any increasing and representative set of training samples the estimation of the parameters converges toward the true model parameters

Maximum Likelihood (ML) Criterion

• However, for automatic speech recognition, the model assumptions are typically not correct, and therefore the ML estimator will return the true parameters of a wrong model in the limiting case of an infinite amount of training data

• In contrast to discriminative training criteria, which concentrate on enhancing class separability by taking class extraneous data into account, the ML estimator optimizes each class region individually

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion

• yields the MMI criterion which is defined as the sum over the logarithms of the class posterior probabilities of the spoken word sequences Wr for each training utterance r given the corresponding acoustic observations Xr:


• The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation


• An approximation to the MMI criterion is given by the Corrective Training (CT) criterion

• Here, the sum over all competing word sequences in the denominator is replaced with the best recognized sentence hypothesis

Diversity Index

• The Diversity Index of degree measures the divergence of a probability distribution from the uniform distribution.

• While a diversity index closer to the maximum at 0 means a larger divergence from the uniform distribution, smaller values indicate that all classes tend to be nearly equally likely.

Diversity Index

• The Diversity Index applies the weighting function which results in the following expression for the discriminative training criterion:

Diversity Index

• Two well-known diversity indices, the Shannon Entropy (which is equivalent to the MMI criterion for constant class priors) and the Gini Index, are special cases of the Diversity Index

• The Shannon Entropy results from the continuous limit as approaches 0 while the Gini Index follows from setting = 1:

Diversity Index

• The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

Jeffreys’ Criterion

• The Jeffreys’ criterion, which is also known as Jeffreys’ divergence, is closely related to the Kullback- Leibler distance [Kullback & Leibler 51] and was first proposed in [Jeffreys 46]:

• The smoothing function is not lower-bounded which means that an increase in the objective function will be large if the parameters are estimated such that no training utterance has a vanishing small posterior probability

Chernoff Affinity

• The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity.

• It employs the smoothing function with parameter which leads to the following training criterion:

Chernoff Affinity

• The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity

• It employs the smoothing function with parameter which leads to the following training criterion:

Smoothed Error Minimizing Training Criteria

• Error minimizing training criteria such as the MCE, the MWE, and the MPE criterion aim at minimizing the expectation of an error related loss function on the training data

• Let L denote any such loss function. Then the objective is to determine a parameter set that minimizes the total costs due to classification errors:

The optimization problem

• The objective function includes an “argmin” operation which prevents the computation of a gradient

• The objective function has many local optima: an optimization algorithm must handle this.

• The loss function L itself is typically a non-continuous step function and therefore not differentiable.


• A remedy to make this class of error minimizing training criteria amenable to gradient based optimization methods is to replace Eq. (4.17) with the following expression:

• Discriminative criteria like the MCE, the MWE, and the MPE criterion differ only with respect to the choice of the loss function L.


• While the MCE criterion typically applies a smoothed sentence error loss function, both the MWE and the MPE criterion are based on approximations of the word or phoneme error rate


• Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion

• Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion

• Minimum Squared Error (MSE) Criterion

Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion

• The MCE criterion aims at minimizing the expectation of a smoothed sentence error on training data.

• According to Bayes’ decision rule, the probability of making a classification error in utterance r is given by:

Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion

• Smoothing the local error probability with a sigmoid function and carrying out the sum over all training utterances yields the MCE criterion:

• Similar to the CT criterion, the Falsifying Training (FT) derives from the MCE criterion in the limiting case of infinite :

Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion

• The objective of the Minimum Word Error (MWE) criterion as well as its closely related Minimum Phone Error (MPE) criterion is to minimize the expectation of an approximation to the word or phoneme accuracy on training data.

• After an efficient lattice-based training scheme was found and successfully implemented in [Povey & Woodland 02], the criterion has received increasing interest.

Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion

• Both criteria compute the average transcription accuracy over all sentence hypotheses considered for discrimination:

• Here, is defined as the posterior probability of the sentence hypothesis W scaled by a factor in the log-space:

A class of discriminative training criteria covered by the extended unifying approach.

wolfgang macherey von der fakult¨at f¨ur mathematik, informatik und naturwissenschaften der

Documents