wolfgang macherey von der fakult¨at f¨ur mathematik, informatik und naturwissenschaften der

29
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation Presented by Yueng-Tien, Lo

Upload: bob

Post on 22-Mar-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training. Wolfgang Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Discriminative Training andAcoustic Modeling for Automatic

Speech Recognition- Chap. 4 Discriminative Training

Wolfgang MachereyVon der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften

derRWTH Aachen University zur Erlangung des akademischen Grades eines

Doktorsder Naturwissenschaften genehmigte Dissertation

Presented by Yueng-Tien, Lo

Page 2: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

outline

• A General View of Discriminative Training Criteria– Extended Unifying Approach– Smoothed Error Minimizing Training Criteria

Page 3: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

A General View of Discriminative Training Criteria

• A unifying approach for a class of discriminative training criteria was presented that allows for optimizing several objective functions – among them the Maximum Mutual Information criterion and the Minimum Classification Error criterion – within a single framework.

• The approach is extended such that it also captures other criteria more recently proposed

Page 4: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

A General View of Discriminative Training Criteria

• A class of discriminative training criteria can then be defined by:

• is a gain function that allows for rating sentence hypotheses W based on the spoken word string Wr

• G reflects an error metric such as the Levenshtein distance or the sentence error.

Page 5: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

A General View of Discriminative Training Criteria

• The fraction inside the brackets is called the local discriminative criterion

Page 6: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Extended Unifying Approach

• The choice of alternative word sequences contained in the set Mr together with the smoothing function f, the weighting exponent , and the gain function G determine the particular criterion in use.

Page 7: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Extended Unifying Approach

• Maximum Likelihood (ML) Criterion• Maximum Mutual Information (MMI) and Corrective

Training (CT) Criterion• Diversity Index• Jeffreys’ Criterion• Chernoff Affinity

Page 8: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Maximum Likelihood (ML) Criterion

• Although not a discriminative criterion in a strict sense, the ML objective function is contained as a special case in the extended unifying approach.

• The ML estimator is consistent which means that for any increasing and representative set of training samples the estimation of the parameters converges toward the true model parameters

Page 9: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Maximum Likelihood (ML) Criterion

• However, for automatic speech recognition, the model assumptions are typically not correct, and therefore the ML estimator will return the true parameters of a wrong model in the limiting case of an infinite amount of training data

• In contrast to discriminative training criteria, which concentrate on enhancing class separability by taking class extraneous data into account, the ML estimator optimizes each class region individually

Page 10: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion

• yields the MMI criterion which is defined as the sum over the logarithms of the class posterior probabilities of the spoken word sequences Wr for each training utterance r given the corresponding acoustic observations Xr:

Page 11: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion

• The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

Page 12: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion

• An approximation to the MMI criterion is given by the Corrective Training (CT) criterion

• Here, the sum over all competing word sequences in the denominator is replaced with the best recognized sentence hypothesis

Page 13: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Diversity Index

• The Diversity Index of degree measures the divergence of a probability distribution from the uniform distribution.

• While a diversity index closer to the maximum at 0 means a larger divergence from the uniform distribution, smaller values indicate that all classes tend to be nearly equally likely.

Page 14: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Diversity Index

• The Diversity Index applies the weighting function which results in the following expression for the discriminative training criterion:

Page 15: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Diversity Index

• Two well-known diversity indices, the Shannon Entropy (which is equivalent to the MMI criterion for constant class priors) and the Gini Index, are special cases of the Diversity Index

• The Shannon Entropy results from the continuous limit as approaches 0 while the Gini Index follows from setting = 1:

Page 16: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Diversity Index

• The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

Page 17: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Jeffreys’ Criterion

• The Jeffreys’ criterion, which is also known as Jeffreys’ divergence, is closely related to the Kullback- Leibler distance [Kullback & Leibler 51] and was first proposed in [Jeffreys 46]:

• The smoothing function is not lower-bounded which means that an increase in the objective function will be large if the parameters are estimated such that no training utterance has a vanishing small posterior probability

Page 18: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Chernoff Affinity

• The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity.

• It employs the smoothing function with parameter which leads to the following training criterion:

Page 19: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Chernoff Affinity

• The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity

• It employs the smoothing function with parameter which leads to the following training criterion:

Page 20: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Smoothed Error Minimizing Training Criteria

• Error minimizing training criteria such as the MCE, the MWE, and the MPE criterion aim at minimizing the expectation of an error related loss function on the training data

• Let L denote any such loss function. Then the objective is to determine a parameter set that minimizes the total costs due to classification errors:

Page 21: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

The optimization problem

• The objective function includes an “argmin” operation which prevents the computation of a gradient

• The objective function has many local optima: an optimization algorithm must handle this.

• The loss function L itself is typically a non-continuous step function and therefore not differentiable.

Page 22: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Smoothed Error Minimizing Training Criteria

• A remedy to make this class of error minimizing training criteria amenable to gradient based optimization methods is to replace Eq. (4.17) with the following expression:

• Discriminative criteria like the MCE, the MWE, and the MPE criterion differ only with respect to the choice of the loss function L.

Page 23: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Smoothed Error Minimizing Training Criteria

• While the MCE criterion typically applies a smoothed sentence error loss function, both the MWE and the MPE criterion are based on approximations of the word or phoneme error rate

Page 24: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Smoothed Error Minimizing Training Criteria

• Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion

• Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion

• Minimum Squared Error (MSE) Criterion

Page 25: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion

• The MCE criterion aims at minimizing the expectation of a smoothed sentence error on training data.

• According to Bayes’ decision rule, the probability of making a classification error in utterance r is given by:

Page 26: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion

• Smoothing the local error probability with a sigmoid function and carrying out the sum over all training utterances yields the MCE criterion:

• Similar to the CT criterion, the Falsifying Training (FT) derives from the MCE criterion in the limiting case of infinite :

Page 27: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion

• The objective of the Minimum Word Error (MWE) criterion as well as its closely related Minimum Phone Error (MPE) criterion is to minimize the expectation of an approximation to the word or phoneme accuracy on training data.

• After an efficient lattice-based training scheme was found and successfully implemented in [Povey & Woodland 02], the criterion has received increasing interest.

Page 28: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion

• Both criteria compute the average transcription accuracy over all sentence hypotheses considered for discrimination:

• Here, is defined as the posterior probability of the sentence hypothesis W scaled by a factor in the log-space:

Page 29: Wolfgang  Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der

A class of discriminative training criteria covered by the extended unifying approach.