conditional random fields for asr

Conditional Random Conditional Random Fields for ASRFields for ASR

Jeremy MorrisJeremy Morris

July 25, 2006July 25, 2006

OverviewOverview

►Problem Statement (Motivation)Problem Statement (Motivation)►Conditional Random FieldsConditional Random Fields►ExperimentsExperiments

Attribute SelectionAttribute Selection Experimental SetupExperimental Setup

►ResultsResults►Future WorkFuture Work

Problem StatementProblem Statement

►Developed as part of the ASAT Project (Automatic Speech Attribute

Transcription)

►Goal: Develop a system for bottom-up speech recognition using 'speech attributes'

Speech Attributes?Speech Attributes?

►Any information that could be useful for recognizing the spoken language Phonetic attributes Speaker attributes (gender, age, etc.) Any other useful attributes that could be used for

speech recognition Note that there is no guarantee that attributes

will be independent of each other►One part of this project is to explore ways to

create a framework for easily combining new features for experimental purposes

/d/manner: stop

place of artic: dentalvoicing: voiced

/t/manner: stop

place of artic: dentalvoicing: unvoiced

/iy/height: high

backness: frontroundness: nonround

Evidence Combination

►Two basic ways to build hypotheses

hyp

data

hyp

data

Top Down

Generate a hypothesis

See if the data fits the hypothesis

Bottom Up

Examine the data

Search for a hypothesisthat fits

Top DownTop Down

►Traditional Automated Speech Recogintion Systems (ASR) use a top-down approach (HMMs) Hypothesis is the phone we are

predicting Data is some encoding of the

acoustic speech signal A likelihood of the signal given the

phone label is learned from data A prior probability for the phone label

is learned from the data These are combined through Bayes These are combined through Bayes

Rule to give us the posterior Rule to give us the posterior probabilityprobability

/iy/

X

P(/iy/)

P(X|/iy/)

Bottom Up

►Bottom-up models have the same high-level goal – determine the label from the observation But instead of a likelihood, the

posterior probability is learned from the data

►Neural Networks have been used to learn these probabilities

/iy/

X

P(/iy/|X)

Speech is a SequenceSpeech is a Sequence

►Speech is not a single, independent event It is a combination of multiple events over time

►A model to recognize spoken language should take into account dependencies across time

/k/ /k/ /iy/ /iy/ /iy/


►A top down (generative) model can be extended into a time sequence as a Hidden Markov Model (HMM) Now our likelihood of the data is over the

entire sequence instead of a single phone

/k/ /k/ /iy/ /iy/ /iy/

X X X X X


►Tandem is a method for using evidence bottom up (discriminative) Hypothesis output of

Neural Network is used to train an HMM

Not a pure discriminative method, but a combination of generative and discriminative methods

/k/ /iy/ /iy/

Y Y Y

X X X

Bottom up ModellingBottom up Modelling

► The idea is to have a system that combines evidence layer by layer Speech attributes contribute to phone attribute detection Phone attributes contribute to “syllable” attribute

detection, and so on

► Each layer combines information from previous layers to form its hypotheses We want to do this probabalistically – no hard decisions Note that there is no guarantee of independence among

the observed speech features – in fact, they are often very dependent.

Conditional Random FieldsConditional Random Fields

►A form of discriminative modelling Has been used successfully in various

domains such as part of speech tagging and other Natural Language Processing tasks

►Processes evidence bottom-up Combines multiple features of the data Builds the probability P( sequence | data)


►CRFs are based on the idea of Markov Random Fields Modelled as an undirected graph connecting

labels with observations Observations in a CRF are not random variables

/k/ /k/ /iy/ /iy/ /iy/

X X X X X

Transition functions add associations between transitions from

one label to anotherState functions help determine theidentity of the state


)(

)),,(),((exp

)|(1

xZ

yyxgyxf

xyP t i jttjjtii

State Feature Function

f([x is stop], /t/)

One possible state feature functionFor our attributes and labels

State Feature Weight

λ=10

One possible weight valuefor this state feature

(Strong)

Transition Feature Function

g(x, /iy/,/k/)

One possible transition feature function

Indicates /k/ followed by /iy/

Transition Feature Weight

μ=4

One possible weight valuefor this transition feature

►Hammersley-Clifford Theorem states that a random field is an MRF iff it can be described in the above form The exponential is the sum of the clique

potentials of the undirected graph


►Conceptual Overview Each attribute of the data we are trying to model

fits into a feature function that associates the attribute and a possible label►A positive value if the attribute appears in the data►A zero value if the attribute is not in the data

Each feature function carries a weight that gives the strength of that feature function for the proposed label►High positive weights indicate a good association

between the feature and the proposed label►High negative weights indicate a negative association

between the feature and the proposed label►Weights close to zero indicate the feature has little or

no impact on the identity of the label

ExperimentsExperiments

►Goal: Implement a Conditional Random Field Model on ASAT-style data Perform phone recognition Compare results to those obtained via a Tandem

system

►Experimental Data TIMIT read speech corpus Moderate-sized corpus of clean, prompted speech,

complete with phonetic-level transcriptions

Attribute SelectionAttribute Selection

►Attribute Detectors ICSI QuickNet Neural Networks

►Two different types of attributes Phonological feature detectors

►Place, Manner, Voicing, Vowel Height, Backness, etc.►Features are grouped into eight classes, with each class

having a variable number of possible values based on the IPA phonetic chart

Phone detectors►Neural networks output based on the phone labels –

one output per label Classifiers were applied to 2960 utterances from

the TIMIT training set

Experimental SetupExperimental Setup

►Code built on the Java CRF toolkit on Sourceforge http://crf.sourceforge.net Performs training to maximize the log-likelihood

of the training set with respect to the model Uses a Limited Memory BGFS algorithm to

minimize the gradient of the log-likelihood►For CRF models, maximizing the log-likelihood of the

empirical distribution of the data as predicted by the model is the same as maximizing the entropy (Berger et. al.)

Experimental SetupExperimental Setup

►Output from the Neural Nets are themselves treated as feature functions for the observed sequence – each attribute/label combination gives us a value for one feature function Note that this makes the feature functions

non-binary features.

ResultsResultsModelModel Phone Phone

AccuracAccuracyy

Phone Phone CorrectCorrect

Tandem [1] (phones)Tandem [1] (phones) 60.48%60.48% 63.30%63.30%

Tandem [3] (phones)Tandem [3] (phones) 67.32%67.32% 73.81%73.81%

CRF [1] (phones)CRF [1] (phones) 66.89%66.89% 68.49%68.49%

Tandem [1] (features)Tandem [1] (features) 61.48%61.48% 63.50%63.50%

Tandem [3] (features)Tandem [3] (features) 66.69%66.69% 72.52%72.52%

CRF [1] (features)CRF [1] (features) 65.29%65.29% 66.81%66.81%

Tandem [1] (phones/feas)Tandem [1] (phones/feas) 61.78%61.78% 63.68%63.68%

Tandem [3] (phones/feas)Tandem [3] (phones/feas) 67.96%67.96% 73.40%73.40%

CRF (phones/feas)CRF (phones/feas) 68.00%68.00% 69.58%69.58%

Future WorkFuture Work

►More featuresMore features What kinds of features can we add to improve What kinds of features can we add to improve

our transitions?our transitions?► TuningTuning

HMM model has parameters that can be tuned HMM model has parameters that can be tuned for better performance – can we tweak the CRF for better performance – can we tweak the CRF to do something similar?to do something similar?

►Word recogntionWord recogntion How does this model do at the full word How does this model do at the full word

recognition level, instead of just phonesrecognition level, instead of just phones►Other corporaOther corpora

Can we extend this method beyond TIMIT to Can we extend this method beyond TIMIT to different types of corpora? (e.g. WSJ)different types of corpora? (e.g. WSJ)

conditional random fields for asr

Documents

useful attributes

speech recognitionnote

layerspeech attributes

observed speech features

phone label

evidence layer

generative model

upcombines multiple