corpora and statistical methods lecture 9

Slide 1

Hidden Markov Models & POS TaggingCorpora and Statistical Methods Lecture 9AcknowledgementSome of the diagrams are from slides by David Bley (available on companion website to Manning and Schutze 1999)Formalisation of a Hidden Markov modelPart 1Crucial ingredients (familiar)Underlying states: S = {s1,,sN}

Output alphabet (observations): K = {k1,,kM}

State transition probabilities:A = {aij}, i,j S

State sequence: X = (X1,,XT+1)+ a function mapping each Xt to a state s

Output sequence: O = (O1,,OT)where each ot K

Crucial ingredients (additional)Initial state probabilities: = {i}, i S(tell us the initial probability of each state)

Symbol emission probabilities:B = {bijk}, i,j S, k K(tell us the probability b of seeing observation Ot=k at time t, given that Xt=si and Xt+1 = sj)Trellis diagram of an HMMs1s2s3a1,1a1,2a1,3Trellis diagram of an HMMs1s2s3a1,1a1,2a1,3o1o2o3Obs. seq:time:t1t2t3Trellis diagram of an HMMs1s2s3a1,1a1,2a1,3o1o2o3Obs. seq:time:t1t2t3b1,1,k=O2b1,1,k=O3b1,2,k=O2b1,3,k=O2The fundamental questions for HMMsGiven a model = (A, B, ), how do we compute the likelihood of an observation P(O| )?

Given an observation sequence O, and model , which is the state sequence (X1,,Xt+1) that best explains the observations?This is the decoding problem

Given an observation sequence O, and a space of possible models = (A, B, ), which model best explains the observed data?Application of question 1 (ASR)Given a model = (A, B, ), how do we compute the likelihood of an observation P(O| )?

Input of an ASR system: a continuous stream of sound waves, which is ambiguousNeed to decode it into a sequence of phones.is the input the sequence [n iy d] or [n iy]?which sequence is the most probable?Application of question 2 (POS Tagging)Given an observation sequence O, and model , which is the state sequence (X1,,Xt+1) that best explains the observations?this is the decoding problem

Consider a POS TaggerInput observation sequence: I can read

need to find the most likely sequence of underlying POS tags:e.g. is can a modal verb, or the noun?how likely is it that can is a noun, given that the previous word is a pronoun?Finding the probability of an observation sequenceExample problem: ASRAssume that the input contains the word needinput stream is ambiguous (there is noise, individual variation in speech, etc)

Possible sequences of observations: [n iy] (knee)[n iy dh] (need)[n iy t] (neat)

States: underlying sequences of phones giving rise to the input observations with transition probabilitiesassume we have state sequences for need, knee, new, neat,

Formulating the problemProbability of an observation sequence is logically an OR problem:model gives us state transitions underlying several possible words (knee, need, neat)

How likely is the word need? We have:all possible state sequences Xeach sequence can give rise to the signal received with a certain probability (possibly zero)the probability of the word need is the sum of probabilities with which each sequence can have given rise to the word.oTo1 otot-1ot+1Simplified trellis diagram representation

startndhiyendHidden layer: transitions between sounds forming the words need, kneeThis is our modeloTo1 otot-1ot+1Simplified trellis diagram representationstartndhiyendVisible layer is what ASR is given as inputoTo1 otot-1ot+1Computing the probability of an observation

startndhiyendComputing the probability of an observation

oTo1otot-1ot+1x1xt+1xTxtxt-1

Computing the probability of an observation

oTo1otot-1ot+1x1xt+1xTxtxt-1Computing the probability of an observation

oTo1otot-1ot+1x1xt+1xTxtxt-1Computing the probability of an observation


Computing the probability of an observationoTo1otot-1ot+1x1xt+1xTxtxt-1A final word on observation probabilitiesSince were computing the probability of an observation given a model, we can use these methods to compare different modelsif we take observations in our corpus as given, then the best model is the one which maximises the probability of these observations(useful for training/parameter setting)The forward procedureForward ProcedureGiven our phone input, how do we decide whether the actual word is need, knee, ?Could compute p(O|) for every single wordHighly expensive in terms of computationForward procedureAn efficient solution to resolving the problembased on dynamic programming (memoisation)rather than perform separate computations for all possible sequences X, keep in memory partial solutions Forward procedureNetwork representation of all sequences (X) of states that could generate the observationssum of probabilities for those sequences

E.g. O=[n iy] could be generated byX1 = [n iy d] (need)X2 = [n iy t] (neat)shared histories can help us save on memory

Fundamental assumption: Given several state sequences of length t+1 with shared history up to tprobability of first t observations is the same in all of them

Forward ProcedureoTo1otot-1ot+1x1xt+1xTxtxt-1Probability of the first t observations is the same for all possible t+1 length state sequences. Define a forward variable:Probability of ending up in state si at time t after observations 1 to t-1

Forward Procedure: initialisationoTo1otot-1ot+1x1xt+1xTxtxt-1Probability of the first t observations is the same for all possible t+1 length state sequences. Define:Probability of being in state si first is just equal to the initialisation probability Forward Procedure (inductive step)


Looking backwardThe forward procedure caches the probability of sequences of states leading up to an observation (left to right).

The backward procedure works the other way:probability of seeing the rest of the obs sequence given that we were in some state at some timeBackward procedure: basic structureDefine:probability of the remaining observations given that current obs is emitted by state i

Initialise:probability at the final state

Inductive step:

Total:

Combining forward & backward variablesOur two variables can be combined:

the likelihood of being in state i at time t with our sequence of observations is a function of:the probability of ending up in i at t given what came previouslythe probability of being in i at t given the rest

Therefore:

Decoding: Finding the best state sequenceBest state sequence: exampleConsider the ASR problem again

Input observation sequence:[aa n iy dh ax](corresponds to I need the)

Possible solutions:I need aI need theI kneed a

NB: each possible solution corresponds to a state sequence.Problem is to find best word segmentation and most likely underlying phonetic input.Some difficultiesIf we focus on the likelihood of each individual state, we run into problemscontext effects mean that what is individually likely may together yield an unlikely sequence

the ASR program needs to look at the probability of entire sequences

Viterbi algorithmGiven an observation sequence O and a model , find:argmaxX P(X,O|)the sequence of states X such that P(X,O|) is highest

Basic idea:run a type of forward procedure (computes probability of all possible paths)store partial solutionsat the end, look back to find the best pathIllustration: path through the trellisS1S21t=234567S3S4At every node (state) and time, we store:the likelihood of reaching that state at that time by the most probable path leading to that state (denoted )the preceding state leading to the current state (denoted )

oTo1otot-1ot+1Viterbi Algorithm: definitions

x1xt-1jThe probability of the most probable path from observation 1 to t-1, landing us in state j at toTo1otot-1ot+1Viterbi Algorithm: initialisation

x1xt-1jThe probability of being in state j at the beginning is just the initialisation probability of state j.oTo1otot-1ot+1Viterbi Algorithm: inductive step

x1xt-1xtxt+1Probability of being in j at t+1 depends on the state i for which aij is highest the probability that j emits the symbol Ot+1oTo1otot-1ot+1Viterbi Algorithm: inductive step

x1xt-1xtxt+1Backtrace store: the most probable state from which state j can be reached IllustrationS1S21t=234567S3S42(t=6) = probability of reaching state 2 at time t=6 by the most probable path (marked) through state 2 at t=62(t=6) =3 is the state preceding state 2 at t=6 on the most probable path through state 2 at t=6oTo1otot-1ot+1Viterbi Algorithm: backtrace

x1xt-1xtxt+1xTThe best state at T is that state i for which the probability i(T) is highest oTo1otot-1ot+1Viterbi Algorithm: backtrace

Work backwards to the most likely preceding statex1xt-1xtxt+1xToTo1otot-1ot+1Viterbi Algorithm: backtrace

The probability of the best state sequence is the maximum value stored for the final state T x1xt-1xtxt+1xTSummaryWeve looked at two algorithms for solving two of the fundamental problems of HMMS:likelihood of an observation sequence given a model (Forward/Backward Procedure)most likely underlying state, given an observation sequence (Viterbi Algorithm)Next up:we look at POS tagging

corpora and statistical methods lecture 9

Documents

state sequence x1

observation sequence

state s output sequence

j s state sequence

observation ot

sequence n iy d

state transitions

observation po