part of speech tagging the dt students nn went vb to p class nn plays vb nn well adv nn with p...

14
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P P others NN DT Fruit NN NN NN NN flies NN VB NN VB like VB P P VB a DT DT DT DT banana NN NN NN NN Some examples: * ? *

Upload: matthew-shepherd

Post on 17-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Part of Speech Tagging

TheDT

studentsNN

wentVB

toP

classNN

PlaysVBNN

wellADVNN

withPP

othersNNDT

FruitNNNNNNNN

fliesNNVBNNVB

likeVBPPVB

aDTDTDTDT

bananaNNNNNNNN

Some examples:

*

?*

Page 2: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

• Addresses the ambiguity problem– Use probabilities to find the more likely tag

sequence

• Some popular approaches:– Transformational tagger– Maximum Entropy– Hidden Markov Model

Probabilistic POS Tagging

Page 3: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Problem Setup

• There are M types of POS tags– Tag set: {t_1,..,t_M}.

• The word vocabulary size is V – Vocabulary set: {w_1,..,w_V}.

• We have a word sequence of length n:

<w> = w1,w2…wn

• Want to find the best sequence of POS tags:<t> = t1,t2…tn

)|Pr(maxarg

wttt

best

Page 4: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Noisy Channel Framework

• P(<t>|<w>) is awkward to estimate directly, but by Bayes Rule:

• Can cast the problem in terms of the noisy channel model– POS tag sequence is the source– Through the “noisy channel,” the sequence is

transformed into the observed English words.

)(

)Pr()|Pr()|Pr(

wP

ttwwt

Page 5: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Model for POS Tagging

• Need to compute Pr(<w>|<t>) and Pr(<t>)

• Make Markov assumptions to simplify:– Generation of each word wi, only depends on

its tag ti, and not on previous words

– Generation of each tag ti only depends on its immediate predecessor ti-1

)Pr()|Pr(maxarg)|Pr(maxarg

ttwwttt

n

iiiii ttptwpttw

11)|()|()Pr()|Pr(

Page 6: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

POS Model in Terms of HMM

• The states of the HMM represent POS tags• The output alphabet corresponds to the English

vocabulary[notation:ti is the ith tag in a tag sequence <t>,t_i represents the ith tag in the tag set {t_1,..,t_M}]

• i : [p(t_i|*start*)] prob of starting on state t_i• aij : [p(t_j|t_i)] prob of going from t_i to t_j• bjk : [p(w_k|t_j)] prob of output vocab w_k at state

t_j

Page 7: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Learning the Parameters with Annotated Corpora

• Values for model parameters are unknown– Suppose we have pairs of sequences:

• <w> = w1,w2…wn

• <t> = t1,t2…tn

such that <t> are the correct tags for <w>• How to estimate the parameters?• Max Likelihood Estimate

– Just count co-occurrences

Page 8: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Learning the Parameters without Annotated Corpora

• Values for model still unknown, but we have no annotated tags for the word sequences <w>

• Need to search through the space of all possible parameters to find good values for the parameters.

• Expectation Maximization– Learn the parameters through iterative refinement

– A form of greedy heuristic

– Guaranteed to find a locally optimal solution but it may not be *the best* solution

Page 9: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

The EM Algorithm (sketch)Given (as training data): word sequences <w>Initialize all the parameters of the model to some

random valuesRepeat until Convergence

E-Step Compute the expected likelihood of

generating all training sequences <w> using the current model

M-StepUpdate the parameters of the model to maximize the likelihood of getting <w>

Page 10: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

An Inefficient HMM Training Algorithm Initialize all parameters (, A, B) to some random valuesRepeat until convergence

clear all count table entriesfor every training sequence <w>

Pr(<w>) := 0for all possible sequence <t>

compute Pr(<t>) and Pr(<w>|<t>)Pr(<w>) += Pr(<w>|<t>)Pr(<t>)

for all possible sequence <t>compute Pr(<t>|<w>) /* Pr(<t>|<w>)= Pr(<w>|<t>)Pr(<t>)/Pr(<w>) */

Count(t1|*start*) += Pr(<t>|<w>)for each position s = 1..n /* update all expected counts */

Count(ts+1|ts) += Pr(<t>|<w>)

Count(ws|ts) += Pr (<t>|<w>)

for all tags t_i

i := Count(t_i|*start*)/Count(*start*)for all pairs of tags t_i and t_j:

aij := Count(t_j|t_i)/Count(t_i) /* use expected counts collected */for all pairs of word tag pair t_j, w_k:

bjk := Count(w_k|t_j)/Count(t_j)

E-Step

M-Step

Page 11: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Forward & Backward Equations• Forward: i(s)

– Pr(w1,w2…ws,t_i)

– Prob of outputting prefix w1..ws (through all possible paths) and land on state (tag) t_i at time (position) s.

– Base case:

– Inductive step:

• Backward: i(s)

– Pr(ws+1, …wn|t_i)

– Prob of outputting suffix ws+1 ..wn (through all possible paths) knowing that we must be on state t_i at time (position) s.

– Base case:

– Inductive step:

][ 1)1( wiii b 1)( ni

][1

1)()1(

swj

M

iijij bass

M

jjwjiji sbas

s1

][ )1()(1

Note: I used [ws] to denote some index k, such that ws = w_k

Page 12: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

More Fun with Forward & Backward Equations

• Can use to compute prob of word sequence Pr(<w>) for any time/position step s:

• Can also compute prob of leaving state t_i at time step s

• Can compute prob of going from state t_i to t_j at time s

M

iii ssw

1

)()()Pr(

)Pr(

)()()(

w

sss ii

i

)Pr(

)1()()( ][ 1

w

sbass jwjiji

ijs

Page 13: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Update Rules for Parameter Re-Estimation

Using the probability quantities defined in the previous slide (based on forward and backward functions), we can get new values for the HMM parameters:

n

sj

n

kwwsj

jk

n

si

n

sij

ij

ii

s

s

b

s

sa

s

1

_,1

1

1

)(

)(

:

)(

)(:

)1(:

Prob of leaving state t_i at time step 1

Total expected count of going from t_i to t_jTotal expected count of leaving t_i

Total expected count of t_i generating w_k

Total expected count of leaving t_i

Page 14: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT

Efficient Training of HMM

Init same as before

Repeat

E-Step: Compute all forward and backward values: i(s), i(s) /* where i=1..M, s=1..n */

M-Step: update all parameters using the update rules in the previous slide

Until Convergence