part of speech tagging the dt students nn went vb to p class nn plays vb nn well adv nn with p...
TRANSCRIPT
![Page 1: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/1.jpg)
Part of Speech Tagging
TheDT
studentsNN
wentVB
toP
classNN
PlaysVBNN
wellADVNN
withPP
othersNNDT
FruitNNNNNNNN
fliesNNVBNNVB
likeVBPPVB
aDTDTDTDT
bananaNNNNNNNN
Some examples:
*
?*
![Page 2: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/2.jpg)
• Addresses the ambiguity problem– Use probabilities to find the more likely tag
sequence
• Some popular approaches:– Transformational tagger– Maximum Entropy– Hidden Markov Model
Probabilistic POS Tagging
![Page 3: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/3.jpg)
Problem Setup
• There are M types of POS tags– Tag set: {t_1,..,t_M}.
• The word vocabulary size is V – Vocabulary set: {w_1,..,w_V}.
• We have a word sequence of length n:
<w> = w1,w2…wn
• Want to find the best sequence of POS tags:<t> = t1,t2…tn
)|Pr(maxarg
wttt
best
![Page 4: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/4.jpg)
Noisy Channel Framework
• P(<t>|<w>) is awkward to estimate directly, but by Bayes Rule:
• Can cast the problem in terms of the noisy channel model– POS tag sequence is the source– Through the “noisy channel,” the sequence is
transformed into the observed English words.
)(
)Pr()|Pr()|Pr(
wP
ttwwt
![Page 5: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/5.jpg)
Model for POS Tagging
• Need to compute Pr(<w>|<t>) and Pr(<t>)
• Make Markov assumptions to simplify:– Generation of each word wi, only depends on
its tag ti, and not on previous words
– Generation of each tag ti only depends on its immediate predecessor ti-1
)Pr()|Pr(maxarg)|Pr(maxarg
ttwwttt
n
iiiii ttptwpttw
11)|()|()Pr()|Pr(
![Page 6: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/6.jpg)
POS Model in Terms of HMM
• The states of the HMM represent POS tags• The output alphabet corresponds to the English
vocabulary[notation:ti is the ith tag in a tag sequence <t>,t_i represents the ith tag in the tag set {t_1,..,t_M}]
• i : [p(t_i|*start*)] prob of starting on state t_i• aij : [p(t_j|t_i)] prob of going from t_i to t_j• bjk : [p(w_k|t_j)] prob of output vocab w_k at state
t_j
![Page 7: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/7.jpg)
Learning the Parameters with Annotated Corpora
• Values for model parameters are unknown– Suppose we have pairs of sequences:
• <w> = w1,w2…wn
• <t> = t1,t2…tn
such that <t> are the correct tags for <w>• How to estimate the parameters?• Max Likelihood Estimate
– Just count co-occurrences
![Page 8: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/8.jpg)
Learning the Parameters without Annotated Corpora
• Values for model still unknown, but we have no annotated tags for the word sequences <w>
• Need to search through the space of all possible parameters to find good values for the parameters.
• Expectation Maximization– Learn the parameters through iterative refinement
– A form of greedy heuristic
– Guaranteed to find a locally optimal solution but it may not be *the best* solution
![Page 9: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/9.jpg)
The EM Algorithm (sketch)Given (as training data): word sequences <w>Initialize all the parameters of the model to some
random valuesRepeat until Convergence
E-Step Compute the expected likelihood of
generating all training sequences <w> using the current model
M-StepUpdate the parameters of the model to maximize the likelihood of getting <w>
![Page 10: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/10.jpg)
An Inefficient HMM Training Algorithm Initialize all parameters (, A, B) to some random valuesRepeat until convergence
clear all count table entriesfor every training sequence <w>
Pr(<w>) := 0for all possible sequence <t>
compute Pr(<t>) and Pr(<w>|<t>)Pr(<w>) += Pr(<w>|<t>)Pr(<t>)
for all possible sequence <t>compute Pr(<t>|<w>) /* Pr(<t>|<w>)= Pr(<w>|<t>)Pr(<t>)/Pr(<w>) */
Count(t1|*start*) += Pr(<t>|<w>)for each position s = 1..n /* update all expected counts */
Count(ts+1|ts) += Pr(<t>|<w>)
Count(ws|ts) += Pr (<t>|<w>)
for all tags t_i
i := Count(t_i|*start*)/Count(*start*)for all pairs of tags t_i and t_j:
aij := Count(t_j|t_i)/Count(t_i) /* use expected counts collected */for all pairs of word tag pair t_j, w_k:
bjk := Count(w_k|t_j)/Count(t_j)
E-Step
M-Step
![Page 11: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/11.jpg)
Forward & Backward Equations• Forward: i(s)
– Pr(w1,w2…ws,t_i)
– Prob of outputting prefix w1..ws (through all possible paths) and land on state (tag) t_i at time (position) s.
– Base case:
– Inductive step:
• Backward: i(s)
– Pr(ws+1, …wn|t_i)
– Prob of outputting suffix ws+1 ..wn (through all possible paths) knowing that we must be on state t_i at time (position) s.
– Base case:
– Inductive step:
][ 1)1( wiii b 1)( ni
][1
1)()1(
swj
M
iijij bass
M
jjwjiji sbas
s1
][ )1()(1
Note: I used [ws] to denote some index k, such that ws = w_k
![Page 12: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/12.jpg)
More Fun with Forward & Backward Equations
• Can use to compute prob of word sequence Pr(<w>) for any time/position step s:
• Can also compute prob of leaving state t_i at time step s
• Can compute prob of going from state t_i to t_j at time s
M
iii ssw
1
)()()Pr(
)Pr(
)()()(
w
sss ii
i
)Pr(
)1()()( ][ 1
w
sbass jwjiji
ijs
![Page 13: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/13.jpg)
Update Rules for Parameter Re-Estimation
Using the probability quantities defined in the previous slide (based on forward and backward functions), we can get new values for the HMM parameters:
n
sj
n
kwwsj
jk
n
si
n
sij
ij
ii
s
s
b
s
sa
s
1
_,1
1
1
)(
)(
:
)(
)(:
)1(:
Prob of leaving state t_i at time step 1
Total expected count of going from t_i to t_jTotal expected count of leaving t_i
Total expected count of t_i generating w_k
Total expected count of leaving t_i
![Page 14: Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT](https://reader035.vdocument.in/reader035/viewer/2022081816/56649ce45503460f949b0f04/html5/thumbnails/14.jpg)
Efficient Training of HMM
Init same as before
Repeat
E-Step: Compute all forward and backward values: i(s), i(s) /* where i=1..M, s=1..n */
M-Step: update all parameters using the update rules in the previous slide
Until Convergence