review: hidden markov models efficient dynamic programming algorithms exist for –finding pr(s)...
TRANSCRIPT
![Page 1: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/1.jpg)
Review: Hidden Markov Models
• Efficient dynamic programming algorithms exist for– Finding Pr(S)– The highest probability path P that maximizes
Pr(S,P) (Viterbi)• Training the model
– State seq known: MLE + smoothing– Otherwise: Baum-Welch algorithm
S2
S4
S1
0.9
0.5
0.50.8
0.2
0.1
S3
A
C
0.6
0.4
A
C
0.3
0.7A
C
0.5
0.5
A
C
0.9
0.1
![Page 2: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/2.jpg)
HMM for Segmentation
• Simplest Model: One state per entity type
![Page 3: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/3.jpg)
HMM Learning
• Manally pick HMM’s graph (eg simple model, fully connected)
• Learn transition probabilities: Pr(si|sj)
• Learn emission probabilities: Pr(w|si)
![Page 4: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/4.jpg)
Learning model parameters• When training data defines unique path through HMM
– Transition probabilities• Probability of transitioning from state i to state j =
number of transitions from i to j total transitions from state i
– Emission probabilities• Probability of emitting symbol k from state i =
number of times k generated from i number of transition from I
![Page 5: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/5.jpg)
What is a “symbol” ???
Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
4601 => “4601”, “9999”, “9+”, “number”, … ?
000.. . . .999
3 -d ig i ts
00000 .. . .99999
5 -d ig i ts
0 ..99 0000 ..9999 000000 ..
O th e rs
N u m b e rs
A .. ..z
C h a rs
a a ..
M u lt i -le tte r
W o rds
. , / - + ? #
D e lim ite rs
A ll
Datamold: choose best abstraction level using holdout set
![Page 6: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/6.jpg)
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping features of words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
We can extend the HMM model so that each state generates multiple “features” – but they should be independent.
![Page 7: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/7.jpg)
Borthwick et al solution
We could use YFCL: an SVM, logistic regression, a decision tree, …. We’ll be talking about logistic regression.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Instead of an HMM, classify each token. Don’t learn transition probabilities, instead constrain them at test time.
![Page 8: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/8.jpg)
Stupid HMM tricks
startPr(red)
Pr(green) Pr(green|green) = 1
Pr(red|red) = 1
![Page 9: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/9.jpg)
Stupid HMM tricks
startPr(red)
Pr(green)Pr(green|green) = 1
Pr(red|red) = 1
Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)
argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y)= argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y)
Pr(“I voted for Ralph Nader”|ggggg) = Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)
![Page 10: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/10.jpg)
From NB to Maxent
Zy
yw
docf
yjkydocf
k
i
kj
/)Pr(
)|Pr(
ncombinatiok j,th -i )(
0]:?1 class of doc of position at appears [word ),(,
xjw
ywyZ
xy
k
jk
in wordis where
)|Pr()Pr(1
)|Pr( i
yxfi ),(0
![Page 11: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/11.jpg)
From NB to Maxent
xjw
ywyZ
xy
k
jk
in wordis where
)|Pr()Pr(1
)|Pr( i
yxfi ),(0
![Page 12: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/12.jpg)
From NB to Maxent
xjw
ywyZ
xy
k
jk
in wordis where
)|Pr()Pr(1
)|Pr( i
yxfi ),(0
i
i yxfxyP ),()|(log 0 Or:
Idea: keep the same functional form as naïve Bayes, but pick the parameters to optimize performance on training data.
One possible definition of performance is conditional log likelihood of the data:
t
tt xyP )|(log
![Page 13: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/13.jpg)
MaxEnt Comments– Implementation:
• All methods are iterative• For NLP like problems with many features, modern gradient-like or
Newton-like methods work well• Thursday I’ll derive the gradient for CRFs
– Smoothing: • Typically maxent will overfit data if there are many infrequent features. • Old-school solutions: discard low-count features; early stopping with
holdout set; …• Modern solutions: penalize large parameter values with a prior centered
on zero to limit size of alphas (ie, optimize log likelihood - sum alpha); other regularization techniques
![Page 14: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/14.jpg)
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping features of words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
![Page 15: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/15.jpg)
Borthwick et al idea
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations
...)|Pr( tt xs
![Page 16: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/16.jpg)
Another idea….
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state
...),|Pr( ,1 ttt sxs
![Page 17: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/17.jpg)
MaxEnt taggers and MEMMs
St -1 S
t
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history
......),|Pr( ,2,1 tttt ssxs
Learning does not change – you’ve just added a few additional features that are the previous labels.
Classification is trickier – we don’t know the previous-label features at test time – so we will need to search for the best sequence of labels (like for an HMM).
![Page 18: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/18.jpg)
Partial history of the idea
• Sliding window classifiers– Sejnowski’s NETTalk, mid 1980’s
• Recurrent neural networks and other “recurrent” sliding-window classifiers– Late 1980’s and 1990’s
• Ratnaparkhi’s thesis– Mid-late 1990’s
• Frietag, McCallum & Pereira ICML 2000– Formalize notion of MEMM
• OpenNLP– Based largely on MaxEnt taggers, Apache Open Source
![Page 19: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/19.jpg)
Ratnaparkhi’s MXPOST
• Sequential learning problem: predict POS tags of words.
• Uses MaxEnt model described above.
• Rich feature set.• To smooth, discard features
occurring < 10 times.
![Page 20: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/20.jpg)
MXPOST
![Page 21: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/21.jpg)
MXPOST: learning & inference
GISFeature
selection
![Page 22: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/22.jpg)
23
Using the HMM to segment
• Find highest probability path through the HMM.• Viterbi: quadratic dynamic programming
algorithm
House
ot
Road
City
Pin
15213 Butler Highway Greenville 21578
House
Road
City
Pinot
House
Road
City
Pin
15213 Butler ... 21578
![Page 23: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/23.jpg)
Inference for MENE (Borthwick et al system)
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
Goal: best legal path through lattice (i.e., path that runs through the most black ink. Like Viterbi but cost of possible transitions are ignored.)
![Page 24: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/24.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
),|Pr(
)...,,|Pr(
)...,,|Pr()|Pr(
1
1,
1,1
iii
iikii
iii
yxy
yyxy
yyxyxy
(Approx view): find best path, weights are now on arcs from state to state.
window of k tagsk=1
![Page 25: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/25.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to each node, weights are now on arcs from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
![Page 26: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/26.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
),|Pr(
)...,,|Pr(
)...,,|Pr()|Pr(
1,2
1,
1,1
iiii
iikii
iii
yyxy
yyxy
yyxyxy
Find best path? tree? Weights are on hyperedges
![Page 27: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/27.jpg)
Inference for MxPOST
I
O
iI
iO
When will prof Cohen post the notes …
oI
oO
iI
iO
oI
oO
iI
iO
oI
oO
iI
iO
oI
oO
iI
iO
oI
oO
iI
iO
oI
oO
… …
Beam search is alternative to Viterbi: at each stage, find all children, score them, and discard all but the top n states
![Page 28: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/28.jpg)
MXPost results
• State of art accuracy (for 1996)
• Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art).
• Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.
![Page 29: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/29.jpg)
MEMMs
• Basic difference from ME tagging:– ME tagging: previous state is feature of MaxEnt
classifier– MEMM: build a separate MaxEnt classifier for each
state.• Can build any HMM architecture you want; eg parallel nested
HMM’s, etc.• Data is fragmented: examples where previous tag is “proper
noun” give no information about learning tags when previous tag is “noun”
– Mostly a difference in viewpoint– MEMM does allow possibility of “hidden” states and
Baum-Welsh like training
![Page 30: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/30.jpg)
MEMM task: FAQ parsing
![Page 31: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/31.jpg)
MEMM features
![Page 32: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/32.jpg)
MEMMs
![Page 33: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/33.jpg)
Looking forward
• HMMS– Easy to train generative model– Features for a state must be independent (-)
• MaxEnt tagger/MEMM– Multiple cascaded classifiers– Features can be arbitrary (+)– Have we given anything up?
![Page 34: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/34.jpg)
37
HMM inference
House
ot
Road
City
Pin
• Total probability of transitions out of a state must sum to 1
• But …they can all lead to “unlikely” states
• So…. a state can be a (probable) “dead end” in the lattice
House
Road
City
Pinot
House
Road
City
Pin
15213 Butler ... 21578
![Page 35: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/35.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to
each node, weights are now on arcs
from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
Flow out of each node is always fixed:
y
tt yYxyYy 1)',|Pr(,' 1
![Page 36: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/36.jpg)
Label Bias Problem (Lafferty, McCallum, Pereira ICML 2001)
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1
Pr(0123|rib)=1Pr(0453|rob)=1
![Page 37: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/37.jpg)
Another max-flow scheme
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to
each node, weights are now on arcs
from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
Flow out of a node is always fixed:
y
tt yYxyYy 1)',|Pr(,' 1
![Page 38: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/38.jpg)
Another max-flow scheme: MRFs
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
Goal is to learn how to weight edges in the graph:• weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and
isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]
![Page 39: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/39.jpg)
Another max-flow scheme: MRFs
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges
in the graph, given features from the examples.
![Page 40: Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)](https://reader030.vdocument.in/reader030/viewer/2022032802/56649e1a5503460f94b07d7a/html5/thumbnails/40.jpg)
Another view of label bias [Sha & Pereira]
So what’s the alternative?