conditional random fields•hmm and cmm limitations •label bias problem •conditional random...
TRANSCRIPT
1
Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data
J. Lafferty, A. McCallum, F. Pereira. (ICML’01)
Presented by Kevin DuhMarch 4, 2005UW Markovia Reading Group
2
Outline
• Motivation:• HMM and CMM limitations• Label Bias problem
• Conditional Random Field (CRF) Definition• CRF Parameter Estimation
• Iterative Scaling• Experiments
• Synthetic• Part-of-speech tagging
3
Hidden Markov Models (HMMs)
• Generative model p(X,Y)• Must enumerate all possible observation
sequences Requires atomicrepresentation
• Assumes independence of features• same as Naïve Bayes
YiYi-1Yi-2Yi-3
XiXi-1Xi-2Xi-3
Y
XA XB
4
Conditional Markov Models(CMMs)
• Conditional model P(Y|X)• No effort wasted on modeling observations• Transition probability can depend on both past and
future observations• Features can be dependent
• Suffers label-bias problem due to per-statenormalization
YiYi-1Yi-2Yi-3
XiXi-1Xi-2Xi-3
Example:Maximum EntropyMarkov Model(MEMM)
2
5
Label Bias Example
After Wallach ‘02
s D N
V
N V
N R
A
eThe robot
wheels
wheelsare
Fred round
round
D: determinerN: nounV: verbA: adjectiveR: adverb
Obs: “The robot wheels are round.”
But if P(V|N,wheels) > P(N|N,wheels), then upper path is chosen regardless of obs.
6
Label Bias Problem
• The Problem: States with low-entropy next-statedistributions ignore observations• Fundamental cause: Per-state normalization
• “Conservation of score mass”• Transitions leaving a given state only compete against each
other
• Solution• Model accounts for whole sequence at once• Prob. mass is amplified/dampened at individual
transitions
7
Conditional Random Fields(CRFs)
• Single exponential model of jointprobability of entire state sequencegiven observations
• Alternative view: Finite state model withun-normalized transition prob.
YiYi-1Yi-2Yi-3
X
8
Definition of CRF
Def: A CRF is a undirected graphical model globallyconditioned on the observation sequence
Graph: G=(V,E). V represents all Y.(X,Y) is a CRF if, when conditioned on X, Yv obeys
the Markov property with respect to G:
P(Y
v| X,Y
w,w ! v) = P(Y
v| X,Y
w,w ! v)
3
9
What does the distribution of aRandom Field look like?
• Hammersley-Clifford Theorem:
• Potential functions:• strictly positive and real value function• no direct probabilistic interpretation• represent “constraints” on configurations of random variables
• An overall configuration satisfying more constraints will have higherprobability
• Here, potential functions are chosen based on theMaximum Entropy principle
p(v1,v2,..,vn ) !
1
Z! Vc
(vc )c"C
#Conditional Independence Statements made by RF
10
Maximum Entropy Principle
• MaxEnt says:• “When estimating a distribution, pick the max entropy
distribution that respects all features f(x,y) seen intraining data”
• Constrained optimization problem
• Parametric form:
!p(x, y)x,y
! f (x, y) = !p(x)q(y | x) f (x, y)x,y
! E!p(x,y)[ f ] = Eq[ f ]
p! (y | x) =1
Z(x)exp !k fk (x,y)
k
"#$%&'(
i.e.
11
Parametric Form ofCRF Distribution
! Yc(yc ) = exp "k fk (c,yc ,x)
k
#$%&'()
Define each potential function as:
p! (y | x) =1
Z(x)exp !k fk (c,yc ,x)
k
"c#C"$%&
'()
CRF distribution becomes:
p! (y | x) =1
Z(x)exp " k
e#E ,k$ fk (e,ye,x) + µ k
v#V ,k$ gk (v,yv ,x)
%
&'(
)*
Distinguish between two types of features:
p! (y | x) =1
Z(x)exp " k
i,k
# fk (yi$1,yi ,x) + µ k
i,k
# gk (yi ,x)%
&'(
)*
Special Case of HMM-like Chain graph:
12
CRF Parameter Estimation
• Iterative Scaling:• Maximizes likelihood by iteratively updating
• Define auxilliary function A() s.t.
• Initialize each• Do until convergence:
Solve for eachUpdate parameter:
!k
O(!) = log p! (y(i ) | x(i ) )
i=1
N
" # !p(x,y)log p! (y | x)x,y
"
µk! µ
k+ "µ
k
A(! ',!) "O(! ') #O(!)
!k"!
k+ #!
k
!"k
dA(! ',!)
d"#k
= 0
!k"!
k+ #!
k
4
13
CRF Parameter Estimation
• For chain CRF, setting gives
• is total feature count• Unfortunately, T(x,y) is a global property of (x,y)
• Dynamic programming will sum over sequences withpotentially varying T. Inefficient exp sum computation
dA(! ',!)
d"#k
= 0
!E[ fk ] " !p(x,y)x,y
! fk (yi"1,yi ,x)i=1
n+1
!
= !p(x)p(y | x)x,y
! fk (yi"1,yi ,x)i=1
n+1
! exp #$kT (x,y)( )
T (x,y) = fk (yi!1,yi ,x)i,k
" + gk (yi ,x)i,k
"
14
Algorithm S(Generalized Iterative Scaling)
• Introduce global slack feature s.t. T(x,y) becomesconstant S for all (x,y)
• Define forward and backward variables
S(x,y) ! S ! fk (yi!1,yi ,x)i,k
" + gk (yi ,x)i,k
"
! i (y | x) = ! i"1(y | x)exp fk (yi"1,yi ,x)i,k
# + gk (yi ,x)i,k
#$
%&'
()
!i (y | x) = !i+1(y | x)exp fk (yi+1,yi ,x)i,k
" + gk (yi+1,x)i,k
"#
$%&
'(
15
Algorithm SThe update equations become:
Where
Note is like posterior as in HMM
Rate of convergence governed by S
16
Algorithm T(Improved Iterative Scaling)
!E[ fk ] = !p(x)p(y | x)x,y
! fk (yi"1,yi ,x)i=1
n+1
! exp #$kT (x)( )
!p(x)p(y | x)
x ,y|T (x )=t{ }
! fk (yi"1,yi ,x)i=1
n+1
! exp #$k( )t
%
&'
(
)*
t=0
Tmax
!
ak ,t = !p(x)p(y | x)x,y
! fk (yi"1,yi ,x)i=1
n+1
! # t,T (x)( )
!E[ fk ] = !p(x)p(y | x)x,y
! fk (yi"1,yi ,x)i=1
n+1
! exp #$kT (x,y)( )
The equation we want to solve
exp !"k( )is polynomial in
So can be solved with Newton’s method
T (x) ! maxT (x,y)Define
Then:
Now, let ak,t,bk,t be E[fk|T(x)=t]
UPDATE:
5
17
Experiments with Synthetic Data
1. Modeling Label Bias:• Generated data by Fig 1 stochastic FSA• CRF: 4.6%, MEMM: 42% error rate
2. Modeling mixed-order sources• Generate data by ! p(y
i| y
i"1,y
i"2) + (1"! )p(y
i| y
i"1)
18
POS tagging experiment
Wall Street Journal dataset; 45 POS tags
Training time: Initial value is result of MEMM training (100 iter) Convergence for CRF+ took 1000 more iterations
19
Conclusion / Summary• CRFs are undirected graphical models globally
conditioned on observations• Advantages of CRFs:
• Conditional model• Allows multiple interacting features
• Disadvantage of CRFs:• Slow convergence during training
• Potential future directions:• More complex graph structures• Faster (approximate) Inference/Learning algorithms• Feature selection/induction algo for CRFs…
20
Useful References• Hanna Wallach. Efficient Training of Conditional
Random Fields. M.Sc. thesis, Division ofInformatics, University of Edinburgh, 2002.
• Della Pietra, S., Della Pietra, V., & Lafferty, J.(1997). Inducing features of random fields.IEEE Transactions on Pattern Analysis andMachine Intelligence, 19, 380–393.
• Berger, A. L., Della Pietra, S. A., & Della Pietra, V.J. (1996). A maximum entropy approach tonatural language processing. ComputationalLinguistics, 22.