conditional random fields•hmm and cmm limitations •label bias problem •conditional random...

1

Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data

J. Lafferty, A. McCallum, F. Pereira. (ICML’01)

Presented by Kevin DuhMarch 4, 2005UW Markovia Reading Group

2

Outline

• Motivation:• HMM and CMM limitations• Label Bias problem

• Conditional Random Field (CRF) Definition• CRF Parameter Estimation

• Iterative Scaling• Experiments

• Synthetic• Part-of-speech tagging

3

Hidden Markov Models (HMMs)

• Generative model p(X,Y)• Must enumerate all possible observation

sequences Requires atomicrepresentation

• Assumes independence of features• same as Naïve Bayes

YiYi-1Yi-2Yi-3

XiXi-1Xi-2Xi-3

Y

XA XB

4

Conditional Markov Models(CMMs)

• Conditional model P(Y|X)• No effort wasted on modeling observations• Transition probability can depend on both past and

future observations• Features can be dependent

• Suffers label-bias problem due to per-statenormalization

YiYi-1Yi-2Yi-3

XiXi-1Xi-2Xi-3

Example:Maximum EntropyMarkov Model(MEMM)

2

5

Label Bias Example

After Wallach ‘02

s D N

V

N V

N R

A

eThe robot

wheels

wheelsare

Fred round

round

D: determinerN: nounV: verbA: adjectiveR: adverb

Obs: “The robot wheels are round.”

But if P(V|N,wheels) > P(N|N,wheels), then upper path is chosen regardless of obs.

6

Label Bias Problem

• The Problem: States with low-entropy next-statedistributions ignore observations• Fundamental cause: Per-state normalization

• “Conservation of score mass”• Transitions leaving a given state only compete against each

other

• Solution• Model accounts for whole sequence at once• Prob. mass is amplified/dampened at individual

transitions

7

Conditional Random Fields(CRFs)

• Single exponential model of jointprobability of entire state sequencegiven observations

• Alternative view: Finite state model withun-normalized transition prob.

YiYi-1Yi-2Yi-3

X

8

Definition of CRF

Def: A CRF is a undirected graphical model globallyconditioned on the observation sequence

Graph: G=(V,E). V represents all Y.(X,Y) is a CRF if, when conditioned on X, Yv obeys

the Markov property with respect to G:

P(Y

v| X,Y

w,w ! v) = P(Y

v| X,Y

w,w ! v)

3

9

What does the distribution of aRandom Field look like?

• Hammersley-Clifford Theorem:

• Potential functions:• strictly positive and real value function• no direct probabilistic interpretation• represent “constraints” on configurations of random variables

• An overall configuration satisfying more constraints will have higherprobability

• Here, potential functions are chosen based on theMaximum Entropy principle

p(v1,v2,..,vn ) !

1

Z! Vc

(vc )c"C

#Conditional Independence Statements made by RF

10

Maximum Entropy Principle

• MaxEnt says:• “When estimating a distribution, pick the max entropy

distribution that respects all features f(x,y) seen intraining data”

• Constrained optimization problem

• Parametric form:

!p(x, y)x,y

! f (x, y) = !p(x)q(y | x) f (x, y)x,y

! E!p(x,y)[ f ] = Eq[ f ]

p! (y | x) =1

Z(x)exp !k fk (x,y)

k

"#$%&'(

i.e.

11

Parametric Form ofCRF Distribution

! Yc(yc ) = exp "k fk (c,yc ,x)

k

#$%&'()

Define each potential function as:

p! (y | x) =1

Z(x)exp !k fk (c,yc ,x)

k

"c#C"$%&

'()

CRF distribution becomes:

p! (y | x) =1

Z(x)exp " k

e#E ,k$ fk (e,ye,x) + µ k

v#V ,k$ gk (v,yv ,x)

%

&'(

)*

Distinguish between two types of features:

p! (y | x) =1

Z(x)exp " k

i,k

# fk (yi$1,yi ,x) + µ k

i,k

# gk (yi ,x)%

&'(

)*

Special Case of HMM-like Chain graph:

12

CRF Parameter Estimation

• Iterative Scaling:• Maximizes likelihood by iteratively updating

• Define auxilliary function A() s.t.

• Initialize each• Do until convergence:

Solve for eachUpdate parameter:

!k

O(!) = log p! (y(i ) | x(i ) )

i=1

N

" # !p(x,y)log p! (y | x)x,y

"

µk! µ

k+ "µ

k

A(! ',!) "O(! ') #O(!)

!k"!

k+ #!

k

!"k

dA(! ',!)

d"#k

= 0

!k"!

k+ #!

k

4

13

CRF Parameter Estimation

• For chain CRF, setting gives

• is total feature count• Unfortunately, T(x,y) is a global property of (x,y)

• Dynamic programming will sum over sequences withpotentially varying T. Inefficient exp sum computation

dA(! ',!)

d"#k

= 0

!E[ fk ] " !p(x,y)x,y

! fk (yi"1,yi ,x)i=1

n+1

!

= !p(x)p(y | x)x,y


n+1

! exp #$kT (x,y)( )

T (x,y) = fk (yi!1,yi ,x)i,k

" + gk (yi ,x)i,k

"

14

Algorithm S(Generalized Iterative Scaling)

• Introduce global slack feature s.t. T(x,y) becomesconstant S for all (x,y)

• Define forward and backward variables

S(x,y) ! S ! fk (yi!1,yi ,x)i,k

" + gk (yi ,x)i,k

"

! i (y | x) = ! i"1(y | x)exp fk (yi"1,yi ,x)i,k

# + gk (yi ,x)i,k

#$

%&'

()

!i (y | x) = !i+1(y | x)exp fk (yi+1,yi ,x)i,k

" + gk (yi+1,x)i,k

"#

$%&

'(

15

Algorithm SThe update equations become:

Where

Note is like posterior as in HMM

Rate of convergence governed by S

16

Algorithm T(Improved Iterative Scaling)

!E[ fk ] = !p(x)p(y | x)x,y


n+1

! exp #$kT (x)( )

!p(x)p(y | x)

x ,y|T (x )=t{ }


n+1

! exp #$k( )t

%

&'

(

)*

t=0

Tmax

!

ak ,t = !p(x)p(y | x)x,y


n+1

! # t,T (x)( )

!E[ fk ] = !p(x)p(y | x)x,y


n+1

! exp #$kT (x,y)( )

The equation we want to solve

exp !"k( )is polynomial in

So can be solved with Newton’s method

T (x) ! maxT (x,y)Define

Then:

Now, let ak,t,bk,t be E[fk|T(x)=t]

UPDATE:

5

17

Experiments with Synthetic Data

1. Modeling Label Bias:• Generated data by Fig 1 stochastic FSA• CRF: 4.6%, MEMM: 42% error rate

2. Modeling mixed-order sources• Generate data by ! p(y

i| y

i"1,y

i"2) + (1"! )p(y

i| y

i"1)

18

POS tagging experiment

Wall Street Journal dataset; 45 POS tags

Training time: Initial value is result of MEMM training (100 iter) Convergence for CRF+ took 1000 more iterations

19

Conclusion / Summary• CRFs are undirected graphical models globally

conditioned on observations• Advantages of CRFs:

• Conditional model• Allows multiple interacting features

• Disadvantage of CRFs:• Slow convergence during training

• Potential future directions:• More complex graph structures• Faster (approximate) Inference/Learning algorithms• Feature selection/induction algo for CRFs…

20

Useful References• Hanna Wallach. Efficient Training of Conditional

Random Fields. M.Sc. thesis, Division ofInformatics, University of Edinburgh, 2002.

• Della Pietra, S., Della Pietra, V., & Lafferty, J.(1997). Inducing features of random fields.IEEE Transactions on Pattern Analysis andMachine Intelligence, 19, 380–393.

• Berger, A. L., Della Pietra, S. A., & Della Pietra, V.J. (1996). A maximum entropy approach tonatural language processing. ComputationalLinguistics, 22.

conditional random fields•hmm and cmm limitations •label bias problem •conditional random...

Documents