advanced language modeling: conditional random fields · conditional random fields overview...

Advanced Language Modeling:Conditional Random Fields

Thomas Hanneforth

Linguistics Dept.Universität Potsdam

February 10, 2015

Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 1 / 63

Outline

1 Introduction

2 Linear CRFs

3 Inference

4 CRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

5 Summary

Conditional Random FieldsOverview

Conditional Random Fields (CRFs) are a generalisation of log-linearmodels. The main difference between log-linear models and CRFs isthat the output set Y of CRFs is generalised to encompass structureddata (sequences, . . .).CRFs are conditional models, that is, they represent conditionalprobability distributions of the form p(y|x) between the inputs x and theoutputs y.

CRFs are based on the idea of undirected graphical models: aprobability distribution over many random variables can often berepresented as a product of local functions on a subset of the variables.

Conditional Random FieldsUndirected graphical models

Assume that a probability distribution p over a set Y of random variablescan be factored into a set of local factor functions Φa (with |{Φa}| = A)where Φa depends only on a subset Ya of Y :

p(y) =1

Z

A∏a=1

Φa(ya)

As a side condition, Φa(ya) ≥ 0 for all ya ∈ Ya.Z is used to normalise

∏a Φa(ya) to a probability and is called the

partition function:

Z =∑y∈Y

A∏a=1

Φa(ya)

Note that the enumeration of all y ∈ Y is in general intractable.

p(y) =1

Z

A∏a=1

Φa(ya)

partition function:

Z =∑y∈Y

A∏a=1

Φa(ya)

p(y) =1

Z

A∏a=1

Φa(ya)

partition function:

Z =∑y∈Y

A∏a=1

Φa(ya)

p(y) =1

Z

A∏a=1

Φa(ya)

partition function:

Z =∑y∈Y

A∏a=1

Φa(ya)

Conditional Random FieldsGeneral CRFs

Suppose the set of random variables of the model is partitioned into twodisjunctive sets: X , the set of input variables (whose values are observed)and Y , the set of output variables (whose values are to be predicted).

Suppose furthermore that the model can be factored into A local factorsΦa(xa, ya) such that xa ∈ Xa, ya ∈ Ya, Xa ⊆ X and Ya ⊆ Y .A general CRF then represents the conditional distribution

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

Z(x) here defines the input-dependent partition function.

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

As in log-linear models, we make the log-linear assumption that eachfactor Φa(xa, ya) is defined as the exponential of a dot product betweena vector of parameters

−→θa and a vector-valued feature function Fa:

Φa(xa, ya) = exp

(K(a)∑k=1

θak · Fak(xa, ya)

)

Combining this with the definition of the conditional distribution, wearrive at

p(y|x) = 1Z−→θ

(x)

A∏a=1

exp

(K(a)∑k=1

θak · Fak(xa, ya)

)

as the form of a general CRF.

As in log-linear models, we make the log-linear assumption that eachfactor Φa(xa, ya) is defined as the exponential of a dot product betweena vector of parameters

−→θa and a vector-valued feature function Fa:

Φa(xa, ya) = exp

(K(a)∑k=1

θak · Fak(xa, ya)

)

Combining this with the definition of the conditional distribution, wearrive at

p(y|x) = 1Z−→θ

(x)

A∏a=1

exp

(K(a)∑k=1

θak · Fak(xa, ya)

)

as the form of a general CRF.

Outline

1 Introduction

2 Linear CRFs

3 Inference

5 Summary

Conditional Random FieldsLinear CRFs

A special case of CRFs are linear (chain) CRFs which are used forsequence labeling.

The input of a sequence labeling task is a sequence constructed fromsome given input alphabet, while the output is a same-length sequencewith class labels assigned to each input symbol.

Prototypical sequence labeling tasks in NLP are tagging or sentencesegmentation.

For linear CRFs, we add the constraint that an output label Yi dependsonly upon the h labels directly preceding it. In practice, h is most often 1.

Conditional Random FieldsLinear CRFs: examples

Example (Linear CRF where each label Yi depends on the current inputXi and the preceding label)

Y1 Y2 Y3 Yn−1 Yn

X1 X2 X3 Xn−1 Xn

Example (Linear CRF where each label Yi depends on the inputsymbols Xi, Xi−1, Xi+1 and the preceding label)

Y1 Y2 Y3 Yn−1 Yn

X1 X2 X3 Xn−1 Xn

Example (Linear CRF where each label Yi depends on the whole inputand the preceding label)

Y1 Y2 Y3 Yn

X1 X2 X3 Xn

Conditional Random FieldsLinear CRFs: maximum cliques

Note that the input sequence x̄ is often treated as a single (structured)random variable.

This leads to the following picture:

Y1 Y2 Y3 Yn−1 Yn

x̄

In graph-theoretic terms, the factor functions of a (linear) CRF thencorrespond to the maximal cliques of the independency graphunderlying the CRF.

Y1 Y2 Y3 Yn−1 Yn

x̄

Y1 Y2 Y3 Yn−1 Yn

x̄

Example (Maximal clique)

Y1 Y2 Y3 Yn−1 Yn

x̄

Definition (Linear CRF)Let X be the input alphabet and Y be the output alphabet. Let x̄ be a sequencefrom X n and ȳ be a sequence from Yn+1, with n ≥ 1 (we assume a specialstart-of-sequence marker in ȳ). Let

−→θ a real valued parameter vector in Rd

and let F be a set of d real-valued feature functions {fk(yi−1, yi, x̄, i)dk=1}.

A linear chain conditional random field is a conditional distribution p(ȳ|x̄)with the following form:

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp

(d∑

k=1

θk · fk(yi−1, yi, x̄, i)

).

Again, Z−→θ

(x̄) is the partition function which normalises the product to aprobability:

Z−→θ

(x̄) =∑

ȳ′∈Y|x̄|

|ȳ′|∏i=1

exp

(d∑

k=1

θk · fk(y′i−1, y′i, x̄, i)

).

Equivalently, all fk(yi−1, yi, x̄, i) can be combined into a vectorF (yi−1, yi, x̄, i) ∈ Rd, in which case p(ȳ|x̄) can be simplified to

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp(−→θ · F (yi−1, yi, x̄, i)

)Note that the number of different ȳ′ in the definition of Z−→

θ(x̄) is |Y||ȳ|.

Again, Z−→θ

Z−→θ

(x̄) =∑

ȳ′∈Y|x̄|

|ȳ′|∏i=1

exp

(d∑

k=1

θk · fk(y′i−1, y′i, x̄, i)

).

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp(−→θ · F (yi−1, yi, x̄, i)

θ(x̄) is |Y||ȳ|.

Again, Z−→θ

Z−→θ

(x̄) =∑

ȳ′∈Y|x̄|

|ȳ′|∏i=1

exp

(d∑

k=1

θk · fk(y′i−1, y′i, x̄, i)

).

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp(−→θ · F (yi−1, yi, x̄, i)

θ(x̄) is |Y||ȳ|.

Conditional Random FieldsLinear CRFs: Relationship to general CRFs

Remember the definition of a general CRF:

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

Compare that to the definition of a linear CRF:

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp(−→θ · F (yi−1, yi, x̄, i)

)That is, Φa(xa, ya) = Φi(yi−1, yi, x̄) = exp

(−→θ · F (yi−1, yi, x̄, i)

)and

A = |ȳ| with xa = x̄ and ya = {yi−1, yi}.

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp(−→θ · F (yi−1, yi, x̄, i)

(−→θ · F (yi−1, yi, x̄, i)

)and

p(y|x) = 1Z(x)

A∏a=1

Φa(xa, ya)

p(ȳ|x̄) = 1Z−→θ

(x̄)

|ȳ|∏i=1

exp(−→θ · F (yi−1, yi, x̄, i)

(−→θ · F (yi−1, yi, x̄, i)

)and

Conditional Random FieldsLinear CRFs: Examples for feature functions

Feature function used for tagging (Y = Tag):

f1(yj , yi, x̄, i) =

{1 if yj = DT ∧ yi = NN ∧ xi−1 = the0 otherwise

Feature function used for language modeling (Y = Tag):

f2(yj , yi, x̄, i) =

{1 if yj = JJ ∧ yi = NN0 otherwise

Feature function used for named entity recognition (Y = NEClass):

f3(yj , yi, x̄, i) =

{1 if yi = ORG ∧ pref3(xi) = al- ∧ xi+1 = front0 otherwise

Establishes a relationship that xi is a named entity of class ORG(anisation) if xistarts with al- and is followed by front (for example, al-Nusra front). Note thatthis feature ignores the previous output label, it’s therefore called a label feature.

f1(yj , yi, x̄, i) =

f2(yj , yi, x̄, i) =

f3(yj , yi, x̄, i) =

f1(yj , yi, x̄, i) =

f2(yj , yi, x̄, i) =

f3(yj , yi, x̄, i) =

f4(yj , yi, x̄, i) =

{1 if yi = PER ∧ xi−1 = president0 otherwise

f4 is 1 iff the current output label is PER(SON) and the word preceding thecurrent on is president.

f5(yj , yi, x̄, i) =

{1 if yj = PER ∧ yi = PER ∧ xi−1 · xi ∈ NE-ListPerson0 otherwise

f5 is 1 iff the current and the preceding output labels are both PER(SON) andthe concatenation of the current and preceding input symbol is contained insome list containing persons (for example, derived from Wikipedia). This isalso called a gazetteer feature.

f4(yj , yi, x̄, i) =

{1 if yi = PER ∧ xi−1 = president0 otherwise

f4 is 1 iff the current output label is PER(SON) and the word preceding thecurrent on is president.

f5(yj , yi, x̄, i) =

{1 if yj = PER ∧ yi = PER ∧ xi−1 · xi ∈ NE-ListPerson0 otherwise

f5 is 1 iff the current and the preceding output labels are both PER(SON) andthe concatenation of the current and preceding input symbol is contained insome list containing persons (for example, derived from Wikipedia). This isalso called a gazetteer feature.

Conditional Random FieldsLinear CRFs: Classification of feature functions

Label observation features: these are functions of the form

fk,m,y(yj , yi, x̄, i) =

{gm(x̄, i) if yi = y

0 otherwise

fk,m,y is non-zero iff the current output label matches some specified label y.gm is a function which depends only on the input sequence x̄ and the currentposition i.

Example (Feature functions for NE recognition)

fk,cap,ORG(yj , yi, x̄, i) =

{1 if capitalised(xi) ∧ yi = ORG0 otherwise

fk,ulr,PRODUCT(yj , yi, x̄, i) =

{upper−lower−ratio(xi) if yi = PRODUCT0 otherwise

Label observation features: these are functions of the form

fk,m,y(yj , yi, x̄, i) =

{gm(x̄, i) if yi = y

0 otherwise

fk,m,y is non-zero iff the current output label matches some specified label y.gm is a function which depends only on the input sequence x̄ and the currentposition i.

fk,cap,ORG(yj , yi, x̄, i) =

{1 if capitalised(xi) ∧ yi = ORG0 otherwise

fk,ulr,PRODUCT(yj , yi, x̄, i) =

{upper−lower−ratio(xi) if yi = PRODUCT0 otherwise

Transition observation features: these are functions of the form

fk,m,y′,y(yj , yi, x̄, i) =

{gm(x̄, i) if yj = y

′ ∧ yi = y0 otherwise

fk,m,y′,y depends on the transition yj → yi and some property of the input.Again, gm is a function which depends only on x̄ and i.

fk,Mr.,OTHER,PER(yj , yi, x̄, i) =

{1 if yj = OTHER ∧ yi = PER ∧ xi−1 = Mr.0 otherwise

Transition observation features: these are functions of the form

fk,m,y′,y(yj , yi, x̄, i) =

{gm(x̄, i) if yj = y

fk,m,y′,y depends on the transition yj → yi and some property of the input.Again, gm is a function which depends only on x̄ and i.

fk,Mr.,OTHER,PER(yj , yi, x̄, i) =

{1 if yj = OTHER ∧ yi = PER ∧ xi−1 = Mr.0 otherwise

Transition features: these are functions of the form

fk,y′,y(yj , yi, x̄, i) =

{1 if yj = y

fk,y′,y is an indicator function for the presence of a transition yj → yi.

Example (Feature functions for tagging)

fk,DT,NN(yj , yi, x̄, i) =

{1 if yj = DT ∧ yi = NN0 otherwise

Transition features: these are functions of the form

fk,y′,y(yj , yi, x̄, i) =

{1 if yj = y

fk,y′,y is an indicator function for the presence of a transition yj → yi.

Example (Feature functions for tagging)

fk,DT,NN(yj , yi, x̄, i) =

{1 if yj = DT ∧ yi = NN0 otherwise

Conditional Random FieldsLinear CRFs: Relationship to HMMs

An HMM can be cast into the form of a linear CRF by setting the localfactors of the CRF Φi(yi−1, yi, x̄) as follows:

Φi(yi−1, yi, x̄) = p(yi|yi−1)p(xi|yi).

By the definition of the local factors of a linear CRF,

Φi(yi−1, yi, x̄) = exp(∑

k

θkfk(yi−1, yi, x̄, i))

Then:

exp(∑

k

= p(yi|yi−1)p(xi|yi)∑k

θkfk(yi−1, yi, x̄, i) = ln(p(yi|yi−1)p(xi|yi)

)= ln p(yi|yi−1) + ln p(xi|yi)

k

Then:

exp(∑

k

Then:

exp(∑

k

This can be achieved by letting F (the set of all feature functions) consistof |Y|2 functions of the form

fi,j(r, q, x) =

{ln p(yj |yi) if r = yi ∧ q = yj0 otherwise

and |Y||X | functions of the form

fi,o(r, q, x) =

{ln p(xo|yi) if q = yi ∧ x = xo0 otherwise

Since HMMs are already probabilised at the transition level, Z(x̄)becomes 1.

Note that we omit the index position i in the definition of the featurefunctions since HMMs can only inspect a single input symbol.

fi,j(r, q, x) =

fi,o(r, q, x) =

fi,j(r, q, x) =

fi,o(r, q, x) =

Conditional Random FieldsLinear CRFs: Alternative view

Alternatively, the kth factor of a linear CRF could be seen as the sum ofthe feature function fk over all input positions i:

p(ȳ|x̄) = 1Z−→θ

(x̄)

d∏k=1

exp

( |ȳ|∑i=1

).

Here, the p(ȳ|x̄) is defined as a product of d factors, one factor for eachfeature fk. Each factor is the weighted sum of all occurrences of thatfeature in the pair 〈x̄, ȳ〉.

Conditional Random FieldsLinear CRFs: Alternative view

Alternatively, the kth factor of a linear CRF could be seen as the sum ofthe feature function fk over all input positions i:

p(ȳ|x̄) = 1Z−→θ

(x̄)

d∏k=1

exp

( |ȳ|∑i=1

).

Here, the p(ȳ|x̄) is defined as a product of d factors, one factor for eachfeature fk. Each factor is the weighted sum of all occurrences of thatfeature in the pair 〈x̄, ȳ〉.

Conditional Random FieldsHigher-order linear CRFs

Outline

1 Introduction

2 Linear CRFs

3 Inference

5 Summary

Conditional Random FieldsInference

Inference is generally defined as finding the best output sequence ȳ∗

for a given input sequence x̄ and model M with parameters−→θ .

More formally:ȳ∗ = argmax

ȳp(ȳ|x̄;

−→θ )

Since there are |Y||x̄| sequences of length |x̄|, this formula is intractablein general.

But for linear CRFs, it can be computed quite efficiently with a variant ofthe Viterbi algorithm.

Conditional Random FieldsViterbi inference with Linear CRFs

Let x̄ be an input sequence. Define variables δx̄(i, y) (with 0 ≤ i ≤ |x̄| andy ∈ Y) based on the following recurrence:

δx̄(i, y) =

1 if i = 0maxy′

(δx̄(i− 1, y′) · exp

(∑k θk · fk(y′, y, x̄, i)

))if i > 0

By adding backpointers, the best output sequence for x̄ can be reconstructedfrom max

y′δx̄(|x̄|, y′).

Note that the δx̄(i, y)’s are not probabilities since they are not yet

normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

Z−→θ

(x̄) .

The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

δx̄(i, y) =

0 if i = 0maxy′

(δx̄(i− 1, y′) +

∑k θk · fk(y′, y, x̄, i)

)if i > 0

This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

Z−→θ

(x̄) .

δx̄(i, y) =

0 if i = 0maxy′

(δx̄(i− 1, y′) +

∑k θk · fk(y′, y, x̄, i)

)if i > 0

Z−→θ

(x̄) .

δx̄(i, y) =

0 if i = 0maxy′

(δx̄(i− 1, y′) +

∑k θk · fk(y′, y, x̄, i)

)if i > 0

Z−→θ

(x̄) .

δx̄(i, y) =

0 if i = 0maxy′

(δx̄(i− 1, y′) +

∑k θk · fk(y′, y, x̄, i)

)if i > 0

Z−→θ

(x̄) .

δx̄(i, y) =

0 if i = 0maxy′

(δx̄(i− 1, y′) +

∑k θk · fk(y′, y, x̄, i)

)if i > 0

Z−→θ

(x̄) .

δx̄(i, y) =

0 if i = 0maxy′

(δx̄(i− 1, y′) +

∑k θk · fk(y′, y, x̄, i)

)if i > 0

Viterbi algorithm for linear CRFs (in the (max,+) semiring)

Input: CRF 〈X ,Y,−→θ , F, d〉, input sequence x̄ ∈ X+

Output: argmaxȳ p(ȳ|x̄; θ)1 for each y′ ∈ Y do2 δ(1, y′)← Ψ(x̄, y′, 1)3 for i← 2 to |x̄| do4 for each y ∈ Y do5 v ← −∞6 for each y′ → y do7 if δ(i− 1, y′) + θy′→y > v then8 v ← δ(i− 1, y′) + θy′→y9 Π(i, y)← y′

10 δ(i, y)← v + Ψ(x̄, y, i)

11 return extract−path(maxy

δ(|x̄|, y),Π)

Explanation

Ψ(x̄, y, i) is the dot product of the parameter vector−→θ with the label

observation features for x̄ at position i and label y.

In line 5, a temporary score v is set to −∞ (the identity element of max on realnumbers).

Note that we take advantage of the potential sparseness of the transitions byonly considering transitions y′ → y found in the model (line 6) (the trainingcorpus on which the model is based may not contain all |Y|2 possible labelbigrams).

Note that the value of δ(i, y) depend only on the previous values at i− 1 and thetransitions y′ → y. The label observation features in Ψ(x̄, y, i) stay the same(lines 8,10). That’s why they do not contribute to the maximisation in line 7.

v is updated in case we find a better score considering an incoming transitionfrom y′ (θy′→y denotes the parameter value of the transition) and the score ind(i− 1, y′). In that case, we also update the backpointer in Π(i, y) to point to y′.

Outline

1 Introduction

2 Linear CRFs

3 Inference

5 Summary

CRF Training

Given a finite corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉) of training pairs〈x̄(n), ȳ(n)〉, the goal of training is to maximise the log conditionallikelihood LCL(T,

−→θ ) of the corpus given the model parameters

−→θ :

LCL(T,−→θ ) =

N∑n=1

ln p(ȳ(n)|x̄(n);−→θ )

−→θ ∗ = argmax

−→θ

LCL(T,−→θ )

CRF TrainingMethods

There are a couple of training algorithms for linear CRFs:

First order gradient methods, for example Gradient descent orStochastic gradient descent (SGD)Approximate gradient methods: Averaged PerceptronSecond order gradient methods: L-BGFS

CRF TrainingEmpirical distribution

Given a corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉), the empirical distributionp̃(x̄, ȳ) is defined as:

p̃(x̄, ȳ) =1

N

N∑i=1

1{x̄(i)=x̄} 1{ȳ(i)=ȳ}

1{...} denotes an indicator function which is 1 if the condition in {. . .} ismet, otherwise 0.

If all N training pairs 〈x̄(i), ȳ(i)〉 are distinct, each pair gets its equalprobability share of 1N .

Note that p̃(x̄, ȳ) is zero for all pairs 〈x̄, ȳ〉 not found in the trainingcorpus.

CRF TrainingPartial derivatives of ln p(ȳ|x̄;

−→θ )

Analogously to the training of log-linear models, we are interested in thefunctional dependence of the log-likelihood of a given training pair〈x̄, ȳ〉 from a selected parameter value θk.In other words: we compute the partial derivative

∂

∂θkln p(ȳ|x̄;

−→θ )

Recall that the gradient of ln p(ȳ|x̄;−→θ ) – ∇ ln p(ȳ|x̄;

−→θ ) – is the vector

of the partial derivatives of ln p(ȳ|x̄;−→θ ) with respect to all d parameters.

−→θ )

∂

∂θkln p(ȳ|x̄;

−→θ ) =

∂

∂θk

(ln

(1

Z−→θ

(x̄)exp

∑j

θj · Fj(x̄, ȳ)))

(Def. of ln) =∂

∂θk

(ln

(1

Z−→θ

(x̄)

)+∑j

θj · Fj(x̄, ȳ)

)

(Def. of ln, +) =∂

∂θk

(∑j

θj · Fj(x̄, ȳ)− lnZ−→θ (x̄)

)

(Def. of ∂, sum rule) = Fk(x̄, ȳ)−∂

∂θklnZ−→

θ(x̄)

(Def. of ∂ ln) = Fk(x̄, ȳ)−1

Z−→θ

(x̄)· ∂∂θk

Z−→θ

(x̄)

(Def. of Z−→θ

(x̄)) = Fk(x̄, ȳ)−1

Z−→θ

(x̄)· ∂∂θk

(∑ȳ′

exp∑j

θj · Fj(x̄, ȳ′)

)

−→θ )

(Sum rule) = Fk(x̄, ȳ)−1

Z−→θ

(x̄)·∑ȳ′

(∂

∂θkexp

∑j

θj · Fj(x̄, ȳ′)

)

(∂ exp, chain rule) = Fk(x̄, ȳ)−1

Z−→θ

(x̄)·∑ȳ′

Fk(x̄, ȳ′)(

exp∑j

θj · Fj(x̄, ȳ′))

(Z−→θ

(x̄) into∑ȳ′ ) = Fk(x̄, ȳ)−

∑ȳ′

Fk(x̄, ȳ′)

(exp

∑j θj · Fj(x̄, ȳ′)Z−→θ

(x̄)

)

(Expand Z−→θ

(x̄)) = Fk(x̄, ȳ)−∑ȳ′

Fk(x̄, ȳ′)

(exp

∑j θj · Fj(x̄, ȳ′)∑

ȳ′′ exp∑j′ θj′ · Fj′(x̄, ȳ′′)

)

(Def. of p(ȳ′|x̄;−→θ )) = Fk(x̄, ȳ)−

∑ȳ′

(Fk(x̄, ȳ

′) · p(ȳ′|x̄;−→θ )

)(Def. of E) = Fk(x̄, ȳ)− Ep(ȳ′|x̄;−→θ )[Fk(x̄, ȳ

′)]

−→θ ): Empirical feature counts

The empirical feature counts depend on the factorisation of the CRF andthe form of the feature functions.

For linear CRFs, the count Fk of feature fk is

Fk(x̄(n), ȳ(n)) =

|x(n)|∑i=1

fk(ȳ(n)i−1, ȳ

(n)i , x̄

(n), i)

given a training pair 〈x̄(n), ȳ(n)〉.

−→θ )

Recall that Fk(x̄, ȳ) is the sum of feature fk over the whole pair 〈x̄, ȳ〉.Given a training pair 〈x̄(n), ȳ(n)〉, the main result from above is:

Fk(x̄(n), ȳ(n))− E

p(ȳ|x̄(n);−→θ )

[Fk(x̄(n), ȳ)]

That is: we subtract the expected aggregated value of the feature fkunder the model distribution from the aggregated value of fk in the givensample.Since Fk(x̄(n), ȳ(n)) can be restated as the expected value of fk under theempirical distribution p̃(x̄, ȳ), we arrive at:

Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄(n), ȳ)]

After training, we want this difference to be 0, that is, the expectation offk under the empirical distribution (that is, the value of fk in the trainingpair) matches the expected value of fk under the model distribution.

−→θ )

Fk(x̄(n), ȳ(n))− E

p(ȳ|x̄(n);−→θ )

[Fk(x̄(n), ȳ)]

−→θ )

Fk(x̄(n), ȳ(n))− E

p(ȳ|x̄(n);−→θ )

[Fk(x̄(n), ȳ)]

−→θ )

Fk(x̄(n), ȳ(n))− E

p(ȳ|x̄(n);−→θ )

[Fk(x̄(n), ȳ)]

−→θ )

Fk(x̄(n), ȳ(n))− E

p(ȳ|x̄(n);−→θ )

[Fk(x̄(n), ȳ)]

CRF TrainingExample: NE recognition

Example (Input sequence with two alternative labelings)x̄ Britain sent warships across the English Channel Mondayȳ B O O O O B I Bȳ′ O O O O O B I B

x̄ to rescue Britons stranded by Eyjafjallakökull ’s vulcanicȳ O O B O O B O Oȳ′ O O B O O B O O

x̄ ash cloud .ȳ O O Oȳ′ O O O

(B = beginning of a NE, I = within a NE, O = outside a NE)

[Example taken from Noah Smith’s Linguistic structure prediction]

CRF TrainingExample: NE recognition

Example (Some feature functions and their global feature vectors)Group Indicator function fk(xi, yi)

∑i fk(xi, yi)

∑i fk(xi, y

′i)

bias yi = B 5 4yi = I 1 1yi = O 14 15

lexical xi = Britain and yi = B 1 0xi = Britain and yi = O 0 1

lowercase lc(xi) = britain and yi = B 1 0lc(xi) = britain and yi = O 0 1

shape shape(xi) = Aaaaaaa and yi = B 3 2shape(xi) = Aaaaaaa and yi = I 1 1shape(xi) = Aaaaaaa and yi = O 0 1

prefix pref1(xi) = B and yi = B 2 1pref1(xi) = B and yi = O 0 1shape(pref1(xi)) = A and yi = B 5 4

wiki xi is in Wikipedia NE list and yi = B 2 1

CRF TrainingGradient-based parameter estimation

This difference of expectations can be generalised to the whole corpus

N∑n=1

(Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄

(n), ȳ)])

In practice, these differences are often negated:

−(Ep̃(x̄,ȳ)[Fk(x̄, ȳ)]− Ep(ȳ′|x̄;−→θ )[Fk(x̄, ȳ

′)])

and

−N∑n=1

(Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄

(n), ȳ)])

Then, both functions are convex, that is, every local minimum is also aglobal minimum.

This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

1 Gradient descent:I Use −

∑Nn=1 ln p(ȳ

(n)|x̄(n);−→θ ) as the objective function of the parameter

estimation.I Compute the gradient of the above function relative to the current

parameters−→θ , update

−→θ accordingly.

I Repeat the above step several times.

2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

−→θ ) as the objective function of the parameter estimation.

I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

−→θ .

I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

∑Nn=1 ln p(ȳ

−→θ .

∑Nn=1 ln p(ȳ

−→θ .

∑Nn=1 ln p(ȳ

−→θ .

∑Nn=1 ln p(ȳ

−→θ .

∑Nn=1 ln p(ȳ

−→θ .

∑Nn=1 ln p(ȳ

−→θ .

∑Nn=1 ln p(ȳ

−→θ .

CRF TrainingRegularisation

CRF TrainingStochastic gradient descent

Stochastic gradient descent

Input: Corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉)Output: Parameter vector

−→θ with |

−→θ | = d

1−→θ ← 0

2 repeat3 permute(T )4 for n← 1 to N do5 δ ← ∇ ln p(ȳ(n)|x̄(n);

−→θ )

6 Calculate β̂ (the learning rate)

7−→θ ←

−→θ − β̂δ

8 until convergence;

CRF TrainingStochastic gradient descent

Remarks

This is a generic algorithm which works for all types of CRFs.

Its performance depends on how efficient we can compute the gradient of thelog conditional likelihood of the training pair in line 5.

For linear CRFs, there is a polynomial algorithm based on dynamicprogramming.

Note that SGD is a inherent sequential algorithm not easily to be parallelised.

But the gradient vector can in principle be computed with a parallel algorithm.

CRF TrainingEfficient computation of E

p(ȳ|x̄;−→θ )

[Fk(x̄, ȳ)]

The computational demanding step of gradient-based training algorithmsis the computation of the gradient of the partition function Z – that isEp(ȳ|x̄;

−→θ )

[Fk(x̄, ȳ)] – since it requires the summation over exponentialmany values of ȳ.

For linear CRFs, we can take advantage of their special factorisation byusing a dynamic programming approach.Recall that for linear CRFs,

Fk(x̄, ȳ) =

|ȳ|∑i=1

fk(yi−1, yi, x̄, i)

p(ȳ|x̄;−→θ )

[Fk(x̄, ȳ)] for linear CRFs

Ep(ȳ|x̄;

−→θ )

[Fk(x̄, ȳ)]

(Def. of Fk for lin. CRFs) = Ep(ȳ|x̄;−→θ )

[ |ȳ|∑i=1

fk(yi−1, yi, x̄, i)

]

(E[X+Y ]=E[X]+E[Y ]) =

|ȳ|∑i=1

Ep(ȳ|x̄;

−→θ )

[fk(yi−1, yi, x̄, i)]

(Def. of Ep(ȳ|x̄;

−→θ )

[.]) =

|ȳ|∑i=1

∑ȳ

(p(ȳ|x̄;

−→θ ) · fk(yi−1, yi, x̄, i)

)

(Def. of p(ȳ|x̄;−→θ )) =

|ȳ|∑i=1

∑y′,y∈Y

(p(y′, y|x̄, i;

−→θ ) · fk(y′, y, x̄, i)

)

p(ȳ|x̄;−→θ )

[Fk(x̄, ȳ)] for linear CRFs

Given an output sequence ȳ, p(y′, y|x̄, i;−→θ ) is the probability of the

transition from y′ to y at position i of x̄ given−→θ .

This probability can be further decomposed as follows:

p(y′, y|x̄, i;−→θ ) =

αx̄(i− 1, y′) · Φ(y′, y, x̄, i) · βx̄(i, y)Z−→θ

(x̄)

withI αx̄(i− 1, y′) being the forward score of outputting y′ after a prefix of ȳ of

length i− 1,I Φ(y′, y, x̄, i) = exp

(∑dk=1 θkfk(y

′, y, x̄, i))

being the score of thetransition from y′ to y

I βx̄(i, y) being the backward score of outputting y and then the remainingsuffix of ȳ, and

I Z−→θ

(x̄) =∑y∈Y αx̄(|x̄|, y), the total score of all sequences in Y |x̄|.

Conditional Random FieldsEfficient computation of E

p(ȳ|x̄;−→θ )

[Fk(x̄, ȳ)]: Forward-backward algorithm

The α and β-variables can be defined in an to HMMs analogous way:

αx̄(i, y) =

{1 if i = 0∑

y′

(αx̄(i− 1, y′) · Φ(y′, y, x̄, i)

)if i > 0

βx̄(i, y) =

{1 if i = |x̄|∑

y′

(βx̄(i+ 1, y

′) · Φ(y, y′, x̄, i+ 1))

if 0 ≤ i < |x̄|

with

Φ(y, y′, x̄, i) = exp( d∑k=1

θk · fk(y, y′, x̄, i))

CRF TrainingAveraged perceptron

In machine learning, the perceptron is an algorithm for supervisedclassification of an input into one of several possible non-binary outputs.

It is a type of linear classifier, i.e. a classification algorithm that makesits predictions based on a linear predictor function combining a set ofweights with the feature vector. The algorithm allows for online learning,in that it processes elements in the training set one at a time.

The perceptron algorithm is a simple iterative parameter estimationalgorithm where we count the mismatches between the given outputsequence ȳ (which is given in the training corpus) and the predictedoutput sequence ȳ′ based on the current parameters

−→θ .

Effectively, we try to minimise the number of these mismatches.

−→θ .

Recall the definition of the (unregularised) gradient of a feature fk givena single corpus instance 〈x̄(j), ȳ(j)〉:

∂

∂θkln p(ȳ(j)|x̄(j);

−→θ ) = Fk(x̄

(j), ȳ(j))− Ep(ȳ|x̄(j);

−→θ )

[Fk(ȳ, x̄(j))]

In perceptron-based methods, the gradient for a single corpus instance〈x̄(j), ȳ(j)〉 is approximated by:

∂

−→θ ) ≈ Fk(x̄(j), ȳ(j))−Fk(x̄(j), argmax

ȳp(ȳ|x̄(j);

−→θ ))

That is, instead of taking the weighted average of the Fk for all possibleȳ, we take only one specific ȳ into account, namely that with the highestViterbi score given the current parameters

−→θ .

∂

−→θ ) = Fk(x̄

(j), ȳ(j))− Ep(ȳ|x̄(j);

−→θ )

[Fk(ȳ, x̄(j))]

∂

ȳp(ȳ|x̄(j);

−→θ ))

−→θ .

∂

−→θ ) = Fk(x̄

(j), ȳ(j))− Ep(ȳ|x̄(j);

−→θ )

[Fk(ȳ, x̄(j))]

∂

ȳp(ȳ|x̄(j);

−→θ ))

−→θ .

Averaged Perceptron training algorithm

Input: Corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉)Output: Parameter vector

−→θ

1−→θ ← 0;

−→θsum ← 0

2 for t← 1 to M do3 permute(T )4 for j ← 1 to N do5 ȳ∗ ← argmax

ȳ′p(ȳ′|x̄(j);

−→θ )

6 if ȳ∗ 6= ȳ(j) then7

−→θ ←

−→θ +

∑|x̄(j)|i=1 F (y

(j)i−1, y

(j)i , x̄, i)−

∑|x̄(j)|i=1 F (y

∗i−1, y

∗i , x̄, i)

8−→θsum ←

−→θsum +

−→θ

9 return−−→θsumN ·M

Remarks

Line 5 does a Viterbi search to find the best output sequence ȳ∗ for the currentinput x̄(j) given the current parameters

−→θ .

Line 6 compares that sequence with the correct corpus sequence ȳ(j).

If the sequences are not the same, then in line 7 the feature vector of the trainingpair (x̄(j), ȳ(j)) is added to

−→θ and the feature vector for (x̄(j), ȳ∗) is subtracted

from−→θ (here, F (y′, y, x̄, i) denotes the vector of the d feature functions

fk(y′, y, x̄, i)).

Then,−→θ is accumulated to a vector

−→θsum

In effect, this means that correct features in (x̄(j), ȳ(j)) are amplified andincorrect ones in (x̄(j), ȳ∗) are dampened.

Finally,−→θsum is averaged by |T | ·M .

Intuitions behind the algorithm

Consider an incorrect output label, causing the test in line 6 to be true. Say, ȳ(j)

differs from ȳ∗ only at position l.

Then, line 7 effectively subtracts from−→θ all feature values in F (y∗l−1, y

∗l , x̄, l).

That is, the parameter values for the features causing the incorrect label y∗l aredampened. At the same time, the parameter values for the features found inF (y

(j)l−1, y

(j)l , x̄, l) are always amplified, since we assume the corpus labeling to

be correct.

The next time the pair 〈x̄(j), ȳ(j)〉 is considered, the Viterbi search may result inthe correct label ȳ(j)l found in the corpus since the dampened parameter valuesof the features causing the incorrect label play a less prominent role.

Implementations notes

The efficiency of the averaged perceptron algorithm depends crucially uponhow smart we implement the vector sum in line 8 since

−→θ can contain several

million features and line 8 is executed N · |T | times.

The Viterbi score is a bit easier to compute than Ep(ȳ|x̄(j);

−→θ )

[Fk(ȳ, x̄(j))] in the

SGD algorithm.

Again, this is a sequential algorithm.

CRF TrainingAveraged perceptron: summary

OverfittingError driven

CRF TrainingL-BGFS

Outline

1 Introduction

2 Linear CRFs

3 Inference

5 Summary

Conditional random fieldsSummary

TODO

Relationship to maximum entropy

Higher-order LCRFs

Regularisation

L-BGFS

Version history

29.1.2015: Initial version 0.110.2.2015: Version 0.2: Moved forward-backward algorithm to the CRFtraining section and added an additional slide. Changed the slide aboutefficient gradient computation. Fixed a number of errors. Added tofurther slide to the averaged perceptron algorithm.

IntroductionLinear CRFsInferenceCRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

Summary

advanced language modeling: conditional random fields · conditional random fields overview...

Documents