advanced language modeling: conditional random fields · conditional random fields overview...

113
Advanced Language Modeling: Conditional Random Fields Thomas Hanneforth Linguistics Dept. Universit¨ at Potsdam February 10, 2015 Thomas Hanneforth (Universit¨ at Potsdam) Language Modeling: CRFs February 10, 2015 1 / 63

Upload: others

Post on 23-Oct-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

  • Advanced Language Modeling:Conditional Random Fields

    Thomas Hanneforth

    Linguistics Dept.Universität Potsdam

    February 10, 2015

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 1 / 63

  • Outline

    1 Introduction

    2 Linear CRFs

    3 Inference

    4 CRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

    5 Summary

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 2 / 63

  • Conditional Random FieldsOverview

    Conditional Random Fields (CRFs) are a generalisation of log-linearmodels. The main difference between log-linear models and CRFs isthat the output set Y of CRFs is generalised to encompass structureddata (sequences, . . .).CRFs are conditional models, that is, they represent conditionalprobability distributions of the form p(y|x) between the inputs x and theoutputs y.

    CRFs are based on the idea of undirected graphical models: aprobability distribution over many random variables can often berepresented as a product of local functions on a subset of the variables.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 3 / 63

  • Conditional Random FieldsOverview

    Conditional Random Fields (CRFs) are a generalisation of log-linearmodels. The main difference between log-linear models and CRFs isthat the output set Y of CRFs is generalised to encompass structureddata (sequences, . . .).CRFs are conditional models, that is, they represent conditionalprobability distributions of the form p(y|x) between the inputs x and theoutputs y.

    CRFs are based on the idea of undirected graphical models: aprobability distribution over many random variables can often berepresented as a product of local functions on a subset of the variables.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 3 / 63

  • Conditional Random FieldsOverview

    Conditional Random Fields (CRFs) are a generalisation of log-linearmodels. The main difference between log-linear models and CRFs isthat the output set Y of CRFs is generalised to encompass structureddata (sequences, . . .).CRFs are conditional models, that is, they represent conditionalprobability distributions of the form p(y|x) between the inputs x and theoutputs y.

    CRFs are based on the idea of undirected graphical models: aprobability distribution over many random variables can often berepresented as a product of local functions on a subset of the variables.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 3 / 63

  • Conditional Random FieldsUndirected graphical models

    Assume that a probability distribution p over a set Y of random variablescan be factored into a set of local factor functions Φa (with |{Φa}| = A)where Φa depends only on a subset Ya of Y :

    p(y) =1

    Z

    A∏a=1

    Φa(ya)

    As a side condition, Φa(ya) ≥ 0 for all ya ∈ Ya.Z is used to normalise

    ∏a Φa(ya) to a probability and is called the

    partition function:

    Z =∑y∈Y

    A∏a=1

    Φa(ya)

    Note that the enumeration of all y ∈ Y is in general intractable.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 4 / 63

  • Conditional Random FieldsUndirected graphical models

    Assume that a probability distribution p over a set Y of random variablescan be factored into a set of local factor functions Φa (with |{Φa}| = A)where Φa depends only on a subset Ya of Y :

    p(y) =1

    Z

    A∏a=1

    Φa(ya)

    As a side condition, Φa(ya) ≥ 0 for all ya ∈ Ya.Z is used to normalise

    ∏a Φa(ya) to a probability and is called the

    partition function:

    Z =∑y∈Y

    A∏a=1

    Φa(ya)

    Note that the enumeration of all y ∈ Y is in general intractable.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 4 / 63

  • Conditional Random FieldsUndirected graphical models

    Assume that a probability distribution p over a set Y of random variablescan be factored into a set of local factor functions Φa (with |{Φa}| = A)where Φa depends only on a subset Ya of Y :

    p(y) =1

    Z

    A∏a=1

    Φa(ya)

    As a side condition, Φa(ya) ≥ 0 for all ya ∈ Ya.Z is used to normalise

    ∏a Φa(ya) to a probability and is called the

    partition function:

    Z =∑y∈Y

    A∏a=1

    Φa(ya)

    Note that the enumeration of all y ∈ Y is in general intractable.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 4 / 63

  • Conditional Random FieldsUndirected graphical models

    Assume that a probability distribution p over a set Y of random variablescan be factored into a set of local factor functions Φa (with |{Φa}| = A)where Φa depends only on a subset Ya of Y :

    p(y) =1

    Z

    A∏a=1

    Φa(ya)

    As a side condition, Φa(ya) ≥ 0 for all ya ∈ Ya.Z is used to normalise

    ∏a Φa(ya) to a probability and is called the

    partition function:

    Z =∑y∈Y

    A∏a=1

    Φa(ya)

    Note that the enumeration of all y ∈ Y is in general intractable.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 4 / 63

  • Conditional Random FieldsGeneral CRFs

    Suppose the set of random variables of the model is partitioned into twodisjunctive sets: X , the set of input variables (whose values are observed)and Y , the set of output variables (whose values are to be predicted).

    Suppose furthermore that the model can be factored into A local factorsΦa(xa, ya) such that xa ∈ Xa, ya ∈ Ya, Xa ⊆ X and Ya ⊆ Y .A general CRF then represents the conditional distribution

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Z(x) here defines the input-dependent partition function.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 5 / 63

  • Conditional Random FieldsGeneral CRFs

    Suppose the set of random variables of the model is partitioned into twodisjunctive sets: X , the set of input variables (whose values are observed)and Y , the set of output variables (whose values are to be predicted).

    Suppose furthermore that the model can be factored into A local factorsΦa(xa, ya) such that xa ∈ Xa, ya ∈ Ya, Xa ⊆ X and Ya ⊆ Y .A general CRF then represents the conditional distribution

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Z(x) here defines the input-dependent partition function.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 5 / 63

  • Conditional Random FieldsGeneral CRFs

    Suppose the set of random variables of the model is partitioned into twodisjunctive sets: X , the set of input variables (whose values are observed)and Y , the set of output variables (whose values are to be predicted).

    Suppose furthermore that the model can be factored into A local factorsΦa(xa, ya) such that xa ∈ Xa, ya ∈ Ya, Xa ⊆ X and Ya ⊆ Y .A general CRF then represents the conditional distribution

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Z(x) here defines the input-dependent partition function.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 5 / 63

  • Conditional Random FieldsGeneral CRFs

    Suppose the set of random variables of the model is partitioned into twodisjunctive sets: X , the set of input variables (whose values are observed)and Y , the set of output variables (whose values are to be predicted).

    Suppose furthermore that the model can be factored into A local factorsΦa(xa, ya) such that xa ∈ Xa, ya ∈ Ya, Xa ⊆ X and Ya ⊆ Y .A general CRF then represents the conditional distribution

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Z(x) here defines the input-dependent partition function.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 5 / 63

  • Conditional Random FieldsGeneral CRFs

    As in log-linear models, we make the log-linear assumption that eachfactor Φa(xa, ya) is defined as the exponential of a dot product betweena vector of parameters

    −→θa and a vector-valued feature function Fa:

    Φa(xa, ya) = exp

    (K(a)∑k=1

    θak · Fak(xa, ya)

    )

    Combining this with the definition of the conditional distribution, wearrive at

    p(y|x) = 1Z−→θ

    (x)

    A∏a=1

    exp

    (K(a)∑k=1

    θak · Fak(xa, ya)

    )

    as the form of a general CRF.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 6 / 63

  • Conditional Random FieldsGeneral CRFs

    As in log-linear models, we make the log-linear assumption that eachfactor Φa(xa, ya) is defined as the exponential of a dot product betweena vector of parameters

    −→θa and a vector-valued feature function Fa:

    Φa(xa, ya) = exp

    (K(a)∑k=1

    θak · Fak(xa, ya)

    )

    Combining this with the definition of the conditional distribution, wearrive at

    p(y|x) = 1Z−→θ

    (x)

    A∏a=1

    exp

    (K(a)∑k=1

    θak · Fak(xa, ya)

    )

    as the form of a general CRF.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 6 / 63

  • Outline

    1 Introduction

    2 Linear CRFs

    3 Inference

    4 CRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

    5 Summary

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 7 / 63

  • Conditional Random FieldsLinear CRFs

    A special case of CRFs are linear (chain) CRFs which are used forsequence labeling.

    The input of a sequence labeling task is a sequence constructed fromsome given input alphabet, while the output is a same-length sequencewith class labels assigned to each input symbol.

    Prototypical sequence labeling tasks in NLP are tagging or sentencesegmentation.

    For linear CRFs, we add the constraint that an output label Yi dependsonly upon the h labels directly preceding it. In practice, h is most often 1.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 8 / 63

  • Conditional Random FieldsLinear CRFs

    A special case of CRFs are linear (chain) CRFs which are used forsequence labeling.

    The input of a sequence labeling task is a sequence constructed fromsome given input alphabet, while the output is a same-length sequencewith class labels assigned to each input symbol.

    Prototypical sequence labeling tasks in NLP are tagging or sentencesegmentation.

    For linear CRFs, we add the constraint that an output label Yi dependsonly upon the h labels directly preceding it. In practice, h is most often 1.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 8 / 63

  • Conditional Random FieldsLinear CRFs

    A special case of CRFs are linear (chain) CRFs which are used forsequence labeling.

    The input of a sequence labeling task is a sequence constructed fromsome given input alphabet, while the output is a same-length sequencewith class labels assigned to each input symbol.

    Prototypical sequence labeling tasks in NLP are tagging or sentencesegmentation.

    For linear CRFs, we add the constraint that an output label Yi dependsonly upon the h labels directly preceding it. In practice, h is most often 1.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 8 / 63

  • Conditional Random FieldsLinear CRFs

    A special case of CRFs are linear (chain) CRFs which are used forsequence labeling.

    The input of a sequence labeling task is a sequence constructed fromsome given input alphabet, while the output is a same-length sequencewith class labels assigned to each input symbol.

    Prototypical sequence labeling tasks in NLP are tagging or sentencesegmentation.

    For linear CRFs, we add the constraint that an output label Yi dependsonly upon the h labels directly preceding it. In practice, h is most often 1.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 8 / 63

  • Conditional Random FieldsLinear CRFs: examples

    Example (Linear CRF where each label Yi depends on the current inputXi and the preceding label)

    Y1 Y2 Y3 Yn−1 Yn

    X1 X2 X3 Xn−1 Xn

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 9 / 63

  • Conditional Random FieldsLinear CRFs: examples

    Example (Linear CRF where each label Yi depends on the inputsymbols Xi, Xi−1, Xi+1 and the preceding label)

    Y1 Y2 Y3 Yn−1 Yn

    X1 X2 X3 Xn−1 Xn

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 10 / 63

  • Conditional Random FieldsLinear CRFs: examples

    Example (Linear CRF where each label Yi depends on the whole inputand the preceding label)

    Y1 Y2 Y3 Yn

    X1 X2 X3 Xn

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 11 / 63

  • Conditional Random FieldsLinear CRFs: maximum cliques

    Note that the input sequence x̄ is often treated as a single (structured)random variable.

    This leads to the following picture:

    Y1 Y2 Y3 Yn−1 Yn

    In graph-theoretic terms, the factor functions of a (linear) CRF thencorrespond to the maximal cliques of the independency graphunderlying the CRF.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 12 / 63

  • Conditional Random FieldsLinear CRFs: maximum cliques

    Note that the input sequence x̄ is often treated as a single (structured)random variable.

    This leads to the following picture:

    Y1 Y2 Y3 Yn−1 Yn

    In graph-theoretic terms, the factor functions of a (linear) CRF thencorrespond to the maximal cliques of the independency graphunderlying the CRF.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 12 / 63

  • Conditional Random FieldsLinear CRFs: maximum cliques

    Note that the input sequence x̄ is often treated as a single (structured)random variable.

    This leads to the following picture:

    Y1 Y2 Y3 Yn−1 Yn

    In graph-theoretic terms, the factor functions of a (linear) CRF thencorrespond to the maximal cliques of the independency graphunderlying the CRF.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 12 / 63

  • Conditional Random FieldsLinear CRFs: maximum cliques

    Example (Maximal clique)

    Y1 Y2 Y3 Yn−1 Yn

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 13 / 63

  • Conditional Random FieldsLinear CRFs

    Definition (Linear CRF)Let X be the input alphabet and Y be the output alphabet. Let x̄ be a sequencefrom X n and ȳ be a sequence from Yn+1, with n ≥ 1 (we assume a specialstart-of-sequence marker in ȳ). Let

    −→θ a real valued parameter vector in Rd

    and let F be a set of d real-valued feature functions {fk(yi−1, yi, x̄, i)dk=1}.

    A linear chain conditional random field is a conditional distribution p(ȳ|x̄)with the following form:

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp

    (d∑

    k=1

    θk · fk(yi−1, yi, x̄, i)

    ).

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 14 / 63

  • Conditional Random FieldsLinear CRFs

    Again, Z−→θ

    (x̄) is the partition function which normalises the product to aprobability:

    Z−→θ

    (x̄) =∑

    ȳ′∈Y|x̄|

    |ȳ′|∏i=1

    exp

    (d∑

    k=1

    θk · fk(y′i−1, y′i, x̄, i)

    ).

    Equivalently, all fk(yi−1, yi, x̄, i) can be combined into a vectorF (yi−1, yi, x̄, i) ∈ Rd, in which case p(ȳ|x̄) can be simplified to

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp(−→θ · F (yi−1, yi, x̄, i)

    )Note that the number of different ȳ′ in the definition of Z−→

    θ(x̄) is |Y||ȳ|.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 15 / 63

  • Conditional Random FieldsLinear CRFs

    Again, Z−→θ

    (x̄) is the partition function which normalises the product to aprobability:

    Z−→θ

    (x̄) =∑

    ȳ′∈Y|x̄|

    |ȳ′|∏i=1

    exp

    (d∑

    k=1

    θk · fk(y′i−1, y′i, x̄, i)

    ).

    Equivalently, all fk(yi−1, yi, x̄, i) can be combined into a vectorF (yi−1, yi, x̄, i) ∈ Rd, in which case p(ȳ|x̄) can be simplified to

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp(−→θ · F (yi−1, yi, x̄, i)

    )Note that the number of different ȳ′ in the definition of Z−→

    θ(x̄) is |Y||ȳ|.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 15 / 63

  • Conditional Random FieldsLinear CRFs

    Again, Z−→θ

    (x̄) is the partition function which normalises the product to aprobability:

    Z−→θ

    (x̄) =∑

    ȳ′∈Y|x̄|

    |ȳ′|∏i=1

    exp

    (d∑

    k=1

    θk · fk(y′i−1, y′i, x̄, i)

    ).

    Equivalently, all fk(yi−1, yi, x̄, i) can be combined into a vectorF (yi−1, yi, x̄, i) ∈ Rd, in which case p(ȳ|x̄) can be simplified to

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp(−→θ · F (yi−1, yi, x̄, i)

    )Note that the number of different ȳ′ in the definition of Z−→

    θ(x̄) is |Y||ȳ|.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 15 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to general CRFs

    Remember the definition of a general CRF:

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Compare that to the definition of a linear CRF:

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp(−→θ · F (yi−1, yi, x̄, i)

    )That is, Φa(xa, ya) = Φi(yi−1, yi, x̄) = exp

    (−→θ · F (yi−1, yi, x̄, i)

    )and

    A = |ȳ| with xa = x̄ and ya = {yi−1, yi}.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 16 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to general CRFs

    Remember the definition of a general CRF:

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Compare that to the definition of a linear CRF:

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp(−→θ · F (yi−1, yi, x̄, i)

    )That is, Φa(xa, ya) = Φi(yi−1, yi, x̄) = exp

    (−→θ · F (yi−1, yi, x̄, i)

    )and

    A = |ȳ| with xa = x̄ and ya = {yi−1, yi}.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 16 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to general CRFs

    Remember the definition of a general CRF:

    p(y|x) = 1Z(x)

    A∏a=1

    Φa(xa, ya)

    Compare that to the definition of a linear CRF:

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    |ȳ|∏i=1

    exp(−→θ · F (yi−1, yi, x̄, i)

    )That is, Φa(xa, ya) = Φi(yi−1, yi, x̄) = exp

    (−→θ · F (yi−1, yi, x̄, i)

    )and

    A = |ȳ| with xa = x̄ and ya = {yi−1, yi}.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 16 / 63

  • Conditional Random FieldsLinear CRFs: Examples for feature functions

    Feature function used for tagging (Y = Tag):

    f1(yj , yi, x̄, i) =

    {1 if yj = DT ∧ yi = NN ∧ xi−1 = the0 otherwise

    Feature function used for language modeling (Y = Tag):

    f2(yj , yi, x̄, i) =

    {1 if yj = JJ ∧ yi = NN0 otherwise

    Feature function used for named entity recognition (Y = NEClass):

    f3(yj , yi, x̄, i) =

    {1 if yi = ORG ∧ pref3(xi) = al- ∧ xi+1 = front0 otherwise

    Establishes a relationship that xi is a named entity of class ORG(anisation) if xistarts with al- and is followed by front (for example, al-Nusra front). Note thatthis feature ignores the previous output label, it’s therefore called a label feature.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 17 / 63

  • Conditional Random FieldsLinear CRFs: Examples for feature functions

    Feature function used for tagging (Y = Tag):

    f1(yj , yi, x̄, i) =

    {1 if yj = DT ∧ yi = NN ∧ xi−1 = the0 otherwise

    Feature function used for language modeling (Y = Tag):

    f2(yj , yi, x̄, i) =

    {1 if yj = JJ ∧ yi = NN0 otherwise

    Feature function used for named entity recognition (Y = NEClass):

    f3(yj , yi, x̄, i) =

    {1 if yi = ORG ∧ pref3(xi) = al- ∧ xi+1 = front0 otherwise

    Establishes a relationship that xi is a named entity of class ORG(anisation) if xistarts with al- and is followed by front (for example, al-Nusra front). Note thatthis feature ignores the previous output label, it’s therefore called a label feature.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 17 / 63

  • Conditional Random FieldsLinear CRFs: Examples for feature functions

    Feature function used for tagging (Y = Tag):

    f1(yj , yi, x̄, i) =

    {1 if yj = DT ∧ yi = NN ∧ xi−1 = the0 otherwise

    Feature function used for language modeling (Y = Tag):

    f2(yj , yi, x̄, i) =

    {1 if yj = JJ ∧ yi = NN0 otherwise

    Feature function used for named entity recognition (Y = NEClass):

    f3(yj , yi, x̄, i) =

    {1 if yi = ORG ∧ pref3(xi) = al- ∧ xi+1 = front0 otherwise

    Establishes a relationship that xi is a named entity of class ORG(anisation) if xistarts with al- and is followed by front (for example, al-Nusra front). Note thatthis feature ignores the previous output label, it’s therefore called a label feature.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 17 / 63

  • Conditional Random FieldsLinear CRFs: Examples for feature functions

    Feature function used for named entity recognition (Y = NEClass):

    f4(yj , yi, x̄, i) =

    {1 if yi = PER ∧ xi−1 = president0 otherwise

    f4 is 1 iff the current output label is PER(SON) and the word preceding thecurrent on is president.

    Feature function used for named entity recognition (Y = NEClass):

    f5(yj , yi, x̄, i) =

    {1 if yj = PER ∧ yi = PER ∧ xi−1 · xi ∈ NE-ListPerson0 otherwise

    f5 is 1 iff the current and the preceding output labels are both PER(SON) andthe concatenation of the current and preceding input symbol is contained insome list containing persons (for example, derived from Wikipedia). This isalso called a gazetteer feature.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 18 / 63

  • Conditional Random FieldsLinear CRFs: Examples for feature functions

    Feature function used for named entity recognition (Y = NEClass):

    f4(yj , yi, x̄, i) =

    {1 if yi = PER ∧ xi−1 = president0 otherwise

    f4 is 1 iff the current output label is PER(SON) and the word preceding thecurrent on is president.

    Feature function used for named entity recognition (Y = NEClass):

    f5(yj , yi, x̄, i) =

    {1 if yj = PER ∧ yi = PER ∧ xi−1 · xi ∈ NE-ListPerson0 otherwise

    f5 is 1 iff the current and the preceding output labels are both PER(SON) andthe concatenation of the current and preceding input symbol is contained insome list containing persons (for example, derived from Wikipedia). This isalso called a gazetteer feature.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 18 / 63

  • Conditional Random FieldsLinear CRFs: Classification of feature functions

    Label observation features: these are functions of the form

    fk,m,y(yj , yi, x̄, i) =

    {gm(x̄, i) if yi = y

    0 otherwise

    fk,m,y is non-zero iff the current output label matches some specified label y.gm is a function which depends only on the input sequence x̄ and the currentposition i.

    Example (Feature functions for NE recognition)

    fk,cap,ORG(yj , yi, x̄, i) =

    {1 if capitalised(xi) ∧ yi = ORG0 otherwise

    fk,ulr,PRODUCT(yj , yi, x̄, i) =

    {upper−lower−ratio(xi) if yi = PRODUCT0 otherwise

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 19 / 63

  • Conditional Random FieldsLinear CRFs: Classification of feature functions

    Label observation features: these are functions of the form

    fk,m,y(yj , yi, x̄, i) =

    {gm(x̄, i) if yi = y

    0 otherwise

    fk,m,y is non-zero iff the current output label matches some specified label y.gm is a function which depends only on the input sequence x̄ and the currentposition i.

    Example (Feature functions for NE recognition)

    fk,cap,ORG(yj , yi, x̄, i) =

    {1 if capitalised(xi) ∧ yi = ORG0 otherwise

    fk,ulr,PRODUCT(yj , yi, x̄, i) =

    {upper−lower−ratio(xi) if yi = PRODUCT0 otherwise

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 19 / 63

  • Conditional Random FieldsLinear CRFs: Classification of feature functions

    Transition observation features: these are functions of the form

    fk,m,y′,y(yj , yi, x̄, i) =

    {gm(x̄, i) if yj = y

    ′ ∧ yi = y0 otherwise

    fk,m,y′,y depends on the transition yj → yi and some property of the input.Again, gm is a function which depends only on x̄ and i.

    Example (Feature functions for NE recognition)

    fk,Mr.,OTHER,PER(yj , yi, x̄, i) =

    {1 if yj = OTHER ∧ yi = PER ∧ xi−1 = Mr.0 otherwise

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 20 / 63

  • Conditional Random FieldsLinear CRFs: Classification of feature functions

    Transition observation features: these are functions of the form

    fk,m,y′,y(yj , yi, x̄, i) =

    {gm(x̄, i) if yj = y

    ′ ∧ yi = y0 otherwise

    fk,m,y′,y depends on the transition yj → yi and some property of the input.Again, gm is a function which depends only on x̄ and i.

    Example (Feature functions for NE recognition)

    fk,Mr.,OTHER,PER(yj , yi, x̄, i) =

    {1 if yj = OTHER ∧ yi = PER ∧ xi−1 = Mr.0 otherwise

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 20 / 63

  • Conditional Random FieldsLinear CRFs: Classification of feature functions

    Transition features: these are functions of the form

    fk,y′,y(yj , yi, x̄, i) =

    {1 if yj = y

    ′ ∧ yi = y0 otherwise

    fk,y′,y is an indicator function for the presence of a transition yj → yi.

    Example (Feature functions for tagging)

    fk,DT,NN(yj , yi, x̄, i) =

    {1 if yj = DT ∧ yi = NN0 otherwise

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 21 / 63

  • Conditional Random FieldsLinear CRFs: Classification of feature functions

    Transition features: these are functions of the form

    fk,y′,y(yj , yi, x̄, i) =

    {1 if yj = y

    ′ ∧ yi = y0 otherwise

    fk,y′,y is an indicator function for the presence of a transition yj → yi.

    Example (Feature functions for tagging)

    fk,DT,NN(yj , yi, x̄, i) =

    {1 if yj = DT ∧ yi = NN0 otherwise

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 21 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to HMMs

    An HMM can be cast into the form of a linear CRF by setting the localfactors of the CRF Φi(yi−1, yi, x̄) as follows:

    Φi(yi−1, yi, x̄) = p(yi|yi−1)p(xi|yi).

    By the definition of the local factors of a linear CRF,

    Φi(yi−1, yi, x̄) = exp(∑

    k

    θkfk(yi−1, yi, x̄, i))

    Then:

    exp(∑

    k

    θkfk(yi−1, yi, x̄, i))

    = p(yi|yi−1)p(xi|yi)∑k

    θkfk(yi−1, yi, x̄, i) = ln(p(yi|yi−1)p(xi|yi)

    )= ln p(yi|yi−1) + ln p(xi|yi)

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 22 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to HMMs

    An HMM can be cast into the form of a linear CRF by setting the localfactors of the CRF Φi(yi−1, yi, x̄) as follows:

    Φi(yi−1, yi, x̄) = p(yi|yi−1)p(xi|yi).

    By the definition of the local factors of a linear CRF,

    Φi(yi−1, yi, x̄) = exp(∑

    k

    θkfk(yi−1, yi, x̄, i))

    Then:

    exp(∑

    k

    θkfk(yi−1, yi, x̄, i))

    = p(yi|yi−1)p(xi|yi)∑k

    θkfk(yi−1, yi, x̄, i) = ln(p(yi|yi−1)p(xi|yi)

    )= ln p(yi|yi−1) + ln p(xi|yi)

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 22 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to HMMs

    An HMM can be cast into the form of a linear CRF by setting the localfactors of the CRF Φi(yi−1, yi, x̄) as follows:

    Φi(yi−1, yi, x̄) = p(yi|yi−1)p(xi|yi).

    By the definition of the local factors of a linear CRF,

    Φi(yi−1, yi, x̄) = exp(∑

    k

    θkfk(yi−1, yi, x̄, i))

    Then:

    exp(∑

    k

    θkfk(yi−1, yi, x̄, i))

    = p(yi|yi−1)p(xi|yi)∑k

    θkfk(yi−1, yi, x̄, i) = ln(p(yi|yi−1)p(xi|yi)

    )= ln p(yi|yi−1) + ln p(xi|yi)

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 22 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to HMMs

    This can be achieved by letting F (the set of all feature functions) consistof |Y|2 functions of the form

    fi,j(r, q, x) =

    {ln p(yj |yi) if r = yi ∧ q = yj0 otherwise

    and |Y||X | functions of the form

    fi,o(r, q, x) =

    {ln p(xo|yi) if q = yi ∧ x = xo0 otherwise

    Since HMMs are already probabilised at the transition level, Z(x̄)becomes 1.

    Note that we omit the index position i in the definition of the featurefunctions since HMMs can only inspect a single input symbol.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 23 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to HMMs

    This can be achieved by letting F (the set of all feature functions) consistof |Y|2 functions of the form

    fi,j(r, q, x) =

    {ln p(yj |yi) if r = yi ∧ q = yj0 otherwise

    and |Y||X | functions of the form

    fi,o(r, q, x) =

    {ln p(xo|yi) if q = yi ∧ x = xo0 otherwise

    Since HMMs are already probabilised at the transition level, Z(x̄)becomes 1.

    Note that we omit the index position i in the definition of the featurefunctions since HMMs can only inspect a single input symbol.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 23 / 63

  • Conditional Random FieldsLinear CRFs: Relationship to HMMs

    This can be achieved by letting F (the set of all feature functions) consistof |Y|2 functions of the form

    fi,j(r, q, x) =

    {ln p(yj |yi) if r = yi ∧ q = yj0 otherwise

    and |Y||X | functions of the form

    fi,o(r, q, x) =

    {ln p(xo|yi) if q = yi ∧ x = xo0 otherwise

    Since HMMs are already probabilised at the transition level, Z(x̄)becomes 1.

    Note that we omit the index position i in the definition of the featurefunctions since HMMs can only inspect a single input symbol.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 23 / 63

  • Conditional Random FieldsLinear CRFs: Alternative view

    Alternatively, the kth factor of a linear CRF could be seen as the sum ofthe feature function fk over all input positions i:

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    d∏k=1

    exp

    ( |ȳ|∑i=1

    θk · fk(yi−1, yi, x̄, i)

    ).

    Here, the p(ȳ|x̄) is defined as a product of d factors, one factor for eachfeature fk. Each factor is the weighted sum of all occurrences of thatfeature in the pair 〈x̄, ȳ〉.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 24 / 63

  • Conditional Random FieldsLinear CRFs: Alternative view

    Alternatively, the kth factor of a linear CRF could be seen as the sum ofthe feature function fk over all input positions i:

    p(ȳ|x̄) = 1Z−→θ

    (x̄)

    d∏k=1

    exp

    ( |ȳ|∑i=1

    θk · fk(yi−1, yi, x̄, i)

    ).

    Here, the p(ȳ|x̄) is defined as a product of d factors, one factor for eachfeature fk. Each factor is the weighted sum of all occurrences of thatfeature in the pair 〈x̄, ȳ〉.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 24 / 63

  • Conditional Random FieldsHigher-order linear CRFs

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 25 / 63

  • Outline

    1 Introduction

    2 Linear CRFs

    3 Inference

    4 CRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

    5 Summary

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 26 / 63

  • Conditional Random FieldsInference

    Inference is generally defined as finding the best output sequence ȳ∗

    for a given input sequence x̄ and model M with parameters−→θ .

    More formally:ȳ∗ = argmax

    ȳp(ȳ|x̄;

    −→θ )

    Since there are |Y||x̄| sequences of length |x̄|, this formula is intractablein general.

    But for linear CRFs, it can be computed quite efficiently with a variant ofthe Viterbi algorithm.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 27 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Let x̄ be an input sequence. Define variables δx̄(i, y) (with 0 ≤ i ≤ |x̄| andy ∈ Y) based on the following recurrence:

    δx̄(i, y) =

    1 if i = 0maxy′

    (δx̄(i− 1, y′) · exp

    (∑k θk · fk(y′, y, x̄, i)

    ))if i > 0

    By adding backpointers, the best output sequence for x̄ can be reconstructedfrom max

    y′δx̄(|x̄|, y′).

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 28 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Note that the δx̄(i, y)’s are not probabilities since they are not yet

    normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

    Z−→θ

    (x̄) .

    The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

    δx̄(i, y) =

    0 if i = 0maxy′

    (δx̄(i− 1, y′) +

    ∑k θk · fk(y′, y, x̄, i)

    )if i > 0

    This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 29 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Note that the δx̄(i, y)’s are not probabilities since they are not yet

    normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

    Z−→θ

    (x̄) .

    The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

    δx̄(i, y) =

    0 if i = 0maxy′

    (δx̄(i− 1, y′) +

    ∑k θk · fk(y′, y, x̄, i)

    )if i > 0

    This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 29 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Note that the δx̄(i, y)’s are not probabilities since they are not yet

    normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

    Z−→θ

    (x̄) .

    The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

    δx̄(i, y) =

    0 if i = 0maxy′

    (δx̄(i− 1, y′) +

    ∑k θk · fk(y′, y, x̄, i)

    )if i > 0

    This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 29 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Note that the δx̄(i, y)’s are not probabilities since they are not yet

    normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

    Z−→θ

    (x̄) .

    The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

    δx̄(i, y) =

    0 if i = 0maxy′

    (δx̄(i− 1, y′) +

    ∑k θk · fk(y′, y, x̄, i)

    )if i > 0

    This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 29 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Note that the δx̄(i, y)’s are not probabilities since they are not yet

    normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

    Z−→θ

    (x̄) .

    The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

    δx̄(i, y) =

    0 if i = 0maxy′

    (δx̄(i− 1, y′) +

    ∑k θk · fk(y′, y, x̄, i)

    )if i > 0

    This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 29 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Note that the δx̄(i, y)’s are not probabilities since they are not yet

    normalised. The probability p(ȳ∗|x̄) would be maxy′ δx̄(|x̄|,y′)

    Z−→θ

    (x̄) .

    The computation of δx̄(i, y) would be more efficient if it is carried out inlog-space:

    δx̄(i, y) =

    0 if i = 0maxy′

    (δx̄(i− 1, y′) +

    ∑k θk · fk(y′, y, x̄, i)

    )if i > 0

    This has a number of advantages:I Multiplication is replaced by addition which is much more efficient.I The exp-operation is cancelled out.I The values of δx̄(i, y) stay within reasonable intervals.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 29 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Viterbi algorithm for linear CRFs (in the (max,+) semiring)

    Input: CRF 〈X ,Y,−→θ , F, d〉, input sequence x̄ ∈ X+

    Output: argmaxȳ p(ȳ|x̄; θ)1 for each y′ ∈ Y do2 δ(1, y′)← Ψ(x̄, y′, 1)3 for i← 2 to |x̄| do4 for each y ∈ Y do5 v ← −∞6 for each y′ → y do7 if δ(i− 1, y′) + θy′→y > v then8 v ← δ(i− 1, y′) + θy′→y9 Π(i, y)← y′

    10 δ(i, y)← v + Ψ(x̄, y, i)

    11 return extract−path(maxy

    δ(|x̄|, y),Π)

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 30 / 63

  • Conditional Random FieldsViterbi inference with Linear CRFs

    Explanation

    Ψ(x̄, y, i) is the dot product of the parameter vector−→θ with the label

    observation features for x̄ at position i and label y.

    In line 5, a temporary score v is set to −∞ (the identity element of max on realnumbers).

    Note that we take advantage of the potential sparseness of the transitions byonly considering transitions y′ → y found in the model (line 6) (the trainingcorpus on which the model is based may not contain all |Y|2 possible labelbigrams).

    Note that the value of δ(i, y) depend only on the previous values at i− 1 and thetransitions y′ → y. The label observation features in Ψ(x̄, y, i) stay the same(lines 8,10). That’s why they do not contribute to the maximisation in line 7.

    v is updated in case we find a better score considering an incoming transitionfrom y′ (θy′→y denotes the parameter value of the transition) and the score ind(i− 1, y′). In that case, we also update the backpointer in Π(i, y) to point to y′.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 31 / 63

  • Outline

    1 Introduction

    2 Linear CRFs

    3 Inference

    4 CRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

    5 Summary

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 32 / 63

  • CRF Training

    Given a finite corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉) of training pairs〈x̄(n), ȳ(n)〉, the goal of training is to maximise the log conditionallikelihood LCL(T,

    −→θ ) of the corpus given the model parameters

    −→θ :

    LCL(T,−→θ ) =

    N∑n=1

    ln p(ȳ(n)|x̄(n);−→θ )

    −→θ ∗ = argmax

    −→θ

    LCL(T,−→θ )

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 33 / 63

  • CRF TrainingMethods

    There are a couple of training algorithms for linear CRFs:

    First order gradient methods, for example Gradient descent orStochastic gradient descent (SGD)Approximate gradient methods: Averaged PerceptronSecond order gradient methods: L-BGFS

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 34 / 63

  • CRF TrainingEmpirical distribution

    Given a corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉), the empirical distributionp̃(x̄, ȳ) is defined as:

    p̃(x̄, ȳ) =1

    N

    N∑i=1

    1{x̄(i)=x̄} 1{ȳ(i)=ȳ}

    1{...} denotes an indicator function which is 1 if the condition in {. . .} ismet, otherwise 0.

    If all N training pairs 〈x̄(i), ȳ(i)〉 are distinct, each pair gets its equalprobability share of 1N .

    Note that p̃(x̄, ȳ) is zero for all pairs 〈x̄, ȳ〉 not found in the trainingcorpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 35 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    Analogously to the training of log-linear models, we are interested in thefunctional dependence of the log-likelihood of a given training pair〈x̄, ȳ〉 from a selected parameter value θk.In other words: we compute the partial derivative

    ∂θkln p(ȳ|x̄;

    −→θ )

    Recall that the gradient of ln p(ȳ|x̄;−→θ ) – ∇ ln p(ȳ|x̄;

    −→θ ) – is the vector

    of the partial derivatives of ln p(ȳ|x̄;−→θ ) with respect to all d parameters.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 36 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    ∂θkln p(ȳ|x̄;

    −→θ ) =

    ∂θk

    (ln

    (1

    Z−→θ

    (x̄)exp

    ∑j

    θj · Fj(x̄, ȳ)))

    (Def. of ln) =∂

    ∂θk

    (ln

    (1

    Z−→θ

    (x̄)

    )+∑j

    θj · Fj(x̄, ȳ)

    )

    (Def. of ln, +) =∂

    ∂θk

    (∑j

    θj · Fj(x̄, ȳ)− lnZ−→θ (x̄)

    )

    (Def. of ∂, sum rule) = Fk(x̄, ȳ)−∂

    ∂θklnZ−→

    θ(x̄)

    (Def. of ∂ ln) = Fk(x̄, ȳ)−1

    Z−→θ

    (x̄)· ∂∂θk

    Z−→θ

    (x̄)

    (Def. of Z−→θ

    (x̄)) = Fk(x̄, ȳ)−1

    Z−→θ

    (x̄)· ∂∂θk

    (∑ȳ′

    exp∑j

    θj · Fj(x̄, ȳ′)

    )

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 37 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    (Sum rule) = Fk(x̄, ȳ)−1

    Z−→θ

    (x̄)·∑ȳ′

    (∂

    ∂θkexp

    ∑j

    θj · Fj(x̄, ȳ′)

    )

    (∂ exp, chain rule) = Fk(x̄, ȳ)−1

    Z−→θ

    (x̄)·∑ȳ′

    Fk(x̄, ȳ′)(

    exp∑j

    θj · Fj(x̄, ȳ′))

    (Z−→θ

    (x̄) into∑ȳ′ ) = Fk(x̄, ȳ)−

    ∑ȳ′

    Fk(x̄, ȳ′)

    (exp

    ∑j θj · Fj(x̄, ȳ′)Z−→θ

    (x̄)

    )

    (Expand Z−→θ

    (x̄)) = Fk(x̄, ȳ)−∑ȳ′

    Fk(x̄, ȳ′)

    (exp

    ∑j θj · Fj(x̄, ȳ′)∑

    ȳ′′ exp∑j′ θj′ · Fj′(x̄, ȳ′′)

    )

    (Def. of p(ȳ′|x̄;−→θ )) = Fk(x̄, ȳ)−

    ∑ȳ′

    (Fk(x̄, ȳ

    ′) · p(ȳ′|x̄;−→θ )

    )(Def. of E) = Fk(x̄, ȳ)− Ep(ȳ′|x̄;−→θ )[Fk(x̄, ȳ

    ′)]

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 38 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ ): Empirical feature counts

    The empirical feature counts depend on the factorisation of the CRF andthe form of the feature functions.

    For linear CRFs, the count Fk of feature fk is

    Fk(x̄(n), ȳ(n)) =

    |x(n)|∑i=1

    fk(ȳ(n)i−1, ȳ

    (n)i , x̄

    (n), i)

    given a training pair 〈x̄(n), ȳ(n)〉.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 39 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    Recall that Fk(x̄, ȳ) is the sum of feature fk over the whole pair 〈x̄, ȳ〉.Given a training pair 〈x̄(n), ȳ(n)〉, the main result from above is:

    Fk(x̄(n), ȳ(n))− E

    p(ȳ|x̄(n);−→θ )

    [Fk(x̄(n), ȳ)]

    That is: we subtract the expected aggregated value of the feature fkunder the model distribution from the aggregated value of fk in the givensample.Since Fk(x̄(n), ȳ(n)) can be restated as the expected value of fk under theempirical distribution p̃(x̄, ȳ), we arrive at:

    Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄(n), ȳ)]

    After training, we want this difference to be 0, that is, the expectation offk under the empirical distribution (that is, the value of fk in the trainingpair) matches the expected value of fk under the model distribution.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 40 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    Recall that Fk(x̄, ȳ) is the sum of feature fk over the whole pair 〈x̄, ȳ〉.Given a training pair 〈x̄(n), ȳ(n)〉, the main result from above is:

    Fk(x̄(n), ȳ(n))− E

    p(ȳ|x̄(n);−→θ )

    [Fk(x̄(n), ȳ)]

    That is: we subtract the expected aggregated value of the feature fkunder the model distribution from the aggregated value of fk in the givensample.Since Fk(x̄(n), ȳ(n)) can be restated as the expected value of fk under theempirical distribution p̃(x̄, ȳ), we arrive at:

    Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄(n), ȳ)]

    After training, we want this difference to be 0, that is, the expectation offk under the empirical distribution (that is, the value of fk in the trainingpair) matches the expected value of fk under the model distribution.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 40 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    Recall that Fk(x̄, ȳ) is the sum of feature fk over the whole pair 〈x̄, ȳ〉.Given a training pair 〈x̄(n), ȳ(n)〉, the main result from above is:

    Fk(x̄(n), ȳ(n))− E

    p(ȳ|x̄(n);−→θ )

    [Fk(x̄(n), ȳ)]

    That is: we subtract the expected aggregated value of the feature fkunder the model distribution from the aggregated value of fk in the givensample.Since Fk(x̄(n), ȳ(n)) can be restated as the expected value of fk under theempirical distribution p̃(x̄, ȳ), we arrive at:

    Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄(n), ȳ)]

    After training, we want this difference to be 0, that is, the expectation offk under the empirical distribution (that is, the value of fk in the trainingpair) matches the expected value of fk under the model distribution.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 40 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    Recall that Fk(x̄, ȳ) is the sum of feature fk over the whole pair 〈x̄, ȳ〉.Given a training pair 〈x̄(n), ȳ(n)〉, the main result from above is:

    Fk(x̄(n), ȳ(n))− E

    p(ȳ|x̄(n);−→θ )

    [Fk(x̄(n), ȳ)]

    That is: we subtract the expected aggregated value of the feature fkunder the model distribution from the aggregated value of fk in the givensample.Since Fk(x̄(n), ȳ(n)) can be restated as the expected value of fk under theempirical distribution p̃(x̄, ȳ), we arrive at:

    Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄(n), ȳ)]

    After training, we want this difference to be 0, that is, the expectation offk under the empirical distribution (that is, the value of fk in the trainingpair) matches the expected value of fk under the model distribution.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 40 / 63

  • CRF TrainingPartial derivatives of ln p(ȳ|x̄;

    −→θ )

    Recall that Fk(x̄, ȳ) is the sum of feature fk over the whole pair 〈x̄, ȳ〉.Given a training pair 〈x̄(n), ȳ(n)〉, the main result from above is:

    Fk(x̄(n), ȳ(n))− E

    p(ȳ|x̄(n);−→θ )

    [Fk(x̄(n), ȳ)]

    That is: we subtract the expected aggregated value of the feature fkunder the model distribution from the aggregated value of fk in the givensample.Since Fk(x̄(n), ȳ(n)) can be restated as the expected value of fk under theempirical distribution p̃(x̄, ȳ), we arrive at:

    Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄(n), ȳ)]

    After training, we want this difference to be 0, that is, the expectation offk under the empirical distribution (that is, the value of fk in the trainingpair) matches the expected value of fk under the model distribution.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 40 / 63

  • CRF TrainingExample: NE recognition

    Example (Input sequence with two alternative labelings)x̄ Britain sent warships across the English Channel Mondayȳ B O O O O B I Bȳ′ O O O O O B I B

    x̄ to rescue Britons stranded by Eyjafjallakökull ’s vulcanicȳ O O B O O B O Oȳ′ O O B O O B O O

    x̄ ash cloud .ȳ O O Oȳ′ O O O

    (B = beginning of a NE, I = within a NE, O = outside a NE)

    [Example taken from Noah Smith’s Linguistic structure prediction]

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 41 / 63

  • CRF TrainingExample: NE recognition

    Example (Some feature functions and their global feature vectors)Group Indicator function fk(xi, yi)

    ∑i fk(xi, yi)

    ∑i fk(xi, y

    ′i)

    bias yi = B 5 4yi = I 1 1yi = O 14 15

    lexical xi = Britain and yi = B 1 0xi = Britain and yi = O 0 1

    lowercase lc(xi) = britain and yi = B 1 0lc(xi) = britain and yi = O 0 1

    shape shape(xi) = Aaaaaaa and yi = B 3 2shape(xi) = Aaaaaaa and yi = I 1 1shape(xi) = Aaaaaaa and yi = O 0 1

    prefix pref1(xi) = B and yi = B 2 1pref1(xi) = B and yi = O 0 1shape(pref1(xi)) = A and yi = B 5 4

    wiki xi is in Wikipedia NE list and yi = B 2 1

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 42 / 63

  • CRF TrainingGradient-based parameter estimation

    This difference of expectations can be generalised to the whole corpus

    N∑n=1

    (Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄

    (n), ȳ)])

    In practice, these differences are often negated:

    −(Ep̃(x̄,ȳ)[Fk(x̄, ȳ)]− Ep(ȳ′|x̄;−→θ )[Fk(x̄, ȳ

    ′)])

    and

    −N∑n=1

    (Ep̃(x̄,ȳ)[Fk(x̄(n), ȳ(n))]− Ep(ȳ|x̄(n);−→θ )[Fk(x̄

    (n), ȳ)])

    Then, both functions are convex, that is, every local minimum is also aglobal minimum.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 43 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingGradient-based parameter estimation

    This leads to two generic algorithms for parameter estimation for a givencorpus T of pairs 〈x̄(n), ȳ(n)〉:

    1 Gradient descent:I Use −

    ∑Nn=1 ln p(ȳ

    (n)|x̄(n);−→θ ) as the objective function of the parameter

    estimation.I Compute the gradient of the above function relative to the current

    parameters−→θ , update

    −→θ accordingly.

    I Repeat the above step several times.

    2 Stochastic gradient descent:I Use − ln p(ȳ|x̄;

    −→θ ) as the objective function of the parameter estimation.

    I Take a random training pair 〈x̄(n), ȳ(n)〉, compute the gradient of theobjective function and update

    −→θ .

    I Repeat this several times for each 〈x̄(n), ȳ(n)〉, that is, take several passesover the corpus.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 44 / 63

  • CRF TrainingRegularisation

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 45 / 63

  • CRF TrainingStochastic gradient descent

    Stochastic gradient descent

    Input: Corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉)Output: Parameter vector

    −→θ with |

    −→θ | = d

    1−→θ ← 0

    2 repeat3 permute(T )4 for n← 1 to N do5 δ ← ∇ ln p(ȳ(n)|x̄(n);

    −→θ )

    6 Calculate β̂ (the learning rate)

    7−→θ ←

    −→θ − β̂δ

    8 until convergence;

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 46 / 63

  • CRF TrainingStochastic gradient descent

    Remarks

    This is a generic algorithm which works for all types of CRFs.

    Its performance depends on how efficient we can compute the gradient of thelog conditional likelihood of the training pair in line 5.

    For linear CRFs, there is a polynomial algorithm based on dynamicprogramming.

    Note that SGD is a inherent sequential algorithm not easily to be parallelised.

    But the gradient vector can in principle be computed with a parallel algorithm.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 47 / 63

  • CRF TrainingEfficient computation of E

    p(ȳ|x̄;−→θ )

    [Fk(x̄, ȳ)]

    The computational demanding step of gradient-based training algorithmsis the computation of the gradient of the partition function Z – that isEp(ȳ|x̄;

    −→θ )

    [Fk(x̄, ȳ)] – since it requires the summation over exponentialmany values of ȳ.

    For linear CRFs, we can take advantage of their special factorisation byusing a dynamic programming approach.Recall that for linear CRFs,

    Fk(x̄, ȳ) =

    |ȳ|∑i=1

    fk(yi−1, yi, x̄, i)

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 48 / 63

  • CRF TrainingEfficient computation of E

    p(ȳ|x̄;−→θ )

    [Fk(x̄, ȳ)] for linear CRFs

    Ep(ȳ|x̄;

    −→θ )

    [Fk(x̄, ȳ)]

    (Def. of Fk for lin. CRFs) = Ep(ȳ|x̄;−→θ )

    [ |ȳ|∑i=1

    fk(yi−1, yi, x̄, i)

    ]

    (E[X+Y ]=E[X]+E[Y ]) =

    |ȳ|∑i=1

    Ep(ȳ|x̄;

    −→θ )

    [fk(yi−1, yi, x̄, i)]

    (Def. of Ep(ȳ|x̄;

    −→θ )

    [.]) =

    |ȳ|∑i=1

    ∑ȳ

    (p(ȳ|x̄;

    −→θ ) · fk(yi−1, yi, x̄, i)

    )

    (Def. of p(ȳ|x̄;−→θ )) =

    |ȳ|∑i=1

    ∑y′,y∈Y

    (p(y′, y|x̄, i;

    −→θ ) · fk(y′, y, x̄, i)

    )

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 49 / 63

  • CRF TrainingEfficient computation of E

    p(ȳ|x̄;−→θ )

    [Fk(x̄, ȳ)] for linear CRFs

    Given an output sequence ȳ, p(y′, y|x̄, i;−→θ ) is the probability of the

    transition from y′ to y at position i of x̄ given−→θ .

    This probability can be further decomposed as follows:

    p(y′, y|x̄, i;−→θ ) =

    αx̄(i− 1, y′) · Φ(y′, y, x̄, i) · βx̄(i, y)Z−→θ

    (x̄)

    withI αx̄(i− 1, y′) being the forward score of outputting y′ after a prefix of ȳ of

    length i− 1,I Φ(y′, y, x̄, i) = exp

    (∑dk=1 θkfk(y

    ′, y, x̄, i))

    being the score of thetransition from y′ to y

    I βx̄(i, y) being the backward score of outputting y and then the remainingsuffix of ȳ, and

    I Z−→θ

    (x̄) =∑y∈Y αx̄(|x̄|, y), the total score of all sequences in Y |x̄|.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 50 / 63

  • Conditional Random FieldsEfficient computation of E

    p(ȳ|x̄;−→θ )

    [Fk(x̄, ȳ)]: Forward-backward algorithm

    The α and β-variables can be defined in an to HMMs analogous way:

    αx̄(i, y) =

    {1 if i = 0∑

    y′

    (αx̄(i− 1, y′) · Φ(y′, y, x̄, i)

    )if i > 0

    βx̄(i, y) =

    {1 if i = |x̄|∑

    y′

    (βx̄(i+ 1, y

    ′) · Φ(y, y′, x̄, i+ 1))

    if 0 ≤ i < |x̄|

    with

    Φ(y, y′, x̄, i) = exp( d∑k=1

    θk · fk(y, y′, x̄, i))

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 51 / 63

  • CRF TrainingAveraged perceptron

    In machine learning, the perceptron is an algorithm for supervisedclassification of an input into one of several possible non-binary outputs.

    It is a type of linear classifier, i.e. a classification algorithm that makesits predictions based on a linear predictor function combining a set ofweights with the feature vector. The algorithm allows for online learning,in that it processes elements in the training set one at a time.

    The perceptron algorithm is a simple iterative parameter estimationalgorithm where we count the mismatches between the given outputsequence ȳ (which is given in the training corpus) and the predictedoutput sequence ȳ′ based on the current parameters

    −→θ .

    Effectively, we try to minimise the number of these mismatches.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 52 / 63

  • CRF TrainingAveraged perceptron

    In machine learning, the perceptron is an algorithm for supervisedclassification of an input into one of several possible non-binary outputs.

    It is a type of linear classifier, i.e. a classification algorithm that makesits predictions based on a linear predictor function combining a set ofweights with the feature vector. The algorithm allows for online learning,in that it processes elements in the training set one at a time.

    The perceptron algorithm is a simple iterative parameter estimationalgorithm where we count the mismatches between the given outputsequence ȳ (which is given in the training corpus) and the predictedoutput sequence ȳ′ based on the current parameters

    −→θ .

    Effectively, we try to minimise the number of these mismatches.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 52 / 63

  • CRF TrainingAveraged perceptron

    In machine learning, the perceptron is an algorithm for supervisedclassification of an input into one of several possible non-binary outputs.

    It is a type of linear classifier, i.e. a classification algorithm that makesits predictions based on a linear predictor function combining a set ofweights with the feature vector. The algorithm allows for online learning,in that it processes elements in the training set one at a time.

    The perceptron algorithm is a simple iterative parameter estimationalgorithm where we count the mismatches between the given outputsequence ȳ (which is given in the training corpus) and the predictedoutput sequence ȳ′ based on the current parameters

    −→θ .

    Effectively, we try to minimise the number of these mismatches.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 52 / 63

  • CRF TrainingAveraged perceptron

    In machine learning, the perceptron is an algorithm for supervisedclassification of an input into one of several possible non-binary outputs.

    It is a type of linear classifier, i.e. a classification algorithm that makesits predictions based on a linear predictor function combining a set ofweights with the feature vector. The algorithm allows for online learning,in that it processes elements in the training set one at a time.

    The perceptron algorithm is a simple iterative parameter estimationalgorithm where we count the mismatches between the given outputsequence ȳ (which is given in the training corpus) and the predictedoutput sequence ȳ′ based on the current parameters

    −→θ .

    Effectively, we try to minimise the number of these mismatches.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 52 / 63

  • CRF TrainingAveraged perceptron

    Recall the definition of the (unregularised) gradient of a feature fk givena single corpus instance 〈x̄(j), ȳ(j)〉:

    ∂θkln p(ȳ(j)|x̄(j);

    −→θ ) = Fk(x̄

    (j), ȳ(j))− Ep(ȳ|x̄(j);

    −→θ )

    [Fk(ȳ, x̄(j))]

    In perceptron-based methods, the gradient for a single corpus instance〈x̄(j), ȳ(j)〉 is approximated by:

    ∂θkln p(ȳ(j)|x̄(j);

    −→θ ) ≈ Fk(x̄(j), ȳ(j))−Fk(x̄(j), argmax

    ȳp(ȳ|x̄(j);

    −→θ ))

    That is, instead of taking the weighted average of the Fk for all possibleȳ, we take only one specific ȳ into account, namely that with the highestViterbi score given the current parameters

    −→θ .

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 53 / 63

  • CRF TrainingAveraged perceptron

    Recall the definition of the (unregularised) gradient of a feature fk givena single corpus instance 〈x̄(j), ȳ(j)〉:

    ∂θkln p(ȳ(j)|x̄(j);

    −→θ ) = Fk(x̄

    (j), ȳ(j))− Ep(ȳ|x̄(j);

    −→θ )

    [Fk(ȳ, x̄(j))]

    In perceptron-based methods, the gradient for a single corpus instance〈x̄(j), ȳ(j)〉 is approximated by:

    ∂θkln p(ȳ(j)|x̄(j);

    −→θ ) ≈ Fk(x̄(j), ȳ(j))−Fk(x̄(j), argmax

    ȳp(ȳ|x̄(j);

    −→θ ))

    That is, instead of taking the weighted average of the Fk for all possibleȳ, we take only one specific ȳ into account, namely that with the highestViterbi score given the current parameters

    −→θ .

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 53 / 63

  • CRF TrainingAveraged perceptron

    Recall the definition of the (unregularised) gradient of a feature fk givena single corpus instance 〈x̄(j), ȳ(j)〉:

    ∂θkln p(ȳ(j)|x̄(j);

    −→θ ) = Fk(x̄

    (j), ȳ(j))− Ep(ȳ|x̄(j);

    −→θ )

    [Fk(ȳ, x̄(j))]

    In perceptron-based methods, the gradient for a single corpus instance〈x̄(j), ȳ(j)〉 is approximated by:

    ∂θkln p(ȳ(j)|x̄(j);

    −→θ ) ≈ Fk(x̄(j), ȳ(j))−Fk(x̄(j), argmax

    ȳp(ȳ|x̄(j);

    −→θ ))

    That is, instead of taking the weighted average of the Fk for all possibleȳ, we take only one specific ȳ into account, namely that with the highestViterbi score given the current parameters

    −→θ .

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 53 / 63

  • CRF TrainingAveraged perceptron

    Averaged Perceptron training algorithm

    Input: Corpus T = (〈x̄(1), ȳ(1)〉 . . . 〈x̄(N), ȳ(N)〉)Output: Parameter vector

    −→θ

    1−→θ ← 0;

    −→θsum ← 0

    2 for t← 1 to M do3 permute(T )4 for j ← 1 to N do5 ȳ∗ ← argmax

    ȳ′p(ȳ′|x̄(j);

    −→θ )

    6 if ȳ∗ 6= ȳ(j) then7

    −→θ ←

    −→θ +

    ∑|x̄(j)|i=1 F (y

    (j)i−1, y

    (j)i , x̄, i)−

    ∑|x̄(j)|i=1 F (y

    ∗i−1, y

    ∗i , x̄, i)

    8−→θsum ←

    −→θsum +

    −→θ

    9 return−−→θsumN ·M

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 54 / 63

  • CRF TrainingAveraged perceptron

    Remarks

    Line 5 does a Viterbi search to find the best output sequence ȳ∗ for the currentinput x̄(j) given the current parameters

    −→θ .

    Line 6 compares that sequence with the correct corpus sequence ȳ(j).

    If the sequences are not the same, then in line 7 the feature vector of the trainingpair (x̄(j), ȳ(j)) is added to

    −→θ and the feature vector for (x̄(j), ȳ∗) is subtracted

    from−→θ (here, F (y′, y, x̄, i) denotes the vector of the d feature functions

    fk(y′, y, x̄, i)).

    Then,−→θ is accumulated to a vector

    −→θsum

    In effect, this means that correct features in (x̄(j), ȳ(j)) are amplified andincorrect ones in (x̄(j), ȳ∗) are dampened.

    Finally,−→θsum is averaged by |T | ·M .

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 55 / 63

  • CRF TrainingAveraged perceptron

    Intuitions behind the algorithm

    Consider an incorrect output label, causing the test in line 6 to be true. Say, ȳ(j)

    differs from ȳ∗ only at position l.

    Then, line 7 effectively subtracts from−→θ all feature values in F (y∗l−1, y

    ∗l , x̄, l).

    That is, the parameter values for the features causing the incorrect label y∗l aredampened. At the same time, the parameter values for the features found inF (y

    (j)l−1, y

    (j)l , x̄, l) are always amplified, since we assume the corpus labeling to

    be correct.

    The next time the pair 〈x̄(j), ȳ(j)〉 is considered, the Viterbi search may result inthe correct label ȳ(j)l found in the corpus since the dampened parameter valuesof the features causing the incorrect label play a less prominent role.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 56 / 63

  • CRF TrainingAveraged perceptron

    Implementations notes

    The efficiency of the averaged perceptron algorithm depends crucially uponhow smart we implement the vector sum in line 8 since

    −→θ can contain several

    million features and line 8 is executed N · |T | times.

    The Viterbi score is a bit easier to compute than Ep(ȳ|x̄(j);

    −→θ )

    [Fk(ȳ, x̄(j))] in the

    SGD algorithm.

    Again, this is a sequential algorithm.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 57 / 63

  • CRF TrainingAveraged perceptron: summary

    OverfittingError driven

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 58 / 63

  • CRF TrainingL-BGFS

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 59 / 63

  • Outline

    1 Introduction

    2 Linear CRFs

    3 Inference

    4 CRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

    5 Summary

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 60 / 63

  • Conditional random fieldsSummary

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 61 / 63

  • TODO

    Relationship to maximum entropy

    Higher-order LCRFs

    Regularisation

    L-BGFS

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 62 / 63

  • Version history

    29.1.2015: Initial version 0.110.2.2015: Version 0.2: Moved forward-backward algorithm to the CRFtraining section and added an additional slide. Changed the slide aboutefficient gradient computation. Fixed a number of errors. Added tofurther slide to the averaged perceptron algorithm.

    Thomas Hanneforth (Universität Potsdam) Language Modeling: CRFs February 10, 2015 63 / 63

    IntroductionLinear CRFsInferenceCRF-TrainingFirst order gradient methodsStochastic gradient descentAveraged perceptronL-BGFS

    Summary