11 cs 388: natural language processing: discriminative training and conditional random fields (crfs)...

37
1 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University of Texas at Austin

Upload: emil-cole

Post on 25-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

11

CS 388: Natural Language Processing:Discriminative Training and

Conditional Random Fields (CRFs)for Sequence Labeling

Raymond J. MooneyUniversity of Texas at Austin

Page 2: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

2

Joint Distribution

• The joint probability distribution for a set of random variables, X1,…,Xn gives the probability of every combination of values (an n-dimensional array with vn values if all variables are discrete with v values, all vn values must sum to 1): P(X1,…,Xn)

• The marginal probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution.

• Therefore, all conditional probabilities can also be calculated.

circle square

red 0.20 0.02

blue 0.02 0.01

circle square

red 0.05 0.30

blue 0.20 0.20

positive negative

25.005.020.0)( circleredP

80.025.0

20.0

)(

)()|(

circleredP

circleredpositivePcircleredpositiveP

57.03.005.002.020.0)( redP

Page 3: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

3

Probabilistic Classification

• Let Y be the random variable for the class which takes values {y1,y2,…ym}.

• Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible vector value for X and xij a possible value for Xi.

• For classification, we need to compute P(Y=yi | X=xk) for i = 1…m

• Could be done using joint distribution but this requires estimating an exponential number of parameters.

Page 4: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

4

Bayesian Categorization

• Determine category of xk by determining for each yi

• P(X=xk) can be determined since categories are complete and disjoint.

)(

)|()()|(

k

ikiki xXP

yYxXPyYPxXyYP

m

i k

ikim

iki xXP

yYxXPyYPxXyYP

11

1)(

)|()()|(

m

iikik yYxXPyYPxXP

1

)|()()(

Page 5: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

5

Bayesian Categorization (cont.)

• Need to know:– Priors: P(Y=yi)

– Conditionals: P(X=xk | Y=yi)

• P(Y=yi) are easily estimated from data.

– If ni of the examples in D are in yi then P(Y=yi) = ni / |D|

• Too many possible instances (e.g. 2n for binary features) to estimate all P(X=xk | Y=yi).

• Still need to make some sort of independence assumptions about the features to make learning tractable.

Page 6: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

6

Naïve Bayes Generative Model

Size Color Shape Size Color Shape

Positive Negative

posnegpos

pospos neg

neg

sm

medlg

lg

medsm

smmed

lg

red

redredredred

blue

bluegrn

circcirc

circ

circ

sqr

tri tri

circ sqrtri

sm

lg

medsm

lgmed

lgsmblue

red

grnblue

grnred

grnblue

circ

sqr tricirc

sqrcirc

tri

Category

Page 7: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

7

Naïve Bayes Inference Problem

Size Color Shape Size Color Shape

Positive Negative

posnegpos

pospos neg

neg

sm

medlg

lg

medsm

smmed

lg

red

redredredred

blue

bluegrn

circcirc

circ

circ

sqr

tri tri

circ sqrtri

sm

lg

medsm

lgmed

lgsmblue

red

grnblue

grnred

grnblue

circ

sqr tricirc

sqrcirc

tri

Category

lg red circ ?? ??

Page 8: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

8

Naïve Bayesian Categorization

• If we assume features of an instance are independent given the category (conditionally independent).

• Therefore, we then only need to know P(Xi | Y) for each possible pair of a feature-value and a category.

• If Y and all Xi and binary, this requires specifying only 2n parameters:– P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi

– P(Xi=false | Y) = 1 – P(Xi=true | Y)

• Compared to specifying 2n parameters without any independence assumptions.

)|()|,,()|(1

21

n

iin YXPYXXXPYXP

Page 9: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

9

Generative vs. Discriminative Models

• Generative models and are not directly designed to maximize the performance of classification. They model the complete joint distribution P(X,Y).

• Classification is then done using Bayesian inference given the generative model of the joint distribution.

• But a generative model can also be used to perform any other inference task, e.g. P(X1 | X2, …Xn, Y)– “Jack of all trades, master of none.”

• Discriminative models are specifically designed and trained to maximize performance of classification. They only model the conditional distribution P(Y | X).

• By focusing on modeling the conditional distribution, they generally perform better on classification than generative models when given a reasonable amount of training data.

Page 10: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Logistic Regression

• Assumes a parametric form for directly estimating P(Y | X). For binary concepts, this is:

n

i ii XwwXYP

10 )exp(1

1)|1(

• Equivalent to a one-layer backpropagation neural net.– Logistic regression is the source of the sigmoid function

used in backpropagation.

– Objective function for training is somewhat different.

)|1(1)|0( XYPXYP

n

i ii

n

i ii

Xww

Xww

10

10

)exp(1

)exp(

Page 11: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Logistic Regression as a Log-Linear Model

• Logistic regression is basically a linear model, which is demonstrated by taking logs.

)|1(

)|0(1 iff 0 labelAssign

XYP

XYPY

n

i ii Xww10 )exp(1

n

i ii Xww100

n

i ii Xww10ly equivalentor

• Also called a maximum entropy model (MaxEnt) because it can be shown that standard training for logistic regression gives the distribution with maximum entropy that is consistent with the training data.

Page 12: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Logistic Regression Training

• Weights are set during training to maximize the conditional data likelihood :

where D is the set of training examples and Yd and Xd denote, respectively, the values of Y and X for example d.

),|(argmax WXYPW d

Dd

d

W

• Equivalently viewed as maximizing the conditional log likelihood (CLL)

Dd

dd

WWXYPW ),|(lnargmax

Page 13: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Logistic Regression Training

• Like neural-nets, can use standard gradient descent to find the parameters (weights) that optimize the CLL objective function.

• Many other more advanced training methods are possible to speed convergence.– Conjugate gradient– Generalized Iterative Scaling (GIS)– Improved Iterative Scaling (IIS)– Limited-memory quasi-Newton (L-BFGS)

Page 14: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Preventing Overfitting in Logistic Regression

• To prevent overfitting, one can use regularization (a.k.a. smoothing) by penalizing large weights by changing the training objective:

2

2),|(lnargmax WWXYPW

Dd

dd

W

• This can be shown to be equivalent to MAP parameter estimation assuming a Guassian prior for W with zero mean and a variance related to 1/λ.

Where λ is a constant that determines the amount of smoothing

Page 15: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Multinomial Logistic Regression(MaxEnt)

• Logistic regression can be generalized to multi-class problems (where Y has a multinomial distribution).

• Create feature functions for each combination of a class value y´ and each feature Xj and another for the “bias weight” of each class.– f y´, j (Y, X) = Xj if Y= y´ and 0 otherwise

– f y´ (Y, X) = 1 if Y= y´ and 0 otherwise

• The final conditional distribution is:

K

kkk XYf

XZXYP

1

)),(exp()(

1)|(

(normalizing constant)

Y

K

kkk XYfXZ

1

)),(exp()(

(λk are weights)

Page 16: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Graphical Models

• If no assumption of independence is made, then an exponential number of parameters must be estimated for sound probabilistic inference.– No realistic amount of training data is sufficient to estimate so many

parameters.• If a blanket assumption of conditional independence is made,

efficient training and inference is possible, but such a strong assumption is rarely warranted.

• Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated.– Bayesian Networks: Directed acyclic graphs that indicate causal

structure.– Markov Networks: Undirected graphs that capture general

dependencies.

Page 17: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Bayesian Networks

• Directed Acyclic Graph (DAG)– Nodes are random variables– Edges indicate causal influences

Burglary Earthquake

Alarm

JohnCalls MaryCalls

Page 18: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Conditional Probability Tables

• Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case).– Roots (sources) of the DAG that have no parents are given prior

probabilities.

Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(B)

.001

P(E)

.002

B E P(A)

T T .95

T F .94

F T .29

F F .001

A P(M)

T .70

F .01

A P(J)

T .90

F .05

Page 19: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Joint Distributions for Bayes Nets

• A Bayesian Network implicitly defines a joint distribution.

))(Parents|(),...,(1

21 i

n

iin XxPxxxP

• Example)( EBAMJP

)()()|()|()|( EPBPEBAPAMPAJP 00062.0998.0999.0001.07.09.0

Page 20: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Naïve Bayes as a Bayes Net

• Naïve Bayes is a simple Bayes Net

Y

X1 X2… Xn

• Priors P(Y) and conditionals P(Xi|Y) for Naïve Bayes provide CPTs for the network.

Page 21: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Markov Networks

• Undirected graph over a set of random variables, where an edge represents a dependency.

• The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X).

• Every node in a Markov Net is conditionally independent of every other node given its Markov blanket.

Page 22: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Distribution for a Markov Network

• The distribution of a Markov net is most compactly described in terms of a set of potential functions (a.k.a. factors, compatibility functions), φk, for each clique, k, in the graph.

• For each joint assignment of values to the variables in clique k, φk assigns a non-negative real value that represents the compatibility of these values.

• The joint distribution of a Markov network is then defined by:

)(1

),...,( }{21 k

kkn xZ

xxxP

Where x{k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1.

x k

kk xZ )( }{

Page 23: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Sample Markov Network

Burglary Earthquake

Alarm

JohnCalls MaryCalls

B A 1

T T 100

T F 1

F T 1

F F 200

E A 2

T T 50

T F 10

F T 1

F F 200

M A 4

T T 50

T F 1

F T 10

F F 200

J A 3

T T 75

T F 10

F T 1

F F 200

Page 24: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Logistic Regression as a Markov Net

• Logistic regression is a simple Markov Net

Y

X1 X2… Xn

• But only models the conditional distribution, P(Y | X) and not the full joint P(X,Y)

• Same as a discriminatively trained naïve Bayes.

Page 25: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Generative vs. Discriminative Sequence Labeling Models

• HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q).

• HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task.

• Conditional Random Fields (CRFs) are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O)

Page 26: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Classification

Y

X1 X2… Xn

Y

X1 X2… Xn

Naïve Bayes

LogisticRegression

Conditional

Generative

Discriminative

Page 27: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Sequence Labeling

Y2

X1 X2… XT

HMM

Linear-chain CRF

Conditional

Generative

Discriminative

Y1 YT..

Y2

X1 X2… XT

Y1 YT..

Page 28: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Simple Linear Chain CRF Features

• Modeling the conditional distribution is similar to that used in multinomial logistic regression.

• Create feature functions fk(Yt, Yt−1, Xt)

– Feature for each state transition pair i, j• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise

– Feature for each state observation pair i, o• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise

• Note: number of features grows quadratically in the number of states (i.e. tags).

28

Page 29: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Conditional Distribution forLinear Chain CRF

• Using these feature functions for a simple linear chain CRF, we can define:

29

T

t

K

ktttkk XYYf

XZXYP

1 11 )),,(exp(

)(

1)|(

Y

T

t

K

ktttkk XYYfXZ

1 11 )),,(exp()(

Page 30: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Adding Token Features to a CRF

• Can add token features Xi,j

30

…X1,1X1,m …X2,1

X2,m …XT,1XT,m…

• Can add additional feature functions for each token feature to model conditional distribution.

Y1 Y2 YT

Page 31: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Features in POS Tagging

• For POS Tagging, use lexicographic features of tokens.– Capitalized?– Start with numeral?– Ends in given suffix (e.g. “s”, “ed”, “ly”)?

31

Page 32: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Enhanced Linear Chain CRF(standard approach)

• Can also condition transition on the current token features.

32

X1,1 X2,1…

… …

XT,1

X1,m X2,mXT,m

• Add feature functions:• fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1

and 0 otherwise

Y2 YTY1

Page 33: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Supervised Learning (Parameter Estimation)

• As in logistic regression, use L-BFGS optimization procedure, to set λ weights to maximize CLL of the supervised training data.

• See paper for details.

33

Page 34: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Sequence Tagging (Inference)

• Variant of Viterbi algorithm can be used to efficiently, O(TN2), determine the globally most probable label sequence for a given token sequence using a given log-linear model of the conditional probability P(Y | X).

• See paper for details.

34

Page 35: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

Skip-Chain CRFs

• Can model some long-distance dependencies (i.e. the same word appearing in different parts of the text) by including long-distance edges in the Markov model.

35

Y2

X1 X2

…X3

Y1 Y3

Michael Dell said

Dell bought

Y100

X100

Y101

X101

• Additional links make exact inference intractable, so must resort to approximate inference to try to find the most probable labeling.

Page 36: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

36

CRF Results

• Experimental results verify that they have superior accuracy on various sequence labeling tasks.– Part of Speech tagging

– Noun phrase chunking

– Named entity recognition

– Semantic role labeling

• However, CRFs are much slower to train and do not scale as well to large amounts of training data.– Training for POS on full Penn Treebank (~1M words)

currently takes “over a week.”

• Skip-chain CRFs improve results on IE.

Page 37: 11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University

CRF Summary

• CRFs are a discriminative approach to sequence labeling whereas HMMs are generative.

• Discriminative methods are usually more accurate since they are trained for a specific performance task.

• CRFs also easily allow adding additional token features without making additional independence assumptions.

• Training time is increased since a complex optimization procedure is needed to fit supervised training data.

• CRFs are a state-of-the-art method for sequence labeling.

37