1 natural language processing (8) zhao hai 赵海 department of computer science and engineering...

116
1 Natural Language Processing (8) Zhao Hai 赵赵 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 [email protected]

Upload: nancy-johnston

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

1

Natural Language Processing (8)

Zhao Hai 赵海

Department of Computer Science and Engineering

Shanghai Jiao Tong University

2010-2011 

[email protected]

Page 2: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

2

Overview

• Models– HMM: Hidden Markov Model– maximum entropy Markov model– CRFs: Conditional Random Fields

• Tasks– Chinese word segmentation– part-of-speech tagging– named entity recognition

Page 3: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

3

What is an HMM?

• Graphical Model• Circles indicate states• Arrows indicate probabilistic dependencies between states

Page 4: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

4

What is an HMM?

• Green circles are hidden states• Dependent only on the previous state• “The past is independent of the future given the present.”

Page 5: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

5

What is an HMM?

• Purple nodes are observed states• Dependent only on their corresponding hidden state

Page 6: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

6

HMM Formalism

• {S, K, • S : {s1…sN } are the values for the hidden states

• K : {k1…kM } are the values for the observations

SSS

KKK

S

K

S

K

Page 7: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

7

HMM Formalism

• {S, K, • are the initial state probabilities

• A = {aij} are the state transition probabilities

• B = {bik} are the observation state probabilities

A

B

AAA

BB

SSS

KKK

S

K

S

K

Page 8: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

8

Inference in an HMM

• Probability Estimation: Compute the probability of a given observation sequence

• Decoding: Given an observation sequence, compute the most likely hidden state sequence

• Parameter Estimation: Given an observation sequence, find a model that most closely fits the observation

Page 9: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

9

)|( Compute

),,( ,)...( 1

OP

BAooO T

oTo1 otot-1 ot+1

Given an observation sequence and a model, compute the probability of the observation sequence

Probability Estimation

Page 10: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

10

Probability Estimation

TT oxoxox bbbXOP ...),|(2211

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 11: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

11

Probability Estimation

TT oxoxox bbbXOP ...),|(2211

TT xxxxxxx aaaXP132211

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 12: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

12

Probability Estimation

)|(),|()|,( XPXOPXOP

TT oxoxox bbbXOP ...),|(2211

TT xxxxxxx aaaXP132211

...)|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 13: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

13

Probability Estimation

)|(),|()|,( XPXOPXOP

TT oxoxox bbbXOP ...),|(2211

TT xxxxxxx aaaXP132211

...)|(

X

XPXOPOP )|(),|()|(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 14: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

14

111

1

111

1

1}...{

)|(

tttt

T

oxxx

T

txxoxx babOP

Probability Estimation

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Page 15: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

15

)|,...()( 1 ixooPt tti

Forward Procedure

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

• Special structure gives us an efficient solution using dynamic programming.

• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences.

• Define:

Page 16: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

16)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

Page 17: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

17

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Page 18: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

18

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Page 19: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

19

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

)1( tj

)|(),...(

)()|()|...(

)()|...(

),...(

1111

11111

1111

111

jxoPjxooP

jxPjxoPjxooP

jxPjxooP

jxooP

tttt

ttttt

ttt

tt

Page 20: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

20

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 21: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

21

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 22: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

22

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 23: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

23

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Forward Procedure

Page 24: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

24

)|...()( ixooPt tTti

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

Backward Procedure

1)1( Ti

Nj

jioiji tbatt

...1

)1()(

Probability of the rest of the states given the first state

Page 25: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

25

oTo1 otot-1 ot+1

x1 xt+1 xTxtxt-1

The Solution to Estimation

N

ii TOP

1

)()|(

N

iiiOP

1

)1()|(

)()()|(1

ttOP i

N

ii

Forward Procedure

Backward Procedure

Combination

Page 26: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

26

oTo1 otot-1 ot+1

Decoding: Best State Sequence

• Find the state sequence that best explains the observations

• Viterbi algorithm

• )|(maxarg OXPX

Page 27: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

27

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

x1 xt-1 j

Page 28: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

28

oTo1 otot-1 ot+1

Viterbi Algorithm

),,...,...(max)( 1111... 11

ttttxx

j ojxooxxPtt

1)(max)1(

tjoijii

j batt

1)(maxarg)1(

tjoijii

j batt Recursive Computation

x1 xt-1 xt xt+1

Page 29: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

29

oTo1 otot-1 ot+1

Viterbi Algorithm

)(maxargˆ TX ii

T

)1(ˆ1

^

tXtX

t

)(maxarg)ˆ( TXP ii

Compute the most likely state sequence by working backwards

x1 xt-1 xt xt+1 xT

Page 30: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

30

oTo1 otot-1 ot+1

Parameter Estimation

• Given an observation sequence, find the model that is most likely to produce that sequence.

• No analytic method => an EM algorithm (Baum-Welch)• Given a model and observation sequence, update the

model parameters to better fit the observations.

A

B

AAA

BBB B

Page 31: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

31

oTo1 otot-1 ot+1

Parameter Estimation

A

B

AAA

BBB B

Nmmm

jjoijit tt

tbatjip t

...1

)()(

)1()(),( 1

Probability of traversing an arc

Nj

ti jipt...1

),()( Probability of being in state i

Page 32: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

32

oTo1 otot-1 ot+1

Parameter Estimation

A

B

AAA

BBB B

)1(ˆ i i

Now we can compute the new estimates of the model parameters.

T

t i

T

t tij

t

jipa

1

1

)(

),(ˆ

T

t i

kot t

ikt

ib t

1

}:{

)(

)(ˆ

Page 33: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

33

HMM Applications

• Generating parameters for n-gram models• Tagging speech• Speech recognition

Page 34: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

34

oTo1 otot-1 ot+1

The Most Important Thing

A

B

AAA

BBB B

We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not solvable.

Page 35: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

35

Overview

• Models– HMM: Hidden Markov Model– maximum entropy Markov model– CRFs: Conditional Random Fields

• Tasks– Chinese word segmentation– part-of-speech tagging– named entity recognition

Page 36: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

36

Limitations of HMM

“US official questions regulatory scrutiny of Apple”• Problem 1: HMMs only use word identity. It cannot use richer representations.

– Apple is capitalized.

• MEMM Solution: Use more descriptive features – ( b0:Is-capitalized, b1: Is-in-plural, b2: Has-wordnet-antonym, b3:Is-“the” etc)– Real valued features can also be handled.

• Here features are pairs < b, s >: b is feature of observation and s is destination state e.g. <Is-capitalized, Company>

• Feature function:

e.g.

f <Is-capitalized,Company>(“Apple”, Company) = 1.b,s

if b(o ) is true and s sf (o ,s )

0 otherwiset t

t t

Page 37: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

37

HMMs vs. MEMMs (I)

( | '), ( | )P s s P o s ( | ', ) | | : ( | )sP s s o S distributions P s o

HMMs MEMMs

Page 38: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

38

HMMs vs. MEMMs (II)MEMMsHMMs

αt(s) the probability of producing o1, . . . , ot and being in s at time t.

αt(s) the probability of being ins at time t given o1, . . . , ot .

δt(s) the probability of the best pathfor producing o1, . . . , ot and being in s at time t.

δt(s) the probability of the bestpath that reaches s at time tgiven o1, . . . , ot .

1 1( ) ( ') ( | ') ( | )t s S t ts s P s s P o s 1 1( ) ( ') ( )|t s S t s ts s P s o

1 1( ) max ( ') ( | ') ( | )t s S t ts s P s s P o s 1 1( ) max ( ') ( | )t s S t s ts s P s o

Page 39: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

39

Maximum Entropy•Problem 2: • HMMs are trained to maximize the likelihood of the training set. Generative, joint distribution.

• But they solve conditional problems (observations are given).• MEMM Solution: Maximum Entropy (duh).

• Idea: Use the least biased hypothesis, subject to what is known.• Constraints: The expectation Ei of feature i in the learned

distribution should be the same as its mean Fi on the training set. For every state s0 and feature i:

1,

1( , ')

k

ni k s s i k

s

F f o sn

1,

1( | ) ( , ')

k

ni k s s s k i kS

ss

E P s o f o sn

Page 40: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

40

More on MEMMs

• It turns out that the maximum entropy distribution is unique and has an exponential form:

• We can estimate λi with Generalized Iterative Scaling.

– Adding a feature x : does not affect the solution.

– Compute Fi.

– Set

– Compute current expectation of feature i from model.

1( | ) exp( ( , ))

( , ')s i ii features

P s o f o sZ o s

( , ) ( , )x ii

f o s C f o s

(0) 0i ( )jiE

( 1) ( )( )

1log( )j j i

i i ji

F

C E

Page 41: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

41

Extensions• We can train even when the labels are not known using EM.

– E step: determine most probable state sequence and compute Fi.

– M step: GIS.

• We can reduce the number of parameters to estimate by moving the previous state in the features: “Subject-is-female”, “Previous-was-question”, “Is-verb-and-no-noun-yet”.

• We can even add features regarding actions in a reinforcement learning setting: “Slow-vehicle-encountered-and-steer-left”.

• We can mitigate data sparseness problems by simplifying the model:

)),(exp()',(

1)'|(),'|(

iii sof

soZssPossP

Page 42: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

42

Overview

• Models– HMM: Hidden Markov Model– maximum entropy Markov model– CRFs: Conditional Random Fields

• Tasks– Chinese word segmentation– part-of-speech tagging– named entity recognition

Page 43: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

43

CRFs as Sequence Labeling Tool• Conditional random fields (CRFs) are a statistical sequence

modeling framework first introduced to the field of natural language processing (NLP) to overcome label-bias problem.

• John Lafferty, A. McCallum and F. Pereira. 2001.

Conditional Random Field: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, 282-289. June 28-July 01, 2001

Page 44: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

44

Sequence Segmenting and Labeling

• Goal: mark up sequences with content tags

• Application in computational biology– DNA and protein sequence alignment– Sequence homolog searching in databases– Protein secondary structure prediction– RNA secondary structure analysis

• Application in computational linguistics & computer science– Text and speech processing, including topic segmentation, part-of-speech

(POS) tagging– Information extraction– Syntactic disambiguation

Page 45: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

45

Example: Protein secondary structure prediction

Conf: 977621015677468999723631357600330223342057899861488356412238Pred: CCCCCCCCCCCCCEEEEEEECCCCCCCCCCCCCHHHHHHHHHHHHHHHCCCCEEEEHHCC AA: EKKSINECDLKGKKVLIRVDFNVPVKNGKITNDYRIRSALPTLKKVLTEGGSCVLMSHLG 10 20 30 40 50 60

Conf: 855764222454123478985100010478999999874033445740023666631258Pred: CCCCCCCCCCCCCCCCCCCCCCCCCCHHHHHHHHHHHHHCCCCCCCCCCCCHHHHHHCCC AA: RPKGIPMAQAGKIRSTGGVPGFQQKATLKPVAKRLSELLLRPVTFAPDCLNAADVVSKMS 70 80 90 100 110 120

Conf: 874688611002343044310017899999875053355212244334552001322452Pred: CCCEEEECCCHHHHHHCCCCCHHHHHHHHHHHHHCCEEEECCCCCCCCCCCCCCCCHHHH AA: PGDVVLLENVRFYKEEGSKKAKDREAMAKILASYGDVYISDAFGTAHRDSATMTGIPKIL 130 140 150 160 170 180

Page 46: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

46

HMMs as Generative Models

• Hidden Markov models (HMMs) • Assign a joint probability to paired observation and label

sequences– The parameters typically trained to maximize the joint likelihood of

train examples

Page 47: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

47

HMMs as Generative Models (cont’d)

• Difficulties and disadvantages– Need to enumerate all possible observation sequences

– Not practical to represent multiple interacting features or long-range dependencies of the observations

– Very strict independence assumptions on the observations

Page 48: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

48

Conditional Models

• Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x)– Specify the probability of possible label sequences given an observation

sequence

• Allow arbitrary, non-independent features on the observation sequence X

• The probability of a transition between labels may depend on past and future observations– Relax strong independence assumptions in generative models

Page 49: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

49

Discriminative ModelsMaximum Entropy Markov Models (MEMMs)

• Exponential model• Given training set X with label sequence Y:

– Train a model θ that maximizes P(Y|X, θ)– For a new data sequence x, the predicted label y maximizes P(y|x, θ)– Notice the per-state normalization

Page 50: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

50

MEMMs (cont’d)

• MEMMs have all the advantages of Conditional Models

• Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)

• Subject to Label Bias Problem– Bias toward states with fewer outgoing transitions

Page 51: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

51

Label Bias Problem

• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).

• Per-state normalization does not allow the required expectation

•http://wing.comp.nus.edu.sg/pipermail/graphreading/2005-September/000032.html

•http://hi.baidu.com/%BB%F0%D1%BF_ayouh/blog/item/338f13510d38e8441038c250.html

• Consider this MEMM:

Page 52: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

52

Solve the Label Bias Problem

• Change the state-transition structure of the model

– Not always practical to change the set of states

• Start with a fully-connected model and let the training procedure figure out a good structure

– Prelude the use of prior, which is very valuable (e.g. in information extraction)

Page 53: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

53

Random Field

Page 54: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

54

Conditional Random Fields (CRFs)

• CRFs have all the advantages of MEMMs without label bias problem– MEMM uses per-state exponential model for the conditional probabilities of

next states given the current state

– CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence

• Undirected acyclic graph

• Allow some transitions “vote” more strongly than others depending on the corresponding observations

Page 55: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

55

Definition of CRFs

X is a random variable over data sequences to be labeled

Y is a random variable over corresponding label sequences

Page 56: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

56

Example of CRFs

Page 57: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

57

Graphical comparison among HMMs, MEMMs and CRFs

HMM MEMM CRF

Page 58: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

58

Conditional Distribution

1 2 1 2( , , , ; , , , ); andn n k k

x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a

Boolean edge featurek is the number of features

are parameters to be estimated

y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v

If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

(y | x) exp ( , y | , x) ( , y | , x)

k k e k k v

e E,k v V ,k

p f e g v

Page 59: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

59

Conditional Distribution (cont’d)

• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:

Z(x) is a normalization over the data sequence x

(y | x) exp ( , y | , x) ( , y |1

(x), x)

k k e k k v

e E,k v V ,k

p f e g vZ

Page 60: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

60

Decoding: To label an unseen sequence

We compute the most likely labeling Y* as follows by dynamic programming (for efficient computation)

)|(maxarg* XYPY Y

Page 61: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

61

Complexity Estimation

• The time complexity of an iteration of parameter estimation of L-BFGS algorithm is

• O(L2NMF)• where L and N are, respectively, the numbers of

labels and sequences (sentences), • M is the average length of sequences, and• F is the average number of activated features of

each labelled clique.

Page 62: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

62

CRF++: a CRFs Package

• CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data.

• http://crfpp.sourceforge.net/• Requirements

– C++ compiler (gcc 3.0 or higher)

• How to make – % ./configure – % make – % su – # make install

Page 63: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

63

CRF++

• Feature template representation and input file format

Page 64: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

64

CRF++

• training

• % crf_learn -f 3 -c 1.5 template_file train_file model_file

• test

• % crf_test -m model_file test_files

Page 65: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

65

Summary

• Discriminative models are prone to the label bias problem

• CRFs provide the benefits of discriminative models

• CRFs solve the label bias problem well, and demonstrate good performance

Page 66: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

66

Overview

• Models– HMM: Hidden Markov Model– maximum entropy Markov model– CRFs: Conditional Random Fields

• Tasks– Chinese word segmentation– part-of-speech tagging– named entity recognition

Page 67: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

67

What is Chinese Word Segmentation

• A special case of tokenization in natural language processing (NLP) for many languages that have no explicit word delimiters such as spaces.

• Original:– 她来自苏格兰– She comes from SU GE LAN Meaningless!

• Segmented:– 她 / 来 / 自 /苏格兰– She comes from Scotland. Meaningful!

Page 68: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

68

Learning from a Lexicon:maximal matching algorithm for word segmentation

• Input – A lexicon is pre-defined.– An unsegmented sequence

• The algorithm:① Start from the first character, try to find the longest

matched word in the lexicon.② Set the next character after the above found word as

the new start point.③ If reaches the end of the sequence, the algorithm

ends.④ Otherwise, go to (1).

Page 69: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

69

Learning from a segmented corpus:Word segmentation as labeling

• 自然科学的研究不断深入• natural science / of / research / uninterruptedly / deepen

• 自然科学 / 的 /研究 /不断 /深入• BMME S BE BE BE• B: beginning, M: Middle, E: End, of a word• S: single-character word

• Using CRFs as the learning model

Page 70: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

70

CWS as Character-base Tagging: From the begging to the latest

• Nianwen Xue, 2003 Chinese Word Segmentation as Character Tagging, CLCLP, Vol. 8(1), 2003• Xiaoqiang Luo, 2003 A Maximum Entropy Chinese Character-based Parser, EMNLP-2003• Hwee Tou Ng and Kin Kiat Low, 2004 Chinese Part-of Speech Tagging: One-at-a-Time or All-at-Once? Word-based or

Character-Based? EMNLP-2004• Jin Kiat Low, Hwee Tou Ng, Wenyuan Guo, 2005

A Maximum Entropy Approach to Chinese Word Segmentation, The 4th SIGHAN Workshop on CLP, 2005

• Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, Christopher Manning, 2005

A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005, The 4th SIGHAN Workshop on CLP, 2005

Page 71: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

71

Label Set

Tag set Tags Multi-character Word Reference

2-tag B, E B, BE, BEE, … Mostly for CRF

4-tag B, M, E, S S, BE, BME, BMME, … Xue/Low/MaxEnt

6-tag B, M, E, S,B2, B3

S, BE, BB2E, BB2 B3E,

BB2 B3ME, …

Zhao/CRF

More labels, better performance for CWS …

Page 72: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

72

Feature Template Set

C-1 , C0 , C1 , C-1C0 , C0C1 , C-1C1,

Where C-1 C0 C1 is previous, current and next character

Page 73: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

73

Overview

• Models– HMM: Hidden Markov Model– maximum entropy Markov model– CRFs: Conditional Random Fields

• Tasks– Chinese word segmentation– part-of-speech tagging– named entity recognition

Page 74: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

74

Parts of Speech• Generally speaking, the “grammatical type” of word:

– Verb, Noun, Adjective, Adverb, Article, …• We can also include inflection:

– Verbs: Tense, number, …– Nouns: Number, proper/common, …– Adjectives: comparative, superlative, …– …

• Most commonly used POS sets for English have 50-80 different tags

Page 75: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

75

BNC Parts of Speech• Nouns:

NN0 Common noun, neutral for number (e.g. aircraft

NN1 Singular common noun (e.g. pencil, goose, time

NN2 Plural common noun (e.g. pencils, geese, times

NP0 Proper noun (e.g. London, Michael, Mars, IBM

• Pronouns:PNI Indefinite pronoun (e.g. none, everything, one

PNP Personal pronoun (e.g. I, you, them, ours

PNQ Wh-pronoun (e.g. who, whoever, whom

PNX Reflexive pronoun (e.g. myself, itself, ourselves

Page 76: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

76

• Verbs:VVB finite base form of lexical verbs (e.g. forget, send, live, return

VVD past tense form of lexical verbs (e.g. forgot, sent, lived

VVG -ing form of lexical verbs (e.g. forgetting, sending, living

VVI infinitive form of lexical verbs (e.g. forget, send, live, return

VVN past participle form of lexical verbs (e.g. forgotten, sent, lived

VVZ -s form of lexical verbs (e.g. forgets, sends, lives, returns

VBB present tense of BE, except for is

…and so on: VBD VBG VBI VBN VBZ

VDB finite base form of DO: do

…and so on: VDD VDG VDI VDN VDZ

VHB finite base form of HAVE: have, 've

…and so on: VHD VHG VHI VHN VHZ

VM0 Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)

Page 77: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

77

• ArticlesAT0 Article (e.g. the, a, an, no)

DPS Possessive determiner (e.g. your, their, his)

DT0 General determiner (this, that)

DTQ Wh-determiner (e.g. which, what, whose, whichever)

EX0 Existential there, i.e. occurring in “there is…” or “there are…”

• AdjectivesAJ0 Adjective (general or positive) (e.g. good, old, beautiful)

AJC Comparative adjective (e.g. better, older)

AJS Superlative adjective (e.g. best, oldest)

• AdverbsAV0 General adverb (e.g. often, well, longer (adv.), furthest.

AVP Adverb particle (e.g. up, off, out)

AVQ Wh-adverb (e.g. when, where, how, why, wherever)

Page 78: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

78

• Miscellaneous:CJC Coordinating conjunction (e.g. and, or, but)

CJS Subordinating conjunction (e.g. although, when)

CJT The subordinating conjunction that

CRD Cardinal number (e.g. one, 3, fifty-five, 3609)

ORD Ordinal numeral (e.g. first, sixth, 77th, last)

ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow)

POS The possessive or genitive marker 's or '

TO0 Infinitive marker to

PUL Punctuation: left bracket - i.e. ( or [

PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ?

PUQ Punctuation: quotation mark - i.e. ' or "

PUR Punctuation: right bracket - i.e. ) or ]

XX0 The negative particle not or n't

ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)

Page 79: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

79

Task: Part-Of-Speech Tagging

• Goal: Assign the correct part-of-speech to each word (and punctuation) in a text.

• Example:

• Learn a local model of POS dependencies, usually from pre-tagged data

• No parsing

Two old men bet on the game .CRD AJ0 NN2 VVD PP0 AT0 NN1 PUN

Page 80: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

80

Hidden Markov Models• Assume: POS (state) sequence generated as time-

invariant random process, and each POS randomly generates a word (output symbol)

AT0

NN1

NN2

AJ0

0.20.3

0.5

0.3

0.5

0.9

0.2

0.1“the”

“a” 0.6

0.4

“cat”“bet”

“cats”

“men”

Page 81: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

81

Definition of HMM for Tagging• Set of states – all possible tags• Output alphabet – all words in the language

• State/tag transition probabilities• Initial state probabilities: the probability of

beginning a sentence with a tag t (t0t)

• Output probabilities – producing word w at state t

• Output sequence – observed word sequence• State sequence – underlying tag sequence

Page 82: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

82

• First-order (bigram) Markov assumptions:– Limited Horizon: Tag depends only on previous tag

P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)

– Time invariance: No change over time

P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj tk)

• Output probabilities:– Probability of getting word wk for tag tj: P(wk | tj)– Assumption:

Not dependent on other tags or words!

HMMs For Tagging

Page 83: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

83

Combining Probabilities• Probability of a tag sequence:

P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)

Assume t0 – starting tag:

= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)

• Prob. of word sequence and tag sequence:

P(W,T) = i P(ti-1ti) P(wi | ti)

Page 84: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

84

Training from Labeled Corpus

• Labeled training = each word has a POS tag

• Thus:PMLE(tj) = C(tj) / N

PMLE(tjtk) = C(tj, tk) / C(tj)

PMLE(wk | tj) = C(tj:wk) / C(tj)

• Smoothing can be applied.

Page 85: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

85

Viterbi Tagging• Most probable tag sequence given text:

T* = arg maxT Pm(T | W)

= arg maxT Pm(W | T) Pm(T) / Pm(W)(Bayes’ Theorem)

= arg maxT Pm(W | T) Pm(T)(W is constant for all T)

= arg maxT i[m(ti-1ti) m(wi | ti) ]= arg maxT i log[m(ti-1ti) m(wi | ti) ]

• Exponential number of possible tag sequences – use dynamic programming for efficient computation

Page 86: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

86

-log m t1 t2 t3

t0 2.3 1.7 1

t1 1.7 1 2.3

t2 0.3 3.3 3.3

t3 1.3 1.3 2.3

-log m w1 w2 w3

t1 0.7 2.3 2.3

t2 1.7 0.7 3.3

t3 1.7 1.7 1.3

t1

t2

t3

w1

t1

t2

t3

w2

t1

t2

t3

w3

t0

-1.7

-0.3

-1.3

-3

-3.4

-2.7

-2.3

-1.7

-1

-6

-4.7

-6.7

-1.7

-0.3

-1.3

-7.3

-9.3

-10.3

Page 87: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

87

Viterbi Algorithm1. D(0, START) = 0

2. for each tag t != START do: D(1, t) = -3. for i 1 to N do:

for each tag tj do:

D(i, tj) maxk D(i-1,tk) + lm(tktj) + lm(wi|tj) Record best(i,j)=k which yielded the max

4. log P(W,T) = maxj D(N, tj)

5. Reconstruct path from maxj backwards

where: lm(.) = log m(.) and D(i, tj) – max joint probability of state and word sequences till position i, ending at tj.

Complexity: O(Nt2 N)

Page 88: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

88

Overview

• Models– HMM: Hidden Markov Model– maximum entropy Markov model– CRFs: Conditional Random Fields

• Tasks– Chinese word segmentation– part-of-speech tagging– named entity recognition

Page 89: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

89

Hidden Markov Model (HMM) based NERC System

• Named Entity Recognition and Classification System

• HMM-

– Statistical construct used to solve classification problems, having an inherent state sequence representation

– Transition probability: Probability of traveling between two given states

– A set of output symbols (also known as observation) emitted by the process

– Emitted symbol depends on the probability distribution of the particular state

– Output of the HMM: Sequence of output symbols

– Exact state sequence corresponding to a particular observation sequence is unknown (i.e., hidden)

– Simple language model (n-gram) for NE tagging

• Uses very little amount of knowledge about the language, apart from simple context information

Page 90: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

90

HMM based NERC System (Contd..)

HMM based NERC Architecture

Page 91: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

91

HMM based NERC System (Contd..)• Components of HMM based NERC system

– Language model

• Represented by the model parameters of HMM

• Model parameters estimated based on the labeled data during learning

– Possible class module

• Consists of a list of lexical units associated with the list of 17 tags

– NE disambiguation algorithm

• Input: List of lexical units with the associated list of possible tags

• Output: Output tag for each lexical unit using the encoded information from the language model

• Decides the best possible tag assignment for every word in a sentence according to the language model

• Viterbi algorithm (Viterbi, 1967)

– Unknown word handling

• Viterbi algorithm (Viterbi, 1967) assigns some tags to unknown words

• variable length NE suffixes

• Lexicon (Ekbal and Bandyopadhyay, 2008d)

Page 92: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

92

HMM based NERC System (Contd..)

• Problem of NE tagging

Let W be a sequence of words

W = w1 , w2 , … , wn

Let T be the corresponding NE tag sequence

T = t1 , t2 , … , tn

Task : Find T which maximizes P ( T | W )

T’ = argmaxT P ( T | W )

Page 93: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

93

HMM based NERC System (Contd..)By Bayes Rule,

P ( T | W ) = P ( W | T ) * P ( T ) / P ( W )

T’ = argmaxT P ( W | T ) * P ( T )

Models

– Fisrt order model (Bigram): The probability of a tag depends only on the previous tag

– Second order model (Trigram): The probability of a tag depends on the previous two tags

Transition Probability

Bigram P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | t1 … tn-1 )

Trigram P ( T ) = P ( t1 ) * P ( t2 | t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

P ( T ) = P ( t1 | $ ) * P ( t2 | $ t1 ) * P ( t3 | t1 t2 ) …… * P ( tn | tn-2 tn-1 )

Where, $ dummy tag used to represent the beginning of a sentence

Page 94: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

94

HMM based NERC System (Contd..)

• Estimation of unigram, bigram and trigram probabilities from the training corpus

• Emission Probability

• Estimation :

33

2 33 2

2

1 2 33 1 2

1 2

( ))

: ( |

( , , ), )

: (

( , ))

( )

: ( |( , )

Unigram P t

freq t t

fr

freq t

N

Bigram P t t

freq t t t

eq t

Trigra tfreq

Pt

mt

t t

1 1 2 2( | ) ( | )* ( | )*... )( |* n nP W T P w t P w P w tt

( ,( | )

(

)

)i i

i ii

freq w tP w t

freq t

Page 95: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

95

HMM based NERC System (Contd..)

Context Dependency (Modification)

– To make Markov model powerful, introduce a 1st order context dependent feature

1 1 2 1 2 1( | ) ( | $, )* ( | , )| , )*...* ( n n nP W T P w t P w t tt P w t

11

1

( , , ),

,( | )

( )i i i

i i ii i

freq w t tP w t t

freq t t

Page 96: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

96

2nd order Hidden Markov Model

Page 97: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

972nd order Hidden Markov Model (Proposed)

Page 98: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

98

HMM based NERC System (Contd..)• Why Smoothing?

– All events may not be encountered in the limited training corpus

– Insufficient instances for each bigram or trigram to reliably estimate the probability

– Setting a probability to zero has an undesired effect

• Procedure

– Transition probability :

– Emission probability :

2 1 1 2 1 3 2 1

1 2 3

( | , ) ( ) ( | ) ( | ,

1

)n n n n n n n n nP t t t P t P t t P t t t

1 1 2 1

1 2

( | , ) ( | ) ( | , )

1i i i i i i i iP w t t P w t P w t t

– Calculation of λs and Өs (Brants, 2000)

Page 99: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

99

HMM based NERC System (Contd..) Handling of unknown words

Viterbi algorithm (Viterbi, 1967) attempts to assign a tag to the unknown wordsP( wi | ti ) P( fi | ti )

Calculated based on the features of unknown wordSuffixes: Probability distribution of a particular suffix with respect to

specific NE tags is generated from all words in the training set that share the same suffix

Variable length person name suffixes (e.g., - bAbu[-babu], -dA [-da] , -di[-di] etc)

Variable length location name suffixes (e.g., - lYAnd[-land], -pur[-pur], -liYA[-lia]) etc)

LexiconLexicon contains the root words and their basic POS informationUnknown word that is found to appear in the lexicon is most likely not a

NE

Page 100: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

100

Supervised NERC Systems (ME, CRF and SVM)

• Limitations of HMM– Use of only local features may not work well– Simple HMM models do not work well when large data are not used to estimate the model parameters– Incorporating a diverse set features in an HMM based NE tagger is difficult and complicates the smoothing

• Solution:– Maximum Entropy (ME) model, Conditional Random Field (CRF) or Support Vector Machine (SVM)– ME, CRF or SVM can make use of rich feature information

• ME model– Very flexible method of statistical modeling– A combination of several features can be easily incorporated– Careful feature selection plays a crucial role– Does not provide a method for automatic selection of useful features– Features selected using heuristics– Adding arbitrary features may result in overfitting

Page 101: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

101

Supervised NERC Systems (ME, CRF and SVM)• CRF

– CRF does not require careful feature selection in order to avoid overfitting– Freedom to include arbitrary features– Ability of feature induction to automatically construct the most usefulfeature combinations– Conjunction of features – Infeasible to incorporate all possible conjunction features due to overflow ofmemory– Good to handle different types of data

• SVM– Predict the classes depending upon the labeled word examples only– Predict the NEs based on feature information of words collected in a predefined window size only– Can not handle the NEs outside tokens– Achieves high generalization even with training data of a very high dimension– Can handle non-linear feature spaces with

Page 102: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

102

Named Entity Features• Language Independent Features

– Can be applied for NERC in any language• Language Dependent Features

– Generated from the language specific resources like gazetteers and POS taggers– Indian languages are resource-constrained– Creation of gazetteers in resource-constrained environment requires a priori knowledge of the language– POS information depends on some language specific phenomenon such as person, number, tense, gender etc– POS tagger (Ekbal and Bandyopadhyay, 2008d) makes use of the several language specific resources such as lexicon, inflection list and a NERC system to improve its performance

• Language dependent features improve system performance

Page 103: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

103

Language Independent Features

– Context Word: Preceding and succeeding words

– Word Suffix

• Not necessarily linguistic suffixes

• Fixed length character strings stripped from the endings of words

• Variable length suffix -binary valued feature

– Word Prefix

• Fixed length character strings stripped from the beginning of the words

– Named Entity Information: Dynamic NE tag (s) of the previous word (s)

– First Word (binary valued feature): Check whether the current token is the first word in the sentence

Page 104: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

104

Language Independent Features (Contd..)

• Length (binary valued): Check whether the length of the current word less than three or not (shorter words rarely NEs)

• Position (binary valued): Position of the word in the sentence

• Infrequent (binary valued): Infrequent words in the training corpus most probably NEs

• Digit features: Binary-valued

– Presence and/or the exact number of digits in a token

• CntDgt : Token contains digits

• FourDgt: Token consists of four digits

• TwoDgt: Token consists of two digits

• CnsDgt: Token consists of digits only

Page 105: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

105

Language Independent Features (Contd..)

– Combination of digits and punctuation symbols

• CntDgtCma: Token consists of digits and comma

• CntDgtPrd: Token consists of digits and periods

– Combination of digits and symbols

• CntDgtSlsh: Token consists of digit and slash

• CntDgtHph: Token consists of digits and hyphen

• CntDgtPrctg: Token consists of digits and percentages

– Combination of digit and special symbols

• CntDgtSpl: Token consists of digit and special symbol such as $, # etc.

Page 106: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

106

Language dependent Features (Contd..)

– Part of Speech (POS) Information: POS tag(s) of the current and/or

the surrounding word(s)

• SVM-based POS tagger (Ekbal and Bandyopadhyay, 2008b)

• Accuracy=90.2%

• SVM based NER POS tagger developed with a fine-grained tagset of 27 tags

• ME and CRF based NERC Coarse-grained POS tagger

– Nominal, PREP (Postpositions) and Other

– Gazetteer based features (binary valued): Several features extracted from the gazetteers

Page 107: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

107

ME based NERC System (Contd..)

•Language model: Represented by ME

model parameters

•Possible class module: Consists of a list

of lexical units for each word associated

with the list of 17 tags

•NE disambiguation: Decides the most

probable tag sequence for a given word

sequence

• beam search algorithm

•Elimination of inadmissible sequences:

Removes the inadmissible tag sequences

from the output of the ME modelTool: C++ based ME Package(http://homepages.inf.ed.ac.uk/s0450736/software/maxent/maxent-20061005.tar.bz2)

Page 108: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

108

ME based NERC System

• Elimination of Inadmissible Tag Sequences

– Inadmissible tag sequence (e.g., B-PER followed by LOC)

– Transition probability

– Probability of the classes assigned to the words in a sentence ‘s’ in a document ‘D’ defined as :

where is determined by the maximum entropy classifier

1, if the sequence is admissible( | )

0, otherwisei jP c c

1 11

( ,..., | , ) ( | , )* ( | )n i ii

i

n

P c c s D P c s D P c c

( | , )iP c s D

Page 109: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

109

CRF based NERC System

• Language model: Represented by CRF model parameters

• Possible class module: Consists of a list of lexical units for each word associated with the list of 17 tags

• NE disambiguation: Decides the most probable tag sequence for a given word sequence

• Forward Viterbi and backword A*

search algorithm (Rabiner, 1989) for

disambiguation

• Elimination of inadmissible sequences: Removes the inadmissible tag sequences from the output of the CRF model (Same as ME model)

Tool: C++ based CRF++ package (http://crfpp.sourceforge.net )

Page 110: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

110

CRF based NERC System (Contd..)

• Feature Template: Feature represented in terms of feature template

Feature template used in the experiment

Page 111: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

111

SVM based NERC System

•Language model: Represented by SVM model parameters

•Possible class module: Considers any of the 17 NE tags to each word

•NE disambiguation: Beam search (Selection of beam width (i.e., N) is very important, as larger beam width does not always give a significant improvement in performance)

•Elimination of inadmissible tag sequences:

Same as ME and CRF

Training: YamCha toolkit (http://chasen-org/~taku/software/yamcha/) Classification: TinySVM-0.07 (http://cl.aist-nara.ac.jp/~takuku/ software/TinySVM ) one vs rest and pairwise multi-class decision methods Polynomial kernel function

Page 112: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

112

SVM based NERC System (Contd..)

•wi word appearing at the ith position

•pi POS feature of wi,

•ti NE label for the ith word

•Reverse parsing direction is possible (from right to left)

•Models of SVM:

•SVM-F: Parses from left to right

•SVM-B: Parses from right to left

Feature representation in SVM Features

•Features Surrounding context, such as words, their lexical features, and the various orthographic wordlevel features as well as the NE labels

Page 113: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

113

Best Feature Sets for ME, CRF and SVM

Model Feature

ME Word, Context (Preceding one and following one word), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous word, First word of the sentence, Infrequent word, Length of the word, Digit features

CRF Word, Context (Preceding two and following two words), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous word, First word of the sentence, Infrequent word, Length of the word, Digit features

SVM-F Word, Context (Preceding three and following two words), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of the previous two words, First word of the sentence, Infrequent word, Length of the word, Digit features

SVM-B Word, Context (Preceding three and following two words), Prefixes and suffixes of length up to three characters of the current word only, Dynamic NE tag of theprevious two words, First word of the sentence, Infrequent word, Length of the word, Digit features

Best Feature set Selection:Training with language independent features and tested with the development set

Page 114: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

114

Language Dependent Evaluation (ME, CRF and SVM)

• Observations: Classifiers trained with best set of language independent as well as

language dependent features POS information of the words are very effective

Coarse-grained POS tagger (Nominal, PREP and Other) for ME and CRF Fine-grained POS tagger (developed with 27 POS tags) for SVM based

Systems Best Performance of ME: POS information of the current word only (an

improvement of 2.02% F-Score ) Best Performance of CRF: POS information of the current, previous and next

words (an improvement of 3.04% F-Score ) Best Performance of SVM: POS information of the current, previous and next

words (an improvement of 2.37% F-Score in SVM-F and 2.32% in SVM-B )

NE suffixes, Organization suffix words, person prefix words, designations and common location words are more effective than other gazetteers

Page 115: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

115

Reference

• HMM http://www-nlp.stanford.edu/fsnlp/hmm-chap/blei-hmm-ch9.ppt

• MEMM www.cs.cornell.edu/courses/cs778/2006fa/lectures/05-memm.pdf

• CRFs web.engr.oregonstate.edu/~tgd/classes/539/slides/Shen-CRF.ppt

• PoS-tagging cs.haifa.ac.il/~shuly/teaching/04/statnlp/pos-tagging.ppt

• NER www.cl.uni-heidelberg.de/colloquium/docs/ekbal_abstract.pdf

Page 116: 1 Natural Language Processing (8) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

116

Assignment

• States: {Ssunny,Srainy ,Ssnowy }

• Observations: {Oskirt ,Ocoat ,Oumbrella }

• State transition probabilities:• Emission probability:• Initial state distribution:

• Given O: Ocoat Ocoat Oumbrella Oumbrella Oskirt Oumbrella Oumbrella

• What is the probability that the given sequence appears? (forward procedure)

• What is the most possible weather states given the observations?

.8 .15 .05

.38 .6 .02

.75 .05 .2

A

.6 .3 .1

.05 .3 .65

0 .5 .5

B

(.7 .25 .05)