graph-based posterior regularization for semi-supervised...

GRAPH-BASED POSTERIOR REGULARIZATION FOR

SEMI-SUPERVISED STRUCTURED PREDICTION

Luheng He Jennifer Gillenwater† Ben Taskar

†University of Pennsylvania University of Washington

OVERVIEW

Structured Prediction (CRF)

Graph Propagation

A Joint Objective

y7y5

ninth run for

ninth run for

J (q, p✓)

This is recognized as a face

CONDITIONAL RANDOM FIELD



DET VERB VERB ADP DET NOUN


Y = y1 y2 y4 y5 y6y3

X =



Y = y1 y2 y4 y5 y6y3

X =


( )f yt , yt�1,x{


Y = y1 y2 y4 y5 y6y3

X =


Conditional Distribution

p✓(y | x) = 1Z✓(x) exp

TPt=1

✓>f(yt, yt�1,x)

�( )f yt , yt�1,x

{-factorp {


Y = y1 y2 y4 y5 y6y3

X =


Conditional Distribution

p✓(y | x) = 1Z✓(x) exp

TPt=1

✓>f(yt, yt�1,x)

�( )f yt , yt�1,x

{CRF objective

NLik(p✓) = �P̀i=1

log p✓(yi | xi)-factorp {

The painting is considered as a work of genius .

This is recognized as a face .

GRAPH PROPAGATION

Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning

VERB

?



GRAPH PROPAGATION


VERB

?

similar context



GRAPH PROPAGATION


VERB

similar contextsimilar tagging



GRAPH PROPAGATION


VERB

similar context

VERB

similar tagging

... a run along ...NOUN

... a run for ...NOUN

... luck run out ...VERB

?... ninth run for ...

GRAPH LAPLACIAN REGULARIZER





0.4

0.8

0.8






0.4

0.8

0.8


Prob (NOUN | ninth run for) = 0.6Prob (VERB | ninth run for) = 0.4Prob (ADV | ninth run for) = 0Prob (DET | ninth run for) = 0…





0.4

0.8

0.8


Prob(tag | ninth run for) =argmin

m0.4⇥ km� Prob(tag | a run along)k22

+0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22





0.4

0.8

0.8


Prob(tag | ninth run for) =argmin

m0.4⇥ km� Prob(tag | a run along)k22

+0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22: The proportion of time trigram a has tag k

minm

Lap(m) =X

a2Unlab

X

b2Neighbors(a)

X

k2Tagswab(ma,k �mb,k)2

ma,k

COMBINING THE TWO

labeled labeled + unlabeled

ninth run for

graph-propagationy7y5

ninth run for

CRF estimation

Data

Model m(tag | trigram)p(tags | sentence; ✓)

PRIOR WORKSubramanya et al. (EMNLP 2010)

graph-propagation

y7y5

ninth run for

VN

CRF estimation+

ninth run for

PRIOR WORK

Our work: retains efficiency while optimizing an extendible, joint objective.

Subramanya et al. (EMNLP 2010)

graph-propagation

y7y5

ninth run for

VN

CRF estimation+

ninth run for

HOW TO COMBINE?introduce auxiliary variables q


p✓(y | xi) =1

Z✓(xi)exp

"TX

t=1

✓>f(yt, yt�1,xi)

#

qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#


1. Normalized

p✓(y | xi) =1

Z✓(xi)exp

"TX

t=1


#

qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#


1. Normalized 2. Decomposed into local factors

p✓(y | xi) =1

Z✓(xi)exp

"TX

t=1


#

qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#

Lap(q) + KL(q k p✓)ninth run for y7y5

ninth run for

NLik(p✓) +J (q, p✓) =

JOINT OBJECTIVE


ninth run for

NLik(p✓) +J (q, p✓) =

JOINT OBJECTIVE

Lap(q) =X

a2Unlab

X

b2Neighbors(a)

X

k2Tagswab(ma,k(q)�mb,k(q))2


ninth run for

NLik(p✓) +J (q, p✓) =

JOINT OBJECTIVE

NLik(p✓) = �P̀i=1

log p✓(yi | xi)


ninth run for

NLik(p✓) +J (q, p✓) =

JOINT OBJECTIVE

KL(q k p✓) =nX

i=1

X

y

qiy logqiy

p✓(y | xi)

HOW TO OPTIMIZE?minq,✓

J (q, p✓)

HOW TO OPTIMIZE?minq,✓

J (q, p✓)

unconstrained�

HOW TO OPTIMIZE?

update:

✓0 = ✓ � ⌘ @J (q,p✓)@✓

p

minq,✓

J (q, p✓)

unconstrained�

HOW TO OPTIMIZE?

update:

✓0 = ✓ � ⌘ @J (q,p✓)@✓

p

minq,✓

J (q, p✓)

unconstrained�

update:q

no compact form

projection is hard

(# tags)(i’s length) values

X

y

qiy = 1

UPDATE Q

UPDATE Q

qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#q can be represented by local factors r

UPDATE Q

qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)


qiy0 = qiy � ⌘

@J (q,p✓)@qiy

doing an additive gradient update

UPDATE Q

qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)


qiy0 = qiy � ⌘

@J (q,p✓)@qiy

doing an additive gradient update

q’ cannot be written as product of local factors!

EXPONENTIATED GRADIENT

Collins et al. (JMLR 2008): Exponentiated gradient for CRFs

multiplicative gradient update:

qiy0=

1

Zq0(xi)qiy exp

�⌘ @J (q, p✓)

@qiy

�




qiy0=

1

Zq0(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#exp

�⌘ @J (q, p✓)

@qiy

�



decompose into local factors


qiy0=

1

Zq0(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#exp

�⌘ @J (q, p✓)

@qiy

�





qiy0=

1

Zq0(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#exp

�⌘ @J (q, p✓)

@qiy

�

=

1

Zq0(xi)exp

"TX

t=1

r0i,t(yt, yt�1)

#





qiy0=

1

Zq0(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#exp

�⌘ @J (q, p✓)

@qiy

�

=

1

Zq0(xi)exp

"TX

t=1

r0i,t(yt, yt�1)

#

only updating (#tags)2 ⇥ (i’s length) variables!

SUMMARY


ninth run for

NLik(p✓) +J (q, p✓) =

SUMMARY


ninth run for

NLik(p✓) +J (q, p✓) =

✓0 = ✓ � ⌘ @J (q,p✓)@✓

E-step:

M-step:

qiy

0=

1Zq(xi)

qiy

exp

h�⌘ @J (q,p✓)@qiy

i

(update each in practice)qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#

SUMMARY


ninth run for

NLik(p✓) +J (q, p✓) =

Theorem: Converges to a local

optimum ofJ (q, p✓)

✓0 = ✓ � ⌘ @J (q,p✓)@✓

E-step:

M-step:

qiy

0=

1Zq(xi)

qiy

exp

h�⌘ @J (q,p✓)@qiy

i

(update each in practice)qiy =1

Zq(xi)exp

"TX

t=1

ri,t(yt, yt�1)

#

EXPERIMENT SETTING

10 Languages (CoNLL-X and CoNLL-2007)

100 Randomly sampled labeled sentencesAveraged over 10 sampling runs

Universal POS Tags (Petrov et al. 2011)Second Order CRF Model f(yt, yt�1, yt�2,x)

ninth run for

graph propagation

GP

EN DE ES PT DA SL SV EL IT NL Avg8

10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

GP


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

ninth run for

GP → CRF

y7y5

ninth run for

GP → CRFGP


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

GP → CRFGP

CRF

y7y5

ninth run for

CRF


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

GP → CRFGP


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

CRFGP → CRFGP

J

y7y5

ninth run for

KL(ninth run for

k )

J


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

CRFGP → CRFGP

J

y7y5

ninth run for

KL(ninth run for

k )

J


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r


10

12

14

16

18

20

22

24

Language

POS

Tagg

ing

Erro

r

CRFGP → CRFGP

28% average relative error reduction

+ KL(q k p✓)Lap(q)ninth run for y7y5

ninth run for

NLik(p✓) +J (q, p✓) =

CONCLUSION

+ KL(q k p✓)

y7y5

ninth run for

NLik(p✓) +J (q, p✓) =

any convex, differentiable regularizer

CONCLUSION

PosteriorConstraintR(q)

+ KL(q k p✓)

y7y5

ninth run for

NLik(p✓) +J (q, p✓) =

any convex, differentiable regularizer

CONCLUSION

Code: https://code.google.com/p/pr-graph/

PosteriorConstraintR(q)
https://code.google.com/p/pr-graph/

graph-based posterior regularization for semi-supervised...

Documents