graph-based posterior regularization for semi-supervised...
TRANSCRIPT
-
GRAPH-BASED POSTERIOR REGULARIZATION FOR
SEMI-SUPERVISED STRUCTURED PREDICTION
Luheng He Jennifer Gillenwater† Ben Taskar
†University of Pennsylvania University of Washington
-
OVERVIEW
Structured Prediction (CRF)
Graph Propagation
A Joint Objective
y7y5
ninth run for
ninth run for
J (q, p✓)
-
This is recognized as a face
CONDITIONAL RANDOM FIELD
-
This is recognized as a face
CONDITIONAL RANDOM FIELD
DET VERB VERB ADP DET NOUN
-
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
-
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
( )f yt , yt�1,x{
-
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
Conditional Distribution
p✓(y | x) = 1Z✓(x) exp
TPt=1
✓>f(yt, yt�1,x)
�( )f yt , yt�1,x
{-factorp {
-
This is recognized as a face
Y = y1 y2 y4 y5 y6y3
X =
CONDITIONAL RANDOM FIELD
Conditional Distribution
p✓(y | x) = 1Z✓(x) exp
TPt=1
✓>f(yt, yt�1,x)
�( )f yt , yt�1,x
{CRF objective
NLik(p✓) = �P̀i=1
log p✓(yi | xi)-factorp {
-
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
?
-
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
?
similar context
-
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
similar contextsimilar tagging
-
The painting is considered as a work of genius .
This is recognized as a face .
GRAPH PROPAGATION
Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning
VERB
similar context
VERB
similar tagging
-
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
GRAPH LAPLACIAN REGULARIZER
-
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
-
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
-
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
Prob (NOUN | ninth run for) = 0.6Prob (VERB | ninth run for) = 0.4Prob (ADV | ninth run for) = 0Prob (DET | ninth run for) = 0…
-
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
Prob(tag | ninth run for) =argmin
m0.4⇥ km� Prob(tag | a run along)k22
+0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22
-
... a run along ...NOUN
... a run for ...NOUN
... luck run out ...VERB
?... ninth run for ...
0.4
0.8
0.8
GRAPH LAPLACIAN REGULARIZER
Prob(tag | ninth run for) =argmin
m0.4⇥ km� Prob(tag | a run along)k22
+0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22: The proportion of time trigram a has tag k
minm
Lap(m) =X
a2Unlab
X
b2Neighbors(a)
X
k2Tagswab(ma,k �mb,k)2
ma,k
-
COMBINING THE TWO
labeled labeled + unlabeled
ninth run for
graph-propagationy7y5
ninth run for
CRF estimation
Data
Model m(tag | trigram)p(tags | sentence; ✓)
-
PRIOR WORKSubramanya et al. (EMNLP 2010)
graph-propagation
y7y5
ninth run for
VN
CRF estimation+
ninth run for
-
PRIOR WORK
Our work: retains efficiency while optimizing an extendible, joint objective.
Subramanya et al. (EMNLP 2010)
graph-propagation
y7y5
ninth run for
VN
CRF estimation+
ninth run for
-
HOW TO COMBINE?introduce auxiliary variables q
-
HOW TO COMBINE?introduce auxiliary variables q
p✓(y | xi) =1
Z✓(xi)exp
"TX
t=1
✓>f(yt, yt�1,xi)
#
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
-
HOW TO COMBINE?introduce auxiliary variables q
1. Normalized
p✓(y | xi) =1
Z✓(xi)exp
"TX
t=1
✓>f(yt, yt�1,xi)
#
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
-
HOW TO COMBINE?introduce auxiliary variables q
1. Normalized 2. Decomposed into local factors
p✓(y | xi) =1
Z✓(xi)exp
"TX
t=1
✓>f(yt, yt�1,xi)
#
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
-
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
-
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
Lap(q) =X
a2Unlab
X
b2Neighbors(a)
X
k2Tagswab(ma,k(q)�mb,k(q))2
-
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
NLik(p✓) = �P̀i=1
log p✓(yi | xi)
-
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
JOINT OBJECTIVE
KL(q k p✓) =nX
i=1
X
y
qiy logqiy
p✓(y | xi)
-
HOW TO OPTIMIZE?minq,✓
J (q, p✓)
-
HOW TO OPTIMIZE?minq,✓
J (q, p✓)
unconstrained�
-
HOW TO OPTIMIZE?
update:
✓0 = ✓ � ⌘ @J (q,p✓)@✓
p
minq,✓
J (q, p✓)
unconstrained�
-
HOW TO OPTIMIZE?
update:
✓0 = ✓ � ⌘ @J (q,p✓)@✓
p
minq,✓
J (q, p✓)
unconstrained�
update:q
no compact form
projection is hard
(# tags)(i’s length) values
X
y
qiy = 1
-
UPDATE Q
-
UPDATE Q
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#q can be represented by local factors r
-
UPDATE Q
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#q can be represented by local factors r
qiy0 = qiy � ⌘
@J (q,p✓)@qiy
doing an additive gradient update
-
UPDATE Q
qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#q can be represented by local factors r
qiy0 = qiy � ⌘
@J (q,p✓)@qiy
doing an additive gradient update
q’ cannot be written as product of local factors!
-
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
multiplicative gradient update:
qiy0=
1
Zq0(xi)qiy exp
�⌘ @J (q, p✓)
@qiy
�
-
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
-
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
decompose into local factors
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
-
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
decompose into local factors
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
=
1
Zq0(xi)exp
"TX
t=1
r0i,t(yt, yt�1)
#
-
EXPONENTIATED GRADIENT
Collins et al. (JMLR 2008): Exponentiated gradient for CRFs
decompose into local factors
multiplicative gradient update:
qiy0=
1
Zq0(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#exp
�⌘ @J (q, p✓)
@qiy
�
=
1
Zq0(xi)exp
"TX
t=1
r0i,t(yt, yt�1)
#
only updating (#tags)2 ⇥ (i’s length) variables!
-
SUMMARY
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
-
SUMMARY
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
✓0 = ✓ � ⌘ @J (q,p✓)@✓
E-step:
M-step:
qiy
0=
1Zq(xi)
qiy
exp
h�⌘ @J (q,p✓)@qiy
i
(update each in practice)qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
-
SUMMARY
Lap(q) + KL(q k p✓)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
Theorem: Converges to a local
optimum ofJ (q, p✓)
✓0 = ✓ � ⌘ @J (q,p✓)@✓
E-step:
M-step:
qiy
0=
1Zq(xi)
qiy
exp
h�⌘ @J (q,p✓)@qiy
i
(update each in practice)qiy =1
Zq(xi)exp
"TX
t=1
ri,t(yt, yt�1)
#
-
EXPERIMENT SETTING
10 Languages (CoNLL-X and CoNLL-2007)
100 Randomly sampled labeled sentencesAveraged over 10 sampling runs
Universal POS Tags (Petrov et al. 2011)Second Order CRF Model f(yt, yt�1, yt�2,x)
-
ninth run for
graph propagation
GP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
-
GP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
-
ninth run for
GP → CRF
y7y5
ninth run for
GP → CRFGP
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
-
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
GP → CRFGP
-
CRF
y7y5
ninth run for
CRF
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
GP → CRFGP
-
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
CRFGP → CRFGP
-
J
y7y5
ninth run for
KL(ninth run for
k )
J
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
CRFGP → CRFGP
-
J
y7y5
ninth run for
KL(ninth run for
k )
J
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
EN DE ES PT DA SL SV EL IT NL Avg8
10
12
14
16
18
20
22
24
Language
POS
Tagg
ing
Erro
r
CRFGP → CRFGP
28% average relative error reduction
-
+ KL(q k p✓)Lap(q)ninth run for y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
CONCLUSION
-
+ KL(q k p✓)
y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
any convex, differentiable regularizer
CONCLUSION
PosteriorConstraintR(q)
-
+ KL(q k p✓)
y7y5
ninth run for
NLik(p✓) +J (q, p✓) =
any convex, differentiable regularizer
CONCLUSION
Code: https://code.google.com/p/pr-graph/
PosteriorConstraintR(q)
https://code.google.com/p/pr-graph/