graph-based posterior regularization for semi-supervised...

57
GRAPH-BASED POSTERIOR REGULARIZATION FOR SEMI-SUPERVISED STRUCTURED PREDICTION Luheng He Jennifer Gillenwater Ben Taskar †University of Pennsylvania University of Washington

Upload: others

Post on 04-Feb-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

  • GRAPH-BASED POSTERIOR REGULARIZATION FOR

    SEMI-SUPERVISED STRUCTURED PREDICTION

    Luheng He Jennifer Gillenwater† Ben Taskar

    †University of Pennsylvania University of Washington

  • OVERVIEW

    Structured Prediction (CRF)

    Graph Propagation

    A Joint Objective

    y7y5

    ninth run for

    ninth run for

    J (q, p✓)

  • This is recognized as a face

    CONDITIONAL RANDOM FIELD

  • This is recognized as a face

    CONDITIONAL RANDOM FIELD

    DET VERB VERB ADP DET NOUN

  • This is recognized as a face

    Y = y1 y2 y4 y5 y6y3

    X =

    CONDITIONAL RANDOM FIELD

  • This is recognized as a face

    Y = y1 y2 y4 y5 y6y3

    X =

    CONDITIONAL RANDOM FIELD

    ( )f yt , yt�1,x{

  • This is recognized as a face

    Y = y1 y2 y4 y5 y6y3

    X =

    CONDITIONAL RANDOM FIELD

    Conditional Distribution

    p✓(y | x) = 1Z✓(x) exp

    TPt=1

    ✓>f(yt, yt�1,x)

    �( )f yt , yt�1,x

    {-factorp {

  • This is recognized as a face

    Y = y1 y2 y4 y5 y6y3

    X =

    CONDITIONAL RANDOM FIELD

    Conditional Distribution

    p✓(y | x) = 1Z✓(x) exp

    TPt=1

    ✓>f(yt, yt�1,x)

    �( )f yt , yt�1,x

    {CRF objective

    NLik(p✓) = �P̀i=1

    log p✓(yi | xi)-factorp {

  • The painting is considered as a work of genius .

    This is recognized as a face .

    GRAPH PROPAGATION

    Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning

    VERB

    ?

  • The painting is considered as a work of genius .

    This is recognized as a face .

    GRAPH PROPAGATION

    Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning

    VERB

    ?

    similar context

  • The painting is considered as a work of genius .

    This is recognized as a face .

    GRAPH PROPAGATION

    Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning

    VERB

    similar contextsimilar tagging

  • The painting is considered as a work of genius .

    This is recognized as a face .

    GRAPH PROPAGATION

    Zhu et al. (ICML 2013): Graph-based Semi-supervised Learning

    VERB

    similar context

    VERB

    similar tagging

  • ... a run along ...NOUN

    ... a run for ...NOUN

    ... luck run out ...VERB

    ?... ninth run for ...

    GRAPH LAPLACIAN REGULARIZER

  • ... a run along ...NOUN

    ... a run for ...NOUN

    ... luck run out ...VERB

    ?... ninth run for ...

    0.4

    0.8

    0.8

    GRAPH LAPLACIAN REGULARIZER

  • ... a run along ...NOUN

    ... a run for ...NOUN

    ... luck run out ...VERB

    ?... ninth run for ...

    0.4

    0.8

    0.8

    GRAPH LAPLACIAN REGULARIZER

  • ... a run along ...NOUN

    ... a run for ...NOUN

    ... luck run out ...VERB

    ?... ninth run for ...

    0.4

    0.8

    0.8

    GRAPH LAPLACIAN REGULARIZER

    Prob (NOUN | ninth run for) = 0.6Prob (VERB | ninth run for) = 0.4Prob (ADV | ninth run for) = 0Prob (DET | ninth run for) = 0…

  • ... a run along ...NOUN

    ... a run for ...NOUN

    ... luck run out ...VERB

    ?... ninth run for ...

    0.4

    0.8

    0.8

    GRAPH LAPLACIAN REGULARIZER

    Prob(tag | ninth run for) =argmin

    m0.4⇥ km� Prob(tag | a run along)k22

    +0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22

  • ... a run along ...NOUN

    ... a run for ...NOUN

    ... luck run out ...VERB

    ?... ninth run for ...

    0.4

    0.8

    0.8

    GRAPH LAPLACIAN REGULARIZER

    Prob(tag | ninth run for) =argmin

    m0.4⇥ km� Prob(tag | a run along)k22

    +0.8⇥ km� Prob(tag | a run for)k22+0.8⇥ km� Prob(tag | luck run out)k22: The proportion of time trigram a has tag k

    minm

    Lap(m) =X

    a2Unlab

    X

    b2Neighbors(a)

    X

    k2Tagswab(ma,k �mb,k)2

    ma,k

  • COMBINING THE TWO

    labeled labeled + unlabeled

    ninth run for

    graph-propagationy7y5

    ninth run for

    CRF estimation

    Data

    Model m(tag | trigram)p(tags | sentence; ✓)

  • PRIOR WORKSubramanya et al. (EMNLP 2010)

    graph-propagation

    y7y5

    ninth run for

    VN

    CRF estimation+

    ninth run for

  • PRIOR WORK

    Our work: retains efficiency while optimizing an extendible, joint objective.

    Subramanya et al. (EMNLP 2010)

    graph-propagation

    y7y5

    ninth run for

    VN

    CRF estimation+

    ninth run for

  • HOW TO COMBINE?introduce auxiliary variables q

  • HOW TO COMBINE?introduce auxiliary variables q

    p✓(y | xi) =1

    Z✓(xi)exp

    "TX

    t=1

    ✓>f(yt, yt�1,xi)

    #

    qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #

  • HOW TO COMBINE?introduce auxiliary variables q

    1. Normalized

    p✓(y | xi) =1

    Z✓(xi)exp

    "TX

    t=1

    ✓>f(yt, yt�1,xi)

    #

    qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #

  • HOW TO COMBINE?introduce auxiliary variables q

    1. Normalized 2. Decomposed into local factors

    p✓(y | xi) =1

    Z✓(xi)exp

    "TX

    t=1

    ✓>f(yt, yt�1,xi)

    #

    qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #

  • Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    JOINT OBJECTIVE

  • Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    JOINT OBJECTIVE

    Lap(q) =X

    a2Unlab

    X

    b2Neighbors(a)

    X

    k2Tagswab(ma,k(q)�mb,k(q))2

  • Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    JOINT OBJECTIVE

    NLik(p✓) = �P̀i=1

    log p✓(yi | xi)

  • Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    JOINT OBJECTIVE

    KL(q k p✓) =nX

    i=1

    X

    y

    qiy logqiy

    p✓(y | xi)

  • HOW TO OPTIMIZE?minq,✓

    J (q, p✓)

  • HOW TO OPTIMIZE?minq,✓

    J (q, p✓)

    unconstrained�

  • HOW TO OPTIMIZE?

    update:

    ✓0 = ✓ � ⌘ @J (q,p✓)@✓

    p

    minq,✓

    J (q, p✓)

    unconstrained�

  • HOW TO OPTIMIZE?

    update:

    ✓0 = ✓ � ⌘ @J (q,p✓)@✓

    p

    minq,✓

    J (q, p✓)

    unconstrained�

    update:q

    no compact form

    projection is hard

    (# tags)(i’s length) values

    X

    y

    qiy = 1

  • UPDATE Q

  • UPDATE Q

    qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #q can be represented by local factors r

  • UPDATE Q

    qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #q can be represented by local factors r

    qiy0 = qiy � ⌘

    @J (q,p✓)@qiy

    doing an additive gradient update

  • UPDATE Q

    qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #q can be represented by local factors r

    qiy0 = qiy � ⌘

    @J (q,p✓)@qiy

    doing an additive gradient update

    q’ cannot be written as product of local factors!

  • EXPONENTIATED GRADIENT

    Collins et al. (JMLR 2008): Exponentiated gradient for CRFs

    multiplicative gradient update:

    qiy0=

    1

    Zq0(xi)qiy exp

    �⌘ @J (q, p✓)

    @qiy

  • EXPONENTIATED GRADIENT

    Collins et al. (JMLR 2008): Exponentiated gradient for CRFs

    multiplicative gradient update:

    qiy0=

    1

    Zq0(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #exp

    �⌘ @J (q, p✓)

    @qiy

  • EXPONENTIATED GRADIENT

    Collins et al. (JMLR 2008): Exponentiated gradient for CRFs

    decompose into local factors

    multiplicative gradient update:

    qiy0=

    1

    Zq0(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #exp

    �⌘ @J (q, p✓)

    @qiy

  • EXPONENTIATED GRADIENT

    Collins et al. (JMLR 2008): Exponentiated gradient for CRFs

    decompose into local factors

    multiplicative gradient update:

    qiy0=

    1

    Zq0(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #exp

    �⌘ @J (q, p✓)

    @qiy

    =

    1

    Zq0(xi)exp

    "TX

    t=1

    r0i,t(yt, yt�1)

    #

  • EXPONENTIATED GRADIENT

    Collins et al. (JMLR 2008): Exponentiated gradient for CRFs

    decompose into local factors

    multiplicative gradient update:

    qiy0=

    1

    Zq0(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #exp

    �⌘ @J (q, p✓)

    @qiy

    =

    1

    Zq0(xi)exp

    "TX

    t=1

    r0i,t(yt, yt�1)

    #

    only updating (#tags)2 ⇥ (i’s length) variables!

  • SUMMARY

    Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

  • SUMMARY

    Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    ✓0 = ✓ � ⌘ @J (q,p✓)@✓

    E-step:

    M-step:

    qiy

    0=

    1Zq(xi)

    qiy

    exp

    h�⌘ @J (q,p✓)@qiy

    i

    (update each in practice)qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #

  • SUMMARY

    Lap(q) + KL(q k p✓)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    Theorem: Converges to a local

    optimum ofJ (q, p✓)

    ✓0 = ✓ � ⌘ @J (q,p✓)@✓

    E-step:

    M-step:

    qiy

    0=

    1Zq(xi)

    qiy

    exp

    h�⌘ @J (q,p✓)@qiy

    i

    (update each in practice)qiy =1

    Zq(xi)exp

    "TX

    t=1

    ri,t(yt, yt�1)

    #

  • EXPERIMENT SETTING

    10 Languages (CoNLL-X and CoNLL-2007)

    100 Randomly sampled labeled sentencesAveraged over 10 sampling runs

    Universal POS Tags (Petrov et al. 2011)Second Order CRF Model f(yt, yt�1, yt�2,x)

  • ninth run for

    graph propagation

    GP

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

  • GP

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

  • ninth run for

    GP → CRF

    y7y5

    ninth run for

    GP → CRFGP

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

  • EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    GP → CRFGP

  • CRF

    y7y5

    ninth run for

    CRF

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    GP → CRFGP

  • EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    CRFGP → CRFGP

  • J

    y7y5

    ninth run for

    KL(ninth run for

    k )

    J

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    CRFGP → CRFGP

  • J

    y7y5

    ninth run for

    KL(ninth run for

    k )

    J

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    EN DE ES PT DA SL SV EL IT NL Avg8

    10

    12

    14

    16

    18

    20

    22

    24

    Language

    POS

    Tagg

    ing

    Erro

    r

    CRFGP → CRFGP

    28% average relative error reduction

  • + KL(q k p✓)Lap(q)ninth run for y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    CONCLUSION

  • + KL(q k p✓)

    y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    any convex, differentiable regularizer

    CONCLUSION

    PosteriorConstraintR(q)

  • + KL(q k p✓)

    y7y5

    ninth run for

    NLik(p✓) +J (q, p✓) =

    any convex, differentiable regularizer

    CONCLUSION

    Code: https://code.google.com/p/pr-graph/

    PosteriorConstraintR(q)

    https://code.google.com/p/pr-graph/