vene.rovene.ro/talks/19-edinburgh.pdf · senmentclassificaon (sst) ltr flat 80% 81% 82% 83% 84%...

158
Learning with Sparse Latent Structure Vlad Niculae Insধtuto de Telecomunicações Work with: André Marধns, Claire Cardie, Mathieu Blondel github.com/vene/sparsemap @vnfrombucharest

Upload: others

Post on 20-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Learning with Sparse Latent StructureVlad Niculae

Ins tuto de Telecomunicações

Work with: André Mar ns, Claire Cardie, Mathieu Blondel

github.com/vene/sparsemap @vnfrombucharest

Page 2: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

Page 3: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

Page 4: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Prediction· · ·

· · ·

Page 5: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Latent Structure Models

input · · ·

· · ·

output

posi ve

neutral

nega ve

Page 6: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

*record scratch*

*freeze frame*

How to select an itemfrom a set?

Page 7: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

· · ·

Page 8: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=? or, essen ally, ∂p

∂θ=?

Page 9: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

c1c2

· · ·cN

θ

pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=? or, essen ally, ∂p

∂θ=?

Page 10: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

c1c2

· · ·cN

θ p

inputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=? or, essen ally, ∂p

∂θ=?

Page 11: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=? or, essen ally, ∂p

∂θ=?

Page 12: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=?

or, essen ally, ∂p∂θ=?

Page 13: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

How to select an item from a set?

c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=? or, essen ally, ∂p

∂θ=?

Page 14: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 15: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 16: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 17: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 18: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 19: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 20: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 21: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Page 22: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax

c1c2

· · ·cN

θ p

∂p∂θ = 0

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

Page 23: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Argmax vs. Softmax

c1c2

· · ·cN

θ p

∂p∂θ = diag(p) − pp

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

Page 24: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 25: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 26: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 27: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 28: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 29: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 30: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

N = 3

Page 31: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

N = 3

Page 32: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 3

Page 33: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 34: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 35: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 36: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

Page 37: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 3

Page 38: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 3

Page 39: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Variational Form of Argmax

maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 3

Page 40: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +

∑j |pj − pj−1|

csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

Page 41: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +

∑j |pj − pj−1|

csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

Page 42: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +

∑j |pj − pj−1|

csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

Page 43: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

(Mar ns and Astudillo, 2016)

Page 44: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

so max

Page 45: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

sparsemax

Page 46: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

fusedmax ?!

Page 47: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

Page 48: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +

∑j |pj − pj−1|

csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

Page 49: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computa on:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Mar ns and Astudillo, 2016)

argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)

Page 50: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computa on:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Mar ns and Astudillo, 2016)

argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)

Page 51: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computa on:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Mar ns and Astudillo, 2016)

argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)

Page 52: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computa on:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Mar ns and Astudillo, 2016)

argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)

Page 53: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Predictionfinally

Page 54: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Predictionis essen ally a (very high-dimensional) argmax

c1c2

· · ·cN

θ pinputx

outputyc2

There are exponen allymany structures(θ cannot fit in memory!)

Page 55: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Predictionis essen ally a (very high-dimensional) argmax

· · ·

θ pinputx

outputy

There are exponen allymany structures(θ cannot fit in memory!)

Page 56: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Predictionis essen ally a (very high-dimensional) argmax

· · ·

θ pinputx

outputy

There are exponen allymany structures(θ cannot fit in memory!)

Page 57: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

Page 58: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

Page 59: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

Page 60: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 61: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 62: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 63: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 64: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 65: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 66: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 67: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 68: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 69: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 70: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 71: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 72: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 73: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

(Niculae, Mar ns, Blondel, and Cardie, 2018)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

Page 74: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

SparseMAP Solution

μ⋆ = argmaxμ∈M

μ⊤η − 1/2∥μ∥2

= = .6 + .4

= Ap⋆ with very sparse p⋆ ∈ N

Page 75: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 76: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 77: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 78: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM

• update the (sparse) coefficients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 79: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM

• update the (sparse) coefficients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη

Ac ve Set achievesfinite & linear convergence!

Completely modular: just add MAP

Page 80: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise

• Quadra c objec ve: Ac ve Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 81: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 82: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη

Ac ve Set achievesfinite & linear convergence!

Completely modular: just add MAP

Page 83: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 84: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 85: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP

Page 86: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Attention & Graphical Models

Page 87: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Attention & Graphical Models

Page 88: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.

hypothesis: A police officer watches a situa on closely.

input

(P, H)A

gentleman

overlooking

...

situa on

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: ESIM (Chen et al., 2017))

Page 89: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.

hypothesis: A police officer watches a situa on closely.

input

(P, H)A

gentleman

overlooking

...

situa on

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: ESIM (Chen et al., 2017))

Page 90: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.

hypothesis: A police officer watches a situa on closely.

input

(P, H)A

gentleman

overlooking

...

situa on

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: ESIM (Chen et al., 2017))

Page 91: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.

hypothesis: A police officer watches a situa on closely.

input

(P, H)A

gentleman

overlooking

...

situa on

A

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global matching)

Page 92: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

SNLI

so max matching sequence85.5%

86%

86.5%

87%accuracy(3-class)

Mul NLI

so max matching sequence75%

75.5%

76%

76.5%

Page 93: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

Page 94: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situa on

.

A

police

officer

watches

a

situa on

closely

.

Page 95: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dynamically inferringthe computation graph

Page 96: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 97: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 98: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 99: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 100: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 101: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 102: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Page 103: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H

The bears eat the pre y ones

output

y

(Niculae, Mar ns, and Cardie, 2018)

Page 104: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pre y ones

output

y

(Niculae, Mar ns, and Cardie, 2018)

Page 105: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

p

φ

(y | h, x) p

π

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 106: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 107: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by h

sum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 108: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by h

sum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 109: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 110: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 111: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 112: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 113: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 114: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 115: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 116: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 117: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 118: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 119: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!

Page 120: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

Page 121: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

Page 122: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

Page 123: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 124: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 125: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 126: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 127: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 128: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 129: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 130: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup

(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 131: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 132: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Sen ment classifica on (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3-class)

Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like

⋆ The bears eat the pre y ones

CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Page 133: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

Page 134: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Syntax vs. Composition Orderp = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

Page 135: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

1.0

1.0

Page 136: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

ConclusionsDifferen able & sparsestructured inference

Generic, extensible algorithms

Interpretable structured a en on

Dynamically-inferredcomputa on graphs

[email protected] github.com/vene/sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

p = 22.6%

⋆ lovely and poignant .

p = 21.4%

⋆ lovely and poignant .

· · ·

Page 137: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Extra slides

Page 138: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via fla con.com.

Page 139: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

Page 140: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

Page 141: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

Page 142: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposi on: fusedmax(θ) = sparsemaxproxfused(θ)

(Niculae and Blondel, 2017)

“Fused Lasso” a.k.a. 1-d Total Varia on

(Tibshirani et al., 2005)

Page 143: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposi on: fusedmax(θ) = sparsemaxproxfused(θ)

(Niculae and Blondel, 2017)

“Fused Lasso” a.k.a. 1-d Total Varia on

(Tibshirani et al., 2005)

Page 144: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Example: Source Sentence with Three Words

(0.52, 0.35, 0.13)

softmax

(0.36, 0.44, 0.2)

(0.18, 0.27, 0.55)

0

1

Ferti

lities

(0.7, 0.3, 0)

sparsemax

(0.4, 0.6, 0)

(0, 0.15, 0.85)

0

1

(0.52, 0.35, 0.13)

csoftmax

(0.36, 0.44, 0.2)

(0.12, 0.21, 0.67)

0

1

(0.7, 0.3, 0)

csparsemax

(0.3, 0.7, 0)

(0, 0, 1)

0

1

Page 145: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

e.g., fertility constraints for NMT

thisisthelast

hundredyearslawofthelast

hundred

<EOS>

<SINK> .

jahrehundertletzten

dergesetzmoores istdas

.

iam

goingto

giveyouthe

governmentgovernment

.<EOS>

<SINK> .

wählen '

regierung 'them

adasnun

werdeich

thisis

moore's

lawlast

hundredyears

.<EOS>

nowi

amgoing

tochoose

thegovernment

.<EOS>

constrained so max: (Mar ns and Kreutzer, 2017) constrained sparsemax: (Malaviya et al., 2018)

Page 146: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely .a

police

officer

watches a

situation

closely .

Page 147: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost-SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)

Page 148: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost-SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)

Page 149: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%
Page 150: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%
Page 151: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%
Page 152: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%
Page 153: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%
Page 154: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differen able op miza on as a layer inneural networks”. In: Proc. of ICML.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scien fic Belmont.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning with Fenchel-YoungLosses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadra c knapsack problems”. In: Opera onsResearch Le ers 3.3, pp. 163–166.

Chen, Qian et al. (2017). “Enhanced LSTM for natural language inference”. In: Proc. of ACL.Condat, Laurent (2016). “Fast projec on onto the simplex and the ℓ1 ball”. In:Mathema calProgramming 158.1-2, pp. 575–585.

Danskin, John M (1966). “The theory of max-min, with applica ons”. In: SIAM Journal onApplied Mathema cs 14.4, pp. 641–664.

Dantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method forminimizing a linear form under linear inequality restraints”. In: Pacific Journal of Mathema cs5.2, pp. 183–195.

Page 155: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

References IIFrank, Marguerite and Philip Wolfe (1956). “An algorithm for quadra c programming”. In: Nav.Res. Log. 3.1-2, pp. 95–110.

Gould, Stephen et al. (2016). “On differen a ng parameterized argmin and argmax problemswith applica on to bi-level op miza on”. In: preprint arXiv:1607.05447.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Valida on of subgradientop miza on”. In:Mathema cal Programming 6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dic onary”. In: TACL4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured a en on networks”. In: Proc. of ICLR.Kipf, Thomas N. and Max Welling (2017). “Semi-supervised classifica on with graphconvolu onal networks”. In: Proc. of ICLR.

Koo, Terry et al. (2007). “Structured predic on models via the matrix-tree theorem”. In: Proc. ofEMNLP.

Page 156: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

References IIILacoste-Julien, Simon and Mar n Jaggi (2015). “On the global linear convergence ofFrank-Wolfe op miza on variants”. In: Proc. of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representa ons”. In: TACL 6,pp. 63–75.

Malaviya, Chaitanya, Pedro Ferreira, and André F. T. Mar ns (2018). “Sparse and constraineda en on for neural machine transla on”. In: Proc. of ACL.

Marcheggiani, Diego and Ivan Titov (2017). “Encoding sentences with graph convolu onalnetworks for seman c role labeling”. In: Proc. of EMNLP.

Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: Asparse model of a en on and mul -label classifica on”. In: Proc. of ICML.

Mar ns, André FT and Julia Kreutzer (2017). “Learning What’s Easy: Fully Differen ableNeural Easy-First Taggers”. In: Proc. of EMNLP, pp. 349–362.

McDonald, Ryan T and Giorgio Sa a (2007). “On the complexity of non-projec ve data-drivendependency parsing”. In: Proc. of ICPT.

Page 157: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structuredneural a en on”. In: Proc. of NeurIPS.

Niculae, Vlad, André FT Mar ns, Mathieu Blondel, et al. (2018). “SparseMAP: Differen ablesparse structured inference”. In: Proc. of ICML.

Niculae, Vlad, André FT Mar ns, and Claire Cardie (2018). “Towards dynamic computa ongraphs via sparse latent structure”. In: Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Op miza on. Springer New York.Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applica onsin speech recogni on”. In: P. IEEE 77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilis c models of nonprojec ve dependencytrees”. In: Proc. of EMNLP.

Tai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved seman crepresenta ons from tree-structured Long Short-Term Memory networks”. In: Proc. ofACL-IJCNLP.

Page 158: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%

References VTaskar, Ben (2004). “Learning structured predic on models: A large margin approach”.PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of theRoyal Sta s cal Society: Series B (Sta s cal Methodology) 67.1, pp. 91–108.

Valiant, Leslie G (1979). “The complexity of compu ng the permanent”. In: Theor. Comput. Sci.8.2, pp. 189–201.

Vinyes, Marina and Guillaume Obozinski (2017). “ Fast column genera on for atomic normregulariza on”. In: Proc. of AISTATS.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathema cal Programming11.1, pp. 128–149.