vene.rovene.ro/talks/19-edinburgh.pdf · senmentclassiﬁcaon (sst) ltr flat 80% 81% 82% 83% 84%...

Learning with Sparse Latent StructureVlad Niculae

Ins tuto de Telecomunicações

Work with: André Mar ns, Claire Cardie, Mathieu Blondel

github.com/vene/sparsemap @vnfrombucharest

https://github.com/vene/sparsemap

https://twitter.com/vnfrombucharest

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

Structured Prediction· · ·

· · ·

Latent Structure Models

input · · ·

· · ·

output

posi ve

neutral

nega ve

*record scratch*

*freeze frame*

How to select an itemfrom a set?

How to select an item from a set?

· · ·


c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=? or, essen ally, ∂p

∂θ=?


c1c2

· · ·cN

θ

pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)


∂θ=?


c1c2

· · ·cN

θ p

inputx

outputy

θ = f1(x;w) y = f2(p, x;w)


∂θ=?


c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)


∂θ=?


c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)

∂y∂w=?

or, essen ally, ∂p∂θ=?


c1c2

· · ·cN

θ pinputx

outputy

θ = f1(x;w) y = f2(p, x;w)


∂θ=?

Argmax

θc1c2

· · ·cN

θ pp

∂p∂θ=?

Argmax

c1c2

· · ·cN

θ p

∂p∂θ = 0

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

Argmax vs. Softmax

c1c2

· · ·cN

θ p

∂p∂θ = diag(p) − pp

⊤

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

Variational Form of Argmax

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

N = 3


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

N = 3


= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 3


maxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


maxjθj =max



0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


maxjθj =max



0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3


maxjθj =max



0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 3


maxjθj =max



0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 3

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +

∑j |pj − pj−1|

csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)



p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


(Mar ns and Astudillo, 2016)

so max

sparsemax

fusedmax ?!



p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]




p⊤θ −Ω(p)

argmax: Ω(p)=0

so max: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +

∑j |pj − pj−1|

csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]


Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computa on:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Mar ns and Astudillo, 2016)

argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)

Structured Predictionfinally

Structured Predictionis essen ally a (very high-dimensional) argmax

c1c2

· · ·cN

θ pinputx

outputyc2

There are exponen allymany structures(θ cannot fit in memory!)

Structured Predictionis essen ally a (very high-dimensional) argmax

· · ·

θ pinputx

outputy

There are exponen allymany structures(θ cannot fit in memory!)

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm

e.g. sequence labelling→ forward-backward(Rabiner, 1989)

As a en on: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)

As a en on: (Liu and Lapata, 2018)

e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)

M := convay : y ∈ Y

=Ap : p ∈

=EY∼p aY : p ∈

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M








=Ap : p ∈

=EY∼p aY : p ∈

argmax argmaxp∈

p⊤θ

so max argmaxp∈

p⊤θ +H(p)


p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η


μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M







(Niculae, Mar ns, Blondel, and Cardie, 2018)


=Ap : p ∈

=EY∼p aY : p ∈

SparseMAP Solution

μ⋆ = argmaxμ∈M

μ⊤η − 1/2∥μ∥2

= = .6 + .4

= Ap⋆ with very sparse p⋆ ∈ N

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set

(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass

∂μ∂η is sparse

compu ng∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadra c objec velinear constraints(alas, exponen ally many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eηAc ve Set achieves

finite & linear convergence!

Completely modular: just add MAP


μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coefficients of p



Backward pass


compu ng∂μ∂η

⊤dy








μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM

• update the (sparse) coefficients of p



Backward pass


compu ng∂μ∂η

⊤dy




μ⊤ (η −μ(t−1))︸︷︷︸eη

Ac ve Set achievesfinite & linear convergence!



μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise

• Quadra c objec ve: Ac ve Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)

Backward pass


compu ng∂μ∂η

⊤dy








μ∈Mμ⊤η − 1/2∥μ∥2


• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set


Backward pass


compu ng∂μ∂η

⊤dy








μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


compu ng∂μ∂η

⊤dy




μ⊤ (η −μ(t−1))︸︷︷︸eη

Ac ve Set achievesfinite & linear convergence!



μ∈Mμ⊤η − 1/2∥μ∥2




Backward pass


compu ng∂μ∂η

⊤dy







Structured Attention & Graphical Models

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.

hypothesis: A police officer watches a situa on closely.

input

(P, H)A

gentleman

overlooking

...

situa on

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: ESIM (Chen et al., 2017))

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.

hypothesis: A police officer watches a situa on closely.

input

(P, H)A

gentleman

overlooking

...

situa on

A

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global matching)

SNLI

so max matching sequence85.5%

86%

86.5%

87%accuracy(3-class)

Mul NLI

so max matching sequence75%

75.5%

76%

76.5%

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situa on

.

A

police

officer

watches

a

situa on

closely

.

Dynamically inferringthe computation graph

Dependency TreeLSTM

The bears eat the pre y ones

(Tai et al., 2015)

closely related to GCNs, e.g.(Kipf and Welling, 2017)

(Marcheggiani and Titov, 2017)

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H

The bears eat the pre y ones

output

y

(Niculae, Mar ns, and Cardie, 2018)

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pre y ones

output

y

(Niculae, Mar ns, and Cardie, 2018)

Structured Latent Variable Models

p(y | x) = ∑h∈H

p

φ

(y | h, x) p

π

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

so maxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponen ally large sum!


p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)


∂p(y | x)∂π


scoreπ(h; x)






p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)


e.g., a TreeLSTM defined by h

sum overall possible trees




p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)






p(y | x) = ∑h∈H


How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

so max

idea 3

SparseMAP





p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2


so max

idea 3

SparseMAP





p(y | x) = ∑h∈H



∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2


so max

idea 3

SparseMAP





p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)

so maxidea 3

SparseMAP





p(y | x) = ∑h∈H



∂p(y | x)∂π


scoreπ(h; x)





SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

Sen ment classifica on (SST)

80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%


Reverse dic onary lookup(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10⋆ The bears eat the pre y ones

Le -to-right: regular LSTM

⋆ The bears eat the pre y ones

Flat: bag-of-words–like


CoreNLP: off-line parser

Sentence pair classifica on (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)


LTR80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat CoreNLP80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)


LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%


Reverse dic onary lookup

(defini ons)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)



81%

82%

83%

84%

85%accuracy(binary)



80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%



LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10








p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)


∑h∈H

g(x;h) pπ(h | x)

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%


· · ·

Syntax vs. Composition Orderp = 22.6%




· · ·

Syntax vs. Composition Order

p = 22.6%




· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

1.0

1.0

ConclusionsDifferen able & sparsestructured inference

Generic, extensible algorithms

Interpretable structured a en on

Dynamically-inferredcomputa on graphs

[email protected] github.com/vene/sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

p = 22.6%


p = 21.4%


· · ·

mailto:[email protected]

https://github.com/vene/sparsemap

https://vene.ro

https://twitter.com/vnfrombucharest

Extra slides

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via fla con.com.

https://www.flaticon.com/authors/dave-gandy

https://www.freepik.com/

https://www.flaticon.com/

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

Example: Source Sentence with Three Words

(0.52, 0.35, 0.13)

softmax

(0.36, 0.44, 0.2)

(0.18, 0.27, 0.55)

0

1

Ferti

lities

(0.7, 0.3, 0)

sparsemax

(0.4, 0.6, 0)

(0, 0.15, 0.85)

0

1

(0.52, 0.35, 0.13)

csoftmax

(0.36, 0.44, 0.2)

(0.12, 0.21, 0.67)

0

1

(0.7, 0.3, 0)

csparsemax

(0.3, 0.7, 0)

(0, 0, 1)

0

1

e.g., fertility constraints for NMT

thisisthelast

hundredyearslawofthelast

hundred

<EOS>

<SINK> .

jahrehundertletzten

dergesetzmoores istdas

.

iam

goingto

giveyouthe

governmentgovernment

.<EOS>

<SINK> .

wählen '

regierung 'them

adasnun

werdeich

thisis

moore's

lawlast

hundredyears

.<EOS>

nowi

amgoing

tochoose

thegovernment

.<EOS>

constrained so max: (Mar ns and Kreutzer, 2017) constrained sparsemax: (Malaviya et al., 2018)

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely .a

police

officer

watches a

situation

closely .

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost-SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost-SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differen able op miza on as a layer inneural networks”. In: Proc. of ICML.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scien fic Belmont.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning with Fenchel-YoungLosses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadra c knapsack problems”. In: Opera onsResearch Le ers 3.3, pp. 163–166.

Chen, Qian et al. (2017). “Enhanced LSTM for natural language inference”. In: Proc. of ACL.Condat, Laurent (2016). “Fast projec on onto the simplex and the ℓ1 ball”. In:Mathema calProgramming 158.1-2, pp. 575–585.

Danskin, John M (1966). “The theory of max-min, with applica ons”. In: SIAM Journal onApplied Mathema cs 14.4, pp. 641–664.

Dantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method forminimizing a linear form under linear inequality restraints”. In: Pacific Journal of Mathema cs5.2, pp. 183–195.

http://proceedings.mlr.press/v70/amos17a.html

http://proceedings.mlr.press/v70/amos17a.html

http://www.athenasc.com/nonlinbook.html

https://arxiv.org/abs/1901.02324


https://www.sciencedirect.com/science/article/pii/0167637784900105

https://www.aclweb.org/anthology/P17-1152

https://hal.archives-ouvertes.fr/hal-01056171

https://epubs.siam.org/doi/abs/10.1137/0114053

https://msp.org/pjm/1955/5-2/pjm-v5-n2-s.pdf

https://msp.org/pjm/1955/5-2/pjm-v5-n2-s.pdf

References IIFrank, Marguerite and Philip Wolfe (1956). “An algorithm for quadra c programming”. In: Nav.Res. Log. 3.1-2, pp. 95–110.

Gould, Stephen et al. (2016). “On differen a ng parameterized argmin and argmax problemswith applica on to bi-level op miza on”. In: preprint arXiv:1607.05447.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Valida on of subgradientop miza on”. In:Mathema cal Programming 6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dic onary”. In: TACL4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured a en on networks”. In: Proc. of ICLR.Kipf, Thomas N. and Max Welling (2017). “Semi-supervised classifica on with graphconvolu onal networks”. In: Proc. of ICLR.

Koo, Terry et al. (2007). “Structured predic on models via the matrix-tree theorem”. In: Proc. ofEMNLP.

https://doi.org/10.1002/nav.3800030109



https://link.springer.com/article/10.1007/BF01580223






http://www.aclweb.org/anthology/D07-1015

References IIILacoste-Julien, Simon and Mar n Jaggi (2015). “On the global linear convergence ofFrank-Wolfe op miza on variants”. In: Proc. of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representa ons”. In: TACL 6,pp. 63–75.

Malaviya, Chaitanya, Pedro Ferreira, and André F. T. Mar ns (2018). “Sparse and constraineda en on for neural machine transla on”. In: Proc. of ACL.

Marcheggiani, Diego and Ivan Titov (2017). “Encoding sentences with graph convolu onalnetworks for seman c role labeling”. In: Proc. of EMNLP.

Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: Asparse model of a en on and mul -label classifica on”. In: Proc. of ICML.

Mar ns, André FT and Julia Kreutzer (2017). “Learning What’s Easy: Fully Differen ableNeural Easy-First Taggers”. In: Proc. of EMNLP, pp. 349–362.

McDonald, Ryan T and Giorgio Sa a (2007). “On the complexity of non-projec ve data-drivendependency parsing”. In: Proc. of ICPT.




http://aclweb.org/anthology/D17-1159

http://aclweb.org/anthology/D17-1159



https://dl.acm.org/citation.cfm?id=1621410.1621426

https://dl.acm.org/citation.cfm?id=1621410.1621426

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structuredneural a en on”. In: Proc. of NeurIPS.

Niculae, Vlad, André FT Mar ns, Mathieu Blondel, et al. (2018). “SparseMAP: Differen ablesparse structured inference”. In: Proc. of ICML.

Niculae, Vlad, André FT Mar ns, and Claire Cardie (2018). “Towards dynamic computa ongraphs via sparse latent structure”. In: Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Op miza on. Springer New York.Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applica onsin speech recogni on”. In: P. IEEE 77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilis c models of nonprojec ve dependencytrees”. In: Proc. of EMNLP.

Tai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved seman crepresenta ons from tree-structured Long Short-Term Memory networks”. In: Proc. ofACL-IJCNLP.







https://doi.org/10.1007/b98874

https://doi.org/10.1109/5.18626

https://doi.org/10.1109/5.18626



References VTaskar, Ben (2004). “Learning structured predic on models: A large margin approach”.PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of theRoyal Sta s cal Society: Series B (Sta s cal Methodology) 67.1, pp. 91–108.

Valiant, Leslie G (1979). “The complexity of compu ng the permanent”. In: Theor. Comput. Sci.8.2, pp. 189–201.

Vinyes, Marina and Guillaume Obozinski (2017). “ Fast column genera on for atomic normregulariza on”. In: Proc. of AISTATS.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathema cal Programming11.1, pp. 128–149.

https://homes.cs.washington.edu/~taskar/pubs/thesis.pdf

https://web.stanford.edu/group/SOL/papers/fused-lasso-JRSSB.pdf

https://doi.org/10.1016/0304-3975(79)90044-6

http://proceedings.mlr.press/v54/vinyes17a.html

http://proceedings.mlr.press/v54/vinyes17a.html


vene.rovene.ro/talks/19-edinburgh.pdf · senmentclassiﬁcaon (sst) ltr flat 80% 81% 82% 83% 84%...

Documents