vene.rovene.ro/talks/19-edinburgh.pdf · senmentclassificaon (sst) ltr flat 80% 81% 82% 83% 84%...
TRANSCRIPT
Learning with Sparse Latent StructureVlad Niculae
Ins tuto de Telecomunicações
Work with: André Mar ns, Claire Cardie, Mathieu Blondel
github.com/vene/sparsemap @vnfrombucharest
Structured Prediction
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
Structured Prediction
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
Structured Prediction· · ·
· · ·
Latent Structure Models
input · · ·
· · ·
output
posi ve
neutral
nega ve
*record scratch*
*freeze frame*
How to select an itemfrom a set?
How to select an item from a set?
· · ·
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
How to select an item from a set?
c1c2
· · ·cN
θ
pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
How to select an item from a set?
c1c2
· · ·cN
θ p
inputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=?
or, essen ally, ∂p∂θ=?
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
Argmax
c1c2
· · ·cN
θ p
∂p∂θ = 0
p1
θ10
1
θ2 − 1 θ2 θ2 + 1
Argmax vs. Softmax
c1c2
· · ·cN
θ p
∂p∂θ = diag(p) − pp
⊤
p1
θ10
1
θ2 − 1 θ2 θ2 + 1
pj = exp(θj)/Z
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
p = [0,0,1]
N = 3
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
p = [0,0,1]
p = [1/3, 1/3, 1/3]
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
N = 3
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
p⋆ = [0,0,1]
N = 3
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
(Mar ns and Astudillo, 2016)
so max
sparsemax
fusedmax ?!
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
Structured Predictionfinally
Structured Predictionis essen ally a (very high-dimensional) argmax
c1c2
· · ·cN
θ pinputx
outputyc2
There are exponen allymany structures(θ cannot fit in memory!)
Structured Predictionis essen ally a (very high-dimensional) argmax
· · ·
θ pinputx
outputy
There are exponen allymany structures(θ cannot fit in memory!)
Structured Predictionis essen ally a (very high-dimensional) argmax
· · ·
θ pinputx
outputy
There are exponen allymany structures(θ cannot fit in memory!)
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
(Niculae, Mar ns, Blondel, and Cardie, 2018)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
SparseMAP Solution
μ⋆ = argmaxμ∈M
μ⊤η − 1/2∥μ∥2
= = .6 + .4
= Ap⋆ with very sparse p⋆ ∈ N
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM
• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM
• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Ac ve Set achievesfinite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise
• Quadra c objec ve: Ac ve Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Ac ve Set achievesfinite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
Structured Attention & Graphical Models
Structured Attention & Graphical Models
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: ESIM (Chen et al., 2017))
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: ESIM (Chen et al., 2017))
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: ESIM (Chen et al., 2017))
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Proposed model: global matching)
SNLI
so max matching sequence85.5%
86%
86.5%
87%accuracy(3-class)
Mul NLI
so max matching sequence75%
75.5%
76%
76.5%
a
gentleman
overlooking
a
neighborhood
situation
.
a
police
officer
watches a
situation
closely . a
police
officer
watches a
situation
closely .
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
A
gentleman
overlooking
a
neighborhood
situa on
.
A
police
officer
watches
a
situa on
closely
.
Dynamically inferringthe computation graph
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
Latent Dependency TreeLSTM
input
x
p(y|x) =∑h∈H
p(y | h, x) p(h | x)
h ∈H
The bears eat the pre y ones
output
y
(Niculae, Mar ns, and Cardie, 2018)
Latent Dependency TreeLSTM
input
x
p(y|x) =∑h∈H
p(y | h, x) p(h | x)
h ∈HThe bears eat the pre y ones
output
y
(Niculae, Mar ns, and Cardie, 2018)
Structured Latent Variable Models
p(y | x) = ∑h∈H
p
φ
(y | h, x) p
π
(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by h
sum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by h
sum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ?
∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
SparseMAP
• • •= .7 • • • + .3 • • •
+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
SparseMAP
• • •= .7 • • • + .3 • • • + 0 • • • + ...
p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
SparseMAP
• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
Sen ment classifica on (SST)
80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)
given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup
(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)
given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
Syntax vs. Composition Order
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
Syntax vs. Composition Orderp = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
Syntax vs. Composition Order
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
p = 15.33%
⋆ a deep and meaningful film .1.0 1.0
1.01.0
1.0 1.0
p = 15.27%
⋆ a deep and meaningful film .
1.01.0
1.01.0
1.01.0
· · ·CoreNLP parse, p = 0%
⋆ a deep and meaningful film .
1.0 1.0
1.01.0
1.0
1.0
ConclusionsDifferen able & sparsestructured inference
Generic, extensible algorithms
Interpretable structured a en on
Dynamically-inferredcomputa on graphs
[email protected] github.com/vene/sparsemaphttps://vene.ro @vnfrombucharest
· · ·
· · ·
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
p = 22.6%
⋆ lovely and poignant .
p = 21.4%
⋆ lovely and poignant .
· · ·
Extra slides
Acknowledgements
This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.
Some icons by Dave Gandy and Freepik via fla con.com.
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
Fusedmax
fusedmax(θ) = argmaxp∈
p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|
= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
Proposi on: fusedmax(θ) = sparsemaxproxfused(θ)
(Niculae and Blondel, 2017)
“Fused Lasso” a.k.a. 1-d Total Varia on
(Tibshirani et al., 2005)
Fusedmax
fusedmax(θ) = argmaxp∈
p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|
= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
Proposi on: fusedmax(θ) = sparsemaxproxfused(θ)
(Niculae and Blondel, 2017)
“Fused Lasso” a.k.a. 1-d Total Varia on
(Tibshirani et al., 2005)
Example: Source Sentence with Three Words
(0.52, 0.35, 0.13)
softmax
(0.36, 0.44, 0.2)
(0.18, 0.27, 0.55)
0
1
Ferti
lities
(0.7, 0.3, 0)
sparsemax
(0.4, 0.6, 0)
(0, 0.15, 0.85)
0
1
(0.52, 0.35, 0.13)
csoftmax
(0.36, 0.44, 0.2)
(0.12, 0.21, 0.67)
0
1
(0.7, 0.3, 0)
csparsemax
(0.3, 0.7, 0)
(0, 0, 1)
0
1
e.g., fertility constraints for NMT
thisisthelast
hundredyearslawofthelast
hundred
<EOS>
<SINK> .
jahrehundertletzten
dergesetzmoores istdas
.
iam
goingto
giveyouthe
governmentgovernment
.<EOS>
<SINK> .
wählen '
regierung 'them
adasnun
werdeich
thisis
moore's
lawlast
hundredyears
.<EOS>
nowi
amgoing
tochoose
thegovernment
.<EOS>
constrained so max: (Mar ns and Kreutzer, 2017) constrained sparsemax: (Malaviya et al., 2018)
a
gentleman
overlooking
a
neighborhood
situation
.
a
police
officer
watches a
situation
closely .a
police
officer
watches a
situation
closely .
Structured Output Prediction
SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2
− η⊤μ + 1/2∥μ∥2
cost-SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)
− η⊤μ + 1/2∥μ∥2
Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)
Structured Output Prediction
SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2
− η⊤μ + 1/2∥μ∥2cost-SparseMAP LρA(η, μ) = max
μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)
− η⊤μ + 1/2∥μ∥2
Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)
References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differen able op miza on as a layer inneural networks”. In: Proc. of ICML.
Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scien fic Belmont.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning with Fenchel-YoungLosses”. In: preprint arXiv:1901.02324.
Brucker, Peter (1984). “An O(n) algorithm for quadra c knapsack problems”. In: Opera onsResearch Le ers 3.3, pp. 163–166.
Chen, Qian et al. (2017). “Enhanced LSTM for natural language inference”. In: Proc. of ACL.Condat, Laurent (2016). “Fast projec on onto the simplex and the ℓ1 ball”. In:Mathema calProgramming 158.1-2, pp. 575–585.
Danskin, John M (1966). “The theory of max-min, with applica ons”. In: SIAM Journal onApplied Mathema cs 14.4, pp. 641–664.
Dantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method forminimizing a linear form under linear inequality restraints”. In: Pacific Journal of Mathema cs5.2, pp. 183–195.
References IIFrank, Marguerite and Philip Wolfe (1956). “An algorithm for quadra c programming”. In: Nav.Res. Log. 3.1-2, pp. 95–110.
Gould, Stephen et al. (2016). “On differen a ng parameterized argmin and argmax problemswith applica on to bi-level op miza on”. In: preprint arXiv:1607.05447.
Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Valida on of subgradientop miza on”. In:Mathema cal Programming 6.1, pp. 62–88.
Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dic onary”. In: TACL4.1, pp. 17–30.
Kim, Yoon et al. (2017). “Structured a en on networks”. In: Proc. of ICLR.Kipf, Thomas N. and Max Welling (2017). “Semi-supervised classifica on with graphconvolu onal networks”. In: Proc. of ICLR.
Koo, Terry et al. (2007). “Structured predic on models via the matrix-tree theorem”. In: Proc. ofEMNLP.
References IIILacoste-Julien, Simon and Mar n Jaggi (2015). “On the global linear convergence ofFrank-Wolfe op miza on variants”. In: Proc. of NeurIPS.
Liu, Yang and Mirella Lapata (2018). “Learning structured text representa ons”. In: TACL 6,pp. 63–75.
Malaviya, Chaitanya, Pedro Ferreira, and André F. T. Mar ns (2018). “Sparse and constraineda en on for neural machine transla on”. In: Proc. of ACL.
Marcheggiani, Diego and Ivan Titov (2017). “Encoding sentences with graph convolu onalnetworks for seman c role labeling”. In: Proc. of EMNLP.
Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: Asparse model of a en on and mul -label classifica on”. In: Proc. of ICML.
Mar ns, André FT and Julia Kreutzer (2017). “Learning What’s Easy: Fully Differen ableNeural Easy-First Taggers”. In: Proc. of EMNLP, pp. 349–362.
McDonald, Ryan T and Giorgio Sa a (2007). “On the complexity of non-projec ve data-drivendependency parsing”. In: Proc. of ICPT.
References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structuredneural a en on”. In: Proc. of NeurIPS.
Niculae, Vlad, André FT Mar ns, Mathieu Blondel, et al. (2018). “SparseMAP: Differen ablesparse structured inference”. In: Proc. of ICML.
Niculae, Vlad, André FT Mar ns, and Claire Cardie (2018). “Towards dynamic computa ongraphs via sparse latent structure”. In: Proc. of EMNLP.
Nocedal, Jorge and Stephen Wright (1999). Numerical Op miza on. Springer New York.Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applica onsin speech recogni on”. In: P. IEEE 77.2, pp. 257–286.
Smith, David A and Noah A Smith (2007). “Probabilis c models of nonprojec ve dependencytrees”. In: Proc. of EMNLP.
Tai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved seman crepresenta ons from tree-structured Long Short-Term Memory networks”. In: Proc. ofACL-IJCNLP.
References VTaskar, Ben (2004). “Learning structured predic on models: A large margin approach”.PhD thesis. Stanford University.
Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of theRoyal Sta s cal Society: Series B (Sta s cal Methodology) 67.1, pp. 91–108.
Valiant, Leslie G (1979). “The complexity of compu ng the permanent”. In: Theor. Comput. Sci.8.2, pp. 189–201.
Vinyes, Marina and Guillaume Obozinski (2017). “ Fast column genera on for atomic normregulariza on”. In: Proc. of AISTATS.
Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathema cal Programming11.1, pp. 128–149.