Learning Fast-Mixing Models for Structured Prediction
Jacob Steinhardt Percy Liang
Stanford University
{jsteinhardt,pliang}@cs.stanford.edu
July 8, 2015
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 1 / 11
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n a
Goal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:
Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)
Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u
A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A
A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A
A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A
u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u
A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A
u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u
A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A
A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A
· · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ = εuθ +(1− ε)Aθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ = εuθ +(1− ε)Aθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ = εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ
F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
Strategy
Parameterize strong Doeblin distributions πθ
Maximize log-likelihood: L(θ) = 1n ∑
ni=1 log πθ (z(i))
Issue: hard to compute ∇L(θ)
Insight: interpret Markov chain as latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11
Strategy
Parameterize strong Doeblin distributions πθ
Maximize log-likelihood: L(θ) = 1n ∑
ni=1 log πθ (z(i))
Issue: hard to compute ∇L(θ)
Insight: interpret Markov chain as latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11
Strategy
Parameterize strong Doeblin distributions πθ
Maximize log-likelihood: L(θ) = 1n ∑
ni=1 log πθ (z(i))
Issue: hard to compute ∇L(θ)
Insight: interpret Markov chain as latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11
Learning Updates
Recall latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Lemma
For any fixed z,
∂ logpθ (zT = z)∂θ
= Ez1:T−1∼pθ (·|zT =z)
[∂ logpθ (z1:T )
∂θ
].
Upshot: just need to sample trajectories that end at z.=⇒ importance sampling
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11
Learning Updates
Recall latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Lemma
For any fixed z,
∂ logpθ (zT = z)∂θ
= Ez1:T−1∼pθ (·|zT =z)
[∂ logpθ (z1:T )
∂θ
].
Upshot: just need to sample trajectories that end at z.=⇒ importance sampling
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11
Learning Updates
Recall latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Lemma
For any fixed z,
∂ logpθ (zT = z)∂θ
= Ez1:T−1∼pθ (·|zT =z)
[∂ logpθ (z1:T )
∂θ
].
Upshot: just need to sample trajectories that end at z.=⇒ importance sampling
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11
Experiments: Task
Task from before:
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n a
Note y is a deterministic function y = f (z).
Goal: learn model p(z | x) that maximizes
p(y | x) = ∑z∈f−1(y)
p(z | x)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 7 / 11
Experiments: Task
Task from before:
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n a
Note y is a deterministic function y = f (z).
Goal: learn model p(z | x) that maximizes
p(y | x) = ∑z∈f−1(y)
p(z | x)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 7 / 11
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:
basic: Gibbs sampling (A)(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
Experiments: Results
5 10 15training passes
0.00
0.05
0.10
0.15
0.20
0.25ac
cura
cySwipe Typing (Test Accuracy)
Doeblinbasic
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 9 / 11
Experiments: Results
5 10 15training passes
0.00
0.05
0.10
0.15
0.20
0.25ac
cura
cySwipe Typing (Test Accuracy)
Doeblinuµ-Gibbsbasic
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 9 / 11
Discussion
Summary:
Strong Doeblin property enables fast mixing
Interpolates between tractability and expressivity
Provides better learning updates at training time
Also in paper:
Theoretical analysis of strong Doeblin family
Multi-stage Doeblin chains
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 10 / 11
Discussion
Summary:
Strong Doeblin property enables fast mixing
Interpolates between tractability and expressivity
Provides better learning updates at training time
Also in paper:
Theoretical analysis of strong Doeblin family
Multi-stage Doeblin chains
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 10 / 11
Discussion
Related work:
Policy gradient (Sutton et al., 1999)
Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)
Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)
Future work:
Explore other tractable families
Learn multi-stage chains
Reproducible experiments on CodaLab: codalab.org/worksheets
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11
Discussion
Related work:
Policy gradient (Sutton et al., 1999)
Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)
Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)
Future work:
Explore other tractable families
Learn multi-stage chains
Reproducible experiments on CodaLab: codalab.org/worksheets
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11
Discussion
Related work:
Policy gradient (Sutton et al., 1999)
Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)
Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)
Future work:
Explore other tractable families
Learn multi-stage chains
Reproducible experiments on CodaLab: codalab.org/worksheets
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11