![Page 1: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/1.jpg)
Learning Fast-Mixing Models for Structured Prediction
Jacob Steinhardt Percy Liang
Stanford University
{jsteinhardt,pliang}@cs.stanford.edu
July 8, 2015
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 1 / 11
![Page 2: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/2.jpg)
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n a
Goal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
![Page 3: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/3.jpg)
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:
Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
![Page 4: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/4.jpg)
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
![Page 5: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/5.jpg)
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)
Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
![Page 6: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/6.jpg)
Structured Prediction Task
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n aGoal: fit maximum likelihood model pθ (z | x). Two routes:Use simple model u, exact inference
Use expressive model, Gibbs sampling (transition kernel A)Can we get the best of both worlds?
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 2 / 11
![Page 7: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/7.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 8: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/8.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u
A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 9: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/9.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A
A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 10: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/10.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A
A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 11: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/11.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A
u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 12: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/12.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u
A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 13: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/13.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A
u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 14: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/14.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u
A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 15: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/15.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A
A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 16: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/16.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A
· · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 17: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/17.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 18: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/18.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 19: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/19.jpg)
Strong Doeblin Chains
Definition (Doeblin, 1940)
A chain A is strong Doeblin with parameter ε if
A(zt | zt−1) = εu(zt)+(1− ε)A(zt | zt−1)
for some u, A.
u A A A u A u A A · · ·
All Doeblin chains mix quickly:
Proposition
If A is ε strong Doeblin, then its mixing time is at most 1ε.
Moreover, the stationary distribution is AT u, where T ∼ Geometric(ε).
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 3 / 11
![Page 20: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/20.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ = εuθ +(1− ε)Aθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 21: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/21.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ = εuθ +(1− ε)Aθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 22: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/22.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ = εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 23: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/23.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 24: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/24.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 25: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/25.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 26: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/26.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ
F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 27: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/27.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 28: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/28.jpg)
A Strong Doeblin Family
Let θ parameterize a distribution uθ and transition matrix Aθ .
Aθ
πθ
= εuθ +(1− ε)Aθ
πθ
Three model families:
F0
{uθ}θ∈Θ
F{πθ}θ∈Θ F {πθ}θ∈Θ
F parameterizes computationally tractable distributions!
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 4 / 11
![Page 29: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/29.jpg)
Strategy
Parameterize strong Doeblin distributions πθ
Maximize log-likelihood: L(θ) = 1n ∑
ni=1 log πθ (z(i))
Issue: hard to compute ∇L(θ)
Insight: interpret Markov chain as latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11
![Page 30: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/30.jpg)
Strategy
Parameterize strong Doeblin distributions πθ
Maximize log-likelihood: L(θ) = 1n ∑
ni=1 log πθ (z(i))
Issue: hard to compute ∇L(θ)
Insight: interpret Markov chain as latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11
![Page 31: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/31.jpg)
Strategy
Parameterize strong Doeblin distributions πθ
Maximize log-likelihood: L(θ) = 1n ∑
ni=1 log πθ (z(i))
Issue: hard to compute ∇L(θ)
Insight: interpret Markov chain as latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Observe: πθ (z) = pθ (zT = z), T ∼ Geometric(ε)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 5 / 11
![Page 32: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/32.jpg)
Learning Updates
Recall latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Lemma
For any fixed z,
∂ logpθ (zT = z)∂θ
= Ez1:T−1∼pθ (·|zT =z)
[∂ logpθ (z1:T )
∂θ
].
Upshot: just need to sample trajectories that end at z.=⇒ importance sampling
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11
![Page 33: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/33.jpg)
Learning Updates
Recall latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Lemma
For any fixed z,
∂ logpθ (zT = z)∂θ
= Ez1:T−1∼pθ (·|zT =z)
[∂ logpθ (z1:T )
∂θ
].
Upshot: just need to sample trajectories that end at z.=⇒ importance sampling
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11
![Page 34: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/34.jpg)
Learning Updates
Recall latent variable model:
pθ : z1
uθ
z2 · · · zTAθ Aθ Aθ
Lemma
For any fixed z,
∂ logpθ (zT = z)∂θ
= Ez1:T−1∼pθ (·|zT =z)
[∂ logpθ (z1:T )
∂θ
].
Upshot: just need to sample trajectories that end at z.=⇒ importance sampling
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 6 / 11
![Page 35: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/35.jpg)
Experiments: Task
Task from before:
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n a
Note y is a deterministic function y = f (z).
Goal: learn model p(z | x) that maximizes
p(y | x) = ∑z∈f−1(y)
p(z | x)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 7 / 11
![Page 36: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/36.jpg)
Experiments: Task
Task from before:
a
c b
e
d gf
i
h kj
m
l
o
n
q p
s
r utw
v
y
xz
x: b d s a d b n n n f a a s s j j j
z: b # # a # # n-n-n # a-a # # n-n a
y: b a n a n a
Note y is a deterministic function y = f (z).
Goal: learn model p(z | x) that maximizes
p(y | x) = ∑z∈f−1(y)
p(z | x)
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 7 / 11
![Page 37: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/37.jpg)
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
![Page 38: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/38.jpg)
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
![Page 39: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/39.jpg)
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:
basic: Gibbs sampling (A)(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
![Page 40: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/40.jpg)
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
![Page 41: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/41.jpg)
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
![Page 42: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/42.jpg)
Experiments: Setup
Models:
u(z | x) (bigram, DP): z1 z2 z3 z4
A(zt | zt−1,x) (dictionary, Gibbs): z1 z2 z3 z4
Comparisons:basic: Gibbs sampling (A)
(compute gradients assuming exact inference)
uθ -Gibbs: Gibbs with random restarts from u
Doeblin: our method
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 8 / 11
![Page 43: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/43.jpg)
Experiments: Results
5 10 15training passes
0.00
0.05
0.10
0.15
0.20
0.25ac
cura
cySwipe Typing (Test Accuracy)
Doeblinbasic
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 9 / 11
![Page 44: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/44.jpg)
Experiments: Results
5 10 15training passes
0.00
0.05
0.10
0.15
0.20
0.25ac
cura
cySwipe Typing (Test Accuracy)
Doeblinuµ-Gibbsbasic
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 9 / 11
![Page 45: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/45.jpg)
Discussion
Summary:
Strong Doeblin property enables fast mixing
Interpolates between tractability and expressivity
Provides better learning updates at training time
Also in paper:
Theoretical analysis of strong Doeblin family
Multi-stage Doeblin chains
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 10 / 11
![Page 46: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/46.jpg)
Discussion
Summary:
Strong Doeblin property enables fast mixing
Interpolates between tractability and expressivity
Provides better learning updates at training time
Also in paper:
Theoretical analysis of strong Doeblin family
Multi-stage Doeblin chains
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 10 / 11
![Page 47: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/47.jpg)
Discussion
Related work:
Policy gradient (Sutton et al., 1999)
Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)
Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)
Future work:
Explore other tractable families
Learn multi-stage chains
Reproducible experiments on CodaLab: codalab.org/worksheets
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11
![Page 48: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/48.jpg)
Discussion
Related work:
Policy gradient (Sutton et al., 1999)
Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)
Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)
Future work:
Explore other tractable families
Learn multi-stage chains
Reproducible experiments on CodaLab: codalab.org/worksheets
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11
![Page 49: Learning Fast-Mixing Models for Structured Predictionjsteinhardt/publications/fast-mcmc/slides.pdfq p s w r t u v y z x x: b d s a d b n n n f a a s s j j j z: b # # a # # n-n-n #](https://reader033.vdocument.in/reader033/viewer/2022043000/5f7764466f73d11c8b493cd7/html5/thumbnails/49.jpg)
Discussion
Related work:
Policy gradient (Sutton et al., 1999)
Inference-aware learning (Barbu, 2009; Domke, 2011; Stoyanov et al.,2011; Huang et al., 2012)
Strong Doeblin analysis (Doeblin, 1940; Propp & Wilson, 1996; Corcoran& Tweedie, 1998)
Future work:
Explore other tractable families
Learn multi-stage chains
Reproducible experiments on CodaLab: codalab.org/worksheets
J. Steinhardt & P. Liang (Stanford) Fast-Mixing Models July 8, 2015 11 / 11