allerton conference, september 2011jduchi/projects/duchiagjojo11_slides.pdfrelated work stochastic...

Post on 16-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ergodic Subgradient Descent

John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan

University of California, Berkeley and Royal Institute of Technology (KTH), Sweden

Allerton Conference, September 2011

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 1 / 19

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))

◮ Update

x(t+ 1) = x(t)− α(t)g(t)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 2 / 19

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))︸ ︷︷ ︸unbiased

◮ Update

x(t+ 1) = x(t)− α(t)g(t)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 2 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Unbiased gradient estimate: choose i ∈ {1, . . . , n} uniformly at random,

g(t) = ∇fi(x(t))

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 3 / 19

Unbiasedness?

Where does the data comefrom? The noise?

◮ Financial data

◮ Autoregressive processes

◮ Markov chains

◮ Queuing systems

◮ Machine learning

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 4 / 19

Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 5 / 19

Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

What we do: Finite-time convergence rates of stochastic optimizationprocedures with dependent noise

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 5 / 19

Stochastic Optimization

Goal: solve the following problem

minx

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 6 / 19

Stochastic Optimization

Goal: solve the following problem

minx

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Example: Distributed Optimization

f(x) =1

n

n∑

i=1

F (x; ξi) =1

n

n∑

i=1

fi(x),

where ξi is data on machine i

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 6 / 19

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

◮ May not know distribution Π

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

◮ May not know distribution Π

◮ Might not even be able to get samples ξ ∼ Π

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

◮ May not know distribution Π

◮ Might not even be able to get samples ξ ∼ Π

Solution: stochastic gradient descent, but with samples ξ from adistribution P t, where

P t → Π

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 7 / 19

Ergodic Gradient Descent

Algorithm: Receive ξt ∼ P (· | ξ1, . . . , ξt−1), compute stochastic(sub)gradient:

g(t) ∈ ∂F (x(t); ξt),

then projected gradient step:

x(t+ 1) = ProjX [x(t)− α(t)g(t)]

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 8 / 19

Stochastic Assumption

Ergodicity: The stochastic process ξ1, ξ2, . . . , ξt ∼ P is sufficientlymixing, i.e.

dTV(P (ξt | ξ1, . . . , ξs),Π) → 0

as t− s ↑ ∞. Specifically, define mixing time τmix such that

t− s ≥ τmix(P, ǫ) ⇒ dTV(P (ξt | ξ1, . . . , ξs),Π) ≤ ǫ.

ξ1 ξ2 ξ3 ξt

weakened dependence

︸︷︷︸

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 9 / 19

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

T

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

T

Theorem: (High-probability convergence) With probability at least1− e−κ,

f(x(T ))− f(x∗) ≤ C

√τmix(P )√

T︸ ︷︷ ︸Expected rate

+ C

√κτmix(P ) log τmix(P )

T︸ ︷︷ ︸Deviation

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 10 / 19

Examples:

◮ Peer-to-peer Distributed Optimization

◮ Ranking algorithms (optimization on combinatorial spaces)

◮ Slowly mixing Markov chains

◮ Any mixing process

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 11 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

n

n∑

i=1

fi(x)

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

)

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 12 / 19

Peer-to-peer distributed optimization: Convergence

Convergence rate of

x(t+1) = ProjX(x(t)− α(t)∇fi(t)(x(t))

)

governed by spectral gap oftransition matrix P (here ρ2(P ) issecond singular value of P ):

f(x(T ))−f(x∗) = O(√

log(Tn)

1− ρ2(P )︸ ︷︷ ︸

τmix

· 1√T

)

in expectation and w.h.p.

f4(x)

f5(x) f6(x)

f2(x)

f3(x)

f1(x)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 13 / 19

Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 14 / 19

Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Example: Learn a ranking.Receive pairwise userpreferences between items,would like to be oblivious toorder of remainder. P ispartial order for ranked items,{σ ∈ P} is permutations of[n] consistent with P

1

3 6

32

4

57

8

Objective:

f(x) :=1

card(σ ∈ P)

σ∈P

F (x;σ)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 14 / 19

Partial-order Permutation Markov Chain

Markov chain (Karzanov and Khachiyan 91): pick a random pair (i, j),swap if j ≺ i is consistent with P

1 i j n

j i n1

Mixing time (Wilson 04): τmix(P, ǫ) ≤ 4π2n

3 log nǫ

Convergence rate:

f(x(T ))− f(x∗) = O(n3/2

√log(Tn)√T

)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 15 / 19

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

◮ Consequence: expected and high probability convergence

f(x(T ))− f(x∗) = O(T−

1

2β+2

)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 16 / 19

Arbitrary mixing process

General convergence guarantee:

f(x(T ))− f(x∗) ≤ C1√T

+ C infǫ>0

{ǫ+

τmix(P, ǫ)√T

}.

Consequence: If τmix(P, ǫ) < ∞, then

f(x(T ))− f(x∗) → 0 with probability 1.

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 17 / 19

Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 18 / 19

Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

◮ Future work: dynamic adaptation for unknown mixing rates

◮ Weaken uniformity of mixing time assumptions

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 18 / 19

Thanks!

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 19 / 19

top related