allerton conference, september 2011jduchi/projects/duchiagjojo11_slides.pdfrelated work stochastic...

Ergodic Subgradient Descent

John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan

University of California, Berkeley and Royal Institute of Technology (KTH), Sweden

Allerton Conference, September 2011

Duchi, Agarwal, Johansson, Jordan (UCB) Ergodic Subgradient Descent September 2011 1 / 19

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))

◮ Update

x(t+ 1) = x(t)− α(t)g(t)

Stochastic Gradient Descent

Goal: solveminimize f(x) subject to x ∈ X .

Repeat: At iteration t

◮ Receive stochastic gradient g(t):

E[g(t) | g(1), . . . , g(t− 1)]

= ∇f(x(t))︸︷︷︸unbiased

◮ Update

x(t+ 1) = x(t)− α(t)g(t)

Peer-to-peer distributed optimization

Setup (Johansson et al. 09): nprocessors, each possesses functionfi(x). Objective:

f(x) =1

f5(x) f6(x)

f(x) =1

Token walks randomly according totransition matrix P ; processor i(t)has token at time t. Update:

x(t+1) = ProjX(x(t)− α∇fi(t)(x(t))

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

Unbiased gradient estimate: choose i ∈ {1, . . . , n} uniformly at random,

g(t) = ∇fi(x(t))

Unbiasedness?

Where does the data comefrom? The noise?

◮ Financial data

◮ Autoregressive processes

◮ Markov chains

◮ Queuing systems

◮ Machine learning

Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

Related Work

◮ Stochastic approximation methods (e.g. Robbins and Monro 1951,Polyak and Juditsky 1992)

◮ ODE and perturbation methods with dependent noise (e.g. Kushnerand Yin 2003)

◮ Robust approaches and finite sample rates for independent noisesettings (Nemirovski and Yudin 1983, Nemirovski et al. 2009)

What we do: Finite-time convergence rates of stochastic optimizationprocedures with dependent noise

Stochastic Optimization

Goal: solve the following problem

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Stochastic Optimization

Goal: solve the following problem

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

subject to x ∈ X

Here ξ come from distribution Π on sample space Ξ.

Example: Distributed Optimization

f(x) =1

F (x; ξi) =1

fi(x),

where ξi is data on machine i

f5(x) f6(x)

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Cannot compute expectation in closed form

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ May not know distribution Π

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Might not even be able to get samples ξ ∼ Π

Difficulties

Goal: Solve

minx∈X

f(x) = EΠ[F (x; ξ)] =

ΞF (x; ξ)dΠ(ξ)

◮ Might not even be able to get samples ξ ∼ Π

Solution: stochastic gradient descent, but with samples ξ from adistribution P t, where

P t → Π

Ergodic Gradient Descent

Algorithm: Receive ξt ∼ P (· | ξ1, . . . , ξt−1), compute stochastic(sub)gradient:

g(t) ∈ ∂F (x(t); ξt),

then projected gradient step:

x(t+ 1) = ProjX [x(t)− α(t)g(t)]

Stochastic Assumption

Ergodicity: The stochastic process ξ1, ξ2, . . . , ξt ∼ P is sufficientlymixing, i.e.

dTV(P (ξt | ξ1, . . . , ξs),Π) → 0

as t− s ↑ ∞. Specifically, define mixing time τmix such that

t− s ≥ τmix(P, ǫ) ⇒ dTV(P (ξt | ξ1, . . . , ξs),Π) ≤ ǫ.

ξ1 ξ2 ξ3 ξt

weakened dependence

︸︷︷︸

Main Results:

For algorithm

g(t) ∈ ∂F (x(t); ξt), x(t+ 1) = ProjX [x(t)− α(t)g(t)]

have following guarantees on x(T ) = 1T

∑Tt=1 x(t):

Main Results:

For algorithm

∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

Main Results:

For algorithm

∑Tt=1 x(t):

Theorem: (Expected convergence) With choice α(t) ∝ 1/√t,

E[f(x(T ))]− f(x∗) ≤ C

√τmix(P )√

Theorem: (High-probability convergence) With probability at least1− e−κ,

f(x(T ))− f(x∗) ≤ C

√τmix(P )√

T︸︷︷︸Expected rate

√κτmix(P ) log τmix(P )

T︸︷︷︸Deviation

Examples:

◮ Peer-to-peer Distributed Optimization

◮ Ranking algorithms (optimization on combinatorial spaces)

◮ Slowly mixing Markov chains

◮ Any mixing process

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

f(x) =1

f5(x) f6(x)

Peer-to-peer distributed optimization: Convergence

Convergence rate of

x(t+1) = ProjX(x(t)− α(t)∇fi(t)(x(t))

governed by spectral gap oftransition matrix P (here ρ2(P ) issecond singular value of P ):

f(x(T ))−f(x∗) = O(√

log(Tn)

1− ρ2(P )︸︷︷︸

· 1√T

in expectation and w.h.p.

f5(x) f6(x)

Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Optimization over combinatorial spaces

General problem: want samples from uniform distribution Π overcombinatorial space. Hard to do, so use random walk P → Π (Jerrum &Sinclair)

Example: Learn a ranking.Receive pairwise userpreferences between items,would like to be oblivious toorder of remainder. P ispartial order for ranked items,{σ ∈ P} is permutations of[n] consistent with P

Objective:

f(x) :=1

card(σ ∈ P)

σ∈P

F (x;σ)

Partial-order Permutation Markov Chain

Markov chain (Karzanov and Khachiyan 91): pick a random pair (i, j),swap if j ≺ i is consistent with P

1 i j n

j i n1

Mixing time (Wilson 04): τmix(P, ǫ) ≤ 4π2n

3 log nǫ

Convergence rate:

f(x(T ))− f(x∗) = O(n3/2

√log(Tn)√T

Slowly mixing Markov chains

◮ Might not have such fast mixing rates (i.e. log 1ǫ )

◮ Examples:◮ Physical simulations of natural phenomena◮ Autoregressive processes◮ Monte Carlo-sampling based variants of expectation maximization

(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

(EM) algorithm

◮ Possible mixing rate is algebraic:

τmix(P, ǫ) ≤ Mǫ−β

for a β > 0.

◮ Consequence: expected and high probability convergence

f(x(T ))− f(x∗) = O(T−

Arbitrary mixing process

General convergence guarantee:

f(x(T ))− f(x∗) ≤ C1√T

+ C infǫ>0

τmix(P, ǫ)√T

Consequence: If τmix(P, ǫ) < ∞, then

f(x(T ))− f(x∗) → 0 with probability 1.

Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

Conclusions and Discussion

◮ Finite sample convergence rates for stochastic gradient algorithmswith dependent noise

◮ Convergence rates dependent on mixing time τmix

◮ In companion work, we extend these results to all stable onlinelearning algorithms (Agarwal and Duchi, 2011)

◮ Future work: dynamic adaptation for unknown mixing rates

◮ Weaken uniformity of mixing time assumptions

Thanks!

allerton conference, september 2011jduchi/projects/duchiagjojo11_slides.pdfrelated work stochastic...

Documents

the allerton family journal

global warming miss parson – allerton grange school

uiuc allerton park master plan

allerton residence (design option 2) mm south-west ... ·...

former allerton house site, chapel allerton

anatoli juditsky arkadi nemirovski arxiv:1804.00355v5 ......

56th allerton conference -...

tavener cottage, stone allerton, abridge, somerset, bs26...

brindle close, allerton, bd15€¦ · brindle close,...

3rd annual ehks retreat march 10, 2007 allerton park

beak-pines organic area allerton …...beak-pines organic...

shannon meets nyquist: the interplay between capacity...

guide to the university of illinois' documentation...

specification for single residential home …...

{ settling the northern colonies chapter three. alden,...

ieee oh’ vol. no. april on the goldstein levitin polyak...

dixons allerton academy primary provision guidance nursery...

key stage 4 curriculum handbook 2021 - allerton high school

allerton highly qualified and more - what is new

delivery area h - kippax, allerton bywater, little & great...