multi-block alternating direction method of multipliersyyye/mit2015.pdfmarkov decision processes...

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Multi-Block Alternating Direction Method of

MultipliersAre there Alternative Algorithms for Linear Programming?

Yinyu Ye

1Department of Management Science and Engineering andInstitute for Computational and Mathematical Engineering

Stanford University, Stanford

September 17, 2015

Ye, Yinyu (Stanford) Alternative Algorithms for LP? MIT ORC, 2015 1 / 45

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of Contents

1 Linear Programming and Algorithms

2 Advances in Simplex and Interior-Point Algorithms

3 First-Order and Alternating Direction Method of Multipliers

4 Convergence of RP-ADMM in Expectation

5 Why Multi-Block ADMM?


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of Contents







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Linear Programming (LP)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Algebra of Linear Programming

minimizex cTx

subject to Ax (<=>)b,x ≥ 0,

where given constraint matrix A is an m × n matrix, theright-hand b is an m-dimensional vector, the objectivecoefficients c is an n-dimensional vector.

Variables x is an n-dimensional vector and need to beoptimally decided.LP is a data-driven computation/decision model that haswide applications so that we like to decide the optimal xfast in theory and practice.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


minimizex cTx


where given constraint matrix A is an m × n matrix, theright-hand b is an m-dimensional vector, the objectivecoefficients c is an n-dimensional vector.Variables x is an n-dimensional vector and need to beoptimally decided.

LP is a data-driven computation/decision model that haswide applications so that we like to decide the optimal xfast in theory and practice.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


minimizex cTx


where given constraint matrix A is an m × n matrix, theright-hand b is an m-dimensional vector, the objectivecoefficients c is an n-dimensional vector.Variables x is an n-dimensional vector and need to beoptimally decided.LP is a data-driven computation/decision model that haswide applications so that we like to decide the optimal xfast in theory and practice.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Geometry of Linear Programming


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

LP Algorithms: the Simplex Method


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

LP Algorithms: the Interior-Point Method


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of Contents







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Advances in the Simplex MethodMarkov decision processes (MDPs) provide amathematical framework for sequentialdecision-making where outcomes are partly randomand partly under the control of a decision maker.

MDPs are useful for studying a wide range ofoptimization problems solved via dynamicprogramming, where it was known at least as early asthe 1950s (cf. Shapley 53, Bellman 57).Modern applications include dynamic planning,reinforcement learning, social networking, and almostall other dynamic/sequential decision makingproblems in Mathematical, Physical, Managementand Social Sciences.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Advances in the Simplex MethodMarkov decision processes (MDPs) provide amathematical framework for sequentialdecision-making where outcomes are partly randomand partly under the control of a decision maker.MDPs are useful for studying a wide range ofoptimization problems solved via dynamicprogramming, where it was known at least as early asthe 1950s (cf. Shapley 53, Bellman 57).

Modern applications include dynamic planning,reinforcement learning, social networking, and almostall other dynamic/sequential decision makingproblems in Mathematical, Physical, Managementand Social Sciences.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Advances in the Simplex MethodMarkov decision processes (MDPs) provide amathematical framework for sequentialdecision-making where outcomes are partly randomand partly under the control of a decision maker.MDPs are useful for studying a wide range ofoptimization problems solved via dynamicprogramming, where it was known at least as early asthe 1950s (cf. Shapley 53, Bellman 57).Modern applications include dynamic planning,reinforcement learning, social networking, and almostall other dynamic/sequential decision makingproblems in Mathematical, Physical, Managementand Social Sciences.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Markov Decision Process/Game continuedAt each time step, the process is in some statei ∈ {1, ...,m}, and the decision maker chooses anaction ji among a set of actions, Ai , for state i .

The process responds at the next time step byrandomly moving into a state and producing acorresponding cost cji . The correspondingprobabilities entering next state, pji , is conditionallyindependent of all previous states and actions.A stationary policy for the decision maker is a set ofm actions taking by the decision maker all times.The MDP is to find a policy to optimiza the expecteddiscounted sum over infinite horizon with a discountfactor 0 ≤ γ < 1.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Markov Decision Process/Game continuedAt each time step, the process is in some statei ∈ {1, ...,m}, and the decision maker chooses anaction ji among a set of actions, Ai , for state i .The process responds at the next time step byrandomly moving into a state and producing acorresponding cost cji . The correspondingprobabilities entering next state, pji , is conditionallyindependent of all previous states and actions.

A stationary policy for the decision maker is a set ofm actions taking by the decision maker all times.The MDP is to find a policy to optimiza the expecteddiscounted sum over infinite horizon with a discountfactor 0 ≤ γ < 1.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Markov Decision Process/Game continuedAt each time step, the process is in some statei ∈ {1, ...,m}, and the decision maker chooses anaction ji among a set of actions, Ai , for state i .The process responds at the next time step byrandomly moving into a state and producing acorresponding cost cji . The correspondingprobabilities entering next state, pji , is conditionallyindependent of all previous states and actions.A stationary policy for the decision maker is a set ofm actions taking by the decision maker all times.

The MDP is to find a policy to optimiza the expecteddiscounted sum over infinite horizon with a discountfactor 0 ≤ γ < 1.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Markov Decision Process/Game continuedAt each time step, the process is in some statei ∈ {1, ...,m}, and the decision maker chooses anaction ji among a set of actions, Ai , for state i .The process responds at the next time step byrandomly moving into a state and producing acorresponding cost cji . The correspondingprobabilities entering next state, pji , is conditionallyindependent of all previous states and actions.A stationary policy for the decision maker is a set ofm actions taking by the decision maker all times.The MDP is to find a policy to optimiza the expecteddiscounted sum over infinite horizon with a discountfactor 0 ≤ γ < 1.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Fixed-Point of Cost-to-Go ValuesAn optimal policy is associated with m cost-to-go valuesfor every state, y ∈ Rm, such that it is a fixed point:

y ∗i = min{cji + γpTjiy, ji ∈ Ai}, ∀i ,

where the optimal action

j∗i = argmin{cji + γpTjiy, ji ∈ Ai}, ∀i .

maximizey∑m

i=1 yisubject to y1 ≤ cj1 + γpT

j1y, j1 ∈ A1

. . . . . . . . .yi ≤ cji + γpT

jiy, ji ∈ Ai

. . . . . . . . .ym ≤ cjm + γpT

jmy, jm ∈ Am.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Algorithmic Events of the MDP Methods

Shapley 53 and Bellman 57 developed a method called thevalue-iteration method to approximate the optimal state values.

Another best known method is due to Howard 60 and is knownas the policy-iteration method, or block pivoting simplexmethod, which generate an optimal policy in finite number ofiterations in a distributed and decentralized way.de Ghellinck 60, D’Epenoux 60 and Manne 60 showed that theMDP has an LP representation, so that it can be solved by thesimplex method of Dantzig 47, which is a special policy-iterationmethod that each step only actions in one state is switched.But most analyses of these methods are negative: the simplexmethod with smallest index rule (Melekopoglou/Condon 90),random pivoting rule (Friedman et al 12), and policy-iterationfor undiscounted finite-horizon MDP (Fearnley 10) are allexponential.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Shapley 53 and Bellman 57 developed a method called thevalue-iteration method to approximate the optimal state values.Another best known method is due to Howard 60 and is knownas the policy-iteration method, or block pivoting simplexmethod, which generate an optimal policy in finite number ofiterations in a distributed and decentralized way.

de Ghellinck 60, D’Epenoux 60 and Manne 60 showed that theMDP has an LP representation, so that it can be solved by thesimplex method of Dantzig 47, which is a special policy-iterationmethod that each step only actions in one state is switched.But most analyses of these methods are negative: the simplexmethod with smallest index rule (Melekopoglou/Condon 90),random pivoting rule (Friedman et al 12), and policy-iterationfor undiscounted finite-horizon MDP (Fearnley 10) are allexponential.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Shapley 53 and Bellman 57 developed a method called thevalue-iteration method to approximate the optimal state values.Another best known method is due to Howard 60 and is knownas the policy-iteration method, or block pivoting simplexmethod, which generate an optimal policy in finite number ofiterations in a distributed and decentralized way.de Ghellinck 60, D’Epenoux 60 and Manne 60 showed that theMDP has an LP representation, so that it can be solved by thesimplex method of Dantzig 47, which is a special policy-iterationmethod that each step only actions in one state is switched.

But most analyses of these methods are negative: the simplexmethod with smallest index rule (Melekopoglou/Condon 90),random pivoting rule (Friedman et al 12), and policy-iterationfor undiscounted finite-horizon MDP (Fearnley 10) are allexponential.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


Shapley 53 and Bellman 57 developed a method called thevalue-iteration method to approximate the optimal state values.Another best known method is due to Howard 60 and is knownas the policy-iteration method, or block pivoting simplexmethod, which generate an optimal policy in finite number ofiterations in a distributed and decentralized way.de Ghellinck 60, D’Epenoux 60 and Manne 60 showed that theMDP has an LP representation, so that it can be solved by thesimplex method of Dantzig 47, which is a special policy-iterationmethod that each step only actions in one state is switched.But most analyses of these methods are negative: the simplexmethod with smallest index rule (Melekopoglou/Condon 90),random pivoting rule (Friedman et al 12), and policy-iterationfor undiscounted finite-horizon MDP (Fearnley 10) are allexponential.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Positive Results for MDP of m states and n actionsThe classic simplex method, with the Dantzigpivoting rule, terminates in

m(n −m)

1− γ· log

(m2

1− γ

)iterations, and each iteration uses at most O(mn)arithmetic operations (Y MOR10).

The policy-iteration or block-pivoting method

n

1− γ· log

(m

1− γ

),

iterations and each iteration uses at most m2narithmetic operations (Hansen/Miltersen/ZwickACM12).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Positive Results for MDP of m states and n actionsThe classic simplex method, with the Dantzigpivoting rule, terminates in

m(n −m)

1− γ· log

(m2

1− γ

)iterations, and each iteration uses at most O(mn)arithmetic operations (Y MOR10).The policy-iteration or block-pivoting method

n

1− γ· log

(m

1− γ

),

iterations and each iteration uses at most m2narithmetic operations (Hansen/Miltersen/ZwickACM12).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Positive Results for MDP continued

The simplex method for deterministic MDP,regardless discount factor, terminates in

O(m3n2 log2m)

iterations (Post/Y MOR2014).

The result was extended to general LP by Kitaharaand Mizuno 2012; also see renewed exciting researchwork on the simplex method, e.g., Feinberg/Huang2013, Lee/Epelman/Romeijn/Smith 2013, Scherrer2014, Fearnley/Savani 2014,Adler/Papadimitriou/Rubinstein 2014, etc.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Turn-Based Two-Person Zero-Sum Game

The states is partitioned to two sets where one is tomax and the other is to min, i.e., the fixed point:

y ∗i = min{cji + γpTjiy, ji ∈ Ai}, ∀i ∈ I+,

y ∗i = max{cji + γpTjiy, ji ∈ Ai}, ∀i ∈ I−.

It does not admit a convex programming formulation,and it is unknown if it can be solved in polynomialtime in general.

Hansen/Miltersen/Zwick ACM12 proved that thestrategy iteration method solves it in stronglypolynomial time when discount factor is fixed – thefirst strongly polynomial time algorithm.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Advances in Interior-Point Methods

In theory, the best interior-point algorithms converge inO(

√max{m, n}) iterations, where notation O includes some

constant and logarithmic factors of dimensions and accuracy ϵ.

Navigating Central Path with Electrical Flows: from Flows toMatchings, and Back (Madry FOCS13). The paper made abreakthrough on solving a class of max-flow and min-cutproblems.

Path-Finding Methods for Linear Programming : Solving LinearPrograms in O(

√min{m, n}) Iterations and Faster Algorithms

for Maximum Flow (Lee and Sidford, FOCS14).

Efficient Inverse Maintenance and Faster Algorithms for LinearProgramming (Lee and Sidford, FOCS15), where the author alsohave also reduced the operation complexity for linearprogramming.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of Contents







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

First-Order or “Inverse-Free” Methods

Early work include the von Neumann projectionmethod (also see Fruend and Jarre/Rendl 07).

Use subgradient method for a transformed problem,assuming knowing a strict feasible point (Reneger 14,Freund 15) with iteration complexity O(L2D2 1

ϵ2 ),where L and D are condition numbers on the optimalsolution structure and feasible region.

First-order Karmarkar potential reduction reductionalgorithm...


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

First-Order or “Inverse-Free” Methods

First order method for packing and covering LP(Allen-Zhu/Orecchia 14): penalize the constraint,then use stochastic coordinate descent and severaltechniques. The iteration complexity: O(1ϵ ) for

packing LP and O( 1ϵ1.5 ) for covering LP.

Two-Block ADMM for solve primal-LP (He/Yuan 11,Monteiro/Svaiter 13, Monteiro/Ortiz/Svaiter 14)with iteration complexity O(1ϵ ) (a matrix needs to beinversed once).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

First-Order Potential Reduction I

Minimize f (x)Subject to eTx = 1; x ≥ 0,

where e is the vector of all ones.

Assume that f (x) is a convex function in x ∈ Rn and f (x∗) = 0where x∗ is a minimizer of the problem. Furthermore,

f (x+ d)− f (x) ≤ ∇f (x)Td+γ

2∥d∥2,

where positive γ is the Lipschitz parameter.We consider the potential function (e.g., Karmarkar 84,Todd/Ye 87)

ϕ(x) = ρ ln(f (x))−∑j

ln(xj),

where ρ ≥ n over the simplex. If we start from x0 = 1ne, and

generate a sequence of points xk , k = 1, ...,, whose potential value isstrictly decrease by a fixed amount.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



where e is the vector of all ones.Assume that f (x) is a convex function in x ∈ Rn and f (x∗) = 0where x∗ is a minimizer of the problem. Furthermore,

f (x+ d)− f (x) ≤ ∇f (x)Td+γ

2∥d∥2,

where positive γ is the Lipschitz parameter.

We consider the potential function (e.g., Karmarkar 84,Todd/Ye 87)

ϕ(x) = ρ ln(f (x))−∑j

ln(xj),




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.



where e is the vector of all ones.Assume that f (x) is a convex function in x ∈ Rn and f (x∗) = 0where x∗ is a minimizer of the problem. Furthermore,

f (x+ d)− f (x) ≤ ∇f (x)Td+γ

2∥d∥2,

where positive γ is the Lipschitz parameter.We consider the potential function (e.g., Karmarkar 84,Todd/Ye 87)

ϕ(x) = ρ ln(f (x))−∑j

ln(xj),




.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

First-Order Potential Reduction II

Update x by solving

Minimize ∇ϕ(x)TdSubject to eTd = 0, ∥X−1d∥ ≤ β;

where β < 1 is yet to be determined; and x+ = x+ d.

ϕ(x+)− ϕ(x) ≤ −β +ργ

2f (x)β2 +

β2

2(1− β)

so that one can choose a β to make

ϕ(x+)− ϕ(x) ≤ −f (x)2(f (x) + 2ργ)

.

TheoremThe steepest descent potential reduction algorithm generates a xk

with f (xk)/f (x0) ≤ ϵ in no more than

4(n +√n)max{1,2(n+

√n)γ/f (x0)}

ϵln(1

ϵ) steps.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Alternating Direction Method of Multipliers

Consider the convex optimization problem

minx∈Rn

f1(x1) + · · ·+ fp(xp),

s.t. Ax , A1x1 + · · ·+ Apxp = b,

xi ∈ Xi ⊂ Rdi , i = 1, . . . , p.

Augmented Lagrangian function:

Lγ(x1, .., xp; y) =∑

i fi(xi)− yT (∑

i Aixi − b)

+γ2∥

∑i Aixi − b∥2.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


continued

Multi-block ADMM:xk+11 ←− argminx1∈X1

Lγ(x1, xk2 . . . , xkp; y

k),...

xk+1p ←− argminxp∈Xp

Lγ(xk+11 , . . . , xk+1

p−1, xp; yk),

yk+1 ←− yk − γ(∑

i Aixk+1i − b).

Convergence was well established when p = 1 orp = 2 (..., Glowinski/Marrocco 75, Gabay/Mercier’76, Eckstein/Bertsekas 92,...)

But what about p > 2?


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.


continued

Multi-block ADMM:xk+11 ←− argminx1∈X1

Lγ(x1, xk2 . . . , xkp; y

k),...

xk+1p ←− argminxp∈Xp

Lγ(xk+11 , . . . , xk+1

p−1, xp; yk),

yk+1 ←− yk − γ(∑

i Aixk+1i − b).

Convergence was well established when p = 1 orp = 2 (..., Glowinski/Marrocco 75, Gabay/Mercier’76, Eckstein/Bertsekas 92,...)

But what about p > 2?Ye, Yinyu (Stanford) Alternative Algorithms for LP? MIT ORC, 2015 24 / 45

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convergence of Multi-Block ADMM

Recent Discovery: When p ≥ 3, ADMM can diverge[Chen/He/Y/Yuan MP14]:

p = 3.

Ax = a1x1 + a2x2 + a3x3 = 0, where A =

1 1 11 1 21 2 2

.

The ADMM would be a linear mapping of matrix M

zk+1 = Mzk , where zk = (xk1 ; xk2 ; x

k3 ; y

k)

Why diverges? ρ(M) > 1 for above A and anychoice of γ.

When you randomly choose an initial solution, itdiverges with probability one!


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convergence of Multi-Block ADMM

Recent Discovery: When p ≥ 3, ADMM can diverge[Chen/He/Y/Yuan MP14]: p = 3.

Ax = a1x1 + a2x2 + a3x3 = 0, where A =

1 1 11 1 21 2 2

.

The ADMM would be a linear mapping of matrix M

zk+1 = Mzk , where zk = (xk1 ; xk2 ; x

k3 ; y

k)

Why diverges? ρ(M) > 1 for above A and anychoice of γ.

When you randomly choose an initial solution, itdiverges with probability one!


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Almost Always Diverges


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Strong Convexity Does not Help

min 0.05x21 + 0.05x22 + 0.05x23

s.t.

1 1 11 1 21 2 2

x1x2x3

= 0.

The mapping matrix M in the ADMM (γ = 1) has

ρ(M) = 1.0087 > 1

That is, even for strongly convex programming, theADMM is not necessarily convergent for a range ofγ > 0.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Stepsize of ADMM

For some step-size 0 < β < 1, update the Lagrangianmultiplier:

yk+1 := yk − βγ(∑i

Aixk+1i − b).

p = 1; (Augmented Lagrangian Method) forβ ∈ (0, 2), (Hestenes 69, Powell 69).

p = 2; (Alternating Direction Method of Multipliers)

for β ∈ (0, 1+√5

2 ), (Glowinski 84).

p ≥ 3; for β sufficiently small provided additionalconditions on the problem, (Hong/Luo 12)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Is there a Problem-Data-Independent β?

For any positive β the following linear system makesADMM diverge 1 1 1

1 1 1 + β1 1 + β 1 + β

x1x2x3

= 0.

There is no practical problem-data-independent β suchthat the small-step size variant would work.

Many other alternatives were proposed, but somehowthey all make ADMM converge slower in practice...


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Randomly Permuted ADMM

Random-Permuted ADMM (RP-ADMM): each round,draw a random permutation σ = (σ(1), . . . , σ(p)) of{1, . . . , p}, and

Update xσ(1) → xσ(2) → · · · → xσ(p) → y.

(This is block sample without replacement)

Force “absolute fairness” among blocks or a mixedstrategy on updating order.

Random permutation has been effectively used inmany areas but little theoretical analysis is known.

Simulation Test Result: almost always converges andfast!


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Almost Always Converges when Randomize the

Update Order


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of Contents







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Any Theory Behind the Success?

Consider a square system of linear equation:

(S) minx∈Rn

0,

s.t. A1x1 + · · ·+ Apxp = b,

where A = [A1, . . . ,Ap] ∈ Rn×n, xi ∈ Rdi and∑

i di = n.

After k rounds, RP-ADMM generates zk , an r.v.depending on

ξk = (σ1, . . . , σk),

where σj is the randomly picked permutation at j-thround. Let

ϕk , Eξk(zk).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Convergence in Expectation (Sun/Luo/Y 15)

Theorem 1

The expected iterate ϕk , Eξk(zk) converges to the

solution linearly for any 1 ≤ p ≤ n if A is invertible, i.e.

{ϕk}k→∞ −→[A−1b0

].

Remark: Expected convergence = convergence, but is astrong evidence for convergence for solving mostproblems, e.g., when iterates are bounded.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

The Average Mapping is a Contraction

The update equation of RP-ADMM for (S) (withb = 0) is

zk+1 = Mσzk ,

where Mσ ∈ R2n×2n depend on σ.

Define the expected update matrix as

M = Eσ(Mσ) =1

m!

∑σ

Mσ.

Theorem 2The spectral radius of M is strictly less than 1, i.e.ρ(M) < 1, for any 1 ≤ p ≤ n if A is invertible.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Proof Difficulty of Theorem 2

Theorem 2 implies Theorem 1 is relatively easy toshow, but Theorem 2...

Few tools deal with spectral radius of non-symmetricmatrices.

E.g. ρ(X + Y ) ≤ ρ(X ) + ρ(Y ) and ρ(XY ) ≤ ρ(X )ρ(Y ) don’thold.

Though ρ(M) < ∥M∥, it turns out ∥M∥ > 2.3 for thecounterexample.

RP is not independent so that M is a complicatedfunction of A.

Techniques: Symmetrization and MathematicalInduction.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Proof Difficulty of Theorem 2

Theorem 2 implies Theorem 1 is relatively easy toshow, but Theorem 2...Few tools deal with spectral radius of non-symmetricmatrices.

E.g. ρ(X + Y ) ≤ ρ(X ) + ρ(Y ) and ρ(XY ) ≤ ρ(X )ρ(Y ) don’thold.

Though ρ(M) < ∥M∥, it turns out ∥M∥ > 2.3 for thecounterexample.

RP is not independent so that M is a complicatedfunction of A.

Techniques: Symmetrization and MathematicalInduction.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Two Main Lemmas to Prove Theorem 2

Step 1: Relate M to a symmetric matrix AQAT .

Lemma 1

λ ∈ eig(M)⇐⇒ (1− λ)2

1− 2λ∈ eig(AQAT ).

Sicne Q is symmetric, we have

ρ(M) < 1⇐⇒ eig(AQAT ) ⊆ (0,4

3).

Step 2: Bound eigenvalues of AQAT - prove by induction.

Lemma 2

eig(AQAT ) ⊆ (0,4

3).

Remark: 4/3 is “almost” tight; for m = 3, maximum ≈ 1.18.Increase to 4/3 as m increases.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Fixed Cyclic ADMM vs RP-ADMM

Solving weakly Laplacian linear systems:

A(i , i) = 1, A(i , j) = A(j , i) = τ · rand(1, 1).

(n, τ) - (iter,time) CycADMM RP-ADMM

(500, 0.01) diverge (10.1, 1.3)(500, 0.05) diverge (67, 0.61)

(5000, 0.002) diverge (7.5, 21.1)(5000, 0.01) diverge (22, 38)(5000, 0.02) diverge (205, 251)


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Further Result on Randomized Permutation for

Convex Optimization

Consider the non-separable convex quadratic problem

minx∈Rn

xTHx+ cTx,

s.t. Ax , A1x1 + · · ·+ Apxp = b,

xi ∈ Xi ⊂ Rdi , i = 1, . . . , p.

Theorem 3If each block subproblem possesses a unique solution, theexpected iterate ϕk , Eξk(z

k) converges to an optimalsolution of the original problem (Chen, Li, Liu and Y 15).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table of Contents







.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Universal Decomposition and Block Coordinate

Descent

Consider the homogeneous and self-dual linear program tofind feasible solution (x, y, s) such that

Ax− bτ = 0,−ATy − s+ cτ = 0,bTy − cTx− κ = 0,

eTx+ τ + eT s+ κ = 1,(x, τ, s, κ) ≥ 0,

where three blocks (x, τ), y and (s, κ) are alternativelyupdated (Donoghue/Chu/Parikh/Boyd 15), also see Sunand Toh 14 for SDP.


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Combine ADMM and Interior-Point

Consider the logarithmic barrier function as objective:

min −µ ln(τκ)− µ∑

j ln(xjsj)

s.t. Ax− bτ = 0,−ATy − s+ cτ = 0,bTy − cTx− κ = 0,

eTx+ τ + eT s+ κ = 1,

which is solved by Multi-Block ADMM.

Then µ is gradually reduced to 0 as in interior-pointmethods – initial computational results are encouraging(Lin/Ma 15)!


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Implementation on Distributed Platforms I

Reformulate

minimizex cTx

subject to Ax = b,x ≥ 0, as

minimizex0,...,xm cTx0

subject to aixi = bi , i = 1, ...,mxi − x0 = 0, i = 1, ...,m,x0 ≥ 0.

Then apply Multi-Block ADMM to the new formulation(He and Yuan 15, Sun/Y 15).


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Implementation on Distributed Platforms II

In each round of ADMM, for i = 1, ...,m, for a simplequadratic objective one independently solves

minimizexi qi(xi)subject to aix = bi ;

then, update x0 as

x0 = max{ 1m

∑i

xi , 0}.

Or

minimizexi qi(xi)subject to aix = bi , xi ≥ 0;

x0 =1

m

∑i

xi .


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Implementation on Distributed Platforms II

In each round of ADMM, for i = 1, ...,m, for a simplequadratic objective one independently solves

minimizexi qi(xi)subject to aix = bi ;

then, update x0 as

x0 = max{ 1m

∑i

xi , 0}. Or

minimizexi qi(xi)subject to aix = bi , xi ≥ 0;

x0 =1

m

∑i

xi .


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Summary and Future Directions

Could MDP be solved in strongly polynomial time regardlessdiscount factors?

Could the recent theoretically better interior-point algorithms berealized in practice?

Could the RP-ADMM convergence be with high probability,and/or the RP-ADMM result extended to more general convexoptimization problems?

Big Picture: Are there other competitive linear programmingalgorithms in both theory and practice?

Multi-Block ADMM might have a chance (at least in practice):

Good theories have been established.Could be easily implemented in a distributed way on a cloudand/or GPU platform.

Linear Programming Research Continues ...


.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







Good theories have been established.

Could be easily implemented in a distributed way on a cloudand/or GPU platform.



.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.







Good theories have been established.Could be easily implemented in a distributed way on a cloudand/or GPU platform.



multi-block alternating direction method of multipliersyyye/mit2015.pdfmarkov decision processes...

Documents