virginia smith, researcher, uc berkeley at mlconf sf 2016

CoCoAA General Framework

for Communication-Efficient Distributed Optimization

Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan

Machine Learning with Large Datasets

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Machine Learning Workflow

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …



MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …




OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …





SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …





SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem: efficiently solving objective

when data is distributed

Distributed Optimization


reduce: w = w � ↵P

k �w


“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees


Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication


Pk �w


average: w := 1K

Pk wk


Pk �w


average: w := 1K

Pk wk

“always communicate”

“never communicate”


k �w


✔ low communication

average: w := 1K

Pk wk




k �w


✔ low communication✗ convergence not guaranteed

average: w := 1K

Pk wk




k �w

Mini-batch

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees


Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication


Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground


Pi2b �w

Mini-batch Limitations


1. ONE-OFF METHODS


1. ONE-OFF METHODS

2. STALE UPDATES


1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+


1. ONE-OFF METHODS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE


1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

1. ONE-OFF METHODS Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

CoCoA-v1

1. Primal-Dual Framework

PRIMAL DUAL≥


PRIMAL DUAL≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#


PRIMAL DUAL≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinearDual separates across machines (one dual variable per datapoint)

≥minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

1. Primal-Dual FrameworkGlobal objective:


Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


Local objective:

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


Local objective: max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

�

2

��1

�nA�↵[k]

��2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


Local objective:

Can solve the local objective using any internal optimization method

max

�↵[k]2Rn� 1

n

X

i2Pk

`⇤i (�↵i � (�↵[k])i)

� 1

nwTA�↵[k] �

�

2

��1

�nA�↵[k]

��2

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#

2. Immediately Apply Updates


for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE


for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

FRESHfor i 2 b

�w �↵riP (w)w w +�w

end

w w +�w

3. Average over K

3. Average over K

reduce: w = w + 1K

Pk �wk

COCOA-v1 Limitations

Can we avoid having to average the partial solutions?


Can we avoid having to average the partial solutions? ✔


Can we avoid having to average the partial solutions? ✔

[CoCoA+, Ma & Smith, et. al., ICML ’15]


Can we avoid having to average the partial solutions?

L1-regularized objectives not covered in initial framework

✔[CoCoA+, Ma & Smith, et. al., ICML ’15]


LARGE-SCALE OPTIMIZATION

CoCoA

ProxCoCoA+

L1 Regularization

L1 RegularizationEncourages sparse solutions

L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems


Beneficial to distribute by feature


Beneficial to distribute by feature

Can we map this to the CoCoA setup?

Solution: Solve Primal Directly

minw2Rd

"P (w) :=

�

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#


PRIMAL DUAL≥

Ai =1

�n

xi

max

↵2Rn

"D(↵) := ��

2

kA↵k2 � 1

n

nX

i=1

`⇤i (�↵i)

#


PRIMAL DUAL≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)


PRIMAL DUAL

Stopping criteria given by duality gap

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnet

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)


PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnetPrimal separates across machines (one primal variable per feature)

≥min↵2Rn

f(A↵) +nX

i=1

gi(↵i) minw2Rd

f⇤(w) +nX

i=1

`⇤i (�x

Ti w)

50x speedup

CoCoA: A Framework for Distributed Optimization


flexible


flexibleallows for arbitrary internal methods


flexible efficientallows for arbitrary internal methods


flexible efficientallows for arbitrary internal methods

fast! (strong convergence & low

communication)


flexible efficient generalallows for arbitrary internal methods


communication)




communication)

works for a variety of ML models




communication)


1. CoCoA-v1 [NIPS ’14]




communication)



2. CoCoA+ [ICML ’15]




communication)




3.ProxCoCoA+ [current work]




communication)




3.ProxCoCoA+ [current work]

CoCoA Framework

Impact & Adoption

Impact & Adoptiongingsmith.github.io/

cocoa/

http://gingsmith.github.io/cocoa/

Impact & Adoptiongingsmith.github.io/

cocoa/

Numerous talks & demos


Impact & Adoption

ICMLgingsmith.github.io/

cocoa/

Numerous talks & demos


Impact & Adoption


cocoa/

Numerous talks & demosOpen-source code & documentation


Impact & Adoption


cocoa/

Numerous talks & demosOpen-source code & documentationIndustry & academic adoption


Thanks!

cs.berkeley.edu/~vsmith

github.com/gingsmith/proxcocoa

github.com/gingsmith/cocoa

http://cs.berlekey.edu/~vsmith

http://www.berkeley.edu/~vsmith

http://www.berkeley.edu/~vsmith

Empirical Results in Dataset

Training (n)

Features (d)

Sparsity Workers (K)

url 2M 3M 4e-3% 4

kddb 19M 29M 1e-4% 4

epsilon 400K 2K 100% 8

webspam 350K 16M 0.02% 16

A First Approach:

A First Approach: CoCoA+ with Smoothing


Issue: CoCoA+ requires strongly-convex regularizers


Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer



k↵k1 + �k↵k22



Amount of L2 Final Sparsity

Ideal (δ = 0) 0.6030

k↵k1 + �k↵k22




Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

k↵k1 + �k↵k22




Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

k↵k1 + �k↵k22




Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22




Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X




Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

CoCoA+ with smoothing doesn’t work




Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

k↵k1 + �k↵k22

X

CoCoA+ with smoothing doesn’t work

Additionally, CoCoA+ distributes by datapoint, not by feature

Better Solution: ProxCoCoA+



Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465



Ideal (δ = 0) 0.6030

δ = 0.0001 0.6035

δ = 0.001 0.6240

δ = 0.01 0.6465

ProxCoCoA+ 0.6030

Convergence

ConvergenceAssumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

⌘

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

gi

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

⌘

Convergence

Theorem 1. Let have -bounded supportL

T � O⇣

11�⇥

⇣8L2n2

⌧✏ + c⌘⌘

gi

Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:

⇥

⇥ 2 [0, 1)

E⇥G�0

k (�↵[k])� G�0

k (�↵?[k])

⇤ ⇥

⇣G�0

k (0)� G�0

k (�↵?[k])

⌘

Theorem 2. Let be -strongly convexµ

T � 1(1�⇥)

µ⌧+nµ⌧ log

n✏

gi

virginia smith, researcher, uc berkeley at mlconf sf 2016

Technology