virginia smith, researcher, uc berkeley at mlconf sf 2016
TRANSCRIPT
CoCoAA General Framework
for Communication-Efficient Distributed Optimization
Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan
Machine Learning with Large Datasets
Machine Learning with Large Datasets
Machine Learning with Large Datasets
image/music/video tagging document categorization
item recommendation click-through rate prediction
sequence tagging protein structure prediction
sensor data prediction spam classification
fraud detection
Machine Learning Workflow
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …
Machine Learning WorkflowDATA & PROBLEM classification, regression,
collaborative filtering, …
MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …
Open Problem: efficiently solving objective
when data is distributed
Distributed Optimization
Distributed Optimization
Distributed Optimization
reduce: w = w � ↵P
k �w
Distributed Optimization
reduce: w = w � ↵P
k �w
Distributed Optimization
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
average: w := 1K
Pk wk
“always communicate”reduce: w = w � ↵
Pk �w
Distributed Optimization✔ convergence guarantees✗ high communication
average: w := 1K
Pk wk
“always communicate”
“never communicate”
reduce: w = w � ↵P
k �w
Distributed Optimization✔ convergence guarantees✗ high communication
✔ low communication
average: w := 1K
Pk wk
“always communicate”
“never communicate”
reduce: w = w � ↵P
k �w
Distributed Optimization✔ convergence guarantees✗ high communication
✔ low communication✗ convergence not guaranteed
average: w := 1K
Pk wk
“always communicate”
“never communicate”
reduce: w = w � ↵P
k �w
Mini-batch
Mini-batch
Mini-batch
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees✔ tunable communication
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees✔ tunable communication
a natural middle-ground
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch✔ convergence guarantees✔ tunable communication
a natural middle-ground
reduce: w = w � ↵|b|
Pi2b �w
Mini-batch Limitations
Mini-batch Limitations
1. ONE-OFF METHODS
Mini-batch Limitations
1. ONE-OFF METHODS
2. STALE UPDATES
Mini-batch Limitations
1. ONE-OFF METHODS
2. STALE UPDATES
3. AVERAGE OVER BATCH SIZE
LARGE-SCALE OPTIMIZATION
CoCoA
ProxCoCoA+
LARGE-SCALE OPTIMIZATION
CoCoA
ProxCoCoA+
Mini-batch Limitations
1. ONE-OFF METHODS
2. STALE UPDATES
3. AVERAGE OVER BATCH SIZE
Mini-batch Limitations
1. ONE-OFF METHODS Primal-Dual Framework
2. STALE UPDATES Immediately apply local updates
3. AVERAGE OVER BATCH SIZE Average over K << batch size
1. ONE-OFF METHODS Primal-Dual Framework
2. STALE UPDATES Immediately apply local updates
3. AVERAGE OVER BATCH SIZE Average over K << batch size
CoCoA-v1
1. Primal-Dual Framework
PRIMAL DUAL≥
1. Primal-Dual Framework
PRIMAL DUAL≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
1. Primal-Dual Framework
PRIMAL DUAL≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gap
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practice
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinearDual separates across machines (one dual variable per datapoint)
≥minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual Framework
1. Primal-Dual FrameworkGlobal objective:
1. Primal-Dual FrameworkGlobal objective:
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual FrameworkGlobal objective:
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual FrameworkGlobal objective:
Local objective:
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual FrameworkGlobal objective:
Local objective: max
�↵[k]2Rn� 1
n
X
i2Pk
`⇤i (�↵i � (�↵[k])i)
� 1
nwTA�↵[k] �
�
2
���1
�nA�↵[k]
���2
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual FrameworkGlobal objective:
Local objective: max
�↵[k]2Rn� 1
n
X
i2Pk
`⇤i (�↵i � (�↵[k])i)
� 1
nwTA�↵[k] �
�
2
���1
�nA�↵[k]
���2
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
1. Primal-Dual FrameworkGlobal objective:
Local objective:
Can solve the local objective using any internal optimization method
max
�↵[k]2Rn� 1
n
X
i2Pk
`⇤i (�↵i � (�↵[k])i)
� 1
nwTA�↵[k] �
�
2
���1
�nA�↵[k]
���2
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
2. Immediately Apply Updates
2. Immediately Apply Updates
for i 2 b�w �w � ↵riP (w)
end
w w +�w
STALE
2. Immediately Apply Updates
for i 2 b�w �w � ↵riP (w)
end
w w +�w
STALE
FRESHfor i 2 b
�w �↵riP (w)w w +�w
end
w w +�w
3. Average over K
3. Average over K
reduce: w = w + 1K
Pk �wk
COCOA-v1 Limitations
Can we avoid having to average the partial solutions?
COCOA-v1 Limitations
Can we avoid having to average the partial solutions? ✔
COCOA-v1 Limitations
Can we avoid having to average the partial solutions? ✔
[CoCoA+, Ma & Smith, et. al., ICML ’15]
COCOA-v1 Limitations
Can we avoid having to average the partial solutions?
L1-regularized objectives not covered in initial framework
✔[CoCoA+, Ma & Smith, et. al., ICML ’15]
COCOA-v1 Limitations
Can we avoid having to average the partial solutions?
L1-regularized objectives not covered in initial framework
✔[CoCoA+, Ma & Smith, et. al., ICML ’15]
COCOA-v1 Limitations
LARGE-SCALE OPTIMIZATION
CoCoA
ProxCoCoA+
LARGE-SCALE OPTIMIZATION
CoCoA
ProxCoCoA+
L1 Regularization
L1 RegularizationEncourages sparse solutions
L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems
L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems
Beneficial to distribute by feature
L1 RegularizationEncourages sparse solutionsIncludes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems
Beneficial to distribute by feature
Can we map this to the CoCoA setup?
Solution: Solve Primal Directly
minw2Rd
"P (w) :=
�
2||w||2 + 1
n
nX
i=1
`i(wTxi)
#
Solution: Solve Primal Directly
PRIMAL DUAL≥
Ai =1
�n
xi
max
↵2Rn
"D(↵) := ��
2
kA↵k2 � 1
n
nX
i=1
`⇤i (�↵i)
#
Solution: Solve Primal Directly
PRIMAL DUAL≥min↵2Rn
f(A↵) +nX
i=1
gi(↵i) minw2Rd
f⇤(w) +nX
i=1
`⇤i (�x
Ti w)
Solution: Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gap
≥min↵2Rn
f(A↵) +nX
i=1
gi(↵i) minw2Rd
f⇤(w) +nX
i=1
`⇤i (�x
Ti w)
Solution: Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practice
≥min↵2Rn
f(A↵) +nX
i=1
gi(↵i) minw2Rd
f⇤(w) +nX
i=1
`⇤i (�x
Ti w)
Solution: Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnet
≥min↵2Rn
f(A↵) +nX
i=1
gi(↵i) minw2Rd
f⇤(w) +nX
i=1
`⇤i (�x
Ti w)
Solution: Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. glmnetPrimal separates across machines (one primal variable per feature)
≥min↵2Rn
f(A↵) +nX
i=1
gi(↵i) minw2Rd
f⇤(w) +nX
i=1
`⇤i (�x
Ti w)
50x speedup
CoCoA: A Framework for Distributed Optimization
CoCoA: A Framework for Distributed Optimization
flexible
CoCoA: A Framework for Distributed Optimization
flexibleallows for arbitrary internal methods
CoCoA: A Framework for Distributed Optimization
flexible efficientallows for arbitrary internal methods
CoCoA: A Framework for Distributed Optimization
flexible efficientallows for arbitrary internal methods
fast! (strong convergence & low
communication)
CoCoA: A Framework for Distributed Optimization
flexible efficient generalallows for arbitrary internal methods
fast! (strong convergence & low
communication)
CoCoA: A Framework for Distributed Optimization
flexible efficient generalallows for arbitrary internal methods
fast! (strong convergence & low
communication)
works for a variety of ML models
CoCoA: A Framework for Distributed Optimization
flexible efficient generalallows for arbitrary internal methods
fast! (strong convergence & low
communication)
works for a variety of ML models
1. CoCoA-v1 [NIPS ’14]
CoCoA: A Framework for Distributed Optimization
flexible efficient generalallows for arbitrary internal methods
fast! (strong convergence & low
communication)
works for a variety of ML models
1. CoCoA-v1 [NIPS ’14]
2. CoCoA+ [ICML ’15]
CoCoA: A Framework for Distributed Optimization
flexible efficient generalallows for arbitrary internal methods
fast! (strong convergence & low
communication)
works for a variety of ML models
1. CoCoA-v1 [NIPS ’14]
2. CoCoA+ [ICML ’15]
3.ProxCoCoA+ [current work]
CoCoA: A Framework for Distributed Optimization
flexible efficient generalallows for arbitrary internal methods
fast! (strong convergence & low
communication)
works for a variety of ML models
1. CoCoA-v1 [NIPS ’14]
2. CoCoA+ [ICML ’15]
3.ProxCoCoA+ [current work]
CoCoA Framework
Impact & Adoption
Impact & Adoptiongingsmith.github.io/
cocoa/
Numerous talks & demos
Impact & Adoption
ICMLgingsmith.github.io/
cocoa/
Numerous talks & demos
Impact & Adoption
ICMLgingsmith.github.io/
cocoa/
Numerous talks & demosOpen-source code & documentation
Impact & Adoption
ICMLgingsmith.github.io/
cocoa/
Numerous talks & demosOpen-source code & documentationIndustry & academic adoption
Thanks!
cs.berkeley.edu/~vsmith
github.com/gingsmith/proxcocoa
github.com/gingsmith/cocoa
Empirical Results in Dataset
Training (n)
Features (d)
Sparsity Workers (K)
url 2M 3M 4e-3% 4
kddb 19M 29M 1e-4% 4
epsilon 400K 2K 100% 8
webspam 350K 16M 0.02% 16
A First Approach:
A First Approach: CoCoA+ with Smoothing
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
k↵k1 + �k↵k22
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
k↵k1 + �k↵k22
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
k↵k1 + �k↵k22
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
k↵k1 + �k↵k22
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
k↵k1 + �k↵k22
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + �k↵k22
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + �k↵k22
X
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + �k↵k22
X
CoCoA+ with smoothing doesn’t work
A First Approach: CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizersApproach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + �k↵k22
X
CoCoA+ with smoothing doesn’t work
Additionally, CoCoA+ distributes by datapoint, not by feature
Better Solution: ProxCoCoA+
Better Solution: ProxCoCoA+
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
Better Solution: ProxCoCoA+
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
ProxCoCoA+ 0.6030
Better Solution: ProxCoCoA+
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
ProxCoCoA+ 0.6030
Better Solution: ProxCoCoA+
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
ProxCoCoA+ 0.6030
Convergence
ConvergenceAssumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E⇥G�0
k (�↵[k])� G�0
k (�↵?[k])
⇤ ⇥
⇣G�0
k (0)� G�0
k (�↵?[k])
⌘
Convergence
Theorem 1. Let have -bounded supportL
T � O⇣
11�⇥
⇣8L2n2
⌧✏ + c⌘⌘
gi
Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E⇥G�0
k (�↵[k])� G�0
k (�↵?[k])
⇤ ⇥
⇣G�0
k (0)� G�0
k (�↵?[k])
⌘
Convergence
Theorem 1. Let have -bounded supportL
T � O⇣
11�⇥
⇣8L2n2
⌧✏ + c⌘⌘
gi
Assumption: Local -Approximation For , we assume the local solver finds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E⇥G�0
k (�↵[k])� G�0
k (�↵?[k])
⇤ ⇥
⇣G�0
k (0)� G�0
k (�↵?[k])
⌘
Theorem 2. Let be -strongly convexµ
T � 1(1�⇥)
µ⌧+nµ⌧ log
n✏
gi