accelerated, parallel and proximal coordinate descent moscow february 2014 approx peter richtárik...
TRANSCRIPT
![Page 1: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/1.jpg)
Accelerated, Parallel and PROXimal coordinate descent
Moscow February 2014
A P PROXPeter Richtárik
(Joint work with Olivier Fercoq - arXiv:1312.5799)
![Page 2: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/2.jpg)
Optimization Problem
![Page 3: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/3.jpg)
Problem
Convex (smooth or nonsmooth)
Convex (smooth or nonsmooth)- separable- allow
Loss Regularizer
![Page 4: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/4.jpg)
Regularizer: examples
No regularizer Weighted L1 norm
Weighted L2 normBox constraints
e.g., SVM dual
e.g., LASSO
![Page 5: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/5.jpg)
Loss: examples
Quadratic loss
L-infinity
L1 regression
Exponential loss
Logistic loss
Square hinge loss
BKBG’11RT’11bTBRS’13RT ’13a
FR’13
![Page 6: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/6.jpg)
RANDOMIZED COORDINATE DESCENT
IN 2D
![Page 7: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/7.jpg)
Find the minimizer of
2D OptimizationContours of a function
Goal:
a2 =b2
![Page 8: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/8.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
N
S
EW
![Page 9: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/9.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
![Page 10: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/10.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
2
![Page 11: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/11.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23 N
S
EW
![Page 12: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/12.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
![Page 13: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/13.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
5
![Page 14: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/14.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
6
N
S
EW
![Page 15: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/15.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
N
S
EW
67SOLVED!
![Page 16: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/16.jpg)
CONTRIBUTIONS
![Page 17: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/17.jpg)
Variants of Randomized Coordinate Descent Methods
• Block– can operate on “blocks” of coordinates – as opposed to just on individual coordinates
• General – applies to “general” (=smooth convex) functions – as opposed to special ones such as quadratics
• Proximal– admits a “nonsmooth regularizer” that is kept intact in solving subproblems – regularizer not smoothed, nor approximated
• Parallel – operates on multiple blocks / coordinates in parallel– as opposed to just 1 block / coordinate at a time
• Accelerated– achieves O(1/k^2) convergence rate for convex functions– as opposed to O(1/k)
• Efficient– complexity of 1 iteration is O(1) per processor on sparse problems – as opposed to O(# coordinates) : avoids adding two full vectors
![Page 18: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/18.jpg)
Brief History of Randomized Coordinate Descent Methods
+ new long stepsizes
![Page 19: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/19.jpg)
APPROX
![Page 20: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/20.jpg)
“PROXIMAL”“PAR
ALLE
L”
“ACCELERATED”
A P PROX
![Page 21: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/21.jpg)
PCDM (R. & Takáč, 2012) = APPROX if we force
![Page 22: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/22.jpg)
APPROX: Smooth Case
Want this to be as large as possible
Update for coordinate i
Partial derivative of f
![Page 23: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/23.jpg)
CONVERGENCE RATE
![Page 24: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/24.jpg)
Convergence Rate
average # coordinates updated / iteration
# coordinates# iterations
implies
Theorem [FR’13b]
Key assumption
![Page 25: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/25.jpg)
Special Case: Fully Parallel Variant
all coordinates are updated in each iteration
# normalized weights (summing to n)
# iterations
implies
![Page 26: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/26.jpg)
Special Case: Effect of New Stepsizes
Average degree of separability
“Average” of the Lipschitz constants
With the new stepsizes (will mention later!), we have:
![Page 27: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/27.jpg)
“EFFICIENCY” OF
APPROX
![Page 28: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/28.jpg)
Cost of 1 Iteration of APPROX
Assume N = n (all blocks are of size 1)and that
Sparse matrixThen the average cost of 1 iteration of APPROX is
Scalar function: derivative = O(1)
arithmetic ops
= average # nonzeros in a column of A
![Page 29: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/29.jpg)
Bottleneck: Computation of Partial Derivatives
maintained
![Page 30: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/30.jpg)
PRELIMINARYEXPERIMENTS
![Page 31: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/31.jpg)
L1 Regularized L1 Regression
Dorothea dataset:
Gradient Method
Nesterov’s Accelerated Gradient Method
SPCDM
APPROX
![Page 32: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/32.jpg)
L1 Regularized L1 Regression
![Page 33: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/33.jpg)
L1 Regularized Least Squares (LASSO)
KDDB dataset:
PCDM
APPROX
![Page 34: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/34.jpg)
Training Linear SVMs
Malicious URL dataset:
![Page 35: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/35.jpg)
Choice of Stepsizes:
How (not) to ParallelizeCoordinate Descent
![Page 36: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/36.jpg)
Convergence of Randomized Coordinate Descent
Strongly convex F(Simple Mehod)
Smooth or ‘simple’ nonsmooth F(Accelerated Method)
‘Difficult’ nonsmooth F(Accelerated Method)
or smooth F(Simple method)
‘Difficult’ nonsmooth F(Simple Method)
Focus on n
(big data = big n)
![Page 37: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/37.jpg)
Parallelization Dream
Depends on to what extent we can add up individual updates, which depends on the properties of F and the
way coordinates are chosen at each iteration
Serial Parallel
What do we actually get?WANT
![Page 38: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/38.jpg)
“Naive” parallelization
Do the same thing as before, but
for MORE or ALL coordinates &
ADD UP the updates
![Page 39: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/39.jpg)
Failure of naive parallelization
1a
1b
0
![Page 40: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/40.jpg)
Failure of naive parallelization
1
1a
1b
0
![Page 41: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/41.jpg)
Failure of naive parallelization
1
2a
2b
![Page 42: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/42.jpg)
Failure of naive parallelization
1
2a
2b
2
![Page 43: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/43.jpg)
Failure of naive parallelization
2
OOPS!
![Page 44: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/44.jpg)
1
1a
1b
0
Idea: averaging updates may help
SOLVED!
![Page 45: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/45.jpg)
Averaging can be too conservative
1a
1b
0
12a
2b
2
and so on...
![Page 46: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/46.jpg)
Averaging may be too conservative 2
WANT
BAD!!!But we wanted:
![Page 47: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/47.jpg)
What to do?
Averaging:
Summation:
Update to coordinate i
i-th unit coordinate vector
Figure out when one can safely use:
![Page 48: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/48.jpg)
ESO:Expected SeparableOverapproximation
![Page 49: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/49.jpg)
5 Models for f Admitting Small
1
2
3
Smooth partially separable f [RT’11b ]
Nonsmooth max-type f [FR’13]
f with ‘bounded Hessian’ [BKBG’11, RT’13a ]
![Page 50: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/50.jpg)
5 Partially separable f with block smooth components [FR’13b]
5 Models for f Admitting Small
4 Partially separable f with smooth components [NC’13]
![Page 51: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/51.jpg)
Randomized Parallel Coordinate Descent Method
Random set of coordinates (sampling)
Current iterate New iterate i-th unit coordinate vector
Update to i-th coordinate
![Page 52: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/52.jpg)
ESO: Expected Separable Overapproximation
Definition [RT’11b]
1. Separable in h2. Can minimize in parallel3. Can compute updates for only
Shorthand:
Minimize in h
![Page 53: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/53.jpg)
PART II.ADDITIONAL TOPICS
![Page 54: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/54.jpg)
Partial Separability and
Doubly Uniform Samplings
![Page 55: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/55.jpg)
Serial uniform samplingProbability law:
![Page 56: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/56.jpg)
-nice samplingProbability law:
Good for shared memory systems
![Page 57: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/57.jpg)
Doubly uniform sampling
Probability law:
Can model unreliable processors / machines
![Page 58: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/58.jpg)
ESO for partially separable functionsand doubly uniform samplings
Theorem [RT’11b]
1 Smooth partially separable f [RT’11b ]
![Page 59: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/59.jpg)
PCDM: Theoretical Speedup
Much of Big Data is here!
degree of partial separability
# coordinates
# coordinate updates / iter
WEAK OR NO SPEEDUP: Non-separable (dense) problems
LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems
![Page 60: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/60.jpg)
![Page 61: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/61.jpg)
n = 1000(# coordinates)
Theory
![Page 62: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/62.jpg)
Practice
n = 1000(# coordinates)
![Page 63: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/63.jpg)
PCDM: Experiment with a 1 billion-by-2 billion
LASSO problem
![Page 64: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/64.jpg)
Optimization with Big Data
* in a billion dimensional space on a foggy day
Extreme* Mountain Climbing=
![Page 65: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/65.jpg)
Coordinate Updates
![Page 66: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/66.jpg)
Iterations
![Page 67: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/67.jpg)
Wall Time
![Page 68: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/68.jpg)
Distributed-Memory Coordinate Descent
![Page 69: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/69.jpg)
Distributed -nice sampling
Probability law:
Machine 2Machine 1 Machine 3
Good for a distributed version of coordinate descent
![Page 70: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/70.jpg)
ESO: Distributed setting
Theorem [RT’13b]
3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ]
spectral norm of the data
![Page 71: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/71.jpg)
Bad partitioning at most doubles # of iterations
spectral norm of the partitioning
Theorem [RT’13b]
# nodes
# iterations = implies
# updates/node
![Page 72: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/72.jpg)
![Page 73: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/73.jpg)
LASSO with a 3TB data matrix
128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM
= # coordinates
![Page 74: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/74.jpg)
• Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011.
• Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012.
• [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012.
• Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013.
• Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012.
• Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Technical report, Microsoft Research, 2013.
References: serial coordinate descent
![Page 75: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/75.jpg)
• [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011
• [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012
• Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013
• [FR’13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013
• [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013
• [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013
References: parallel coordinate descent
Good entry point to the topic (4p paper)
![Page 76: Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)](https://reader038.vdocument.in/reader038/viewer/2022110116/551c2d0d550346b24f8b61ce/html5/thumbnails/76.jpg)
• P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012.
• Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013.
• Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012.
• [FR’13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal coordinate descent. arXiv:1312.5799, 2013
References: parallel coordinate descent