peter richtárik parallel coordinate nips 2013, lake tahoe descent methods

Download Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods

If you can't read please download the document

Upload: merilyn-benson

Post on 17-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Peter Richtrik Parallel coordinate NIPS 2013, Lake Tahoe descent methods
  • Slide 2
  • I.Introduction II.Regularized Convex Optimization III.Parallel CD (PCDM) IV.Accelerated Parallel CD (APCDM) V.ESO? Good ESO? VI.Experiments in 10^9 dim VII.Distributed CD (Hydra) VIII.Nonuniform Parallel CD (NSync) IX.Mini-batch SDCA for SVMs OUTLINE POSTER
  • Slide 3
  • Part I. Introduction
  • Slide 4
  • What is Randomized Coordinate Descent ?
  • Slide 5
  • Find the minimizer of 2D Optimization Contours of a function Goal:
  • Slide 6
  • Randomized Coordinate Descent in 2D N S E W
  • Slide 7
  • 1 N S E W
  • Slide 8
  • 1 N S E W 2
  • Slide 9
  • 1 2 3 N S E W
  • Slide 10
  • 1 2 3 4 N S E W
  • Slide 11
  • 1 2 3 4 N S E W 5
  • Slide 12
  • 1 2 3 4 5 6 N S E W
  • Slide 13
  • 1 2 3 4 5 N S E W 6 7 S O L V E D !
  • Slide 14
  • Convergence of Randomized Coordinate Descent Strongly convex F Smooth or simple nonsmooth F difficult nonsmooth F Focus on n (big data = big n)
  • Slide 15
  • Parallelization Dream Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration SerialParallel What do we actually get? WANT
  • Slide 16
  • How (not) to Parallelize Coordinate Descent
  • Slide 17
  • Naive parallelization Do the same thing as before, but for MORE or ALL coordinates & ADD UP the updates
  • Slide 18
  • Failure of naive parallelization 1a 1b 0
  • Slide 19
  • Failure of naive parallelization 1 1a 1b 0
  • Slide 20
  • Failure of naive parallelization 1 2a 2b
  • Slide 21
  • Failure of naive parallelization 1 2a 2b 2
  • Slide 22
  • Failure of naive parallelization 2 O O P S !
  • Slide 23
  • 1 1a 1b 0 Idea: averaging updates may help S O L V E D !
  • Slide 24
  • Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...
  • Slide 25
  • Averaging may be too conservative 2 WANT BAD!!! But we wanted:
  • Slide 26
  • What to do? Averaging: Summation: Update to coordinate i i-th unit coordinate vector Figure out when one can safely use:
  • Slide 27
  • Part II. Regularized Convex Optimization
  • Slide 28
  • Problem Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow Loss Regularizer
  • Slide 29
  • Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO
  • Slide 30
  • Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG11 RT11b TBRS13 RT 13a FR13a
  • Slide 31
  • 3 models for f with small 1 2 3 Smooth partially separable f [RT11b ] Nonsmooth max-type f [FR13] f with bounded Hessian [BKBG11, RT13a ]
  • Slide 32
  • Part III. Parallel Coordinate Descent P.R. and Martin Tak Parallel Coordinate Descent Methods for Big Data Optimization arXiv:1212.0873, 2012 [IMA Leslie Fox Prize 2013] P.R. and Martin Tak Parallel Coordinate Descent Methods for Big Data Optimization arXiv:1212.0873, 2012 [IMA Leslie Fox Prize 2013]
  • Slide 33
  • Randomized Parallel Coordinate Descent Method Random set of coordinates (sampling) Current iterateNew iteratei-th unit coordinate vector Update to i-th coordinate
  • Slide 34
  • ESO: Expected Separable Overapproximation Definition [RT11b] 1. Separable in h 2. Can minimize in parallel 3. Can compute updates for only Shorthand: Minimize in h
  • Slide 35
  • Convergence rate: convex f average # updated coordinates per iteration # coordinatesstepsize parameter error tolerance # iterations implies Theorem [RT11b]
  • Slide 36
  • Convergence rate: strongly convex f implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT11b]
  • Slide 37
  • Part IV. Accelerated Parallel CD Olivier Fercoq and P.R. Accelerated, Parallel and Proximal Coordinate Descent Manuscript, 2013 Olivier Fercoq and P.R. Accelerated, Parallel and Proximal Coordinate Descent Manuscript, 2013
  • Slide 38
  • 3 x YES
  • Slide 39
  • The Algorithm Parallel CD step ESO parameters
  • Slide 40
  • Complexity average # updated coordinates per iteration # coordinates error tolerance # iterations implies Theorem [FR13b]
  • Slide 41
  • Part V. ESO? Good ESO? (e.g., partially separable f & doubly uniform S)
  • Slide 42
  • Serial uniform sampling Probability law:
  • Slide 43
  • -nice sampling Probability law: Good for shared memory systems
  • Slide 44
  • Doubly uniform sampling Probability law: Can model unreliable processors / machines
  • Slide 45
  • ESO for partially separable functions and doubly uniform samplings Theorem [RT11b] 1 Smooth partially separable f [RT11b ]
  • Slide 46
  • Theoretical speedup Much of Big Data is here! degree of partial separability # coordinates # coordinate updates / iter WEAK OR NO SPEEDUP: Non-separable (dense) problems LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems
  • Slide 47
  • Slide 48
  • n = 1000 (# coordinates) Theory
  • Slide 49
  • Practice n = 1000 (# coordinates)
  • Slide 50
  • Part VI. Experiment with a 1 billion-by-2 billion LASSO problem
  • Slide 51
  • Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =
  • Slide 52
  • Coordinate Updates
  • Slide 53
  • Iterations
  • Slide 54
  • Wall Time
  • Slide 55
  • L2-reg logistic loss on rcv1.binary
  • Slide 56
  • Part VII. Distributed CD P.R. and Martin Tak Distributed Coordinate Descent Method for Learning with Big Data arXiv:1310.2059, 2013 P.R. and Martin Tak Distributed Coordinate Descent Method for Learning with Big Data arXiv:1310.2059, 2013
  • Slide 57
  • Distributed -nice sampling Probability law: Machine 2Machine 1Machine 3 Good for a distributed version of coordinate descent
  • Slide 58
  • ESO: Distributed setting Theorem [RT13b] 3 f with bounded Hessian [BKBG11, RT13a ] spectral norm of the data
  • Slide 59
  • Bad partitioning at most doubles # of iterations spectral norm of the partitioning Theorem [RT13b] # nodes # iterations = implies # updates/node
  • Slide 60
  • Slide 61
  • LASSO with a 3TB data matrix 128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM = # coordinates
  • Slide 62
  • References
  • Slide 63
  • Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011. Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012. [RT11b] P.R. and Martin Tak, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012. Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013. Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012. Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block- coordinate descent methods. Technical report, Microsoft Research, 2013. References: serial coordinate descent
  • Slide 64
  • [BKBG11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011 [RT12] P.R. and Martin Tak, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012 Martin Tak, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013 [FR13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013 [RT13a] P.R. and Martin Tak, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013 [RT13b] P.R. and Martin Tak, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013 References: parallel coordinate descent Good entry point to the topic (4p paper)
  • Slide 65
  • P.R. and Martin Tak, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012. [FR13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal Coordinate Descent. Manuscript, 2013. Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013. Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012. Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch stochastic dual coordinate ascent. NIPS 2013. References: parallel coordinate descent