francis bach inria - ecole normale sup´erieure, paris, francefbach/fbach_mlss_2018.pdf ·...
TRANSCRIPT
![Page 1: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/1.jpg)
Statistical machine learningand convex optimization
Francis Bach
INRIA - Ecole Normale Superieure, Paris, France
ÉCOLE NORMALE
S U P É R I E U R E
Machine Learning Summer School - Madrid, 2018
Slides available: www.di.ens.fr/~fbach/fbach_mlss_2018.pdf
1
![Page 2: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/2.jpg)
“Big data/AI” revolution?
A new scientific context
• Data everywhere: size does not (always) matter
• Science and industry
• Size and variety
• Learning from examples
– n observations in dimension d
2
![Page 3: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/3.jpg)
“Big data/AI” revolution?
A new scientific context
• Data everywhere: size does not (always) matter
• Science and industry
• Size and variety
• Learning from examples
– n observations in dimension d
3
![Page 4: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/4.jpg)
Advertising
4
![Page 5: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/5.jpg)
Marketing - Personalized recommendation
5
![Page 6: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/6.jpg)
Visual object recognition
6
![Page 7: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/7.jpg)
Personal photos
• Recognizing people and places
- Emile and Francis
- Chocolateria San Gines
7
![Page 8: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/8.jpg)
Personal photos
• Recognizing people and places
– Emile and Francis
– Chocolateria San Gines
8
![Page 9: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/9.jpg)
Bioinformatics
• Protein: Crucial elements of cell life
• Massive data: 2 millions for humans
• Complex data
9
![Page 10: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/10.jpg)
Context
Machine learning for “big data”
• Large-scale machine learning: large d, large n
– d : dimension of each observation (or number of features)
– n : number of observations
• Examples: computer vision, bioinformatics, advertising
– Ideal running-time complexity: O(dn)
– Going back to simple methods
– Stochastic gradient methods (Robbins and Monro, 1951b)
– Mixing statistics and optimization
– Using smoothness to go beyond stochastic gradient descent
10
![Page 11: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/11.jpg)
Context
Machine learning for “big data”
• Large-scale machine learning: large d, large n
– d : dimension of each observation (or number of features)
– n : number of observations
• Examples: computer vision, bioinformatics, advertising
• Ideal running-time complexity: O(dn)
– Going back to simple methods
– Stochastic gradient methods (Robbins and Monro, 1951b)
– Mixing statistics and optimization
– Using smoothness to go beyond stochastic gradient descent
11
![Page 12: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/12.jpg)
Context
Machine learning for “big data”
• Large-scale machine learning: large d, large n
– d : dimension of each observation (or number of features)
– n : number of observations
• Examples: computer vision, bioinformatics, advertising
• Ideal running-time complexity: O(dn)
• Going back to simple methods
– Stochastic gradient methods (Robbins and Monro, 1951b)
– Mixing statistics and optimization
– Using smoothness to go beyond stochastic gradient descent
12
![Page 13: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/13.jpg)
Scaling to large problems
“Retour aux sources”
• 1950’s: Computers not powerful enough
IBM “1620”, 1959
CPU frequency: 50 KHz
Price > 100 000 dollars
• 2010’s: Data too massive
- Stochastic gradient methods (Robbins and Monro, 1951a)
- Going back to simple methods
13
![Page 14: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/14.jpg)
Scaling to large problems
“Retour aux sources”
• 1950’s: Computers not powerful enough
IBM “1620”, 1959
CPU frequency: 50 KHz
Price > 100 000 dollars
• 2010’s: Data too massive
• Stochastic gradient methods (Robbins and Monro, 1951a)
– Going back to simple methods
14
![Page 15: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/15.jpg)
Outline - II
1. Introduction
• Large-scale machine learning and optimization
• Classes of functions (convex, smooth, etc.)
• Traditional statistical analysis (regardless of optimization)
2. Classical methods for convex optimization
• Smooth optimization (gradient descent, Newton method)
• Non-smooth optimization (subgradient descent)
• Proximal methods
3. Non-smooth stochastic approximation
• Stochastic (sub)gradient and averaging
• Non-asymptotic results and lower bounds
• Strongly convex vs. non-strongly convex
15
![Page 16: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/16.jpg)
Outline - II
4. Classical stochastic approximation (not covered)
• Asymptotic analysis
• Robbins-Monro algorithm and Polyak-Rupert averaging
5. Smooth stochastic approximation algorithms
• Non-asymptotic analysis for smooth functions
• Least-squares regression without decaying step-sizes
6. Finite data sets (partially covered)
• Gradient methods with exponential convergence rates
• (Dual) stochastic coordinate descent
• Frank-Wolfe
7. Non-convex problems (“open” / not covered)
16
![Page 17: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/17.jpg)
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd
– NB: non-linear problems (on the board)
17
![Page 18: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/18.jpg)
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rd
1
n
n∑
i=1
ℓ(
yi, θ⊤Φ(xi)
)
+ µΩ(θ)
convex data fitting term + regularizer
18
![Page 19: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/19.jpg)
Usual losses
• Regression: y ∈ R, prediction y = θ⊤Φ(x)
– quadratic loss 12(y − y)2 = 1
2(y − θ⊤Φ(x))2
19
![Page 20: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/20.jpg)
Usual losses
• Regression: y ∈ R, prediction y = θ⊤Φ(x)
– quadratic loss 12(y − y)2 = 1
2(y − θ⊤Φ(x))2
• Classification : y ∈ −1, 1, prediction y = sign(θ⊤Φ(x))
– loss of the form ℓ(y θ⊤Φ(x))– “True” 0-1 loss: ℓ(y θ⊤Φ(x)) = 1y θ⊤Φ(x)<0
– Usual convex losses:
−3 −2 −1 0 1 2 3 40
1
2
3
4
5
0−1hingesquarelogistic
20
![Page 21: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/21.jpg)
Main motivating examples
• Support vector machine (hinge loss): non-smooth
ℓ(Y, θ⊤Φ(X)) = max1− Y θ⊤Φ(X), 0
• Logistic regression: smooth
ℓ(Y, θ⊤Φ(X)) = log(1 + exp(−Y θ⊤Φ(X)))
• Least-squares regression
ℓ(Y, θ⊤Φ(X)) =1
2(Y − θ⊤Φ(X))2
• Structured output regression
– See Tsochantaridis et al. (2005); Lacoste-Julien et al. (2013)
21
![Page 22: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/22.jpg)
Usual regularizers
• Main goal: avoid overfitting
• (squared) Euclidean norm: ‖θ‖22 =∑d
j=1 |θj|2
– Numerically well-behaved
– Representer theorem and kernel methods : θ =∑n
i=1αiΦ(xi)
– See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and
Cristianini (2004) and references therein
22
![Page 23: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/23.jpg)
Usual regularizers
• Main goal: avoid overfitting
• (squared) Euclidean norm: ‖θ‖22 =∑d
j=1 |θj|2
– Numerically well-behaved
– Representer theorem and kernel methods : θ =∑n
i=1αiΦ(xi)
– See, e.g., Scholkopf and Smola (2001); Shawe-Taylor and
Cristianini (2004) and references therein
• Sparsity-inducing norms
– Main example: ℓ1-norm ‖θ‖1 =∑d
j=1 |θj|– Perform model selection as well as regularization
– Non-smooth optimization and structured sparsity
– See, e.g., Bach et al. (2012b,a) and references therein
23
![Page 24: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/24.jpg)
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rd
1
n
n∑
i=1
ℓ(
yi, θ⊤Φ(xi)
)
+ µΩ(θ)
convex data fitting term + regularizer
24
![Page 25: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/25.jpg)
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rd
1
n
n∑
i=1
ℓ(
yi, θ⊤Φ(xi)
)
+ µΩ(θ)
convex data fitting term + regularizer
• Empirical risk: f(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi)) training cost
• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost
• Two fundamental questions: (1) computing θ and (2) analyzing θ
– May be tackled simultaneously25
![Page 26: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/26.jpg)
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rd
1
n
n∑
i=1
ℓ(
yi, θ⊤Φ(xi)
)
+ µΩ(θ)
convex data fitting term + regularizer
• Empirical risk: f(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi)) training cost
• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost
• Two fundamental questions: (1) computing θ and (2) analyzing θ
– May be tackled simultaneously26
![Page 27: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/27.jpg)
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rd
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rd
1
n
n∑
i=1
ℓ(
yi, θ⊤Φ(xi)
)
such that Ω(θ) 6 D
convex data fitting term + constraint
• Empirical risk: f(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi)) training cost
• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost
• Two fundamental questions: (1) computing θ and (2) analyzing θ
– May be tackled simultaneously27
![Page 28: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/28.jpg)
General assumptions
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.
• Bounded features Φ(x) ∈ Rd: ‖Φ(x)‖2 6 R
• Empirical risk: f(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi)) training cost
• Expected risk: f(θ) = E(x,y)ℓ(y, θ⊤Φ(x)) testing cost
• Loss for a single observation: fi(θ) = ℓ(yi, θ⊤Φ(xi))
⇒ ∀i, f(θ) = Efi(θ)
• Properties of fi, f, f
– Convex on Rd
– Additional regularity assumptions: Lipschitz-continuity,
smoothness and strong convexity
28
![Page 29: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/29.jpg)
Convexity
• Global definitions
g(θ)
θ
29
![Page 30: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/30.jpg)
Convexity
• Global definitions (full domain)
g(θ)
θ
g(θ1)g(θ2)
θ2θ1
– Not assuming differentiability:
∀θ1, θ2, α ∈ [0, 1], g(αθ1 + (1− α)θ2) 6 αg(θ1) + (1− α)g(θ2)
30
![Page 31: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/31.jpg)
Convexity
• Global definitions (full domain)
g(θ)
θ
g(θ1)g(θ2)
θ2θ1
– Assuming differentiability:
∀θ1, θ2, g(θ1) > g(θ2) + g′(θ2)⊤(θ1 − θ2)
• Extensions to all functions with subgradients / subdifferential
31
![Page 32: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/32.jpg)
Convexity
• Global definitions (full domain)
g(θ)
θ
• Local definitions
– Twice differentiable functions
– ∀θ, g′′(θ) < 0 (positive semi-definite Hessians)
32
![Page 33: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/33.jpg)
Convexity
• Global definitions (full domain)
g(θ)
θ
• Local definitions
– Twice differentiable functions
– ∀θ, g′′(θ) < 0 (positive semi-definite Hessians)
• Why convexity?
33
![Page 34: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/34.jpg)
Why convexity?
• Local minimum = global minimum
– Optimality condition (smooth): g′(θ) = 0
– Most algorithms do not need convexity for their definitions
– Local convexity
• Convex duality
– See Boyd and Vandenberghe (2003)
• Recognizing convex problems
– See Boyd and Vandenberghe (2003)
34
![Page 35: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/35.jpg)
Why convexity?
• Local minimum = global minimum
– Optimality condition (smooth): g′(θ) = 0
– Most algorithms do not need convexity for their definitions
– Local convexity around a local optimum
• Convex duality
– See Boyd and Vandenberghe (2003)
• Recognizing convex problems
– See Boyd and Vandenberghe (2003)
35
![Page 36: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/36.jpg)
Lipschitz continuity
• Bounded gradients of g (⇔ Lipschitz-continuity): the function
g if convex, differentiable and has (sub)gradients uniformly bounded
by B on the ball of center 0 and radius D:
∀θ ∈ Rd, ‖θ‖2 6 D ⇒ ‖g′(θ)‖2 6 B
⇔
∀θ, θ′ ∈ Rd, ‖θ‖2, ‖θ′‖2 6 D ⇒ |g(θ)− g(θ′)| 6 B‖θ − θ′‖2
• Machine learning
– with g(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi))
– G-Lipschitz loss and R-bounded data: B = GR (see board)
36
![Page 37: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/37.jpg)
Smoothness and strong convexity
• A function g : Rd → R is L-smooth if and only if it is differentiable
and its gradient is L-Lipschitz-continuous
∀θ1, θ2 ∈ Rd, ‖g′(θ1)− g′(θ2)‖2 6 L‖θ1 − θ2‖2
• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) 4 L · Id
smooth non−smooth
37
![Page 38: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/38.jpg)
Smoothness and strong convexity
• A function g : Rd → R is L-smooth if and only if it is differentiable
and its gradient is L-Lipschitz-continuous
∀θ1, θ2 ∈ Rd, ‖g′(θ1)− g′(θ2)‖2 6 L‖θ1 − θ2‖2
• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) 4 L · Id
• Machine learning (see board)
– with g(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi))
– Hessian ≈ covariance matrix 1n
∑ni=1Φ(xi)Φ(xi)
⊤
– Lloss-smooth loss and R-bounded data: L = LlossR2
38
![Page 39: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/39.jpg)
Smoothness and strong convexity
• A function g : Rd → R is µ-strongly convex if and only if
∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)
⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22
• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id
convexstronglyconvex
39
![Page 40: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/40.jpg)
Smoothness and strong convexity
• A function g : Rd → R is µ-strongly convex if and only if
∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)
⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22
• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id
(large µ/L) (small µ/L)
40
![Page 41: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/41.jpg)
Smoothness and strong convexity
• A function g : Rd → R is µ-strongly convex if and only if
∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)
⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22
• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id
• Machine learning
– with g(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi))
– Hessian ≈ covariance matrix 1n
∑ni=1Φ(xi)Φ(xi)
⊤
– Data with invertible covariance matrix (low correlation/dimension)
41
![Page 42: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/42.jpg)
Smoothness and strong convexity
• A function g : Rd → R is µ-strongly convex if and only if
∀θ1, θ2 ∈ Rd, g(θ1) > g(θ2) + g′(θ2)
⊤(θ1 − θ2) +µ2‖θ1 − θ2‖22
• If g is twice differentiable: ∀θ ∈ Rd, g′′(θ) < µ · Id
• Machine learning
– with g(θ) = 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi))
– Hessian ≈ covariance matrix 1n
∑ni=1Φ(xi)Φ(xi)
⊤
– Data with invertible covariance matrix (low correlation/dimension)
• Adding regularization by µ2‖θ‖2
– creates additional bias unless µ is small
42
![Page 43: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/43.jpg)
Summary of smoothness/convexity assumptions
• Bounded gradients of g (Lipschitz-continuity): the function g if
convex, differentiable and has (sub)gradients uniformly bounded by
B on the ball of center 0 and radius D:
∀θ ∈ Rd, ‖θ‖2 6 D ⇒ ‖g′(θ)‖2 6 B
• Smoothness of g: the function g is convex, differentiable with
L-Lipschitz-continuous gradient g′ (e.g., bounded Hessians):
∀θ ∈ Rd, g′′(θ) 4 L · Id
• Strong convexity of g: The function g is strongly convex with
respect to the norm ‖ · ‖, with convexity constant µ > 0:
∀θ ∈ Rd, g′′(θ) < µ · Id
43
![Page 44: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/44.jpg)
Analysis of empirical risk minimization
• Approximation and estimation errors: Θ = θ ∈ Rd,Ω(θ) 6 D
f(θ)− minθ∈Rd
f(θ) =
[
f(θ)−minθ∈Θ
f(θ)
]
+
[
minθ∈Θ
f(θ)− minθ∈Rd
f(θ)
]
Estimation error Approximation error
– NB: may replace minθ∈Rd
f(θ) by best (non-linear) predictions
44
![Page 45: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/45.jpg)
Analysis of empirical risk minimization
• Approximation and estimation errors: Θ = θ ∈ Rd,Ω(θ) 6 D
f(θ)− minθ∈Rd
f(θ) =
[
f(θ)−minθ∈Θ
f(θ)
]
+
[
minθ∈Θ
f(θ)− minθ∈Rd
f(θ)
]
Estimation error Approximation error
1. Uniform deviation bounds, with θ ∈ argminθ∈Θ
f(θ)
f(θ)−minθ∈Θ
f(θ) =[
f(θ)−f(θ)]
+[
f(θ)−f((θ∗)Θ)]
+[
f((θ∗)Θ)− f((θ∗)Θ)]
6 supθ∈Θ
f(θ)− f(θ) + 0 + supθ∈Θ
f(θ)−f(θ)
45
![Page 46: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/46.jpg)
Analysis of empirical risk minimization
• Approximation and estimation errors: Θ = θ ∈ Rd,Ω(θ) 6 D
f(θ)− minθ∈Rd
f(θ) =
[
f(θ)−minθ∈Θ
f(θ)
]
+
[
minθ∈Θ
f(θ)− minθ∈Rd
f(θ)
]
Estimation error Approximation error
1. Uniform deviation bounds, with θ ∈ argminθ∈Θ
f(θ)
f(θ)−minθ∈Θ
f(θ) 6 supθ∈Θ
f(θ)− f(θ) + supθ∈Θ
f(θ)− f(θ)
– Typically slow rate O(
1/√n)
2. More refined concentration results with faster rates O(1/n)
46
![Page 47: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/47.jpg)
Analysis of empirical risk minimization
• Approximation and estimation errors: Θ = θ ∈ Rd,Ω(θ) 6 D
f(θ)− minθ∈Rd
f(θ) =
[
f(θ)−minθ∈Θ
f(θ)
]
+
[
minθ∈Θ
f(θ)− minθ∈Rd
f(θ)
]
Estimation error Approximation error
1. Uniform deviation bounds, with θ ∈ argminθ∈Θ
f(θ)
f(θ)−minθ∈Θ
f(θ) 6 2 · supθ∈Θ
|f(θ)− f(θ)|
– Typically slow rate O(
1/√n)
2. More refined concentration results with faster rates O(1/n)
47
![Page 48: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/48.jpg)
Slow rate for supervised learning
• Assumptions (f is the expected risk, f the empirical risk)
– Ω(θ) = ‖θ‖2 (Euclidean norm)
– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.
– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D– No assumptions regarding convexity
48
![Page 49: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/49.jpg)
Slow rate for supervised learning
• Assumptions (f is the expected risk, f the empirical risk)
– Ω(θ) = ‖θ‖2 (Euclidean norm)
– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.
– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D– No assumptions regarding convexity
• With probability greater than 1− δ
supθ∈Θ
|f(θ)− f(θ)| 6 ℓ0 +GRD√n
[
2 +
√
2 log2
δ
]
• Expectated estimation error: E[
supθ∈Θ
|f(θ)− f(θ)|]
64ℓ0 + 4GRD√
n
• Using Rademacher averages (see, e.g., Boucheron et al., 2005)
• Lipschitz functions ⇒ slow rate
49
![Page 50: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/50.jpg)
Motivation from mean estimation
• Estimator θ = 1n
∑ni=1 zi = argminθ∈R
12n
∑ni=1(θ − zi)
2 = f(θ)
– θ∗ = Ez = argminθ∈R12E(θ − z)2 = f(θ)
– From before (estimation error): f(θ)− f(θ∗) = O(1/√n)
50
![Page 51: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/51.jpg)
Motivation from mean estimation
• Estimator θ = 1n
∑ni=1 zi = argminθ∈R
12n
∑ni=1(θ − zi)
2 = f(θ)
– θ∗ = Ez = argminθ∈R12E(θ − z)2 = f(θ)
– From before (estimation error): f(θ)− f(θ∗) = O(1/√n)
• Direct computation:
– f(θ) = 12E(θ − z)2 = 1
2(θ − Ez)2 + 12 var(z)
• More refined/direct bound:
f(θ)− f(Ez) =1
2(θ − Ez)2
E[
f(θ)− f(Ez)]
=1
2E
(
1
n
n∑
i=1
zi − Ez
)2
=1
2nvar(z)
• Bound only at θ + strong convexity (instead of uniform bound)
51
![Page 52: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/52.jpg)
Fast rate for supervised learning
• Assumptions (f is the expected risk, f the empirical risk)
– Same as before (bounded features, Lipschitz loss)
– Regularized risks: fµ(θ) = f(θ)+µ2‖θ‖22 and fµ(θ) = f(θ)+µ
2‖θ‖22– Convexity
52
![Page 53: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/53.jpg)
Fast rate for supervised learning
• Assumptions (f is the expected risk, f the empirical risk)
– Same as before (bounded features, Lipschitz loss)
– Regularized risks: fµ(θ) = f(θ)+µ2‖θ‖22 and fµ(θ) = f(θ)+µ
2‖θ‖22– Convexity
• For any a > 0, with probability greater than 1− δ, for all θ ∈ Rd,
fµ(θ)− minη∈Rd
fµ(η) 68G2R2(32 + log 1
δ)
µn
• Results from Sridharan, Srebro, and Shalev-Shwartz (2008)
– see also Boucheron and Massart (2011) and references therein
• Strongly convex functions ⇒ fast rate
– Warning: µ should decrease with n to reduce approximation error
53
![Page 54: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/54.jpg)
Outline - II
1. Introduction
• Large-scale machine learning and optimization
• Classes of functions (convex, smooth, etc.)
• Traditional statistical analysis (regardless of optimization)
2. Classical methods for convex optimization
• Smooth optimization (gradient descent, Newton method)
• Non-smooth optimization (subgradient descent)
• Proximal methods
3. Non-smooth stochastic approximation
• Stochastic (sub)gradient and averaging
• Non-asymptotic results and lower bounds
• Strongly convex vs. non-strongly convex
54
![Page 55: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/55.jpg)
Outline - II
4. Classical stochastic approximation (not covered)
• Asymptotic analysis
• Robbins-Monro algorithm and Polyak-Rupert averaging
5. Smooth stochastic approximation algorithms
• Non-asymptotic analysis for smooth functions
• Least-squares regression without decaying step-sizes
6. Finite data sets (partially covered)
• Gradient methods with exponential convergence rates
• (Dual) stochastic coordinate descent
• Frank-Wolfe
7. Non-convex problems (“open” / not covered)
55
![Page 56: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/56.jpg)
Complexity results in convex optimization
• Assumption: g convex on Rd
• Classical generic algorithms
– Gradient descent and accelerated gradient descent
– Newton method
– Subgradient method (and ellipsoid algorithm)
56
![Page 57: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/57.jpg)
Complexity results in convex optimization
• Assumption: g convex on Rd
• Classical generic algorithms
– Gradient descent and accelerated gradient descent
– Newton method
– Subgradient method (and ellipsoid algorithm)
• Key additional properties of g
– Lipschitz continuity, smoothness or strong convexity
• Key insight from Bottou and Bousquet (2008)
– In machine learning, no need to optimize below estimation error
• Key references: Nesterov (2004), Bubeck (2015)
57
![Page 58: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/58.jpg)
Several criteria for characterizing convergence
• Objective function values
g(θ)− infη∈Rd
g(η)
– Usually weaker condition
• Iterates
infη∈argmin g
∥
∥θ − η∥
∥
2
– Typically used for strongly-convex problems
58
![Page 59: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/59.jpg)
Several criteria for characterizing convergence
• Objective function values
g(θ)− infη∈Rd
g(η)
– Usually weaker condition
• Iterates
infη∈argmin g
∥
∥θ − η∥
∥
2
– Typically used for strongly-convex problems
• NB 1: relationships between the two types in several situations
• NB 2: similarity with prediction vs. estimation in statistics
59
![Page 60: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/60.jpg)
(smooth) gradient descent
• Assumptions
– g convex with L-Lipschitz-continuous gradient (e.g., L-smooth)
– Minimum attained at θ∗
• Algorithm:
θt = θt−1 −1
Lg′(θt−1)
60
![Page 61: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/61.jpg)
(smooth) gradient descent - strong convexity
• Assumptions
– g convex with L-Lipschitz-continuous gradient (e.g., L-smooth)
– g µ-strongly convex
• Algorithm:
θt = θt−1 −1
Lg′(θt−1)
• Bound:
g(θt)− g(θ∗) 6 (1− µ/L)t[
g(θ0)− g(θ∗)]
• Three-line proof
• Line search, steepest descent or constant step-size
61
![Page 62: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/62.jpg)
(smooth) gradient descent - slow rate
• Assumptions
– g convex with L-Lipschitz-continuous gradient (e.g., L-smooth)
– Minimum attained at θ∗
• Algorithm:
θt = θt−1 −1
Lg′(θt−1)
• Bound:
g(θt)− g(θ∗) 62L‖θ0 − θ∗‖2
t+ 4• Four-line proof
• Adaptivity of gradient descent to problem difficulty
• Not best possible convergence rates after O(d) iterations
62
![Page 63: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/63.jpg)
Gradient descent - Proof for quadratic functions
• Quadratic convex function: g(θ) = 12θ
⊤Hθ − c⊤θ
– µ and L are smallest largest eigenvalues of H
– Global optimum θ∗ = H−1c (or H†c) such that Hθ∗ = c
63
![Page 64: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/64.jpg)
Gradient descent - Proof for quadratic functions
• Quadratic convex function: g(θ) = 12θ
⊤Hθ − c⊤θ
– µ and L are smallest largest eigenvalues of H
– Global optimum θ∗ = H−1c (or H†c) such that Hθ∗ = c
• Gradient descent with γ = 1/L:
θt = θt−1 −1
L(Hθt−1 − c) = θt−1 −
1
L(Hθt−1 −Hθ∗)
θt − θ∗ = (I − 1
LH)(θt−1 − θ∗) = (I − 1
LH)t(θ0 − θ∗)
64
![Page 65: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/65.jpg)
Gradient descent - Proof for quadratic functions
• Quadratic convex function: g(θ) = 12θ
⊤Hθ − c⊤θ
– µ and L are smallest largest eigenvalues of H
– Global optimum θ∗ = H−1c (or H†c) such that Hθ∗ = c
• Gradient descent with γ = 1/L:
θt = θt−1 −1
L(Hθt−1 − c) = θt−1 −
1
L(Hθt−1 −Hθ∗)
θt − θ∗ = (I − 1
LH)(θt−1 − θ∗) = (I − 1
LH)t(θ0 − θ∗)
• Strong convexity µ > 0: eigenvalues of (I − 1LH)t in [0, (1− µ
L)t]
– Convergence of iterates: ‖θt − θ∗‖2 6 (1− µ/L)2t‖θ0 − θ∗‖2– Function values: g(θt)− g(θ∗) 6 (1− µ/L)2t
[
g(θ0)− g(θ∗)]
– Function values: g(θt)− g(θ∗) 6Lt ‖θ0 − θ∗‖2
65
![Page 66: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/66.jpg)
Gradient descent - Proof for quadratic functions
• Quadratic convex function: g(θ) = 12θ
⊤Hθ − c⊤θ
– µ and L are smallest largest eigenvalues of H
– Global optimum θ∗ = H−1c (or H†c) such that Hθ∗ = c
• Gradient descent with γ = 1/L:
θt = θt−1 −1
L(Hθt−1 − c) = θt−1 −
1
L(Hθt−1 −Hθ∗)
θt − θ∗ = (I − 1
LH)(θt−1 − θ∗) = (I − 1
LH)t(θ0 − θ∗)
• Convexity µ = 0: eigenvalues of (I − 1LH)t in [0, 1]
– No convergence of iterates: ‖θt − θ∗‖2 6 ‖θ0 − θ∗‖2– Function values: g(θt)−g(θ∗) 6 maxv∈[0,L] v(1−v/L)2t‖θ0−θ∗‖2– Function values: g(θt)− g(θ∗) 6
Lt ‖θ0 − θ∗‖2
66
![Page 67: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/67.jpg)
Accelerated gradient methods (Nesterov, 1983)
• Assumptions
– g convex with L-Lipschitz-cont. gradient , min. attained at θ∗
• Algorithm:θt = ηt−1 −
1
Lg′(ηt−1)
ηt = θt +t− 1
t+ 2(θt − θt−1)
• Bound:g(θt)− g(θ∗) 6
2L‖θ0 − θ∗‖2(t+ 1)2
• Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)
• Not improvable
• Extension to strongly-convex functions
67
![Page 68: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/68.jpg)
Accelerated gradient methods - strong convexity
• Assumptions
– g convex with L-Lipschitz-cont. gradient , min. attained at θ∗– g µ-strongly convex
• Algorithm:
θt = ηt−1 −1
Lg′(ηt−1)
ηt = θt +1−
√
µ/L
1 +√
µ/L(θt − θt−1)
• Bound: g(θt)− f(θ∗) 6 L‖θ0 − θ∗‖2(1−√
µ/L)t
– Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)
– Not improvable
– Relationship with conjugate gradient for quadratic functions
68
![Page 69: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/69.jpg)
Optimization for sparsity-inducing norms
(see Bach, Jenatton, Mairal, and Obozinski, 2012b)
• Gradient descent as a proximal method (differentiable functions)
– θt+1 = arg minθ∈Rd
f(θt) + (θ − θt)⊤∇f(θt)+
L
2‖θ − θt‖22
– θt+1 = θt − 1L∇f(θt)
69
![Page 70: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/70.jpg)
Optimization for sparsity-inducing norms
(see Bach, Jenatton, Mairal, and Obozinski, 2012b)
• Gradient descent as a proximal method (differentiable functions)
– θt+1 = arg minθ∈Rd
f(θt) + (θ − θt)⊤∇f(θt)+
L
2‖θ − θt‖22
– θt+1 = θt − 1L∇f(θt)
• Problems of the form: minθ∈Rd
f(θ) + µΩ(θ)
– θt+1 = arg minθ∈Rd
f(θt) + (θ − θt)⊤∇f(θt)+µΩ(θ)+
L
2‖θ − θt‖22
– Ω(θ) = ‖θ‖1 ⇒ Thresholded gradient descent
• Similar convergence rates than smooth optimization
– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)
70
![Page 71: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/71.jpg)
Soft-thresholding for the ℓ1-norm
• Example: quadratic problem in 1D, i.e. minx∈R
1
2x2 − xy + λ|x|
71
![Page 72: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/72.jpg)
Soft-thresholding for the ℓ1-norm
• Example: quadratic problem in 1D, i.e. minx∈R
1
2x2 − xy + λ|x|
• Piecewise quadratic function with a kink at zero
– Derivative at 0+: g+ = λ− y and 0−: g− = −λ− y
– x = 0 is the solution iff g+ > 0 and g− 6 0 (i.e., |y| 6 λ)
– x > 0 is the solution iff g+ 6 0 (i.e., y > λ) ⇒ x∗ = y − λ
– x 6 0 is the solution iff g− 6 0 (i.e., y 6 −λ) ⇒ x∗ = y + λ
• Solution x∗ = sign(y)(|y| − λ)+ = soft thresholding
72
![Page 73: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/73.jpg)
Soft-thresholding for the ℓ1-norm
• Example: quadratic problem in 1D, i.e. minx∈R
1
2x2 − xy + λ|x|
• Piecewise quadratic function with a kink at zero
• Solution x∗ = sign(y)(|y| − λ)+ = soft thresholding
x
−λ
x*(y)
λ y
73
![Page 74: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/74.jpg)
Projected gradient descent
• Problems of the form: minθ∈K
f(θ)
– θt+1 = argminθ∈K
f(θt) + (θ − θt)⊤∇f(θt)+
L
2‖θ − θt‖22
– θt+1 = argminθ∈K
1
2
∥
∥
∥θ −
(
θt −1
L∇f(θt)
)
∥
∥
∥
2
2– Projected gradient descent
• Similar convergence rates than smooth optimization
– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)
74
![Page 75: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/75.jpg)
Newton method
• Given θt−1, minimize second-order Taylor expansion
g(θ) = g(θt−1)+g′(θt−1)⊤(θ−θt−1)+
1
2(θ−θt−1)
⊤g′′(θt−1)⊤(θ−θt−1)
• Expensive Iteration: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– Running-time complexity: O(d3) in general
75
![Page 76: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/76.jpg)
Newton method
• Given θt−1, minimize second-order Taylor expansion
g(θ) = g(θt−1)+g′(θt−1)⊤(θ−θt−1)+
1
2(θ−θt−1)
⊤g′′(θt−1)⊤(θ−θt−1)
• Expensive Iteration: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– Running-time complexity: O(d3) in general
• Quadratic convergence: If ‖θt−1 − θ∗‖ small enough, for some
constant C, we have
(C‖θt − θ∗‖) = (C‖θt−1 − θ∗‖)2
– See Boyd and Vandenberghe (2003)
76
![Page 77: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/77.jpg)
Summary: minimizing smooth convex functions
• Assumption: g convex
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for smooth convex functions
– O(e−tµ/L) convergence rate for strongly smooth convex functions
– Optimal rates O(1/t2) and O(e−t√
µ/L)
• Newton method: θt = θt−1 − f ′′(θt−1)−1f ′(θt−1)
– O(
e−ρ2t)
convergence rate
77
![Page 78: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/78.jpg)
Summary: minimizing smooth convex functions
• Assumption: g convex
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for smooth convex functions
– O(e−tµ/L) convergence rate for strongly smooth convex functions
– Optimal rates O(1/t2) and O(e−t√
µ/L)
• Newton method: θt = θt−1 − f ′′(θt−1)−1f ′(θt−1)
– O(
e−ρ2t)
convergence rate
• From smooth to non-smooth
– Subgradient method (and ellipsoid)
78
![Page 79: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/79.jpg)
Counter-example (Bertsekas, 1999)
Steepest descent for nonsmooth objectives
• g(θ1, θ2) =
−5(9θ21 + 16θ22)1/2 if θ1 > |θ2|
−(9θ1 + 16|θ2|)1/2 if θ1 6 |θ2|
• Steepest descent starting from any θ such that θ1 > |θ2| >
(9/16)2|θ1|
−5 0 5−5
0
5
79
![Page 80: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/80.jpg)
Subgradient method/“descent” (Shor et al., 1985)
• Assumptions
– g convex and B-Lipschitz-continuous on ‖θ‖2 6 D
• Algorithm: θt = ΠD
(
θt−1 −2D
B√tg′(θt−1)
)
– ΠD : orthogonal projection onto ‖θ‖2 6 D
Constraints
80
![Page 81: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/81.jpg)
Subgradient method/“descent” (Shor et al., 1985)
• Assumptions
– g convex and B-Lipschitz-continuous on ‖θ‖2 6 D
• Algorithm: θt = ΠD
(
θt−1 −2D
B√tg′(θt−1)
)
– ΠD : orthogonal projection onto ‖θ‖2 6 D
• Bound:
g
(
1
t
t−1∑
k=0
θk
)
− g(θ∗) 62DB√
t
• Three-line proof
• Best possible convergence rate after O(d) iterations (Bubeck, 2015)
81
![Page 82: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/82.jpg)
Subgradient method/“descent” - proof - I
• Iteration: θt = ΠD(θt−1 − γtg′(θt−1)) with γt =
2DB√t
• Assumption: ‖g′(θ)‖2 6 B and ‖θ‖2 6 D
82
![Page 83: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/83.jpg)
Subgradient method/“descent” - proof - I
• Iteration: θt = ΠD(θt−1 − γtg′(θt−1)) with γt =
2DB√t
• Assumption: ‖g′(θ)‖2 6 B and ‖θ‖2 6 D
‖θt − θ∗‖22 6 ‖θt−1 − θ∗ − γtg′(θt−1)‖22 by contractivity of projections
= ‖θt−1 − θ∗‖22 + γ2t ‖g′(θt−1)‖22 − 2γt(θt−1 − θ∗)
⊤g′(θt−1)
6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt(θt−1 − θ∗)
⊤g′(θt−1) because ‖g′(θt−1)‖2 6 B
6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt
[
g(θt−1)− g(θ∗)]
(property of subgradients)
• leading to
g(θt−1)− g(θ∗) 6B2γt2
+1
2γt
[
‖θt−1 − θ∗‖22 − ‖θt − θ∗‖22]
83
![Page 84: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/84.jpg)
Subgradient method/“descent” - proof - II
• Starting from g(θt−1)− g(θ∗) 6B2γt2
+1
2γt
[
‖θt−1 − θ∗‖22 − ‖θt − θ∗‖22]
• Constant step-size γt = γ
t∑
u=1
[
g(θu−1)− g(θ∗)]
6
t∑
u=1
B2γ
2+
t∑
u=1
1
2γ
[
‖θu−1 − θ∗‖22 − ‖θu − θ∗‖22]
6 tB2γ
2+
1
2γ‖θ0 − θ∗‖22 6 t
B2γ
2+
2
γD2
• Optimized step-size γt =2DB√tdepends on “horizon” t
– Leads to bound of 2DB√t
• Using convexity: g
(
1
t
t−1∑
k=0
θk
)
−g(θ∗) 61
t
t−1∑
k=0
g(θk)−g(θ∗) 62DB√
t
84
![Page 85: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/85.jpg)
Subgradient method/“descent” - proof - III
• Starting from g(θt−1)− g(θ∗) 6B2γt2
+1
2γt
[
‖θt−1 − θ∗‖22 − ‖θt − θ∗‖22]
• Decreasing step-size
t∑
u=1
[
g(θu−1)− g(θ∗)]
6
t∑
u=1
B2γu2
+t
∑
u=1
1
2γu
[
‖θu−1 − θ∗‖22 − ‖θu − θ∗‖22]
=
t∑
u=1
B2γu2
+
t−1∑
u=1
‖θu−θ∗‖22( 1
2γu+1− 1
2γu
)
+‖θ0−θ∗‖22
2γ1− ‖θt−θ∗‖22
2γt
6
t∑
u=1
B2γu2
+t−1∑
u=1
4D2( 1
2γu+1− 1
2γu
)
+4D2
2γ1
=t
∑
u=1
B2γu2
+4D2
2γt6 3DB
√t with γt =
2D
B√t
• Using convexity: g(
1t
∑t−1k=0 θk
)
− g(θ∗) 63DB√
t
85
![Page 86: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/86.jpg)
Subgradient descent for machine learning
• Assumptions (f is the expected risk, f the empirical risk)
– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.
– f(θ) = 1n
∑ni=1 ℓ(yi,Φ(xi)
⊤θ)
– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D
• Statistics: with probability greater than 1− δ
supθ∈Θ
|f(θ)− f(θ)| 6 GRD√n
[
2 +
√
2 log2
δ
]
• Optimization: after t iterations of subgradient method
f(θ)−minη∈Θ
f(η) 6GRD√
t
• t = n iterations, with total running-time complexity of O(n2d)
86
![Page 87: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/87.jpg)
Subgradient descent - strong convexity
• Assumptions
– g convex and B-Lipschitz-continuous on ‖θ‖2 6 D– g µ-strongly convex
• Algorithm: θt = ΠD
(
θt−1 −2
µ(t+ 1)g′(θt−1)
)
• Bound:
g
(
2
t(t+ 1)
t∑
k=1
kθk−1
)
− g(θ∗) 62B2
µ(t+ 1)
• Three-line proof
• Best possible convergence rate after O(d) iterations (Bubeck, 2015)
87
![Page 88: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/88.jpg)
Subgradient method - strong convexity - proof - I
• Iteration: θt = ΠD(θt−1 − γtg′(θt−1)) with γt =
2µ(t+1)
• Assumption: ‖g′(θ)‖2 6 B and ‖θ‖2 6 D and µ-strong convexity of f
‖θt − θ∗‖22 6 ‖θt−1 − θ∗ − γtg′(θt−1)‖22 by contractivity of projections
6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt(θt−1 − θ∗)
⊤g′(θt−1) because ‖g′(θt−1)‖2 6 B
6 ‖θt−1 − θ∗‖22 +B2γ2t − 2γt
[
g(θt−1)− g(θ∗)+µ
2‖θt−1 − θ∗‖22
]
(property of subgradients and strong convexity)
• leading to
g(θt−1)− g(θ∗) 6B2γt2
+1
2
[ 1
γt− µ
]
‖θt−1 − θ∗‖22 −1
2γt‖θt − θ∗‖22
6B2
µ(t+ 1)+
µ
2
[t− 1
2
]
‖θt−1 − θ∗‖22 −µ(t+ 1)
4‖θt − θ∗‖22
88
![Page 89: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/89.jpg)
Subgradient method - strong convexity - proof - II
• From g(θt−1)− g(θ∗) 6B2
µ(t+ 1)+
µ
2
[t− 1
2
]
‖θt−1 − θ∗‖22 −µ(t+ 1)
4‖θt − θ∗‖22
t∑
u=1
u[
g(θu−1)− g(θ∗)]
6
u∑
t=1
B2u
µ(u+ 1)+
1
4
t∑
u=1
[
u(u−1)‖θu−1−θ∗‖22 − u(u+ 1)‖θu−θ∗‖22]
6B2t
µ+
1
4
[
0− t(t+ 1)‖θt − θ∗‖22]
6B2t
µ
• Using convexity: g
(
2
t(t+ 1)
t∑
u=1
uθu−1
)
− g(θ∗) 62B2
t+ 1
• NB: with step-size γn = 1/(nµ), extra logarithmic factor
89
![Page 90: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/90.jpg)
Summary: minimizing convex functions
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/√t) convergence rate for non-smooth convex functions
– O(1/t) convergence rate for smooth convex functions
– O(e−ρt) convergence rate for strongly smooth convex functions
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
convergence rate
- Key insights from Bottou and Bousquet (2008)
- In machine learning, no need to optimize below statistical error
- In machine learning, cost functions are averages
- Testing errors are more important than training errors
⇒ Stochastic approximation
90
![Page 91: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/91.jpg)
Summary of rates of convergence
• Problem parameters
– D diameter of the domain
– B Lipschitz-constant
– L smoothness constant
– µ strong convexity constant
convex strongly convex
nonsmooth deterministic: BD/√t deterministic: B2/(tµ)
stochastic: BD/√n stochastic: B2/(nµ)
smooth deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: LD2/√n stochastic: L/(nµ)
finite sum: n/t finite sum: exp(−min1/n, µ/Lquadratic deterministic: LD2/t2 deterministic: exp(−t
√
µ/L)
stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n
91
![Page 92: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/92.jpg)
Summary: minimizing convex functions
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/√t) convergence rate for non-smooth convex functions
– O(1/t) convergence rate for smooth convex functions
– O(e−ρt) convergence rate for strongly smooth convex functions
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
convergence rate
• Key insights from Bottou and Bousquet (2008)
1. In machine learning, no need to optimize below statistical error
2. In machine learning, cost functions are averages
3. Testing errors are more important than training errors
⇒ Stochastic approximation
92
![Page 93: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/93.jpg)
Outline - II
1. Introduction
• Large-scale machine learning and optimization
• Classes of functions (convex, smooth, etc.)
• Traditional statistical analysis (regardless of optimization)
2. Classical methods for convex optimization
• Smooth optimization (gradient descent, Newton method)
• Non-smooth optimization (subgradient descent)
• Proximal methods
3. Non-smooth stochastic approximation
• Stochastic (sub)gradient and averaging
• Non-asymptotic results and lower bounds
• Strongly convex vs. non-strongly convex
93
![Page 94: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/94.jpg)
Outline - II
4. Classical stochastic approximation (not covered)
• Asymptotic analysis
• Robbins-Monro algorithm and Polyak-Rupert averaging
5. Smooth stochastic approximation algorithms
• Non-asymptotic analysis for smooth functions
• Least-squares regression without decaying step-sizes
6. Finite data sets (partially covered)
• Gradient methods with exponential convergence rates
• (Dual) stochastic coordinate descent
• Frank-Wolfe
7. Non-convex problems (“open” / not covered)
94
![Page 95: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/95.jpg)
Stochastic approximation
• Goal: Minimizing a function f defined on Rd
– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at
certain points θn ∈ Rd
95
![Page 96: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/96.jpg)
Stochastic approximation
• Goal: Minimizing a function f defined on Rd
– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at
certain points θn ∈ Rd
• Machine learning - statistics
– loss for a single pair of observations: fn(θ) = ℓ(yn, θ⊤Φ(xn))
– f(θ) = Efn(θ) = E ℓ(yn, θ⊤Φ(xn)) = generalization error
– Expected gradient: f ′(θ) = Ef ′n(θ) = E
ℓ′(yn, θ⊤Φ(xn))Φ(xn)
– Non-asymptotic results
• Number of iterations = number of observations
96
![Page 97: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/97.jpg)
Stochastic approximation
• Goal: Minimizing a function f defined on Rd
– given only unbiased estimates f ′n(θn) of its gradients f ′(θn) at
certain points θn ∈ Rd
• Stochastic approximation
– (much) broader applicability beyond convex optimization
θn = θn−1 − γnhn(θn−1) with E[
hn(θn−1)|θn−1
]
= h(θn−1)
– Beyond convex problems, i.i.d assumption, finite dimension, etc.
– Typically asymptotic results (see next lecture)
– See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)
97
![Page 98: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/98.jpg)
Relationship to online learning
• Stochastic approximation
– Minimize f(θ) = Ezℓ(θ, z) = generalization error of θ
– Using the gradients of single i.i.d. observations
98
![Page 99: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/99.jpg)
Relationship to online learning
• Stochastic approximation
– Minimize f(θ) = Ezℓ(θ, z) = generalization error of θ
– Using the gradients of single i.i.d. observations
• Batch learning
– Finite set of observations: z1, . . . , zn– Empirical risk: f(θ) = 1
n
∑nk=1 ℓ(θ, zi)
– Estimator θ = Minimizer of f(θ) over a certain class Θ
– Generalization bound using uniform concentration results
99
![Page 100: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/100.jpg)
Relationship to online learning
• Stochastic approximation
– Minimize f(θ) = Ezℓ(θ, z) = generalization error of θ
– Using the gradients of single i.i.d. observations
• Batch learning
– Finite set of observations: z1, . . . , zn– Empirical risk: f(θ) = 1
n
∑nk=1 ℓ(θ, zi)
– Estimator θ = Minimizer of f(θ) over a certain class Θ
– Generalization bound using uniform concentration results
• Online learning
– Update θn after each new (potentially adversarial) observation zn– Cumulative loss: 1
n
∑nk=1 ℓ(θk−1, zk)
– Online to batch through averaging (Cesa-Bianchi et al., 2004)
100
![Page 101: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/101.jpg)
Convex stochastic approximation
• Key properties of f and/or fn
– Smoothness: f B-Lipschitz continuous, f ′ L-Lipschitz continuous
– Strong convexity: f µ-strongly convex
101
![Page 102: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/102.jpg)
Convex stochastic approximation
• Key properties of f and/or fn
– Smoothness: f B-Lipschitz continuous, f ′ L-Lipschitz continuous
– Strong convexity: f µ-strongly convex
• Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro)
θn = θn−1 − γnf′n(θn−1)
– Polyak-Ruppert averaging: θn = 1n
∑n−1k=0 θk
– Which learning rate sequence γn? Classical setting: γn = Cn−α
102
![Page 103: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/103.jpg)
Convex stochastic approximation
• Key properties of f and/or fn
– Smoothness: f B-Lipschitz continuous, f ′ L-Lipschitz continuous
– Strong convexity: f µ-strongly convex
• Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro)
θn = θn−1 − γnf′n(θn−1)
– Polyak-Ruppert averaging: θn = 1n
∑n−1k=0 θk
– Which learning rate sequence γn? Classical setting: γn = Cn−α
• Desirable practical behavior
– Applicable (at least) to classical supervised learning problems
– Robustness to (potentially unknown) constants (L,B,µ)
– Adaptivity to difficulty of the problem (e.g., strong convexity)
103
![Page 104: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/104.jpg)
Stochastic subgradient “descent”/method
• Assumptions
– fn convex and B-Lipschitz-continuous on ‖θ‖2 6 D– (fn) i.i.d. functions such that Efn = f
– θ∗ global optimum of f on C = ‖θ‖2 6 D
• Algorithm: θn = ΠD
(
θn−1 −2D
B√nf ′n(θn−1)
)
104
![Page 105: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/105.jpg)
Stochastic subgradient “descent”/method
• Assumptions
– fn convex and B-Lipschitz-continuous on ‖θ‖2 6 D– (fn) i.i.d. functions such that Efn = f
– θ∗ global optimum of f on C = ‖θ‖2 6 D
• Algorithm: θn = ΠD
(
θn−1 −2D
B√nf ′n(θn−1)
)
• Bound:
Ef
(
1
n
n−1∑
k=0
θk
)
− f(θ∗) 62DB√
n
• “Same” three-line proof as in the deterministic case
• Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
• Running-time complexity: O(dn) after n iterations
105
![Page 106: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/106.jpg)
Stochastic subgradient method - proof - I
• Iteration: θn = ΠD(θn−1 − γnf′n(θn−1)) with γn = 2D
B√n
• Fn : information up to time n
• ‖f ′n(θ)‖2 6 B and ‖θ‖2 6 D, unbiased gradients/functions E(fn|Fn−1) = f
‖θn − θ∗‖22 6 ‖θn−1 − θ∗ − γnf′n(θn−1)‖22 by contractivity of projections
6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn(θn−1 − θ∗)
⊤f ′n(θn−1) because ‖f ′
n(θn−1)‖2 6
E[
‖θn − θ∗‖22|Fn−1
]
6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn(θn−1 − θ∗)
⊤f ′(θn−1)
6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn
[
f(θn−1)− f(θ∗)]
(subgradient propert
E‖θn − θ∗‖22 6 E‖θn−1 − θ∗‖22 + B2γ2n − 2γn
[
Ef(θn−1)− f(θ∗)]
• leading to Ef(θn−1)− f(θ∗) 6B2γn2
+1
2γn
[
E‖θn−1 − θ∗‖22 − E‖θn − θ∗‖22]
106
![Page 107: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/107.jpg)
Stochastic subgradient method - proof - II
• Starting from Ef(θn−1)− f(θ∗) 6B2γn2
+1
2γn
[
E‖θn−1 − θ∗‖22 − E‖θn − θ∗‖22]
n∑
u=1
[
Ef(θu−1)− f(θ∗)]
6
n∑
u=1
B2γu2
+n∑
u=1
1
2γu
[
E‖θu−1 − θ∗‖22 − E‖θu − θ∗‖22]
6
n∑
u=1
B2γu2
+4D2
2γn6 2DB
√n with γn =
2D
B√n
• Using convexity: Ef
(
1
n
n−1∑
k=0
θk
)
− f(θ∗) 62DB√
n
107
![Page 108: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/108.jpg)
Stochastic subgradient descent - strong convexity - I
• Assumptions
– fn convex and B-Lipschitz-continuous
– (fn) i.i.d. functions such that Efn = f
– f µ-strongly convex on ‖θ‖2 6 D– θ∗ global optimum of f over ‖θ‖2 6 D
• Algorithm: θn = ΠD
(
θn−1 −2
µ(n+ 1)f ′n(θn−1)
)
• Bound:
Ef
(
2
n(n+ 1)
n∑
k=1
kθk−1
)
− f(θ∗) 62B2
µ(n+ 1)
• “Same” proof than deterministic case (Lacoste-Julien et al., 2012)
• Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
108
![Page 109: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/109.jpg)
Stochastic subgradient - strong convexity - proof - I
• Iteration: θn = ΠD(θn−1 − γnf′n(θt−1)) with γn = 2
µ(n+1)
• Assumption: ‖f ′n(θ)‖2 6 B and ‖θ‖2 6 D and µ-strong convexity of f
‖θn − θ∗‖22 6 ‖θn−1 − θ∗ − γnf′n(θt−1)‖22 by contractivity of projections
6 ‖θn−1 − θ∗‖22 +B2γ2n − 2γn(θn−1 − θ∗)
⊤f ′n(θt−1) because ‖f ′
n(θt−1)‖2E(·|Fn−1) 6 ‖θn−1 − θ∗‖22 +B2γ2
n − 2γn[
f(θn−1)− f(θ∗)+µ
2‖θn−1 − θ∗‖22
]
(property of subgradients and strong convexity)
• leading to
Ef(θn−1)− f(θ∗) 6B2γn2
+1
2
[ 1
γn− µ
]
‖θn−1 − θ∗‖22 −1
2γn‖θn − θ∗‖22
6B2
µ(n+ 1)+
µ
2
[n− 1
2
]
‖θn−1 − θ∗‖22 −µ(n+ 1)
4‖θn − θ∗‖22
109
![Page 110: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/110.jpg)
Stochastic subgradient - strong convexity - proof - II
• FromEf(θn−1)−f(θ∗)6B2
µ(n+ 1)+µ
2
[n− 1
2
]
E‖θn−1−θ∗‖22−µ(n+ 1)
4E‖θn−θ∗‖22
n∑
u=1
u[
Ef(θu−1)− f(θ∗)]
6
n∑
u=1
B2u
µ(u+ 1)+
1
4
n∑
u=1
[
u(u− 1)E‖θu−1 − θ∗‖22 − u(u+ 1)E‖θu
6B2n
µ+
1
4
[
0− n(n+ 1)E‖θn − θ∗‖22]
6B2n
µ
• Using convexity: Ef
(
2
n(n+ 1)
n∑
u=1
uθu−1
)
− g(θ∗) 62B2
n+ 1
• NB: with step-size γn = 1/(nµ), extra logarithmic factor (see later)
110
![Page 111: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/111.jpg)
Stochastic subgradient descent - strong convexity - II
• Assumptions
– fn convex and B-Lipschitz-continuous
– (fn) i.i.d. functions such that Efn = f
– θ∗ global optimum of g = f + µ2‖ · ‖22
– No compactness assumption - no projections
• Algorithm:
θn = θn−1−2
µ(n+ 1)g′n(θn−1) = θn−1−
2
µ(n+ 1)
[
f ′n(θn−1)+µθn−1
]
• Bound: Eg
(
2
n(n+ 1)
n∑
k=1
kθk−1
)
− g(θ∗) 62B2
µ(n+ 1)
• Minimax convergence rate
111
![Page 112: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/112.jpg)
Beyond convergence in expectation
• Typical result: Ef
(
1
n
n−1∑
k=0
θk
)
− f(θ∗) 62DB√
n
– Obtained with simple conditioning arguments
• High-probability bounds
– Markov inequality: P(
f(
1n
∑n−1k=0 θk
)
− f(θ∗) > ε)
62DB√nε
112
![Page 113: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/113.jpg)
Beyond convergence in expectation
• Typical result: Ef
(
1
n
n−1∑
k=0
θk
)
− f(θ∗) 62DB√
n
– Obtained with simple conditioning arguments
• High-probability bounds
– Markov inequality: P(
f(
1n
∑n−1k=0 θk
)
− f(θ∗) > ε)
62DB√nε
– Deviation inequality (Nemirovski et al., 2009; Nesterov and Vial,
2008)
P
(
f(1
n
n−1∑
k=0
θk
)
− f(θ∗) >2DB√
n(2 + 4t)
)
6 2 exp(−t2)
• See also Bach (2013) for logistic regression
113
![Page 114: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/114.jpg)
Beyond stochastic gradient method
• Adding a proximal step
– Goal: minθ∈Rd
f(θ) + Ω(θ) = Efn(θ) + Ω(θ)
– Replace recursion θn = θn−1 − γnf′n(θn) by
θn = minθ∈Rd
∥
∥θ − θn−1 + γnf′n(θn)
∥
∥
2
2+ CΩ(θ)
– Xiao (2010); Hu et al. (2009)
– May be accelerated (Ghadimi and Lan, 2013)
• Related frameworks
– Regularized dual averaging (Nesterov, 2009; Xiao, 2010)
– Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)
114
![Page 115: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/115.jpg)
Minimax rates (Agarwal et al., 2012)
• Model of computation (i.e., algorithms): first-order oracle
– Queries a function f by obtaining f(θk) and f ′(θk) with zero-mean
bounded variance noise, for k = 0, . . . , n− 1 and outputs θn
• Class of functions
– convex B-Lipschitz-continuous (w.r.t. ℓ2-norm) on a compact
convex set C containing an ℓ∞-ball
• Performance measure
– for a given algorithm and function εn(algo, f) = f(θn)−infθ∈C f(θ)– for a given algorithm: sup
functions f
εn(algo, f)
• Minimax performance: infalgo
supfunctions f
εn(algo, f)
115
![Page 116: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/116.jpg)
Minimax rates (Agarwal et al., 2012)
• Convex functions: domain C that contains an ℓ∞-ball of radius D
infalgo
supfunctions f
ε(algo, f) > cst ×min
BD
√
d
n,BD
– Consequences for ℓ2-ball of radius D: BD/√n
– Upper-bound through stochastic subgradient
• µ-strongly-convex functions:
infalgo
supfunctions f
εn(algo, f) > cst ×minB2
µn,B2
µd,BD
√
d
n,BD
116
![Page 117: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/117.jpg)
Summary of rates of convergence
• Problem parameters
– D diameter of the domain
– B Lipschitz-constant
– L smoothness constant
– µ strong convexity constant
convex strongly convex
nonsmooth deterministic: BD/√t deterministic: B2/(tµ)
stochastic: BD/√n stochastic: B2/(nµ)
smooth deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: LD2/√n stochastic: L/(nµ)
finite sum: n/t finite sum: exp(−min1/n, µ/Lquadratic deterministic: LD2/t2 deterministic: exp(−t
√
µ/L)
stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n
117
![Page 118: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/118.jpg)
Outline - II
1. Introduction
• Large-scale machine learning and optimization
• Classes of functions (convex, smooth, etc.)
• Traditional statistical analysis (regardless of optimization)
2. Classical methods for convex optimization
• Smooth optimization (gradient descent, Newton method)
• Non-smooth optimization (subgradient descent)
• Proximal methods
3. Non-smooth stochastic approximation
• Stochastic (sub)gradient and averaging
• Non-asymptotic results and lower bounds
• Strongly convex vs. non-strongly convex
118
![Page 119: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/119.jpg)
Outline - II
4. Classical stochastic approximation (not covered)
• Asymptotic analysis
• Robbins-Monro algorithm and Polyak-Rupert averaging
5. Smooth stochastic approximation algorithms
• Non-asymptotic analysis for smooth functions
• Least-squares regression without decaying step-sizes
6. Finite data sets (partially covered)
• Gradient methods with exponential convergence rates
• (Dual) stochastic coordinate descent
• Frank-Wolfe
7. Non-convex problems (“open” / not covered)
119
![Page 120: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/120.jpg)
Convex stochastic approximation
Existing work
• Known global minimax rates of convergence for non-smooth
problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
– Strongly convex: O((µn)−1)
Attained by averaged stochastic gradient descent with γn ∝ (µn)−1
– Non-strongly convex: O(n−1/2)
Attained by averaged stochastic gradient descent with γn ∝ n−1/2
– Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan
et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz
et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov
and Vial (2008); Nemirovski et al. (2009)
120
![Page 121: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/121.jpg)
Convex stochastic approximation
Existing work
• Known global minimax rates of convergence for non-smooth
problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
– Strongly convex: O((µn)−1)
Attained by averaged stochastic gradient descent with γn ∝ (µn)−1
– Non-strongly convex: O(n−1/2)
Attained by averaged stochastic gradient descent with γn ∝ n−1/2
• Many contributions in optimization and online learning: Bottou
and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al.
(2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al.
(2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and
Vial (2008); Nemirovski et al. (2009)
121
![Page 122: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/122.jpg)
Convex stochastic approximation
Existing work
• Known global minimax rates of convergence for non-smooth
problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
– Strongly convex: O((µn)−1)
Attained by averaged stochastic gradient descent with γn ∝ (µn)−1
– Non-strongly convex: O(n−1/2)
Attained by averaged stochastic gradient descent with γn ∝ n−1/2
• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;
Ruppert, 1988)
– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for
smooth strongly convex problems
A single algorithm with global adaptive convergence rate for
smooth problems?
122
![Page 123: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/123.jpg)
Convex stochastic approximation
Existing work
• Known global minimax rates of convergence for non-smooth
problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
– Strongly convex: O((µn)−1)
Attained by averaged stochastic gradient descent with γn ∝ (µn)−1
– Non-strongly convex: O(n−1/2)
Attained by averaged stochastic gradient descent with γn ∝ n−1/2
• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;
Ruppert, 1988)
– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for
smooth strongly convex problems
• Non-asymptotic analysis for smooth problems? problems with
convergence rate O(min1/µn, 1/√n)123
![Page 124: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/124.jpg)
Algorithm
• Iteration: θn = θn−1 − γnf′n(θn−1)
– Polyak-Ruppert averaging: θn = 1n
∑n−1k=0 θk
124
![Page 125: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/125.jpg)
Summary of results (Bach and Moulines, 2011)
• Stochastic gradient descent with learning rate γn = Cn−α
• Strongly convex smooth objective functions
– Old: O(n−1µ−1) rate achieved without averaging for α = 1
– New: O(n−1µ−1) rate achieved with averaging for α ∈ [1/2, 1]
– Non-asymptotic analysis with explicit constants
– Forgetting of initial conditions
– Robustness to the choice of C
125
![Page 126: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/126.jpg)
Summary of results (Bach and Moulines, 2011)
• Stochastic gradient descent with learning rate γn = Cn−α
• Strongly convex smooth objective functions
– Old: O(n−1µ−1) rate achieved without averaging for α = 1
– New: O(n−1µ−1) rate achieved with averaging for α ∈ [1/2, 1]
– Non-asymptotic analysis with explicit constants
– Forgetting of initial conditions
– Robustness to the choice of C
• Convergence rates for E‖θn − θ∗‖2 and E‖θn − θ∗‖2
– no averaging: O(σ2γn
µ
)
+O(e−µnγn)‖θ0 − θ∗‖2
– averaging:trH(θ∗)−1
n+ µ−1O(n−2α+n−2+α) +O
(‖θ0 − θ∗‖2µ2n2
)
126
![Page 127: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/127.jpg)
Robustness to wrong constants for γn = Cn−α
• f(θ) = 12|θ|2 with i.i.d. Gaussian noise (d = 1)
• Left: α = 1/2
• Right: α = 1
0 2 4−5
0
5
log(n)
log[
f(θ n)−
f∗ ]
α = 1/2
sgd − C=1/5ave − C=1/5sgd − C=1ave − C=1sgd − C=5ave − C=5
0 2 4−5
0
5
log(n)
log[
f(θ n)−
f∗ ]
α = 1
sgd − C=1/5ave − C=1/5sgd − C=1ave − C=1sgd − C=5ave − C=5
• See also http://leon.bottou.org/projects/sgd
127
![Page 128: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/128.jpg)
Summary of new results (Bach and Moulines, 2011)
• Stochastic gradient descent with learning rate γn = Cn−α
• Strongly convex smooth objective functions
– Old: O(n−1) rate achieved without averaging for α = 1
– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]
– Non-asymptotic analysis with explicit constants
128
![Page 129: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/129.jpg)
Summary of new results (Bach and Moulines, 2011)
• Stochastic gradient descent with learning rate γn = Cn−α
• Strongly convex smooth objective functions
– Old: O(n−1) rate achieved without averaging for α = 1
– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]
– Non-asymptotic analysis with explicit constants
• Non-strongly convex smooth objective functions
– Old: O(n−1/2) rate achieved with averaging for α = 1/2
– New: O(maxn1/2−3α/2, n−α/2, nα−1) rate achieved without
averaging for α ∈ [1/3, 1]
• Take-home message
– Use α = 1/2 with averaging to be adaptive to strong convexity
129
![Page 130: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/130.jpg)
Robustness to lack of strong convexity
• Left: f(θ) = |θ|2 between −1 and 1
• Right: f(θ) = |θ|4 between −1 and 1
• affine outside of [−1, 1], continuously differentiable.
0 1 2 3 4 5
−4
−3
−2
−1
0
1
log(n)
log[
f(θ n)−
f∗ ]
power 2
sgd − 1/3ave − 1/3sgd − 1/2ave − 1/2sgd − 2/3ave − 2/3sgd − 1ave − 1
0 1 2 3 4 5
−6
−4
−2
0
2
log(n)
log[
f(θ n)−
f∗ ]
power 4
sgd − 1/3ave − 1/3sgd − 1/2ave − 1/2sgd − 2/3ave − 2/3sgd − 1ave − 1
130
![Page 131: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/131.jpg)
Convex stochastic approximation
Existing work
• Known global minimax rates of convergence for non-smooth
problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
– Strongly convex: O((µn)−1)
Attained by averaged stochastic gradient descent with γn ∝ (µn)−1
– Non-strongly convex: O(n−1/2)
Attained by averaged stochastic gradient descent with γn ∝ n−1/2
• Asymptotic analysis of averaging (Polyak and Juditsky, 1992;
Ruppert, 1988)
– All step sizes γn = Cn−α with α ∈ (1/2, 1) lead to O(n−1) for
smooth strongly convex problems
• A single adaptive algorithm for smooth problems with
convergence rate O(1/n) in all situations?
131
![Page 132: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/132.jpg)
Least-mean-square algorithm
• Least-squares: f(θ) = 12E
[
(yn − 〈Φ(xn), θ〉)2]
with θ ∈ Rd
– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)
– usually studied without averaging and decreasing step-sizes
– with strong convexity assumption E[
Φ(xn)⊗Φ(xn)]
= H < µ · Id
132
![Page 133: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/133.jpg)
Least-mean-square algorithm
• Least-squares: f(θ) = 12E
[
(yn − 〈Φ(xn), θ〉)2]
with θ ∈ Rd
– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)
– usually studied without averaging and decreasing step-sizes
– with strong convexity assumption E[
Φ(xn)⊗Φ(xn)]
= H < µ · Id
• New analysis for averaging and constant step-size γ = 1/(4R2)
– Assume ‖Φ(xn)‖ 6 R and |yn − 〈Φ(xn), θ∗〉| 6 σ almost surely
– No assumption regarding lowest eigenvalues of H
– Main result: Ef(θn−1)− f(θ∗) 64σ2d
n+
4R2‖θ0 − θ∗‖2n
• Matches statistical lower bound (Tsybakov, 2003)
– Non-asymptotic robust version of Gyorfi and Walk (1996)
133
![Page 134: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/134.jpg)
Markov chain interpretation of constant step sizes
• LMS recursion for fn(θ) =12
(
yn − 〈Φ(xn), θ〉)2
θn = θn−1 − γ(
〈Φ(xn), θn−1〉 − yn)
Φ(xn)
• The sequence (θn)n is a homogeneous Markov chain
– convergence to a stationary distribution πγ
– with expectation θγdef=
∫
θπγ(dθ)
- For least-squares, θγ = θ∗
θγ
θ0
θn
134
![Page 135: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/135.jpg)
Markov chain interpretation of constant step sizes
• LMS recursion for fn(θ) =12
(
yn − 〈Φ(xn), θ〉)2
θn = θn−1 − γ(
〈Φ(xn), θn−1〉 − yn)
Φ(xn)
• The sequence (θn)n is a homogeneous Markov chain
– convergence to a stationary distribution πγ
– with expectation θγdef=
∫
θπγ(dθ)
• For least-squares, θγ = θ∗
θ∗
θ0
θn
135
![Page 136: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/136.jpg)
Markov chain interpretation of constant step sizes
• LMS recursion for fn(θ) =12
(
yn − 〈Φ(xn), θ〉)2
θn = θn−1 − γ(
〈Φ(xn), θn−1〉 − yn)
Φ(xn)
• The sequence (θn)n is a homogeneous Markov chain
– convergence to a stationary distribution πγ
– with expectation θγdef=
∫
θπγ(dθ)
• For least-squares, θγ = θ∗
θ∗
θ0
θn
θn
136
![Page 137: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/137.jpg)
Markov chain interpretation of constant step sizes
• LMS recursion for fn(θ) =12
(
yn − 〈Φ(xn), θ〉)2
θn = θn−1 − γ(
〈Φ(xn), θn−1〉 − yn)
Φ(xn)
• The sequence (θn)n is a homogeneous Markov chain
– convergence to a stationary distribution πγ
– with expectation θγdef=
∫
θπγ(dθ)
• For least-squares, θγ = θ∗
– θn does not converge to θ∗ but oscillates around it
– oscillations of order√γ
• Ergodic theorem:
– Averaged iterates converge to θγ = θ∗ at rate O(1/n)
137
![Page 138: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/138.jpg)
Simulations - synthetic examples
• Gaussian distributions - d = 20
0 2 4 6−5
−4
−3
−2
−1
0
log10
(n)
log 10
[f(θ)
−f(
θ *)]
synthetic square
1/2R2
1/8R2
1/32R2
1/2R2n1/2
138
![Page 139: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/139.jpg)
Simulations - benchmarks
• alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)
0 2 4 6
−2
−1.5
−1
−0.5
0
0.5
1
log10
(n)
log 10
[f(θ)
−f(
θ *)]
alpha square C=1 test
1/R2
1/R2n1/2
SAG
0 2 4 6
−2
−1.5
−1
−0.5
0
0.5
1
log10
(n)
alpha square C=opt test
C/R2
C/R2n1/2
SAG
0 2 4
−0.8
−0.6
−0.4
−0.2
0
0.2
log10
(n)
log 10
[f(θ)
−f(
θ *)]
news square C=1 test
1/R2
1/R2n1/2
SAG
0 2 4
−0.8
−0.6
−0.4
−0.2
0
0.2
log10
(n)
news square C=opt test
C/R2
C/R2n1/2
SAG
139
![Page 140: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/140.jpg)
Isn’t least-squares regression a “regression”?
140
![Page 141: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/141.jpg)
Isn’t least-squares regression a “regression”?
• Least-squares regression
– Simpler to analyze and understand
– Explicit relationship to bias/variance trade-offs
– See Defossez and Bach (2015); Dieuleveut et al. (2016)
• Many important loss functions are not quadratic
– Beyond least-squares with online Newton steps
– Complexity of O(d) per iteration with rate O(d/n)
– See Bach and Moulines (2013) for details
141
![Page 142: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/142.jpg)
Beyond least-squares - Markov chain interpretation
• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain
– Stationary distribution πγ such that∫
f ′(θ)πγ(dθ) = 0
– When f ′ is not linear, f ′(∫
θπγ(dθ)) 6=∫
f ′(θ)πγ(dθ) = 0
142
![Page 143: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/143.jpg)
Beyond least-squares - Markov chain interpretation
• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain
– Stationary distribution πγ such that∫
f ′(θ)πγ(dθ) = 0
– When f ′ is not linear, f ′(∫
θπγ(dθ)) 6=∫
f ′(θ)πγ(dθ) = 0
• θn oscillates around the wrong value θγ 6= θ∗
θγ
θ0
θn
θn
θ∗
143
![Page 144: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/144.jpg)
Beyond least-squares - Markov chain interpretation
• Recursion θn = θn−1 − γf ′n(θn−1) also defines a Markov chain
– Stationary distribution πγ such that∫
f ′(θ)πγ(dθ) = 0
– When f ′ is not linear, f ′(∫
θπγ(dθ)) 6=∫
f ′(θ)πγ(dθ) = 0
• θn oscillates around the wrong value θγ 6= θ∗
– moreover, ‖θ∗ − θn‖ = Op(√γ)
– Linear convergence up to the noise level for strongly-convex
problems (Nedic and Bertsekas, 2000)
• Ergodic theorem
– averaged iterates converge to θγ 6= θ∗ at rate O(1/n)
– moreover, ‖θ∗ − θγ‖ = O(γ) (Bach, 2013)
144
![Page 145: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/145.jpg)
Simulations - synthetic examples
• Gaussian distributions - d = 20
0 2 4 6−5
−4
−3
−2
−1
0
log10
(n)
log 10
[f(θ)
−f(
θ *)]
synthetic logistic − 1
1/2R2
1/8R2
1/32R2
1/2R2n1/2
145
![Page 146: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/146.jpg)
Restoring convergence through online Newton steps
• Known facts
1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2)
for all convex functions
2. Averaged SGD with γn constant leads to robust rate O(n−1)
for all convex quadratic functions
3. Newton’s method squares the error at each iteration
for smooth functions
4. A single step of Newton’s method is equivalent to minimizing the
quadratic Taylor expansion
– Online Newton step
– Rate: O((n−1/2)2 + n−1) = O(n−1)
– Complexity: O(d) per iteration
146
![Page 147: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/147.jpg)
Restoring convergence through online Newton steps
• Known facts
1. Averaged SGD with γn ∝ n−1/2 leads to robust rate O(n−1/2)
for all convex functions
2. Averaged SGD with γn constant leads to robust rate O(n−1)
for all convex quadratic functions ⇒ O(n−1)
3. Newton’s method squares the error at each iteration
for smooth functions ⇒ O((n−1/2)2)
4. A single step of Newton’s method is equivalent to minimizing the
quadratic Taylor expansion
• Online Newton step
– Rate: O((n−1/2)2 + n−1) = O(n−1)
– Complexity: O(d) per iteration
147
![Page 148: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/148.jpg)
Restoring convergence through online Newton steps
• The Newton step for f = Efn(θ)def= E
[
ℓ(yn, 〈θ,Φ(xn)〉)]
at θ is
equivalent to minimizing the quadratic approximation
g(θ) = f(θ) + 〈f ′(θ), θ − θ〉+ 12〈θ − θ, f ′′(θ)(θ − θ)〉
= f(θ) + 〈Ef ′n(θ), θ − θ〉+ 1
2〈θ − θ,Ef ′′n(θ)(θ − θ)〉
= E
[
f(θ) + 〈f ′n(θ), θ − θ〉+ 1
2〈θ − θ, f ′′n(θ)(θ − θ)〉
]
148
![Page 149: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/149.jpg)
Restoring convergence through online Newton steps
• The Newton step for f = Efn(θ)def= E
[
ℓ(yn, 〈θ,Φ(xn)〉)]
at θ is
equivalent to minimizing the quadratic approximation
g(θ) = f(θ) + 〈f ′(θ), θ − θ〉+ 12〈θ − θ, f ′′(θ)(θ − θ)〉
= f(θ) + 〈Ef ′n(θ), θ − θ〉+ 1
2〈θ − θ,Ef ′′n(θ)(θ − θ)〉
= E
[
f(θ) + 〈f ′n(θ), θ − θ〉+ 1
2〈θ − θ, f ′′n(θ)(θ − θ)〉
]
• Complexity of least-mean-square recursion for g is O(d)
θn = θn−1 − γ[
f ′n(θ) + f ′′
n(θ)(θn−1 − θ)]
– f ′′n(θ) = ℓ′′(yn, 〈θ,Φ(xn)〉)Φ(xn)⊗ Φ(xn) has rank one
– New online Newton step without computing/inverting Hessians
149
![Page 150: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/150.jpg)
Choice of support point for online Newton step
• Two-stage procedure
(1) Run n/2 iterations of averaged SGD to obtain θ
(2) Run n/2 iterations of averaged constant step-size LMS
– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)
– Provable convergence rate of O(d/n) for logistic regression
– Additional assumptions but no strong convexity
150
![Page 151: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/151.jpg)
Choice of support point for online Newton step
• Two-stage procedure
(1) Run n/2 iterations of averaged SGD to obtain θ
(2) Run n/2 iterations of averaged constant step-size LMS
– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)
– Provable convergence rate of O(d/n) for logistic regression
– Additional assumptions but no strong convexity
• Update at each iteration using the current averaged iterate
– Recursion: θn = θn−1 − γ[
f ′n(θn−1) + f ′′
n(θn−1)(θn−1 − θn−1)]
– No provable convergence rate (yet) but best practical behavior
– Note (dis)similarity with regular SGD: θn = θn−1 − γf ′n(θn−1)
151
![Page 152: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/152.jpg)
Simulations - synthetic examples
• Gaussian distributions - d = 20
0 2 4 6−5
−4
−3
−2
−1
0
log10
(n)
log 10
[f(θ)
−f(
θ *)]
synthetic logistic − 1
1/2R2
1/8R2
1/32R2
1/2R2n1/2
0 2 4 6−5
−4
−3
−2
−1
0
log10
(n)lo
g 10[f(
θ)−
f(θ *)]
synthetic logistic − 2
every 2p
every iter.2−step2−step−dbl.
152
![Page 153: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/153.jpg)
Simulations - benchmarks
• alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)
0 2 4 6−2.5
−2
−1.5
−1
−0.5
0
0.5
log10
(n)
log 10
[f(θ)
−f(
θ *)]
alpha logistic C=1 test
1/R2
1/R2n1/2
SAGAdagradNewton
0 2 4 6−2.5
−2
−1.5
−1
−0.5
0
0.5
log10
(n)
alpha logistic C=opt test
C/R2
C/R2n1/2
SAGAdagradNewton
0 2 4−1
−0.8
−0.6
−0.4
−0.2
0
0.2
log10
(n)
log 10
[f(θ)
−f(
θ *)]
news logistic C=1 test
1/R2
1/R2n1/2
SAGAdagradNewton
0 2 4−1
−0.8
−0.6
−0.4
−0.2
0
0.2
log10
(n)
news logistic C=opt test
C/R2
C/R2n1/2
SAGAdagradNewton
153
![Page 154: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/154.jpg)
Summary of rates of convergence
• Problem parameters
– D diameter of the domain
– B Lipschitz-constant
– L smoothness constant
– µ strong convexity constant
convex strongly convex
nonsmooth deterministic: BD/√t deterministic: B2/(tµ)
stochastic: BD/√n stochastic: B2/(nµ)
smooth deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: LD2/√n stochastic: L/(nµ)
finite sum: n/t finite sum: exp(−min1/n, µ/Lquadratic deterministic: LD2/t2 deterministic: exp(−t
√
µ/L)
stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n
154
![Page 155: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/155.jpg)
Summary of rates of convergence
• Problem parameters
– D diameter of the domain
– B Lipschitz-constant
– L smoothness constant
– µ strong convexity constant
convex strongly convex
nonsmooth deterministic: BD/√t deterministic: B2/(tµ)
stochastic: BD/√n stochastic: B2/(nµ)
smooth deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: LD2/√n stochastic: L/(nµ)
finite sum: n/t finite sum: exp(−t/(n+L/µ))
quadratic deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n
155
![Page 156: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/156.jpg)
Outline - II
1. Introduction
• Large-scale machine learning and optimization
• Classes of functions (convex, smooth, etc.)
• Traditional statistical analysis (regardless of optimization)
2. Classical methods for convex optimization
• Smooth optimization (gradient descent, Newton method)
• Non-smooth optimization (subgradient descent)
• Proximal methods
3. Non-smooth stochastic approximation
• Stochastic (sub)gradient and averaging
• Non-asymptotic results and lower bounds
• Strongly convex vs. non-strongly convex
156
![Page 157: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/157.jpg)
Outline - II
4. Classical stochastic approximation (not covered)
• Asymptotic analysis
• Robbins-Monro algorithm and Polyak-Rupert averaging
5. Smooth stochastic approximation algorithms
• Non-asymptotic analysis for smooth functions
• Least-squares regression without decaying step-sizes
6. Finite data sets (partially covered)
• Gradient methods with exponential convergence rates
• (Dual) stochastic coordinate descent
• Frank-Wolfe
7. Non-convex problems (“open” / not covered)
157
![Page 158: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/158.jpg)
Going beyond a single pass over the data
• Stochastic approximation
– Assumes infinite data stream
– Observations are used only once
– Directly minimizes testing cost E(x,y) ℓ(y, θ⊤Φ(x))
158
![Page 159: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/159.jpg)
Going beyond a single pass over the data
• Stochastic approximation
– Assumes infinite data stream
– Observations are used only once
– Directly minimizes testing cost E(x,y) ℓ(y, θ⊤Φ(x))
• Machine learning practice
– Finite data set (x1, y1, . . . , xn, yn)
– Multiple passes
– Minimizes training cost 1n
∑ni=1 ℓ(yi, θ
⊤Φ(xi))
– Need to regularize (e.g., by the ℓ2-norm) to avoid overfitting
• Goal: minimize g(θ) =1
n
n∑
i=1
fi(θ)
159
![Page 160: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/160.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
g(θt)− g(θ∗) 6 O(1/t)
g(θt)− g(θ∗) 6 O(e−t(µ/L)) = O(e−t/κ)
(small κ = L/µ) (large κ = L/µ)
160
![Page 161: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/161.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
g(θt)− g(θ∗) 6 O(1/t)
g(θt)− g(θ∗) 6 O((1−µ/L)t) = O(e−t(µ/L)) if µ-strongly convex
(small κ = L/µ) (large κ = L/µ)
161
![Page 162: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/162.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for convex functions
– O(e−t/κ) linear if strongly-convex
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
quadratic rate
162
![Page 163: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/163.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for convex functions
– O(e−t/κ) linear if strongly-convex ⇔ O(κ log 1ε) iterations
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
quadratic rate ⇔ O(log log 1ε) iterations
163
![Page 164: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/164.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for convex functions
– O(e−t/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log 1ε)
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
quadratic rate ⇔ complexity = O((nd2 + d3) · log log 1ε)
164
![Page 165: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/165.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for convex functions
– O(e−t/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log 1ε)
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
quadratic rate ⇔ complexity = O((nd2 + d3) · log log 1ε)
• Key insights for machine learning (Bottou and Bousquet, 2008)
1. No need to optimize below statistical error
2. Cost functions are averages
3. Testing error is more important than training error
165
![Page 166: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/166.jpg)
Iterative methods for minimizing smooth functions
• Assumption: g convex and L-smooth on Rd
• Gradient descent: θt = θt−1 − γt g′(θt−1)
– O(1/t) convergence rate for convex functions
– O(e−t/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log 1ε)
• Newton method: θt = θt−1 − g′′(θt−1)−1g′(θt−1)
– O(
e−ρ2t)
quadratic rate ⇔ complexity = O((nd2 + d3) · log log 1ε)
• Key insights for machine learning (Bottou and Bousquet, 2008)
1. No need to optimize below statistical error
2. Cost functions are averages
3. Testing error is more important than training error
166
![Page 167: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/167.jpg)
Stochastic gradient descent (SGD) for finite sums
minθ∈Rd
g(θ) =1
n
n∑
i=1
fi(θ)
• Iteration: θt = θt−1 − γtf′i(t)(θt−1)
– Sampling with replacement: i(t) random element of 1, . . . , n– Polyak-Ruppert averaging: θt =
1t+1
∑tu=0 θu
167
![Page 168: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/168.jpg)
Stochastic gradient descent (SGD) for finite sums
minθ∈Rd
g(θ) =1
n
n∑
i=1
fi(θ)
• Iteration: θt = θt−1 − γtf′i(t)(θt−1)
– Sampling with replacement: i(t) random element of 1, . . . , n– Polyak-Ruppert averaging: θt =
1t+1
∑tu=0 θu
• Convergence rate if each fi is convex L-smooth and g µ-strongly-
convex:
Eg(θt)− g(θ∗) 6
O(1/√t) if γt = 1/(L
√t)
O(L/(µt)) = O(κ/t) if γt = 1/(µt)
– No adaptivity to strong-convexity in general
– Adaptivity with self-concordance assumption (Bach, 2013)
– Running-time complexity: O(d · κ/ε)168
![Page 169: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/169.jpg)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(
yi, h(xi, θ))
+ λΩ(θ)
169
![Page 170: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/170.jpg)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(
yi, h(xi, θ))
+ λΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
– Linear (e.g., exponential) convergence rate in O(e−t/κ)
– Iteration complexity is linear in n
170
![Page 171: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/171.jpg)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(
yi, h(xi, θ))
+ λΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
171
![Page 172: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/172.jpg)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(
yi, h(xi, θ))
+ λΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
– Linear (e.g., exponential) convergence rate in O(e−t/κ)
– Iteration complexity is linear in n
• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)
– Sampling with replacement: i(t) random element of 1, . . . , n– Convergence rate in O(κ/t)
– Iteration complexity is independent of n
172
![Page 173: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/173.jpg)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(
yi, h(xi, θ))
+ λΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)
173
![Page 174: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/174.jpg)
Stochastic vs. deterministic methods
• Goal = best of both worlds: Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
174
![Page 175: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/175.jpg)
Stochastic vs. deterministic methods
• Goal = best of both worlds: Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
175
![Page 176: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/176.jpg)
Accelerating gradient methods - Related work
• Generic acceleration (Nesterov, 1983, 2004)
θt = ηt−1 − γtg′(ηt−1) and ηt = θt + δt(θt − θt−1)
176
![Page 177: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/177.jpg)
Accelerating gradient methods - Related work
• Generic acceleration (Nesterov, 1983, 2004)
θt = ηt−1 − γtg′(ηt−1) and ηt = θt + δt(θt − θt−1)
– Good choice of momentum term δt ∈ [0, 1)
g(θt)− g(θ∗) 6 O(1/t2)
g(θt)− g(θ∗) 6 O(e−t√
µ/L) = O(e−t/√κ) if µ-strongly convex
– Optimal rates after t = O(d) iterations (Nesterov, 2004)
177
![Page 178: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/178.jpg)
Accelerating gradient methods - Related work
• Generic acceleration (Nesterov, 1983, 2004)
θt = ηt−1 − γtg′(ηt−1) and ηt = θt + δt(θt − θt−1)
– Good choice of momentum term δt ∈ [0, 1)
g(θt)− g(θ∗) 6 O(1/t2)
g(θt)− g(θ∗) 6 O(e−t√
µ/L) = O(e−t/√κ) if µ-strongly convex
– Optimal rates after t = O(d) iterations (Nesterov, 2004)
– Still O(nd) iteration cost: complexity = O(nd · √κ log 1ε)
178
![Page 179: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/179.jpg)
Accelerating gradient methods - Related work
• Constant step-size stochastic gradient
– Solodov (1998); Nedic and Bertsekas (2000)
– Linear convergence, but only up to a fixed tolerance
179
![Page 180: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/180.jpg)
Accelerating gradient methods - Related work
• Constant step-size stochastic gradient
– Solodov (1998); Nedic and Bertsekas (2000)
– Linear convergence, but only up to a fixed tolerance
• Stochastic methods in the dual (SDCA)
– Shalev-Shwartz and Zhang (2012)
– Similar linear rate but limited choice for the fi’s
– Extensions without duality: see Shalev-Shwartz (2016)
180
![Page 181: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/181.jpg)
Accelerating gradient methods - Related work
• Constant step-size stochastic gradient
– Solodov (1998); Nedic and Bertsekas (2000)
– Linear convergence, but only up to a fixed tolerance
• Stochastic methods in the dual (SDCA)
– Shalev-Shwartz and Zhang (2012)
– Similar linear rate but limited choice for the fi’s
– Extensions without duality: see Shalev-Shwartz (2016)
• Stochastic version of accelerated batch gradient methods
– Tseng (1998); Ghadimi and Lan (2010); Xiao (2010)
– Can improve constants, but still have sublinear O(1/t) rate
181
![Page 182: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/182.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
182
![Page 183: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/183.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
yt1
yt2
yt3
yt4
ytn−1
ytn
f1 f2 f3 f4 fnfn−1
1
n
∑n
i=1yti
g =1
n
∑n
i=1fifunctions
gradients ∈ Rd
183
![Page 184: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/184.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
yt1
yt2
yt3
yt4
ytn−1
ytn
f1 f2 f3 f4 fnfn−1
1
n
∑n
i=1yti
g =1
n
∑n
i=1fifunctions
gradients ∈ Rd
184
![Page 185: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/185.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
yt1
yt2
yt3
yt4
ytn−1
ytn
f1 f2 f3 f4 fnfn−1
1
n
∑n
i=1yti
g =1
n
∑n
i=1fifunctions
gradients ∈ Rd
185
![Page 186: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/186.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
• Stochastic version of incremental average gradient (Blatt et al., 2008)
186
![Page 187: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/187.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
• Stochastic version of incremental average gradient (Blatt et al., 2008)
• Extra memory requirement: n gradients in Rd in general
• Linear supervised machine learning: only n real numbers
– If fi(θ) = ℓ(yi,Φ(xi)⊤θ), then f ′
i(θ) = ℓ′(yi,Φ(xi)⊤θ)Φ(xi)
187
![Page 188: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/188.jpg)
Stochastic average gradient - Convergence analysis
• Assumptions
– Each fi is L-smooth, i = 1, . . . , n - link with R2
– g= 1n
∑ni=1 fi is µ-strongly convex
– constant step size γt = 1/(16L) - no need to know µ
188
![Page 189: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/189.jpg)
Stochastic average gradient - Convergence analysis
• Assumptions
– Each fi is L-smooth, i = 1, . . . , n - link with R2
– g= 1n
∑ni=1 fi is µ-strongly convex
– constant step size γt = 1/(16L) - no need to know µ
• Strongly convex case (Le Roux et al., 2012, 2013)
E[
g(θt)− g(θ∗)]
6 cst ×(
1−min 1
8n,
µ
16L
)t
– Linear (exponential) convergence rate with O(d) iteration cost
– After one pass, reduction of cost by exp(
−min
18,
nµ16L
)
– NB: in machine learning, may often restrict to µ > L/n
⇒ constant error reduction after each effective pass
189
![Page 190: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/190.jpg)
Running-time comparisons (strongly-convex)
• Assumptions: g(θ) = 1n
∑ni=1 fi(θ)
– Each fi convex L-smooth and g µ-strongly convex
Stochastic gradient descent d×∣
∣
∣
Lµ × 1
ε
Gradient descent d×∣
∣
∣nLµ × log 1
ε
Accelerated gradient descent d×∣
∣
∣n√
Lµ × log 1
ε
SAG d×∣
∣
∣(n+ L
µ) × log 1ε
– NB-1: for (accelerated) gradient descent, L = smoothness constant of g
– NB-2: with non-uniform sampling, L = average smoothness constants of all fi’s
190
![Page 191: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/191.jpg)
Running-time comparisons (strongly-convex)
• Assumptions: g(θ) = 1n
∑ni=1 fi(θ)
– Each fi convex L-smooth and g µ-strongly convex
Stochastic gradient descent d×∣
∣
∣
Lµ × 1
ε
Gradient descent d×∣
∣
∣nLµ × log 1
ε
Accelerated gradient descent d×∣
∣
∣n√
Lµ × log 1
ε
SAG d×∣
∣
∣(n+ L
µ) × log 1ε
• Beating two lower bounds (Nemirovsky and Yudin, 1983; Nesterov,
2004): with additional assumptions
(1) stochastic gradient: exponential rate for finite sums
(2) full gradient: better exponential rate using the sum structure
191
![Page 192: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/192.jpg)
Running-time comparisons (non-strongly-convex)
• Assumptions: g(θ) = 1n
∑ni=1 fi(θ)
– Each fi convex L-smooth
– Ill conditioned problems: g may not be strongly-convex (µ = 0)
Stochastic gradient descent d×∣
∣
∣
∣
1/ε2
Gradient descent d×∣
∣
∣
∣
n/ε
Accelerated gradient descent d×∣
∣
∣
∣
n/√ε
SAG d×∣
∣
∣
∣
√n/ε
• Adaptivity to potentially hidden strong convexity
• No need to know the local/global strong-convexity constant
192
![Page 193: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/193.jpg)
Stochastic average gradient
Implementation details and extensions
• Sparsity in the features
– Just-in-time updates ⇒ replace O(d) by number of non zeros
– See also Leblond, Pedregosa, and Lacoste-Julien (2016)
• Mini-batches
– Reduces the memory requirement + block access to data
• Line-search
– Avoids knowing L in advance
• Non-uniform sampling
– Favors functions with large variations
• See www.cs.ubc.ca/~schmidtm/Software/SAG.html
193
![Page 194: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/194.jpg)
Experimental results (logistic regression)
quantum dataset rcv1 dataset
(n = 50 000, d = 78) (n = 697 641, d = 47 236)
AFG
L−BFGS
SGASG
IAG
AFG
L−BFGS
SG
ASG
IAG
194
![Page 195: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/195.jpg)
Experimental results (logistic regression)
quantum dataset rcv1 dataset
(n = 50 000, d = 78) (n = 697 641, d = 47 236)
AFG
L−BFGS
SG
ASG
IAG
SAG−LS
AFGL−BFGS
SG
ASG
IAG
SAG−LS
195
![Page 196: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/196.jpg)
Before non-uniform sampling
protein dataset sido dataset
(n = 145 751, d = 74) (n = 12 678, d = 4 932)
IAG
AGD
L−BFG
S
SG−C
ASG−C
PCD−L
DCA
SAG
IAG
AGD
L−BFGS
SG−CASG−C
PCD−L
DCA
SAG
196
![Page 197: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/197.jpg)
After non-uniform sampling
protein dataset sido dataset
(n = 145 751, d = 74) (n = 12 678, d = 4 932)
IAG
AGDL−BFGS SG−CASG−C
PCD−L
DC
A
SAG
SAG−LS (Lipschitz)
IAG AGD L−BFGS
SG−C ASG−CPCD−L
DCA
SAG
SAG−LS (Lipschitz)
197
![Page 198: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/198.jpg)
Linearly convergent stochastic gradient algorithms
• Many related algorithms
– SAG (Le Roux, Schmidt, and Bach, 2012)
– SDCA (Shalev-Shwartz and Zhang, 2012)
– SVRG (Johnson and Zhang, 2013; Zhang et al., 2013)
– MISO (Mairal, 2015)
– Finito (Defazio et al., 2014a)
– SAGA (Defazio, Bach, and Lacoste-Julien, 2014b)
– · · ·
• Similar rates of convergence and iterations
– Different interpretations and proofs / proof lengths
– Lazy gradient evaluations
– Variance reduction
198
![Page 199: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/199.jpg)
Linearly convergent stochastic gradient algorithms
• Many related algorithms
– SAG (Le Roux, Schmidt, and Bach, 2012)
– SDCA (Shalev-Shwartz and Zhang, 2012)
– SVRG (Johnson and Zhang, 2013; Zhang et al., 2013)
– MISO (Mairal, 2015)
– Finito (Defazio et al., 2014a)
– SAGA (Defazio, Bach, and Lacoste-Julien, 2014b)
– · · ·
• Similar rates of convergence and iterations
• Different interpretations and proofs / proof lengths
– Lazy gradient evaluations
– Variance reduction
199
![Page 200: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/200.jpg)
Variance reduction
• Principle: reducing variance of sample of X by using a sample from
another random variable Y with known expectation
Zα = α(X − Y ) + EY
– EZα = αEX + (1− α)EY
– var(Zα) = α2[
var(X) + var(Y )− 2cov(X,Y )]
– α = 1: no bias, α < 1: potential bias (but reduced variance)
– Useful if Y positively correlated with X
200
![Page 201: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/201.jpg)
Variance reduction
• Principle: reducing variance of sample of X by using a sample from
another random variable Y with known expectation
Zα = α(X − Y ) + EY
– EZα = αEX + (1− α)EY
– var(Zα) = α2[
var(X) + var(Y )− 2cov(X,Y )]
– α = 1: no bias, α < 1: potential bias (but reduced variance)
– Useful if Y positively correlated with X
• Application to gradient estimation (Johnson and Zhang, 2013;
Zhang, Mahdavi, and Jin, 2013)
– SVRG: X = f ′i(t)(θt−1), Y = f ′
i(t)(θ), α = 1, with θ stored
– EY = 1n
∑ni=1 f
′i(θ) full gradient at θ,X − Y = f ′
i(t)(θt−1)− f ′i(t)(θ)
201
![Page 202: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/202.jpg)
Stochastic variance reduced gradient (SVRG)
(Johnson and Zhang, 2013; Zhang et al., 2013)
• Initialize θ ∈ Rd
• For iepoch = 1 to # of epochs
– Compute all gradients f ′i(θ) ; store g
′(θ) = 1n
∑ni=1 f
′i(θ)
– Initialize θ0 = θ
– For t = 1 to length of epochs
θt = θt−1 − γ[
g′(θ) +(
f ′i(t)(θt−1)− f ′
i(t)(θ))
]
– Update θ = θt
• Output: θ
– No need to store gradients - two gradient evaluations per inner step
– Two parameters: length of epochs + step-size γ
– Same linear convergence rate as SAG, simpler proof
202
![Page 203: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/203.jpg)
Stochastic variance reduced gradient (SVRG)
(Johnson and Zhang, 2013; Zhang et al., 2013)
• Initialize θ ∈ Rd
• For iepoch = 1 to # of epochs
– Compute all gradients f ′i(θ) ; store g
′(θ) = 1n
∑ni=1 f
′i(θ)
– Initialize θ0 = θ
– For t = 1 to length of epochs
θt = θt−1 − γ[
g′(θ) +(
f ′i(t)(θt−1)− f ′
i(t)(θ))
]
– Update θ = θt
• Output: θ
– No need to store gradients - two gradient evaluations per inner step
– Two parameters: length of epochs + step-size γ
– Same linear convergence rate as SAG, simpler proof
203
![Page 204: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/204.jpg)
Interpretation of SAG as variance reduction
• SAG update: θt = θt−1−γ
n
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
– Interpretation as lazy gradient evaluations
204
![Page 205: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/205.jpg)
Interpretation of SAG as variance reduction
• SAG update: θt = θt−1−γ
n
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
– Interpretation as lazy gradient evaluations
• SAG update: θt = θt−1 − γ[
1n
∑ni=1 y
t−1i + 1
n
(
f ′i(t)(θt−1)− yt−1
i(t)
)
]
– Biased update (expectation w.r.t. to i(t) not equal to full gradient)
205
![Page 206: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/206.jpg)
Interpretation of SAG as variance reduction
• SAG update: θt = θt−1−γ
n
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
– Interpretation as lazy gradient evaluations
• SAG update: θt = θt−1 − γ[
1n
∑ni=1 y
t−1i + 1
n
(
f ′i(t)(θt−1)− yt−1
i(t)
)
]
– Biased update (expectation w.r.t. to i(t) not equal to full gradient)
• SVRG update: θt = θt−1−γ[
1n
∑ni=1 f
′i(θ)+
(
f ′i(t)(θt−1)−f ′
i(t)(θ))
]
– Unbiased update
206
![Page 207: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/207.jpg)
Interpretation of SAG as variance reduction
• SAG update: θt = θt−1−γ
n
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
– Interpretation as lazy gradient evaluations
• SAG update: θt = θt−1 − γ[
1n
∑ni=1 y
t−1i + 1
n
(
f ′i(t)(θt−1)− yt−1
i(t)
)
]
– Biased update (expectation w.r.t. to i(t) not equal to full gradient)
• SVRG update: θt = θt−1−γ[
1n
∑ni=1 f
′i(θ)+
(
f ′i(t)(θt−1)−f ′
i(t)(θ))
]
– Unbiased update
• SAGA update: θt = θt−1 − γ[
1n
∑ni=1 y
t−1i +
(
f ′i(t)(θt−1)− yt−1
i(t)
)
]
– Defazio, Bach, and Lacoste-Julien (2014b)
– Unbiased update without epochs
207
![Page 208: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/208.jpg)
SVRG vs. SAGA
• SAGA update: θt = θt−1 − γ[
1n
∑ni=1 y
t−1i +
(
f ′i(t)(θt−1)− yt−1
i(t)
)
]
• SVRG update: θt = θt−1−γ[
1n
∑ni=1 f
′i(θ)+
(
f ′i(t)(θt−1)−f ′
i(t)(θ))
]
SAGA SVRG
Storage of gradients yes no
Epoch-based no yes
Parameters step-size step-size & epoch lengths
Gradient evaluations per step 1 at least 2
Adaptivity to strong-convexity yes no
Robustness to ill-conditioning yes no
– See Babanezhad et al. (2015)
208
![Page 209: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/209.jpg)
Proximal extensions
• Composite optimization problems: minθ∈Rd
1
n
n∑
i=1
fi(θ)+h(θ)
– fi smooth and convex
– h convex, potentially non-smooth
209
![Page 210: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/210.jpg)
Proximal extensions
• Composite optimization problems: minθ∈Rd
1
n
n∑
i=1
fi(θ)+h(θ)
– fi smooth and convex
– h convex, potentially non-smooth
– Constrained optimization: h(θ) = 0 if θ ∈ K, and +∞ otherwise
– Sparsity-inducing norms, e.g., h(θ) = ‖θ‖1
210
![Page 211: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/211.jpg)
Proximal extensions
• Composite optimization problems: minθ∈Rd
1
n
n∑
i=1
fi(θ)+h(θ)
– fi smooth and convex
– h convex, potentially non-smooth
– Constrained optimization: h(θ) = 0 if θ ∈ K, and +∞ otherwise
– Sparsity-inducing norms, e.g., h(θ) = ‖θ‖1
• Proximal methods (a.k.a. splitting methods)
– Extra projection / soft thresholding step after gradient update
– See, e.g., Combettes and Pesquet (2011); Bach, Jenatton, Mairal,
and Obozinski (2012b); Parikh and Boyd (2014)
211
![Page 212: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/212.jpg)
Proximal extensions
• Composite optimization problems: minθ∈Rd
1
n
n∑
i=1
fi(θ)+h(θ)
– fi smooth and convex
– h convex, potentially non-smooth
– Constrained optimization: h(θ) = 0 if θ ∈ K, and +∞ otherwise
– Sparsity-inducing norms, e.g., h(θ) = ‖θ‖1
• Proximal methods (a.k.a. splitting methods)
– Extra projection / soft thresholding step after gradient update
– See, e.g., Combettes and Pesquet (2011); Bach, Jenatton, Mairal,
and Obozinski (2012b); Parikh and Boyd (2014)
• Directly extends to variance-reduced gradient techniques
– Same rates of convergence
212
![Page 213: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/213.jpg)
Acceleration
• Similar guarantees for finite sums: SAG, SDCA, SVRG (Xiao and
Zhang, 2014), SAGA, MISO (Mairal, 2015)
Gradient descent d×∣
∣
∣nLµ × log 1
ε
Accelerated gradient descent d×∣
∣
∣n√
Lµ × log 1
ε
SAG(A), SVRG, SDCA, MISO d×∣
∣
∣(n+ L
µ) × log 1ε
213
![Page 214: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/214.jpg)
Acceleration
• Similar guarantees for finite sums: SAG, SDCA, SVRG (Xiao and
Zhang, 2014), SAGA, MISO (Mairal, 2015)
Gradient descent d×∣
∣
∣nLµ × log 1
ε
Accelerated gradient descent d×∣
∣
∣n√
Lµ × log 1
ε
SAG(A), SVRG, SDCA, MISO d×∣
∣
∣(n+ L
µ) × log 1ε
Accelerated versions d×∣
∣
∣
∣
(n+√
nLµ) × log 1
ε
• Acceleration for special algorithms (e.g., Shalev-Shwartz and
Zhang, 2014; Nitanda, 2014; Lan, 2015)
• Catalyst (Lin, Mairal, and Harchaoui, 2015)
– Widely applicable generic acceleration scheme
214
![Page 215: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/215.jpg)
From training to testing errors
• rcv1 dataset (n = 697 641, d = 47 236)
– NB: IAG, SG-C, ASG with optimal step-sizes in hindsight
Training cost Testing cost
Effective Passes
0 10 20 30 40 50
Objective minus Optimum
10-20
10-15
10-10
10-5
100
AGDL-BFGS
SG-CASG
IAG
SAG
215
![Page 216: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/216.jpg)
From training to testing errors
• rcv1 dataset (n = 697 641, d = 47 236)
– NB: IAG, SG-C, ASG with optimal step-sizes in hindsight
Training cost Testing cost
Effective Passes
0 10 20 30 40 50
Objective minus Optimum
10-20
10-15
10-10
10-5
100
AGDL-BFGS
SG-CASG
IAG
SAG
Effective Passes
0 10 20 30 40 50
Test Logistic Loss
105
0
0.5
1
1.5
2
2.5
AGD
L-BFGS
SG-C
ASG
IAG
SAG
216
![Page 217: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/217.jpg)
SGD minimizes the testing cost!
• Goal: minimize f(θ) = Ep(x,y)ℓ(y, θ⊤Φ(x))
– Given n independent samples (xi, yi), i = 1, . . . , n from p(x, y)
– Given a single pass of stochastic gradient descent
– Bounds on the excess testing cost Ef(θn)− infθ∈Rd f(θ)
217
![Page 218: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/218.jpg)
SGD minimizes the testing cost!
• Goal: minimize f(θ) = Ep(x,y)ℓ(y, θ⊤Φ(x))
– Given n independent samples (xi, yi), i = 1, . . . , n from p(x, y)
– Given a single pass of stochastic gradient descent
– Bounds on the excess testing cost Ef(θn)− infθ∈Rd f(θ)
• Optimal convergence rates: O(1/√n) and O(1/(nµ))
– Optimal for non-smooth losses (Nemirovsky and Yudin, 1983)
– Attained by averaged SGD with decaying step-sizes
218
![Page 219: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/219.jpg)
SGD minimizes the testing cost!
• Goal: minimize f(θ) = Ep(x,y)ℓ(y, θ⊤Φ(x))
– Given n independent samples (xi, yi), i = 1, . . . , n from p(x, y)
– Given a single pass of stochastic gradient descent
– Bounds on the excess testing cost Ef(θn)− infθ∈Rd f(θ)
• Optimal convergence rates: O(1/√n) and O(1/(nµ))
– Optimal for non-smooth losses (Nemirovsky and Yudin, 1983)
– Attained by averaged SGD with decaying step-sizes
• Constant-step-size SGD
– Linear convergence up to the noise level for strongly-convex
problems (Solodov, 1998; Nedic and Bertsekas, 2000)
– Full convergence and robustness to ill-conditioning?
219
![Page 220: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/220.jpg)
Robust averaged stochastic gradient
(Bach and Moulines, 2013)
• Constant-step-size SGD is convergent for least-squares
– Convergence rate in O(1/n) without any dependence on µ
– Simple choice of step-size (equal to 1/L)
! " #
!$%
!$&
!$#
!$"
!
!$"
'()*!+,-
'()*!./+
/+-0
*1
*1,*1"
345
- Convergence in O(1/n) for smooth losses with O(d) online Newton
step220
![Page 221: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/221.jpg)
Robust averaged stochastic gradient
(Bach and Moulines, 2013)
• Constant-step-size SGD is convergent for least-squares
– Convergence rate in O(1/n) without any dependence on µ
– Simple choice of step-size (equal to 1/L)
! " #
!$%
!$&
!$#
!$"
!
!$"
'()*!+,-
'()*!./+
/+-0
*1
*1,*1"
345
• Convergence in O(1/n) for smooth losses with O(d) online Newton
step221
![Page 222: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/222.jpg)
Conclusions - variance reduction
• Linearly-convergent stochastic gradient methods
– Provable and precise rates
– Improves on two known lower-bounds (by using structure)
– Several extensions / interpretations / accelerations
222
![Page 223: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/223.jpg)
Conclusions - variance reduction
• Linearly-convergent stochastic gradient methods
– Provable and precise rates
– Improves on two known lower-bounds (by using structure)
– Several extensions / interpretations / accelerations
• Extensions and future work
– Extension to saddle-point problems (Balamurugan and Bach, 2016)
– Lower bounds for finite sums (Agarwal and Bottou, 2014; Lan,
2015; Arjevani and Shamir, 2016)
– Sampling without replacement (Gurbuzbalaban et al., 2015;
Shamir, 2016)
223
![Page 224: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/224.jpg)
Conclusions - variance reduction
• Linearly-convergent stochastic gradient methods
– Provable and precise rates
– Improves on two known lower-bounds (by using structure)
– Several extensions / interpretations / accelerations
• Extensions and future work
– Extension to saddle-point problems (Balamurugan and Bach, 2016)
– Lower bounds for finite sums (Agarwal and Bottou, 2014; Lan,
2015; Arjevani and Shamir, 2016)
– Sampling without replacement (Gurbuzbalaban et al., 2015;
Shamir, 2016)
– Bounds on testing errors for incremental methods (Frostig et al.,
2015; Babanezhad et al., 2015)
224
![Page 225: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/225.jpg)
Frank-Wolfe - conditional gradient - I
• Goal: minimize smooth convex function f(θ) on compact set C
• Standard result: accelerated projected gradient descent with optimal
rate O(1/t2)
– Requires projection oracle: argminθ∈C12‖θ − η‖2
• Only availability of the linear oracle: argminθ∈C θ⊤η
– Many examples (sparsity, low-rank, large polytopes, etc.)
– Iterative Frank-Wolfe algorithm (see, e.g., Jaggi, 2013, and
references therein) with geometric interpretation (see board)
θt ∈ argminθ∈C
θ⊤f ′(θt−1)
θt = (1− ρt)θt−1 + ρtθt
225
![Page 226: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/226.jpg)
Frank-Wolfe - conditional gradient - II
• Convergence rates: f(θt)− f(θ∗) 62Ldiam(C)2
t+ 1
Iteration:
θt ∈ argminθ∈C
θ⊤f ′(θt−1)
θt = (1− ρt)θt−1 + ρtθt
From smoothness: f(θt) 6 f(θt−1)+f ′(θt−1)⊤[ρt(θt−θt−1)
]
+L
2
∥
∥ρt(θt−θt−1)∥
∥
2
From compactness: f(θt) 6 f(θt−1) + f ′(θt−1)⊤[ρt(θt − θt−1)
]
+L
2ρ2tdiam(C)2
From convexity, f(θt)− f(θ∗) 6 f ′(θt−1)⊤(θt−1− θ∗) 6 max
θ∈Cf ′(θt−1)
⊤(θt−1− θ),
which is equal to f ′(θt−1)⊤(θt−1 − θt)
Thus, f(θt) 6 f(θt−1)− ρt[
f(θt−1)− f(θ∗)]
+L
2ρ2tdiam(C)2
With ρt = 2/(t+ 1): f(θt) 62Ldiam(C)2
t+1 by direct expansion
226
![Page 227: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/227.jpg)
Frank-Wolfe - conditional gradient - II
• Convergence rates: f(θt)− f(θ∗) 62Ldiam(C)2
t
• Remarks and extensions
– Affine-invariant algorithms
– Certified duality gaps and dual interpretations (Bach, 2015)
– Adapted to very large polytopes
– Line-search extensions: minimize quadratic upper-bound
– Stochastic extensions (Lacoste-Julien et al., 2013)
– Away and pairwise steps to avoid oscillations (Lacoste-Julien and
Jaggi, 2015)
227
![Page 228: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/228.jpg)
Outline - II
1. Introduction
• Large-scale machine learning and optimization
• Classes of functions (convex, smooth, etc.)
• Traditional statistical analysis (regardless of optimization)
2. Classical methods for convex optimization
• Smooth optimization (gradient descent, Newton method)
• Non-smooth optimization (subgradient descent)
• Proximal methods
3. Non-smooth stochastic approximation
• Stochastic (sub)gradient and averaging
• Non-asymptotic results and lower bounds
• Strongly convex vs. non-strongly convex
228
![Page 229: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/229.jpg)
Outline - II
4. Classical stochastic approximation (not covered)
• Asymptotic analysis
• Robbins-Monro algorithm and Polyak-Rupert averaging
5. Smooth stochastic approximation algorithms
• Non-asymptotic analysis for smooth functions
• Least-squares regression without decaying step-sizes
6. Finite data sets (partially covered)
• Gradient methods with exponential convergence rates
• (Dual) stochastic coordinate descent
• Frank-Wolfe
7. Non-convex problems (“open” / not covered)
229
![Page 230: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/230.jpg)
Subgradient descent for machine learning
• Assumptions (f is the expected risk, f the empirical risk)
– “Linear” predictors: θ(x) = θ⊤Φ(x), with ‖Φ(x)‖2 6 R a.s.
– f(θ) = 1n
∑ni=1 ℓ(yi,Φ(xi)
⊤θ)
– G-Lipschitz loss: f and f are GR-Lipschitz on Θ = ‖θ‖2 6 D
• Statistics: with probability greater than 1− δ
supθ∈Θ
|f(θ)− f(θ)| 6 GRD√n
[
2 +
√
2 log2
δ
]
• Optimization: after t iterations of subgradient method
f(θ)−minη∈Θ
f(η) 6GRD√
t
• t = n iterations, with total running-time complexity of O(n2d)
230
![Page 231: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/231.jpg)
Stochastic subgradient “descent”/method
• Assumptions
– fn convex and B-Lipschitz-continuous on ‖θ‖2 6 D– (fn) i.i.d. functions such that Efn = f
– θ∗ global optimum of f on ‖θ‖2 6 D
• Algorithm: θn = ΠD
(
θn−1 −2D
B√nf ′n(θn−1)
)
• Bound:
Ef
(
1
n
n−1∑
k=0
θk
)
− f(θ∗) 62DB√
n
• “Same” three-line proof as in the deterministic case
• Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)
• Running-time complexity: O(dn) after n iterations
231
![Page 232: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/232.jpg)
Summary of new results (Bach and Moulines, 2011)
• Stochastic gradient descent with learning rate γn = Cn−α
• Strongly convex smooth objective functions
– Old: O(n−1) rate achieved without averaging for α = 1
– New: O(n−1) rate achieved with averaging for α ∈ [1/2, 1]
– Non-asymptotic analysis with explicit constants
– Forgetting of initial conditions
– Robustness to the choice of C
• Convergence rates for E‖θn − θ∗‖2 and E‖θn − θ∗‖2
– no averaging: O(σ2γn
µ
)
+O(e−µnγn)‖θ0 − θ∗‖2
– averaging:trH(θ∗)−1
n+ µ−1O(n−2α+n−2+α) +O
(‖θ0 − θ∗‖2µ2n2
)
232
![Page 233: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/233.jpg)
Least-mean-square algorithm
• Least-squares: f(θ) = 12E
[
(yn − 〈Φ(xn), θ〉)2]
with θ ∈ Rd
– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995)
– usually studied without averaging and decreasing step-sizes
– with strong convexity assumption E[
Φ(xn)⊗Φ(xn)]
= H < µ · Id
• New analysis for averaging and constant step-size γ = 1/(4R2)
– Assume ‖Φ(xn)‖ 6 R and |yn − 〈Φ(xn), θ∗〉| 6 σ almost surely
– No assumption regarding lowest eigenvalues of H
– Main result: Ef(θn−1)− f(θ∗) 64σ2d
n+
4R2‖θ0 − θ∗‖2n
• Matches statistical lower bound (Tsybakov, 2003)
– Non-asymptotic robust version of Gyorfi and Walk (1996)
233
![Page 234: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/234.jpg)
Choice of support point for online Newton step
• Two-stage procedure
(1) Run n/2 iterations of averaged SGD to obtain θ
(2) Run n/2 iterations of averaged constant step-size LMS
– Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000)
– Provable convergence rate of O(d/n) for logistic regression
– Additional assumptions but no strong convexity
• Update at each iteration using the current averaged iterate
– Recursion: θn = θn−1 − γ[
f ′n(θn−1) + f ′′
n(θn−1)(θn−1 − θn−1)]
– No provable convergence rate (yet) but best practical behavior
– Note (dis)similarity with regular SGD: θn = θn−1 − γf ′n(θn−1)
234
![Page 235: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/235.jpg)
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
• Stochastic version of incremental average gradient (Blatt et al., 2008)
• Extra memory requirement
– Supervised machine learning
– If fi(θ) = ℓi(yi,Φ(xi)⊤θ), then f ′
i(θ) = ℓ′i(yi,Φ(xi)⊤θ)Φ(xi)
– Only need to store n real numbers
235
![Page 236: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/236.jpg)
Summary of rates of convergence
• Problem parameters
– D diameter of the domain
– B Lipschitz-constant
– L smoothness constant
– µ strong convexity constant
convex strongly convex
nonsmooth deterministic: BD/√t deterministic: B2/(tµ)
stochastic: BD/√n stochastic: B2/(nµ)
smooth deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: LD2/√n stochastic: L/(nµ)
finite sum: n/t finite sum: exp(−t/(n+L/µ))
quadratic deterministic: LD2/t2 deterministic: exp(−t√
µ/L)
stochastic: d/n+ LD2/n stochastic: d/n+ LD2/n
236
![Page 237: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/237.jpg)
Conclusions
Machine learning and convex optimization
• Statistics with or without optimization?
– Significance of mixing algorithms with analysis
– Benefits of mixing algorithms with analysis
• Open problems
– Non-parametric stochastic approximation (see, e.g. Dieuleveut and
Bach, 2014)
– Characterization of implicit regularization of online methods
– Structured prediction
– Going beyond a single pass over the data (testing performance)
– Parallel and distributed optimization
– Non-convex optimization (see, e.g. Reddi et al., 2016)
237
![Page 238: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/238.jpg)
References
A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds
on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information
Theory, 58(5):3235–3249, 2012.
Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. arXiv preprint
arXiv:1410.0723, 2014.
Y. Arjevani and O. Shamir. Dimension-free iteration complexity of finite sum optimization problems.
In Advances In Neural Information Processing Systems, 2016.
R. Babanezhad, M. O. Ahmed, A. Virani, M. W. Schmidt, J. Konecny, and S. Sallinen. Stopwasting
my gradients: Practical SVRG. In Advances in Neural Information Processing Systems, 2015.
F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic
regression. Technical Report 00804431, HAL, 2013.
F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine
learning. In Adv. NIPS, 2011.
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence
rate o(1/n). Technical Report 00831977, HAL, 2013.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization,
2012a.
Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on
Optimization, 25(1):115–129, 2015.
238
![Page 239: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/239.jpg)
Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-
inducing penalties. Foundations and Trends R© in Machine Learning, 4(1):1–106, 2012b.
P. Balamurugan and F. Bach. Stochastic variance reduction methods for saddle-point problems.
Technical Report 01319293, HAL, 2016.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
Albert Benveniste, Michel Metivier, and Pierre Priouret. Adaptive algorithms and stochastic
approximations. Springer Publishing Company, Incorporated, 2012.
D. P. Bertsekas. Nonlinear programming. Athena scientific, 1999.
D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant
step size. SIAM Journal on Optimization, 18(1):29–51, 2008.
L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.
L. Bottou and Y. Le Cun. On-line learning for very large data sets. Applied Stochastic Models in
Business and Industry, 21(2):137–151, 2005.
S. Boucheron and P. Massart. A high-dimensional wilks phenomenon. Probability theory and related
fields, 150(3-4):405–433, 2011.
S. Boucheron, O. Bousquet, G. Lugosi, et al. Theory of classification: A survey of some recent
advances. ESAIM Probability and statistics, 9:323–375, 2005.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2003.
S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine
Learning, 8(3-4):231–357, 2015.
239
![Page 240: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/240.jpg)
N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.
Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.
P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-point
algorithms for inverse problems in science and engineering, pages 185–212. Springer, 2011.
A. Defazio, J. Domke, and T. S. Caetano. Finito: A faster, permutable incremental gradient method
for big data problems. In Proc. ICML, 2014a.
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method
with support for non-strongly convex composite objectives. In Advances in Neural Information
Processing Systems, pages 1646–1654, 2014b.
A. Defossez and F. Bach. Constant step size least-mean-square: Bias-variance trade-offs and optimal
sampling distributions. 2015.
A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical
report, ArXiv, 2014.
A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for
least-squares regression. Technical Report 1602.05419, arXiv, 2016.
J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal
of Machine Learning Research, 10:2899–2934, 2009. ISSN 1532-4435.
R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Competing with the empirical risk minimizer in a
single pass. In Proceedings of the Conference on Learning Theory, 2015.
S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic
composite optimization. Optimization Online, July, 2010.
240
![Page 241: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/241.jpg)
Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex
stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal
on Optimization, 23(4):2061–2089, 2013.
M. Gurbuzbalaban, A. Ozdaglar, and P. Parrilo. On the convergence rate of incremental aggregated
gradient algorithms. Technical Report 1506.02081, arXiv, 2015.
L. Gyorfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal
on Control and Optimization, 34(1):31–61, 1996.
E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization.
Machine Learning, 69(2):169–192, 2007.
Chonghai Hu, James T Kwok, and Weike Pan. Accelerated gradient methods for stochastic optimization
and online learning. In NIPS, volume 22, pages 781–789, 2009.
Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of
the 30th International Conference on Machine Learning (ICML-13), pages 427–435, 2013.
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance
reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications.
Springer-Verlag, second edition, 2003.
S. Lacoste-Julien and M. Jaggi. On the global linear convergence of frank-wolfe optimization variants.
In Advances in Neural Information Processing Systems (NIPS), 2015.
S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an o (1/t) convergence
rate for projected stochastic subgradient descent. Technical Report 1212.2002, ArXiv, 2012.
241
![Page 242: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/242.jpg)
Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In Proceedings of The 30th International Conference
on Machine Learning, pages 53–61, 2013.
G. Lan. An optimal randomized incremental gradient method. Technical Report 1507.02000, arXiv,
2015.
Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis of mirror descent
stochastic approximation method. Mathematical programming, 134(2):425–458, 2012.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets. Technical Report 00674995, HAL,
2013.
R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Asaga: Asynchronous parallel Saga. Technical Report
1606.04809, arXiv, 2016.
H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in
Neural Information Processing Systems (NIPS), 2015.
O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.
Wiley West Sussex, 1995.
J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine
learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochastic
242
![Page 243: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/243.jpg)
Optimization: Algorithms and Applications, pages 263–304, 2000.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to
stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley
& Sons, 1983.
Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k2).
Soviet Math. Doklady, 269(3):543–547, 1983.
Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers,
2004.
Y. Nesterov. Gradient methods for minimizing composite objective function. Center for Operations
Research and Econometrics (CORE), Catholic University of Louvain, Tech. Rep, 76, 2007.
Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120
(1):221–259, 2009.
Y. Nesterov and J. P. Vial. Confidence level solutions for stochastic programming. Automatica, 44(6):
1559–1568, 2008. ISSN 0005-1098.
A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural
Information Processing Systems (NIPS), 2014.
N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):127–239,
2014.
B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal
on Control and Optimization, 30(4):838–855, 1992.
243
![Page 244: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/244.jpg)
S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex
optimization. Technical Report 1603.06160, arXiv, 2016.
H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,
pages 400–407, 1951a.
H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400–407,
1951b.
D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report
781, Cornell University Operations Research and Industrial Engineering, 1988.
M. Schmidt, N. Le Roux, and F. Bach. Optimization with approximate gradients. Technical report,
HAL, 2011.
B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.
S. Shalev-Shwartz. Sdca without duality, regularization, and individual convexity. Technical Report
1602.01582, arXiv, 2016.
S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proc.
ICML, 2008.
S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss
minimization. Technical Report 1209.1873, Arxiv, 2012.
S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized
loss minimization. In Proc. ICML, 2014.
S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm.
In Proc. ICML, 2007.
244
![Page 245: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/245.jpg)
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In proc.
COLT, 2009.
O. Shamir. Without-replacement sampling for stochastic gradient methods: Convergence results and
application to distributed optimization. Technical Report 1603.00570, arXiv, 2016.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,
2004.
Naum Zuselevich Shor, Krzysztof C. Kiwiel, and Andrzej Ruszcay?ski. Minimization methods for
non-differentiable functions. Springer-Verlag New York, Inc., 1985.
M.V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational
Optimization and Applications, 11(1):23–35, 1998.
K. Sridharan, N. Srebro, and S. Shalev-Shwartz. Fast rates for regularized objectives. 2008.
P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize
rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
I. Tsochantaridis, Thomas Joachims, T., Y. Altun, and Y. Singer. Large margin methods for structured
and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005.
A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003.
A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.
L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal
of Machine Learning Research, 9:2543–2596, 2010. ISSN 1532-4435.
L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction.
SIAM Journal on Optimization, 24(4):2057–2075, 2014.
245
![Page 246: Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, Francefbach/fbach_mlss_2018.pdf · 2018-08-29 · Outline - II 1. Introduction • Large-scale machine learning and optimization](https://reader034.vdocument.in/reader034/viewer/2022042211/5eb2898e59f10216b202b4a4/html5/thumbnails/246.jpg)
L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of
full gradients. In Advances in Neural Information Processing Systems, 2013.
246