stochastic optimization in hilbert spacesoutline aymeric dieuleveut stochastic optimization hilbert...

102
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48

Upload: others

Post on 01-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Stochastic optimization in Hilbert spaces

Aymeric Dieuleveut

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48

Page 2: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48

Learning vs Statistics

Page 3: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 3 / 48

Learning vs Statistics

Tradeoffs of large scalelearning

Algorithm complexity.ERM ?

Page 4: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 4 / 48

Learning vs Statistics

Tradeoffs of large scalelearning

Algorithm complexity.ERM ?

Stochastic optimization

Why is SGD so useful inlearning ?

Page 5: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 5 / 48

Learning vs Statistics

Tradeoffs of large scalelearning

Algorithm complexity.ERM ?

Stochastic optimization

Why is SGD so useful inlearning ?

A simple case

Least mean squares, fi-nite dimension

Page 6: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48

Learning vs Statistics

Tradeoffs of large scalelearning

Algorithm complexity.ERM ?

Stochastic optimization

Why is SGD so useful inlearning ?

A simple case

Least mean squares, fi-nite dimension

Higher dimension ?

RKHS, non parametriclearning

Page 7: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 7 / 48

Learning vs Statistics

Tradeoffs of large scalelearning

Algorithm complexity.ERM ?

Stochastic optimization

Why is SGD so useful inlearning ?

A simple case

Least mean squares, fi-nite dimension

Higher dimension ?

RKHS, non parametriclearning

Lower complexity ?

Column sampling, fea-ture selection

Page 8: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine LearningEstimation LearningClassifier Hypothesis

Data point Example/InstanceRegression Supervised Learning

Classification Supervised LearningCovariate FeatureResponse Label 1

Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawingconclusions about it.

ML are more interested about prediction with a concern onalgorithms for high dim. data.

1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

Page 9: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine LearningEstimation LearningClassifier Hypothesis

Data point Example/InstanceRegression Supervised Learning

Classification Supervised LearningCovariate FeatureResponse Label 1

Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawingconclusions about it.

ML are more interested about prediction with a concern onalgorithms for high dim. data.

1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

Page 10: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine LearningEstimation LearningClassifier Hypothesis

Data point Example/InstanceRegression Supervised Learning

Classification Supervised LearningCovariate FeatureResponse Label 1

Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawingconclusions about it.

ML are more interested about prediction with a concern onalgorithms for high dim. data.

1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

Page 11: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine LearningEstimation LearningClassifier Hypothesis

Data point Example/InstanceRegression Supervised Learning

Classification Supervised LearningCovariate FeatureResponse Label 1

Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawingconclusions about it.

ML are more interested about prediction with a concern onalgorithms for high dim. data.

1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

Page 12: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Statistics vs Machine Learning

Statistics Machine LearningEstimation LearningClassifier Hypothesis

Data point Example/InstanceRegression Supervised Learning

Classification Supervised LearningCovariate FeatureResponse Label 1

Essentially AI vs math guys doing same kind of stuff. However main differences :

Statisticians are more interested in the model and drawingconclusions about it.

ML are more interested about prediction with a concern onalgorithms for high dim. data.

1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

Page 13: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Framework

We consider the classical risk minimization problem. Given :

a space of input output pairs (x , y) ∈ X × Y, with probabilitydistribution P(x , y).

a loss function ` : Y × Y → R, a class of function F .

the risk of a function f : X → Y is R(f ) := EP [` (f (x), y)].

Our aim isminf ∈F

R(f )

R is unknown.

given a sequence of i.i.d. data points distributed(xi , yi )i=1..n ∼ P⊗n, we can define the empirical risk

Rn(f ) =1

n

n∑i=1

`(f (xi ), yi ).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

Page 14: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Framework

We consider the classical risk minimization problem. Given :

a space of input output pairs (x , y) ∈ X × Y, with probabilitydistribution P(x , y).

a loss function ` : Y × Y → R, a class of function F .

the risk of a function f : X → Y is R(f ) := EP [` (f (x), y)].

Our aim isminf ∈F

R(f )

R is unknown.

given a sequence of i.i.d. data points distributed(xi , yi )i=1..n ∼ P⊗n, we can define the empirical risk

Rn(f ) =1

n

n∑i=1

`(f (xi ), yi ).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

Page 15: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Framework

We consider the classical risk minimization problem. Given :

a space of input output pairs (x , y) ∈ X × Y, with probabilitydistribution P(x , y).

a loss function ` : Y × Y → R, a class of function F .

the risk of a function f : X → Y is R(f ) := EP [` (f (x), y)].

Our aim isminf ∈F

R(f )

R is unknown.

given a sequence of i.i.d. data points distributed(xi , yi )i=1..n ∼ P⊗n, we can define the empirical risk

Rn(f ) =1

n

n∑i=1

`(f (xi ), yi ).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

Page 16: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error.

There are many ways of seeing it :

constraint case

penalized case

other regularization

Thus compromise : εapp + εest .

εapp εestF ↗ ↘ ↗

This is the classical setting.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

Page 17: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error.There are many ways of seeing it :

constraint case

penalized case

other regularization

Thus compromise : εapp + εest .

εapp εestF ↗ ↘ ↗

This is the classical setting.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

Page 18: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error.There are many ways of seeing it :

constraint case

penalized case

other regularization

Thus compromise : εapp + εest .

εapp εestF ↗

↘ ↗

This is the classical setting.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

Page 19: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error.There are many ways of seeing it :

constraint case

penalized case

other regularization

Thus compromise : εapp + εest .

εapp εestF ↗ ↘

This is the classical setting.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

Page 20: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

The bias-variance tradeoffs

a.k.a. estimation approximation error.There are many ways of seeing it :

constraint case

penalized case

other regularization

Thus compromise : εapp + εest .

εapp εestF ↗ ↘ ↗

This is the classical setting.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

Page 21: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2

It questions the following points :

up to which precision is it necessary to optimize ?

which is the limiting factor ? (time, data points)

A problem is said to be large scale when time is limiting. For large scaleproblem :

which algo ?

more data less work ? (if time is limiting)

2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

Page 22: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2

It questions the following points :

up to which precision is it necessary to optimize ?

which is the limiting factor ? (time, data points)

A problem is said to be large scale when time is limiting. For large scaleproblem :

which algo ?

more data less work ? (if time is limiting)

2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

Page 23: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2

It questions the following points :

up to which precision is it necessary to optimize ?

which is the limiting factor ? (time, data points)

A problem is said to be large scale when time is limiting. For large scaleproblem :

which algo ?

more data less work ? (if time is limiting)

2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

Page 24: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2

It questions the following points :

up to which precision is it necessary to optimize ?

which is the limiting factor ? (time, data points)

A problem is said to be large scale when time is limiting. For large scaleproblem :

which algo ?

more data less work ? (if time is limiting)

2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

Page 25: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2

It questions the following points :

up to which precision is it necessary to optimize ?

which is the limiting factor ? (time, data points)

A problem is said to be large scale when time is limiting. For large scaleproblem :

which algo ?

more data less work ? (if time is limiting)

2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

Page 26: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Adding an optimization term

When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2

It questions the following points :

up to which precision is it necessary to optimize ?

which is the limiting factor ? (time, data points)

A problem is said to be large scale when time is limiting. For large scaleproblem :

which algo ?

more data less work ? (if time is limiting)

2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

Page 27: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ↗ n↗ ε↗

εapp ↘εest ↗ ↘εopt ↗T ↗ ↗ ↘

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

Page 28: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ↗ n↗ ε↗εapp ↘

εest ↗ ↘εopt ↗T ↗ ↗ ↘

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

Page 29: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ↗ n↗ ε↗εapp ↘εest ↗ ↘

εopt ↗T ↗ ↗ ↘

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

Page 30: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ↗ n↗ ε↗εapp ↘εest ↗ ↘εopt ↗

T ↗ ↗ ↘

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

Page 31: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Tradeoffs - Large scale learning

F ↗ n↗ ε↗εapp ↘εest ↗ ↘εopt ↗T ↗ ↗ ↘

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

Page 32: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Different algorithms

To minimize ERM, a bunch of algorithms may be considered :

Gradient descent

Second order gradient descent

Stochastic gradient descent

Fast stochastic algorithm (requiring high memory storage)

Let’s compare first order methods : SGD and GD.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

Page 33: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Different algorithms

To minimize ERM, a bunch of algorithms may be considered :

Gradient descent

Second order gradient descent

Stochastic gradient descent

Fast stochastic algorithm (requiring high memory storage)

Let’s compare first order methods : SGD and GD.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

Page 34: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f )

we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0.2 Iterate :

Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .

3 Output fm or fm := 1m

m∑k=1

fk (averaged SGD).

Gradient descent : same but with “true” gradient.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

Page 35: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f )

we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0.

2 Iterate :Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .

3 Output fm or fm := 1m

m∑k=1

fk (averaged SGD).

Gradient descent : same but with “true” gradient.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

Page 36: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f )

we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0.2 Iterate :

Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .

3 Output fm or fm := 1m

m∑k=1

fk (averaged SGD).

Gradient descent : same but with “true” gradient.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

Page 37: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Stochastic gradient algorithms :

Aim : minf R(f )

we only access to unbiased estimates of R(f ) and ∇R(f ).

1 Start at some f0.2 Iterate :

Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .

3 Output fm or fm := 1m

m∑k=1

fk (averaged SGD).

Gradient descent : same but with “true” gradient.Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

Page 38: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

ERM

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

GD in ERMminf ∈F Rn(f )

gk = ∇f∑n

i=1 `(fk , (xi , yi ))= ∇f R(fk)

fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O ((1− κ)m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(nd).

R(fm)− R(f ∗) 6 O(1/√m)

+ O(1/√n)

With step-size γk proportional to 1√k

.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

Page 39: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

ERM

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

GD in ERMminf ∈F Rn(f )

gk = ∇f∑n

i=1 `(fk , (xi , yi ))= ∇f R(fk)

fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O ((1− κ)m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(nd).

R(fm)− R(f ∗) 6 O(1/√m)

+ O(1/√n)

With step-size γk proportional to 1√k

.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

Page 40: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

ERM

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

GD in ERMminf ∈F Rn(f )

gk = ∇f∑n

i=1 `(fk , (xi , yi ))= ∇f R(fk)

fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O ((1− κ)m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(nd).

R(fm)− R(f ∗) 6 O(1/√m)

+ O(1/√n)

With step-size γk proportional to 1√k

.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

Page 41: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Conclusion

In the large scale setting, it is beneficial to use SGD !

Does more data help ?

With global estimation error fixed, it seems T ' 1R(fm)−R(f∗)− 1√

n

is

decreasing with n.

Upper bounding Rn − R uniformly is dangerous. Indeed, we have to alsocompare to one pass SGD, which minimizes the true risk R.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

Page 42: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Conclusion

In the large scale setting, it is beneficial to use SGD !Does more data help ?

With global estimation error fixed, it seems T ' 1R(fm)−R(f∗)− 1√

n

is

decreasing with n.

Upper bounding Rn − R uniformly is dangerous. Indeed, we have to alsocompare to one pass SGD, which minimizes the true risk R.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

Page 43: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Conclusion

In the large scale setting, it is beneficial to use SGD !Does more data help ?

With global estimation error fixed, it seems T ' 1R(fm)−R(f∗)− 1√

n

is

decreasing with n.

Upper bounding Rn − R uniformly is dangerous. Indeed, we have to alsocompare to one pass SGD, which minimizes the true risk R.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

Page 44: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) :

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

SGD one passminf ∈F R(f )

Pick an independent (x , y)

gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)

Output fk , k 6 n

R(fk)− R(f ∗) 6 O(

1/√k)

Cost of one iteration O(d).

SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

Page 45: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) :

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

SGD one passminf ∈F R(f )

Pick an independent (x , y)

gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)

Output fk , k 6 n

R(fk)− R(f ∗) 6 O(

1/√k)

Cost of one iteration O(d).

SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

Page 46: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) :

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

SGD one passminf ∈F R(f )

Pick an independent (x , y)

gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)

Output fk , k 6 n

R(fk)− R(f ∗) 6 O(

1/√k)

Cost of one iteration O(d).

SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

Page 47: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Expectation minimization

Stochastic gradient descent may be used to minimize R(f ) :

SGD in ERMminf ∈F Rn(f )

Pick any (xi , yi ) from empirical sample

gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)

Output fmRn(fm)− Rn(f ∗n ) 6 O

(1/√m)

supf∈F |R − Rn|(f ) 6 O(1/√n)

Cost of one iteration O(d).

SGD one passminf ∈F R(f )

Pick an independent (x , y)

gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)

Output fk , k 6 n

R(fk)− R(f ∗) 6 O(

1/√k)

Cost of one iteration O(d).

SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

Page 48: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Tradeoffs of Large scale learning - Learning

Rate of convergence

We are interested in prediction.

Strongly convex objective : 1µn .

Non strongly : 1√n

.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48

Page 49: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

LMS [Bach and Moulines, 2013]

We now consider the simple case where X = Rd , and the loss ` is qua-dratic. We are interested in linear predictors :

minθ∈Rd

EP [(θT x − y)2].

If we assume that the data points are generated according to

yi = θT∗ xi + εi .

We consider stochastic gradient algorithm :

θ0 = 0

θn+1 = θn − γn(〈xn, θn〉xn − ynxn)

This system may be rewritten :

θn+1 − θ∗ = (I − γxnxTn )(θn − θ∗)− γnξn. (1)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48

Page 50: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction.

Strongly convex objective : 1µn .

Non strongly : 1√n

.

We define H = E[xxT ].

We have µ = min Sp(H).

For least min squares, statistical rate with ordinary LMS estimator is

σ2d

n

there is still a gap to be bridged !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

Page 51: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction.

Strongly convex objective : 1µn .

Non strongly : 1√n

.

We define H = E[xxT ]. We have µ = min Sp(H).

For least min squares, statistical rate with ordinary LMS estimator is

σ2d

n

there is still a gap to be bridged !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

Page 52: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction.

Strongly convex objective : 1µn .

Non strongly : 1√n

.

We define H = E[xxT ]. We have µ = min Sp(H).

For least min squares, statistical rate with ordinary LMS estimator is

σ2d

n

there is still a gap to be bridged !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

Page 53: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

Rate of convergence, back again !

We are interested in prediction.

Strongly convex objective : 1µn .

Non strongly : 1√n

.

We define H = E[xxT ]. We have µ = min Sp(H).

For least min squares, statistical rate with ordinary LMS estimator is

σ2d

n

there is still a gap to be bridged !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

Page 54: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

A few assumptions

We define H = E[xxT ], and C = E[ξξT ].

Bounded noise variance : we assume C 6 σ2H.

Covariance operator :

no assumption on minimal eigenvalue,

E[‖x‖2] 6 R2.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

Page 55: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

A few assumptions

We define H = E[xxT ], and C = E[ξξT ].

Bounded noise variance : we assume C 6 σ2H.

Covariance operator :

no assumption on minimal eigenvalue,

E[‖x‖2] 6 R2.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

Page 56: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

A few assumptions

We define H = E[xxT ], and C = E[ξξT ].

Bounded noise variance : we assume C 6 σ2H.

Covariance operator :

no assumption on minimal eigenvalue,

E[‖x‖2] 6 R2.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

Page 57: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

A case study -Finite dimension linear least mean squares

Result

Theorem

E[R(θn)− R(θ∗)] 64

n(σ2d + R2‖θ0 − θ∗‖2)

optimal statistical rate

1/n without strong convexity.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48

Page 58: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48

What if d >> n ?

Page 59: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48

What if d >> n ?

Carry analyse in a Hilbert space

using reproducing kernel Hilbertspaces

Page 60: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48

What if d >> n ?

Carry analyse in a Hilbert space

using reproducing kernel Hilbertspaces

Non parametric regression inRKHS

An interesting problem itself

Page 61: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48

What if d >> n ?

Carry analyse in a Hilbert space

using reproducing kernel Hilbertspaces

Non parametric regression inRKHS

An interesting problem itself

Behaviour in FD

Adaptativity, tradeoffs.

Optimal statistical rates inRKHS

Choice of γ

Page 62: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Reproducing kernel Hilbert space[Dieuleveut and Bach, 2014]

We denote HK a Hilbert space of function. HK ⊂ RX .Which is characterized by the kernel function K : X × X → R :

for any x , Kx : X → R defined by Kx(x ′) = K (x , x ′) is in HK .

reproducing property : for all g ∈ HK and x ∈ X , g(x) = 〈g ,Kx〉K .

Two usages :

α) A hypothesis space for regression.

β) Mapping data points in a linear space.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

Page 63: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Reproducing kernel Hilbert space[Dieuleveut and Bach, 2014]

We denote HK a Hilbert space of function. HK ⊂ RX .Which is characterized by the kernel function K : X × X → R :

for any x , Kx : X → R defined by Kx(x ′) = K (x , x ′) is in HK .

reproducing property : for all g ∈ HK and x ∈ X , g(x) = 〈g ,Kx〉K .

Two usages :

α) A hypothesis space for regression.

β) Mapping data points in a linear space.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

Page 64: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Reproducing kernel Hilbert space[Dieuleveut and Bach, 2014]

We denote HK a Hilbert space of function. HK ⊂ RX .Which is characterized by the kernel function K : X × X → R :

for any x , Kx : X → R defined by Kx(x ′) = K (x , x ′) is in HK .

reproducing property : for all g ∈ HK and x ∈ X , g(x) = 〈g ,Kx〉K .

Two usages :

α) A hypothesis space for regression.

β) Mapping data points in a linear space.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

Page 65: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

α) A hypothesis space for regression.

Classical regression setting :

(Xi ,Yi ) ∼ ρ i.i.d.

(Xi ,Yi ) ∈ (X × R)

Goal : Minimizing prediction error

ming∈L2

E[(g(X )− Y )2].

Looking for an estimator gn of gρ(X ) = E[Y |X ], gρ ∈ L2ρX . with

L2ρX =

{f : X → R/

∫f 2(t)dρX (t) <∞

}.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

Page 66: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

α) A hypothesis space for regression.

Classical regression setting :

(Xi ,Yi ) ∼ ρ i.i.d.

(Xi ,Yi ) ∈ (X × R)

Goal : Minimizing prediction error

ming∈L2

E[(g(X )− Y )2].

Looking for an estimator gn of gρ(X ) = E[Y |X ], gρ ∈ L2ρX . with

L2ρX =

{f : X → R/

∫f 2(t)dρX (t) <∞

}.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

Page 67: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

β) Mapping data points in a linear space.

Linear regression on data maped into some RKHS.

arg minθ∈H||Y − Xθ||2.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48

Page 68: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

2 approaches of regression problem :

Link : In generalHK ⊂ L2

ρX

Andcompl||.||L2

ρX(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression functionin the RKHS.

General regression problemgρ ∈ L2

Linear regression problem inRKHS

looking for an estimator for the first problem using naturalalgorithms for the second one

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

Page 69: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

2 approaches of regression problem :

Link : In generalHK ⊂ L2

ρX

Andcompl||.||L2

ρX(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression functionin the RKHS.

General regression problemgρ ∈ L2

Linear regression problem inRKHS

looking for an estimator for the first problem using naturalalgorithms for the second one

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

Page 70: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

2 approaches of regression problem :

Link : In generalHK ⊂ L2

ρX

Andcompl||.||L2

ρX(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression functionin the RKHS.

General regression problemgρ ∈ L2

Linear regression problem inRKHS

looking for an estimator for the first problem using naturalalgorithms for the second one

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

Page 71: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

2 approaches of regression problem :

Link : In generalHK ⊂ L2

ρX

Andcompl||.||L2

ρX(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression functionin the RKHS.

General regression problemgρ ∈ L2

Linear regression problem inRKHS

looking for an estimator for the first problem using naturalalgorithms for the second one

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

Page 72: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

2 approaches of regression problem :

Link : In generalHK ⊂ L2

ρX

Andcompl||.||L2

ρX(RKHS) = L2

ρX

in some cases. We then look for an estimator of the regression functionin the RKHS.

General regression problemgρ ∈ L2

Linear regression problem inRKHS

looking for an estimator for the first problem using naturalalgorithms for the second one

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

Page 73: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Outline

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48

What if d >> n ?

Carry analyse in a Hilbert space

using reproducing kernel Hilbertspaces

Non parametric regression inRKHS

An interesting problem itself

Page 74: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

SGD algorithm in the RKHS

g0 ∈ HK (we often consider g0 = 0),

gn =n∑

i=1

aiKxi , (2)

(an)n such that an := −γn(gn−1(xn)−yn) = −γn(∑n−1

i=1 aiK (xn, xi )− yi

).

gn = gn−1 − γn (gn−1(xn)− yn)Kxn

=n∑

i=1

aiKxi with an defined as above.

(gn−1(xn)− yn)Kxn unbiased estimate of gradE[(〈Kx , gn−1〉 − y)2] .

SGD algorithm in the RKHS takes very simple formAymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48

Page 75: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Assumptions

Two important points characterize the difficulty of the problem :

The regularity of the objective function

The spectrum of the covariance operator

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48

Page 76: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Covariance operator

We have Σ = E [Kx ⊗ Kx ] . Where Kx ⊗ Kx : g 7→ 〈Kx , g〉Kx = g(x)Kx

Covariance operator is a self adjoint operator which contains infor-mation on the distribution of Kx

Assumption :

tr(Σα) <∞, for α ∈ [0; 1].

on gρ : gρ ∈ Σr (L2ρ(X )) with r ≥ 0.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

Page 77: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Covariance operator

We have Σ = E [Kx ⊗ Kx ] . Where Kx ⊗ Kx : g 7→ 〈Kx , g〉Kx = g(x)Kx

Covariance operator is a self adjoint operator which contains infor-mation on the distribution of Kx

Assumption :

tr(Σα) <∞, for α ∈ [0; 1].

on gρ : gρ ∈ Σr (L2ρ(X )) with r ≥ 0.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

Page 78: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Interpretation

Eigenvalues decrease

Ellipsoid class of function. (we do not assume gρ ∈ HK )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48

Page 79: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Result :

Theorem

Under a few hidden assumptions :

E [R (gn)− R(gρ)] 6 O

(σ2 tr(Σα)γα

n1−α

)+ O

(||Σ−rgρ||2(nγ)2(r∧1)

)

Bias Variance decomposition

O is a known constant (4 or 8)

Finite horizon result here but extends to online setting.

Saturation

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

Page 80: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Result :

Theorem

Under a few hidden assumptions :

E [R (gn)− R(gρ)] 6 O

(σ2 tr(Σα)γα

n1−α

)+ O

(||Σ−rgρ||2(nγ)2(r∧1)

)

Bias Variance decomposition

O is a known constant (4 or 8)

Finite horizon result here but extends to online setting.

Saturation

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

Page 81: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Result :

Theorem

Under a few hidden assumptions :

E [R (gn)− R(gρ)] 6 O

(σ2 tr(Σα)γα

n1−α

)+ O

(||Σ−rgρ||2(nγ)2(r∧1)

)

Bias Variance decomposition

O is a known constant (4 or 8)

Finite horizon result here but extends to online setting.

Saturation

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

Page 82: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Corollary

Corollary

Assume A1-8 :If 1−α

2 < r < 2−α2 , with γ = n−

2r+α−12r+α we get the optimal rate :

E [R (gn)− R(gρ)] = O(n−

2r2r+α

)(3)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48

Page 83: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.

We get insights on how to choose the kernel and the step size.

We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates inRKHS

Choice of γ

Page 84: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.

We get insights on how to choose the kernel and the step size.

We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates inRKHS

Choice of γ

Page 85: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.

We get insights on how to choose the kernel and the step size.

We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates inRKHS

Choice of γ

Page 86: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 1

We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.

We get insights on how to choose the kernel and the step size.

We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

Optimal statistical rates inRKHS

Choice of γ

Page 87: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 2

Theorem can be rewritten :

E[R(θn)− R(θ∗)

]6 O

(σ2 tr(Σα)γα

n1−α

)+ O

(θT∗ Σ2r−1θT

(nγ)2(r∧1)

)(4)

where the ellipsoid condition appears more clearly.Thus :

SGD is adaptative to the regularity of the problem

bridges the gap between the different regimes and explainsbehaviour when d >> n.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

Behaviour in FD

Adaptativity, tradeoffs.

Page 88: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 2

Theorem can be rewritten :

E[R(θn)− R(θ∗)

]6 O

(σ2 tr(Σα)γα

n1−α

)+ O

(θT∗ Σ2r−1θT

(nγ)2(r∧1)

)(4)

where the ellipsoid condition appears more clearly.

Thus :

SGD is adaptative to the regularity of the problem

bridges the gap between the different regimes and explainsbehaviour when d >> n.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

Behaviour in FD

Adaptativity, tradeoffs.

Page 89: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

Non parametric learning

Conclusion 2

Theorem can be rewritten :

E[R(θn)− R(θ∗)

]6 O

(σ2 tr(Σα)γα

n1−α

)+ O

(θT∗ Σ2r−1θT

(nγ)2(r∧1)

)(4)

where the ellipsoid condition appears more clearly.Thus :

SGD is adaptative to the regularity of the problem

bridges the gap between the different regimes and explainsbehaviour when d >> n.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

Behaviour in FD

Adaptativity, tradeoffs.

Page 90: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

1 Tradeoffs of Large scale learning - Learning

2 A case study -Finite dimension linear least mean squares

3 Non parametric learning

4 The complexity challenge, approximation of the kernel

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48

Page 91: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Reducing complexity : sampling methods

However the complexity of such a method remains quadratic with respectof the number of examples : iteration number n costs n kernel calculations.

Rate Complexity

Finite Dimension dn O(dn)

Infinite dimension dnn O(n2)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

Page 92: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Reducing complexity : sampling methods

However the complexity of such a method remains quadratic with respectof the number of examples : iteration number n costs n kernel calculations.

Rate Complexity

Finite Dimension dn O(dn)

Infinite dimension dnn O(n2)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

Page 93: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Reducing complexity : sampling methods

However the complexity of such a method remains quadratic with respectof the number of examples : iteration number n costs n kernel calculations.

Rate Complexity

Finite Dimension dn O(dn)

Infinite dimension dnn O(n2)

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

Page 94: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

2 related methods

Approximate the kernel matrix

Approximate the kernel

Results from [Bach, 2012].Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., 2015]There also exist results in the second situation [Rahimi and Recht, 2008,Dai et al., 2014]

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48

Page 95: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Sharp analysis

We only consider a fixed design setting. Then we have to approximate thekernel matrix : instead of computing the whole matrix, we randomly picka number dn of columns.

Then we still get the same estimation errors.Leading to :

Rate Complexity

Finite Dimension dn O(dn)

Infinite dimension dnn O(nd2

n )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

Page 96: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Sharp analysis

We only consider a fixed design setting. Then we have to approximate thekernel matrix : instead of computing the whole matrix, we randomly picka number dn of columns.Then we still get the same estimation errors.Leading to :

Rate Complexity

Finite Dimension dn O(dn)

Infinite dimension dnn O(nd2

n )

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

Page 97: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Random feature selection

Many kernels may be represented, due to Bochner’s theorem as

K (x , y) =

∫Wφ(w , x)φ(w , y)dµ(w).

(think of translation invariant kernels and Fourier transform).

We thus consider the low rank approximation :

K (x , y) =1

d

n∑i=1

φ(x ,wi )φ(y ,wi ).

where wi ∼ µ.We use this approximation of the kernel in SGD.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

Page 98: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Random feature selection

Many kernels may be represented, due to Bochner’s theorem as

K (x , y) =

∫Wφ(w , x)φ(w , y)dµ(w).

(think of translation invariant kernels and Fourier transform).We thus consider the low rank approximation :

K (x , y) =1

d

n∑i=1

φ(x ,wi )φ(y ,wi ).

where wi ∼ µ.We use this approximation of the kernel in SGD.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

Page 99: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Directions

What I am working on for the moment :

Random feature selection

Tuning the sampling to improve accuracy of the approximation

Acceleration + stochasticity (with Nicolas Flammarion).

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48

Page 100: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Some references I

Alaoui, A. E. and Mahoney, M. W. (2014).

Fast randomized kernel methods with statistical guarantees.CoRR, abs/1411.0306.

Bach, F. (2012).

Sharp analysis of low-rank kernel matrix approximations.ArXiv e-prints.

Bach, F. and Moulines, E. (2013).

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).ArXiv e-prints.

Bottou, L. and Bousquet, O. (2008).

The tradeoffs of large scale learning.In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20.

Caponnetto, A. and De Vito, E. (2007).

Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3) :331–368.

Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014).

Scalable kernel methods via doubly stochastic gradients.In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information ProcessingSystems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3041–3049.

Dieuleveut, A. and Bach, F. (2014).

Non-parametric Stochastic Approximation with Large Step sizes.ArXiv e-prints.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48

Page 101: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Some references II

Rahimi, A. and Recht, B. (2008).

Weighted sums of random kitchen sinks : Replacing minimization with randomization in learning.In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference onNeural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages1313–1320.

Rudi, A., Camoriano, R., and Rosasco, L. (2015).

Less is more : Nystrom computational regularization.CoRR, abs/1507.04717.

Shalev-Schwartz, S. and K., S. (2011).

Theorical basis for more data less work.

Shalev-Schwartz, S. and Srebro, N. (2008).

SVM optimisation : Inverse dependance on training set size.Proceedings of the International Conference on Machine Learning (ICML).

Tarres, P. and Yao, Y. (2011).

Online learning as stochastic approximation of regularization paths.ArXiv e-prints 1103.5538.

Ying, Y. and Pontil, M. (2008).

Online gradient descent learning algorithms.Foundations of Computational Mathematics, 5.

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 47 / 48

Page 102: Stochastic optimization in Hilbert spacesOutline Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48 Learning vs Statistics Tradeo s of large scale learning Algorithm

The complexity challenge, approximation of the kernel

Thank you for your attention !

Aymeric Dieuleveut Stochastic optimization Hilbert spaces 48 / 48