an introduction to statistical learning theory · 2020. 10. 23. · an introduction to statistical...

An introduction to statisticallearning theory

Fabrice Rossi

TELECOM ParisTech

November 2008

About this lecture

Main goalShow how statistics enable a rigorous analysis of machinelearning methods which leads to a better understanding and topossible improvements of those methods

Outline

1. introduction to the formal model of statistical learning2. empirical risk and capacity measures3. capacity control4. beyond empirical risk minimization

2 / 154 Fabrice Rossi Outline

http://apiacoa.org/

About this lecture

Main goalShow how statistics enable a rigorous analysis of machinelearning methods which leads to a better understanding and topossible improvements of those methods

Outline

1. introduction to the formal model of statistical learning2. empirical risk and capacity measures3. capacity control4. beyond empirical risk minimization

2 / 154 Fabrice Rossi Outline

http://apiacoa.org/

Part I

Introduction and formalization

3 / 154 Fabrice Rossi

http://apiacoa.org/

Outline

Machine learning in a nutshell

FormalizationData and algorithmsConsistencyRisk/Loss optimization

http://apiacoa.org/

Statistical LearningStatistical Learning

= Machine learning + Statistics

Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

... all of this is done by the computer (no human intervention)

Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

Nothing is more practical than a good theoryVladimir Vapnik (1998)

5 / 154 Fabrice Rossi Machine learning in a nutshell

http://apiacoa.org/

Statistical LearningStatistical Learning = Machine learning + Statistics

http://apiacoa.org/

Machine learning

I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z

I two generic situations:1. unsupervised learning:

I no predefined structure in ZI then the goal is to find some structures: clusters, association

rules, distribution, etc.2. supervised learning:

I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable

value for y such that z = (x, y) is compatible with thephenomenon

I this course focuses on supervised learning

http://apiacoa.org/

Machine learning

rules, distribution, etc.

2. supervised learning:I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable

http://apiacoa.org/

Machine learning

http://apiacoa.org/

Machine learning

http://apiacoa.org/

Supervised learning

I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether

a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures

I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x

I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence

I modelling difficulty increases with the complexity of YI Vapnik’s message: “don’t built a regression model to do

classification”

http://apiacoa.org/

Supervised learning

I modelling difficulty increases with the complexity of Y

I Vapnik’s message: “don’t built a regression model to doclassification”

http://apiacoa.org/

Supervised learning

I modelling difficulty increases with the complexity of YI Vapnik’s message: “don’t built a regression model to do

classification”

http://apiacoa.org/

Model quality

I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y

I a “good” model should be such that

∀1 ≤ i ≤ n, g(xi) ' yi

or not?perfect interpolation is always possible

but is that what we are looking for?I No! We want generalization:

I a good model must “learn” something “smart” from thedata, not simply memorize them

I on new data, it should be such that g(xi ) ' yi (for i > n)

http://apiacoa.org/

Model quality

∀1 ≤ i ≤ n, g(xi) ' yi

or not?

perfect interpolation is always possiblebut is that what we are looking for?

I No! We want generalization:I a good model must “learn” something “smart” from the

data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)

http://apiacoa.org/

Model quality

∀1 ≤ i ≤ n, g(xi) ' yi

but is that what we are looking for?

I No! We want generalization:I a good model must “learn” something “smart” from the

http://apiacoa.org/

Model quality

∀1 ≤ i ≤ n, g(xi) ' yi

but is that what we are looking for?I No! We want generalization

:I a good model must “learn” something “smart” from the

http://apiacoa.org/

Model quality

∀1 ≤ i ≤ n, g(xi) ' yi

but is that what we are looking for?I No! We want generalization:

I a good model must “learn” something “smart” from thedata, not simply memorize them

I on new data, it should be such that g(xi ) ' yi (for i > n)

http://apiacoa.org/

Machine learning objectives

I main goal: from a dataset (xi ,yi)1≤i≤n, build a model gwhich generalizes in a satisfactory way on new data

I but also:I request only as many data as neededI do this efficiently (quickly with a low memory usage)I gain knowledge on the underlying phenomenon (e.g.,

discard useless parts of x)I etc.

I the statistical learning framework addresses some of thosegoals:

I it provides asymptotic guarantees about learnabilityI it gives non asymptotic confidence bounds around

performance estimatesI it suggests/validates best practices (hold out estimates,

regularization, etc.)

http://apiacoa.org/

Machine learning objectives

I main goal: from a dataset (xi ,yi)1≤i≤n, build a model gwhich generalizes in a satisfactory way on new data

I but also:I request only as many data as neededI do this efficiently (quickly with a low memory usage)I gain knowledge on the underlying phenomenon (e.g.,

discard useless parts of x)I etc.

I the statistical learning framework addresses some of thosegoals:

I it provides asymptotic guarantees about learnabilityI it gives non asymptotic confidence bounds around

performance estimatesI it suggests/validates best practices (hold out estimates,

regularization, etc.)

http://apiacoa.org/

Outline

Machine learning in a nutshell

FormalizationData and algorithmsConsistencyRisk/Loss optimization

10 / 154 Fabrice Rossi Formalization

http://apiacoa.org/

Data

I the phenomenon is fully described by an unknownprobability distribution P on X × Y

I we are given n observations:I Dn = (Xi ,Yi )ni=1I each pair (Xi ,Yi ) is distributed according to PI pairs are independent

I remarks:I the phenomenon is stationary, i.e., P does not change: new

data will be distributed as learning dataI extensions to non stationary situations exist, see e.g.

covariate shift and concept driftI the independence assumption says that each observation

brings new informationI extensions to dependent observation exists, e.g., Markovian

models for time series analysis

http://apiacoa.org/

Data

I the phenomenon is fully described by an unknownprobability distribution P on X × Y

I we are given n observations:I Dn = (Xi ,Yi )ni=1I each pair (Xi ,Yi ) is distributed according to PI pairs are independent

I remarks:I the phenomenon is stationary, i.e., P does not change: new

data will be distributed as learning dataI extensions to non stationary situations exist, see e.g.

covariate shift and concept driftI the independence assumption says that each observation

brings new informationI extensions to dependent observation exists, e.g., Markovian

models for time series analysis

http://apiacoa.org/

Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a

low cost is good!)I the risk of a model g: X → Y is

L(g) = E {c(g(X ),Y )}

This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)

I interpretation:I the average value of c(g(x),y) over a lot of observations

generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has

been built from a dataset)I remark: we need a probability space, c and g have to be

measurable functions, etc.

http://apiacoa.org/

L(g) = E {c(g(X ),Y )}

been built from a dataset)

I remark: we need a probability space, c and g have to bemeasurable functions, etc.

http://apiacoa.org/

L(g) = E {c(g(X ),Y )}

been built from a dataset)I remark: we need a probability space, c and g have to be

measurable functions, etc.

http://apiacoa.org/

We are frequentists

http://apiacoa.org/

Cost functionsI Regression (Y = Rq):

I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2

I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ

and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwise

I Classification (Y = {1, . . . ,q}):I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification

I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to

build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV

http://apiacoa.org/

and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwiseI Classification (Y = {1, . . . ,q}):

I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification

http://apiacoa.org/

and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwiseI Classification (Y = {1, . . . ,q}):

I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification

http://apiacoa.org/

Machine learning algorithm

I given Dn, a machine learning algorithm/method builds amodel gDn (= gn for simplicity)

I gn is a random variable with values in the set ofmeasurable functions from X to Y

I the associated (random) risk is

L(gn) = E {c(gn(X ),Y ) | Dn}

I statistical learning studies the behavior of L(gn) andE {L(gn)} when n increases

:I keep in mind that L(gn) is a random variableI if many datasets are generated from the phenomenon, the

mean risk of the classifiers produced by the algorithm forthose datasets is E {L(gn)}

http://apiacoa.org/

Machine learning algorithm

I given Dn, a machine learning algorithm/method builds amodel gDn (= gn for simplicity)

I gn is a random variable with values in the set ofmeasurable functions from X to Y

I the associated (random) risk is

L(gn) = E {c(gn(X ),Y ) | Dn}

I statistical learning studies the behavior of L(gn) andE {L(gn)} when n increases:

I keep in mind that L(gn) is a random variableI if many datasets are generated from the phenomenon, the

mean risk of the classifiers produced by the algorithm forthose datasets is E {L(gn)}

http://apiacoa.org/

Consistency

I best risk:L∗ = inf

gL(g)

I the infimum is taken over all measurable functions from Xto Y

I a machine learning method is:I strongly consistent: if L(gn)

p.s.−−→ L∗I consistent: if E {L(gn)} → L∗I universally (strongly) consistent: if the convergence holds

for any distribution of the data PI remark: if c is bounded then limn→∞ E {L(gn)} = L∗ ⇔

L(gn)P−→ L∗

http://apiacoa.org/

Consistency

I best risk:L∗ = inf

gL(g)

I the infimum is taken over all measurable functions from Xto Y

I a machine learning method is:I strongly consistent: if L(gn)

p.s.−−→ L∗I consistent: if E {L(gn)} → L∗I universally (strongly) consistent: if the convergence holds

for any distribution of the data PI remark: if c is bounded then limn→∞ E {L(gn)} = L∗ ⇔

L(gn)P−→ L∗

http://apiacoa.org/

InterpretationI universality: no assumption on the data distribution

(realistic)I consistency: “perfect” generalization when given an infinite

learning set

I strong case:I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N

L(gn) ≤ L∗ + �

where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1

I “weak” case:I for any � > 0I there is N such that for n ≥ N

E {L(gn)} ≤ L∗ + �

where gn is built on (xi , yi)1≤i≤n

http://apiacoa.org/

learning setI strong case:

I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N

L(gn) ≤ L∗ + �

E {L(gn)} ≤ L∗ + �

http://apiacoa.org/

learning setI strong case:

I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N

L(gn) ≤ L∗ + �

E {L(gn)} ≤ L∗ + �

http://apiacoa.org/

Interpretation (continued)I We can be unlucky:

I strong case:I L(gn) = E {c(gn(X ),Y ) | Dn} ≤ L∗ + �I but what about

1p

n+pXi=n+1

c(gn(xi), yi)

I weak case:I E {L(gn)} ≤ L∗ + �I but what about L(gn) for a particular dataset?I see also the strong case!

I Stronger result, Probably Approximately Optimal algorithm:∀�, δ, ∃n(�, δ) such that ∀n ≥ n(�, δ)

P {L (gn) > L∗ + �} < δ

I Gives some convergence speed: especially interestingwhen n(�, δ) is independent of P (distribution free)

http://apiacoa.org/

Interpretation (continued)I We can be unlucky:

I strong case:I L(gn) = E {c(gn(X ),Y ) | Dn} ≤ L∗ + �I but what about

1p

n+pXi=n+1

c(gn(xi), yi)

I weak case:I E {L(gn)} ≤ L∗ + �I but what about L(gn) for a particular dataset?I see also the strong case!

I Stronger result, Probably Approximately Optimal algorithm:∀�, δ, ∃n(�, δ) such that ∀n ≥ n(�, δ)

P {L (gn) > L∗ + �} < δ

I Gives some convergence speed: especially interestingwhen n(�, δ) is independent of P (distribution free)

http://apiacoa.org/

From PAO to consistency

I if convergence happens fast enough, then the method is(strongly) consistent

I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �

P {L (gn) > L∗ + �} −−−→n→∞

0I by definition (and also because L(gn) > L∗), this means

L(gn)P−→ L∗

I then, if c is bounded E {L(gn)} −−−→n→∞

L∗

I if convergence happens faster, then the method might bestrongly consistent

I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases

http://apiacoa.org/

P {L (gn) > L∗ + �} −−−→n→∞

L(gn)P−→ L∗

L∗

http://apiacoa.org/

P {L (gn) > L∗ + �} −−−→n→∞

L(gn)P−→ L∗

L∗

http://apiacoa.org/

From PAO to strong consistency

I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that

if∑

i

P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }

I and therefore, L(gn)p.s.−−→ L∗

http://apiacoa.org/

if∑

i

P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

http://apiacoa.org/

if∑

i

P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

http://apiacoa.org/

if∑

i

P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

http://apiacoa.org/

Summary

I from n i.i.d. observations Dn = (Xi ,Yi)ni=1, a ML algorithmbuilds a model gn

I its risks is L(gn) = E {c(gn(X ),Y ) | Dn}I statistical learning studies L(gn)− L∗ and E {L(gn)} − L∗

I the preferred approach is to show a fast decrease ofP {L (gn) > L∗ + �} with n

I no hypothesis on the data (i.e., on P), a.k.a. distributionfree results

http://apiacoa.org/

Classical statisticsEven though we are fequentists...

I this program is quite different from classical statistics:I no optimal parameters, no identifiability: intrinsically non

parametricI no optimal model: only optimal performancesI no asymptotic distribution: focus on finite distance

inequalitiesI no (or minimal) assumption on the data distribution

I minimal hypothesisI are justified by our lack of knowledge on the studied

phenomenonI have strong and annoying consequences linked to the

worst case scenario they imply

http://apiacoa.org/

Optimal modelI there is nevertheless an optimal modelI regression:

I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}

I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the

Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}

I we only need to mimic the prediction of the model, not themodel itself:

I for optimal classification, we don’t need to compute theposterior probabilities

I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}

I we only need to agree with the sign of E {Y | X = x}

http://apiacoa.org/

No free lunchI unfortunately, making no hypothesis on P costs a lot

I no free lunch results for binary classification andmisclassification cost

I there is always a bad distribution:I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

E {L(gn)} ≥12− �

I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and

E {L(gn)} ≥ an

http://apiacoa.org/

No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and

misclassification cost

I there is always a bad distribution:I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

E {L(gn)} ≥12− �

E {L(gn)} ≥ an

http://apiacoa.org/

misclassification costI there is always a bad distribution:

I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

E {L(gn)} ≥12− �

E {L(gn)} ≥ an

http://apiacoa.org/

misclassification costI there is always a bad distribution:

I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

E {L(gn)} ≥12− �

E {L(gn)} ≥ an

http://apiacoa.org/

Estimation difficultiesI in fact, estimating the Bayes risk L∗ (in binary

classification) is difficult:I given a fixed estimation algorithm for L∗, denoted L̂nI for all � > 0 and all n, there is (X ,Y ) such that

E{|L̂n − L∗|

}≥ 1

4− �

I regression is more difficult than classification:I ηn estimator for η = E {Y |X} in a weak sense:

limn→∞

E{

(ηn(X )− η(X ))2}

= 0

I then for gn(x) = I{ηn(x)>1/2}, we have

limn→∞

E {L(gn)} − L∗√E {(ηn(X )− η(X ))2}

= 0

http://apiacoa.org/

Estimation difficultiesI in fact, estimating the Bayes risk L∗ (in binary

classification) is difficult:I given a fixed estimation algorithm for L∗, denoted L̂nI for all � > 0 and all n, there is (X ,Y ) such that

E{|L̂n − L∗|

}≥ 1

4− �

I regression is more difficult than classification:I ηn estimator for η = E {Y |X} in a weak sense:

limn→∞

E{

(ηn(X )− η(X ))2}

= 0

I then for gn(x) = I{ηn(x)>1/2}, we have

limn→∞

E {L(gn)} − L∗√E {(ηn(X )− η(X ))2}

= 0

http://apiacoa.org/

Consequences

I no free lunch results show that we cannot get uniformconvergence speed:

I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P

I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur

from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity

assumptions on f , extrapolation needs even strongerhypothesis

I but we don’t want hypothesis on P...

http://apiacoa.org/

Consequences

http://apiacoa.org/

Consequences

http://apiacoa.org/

Estimation and approximation

I solution: split the problem in two parts, estimation andapproximation

I estimation:I restrict the class of acceptable models to G, a set of

measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must

choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow

uniform results!I approximation:

I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!

http://apiacoa.org/

uniform results!

I approximation:I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!

http://apiacoa.org/

uniform results!I approximation:

I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!

http://apiacoa.org/

Back to machine learning algorithmsI many ad hoc algorithms:

I k nearest neighborsI tree based algorithmsI Adaboost

I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via

DnI examples:

I linear regression (X is a Hilbert space, Y = R):I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error

wn = arg minw 1nPn

i=1 (yi − 〈w , xi〉)2

I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a

disjunctive coding of the classes

http://apiacoa.org/

Dn

I examples:I linear regression (X is a Hilbert space, Y = R):

I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error

wn = arg minw 1nPn

i=1 (yi − 〈w , xi〉)2

http://apiacoa.org/

DnI examples:

I linear regression (X is a Hilbert space, Y = R):I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error

wn = arg minw 1nPn

i=1 (yi − 〈w , xi〉)2

http://apiacoa.org/

CriterionI the best solution would be to choose gn by minimizing

L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L

has no reason to be convex, for instance)

I partial solution, the empirical risk:

Ln(g) =1n

n∑i=1

c(g(Xi),Yi)

I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)

I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)

http://apiacoa.org/

has no reason to be convex, for instance)I partial solution, the empirical risk:

Ln(g) =1n

n∑i=1

c(g(Xi),Yi)

http://apiacoa.org/

Ln(g) =1n

n∑i=1

c(g(Xi),Yi)

http://apiacoa.org/

Ln(g) =1n

n∑i=1

c(g(Xi),Yi)

http://apiacoa.org/

Empirical risk minimizationI generic machine learning method:

I fix a model class GI choose g∗n in G by

g∗n = arg ming∈GLn(g)

I denote L∗G = infg∈G L(g)

I does that work?I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n

I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best

possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)

http://apiacoa.org/

I denote L∗G = infg∈G L(g)I does that work?

I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n

http://apiacoa.org/

possible one in the class)

I L∗G − L∗ is handled differently (approximation vs estimation)

http://apiacoa.org/

Complexity controlThe overfitting issue

I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G

I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier

defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting

I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L

∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and

approximation error

http://apiacoa.org/

approximation error

http://apiacoa.org/

∗ (for arbitrary P)

I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and

approximation error

http://apiacoa.org/

∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)

I in other words, how to balance estimation error andapproximation error

http://apiacoa.org/

approximation error

http://apiacoa.org/

Increasing complexity

I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize

I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when

g∗n = arg ming∈Gn Ln(gn)I approximation part:

I functional analysisI make sure that limn→∞ L∗Gn = L

∗

I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn

depends only on the data size

http://apiacoa.org/

g∗n = arg ming∈Gn Ln(gn)

I approximation part:I functional analysisI make sure that limn→∞ L∗Gn = L

∗

http://apiacoa.org/

∗

I related to density arguments (universal approximation)

I this is the simplest approach but also the less realistic: Gndepends only on the data size

http://apiacoa.org/

∗

http://apiacoa.org/

Structural risk minimizationI key idea: use again more and more powerful classes Gd

but also compare models between classes

I more precisely:I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d

Ln(g∗n,d ) + r(n,d)

I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk

I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,

but rather unrealistic on a practical point of view becauseof the tremendous computational cost

http://apiacoa.org/

but also compare models between classesI more precisely:

I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d

http://apiacoa.org/

I it corresponds to a complexity measure for Gd

I more interesting than the increasing complexity solution,but rather unrealistic on a practical point of view becauseof the tremendous computational cost

http://apiacoa.org/

Regularization

I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)

I more precisely:I choose G such that L∗G = L∗I define g∗n by

g∗n = arg ming∈G(Ln(g) + λnR(g))

I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must

be carefully chosenI this is the (one of the) best solution(s)

http://apiacoa.org/

Regularization

http://apiacoa.org/

Regularization

be carefully chosen

I this is the (one of the) best solution(s)

http://apiacoa.org/

Regularization

http://apiacoa.org/

SummaryI from n i.i.d. observations Dn = (Xi ,Yi)ni=1, a ML algorithm

builds a model gnI a good ML algorithm builds a universally strongly

consistent model, i.e. for all P,

L(gn)p.s.−−→ L∗

I a very general class of ML algorithms is based onempirical risk minimization, i.e.

g∗n = arg ming∈G1n

n∑i=1

c(g(Xi),Yi)

I consistency is obtained by controlling the estimation errorand the approximation error

http://apiacoa.org/

Outline

I part II: empirical risk and capacity measuresI part III: capacity controlI part IV: empirical loss and regularization

I not in this lecture (see the references):I ad hoc methods such as k nearest neighbours and treesI advanced topics such as noise conditions and data

dependent boundsI clusteringI Bayesian point of viewI and many other things...

http://apiacoa.org/

Outline

I part II: empirical risk and capacity measuresI part III: capacity controlI part IV: empirical loss and regularizationI not in this lecture (see the references):

I ad hoc methods such as k nearest neighbours and treesI advanced topics such as noise conditions and data

dependent boundsI clusteringI Bayesian point of viewI and many other things...

http://apiacoa.org/

Part II

Empirical risk and capacity measures

http://apiacoa.org/

Outline

ConcentrationHoeffding inequalityUniform bounds

Vapnik-Chervonenkis DimensionDefinitionApplication to classificationProof

Covering numbersDefinition and resultsComputing covering numbers

Summary

http://apiacoa.org/

Empirical risk

I Empirical risk minimisation (ERM) relies on the linkbetween Ln(g) and L(g)

I the law of large numbers is too limited:I asymptotic onlyI g must be fixed

I we need:1. quantitative finite distance results, i.e., PAO like bounds on

P {|Ln(g)− L(g)| > �}

2. uniform results, i.e. bounds on

P

{supg∈G|Ln(g)− L(g)| > �

}

39 / 154 Fabrice Rossi Concentration

http://apiacoa.org/

Concentration inequalities

I in essence, the goal is to evaluate the concentration of theempirical mean of c(g(X ),Y ) around its expectation

Ln(g)− L(g) =1n

n∑i=1

c(g(Xi),Yi)− E {c(g(X ),Y )}

I some standard bounds:I Markov: P {|X | ≥ t} ≤ E{|X |}tI Bienaymé-Chebyshev: P {|X − E {X}| ≥ t} ≤ Var(X)t2I BC gives for n independent real valued random variables

Xi , denoting Sn =∑n

i=1 Xi :

P {|Sn − E {Sn}| ≥ t} ≤∑n

i=1 Var(Xi )t2

http://apiacoa.org/

Concentration inequalities

I in essence, the goal is to evaluate the concentration of theempirical mean of c(g(X ),Y ) around its expectation

Ln(g)− L(g) =1n

n∑i=1

c(g(Xi),Yi)− E {c(g(X ),Y )}

I some standard bounds:I Markov: P {|X | ≥ t} ≤ E{|X |}tI Bienaymé-Chebyshev: P {|X − E {X}| ≥ t} ≤ Var(X)t2I BC gives for n independent real valued random variables

Xi , denoting Sn =∑n

i=1 Xi :

P {|Sn − E {Sn}| ≥ t} ≤∑n

i=1 Var(Xi )t2

http://apiacoa.org/

Hoeffding’s inequalityHoeffding, 1963

hypothesis

I X1, . . . ,Xn, n independent r.v.I Xi ∈ [ai ,bi ]I Sn =

∑ni=1 Xi

result

P {Sn − E {Sn} ≥ �} ≤ e−2�2/Pn

i=1(bi−ai )2

P {Sn − E {Sn} ≤ −�} ≤ e−2�2/Pn

i=1(bi−ai )2

Quantitative distribution free finite distance law of largenumbers!

http://apiacoa.org/

Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as

I a bounded Y in regressionI or the misclassification cost in binary classification

[a,b] = [0,1]

I Ui = 1n c(g(Xi),Yi), Ui ∈ [an ,

bn ]

I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)

I∑n

i=1(bi − ai)2 =(b−a)2

nI and therefore

P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

I then Ln(g)P−→ L(g) and Ln(g)

p.s.−−→ L(g) (Borel Cantelli)

http://apiacoa.org/

[a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [

an ,

bn ]

I∑n

i=1(bi − ai)2 =(b−a)2

nI and therefore

P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

http://apiacoa.org/

[a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [

an ,

bn ]

I∑n

i=1(bi − ai)2 =(b−a)2

nI and therefore

P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

http://apiacoa.org/

[a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [

an ,

bn ]

I∑n

i=1(bi − ai)2 =(b−a)2

nI and therefore

P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

http://apiacoa.org/

Limitations

I g cannot depend on the (Xi ,Yi) (independent test set)I intuitively:

I δ = 2e−2n�2/(b−a)2

I for each g, the probability to draw from P a Dn on which

|Ln(g)− L(g)| ≤ (b − a)

√log 2δ2n

is at least 1− δI but the sets of “correct” Dn differ for each gI conversely for a fixed Dn, the bound is valid only for some of

the modelsI this cannot be used to study g∗n = arg ming∈G Ln(g)

http://apiacoa.org/

The test setI Hoeffding’s inequality justifies the test set approach (hold

out estimate)I Given a dataset Dm:

I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)

I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p

∑mi=n+1 c(gn(Xi ),Yi )

I In fact, with probability at least 1− δ

L(gn) ≤ L′p(gn) + (b − a)

√log 1δ2p

I Example:I classification (a = 0, b = 1), with a “good” classifier

L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)

http://apiacoa.org/

L(gn) ≤ L′p(gn) + (b − a)

√log 1δ2p

http://apiacoa.org/

L(gn) ≤ L′p(gn) + (b − a)

√log 1δ2p

http://apiacoa.org/

Proof of the Hoeffding’s inequality

I Chernoff’s bounding technique (an application of Markov’sinequality):

P {X ≥ t} = P{

esX ≥ est}≤

E{

esX}

est

the bound is controlled via sI lemma: if E {X} = 0 and X ∈ [a,b], the for all s,

E{

esX}≤ e

s2(b−a)28

I e is convex⇒ E{

esX}≤ eφ(s(b−a))

I then we bound φI this is used for X = Sn − E {Sn}

http://apiacoa.org/

Proof of the Hoeffding’s inequality

P {Sn − E {Sn} ≥ �} ≤ e−s�E{

esPn

i=1(Xi−E{Xi})}

= e−s�n∏

i=1

E{

es(Xi−E{Xi})}

(independence)

≤ e−s�n∏

i=1

es2(bi−ai )

2

8 (lemma)

= e−s�es2Pn

i=1(bi−ai )2

8

= e−2�2/Pn

i=1(bi−ai )2 (minimization)

with s = 4�/∑n

i=1(bi − ai)2

http://apiacoa.org/

Remarks

I Hoeffding’s inequality is useful in many other contexts:I Large-scale machine learning (e.g.,

[Domingos and Hulten, 2001]):I enormous dataset (e.g., cannot be loaded in memory)I process growing subsets until the model quality is known

with sufficient confidenceI monitor drifting via confidence bounds

I Regret in game theory (e.g.,[Cesa-Bianchi and Lugosi, 2006]):

I define strategies by minimizing a regret measureI get bounds on the regret estimations (e.g., trade-off between

exploration and exploitation in multi-arms bandits)I Many improvements/complements are available, such as:

I Bernstein’s inequalityI McDiarmid’s bounded difference inequality

http://apiacoa.org/

Uniform boundsI we deal with the estimation error:

I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?

I we need uniform bounds:

|Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|

L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)

≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|

≤ 2 supg∈G|Ln(g)− L(g)|

I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM

http://apiacoa.org/

Finite model class

I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore

P{

max1≤i≤m

Ui ≥ �}≤

m∑i=1

P {Ui ≥ �}

I then when G is finite

P

{supg∈G|Ln(g)− L(g)| ≥ �

}≤ 2|G|e−2n�2/(b−a)2

so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)

I quantitative distribution free uniform finite distance law oflarge numbers!

http://apiacoa.org/

Finite model class

P{

max1≤i≤m

Ui ≥ �}≤

m∑i=1

P {Ui ≥ �}

P

}≤ 2|G|e−2n�2/(b−a)2

http://apiacoa.org/

Finite model class

P{

max1≤i≤m

Ui ≥ �}≤

m∑i=1

P {Ui ≥ �}

P

}≤ 2|G|e−2n�2/(b−a)2

http://apiacoa.org/

Bound versus size

with probability at least 1− δ

L(g∗n) ≤ infg∈G L(g) + 2(b − a)

√log |G|+ log 2δ

2n

I infg∈G L(g) decreases with |G| (more models)I but the bound increases with |G|: trade-off between

flexibility and estimation qualityI remark: log |G| is the number of bits needed to encode the

model choice; this is a capacity measure

http://apiacoa.org/

Outline

ConcentrationHoeffding inequalityUniform bounds

Vapnik-Chervonenkis DimensionDefinitionApplication to classificationProof

Covering numbersDefinition and resultsComputing covering numbers

Summary

51 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

http://apiacoa.org/

Infinite model class

I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}

are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L

∗ = 0(nonlinear problems)

I solution: evaluate the capacity of a class rather than itssize

I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available

shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,

then A and B are identical

http://apiacoa.org/

Growth functionI abstract setting: F is a set of measurable functions from

Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )

I given z1 ∈ Rd , . . . , zn ∈ Rd , we define

Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}

I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F

I Growth function

SF (n) = supz1∈Rd ,...,zn∈Rd

|Fz1,...,zn |

I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)

http://apiacoa.org/

Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}

I Growth function

|Fz1,...,zn |

http://apiacoa.org/

Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}

I Growth function

|Fz1,...,zn |

http://apiacoa.org/

Vapnik-Chervonenkis dimension

I SF (n) ≤ 2nI vocabulary:

I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn

I Vapnik-Chervonenkis dimension

VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}

I interpretation:I if SF (n) < 2n :

I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn

I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered

by F

http://apiacoa.org/

by F

http://apiacoa.org/

by F

http://apiacoa.org/

Example

I points from R2

I F = {I{ax+by+c≥0}}

I SF (3) = 23

I even if we havesometimes |Fz1,z2,z3 | < 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Example

I points from R2

I SF (3) = 23

I SF (4) < 24

I VCdim (F) = 3

http://apiacoa.org/

Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then

VCdim({

f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

proof

I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))

I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that

∑p+1i=1 γig(zi) = 0

I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j

I consequently no z1, . . . , zp+1 can be shattered

http://apiacoa.org/

VCdim({

f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

proof

∑p+1i=1 γig(zi) = 0

http://apiacoa.org/

VCdim({

f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

proof

∑p+1i=1 γig(zi) = 0

http://apiacoa.org/

VCdim({

f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

proof

∑p+1i=1 γig(zi) = 0

http://apiacoa.org/

VCdim({

f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

proof

∑p+1i=1 γig(zi) = 0

http://apiacoa.org/

VC dimension 6= parameter number

I in the linear case F = {f (z) = I{Ppi=1 wiφi (z)≥0}}VCdim (F) = p

I but in general:I VCdim

({f (z) = I{sin(tz)≥0}}

)=∞

I one hidden layer perceptron:

G =

g(z) = Tβ0 + h∑

k=1

βk T

αk0 + d∑j=1

αkjzj

I T (a) = I{a≥0}I VCdim (G) ≥ W log2(h/4)/32 where W = dh + 2h + 1 is the

number of parameters

http://apiacoa.org/

VC dimension 6= parameter number

I in the linear case F = {f (z) = I{Ppi=1 wiφi (z)≥0}}VCdim (F) = p

I but in general:I VCdim

({f (z) = I{sin(tz)≥0}}

)=∞

I one hidden layer perceptron:

G =

g(z) = Tβ0 + h∑

k=1

βk T

αk0 + d∑j=1

αkjzj

I T (a) = I{a≥0}I VCdim (G) ≥ W log2(h/4)/32 where W = dh + 2h + 1 is the

number of parameters

http://apiacoa.org/

Snake oil warning

I you might read here and there things likethe VC-dimension of the class of separating

hyperplanes with margin larger than 1δ is boundedby bla bla bla

I this is plain wrong!I the class G is data independentI shattering does not take into a marginI asking for the “VC-dimension of the class of separating

hyperplanes with margin larger than 1δ ” is a nonsenseI there is an extension of the VC dimensi

an introduction to statistical learning theory · 2020. 10. 23. · an introduction to statistical...

Documents