an introduction to statistical learning theory · 2020. 10. 23. · an introduction to statistical...

302
An introduction to statistical learning theory Fabrice Rossi TELECOM ParisTech November 2008

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • An introduction to statisticallearning theory

    Fabrice Rossi

    TELECOM ParisTech

    November 2008

  • About this lecture

    Main goalShow how statistics enable a rigorous analysis of machinelearning methods which leads to a better understanding and topossible improvements of those methods

    Outline

    1. introduction to the formal model of statistical learning2. empirical risk and capacity measures3. capacity control4. beyond empirical risk minimization

    2 / 154 Fabrice Rossi Outline

    http://apiacoa.org/

  • About this lecture

    Main goalShow how statistics enable a rigorous analysis of machinelearning methods which leads to a better understanding and topossible improvements of those methods

    Outline

    1. introduction to the formal model of statistical learning2. empirical risk and capacity measures3. capacity control4. beyond empirical risk minimization

    2 / 154 Fabrice Rossi Outline

    http://apiacoa.org/

  • Part I

    Introduction and formalization

    3 / 154 Fabrice Rossi

    http://apiacoa.org/

  • Outline

    Machine learning in a nutshell

    FormalizationData and algorithmsConsistencyRisk/Loss optimization

    4 / 154 Fabrice Rossi

    http://apiacoa.org/

  • Statistical LearningStatistical Learning

    = Machine learning + Statistics

    Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

    ... all of this is done by the computer (no human intervention)

    Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

    Nothing is more practical than a good theoryVladimir Vapnik (1998)

    5 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Statistical LearningStatistical Learning = Machine learning + Statistics

    Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

    ... all of this is done by the computer (no human intervention)

    Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

    Nothing is more practical than a good theoryVladimir Vapnik (1998)

    5 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Statistical LearningStatistical Learning = Machine learning + Statistics

    Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

    ... all of this is done by the computer (no human intervention)

    Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

    Nothing is more practical than a good theoryVladimir Vapnik (1998)

    5 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Statistical LearningStatistical Learning = Machine learning + Statistics

    Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

    ... all of this is done by the computer (no human intervention)

    Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

    Nothing is more practical than a good theoryVladimir Vapnik (1998)

    5 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Statistical LearningStatistical Learning = Machine learning + Statistics

    Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

    ... all of this is done by the computer (no human intervention)

    Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

    Nothing is more practical than a good theoryVladimir Vapnik (1998)

    5 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Statistical LearningStatistical Learning = Machine learning + Statistics

    Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon

    ... all of this is done by the computer (no human intervention)

    Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools

    Nothing is more practical than a good theoryVladimir Vapnik (1998)

    5 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Machine learning

    I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z

    I two generic situations:1. unsupervised learning:

    I no predefined structure in ZI then the goal is to find some structures: clusters, association

    rules, distribution, etc.2. supervised learning:

    I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable

    value for y such that z = (x, y) is compatible with thephenomenon

    I this course focuses on supervised learning

    6 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Machine learning

    I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z

    I two generic situations:1. unsupervised learning:

    I no predefined structure in ZI then the goal is to find some structures: clusters, association

    rules, distribution, etc.

    2. supervised learning:I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable

    value for y such that z = (x, y) is compatible with thephenomenon

    I this course focuses on supervised learning

    6 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Machine learning

    I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z

    I two generic situations:1. unsupervised learning:

    I no predefined structure in ZI then the goal is to find some structures: clusters, association

    rules, distribution, etc.2. supervised learning:

    I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable

    value for y such that z = (x, y) is compatible with thephenomenon

    I this course focuses on supervised learning

    6 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Machine learning

    I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z

    I two generic situations:1. unsupervised learning:

    I no predefined structure in ZI then the goal is to find some structures: clusters, association

    rules, distribution, etc.2. supervised learning:

    I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable

    value for y such that z = (x, y) is compatible with thephenomenon

    I this course focuses on supervised learning

    6 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Supervised learning

    I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether

    a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures

    I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x

    I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence

    I modelling difficulty increases with the complexity of YI Vapnik’s message: “don’t built a regression model to do

    classification”

    7 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Supervised learning

    I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether

    a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures

    I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x

    I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence

    I modelling difficulty increases with the complexity of Y

    I Vapnik’s message: “don’t built a regression model to doclassification”

    7 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Supervised learning

    I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether

    a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures

    I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x

    I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence

    I modelling difficulty increases with the complexity of YI Vapnik’s message: “don’t built a regression model to do

    classification”

    7 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Model quality

    I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y

    I a “good” model should be such that

    ∀1 ≤ i ≤ n, g(xi) ' yi

    or not?perfect interpolation is always possible

    but is that what we are looking for?I No! We want generalization:

    I a good model must “learn” something “smart” from thedata, not simply memorize them

    I on new data, it should be such that g(xi ) ' yi (for i > n)

    8 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Model quality

    I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y

    I a “good” model should be such that

    ∀1 ≤ i ≤ n, g(xi) ' yi

    or not?

    perfect interpolation is always possiblebut is that what we are looking for?

    I No! We want generalization:I a good model must “learn” something “smart” from the

    data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)

    8 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Model quality

    I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y

    I a “good” model should be such that

    ∀1 ≤ i ≤ n, g(xi) ' yi

    or not?perfect interpolation is always possible

    but is that what we are looking for?

    I No! We want generalization:I a good model must “learn” something “smart” from the

    data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)

    8 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Model quality

    I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y

    I a “good” model should be such that

    ∀1 ≤ i ≤ n, g(xi) ' yi

    or not?perfect interpolation is always possible

    but is that what we are looking for?I No! We want generalization

    :I a good model must “learn” something “smart” from the

    data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)

    8 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Model quality

    I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y

    I a “good” model should be such that

    ∀1 ≤ i ≤ n, g(xi) ' yi

    or not?perfect interpolation is always possible

    but is that what we are looking for?I No! We want generalization:

    I a good model must “learn” something “smart” from thedata, not simply memorize them

    I on new data, it should be such that g(xi ) ' yi (for i > n)

    8 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Machine learning objectives

    I main goal: from a dataset (xi ,yi)1≤i≤n, build a model gwhich generalizes in a satisfactory way on new data

    I but also:I request only as many data as neededI do this efficiently (quickly with a low memory usage)I gain knowledge on the underlying phenomenon (e.g.,

    discard useless parts of x)I etc.

    I the statistical learning framework addresses some of thosegoals:

    I it provides asymptotic guarantees about learnabilityI it gives non asymptotic confidence bounds around

    performance estimatesI it suggests/validates best practices (hold out estimates,

    regularization, etc.)

    9 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Machine learning objectives

    I main goal: from a dataset (xi ,yi)1≤i≤n, build a model gwhich generalizes in a satisfactory way on new data

    I but also:I request only as many data as neededI do this efficiently (quickly with a low memory usage)I gain knowledge on the underlying phenomenon (e.g.,

    discard useless parts of x)I etc.

    I the statistical learning framework addresses some of thosegoals:

    I it provides asymptotic guarantees about learnabilityI it gives non asymptotic confidence bounds around

    performance estimatesI it suggests/validates best practices (hold out estimates,

    regularization, etc.)

    9 / 154 Fabrice Rossi Machine learning in a nutshell

    http://apiacoa.org/

  • Outline

    Machine learning in a nutshell

    FormalizationData and algorithmsConsistencyRisk/Loss optimization

    10 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Data

    I the phenomenon is fully described by an unknownprobability distribution P on X × Y

    I we are given n observations:I Dn = (Xi ,Yi )ni=1I each pair (Xi ,Yi ) is distributed according to PI pairs are independent

    I remarks:I the phenomenon is stationary, i.e., P does not change: new

    data will be distributed as learning dataI extensions to non stationary situations exist, see e.g.

    covariate shift and concept driftI the independence assumption says that each observation

    brings new informationI extensions to dependent observation exists, e.g., Markovian

    models for time series analysis

    11 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Data

    I the phenomenon is fully described by an unknownprobability distribution P on X × Y

    I we are given n observations:I Dn = (Xi ,Yi )ni=1I each pair (Xi ,Yi ) is distributed according to PI pairs are independent

    I remarks:I the phenomenon is stationary, i.e., P does not change: new

    data will be distributed as learning dataI extensions to non stationary situations exist, see e.g.

    covariate shift and concept driftI the independence assumption says that each observation

    brings new informationI extensions to dependent observation exists, e.g., Markovian

    models for time series analysis

    11 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a

    low cost is good!)I the risk of a model g: X → Y is

    L(g) = E {c(g(X ),Y )}

    This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)

    I interpretation:I the average value of c(g(x),y) over a lot of observations

    generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has

    been built from a dataset)I remark: we need a probability space, c and g have to be

    measurable functions, etc.

    12 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a

    low cost is good!)I the risk of a model g: X → Y is

    L(g) = E {c(g(X ),Y )}

    This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)

    I interpretation:I the average value of c(g(x),y) over a lot of observations

    generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has

    been built from a dataset)

    I remark: we need a probability space, c and g have to bemeasurable functions, etc.

    12 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a

    low cost is good!)I the risk of a model g: X → Y is

    L(g) = E {c(g(X ),Y )}

    This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)

    I interpretation:I the average value of c(g(x),y) over a lot of observations

    generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has

    been built from a dataset)I remark: we need a probability space, c and g have to be

    measurable functions, etc.

    12 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • We are frequentists

    13 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Cost functionsI Regression (Y = Rq):

    I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2

    I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ

    and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwise

    I Classification (Y = {1, . . . ,q}):I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification

    I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to

    build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV

    14 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Cost functionsI Regression (Y = Rq):

    I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2

    I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ

    and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwiseI Classification (Y = {1, . . . ,q}):

    I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification

    I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to

    build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV

    14 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Cost functionsI Regression (Y = Rq):

    I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2

    I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ

    and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwiseI Classification (Y = {1, . . . ,q}):

    I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification

    I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to

    build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV

    14 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Machine learning algorithm

    I given Dn, a machine learning algorithm/method builds amodel gDn (= gn for simplicity)

    I gn is a random variable with values in the set ofmeasurable functions from X to Y

    I the associated (random) risk is

    L(gn) = E {c(gn(X ),Y ) | Dn}

    I statistical learning studies the behavior of L(gn) andE {L(gn)} when n increases

    :I keep in mind that L(gn) is a random variableI if many datasets are generated from the phenomenon, the

    mean risk of the classifiers produced by the algorithm forthose datasets is E {L(gn)}

    15 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Machine learning algorithm

    I given Dn, a machine learning algorithm/method builds amodel gDn (= gn for simplicity)

    I gn is a random variable with values in the set ofmeasurable functions from X to Y

    I the associated (random) risk is

    L(gn) = E {c(gn(X ),Y ) | Dn}

    I statistical learning studies the behavior of L(gn) andE {L(gn)} when n increases:

    I keep in mind that L(gn) is a random variableI if many datasets are generated from the phenomenon, the

    mean risk of the classifiers produced by the algorithm forthose datasets is E {L(gn)}

    15 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Consistency

    I best risk:L∗ = inf

    gL(g)

    I the infimum is taken over all measurable functions from Xto Y

    I a machine learning method is:I strongly consistent: if L(gn)

    p.s.−−→ L∗I consistent: if E {L(gn)} → L∗I universally (strongly) consistent: if the convergence holds

    for any distribution of the data PI remark: if c is bounded then limn→∞ E {L(gn)} = L∗ ⇔

    L(gn)P−→ L∗

    16 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Consistency

    I best risk:L∗ = inf

    gL(g)

    I the infimum is taken over all measurable functions from Xto Y

    I a machine learning method is:I strongly consistent: if L(gn)

    p.s.−−→ L∗I consistent: if E {L(gn)} → L∗I universally (strongly) consistent: if the convergence holds

    for any distribution of the data PI remark: if c is bounded then limn→∞ E {L(gn)} = L∗ ⇔

    L(gn)P−→ L∗

    16 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • InterpretationI universality: no assumption on the data distribution

    (realistic)I consistency: “perfect” generalization when given an infinite

    learning set

    I strong case:I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N

    L(gn) ≤ L∗ + �

    where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1

    I “weak” case:I for any � > 0I there is N such that for n ≥ N

    E {L(gn)} ≤ L∗ + �

    where gn is built on (xi , yi)1≤i≤n

    17 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • InterpretationI universality: no assumption on the data distribution

    (realistic)I consistency: “perfect” generalization when given an infinite

    learning setI strong case:

    I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N

    L(gn) ≤ L∗ + �

    where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1

    I “weak” case:I for any � > 0I there is N such that for n ≥ N

    E {L(gn)} ≤ L∗ + �

    where gn is built on (xi , yi)1≤i≤n

    17 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • InterpretationI universality: no assumption on the data distribution

    (realistic)I consistency: “perfect” generalization when given an infinite

    learning setI strong case:

    I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N

    L(gn) ≤ L∗ + �

    where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1

    I “weak” case:I for any � > 0I there is N such that for n ≥ N

    E {L(gn)} ≤ L∗ + �

    where gn is built on (xi , yi)1≤i≤n

    17 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Interpretation (continued)I We can be unlucky:

    I strong case:I L(gn) = E {c(gn(X ),Y ) | Dn} ≤ L∗ + �I but what about

    1p

    n+pXi=n+1

    c(gn(xi), yi)

    I weak case:I E {L(gn)} ≤ L∗ + �I but what about L(gn) for a particular dataset?I see also the strong case!

    I Stronger result, Probably Approximately Optimal algorithm:∀�, δ, ∃n(�, δ) such that ∀n ≥ n(�, δ)

    P {L (gn) > L∗ + �} < δ

    I Gives some convergence speed: especially interestingwhen n(�, δ) is independent of P (distribution free)

    18 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Interpretation (continued)I We can be unlucky:

    I strong case:I L(gn) = E {c(gn(X ),Y ) | Dn} ≤ L∗ + �I but what about

    1p

    n+pXi=n+1

    c(gn(xi), yi)

    I weak case:I E {L(gn)} ≤ L∗ + �I but what about L(gn) for a particular dataset?I see also the strong case!

    I Stronger result, Probably Approximately Optimal algorithm:∀�, δ, ∃n(�, δ) such that ∀n ≥ n(�, δ)

    P {L (gn) > L∗ + �} < δ

    I Gives some convergence speed: especially interestingwhen n(�, δ) is independent of P (distribution free)

    18 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to consistency

    I if convergence happens fast enough, then the method is(strongly) consistent

    I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �

    P {L (gn) > L∗ + �} −−−→n→∞

    0I by definition (and also because L(gn) > L∗), this means

    L(gn)P−→ L∗

    I then, if c is bounded E {L(gn)} −−−→n→∞

    L∗

    I if convergence happens faster, then the method might bestrongly consistent

    I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases

    19 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to consistency

    I if convergence happens fast enough, then the method is(strongly) consistent

    I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �

    P {L (gn) > L∗ + �} −−−→n→∞

    0I by definition (and also because L(gn) > L∗), this means

    L(gn)P−→ L∗

    I then, if c is bounded E {L(gn)} −−−→n→∞

    L∗

    I if convergence happens faster, then the method might bestrongly consistent

    I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases

    19 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to consistency

    I if convergence happens fast enough, then the method is(strongly) consistent

    I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �

    P {L (gn) > L∗ + �} −−−→n→∞

    0I by definition (and also because L(gn) > L∗), this means

    L(gn)P−→ L∗

    I then, if c is bounded E {L(gn)} −−−→n→∞

    L∗

    I if convergence happens faster, then the method might bestrongly consistent

    I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases

    19 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to strong consistency

    I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that

    if∑

    i

    P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

    I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }

    I and therefore, L(gn)p.s.−−→ L∗

    20 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to strong consistency

    I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that

    if∑

    i

    P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

    I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }

    I and therefore, L(gn)p.s.−−→ L∗

    20 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to strong consistency

    I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that

    if∑

    i

    P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

    I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }

    I and therefore, L(gn)p.s.−−→ L∗

    20 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • From PAO to strong consistency

    I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that

    if∑

    i

    P {Ai} L∗ + �} L∗ + �} i .o.]} = 0

    I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }

    I and therefore, L(gn)p.s.−−→ L∗

    20 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Summary

    I from n i.i.d. observations Dn = (Xi ,Yi)ni=1, a ML algorithmbuilds a model gn

    I its risks is L(gn) = E {c(gn(X ),Y ) | Dn}I statistical learning studies L(gn)− L∗ and E {L(gn)} − L∗

    I the preferred approach is to show a fast decrease ofP {L (gn) > L∗ + �} with n

    I no hypothesis on the data (i.e., on P), a.k.a. distributionfree results

    21 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Classical statisticsEven though we are fequentists...

    I this program is quite different from classical statistics:I no optimal parameters, no identifiability: intrinsically non

    parametricI no optimal model: only optimal performancesI no asymptotic distribution: focus on finite distance

    inequalitiesI no (or minimal) assumption on the data distribution

    I minimal hypothesisI are justified by our lack of knowledge on the studied

    phenomenonI have strong and annoying consequences linked to the

    worst case scenario they imply

    22 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Optimal modelI there is nevertheless an optimal modelI regression:

    I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}

    I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the

    Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}

    I we only need to mimic the prediction of the model, not themodel itself:

    I for optimal classification, we don’t need to compute theposterior probabilities

    I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}

    I we only need to agree with the sign of E {Y | X = x}

    23 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Optimal modelI there is nevertheless an optimal modelI regression:

    I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}

    I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the

    Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}

    I we only need to mimic the prediction of the model, not themodel itself:

    I for optimal classification, we don’t need to compute theposterior probabilities

    I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}

    I we only need to agree with the sign of E {Y | X = x}

    23 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Optimal modelI there is nevertheless an optimal modelI regression:

    I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}

    I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the

    Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}

    I we only need to mimic the prediction of the model, not themodel itself:

    I for optimal classification, we don’t need to compute theposterior probabilities

    I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}

    I we only need to agree with the sign of E {Y | X = x}

    23 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • No free lunchI unfortunately, making no hypothesis on P costs a lot

    I no free lunch results for binary classification andmisclassification cost

    I there is always a bad distribution:I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥12− �

    I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥ an

    24 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and

    misclassification cost

    I there is always a bad distribution:I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥12− �

    I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥ an

    24 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and

    misclassification costI there is always a bad distribution:

    I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥12− �

    I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥ an

    24 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and

    misclassification costI there is always a bad distribution:

    I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥12− �

    I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and

    E {L(gn)} ≥ an

    24 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Estimation difficultiesI in fact, estimating the Bayes risk L∗ (in binary

    classification) is difficult:I given a fixed estimation algorithm for L∗, denoted L̂nI for all � > 0 and all n, there is (X ,Y ) such that

    E{|L̂n − L∗|

    }≥ 1

    4− �

    I regression is more difficult than classification:I ηn estimator for η = E {Y |X} in a weak sense:

    limn→∞

    E{

    (ηn(X )− η(X ))2}

    = 0

    I then for gn(x) = I{ηn(x)>1/2}, we have

    limn→∞

    E {L(gn)} − L∗√E {(ηn(X )− η(X ))2}

    = 0

    25 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Estimation difficultiesI in fact, estimating the Bayes risk L∗ (in binary

    classification) is difficult:I given a fixed estimation algorithm for L∗, denoted L̂nI for all � > 0 and all n, there is (X ,Y ) such that

    E{|L̂n − L∗|

    }≥ 1

    4− �

    I regression is more difficult than classification:I ηn estimator for η = E {Y |X} in a weak sense:

    limn→∞

    E{

    (ηn(X )− η(X ))2}

    = 0

    I then for gn(x) = I{ηn(x)>1/2}, we have

    limn→∞

    E {L(gn)} − L∗√E {(ηn(X )− η(X ))2}

    = 0

    25 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Consequences

    I no free lunch results show that we cannot get uniformconvergence speed:

    I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P

    I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur

    from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity

    assumptions on f , extrapolation needs even strongerhypothesis

    I but we don’t want hypothesis on P...

    26 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Consequences

    I no free lunch results show that we cannot get uniformconvergence speed:

    I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P

    I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur

    from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity

    assumptions on f , extrapolation needs even strongerhypothesis

    I but we don’t want hypothesis on P...

    26 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Consequences

    I no free lunch results show that we cannot get uniformconvergence speed:

    I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P

    I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur

    from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity

    assumptions on f , extrapolation needs even strongerhypothesis

    I but we don’t want hypothesis on P...

    26 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Estimation and approximation

    I solution: split the problem in two parts, estimation andapproximation

    I estimation:I restrict the class of acceptable models to G, a set of

    measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must

    choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow

    uniform results!I approximation:

    I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!

    27 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Estimation and approximation

    I solution: split the problem in two parts, estimation andapproximation

    I estimation:I restrict the class of acceptable models to G, a set of

    measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must

    choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow

    uniform results!

    I approximation:I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!

    27 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Estimation and approximation

    I solution: split the problem in two parts, estimation andapproximation

    I estimation:I restrict the class of acceptable models to G, a set of

    measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must

    choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow

    uniform results!I approximation:

    I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!

    27 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Back to machine learning algorithmsI many ad hoc algorithms:

    I k nearest neighborsI tree based algorithmsI Adaboost

    I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via

    DnI examples:

    I linear regression (X is a Hilbert space, Y = R):I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error

    wn = arg minw 1nPn

    i=1 (yi − 〈w , xi〉)2

    I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a

    disjunctive coding of the classes

    28 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Back to machine learning algorithmsI many ad hoc algorithms:

    I k nearest neighborsI tree based algorithmsI Adaboost

    I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via

    Dn

    I examples:I linear regression (X is a Hilbert space, Y = R):

    I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error

    wn = arg minw 1nPn

    i=1 (yi − 〈w , xi〉)2

    I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a

    disjunctive coding of the classes

    28 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Back to machine learning algorithmsI many ad hoc algorithms:

    I k nearest neighborsI tree based algorithmsI Adaboost

    I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via

    DnI examples:

    I linear regression (X is a Hilbert space, Y = R):I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error

    wn = arg minw 1nPn

    i=1 (yi − 〈w , xi〉)2

    I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a

    disjunctive coding of the classes

    28 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • CriterionI the best solution would be to choose gn by minimizing

    L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L

    has no reason to be convex, for instance)

    I partial solution, the empirical risk:

    Ln(g) =1n

    n∑i=1

    c(g(Xi),Yi)

    I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)

    I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)

    29 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • CriterionI the best solution would be to choose gn by minimizing

    L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L

    has no reason to be convex, for instance)I partial solution, the empirical risk:

    Ln(g) =1n

    n∑i=1

    c(g(Xi),Yi)

    I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)

    I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)

    29 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • CriterionI the best solution would be to choose gn by minimizing

    L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L

    has no reason to be convex, for instance)I partial solution, the empirical risk:

    Ln(g) =1n

    n∑i=1

    c(g(Xi),Yi)

    I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)

    I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)

    29 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • CriterionI the best solution would be to choose gn by minimizing

    L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L

    has no reason to be convex, for instance)I partial solution, the empirical risk:

    Ln(g) =1n

    n∑i=1

    c(g(Xi),Yi)

    I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)

    I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)

    29 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Empirical risk minimizationI generic machine learning method:

    I fix a model class GI choose g∗n in G by

    g∗n = arg ming∈GLn(g)

    I denote L∗G = infg∈G L(g)

    I does that work?I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n

    I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best

    possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)

    30 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Empirical risk minimizationI generic machine learning method:

    I fix a model class GI choose g∗n in G by

    g∗n = arg ming∈GLn(g)

    I denote L∗G = infg∈G L(g)I does that work?

    I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n

    I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best

    possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)

    30 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Empirical risk minimizationI generic machine learning method:

    I fix a model class GI choose g∗n in G by

    g∗n = arg ming∈GLn(g)

    I denote L∗G = infg∈G L(g)I does that work?

    I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n

    I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best

    possible one in the class)

    I L∗G − L∗ is handled differently (approximation vs estimation)

    30 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Empirical risk minimizationI generic machine learning method:

    I fix a model class GI choose g∗n in G by

    g∗n = arg ming∈GLn(g)

    I denote L∗G = infg∈G L(g)I does that work?

    I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n

    I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best

    possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)

    30 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Complexity controlThe overfitting issue

    I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G

    I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier

    defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting

    I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L

    ∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and

    approximation error

    31 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Complexity controlThe overfitting issue

    I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G

    I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier

    defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting

    I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L

    ∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and

    approximation error

    31 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Complexity controlThe overfitting issue

    I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G

    I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier

    defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting

    I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L

    ∗ (for arbitrary P)

    I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and

    approximation error

    31 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Complexity controlThe overfitting issue

    I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G

    I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier

    defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting

    I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L

    ∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)

    I in other words, how to balance estimation error andapproximation error

    31 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Complexity controlThe overfitting issue

    I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G

    I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier

    defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting

    I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L

    ∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and

    approximation error

    31 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Increasing complexity

    I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize

    I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when

    g∗n = arg ming∈Gn Ln(gn)I approximation part:

    I functional analysisI make sure that limn→∞ L∗Gn = L

    I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn

    depends only on the data size

    32 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Increasing complexity

    I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize

    I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when

    g∗n = arg ming∈Gn Ln(gn)

    I approximation part:I functional analysisI make sure that limn→∞ L∗Gn = L

    I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn

    depends only on the data size

    32 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Increasing complexity

    I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize

    I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when

    g∗n = arg ming∈Gn Ln(gn)I approximation part:

    I functional analysisI make sure that limn→∞ L∗Gn = L

    I related to density arguments (universal approximation)

    I this is the simplest approach but also the less realistic: Gndepends only on the data size

    32 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Increasing complexity

    I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize

    I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when

    g∗n = arg ming∈Gn Ln(gn)I approximation part:

    I functional analysisI make sure that limn→∞ L∗Gn = L

    I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn

    depends only on the data size

    32 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Structural risk minimizationI key idea: use again more and more powerful classes Gd

    but also compare models between classes

    I more precisely:I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d

    Ln(g∗n,d ) + r(n,d)

    I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk

    I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,

    but rather unrealistic on a practical point of view becauseof the tremendous computational cost

    33 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Structural risk minimizationI key idea: use again more and more powerful classes Gd

    but also compare models between classesI more precisely:

    I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d

    Ln(g∗n,d ) + r(n,d)

    I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk

    I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,

    but rather unrealistic on a practical point of view becauseof the tremendous computational cost

    33 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Structural risk minimizationI key idea: use again more and more powerful classes Gd

    but also compare models between classesI more precisely:

    I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d

    Ln(g∗n,d ) + r(n,d)

    I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk

    I it corresponds to a complexity measure for Gd

    I more interesting than the increasing complexity solution,but rather unrealistic on a practical point of view becauseof the tremendous computational cost

    33 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Structural risk minimizationI key idea: use again more and more powerful classes Gd

    but also compare models between classesI more precisely:

    I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d

    Ln(g∗n,d ) + r(n,d)

    I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk

    I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,

    but rather unrealistic on a practical point of view becauseof the tremendous computational cost

    33 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Regularization

    I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)

    I more precisely:I choose G such that L∗G = L∗I define g∗n by

    g∗n = arg ming∈G(Ln(g) + λnR(g))

    I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must

    be carefully chosenI this is the (one of the) best solution(s)

    34 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Regularization

    I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)

    I more precisely:I choose G such that L∗G = L∗I define g∗n by

    g∗n = arg ming∈G(Ln(g) + λnR(g))

    I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must

    be carefully chosenI this is the (one of the) best solution(s)

    34 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Regularization

    I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)

    I more precisely:I choose G such that L∗G = L∗I define g∗n by

    g∗n = arg ming∈G(Ln(g) + λnR(g))

    I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must

    be carefully chosen

    I this is the (one of the) best solution(s)

    34 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Regularization

    I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)

    I more precisely:I choose G such that L∗G = L∗I define g∗n by

    g∗n = arg ming∈G(Ln(g) + λnR(g))

    I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must

    be carefully chosenI this is the (one of the) best solution(s)

    34 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • SummaryI from n i.i.d. observations Dn = (Xi ,Yi)ni=1, a ML algorithm

    builds a model gnI a good ML algorithm builds a universally strongly

    consistent model, i.e. for all P,

    L(gn)p.s.−−→ L∗

    I a very general class of ML algorithms is based onempirical risk minimization, i.e.

    g∗n = arg ming∈G1n

    n∑i=1

    c(g(Xi),Yi)

    I consistency is obtained by controlling the estimation errorand the approximation error

    35 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Outline

    I part II: empirical risk and capacity measuresI part III: capacity controlI part IV: empirical loss and regularization

    I not in this lecture (see the references):I ad hoc methods such as k nearest neighbours and treesI advanced topics such as noise conditions and data

    dependent boundsI clusteringI Bayesian point of viewI and many other things...

    36 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Outline

    I part II: empirical risk and capacity measuresI part III: capacity controlI part IV: empirical loss and regularizationI not in this lecture (see the references):

    I ad hoc methods such as k nearest neighbours and treesI advanced topics such as noise conditions and data

    dependent boundsI clusteringI Bayesian point of viewI and many other things...

    36 / 154 Fabrice Rossi Formalization

    http://apiacoa.org/

  • Part II

    Empirical risk and capacity measures

    37 / 154 Fabrice Rossi

    http://apiacoa.org/

  • Outline

    ConcentrationHoeffding inequalityUniform bounds

    Vapnik-Chervonenkis DimensionDefinitionApplication to classificationProof

    Covering numbersDefinition and resultsComputing covering numbers

    Summary

    38 / 154 Fabrice Rossi

    http://apiacoa.org/

  • Empirical risk

    I Empirical risk minimisation (ERM) relies on the linkbetween Ln(g) and L(g)

    I the law of large numbers is too limited:I asymptotic onlyI g must be fixed

    I we need:1. quantitative finite distance results, i.e., PAO like bounds on

    P {|Ln(g)− L(g)| > �}

    2. uniform results, i.e. bounds on

    P

    {supg∈G|Ln(g)− L(g)| > �

    }

    39 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Concentration inequalities

    I in essence, the goal is to evaluate the concentration of theempirical mean of c(g(X ),Y ) around its expectation

    Ln(g)− L(g) =1n

    n∑i=1

    c(g(Xi),Yi)− E {c(g(X ),Y )}

    I some standard bounds:I Markov: P {|X | ≥ t} ≤ E{|X |}tI Bienaymé-Chebyshev: P {|X − E {X}| ≥ t} ≤ Var(X)t2I BC gives for n independent real valued random variables

    Xi , denoting Sn =∑n

    i=1 Xi :

    P {|Sn − E {Sn}| ≥ t} ≤∑n

    i=1 Var(Xi )t2

    40 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Concentration inequalities

    I in essence, the goal is to evaluate the concentration of theempirical mean of c(g(X ),Y ) around its expectation

    Ln(g)− L(g) =1n

    n∑i=1

    c(g(Xi),Yi)− E {c(g(X ),Y )}

    I some standard bounds:I Markov: P {|X | ≥ t} ≤ E{|X |}tI Bienaymé-Chebyshev: P {|X − E {X}| ≥ t} ≤ Var(X)t2I BC gives for n independent real valued random variables

    Xi , denoting Sn =∑n

    i=1 Xi :

    P {|Sn − E {Sn}| ≥ t} ≤∑n

    i=1 Var(Xi )t2

    40 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Hoeffding’s inequalityHoeffding, 1963

    hypothesis

    I X1, . . . ,Xn, n independent r.v.I Xi ∈ [ai ,bi ]I Sn =

    ∑ni=1 Xi

    result

    P {Sn − E {Sn} ≥ �} ≤ e−2�2/Pn

    i=1(bi−ai )2

    P {Sn − E {Sn} ≤ −�} ≤ e−2�2/Pn

    i=1(bi−ai )2

    Quantitative distribution free finite distance law of largenumbers!

    41 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as

    I a bounded Y in regressionI or the misclassification cost in binary classification

    [a,b] = [0,1]

    I Ui = 1n c(g(Xi),Yi), Ui ∈ [an ,

    bn ]

    I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)

    I∑n

    i=1(bi − ai)2 =(b−a)2

    nI and therefore

    P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

    I then Ln(g)P−→ L(g) and Ln(g)

    p.s.−−→ L(g) (Borel Cantelli)

    42 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as

    I a bounded Y in regressionI or the misclassification cost in binary classification

    [a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [

    an ,

    bn ]

    I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)

    I∑n

    i=1(bi − ai)2 =(b−a)2

    nI and therefore

    P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

    I then Ln(g)P−→ L(g) and Ln(g)

    p.s.−−→ L(g) (Borel Cantelli)

    42 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as

    I a bounded Y in regressionI or the misclassification cost in binary classification

    [a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [

    an ,

    bn ]

    I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)

    I∑n

    i=1(bi − ai)2 =(b−a)2

    nI and therefore

    P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

    I then Ln(g)P−→ L(g) and Ln(g)

    p.s.−−→ L(g) (Borel Cantelli)

    42 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as

    I a bounded Y in regressionI or the misclassification cost in binary classification

    [a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [

    an ,

    bn ]

    I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)

    I∑n

    i=1(bi − ai)2 =(b−a)2

    nI and therefore

    P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2

    I then Ln(g)P−→ L(g) and Ln(g)

    p.s.−−→ L(g) (Borel Cantelli)

    42 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Limitations

    I g cannot depend on the (Xi ,Yi) (independent test set)I intuitively:

    I δ = 2e−2n�2/(b−a)2

    I for each g, the probability to draw from P a Dn on which

    |Ln(g)− L(g)| ≤ (b − a)

    √log 2δ2n

    is at least 1− δI but the sets of “correct” Dn differ for each gI conversely for a fixed Dn, the bound is valid only for some of

    the modelsI this cannot be used to study g∗n = arg ming∈G Ln(g)

    43 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • The test setI Hoeffding’s inequality justifies the test set approach (hold

    out estimate)I Given a dataset Dm:

    I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)

    I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p

    ∑mi=n+1 c(gn(Xi ),Yi )

    I In fact, with probability at least 1− δ

    L(gn) ≤ L′p(gn) + (b − a)

    √log 1δ2p

    I Example:I classification (a = 0, b = 1), with a “good” classifier

    L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)

    44 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • The test setI Hoeffding’s inequality justifies the test set approach (hold

    out estimate)I Given a dataset Dm:

    I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)

    I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p

    ∑mi=n+1 c(gn(Xi ),Yi )

    I In fact, with probability at least 1− δ

    L(gn) ≤ L′p(gn) + (b − a)

    √log 1δ2p

    I Example:I classification (a = 0, b = 1), with a “good” classifier

    L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)

    44 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • The test setI Hoeffding’s inequality justifies the test set approach (hold

    out estimate)I Given a dataset Dm:

    I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)

    I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p

    ∑mi=n+1 c(gn(Xi ),Yi )

    I In fact, with probability at least 1− δ

    L(gn) ≤ L′p(gn) + (b − a)

    √log 1δ2p

    I Example:I classification (a = 0, b = 1), with a “good” classifier

    L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)

    44 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Proof of the Hoeffding’s inequality

    I Chernoff’s bounding technique (an application of Markov’sinequality):

    P {X ≥ t} = P{

    esX ≥ est}≤

    E{

    esX}

    est

    the bound is controlled via sI lemma: if E {X} = 0 and X ∈ [a,b], the for all s,

    E{

    esX}≤ e

    s2(b−a)28

    I e is convex⇒ E{

    esX}≤ eφ(s(b−a))

    I then we bound φI this is used for X = Sn − E {Sn}

    45 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Proof of the Hoeffding’s inequality

    P {Sn − E {Sn} ≥ �} ≤ e−s�E{

    esPn

    i=1(Xi−E{Xi})}

    = e−s�n∏

    i=1

    E{

    es(Xi−E{Xi})}

    (independence)

    ≤ e−s�n∏

    i=1

    es2(bi−ai )

    2

    8 (lemma)

    = e−s�es2Pn

    i=1(bi−ai )2

    8

    = e−2�2/Pn

    i=1(bi−ai )2 (minimization)

    with s = 4�/∑n

    i=1(bi − ai)2

    46 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Remarks

    I Hoeffding’s inequality is useful in many other contexts:I Large-scale machine learning (e.g.,

    [Domingos and Hulten, 2001]):I enormous dataset (e.g., cannot be loaded in memory)I process growing subsets until the model quality is known

    with sufficient confidenceI monitor drifting via confidence bounds

    I Regret in game theory (e.g.,[Cesa-Bianchi and Lugosi, 2006]):

    I define strategies by minimizing a regret measureI get bounds on the regret estimations (e.g., trade-off between

    exploration and exploitation in multi-arms bandits)I Many improvements/complements are available, such as:

    I Bernstein’s inequalityI McDiarmid’s bounded difference inequality

    47 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Uniform boundsI we deal with the estimation error:

    I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?

    I we need uniform bounds:

    |Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|

    L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)

    ≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|

    ≤ 2 supg∈G|Ln(g)− L(g)|

    I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM

    48 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Uniform boundsI we deal with the estimation error:

    I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?

    I we need uniform bounds:

    |Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|

    L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)

    ≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|

    ≤ 2 supg∈G|Ln(g)− L(g)|

    I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM

    48 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Uniform boundsI we deal with the estimation error:

    I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?

    I we need uniform bounds:

    |Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|

    L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)

    ≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|

    ≤ 2 supg∈G|Ln(g)− L(g)|

    I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM

    48 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Finite model class

    I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore

    P{

    max1≤i≤m

    Ui ≥ �}≤

    m∑i=1

    P {Ui ≥ �}

    I then when G is finite

    P

    {supg∈G|Ln(g)− L(g)| ≥ �

    }≤ 2|G|e−2n�2/(b−a)2

    so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)

    I quantitative distribution free uniform finite distance law oflarge numbers!

    49 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Finite model class

    I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore

    P{

    max1≤i≤m

    Ui ≥ �}≤

    m∑i=1

    P {Ui ≥ �}

    I then when G is finite

    P

    {supg∈G|Ln(g)− L(g)| ≥ �

    }≤ 2|G|e−2n�2/(b−a)2

    so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)

    I quantitative distribution free uniform finite distance law oflarge numbers!

    49 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Finite model class

    I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore

    P{

    max1≤i≤m

    Ui ≥ �}≤

    m∑i=1

    P {Ui ≥ �}

    I then when G is finite

    P

    {supg∈G|Ln(g)− L(g)| ≥ �

    }≤ 2|G|e−2n�2/(b−a)2

    so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)

    I quantitative distribution free uniform finite distance law oflarge numbers!

    49 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Bound versus size

    with probability at least 1− δ

    L(g∗n) ≤ infg∈G L(g) + 2(b − a)

    √log |G|+ log 2δ

    2n

    I infg∈G L(g) decreases with |G| (more models)I but the bound increases with |G|: trade-off between

    flexibility and estimation qualityI remark: log |G| is the number of bits needed to encode the

    model choice; this is a capacity measure

    50 / 154 Fabrice Rossi Concentration

    http://apiacoa.org/

  • Outline

    ConcentrationHoeffding inequalityUniform bounds

    Vapnik-Chervonenkis DimensionDefinitionApplication to classificationProof

    Covering numbersDefinition and resultsComputing covering numbers

    Summary

    51 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Infinite model class

    I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}

    are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L

    ∗ = 0(nonlinear problems)

    I solution: evaluate the capacity of a class rather than itssize

    I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available

    shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,

    then A and B are identical

    52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Infinite model class

    I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}

    are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L

    ∗ = 0(nonlinear problems)

    I solution: evaluate the capacity of a class rather than itssize

    I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available

    shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,

    then A and B are identical

    52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Infinite model class

    I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}

    are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L

    ∗ = 0(nonlinear problems)

    I solution: evaluate the capacity of a class rather than itssize

    I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available

    shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,

    then A and B are identical

    52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Growth functionI abstract setting: F is a set of measurable functions from

    Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )

    I given z1 ∈ Rd , . . . , zn ∈ Rd , we define

    Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}

    I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F

    I Growth function

    SF (n) = supz1∈Rd ,...,zn∈Rd

    |Fz1,...,zn |

    I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)

    53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Growth functionI abstract setting: F is a set of measurable functions from

    Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )

    I given z1 ∈ Rd , . . . , zn ∈ Rd , we define

    Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}

    I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F

    I Growth function

    SF (n) = supz1∈Rd ,...,zn∈Rd

    |Fz1,...,zn |

    I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)

    53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Growth functionI abstract setting: F is a set of measurable functions from

    Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )

    I given z1 ∈ Rd , . . . , zn ∈ Rd , we define

    Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}

    I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F

    I Growth function

    SF (n) = supz1∈Rd ,...,zn∈Rd

    |Fz1,...,zn |

    I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)

    53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Vapnik-Chervonenkis dimension

    I SF (n) ≤ 2nI vocabulary:

    I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn

    I Vapnik-Chervonenkis dimension

    VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}

    I interpretation:I if SF (n) < 2n :

    I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn

    I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered

    by F

    54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Vapnik-Chervonenkis dimension

    I SF (n) ≤ 2nI vocabulary:

    I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn

    I Vapnik-Chervonenkis dimension

    VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}

    I interpretation:I if SF (n) < 2n :

    I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn

    I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered

    by F

    54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Vapnik-Chervonenkis dimension

    I SF (n) ≤ 2nI vocabulary:

    I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn

    I Vapnik-Chervonenkis dimension

    VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}

    I interpretation:I if SF (n) < 2n :

    I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn

    I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered

    by F

    54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Example

    I points from R2

    I F = {I{ax+by+c≥0}}

    I SF (3) = 23

    I even if we havesometimes |Fz1,z2,z3 | < 23

    I SF (4) < 24

    I VCdim (F) = 3

    55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then

    VCdim({

    f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

    proof

    I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))

    I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that

    ∑p+1i=1 γig(zi) = 0

    I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j

    I consequently no z1, . . . , zp+1 can be shattered

    56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then

    VCdim({

    f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

    proof

    I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))

    I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that

    ∑p+1i=1 γig(zi) = 0

    I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j

    I consequently no z1, . . . , zp+1 can be shattered

    56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then

    VCdim({

    f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

    proof

    I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))

    I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that

    ∑p+1i=1 γig(zi) = 0

    I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j

    I consequently no z1, . . . , zp+1 can be shattered

    56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then

    VCdim({

    f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

    proof

    I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))

    I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that

    ∑p+1i=1 γig(zi) = 0

    I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j

    I consequently no z1, . . . , zp+1 can be shattered

    56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then

    VCdim({

    f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p

    proof

    I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))

    I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that

    ∑p+1i=1 γig(zi) = 0

    I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j

    I consequently no z1, . . . , zp+1 can be shattered

    56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • VC dimension 6= parameter number

    I in the linear case F = {f (z) = I{Ppi=1 wiφi (z)≥0}}VCdim (F) = p

    I but in general:I VCdim

    ({f (z) = I{sin(tz)≥0}}

    )=∞

    I one hidden layer perceptron:

    G =

    g(z) = Tβ0 + h∑

    k=1

    βk T

    αk0 + d∑j=1

    αkjzj

    I T (a) = I{a≥0}I VCdim (G) ≥ W log2(h/4)/32 where W = dh + 2h + 1 is the

    number of parameters

    57 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • VC dimension 6= parameter number

    I in the linear case F = {f (z) = I{Ppi=1 wiφi (z)≥0}}VCdim (F) = p

    I but in general:I VCdim

    ({f (z) = I{sin(tz)≥0}}

    )=∞

    I one hidden layer perceptron:

    G =

    g(z) = Tβ0 + h∑

    k=1

    βk T

    αk0 + d∑j=1

    αkjzj

    I T (a) = I{a≥0}I VCdim (G) ≥ W log2(h/4)/32 where W = dh + 2h + 1 is the

    number of parameters

    57 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension

    http://apiacoa.org/

  • Snake oil warning

    I you might read here and there things likethe VC-dimension of the class of separating

    hyperplanes with margin larger than 1δ is boundedby bla bla bla

    I this is plain wrong!I the class G is data independentI shattering does not take into a marginI asking for the “VC-dimension of the class of separating

    hyperplanes with margin larger than 1δ ” is a nonsenseI there is an extension of the VC dimensi