Introduction to Statistical LearningTheory
Petra Philips
Friedrich Miescher Laboratory, Tübingen
Vorlesung WS 2006/2007Eberhard Karls Universität Tübingen
24 January 2007
http://www.fml.mpg.de/raetsch/lectures/amsa
Retrospection
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 2
Supervised Learning in a Nutshell
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 3
GivenTraining data : A finite set of examples xi ∈ X and theirassociated labels yi ∈ Y.
WantedThe ’best’ estimator modelling the relationship betweenthe xi and the associated labels yi, i.e. the ’best’ function
f : X → Y.
Approach
Restrict possible functions (e.g. hyperplanes).Quantify ’best’ as the optimum of some computableobjective function (usually error on training data).Evaluate prediction performance on new test data .
Challenge
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 4
Is there an a priori way to guarantee good performance?
Recall - Loss, Risk
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 5
Loss The error for a particular example. `(f (xi), yi).Examples: 0-1 loss, hinge loss, squared loss.
Risk The expected loss for all data, including unseen.
R(f ) =
∫`(f (x), y)dρ.
Empirical Risk The average loss on training data only.
Remp(f ) =1
n
n∑i=1
`(f (xi), yi).
’Best’ Function The one that minimizes the risk.Empirical Risk Minimization Instead of minimizing the
risk we minimize the empirical risk!
Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 6
How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?
Assumption Training and test data are ’similar’ becausethey represent the same phenomenon.
No Free Lunch Without assumptions and restrictions noinference and generalization possible!
Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?
Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 7
How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?Assumption Training and test data are ’similar’ because
they represent the same phenomenon.No Free Lunch Without assumptions and restrictions no
inference and generalization possible!Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?
Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 8
How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?Assumption Training and test data are ’similar’ be-
cause they represent the same phenomenon.No Free Lunch Without assumptions and restrictions no
inference and generalization possible!
Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?
Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 9
How can we know we are doing ’the right thing’?Why should a small error on the training data ensure asmall error on unseen test data?Assumption Training and test data are ’similar’ be-
cause they represent the same phenomenon.No Free Lunch Without assumptions and restrictions no
inference and generalization possible!Why should the minimizer of the empirical risk be thesame as the minimizer of the risk?
More Precise Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 10
How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the
simplest one.Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?
More Precise Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 11
How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the
simplest one. ?
Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?
More Precise Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 12
How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the
simplest one. ?Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?
More Precise Questions
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 13
How to restrict the possible set of functions?Occam’s Razor Of two equivalent models choose the
simplest one. ?Can we quantify the ’complexity’ of a learning problem?Is more data always better data?How much data do we need?
Statistical Learning Theory
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 14
Provides a theoretical framework to study thesequestions.Started with Vapnik and Chervonenkis [1971]which led to VC-Theory and SVM.Models the machine learning setting as astatistical phenomenon .
Answers are probabilistic in nature.Tools: statistics, functional analysis, empiricalprocesses, combinatorics, high-dimensional ge-ometry, complexity theory.Newer view: Bousquet et al. [2004].
Probabilistic Learning Model
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 15
AssumptionAll data is generated by the same hidden probabilisticsource!
Formallyρ is an unknown joint probability distribution over X×Y
Training data ((x1, y1), . . . , (xn, yn)) is iid ∼ ρ
Aim: find best f ∗∗ that minimizes risk
R(f ) =
∫`(f (x), y)dρ.
Probabilistic Learning Model
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 16
AssumptionAll data is generated by the same hidden probabilisticsource!
Formallyρ is an unknown joint probability distribution over X×Y
Training data ((x1, y1), . . . , (xn, yn)) is iid ∼ ρ
Aim: find best f ∗ ∈ F that minimizes risk
R(f ) =
∫`(f (x), y)dρ.
Probabilistic Learning Model
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 17
AssumptionAll data is generated by the same hidden probabilisticsource!
Formallyρ is an unknown joint probability distribution over X×Y
Training data ((x1, y1), . . . , (xn, yn)) is iid ∼ ρ
Aim: find best f ∗ ∈ F that minimizes risk
R(f ) =
∫`(f (x), y)dρ.
ERM: find best fn ∈ F that minimizes empirical risk
Remp(f ) =1
n
n∑i=1
`(f (xi), yi).
Challenge Question
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 18
Is R(fn) small, i.e. R(fn) ≈ R(f ∗∗)?
Magics?
Approximation & Estimation Error
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 19
R(fn)−R(f ∗∗) = R(fn)−R(f ∗) + R(f ∗)−R(f ∗∗)
F large
small approximation erroroverfitting
F small
large approximation errorbetter generalization but poor performance
Model selectionChoose F to get an optimal tradeoff between approxima-tion and estimation error.
Estimation Error
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 20
R(f ∗)−R(fn) ?
depends on training datadepends on F
depends on how algorithm chooses fn
depends on unknown ρ through f ∗ and risk
For ERM use uniform differences trick!
Uniform differences
|R(f ∗)−R(fn)| ≤ 2 supf∈F
|Remp(f )−R(f )|
Empirical and Actual Risk
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 21
Remp(f ) ≈ R(f ) ?
Asymptotics: Law of Large NumbersFor any fixed f , |Remp(f )−R(f )| −→ 0 as n −→∞.
Finite Sample Result [Chernoff-Hoeffding]For any fixed f , with high probability
|Remp(f )−R(f )| ≈ 1√n.
Does this mean that ERM finds optimal estimator f ∗ whentraining sample is getting large?
Empirical and Actual Risk
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 22
Remp(f ) ≈ R(f ) ?
Asymptotics: Law of Large NumbersFor any fixed f , |Remp(f )−R(f )| −→ 0 as n −→∞.
Finite Sample Result [Chernoff-Hoeffding]For any fixed f , with high probability
|Remp(f )−R(f )| ≈ 1√n.
Does this mean that ERM finds optimal estimator f ∗ whentraining sample is getting large?
NO! fn is a random variable and not fixed. A uniform LLNis needed, which holds simultaneously for all f ∈ F. This istrue only for classes F which are ’not too complex’ .
Estimation Error
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 23
R(f ∗)−R(fn) ?
Uniform differences
|R(f ∗)−R(fn)| ≤ 2 supf∈F
|Remp(f )−R(f )|
Finite Sample Results
One fixed function: |R(f ∗)−R(fn)| ≈ 1/√
n
F finite: |R(f ∗)−R(fn)| ≈√
log(|F|)/√
n
F infinite: ?
VC Dimension
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 24
A model class shatters a set of data points if it can correctlyclassify any possible labeling.
Lines shatter any 3 points in R2, but not 4 points.
VC dimension [Vapnik, 1995]The VC dimension of a model class is the maximum hsuch that some data point set of size h can be shatteredby the model. (e.g. VC dimension of R2 is 3.)
A small VC dimension implies small complexity.
Optimal and Empirical Estimator
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 25
R(f ∗) ≈ R(fn) ?
Uniform differences
|R(f ∗)−R(fn)‖ ≤ 2 supf∈F
|Remp(f )−R(f )|
Finite Sample Results
One fixed function: |R(f ∗)−R(fn)| ≈ 1/√
n
F finite: |R(f ∗)−R(fn)| ≈√
log(|F|)/√
n
F infinite: |R(f ∗)−R(fn)| ≈√
V Cdim(F)/√
n
All results hold with high probability overthe random draw of training samples!
Implications
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 26
VC dimension is a meaningful complexity measure.Do model selection by minimizing VC dimension.More data gives more likely a good predictor.
Larger Margin Classifiers
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 27
Large Margin ⇒ Small VC dimensionHyperplane classifiers with large margin have small VCdimension [Vapnik, 1995].
Maximum Margin ⇒ Minimum ComplexityMinimize complexity by maximizing margin (irrespectiveof the dimension of the space).
Summary - SLT
G. Rätsch, C.S. Ong and P. Philips: Advanced Methods for Sequence Analysis, Page 28
Provides a statistical framework to study learning algorithms.
Quantifies the generalization ability in terms of
complexity of estimator functions
number of training examples.
Results are probabilistic in nature (confidences).
Results teach us
When and why our intuitive solutions were right (SVM, crossvali-dation).
Why and how to restrict class of estimators and to regularize.
That more data is best because it increases confidence in result.
But: Limited model, many questions not yet understood!
References
O. Bousquet, S. Boucheron, and G. Lugosi. Machine Learning Summer School 2003, volume 3176 of LNAI, chapter Introduction tostatistical learning theory, pages 208–240. Springer-Verlag, 2004.
V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab.Appl., 16:264–280, 1971.
V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, New York, 1995.