![Page 1: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/1.jpg)
Learning Theory
John Shawe-TaylorCentre for Computational Statistics
and Machine LearningUniversity College [email protected]
September, 2009
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009
![Page 2: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/2.jpg)
STRUCTURE
PART A
1. General Statistical Considerations2. Basic PAC Ideas and proofs3. Real-valued Function Classes and the Margin
PART B
1. Rademacher complexity and Main Theory2. Applications to classification3. Conclusions
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 1
![Page 3: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/3.jpg)
Aim:
• Some thoughts on why theory
• Basic Techniques with some deference to history
• Insights into proof techniques and statisticallearning approaches
• Concentration inequalities and relation to Rademacherapproach
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 2
![Page 4: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/4.jpg)
What won’t be included:
• The most general results
• Complete History
• Analysis of Bayesian inference
• Most recent developments, eg PAC-Bayes, localRademacher complexity, etc.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 3
![Page 5: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/5.jpg)
PART A
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 4
![Page 6: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/6.jpg)
Theories of learning
• Basic approach of SLT is to view learning from astatistical viewpoint.
• Aim of any theory is to model real/ artificialphenomena so that we can better understand/predict/ exploit them.
• SLT is just one approach to understanding/predicting/ exploiting learning systems, othersinclude Bayesian inference, inductive inference,statistical physics, traditional statistical analysis.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 5
![Page 7: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/7.jpg)
Theories of learning cont.
• Each theory makes assumptions about thephenomenon of learning and based on thesederives predictions of behaviour
– every predictive theory automatically implies apossible algorithm
– simply optimise the quantities that improve thepredictions
• Each theory has strengths and weaknesses
– generally speaking more assumptions, potentiallymore accurate predictions
– BUT depends on capturing the right aspects ofthe phenomenon
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 6
![Page 8: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/8.jpg)
General statistical considerations
• Statistical models (not including Bayesian) beginwith an assumption that the data is generatedby an underlying distribution P typically not givenexplicitly to the learner.
• If we are trying to classify cancerous tissue fromhealthy tissue, there are two distributions, one forcancerous cells and one for healthy ones.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 7
![Page 9: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/9.jpg)
General statistical considerations cont.
• Usually the distribution subsumes the processesof the natural/artificial world that we are studying.
• Rather than accessing the distribution directly,statistical learning typically assumes that we aregiven a ‘training sample’ or ‘training set’
S = {(x1, y1), . . . , (xm, ym)}
generated identically and independently (i.i.d.)according to the distribution P .
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 8
![Page 10: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/10.jpg)
Generalisation of a learner
• Assume that we have a learning algorithm A thatchooses a function AF(S) from a function spaceF in response to the training set S.
• From a statistical point of view the quantity ofinterest is the random variable:
ε(S, A, F) = E(x,y) [`(AF(S),x, y)] ,
where ` is a ‘loss’ function that measures thediscrepancy between AF(S)(x) and y.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 9
![Page 11: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/11.jpg)
Generalisation of a learner
• For example, in the case of classification ` is 1if the two disagree and 0 otherwise, while forregression it could be the square of the differencebetween AF(S)(x) and y.
• We refer to the random variable ε(S, A, F) as thegeneralisation of the learner.
• Note is random because of the dependence onthe training set S.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 10
![Page 12: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/12.jpg)
Example of Generalisation I
• We consider the Breast Cancer dataset from theUCI repository.
• Use the simple Parzen window classifier: weightvector is
w+ −w−
where w+ is the average of the positive trainingexamples and w− is average of negative trainingexamples. Threshold is set so hyperplanebisects the line joining these two points.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 11
![Page 13: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/13.jpg)
Example of Generalisation II• Given a size m of the training set, by repeatedly
drawing random training sets S we estimate thedistribution of
ε(S, A, F) = E(x,y) [`(AF(S),x, y)] ,
by using the test set error as a proxy for the truegeneralisation.
• We plot the histogram and the average of thedistribution for various sizes of training set –initially the whole dataset gives a single value ifwe use training and test as the all the examples,but then we plot for training set sizes:
342, 273, 205, 137, 68, 34, 27, 20, 14, 7.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 12
![Page 14: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/14.jpg)
Example of Generalisation III
• Since the expected classifier is in all cases thesame:
E [AF(S)] = ES
[w+
S −w−S
]
= ES
[w+
S
]− ES
[w−
S
]
= Ey=+1 [x]− Ey=−1 [x] ,
we do not expect large differences in the averageof the distribution, though the non-linearity ofthe loss function means they won’t be the sameexactly.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 13
![Page 15: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/15.jpg)
Error distribution: full dataset
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
10
20
30
40
50
60
70
80
90
100
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 14
![Page 16: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/16.jpg)
Error distribution: dataset size: 342
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 15
![Page 17: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/17.jpg)
Error distribution: dataset size: 273
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 16
![Page 18: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/18.jpg)
Error distribution: dataset size: 205
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
35
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 17
![Page 19: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/19.jpg)
Error distribution: dataset size: 137
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
30
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 18
![Page 20: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/20.jpg)
Error distribution: dataset size: 68
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
5
10
15
20
25
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 19
![Page 21: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/21.jpg)
Error distribution: dataset size: 34
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
2
4
6
8
10
12
14
16
18
20
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 20
![Page 22: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/22.jpg)
Error distribution: dataset size: 27
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
2
4
6
8
10
12
14
16
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 21
![Page 23: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/23.jpg)
Error distribution: dataset size: 20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
2
4
6
8
10
12
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 22
![Page 24: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/24.jpg)
Error distribution: dataset size: 14
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
2
4
6
8
10
12
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 23
![Page 25: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/25.jpg)
Error distribution: dataset size: 7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
5
6
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 24
![Page 26: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/26.jpg)
Bayes risk and consistency• Traditional statistics has concentrated on analysing
ES [ε(S, A, F)] .
• For example consistency of a classificationalgorithm A and function class F means
limm→∞
ES [ε(S, A, F)] = fBayes,
where
fBayes(x) ={
1 ifP (x, 1) > P (x, 0),0 otherwise.
is the function with the lowest possible risk, oftenreferred to as the Bayes risk.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 25
![Page 27: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/27.jpg)
Expected versus confident bounds
• For a finite sample the generalisation ε(S, A, F)has a distribution depending on the algorithm,function class and sample size m.
• Traditional statistics as indicated above hasconcentrated on the mean of this distribution –but this quantity can be misleading, eg for lowfold cross-validation.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 26
![Page 28: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/28.jpg)
Expected versus confident boundscont.
• Statistical learning theory has preferred toanalyse the tail of the distribution, finding a boundwhich holds with high probability.
• This looks like a statistical test – significant at a1% confidence means that the chances of theconclusion not being true are less than 1% overrandom samples of that size.
• This is also the source of the acronym PAC:probably approximately correct, the ‘confidence’parameter δ is the probability that we have beenmisled by the training set.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 27
![Page 29: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/29.jpg)
Probability of being misled inclassification
• Aim to cover a number of key techniques ofSLT. Basic approach is usually to bound theprobability of being misled and set this equal toδ.
• What is the chance of being misled by a singlebad function f , i.e. training error errS(f) = 0,while true error is bad err(f) > ε?
PS {errS(f) = 0, err(f) > ε} = (1− err(f))m
≤ (1− ε)m
≤ exp(−εm).
so that choosing ε = ln(1/t)/m ensuresprobability less than t.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 28
![Page 30: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/30.jpg)
Finite or Countable function classesIf we now consider a function class
F = {f1, f2, . . . , fn, . . .}
and make the probability of being misled by fn lessthan qnδ with
∞∑n=1
qn ≤ 1,
then the probability of being misled by one of thefunctions is bounded by
PS
{∃fn: errS(fn) = 0, err(fn) >
1m
ln(
1qnδ
)}≤ δ.
This uses the so-called union bound – theprobability of the union of a set of events is at mostthe sum of the individual probabilities.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 29
![Page 31: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/31.jpg)
Finite or Countable function classesresult
• The bound translates into a theorem: given F
and q, with probability at least 1− δ over randomm samples the generalisation error of a functionfn ∈ F with zero training error is bounded by
err(fn) ≤ 1m
(ln
(1qn
)+ ln
(1δ
))
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 30
![Page 32: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/32.jpg)
Some comments on the result
• We can think of the term ln(
1qn
)as the
complexity / description length of the function fn.
• Note that we must put a prior weight on thefunctions. If the functions are drawn at randomaccording to a distribution pn, the expectedgeneralisation will be minimal if we choose ourprior q = p.
• This is the starting point of the PAC-Bayesanalysis.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 31
![Page 33: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/33.jpg)
What if uncountably many functions?
• We need a way to convert an infinite set to a finiteone.
• Key idea is to replace measuring performance ona random test point with measuring on a second‘ghost’ sample
• In this way the analysis is reduced to a finiteset of examples and hence a finite set ofclassification functions.
• This step is often referred to as the ‘doublesample trick’
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 32
![Page 34: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/34.jpg)
Double sample trickThe result has the following form:
Pm{X ∈ Xm : ∃h ∈ H : errX(h) = 0, err(h) ≥ ε}≤ 2P 2m{XY ∈ X2m : ∃h ∈ H :
errX(h) = 0, errY(h) ≥ ε/2}
If we think of the first probability as being over XYthe result concerns three events:
A(h) := {errX(h) = 0}B(h) := {err(h) ≥ ε}C(h) := {errY(h) ≥ ε/2}
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 33
![Page 35: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/35.jpg)
Double sample trick II
It is clear that
P 2m(C(h)|A(h)&B(h)) = P 2m(C(h)|B(h))
> 0.5
for reasonable m by a binomial tail bound.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 34
![Page 36: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/36.jpg)
Double sample trick II
Hence, we have
P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&C(h)} ≥P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&B(h)&C(h)} =
P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&B(h)}P (C(h)|A(h)&B(h))
It follows that
Pm{X ∈ Xm : ∃h ∈ H : A(h)&B(h)} ≤2P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&C(h)}
the required result.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 35
![Page 37: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/37.jpg)
How many functions on a finitesample?
• Let H be a set of {−1, 1} valued functions.
• The growth function BH(m) is the maximumcardinality of the set of functions H whenrestricted to m points – note that this cannot belarger than 2m, i.e. log2(BH(m)) ≤ m
• For the statistics to work we want the number offunctions to be much smaller than this as we willperform a union bound over this number.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 36
![Page 38: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/38.jpg)
Examining the growth function
Consider a plot of the ratio of the growth functionBH(m) to 2m for linear functions in a 20 dimensionalspace:
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 37
![Page 39: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/39.jpg)
Vapnik Chervonenkis dimension
• The Vapnik-Chervonenkis dimension is the pointat which the ratio stops being equal to 1:
VCdim(H) = max{m : for some x1, . . . ,xm,
for all b ∈ {−1, 1}m,
∃hb ∈ H, hb(xi) = bi}
• For linear functions L in Rn, VCdim(L) = n + 1.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 38
![Page 40: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/40.jpg)
Sauer’s Lemma
• Sauer’s Lemma (also due to VC and Shelah):
BH(m) ≤d∑
i=0
(m
i
)≤
(em
d
)d
,
where m ≥ d = VCdim(H).
• This is surprising as the rate of growth of BH(m)up to m = d is exponential – but thereafter it ispolynomial with exponent d.
• Furthermore, there is no assumption made aboutthe functions being linear – this is a completelygeneral result.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 39
![Page 41: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/41.jpg)
Basic Theorem of SLT
We want to bound the probability that the trainingexamples can mislead us about one of the functionswe are considering using:
Pm{X ∈ Xm : ∃h ∈ H : errX(h) = 0, err(h) ≥ ε}→ double sample trick →≤ 2P 2m{XY ∈ X2m : ∃h ∈ H :
errX(h) = 0, errY(h) ≥ ε/2}→ union bound →≤ 2BH(2m)P 2m{XY ∈ X2m :
errX(h) = 0, errY(h) ≥ ε/2}
Final ingredient is known as symmetrisation.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 40
![Page 42: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/42.jpg)
Symmetrisation
• Consider generating a 2m sample S. Sincethe points are generated independently theprobability of generating the same set of pointsin a different order is the same.
• Consider a fixed set Σ of permutations and eachtime we generate a sample we randomly permuteit with a uniformly chosen element of Σ – givesprobability distribution P 2m
Σ
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 41
![Page 43: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/43.jpg)
Symmetrisation cont.
• Any event has equal probability under P 2m andP 2m
Σ , so that
P 2m(A) = P 2mΣ (A) = E2m [Pσ∼Σ(A)]
• Consider particular choice of Σ the permutationsthat swap/leave unchanged corresponding elementsof the two samples X and Y – 2m suchpermutations.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 42
![Page 44: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/44.jpg)
Completion of the proof
P 2m{XY ∈ X2m : errX(h) = 0, errY(h) ≥ ε/2}≤ E2m [Pσ∼Σ{errX(h) = 0, errY(h) ≥ ε/2 for σ(XY)}]
≤ E2m[2−εm/2
]
= 2−εm/2
• Setting the right hand side equal to δ/(2BH(2m))and inverting gives the bound on ε.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 43
![Page 45: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/45.jpg)
Final result• Assembling the ingredients gives the result: with
probability at least 1−δ of random m samples thegeneralisation error of a function h ∈ H chosenfrom a class H with VC dimension d with zerotraining error is bounded by
ε = ε(m,H, δ) =2m
(d log
2em
d+ log
2δ
)
• Note that we can think of d as the complexity /capacity of the function class H.
• The bound does not distinguish betweenfunctions in H.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 44
![Page 46: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/46.jpg)
Lower bounds
• VCdim Characterises Learnability in PAC setting:there exist distributions such that with probabilityat least δ over m random examples, the error ofh is at least
max(
d− 132m
,1m
log(
1δ
)).
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 45
![Page 47: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/47.jpg)
Non-zero training error
• Very similar results can be obtained for non-zerotraining error.
• The main difference is the introduction of asquare root to give a bound of the form
ε(m,H, k, δ) = k + O
(√d
mlog
2em
d+
√1m
log2δ
)
for k training errors, which is significantly worsethan in the zero training error case.
• PAC-Bayes bounds now interpolate betweenthese two.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 46
![Page 48: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/48.jpg)
Structural Risk Minimisation• The way to differentiate between functions using
the VC result is to create a hierarchy of classes:
H1 ⊆ H2 ⊆ · · · ⊆ Hd ⊆ · · · ⊆ HK.
of increasing complexity/VC dimension.
• Can now find the function in each class withminimum empirical error kd and choose betweenthe classes by minimising over the choice of d:
ε(m, d, δ) = kd + O
(√d
mlog
2em
d+
√1m
log2K
δ
)
which bounds the generalisation by a furtherapplication of the union bound over the Kclasses.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 47
![Page 49: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/49.jpg)
Criticisms of PAC Theory
• The theory is certainly valid and the lowerbounds indicate that it is not too far out – so can’tcriticise as stands
• Criticism is that it doesn’t accord with experienceof those applying learning.
• Mismatch between theory and practice.
• For example
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 48
![Page 50: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/50.jpg)
Support Vector Machines (SVM)
One example of PAC failure is in analysing SVMs:linear functions in very high dimensional featurespaces.
1. kernel trick means we can work in an infinitedimensional feature space (⇒ infinite VCdimension) so that VC result does not apply:
2. and YET very impressive performance
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 49
![Page 51: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/51.jpg)
Support Vector Machines cont.
• SVM seeks linear function in a feature spacedefined implicitly via a kernel κ:
κ(x, z) = 〈φ(x), φ(z)〉
• For example the 1-norm SVM seeks w to solve
minw,b,γ,ξ ‖w‖2 + C∑m
i=1 ξi
subject to yi (〈w, φ (xi)〉+ b) ≥ 1− ξi, ξi ≥ 0,i = 1, . . . , m.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 50
![Page 52: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/52.jpg)
Margin in SVMs
• Intuition behind SVMs is that maximising themargin makes it possible to obtain goodgeneralisation despite the high VC dimension
• The lower bound implies that we must be takingadvantage of a benign distribution, since weknow that in the worst case generalisation will bebad.
• Structural Risk Minimisation cannot be applied toa hierarchy determined by the margin of differentclassifiers, since the margin is not known until wesee the data, while the SRM hierarchy must bechosen in advance.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 51
![Page 53: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/53.jpg)
Margin in SVMs cont
• Hence, we require a theory that can give boundsthat are sensitive to serendipitous distributions –in particular we conjecture that the margin is anindication of such ‘luckiness’.
• The proof approach will rely on using real-valued function classes. The margin gives anindication of the accuracy with which we needto approximate the functions when applying thestatistics.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 52
![Page 54: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/54.jpg)
Covering Numbers
F a class of real functions defined on X and ‖ · ‖d anorm on F, then
N(γ, F, ‖ · ‖d)
is the smallest size set Uγ such thatfor any f ∈ F there is a u ∈ Uγ such that ‖f −u‖d <γ.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 53
![Page 55: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/55.jpg)
Covering Numbers cont.
For generalization bounds we need the γ-growthfunction,
Nm(γ, F) := supX∈Xm
N(γ, F, `X∞).
where `X∞ gives the distance between two functions
as the maximum difference between their outputson the sample.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 54
![Page 56: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/56.jpg)
Covering numbers for linear functions
• Consider the set of functions:
{x 7→ 〈w, φ(x)〉 : w =
1Z
m∑
i=1
ziφ(xi), zi ∈ Z,
Z =m∑
i=1
zi 6= 0,m∑
i=1
|zi| ≤ 8R2
γ2
}
• This is an explicit cover that approximates theoutput of any norm 1 linear function on thesample
{x1, . . . ,xm}to within γ/2 on the sample.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 55
![Page 57: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/57.jpg)
Covering numbers for linear functions• We convert the γ/2 approximation on the sample
problem into a classification problem, which issolvable with a margin of γ/2.
• It follows that if we use the perceptron algorithmto find a classifier, we will find a functionsatisfying the γ/2 approximation with just 8R2/γ2
updates.
• This gives a sparse dual representation of thefunction. The covering is chosen as the set offunctions with small sparse dual representations.
• Gives a bound on the size of the cover
log2 N2m(γ/2, F) ≤ k log2
e(2m + k − 1)k
, where k =8R2
γ2.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 56
![Page 58: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/58.jpg)
Second statistical result
• We want to bound the probability that the trainingexamples can mislead us about one of thefunctions with margin bigger than fixed γ:
Pm{X ∈ Xm : ∃f ∈ F : errX(f) = 0,mX(f) ≥ γ, errP (f) ≥ ε}≤ 2P 2m{XY ∈ X2m : ∃f ∈ F such that
errX(f) = 0,mX(f) ≥ γ, errY(f) ≥ ε/2}≤ 2N2m(γ/2, F)P 2m{XY ∈ X2m : for fixed f ′
mX(f ′) > γ/2, mY(εm/2)(f ′) < γ/2}≤ 2N2m(γ/2, F)2−εm/2 ≤ δ
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 57
![Page 59: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/59.jpg)
Second statistical result cont.
• inverting gives
ε = ε(m, F, δ, γ) =2m
(log2 N2m(γ/2, F) + log2
2δ
)
i.e. with probability 1 − δ over m randomexamples a margin γ hypothesis has error lessthan ε. Must apply for finite set of γ (‘do SRMover γ’).
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 58
![Page 60: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/60.jpg)
Bounding the covering numbers
Have the following correspondences with thestandard VC case (easy slogans):
Growth function – γ-growth function
Vapnik Chervonenkis dim – Fat shattering dim
Sauer’s Lemma – Alon et al.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 59
![Page 61: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/61.jpg)
Generalization of SVMs
• Since SVMs are linear functions, can apply linearfunction bound.
• Hence for distribution with support in ball ofradius R, (eg Gaussian Kernels R = 1) andmargin γ, have bound:
ε(m, L, δ, γ) =2
m
(k log2
e(2m + k − 1)
k+ log2
m
δ
)
where k = 8R2
γ2 .
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 60
![Page 62: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/62.jpg)
Error distribution: dataset size: 205
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 61
![Page 63: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/63.jpg)
Error distribution: dataset size: 137
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 62
![Page 64: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/64.jpg)
Error distribution: dataset size: 68
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 63
![Page 65: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/65.jpg)
Error distribution: dataset size: 34
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
14
16
18
20
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 64
![Page 66: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/66.jpg)
Error distribution: dataset size: 27
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
14
16
18
20
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 65
![Page 67: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/67.jpg)
Error distribution: dataset size: 20
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
14
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 66
![Page 68: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/68.jpg)
Error distribution: dataset size: 14
0 0.2 0.4 0.6 0.8 10
5
10
15
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 67
![Page 69: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/69.jpg)
Error distribution: dataset size: 7
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 68
![Page 70: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/70.jpg)
Using a kernel
• Can consider much higher dimensional spacesusing the kernel trick
• Can even work in infinite dimensional spaces, egusing the Gaussian kernel:
κ(x, z) = exp(−‖x− z‖2
2σ2
)
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 69
![Page 71: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/71.jpg)
Error distribution: dataset size: 342
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 70
![Page 72: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/72.jpg)
Error distribution: dataset size: 273
0 0.2 0.4 0.6 0.8 10
5
10
15
20
25
30
35
40
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 71
![Page 73: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/73.jpg)
Conclusions
• Outline of philosophy and approach of SLT
• Central result of SLT
• Touched on covering number bounds for marginbased analysis
• Application to analyse SVM learning
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 72
![Page 74: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/74.jpg)
STRUCTURE
PART A
1. General Statistical Considerations2. Basic PAC Ideas and proofs3. Real-valued Function Classes and the Margin
PART B
1. Rademacher complexity and Main Theory2. Applications to classification3. Conclusions
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 73
![Page 75: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/75.jpg)
PART B
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 74
![Page 76: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/76.jpg)
Concentration inequalities
• Statistical Learning is concerned with thereliability or stability of inferences made from arandom sample.
• Random variables with this property have beena subject of ongoing interest to probabilists andstatisticians.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 75
![Page 77: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/77.jpg)
Concentration inequalities cont.
• As an example consider the mean of a sample ofm 1-dimensional random variables X1, . . . , Xm:
Sm =1m
m∑
i=1
Xi.
• Hoeffding’s inequality states that if Xi ∈ [ai, bi]
P{|Sm − E[Sm]| ≥ ε} ≤ 2 exp(− 2m2ε2∑m
i=1(bi − ai)2
)
Note how the probability falls off exponentiallywith the distance from the mean and with thenumber of variables.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 76
![Page 78: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/78.jpg)
Concentration for SLT
• We are now going to look at deriving SLT resultsfrom concentration inequalities.
• Perhaps the best known form is due toMcDiarmid (although he was actually re-presenting previously derived results):
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 77
![Page 79: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/79.jpg)
McDiarmid’s inequalityTheorem 1. Let X1, . . . , Xn be independent randomvariables taking values in a set A, and assume thatf : An → R satisfies
supx1,...,xn,xi∈A
|f(x1, . . . , xn)− f(x1, . . . , xi, xi+1, . . . , xn)| ≤ ci,
for 1 ≤ i ≤ n. Then for all ε > 0,
P {f (X1, . . . , Xn)− Ef (X1, . . . , Xn) ≥ ε} ≤ exp( −2ε2∑n
i=1 c2i
)
• Hoeffding is a special case when f(x1, . . . , xn) =Sn
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 78
![Page 80: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/80.jpg)
Using McDiarmid
• By setting the right hand side equal to δ, we canalways invert McDiarmid to get a high confidencebound: with probability at least 1− δ
f (X1, . . . , Xn) < Ef (X1, . . . , Xn) +
√∑ni=1 c2
i
2log
1δ
• If ci = c/n for each i this reduces to
f (X1, . . . , Xn) < Ef (X1, . . . , Xn) +
√c2
2nlog
1δ
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 79
![Page 81: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/81.jpg)
Rademacher complexity
• Rademacher complexity is a new way ofmeasuring the complexity of a function class. Itarises naturally if we rerun the proof using thedouble sample trick and symmetrisation but lookat what is actually needed to continue the proof:
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 80
![Page 82: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/82.jpg)
Rademacher proof beginningsFor a fixed f ∈ F we have
E [f(z)] ≤ E [f(z)] + suph∈F
(E[h]− E[h]
).
where F is a class of functions mapping from Z to[0, 1] and E denotes the sample average.
We must bound the size of the second term. Firstapply McDiarmid’s inequality to obtain (ci = 1/m forall i) with probability at least 1− δ:
suph∈F
(E[h]− E[h]
)≤ ES
[suph∈F
(E[h]− E[h]
)]+
√ln(1/δ)
2m.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 81
![Page 83: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/83.jpg)
Deriving double sample result
• We can now move to the ghost sample by simplyobserving that E[h] = ES
[E[h]
]:
ES
[suph∈F
(E[h]− E[h]
)]=
ES
[suph∈F
ES
[1m
m∑
i=1
h(zi)− 1m
m∑
i=1
h(zi)
∣∣∣∣∣S
]]
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 82
![Page 84: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/84.jpg)
Deriving double sample result cont.
Since the sup of an expectation is less than orequal to the expectation of the sup (we can makethe choice to optimise for each S) we have
ES
[suph∈F
(E[h]− E[h]
)]≤
ESES
[suph∈F
1m
m∑
i=1
(h(zi)− h(zi))
]
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 83
![Page 85: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/85.jpg)
Adding symmetrisationHere symmetrisation is again just swappingcorresponding elements – but we can write this asmultiplication by a variable σi which takes values±1with equal probability:
ES
[suph∈F
(E[h]− E[h]
)]≤
≤ EσSS
[suph∈F
1m
m∑
i=1
σi (h(zi)− h(zi))
]
≤ 2ESσ
[suph∈F
∣∣∣∣∣1m
m∑
i=1
σih(zi)
∣∣∣∣∣
]
= Rm (F) ,
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 84
![Page 86: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/86.jpg)
Rademacher complexitywhere
Rm(F) = ESσ
[supf∈F
∣∣∣∣∣2m
m∑
i=1
σif (zi)
∣∣∣∣∣
].
is known as the Rademacher complexity of thefunction class F.
• Rademacher complexity is the expected value ofthe maximal correlation with random noise – avery natural measure of capacity.
• Note that the Rademacher complexity is distributiondependent since it involves an expectation overthe choice of sample – this might seem hard tocompute.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 85
![Page 87: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/87.jpg)
Main Rademacher theorem
Putting the pieces together gives the main theoremof Rademacher complexity: with probability at least1− δ over random samples S of size m, every f ∈ F
satisfies
E [f(z)] ≤ E [f(z)] + Rm(F) +
√ln(1/δ)
2m
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 86
![Page 88: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/88.jpg)
Empirical Rademacher theorem
• Since the empirical Rademacher complexity
Rm(F) = Eσ
[supf∈F
∣∣∣∣∣2m
m∑
i=1
σif (zi)
∣∣∣∣∣
∣∣∣∣∣ z1, . . . , zm
]
is concentrated, we can make a furtherapplication of McDiarmid to obtain with probabilityat least 1− δ
ED [f(z)] ≤ E [f(z)] + Rm(F) + 3
√ln(2/δ)
2m.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 87
![Page 89: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/89.jpg)
Relation to VC theorem
• For H a class of ±1 valued functions with VCdimension d, we can upper bound Rm(H) usingHoeffding’s inequality to upper bound
P
{∣∣∣∣∣m∑
i=1
σif(xi)
∣∣∣∣∣ ≥ ε
}≤ 2 exp
(− ε2
2m
)
for a fixed function f , since the expected valueof the sum is 0, and the maximum change byreplacing a σi is 2.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 88
![Page 90: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/90.jpg)
Relation to VC theorem cont.
• By Sauer’s lemma there are at most (em/d)d
we can bound the probability that the sum isbounded by ε for all functions by
(em
d
)d
2 exp(− ε2
2m
)=: δ.
• Taking δ = m−1 and solving for ε gives
ε =√
2md lnem
d+ 2m ln(2m)
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 89
![Page 91: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/91.jpg)
Rademacher bound for VC class
• Hence we can bound
Rm(H) ≤ 2m
(δm + ε(1− δ))
≤ 2m
+
√8(d ln(em/d) + ln(2m))
m
• This is equivalent to the PAC bound with non-zero loss, except that we could have used thegrowth function or VC dimension measured onthe sample rather than the sup over the wholeinput space.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 90
![Page 92: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/92.jpg)
Application to large marginclassification
• Rademacher complexity comes into its own forBoosting and SVMs.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 91
![Page 93: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/93.jpg)
Application to Boosting
• We can view Boosting as seeking a function fromthe class (H is the set of weak learners)
{∑
h∈H
ahh(x) :∑
h∈H
ah ≤ B
}= convB(H)
by minimising some function of the margindistribution.
• Adaboost corresponds to optimising an exponentialfunction of the margin over this set of functions.
• We will see how to include the margin in theanalysis later, but concentrate on computing theRademacher complexity for now.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 92
![Page 94: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/94.jpg)
Rademacher complexity of convexhulls
Rademacher complexity has a very nice propertyfor convex hull classes:
Rm(convB(H)) =2mEσ
sup
hj∈H,∑
j aj≤B
∣∣∣∣∣∣
m∑
i=1
σi
∑
j
ajhj(xi)
∣∣∣∣∣∣
≤ 2mEσ
sup
hj∈H,∑
j aj≤B
∑
j
aj
∣∣∣∣∣m∑
i=1
σihj(xi)
∣∣∣∣∣
≤ 2mEσ
[sup
hj∈HB
∣∣∣∣∣m∑
i=1
σihj(xi)
∣∣∣∣∣
]
≤ BRm(H).
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 93
![Page 95: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/95.jpg)
Rademacher complexity of convexhulls cont.
• Hence, we can move to the convex hull withoutincurring any complexity penalty for B = 1!
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 94
![Page 96: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/96.jpg)
Rademacher complexity for SVMs
• The Rademacher complexity of a class of linearfunctions with bounded 2-norm:
{x →
m∑
i=1
αiκ(xi,x):α′Kα ≤ B2
}⊆
⊆ {x → 〈w, φ (x)〉 : ‖w‖ ≤ B}= FB,
where we assume a kernel defined featurespace with
〈φ(x), φ(z)〉 = κ(x, z).
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 95
![Page 97: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/97.jpg)
Rademacher complexity of FB
Rm(FB) = Eσ
[sup
f∈FB
∣∣∣∣∣2m
m∑
i=1
σif (xi)
∣∣∣∣∣
]
= Eσ
[sup
‖w‖≤B
∣∣∣∣∣
⟨w,
2m
m∑
i=1
σiφ (xi)
⟩∣∣∣∣∣
]
≤ 2B
mEσ
[∥∥∥∥∥m∑
i=1
σiφ(xi)
∥∥∥∥∥
]
=2B
mEσ
⟨m∑
i=1
σiφ(xi),m∑
j=1
σjφ(xj)
⟩
1/2
≤ 2B
m
Eσ
m∑
i,j=1
σiσjκ(xi,xj)
1/2
=2B
m
√√√√m∑
i=1
κ(xi,xi)
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 96
![Page 98: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/98.jpg)
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 97
![Page 99: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/99.jpg)
Applying to 1-norm SVMsWe take the following formulation of the 1-normSVM:
minw,b,γ,ξ −γ + C∑m
i=1 ξi
subject to yi (〈w, φ (xi)〉+ b) ≥ γ − ξi, ξi ≥ 0,i = 1, . . . , m, and ‖w‖2 = 1.
(1)Note that
ξi = (γ − yig(xi))+ ,
where g(·) = 〈w, φ(·)〉+ b.
• The first step is to introduce a loss function whichupper bounds the discrete loss
P (y 6= sgn(g(x))) = E [H(−yg(x))],
where H is the Heaviside function.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 98
![Page 100: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/100.jpg)
Applying the Rademacher theorem
• Consider the loss function A : R → [0, 1], givenby
A(a) =
1, if a > 0;1 + a/γ, if −γ ≤ a ≤ 0;0, otherwise.
• By the Rademacher Theorem and since the lossfunction A− 1 dominates H − 1, we have that
E [H(−yg(x))− 1] ≤ E [A(−yg(x))− 1]
≤ E [A(−yg(x))− 1] +
Rm((A− 1) ◦ F) + 3
√ln(2/δ)
2m.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 99
![Page 101: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/101.jpg)
Empirical loss and slack variables
• But the function A(−yig(xi)) ≤ ξi/γ, for i =1, . . . , `, and so
E [H(−yg(x))] ≤ 1mγ
m∑
i=1
ξi + Rm((A− 1) ◦ F) + 3
√ln(2/δ)
2m.
• The final missing ingredient to complete thebound is to bound Rm((A− 1) ◦ F) in terms ofRm(F).
• This will require a more detailed look atRademacher complexity.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 100
![Page 102: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/102.jpg)
Rademacher complexity bounds
• First simple observation:
for a ∈ R, Rm(aF) = |a|Rm(F),
since af is the function achieving the sup forsome σ for aF iff f achieves the sup for F.
• We are interested in bounding RC Rm(L ◦F) ≤ 2LRm(F) for class L ◦ F = {L ◦ f : f ∈ F},where L satisfies, L(0) = 0 and
|L(a)− L(b)| ≤ L|a− b|,
i.e. L is a Lipschitz function with constant L > 0.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 101
![Page 103: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/103.jpg)
Rademacher complexity bounds cont.
• In our case L = A− 1 and L = 1/γ.
• By above it is sufficient to prove for case L = 1only, since then
Rm(L ◦ F) = LRm((L/L) ◦ F) ≤ 2LRm(F)
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 102
![Page 104: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/104.jpg)
Proof for contraction (L = 1)
Want to get rid of absolute value in RC:
Rm(F) = Eσ
[supf∈F
∣∣∣∣∣2m
m∑
i=1
σif (zi)
∣∣∣∣∣
]
so consider defining
L+ = L, L−(a) = −L(−a).
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 103
![Page 105: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/105.jpg)
Proof for contraction (L = 1)
Assume F is closed under negation. Now if forsome σ sup achieved with f such that
m∑
i=1
σiL (f (zi)) < 0,
then∣∣∣∣∣
m∑
i=1
σiL (f (zi))
∣∣∣∣∣ =m∑
i=1
σiL− (−f (zi)) ,
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 104
![Page 106: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/106.jpg)
Contraction proof cont.
• and so
Rm(L ◦ F) = Eσ
[sup
f∈F,N∈{L+,L−}
2m
m∑
i=1
σiN (f (zi))
]
• if we further assume 0 ∈ F we have
Rm(L ◦ F) = Eσ
[supf∈F
2m
m∑
i=1
σiL+ (f (zi))
]
+Eσ
[supf∈F
2m
m∑
i=1
σiL− (f (zi))
]
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 105
![Page 107: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/107.jpg)
Contraction proof cont.
• Hence if we show the result without the factor of2 for the complexity without absolute values thedesired result will follow, since for classes closedunder negation we have
Rm(F) = Eσ
[supf∈F
2m
m∑
i=1
σif (zi)
].
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 106
![Page 108: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/108.jpg)
Contraction proof cont.
• We show
Eσ
[supf∈F
m∑
i=1
σiL(f(xi))
]≤ Eσ
[supf∈F
σ1f(x1) +m∑
i=2
σiL(f(xi))
]
and apply induction to obtain the full result.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 107
![Page 109: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/109.jpg)
Contraction proof cont.
• We take the permutations in pairs:
(1, σ2, . . . , σm) and (−1, σ2, . . . , σm)
The result will follow if we show that for all f, g ∈F we can find f ′, g′ ∈ F such that (fi = f(xi) etc.)
L(f1) +m∑
i=2
σiL(fi))− L(g1) +m∑
i=2
σiL(gi)
≤ f ′1 +m∑
i=2
σiL(f ′i))− g′1 +m∑
i=2
σiL(g′i).
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 108
![Page 110: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/110.jpg)
Contraction proof end• If f1 ≥ g1 take f ′ = f and g′ = g to reduce to
showing
L(f1)− L(g1) ≤ f1 − g1 = |f1 − g1|
which follows since |L(f1)− L(g1)| ≤ |f1 − g1|.
• Otherwise f1 < g1 and we take f ′ = g and g′ = fto reduce to showing
L(f1)− L(g1) ≤ g1 − f1 = |f1 − g1|
which again follows from
|L(f1)− L(g1)| ≤ |f1 − g1|.
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 109
![Page 111: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/111.jpg)
Final SVM bound
• Assembling the result we obtain:
P (y 6= sgn(g(x))) = E [H(−yg(x))]
≤ 1mγ
m∑
i=1
ξi +4
mγ
√√√√m∑
i=1
κ(xi,xi) + 3
√ln(2/δ)
2m
• Note that for the Gaussian kernel this reduces to
P (y 6= sgn(g(x))) ≤ 1mγ
m∑
i=1
ξi +4√mγ
+ 3
√ln(2/δ)
2m
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 110
![Page 112: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/112.jpg)
Final Boosting bound• Applying a similar strategy for Boosting with the
1-norm of the slack variables we arrive at Linearprogramming boosting that minimises
∑
h
ah + C
m∑
i=1
ξi,
where ξi = (1− yi
∑h ahh(xi))+.
• with corresponding bound:
P (y 6= sgn(g(x))) = E [H(−yg(x))]
≤ 1m
m∑
i=1
ξi + R(H)∑
h
ah + 3
√ln(2/δ)
2m
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 111
![Page 113: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/113.jpg)
Conclusions
• Outline of philosophy and approach of SLT
• Central result of SLT
• Touched on covering number analysis for marginbased analysis
• Moved to consideration of Rademacher complexity.
• Case of RC for classification giving boundsfor two of the most effective classificationalgorithms: SVMs and Boosting
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 112
![Page 114: Learning Theory - University of Cambridgemlg.eng.cam.ac.uk/mlss09/mlss_slides/ShawTaylor_1.pdfLearning Theory John Shawe-Taylor Centre for Computational Statistics and Machine Learning](https://reader034.vdocument.in/reader034/viewer/2022042322/5f0c388e7e708231d4345536/html5/thumbnails/114.jpg)
Where to find out more
Web Sites: www.support-vector.net (SV Machines)
www.kernel-methods.net (kernel methods)
www.kernel-machines.net (kernel Machines)
www.neurocolt.com
www.pascal-network.org
Machine Learning Summer School (MLSS’09) Tutorial, Sept 2009 113