an introduction to statistical learning theory

Post on 11-Feb-2022

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Introduction to StatisticalLearning Theory

John Shawe-TaylorCentre for Computational Statistics

and Machine LearningDepartment of Computer Science

UCL EngineeringUniversity College Londonjst@cs.ucl.ac.uk

June, 2011

Lammhult Summer School, June 2011

STRUCTURE

PART A

1. General Statistical Considerations2. Basic PAC Ideas and proofs3. Real-valued Function Classes and the Margin

PART B

1. Rademacher complexity and Main Theory2. Applications to classification3. Conclusions

Lammhult Summer School, June 2011 1

Aim:

• Some thoughts on link between theory andpractice

• Basic Techniques with some deference to history

• Insights into proof techniques and statisticallearning approaches

• Concentration inequalities and relation to Rademacherapproach

Lammhult Summer School, June 2011 2

What won’t be included:

• The most general results

• Complete History

• Analysis of Bayesian inference

• Most recent developments, eg PAC-Bayes, localRademacher complexity, etc.

Lammhult Summer School, June 2011 3

PART A

Lammhult Summer School, June 2011 4

Statistical learning Theory

• Statistical Learning Theory aims to analyse keyquantities of interest to the practitioner:

– Bounds of performance on new data– Bounds that will hold for the particular training

set, not an average over possible training sets– Bounds that apply to the very flexible classes

of functions used in practice– Bounds that can drive improved learning

algorithms– Bounds that take into account all aspects of

the data analysis

Lammhult Summer School, June 2011 5

Theories of learning

• Aim of any theory is to model real/ artificialphenomena so that we can better understand/predict/ exploit them.

• SLT is just one approach to understanding/predicting/ exploiting learning systems, othersinclude Bayesian inference, inductive inference,statistical physics, traditional statistical analysis.

Lammhult Summer School, June 2011 6

Theories of learning cont.

• Each theory makes assumptions about thephenomenon of learning and based on thesederives predictions of behaviour

– every predictive theory automatically implies apossible algorithm

– simply optimise the quantities that improve thepredictions

• Each theory has strengths and weaknesses

– generally speaking more assumptions, potentiallymore accurate predictions

– BUT depends on capturing the right aspects ofthe phenomenon

Lammhult Summer School, June 2011 7

General statistical considerations

• Statistical models (not including Bayesian) beginwith an assumption that the data is generatedby an underlying distribution P typically not givenexplicitly to the learner.

• If we are trying to classify cancerous tissue fromhealthy tissue, there are two distributions, one forcancerous cells and one for healthy ones.

Lammhult Summer School, June 2011 8

General statistical considerations cont.

• Usually the distribution subsumes the processesof the natural/artificial world that we are studying.

• Rather than accessing the distribution directly,statistical learning typically assumes that we aregiven a ‘training sample’ or ‘training set’

S = {(x1, y1), . . . , (xm, ym)}

generated identically and independently (i.i.d.)according to the distribution P .

Lammhult Summer School, June 2011 9

Generalisation of a learner

• Assume that we have a learning algorithm A thatchooses a function AF(S) from a function spaceF in response to the training set S.

• From a statistical point of view the quantity ofinterest is the random variable:

ϵ(S,A,F) = E(x,y) [ℓ(AF(S),x, y)] ,

where ℓ is a ‘loss’ function that measures thediscrepancy between AF(S)(x) and y.

Lammhult Summer School, June 2011 10

Generalisation of a learner

• For example, in the case of classification ℓ is 1if the two disagree and 0 otherwise, while forregression it could be the square of the differencebetween AF(S)(x) and y.

• We refer to the random variable ϵ(S,A,F) as thegeneralisation of the learner.

• Note is random because of the dependence onthe training set S.

Lammhult Summer School, June 2011 11

Example of Generalisation I

• We consider the Breast Cancer dataset from theUCI repository.

• Use the simple Parzen window classifier: weightvector is

w+ −w−

where w+ is the average of the positive trainingexamples and w− is average of negative trainingexamples. Threshold is set so hyperplanebisects the line joining these two points.

Lammhult Summer School, June 2011 12

Example of Generalisation II• Given a size m of the training set, by repeatedly

drawing random training sets S we estimate thedistribution of

ϵ(S,A,F) = E(x,y) [ℓ(AF(S),x, y)] ,

by using the test set error as a proxy for the truegeneralisation.

• We plot the histogram and the average of thedistribution for various sizes of training set –initially the whole dataset gives a single value ifwe use training and test as the all the examples,but then we plot for training set sizes:

342, 273, 205, 137, 68, 34, 27, 20, 14, 7.

Lammhult Summer School, June 2011 13

Example of Generalisation III

• Since the expected classifier is in all cases thesame:

E [AF(S)] = ES

[w+

S −w−S

]= ES

[w+

S

]− ES

[w−

S

]= Ey=+1 [x]− Ey=−1 [x] ,

we do not expect large differences in the averageof the distribution, though the non-linearity ofthe loss function means they won’t be the sameexactly.

Lammhult Summer School, June 2011 14

Error distribution: full dataset

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50

60

70

80

90

100

Lammhult Summer School, June 2011 15

Error distribution: dataset size: 342

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

Lammhult Summer School, June 2011 16

Error distribution: dataset size: 273

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

Lammhult Summer School, June 2011 17

Error distribution: dataset size: 205

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

Lammhult Summer School, June 2011 18

Error distribution: dataset size: 137

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

Lammhult Summer School, June 2011 19

Error distribution: dataset size: 68

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

Lammhult Summer School, June 2011 20

Error distribution: dataset size: 34

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

16

18

20

Lammhult Summer School, June 2011 21

Error distribution: dataset size: 27

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

16

Lammhult Summer School, June 2011 22

Error distribution: dataset size: 20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

Lammhult Summer School, June 2011 23

Error distribution: dataset size: 14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

Lammhult Summer School, June 2011 24

Error distribution: dataset size: 7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

6

Lammhult Summer School, June 2011 25

Bayes risk and consistency• Traditionally statistics has concentrated on

analysingES [ϵ(S,A,F)] .

• For example consistency of a classificationalgorithm A and function class F means

limm→∞

ES [ϵ(S,A,F)] = fBayes,

where

fBayes(x) =

{1 ifP (x, 1) > P (x, 0),0 otherwise.

is the function with the lowest possible risk, oftenreferred to as the Bayes risk.

Lammhult Summer School, June 2011 26

Expected versus confident bounds

• For a finite sample the generalisation ϵ(S,A,F)has a distribution depending on the algorithm,function class and sample size m.

• As indicated above, statistics has tended toconcentrate on the mean of this distribution – butthis quantity can be misleading, eg for low foldcross-validation.

Lammhult Summer School, June 2011 27

Expected versus confident boundscont.

• Statistical learning theory has preferred toanalyse the tail of the distribution, finding a boundwhich holds with high probability.

• This looks like a statistical test – significant at a1% confidence means that the chances of theconclusion not being true for my training set withm examples are less than 1%.

• This is also the source of the acronym PAC:probably approximately correct, the ‘confidence’parameter δ is the probability that we have beenmisled by this particular training set.

Lammhult Summer School, June 2011 28

Probability of being misled inclassification

• Aim to cover a number of key techniques ofSLT. Basic approach is usually to bound theprobability of being misled and set this equal toδ.

• What is the chance of being misled by a singlebad function f , i.e. training error errS(f) = 0,while true error is bad err(f) > ϵ?

PS {errS(f) = 0, err(f) > ϵ} = (1− err(f))m

≤ (1− ϵ)m

≤ exp(−ϵm).

so that choosing ϵ = ln(1/t)/m ensuresprobability less than t.

Lammhult Summer School, June 2011 29

Finite or Countable function classesIf we now consider a function class

F = {f1, f2, . . . , fn, . . .}

and make the probability of being misled by fn lessthan qnδ with

∞∑n=1

qn ≤ 1,

then the probability of being misled by one of thefunctions is bounded by

PS

{∃fn: errS(fn) = 0, err(fn) >

1

mln

(1

qnδ

)}≤ δ.

This uses the so-called union bound – theprobability of the union of a set of events is at mostthe sum of the individual probabilities.

Lammhult Summer School, June 2011 30

Finite or Countable function classesresult

• The bound translates into a theorem: given F

and q, with probability at least 1− δ over randomm samples the generalisation error of a functionfn ∈ F with zero training error is bounded by

err(fn) ≤1

m

(ln

(1

qn

)+ ln

(1

δ

))

Lammhult Summer School, June 2011 31

Some comments on the result

• We can think of the term ln(

1qn

)as the

complexity / description length of the function fn.

• Note that we must put a prior weight on thefunctions. If the functions are drawn at randomaccording to a distribution pn, the expectedgeneralisation will be minimal if we choose ourprior q = p.

• Note that though q looks like a prior, the boundholds whether or not it is ‘correct’: this is thestarting point of the PAC-Bayes analysis.

Lammhult Summer School, June 2011 32

What if uncountably many functions?

• We need a way to convert an infinite set to a finiteone.

• Key idea is to replace measuring performance ona random test point with measuring on a second‘ghost’ sample

• In this way the analysis is reduced to a finiteset of examples and hence a finite set ofclassification functions.

• This step is often referred to as the ‘doublesample trick’

Lammhult Summer School, June 2011 33

Double sample trickThe result has the following form:

Pm{X ∈ Xm : ∃h ∈ H : errX(h) = 0,err(h) ≥ ϵ}≤ 2P 2m{XY ∈ X2m : ∃h ∈ H :

errX(h) = 0,errY(h) ≥ ϵ/2}

If we think of the first probability as being over XYthe result concerns three events:

A(h) := {errX(h) = 0}B(h) := {err(h) ≥ ϵ}C(h) := {errY(h) ≥ ϵ/2}

Lammhult Summer School, June 2011 34

Double sample trick II

It is clear that

P 2m(C(h)|A(h)&B(h)) = P 2m(C(h)|B(h))

> 0.5

for reasonable m by a binomial tail bound.

Lammhult Summer School, June 2011 35

Double sample trick II

Hence, we have

P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&C(h)} ≥P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&B(h)&C(h)} =

P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&B(h)}P (C(h)|A(h)&B(h))

It follows that

Pm{X ∈ Xm : ∃h ∈ H : A(h)&B(h)} ≤2P 2m{XY ∈ X2m : ∃h ∈ H : A(h)&C(h)}

the required result.

Lammhult Summer School, June 2011 36

How many functions on a finitesample?

• Let H be a set of {−1, 1} valued functions.

• The growth function BH(m) is the maximumcardinality of the set of functions H whenrestricted to m points – note that this cannot belarger than 2m, i.e. log2(BH(m)) ≤ m

• For the statistics to work we want the number offunctions to be much smaller than this as we willperform a union bound over this number.

Lammhult Summer School, June 2011 37

Examining the growth function

Consider a plot of the ratio of the growth functionBH(m) to 2m for linear functions in a 20 dimensionalspace:

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Lammhult Summer School, June 2011 38

Vapnik Chervonenkis dimension

• The Vapnik-Chervonenkis dimension is the pointat which the ratio stops being equal to 1:

VCdim(H) = max{m : for some x1, . . . ,xm,

for all b ∈ {−1, 1}m,

∃hb ∈ H,hb(xi) = bi}

• For linear functions L in Rn, VCdim(L) = n+ 1.

Lammhult Summer School, June 2011 39

Sauer’s Lemma

• Sauer’s Lemma (also due to VC and Shelah):

BH(m) ≤d∑

i=0

(m

i

)≤(em

d

)d,

where m ≥ d = VCdim(H).

• This is surprising as the rate of growth of BH(m)up to m = d is exponential – but thereafter it ispolynomial with exponent d.

• Furthermore, there is no assumption made aboutthe functions being linear – this is a completelygeneral result.

Lammhult Summer School, June 2011 40

Basic Theorem of SLT

We want to bound the probability that the trainingexamples can mislead us about one of the functionswe are considering using:

Pm{X ∈ Xm : ∃h ∈ H : errX(h) = 0,err(h) ≥ ϵ}→ double sample trick →≤ 2P 2m{XY ∈ X2m : ∃h ∈ H :

errX(h) = 0,errY(h) ≥ ϵ/2}→ union bound →≤ 2BH(2m)P 2m{XY ∈ X2m :

errX(h) = 0,errY(h) ≥ ϵ/2}

Final ingredient is known as symmetrisation.

Lammhult Summer School, June 2011 41

Symmetrisation

• Consider generating a 2m sample S. Sincethe points are generated independently theprobability of generating the same set of pointsin a different order is the same.

• Consider a fixed set Σ of permutations and eachtime we generate a sample we randomly permuteit with a uniformly chosen element of Σ – givesprobability distribution P 2m

Σ

Lammhult Summer School, June 2011 42

Symmetrisation cont.

• Any event has equal probability under P 2m andP 2mΣ , so that

P 2m(A) = P 2mΣ (A) = E2m [Pσ∼Σ(A)]

• Consider particular choice of Σ the permutationsthat swap/leave unchanged corresponding elementsof the two samples X and Y – 2m suchpermutations.

Lammhult Summer School, June 2011 43

Completion of the proof

P 2m{XY ∈ X2m : errX(h) = 0,errY(h) ≥ ϵ/2}≤ E2m [Pσ∼Σ{errX(h) = 0,errY(h) ≥ ϵ/2 for σ(XY)}]

≤ E2m[2−ϵm/2

]= 2−ϵm/2

• Setting the right hand side equal to δ/(2BH(2m))and inverting gives the bound on ϵ.

Lammhult Summer School, June 2011 44

Final result• Assembling the ingredients gives the result: with

probability at least 1−δ of random m samples thegeneralisation error of a function h ∈ H chosenfrom a class H with VC dimension d with zerotraining error is bounded by

ϵ = ϵ(m,H, δ) =2

m

(d log

2em

d+ log

2

δ

)

• Note that we can think of d as the complexity /capacity of the function class H.

• The bound does not distinguish betweenfunctions in H.

Lammhult Summer School, June 2011 45

Lower bounds

• VCdim Characterises Learnability in PAC setting:there exist distributions such that with probabilityat least δ over m random examples, the error ofh is at least

max

(d− 1

32m,1

mlog

(1

δ

)).

Lammhult Summer School, June 2011 46

Non-zero training error

• Very similar results can be obtained for non-zerotraining error.

• The main difference is the introduction of asquare root to give a bound of the form

ϵ(m,H, k, δ) = k +O

(√d

mlog

2em

d+

√1

mlog

2

δ

)

for k training errors, which is significantly worsethan in the zero training error case.

• PAC-Bayes bounds now interpolate betweenthese two.

Lammhult Summer School, June 2011 47

Structural Risk Minimisation• The way to differentiate between functions using

the VC result is to create a hierarchy of classes:

H1 ⊆ H2 ⊆ · · · ⊆ Hd ⊆ · · · ⊆ HK.

of increasing complexity/VC dimension.

• Can now find the function in each class withminimum empirical error kd and choose betweenthe classes by minimising over the choice of d:

ϵ(m, d, δ) = kd +O

(√d

mlog

2em

d+

√1

mlog

2K

δ

)

which bounds the generalisation by a furtherapplication of the union bound over the Kclasses.

Lammhult Summer School, June 2011 48

Criticisms of PAC Theory

• The theory is certainly valid and the lowerbounds indicate that it is not too far out – so can’tcriticise as stands

• Criticism is that it doesn’t accord with experienceof those applying learning.

• Mismatch between theory and practice.

• For example

Lammhult Summer School, June 2011 49

Support Vector Machines (SVM)

One example of PAC failure is in analysingSVMs: thresholded linear functions in very highdimensional feature spaces.

1. kernel trick means we can work in an infinitedimensional feature space (⇒ infinite VCdimension) so that VC result does not apply:

2. and YET very impressive performance

Lammhult Summer School, June 2011 50

Support Vector Machines cont.

• SVM seeks linear function in a feature spacedefined by a mapping ϕ specified implicitly via akernel κ:

κ(x, z) = ⟨ϕ(x), ϕ(z)⟩

• For example the 1-norm SVM seeks w to solve

minw,b,γ,ξ ∥w∥2 + C∑m

i=1 ξi

subject to yi (⟨w, ϕ (xi)⟩+ b) ≥ 1− ξi, ξi ≥ 0,i = 1, . . . ,m.

Lammhult Summer School, June 2011 51

Margin in SVMs

• Intuition behind SVMs is that maximising themargin makes it possible to obtain goodgeneralisation despite the high VC dimension

• The lower bound implies that we must be takingadvantage of a benign distribution, since weknow that in the worst case generalisation will bebad.

• Structural Risk Minimisation cannot be applied toa hierarchy determined by the margin of differentclassifiers, since the margin is not known until wesee the data, while the SRM hierarchy must bechosen in advance.

Lammhult Summer School, June 2011 52

Margin in SVMs cont

• Hence, we require a theory that can give boundsthat are sensitive to serendipitous distributions –in particular we conjecture that the margin is anindication of such ‘luckiness’.

• The proof approach will rely on using real-valued function classes. The margin gives anindication of the accuracy with which we needto approximate the functions when applying thestatistics.

Lammhult Summer School, June 2011 53

Covering Numbers

F a class of real functions defined on X and ∥ · ∥d anorm on F, then

N(γ,F, ∥ · ∥d)

is the smallest size set Uγ such thatfor any f ∈ F there is a u ∈ Uγ such that ∥f −u∥d <γ.

Lammhult Summer School, June 2011 54

Covering Numbers cont.

For generalization bounds we need the γ-growthfunction,

Nm(γ,F) := supX∈Xm

N(γ,F, ℓX∞).

where ℓX∞ gives the distance between two functionsas the maximum difference between their outputson the sample.

Lammhult Summer School, June 2011 55

Covering numbers for linear functions• Consider the set of functions:{

x 7→ ⟨w, ϕ(x)⟩ : w =1

Z

m∑i=1

ziϕ(xi), zi ∈ Z,

Z =

m∑i=1

zi = 0,

m∑i=1

|zi| ≤8R2

γ2

},

where R = max1≤i≤m

∥ϕ(xi)∥

• This is an explicit cover that approximates theoutput of any norm 1 linear function to within γ/2on the sample

{x1, . . . ,xm}.

Lammhult Summer School, June 2011 56

Covering numbers for linear functions• We convert the γ/2 approximation on the sample

problem into a classification problem, which issolvable with a margin of γ/2.

• It follows that if we use the perceptron algorithmto find a classifier, we will find a functionsatisfying the γ/2 approximation with just 8R2/γ2

updates.

• This gives a sparse dual representation of thefunction. The covering is chosen as the set offunctions with small sparse dual representations.

• Gives a bound on the size of the cover

log2N2m(γ/2,F) ≤ k log2

e(2m+ k − 1)

k, where k =

8R2

γ2.

Lammhult Summer School, June 2011 57

Second statistical result

• We want to bound the probability that the trainingexamples can mislead us about one of thefunctions with margin bigger than fixed γ:

Pm{X ∈ Xm : ∃f ∈ F : errX(f) = 0,mX(f) ≥ γ,errP (f) ≥ ϵ}≤ 2P 2m{XY ∈ X2m : ∃f ∈ F such that

errX(f) = 0,mX(f) ≥ γ,errY(f) ≥ ϵ/2}≤ 2N2m(γ/2,F)P 2m{XY ∈ X2m : for fixed f ′

mX(f′) > γ/2,mY(ϵm/2)(f

′) < γ/2}

≤ 2N2m(γ/2,F)2−ϵm/2 ≤ δ

Lammhult Summer School, June 2011 58

Second statistical result cont.

• inverting gives

ϵ = ϵ(m,F, δ, γ) =2

m

(log2N

2m(γ/2,F) + log22

δ

)i.e. with probability 1 − δ over m random

examples a margin γ hypothesis has error lessthan ϵ. Must apply for finite set of γ (‘do SRMover γ’).

Lammhult Summer School, June 2011 59

Bounding the covering numbers

Have the following correspondences with thestandard VC case (easy slogans):

Growth function – γ-growth function

Vapnik Chervonenkis dim – Fat shattering dim

Sauer’s Lemma – Alon et al.

Lammhult Summer School, June 2011 60

Generalization of SVMs

• Since SVMs are linear functions, can apply linearfunction bound.

• Hence for distribution with support in ball ofradius R, (eg Gaussian Kernels R = 1) andmargin γ, have bound:

ϵ(m,L, δ, γ) =2

m

(k log2

e(2m + k − 1)

k+ log2

m

δ

)

where k = 8R2

γ2 .

• Key property: no explicit dependence ondimension!

Lammhult Summer School, June 2011 61

Error distribution: dataset size: 205

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

Lammhult Summer School, June 2011 62

Error distribution: dataset size: 137

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

Lammhult Summer School, June 2011 63

Error distribution: dataset size: 68

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

Lammhult Summer School, June 2011 64

Error distribution: dataset size: 34

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Lammhult Summer School, June 2011 65

Error distribution: dataset size: 27

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

16

18

20

Lammhult Summer School, June 2011 66

Error distribution: dataset size: 20

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14

Lammhult Summer School, June 2011 67

Error distribution: dataset size: 14

0 0.2 0.4 0.6 0.8 10

5

10

15

Lammhult Summer School, June 2011 68

Error distribution: dataset size: 7

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

Lammhult Summer School, June 2011 69

Using a kernel

• Can consider much higher dimensional spacesusing the kernel trick

• Can even work in infinite dimensional spaces, egusing the Gaussian kernel:

κ(x, z) = exp

(−∥x− z∥2

2σ2

)

Lammhult Summer School, June 2011 70

Error distribution: dataset size: 342

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

Lammhult Summer School, June 2011 71

Error distribution: dataset size: 273

0 0.2 0.4 0.6 0.8 10

5

10

15

20

25

30

35

40

Lammhult Summer School, June 2011 72

Conclusions

• Outline of philosophy and approach of SLT

• Central result of SLT

• Touched on covering number bounds for marginbased analysis

• Application to analyse SVM learning

Lammhult Summer School, June 2011 73

STRUCTURE

PART A

1. General Statistical Considerations2. Basic PAC Ideas and proofs3. Real-valued Function Classes and the Margin

PART B

1. Rademacher complexity and Main Theory2. Applications to classification3. Conclusions

Lammhult Summer School, June 2011 74

PART B

Lammhult Summer School, June 2011 75

Concentration inequalities

• Statistical Learning is concerned with thereliability or stability of inferences made from arandom sample.

• Random variables with this property have beena subject of ongoing interest to probabilists andstatisticians.

Lammhult Summer School, June 2011 76

Concentration inequalities cont.

• As an example consider the mean of a sample ofm 1-dimensional random variables X1, . . . , Xm:

Sm =1

m

m∑i=1

Xi.

• Hoeffding’s inequality states that if Xi ∈ [ai, bi]

P{|Sm − E[Sm]| ≥ ϵ} ≤ 2 exp

(− 2m2ϵ2∑m

i=1(bi − ai)2

)Note how the probability falls off exponentiallywith the distance from the mean and with thenumber of variables.

Lammhult Summer School, June 2011 77

Concentration for SLT

• We are now going to look at deriving SLT resultsfrom concentration inequalities.

• Perhaps the best known form is due toMcDiarmid (although he was actually re-presenting previously derived results):

Lammhult Summer School, June 2011 78

McDiarmid’s inequalityTheorem 1. Let X1, . . . , Xn be independent randomvariables taking values in a set A, and assume thatf : An → R satisfies

supx1,...,xn,xi∈A

|f(x1, . . . , xn)− f(x1, . . . , xi, xi+1, . . . , xn)| ≤ ci,

for 1 ≤ i ≤ n. Then for all ϵ > 0,

P {f (X1, . . . , Xn)− Ef (X1, . . . , Xn) ≥ ϵ} ≤ exp

(−2ϵ2∑ni=1 c

2i

)

• Hoeffding is a special case when f(x1, . . . , xn) =Sn

Lammhult Summer School, June 2011 79

Using McDiarmid

• By setting the right hand side equal to δ, we canalways invert McDiarmid to get a high confidencebound: with probability at least 1− δ

f (X1, . . . , Xn) < Ef (X1, . . . , Xn) +

√∑ni=1 c

2i

2log

1

δ

• If ci = c/n for each i this reduces to

f (X1, . . . , Xn) < Ef (X1, . . . , Xn) +

√c2

2nlog

1

δ

Lammhult Summer School, June 2011 80

Rademacher complexity

• Rademacher complexity is a new way ofmeasuring the complexity of a function class. Itarises naturally if we rerun the proof using thedouble sample trick and symmetrisation but lookat what is actually needed to continue the proof:

Lammhult Summer School, June 2011 81

Rademacher proof beginningsFor a fixed f ∈ F we have

E [f(z)] ≤ E [f(z)] + suph∈F

(E[h]− E[h]

).

where F is a class of functions mapping from Z to[0, 1] and E denotes the sample average.

We must bound the size of the second term. Firstapply McDiarmid’s inequality to obtain (ci = 1/m forall i) with probability at least 1− δ:

suph∈F

(E[h]− E[h]

)≤ ES

[suph∈F

(E[h]− E[h]

)]+

√ln(1/δ)

2m.

Lammhult Summer School, June 2011 82

Deriving double sample result

• We can now move to the ghost sample by simplyobserving that E[h] = ES

[E[h]

]:

ES

[suph∈F

(E[h]− E[h]

)]=

ES

[suph∈F

ES

[1

m

m∑i=1

h(zi)−1

m

m∑i=1

h(zi)

∣∣∣∣∣S]]

Lammhult Summer School, June 2011 83

Deriving double sample result cont.

Since the sup of an expectation is less than orequal to the expectation of the sup (we can makethe choice to optimise for each S) we have

ES

[suph∈F

(E[h]− E[h]

)]≤

ESES

[suph∈F

1

m

m∑i=1

(h(zi)− h(zi))

]

Lammhult Summer School, June 2011 84

Adding symmetrisationHere symmetrisation is again just swappingcorresponding elements – but we can write this asmultiplication by a variable σi which takes values ±1with equal probability:

ES

[suph∈F

(E[h]− E[h]

)]≤

≤ EσSS

[suph∈F

1

m

m∑i=1

σi (h(zi)− h(zi))

]

≤ 2ESσ

[suph∈F

1

m

m∑i=1

σih(zi)

]= Rm (F) ,

Lammhult Summer School, June 2011 85

Rademacher complexitywhere

Rm(F) = ESσ

[supf∈F

2

m

m∑i=1

σif (zi)

].

is known as the Rademacher complexity of thefunction class F.

• Rademacher complexity is the expected value ofthe maximal correlation with random noise – avery natural measure of capacity.

• Note that the Rademacher complexity is distributiondependent since it involves an expectation overthe choice of sample – this might seem hard tocompute.

Lammhult Summer School, June 2011 86

Main Rademacher theorem

Putting the pieces together gives the main theoremof Rademacher complexity: with probability at least1− δ over random samples S of size m, every f ∈ F

satisfies

E [f(z)] ≤ E [f(z)] +Rm(F) +

√ln(1/δ)

2m

Lammhult Summer School, June 2011 87

Empirical Rademacher theorem

• Since the empirical Rademacher complexity

Rm(F) = Eσ

[supf∈F

2

m

m∑i=1

σif (zi)

∣∣∣∣∣ z1, . . . , zm]

is concentrated, we can make a furtherapplication of McDiarmid to obtain with probabilityat least 1− δ

ED [f(z)] ≤ E [f(z)] + Rm(F) + 3

√ln(2/δ)

2m.

Lammhult Summer School, June 2011 88

Relation to VC theorem

• For H a class of ±1 valued functions with VCdimension d, we can upper bound Rm(H) usingHoeffding’s inequality to upper bound

P

{m∑i=1

σif(xi) ≥ ϵ

}≤ exp

(− ϵ2

2m

)

for a fixed function f , since the expected valueof the sum is 0, and the maximum change byreplacing a σi is 2.

Lammhult Summer School, June 2011 89

Relation to VC theorem cont.

• By Sauer’s lemma there are at most (em/d)d

we can bound the probability that the sum isbounded by ϵ for all functions by

(emd

)dexp

(− ϵ2

2m

)=: δ.

• Taking δ = m−1 and solving for ϵ gives

ϵ =

√2md ln

em

d+ 2m ln(m)

Lammhult Summer School, June 2011 90

Rademacher bound for VC class

• Hence we can bound

Rm(H) ≤ 2

m(δm+ ϵ(1− δ))

≤ 2

m+

√8(d ln(em/d) + ln(m))

m

• This is equivalent to the PAC bound with non-zero loss, except that we could have used thegrowth function or VC dimension measured onthe sample rather than the sup over the wholeinput space.

Lammhult Summer School, June 2011 91

Application to large marginclassification

• Rademacher complexity comes into its own forBoosting and SVMs.

Lammhult Summer School, June 2011 92

Application to Boosting

• We can view Boosting as seeking a function fromthe class (H is the set of weak learners){∑

h∈H

ahh(x) : ah ≥ 0,∑h∈H

ah ≤ B

}= convB(H)

by minimising some function of the margindistribution.

• Adaboost corresponds to optimising an exponentialfunction of the margin over this set of functions.

• We will see how to include the margin in theanalysis later, but concentrate on computing theRademacher complexity for now.

Lammhult Summer School, June 2011 93

Rademacher complexity of convexhulls

Rademacher complexity has a very nice propertyfor convex hull classes:

Rm(convB(H)) =2

mEσ

suphj∈H,

∑j aj≤B

m∑i=1

σi

∑j

ajhj(xi)

≤ 2

mEσ

suphj∈H,

∑j aj≤B

∑j

aj

m∑i=1

σihj(xi)

≤ 2

mEσ

[suphj∈H

B

m∑i=1

σihj(xi)

]≤ BRm(H).

Lammhult Summer School, June 2011 94

Rademacher complexity of convexhulls cont.

• Hence, we can move to the convex hull withoutincurring any complexity penalty for B = 1!

Lammhult Summer School, June 2011 95

Rademacher complexity for SVMs

• The Rademacher complexity of a class of linearfunctions with bounded 2-norm:{x →

m∑i=1

αiκ(xi,x):α′Kα ≤ B2

}⊆

⊆ {x → ⟨w, ϕ (x)⟩ : ∥w∥ ≤ B}= FB,

where we assume a kernel defined featurespace with

⟨ϕ(x), ϕ(z)⟩ = κ(x, z).

Lammhult Summer School, June 2011 96

Rademacher complexity of FB

Rm(FB) = Eσ

[supf∈FB

2

m

m∑i=1

σif (xi)

]

= Eσ

[sup

∥w∥≤B

⟨w,

2

m

m∑i=1

σiϕ (xi)

⟩]

≤ 2B

mEσ

[∥∥∥∥∥m∑i=1

σiϕ(xi)

∥∥∥∥∥]

=2B

mEσ

⟨ m∑

i=1

σiϕ(xi),

m∑j=1

σjϕ(xj)

⟩1/2

≤ 2B

m

m∑i,j=1

σiσjκ(xi,xj)

1/2

=2B

m

√√√√ m∑i=1

κ(xi,xi)

Lammhult Summer School, June 2011 97

Lammhult Summer School, June 2011 98

Applying to 1-norm SVMsWe take the following formulation of the 1-normSVM:

minw,b,γ,ξ −γ + C∑m

i=1 ξisubject to yi (⟨w, ϕ (xi)⟩+ b) ≥ γ − ξi, ξi ≥ 0,

i = 1, . . . ,m, and ∥w∥2 = 1.(1)

Note thatξi = (γ − yig(xi))+ ,

where g(·) = ⟨w, ϕ(·)⟩+ b.

• The first step is to introduce a loss function whichupper bounds the discrete loss

P (y = sgn(g(x))) = E [H(−yg(x))],

where H is the Heaviside function.

Lammhult Summer School, June 2011 99

Applying the Rademacher theorem

• Consider the loss function A : R → [0, 1], givenby

A(a) =

1, if a > 0;1 + a/γ, if −γ ≤ a ≤ 0;0, otherwise.

• By the Rademacher Theorem and since the lossfunction A− 1 dominates H − 1, we have that

E [H(−yg(x))− 1] ≤ E [A(−yg(x))− 1]

≤ E [A(−yg(x))− 1] +

Rm((A− 1) ◦ F) + 3

√ln(2/δ)

2m.

Lammhult Summer School, June 2011 100

Empirical loss and slack variables

• But the function A(−yig(xi)) ≤ ξi/γ, for i =1, . . . , ℓ, and so

E [H(−yg(x))] ≤ 1

m∑i=1

ξi + Rm((A− 1) ◦ F) + 3

√ln(2/δ)

2m.

• The final missing ingredient to complete thebound is to bound Rm((A− 1) ◦ F) in terms ofRm(F).

• This will require a more detailed look atRademacher complexity.

Lammhult Summer School, June 2011 101

Rademacher complexity bounds• First simple observation:

for a ∈ R, Rm(aF) = |a|Rm(F),

since |a|f is the function achieving the sup forsome σ for aF iff f achieves the sup for F.

• We are interested in bounding RC

Rm(L ◦ F) ≤ LRm(F)

for class L ◦ F = {L ◦ f : f ∈ F}, where L

satisfies

|L(a)− L(b)| ≤ L|a− b|,

i.e. L is a Lipschitz function with constant L > 0.

Lammhult Summer School, June 2011 102

Rademacher complexity bounds cont.

• In our case L = A− 1 and L = 1/γ.

• By above it is sufficient to prove for case L = 1only, since then

Rm(L ◦ F) = LRm((L/L) ◦ F) ≤ LRm(F)

Lammhult Summer School, June 2011 103

Contraction proof (L = 1)

• We show

[supf∈F

m∑i=1

σiL(f(xi))

]≤ Eσ

[supf∈F

σ1f(x1) +m∑i=2

σiL(f(xi))

]

and apply induction to obtain the full result.

Lammhult Summer School, June 2011 104

Contraction proof cont.

• We take the permutations in pairs:

(1, σ2, . . . , σm) and (−1, σ2, . . . , σm)

The result will follow if we show that for all f, g ∈F we can find f ′, g′ ∈ F such that (fi = f(xi) etc.)

L(f1) +

m∑i=2

σiL(fi))− L(g1) +

m∑i=2

σiL(gi)

≤ f ′1 +

m∑i=2

σiL(f′i))− g′1 +

m∑i=2

σiL(g′i).

Lammhult Summer School, June 2011 105

Contraction proof end• If f1 ≥ g1 take f ′ = f and g′ = g to reduce to

showing

L(f1)− L(g1) ≤ f1 − g1 = |f1 − g1|

which follows since |L(f1)− L(g1)| ≤ |f1 − g1|.

• Otherwise f1 < g1 and we take f ′ = g and g′ = fto reduce to showing

L(f1)− L(g1) ≤ g1 − f1 = |f1 − g1|

which again follows from

|L(f1)− L(g1)| ≤ |f1 − g1|.

Lammhult Summer School, June 2011 106

Final SVM bound

• Assembling the result we obtain:

P (y = sgn(g(x))) = E [H(−yg(x))]

≤ 1

m∑i=1

ξi +2

√√√√ m∑i=1

κ(xi,xi) + 3

√ln(2/δ)

2m

• Note that for the Gaussian kernel this reduces to

P (y = sgn(g(x))) ≤ 1

m∑i=1

ξi +2√mγ

+ 3

√ln(2/δ)

2m

Lammhult Summer School, June 2011 107

Final Boosting bound• Applying a similar strategy for Boosting with the

1-norm of the slack variables we arrive at Linearprogramming boosting that minimises

∑h

ah + C

m∑i=1

ξi,

where ξi = (1− yi∑

h ahh(xi))+.

• with corresponding bound:

P (y = sgn(g(x))) = E [H(−yg(x))]

≤ 1

m

m∑i=1

ξi + R(H)∑h

ah + 3

√ln(2/δ)

2m

Lammhult Summer School, June 2011 108

Linear programming machine

• Controversy of why boosting works and relationto bagging.

• The previous bound suggests an optimisationsimilar to that of SVMs.

• seeks linear function in a feature space definedexplicitly.

• For example using the 1-norm it seeks w to solve

minw,b,ξ ∥w∥1 + C∑m

i=1 ξi

subject to yi (⟨w,xi⟩+ b) ≥ 1− ξi, ξi ≥ 0,i = 1, . . . ,m.

Lammhult Summer School, June 2011 109

Linear programming boosting

• Very slight generalisation considers the featuresas a set Hij of ‘weak’ learners (and includethe constant function as one weak learner andnegative of each weak learner):

mina,ξ ∥a∥1 + C∑m

i=1 ξi

subject to yiHia ≥ 1− ξi, ξi ≥ 0, ai ≥ 0i = 1, . . . ,m.

Lammhult Summer School, June 2011 110

Alternative version

• Can explicitly optimise margin with 1-norm fixed:

maxρ,a,ξ ρ−D∑m

i=1 ξi

subject to yiHia ≥ ρ− ξi, ξi ≥ 0,aj ≥ 0∑Nj=1 aj = 1.

• Dual has the following form:

minβ,u β

subject to∑m

i=1 uiyiHij ≤ β, j = 1, . . . , N ,∑mi=1 ui = 1, 0 ≤ ui ≤ D.

Lammhult Summer School, June 2011 111

Column generationCan solve the dual linear programme using aniterative method:

1 initialise ui = 1/m, i = 1, . . . ,m, β = ∞, J = ∅2 choose j⋆ that maximises f(j) =

∑mi=1 uiyiHij

3 if f(j⋆) ≤ β solve primal restricted to J and exit4 J = J ∪ {j⋆}5 Solve dual restricted to set J to give ui, β6 Go to 2

• Note that ui is a distribution on the examples• Each j added acts like an additional weak learner

• f(j) is simply the weighted classificationaccuracy

• Hence gives ‘boosting’ algorithm - with previousweights updated satisfying error bound

• Guaranteed convergence and soft stoppingcriteria

Lammhult Summer School, June 2011 112

Multiple kernel learning• MKL puts a 1-norm constraint on a linear

combination of kernels:{κ(x, z) =

N∑t=1

ztκt(x, z) : zt ≥ 0,

N∑t=1

zt = 1

}

and trains an SVM while optimizing zt – a convexproblem, c.f. group Lasso.

• obtain corresponding bound:

P (y = sgn(g(x)))

≤ 1

m∑i=1

ξi +1

γRm

(N∪t=1

Ft

)+ 3

√ln(2/δ)

2m

Lammhult Summer School, June 2011 113

Bounding MKL• Need a bound on

Rm

(F =

N∪t=1

Ft

)

where Ft = {x → ⟨w, ϕt (x)⟩ : ∥w∥ ≤ 1}.

• First note further applications of McDiarmid giveswith probability 1−δ0 of a random selection of σ∗:

Rm(F) ≤ 2

msupf∈F

m∑i=1

σ∗i f(xi) + 4

√ln(1/δt)

2m

and2

msupf∈Ft

m∑i=1

σ∗i f(xi) ≤ Rm(Ft) + 4

√ln(1/δt)

2m

with probability 1− δt

Lammhult Summer School, June 2011 114

Bounding MKL

• Hence taking δt = δ/2(N + 1) for t = 0, . . . , N

Rm

(F =

N∪t=1

Ft

)

≤ 2

msupf∈F

m∑i=1

σ∗i f(xi) + 4

√ln(2(N + 1)/δ)

2m

≤ 2

mmax

1≤t≤Nsupf∈Ft

m∑i=1

σ∗i f(xi) + 4

√ln(2(N + 1)/δ)

2m

≤ 2

mmax

1≤t≤NRm(Ft) + 8

√ln(2(N + 1)/δ)

2m

with probability 1− δ/2.

Lammhult Summer School, June 2011 115

Conclusions

• Outline of philosophy and approach of SLT

• Central result of SLT

• Touched on covering number analysis for marginbased analysis

• Moved to consideration of Rademacher complexity.

• Case of RC for classification giving boundsfor two of the most effective classificationalgorithms: SVMs and Boosting.

• Similar bounds can be obtained for regressionand other tasks such as PCA.

Lammhult Summer School, June 2011 116

Where to find out more

Web Sites: www.support-vector.net (SV Machines)

www.kernel-methods.net (kernel methods)

www.kernel-machines.net (kernel Machines)

www.neurocolt.com

www.pascal-network.org

References

[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler.Scale-sensitive Dimensions, Uniform Convergence, andLearnability. Journal of the ACM, 44(4):615–631, 1997.

[2] Amiran Ambroladze and John Shawe-Taylor. Complexityof pattern classes and lipschitz property. In Fifteenth

Lammhult Summer School, June 2011 117

International Conference on Algorithmic LearningTheory, ALT’04, pages 181–193. Springer Verlag, 2004.

[3] M. Anthony and P. Bartlett. Neural Network Learning:Theoretical Foundations. Cambridge University Press,1999.

[4] M. Anthony and N. Biggs. Computational LearningTheory, volume 30 of Cambridge Tracts in TheoreticalComputer Science. Cambridge University Press, 1992.

[5] M. Anthony and J. Shawe-Taylor. A result of Vapnik withapplications. Discrete Applied Mathematics, 47:207–217, 1993.

[6] K. Azuma. Weighted sums of certain dependent randomvariables. Tohoku Math J., 19:357–367, 1967.

[7] P. Bartlett and J. Shawe-Taylor. Generalizationperformance of support vector machines and otherpattern classifiers. In B. Scholkopf, C. J. C. Burges,and A. J. Smola, editors, Advances in Kernel Methods— Support Vector Learning, pages 43–54, Cambridge,MA, 1999. MIT Press.

[8] P. L. Bartlett. The sample complexity of pattern

Lammhult Summer School, June 2011 118

classification with neural networks: the size of theweights is more important than the size of the network.IEEE Transactions on Information Theory, 44(2):525–536, 1998.

[9] P. L. Bartlett and S. Mendelson. Rademacher andGaussian complexities: risk bounds and structuralresults. Journal of Machine Learning Research, 3:463–482, 2002.

[10] S. Boucheron, G. Lugosi, , and P. Massart. A sharpconcentration inequality with applications. RandomStructures and Algorithms, pages vol.16, pp.277–292,2000.

[11] O. Bousquet and A. Elisseeff. Stability andgeneralization. Journal of Machine Learning Research,2:499–526, 2002.

[12] N. Cristianini and J. Shawe-Taylor. An introduction toSupport Vector Machines. Cambridge University Press,Cambridge, UK, 2000.

[13] Y. Freund and R. E. Schapire. A decision-theoreticgeneralization of on-line learning and an application to

Lammhult Summer School, June 2011 119

boosting. In Computational Learning Theory: Eurocolt’95, pages 23–37. Springer-Verlag, 1995.

[14] W. Hoeffding. Probability inequalities for sums ofbounded random variables. J. Amer. Stat. Assoc.,58:13–30, 1963.

[15] M. Kearns and U. Vazirani. An Introduction toComputational Learning Theory. MIT Press, 1994.

[16] V. Koltchinskii and D. Panchenko. Rademacherprocesses and bounding the risk of function learning.High Dimensional Probability II, pages 443 – 459, 2000.

[17] J. Langford and J. Shawe-Taylor. PAC bayes andmargins. In Advances in Neural Information ProcessingSystems 15, Cambridge, MA, 2003. MIT Press.

[18] M. Ledoux and M. Talagrand. Probability in BanachSpaces: isoperimetry and processes. Springer, 1991.

[19] C. McDiarmid. On the method of bounded differences. In141 London Mathematical Society Lecture Notes Series,editor, Surveys in Combinatorics 1989, pages 148–188.Cambridge University Press, Cambridge, 1989.

[20] R. Schapire, Y. Freund, P. Bartlett, and W. Sun

Lammhult Summer School, June 2011 120

Lee. Boosting the margin: A new explanation for theeffectiveness of voting methods. Annals of Statistics,1998. (To appear. An earlier version appeared in:D.H. Fisher, Jr. (ed.), Proceedings ICML97, MorganKaufmann.).

[21] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson,and M. Anthony. Structural risk minimization overdata-dependent hierarchies. IEEE Transactions onInformation Theory, 44(5):1926–1940, 1998.

[22] J. Shawe-Taylor and N. Cristianini. On the generalisationof soft margin algorithms. IEEE Transactions onInformation Theory, 48(10):2721–2735, 2002.

[23] J. Shawe-Taylor and N. Cristianini. Kernel Methodsfor Pattern Analysis. Cambridge University Press,Cambridge, UK, 2004.

[24] J. Shawe-Taylor, C. Williams, N. Cristianini, and J. S.Kandola. On the eigenspectrum of the gram matrixand its relationship to the operator eigenspectrum. InProceedings of the 13th International Conference onAlgorithmic Learning Theory (ALT2002), volume 2533,

Lammhult Summer School, June 2011 121

pages 23–40, 2002.[25] M. Talagrand. New concentration inequalities in product

spaces. Invent. Math., 126:505–563, 1996.[26] V. Vapnik. Statistical Learning Theory. Wiley, New York,

1998.[27] V. Vapnik and A. Chervonenkis. Uniform convergence of

frequencies of occurence of events to their probabilities.Dokl. Akad. Nauk SSSR, 181:915 – 918, 1968.

[28] V. Vapnik and A. Chervonenkis. On the uniformconvergence of relative frequencies of events to theirprobabilities. Theory of Probability and its Applications,16(2):264–280, 1971.

[29] Tong Zhang. Covering number bounds of certainregularized linear function classes. Journal of MachineLearning Research, 2:527–550, 2002.

Lammhult Summer School, June 2011 122

top related