the learning problem and regularization9.520/spring09/classes/class02_2009.pdf · the learning...

42
The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 February 2009 Tomaso Poggio The Learning Problem and Regularization

Upload: others

Post on 01-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

The Learning Problem and Regularization

Tomaso Poggio

9.520 Class 02

February 2009

Tomaso Poggio The Learning Problem and Regularization

Page 2: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

About this class

Theme We introduce the learning problem as the problemof function approximation from sparse data. Wedefine the key ideas of loss functions, empiricalerror and generalization error. We then introducethe Empirical Risk Minimization approach and thetwo key requirements on algorithms using it:well-posedness and consistency. We thendescribe a key algorithm – Tikhonov regularization– that satisfies these requirements.

Math Required Familiarity with basic ideas in probability theory.

Tomaso Poggio The Learning Problem and Regularization

Page 3: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

About this class

Theme We introduce the learning problem as the problemof function approximation from sparse data. Wedefine the key ideas of loss functions, empiricalerror and generalization error. We then introducethe Empirical Risk Minimization approach and thetwo key requirements on algorithms using it:well-posedness and consistency. We thendescribe a key algorithm – Tikhonov regularization– that satisfies these requirements.

Math Required Familiarity with basic ideas in probability theory.

Tomaso Poggio The Learning Problem and Regularization

Page 4: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Plan

Learning as function approximationEmpirical Risk MinimizationGeneralization and Well-posednessRegularizationAppendix: Sample and Approximation Error

Tomaso Poggio The Learning Problem and Regularization

Page 5: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Data Generated By A Probability Distribution

We assume that there are an “input” space X and an “otput”space Y . We are given a training set S consisting n samplesdrawn i.i.d. from the probability distribution µ(z) on Z = X × Y :

(x1, y1), . . . , (xn, yn)

that is z1, . . . , znWe will make frequent use of the conditional probability of ygiven x, written p(y |x):

µ(z) = p(x , y) = p(y |x) · p(x)

It is crucial to note that we view p(x , y) as fixed but unknown.

Tomaso Poggio The Learning Problem and Regularization

Page 6: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Probabilistic setting

X Y

P(x)

P(y|x)

Tomaso Poggio The Learning Problem and Regularization

Page 7: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Hypothesis Space

The hypothesis space H is the space of functions that weallow our algorithm to provide. For many algorithms (such asoptimization algorithms) it the space the algorithm is allowed tosearch. As we will see, it is often important to choose thehypothesis space as a function of the amount of data available.

Tomaso Poggio The Learning Problem and Regularization

Page 8: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Learning As Function Approximation From Samples:Regression and Classification

The basic goal of supervised learning is to use the training setS to “learn” a function fS that looks at a new x value xnew andpredicts the associated value of y :

ypred = fS(xnew )

If y is a real-valued random variable, we have regression.If y takes values from an unordered finite set, we have patternclassification. In two-class pattern classification problems, weassign one class a y value of 1, and the other class a y value of−1.

Tomaso Poggio The Learning Problem and Regularization

Page 9: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Loss Functions

In order to measure goodness of our function, we need a lossfunction V . In general, we let V (f , z) = V (f (x), y) denote theprice we pay when we see x and guess that the associated yvalue is f (x) when it is actually y .

Tomaso Poggio The Learning Problem and Regularization

Page 10: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Common Loss Functions For Regression

For regression, the most common loss function is square lossor L2 loss:

V (f (x), y) = (f (x)− y)2

We could also use the absolute value, or L1 loss:

V (f (x), y) = |f (x)− y |

Vapnik’s more general ε-insensitive loss function is:

V (f (x), y) = (|f (x)− y | − ε)+

Tomaso Poggio The Learning Problem and Regularization

Page 11: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Common Loss Functions For Classification

For binary classification, the most intuitive loss is the 0-1 loss:

V (f (x), y) = Θ(−yf (x))

where Θ(−yf (x)) is the step function. For tractability and otherreasons, we often use the hinge loss (implicitely introduced byVapnik) in binary classification:

V (f (x), y) = (1− y · f (x))+

Tomaso Poggio The Learning Problem and Regularization

Page 12: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

The learning problem: summary so far

There is an unknown probability distribution on the productspace Z = X × Y , written µ(z) = µ(x , y). We assume that X isa compact domain in Euclidean space and Y a bounded subsetof R. The training set S = {(x1, y1), ..., (xn, yn)} = {z1, ...zn}

consists of n samples drawn i.i.d. from µ.

H is the hypothesis space, a space of functions f : X → Y .

A learning algorithm is a map L : Z n → H that looks at S andselects from H a function fS : x → y such that fS(x) ≈ y in apredictive way.

Tomaso Poggio The Learning Problem and Regularization

Page 13: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Expected error, empirical error

Given a function f , a loss function V , and a probabilitydistribution µ over Z , the expected or true error of f is:

I[f ] = EzV [f , z] =

∫Z

V (f , z)dµ(z)

which is the expected loss on a new example drawn at randomfrom µ.We would like to make I[f ] small, but in general we do not knowµ.Given a function f , a loss function V , and a training set Sconsisting of n data points, the empirical error of f is:

IS[f ] =1n

∑V (f , zi)

Tomaso Poggio The Learning Problem and Regularization

Page 14: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

A reminder: convergence in probability

Let {Xn} be a sequence of bounded random variables. We saythat

limn→∞

Xn = X in probability

if∀ε > 0 lim

n→∞P{|Xn − X | ≥ ε} = 0.

orif for each n there exists a εn and a δn such that

P {|Xn − X | ≥ εn} ≤ δn,

with εn and δn going to zero for n →∞.

Tomaso Poggio The Learning Problem and Regularization

Page 15: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Generalization

A very natural requirement for fS is distribution independentgeneralization

∀µ, limn→∞

|IS[fS]− I[fS]| = 0 in probability

In other words, the training error for the solution must convergeto the expected error and thus be a “proxy” for it. Otherwise thesolution would not be “predictive”.A desirable additional requirement is universal consistency

∀ε > 0 limn→∞

supµ

PS

{I[fS] > inf

f∈HI[f ] + ε

}= 0.

Remark: For some of the results the requirement of uniformconvergence must be added in both definitions.

Tomaso Poggio The Learning Problem and Regularization

Page 16: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

A learning algorithm should be well-posed, eg stable

In addition to the key property of generalization, a “good”learning algorithm should also be stable: fS should dependcontinuously on the training set S. In particular, changing oneof the training points should affect less and less the solution asn goes to infinity.

Tomaso Poggio The Learning Problem and Regularization

Page 17: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

General definition of Well-Posed and Ill-Posedproblems

A problem is well-posed if its solution:

existsis uniquedepends continuously on the data (e.g. it is stable)

A problem is ill-posed if it is not well-posed. In the context ofthis class, well-posedness is mainly used to mean stability ofthe solution.

Tomaso Poggio The Learning Problem and Regularization

Page 18: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

More on well-posed and ill-posed problems

Hadamard introduced the definition of ill-posedness. Ill-posedproblems are typically inverse problems.As an example, assume g is a function in Y and u is a functionin X , with Y and X Hilbert spaces. Then given the linear,continuous operator L, consider the equation

g = Lu.

The direct problem is is to compute g given u; the inverseproblem is to compute u given the data g. In the learning caseL is somewhat similar to a “sampling” operation and the inverseproblem becomes the problem of finding a function that takesthe values

f (xi) = yi , i = 1, ...n

The inverse problem of finding u is well-posed whenthe solution exists,is unique andis stable, that is depends continuously on the initial data g.

Ill-posed problems fail to satisfy one or more of these criteria.Often the term ill-posed applies to problems that are notstable, which in a sense is the key condition.

Tomaso Poggio The Learning Problem and Regularization

Page 19: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

ERM

Given a training set S and a function space H, empirical riskminimization (Vapnik introduced the term) is the class ofalgorithms that look at S and select fS as

fS = arg minf∈H

IS[f ]

Tomaso Poggio The Learning Problem and Regularization

Page 20: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Generalization and Well-posedness of Empirical RiskMinimization

For ERM to represent a “good” class of learning algorithms, thesolution should

“generalize”exist, be unique and – especially – be “stable”(well-posedness).

Tomaso Poggio The Learning Problem and Regularization

Page 21: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

ERM and generalization: given a certain number ofsamples...

Tomaso Poggio The Learning Problem and Regularization

Page 22: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

...suppose this is the “true” solution...

Tomaso Poggio The Learning Problem and Regularization

Page 23: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

... but suppose ERM gives this solution.

Tomaso Poggio The Learning Problem and Regularization

Page 24: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Under which conditions the ERM solution convergeswith increasing number of examples to the truesolution? In other words...what are the conditions forgeneralization of ERM?

Tomaso Poggio The Learning Problem and Regularization

Page 25: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

ERM and stability: given 10 samples...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 26: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

...we can find the smoothest interpolating polynomial(which degree?).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 27: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

But if we perturb the points slightly...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 28: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

...the solution changes a lot!

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 29: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

If we restrict ourselves to degree two polynomials...

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 30: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

...the solution varies only a small amount under asmall perturbation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tomaso Poggio The Learning Problem and Regularization

Page 31: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

Since Tikhonov, it is well-known that a generally ill-posedproblem such as ERM, can be guaranteed to be well-posedand therefore stable by an appropriate choice of H. Forexample, compactness of H guarantees stability.Similarly, the classical conditions for consistency of ERMconsists or appropriately restricting H:

Tomaso Poggio The Learning Problem and Regularization

Page 32: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

DefinitionH is a (weak) uniform Glivenko-Cantelli (uGC) classif

∀ε > 0 limn→∞

supµ

PS

{supf∈H

|I[f ]− IS[f ]| > ε

}= 0.

Tomaso Poggio The Learning Problem and Regularization

Page 33: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

ERM: conditions for well-posedness (stability) andpredictivity (generalization)

Theorem [Vapnik and Cervonenkis (71), Alon et al (97),Dudley, Giné, and Zinn (91)]

A necessary and sufficient condition for generalization (andconsistency) of ERM is that H is uGC.

Thus, a proper choice of the hypothesis space H ensuresgeneralization of ERM (and consistency since for ERMgeneralization is necessary and sufficient for consistency andviceversa). A proper choice guarantees also stability of ERM.With the appropriate definition of stability, stability andgeneralization are equivalent for ERM.

Tomaso Poggio The Learning Problem and Regularization

Page 34: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Regularization

Regularization (originally introduced by Tikhonov independentlyof the learning problem) ensures well-posedness and (becauseof the above argument) generalization of ERM by constrainingthe hypothesis space H. The direct way – minimize theempirical error subject to f in a ball in an appropriate H – iscalled Ivanov regularization. The indirect way is Tikhonovregularization (which is not strictly ERM).

Tomaso Poggio The Learning Problem and Regularization

Page 35: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Ivanov and Tikhonov RegularizationERM finds the function in (H, ‖ · ‖) which minimizes

1n

n∑i=1

V (f (xi), yi)

which in general – for arbitrary hypothesis space H – is ill-posed.Ivanov regularizes by finding the function that minimizes

1n

n∑i=1

V (f (xi), yi)

while satisfying‖f‖2 ≤ A.

Tikhonov regularization minimizes over the hypothesis space H, for afixed positive parameter γ, the regularized functional

1n

n∑i=1

V (f (xi), yi) + γ‖f‖2K , (1)

where ‖f‖K is the norm in H – the Reproducing Kernel Hilbert Space(RKHS), defined by the kernel K .

Tomaso Poggio The Learning Problem and Regularization

Page 36: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Tikhonov Regularization

As we will see in future classes

Tikhonov regularization ensures well-posedness egexistence, uniqueness and especially stability (in a verystrong form) of the solutionTikhonov regularization ensures generalizationTikhonov regularization is closely related to – but differentfrom – Ivanov regularization, eg ERM on a hypothesisspace H which is a ball in a RKHS.

Tomaso Poggio The Learning Problem and Regularization

Page 37: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Next Class

In the next class we will introduce RKHS: they will be thehypothesis spaces we will work with.We will also derive the solution of Tikhonov regularization.

Tomaso Poggio The Learning Problem and Regularization

Page 38: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Appendix: Target Space, Sample and ApproximationError

In addition to the hypothesis space H, the space we allow ouralgorithms to search, we define...The target space T is a space of functions, chosen a priori inany given problem, that is assumed to contain the “true”function f0 that minimizes the risk. Often, T is chosen to be allfunctions in L2, or all differentiable functions. Notice that the“true” function if it exists is defined by µ(z), which contains allthe relevant information.

Tomaso Poggio The Learning Problem and Regularization

Page 39: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Sample Error (also called Estimation Error)

Let fH be the function in H with the smallest true risk.We have defined the generalization error to be IS[fS]− I[fS].We define the sample error to be I[fS]− I[fH], the difference in truerisk between the best function in H and the function in H we actuallyfind. This is what we pay because our finite sample does not give usenough information to choose to the “best” function in H. We’d likethis to be small. Consistency – defined earlier – is equivalent to thesample error going to zero for n →∞.A main goal in classical learning theory (Vapnik, Smale, ...) is“bounding” the generalization error. Another goal – for learning theoryand statistics – is bounding the sample error, that is determiningconditions under which we can state that I[fS]− I[fH] will be small(with high probability).As a simple rule, we expect that if H is “well-behaved”, then, as ngets large the sample error will become small.

Tomaso Poggio The Learning Problem and Regularization

Page 40: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Approximation Error

Let f0 be the function in T with the smallest true risk.We define the approximation error to be I[fH]− I[f0], thedifference in true risk between the best function in H and thebest function in T . This is what we pay because H is smallerthan T . We’d like this error to be small too. In much of thefollowing we can assume that I[f0] = 0.We will focus less on the approximation error in 9.520, but wewill explore it.As a simple rule, we expect that as H grows bigger, theapproximation error gets smaller. If T ⊆ H – which is a situationcalled the realizable setting –the approximation error is zero.

Tomaso Poggio The Learning Problem and Regularization

Page 41: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

Error

We define the error to be I[fS]− I[f0], the difference in true riskbetween the function we actually find and the best function inT . We’d really like this to be small. As we mentioned, often wecan assume that the error is simply I[fS].The error is the sum of the sample error and the approximationerror:

I[fS]− I[f0] = (I[fS]− I[fH]) + (I[fH]− I[f0])

If we can make both the approximation and the sample errorsmall, the error will be small. There is a tradeoff between theapproximation error and the sample error...

Tomaso Poggio The Learning Problem and Regularization

Page 42: The Learning Problem and Regularization9.520/spring09/Classes/class02_2009.pdf · The Learning Problem and Regularization Tomaso Poggio 9.520 Class 02 ... Math Required Familiarity

The Approximation/Sample Tradeoff

It should already be intuitively clear that making H big makesthe approximation error small. This implies that we can (help)make the error small by making H big.On the other hand, we will show that making H small will makethe sample error small. In particular for ERM, if H is a uGCclass, the generalization error and the sample error will go tozero as n →∞, but how quickly depends directly on the “size”of H. This implies that we want to keep H as small as possible.(Furthermore, T itself may or may not be a uGC class.)Ideally, we would like to find the optimal tradeoff between theseconflicting requirements.

Tomaso Poggio The Learning Problem and Regularization