statistical learning theory

42
06/17/22 Statistical Learning Theory

Upload: evita

Post on 25-Feb-2016

57 views

Category:

Documents


3 download

DESCRIPTION

Statistical Learning Theory. Statistical Learning Theory. A model of supervised learning consists of: a) Environment - Supplying a vector with a fixed but unknown pdf b) Teacher. It provides a desired response d for every according to a conditional pdf - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Learning Theory

04/22/23

Statistical Learning Theory

Page 2: Statistical Learning Theory

04/22/23

Statistical Learning Theory

A model of supervised learning consists of:a) Environment- Supplying a vector with a fixed but unknown pdfb) Teacher. It provides a desired response d for every according to a conditional pdf

. These are related by

x)(xFx

x

)( dxFx ),( vxfd

Page 3: Statistical Learning Theory

04/22/23

Statistical Learning Theory

v is a noise term.c) Learning machine. It is capable of imple-

menting a set of I/O mapping functions:

where y is the actual response and is a set of free parameters (weights) selected from the parameter (weight) space .

),( wxFy w

W

Page 4: Statistical Learning Theory

04/22/23

Statistical Learning Theory

The supervised learning problem is that of selecting the particular that approximates d in an optimum fashion. The selection itself is based on a set of iid training samples:

Each sample is drawn from with a joint pdf

),( wxF

Niii dxT 1)},{(ˆ

),(, dxF dx

T

Page 5: Statistical Learning Theory

04/22/23

Statistical Learning Theory

Supervised learning depends on the following:“Do the training examples contain enough information to construct a LM capable of good generalization?”

To answer, we will see this problem as an approximation problem. We wish to find the function which is the best possible approximation to .

)},{( ii dx

),( wxF )(xf

Page 6: Statistical Learning Theory

04/22/23

Statistical Learning Theory

Let denote a measure of the discrepancy between a d corresponding to a vector and the actual response produced by

The expected value of the loss is defined by the risk functional

),( wxF x

2)),(()),(,( wxFdwxFdL

),()),(,()( , dxdFwxFdLwR Dx

Page 7: Statistical Learning Theory

04/22/23

Statistical Learning Theory

The risk functional may be easily understood from the finite approximation

where denotes the probability of drawing the i-th sample.

i

iiii dxPdxLwR ),(),()(

),( ii dxP

Page 8: Statistical Learning Theory

04/22/23

Principle of Empirical Risk Minimization

Instead of using we use an empirical measure:

This measure differs from in two desirable ways:

a) It does not depend on the unknown pdf explicitly.

)(wR

)),(,(1)( i

iiE wxFdLN

wR

)(wR

),(, dxF Dx

Page 9: Statistical Learning Theory

04/22/23

Principle of Empirical Risk Minimization

b) In theory it can be minimized with respect to .-------Let and denote the weight vector and

the mapping that minimizeAlso, let and denote the ana-logues forBoth and correspond to the space .

w

Ew ),( EwxF

)(wRE

0w ),( 0wxF

)(wR

Ew 0w W

Page 10: Statistical Learning Theory

04/22/23

Principle of Empirical Risk Minimization

We must now consider under which condi-tions is close to as measured by the mismatch between

and .

),( EwxF

)(wRE

),( 0wxF

)(wR

Page 11: Statistical Learning Theory

04/22/23

Principle of Empirical Risk Minimization

1. In place of , construct

on the basis of the training set of iid samples i = 1, ..., N

)(wR

)),(,(1)( i

iiE wxFdLN

wR

),( ii dx

Page 12: Statistical Learning Theory

04/22/23

Principle of Empirical Risk Minimization

2. converges in probability to the mi-nimum possible values of as provided that converges uniformly to .

3. Uniform convergence as per

is necessary and sufficient for consistency of the PERM.

)(wRE

)(wR

)( EwR

)(wR N

0))()(sup(

wRwRP EWw

Page 13: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

The theory of uniform convergence of to includes rates of convergence based

on a parameter called the VC dimension.It is a measure of the capacity or expressive

power of the family of classification functions realized by the learning machine.

)(wRE

)(wR

Page 14: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

To describe the concept of VC dimension let us consider a binary pattern classification problem for which the desired response is

.A dichotomy is a classification function. Let

denote the set of dichotomies implemented by a learning machine:

}1,0{dF

}}1,0{ˆ:,ˆ:),({ˆ WRFWwwxFF m

Page 15: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

Let denote the set of N points in the m-dimensional space of input vectors:

A dichotomy partitions into two disjoint sets and such that

LX

},...,1;ˆ{ˆ NiXxL i

L0L 1L

1

0ˆ1

ˆ0),(

LxforLxfor

wxF

Page 16: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

Let denote the number of distinct dichotomies implemented by the L.M.

Let denote the maximum over all with .

is shattered by if . That is, if all the possible dichotomies of can be induced by functions in .

)ˆ(ˆ LF

)(ˆ lF )ˆ(ˆ LFL lL ˆ

L F LF L

ˆˆ 2)ˆ(

FL

Page 17: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

In the figure we illus-trate a two-dimensionalspace consisting of 4points (x1,...,x4). The

decision boundaries ofF0 and F1 correspond

to the classes 0 and 1being true. F0 induces

the dichotomy:

Page 18: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

While F1 induces

with the set consisting of fourpoints, the cardinalityHence,

]}[ˆ],,,[ˆ{ˆ 3142100 xLxxxLD

]},[ˆ],,[ˆ{ˆ 4312101 xxLxxLD

L

4ˆ L

162)ˆ( 4ˆ LF

Page 19: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

We now formally define the VC dimension as:

“The VC dimension of an ensemble of dichotomies is the cardinality of the largest set that is shattered by .”

FL F

Page 20: Statistical Learning Theory

04/22/23

The Vapnik Chervonenkis Dimension

In more familiar terms, the VC dimension of the set of classification functions

is the maximum number of training examples that can be learned by the machine without error for all possible labelings of the classification functions.

}ˆ:),({ WwwxF

Page 21: Statistical Learning Theory

04/22/23

Importance of the VC Dimension

Roughly speaking, the number of examples needed to learn a class of interest reliably is proportional to the VC dimension.

In some cases the VC dimension is determined by the free parameters of a Neural Network.

In this regard, the following two results are of interest.

Page 22: Statistical Learning Theory

04/22/23

Importance of the VC Dimension

1. Let denote an arbitrary feedforward network built up from neurons with a threshold activation function:

the VC dimension of is O(W logW) where W is the total number of free parameters in the network.

N

0001

)(vforvfor

v

N

Page 23: Statistical Learning Theory

04/22/23

Importance of the VC Dimension

2. Let denote a multilayer feedforward network whose neurons use a sigmoid activation function

the VC dimension is O(W2), where W is the number of free parameters in the network.

N

vev

1

1)(

Page 24: Statistical Learning Theory

04/22/23

Importance of the VC DimensionIn the case of binary pattern classification the

loss function has only two possible values:

The risk functional R( ) and the empirical risk functional Remp( ) assume the following interpretations:

otherwise

dwxFifwxFdL

1),(0

),(,(

w

w

Page 25: Statistical Learning Theory

04/22/23

Importance of the VC DimensionR( ) is the probability of classification error

denoted by P( ).Remp( ) is the training error, denoted by

v( ).

Then (Haykin, p.98):

w

w

w

w

NaswvwPP 0))()((sup

Page 26: Statistical Learning Theory

04/22/23

Importance of the VC DimensionThe notion of VC provides a bound on the rate of

uniform convergence. For the set of classification functions with VC dimension h the following inequality holds:

(vc.1)

where N is the size of the training sample. In other words, a finite VC dimension is a necessary and sufficient condition for uniform convergence of the principle of empirical risk minimization.

)exp(2))()((sup 2 NheNwvwPP

h

Page 27: Statistical Learning Theory

04/22/23

Importance of the VC dimensionThe factor in (vc.1) represents a

bound on the growth function for the family of functions

for Provided that this function does not grow too fast, the right hand side will go to zero as N goes to infinity.

This requirement is satisfied if the VC dimension is finite.

hheN /2)(ˆ lF

}ˆ);,({ˆ WwwxFF

1hl

Page 28: Statistical Learning Theory

04/22/23

Importance of the VC DimensionThus, a finite VC dimension is a necessary and

sufficient condition for uniform convergence of the principle of empirical risk minimization.

Let denote the probability of occurrence of the event

using the previous bound (vc.1) we find

(vc.2)

)()(sup wvwP

)exp(2 2 NheN h

Page 29: Statistical Learning Theory

04/22/23

Importance of the VC Dimension

Let denote the special value of that satisfies (vc.2). Then we obtain (Haykin, 99):

We refer to as the confidence interval.

),,(0 hN

log112log),,(0 NhN

NhhN

0

Page 30: Statistical Learning Theory

04/22/23

Importance of the VC Dimension

We may also write

where

),,,()()( 1 vhNwvwP

),,()(11),,(2),,,( 2

000

hN

wvhNvhN

Page 31: Statistical Learning Theory

04/22/23

Importance of the VC DimensionConclusions:1. 2. For a small training error (close to zero):

3. For a large training error (close to unity):

),,,()()( 1 vhNwvwP

),,(4)()( 0 hNwvwP

),,()()( 0 hNwvwP

Page 32: Statistical Learning Theory

04/22/23

Structural Risk MinimizationThe training error is the frequency of errors

made during the training session for some machine with weight vector during the training session.

The generalization error is the frequency of errors made by the machine when it is tested with examples not seen before.

Let this two errors to be denoted with and .

w

)(wvtrain )(wvgene

Page 33: Statistical Learning Theory

04/22/23

Structural Risk MinimizationLet h be the VC dimension of a family of

classification functions with respect to the input space The generalization error is lower

than the guaranteed risk defined by the sum of competing terms

where the confidence intervalis defined as before.

WwwxF ˆ);,(

X)(wvgene

),,,()()( 1 traintrainguarant vhNwvwv

),,,(1 trainvhN

Page 34: Statistical Learning Theory

04/22/23

Structural Risk Minimization

For a fixed number of training samples N, the training error decreases monotonically as the capacity or h is increased, whereas the confidence interval increases monotonically.

),,()(11),,(2),,,( 2

001

hN

wvhNvhN train

Page 35: Statistical Learning Theory

04/22/23

Structural Risk MinimizationThe training error is the frequency of errors

made during the training session for some machine with weight vector during the training session.

The generalization error is the frequency of errors made by the machine when it is tested with examples not seen before.

Let this two errors to be denoted with and .

w

)(wvtrain )(wvgene

Page 36: Statistical Learning Theory

04/22/23

Structural Risk MinimizationThe training error is the frequency of errors

made during the training session for some machine with weight vector during the training session.

The generalization error is the frequency of errors made by the machine when it is tested with examples not seen before.

Let this two errors to be denoted with and .

w

)(wvtrain )(wvgene

Page 37: Statistical Learning Theory

04/22/23

Structural Risk Minimization The challenge in solving a supervised learning

problem lies in realizing the best generalization performance by matching the machine capacity to the available amount of training data for the problem at hand. The method of structural risk minimization provides an inductive procedure to achieve this goal by making the VC dimension of the learning machine a control variable.

Page 38: Statistical Learning Theory

04/22/23

Structural Risk Minimization Consider an ensemble of pattern classifiers

and define a nested structure of n such machines

such that we have

correspondingly, the VC dimensions of the indivi-dual pattern classifiers satisfy

which implies that the VC dimension of each classifier is finite (see next figure)

}ˆ:),({ WwwxF

nkWwwxFF kk ,...,1}ˆ);,({ˆ

nFFF ˆ...ˆˆ 21

nhhh ...21

Page 39: Statistical Learning Theory

04/22/23

Illustration of relationship between training error, confidence interval and guaranteed risk

Page 40: Statistical Learning Theory

04/22/23

Structural Risk Minimization Then:a) The empirical risk (training error) of each

classifier is minimizedb) The pattern classifier with the smallest

guaranteed risk is identified; this particular machine provides the best compromise between the training error (quality of approximation) and the confidence interval (complexity of the approximation function).

*F

Page 41: Statistical Learning Theory

04/22/23

Structural Risk Minimization Our goal is to find a network structure such that

decreasing the VC dimension occurs at the expense of the smallest possible increase in trainig error.

We achieve this, for example, varying h by varying the number of hidden neurons.

We evaluate the ensemble of fully connected multilayer feedforward networks in which the number of neurons in one of the hidden layers is increased in a monotonic fashion.

Page 42: Statistical Learning Theory

04/22/23

Structural Risk Minimization

The principle of SRM states that the best network in this ensemble is the one for

which the guaranteed risk is the minimum.