feed-forward neural networks - eemb dersler · pdf fileneural network supervised learning...

24
1 Introduction to Radial Basis Function Networks Contents Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Introduction to Radial Basis Function Networks Overview RBF Linear models have been studied in statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model. However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians Typical Applications of NN Pattern Classification Function Approximation Time-Series Forecasting () l f x l C N () f y x m Y R y m X R x n X R x 1 2 3 () ( , , , ) t t t f t x x x x Function Approximation m X R f n Y R : f X Y Unknown m X R n Y R ˆ : f X Y Approximator ˆ f

Upload: lelien

Post on 06-Feb-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

1

Introduction to Radial Basis Function Networks

Contents

Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection

Introduction to Radial Basis Function Networks

Overview

RBF

Linear models have been studied in statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model.

However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians

Typical Applications of NN

Pattern Classification

Function Approximation

Time-Series Forecasting

( )l f xl C N

( )fy x mY R y

mX R x

nX R x

1 2 3( ) ( , , , )t t tft x x x x

Function Approximation

mX R f nY R

:f X YUnknown

mX RnY R

ˆ :f X YApproximator

ˆ f

Page 2: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

2

Neural Network

Supervised Learning

Unknown Function

ixiy

ˆiy1,2,i

+ +

ie

Neural Networks as Universal Approximators

Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy.

Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators.

Statistics vs. Neural Networks

Statistics Neural Networks Model Network Estimation Learning Regression Supervised learning Interpolation Generalization Observations Training set Parameters (synaptic) Weights Independent variables Inputs Dependent variables Outputs Ridge regression Weight decay

Introduction to Radial Basis Function Networks

The Model of

Function Approximator

Linear Models

1

(( ) )ii

i

m

f w

x x

Fixed Basis Functions

Weights

Linear Models

Feature Vectors Inputs

Hidden Units

Output Units

• Decomposition • Feature Extraction • Transformation

Linearly weighted output

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Page 3: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

3

Linear Models

Feature Vectors Inputs

Hidden Units

Output Units

• Decomposition • Feature Extraction • Transformation

Linearly weighted output

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

1

(( ) )ii

i

m

f w

x x

Example Linear Models

Polynomial

Fourier Series

( ) ii

iwf xx

0exp 2( )k

k jf w k xx

( ) , 0,1,2,ii ix x

0( ) exp 2 0,1,, 2, k x j k x k

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Single-Layer Perceptrons as Universal Aproximators

Hidden Units

With sufficient number of sigmoidal units, it can be a universal approximator.

1

(( ) )ii

i

m

f w

x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Radial Basis Function Networks as Universal Aproximators

Hidden Units

With sufficient number of radial-basis-function units, it can also be a universal approximator.

1

(( ) )ii

i

m

f w

x x

Non-Linear Models

1

(( ) )ii

i

m

f w

x x

Adjusted by the Learning process

Weights

1

(( ) )ii

i

m

f w

x x

Introduction to Radial Basis Function Networks

The Radial Basis

Function Networks

Page 4: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

4

Radial Basis Functions

1

(( ) )ii

i

m

f w

x x

Center

Distance Measure

Shape

Three parameters for a radial function:

xi

r = ||x xi||

i(x)= (||x xi||)

Typical Radial Functions

Gaussian

Hardy-Multiquadratic (1971)

Inverse Multiquadratic

2

22 0 and r

r e r

2 2 0 and r r c c c r

2 2 0 and r c r c c r

Gaussian Basis Function (=0.5,1.0,1.5)

2

22 0 and r

r e r

0.5 1.0

1.5

Inverse Multiquadratic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

2 2 0 and r c r c c r

c=1

c=2 c=3

c=4 c=5

Most General RBF

1( ) ( )i iT

i μ x μx x

1μ2μ 3μ

Basis {i: i =1,2,…} is `near’ orthogonal.

Properties of RBF’s

On-Center, Off Surround

Analogies with localized receptive fields found in several biological structures, e.g., – visual cortex;

– ganglion cells

Page 5: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

5

The Topology of RBF

Feature Vectors x1 x2 xn

y1 ym

Inputs

Hidden Units

Output Units

Projection

Interpolation

As a function approximator

( )y f x

The Topology of RBF

Feature Vectors x1 x2 xn

y1 ym

Inputs

Hidden Units

Output Units

Subclasses

Classes

As a pattern classifier.

Introduction to Radial Basis Function Networks

RBFN’s for

Function Approximation

Radial Basis Function Networks

Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm.

The activation function is selected from a class of functions called basis functions.

They usually train much faster than BP.

They are less susceptible to problems with non-stationary inputs

Radial Basis Function Networks

Popularized by Broomhead and Lowe (1988),

and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture.

The major difference between RBF and BP is the behavior of the single hidden layer.

Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.

The idea

x

y Unknown Function to Approximate

Training Data

Page 6: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

6

The idea

x

y Unknown Function to Approximate

Training Data

Basis Functions (Kernels)

The idea

x

y

Basis Functions (Kernels)

Function Learned

1

)) ((m

iiiwy f

xx

The idea

x

y

Basis Functions (Kernels)

Function Learned

Nontraining Sample

1

)) ((m

iiiwy f

xx

The idea

x

y Function Learned

Nontraining Sample

1

)) ((m

iiiwy f

xx

Radial Basis Function Networks as Universal Aproximators

1

( ) ( )m

ii

iy wf

x x

x1 x2 xn

w1 w2 wm

x =

Training set ( )

1

( ),p

k

k

ky

xT

Goal (( ))k kfy x for all k

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xn x =

Training set ( )

1

( ),p

k

k

ky

xT

Goal (( ))k kfy x for all k

1

( ) ( )m

ii

iy wf

x x

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Page 7: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

7

2

( )

1

( )min kp

k

k

SSE fy

x

2

)(

1

) (

1

p mk

ik

ki

i

y w

x

Regularization

Training set ( )

1

( ),p

k

k

ky

xT

Goal (( ))k kfy x for all k

2

1

m

iiiw

2

1

m

iiiw

0i

If regularization is unneeded, set 0i

C

Learn the Optimal Weight Vector

2

2( )

1

( )

1

p mk

ki

ii

k wfyC

xMinimize

1

( ) ( )m

ii

iwf

x x

(

( )

) ( )

1

2 2kp

jk

j

k

k

j j

wfC

fw w

y

x

x

( ) ( ) ( )

1

2 2p

k kj

kjj

k f wy

x x

0

( ) * ( ) ( )

1 1

)* (p p

k k kj j

k kjj

kwf y

x x x

*

1

*( ) ( )i

m

ii

wf

x x

Learn the Optimal Weight Vector

(1) ( ), ,T

pj j j φ x x

* * (1) * ( ), ,T

pf ff x xDefine

(1) ( ), ,Tpy yy

* *j

T Tj jjw φ f φ y 1, ,j m

( ) * ( ) ( )

1 1

)* (p p

k k kj j

k kjj

kwf y

x x x

1

( ) ( )m

ii

iwf

x x *

1

*( ) ( )i

m

ii

wf

x x

Learn the Optimal Weight Vector

*1 1

*2 2

*1

*

*

1

2 2

*

T T

T T

T Tm mmm

w

w

w

φ f φ y

φ f φ y

φ f φ y

* *j

T Tj jjw φ f φ y 1, ,j m

1 2, , , mΦ φ φ φ

1

2

m

Λ

* * *1 2* , , ,

T

mw w ww

Define

**T T wΦ Λf Φ y

Learn the Optimal Weight Vector

**T T wΦ Λf Φ y

* (1)

* (2)

*

* ( )p

f

f

f

x

xf

x

*

1

*( ) ( )i

m

ii

wf

x x

(1)

1

(2)

1

(

*

*

1

*

)

k

k

m

kk

m

kk

mP

kk

k

w

w

w

x

x

x

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

1 2

m

m

p p pm

m

w

w

w

x x x

x x x

x x x

*Φw

* *T T ΛwΦ wΦ Φ y

Learn the Optimal Weight Vector

* *T T ΛwΦ wΦ Φ y

*T T wΦ ΛΦ Φ y

1* T T

Λw ΦΦ Φ y

Variance Matrix

1 TA Φ y

1 :A

Design Matrix :Φ

Page 8: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

8

Introduction to Radial Basis Function Networks

The Projection Matrix

The Empirical-Error Vector

1* Tw A Φ y

*1w *

2w*mw

( )kx ( )1

kx ( )2kx ( )k

nx( )kx ( )1

kx ( )2kx ( )k

nx

Unknown Function

y ** f Φw

1φ 2φ mφ

The Empirical-Error Vector

1* Tw A Φ y

*1w *

2w*mw

( )kx ( )1

kx ( )2kx ( )k

nx( )kx ( )1

kx ( )2kx ( )k

nx

Unknown Function

y ** f Φw

1φ 2φ mφ

Error Vector

* e y f * y Φw 1 T ΦAy Φ y

1 Tp

AI Φ Φ y Py

Sum-Squared-Error

e Py1 T

p P I Φ ΦA

T A Φ Φ Λ

1 2, , , mΦ φ φ φError Vector

T

Py Py2Ty P y

2

( ) * ( )

1

pk k

k

SSE y f

x

If =0, the RBFN’s learning algorithm is to minimize SSE (MSE).

The Projection Matrix

e Py1 T

p P I Φ ΦA

T A Φ Φ Λ

1 2, , , mΦ φ φ φError Vector

2TSSE y P y

1 2( , , , )mspane φ φ φΛ 0

( ) P Py Pee Py

Ty Py

Introduction to Radial Basis Function Networks

Learning the Kernels

Page 9: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

9

RBFN’s as Universal Approximators

x1 x2 xn

y1 yl

1 2 m

w11 w12 w1m wl1 wl2 wlml

Training set

) ( )(

1,

pk

k

k

yxT

2

2( ) exp

2

j

j

j

x μx

Kernels

What to Learn?

Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s

Model Selection

x1 x2 xn

y1 yl

1 2 m

w11 w12 w1m wl1 wl2 wlml

One-Stage Learning

( ) ( ) ( )1

k k kij i i jw y f x x

( ) ( )

1

lk k

i ij jj

f w

x x

( ) ( ) ( )2 2

1

ljk k k

j j ij i iij

w y f

x μμ x x

2

( ) ( ) ( )3 3

1

ljk k k

j j ij i iij

w y f

x μx x

One-Stage Learning

( ) ( )

1

lk k

i ij jj

f w

x xThe simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.

( ) ( ) ( )1

k k kij i i jw y f x x

( ) ( ) ( )2 2

1

ljk k k

j j ij i iij

w y f

x μμ x x

2

( ) ( ) ( )3 3

1

ljk k k

j j ij i iij

w y f

x μx x

x1 x2 xn

y1 yl

1 2 m

w11 w12 w1m wl1 wl2 wlml

Two-Stage Training

Step 1

Step 2

Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.

Determines wij’s.

E.g., using batch-learning.

Train the Kernels

Page 10: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

10

Unsupervised Training

Subset Selection – Random Subset Selection – Forward Selection – Backward Elimination

Clustering Algorithms – KMEANS – LVQ

Mixture Models – GMM

Methods

Subset Selection Random Subset Selection

Randomly choosing a subset of points from training set

Sensitive to the initially chosen points.

Using some adaptive techniques to tune – Centers – Widths – #points

Clustering Algorithms Clustering Algorithms

Page 11: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

11

Clustering Algorithms Clustering Algorithms

1

2

3

4

Introduction to Radial Basis Function Networks

Bias-Variance Dilemma

Goal Revisit

Minimize Prediction Error

• Ultimate Goal Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Badness of Fit

Underfitting – A model (e.g., network) that is not sufficiently

complex can fail to detect fully the signal in a complicated data set, leading to underfitting.

– Produces excessive bias in the outputs.

Overfitting – A model (e.g., network) that is too complex may

fit the noise, not just the signal, leading to overfitting.

– Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance

Model selection Jittering Early stopping Weight decay

– Regularization – Ridge Regression

Bayesian learning Combining networks

Page 12: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

12

Best Way to Avoid Overfitting

Use lots of training data, e.g., – 30 times as many training cases as there are

weights in the network.

– for noise-free data, 5 times as many training cases as weights may be sufficient.

Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of Fit

Underfit Overfit

Badness of Fit

Underfit Overfit

Bias-Variance Dilemma

Underfit Overfit

Large bias Small bias

Small variance Large variance

However, it's not really a dilemma.

Underfit Overfit

Large bias Small bias

Small variance Large variance

Bias-Variance Dilemma

More on overfitting

Easily lead to predictions that are far beyond the range of the training data.

Produce wild predictions in multilayer perceptrons even with noise-free data.

Bias-Variance Dilemma

It's not really a dilemma.

fit more

overfit more

underfit

Bias

Variance

Page 13: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

13

Bias-Variance Dilemma

Sets of functions E.g., depend on # hidden nodes used.

The true model noise

bias

Solution obtained with training set 2.

Solution obtained with training set 1.

Solution obtained with training set 3.

bias bias

The mean of the bias=?

The variance of the bias=?

Bias-Variance Dilemma

Sets of functions E.g., depend on # hidden nodes used.

The true model noise

The mean of the bias=?

The variance of the bias=?

bias

Variance

Model Selection

Sets of functions E.g., depend on # hidden nodes used.

The true model noise

bias

Variance

Reduce the effective number of parameters.

Reduce the number of hidden nodes.

Bias-Variance Dilemma

Sets of functions E.g., depend on # hidden nodes used.

The true model noise

bias

( )g x( ) ( )y g x x

( )f x

Goal: 2

min ( ) ( )E y f

x x

Bias-Variance Dilemma

Goal: 2

min ( ) ( )E y f

x x

2

( ) ( )E y f

x x 2

( ) ( )E g f

x x

2 2( ) ( ) 2 ( ) ( )E g f g f

x x x x

2 2( ) ( ) 2 ( ) ( )E g f E g f E

x x x x

Goal: 2

min ( ) ( )E g f

x x 2

min ( ) ( )E y f

x x

0

constant

Bias-Variance Dilemma

2

( ) ( )E g f

x x 2

[ ( )] [( ) () )( ]E f E fE g f

x xx x

2

[ ]( )) (E fE g

x x 2

[ ]( )) (E fE f

x x

[ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x

2 2

[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff

x x x xx x x x

( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx

( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x

( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0

2 22( ) ( ) ( ) ( )E y f E E g f

x x x x

0

Page 14: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

14

Bias-Variance Dilemma

2 22( ) ( ) ( ) ( )E y f E E g f

x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f

x x xx

Goal: 2

min ( ) ( )E y f

x x

noise

bias2

variance

Cannot be minimized

Minimize both bias2 and variance

Model Complexity vs. Bias-Variance

2 22( ) ( ) ( ) ( )E y f E E g f

x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f

x x xx

Goal: 2

min ( ) ( )E y f

x x

noise

bias2

variance

Model Complexity (Capacity)

Bias-Variance Dilemma

2 22( ) ( ) ( ) ( )E y f E E g f

x x x x

2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f

x x xx

Goal: 2

min ( ) ( )E y f

x x

noise

bias2

variance

Model Complexity (Capacity)

Example (Polynomial Fits)

22exp( 16 )y x x

Example (Polynomial Fits) Example (Polynomial Fits)

Degree 1 Degree 5

Degree 10 Degree 15

Page 15: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

15

Introduction to Radial Basis Function Networks

The Effective Number of Parameters

Variance Estimation

Mean

Variance 22

1

p

ii

xp

In general, not available.

Variance Estimation

Mean 1

p

ii

x xp

Variance 22 2

1

ˆ1

p

ii

s xp

Loss 1 degree of freedom

Simple Linear Regression

0 1y x 2~ (0, )N

01

y

x

y

x

Simple Linear Regression

y

x

01ˆ

ˆ

y

x

0 1y x 2~ (0, )N

2

1

ˆp

i ii

SSE y y

Minimize

Mean Squared Error (MSE)

y

x

01ˆ

ˆ

y

x

0 1y x 2~ (0, )N

2

1

ˆp

i ii

SSE y y

Minimize

2

SSEMSE

p

Loss 2 degrees of freedom

Page 16: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

16

Variance Estimation

2ˆSSE

MSEmp

Loss m degrees of freedom

m: #parameters of the model

The Number of Parameters

w1 w2 wm

x1 x2 xn x =

1

( ) ( )m

ii

iy wf

x x

m #degrees of freedom:

The Effective Number of Parameters ()

w1 w2 wm

x1 x2 xn x =

1

( ) ( )m

ii

iy wf

x x The projection Matrix

1 Tp

P I Φ ΦA

T A Φ Φ Λ

Λ 0

( )p trace P m

The Effective Number of Parameters ()

The projection Matrix

1 Tp

P I Φ ΦA

T A Φ Φ Λ

Λ 0

1( ) Tptrace trace AP I Φ Φ

1 Tp trace ΦA Φ

( ) ( ) ( )trace trace trace A B A BFacts:

( ) ( )trace traceAB BA

1 TTp trace

Φ Φ ΦΦ

Pf)

1T Tp trace

Φ ΦΦ Φ

mp trace I

p m ( )p trace P m

Regularization

((

1

)

2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models with large weights SSE

( )p trace PThe effective number of parameters:

Regularization

((

1

)

2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models with large weights SSE

Without penalty (i=0), there are m degrees

of freedom to minimize SSE (Cost).

The effective number of parameters =m.

The effective number of parameters:

( )p trace P

Page 17: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

17

Regularization

((

1

)

2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models with large weights SSE

With penalty (i>0), the liberty to minimize SSE will be reduced.

The effective number of parameters <m.

The effective number of parameters:

( )p trace P

Variance Estimation

2ˆ MSE

Loss degrees of freedom

The effective number of parameters:

( )p trace P

SSE

p

Variance Estimation

2ˆ MSE

The effective number of parameters:

( )p trace P

( )

SSE

trace

P

Introduction to Radial Basis Function Networks

Model Selection

Model Selection

Goal – Choose the fittest model

Criteria – Least prediction error

Main Tools (Estimate Model Fitness) – Cross validation – Projection matrix

Methods – Weight decay (Ridge regression) – Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness

Minimize Prediction Error

• Ultimate Goal Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Minimize Prediction Error

(MSE)

Page 18: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

18

Estimating Prediction Error

When you have plenty of data use independent test sets – E.g., use the same training set to train

different models, and choose the best model by comparing on the test set.

When data is scarce, use

– Cross-Validation – Bootstrap

Cross Validation

Simplest and most widely used method for estimating prediction error.

Partition the original set into several different ways and to compute an average score over the different partitions, e.g., – K-fold Cross-Validation – Leave-One-Out Cross-Validation – Generalize Cross-Validation

K-Fold CV

Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk.

Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.

K-Fold CV

Available Data

K-Fold CV

Available Data D1 D2 D3 . . . Dk

D1 D2 D3 . . . Dk

D1 D2 D3 . . . Dk

D1 D2 D3 . . . Dk

D1 D2 D3 . . . Dk

Test Set

Training Set

Estimate 2

Leave-One-Out CV

Split the p available input-output patterns into a training set of size p1 and a test set of size 1.

Average the squared error on the left-out pattern over the p possible ways of partition.

A special case of k-fold CV.

Page 19: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

19

Error Variance Predicted by LOO

A special case of k-fold CV.

22

1

1ˆ ( )

p

LOO i i ii

y fp

x

( , ): 1,2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.

Error Variance Predicted by LOO

A special case of k-fold CV.

22

1

1ˆ ( )

p

LOO i i ii

y fp

x

( , ): 1,2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.

Given a model, the function with least empirical error for Di.

As an index of model’s fitness. We want to find a model also minimize this.

Error Variance Predicted by LOO

A special case of k-fold CV.

22

1

1ˆ ( )

p

LOO i i ii

y fp

x

( , ): 1,2, ,k kD y k p x

\ ( , )i i iD D y x

Available input-output patterns.

1, ,i p Training sets of LOO.

if Function learned using Di as training set.

The estimate for the variance of prediction error using LOO:

Error-square for the left-out element.

Error Variance Predicted by LOO

22

1

1ˆ ( )

p

LOO i i ii

y fp

x

Error-square for the left-out element.

2 2(1

ˆ ˆ ( )) ˆTLOO di

pag y PP Py

Generalized Cross-Validation

2 2(1

ˆ ˆ ( )) ˆTLOO di

pag y PP Py

( )( ) p

tracediag

p

PP I

22

2

ˆ ˆˆ

( )

T

GCV

p

trace

y P y

P

2

2

ˆ ˆ

( )

Tp

p

y P y

More Criteria Based on CV

22

2

ˆ ˆˆ

( )

T

GCV

p

p

y P y GCV (Generalized CV)

22 ˆ ˆ

ˆT

UEVp

y P y UEV (Unbiased estimate of variance)

22 ˆ ˆ

ˆT

FPE

p

p p

y P y FPE (Final Prediction Error)

Akaike’s Information

Criterion

BIC (Bayesian Information Criterio)

22 ln( ) 1 ˆ ˆ

ˆT

BIC

p p

p p

y P y

Page 20: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

20

More Criteria Based on CV

22

2

ˆ ˆˆ

( )

T

GCV

p

p

y P y

22 ˆ ˆ

ˆT

UEVp

y P y

22 ˆ ˆ

ˆT

FPE

p

p p

y P y

22 ln( ) 1 ˆ ˆ

ˆT

BIC

p p

p p

y P y

2ˆUEV

p

p

2ˆUEV

p

p

2ln( ) 1ˆ

UEV

p p

p

2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

More Criteria Based on CV

22

2

ˆ ˆˆ

( )

T

GCV

p

p

y P y

22 ˆ ˆ

ˆT

UEVp

y P y

22 ˆ ˆ

ˆT

FPE

p

p p

y P y

22 ln( ) 1 ˆ ˆ

ˆT

BIC

p p

p p

y P y

2ˆUEV

p

p

2ˆUEV

p

p

2ln( ) 1ˆ

UEV

p p

p

2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

?P

Regularization

((

1

)

2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models with large weights SSE

Standard Ridge Regression,

,i i Regularization

((

1

)

2

)p

k

k

k fy

x2

1

m

iiiw

Empirial

Er

Model s

penalror ts

yCo t

2

( )

1 1

( )p m

ki

k

k

iiwy

x

Penalize models with large weights SSE

2

1

m

iiw

Standard Ridge Regression,

,i i

Solution Review

2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

1* Tw A Φ yT

m Φ Φ IA1 T

p P I Φ ΦA

Used to compute model selection criteria

Example

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

Page 21: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

21

Example 2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

Example 2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC

22( ) (1 2 ) xy x x x e

50p

2~ (0,0.2 )N

Width of RBF r = 0.5

Optimizing the Regularization Parameter

Re-Estimation Formula

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Local Ridge Regression

Re-Estimation Formula

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF

( )p trace P

0.05r Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF

( )p trace P

0.05r

Page 22: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

22

Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF 0.05r

There are two local-minima.

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

Using the about re-estimation formula, it will be stuck at the nearest local minimum.

That is, the solution depends on the initial setting.

Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF 0.05r

There are two local-minima.

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

ˆ(0) 0.01

lim 0.1ˆ( )t

t

Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF 0.05r

There are two local-minima.

2 1 2

1

ˆˆ

( )

T

T

trace

trace

A

w

Ay P y

A Pw

5ˆ(0) 10

4ˆ( )lim 2.1 10t

t

Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF 0.05r

RMSE: Root Mean Squared Error In real case, it is not available.

Example

sin(12 )y x 2~ (0,0.1 )N

50m p

— Width of RBF 0.05r

RMSE: Root Mean Squared Error In real case, it is not available.

Local Ridge Regression

2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

Standard Ridge Regression

2

(( ) )

1 1

2

1

min p m m

i ik

ik i

ik

i

C w wy

x

Local Ridge Regression

Page 23: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

23

Local Ridge Regression

2

(( ) )

1 1

2

1

min p m m

ki

k

ki i

i i

C w wy

x

Standard Ridge Regression

2

(( ) )

1 1

2

1

min p m m

i ik

ik i

ik

i

C w wy

x

Local Ridge Regression

j implies that j() can be removed.

The Solutions

1* Tw A Φ y

Tm Φ Φ IA

1 Tp

P I Φ ΦAUsed to compute model selection criteria

T A Φ Φ Λ

TA Φ Φ Linear Regression

Standard Ridge Regression

Local Ridge Regression

Optimizing the Regularization Parameters

Incremental Operation

P: The current projection Matrix.

Pj: The projection Matrix obtained by removing j().

Tj j j j

j Tj j j j

Pφ φ PP P

φ Pφ

22

2

ˆ ˆˆ

( )

T

GCV

p

trace

y P y

P Optimizing the Regularization Parameters

22

2

ˆ ˆˆ

( )

T

GCV

p

trace

y P y

P

Solve 2ˆ

0GCV

j

Subject to 0j

Tj j j j

j Tj j j j

Pφ φ PP P

φ Pφ

Optimizing the Regularization Parameters

2Tjay P y2T Tj j j jby P φ y Pφ2 2( )T T

j j j j jcφ P φ y Pφ

( )jtrace P2T

j j j φ P φ

if

if and ˆ

0 if and

if

c bTj j j b a

c bTjj j j b a

c b c bT Tj j j j j jb a b a

a b

a b

a b

φ Pφ

φ Pφ

φ Pφ φ Pφ

Solve 2ˆ

0GCV

j

Subject to 0j

Optimizing the Regularization Parameters

2Tjay P y2T Tj j j jby P φ y Pφ2 2( )T T

j j j j jcφ P φ y Pφ

( )jtrace P2T

j j j φ P φ

if

if and ˆ

0 if and

if

c bTj j j b a

c bTjj j j b a

c b c bT Tj j j j j jb a b a

a b

a b

a b

φ Pφ

φ Pφ

φ Pφ φ Pφ

Solve 2ˆ

0GCV

j

Subject to 0j

Remove j()

Page 24: Feed-Forward Neural Networks - EEMB DERSLER · PDF fileNeural Network Supervised Learning Unknown ... As a function approximator y f x () The Topology of RBF Feature Vectors x 1 x

24

The Algorithm

Initialize i’s.

– e.g., performing standard ridge regression.

Repeat the following until GCV converges:

– Randomly select j and compute

– Perform local ridge regression

– If GCV reduce & remove j()

ˆj

ˆj