neural networks part 2 dan simon cleveland state university 1

30
Neural Networks Part 2 Dan Simon Cleveland State University 1

Post on 22-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Neural NetworksPart 2

Dan SimonCleveland State University

1

Outline

1. Preprocessing– Input normalization– Feature selection– Principal component analysis (PCA)

2. Cascade Correlation

2

Preprocessing

First things first:1. Preprocessing– Input Normalization–Feature Selection

2. Neural network trainingIf you use a poor preprocessing algorithm, then it doesn’t

matter what type of neural network that you use. And if you use a good preprocessing algorithm, then it doesn’t matter what type of neural network you use.

3

Input normalization for independent variablesTraining data: xni

n [1, N], N = # of training samplesi [1, d], d = input dimension

1

2 2

1

1( 1,..., )

1(x ) ( 1,..., )

1

( 1,..., ), ( 1,..., )

N

i nin

N

i ni in

ni ini

i

x x i dN

x i dN

x xx n N i d

Mean of each input dimension

Normalized inputs

Sometimes weight updating takes care of normalization, but what about weight initialization? Also, recall that RBF activation is determined by Euclidean distance between input and center, which we don’t want to be dominated by a single dimension.

4

Input normalization for correlated variables (whitening)Training data: xni

n [1, N], N = # of training samplesi [1, d], d = input dimension

1

1

1

1

1/2

1( 1 vectors)

1(x )(x ) ( covariance)

1

( 1,..., ) (eigenvalue equation)

[ ]

diag( , , )

( ) (normalized inputs)

N

nn

NT

n nn

j j j

n

d

Tn n

x x dN

x x d dN

u u j d

U u u

x U x x

What is the mean and covariance of the normalized inputs?5

original distribution

whitened distribution

6

Feature Selection: What features of the input data should we use as inputs to the neural network?

Example: A 256 256 bitmap for character recognition gives65,536 input neurons!

One Solution:•Problem-specific clustering of features – for example, “super pixels” in image processing

We went from to 256 to 64 features

7

Feature Selection: What features of the input data should we use as inputs to the neural network?Expert Knowledge: Use clever problem-dependent approaches

56266274

0 1 6 6 6 6 7 6

We went from to 64 to 16 features. The old features were binary, and the new features are not.

8

Clever problem-dependent approach to feature selectionExample: Use ECG data to diagnose heart disease

We have 24 hours of data at 500 Hz.

Cardiologists tell us that primary indicators include:•P wave duration•P wave amplitude•P wave energy•P wave inflection point This gives us a neural network with four inputs

9

Feature Selection: If you reduce the number of features, make sure you don’t lose too much information!

8 8 6 6 8 8 8 8 6 6 8 8

10

Feature Selection in the case of no expert knowledge

Brute Force Search: If we want to reduce the number of features from M to N, we find all possible N-element subsets of the M features, and check neural network performance for each subset.How many N-element subsets can be taken from an M-element set?Binomial coefficient: !

!( )!

M

MN N

M

N

Example: M = 64, N = 8 M-choose-N = 4.4 billionHmm, I wonder if there are better ways …

11

Feature Selection:

•Branch and Bound MethodThis is based on the idea that deleting features cannot improve performance.Suppose we want to reduce the number of features from 5 to 2.First, pick a “reasonable” pair of features (say, 1 and 2).Train an ANN with those features.This gives a performance threshold J.Create a tree of eliminated features.Move down the tree to accumulate deleted features.Evaluate ANN performance P at each node of the tree.If P < J, there is no need to consider that branch any further.

12

1 2 3

2 3 4 3 4 4

3 54 4 5 54 55 5

Use features 1 and 2 to obtain performance at node A.We find that B is worse than A, so no need to evaluate below B.We also find C is worse than A, so no need to evaluate below C.

A

B

Branch and Bound

We want a reduction from 5 to 2 features

C

Optimal method, but may require a lot of effort.

13

Feature Selection:•Sequential Forward SelectionFind the feature f1 that gives the best performanceFind the feature f2 such that (f1 , f2 ) gives the best performanceRepeat for as many features as desired

Example: Find the best 3 out of 5 available features

1 2 3 4 5

1 3 4 5

1 3 5

2 is best

{2, 4} is best

{2, 4, 1} is best

14

Problem with sequential forward selection:There may be two features such that either one along provides little information, but the combination provide a lot of information.

x1

x2

Neither feature, if used alone, provides information about the class. But both features in combination provide a lot of information.

Class 1Class 2

15

Feature Selection:•Sequential Backward EliminationStart with all features.Eliminate the one that provides the least information.Repeat until the desired number of features is obtained.

Example: Find the best 3 out 5 available features

Eliminating feature 4 results in the least loss of performance

1,2,3,4,5

1,2,3,4 1,2,3,5 1,2,4,5 1,3,4,5 2,3,4,5

1,2,3 1,2,5 1,3,5 2,3,5 Eliminating feature 1 results in the least loss of performance

Use all features for best performance

16

Principal Component Analysis (PCA)This is not a feature selection method, but a feature reduction method.

17

x1

x2 Class 1Class 2

This is a reduced-dimension problem, but no single feature gives us enough information about the classes.

18

Principal Component Analysis (PCA)We are given input vectors xn (n=1,…N), and each vector contains d elementsGoal: map xn vectors to zn vectors, where each zn vector has M elements, and M < d.

1

d

n ni ii

ux z

{ui} = orthonormal basis vectors

Since uiTuk = ik, we see that zni = ui

Txn

1 1

1

constants TBD

( )

{ }

d

n n ni i i ii i M

i

d

n n ni i ii M

M

x z u bu

x z u

b

x b

x

2

1

2

1 1

1

2

1( )

2

N

M n nn

N d

ni in i M

x x

z b

E

We want to minimize EM

We found the best {bi} values. What are the best {ui} vectors?

19

2

1 1

1

1

1

1

1( )

2

)

1

1

(

wh, ere

Tni i n

N d

M ni in i M

NM

ni ini

N

i nin

NTi n

n

NTi n

n

u x

z b

dEb

db

zN

u xN

x x

z

E

z

x

b

u

2

1 1

11

1 1

1

1( )

2

1( )( )

2

1( )( )

2

1

2

d NT

M i ni M n

d NT Ti n n i

ni M

d NT Ti n n i

i M n

dTi i

i M

u x x

x x x x u

u x x

P

E

x x u

u u

u

This defines P(d d matrix).

where U = [uM+1 … ud], M = [ ik ], I = identity matrixU is a d (dM) matrix, M is a (dM) (dM) matrix 20

1

1

1

2

( )( )

dT

M i ii M

NT

n nn

u Pu

x x

E

P x x

We want to minimize EM with respect to {ui}.{ui} = 0 would work …We need to constrain {ui} to be a set of orthonormal basis vectors; that is, ui

Tuk = ik. Constrained optimization:

1 1 1

1 1ˆ ( )2 2

1 1Tr( ) Tr[ ( )]

2 2

d d dT T

M i i ik i k iki M i M k M

T T

E u Pu u u

U PU M U U I

ik = Lagrange multipliersM=MT w.l.o.g.

21

1 1ˆ Tr( ) Tr[ ( )]2 2

ˆ0

T TM

M

T

E U PU M U U I

EPU UM

UPU UM

U PU M

UTPU=M: [(dM) d] [d d] [d (dM)] = [(dM) (dM)]Add columns to U, and expand M.We now have an eigenvector equation while still satisfying the original UTPU=M equation. PCA Solution: {ui} = eigenvectors of P

UT

P M=U

UT

PM=U0

0

The error is half of the sum of the (dM) smallest eigenvalues of P{ui} = principal components

PCA = Karhunen-Loeve transformation

22

1

1

1

1

2

1

2

1

2

dT

M i ii M

dTi i i

i M

d

ii M

u Pu

u u

E

PCA illustration in two dimensions: All data points are projected onto the u1 direction. Any variation in the u2 direction is ignored.

23

u1

u2

24

Cascade CorrelationScott Fahlman, 1988

This gives a way to automatically adjust the network size. Also, it uses gradient-based weight optimization without the complexity of backpropagation.

1.Begin with no hidden-layer neurons.2.Add hidden neurons one at a time.3.After adding hidden neuron Hi, optimize the weights from “upstream” neurons to Hi to maximize the effect of Hi on the outputs.4.Optimize the output weights of Hi to minimize training error.

Cascade Correlation ExampleTwo inputs and two outputsStep 1 – Start with a two-layer network (no hidden layers)

25

x1

x2

y1

y21

x1

x2

1

y1 y2

y1 = f(x.w1)w1 = weights from inputs to y1

w1 can be trained with a gradient methodSimilarly, w2 can be trained with a gradient method

Cascade Correlation ExampleTwo inputs and two outputsStep 2 – Add a hidden neuron H1. Maximize the correlation between H1 outputs and training error.

26

1

y1 y2

H1

z

x1

x2

Don’t updateDo update

11

1 1

Correlation ( )( )

( ) ( · )

o

o

n N

n nk knk

n N

nk k n nik ni

z z e e

dSe e f w x x

dw

S

no = number of outputsN = number of training samplese = training error before H1 is added{wi} = weights from inputs to H1

Use a gradient method to maximize |S| with respect to {wi}

Cascade Correlation ExampleTwo inputs and two outputsStep 3 – Optimize the output weights

27

Don’t updateDo update

Use a gradient method to minimize training error with respect to the weights that are connected to the output neurons.

1

y1 y2

H1

x1

x2

Cascade Correlation ExampleTwo inputs and two outputsStep 4 – Add another hidden neuron H2 and repeat step 2; Maximize correlation between H2 outputs and training error.

28

Don’t updateDo update

1

y1 y2

H1

x1

x2

H2

z2

11

1 1

Correlation ( )( )

( ) ( · )

o

o

n N

n nk knk

n N

nk k n nik ni

z z e e

dSe e f w x x

dw

S

no = number of outputsN = number of training samplese = training error before H2 is added{wi} = weights “upstream” from H2

Use a gradient method to maximize |S| with respect to {wi}

Cascade Correlation ExampleTwo inputs and two outputsStep 5 – Optimize the output weights

29

Don’t updateDo update

Use a gradient method to minimize training error with respect to the weights that are connected to the output neurons.1

y1 y2

H1

x1

x2

H2

Repeat previous two steps until desired performance obtained:•Add a hidden neuron Hi

•Maximize corr. between Hi output and training error w/r to Hi input weights•Freeze input weights•Minimize training error w/r to output weights.

References•C. Bishop, Neural Networks for Pattern Recognition•D. Simon, Optimal State Estimation (Chapter 1)

30