neural networks i cmput 466/551 nilanjan ray. outline projection pursuit regression neural network...

Neural Networks I

CMPUT 466/551

Nilanjan Ray

Outline

• Projection Pursuit Regression

• Neural Network– Background– Vanilla Neural Networks– Back-propagation– Examples

Projection Pursuit Regression

• Additive model with non-linear gm’s

• Features X is projected to parameters wm, which we have to find from training data

• Precursors to neural networks

M

m

Tmm XwgXf

1

Fitting a PPR Model

• Minimize squared-error loss function:

• Proceed in forward stages: M=1, 2,…etc.

– At each stage, estimate g, given w (Say, by fitting a spline function)

– Estimate w, given g (details provided in the next slide)

• The value of M is decided by cross-validation

N

i

M

mi

Tmmi

WGxwgy

1

2

1,])([minarg

Fitting a PPR Model…

• At stage m, given g, compute w (Gauss-Newton search)

N

ii

T

iTold

iToldi

iToldi

Told

N

ii

Toldi

Toldi

Toldi

N

ii

Ti

iT

oldiToldi

Toldi

T

xwxwg

xwgzxwxwg

xwwxwgxwgzxwgz

xwwxwgxwgxwg

1

2

2

1

2

1

2

)('

)()('

))((')()]([

))((')()(

weights adjusted responses

So, this is a weighted least-square technique

1

1

)(m

ki

Tkkii xwgyz Residual after stage m-1

Vanilla Neural Network

v

Kl l

kk

kk

K

kkk

K

Tkkk

P

Tmmm

evsigmoid

T

TTgtionclassificaFor

TTgregressionFor

TTTT

KkTgXfY

ZZZZ

KkZT

XXXX

MmXZ

1

1

)exp(

)exp()(,

)(,

),....,,(

,,2,1),()(

),....,,(

,2,1,

),....,,(

,,2,1,

1

21

21

0

21

0

hiddenlayer

Input layer

Output layer

The Sigmoid Function

σ(0.5v)

σ(10v)

σ(sv) = 1/(1+exp(-sv))• a smooth (regularized) threshold function• s controls the activation rate s↑, hard activation s↓, close to identity function

Multilayer Feed Forward NN

http://www.teco.uni-karlsruhe.de/~albrecht/neuro/html/node18.html

Examples architectures

NN: Universal Approximator

• A NN with one hidden units, can approximate arbitrarily well any functional continuous mapping from one finite dimensional space to another, provided number of hidden units is sufficiently large.

• Proof is based on Fourier expansion of a function (see Bishop).

NN: Kolomogorov’s Theorem

• Any continuous mapping f(x) from d input variables can be expressed by a neural networks with two hidden layers of nodes. The first layer contains d(2d+1) nodes, and the second layer contains (2d+1) nodes.

• So, why bother about topology at all? This ‘universal’ architecture is impractical because the functions represented by hidden units will be non-smooth and are unsuitable for learning. (see Bishop for more.)

The XOR Problem and NN

Activation functionsare hard thresholds at 0

Fitting Neural Networks

• Parameters to learn from training data

• Cost functions– Sum-of-squared errors for regression

– Cross-entropy errors for classification

weights1 ,...,2,1;,

weights,1 ,...,2,1;,

0

0

MKKk

pMMm

kk

mm

K

k

N

iikik xfyR

1 1

2

K

k

N

iikik xfyR

1 1

log

Gradient descent: Back-propagation

K

kkikmi

Tmmi

ilmiil

K

ki

Tmkmi

Tkkikik

ml

i

iTmmmi

mikimiiTkkikik

km

i

N

i

K

kikik

N

ii

xs

xsxxzgxfyR

xz

zzzgxfyR

xfyRR

1

'

1

'

0

1 1

2

1

where

2

)(where

,2

and ,parametersfor sDerivative

:function Fitting

N

irml

ir

rml

rml

N

irkm

ir

rkm

rkm

R

R

1

1

1

1 ,

updatedescent Gradient

Back-propagation: Implementation

• Step 1: Initialize the parameters (weights) of NN• Iterate

– Forward pass: compute fk(X) for the current parameter values starting at the input layer and moving all the up to the output layer.

– Backward pass: Start at the output layer; compute i; go down one layer at a time and compute smi all the way down to the input layer

– Update weights by gradient descent rule

Issues in Training Neural Networks

• Initial values of parameters– Back-propagation finds local minimum

• Overfitting– Neural networks have too many parameters– Early stop and regularization

• Scaling of the inputs– Inputs are typically scaled to have zero mean and unit

standard deviation

• Number of hidden units and layers– Better to have too many than too few– With ‘traditional’ back-propagation a long NN gets

stuck in local minima and does not learn well

Avoiding Overfitting

• Weight decay cost function:

• Weight elimination penalty function:

)(,

2

,

2 lm

mlmk

kmR

)11

(,

2

2

,2

2

lm ml

ml

mk km

kmR

Example

Example: ZIP Code Recognition

Some Architectures for ZIP Code Recognition

Architectures and Parameters

• Net-1: No hidden layer, equivalent to multinomial logistic regression

• Net-2: One hidden layer, 12 hidden units fully connected

• Net-3: Two hidden layers, locally connected• Net-4: Two hidden layers, locally connected with

weight sharing• Net-5: Two hidden layers, locally connected, two

levels of weight sharing

Weight sharing is also known as convolutional neural networks

More on Architectures and ResultsArchitecture Links Weights %Correct

Net-1 Single layer 2570 2570 80.0

Net-2 Two layer 3214 3214 87.0

Net-3 Locally connected

1226 1226 88.5

Net-4 Constrained 2266 1132 94.0

Net-5 Constrained 5194 1060 98.4

Net-1: #Links/Weights- 2570 = 16*16*10+10

Net-2: #Links/Weights- 16*16*12+12+12*10+10=3214

Net-3: #Links/Weights- 8*8*3*3+8*8+4*4*5*5+4*4+10*4*4+10=1226

Net-4: #Links- 2*8*8*3*3 + 2*8*8 + 4*4*5*5*2 + 4*4 + 10*4*4+10 = 2266#Weights- 2*3*3 + 2*8*8 + 4*4*5*5*2 + 4*4 + 10*4*4+10 = 1132

Net-5: #Links- 2*8*8*3*3 + 2*8*8 + 4*4*4*5*5*2 + 4*4*4 + 4*4*4*10 + 10 = 5194#Weights- 2*3*3 + 2*8*8 + 4*5*5*2 + 4*4*4 + 4*4*4*10 + 10 =

1060

Performance vs. Training Time

Some References

• C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, 1996. (For good understanding)

• S. Haykin, Neural Networks and Learning Machines, Prentice Hall, 2009. (For very basic reading, lots of examples etc.)

• Prominent Researchers:– Yann LeCun (http://yann.lecun.com/)

– G.E. Hinton (http://www.cs.toronto.edu/~hinton/)

– Yosua Bengio http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html

neural networks i cmput 466/551 nilanjan ray. outline projection pursuit regression neural network...

Documents