layered

Layered

Concept Map for Ch.3

Learning by BP

Nonlayered

Feed forwardNetwork

Min E(W)

SingleLayer

Multilayer Perceptron: y = F(x,W) f(x)Multilayer ALC

Learning : {(xi, f(xi)) | i = 1 ~ N} → W

Matrix-Vector WScalar wij

Backpropagation (BP)

GradientDescent

Sigmoid

Actual OutputInput

Desired OutputNew W

-

+

Old W

Ch2,1 Ch 1 Ch 2

Chapter 3. Multilayer Perceptron1. MLP Architecture – Extension of Perceptron to Many

layers and Sigmoidal Activation functions

– for real-valued mapping/classification

Learning: Discrete → Find W*

→ Continuous F(x, W*) f(x)

or1

1

e 2

tanh11

2

e

1

0

-1

Hyperbolic Tangent

)1(2

1' 2

2

Smaller

0

1

Logistic

)1('

4

2. Weight Learning Rule – Backpropagation of Error

(1) Training Data ( ) Weights (W) :

Curve (Data) Fitting (Modeling, NL Regression)

(2) Mean Squared Error E for 1-D function as an Example

FunctionCostWFdEWEp

ppp

p 22 ))((2

1

2

1)( ,x

pd,px

)(xf

True Function

NN Approximating

Function

(3) Gradient Descent Learning (4) Learning Curve

E

ww

0

Gradienteous] [Instantan Local

(Pattern))2

(Batch))1

p

p

Ε

ΕΕ

w

ww

w

w

Iteration = One scan of the training set (Epoch)

0 Number of Iterations, n

E{ W(n), weight track }

(5) Backpropagation Learning Rule

j )( kk yd

ijw kyjkw k

k iy jy

j i

A. Output Layer Weights

))(( activationlocalerrorlocal

yw

E

w

Ew jk

jk

k

k

p

jk

pjk

)( ')( kkkk

pk yd

E

where

B. Inner Layer Weights

ijij

j

j

p

ij

pij y

w

E

w

Ew

unitshiddenforsignalerror

wy

y

EE

kjjkk

j

j

j

k

k k

p

j

pj

)('

where

Features: Locality of Computation, No Centralized Control, 2-Pass

(Credit assignment)

xi

Water Flow Analogy to Backpropagation

pd

px

py

pe

Input

Output

River Flow w1

)W,xF( y pp

( Drop Object Here )

( Fetch Object Here )

- Man

y weig

hts (Flows)

-

If the error is very sensitive to a weight change, then change that weight a lot, and

vice versa.

→ Gradient Descent , Minimum Disturbance

Principle

Flow wl

(6) Computation Example : MLP(2-1-2)

A. Forward Processing : Comp. Function Signals

)()(

)()(

)(

2222

1211

1

22111

hwsumy

hwsumy

sumh

xvxvsum

1x

2x 2y

1y

2v

1sum1w

2w

21sum

22sum

1v

h

No desired response is needed for hidden nodes. must exist = sigmoid [tanh or logistic]For classification, d = ± 0.9 for tanh, d = 0.1, 0.9 for logistic.

B. Backward Processing - Comp. Error Signals

;)1(',1

1)(

)('])( ')( '[)( ')(

)('])( ')( '[)( ')(

)()( '

)()( '

222121221123

122111221113

2

1

1222112

1222111

12222

12111

eif

xsumwsumewsumexsumwwxv

xsumwsumewsumexsumwwxv

sumsumehw

sumsumehw

2222

1211

)( '

)( '

esum

esum

1v 1w

2w

1sum

22sum

21sum

2v

111 yde

222 yde

h

13

23

has been computed in forward processing

If we knew f(x,y), it would be a lot faster to use it to calculate the output than to use the

NN.

Student Questions:

Does the output error become more uncertain in case of complex multilayer than simple layer ?

Should we use only up to 3 layers ?

Why can oscillation occur in the learning curve ?

Do we use the old weights for calculating the error signal δ ?

What does ANN mean ?

Which makes more sense, error gradient or the weight gradient considering the equation for weight change ?

What becomes the error signal to train the weights in forward mode ?

layered

Documents

error gradient

output error

weight gradient

error signals

weight change

backpropagation learning

learning curve e0iteration

old weights