supervised learning neural networks and back-propagation

Part 7: Neural Networks & Learning 11/2/05

1

11/2/05 1

Neural Networksand Back-Propagation Learning

11/2/05 2

Supervised Learning

• Produce desired outputs for training inputs

• Generalize reasonably & appropriately toother inputs

• Good example: pattern recognition

• Feedforward multilayer networks

11/2/05 3

Feedforward Network

. . .

. . . . . .

. . .

. . .

. . .

inputlayer

outputlayer

hiddenlayers

11/2/05 4

Credit Assignment Problem. .

.

. . . . . .

. . .

. . .

. . .

inputlayer

outputlayer

hiddenlayers

How do we adjust the weights of the hidden layers?

. . .

Desiredoutput

11/2/05 5

Adaptive System

S F

Pk PmP1… …

SystemEvaluation Function

(Fitness, Figure of Merit)

Control ParametersC

ControlAlgorithm

11/2/05 6

Gradient

F

Pk measures how F is altered by variation of Pk

F =

FP1

M

FPk

M

FPm

F points in direction of maximum increase in F


2

11/2/05 7

Gradient Ascenton Fitness Surface

+

–

F

gradient ascent

11/2/05 8

Gradient Ascentby Discrete Steps

+

–

F

11/2/05 9

Gradient Ascent is LocalBut Not Shortest

+

–

11/2/05 10

Gradient Ascent Process˙ P = F P( )

Change in fitness :

˙ F =dF

d t=

F

Pk

dPk

d tk=1

m= F( )k

˙ P kk=1

m

˙ F = F ˙ P

˙ F = F F = F2

0

Therefore gradient ascent increases fitness(until reaches 0 gradient)

11/2/05 11

General Ascent in FitnessNote that any adaptive process P t( ) will increase

fitness provided :

0 < ˙ F = F ˙ P = F ˙ P cos

where is angle between F and ˙ P

Hence we need cos > 0

or < 90o

11/2/05 12

General Ascenton Fitness Surface

+

–

F


3

11/2/05 13

Fitness as Minimum Error

Suppose for Q different inputs we have target outputs t1,K,tQ

Suppose for parameters P the corresponding actual outputs

are y1,K,yQ

Suppose D t,y( ) 0,[ ) measures difference between

target & actual outputs

Let E q = D tq ,yq( ) be error on qth sample

Let F P( ) = E q P( ) = D tq ,yq P( )[ ]q=1

Q

q=1

Q

11/2/05 14

Gradient of Fitness

F = E q

q

= E q

q

Eq

Pk=

PkD tq ,yq( ) =

D tq,yq( )y jq

j

y jq

Pk

=dD tq ,yq( )dyq

yq

Pk

=y qD tq ,yq( ) yq

Pk

11/2/05 15

Jacobian Matrix

Define Jacobian matrix Jq =

y1q

P1L

y1q

PmM O M

ynq

P1L

ynq

Pm

Note Jq n m and D tq,yq( ) n 1

Since E q( )k

=E q

Pk=

y jq

Pk

D tq,yq( )y jq

j

,

Eq = Jq( )TD tq,yq( )

11/2/05 16

Derivative of Squared EuclideanDistance

Suppose D t,y( ) = t y2

= ti yi( )2

i

D t y( )y j

=y j

ti yi( )2

i

=ti yi( )

2

y ji

=d t j y j( )

2

d y j

= 2 t j y j( )

dD t,y( )dy

= 2 y t( )

11/2/05 17

Gradient of Error on qth Input

Eq

Pk=dD tq,yq( )dyq

y q

Pk

= 2 yq tq( )y q

Pk

= 2 y jq t j

q( )y jq

Pkj

11/2/05 18

Recap

To know how to decrease the differences between

actual & desired outputs,

we need to know elements of Jacobian, y jq

Pk,

which says how jth output varies with kth parameter

(given the qth input)

The Jacobian depends on the specific form of the system,in this case, a feedforward neural network

˙ P = Jq( )T

tq yq( )q


4

11/2/05 19

Equations

hi = wijs jj=1

n

h =Ws

Net input:

s i = hi( )

s = h( )

Neuron output:

11/2/05 20

Multilayer Notation

W1 W2 WL–2 WL–1

s1 s2 sL–1 sL

xq yq

11/2/05 21

Notation• L layers of neurons labeled 1, …, L

• Nl neurons in layer l

• sl = vector of outputs from neurons in layer l

• input layer s1 = xq (the input pattern)

• output layer sL = yq (the actual output)

• Wl = weights between layers l and l+1

• Problem: find how outputs yiq vary with

weights Wjkl (l = 1, …, L–1)

11/2/05 22

Typical Neuron

sjl–1

sNl–1

s1l–1

silhi

lWijl–1

WiNl–1

Wi1l–1

11/2/05 23

Error Back-Propagation

We will compute E q

Wijl

starting with last layer (l = L 1)

and working back to earlier layers (l = L 2,K,1)

11/2/05 24

Delta Values

Convenient to break derivatives by chain rule :

Eq

Wijl 1 =

E q

hil

hil

Wijl 1

Let il =

E q

hil

So E q

Wijl 1

= il hi

l

Wijl 1


5

11/2/05 25

Output-Layer Neuron

sjL–1

sNL–1

s1L–1

siL = yi

qhiLWij

L–1

WiNL–1

Wi1L–1

tiq

Eq

11/2/05 26

Output-Layer Derivatives (1)

iL =

Eq

hiL

=hiL

skL tk

q( )2

k

=d si

L tiq( )2

dhiL = 2 si

L tiq( )d si

L

dhiL

= 2 siL ti

q( ) hiL( )

11/2/05 27

Output-Layer Derivatives (2)

hiL

WijL 1 =

WijL 1 Wik

L 1skL 1

k

= s jL 1

Eq

WijL 1

= iLs j

L 1

where iL = 2 si

L tiq( ) hi

L( )

11/2/05 28

Hidden-Layer Neuron

sjl–1

sNl–1

s1l–1

silhi

lWijl–1

WiNl–1

Wi1l–1

skl+1

sNl+1

s1l+1

W1il

Wkil

WNil

s1l

sNl

Eq

11/2/05 29

Hidden-Layer Derivatives (1)

Recall Eq

Wijl 1 = i

l hil

Wijl 1

il

=Eq

hil =

E q

hkl+1

hkl+1

hil

k

= kl+1 hk

l+1

hil

k

hkl+1

hil =

Wkml sm

l

m

hil =

Wkil sil

hil =Wki

ld hi

l( )dhi

l =Wkil hi

l( )

il = k

l+1Wkil hi

l( )k

= hil( ) k

l+1Wkil

k

11/2/05 30

Hidden-Layer Derivatives (2)

hil

Wijl 1 =

Wijl 1 Wik

l 1skl 1

k

=dWij

l 1s jl 1

dWijl 1 = s j

l 1

Eq

Wijl 1

= il s jl 1

where il = hi

l( ) kl+1Wki

l

k


6

11/2/05 31

Derivative of Sigmoid

Suppose s = h( ) =1

1+ exp h( ) (logistic sigmoid)

Dh s =Dh 1+ exp h( )[ ]1

= 1+ exp h( )[ ]2Dh 1+ e h( )

= 1+ e h( )2

e h( ) =e h

1+ e h( )2

=1

1+ e h

e h

1+ e h= s

1+ e h

1+ e h

11+ e h

= s(1 s)

11/2/05 32

Summary of Back-PropagationAlgorithm

Output layer : iL = 2 si

L 1 siL( ) siL ti

q( )Eq

WijL 1

= iLs j

L 1

Hidden layers : il = si

l 1 sil( ) k

l+1Wkil

k

E q

Wijl 1

= il s jl 1

11/2/05 33

Output-Layer Computation

sjL–1

sNL–1

s1L–1

siL = yi

qhiLWij

L–1

WiNL–1

Wi1L–1

tiq–

iL

2

1–

iL = 2 si

L 1 siL( ) tiq si

L( )

WijL–1

WijL 1

= iLs j

L 1

11/2/05 34

Hidden-Layer Computation

sjl–1

sNl–1

s1l–1

silhi

lWijl–1

WiNl–1

Wi1l–1

skl+1

sNl+1

s1l+1

W1il

Wkil

WNil

Eq

1l+1

kl+1

Nl+1

il

1–

il = si

l 1 sil( ) k

l+1Wkil

k

Wijl–1

Wijl 1

= il s jl 1

11/2/05 35

Training Procedures• Batch Learning

– on each epoch (pass through all the training pairs),

– weight changes for all patterns accumulated

– weight matrices updated at end of epoch

– accurate computation of gradient

• Online Learning– weight are updated after back-prop of each training pair

– usually randomize order for each epoch

– approximation of gradient

• Doesn’t make much difference

11/2/05 36

Summation of Error Surfaces

E1

E2

E


7

11/2/05 37

Gradient Computationin Batch Learning

E1

E2

E

11/2/05 38

Gradient Computationin Online Learning

E1

E2

E

11/2/05 39

The Golden Rule of Neural Nets

Neural Networks are the

second-best way

to do everything!

supervised learning neural networks and back-propagation

Documents