layered
DESCRIPTION
Concept Map for Ch.3. Feed forward Network. Nonlayered. Layered. Learning by BP. Sigmoid. . . . Multilayer Perceptron: y = F(x,W) f (x). ALC. Single Layer. Multilayer. Ch 2. Ch2,1. Ch 1. Learning : {(x i , f (x i )) | i = 1 ~ N} → W. Old W. Gradient Descent. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/1.jpg)
Layered
Concept Map for Ch.3
Learning by BP
Nonlayered
Feed forwardNetwork
Min E(W)
SingleLayer
Multilayer Perceptron: y = F(x,W) f(x)Multilayer ALC
Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Matrix-Vector WScalar wij
Backpropagation (BP)
GradientDescent
Sigmoid
Actual OutputInput
Desired OutputNew W
-
+
Old W
Ch2,1 Ch 1 Ch 2
![Page 2: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/2.jpg)
Chapter 3. Multilayer Perceptron1. MLP Architecture – Extension of Perceptron to Many
layers and Sigmoidal Activation functions
– for real-valued mapping/classification
![Page 3: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/3.jpg)
Learning: Discrete → Find W*
→ Continuous F(x, W*) f(x)
![Page 4: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/4.jpg)
![Page 5: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/5.jpg)
![Page 6: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/6.jpg)
or1
1
e 2
tanh11
2
e
1
0
-1
Hyperbolic Tangent
)1(2
1' 2
2
Smaller
0
1
Logistic
)1('
4
![Page 7: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/7.jpg)
2. Weight Learning Rule – Backpropagation of Error
(1) Training Data ( ) Weights (W) :
Curve (Data) Fitting (Modeling, NL Regression)
(2) Mean Squared Error E for 1-D function as an Example
FunctionCostWFdEWEp
ppp
p 22 ))((2
1
2
1)( ,x
pd,px
)(xf
True Function
NN Approximating
Function
![Page 8: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/8.jpg)
![Page 9: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/9.jpg)
![Page 10: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/10.jpg)
(3) Gradient Descent Learning (4) Learning Curve
E
ww
0
Gradienteous] [Instantan Local
(Pattern))2
(Batch))1
p
p
Ε
ΕΕ
w
ww
w
w
Iteration = One scan of the training set (Epoch)
0 Number of Iterations, n
E{ W(n), weight track }
![Page 11: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/11.jpg)
![Page 12: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/12.jpg)
![Page 13: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/13.jpg)
(5) Backpropagation Learning Rule
j )( kk yd
ijw kyjkw k
k iy jy
j i
A. Output Layer Weights
))(( activationlocalerrorlocal
yw
E
w
Ew jk
jk
k
k
p
jk
pjk
)( ')( kkkk
pk yd
E
where
B. Inner Layer Weights
ijij
j
j
p
ij
pij y
w
E
w
Ew
unitshiddenforsignalerror
wy
y
EE
kjjkk
j
j
j
k
k k
p
j
pj
)('
where
Features: Locality of Computation, No Centralized Control, 2-Pass
(Credit assignment)
xi
![Page 14: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/14.jpg)
![Page 15: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/15.jpg)
Water Flow Analogy to Backpropagation
pd
px
py
pe
Input
Output
River Flow w1
)W,xF( y pp
( Drop Object Here )
( Fetch Object Here )
- Man
y weig
hts (Flows)
-
If the error is very sensitive to a weight change, then change that weight a lot, and
vice versa.
→ Gradient Descent , Minimum Disturbance
Principle
Flow wl
![Page 16: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/16.jpg)
(6) Computation Example : MLP(2-1-2)
A. Forward Processing : Comp. Function Signals
)()(
)()(
)(
2222
1211
1
22111
hwsumy
hwsumy
sumh
xvxvsum
1x
2x 2y
1y
2v
1sum1w
2w
21sum
22sum
1v
h
No desired response is needed for hidden nodes. must exist = sigmoid [tanh or logistic]For classification, d = ± 0.9 for tanh, d = 0.1, 0.9 for logistic.
![Page 17: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/17.jpg)
B. Backward Processing - Comp. Error Signals
;)1(',1
1)(
)('])( ')( '[)( ')(
)('])( ')( '[)( ')(
)()( '
)()( '
222121221123
122111221113
2
1
1222112
1222111
12222
12111
eif
xsumwsumewsumexsumwwxv
xsumwsumewsumexsumwwxv
sumsumehw
sumsumehw
2222
1211
)( '
)( '
esum
esum
1v 1w
2w
1sum
22sum
21sum
2v
111 yde
222 yde
h
13
23
has been computed in forward processing
![Page 18: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/18.jpg)
![Page 19: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/19.jpg)
![Page 20: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/20.jpg)
If we knew f(x,y), it would be a lot faster to use it to calculate the output than to use the
NN.
![Page 21: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/21.jpg)
![Page 22: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/22.jpg)
![Page 23: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/23.jpg)
![Page 24: Layered](https://reader035.vdocument.in/reader035/viewer/2022062309/568151d8550346895dc012cf/html5/thumbnails/24.jpg)
Student Questions:
Does the output error become more uncertain in case of complex multilayer than simple layer ?
Should we use only up to 3 layers ?
Why can oscillation occur in the learning curve ?
Do we use the old weights for calculating the error signal δ ?
What does ANN mean ?
Which makes more sense, error gradient or the weight gradient considering the equation for weight change ?
What becomes the error signal to train the weights in forward mode ?