![Page 1: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/1.jpg)
Artificial Neural Networks MultiLayer Perceptrons
Backpropagation Part 3/3
Berrin Yanikoglu
![Page 2: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/2.jpg)
Capabilities of Multilayer Perceptrons
![Page 3: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/3.jpg)
Multilayer Perceptron
In Multilayer perceptrons, there may be one or more hidden layer(s) which are called hidden since they are not observed from the outside.
![Page 4: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/4.jpg)
Multilayer Perceptron
• Each layer may have different number of nodes and different activation functions:
• Commonly, same activation function within one layer • Typically,
– sigmoid/tanh activation function is used in the hidden units, and
– sigmoid/tanh or linear activation functions are used in the output units depending on the problem (classification or function approximation)
• In feedforward networks, activations are passed only from one layer to the next.
![Page 5: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/5.jpg)
Backpropagation
Capabilities of multilayer NNs were known, but ... • a learning algorithm was introduced by Werbos (1974); • made famous by Rumelhart and McClelland (mid 1980s -
the PDP book) • Started massive research in the area
![Page 6: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/6.jpg)
XOR problem
• Learning Boolean functions: 1/0 output can be seen as a 2-class classification problem
• Xor can be solved by a 1-hidden layer network
![Page 7: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/7.jpg)
XOR problem
1 W1
W2
-1 -1
1 1
-0.5
1.5
1 1 -1.5
W 11 = [ 1 1 -0.5] W 12 = [-1 -1 1.5] W 2 = [ 1 1 -1.5]
+
+
Notice how each node implements a decision boundary and the output node combines (AND) their result.
boundary1 (implemented by node1)
boundary2
2
-
-
![Page 8: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/8.jpg)
Capabilities (Hardlimiting nodes)
Single layer – Hyperplane boundaries
1-hidden layer – Can form any, possibly unbounded convex region
2-hidden layers – Arbitrarily complex decision regions
![Page 9: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/9.jpg)
Capabilities: Decision Regions
![Page 10: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/10.jpg)
Capabilities
Thm.(Cybenko 1989, Hornik et al. 1989): Every bounded continuous function can be approximated arbitrarily accurately by 2 layers of weights (1-hidden layer) and sigmoidal units. All other functions can be learned by 2-hidden layer networks. – Discontinuities can be theoretically tolerated for most real life
problems. Also, functions without compact support can be learned under some conditions.
– Proff is based on Kolmogorov’s Thm.
Discontinuity Continuous with discontinuous first derivative
![Page 11: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/11.jpg)
Performance Learning
![Page 12: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/12.jpg)
Performance Learning
A learning paradigm where the network adjusts its parameters (weights and biases) so as to optimize its “performance”
• Need to define a performance index – e.g. mean square error= 1/N Σ (ti-oi)2 i=1 where we consider the difference between the target and the output for a pattern i.
• Search the parameter space to minimize the performance index with respect to the parameters
![Page 13: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/13.jpg)
Performance Optimization
Iterative minimization techniques is a form of performance
learning:
• Define E(.) as the performance index
• Starting with an initial guess w(0), find w(n+1) at each iteration such that
• In particular, we will see that Gradient Descent is an iterative (error) minimization technique
E(w(n+1))< E(w(n))
![Page 14: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/14.jpg)
Basic Optimization Algorithm
w k 1 + w k α k p k + = w Δ k w k 1 + w k – ( ) α k p k = = or
w k
w k 1 + αkpk
Start with initial guess w0 and update the guess in each stage moving along the search direction:
A new state in the search involves deciding on a search direction and the size of the step to take in that direction: pk - Search Direction αk - Learning Rate
![Page 15: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/15.jpg)
Gradient Descent
Also known as Steepest Descent
…
![Page 16: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/16.jpg)
Performance Optimization
Iterative minimization techniques: Gradient Descent – Successive adjustments to w are in the direction of the
steepest descent (direction opposite to the gradient vector)
w(n+1) = w(n) - η ∇E(w(n))gradient
![Page 17: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/17.jpg)
Performance Optimization: Iterative Techniques Summary - ADVANCED
F xk 1+( ) F xk( )<
Choose the next step so that the function decreases:
F xk 1+( ) F xk xΔ k+( ) F xk( ) gkT xΔ k+≈=
For small changes in x we can approximate F(x) using the Taylor Series Expansion:
gk F x( )∇x xk=
≡
where
gkT xΔ k α kgk
Tpk 0<=
If we want the function to decrease, we must choose pk such that:
pk g– k=We can maximize the decrease by choosing:
xk 1+ xk αkgk–=
![Page 18: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/18.jpg)
Steepest Descent
![Page 19: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/19.jpg)
Example F x( ) x1
2 2 x1x2 2x22 x1+ + +=
x 00.50.5
=
F x( )∇x1∂∂ F x( )
x2∂∂ F x( )
2x1 2x2 1+ +
2x1 4x2+= = g0 F x( )∇
x x0=33
= =
α 0.1=
x1 x0 αg0– 0.50.5
0.1 33
– 0.20.2
= = =
x2 x1 αg1– 0.20.2
0.1 1.81.2
– 0.020.08
= = =
![Page 20: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/20.jpg)
Two simple error surfaces (for 2 weights)
a) b)
In a) as you move parallel to one of the axis, there is no change in error In b) moving diagonally, rather than parallel to the axes, brings the biggest change.
![Page 21: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/21.jpg)
Gradient Descent: illustration
Gradient: ∇E[w]=[∂E/∂w0,… ∂E/∂wn]
(w1,w2)
(w1+Δw1,w2 +Δw2)
Δw=-η ∇E[w]
![Page 22: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/22.jpg)
Gradient Vector
F x ( )∇
x 1 ∂∂ F x ( )
x 2 ∂∂ F x ( )
…
x n ∂∂ F x ( )
=
Each dimension of the gradient vector is the partial derivative of the function with respect to one of the dimensions.
![Page 23: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/23.jpg)
Plot
-2 -1 0 1 2-2
-1
0
1
2
![Page 24: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/24.jpg)
Gradient Descent for
Delta Rule for Adaline (Linear Activation)
![Page 25: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/25.jpg)
o
w0
w2
...
wn
![Page 26: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/26.jpg)
![Page 27: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/27.jpg)
![Page 28: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/28.jpg)
Stochastic Gradient Descent
![Page 29: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/29.jpg)
Approximate Gradient Descent (Stochastic Backpropagation)
Normally, in gradient descent, we would need to compute how the error over all input samples (true gradient) changes with respect to a small change in a given weight.
But the common form of the gradient descent algorithm takes
one input pattern, compute the error of the network on that pattern only, and updates the weights using only that information. – Notice that the new weight may not be good/better for all
patterns, but we expect that if we take a small step, we will average and approximate the true gradient.
![Page 30: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/30.jpg)
Stochastic Approximation to Steepest Descent
Instead of updating every weight until all examples have been observed, we update on every example:
wi ≅ η (t-o) xi Remarks:
• Speeds up learning significantly when data sets are large • Standard gradient descent can be used with a larger step size.
• When there are multiple local minima, stochastic approximation to gradient descent may avoid the problem of getting stuck on a local minimum.
Δ
![Page 31: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/31.jpg)
Gradient Descent Backpropagation Algorithm
Derivation for General Activation Functions
![Page 32: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/32.jpg)
Transfer Function Derivatives
f’ n ( ) n d d 1
1 e n – + - - - - - - - - - - - - - - - - - ⎝ ⎠ ⎛ ⎞ e n –
1 e n – + ( ) 2 - - - - - - - - - - - - - - - - - - - - - - - - 1 1
1 e n – + - - - - - - - - - - - - - - - - - –
⎝ ⎠ ⎛ ⎞ 1 1 e n – + - - - - - - - - - - - - - - - - - ⎝ ⎠ ⎛ ⎞ 1 a – ( ) a ( ) = = = =
f’ n ( ) n d
d n ( ) 1 = =
Sigmoid: Linear:
Note: saturation of sigmoid nodes and efficiency of computing the derivative
![Page 33: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/33.jpg)
Stochastic Backpropagation
To calculate the partial derivative of Ep (error on pattern p) w.r.t a given weight wji, we have to consider whether this is the weight of an output or hidden node:
If wji is an output node weight, the situation is simpler:
ji
j
j
j
jji
p
dwdnet
dnetdo
dodE
dwdE
××=
ijjjji
p onetfotdwdE
×ʹ′×−−= )()(
j i
oj
wji
∑=i
jiij wonet
)( jj netfo =Note that oi is the input to node j.
Ep= (tp-op)2
![Page 34: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/34.jpg)
The situation is more complex with a hidden node (you don’t need to follow), because basically: • while we know what the output of an output node should be, • we don’t know what the output of a hidden node should be.
40
![Page 35: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/35.jpg)
Backpropagation – Hidden nodes
If wji is a hidden node weight:
ji
j
j
j
jji
p
dwdnet
dnetdo
dodE
dwdE
××=
ijj
onetfdodE
×ʹ′×= )(
j i
oj
wji
∑=i
jiij wonet
)( jj netfo =
k
k’
wkj
Ep= (tp-op)2
![Page 36: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/36.jpg)
Backpropagation – Hidden nodes
If wji is a hidden node weight:
ji
j
j
j
jji
p
dwdnet
dnetdo
dodE
dwdE
××=
ijj
onetfdodE
×ʹ′×= )(
j i
oj
wji
∑=i
jiij wonet
)( jj netfo =
k
k’
∑ ×=k k
kjj dnet
dEwdodE
Note that as j is a hidden node, we do not know its target. Hence, dE/doj can only be calculated through j’s contribution to the derivative of E w.r.t netk at the output nodes:
wkj
![Page 37: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/37.jpg)
Vanishing Gradient Problem
The use of chain rule in computing the gradient (for sigmoid or tanh) activations causes the gradient to decrease exponentially, as we move from the output towards the first layers.
Layers will learn in much slower speed.
43
![Page 38: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/38.jpg)
We will cover the rest briefly. 1) Showing what a single neuron (or in the example 3 of them)
can do, in a function approximation situation. 2) Showing networks in different complexity, solving the same
problem (approximating a sinusoidal function)
44
![Page 39: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/39.jpg)
Function Approximation & Network Capabilities
![Page 40: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/40.jpg)
Function Approximation
Neural Networks are intrinsically function approximators: – we can train a NN to map real valued vectors to real-
valued vectors.
Function approximation capabilities of a simple network, in response to its parameters (weights and biases) are illustrated in the next slides.
![Page 41: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/41.jpg)
Function Approximation: Example
f 1 n( ) 1
1 e n–+-----------------=
f 2 n( ) n=
w1 1,1 10=
w2 1,1 10=
b11 10–=
b21 10=
w1 1,2 1=
w1 2,2 1=
b2 0=
Nominal Parameter Values
Superscripts are layer numbers
![Page 42: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/42.jpg)
Nominal Response
-2 -1 0 1 2-1
0
1
2
3
![Page 43: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/43.jpg)
Parameter Variations
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1– b2 1≤ ≤
What would be the effect of varying the bias of the output neuron?
![Page 44: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/44.jpg)
Parameter Variations
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1– w1 1,2 1≤ ≤
1– w1 2,2 1≤ ≤
0 b21 20≤ ≤
1– b2 1≤ ≤
![Page 45: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/45.jpg)
Network Complexity
![Page 46: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/46.jpg)
Choice of Architecture g p( ) 1 iπ
4----- p⎝ ⎠⎛ ⎞sin+=
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1-3-1 Network (1 input, 3 hidden, 1 output nodes)
i = 1 i = 2
i = 4 i = 8
![Page 47: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/47.jpg)
Choice of Network Architecture
g p( ) 1 6π4------ p⎝ ⎠⎛ ⎞sin+=
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1-5-1
1-2-1 1-3-1
1-4-1
Residual error decreases with O(1/M) where M is the number of hidden units
![Page 48: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/48.jpg)
Convergence in Time g p( ) 1 πp( )sin+=
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1
2 3
4
5
0
1
2
3 4
5
0
![Page 49: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/49.jpg)
Generalization
p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,
g p( ) 1π4---p⎝ ⎠⎛ ⎞sin+= p 2– 1.6– 1.2– … 1.6 2, , , , ,=
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
1-2-1 1-9-1
![Page 50: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/50.jpg)
We will see later how to use complex models (e.g. a higher order polynomial or a MLP with large number of nodes) together with regularization to control model complexity. • E.g. to keep the weight small, so that even if we use
complex models, the weight are kept in check so that the overall model is not prone to overfit.
56
![Page 51: MultiLayer Perceptrons Backpropagation Part 3/3people.sabanciuniv.edu/berrin/cs512/lectures/7-nn3-backprop.ppt.pdf · hidden layer) and sigmoidal units. All other functions can be](https://reader035.vdocument.in/reader035/viewer/2022071512/6132bd86dfd10f4dd73aa4ca/html5/thumbnails/51.jpg)
Next: Issues and Variations on
Backpropagation