machine learning: multi layer...

139
Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons – p.1/61

Upload: dinhdung

Post on 24-Jul-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Machine Learning:

Multi Layer Perceptrons

Prof. Dr. Martin Riedmiller

Albert-Ludwigs-University Freiburg

AG Maschinelles Lernen

Machine Learning: Multi Layer Perceptrons – p.1/61

Page 2: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Outline

◮ multi layer perceptrons (MLP)

◮ learning MLPs

◮ function minimization: gradient descend & related methods

Machine Learning: Multi Layer Perceptrons – p.2/61

Page 3: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Neural networks

◮ single neurons are not able to solve complex tasks (e.g. restricted to linearcalculations)

◮ creating networks by hand is too expensive; we want to learn from data

◮ nonlinear features also have to be generated by hand; tessalations becomeintractable for larger dimensions

Machine Learning: Multi Layer Perceptrons – p.3/61

Page 4: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Neural networks

◮ single neurons are not able to solve complex tasks (e.g. restricted to linearcalculations)

◮ creating networks by hand is too expensive; we want to learn from data

◮ nonlinear features also have to be generated by hand; tessalations becomeintractable for larger dimensions

◮ we want to have a generic model that can adapt to some training data

◮ basic idea: multi layer perceptron (Werbos 1974, Rumelhart, McClelland, Hinton

1986), also named feed forward networks

Machine Learning: Multi Layer Perceptrons – p.3/61

Page 5: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Neurons in a multi layer perceptron

◮ standard perceptrons calculate adiscontinuous function:

~x 7→ fstep(w0 + 〈~w, ~x〉)

8

Machine Learning: Multi Layer Perceptrons – p.4/61

Page 6: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Neurons in a multi layer perceptron

◮ standard perceptrons calculate adiscontinuous function:

~x 7→ fstep(w0 + 〈~w, ~x〉)

◮ due to technical reasons, neurons inMLPs calculate a smoothed variantof this:

~x 7→ flog(w0 + 〈~w, ~x〉)

with

flog(z) =1

1 + e−z

flog is called logistic function

0

0.2

0.4

0.6

0.8

1

−8 −6 −4 −2 0 2 4 6 8

Machine Learning: Multi Layer Perceptrons – p.4/61

Page 7: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Neurons in a multi layer perceptron

◮ standard perceptrons calculate adiscontinuous function:

~x 7→ fstep(w0 + 〈~w, ~x〉)

◮ due to technical reasons, neurons inMLPs calculate a smoothed variantof this:

~x 7→ flog(w0 + 〈~w, ~x〉)

with

flog(z) =1

1 + e−z

flog is called logistic function

0

0.2

0.4

0.6

0.8

1

−8 −6 −4 −2 0 2 4 6 8

◮ properties:

• monotonically increasing

• limz→∞ = 1

• limz→−∞ = 0

• flog(z) = 1− flog(−z)

• continuous, differentiable

Machine Learning: Multi Layer Perceptrons – p.4/61

Page 8: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons

◮ A multi layer perceptrons (MLP) is a finite acyclic graph. The nodes areneurons with logistic activation.

x1

x2

...

xn

Σ

Σ

...

Σ

Σ

Σ

...

Σ

. . .

. . .

. . .

Σ

Σ

...

Σ

Σ

◮ neurons of i-th layer serve as input features for neurons of i + 1th layer

◮ very complex functions can be calculated combining many neurons

Machine Learning: Multi Layer Perceptrons – p.5/61

Page 9: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ multi layer perceptrons, more formally:A MLP is a finite directed acyclic graph.

• nodes that are no target of any connection are called input neurons. AMLP that should be applied to input patterns of dimension n must have n

input neurons, one for each dimension. Input neurons are typicallyenumerated as neuron 1, neuron 2, neuron 3, ...

• nodes that are no source of any connection are called output neurons. AMLP can have more than one output neuron. The number of outputneurons depends on the way the target values (desired values) of thetraining patterns are described.

• all nodes that are neither input neurons nor output neurons are calledhidden neurons.

• since the graph is acyclic, all neurons can be organized in layers, with theset of input layers being the first layer.

Machine Learning: Multi Layer Perceptrons – p.6/61

Page 10: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

• connections that hop over several layers are called shortcut

• most MLPs have a connection structure with connections from all neurons ofone layer to all neurons of the next layer without shortcuts

• all neurons are enumerated

• Succ(i) is the set of all neurons j for which a connection i→ j exists

• Pred(i) is the set of all neurons j for which a connection j → i exists

• all connections are weighted with a real number. The weight of the connectioni→ j is named wji

• all hidden and output neurons have a bias weight. The bias weight of neuron i

is named wi0

Machine Learning: Multi Layer Perceptrons – p.7/61

Page 11: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ variables for calculation:

• hidden and output neurons have some variable net i (“network input”)

• all neurons have some variable ai (“activation”/“output”)

Machine Learning: Multi Layer Perceptrons – p.8/61

Page 12: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ variables for calculation:

• hidden and output neurons have some variable net i (“network input”)

• all neurons have some variable ai (“activation”/“output”)

◮ applying a pattern ~x = (x1, . . . , xn)T to the MLP:

• for each input neuron the respective element of the input pattern ispresented, i.e. ai ← xi

Machine Learning: Multi Layer Perceptrons – p.8/61

Page 13: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ variables for calculation:

• hidden and output neurons have some variable net i (“network input”)

• all neurons have some variable ai (“activation”/“output”)

◮ applying a pattern ~x = (x1, . . . , xn)T to the MLP:

• for each input neuron the respective element of the input pattern ispresented, i.e. ai ← xi

• for all hidden and output neurons i:after the values aj have been calculated for all predecessors

j ∈ Pred(i), calculate net i and ai as:

net i ← wi0 +∑

j∈Pred(i)

(wijaj)

ai ← flog(net i)

Machine Learning: Multi Layer Perceptrons – p.8/61

Page 14: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ variables for calculation:

• hidden and output neurons have some variable net i (“network input”)

• all neurons have some variable ai (“activation”/“output”)

◮ applying a pattern ~x = (x1, . . . , xn)T to the MLP:

• for each input neuron the respective element of the input pattern ispresented, i.e. ai ← xi

• for all hidden and output neurons i:after the values aj have been calculated for all predecessors

j ∈ Pred(i), calculate net i and ai as:

net i ← wi0 +∑

j∈Pred(i)

(wijaj)

ai ← flog(net i)

• the network output is given by the ai of the output neuronsMachine Learning: Multi Layer Perceptrons – p.8/61

Page 15: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 16: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• calculate activation of input neurons: ai ← xi

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 17: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• calculate activation of input neurons: ai ← xi

• propagate forward the activations:

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 18: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• calculate activation of input neurons: ai ← xi

• propagate forward the activations: step

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 19: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• calculate activation of input neurons: ai ← xi

• propagate forward the activations: step by

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 20: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• calculate activation of input neurons: ai ← xi

• propagate forward the activations: step by step

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 21: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• calculate activation of input neurons: ai ← xi

• propagate forward the activations: step by step

• read the network output from both output neurons

Machine Learning: Multi Layer Perceptrons – p.9/61

Page 22: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ algorithm (forward pass):

Require: pattern ~x, MLP, enumeration of all neurons in topological orderEnsure: calculate output of MLP1: for all input neurons i do2: set ai ← xi

3: end for4: for all hidden and output neurons i in topological order do5: set net i ← wi0 +

j∈Pred(i) wijaj

6: set ai ← flog(net i)7: end for8: for all output neurons i do9: assemble ai in output vector ~y

10: end for11: return ~y

Machine Learning: Multi Layer Perceptrons – p.10/61

Page 23: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ variant:Neurons with logistic activation canonly output values between 0 and 1.To enable output in a wider range ofreal number variants are used:

• neurons with tanh activationfunction:

ai =tanh(net i)=enet

i −e−neti

eneti +e−neti

• neurons with linear activation:

ai = net i

linear activation

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−3 −2 −1 0 1 2 3

flog(2x)

tanh(x)

Machine Learning: Multi Layer Perceptrons – p.11/61

Page 24: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ variant:Neurons with logistic activation canonly output values between 0 and 1.To enable output in a wider range ofreal number variants are used:

• neurons with tanh activationfunction:

ai =tanh(net i)=enet

i −e−neti

eneti +e−neti

• neurons with linear activation:

ai = net i

linear activation

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−3 −2 −1 0 1 2 3

flog(2x)

tanh(x)

• the calculation of the networkoutput is similar to the case oflogistic activation except therelationship between net i and ai

is different.

• the activation function is a localproperty of each neuron.

Machine Learning: Multi Layer Perceptrons – p.11/61

Page 25: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ typical network topologies:

• for regression: output neurons with linear activation

• for classification: output neurons with logistic/tanh activation

• all hidden neurons with logistic activation

• layered layout:input layer – first hidden layer – second hidden layer – ... – output layerwith connection from each neuron in layer i with each neuron in layeri + 1, no shortcut connections

Machine Learning: Multi Layer Perceptrons – p.12/61

Page 26: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Multi layer perceptrons(cont.)

◮ typical network topologies:

• for regression: output neurons with linear activation

• for classification: output neurons with logistic/tanh activation

• all hidden neurons with logistic activation

• layered layout:input layer – first hidden layer – second hidden layer – ... – output layerwith connection from each neuron in layer i with each neuron in layeri + 1, no shortcut connections

◮ Lemma:Any boolean function can be realized by a MLP with one hidden layer. Anybounded continuous function can be approximated with arbitrary precision bya MLP with one hidden layer.Proof: was given by Cybenko (1989). Idea: partition input space in smallcells

Machine Learning: Multi Layer Perceptrons – p.12/61

Page 27: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

MLP Training

◮ given training data: D = {(~x(1), ~d(1)), . . . , (~x(p), ~d(p))} where ~d(i) is thedesired output (real number for regression, class label 0 or 1 forclassification)

◮ given topology of a MLP

◮ task: adapt weights of the MLP

Machine Learning: Multi Layer Perceptrons – p.13/61

Page 28: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

MLP Training(cont.)

◮ idea: minimize an error term

E(~w;D) =1

2

p∑

i=1

||y(~x(i); ~w)− ~d(i)||2

with y(~x; ~w): network output for input pattern ~x and weight vector ~w,

||~u||2 squared length of vector ~u: ||~u||2 =∑dim(~u)

j=1 (uj)2

Machine Learning: Multi Layer Perceptrons – p.14/61

Page 29: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

MLP Training(cont.)

◮ idea: minimize an error term

E(~w;D) =1

2

p∑

i=1

||y(~x(i); ~w)− ~d(i)||2

with y(~x; ~w): network output for input pattern ~x and weight vector ~w,

||~u||2 squared length of vector ~u: ||~u||2 =∑dim(~u)

j=1 (uj)2

◮ learning means: calculating weights for which the error becomes minimal

minimize~w

E(~w;D)

Machine Learning: Multi Layer Perceptrons – p.14/61

Page 30: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

MLP Training(cont.)

◮ idea: minimize an error term

E(~w;D) =1

2

p∑

i=1

||y(~x(i); ~w)− ~d(i)||2

with y(~x; ~w): network output for input pattern ~x and weight vector ~w,

||~u||2 squared length of vector ~u: ||~u||2 =∑dim(~u)

j=1 (uj)2

◮ learning means: calculating weights for which the error becomes minimal

minimize~w

E(~w;D)

◮ interpret E just as a mathematical function depending on ~w and forget aboutits semantics, then we are faced with a problem of mathematical optimization

Machine Learning: Multi Layer Perceptrons – p.14/61

Page 31: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory

◮ discusses mathematical problems of the form:

minimize~u

f(~u)

~u can be any vector of suitable size. But which one solves this task and howcan we calculate it?

Machine Learning: Multi Layer Perceptrons – p.15/61

Page 32: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory

◮ discusses mathematical problems of the form:

minimize~u

f(~u)

~u can be any vector of suitable size. But which one solves this task and howcan we calculate it?

◮ some simplifications:here we consider only functions f which are continuous and differentiable

continuous, non differentiablefunction

non continuous function differentiable function(disrupted) (folded) (smooth)

x

y y y

x x

Machine Learning: Multi Layer Perceptrons – p.15/61

Page 33: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ A global minimum ~u∗ is a point sothat:

f(~u∗) ≤ f(~u)

for all ~u.

◮ A local minimum ~u+ is a point so thatexist r > 0 with

f(~u+) ≤ f(~u)

for all points ~u with ||~u− ~u+|| < r

y

x

global localminima

Machine Learning: Multi Layer Perceptrons – p.16/61

Page 34: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ analytical way to find a minimum:For a local minimum ~u+, the gradient of f becomes zero:

∂f

∂ui

(~u+) = 0 for all i

Hence, calculating all partial derivatives and looking for zeros is a good idea(c.f. linear regression)

Machine Learning: Multi Layer Perceptrons – p.17/61

Page 35: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ analytical way to find a minimum:For a local minimum ~u+, the gradient of f becomes zero:

∂f

∂ui

(~u+) = 0 for all i

Hence, calculating all partial derivatives and looking for zeros is a good idea(c.f. linear regression)

but: there are also other points for which ∂f

∂ui= 0, and resolving these

equations is often not possible

Machine Learning: Multi Layer Perceptrons – p.17/61

Page 36: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 37: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

slope is negative (descending),go right!

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 38: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 39: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

slope is positive (ascending),go left!

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 40: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

Which is the best stepwidth?

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 41: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

Which is the best stepwidth?

slope is small, small step!

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 42: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

Which is the best stepwidth?

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 43: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

Which is the best stepwidth?

slope is large, large step!

~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 44: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Optimization theory(cont.)

◮ numerical way to find a minimum, searching:assume we are starting at a point ~u.Which is the best direction to search for apoint ~v with f(~v) < f(~u) ?

Which is the best stepwidth?

◮ general principle:

vi ← ui − ǫ∂f

∂ui

ǫ > 0 is called learning rate~u

Machine Learning: Multi Layer Perceptrons – p.18/61

Page 45: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent

◮ Gradient descent approach:

Require: mathematical function f , learning rate ǫ > 0Ensure: returned vector is close to a local minimum of f1: choose an initial point ~u2: while ||gradf(~u)|| not close to 0 do3: ~u← ~u− ǫ · gradf(~u)4: end while5: return ~u

◮ open questions:

• how to choose initial ~u

• how to choose ǫ

• does this algorithm really converge?

Machine Learning: Multi Layer Perceptrons – p.19/61

Page 46: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ choice of ǫ

Machine Learning: Multi Layer Perceptrons – p.20/61

Page 47: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ choice of ǫ

1. case small ǫ: convergence

Machine Learning: Multi Layer Perceptrons – p.20/61

Page 48: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ choice of ǫ

2. case very small ǫ: convergence, but it maytake very long

Machine Learning: Multi Layer Perceptrons – p.20/61

Page 49: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ choice of ǫ

3. case medium size ǫ: convergence

Machine Learning: Multi Layer Perceptrons – p.20/61

Page 50: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ choice of ǫ

4. case large ǫ: divergence

Machine Learning: Multi Layer Perceptrons – p.20/61

Page 51: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ choice of ǫ

• is crucial. Only small ǫ guaranteeconvergence.

• for small ǫ, learning may take very long

• depends on the scaling of f : an optimallearning rate for f may lead to divergencefor 2 · f

Machine Learning: Multi Layer Perceptrons – p.20/61

Page 52: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ some more problems with gradient descent:

• flat spots and steep valleys:need larger ǫ in ~u to jump over theuninteresting flat area but need smaller ǫ

in ~v to meet the minimum

~u ~v

Machine Learning: Multi Layer Perceptrons – p.21/61

Page 53: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ some more problems with gradient descent:

• flat spots and steep valleys:need larger ǫ in ~u to jump over theuninteresting flat area but need smaller ǫ

in ~v to meet the minimum

• zig-zagging:in higher dimensions: ǫ is not appropriatefor all dimensions

~u ~v

Machine Learning: Multi Layer Perceptrons – p.21/61

Page 54: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ conclusion:pure gradient descent is a nice theoretical framework but of limited power inpractice. Finding the right ǫ is annoying. Approaching the minimum is timeconsuming.

Machine Learning: Multi Layer Perceptrons – p.22/61

Page 55: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ conclusion:pure gradient descent is a nice theoretical framework but of limited power inpractice. Finding the right ǫ is annoying. Approaching the minimum is timeconsuming.

◮ heuristics to overcome problems of gradient descent:

• gradient descent with momentum

• individual lerning rates for each dimension

• adaptive learning rates

• decoupling steplength from partial derivates

Machine Learning: Multi Layer Perceptrons – p.22/61

Page 56: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ gradient descent with momentumidea: make updates smoother by carrying forward the latest update.

1: choose an initial point ~u

2: set ~∆← ~0 (stepwidth)3: while ||gradf(~u)|| not close to 0 do

4: ~∆← −ǫ · gradf(~u)+µ~∆

5: ~u← ~u + ~∆6: end while7: return ~u

µ ≥ 0, µ < 1 is an additional parameter that has to be adjusted by hand.For µ = 0 we get vanilla gradient descent.

Machine Learning: Multi Layer Perceptrons – p.23/61

Page 57: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of momentum:

• smoothes zig-zagging

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantage:

• additional parameter µ

• may cause additional zig-zagging

Machine Learning: Multi Layer Perceptrons – p.24/61

Page 58: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of momentum:

• smoothes zig-zagging

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantage:

• additional parameter µ

• may cause additional zig-zagging

vanilla gradient descent

Machine Learning: Multi Layer Perceptrons – p.24/61

Page 59: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of momentum:

• smoothes zig-zagging

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantage:

• additional parameter µ

• may cause additional zig-zagging

gradient descent withmomentum

Machine Learning: Multi Layer Perceptrons – p.24/61

Page 60: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of momentum:

• smoothes zig-zagging

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantage:

• additional parameter µ

• may cause additional zig-zagging

gradient descent withstrong momentum

Machine Learning: Multi Layer Perceptrons – p.24/61

Page 61: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of momentum:

• smoothes zig-zagging

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantage:

• additional parameter µ

• may cause additional zig-zagging

vanilla gradient descent

Machine Learning: Multi Layer Perceptrons – p.24/61

Page 62: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of momentum:

• smoothes zig-zagging

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantage:

• additional parameter µ

• may cause additional zig-zagging

gradient descent withmomentum

Machine Learning: Multi Layer Perceptrons – p.24/61

Page 63: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ adaptive learning rateidea:

• make learning rate individual for each dimension and adaptive

• if signs of partial derivative change, reduce learning rate

• if signs of partial derivative don’t change, increase learning rate

◮ algorithm: Super-SAB (Tollenare 1990)

Machine Learning: Multi Layer Perceptrons – p.25/61

Page 64: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

1: choose an initial point ~u2: set initial learning rate ~ǫ

3: set former gradient ~γ ← ~04: while ||gradf(~u)|| not close to 0 do5: calculate gradient ~g ← gradf(~u)6: for all dimensions i do

7: ǫi ←

η+ǫi if gi · γi > 0

η−ǫi if gi · γi < 0

ǫi otherwise

8: ui ← ui − ǫigi

9: end for10: ~γ ← ~g

11: end while12: return ~u

η+ ≥ 1, η− ≤ 1 areadditional parameters thathave to be adjusted byhand. For η+ = η− = 1we get vanilla gradientdescent.

Machine Learning: Multi Layer Perceptrons – p.26/61

Page 65: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Super-SAB and relatedapproaches:

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantages:

• steplength still depends on partialderivatives

Machine Learning: Multi Layer Perceptrons – p.27/61

Page 66: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Super-SAB and relatedapproaches:

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantages:

• steplength still depends on partialderivatives

vanilla gradient descent

Machine Learning: Multi Layer Perceptrons – p.27/61

Page 67: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Super-SAB and relatedapproaches:

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantages:

• steplength still depends on partialderivatives

SuperSAB

Machine Learning: Multi Layer Perceptrons – p.27/61

Page 68: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Super-SAB and relatedapproaches:

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantages:

• steplength still depends on partialderivatives

vanilla gradient descent

Machine Learning: Multi Layer Perceptrons – p.27/61

Page 69: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Super-SAB and relatedapproaches:

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

◮ disadavantages:

• steplength still depends on partialderivatives

SuperSAB

Machine Learning: Multi Layer Perceptrons – p.27/61

Page 70: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ make steplength independent of partial derivativesidea:

• use explicit steplength parameters, one for each dimension

• if signs of partial derivative change, reduce steplength

• if signs of partial derivative don’t change, increase steplegth

◮ algorithm: RProp (Riedmiller&Braun, 1993)

Machine Learning: Multi Layer Perceptrons – p.28/61

Page 71: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

1: choose an initial point ~u

2: set initial steplength ~∆

3: set former gradient ~γ ← ~04: while ||gradf(~u)|| not close to 0 do

5: calculate gradient ~g ← gradf(~u)6: for all dimensions i do

7: ∆i ←

η+∆i if gi · γi > 0

η−∆i if gi · γi < 0

∆i otherwise

8: ui ←

ui + ∆i if gi < 0

ui −∆i if gi > 0

ui otherwise

9: end for10: ~γ ← ~g

11: end while12: return ~u

η+ ≥ 1, η− ≤ 1 areadditional parameters thathave to be adjusted byhand. For MLPs, goodheuristics exist forparameter settings:η+ = 1.2, η− = 0.5,initial ∆i = 0.1

Machine Learning: Multi Layer Perceptrons – p.29/61

Page 72: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Rprop

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

• independent of gradient length

Machine Learning: Multi Layer Perceptrons – p.30/61

Page 73: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Rprop

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

• independent of gradient length

vanilla gradient descent

Machine Learning: Multi Layer Perceptrons – p.30/61

Page 74: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Rprop

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

• independent of gradient length

Rprop

Machine Learning: Multi Layer Perceptrons – p.30/61

Page 75: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Rprop

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

• independent of gradient length

vanilla gradient descent

Machine Learning: Multi Layer Perceptrons – p.30/61

Page 76: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Gradient descent(cont.)

◮ advantages of Rprop

• decouples learning rates of differentdimensions

• accelerates learning at flat spots

• slows down when signs of partialderivatives change

• independent of gradient length

Rprop

Machine Learning: Multi Layer Perceptrons – p.30/61

Page 77: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent

◮ Newton

◮ Quickprop

◮ line search

Machine Learning: Multi Layer Perceptrons – p.31/61

Page 78: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ Newton’s method:

approximate f by a second-order Taylor polynomial:

f(~u + ~∆) ≈ f(~u) + gradf(~u) · ~∆ +1

2~∆T H(~u)~∆

with H(~u) the Hessian of f at ~u, the matrix of second order partialderivatives.

Machine Learning: Multi Layer Perceptrons – p.32/61

Page 79: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ Newton’s method:

approximate f by a second-order Taylor polynomial:

f(~u + ~∆) ≈ f(~u) + gradf(~u) · ~∆ +1

2~∆T H(~u)~∆

with H(~u) the Hessian of f at ~u, the matrix of second order partialderivatives.

Zeroing the gradient of approximation with respect to ~∆:

~0 ≈ (gradf(~u))T + H(~u)~∆

Hence:

~∆ ≈ −(H(~u))−1(gradf(~u))T

Machine Learning: Multi Layer Perceptrons – p.32/61

Page 80: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ Newton’s method:

approximate f by a second-order Taylor polynomial:

f(~u + ~∆) ≈ f(~u) + gradf(~u) · ~∆ +1

2~∆T H(~u)~∆

with H(~u) the Hessian of f at ~u, the matrix of second order partialderivatives.

Zeroing the gradient of approximation with respect to ~∆:

~0 ≈ (gradf(~u))T + H(~u)~∆

Hence:

~∆ ≈ −(H(~u))−1(gradf(~u))T

◮ advantages: no learning rate, no parameters, quick convergence

◮ disadvantages: calculation of H and H−1 very time consuming in highdimensional spaces

Machine Learning: Multi Layer Perceptrons – p.32/61

Page 81: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ Quickprop (Fahlmann, 1988)

• like Newton’s method, but replaces H by a diagonal matrix containingonly the diagonal entries of H .

• hence, calculating the inverse is simplified

• replaces second order derivatives by approximations (difference ratios)

Machine Learning: Multi Layer Perceptrons – p.33/61

Page 82: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ Quickprop (Fahlmann, 1988)

• like Newton’s method, but replaces H by a diagonal matrix containingonly the diagonal entries of H .

• hence, calculating the inverse is simplified

• replaces second order derivatives by approximations (difference ratios)

◮ update rule:

△wti :=

−gti

gti − gt−1

i

(wti − wt−1

i )

where gti = grad f at time t.

◮ advantages: no learning rate, no parameters, quick convergence in manycases

◮ disadvantages: sometimes unstable

Machine Learning: Multi Layer Perceptrons – p.33/61

Page 83: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ line search algorithms:

two nested loops:

• outer loop: determine serachdirection from gradient

• inner loop: determine minimizingpoint on the line defined by currentsearch position and searchdirection

◮ inner loop can be realized by anyminimization algorithm forone-dimensional tasks

◮ advantage: inner loop algorithm maybe more complex algorithm, e.g.Newton

search line

grad

◮ problem: time consuming forhigh-dimensional spaces

Machine Learning: Multi Layer Perceptrons – p.34/61

Page 84: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ line search algorithms:

two nested loops:

• outer loop: determine serachdirection from gradient

• inner loop: determine minimizingpoint on the line defined by currentsearch position and searchdirection

◮ inner loop can be realized by anyminimization algorithm forone-dimensional tasks

◮ advantage: inner loop algorithm maybe more complex algorithm, e.g.Newton

search line

grad

◮ problem: time consuming forhigh-dimensional spaces

Machine Learning: Multi Layer Perceptrons – p.34/61

Page 85: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Beyond gradient descent(cont.)

◮ line search algorithms:

two nested loops:

• outer loop: determine serachdirection from gradient

• inner loop: determine minimizingpoint on the line defined by currentsearch position and searchdirection

◮ inner loop can be realized by anyminimization algorithm forone-dimensional tasks

◮ advantage: inner loop algorithm maybe more complex algorithm, e.g.Newton

grad

search line

◮ problem: time consuming forhigh-dimensional spaces

Machine Learning: Multi Layer Perceptrons – p.34/61

Page 86: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Summary: optimization theory

◮ several algorithms to solve problems of the form:

minimize~u

f(~u)

◮ gradient descent gives the main idea

◮ Rprop plays major role in context of MLPs

◮ dozens of variants and alternatives exist

Machine Learning: Multi Layer Perceptrons – p.35/61

Page 87: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Back to MLP Training

◮ training an MLP means solving:

minimize~w

E(~w;D)

for given network topology and training dataD

E(~w;D) =1

2

p∑

i=1

||y(~x(i); ~w)− ~d(i)||2

Machine Learning: Multi Layer Perceptrons – p.36/61

Page 88: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Back to MLP Training

◮ training an MLP means solving:

minimize~w

E(~w;D)

for given network topology and training dataD

E(~w;D) =1

2

p∑

i=1

||y(~x(i); ~w)− ~d(i)||2

◮ optimization theory offers algorithms to solve task of this kind

open question: how can we calculate derivatives of E?

Machine Learning: Multi Layer Perceptrons – p.36/61

Page 89: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives

◮ the calculation of the network output of a MLP is done step-by-step: neuron i

uses the output of neurons j ∈ Pred(i) as arguments, calculates some

output which serves as argument for all neurons j ∈ Succ(i).

◮ apply the chain rule!

Machine Learning: Multi Layer Perceptrons – p.37/61

Page 90: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ the error term

E(~w;D) =

p∑

i=1

(1

2||y(~x(i); ~w)− ~d(i)||2

)

introducing e(~w; ~x, ~d) = 12||y(~x; ~w)− ~d||2 we can write:

Machine Learning: Multi Layer Perceptrons – p.38/61

Page 91: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ the error term

E(~w;D) =

p∑

i=1

(1

2||y(~x(i); ~w)− ~d(i)||2

)

introducing e(~w; ~x, ~d) = 12||y(~x; ~w)− ~d||2 we can write:

E(~w;D) =

p∑

i=1

e(~w; ~x(i), ~d(i))

applying the rule for sums:

Machine Learning: Multi Layer Perceptrons – p.38/61

Page 92: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ the error term

E(~w;D) =

p∑

i=1

(1

2||y(~x(i); ~w)− ~d(i)||2

)

introducing e(~w; ~x, ~d) = 12||y(~x; ~w)− ~d||2 we can write:

E(~w;D) =

p∑

i=1

e(~w; ~x(i), ~d(i))

applying the rule for sums:

∂E(~w;D)

∂wkl

=

p∑

i=1

∂e(~w; ~x(i), ~d(i))

∂wkl

we can calculate the derivatives for each training pattern indiviudally andsum up

Machine Learning: Multi Layer Perceptrons – p.38/61

Page 93: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ individual error terms for a pattern ~x, ~d

simplifications in notation:

• omitting dependencies from ~x and ~d

• y(~w) = (y1, . . . , ym)T network output (when applying input pattern ~x)

Machine Learning: Multi Layer Perceptrons – p.39/61

Page 94: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ individual error term:

e(~w) =1

2||y(~x; ~w)− ~d||2 =

1

2

m∑

j=1

(yj − dj)2

by direct calculation:∂e

∂yj

= (yj − dj)

yj is the activation of a certain output neuron, say ai

Hence:∂e

∂ai

=∂e

∂yj

= (ai − dj)

Machine Learning: Multi Layer Perceptrons – p.40/61

Page 95: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ calculations within a neuron i

assume we already know ∂e∂ai

observation: e depends indirectly from ai and ai depends on net i

⇒ apply chain rule∂e

∂net i

=∂e

∂ai

·∂ai

∂net i

what is ∂ai

∂neti?

Machine Learning: Multi Layer Perceptrons – p.41/61

Page 96: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮∂ai

∂neti

ai is calculated like: ai = fact(net i) (fact activation function)Hence:

∂ai

∂net i

=∂fact(net i)

∂net i

Machine Learning: Multi Layer Perceptrons – p.42/61

Page 97: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮∂ai

∂neti

ai is calculated like: ai = fact(net i) (fact activation function)Hence:

∂ai

∂net i

=∂fact(net i)

∂net i

• linear activation: fact(net i) = net i

⇒ ∂fact (neti)∂neti

= 1

Machine Learning: Multi Layer Perceptrons – p.42/61

Page 98: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮∂ai

∂neti

ai is calculated like: ai = fact(net i) (fact activation function)Hence:

∂ai

∂net i

=∂fact(net i)

∂net i

• linear activation: fact(net i) = net i

⇒ ∂fact (neti)∂neti

= 1

• logistic activation: fact(net i) = 11+e−neti

⇒ ∂fact (neti)∂neti

= e−neti

(1+e−neti )2= flog(net i) · (1− flog(net i))

Machine Learning: Multi Layer Perceptrons – p.42/61

Page 99: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮∂ai

∂neti

ai is calculated like: ai = fact(net i) (fact activation function)Hence:

∂ai

∂net i

=∂fact(net i)

∂net i

• linear activation: fact(net i) = net i

⇒ ∂fact (neti)∂neti

= 1

• logistic activation: fact(net i) = 11+e−neti

⇒ ∂fact (neti)∂neti

= e−neti

(1+e−neti )2= flog(net i) · (1− flog(net i))

• tanh activation: fact(net i) = tanh(net i)

⇒ ∂fact (neti)∂neti

= 1− (tanh(net i))2

Machine Learning: Multi Layer Perceptrons – p.42/61

Page 100: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ from neuron to neuron

assume we already know ∂e∂netj

for all j ∈ Succ(i)

observation: e depends indirectly from net j of successor neurons and net j

depends on ai⇒ apply chain rule

Machine Learning: Multi Layer Perceptrons – p.43/61

Page 101: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ from neuron to neuron

assume we already know ∂e∂netj

for all j ∈ Succ(i)

observation: e depends indirectly from net j of successor neurons and net j

depends on ai⇒ apply chain rule

∂e

∂ai

=∑

j∈Succ(i)

( ∂e

∂net j

·∂net j

∂ai

)

Machine Learning: Multi Layer Perceptrons – p.43/61

Page 102: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ from neuron to neuron

assume we already know ∂e∂netj

for all j ∈ Succ(i)

observation: e depends indirectly from net j of successor neurons and net j

depends on ai⇒ apply chain rule

∂e

∂ai

=∑

j∈Succ(i)

( ∂e

∂net j

·∂net j

∂ai

)

and:

net j = wjiai + ...

hence:∂netj

∂ai

= wji

Machine Learning: Multi Layer Perceptrons – p.43/61

Page 103: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ the weights

assume we already know ∂e∂neti

for neuron i and neuron j is predecessor of i

observation: e depends indirectly from net i and net i depends on wij

⇒ apply chain rule

Machine Learning: Multi Layer Perceptrons – p.44/61

Page 104: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ the weights

assume we already know ∂e∂neti

for neuron i and neuron j is predecessor of i

observation: e depends indirectly from net i and net i depends on wij

⇒ apply chain rule∂e

∂wij

=∂e

∂net i

·∂net i

∂wij

Machine Learning: Multi Layer Perceptrons – p.44/61

Page 105: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ the weights

assume we already know ∂e∂neti

for neuron i and neuron j is predecessor of i

observation: e depends indirectly from net i and net i depends on wij

⇒ apply chain rule∂e

∂wij

=∂e

∂net i

·∂net i

∂wij

and:

net i = wijaj + ...

hence:∂neti

∂wij

= aj

Machine Learning: Multi Layer Perceptrons – p.44/61

Page 106: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ bias weights

assume we already know ∂e∂neti

for neuron i

observation: e depends indirectly from net i and net i depends on wi0

⇒ apply chain rule

Machine Learning: Multi Layer Perceptrons – p.45/61

Page 107: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ bias weights

assume we already know ∂e∂neti

for neuron i

observation: e depends indirectly from net i and net i depends on wi0

⇒ apply chain rule∂e

∂wi0

=∂e

∂net i

·∂net i

∂wi0

Machine Learning: Multi Layer Perceptrons – p.45/61

Page 108: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ bias weights

assume we already know ∂e∂neti

for neuron i

observation: e depends indirectly from net i and net i depends on wi0

⇒ apply chain rule∂e

∂wi0

=∂e

∂net i

·∂net i

∂wi0

and:

net i = wi0 + ...

hence:∂neti

∂wi0

= 1

Machine Learning: Multi Layer Perceptrons – p.45/61

Page 109: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ a simple example:

1

neuron 1

Σ

neuron 2

Σ

neuron 3

ew2,1 w3,2

Machine Learning: Multi Layer Perceptrons – p.46/61

Page 110: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ a simple example:

1

neuron 1

Σ

neuron 2

Σ

neuron 3

ew2,1 w3,2

∂e∂a3

= a3 − d1

∂e∂net3

= ∂e∂a3

· ∂a3

∂net3= ∂e

∂a3

· 1∂e∂a2

=∑

j∈Succ(2 )(∂e

∂netj·

∂netj

∂a2

) = ∂e∂net3

· w3,2

∂e∂net2

= ∂e∂a2

· ∂a2

∂net2= ∂e

∂a2

· a2(1− a2)∂e

∂w3,2= ∂e

∂net3· ∂net3

∂w3,2= ∂e

∂net3· a2

∂e∂w2,1

= ∂e∂net2

· ∂net2∂w2,1

= ∂e∂net2

· a1

∂e∂w3,0

= ∂e∂net3

· ∂net3∂w3,0

= ∂e∂net3

· 1∂e

∂w2,0= ∂e

∂net2· ∂net2

∂w2,0= ∂e

∂net2· 1

Machine Learning: Multi Layer Perceptrons – p.46/61

Page 111: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ calculating the partial derivatives:

• starting at the output neurons

• neuron by neuron, go from output to input

• finally calculate the partial derivatives with respect to the weights

◮ Backpropagation

Machine Learning: Multi Layer Perceptrons – p.47/61

Page 112: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 113: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 114: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations:

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 115: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 116: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 117: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 118: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

• calculate error, ∂e∂ai

, and ∂e∂neti

for output neurons

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 119: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

• calculate error, ∂e∂ai

, and ∂e∂neti

for output neurons

• propagate backward error: step

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 120: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

• calculate error, ∂e∂ai

, and ∂e∂neti

for output neurons

• propagate backward error: step by

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 121: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

• calculate error, ∂e∂ai

, and ∂e∂neti

for output neurons

• propagate backward error: step by step

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 122: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

• calculate error, ∂e∂ai

, and ∂e∂neti

for output neurons

• propagate backward error: step by step

• calculate ∂e∂wji

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 123: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Calculating partial derivatives(cont.)

◮ illustration:

1

2

Σ

Σ Σ

Σ

Σ

Σ

Σ Σ

• apply pattern ~x = (x1, x2)T

• propagate forward the activations: step by step

• calculate error, ∂e∂ai

, and ∂e∂neti

for output neurons

• propagate backward error: step by step

• calculate ∂e∂wji

• repeat for all patterns and sum up

Machine Learning: Multi Layer Perceptrons – p.48/61

Page 124: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Back to MLP Training

◮ bringing together building blocks of MLP learning:

• we can calculate ∂E∂wij

• we have discussed methods to minimize a differentiable mathematicalfunction

Machine Learning: Multi Layer Perceptrons – p.49/61

Page 125: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Back to MLP Training

◮ bringing together building blocks of MLP learning:

• we can calculate ∂E∂wij

• we have discussed methods to minimize a differentiable mathematicalfunction

◮ combining them yields a learning algorithm for MLPs:

• (standard) backpropagation = gradient descent combined with

calculating ∂E∂wij

for MLPs

• backpropagation with momentum = gradient descent with moment

combined with calculating ∂E∂wij

for MLPs

• Quickprop

• Rprop

• ...

Machine Learning: Multi Layer Perceptrons – p.49/61

Page 126: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Back to MLP Training(cont.)

◮ generic MLP learning algorithm:

1: choose an initial weight vector ~w

2: intialize minimization approach3: while error did not converge do

4: for all (~x, ~d) ∈ D do5: apply ~x to network and calculate the network output

6: calculate∂e(~x)∂wij

for all weights

7: end for8: calculate

∂E(D)∂wij

for all weights suming over all training patterns

9: perform one update step of the minimization approach10: end while

◮ learning by epoch: all training patterns are considered for one update step offunction minimization

Machine Learning: Multi Layer Perceptrons – p.50/61

Page 127: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Back to MLP Training(cont.)

◮ generic MLP learning algorithm:

1: choose an initial weight vector ~w

2: intialize minimization approach3: while error did not converge do

4: for all (~x, ~d) ∈ D do5: apply ~x to network and calculate the network output

6: calculate∂e(~x)∂wij

for all weights

7: perform one update step of the minimization approach8: end for9: end while

◮ learning by pattern: only one training patterns is considered for one updatestep of function minimization (only works with vanilla gradient descent!)

Machine Learning: Multi Layer Perceptrons – p.51/61

Page 128: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Lernverhalten und Parameterwahl - 3 Bit Parity

10

20

30

40

50

60

70

80

90

100

0.001 0.01 0.1 1

average no. epochs

learning parameter

3 Bit Paritiy - Sensitivity

Rprop

BP

QP

SSAB

Machine Learning: Multi Layer Perceptrons – p.52/61

Page 129: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Lernverhalten und Parameterwahl - 6 Bit Parity

0

50

100

150

200

250

300

350

400

450

500

0.0001 0.001 0.01 0.1

average no. epochs

learning parameter

6 Bit Paritiy - Sensitivity

Rprop

BP

QP

SSAB

Machine Learning: Multi Layer Perceptrons – p.53/61

Page 130: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Lernverhalten und Parameterwahl - 10 Encoder

0

50

100

150

200

250

300

350

400

450

500

0.001 0.01 0.1 1 10

average no. epochs

learning parameter

10-5-10 Encoder - Sensitivity

Rprop

BP

QP

SSAB

Machine Learning: Multi Layer Perceptrons – p.54/61

Page 131: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Lernverhalten und Parameterwahl - 12 Encoder

0

200

400

600

800

1000

0.001 0.01 0.1 1

average no. epochs

learning parameter

12-2-12 Encoder - Sensitivity

Rprop

QP

SSAB

Machine Learning: Multi Layer Perceptrons – p.55/61

Page 132: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Lernverhalten und Parameterwahl - ’two sprials’

0

2000

4000

6000

8000

10000

12000

14000

1e-05 0.0001 0.001 0.01 0.1

average no. epochs

learning parameter

Two Spirals - Sensitivity

Rprop

BP

QP

SSAB

Machine Learning: Multi Layer Perceptrons – p.56/61

Page 133: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Real-world examples: sales rate prediction

◮ Bild-Zeitung is the most frequently soldnewspaper in Germany, approx. 4.2 millioncopies per day

◮ it is sold in 110 000 sales outlets in Germany,differing in a lot of facets

Machine Learning: Multi Layer Perceptrons – p.57/61

Page 134: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Real-world examples: sales rate prediction

◮ Bild-Zeitung is the most frequently soldnewspaper in Germany, approx. 4.2 millioncopies per day

◮ it is sold in 110 000 sales outlets in Germany,differing in a lot of facets

◮ problem: how many copies are sold in whichsales outlet?

Machine Learning: Multi Layer Perceptrons – p.57/61

Page 135: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Real-world examples: sales rate prediction

◮ Bild-Zeitung is the most frequently soldnewspaper in Germany, approx. 4.2 millioncopies per day

◮ it is sold in 110 000 sales outlets in Germany,differing in a lot of facets

◮ problem: how many copies are sold in whichsales outlet?

◮ neural approach: train a neural network foreach sales outlet, neural network predicts nextweek’s sales rates

◮ system in use since mid of 1990s

Machine Learning: Multi Layer Perceptrons – p.57/61

Page 136: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Examples: Alvinn (Dean, Pommerleau, 1992)

◮ autonomous vehicle driven by a multi-layer perceptron

◮ input: raw camera image

◮ output: steering wheel angle

◮ generation of training data by a human driver

◮ drives up to 90 km/h

◮ 15 frames per second

Machine Learning: Multi Layer Perceptrons – p.58/61

Page 137: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Alvinn MLP structure

Machine Learning: Multi Layer Perceptrons – p.59/61

Page 138: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Alvinn Training aspects

◮ training data must be ’diverse’

◮ training data should be balanced (otherwise e.g. a bias towards steering leftmight exist)

◮ if human driver makes errors, the training data contains errors

◮ if human driver makes no errors, no information about how to do correctionsis available

◮ generation of artificial training data by shifting and rotating images

Machine Learning: Multi Layer Perceptrons – p.60/61

Page 139: Machine Learning: Multi Layer Perceptronsml.informatik.uni-freiburg.de/former/_media/teaching/ss10/05_mlps.printer.pdf · Multi layer perceptrons (cont.) multi layer perceptrons,

Summary

◮ MLPs are broadly applicable ML models

◮ continuous features, continuos outputs

◮ suited for regression and classification

◮ learning is based on a general principle: gradient descent on an errorfunction

◮ powerful learning algorithms exist

◮ likely to overfit⇒ regularisation methods

Machine Learning: Multi Layer Perceptrons – p.61/61