lecture 10: learning rules &...

60
1 Lecture 10: Learning Rules & Optimization

Upload: others

Post on 02-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

1

Lecture 10: Learning Rules & Optimization

Page 2: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Outline• Five basic learning rules▫ Error-Correction Learningg▫ Hebbian Learning▫ Memory-Based Learning▫ Competitive Learning▫ Boltzmann Learning• Learning paradigms▫ Learning with a teacher▫ Learning without a teacher• Learning Tasks▫ Pattern association

P tt iti▫ Pattern recognition▫ Function approximation▫ Control▫ Filtering Smoothing Prediction▫ Filtering, Smoothing, Prediction• Associative Memory• Optimization Methods

Page 3: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learningg

• Property of primary significance in Neural Nets: learn from its environment, and improve its performance through learning.• Iterative adjustment of synaptic weights• Iterative adjustment of synaptic weights.• Learning: hard to define.▫ One definition by Mendel and McClaren: Learningy g

is a process by which the free parameters of aneural network are adapted through a process ofstimulation by the environment in which thestimulation by the environment in which thenetwork is embedded. The type of learning isdetermined by the manner in which the parameterchanges take placechanges take place.

Page 4: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learningg

• Sequence of events in NN learning:NN i ti l t d b th i t▫ NN is stimulated by the environment.

▫ NN undergoes changes in its free parameters as a result of this stimulation.NN d i t th i t▫ NN responds in a new way to the environment because of the changes that have occurred in its internal structure.

• A prescribed set of well defined rules for the solution• A prescribed set of well-defined rules for the solution of the learning problem is called a learning algorithm.• The manner in which a NN relates to the• The manner in which a NN relates to the

environment dictates the learning paradigm that refers to a model of environment operated on by the NNthe NN.

Page 5: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Five basic learning rulesg

• Error-correction learning <- optimum filteringg p g• Memory-based learning <- memorizing the

training data explicitly• Hebbian learning <- neurobiological• Competitive learning <- neurobiological• Boltzmann learning <- statistical mechanics

Page 6: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Error-Correction Learningg

• Input x(n), output yk(n), and desired response dk(n)or target output .g p• Error signal ek(n) = dk(n) - yk(n)• ek(n) actuates a control mechanism that gradually

adjust the synaptic weights, to minimize the cost j y p g ,function (or index of performance):

• When synaptic weights reach a steady state,

21( ) ( )2 kn e n

When synaptic weights reach a steady state, learning is stopped.

Page 7: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Error-Correction Learning: Delta rule

• Widrow-Hoff rule, with learning rate :( ) ( ) ( )kj k jw n e n x n

• Having this, we can update the weights:( ) ( ) ( )kj k jw n e n x n

( 1) ( ) ( )kj kj kjw n w n w n

• There is a sound theoretical reason for doing this, which we will discuss later (Can anyone do it right

( ) ( ) ( )kj kj kj

( y gnow !)

Page 8: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Memory-based Learningy g

• All (or most) past experiences are explicitly ( ) p p p ystored, as input-target pairs • For example, two classes C1 and C2

1{( , )}Ni i ix d

• Given a new input , determine class based on local neighborhood of

C it i d f d t i i th i hb h d

testxtestx

▫ Criterion used for determining the neighborhood▫ Learning rule applied to the neighborhood of the

input within the set of training examplesinput, within the set of training examples.

Page 9: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Memory-based Learning: Nearest NeighborNeighbor

• A set of instances observed so far: 1 1{ , ,..., }NX x x x

• Nearest neighbor of Nx X testxmin ( , ) ( , )i test i testi

d x x d x x• where d(·,·) is the Euclidean distance.• is classified as the same class as

Cover and Hart (1967): The bound on error is at max

i

testx Nx• Cover and Hart (1967): The bound on error is at max

twice that of the optimal (Bays probability of error), given▫ The classified examples are independently and

identically distributed.▫ The sample size N is infinitely large.

Page 10: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Memory-based Learning: K-Nearest NeighborNeighbor

• Identify k classified patterns that lie nearest to the test vector , for some integer k.testx• Assign to the class that is most frequently

represented by the k-neighbors (use majority vote).• In effect it is like averaging It can deal with outliers

testx

• In effect, it is like averaging. It can deal with outliers.The input x above will be classified as 1.

Page 11: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Hebbian Learningg

• Donald Hebb’s postulate of learning appeared in his book The Organization of Behavior (1949).▫ When an axon of cell A is near enough to excite a cell

B and repeatedly or persistently takes part in firing it, p y p y p g ,some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased.g ,

• Hebbian synapse▫ If two neurons on either side of a synapse are

activated simultaneously the synapse isactivated simultaneously, the synapse is strengthened.

▫ If they are activated asynchronously, the synapse is weakened or eliminated (This part was not mentionedweakened or eliminated. (This part was not mentioned in Hebb’s original postulate.)

Page 12: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Hebbian Synapsesy

▫ Time-dependent mechanism▫ Local mechanism (spatiotemporal contiguity)▫ Interactive mechanism▫ Correlative/conjunctive mechanism▫ Correlative/conjunctive mechanism• Strong evidence for Hebbian plasticity in the

Hippocampus (brain region).• Spike timing dependant plasticity rule• A Hebbian synapse increases its strength

with positively correlated presynaptic andwith positively correlated presynaptic and postsynaptic signals, and decreases its strength when signals are either g guncorrelated or negatively correlated.

Page 13: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Classification of Synaptic Plasticity

Page 14: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Mathematical Models of Synaptic PlasticityPlasticity

• General form: ( ) ( ( ) ( ))

• Hebbian learning (with learning rate η):

( ) ( ( ), ( ))kj k jw n F y n x n

• Covariance rule: ( ) ( ) ( )kj k jw n y n x n

( ) ( ( ) )( ( ) )kj k jw n y n y x n x

Page 15: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Covariance Rule

• Convergence to a nontrivial state( ) ( ( ) )( ( ) )kj k jw n y n y x n x

g• Prediction of both potentiation and depression• Observations:

▫ Weight enhanced when both pre- and post-synaptic neurons activities e g e a ced e bo p e a d pos sy ap c eu o s ac esare above average.

▫ Weight depressed when Presynaptic activity more than average, and postsynaptic activity less than

average. Presynaptic activity less than average, and postsynaptic activity more than

average.

Note, that:1. Synaptic weight wkj is enhanced if the conditions xj >x and

yk >y are both satisfied.yk y2. Synaptic weight wkj is depressed if there is xj >x and yk

<y or yk >y and xj <x .

Page 16: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Competitive Learningg

• Output neurons compete with each other for a p pchance to become active.• Highly suited to discover statistically salient

features (that may aid in classification).• Three basic elements:

S t f ith diff t i ht t▫ Same type of neurons with different weight sets, so that they respond differently to a given set of inputsinputs.

▫ A limit imposed on the strength of each neuron.▫ Competition mechanism, to choose one winner:

winner-takes-all neuron.

Page 17: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Inputs and Weights Seen as Vectors inHigh-dimensional SpaceHigh-dimensional Space

• Inputs and weights can be seen as vectors: xand wkN t th t th i ht t b l t t i• Note that the weight vector belongs to a certain output neuron k, and thus the index.

Page 18: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Competitive Learning Exampleg

• Single layer, feed-forward excitatory, g y , y,and lateral inhibitory connections• Winner selection

• Limit:Limit:• Adaption:

Si l t th ti i ht t• Simply put, the synaptic weight vector• Will be moved towards the input vector.

Page 19: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Example

Ad ti• Adaption:

• Interpreting this as a vector, we get the above plot.Interpreting this as a vector, we get the above plot.• Weight vectors converge toward local input clusters:

clustering.

Page 20: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Boltzman Learningg

• Stochastic learning algorithm rooted in statistical g gmechanics.• Recurrent network, binary neurons (on: ‘+1’, off: ‘-1’).

E f ti E• Energy function E:

• Activation:▫ Choose a random neuron k with state as xk.

Flip state with a probability (given temperature T)▫ Flip state with a probability (given temperature T)

where is the change in due to the flip.

Page 21: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning Paradigmsg g

• How neural networks relate to their environment▫ learning with a teacher▫ learning without a teacher

Page 22: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning with a teacherg

• Also known as supervised learning• Teacher has knowledge, represented as input–

output examples The environment is unknown tooutput examples. The environment is unknown to the NN.• NN tries to emulate the teacher gradually.

E ti l i i t hi thi• Error-correction learning is one way to achieve this.• Error surface, gradient, steepest descent, etc.

Page 23: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning without a teacherg

• Two classes▫ Reinforcement learning (RL)/Neurodynamic

programming▫ Unsupervised learning/Self-organization▫ Unsupervised learning/Self-organization

Page 24: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning without a teacher: Reinforcement LearningLearning

• Learning input-output mapping throughcontinued interaction with the environment.continued interaction with the environment.

• Actor-critic: critic converts primaryreinforcement signal into higher-quality,heuristic reinforcement signal (Barto,g ( ,Sutton, ...).

• Goal is to optimize the cumulative cost ofactions.

• In many cases, learning is under delayedreinforcement. Delayed RL is difficult since(1) teacher does not provide desired action ateach step, and (2) must solve temporal credit-assignment problem.

• Relation to dynamic programming, in thecontext of optimal control theory (Bellman).

Page 25: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning without a teacher: Unsupervised LearningLearning

L b d k i d d f h• Learn based on task-independent measure of the quality of representation.• Internal representations for encoding features of theInternal representations for encoding features of the

input space.• Competitive learning rule needed, such as winner-

takes alltakes-all.

Page 26: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning Tasksg

• Learning tasks:g▫ Pattern association▫ Pattern recognition▫ Function approximation▫ Control

Filt i▫ Filtering

Page 27: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Pattern Association

• Associative memory: brain-like distributed memory that learns association. Storage and retrieval (recall).• Pattern association ( : key pattern, : memorized kx ky( y p ,

pattern):

▫ autoassociation ( ): given partial or corrupted

kx ky, 1,2,...,k kx y k q

k kx yautoassociation ( ): given partial or corrupted version of stored pattern and retrieve the original.

▫ heteroassociation ( ): Learn arbitrary pattern pairs and retrieve them.

k ky

k kx yp

• Relevant issues: storage capacity vs. accuracy.

Page 28: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Pattern Classification

• Mapping between input pattern and a prescribed number of classes (categories).• Two general types:▫ Feature extraction

(observation space to feature f di i litspace: cf. dimensionality

reduction), then classification (feature space to decision space)space).

▫ Single step (observation space to decision space).

Page 29: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Function Approximation

• Nonlinear input-output mapping: for an ( )d f xp p pp gunknown f.• Given a set of labeled examples ,

( )f

1{( , )}Ni i ix d

estimate F(·) such that

( ) ( ) for allF X f x x ( ) ( ) , for all F X f x x

Page 30: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Function Approximation: System Identification and Inverse System Modelingand Inverse System Modeling

• System identification: learn function of an• System identification: learn function of an unknown system.

( )d f x• Inverse system modeling: learn inverse function:

( )d f x

11( )x f d

Page 31: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Control

• Control of a plant, a process or critical part of a t th t i t b i t i d i t ll dsystem that is to be maintained in a controlled

condition.• Feedback controller: adjust plant input u so that the j p p

output of the plant y tracks the reference signal d. Learning is in the form of free-parameter adjustment in the controllerin the controller.

Page 32: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Filtering, Smoothing, Predictiong g

• Extract information about a quantity of interest q yfrom a set of noisy data.• Filtering: estimate quantity at time n, based on

measurements up to time n.• Smoothing: estimate quantity at time n, based

t t ti ( 0) on measurements up to time• Prediction: estimate quantity at time

based on measurements up to time

( 0)n n ( 0)n based on measurements up to time ( 0)n

Page 33: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Memoryy

• Memory: relatively enduring neural alterations y y ginduced by an organism’s interaction with the environment.• Memory needs to be accessible by the nervous

system to influence behavior.A ti it tt d t b t d th h• Activity patterns need to be stored through a learning process.• Types of memory: short term and long term• Types of memory: short-term and long-term

memory

Page 34: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Memory: associative memory

• An associative memory is a brain-like distributed memory that learns by associationthat learns by association.• Autoassociation: A neural network is required to

store a set of patterns by repeatedly presenting them p y p y p gto the network. The network is presented a partial description of an original pattern stored in it, and the task is to retrieve that particular patterntask is to retrieve that particular pattern.

• Heteroassociation: It differs from autoassociation in that an arbitary set of input patterns is paired with another arbitary set of output patterns.

Page 35: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Memory: associative memory

• Let xk denote a key pattern and yk denote aLet xk denote a key pattern and yk denote a memorized pattern. The pattern association is decribed by

• xk yk, k = 1,2, ... ,q• In an autoassociative memory xk= yk• In a heteroassociative memory xk yk.• Storage phase

R ll h• Recall phase• q is a direct measure of the storage capacity.

Page 36: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Optimization

• Adaptive Filtering ProblemAdaptive Filtering Problem• Unconstrained Optimization Problem▫ Steepest Descent Methodp▫ Newton’s Method▫ Gauss-Newton Method• Linear Least-Square Filters• Least Mean-Square Algorithm

S• Learning-rate Annealing Schedules• Learning Curves

Page 37: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

General Adaptive Filter Configuration

Page 38: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Adaptive Filtering Problemg

• Consider an unknown dynamical system, that takes minputs and generates one outputinputs and generates one output.• Behavior of the system described as its input/output pair:

:{ ( ), ( ) | 1, 2 , ..., , ....} w h ereT x i d i i nT

• is the input and the desired response (or target signal).• Input vector can be either a spatial snapshot or a temporal

if l d i ti

1 2( ) [ ( ), ( ), ..., ( )]Tmx i x i x i x i ( )d i

sequence uniformly spaced in time.• There are two important processes in adaptive filtering:▫ Filtering process: generation of output based on the input:

T

▫ Adaptive process: automatic adjustment of weights to reduce error:

( ) ( ) ( )Ty i x i w i

( ) ( ) ( )d( ) ( ) ( )e i d i y i

Page 39: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Mean-Square Error Surface

Page 40: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Unconstrained Optimization Techniques

• How can we adjust to gradually minimize ? Note that

( )w i ( )e i( ) ( ) ( ) ( ) ( ) ( )Ti d i i d i i iNote that

Since and are fixed, only the changein can change

( ) ( ) ( ) ( ) ( ) ( )Te i d i y i d i x i w i

( )w i ( )e i( )d i ( )x i

in can change .• In other words, we want to minimize the cost

function with respect to the weight vector

( )w i ( )e i

( )w wp g: Find the optimal solution .• The necessary condition for optimality is

w*( ) 0w

the gradient operator is defined aswith this we get

Page 41: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Steepest Descent Method

• We want the iterative update algorithm to have p gthe following property:

( ( 1)) ( ( ))w n w n • Define the gradient vector as g.• The iterative weight update rule then becomes:

( )w

where is a small learning-rate parameter. So ( 1) ( ) ( )w n w n g n

we can say,

( ) ( 1) ( ) ( )w n w n w n g n ( ) ( ) ( ) ( )g

Page 42: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Steepest Descent Method

• We now check if ( ( 1)) ( ( ))w n w n We now check if• Using first-order Taylor expansion of near

( ( 1)) ( ( ))w n w n (.) ( )w n

( ( 1)) ( ( )) ( ) ( ) Tw n w n g n w nand , we get( ) ( )w n g n

( ( 1)) ( ( )) ( ) ( )Tw n w n g n g n

• So, it is indeed (for small ):

2( ( )) ( )w n g n

( ( 1)) ( ( ))w n w n positive

• Taylor Series:

Page 43: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Steepest Descent: Example

• Convergence to optimal w is very slow• Convergence to optimal w is very slow.• Small : overdamped, smooth trajectory• Large : underdamped jagged trajectory• Large : underdamped, jagged trajectory• too large: algorithm becomes unstable

Page 44: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Newton’s Method

• Newton’s method is an extension of steepest pdescent, where the second-order term in the Taylor series expansion is used.• It is generally faster and shows a less erratic

meandering compared to the steepest descent methodmethod.• There are certain conditions to be met though,

such as the Hessian matrix beingsuch as the Hessian matrix being positive definite (for an arbitrary ).

Page 45: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Steepest Descent vs. Newton

Steepest Descent Method Newton’s Method

Page 46: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Gauss-Newton Method

• Applicable for cost-functions expressed as sumApplicable for cost functions expressed as sum of error squares:

where is the error in the i-th trial, with the weight w.• Recalling the Taylor series

we can express evaluated near as

In matri notation e get• In matrix notation, we get:

Page 47: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Gauss-Newton Method

• is the Jacobian matrix, where each rowis the Jacobian matrix, where each row is the gradient of :

• We can then evaluate by plugging in actual values of into the Jacobian matrix above.

Page 48: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Quick Example: Jacobian Matrix

• GivenGiven

• The Jacobian of becomes

• For , we get

Page 49: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Gauss-Newton Method

• Again, starting withg g

• what we want is to set so that the error approaches 00.• That is, we want to minimize the norm of e(w):

• Differentiating the above with respect to and setting the result to 0, we get

• need to be nonsingular (inverse is needed)

Page 50: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Applications of Adaptive Filteringg

• System Identification

• Signal EnhancementSignal Enhancement

• Signal Prediction

Page 51: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Linear Least-Square Filters

• Given m input and 1 output function p pwhere i.e., it is linear, and a set of training samples , we can define the error vector for an arbitrary weight w asarbitrary weight w aswhere . Setting we get: • Differentiating the above wrt w, we get .

So, the Jacobian becomes

• Plugging this in to the Gauss-Newton equation, we finally get:

Page 52: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Linear Least-Square Filters

Points worth noting:• X does not need to be a square matrix!• We get off the bat partly

because the output is linear (otherwise the formulabecause the output is linear (otherwise, the formula would be more complex).• The Jacobian of the error function only depends on

th i t d i i i t t th i htthe input, and is invariant wrt the weight w.• The factor (let’s call it ) is like an

inverse.Multiply X to both sides of

then we get:• then we get:

Page 53: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Example• A pseudo-inverse code in MATLAB

X = ceil(rand(4,2)*10),wtrue = rand(2,1)*10,d=X*wtrue, w = inv(X’*X)*X’*dXX =10 73 73 65 4wtrue =0.566444.99120d =40 60340.60336.63831.64722.797w =0.566444.99120

Page 54: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Least Mean-Square Algorithmq g

• Cost function is based on instantaneous lvalues.

• Differentiating the above function of w, we get

• Plugging in• Plugging in

• Using this in the steepest descent rule, we get the LMS algorithm:g

• Note that this weight update is done with only one pair!

Page 55: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

LMS Algorithmg

Page 56: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Least Mean-Square Algorithm: Evaluation

• LMS algorithm behaves like a low-pass filter.• LMS algorithm is simple, model-independent, and thus

robust.• LMS does not follow the direction of steepest descent:

Instead, it follows it stochastically (stochastic gradient descent).• Slow convergence is an issue.• LMS is sensitive to the input correlation matrix’s condition

number (ratio between largest vs. smallest eigenvalue of the correlation matrix).LMS b h if h l i h h• LMS can be shown to converge if the learning rate has the following property:

where is the largest eigenvalue of the correlation matrix.

Page 57: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Improving Convergence in LMSg g

• The main problem arises because of the fixed .O l ti U ti i l i t• One solution: Use a time-varying learning rate:

, as in stochastic optimization theory.• A better alternative: use a hybrid method called

search-then-converge.g

• When , performance is similar to standard LMS. • When it behaves like stochastic• When , it behaves like stochastic

optimization.

Page 58: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Search-Then-Converge in LMSg

Page 59: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Learning Curvesg

Page 60: Lecture 10: Learning Rules & Optimizationce.sharif.edu/courses/92-93/1/ce957-1/resources/root/...Memory-based Learning • All (()pp pyor most) past experiences are explicitly stored,

Readingg

• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapters 2 and 3).