lecture 10: learning rules &...

1

Lecture 10: Learning Rules & Optimization

Outline• Five basic learning rules▫ Error-Correction Learningg▫ Hebbian Learning▫ Memory-Based Learning▫ Competitive Learning▫ Boltzmann Learning• Learning paradigms▫ Learning with a teacher▫ Learning without a teacher• Learning Tasks▫ Pattern association

P tt iti▫ Pattern recognition▫ Function approximation▫ Control▫ Filtering Smoothing Prediction▫ Filtering, Smoothing, Prediction• Associative Memory• Optimization Methods

Learningg

• Property of primary significance in Neural Nets: learn from its environment, and improve its performance through learning.• Iterative adjustment of synaptic weights• Iterative adjustment of synaptic weights.• Learning: hard to define.▫ One definition by Mendel and McClaren: Learningy g

is a process by which the free parameters of aneural network are adapted through a process ofstimulation by the environment in which thestimulation by the environment in which thenetwork is embedded. The type of learning isdetermined by the manner in which the parameterchanges take placechanges take place.

Learningg

• Sequence of events in NN learning:NN i ti l t d b th i t▫ NN is stimulated by the environment.

▫ NN undergoes changes in its free parameters as a result of this stimulation.NN d i t th i t▫ NN responds in a new way to the environment because of the changes that have occurred in its internal structure.

• A prescribed set of well defined rules for the solution• A prescribed set of well-defined rules for the solution of the learning problem is called a learning algorithm.• The manner in which a NN relates to the• The manner in which a NN relates to the

environment dictates the learning paradigm that refers to a model of environment operated on by the NNthe NN.

Five basic learning rulesg

• Error-correction learning <- optimum filteringg p g• Memory-based learning <- memorizing the

training data explicitly• Hebbian learning <- neurobiological• Competitive learning <- neurobiological• Boltzmann learning <- statistical mechanics

Error-Correction Learningg

• Input x(n), output yk(n), and desired response dk(n)or target output .g p• Error signal ek(n) = dk(n) - yk(n)• ek(n) actuates a control mechanism that gradually

adjust the synaptic weights, to minimize the cost j y p g ,function (or index of performance):

• When synaptic weights reach a steady state,

21( ) ( )2 kn e n

When synaptic weights reach a steady state, learning is stopped.

Error-Correction Learning: Delta rule

• Widrow-Hoff rule, with learning rate :( ) ( ) ( )kj k jw n e n x n

• Having this, we can update the weights:( ) ( ) ( )kj k jw n e n x n

( 1) ( ) ( )kj kj kjw n w n w n

• There is a sound theoretical reason for doing this, which we will discuss later (Can anyone do it right

( ) ( ) ( )kj kj kj

( y gnow !)

Memory-based Learningy g

• All (or most) past experiences are explicitly ( ) p p p ystored, as input-target pairs • For example, two classes C1 and C2

1{( , )}Ni i ix d

• Given a new input , determine class based on local neighborhood of

C it i d f d t i i th i hb h d

testxtestx

▫ Criterion used for determining the neighborhood▫ Learning rule applied to the neighborhood of the

input within the set of training examplesinput, within the set of training examples.

Memory-based Learning: Nearest NeighborNeighbor

• A set of instances observed so far: 1 1{ , ,..., }NX x x x

• Nearest neighbor of Nx X testxmin ( , ) ( , )i test i testi

d x x d x x• where d(·,·) is the Euclidean distance.• is classified as the same class as

Cover and Hart (1967): The bound on error is at max

i

testx Nx• Cover and Hart (1967): The bound on error is at max

twice that of the optimal (Bays probability of error), given▫ The classified examples are independently and

identically distributed.▫ The sample size N is infinitely large.

Memory-based Learning: K-Nearest NeighborNeighbor

• Identify k classified patterns that lie nearest to the test vector , for some integer k.testx• Assign to the class that is most frequently

represented by the k-neighbors (use majority vote).• In effect it is like averaging It can deal with outliers

testx

• In effect, it is like averaging. It can deal with outliers.The input x above will be classified as 1.

Hebbian Learningg

• Donald Hebb’s postulate of learning appeared in his book The Organization of Behavior (1949).▫ When an axon of cell A is near enough to excite a cell

B and repeatedly or persistently takes part in firing it, p y p y p g ,some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased.g ,

• Hebbian synapse▫ If two neurons on either side of a synapse are

activated simultaneously the synapse isactivated simultaneously, the synapse is strengthened.

▫ If they are activated asynchronously, the synapse is weakened or eliminated (This part was not mentionedweakened or eliminated. (This part was not mentioned in Hebb’s original postulate.)

Hebbian Synapsesy

▫ Time-dependent mechanism▫ Local mechanism (spatiotemporal contiguity)▫ Interactive mechanism▫ Correlative/conjunctive mechanism▫ Correlative/conjunctive mechanism• Strong evidence for Hebbian plasticity in the

Hippocampus (brain region).• Spike timing dependant plasticity rule• A Hebbian synapse increases its strength

with positively correlated presynaptic andwith positively correlated presynaptic and postsynaptic signals, and decreases its strength when signals are either g guncorrelated or negatively correlated.

Classification of Synaptic Plasticity

Mathematical Models of Synaptic PlasticityPlasticity

• General form: ( ) ( ( ) ( ))

• Hebbian learning (with learning rate η):

( ) ( ( ), ( ))kj k jw n F y n x n

• Covariance rule: ( ) ( ) ( )kj k jw n y n x n

( ) ( ( ) )( ( ) )kj k jw n y n y x n x

Covariance Rule

• Convergence to a nontrivial state( ) ( ( ) )( ( ) )kj k jw n y n y x n x

g• Prediction of both potentiation and depression• Observations:

▫ Weight enhanced when both pre- and post-synaptic neurons activities e g e a ced e bo p e a d pos sy ap c eu o s ac esare above average.

▫ Weight depressed when Presynaptic activity more than average, and postsynaptic activity less than

average. Presynaptic activity less than average, and postsynaptic activity more than

average.

Note, that:1. Synaptic weight wkj is enhanced if the conditions xj >x and

yk >y are both satisfied.yk y2. Synaptic weight wkj is depressed if there is xj >x and yk

<y or yk >y and xj <x .

Competitive Learningg

• Output neurons compete with each other for a p pchance to become active.• Highly suited to discover statistically salient

features (that may aid in classification).• Three basic elements:

S t f ith diff t i ht t▫ Same type of neurons with different weight sets, so that they respond differently to a given set of inputsinputs.

▫ A limit imposed on the strength of each neuron.▫ Competition mechanism, to choose one winner:

winner-takes-all neuron.

Inputs and Weights Seen as Vectors inHigh-dimensional SpaceHigh-dimensional Space

• Inputs and weights can be seen as vectors: xand wkN t th t th i ht t b l t t i• Note that the weight vector belongs to a certain output neuron k, and thus the index.

Competitive Learning Exampleg

• Single layer, feed-forward excitatory, g y , y,and lateral inhibitory connections• Winner selection

• Limit:Limit:• Adaption:

Si l t th ti i ht t• Simply put, the synaptic weight vector• Will be moved towards the input vector.

Example

Ad ti• Adaption:

• Interpreting this as a vector, we get the above plot.Interpreting this as a vector, we get the above plot.• Weight vectors converge toward local input clusters:

clustering.

Boltzman Learningg

• Stochastic learning algorithm rooted in statistical g gmechanics.• Recurrent network, binary neurons (on: ‘+1’, off: ‘-1’).

E f ti E• Energy function E:

• Activation:▫ Choose a random neuron k with state as xk.

Flip state with a probability (given temperature T)▫ Flip state with a probability (given temperature T)

where is the change in due to the flip.

Learning Paradigmsg g

• How neural networks relate to their environment▫ learning with a teacher▫ learning without a teacher

Learning with a teacherg

• Also known as supervised learning• Teacher has knowledge, represented as input–

output examples The environment is unknown tooutput examples. The environment is unknown to the NN.• NN tries to emulate the teacher gradually.

E ti l i i t hi thi• Error-correction learning is one way to achieve this.• Error surface, gradient, steepest descent, etc.

Learning without a teacherg

• Two classes▫ Reinforcement learning (RL)/Neurodynamic

programming▫ Unsupervised learning/Self-organization▫ Unsupervised learning/Self-organization

Learning without a teacher: Reinforcement LearningLearning

• Learning input-output mapping throughcontinued interaction with the environment.continued interaction with the environment.

• Actor-critic: critic converts primaryreinforcement signal into higher-quality,heuristic reinforcement signal (Barto,g ( ,Sutton, ...).

• Goal is to optimize the cumulative cost ofactions.

• In many cases, learning is under delayedreinforcement. Delayed RL is difficult since(1) teacher does not provide desired action ateach step, and (2) must solve temporal credit-assignment problem.

• Relation to dynamic programming, in thecontext of optimal control theory (Bellman).

Learning without a teacher: Unsupervised LearningLearning

L b d k i d d f h• Learn based on task-independent measure of the quality of representation.• Internal representations for encoding features of theInternal representations for encoding features of the

input space.• Competitive learning rule needed, such as winner-

takes alltakes-all.

Learning Tasksg

• Learning tasks:g▫ Pattern association▫ Pattern recognition▫ Function approximation▫ Control

Filt i▫ Filtering

Pattern Association

• Associative memory: brain-like distributed memory that learns association. Storage and retrieval (recall).• Pattern association ( : key pattern, : memorized kx ky( y p ,

pattern):

▫ autoassociation ( ): given partial or corrupted

kx ky, 1,2,...,k kx y k q

k kx yautoassociation ( ): given partial or corrupted version of stored pattern and retrieve the original.

▫ heteroassociation ( ): Learn arbitrary pattern pairs and retrieve them.

k ky

k kx yp

• Relevant issues: storage capacity vs. accuracy.

Pattern Classification

• Mapping between input pattern and a prescribed number of classes (categories).• Two general types:▫ Feature extraction

(observation space to feature f di i litspace: cf. dimensionality

reduction), then classification (feature space to decision space)space).

▫ Single step (observation space to decision space).

Function Approximation

• Nonlinear input-output mapping: for an ( )d f xp p pp gunknown f.• Given a set of labeled examples ,

( )f

1{( , )}Ni i ix d

estimate F(·) such that

( ) ( ) for allF X f x x ( ) ( ) , for all F X f x x

Function Approximation: System Identification and Inverse System Modelingand Inverse System Modeling

• System identification: learn function of an• System identification: learn function of an unknown system.

( )d f x• Inverse system modeling: learn inverse function:

( )d f x

11( )x f d

Control

• Control of a plant, a process or critical part of a t th t i t b i t i d i t ll dsystem that is to be maintained in a controlled

condition.• Feedback controller: adjust plant input u so that the j p p

output of the plant y tracks the reference signal d. Learning is in the form of free-parameter adjustment in the controllerin the controller.

Filtering, Smoothing, Predictiong g

• Extract information about a quantity of interest q yfrom a set of noisy data.• Filtering: estimate quantity at time n, based on

measurements up to time n.• Smoothing: estimate quantity at time n, based

t t ti ( 0) on measurements up to time• Prediction: estimate quantity at time

based on measurements up to time

( 0)n n ( 0)n based on measurements up to time ( 0)n

Memoryy

• Memory: relatively enduring neural alterations y y ginduced by an organism’s interaction with the environment.• Memory needs to be accessible by the nervous

system to influence behavior.A ti it tt d t b t d th h• Activity patterns need to be stored through a learning process.• Types of memory: short term and long term• Types of memory: short-term and long-term

memory

Memory: associative memory

• An associative memory is a brain-like distributed memory that learns by associationthat learns by association.• Autoassociation: A neural network is required to

store a set of patterns by repeatedly presenting them p y p y p gto the network. The network is presented a partial description of an original pattern stored in it, and the task is to retrieve that particular patterntask is to retrieve that particular pattern.

• Heteroassociation: It differs from autoassociation in that an arbitary set of input patterns is paired with another arbitary set of output patterns.

Memory: associative memory

• Let xk denote a key pattern and yk denote aLet xk denote a key pattern and yk denote a memorized pattern. The pattern association is decribed by

• xk yk, k = 1,2, ... ,q• In an autoassociative memory xk= yk• In a heteroassociative memory xk yk.• Storage phase

R ll h• Recall phase• q is a direct measure of the storage capacity.

Optimization

• Adaptive Filtering ProblemAdaptive Filtering Problem• Unconstrained Optimization Problem▫ Steepest Descent Methodp▫ Newton’s Method▫ Gauss-Newton Method• Linear Least-Square Filters• Least Mean-Square Algorithm

S• Learning-rate Annealing Schedules• Learning Curves

General Adaptive Filter Configuration

Adaptive Filtering Problemg

• Consider an unknown dynamical system, that takes minputs and generates one outputinputs and generates one output.• Behavior of the system described as its input/output pair:

:{ ( ), ( ) | 1, 2 , ..., , ....} w h ereT x i d i i nT

• is the input and the desired response (or target signal).• Input vector can be either a spatial snapshot or a temporal

if l d i ti

1 2( ) [ ( ), ( ), ..., ( )]Tmx i x i x i x i ( )d i

sequence uniformly spaced in time.• There are two important processes in adaptive filtering:▫ Filtering process: generation of output based on the input:

T

▫ Adaptive process: automatic adjustment of weights to reduce error:

( ) ( ) ( )Ty i x i w i

( ) ( ) ( )d( ) ( ) ( )e i d i y i

Mean-Square Error Surface

Unconstrained Optimization Techniques

• How can we adjust to gradually minimize ? Note that

( )w i ( )e i( ) ( ) ( ) ( ) ( ) ( )Ti d i i d i i iNote that

Since and are fixed, only the changein can change

( ) ( ) ( ) ( ) ( ) ( )Te i d i y i d i x i w i

( )w i ( )e i( )d i ( )x i

in can change .• In other words, we want to minimize the cost

function with respect to the weight vector

( )w i ( )e i

( )w wp g: Find the optimal solution .• The necessary condition for optimality is

w*( ) 0w

the gradient operator is defined aswith this we get

Steepest Descent Method

• We want the iterative update algorithm to have p gthe following property:

( ( 1)) ( ( ))w n w n • Define the gradient vector as g.• The iterative weight update rule then becomes:

( )w

where is a small learning-rate parameter. So ( 1) ( ) ( )w n w n g n

we can say,

( ) ( 1) ( ) ( )w n w n w n g n ( ) ( ) ( ) ( )g

Steepest Descent Method

• We now check if ( ( 1)) ( ( ))w n w n We now check if• Using first-order Taylor expansion of near

( ( 1)) ( ( ))w n w n (.) ( )w n

( ( 1)) ( ( )) ( ) ( ) Tw n w n g n w nand , we get( ) ( )w n g n

( ( 1)) ( ( )) ( ) ( )Tw n w n g n g n

• So, it is indeed (for small ):

2( ( )) ( )w n g n

( ( 1)) ( ( ))w n w n positive

• Taylor Series:

Steepest Descent: Example

• Convergence to optimal w is very slow• Convergence to optimal w is very slow.• Small : overdamped, smooth trajectory• Large : underdamped jagged trajectory• Large : underdamped, jagged trajectory• too large: algorithm becomes unstable

Newton’s Method

• Newton’s method is an extension of steepest pdescent, where the second-order term in the Taylor series expansion is used.• It is generally faster and shows a less erratic

meandering compared to the steepest descent methodmethod.• There are certain conditions to be met though,

such as the Hessian matrix beingsuch as the Hessian matrix being positive definite (for an arbitrary ).

Steepest Descent vs. Newton

Steepest Descent Method Newton’s Method

Gauss-Newton Method

• Applicable for cost-functions expressed as sumApplicable for cost functions expressed as sum of error squares:

where is the error in the i-th trial, with the weight w.• Recalling the Taylor series

we can express evaluated near as

In matri notation e get• In matrix notation, we get:

Gauss-Newton Method

• is the Jacobian matrix, where each rowis the Jacobian matrix, where each row is the gradient of :

• We can then evaluate by plugging in actual values of into the Jacobian matrix above.

Quick Example: Jacobian Matrix

• GivenGiven

• The Jacobian of becomes

• For , we get

Gauss-Newton Method

• Again, starting withg g

• what we want is to set so that the error approaches 00.• That is, we want to minimize the norm of e(w):

• Differentiating the above with respect to and setting the result to 0, we get

• need to be nonsingular (inverse is needed)

Applications of Adaptive Filteringg

• System Identification

• Signal EnhancementSignal Enhancement

• Signal Prediction

Linear Least-Square Filters

• Given m input and 1 output function p pwhere i.e., it is linear, and a set of training samples , we can define the error vector for an arbitrary weight w asarbitrary weight w aswhere . Setting we get: • Differentiating the above wrt w, we get .

So, the Jacobian becomes

• Plugging this in to the Gauss-Newton equation, we finally get:

Linear Least-Square Filters

Points worth noting:• X does not need to be a square matrix!• We get off the bat partly

because the output is linear (otherwise the formulabecause the output is linear (otherwise, the formula would be more complex).• The Jacobian of the error function only depends on

th i t d i i i t t th i htthe input, and is invariant wrt the weight w.• The factor (let’s call it ) is like an

inverse.Multiply X to both sides of

then we get:• then we get:

Example• A pseudo-inverse code in MATLAB

X = ceil(rand(4,2)*10),wtrue = rand(2,1)*10,d=X*wtrue, w = inv(X’*X)*X’*dXX =10 73 73 65 4wtrue =0.566444.99120d =40 60340.60336.63831.64722.797w =0.566444.99120

Least Mean-Square Algorithmq g

• Cost function is based on instantaneous lvalues.

• Differentiating the above function of w, we get

• Plugging in• Plugging in

• Using this in the steepest descent rule, we get the LMS algorithm:g

• Note that this weight update is done with only one pair!

LMS Algorithmg

Least Mean-Square Algorithm: Evaluation

• LMS algorithm behaves like a low-pass filter.• LMS algorithm is simple, model-independent, and thus

robust.• LMS does not follow the direction of steepest descent:

Instead, it follows it stochastically (stochastic gradient descent).• Slow convergence is an issue.• LMS is sensitive to the input correlation matrix’s condition

number (ratio between largest vs. smallest eigenvalue of the correlation matrix).LMS b h if h l i h h• LMS can be shown to converge if the learning rate has the following property:

where is the largest eigenvalue of the correlation matrix.

Improving Convergence in LMSg g

• The main problem arises because of the fixed .O l ti U ti i l i t• One solution: Use a time-varying learning rate:

, as in stochastic optimization theory.• A better alternative: use a hybrid method called

search-then-converge.g

• When , performance is similar to standard LMS. • When it behaves like stochastic• When , it behaves like stochastic

optimization.

Search-Then-Converge in LMSg

Learning Curvesg

Readingg

• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapters 2 and 3).

lecture 10: learning rules &...

Documents