ee645 neural networks and learning theory spring 2003 prof. anthony kuh dept. of elec. eng....

71
EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427 Email: [email protected]

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

EE645Neural Networks and Learning Theory

Spring 2003

Prof. Anthony Kuh

Dept. of Elec. Eng.

University of Hawaii

Phone: (808)-956-7527, Fax: (808)-956-3427

Email: [email protected]

Page 2: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

I. Introduction to neural networksGoal: study computational capabilities of neural network

and learning systems.

Multidisciplinary field

Algorithms, Analysis, Applications

Page 3: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

A. MotivationWhy study neural networks and machine learning?

Biological inspiration (natural computation)Nonparametric models: adaptive learning systems, learning

from examples, analysis of learning modelsImplementationApplications

Cognitive (Human vs. Computer Intelligence):Humans superior to computers in pattern recognition,

associative recall, learning complex tasks.Computers superior to humans in arithmetic computations,

simple repeatable tasks.Biological: (study human brain)

10^10 to 10^11 neurons in cerebral cortex with on average of 10^3 interconnections / neuron.

Page 4: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

A neuron

Schematic of one neuron

Page 5: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Neural Network

Connection of many neurons together forms a neural network.

Neural network properties:Highly parallel (distributed computing) Robust and fault tolerantFlexible (short and long term learning)Handles variety of information

(often random, fuzzy, and inconsistent)Small, compact, dissipates very little power

Page 6: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

B. Single Neuron

s=w T x + w0 ; synaptic strength (linearly weighted sum of inputs).

y=g(s); activation or squashing function

g( )xw

ys

w0

(Computational node)

Page 7: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Activation functions

Linear units: g(s) = s. Linear threshold units: g(s) = sgn (s). Sigmoidal units: g(s) = tanh (Bs), B >0.

Neural networks generally have nonlinear activation functions.

Most popular models: linear threshold units andsigmoidal units.

Other types of computational units : receptive units (radial basis functions).

Page 8: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

C. Neural Network Architectures

Systems composed of interconnected neurons

inputsoutput

Neural network represented by directed graph: edges represent weights and nodes represent computational units.

Page 9: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Definitions

Feedforward neural network has no loops in directed graph.

Neural networks are often arranged in layers. Single layer feedforward neural network has one

layer of computational nodes. Multilayer feedforward neural network has two or

more layers of computational nodes. Computational nodes that are not output nodes are

called hidden units.

Page 10: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

D. Learning and Information Storage

1. Neural networks have computational capabilities.Where is information stored in a neural network?

What are parameters of neural network?

2. How does a neural network work? (two phases)Training or learning phase (equivalent to write phase in

conventional computer memory): weights are adjusted to meet certain desired criterion.

Recall or test phase (equivalent to read phase in conventional computer memory): weights are fixed as neural network realizes some task.

Page 11: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning and Information (continued)

3) What can neural network models learn?Boolean functions

Pattern recognition problems

Function approximation

Dynamical systems

4) What type of learning algorithms are there?Supervised learning (learning with a teacher)

Unsupervised learning (no teacher)

Reinforcement learning (learning with a critic)

Page 12: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning and Information (continued)

5) How do neural networks learn?• Iterative algorithm: weights of neural network are adjusted on-

line as training data is received.

w(k+1) = L(w(k),x(k),d(k)) for supervised learning where

d(k) is desired output.• Need cost criterion: common cost criterion

Mean Squared Error: for one output J(w) = (y(k) – d(k)) 2

• Goal is to find minimum J(w) over all possible w. Iterative techniques often use gradient descent approaches.

Page 13: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning and Information (continued)

6)Learning and GeneralizationLearning algorithm takes training examples as inputs and

produces concept, pattern or function to be learned.

How good is learning algorithm? Generalization ability measures how well learning algorithm performs.

Sufficient number of training examples. (LLN, typical sequences)

Occam’s razor: “simplest explanation is the best”.

+

+

+

++

+

Regression problem

Page 14: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning and Information (continued)

Generalization errorg = emp + model

Empirical error: average error from training data (desired output vs. actual output)

Model error: due to dimensionality of class of functions or patterns

Desire class to be large enough so that empirical error is small and small enough so that model error is small.

Page 15: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

II. Linear threshold units

A. Preliminaries

sgn( )xw

ys

w0

sgn(s)= 1, if s>=0-1, if s<0

Page 16: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Linearly separable

Consider a set of points with two labels: + and o.

Set of points is linearly separable if a linear threshold functioncan partition the + points from the o points.

+ +

+

o

ooSet of linearly separable points

Page 17: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Not linearly separable

A set of labeled points that cannot be partitioned by a linear threshold function is not linearly separable.

+ +

o

o

Set of points that are not linearly separable

Page 18: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

B. Perceptron Learning Algorithm

An iterative learning algorithm that can find linear threshold function to partition two set of points.

1) w(0) arbitrary

2) Pick point (x(k),d(k)).

3) If w(k) T x(k)d(k) > 0 go to 5)

4) w(k+1) = w(k ) + x(k)d(k)

5) k=k+1, check if cycled through data, if not go to 2

6) Otherwise stop.

Page 19: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

PLA comments

Perceptron convergence theorem (requires margins)

Sketch of proof Updating threshold weights Algorithm is based on cost function

J(w) = - (sum of synaptic strengths of misclassified points)

w(k+1) = w(k) - (k)J(w(k)) (gradient descent)

Page 20: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Perceptron Convergence Theorem

Assumptions: w* solutions and ||w*||=1, no threshold and w(0)=0. Let max||x(k)||= and min y(k)x(k)Tw*=.

<w(k),w*>=<w(k-1) + x(k-1)y(k-1),w*> <w(k-1),w*> + k .

||w(k)||2 ||w(k-1)||2 + ||x(k-1)||2 ||w(k-1)||2 + 2 k 2 .

Implies that k ( / ) 2 (max number of updates).

Page 21: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

III. Linear Units

xw

s=y

A. Preliminaries

Page 22: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Model Assumptions and Parameters

Training examples (x(k),d(k)) drawn randomly Parameters

– Inputs: x(k)– Outputs: y(k)– Desired outputs: d(k)– Weights: w(k)– Error: e(k)= d(k)-y(k)

Error criterion (MSE)min J(w) = E [.5(e(k)) 2]

Page 23: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Wiener solution

Define P= E(x(k)d(k)) and R=E(x(k)x(k)T).J(w) =.5 E[(d(k)-y(k))2] = .5E(d(k)2)- E(x(k)d(k)) Tw +wT E(x(k)x(k) T)w = .5E[d(k) 2] –PTw +.5wTRw

Note J(w) is a quadratic function of w. To minimize J(w) find gradient, J(w) and set to 0.

J(w) = -P + Rw = 0Rw=P (Wiener solution)If R is nonsingular, then w= R-1 P.Resulting MSE = .5E[d(k)2]-PTR-1P

Page 24: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Iterative algorithms

Steepest descent algorithm (move in direction of negative gradient)

w(k+1) = w(k) - J(w(k)) = w(k) + (P-Rw(k))

Least mean square algorithm(approximate gradient from training example)

J(w(k))= -e(k)x(k)

w(k+1) = w(k) + e(k)x(k)

^

Page 25: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Steepest Descent Convergence

w(k+1) = w(k) + (P-Rw(k)); Let w* be solution.

Center weight vector v=w-w* v(k+1) = v(k) - (Rw(k)); Assume R is nonsingular.

Decorrelate weight vector u= Q-1v where R=Q Q-1 is the transformation that diagonalizes R.

u(k+1) = (I - ), u(k) = (I - )k u(0).

Conditions for convergence 0< < 2/max .

Page 26: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

LMS Algorithm Properties

Steepest Descent and LMS algorithm convergence depends on step size and eigenvalues of R.

LMS algorithm is simple to implement. LMS algorithm convergence is relatively slow. Tradeoff between convergence speed and excess

MSE. LMS algorithm can track training data that is time

varying.

Page 27: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Adaptive MMSE Methods

Training data Linear MMSE: LMS, RLS algorithms Nonlinear Decision feedback detectors

Blind algorithms Second order statistics

Minimum Output Energy Methods Reduced order approximations: PCA, multistage Wiener

Filter Higher order statistics

Cumulants, Information based criteria

Page 28: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Designing a learning system

Given a set of training data, design a system that can realize the desired task.

Signal

Processing

Feature

Extraction

Neural

Network

Inputs Outputs

Page 29: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

IV. Multilayer Networks

A. Capabilities– Depend directly on total number of weights and

threshold values.– A one hidden layer network with sufficient number of

hidden units can arbitrarily approximate any boolean function, pattern recognition problems, and well behaved function approximation problems.

– Sigmoidal units more powerful than linear threshold units.

Page 30: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

B. Error backpropagation

Error backpropagation algorithm: methodical way of implementing LMS algorithm for multilayer neural networks.

Two passes: forward pass (computational pass), backward pass (weight correction pass).

Analog computations based on MSE criterion. Hidden units usually sigmoidal units. Initialization: weights take on small random values. Algorithm may not converge to global minimum. Algorithm converges slower than for linear networks. Representation is distributed.

Page 31: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

BP Algorithm Comments

s are error terms computed from output layer back to first layer in dual network.

Training is usually done online. Examples presented in random or sequential order. Update rule is local as weight changes only involve

connections to weight. Computational complexity depends on number of

computational units. Initial weights randomized to avoid converging to

local minima.

Page 32: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

BP Algorithm Comment continued

Threshold weights updated in similar manner to other weights (input =1).

Momentum term added to speed up convergence. Step size set to small value. Sigmoidal activation derivatives simple to

compute.

Page 33: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

BP Architecture

Forward network

Sensitivity network

Output ofcomputational

values calculated

Output oferror terms calculated

Page 34: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Modifications to BP Algorithm

Batch procedure Variable step size Better approximation of gradient method (momentum

term, conjugate gradient) Newton methods (Hessian) Alternate cost functions Regularization Network construction algorithms Incorporating time

Page 35: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

When to stop trainingFirst major features captured. As training continues

minor features captured.

Look at training error.

Crossvalidation (training, validation, and test sets)

training error

testing error

Learning typically slow and may find flat learningareas with little improvement in energy function.

Page 36: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

C. Radial Basis Functions

Use locally receptive units (potential functions) Transform input space to hidden unit space via

potential functions. Output unit is linear.

inputsoutput

Potential units (x) = exp (-.5||x-c|| 2 / 2

Linear unit

Page 37: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Transformation of input space

Input space Feature space

X

X

OO O

O

XX

: X Z

Page 38: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Training Radial basis functions

Use gradient descent on unknown parameters: centers, widths, and output weights

Separate tasks for quicker training: (first layer centers, widths), (second layer weights)

– First layer Fix widths, centers determined from lattice structure Fix widths, clustering algorithm for centers Resource allocation network

– Second layer: use LMS to learn weights

Page 39: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Comparisons between RBFs and BP Algorithm

RBF single hidden layer and BP algorithm can have many hidden layers.

RBF (potential functions) locally receptive units versus BP algorithm (sigmoidal units) distributed representations.

RBF typically many more hidden units. RBF training typically quicker training.

Page 40: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

V. Alternate Detection Method

Consider detection methods based on optimum margin classifiers or Support Vector Machines (SVM)

SVM are based on concepts from statistical learning theory.

SVM are easily extended to nonlinear decision regions via kernel functions.

SVM solutions involve solving quadratic programming problems.

Page 41: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Optimal Marginal Classifiers

X

X X

OO

O

Given a set of points that are linearly separable:

Which hyperplane should you choose to separate points?

Choose hyperplane that maximizes distance between two sets of points.

X

Page 42: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Finding Optimal Hyperplane

X

X X

OO

O

X Draw convex hull around each set of points.

Find shortest line segment connecting two convex hulls.

Find midpoint of line segment.

Optimal hyperplane intersects line segment at midpoing perpendicular to line segment.

w

margins

Optimal hyperplane

Page 43: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Alternative Characterization of Optimal Margin Classifiers

X

X X

OO

O

X

u

v

w

Maximizing margins equivalent to minimizing magnitude of weight vector.

W u+ b = 1

T

W v+ b = -1

T

2m

W (u-v) = 2

T

W (u-v)/ W = 2/ W =2m

T

Page 44: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Solution in 1 Dimension

O O O O O X O X X O X X X

Points on wrong side of hyperplane

If C is large SV include

If C is small SV include all points (scaled MMSE solution)

Note that weight vector depends most heavily on outer support vectors.

Page 45: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Comments on 1 Dimensional Solution

Simple algorithm can be implemented to solve 1D problem. Solution in multiple dimensions is finding weight and then

projecting down to 1D. Min. probability of error threshold depends on likelihood

ratio. MMSE solution depends on all points where as SVM depends

on SV (points that are under margin (closer to min. probability of error).

Min. probability of error, MMSE solution, and SVM in general give different detectors.

Page 46: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Kernel Methods

In many classification and detectionproblems a linear classifier is not sufficient. However, working in higher dimensions can lead to “curse of dimensionality”.

Solution: Use kernel methods wherecomputations done in dual observation space.

Input space Feature space

X

X

OO O

O

XX

: X Z

Page 47: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Solving QP problem

SVM require solving large QP problems. However, many s are zero (not support vectors). Breakup QP into subproblem.

Chunking : (Vapnik 1979) numerical solution. Ossuna algorithm: (1997) numerical solution. Platt algorithm: (1998) Sequential Minimization Optimization

(SMO) analytical solution.

Page 48: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

SMO Algorithm

Sequential Minimization Optimization breaks up QP program into small subproblems that are solved analytically.

SMO solves dual QP SVM problem by examining points that violate KKT conditions.

Algorithm converges and consists of: Search for 2 points that violate KKT conditions. Solve QP program for 2 points. Calculate threshold value b.

Continue until all points satisfy KKT conditions. On numerous benchmarks time to convergence of SMO

varied from O (l) to O (l 2.2 ) . Convergence time depends on difficulty of classification problem and kernel functions used.

Page 49: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

SVM Summary

SVM are based on optimum margin classifiers and are solved using quadratic programming methods.

SVM are easily extended to problems that are not linearly separable.

SVM can create nonlinear separating surfaces via kernel functions.

SVM can be efficiently programmed via the SMO algorithm.

SVM can be extended to solve regression problems.

Page 50: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

VI.Unsupervised Learning

A. MotivationGiven a set of training examples with no teacher orcritic, why do we learn?

Feature extractionData compressionSignal detection and recoverySelf organization

Information can be found about data from inputs.

Page 51: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

B. Principal Component Analysis

IntroductionConsider a zero mean random vector x R n with

autocorrelation matrix R = E(xxT).

R has eigenvectors q(1),… ,q(n) and associated eigenvalues (1)… (n).

Let Q = [ q(1) | …| q(n)] and be a diagonal matrix containing eigenvalues along diagonal.

Then R = Q QT can be decomposed into eigenvector and eigenvalue decomposition.

Page 52: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

First Principal Component

Find max xTRx subject to ||x||=1.Maximum obtained when x= q(1) as this corresponds to

xTRx = (1).q(1) is first principal component of x and also yields

direction of maximum variance.y(1) = q(1)T x is projection of x onto first principal

component.x

q(1)

y(1)

Page 53: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Other Principal Components

ith principal component denoted by q(i) and projection denoted by y(i) = q(i)T x with E(y(i)) = 0 and E(y(i)2)= (i).

Note that y= QTx and we can obtain data vector x from y by noting that x=Qy.

We can approximate x by taking first m principal components (PC) to get z: z= q(1)x(1) +…+ q(m)x(m). Error given by e= x-z. e is orthogonal to q(i) when 1 i m.

Page 54: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Diagram of PCA

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

First PCSecond PC

x

First PC gives more information than second PC.

Page 55: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning algorithms for PCA

Hebbian learning rule: when presynaptic and postsynaptic signal are postive, then weigh associated with synapse increase in strength.

w = x y

x yw

Page 56: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Oja’s rule

Use normalize Hebbian rule applied to linear neuron.

xw

s=y

Need normalized Hebbian rule otherwise weightvector will grow unbounded.

Page 57: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Oja’s rule continued

wi (k+1) = wi(k) + xi (k) y(k) (apply Hebbian rule)

w(k+1)= w(k+1) / ||w(k+1)|| (renormalize weight)

Unfortunately above rule is difficult to implement so modification approximates above rule giving

wi (k+1) = wi(k) + y(k)(xi (k)- y(k) wi(k))

Similar to Hebbian rule with modified input.

Can show that w(k) q(1) with probability one given that x(k) is zero mean second order and drawn from a fixed distribution.

Page 58: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning other PCs

Adaptive learning rules (subtract larger PCs out)– Generalized Hebbian Algorithm– APEX

Batch Algorithm (singular value decomposition)– Approximate correlation matrix R with time averages.

Page 59: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Applications of PCA

Matched Filter problem: x(k) = s(k) + v(k). Multiuser communications: CDMA Image coding (data compression)

GHA quantizerPCA

Page 60: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Kernel Methods

In many classification and detectionproblems a linear classifier is not sufficient. However, working in higher dimensions can lead to “curse of dimensionality”.

Solution: Use kernel methods wherecomputations done in dual observation space.

Input space Feature space

X

X

OO O

O

XX

: X Z

Page 61: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

C. Independent Component Analysis

PCA decorrelates inputs. However in many instances we may want to make outputs independent.

A WU YX

AInputs U assumed independent and user sees X.Goal is to find W so that Y is independent.

Page 62: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

ICA Solution

Y = DPU where D is a diagonal matrix and P is a permutation matrix.

Algorithm is unsupervised. What are assumptions where learning is possible? All components of U except possibly one are nongaussian.

Establish criterion to learn from (use higher order statistics): information based criteria, kurtosis function.

Kullback Leibler Divergence:

D(f,g) = f(x) log (f(x)/g(x)) dx

Page 63: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

ICA Information Criterion

Kullback Leibler Divergence nonnegative. Set f to joint density of Y and g to products of

marginals of Y then

D(f,g) = -H(Y) + H(Yi)

which is minized when components of Y are independent.

When outputs are independent they can be a permutation and scaled version of U.

Page 64: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Learning Algorithms

Can learn weights by approximating divergence cost function using contrast functions.

Iterative gradient estimate algorithms can be used. Faster convergence can be achieved with fixed

point algorithms that approximate Newton’s methods.

Algorithms have been shown to converge.

Page 65: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Applications of ICA

Array antenna processing Blind source separation: speech separation,

biomedical signals, financial data

Page 66: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

D. Competitive Learning

Motivation: Neurons compete with one another with only one winner emerging.

Brain is a topologically ordered computational map.Array of neurons self organize.

Generalized competitive learning algorithm.1) Initialize weights2) Randomly choose inputs3) Pick winner.4) Update weights associated with winner.5) Go to 2).

Page 67: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Competitive Learning Algorithm

K means algorithm (no topological ordering)– Online algorithm– Update centers– Reclassify points– Converges to local minima

Kohonen Self Organization Feature Map (topological ordering)– Neurons arranged on lattice– Weight that are updated depend on winner, step size, and

neighborhood.– Decrease step size and neighborhood size to get

topological ordering.

Page 68: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

KSOFM 2 dimensional lattice

Page 69: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Neural Network Applications

Backgammon (Feedforward network)– 459-24-24-1 network to rate moves– Hand crafted examples, noise helped in training– 59% winning percentage against SUN gammontools– Later versions used reinforcement learning

Handwritten zip code (Feedforward network)– 16-768-192-30-10 network to distinguish numbers– Preprocessed data, 2 hidden layers act as feature detectors– 7291 training examples, 2000 test examples– Training data .14%, test data 5%, test/reject data 1%,12%

Page 70: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Neural Network Applications

Speech recognition– KSOFM map followed by feedforward neural network– 40 – 120 frames mapped onto 12 by 12 Kohonen map– Each frame composed of 600 to 1800 analog vector– Output of Kohonen map fed to feedforward network– Reduced search using KSOFM map– TI 20 word data base 98-99% correct on speaker dependent

classsification

Page 71: EE645 Neural Networks and Learning Theory Spring 2003 Prof. Anthony Kuh Dept. of Elec. Eng. University of Hawaii Phone: (808)-956-7527, Fax: (808)-956-3427

Other topics

Reinforcement learning Associative networks Neural dynamics and control Computational learning theory Bayesian learning Neuroscience Cognitive science Hardware implementation