neural networks and kernel methods. how are we doing on the pass sequence? we can now track both...

40
Neural Networks and Kernel Methods

Upload: justin-hayes

Post on 26-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Neural Networks and Kernel Methods

Page 2: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

How are we doing on the pass sequence?

• We can now track both men, provided with– Hand-labeled coordinates of both men in 30 frames– Hand-extracted features (stripe detector, white blob

detector)– Hand-labeled classes for the white-shirt tracker

• We have a framework for how to optimally make decisions and track the men

Generally, this will take a lot longer than 24 hours… We need to avoid doing this by hand!

Page 3: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Recall: Multi-input linear regressiony(x,w) = w0 + w1 1(x) + w2 2(x) + … + wM M(x)

• x can be an entire scan-line or image!

• We could try to uniformly distribute basis functions in the input space:

• This is futile, because of the curse of dimensionality

x = entire scan line

Page 4: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Neural networks and kernel methods

Two main approaches to avoiding the curse of dimensionality:– “Neural networks”

• Parameterize the basis functions and learn their locations

• Can be nested to create a hierarchy• Regularize the parameters or use Bayesian

learning

– “Kernel methods”• The basis functions are associated with data

points, limiting complexity• A subset of data points may be selected to further

limit complexity

Page 5: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Neural networks and kernel methods

Two main approaches to avoiding the curse of dimensionality:– “Neural networks”

• Parameterize the basis functions and learn their locations

• Can be nested to create a hierarchy• Regularize the parameters or use Bayesian

learning

– “Kernel methods”• The basis functions are associated with data

points, limiting complexity• A subset of data points may be selected to further

limit complexity

Page 6: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Two-layer neural networks

• Before, we used

• Replace each j with a variable zj,where

and h() is a fixed activation function

• The outputs are obtained from

where () is another fixed function

• In all, we have (simplifying biases):

Page 7: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Typical activation functions

• Logistic sigmoid, aka logit:

h(a) = (a) = 1/(1+e-a)

• Hyperbolic tangent:

h(a) = tanh(a) = (ea-e-a)/(ea+e-a)

• Cumulative Gaussian (error function):

h(a) = 2x-∞a N(x|0,1)dx - 1

– This one has a lighter tail

h(a)

Normalized to have same range and slope at a=0

a

As above, but h is on a log-scale

Page 8: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Examples of functions learned by a neural network(3 tanh hidden units, one linear output unit)

Page 9: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Multi-layer neural networks

• Only weights corresponding to the feed-forward topology are instantiated

• The sum is over those values of j with instantiated weights wkj

From now on, we’ll denote all activation functions by h

Page 10: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Learning neural networks

• As for regression, we consider a squared error cost function:

E(w) = ½ n k ( tnk – yk(xn,w) )2

which corresponds to a Gaussian density p(t|x)

• We can substitute

and use a general purpose optimizer to estimate w, but it is illustrative and useful to study the derivatives of E…

Page 11: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Learning neural networks

E(w) = ½ n k ( tnk – yk(xn,w) )2

• Recall that for linear regression:

E(w)/wm = -n ( tn - yn ) xnm

• We’ll use the chain rule of differentiation to derive a similar-looking expression, where– Local input signals are forward-propagated from the input

– Local error signals are back-propagated from the output

Error signalInput signalWeight in-between error signal and

input signal

Page 12: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Local signals needed for learning

• For clarity, consider the error for one training case:

• To compute En/wji, note that wji appears in only one term of the overall expression, namely

• Using the chain rule of differentiation, we have

where

if wji is in the 1st layer, zi is actually input xi

Weight Local

error signal

Local input signa

l

Page 13: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Forward-propagating local input signals

• Forward propagation gives all the a’s and z’s

Page 14: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Back-propagating local error signals

• Back-propagation gives all the ’s

t2

t1

Page 15: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Back-propagating error signals

• To compute En/aj (j), note that aj appears in all

those expressions ak = i wki h(ai) that depend on aj

• Using the chain rule, we have

• The sum is over k s.t. unit j is connected to unit k and for each such term, ak/aj = wkj h’(aj)

• Noting that En/ak = k, we get the back-propagation rule:

• For output units: -

Page 16: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Putting the propagations together

• For each training case n, apply forward propagation and back-propagation to compute

for each weight wji

• Sum these over training cases to compute

• Use these derivatives for steepest descent learning or as input to a conjugate gradients optimizer, etc

• On-line learning: After each pattern presentation, use the above gradient to update the weights

Page 17: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

The number of hidden units determines the complexity of the learned function

(M = # hidden units)

Page 18: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

The effect of local minima

• Because of random weight initialization, each training run will find a different solution

M

Validation error

Page 19: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Regularizing neural networks

Demonstration of over-fitting (M = # hidden units)

Page 20: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Regularizing neural networks

• Use cross-validation to select the network architecture (number of layers, number of units per layer)

• Add to E a term (/2)jiwji2 that penalizes large

weights, so

Use cross-validation to select

• Use early-stopping and cross-validation (next slide)

• Take a Bayesian approach: Put a prior on the w’s and integrate over them to make predictions

Over-fitting:

Page 21: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Early stopping

• The weights start at small values and grow

• Perhaps the number of learning iterations is a surrogate for model complexity?

• This works for some learning tasks

Number of learning iterations

Training error

Validation error

Page 22: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Can we use a standard neural network to automatically learn the features needed for tracking?

• x is 320-dimensional, so the number of parameters would be at least 320

• We have only 15 data points (setting aside 15 for cross validation) so over-fitting will be an issue

• We could try weight decay, Bayesian learning, etc, but a little thinking reveals that our approach is wrong…

• In fact, we want the weights connecting different positions in the scan line to use the same feature (eg, stripes)

x = entire scan line

Page 23: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Convolutional neural networks• Recall that a short portion of the scan line was

sufficient for tracking the striped shirt• We can use this idea to build a convolutional network

Same set of weights used for all hidden units

With constrained weights, the number of free parameters is now only ~ one dozen, so…

We can use Bayesian/regularized learning to automatically learn the features

Page 24: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Convolutional neural networks in 2-D(from Le Cun et al, 1989)

Page 25: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Neural networks and kernel methods

Two main approaches to avoiding the curse of dimensionality:– “Neural networks”

• Parameterize the basis functions and learn their locations

• Can be nested to create a hierarchy• Regularize the parameters or use Bayesian

learning

– “Kernel methods”• The basis functions are associated with data

points, limiting complexity• A subset of data points may be selected to further

limit complexity

Page 26: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Kernel methods

• Basis functions offer a way to enrich the feature space, making simple methods (such as linear regression and linear classifiers) much more powerful

• Example: Input x; Features x, x2, x3, sin(x), … • There are two problems with this approach

– Computational efficiency: Generally, the appropriate features are not known, so there is a huge (possibly infinite) number of them to search over

– Regularization: Even if we could search over the huge number of features, how can we select appropriate features so as to prevent overfitting?

• The kernel framework enables efficient approaches to both problems

Page 27: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Kernel methods

x1

x2

1

2

Page 28: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Definition of a kernel

• Suppose (x) is a mapping from the D-dimensional input vector x to a high (possibly infinite) dimensional feature space

• Many simple methods rely on inner products of feature vectors, (x1)T(x2)

• For certain feature spaces, the “kernel trick” can be used to compute (x1)T(x2) using the input vectors directly:

(x1)T(x2) = k(x1, x2)

• k(x1, x2) is referred to as a kernel

• If a function satisfies “Mercer’s conditions” (see textbook), it can be used as a kernel

Page 29: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Examples of kernels

• k(x1, x2) = x1T x2

• k(x1, x2) = x1T x2

(is symmetric positive definite)

• k(x1, x2) = exp(-||x1-x2||2

• k(x1, x2) = exp(-½ x1T x2

(is symmetric positive definite)

• k(x1, x2) = p(x1)p(x2)

Page 30: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Gaussian processes• Recall that for linear regression:

• Using a design matrix , our prediction vector is

• Let’s use a simple prior on w:

• Then

• K is called the Gram matrix, where

• Result: The correlation between two predictions equals the kernel evaluated for the corresponding inputs

Example

Page 31: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Gaussian processes: “Learning” and prediction

• As before, we assume

• The target vector likelihood is

• Using , we can obtain the marginal predictive distribution over targets:

where

• Predictions are based on

where , =

• is Gaussian with

Page 32: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Example: Samples from

Page 33: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Example: Learning and prediction

Page 34: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Sparse kernel methods and SVMs

• Idea: Identify a small number of training cases, called support vectors, which are used to make predictions

• See textbook for details

Support vector

Page 35: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Questions?

Page 36: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

How are we doing on the pass sequence?• We can now automatically learn the features needed

to track both people

Same set of weights used for all hidden units

Page 37: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

How are we doing on the pass sequence?

Pretty good! We cannow automatically learnthe features needed totrack both people

But, it sucks that we need to hand-label the coordinates of both men in 30 frames and hand-label the 2 classes for the white-shirt tracker

Same set of weights used for all hidden units

Page 38: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both
Page 39: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Lecture 5 Appendix

Page 40: Neural Networks and Kernel Methods. How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both

Constructing kernels• Provided with a kernel or a set of kernels, we can

construct new kernels using any of the rules: