perceptron: from minsky & papert (1969) · pdf fileperceptron: from minsky & papert...

77
Perceptron: from Minsky & Papert (1969) Retina with input pattern Non-adaptive local feature detectors (preprocessing) Trainable evidence weigher Strictly bottom-up feature processing.

Upload: vuongnguyet

Post on 07-Feb-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Perceptron: from Minsky & Papert (1969)!

Retina with!input pattern!

Non-adaptive local feature !detectors (preprocessing)!

Trainable evidence weigher!

Strictly bottom-up feature!processing.!

Page 2: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Teacher:! Correct output for trial 1 is 1.! Correct output for trial 2 is 1.! Correct output for trial 3 is 0.!! . . .! Correct output for trial n is 0.!

`Supervised’ learning training paradigm!

Call the correct answers the!desired outputs and represent!them by the symbol “d”.!

Call the outputs from the network!the actual outputs and represent!them by the symbol “a”.!

Adjust weights to!Reduce error!

a!

Page 3: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

From a neuron to a linear threshold unit.!

Page 4: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Trivial Data Set!

No generalization needed.!

Not linearly separable.!

Odd bit parity.!

Page 5: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Perceptron Learning Rule!

Page 6: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

More Compact Form!

Page 7: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

An Alternative Representation!

Page 8: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Learning the AND function!

Perceptron learning algorithm!

Page 9: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Linear Separability!

Perceptron with 2 inputs:! inputs! threshold!

Solve for y:!

Equation for straight line.! Adjustable wts control slope and intercept.!

Page 10: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

AND vs XOR!

Try to draw a straight line through positive and negative instances.!

Geometric illustration of linear separability or not.!

Page 11: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Algebraic Proof of Not Linear Separable for XOR!

Page 12: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Algebraic Proof of Not Linear Separable for XOR!

Impossible, given above.!

Greater than zero.!

Page 13: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Thresholds can be represented as weights!

Page 14: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

y(x) = w

Tx+ w0

At decision boundary:! y(x) = 0

Linear component of LTU:!

Separating plane is also called decision boundary!

Linear discriminant function

Page 15: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Properties of linear decision boundary!

1.  Decision boundary is orthogonal to weight vector.!

2.  Will give formula for distance of decision boundary to origin.!

Page 16: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Geometry of linear decision boundary!

Points in x space!

x

Ax

BLet ! and!be two pts on hyperplane.!

Then !y(xA) = 0 = y(xB)

Therefore:!

w

T (xB � x

A) = 0

Bishop

Page 17: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

x_A – x_B is parallel to decision boundary!

w

T (xB � x

A) = 0 Implies wt vector is perpendicular to decision boundary!

Interpretation of vector subtraction!

Page 18: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Distance to hyperplane!

l = ||x|| cos ✓ = ||x|| w

tx

||x||||w|| =�w0

||w||

l = ||x|| cos ✓

Distance to hyperplane.!

Page 19: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

SVM Kernel!

Russell and Norvig!

Projecting input features to a higher-dimensional space.!

Page 20: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Perceptron: from Minsky & Papert (1969)!

Retina with!input pattern!

Non-adaptive local feature !detectors (preprocessing)!

Trainable evidence weigher!

Strictly bottom-up feature!processing.!

Page 21: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Diameter-limited Perceptron!

Page 22: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Digits made of seven line segments!

Positive target is the exact digit pattern.!

Negative target is all other possible segment configurations.!

Winston

Page 23: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Winston!

Digit Segment Perceptron!

Ten perceptrons are required.!

Page 24: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Perform a feedforward sweep to compute !the output values below.!

Input is constrained so that!exactly 2 of the 6 inputs are 1.!

Output units indicate whether!the inputs are acquaintances!or siblings.!

Perceptron learning does not generalize to multilayer!

Winston!

Page 25: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

An early model of an error signal: 1960’s!

Page 26: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Linear combiner element!

Descriptive complexity is low!

Page 27: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

A very simple learning task

•  Consider a neural network with two layers of neurons. –  neurons in the top layer

represent known shapes. –  neurons in the bottom layer

represent pixel intensities. •  A pixel gets to vote if it has ink

on it. –  Each inked pixel can vote

for several different shapes. •  The shape that gets the most

votes wins.

0 1 2 3 4 5 6 7 8 9!

Hinton

Page 28: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Why the simple system does not work

•  A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape. –  The winner is the template that has the biggest overlap with the ink.

•  The ways in which shapes vary are much too complicated to be captured by simple template matches of whole shapes. –  To capture all the allowable variations of a shape we need to learn the

features that it is composed of.

Hinton

Page 29: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Examples of handwritten digits from a test set

Page 30: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

General layered feedforward network!

There can be varying numbers of units in the different layers.!

A feedforward network does not have to be layered.!!Any connection topology that does not have recurrent connections!is a feedforward network.!

Page 31: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Summary of Architecture for Backprop

input vector

hidden layers

outputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

Hinton

Page 32: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Matrix representations of networks!

Page 33: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Need for supervised training!

We cannot define the hand printed character A. We can only !present examples and hope that the network generalizes.!

The fact that this happens was a breakthrough in pattern recognition.!

Page 34: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

sigmoid!

derivative!

Sigmoid plots!

Logistic sigmoid equation!

Give the neuron a graded output value between 0 and 1!

Put sigmoid function here!

Page 35: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Simple transformations on functions!

Page 36: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Odd and Even Functions!

Even: f(x) = f(-x)!

Odd: f(x) = -f(-x)! Flip about y-axis and flip about x-axis.!

Flip about y-axis!

Even: cos, cosh!

Odd: sin, sinh!

Page 37: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

sinh, cosh, and tanh!

Page 38: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Relation between logsig and tanh!

Page 39: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Exact formula for setting weights!

1.  Two classes!

2.  Gaussian distributed!

3.  Equal variances!

4.  Naïve Bayes!

Page 40: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Gaussian Distribution!

Page 41: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Fitting a bivariate Gaussian to 2D data

From: Florian Hahne

Data is presented as a scatter plot.

Page 42: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Bivariate Gaussian!

Optimal decision boudary!

Page 43: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Exact formula for setting weights!

Bishop, 1995However, in most applications, we musttrain the wts.

Page 44: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Derivatives of sigmoidal functions!

tanh(x)0 = 1� tanh2(x)

logsig(x)

0= logsig(x) · (1� logsig(x))

If you know the value of a function at a particular point, you can quickly compute the derivative of the function.!

Page 45: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Error Signal (Cost function)!

For regression.

E = �SX

t=1

cX

k=1

dk(t) ln dk(t)

# of classes

For classification

Cross entropy

Sum of squares

Page 46: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Error Measure for Linear Unit!

This will be the cost function!

Page 47: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Instantaneous slope (i.e. derivative)!

Δx

Slope: how fast a line is increasing!

Δx!

x + Δx!x!

f(x)!

f(x + Δx)!

f’(x) ~=! f(x + Δx) - f(x)!Δx!

x - axis!

y - axis!

Page 48: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Principle of gradient descent!

Goal is to find minimum of cost function. !!Suppose estimate is x=1.!

Then derivative is > 0.!

Update estimate in the negative direction.!

Page 49: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

How the derivative varies with the error function!

Minimum of cost function.!

Positive gradient.!

Negative gradient!

Page 50: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Deriving a learning rule: trains wts for single linear unit!

Weight update rule from gradient descent!

Derivative of cost function wrt weight w_1.!

t denotes test case or trial.!

S denotes # of cases in a batch.!

Page 51: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Simple example: learn AND with linear unit.!

Since there are only two!weights, it’s possible to!visualize the error surface.!

# of cases in a batch = 4!

o!

Page 52: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Composite error surface for linear unit learning AND!

Optimal weight values are at top of surface (w1 = w2 = 1/3).!

In this case, quadratic.!Error surface is flipped upside down, so that it is easier to see.!

Page 53: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Error surface for each of the 4 cases.!

Previous error surface is obtained by adding these together.!

Page 54: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

o!

Adding a Logistic Sigmoid Unit!

g(x)

Derivative of sigmoidal unit.!

Formula for sigmoidal unit.!

Page 55: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Derivative of cost function for linear unit with sigmoid!

Chain rule!

Derivative of sigmoid.!

Page 56: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Error surface with a sigmoidal unit!

plateau!

o!

Page 57: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Update rule for wt in output layer!

New form!

Backpropagation of errors!

Function of h, which is defined on next slide.!

Page 58: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

New form!

Adapted from Bishop!

For output layer!

For hidden layers!

�j = g0(hj)X

k

wkj�k

�n = [dn � on] g0(hn)

�wji = ⌘oi�j(hj)

Backpropagation of errors!

Arbitrary location in network.!

Page 59: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Delta_n: output layer!

1.  If sigmoid, then o(1-o).!2.  If tanh, then 1-o^2.!3.  If linear, then 1.!

Page 60: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Initializing weights!

1.  Most methods set weights to randomly chosen small values.!

2.  Random values are chosen to avoid symmetries in the network.!

3.  Small values are chosen to avoid the saturated regions of the sigmoid where the slope is near zero.!

Page 61: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Time complexity of backprop!

Feedforward sweep is O(w).!!Backprop sweep is O(w).!

Page 62: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Basic Autoencoder!

Ashish Gupta!

Page 63: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Sparse Autoencoder!

The number of hidden units is greater than the number of input units.!

The cost function is chosen to make most of the hidden units inactive.!

SS error!Minimize wt magnitude.!Minimize # of

active units!

Cost function for sparse AE.!

Page 64: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Limited-memory BFGS!

Limitation of gradient descent: assumes direction of maximum change in gradient is a good direction for update.!!A better choice maximizes ratio of gradient to curvature of error surface.!!Newton’s method takes curvature into account.!!Curvature is given by Hessian matrix. Require inverting H, which is infeasible for large problems.!!BFGS approximates H inverse without actually inverting Hessian. Thus it is a quasi-Newton method.!!Broyden, Fletcher, Goldfarb, Shanno!

Page 65: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Purpose of sparse autoencoder!

Discover an over complete basis set.!

Use basis set as a set of input features to a learning algorithm.!

Can think of this as a form of highly sophisticated preprocessing.!

Page 66: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Summary of Architecture for Backprop

input vector

hidden layers

outputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

Hinton

Page 67: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Training Procedure: Make sure lab results apply to application domain.!

1.  Training set!2.  Testing set!3.  Validation set!

If you have enough data:!

Otherwise:!

Ten-fold cross-validation. Allows use of all data for training.!

Page 68: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Curse of Dimensionality

x1

D = 1

x1

x2

D = 2

x1

x2

x3

D = 3

Amount of data needed to sample a high-dimensional space grows exponentially w/ D.

Page 69: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

A feature preprocessing example!

There exists a set of weights to solve!the problem. Gradient descent searches!for the desired weights.!

Distinguishing mines from rocks!in sonar returns.!

Page 70: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

26 outputs coding phonemes!

203 input binary vector, 29 bits!for each of 7 chars including!punctuation, 1-out-of 29 code!

Trained on 1024 words, capable!Of intelligible speech after 10!Training epochs, 95% training!Accuracy after 50 epochs, and!78 percent generalization accuracy.!

80 hidden units!

NETtalk – Sejnowski and Rosenberg (1987)!

Page 71: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Limitations of back-propagation

•  It requires labeled training data. –  Almost all data in an applied setting is unlabeled.

•  The learning time does not scale well –  It is very slow in networks with multiple hidden layers.

•  It can get stuck in poor local optima. –  These are often quite good, but for deep networks are far from

optimal.

Hinton

Page 72: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Synaptic noise sources:! 1. Probability of vesicular! release.! 2. Magnitude of response! in case of vesicular ! release.!

Page 73: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Possible biological source of a quasi-global !Difference-of-reward signal!

Mesolimbic system or!!Mesocortical system!

Error broadcast to every !weight in the network!

Page 74: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

A spectrum of machine learning tasks

•  Low-dimensional data (e.g. less than 100 dimensions)

•  Lots of noise in the data

•  There is not much structure in the data, and what structure there is, can be represented by a fairly simple model.

•  The main problem is

distinguishing true structure from noise.

•  High-dimensional data (e.g. more than 100 dimensions)

•  The noise is not sufficient to obscure the structure in the data if we process it right.

•  There is a huge amount of structure in the data, but the structure is too complicated to be represented by a simple model.

•  The main problem is figuring out how to represent the complicated structure in a way that can be learned.

Typical Statistics------------Artificial Intelligence

Hinton

Page 75: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Generative Model for Classification!

1.  Model class-conditional densities p(x | C_k).!

2.  Model class priors: p(C_k).!

3.  Compute posterior: p(C_k | x) using Bayes’ Thm.!

Definition:!

Motivation:!

1.  Links ML to probability theory, the universal language for modeling uncertainty and evaluating evidence.!

2.  Links neural networks to ML.!

Page 76: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Aoccdrnig to a rscheearch at an Elingshuinervtisy, it deosn't mttaer inwaht oredr the ltteers in a wrod are, the olnyiprmoetnt tihng is tahtfristand lsat ltteer is at the rghit pclae. The rsetcan be a toatl mses andyou can sitll raed it wouthit a porbelm. Tihs isbcuseae we do not raed erveylteter by its slef, but the wrod as a wlohe

Knowledge structuring issues and information sources!

Page 77: Perceptron: from Minsky & Papert (1969) · PDF filePerceptron: from Minsky & Papert (1969)! Retina with! input pattern! Non-adaptive local feature ! detectors (preprocessing)! Trainable

Top-down or Context Effects!

Semantics of a structured pattern!

Kaniza figures!