perceptron: from minsky & papert (1969) · pdf fileperceptron: from minsky & papert...

Perceptron: from Minsky & Papert (1969)!

Retina with!input pattern!

Non-adaptive local feature !detectors (preprocessing)!

Trainable evidence weigher!

Strictly bottom-up feature!processing.!

Teacher:! Correct output for trial 1 is 1.! Correct output for trial 2 is 1.! Correct output for trial 3 is 0.!! . . .! Correct output for trial n is 0.!

`Supervised’ learning training paradigm!

Call the correct answers the!desired outputs and represent!them by the symbol “d”.!

Call the outputs from the network!the actual outputs and represent!them by the symbol “a”.!

Adjust weights to!Reduce error!

a!

From a neuron to a linear threshold unit.!

Trivial Data Set!

No generalization needed.!

Not linearly separable.!

Odd bit parity.!

Perceptron Learning Rule!

More Compact Form!

An Alternative Representation!

Learning the AND function!

Perceptron learning algorithm!

Linear Separability!

Perceptron with 2 inputs:! inputs! threshold!

Solve for y:!

Equation for straight line.! Adjustable wts control slope and intercept.!

AND vs XOR!

Try to draw a straight line through positive and negative instances.!

Geometric illustration of linear separability or not.!

Algebraic Proof of Not Linear Separable for XOR!

Algebraic Proof of Not Linear Separable for XOR!

Impossible, given above.!

Greater than zero.!

Thresholds can be represented as weights!

y(x) = w

Tx+ w0

At decision boundary:! y(x) = 0

Linear component of LTU:!

Separating plane is also called decision boundary!

Linear discriminant function

Properties of linear decision boundary!

1.  Decision boundary is orthogonal to weight vector.!

2.  Will give formula for distance of decision boundary to origin.!

Geometry of linear decision boundary!

Points in x space!

x

Ax

BLet ! and!be two pts on hyperplane.!

Then !y(xA) = 0 = y(xB)

Therefore:!

w

T (xB � x

A) = 0

Bishop

x_A – x_B is parallel to decision boundary!

w

T (xB � x

A) = 0 Implies wt vector is perpendicular to decision boundary!

Interpretation of vector subtraction!

Distance to hyperplane!

l = ||x|| cos ✓ = ||x|| w

tx

||x||||w|| =�w0

||w||

l = ||x|| cos ✓

Distance to hyperplane.!

SVM Kernel!

Russell and Norvig!

Projecting input features to a higher-dimensional space.!

Perceptron: from Minsky & Papert (1969)!

Retina with!input pattern!

Non-adaptive local feature !detectors (preprocessing)!

Trainable evidence weigher!

Strictly bottom-up feature!processing.!

Diameter-limited Perceptron!

Digits made of seven line segments!

Positive target is the exact digit pattern.!

Negative target is all other possible segment configurations.!

Winston

Winston!

Digit Segment Perceptron!

Ten perceptrons are required.!

Perform a feedforward sweep to compute !the output values below.!

Input is constrained so that!exactly 2 of the 6 inputs are 1.!

Output units indicate whether!the inputs are acquaintances!or siblings.!

Perceptron learning does not generalize to multilayer!

Winston!

An early model of an error signal: 1960’s!

Linear combiner element!

Descriptive complexity is low!

A very simple learning task

•  Consider a neural network with two layers of neurons. –  neurons in the top layer

represent known shapes. –  neurons in the bottom layer

represent pixel intensities. •  A pixel gets to vote if it has ink

on it. –  Each inked pixel can vote

for several different shapes. •  The shape that gets the most

votes wins.

0 1 2 3 4 5 6 7 8 9!

Hinton

Why the simple system does not work

•  A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape. –  The winner is the template that has the biggest overlap with the ink.

•  The ways in which shapes vary are much too complicated to be captured by simple template matches of whole shapes. –  To capture all the allowable variations of a shape we need to learn the

features that it is composed of.

Hinton

Examples of handwritten digits from a test set

General layered feedforward network!

There can be varying numbers of units in the different layers.!

A feedforward network does not have to be layered.!!Any connection topology that does not have recurrent connections!is a feedforward network.!

Summary of Architecture for Backprop

input vector

hidden layers

outputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

Hinton

Matrix representations of networks!

Need for supervised training!

We cannot define the hand printed character A. We can only !present examples and hope that the network generalizes.!

The fact that this happens was a breakthrough in pattern recognition.!

sigmoid!

derivative!

Sigmoid plots!

Logistic sigmoid equation!

Give the neuron a graded output value between 0 and 1!

Put sigmoid function here!

Simple transformations on functions!

Odd and Even Functions!

Even: f(x) = f(-x)!

Odd: f(x) = -f(-x)! Flip about y-axis and flip about x-axis.!

Flip about y-axis!

Even: cos, cosh!

Odd: sin, sinh!

sinh, cosh, and tanh!

Relation between logsig and tanh!

Exact formula for setting weights!

1.  Two classes!

2.  Gaussian distributed!

3.  Equal variances!

4.  Naïve Bayes!

Gaussian Distribution!

Fitting a bivariate Gaussian to 2D data

From: Florian Hahne

Data is presented as a scatter plot.

Bivariate Gaussian!

Optimal decision boudary!

Exact formula for setting weights!

Bishop, 1995However, in most applications, we musttrain the wts.

Derivatives of sigmoidal functions!

tanh(x)0 = 1� tanh2(x)

logsig(x)

0= logsig(x) · (1� logsig(x))

If you know the value of a function at a particular point, you can quickly compute the derivative of the function.!

Error Signal (Cost function)!

For regression.

E = �SX

t=1

cX

k=1

dk(t) ln dk(t)

# of classes

For classification

Cross entropy

Sum of squares

Error Measure for Linear Unit!

This will be the cost function!

Instantaneous slope (i.e. derivative)!

Δx

Slope: how fast a line is increasing!

Δx!

x + Δx!x!

f(x)!

f(x + Δx)!

f’(x) ~=! f(x + Δx) - f(x)!Δx!

x - axis!

y - axis!

Principle of gradient descent!

Goal is to find minimum of cost function. !!Suppose estimate is x=1.!

Then derivative is > 0.!

Update estimate in the negative direction.!

How the derivative varies with the error function!

Minimum of cost function.!

Positive gradient.!

Negative gradient!

Deriving a learning rule: trains wts for single linear unit!

Weight update rule from gradient descent!

Derivative of cost function wrt weight w_1.!

t denotes test case or trial.!

S denotes # of cases in a batch.!

Simple example: learn AND with linear unit.!

Since there are only two!weights, it’s possible to!visualize the error surface.!

# of cases in a batch = 4!

o!

Composite error surface for linear unit learning AND!

Optimal weight values are at top of surface (w1 = w2 = 1/3).!

In this case, quadratic.!Error surface is flipped upside down, so that it is easier to see.!

Error surface for each of the 4 cases.!

Previous error surface is obtained by adding these together.!

o!

Adding a Logistic Sigmoid Unit!

g(x)

Derivative of sigmoidal unit.!

Formula for sigmoidal unit.!

Derivative of cost function for linear unit with sigmoid!

Chain rule!

Derivative of sigmoid.!

Error surface with a sigmoidal unit!

plateau!

o!

Update rule for wt in output layer!

New form!

Backpropagation of errors!

Function of h, which is defined on next slide.!

New form!

Adapted from Bishop!

For output layer!

For hidden layers!

�j = g0(hj)X

k

wkj�k

�n = [dn � on] g0(hn)

�wji = ⌘oi�j(hj)

Backpropagation of errors!

Arbitrary location in network.!

Delta_n: output layer!

1.  If sigmoid, then o(1-o).!2.  If tanh, then 1-o^2.!3.  If linear, then 1.!

Initializing weights!

1.  Most methods set weights to randomly chosen small values.!

2.  Random values are chosen to avoid symmetries in the network.!

3.  Small values are chosen to avoid the saturated regions of the sigmoid where the slope is near zero.!

Time complexity of backprop!

Feedforward sweep is O(w).!!Backprop sweep is O(w).!

Basic Autoencoder!

Ashish Gupta!

Sparse Autoencoder!

The number of hidden units is greater than the number of input units.!

The cost function is chosen to make most of the hidden units inactive.!

SS error!Minimize wt magnitude.!Minimize # of

active units!

Cost function for sparse AE.!

Limited-memory BFGS!

Limitation of gradient descent: assumes direction of maximum change in gradient is a good direction for update.!!A better choice maximizes ratio of gradient to curvature of error surface.!!Newton’s method takes curvature into account.!!Curvature is given by Hessian matrix. Require inverting H, which is infeasible for large problems.!!BFGS approximates H inverse without actually inverting Hessian. Thus it is a quasi-Newton method.!!Broyden, Fletcher, Goldfarb, Shanno!

Purpose of sparse autoencoder!

Discover an over complete basis set.!

Use basis set as a set of input features to a learning algorithm.!

Can think of this as a form of highly sophisticated preprocessing.!

Summary of Architecture for Backprop

input vector

hidden layers

outputs

Back-propagate error signal to get derivatives for learning

Compare outputs with correct answer to get error signal

Hinton

Training Procedure: Make sure lab results apply to application domain.!

1.  Training set!2.  Testing set!3.  Validation set!

If you have enough data:!

Otherwise:!

Ten-fold cross-validation. Allows use of all data for training.!

Curse of Dimensionality

x1

D = 1

x1

x2

D = 2

x1

x2

x3

D = 3

Amount of data needed to sample a high-dimensional space grows exponentially w/ D.

A feature preprocessing example!

There exists a set of weights to solve!the problem. Gradient descent searches!for the desired weights.!

Distinguishing mines from rocks!in sonar returns.!

26 outputs coding phonemes!

203 input binary vector, 29 bits!for each of 7 chars including!punctuation, 1-out-of 29 code!

Trained on 1024 words, capable!Of intelligible speech after 10!Training epochs, 95% training!Accuracy after 50 epochs, and!78 percent generalization accuracy.!

80 hidden units!

NETtalk – Sejnowski and Rosenberg (1987)!

Limitations of back-propagation

•  It requires labeled training data. –  Almost all data in an applied setting is unlabeled.

•  The learning time does not scale well –  It is very slow in networks with multiple hidden layers.

•  It can get stuck in poor local optima. –  These are often quite good, but for deep networks are far from

optimal.

Hinton

Synaptic noise sources:! 1. Probability of vesicular! release.! 2. Magnitude of response! in case of vesicular ! release.!

Possible biological source of a quasi-global !Difference-of-reward signal!

Mesolimbic system or!!Mesocortical system!

Error broadcast to every !weight in the network!

A spectrum of machine learning tasks

•  Low-dimensional data (e.g. less than 100 dimensions)

•  Lots of noise in the data

•  There is not much structure in the data, and what structure there is, can be represented by a fairly simple model.

•  The main problem is

distinguishing true structure from noise.

•  High-dimensional data (e.g. more than 100 dimensions)

•  The noise is not sufficient to obscure the structure in the data if we process it right.

•  There is a huge amount of structure in the data, but the structure is too complicated to be represented by a simple model.

•  The main problem is figuring out how to represent the complicated structure in a way that can be learned.

Typical Statistics------------Artificial Intelligence

Hinton

Generative Model for Classification!

1.  Model class-conditional densities p(x | C_k).!

2.  Model class priors: p(C_k).!

3.  Compute posterior: p(C_k | x) using Bayes’ Thm.!

Definition:!

Motivation:!

1.  Links ML to probability theory, the universal language for modeling uncertainty and evaluating evidence.!

2.  Links neural networks to ML.!

Aoccdrnig to a rscheearch at an Elingshuinervtisy, it deosn't mttaer inwaht oredr the ltteers in a wrod are, the olnyiprmoetnt tihng is tahtfristand lsat ltteer is at the rghit pclae. The rsetcan be a toatl mses andyou can sitll raed it wouthit a porbelm. Tihs isbcuseae we do not raed erveylteter by its slef, but the wrod as a wlohe

Knowledge structuring issues and information sources!

Top-down or Context Effects!

Semantics of a structured pattern!

Kaniza figures!

perceptron: from minsky & papert (1969) · pdf fileperceptron: from minsky & papert...

Documents