pattern associators, generalization, processing

28
Pattern Associators, Generalization, Processing Psych 85-419/719 Feb 6, 2001

Upload: kenyon-mcpherson

Post on 30-Dec-2015

25 views

Category:

Documents


3 download

DESCRIPTION

Pattern Associators, Generalization, Processing. Psych 85-419/719 Feb 6, 2001. A Pattern Associator. Consists of a set of input units, and output units, and connections from input to output. .. And a training set of examples, consisting of inputs and their corresponding outputs. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern Associators, Generalization, Processing

Pattern Associators,Generalization, Processing

Psych 85-419/719

Feb 6, 2001

Page 2: Pattern Associators, Generalization, Processing

A Pattern Associator

• Consists of a set of input units, and output units, and connections from input to output.

• .. And a training set of examples, consisting of inputs and their corresponding outputs

Page 3: Pattern Associators, Generalization, Processing

Simple GeneralizationIn Hebbian Learning

jljtjli

jtjlilj

jtjjiti

liljji

iio

ioi

iwo

oiw

,,,

,,,

,,,

,,,

Page 4: Pattern Associators, Generalization, Processing

The Dot Product

• The sum of the products of elements of two vectors

• When normalized for length, is basically the correlation between the vectors

• The angle between the vectors is the inverse cosine of the dot product

• When the dot product is 0 (or, angle is 90 degrees), vectors are orthogonal

Page 5: Pattern Associators, Generalization, Processing

Geometrically...

(1,-1)

(1,1)

Page 6: Pattern Associators, Generalization, Processing

So, Generalization in Hebb...

• After single learning trial, generalization is proportional to:– Output from trained trial, and– Correlation between new test input and learned input

j

ljtjli iio ,,,

Page 7: Pattern Associators, Generalization, Processing

After Multiple Training Trials..

• Output of a test pattern is a function of the sum of all dot products between test input and all training input patterns, multiplied by the output that each test pattern produced.

l

litiliti iioko )( ,,,,

Page 8: Pattern Associators, Generalization, Processing

Properties of Hebb Generalization

• If input is uncorrelated with all training inputs, output is zero

• Otherwise, is weighted average of all outputs from all training trials– Weighted by correlations with inputs on

training trials

• If all training trials orthogonal to each other, no cross-talk

Page 9: Pattern Associators, Generalization, Processing

Cross-Talk in Delta Rule Learning

• Suppose we learn a given pattern in the delta rule

• What happens when we present a test pattern that is similar to that learned pattern?

• Difference in output is a function of error on the learned pattern, and dot product of learned input and test input

Page 10: Pattern Associators, Generalization, Processing

What Does This Mean?

• When our new item is similar to what we’ve been trained on, learning is easier if the output we want is close to the output we get from other examples.

• So, regular items (ones that have similar input-output relationships) don’t need a lot of training

• Exceptions need more training.

Page 11: Pattern Associators, Generalization, Processing

Frequency of Regularsand Exceptions

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 3 5 7 9 11 13 15

Log Frequency

Pro

po

rtio

n o

f It

em

s.

Regulars

Exceptions

Page 12: Pattern Associators, Generalization, Processing

Constraints on Learning

• With Hebb rule, each training input needs to be orthogonal to every other one in order to be separable and avoid cross-talk

• With delta rule, the inputs just have to be linearly independent from each other to prevent one training trial from wrecking what was learned on other trials– Linearly independent: can’t produce vector A by

multiplying vector B by a scalar

Page 13: Pattern Associators, Generalization, Processing

More Geometry...

(1,-1)

(1,1)

Orthogonal vectors

(-.5,1)

(-1,.5)

Linearly independent,but not orthogonal

Page 14: Pattern Associators, Generalization, Processing

Different Types of ActivationFunctions

• Linear: output of a unit is simply the summed input to it

• Linear threshold: output of a unit is summed input, but not above or below a threshold

• Stochastic: Roll dice as to what output is based on input

• Sigmoid: 1/(1+exp(-net))

Page 15: Pattern Associators, Generalization, Processing

Delta Rule for Non-Linear Units

ji

i

i

i

i

i

ji

i

iii

jjjii

ii

w

net

net

o

o

E

w

E

otE

ownet

netfo

,,

2

,

)(

)(

For linear units,

i

i

net

o

Equals 1.

Otherwise, it’s thederivative of our activationfunction f

Page 16: Pattern Associators, Generalization, Processing

So Delta Rule Works WellFor Any Activation Function

f that is differentiable

• Linear: easily differentiable

• Sigmoid: easily differentiable

• Threshold…. Not so much so.

• (what about other error functions besides sum-squared?)

Page 17: Pattern Associators, Generalization, Processing

Minimizing Sum SquaredError

• With unambiguous input, will converge to correct output

• With ambiguous or noisy input, will converge to output that minimizes average squared distance from all targets– This is effectively regression!

• Can read outputs as a probability distribution (recall IA Reading Model)

Page 18: Pattern Associators, Generalization, Processing

Regression vs. Winner-Take-All

• In Jets and Sharks model, activating gang node activated “winner” in the age group– Other ages suppressed

• In delta rule, output is proportional to the statistics of the training set

• Which is better?

Page 19: Pattern Associators, Generalization, Processing

The Ideas From Ch 11

• We can think of patterns being correlated over units, rather than units correlated over patterns

• Same with targets

• Based on this, we can see how much cross talk there is between inputs, or weights, or outputs

Page 20: Pattern Associators, Generalization, Processing

As Learning Progresses,Weights Become Aligned With

Targets

1.0 0.4 0.20.4 1.0 0.30.2 0.3 1.0

1.0 0.1 0.00.1 1.0 0.10.0 0.1 1.0

1.0 0.0 0.00.0 1.0 0.00.0 0.0 1.0

Page 21: Pattern Associators, Generalization, Processing

Performance Measures

• Sum squared error– tss is total sum squared error– pss is sum squared error for the current pattern

• Can also compute vector differences between actual output and target output– ndp is normalized dot product. – nvl is normalized vector length. Magnitude of output

vector.– vcor is the correlation, ignoring magnitude

Page 22: Pattern Associators, Generalization, Processing

Unpacking...

• Suppose our targets were -1,1,-1 and our output was -0.5,0.5,-0.5

• vcor, the correlation ignoring length, is perfect (1.0)

• Length (nvl) is less than 1; output is not at full magnitude

• So, overall performance (ndp) is not 1.

Page 23: Pattern Associators, Generalization, Processing

Back to Generalization

• Two layer delta rule networks are great for picking up on regularities

• Can’t do XOR problems

• Recall: regular and exception items (GAVE, WAVE, PAVE… HAVE)

• Are exceptions a form of XOR?

Page 24: Pattern Associators, Generalization, Processing

XOR and Exceptions

• Depends on your representation.• With localist word units (for example), they

are linearly independent, and hence learnable.• … but you don’t get decent generalization

with localist representations!• This state of affairs led many to conclude that

there were two systems for learning regulars and exceptions

Page 25: Pattern Associators, Generalization, Processing

Evidence for Two Systems

• Phonological dyslexics: impaired at rule application, more or less ok at exceptions

• Surface dyslexics: ok at rule application, poor at exceptions

• Conclusion of many: there are two systems. One performs rule association and learns rules. Other has localist word nodes. Handles exceptions.

Page 26: Pattern Associators, Generalization, Processing

History of the Argument

• When this two-system was put forward, it was not known how to train a network to handle XOR problems.

• Existing symbolic models also could pick up rules, but needed something else for exceptions.

• BUT: Starting next week, we’ll talk about learning rules that can handle the XOR problem.

Page 27: Pattern Associators, Generalization, Processing

The Zorzi et al. Model

Pronunciation

Word input

“Lexical” RepTwo LayerAssociation.Delta Rule

Page 28: Pattern Associators, Generalization, Processing

For Thursday…

• Topic: Distributed Representations

• Read PDP1, Chapter 3.

• Optional: Handout, Plaut & McClelland, Stipulating versus discovering representations

• Optional: Science article Sparse population coding of faces in inferiotemporal cortex

• Look over homework #2