dimension 1 - vanderbilt university

Dimension 1

Dim

ensi

on 2

Categorization

Category A

Category B

P(Rj | i) =! jE j

!KEKK!R"

Now the evidences Ej are evidence for category membership rather than evidence for identity

iBBiAA

iAA

EEE

iAP||

|)|(⋅+⋅

⋅=

ββ

β

What are some ways categories could be represented?

What gives rise to the evidence values?

Dimension 1

Dim

ensi

on 2

Prototypes

EA|i is proportional to similarity to the prototype of category A

Dimension 1

Dim

ensi

on 2

Ideals

EA|i is proportional to similarity to the ideal point of category A

Dimension 1

Dim

ensi

on 2

Exemplars

EA|i is proportional to similarity to the experienced exemplars of category A

Dimension 1

Dim

ensi

on 2

Decision Boundaries

EA|i is given by which side of the boundary exemplar i is on (boundary can be noise)

Dimension 1

Dim

ensi

on 2

Rules

EA|i is given by which side of the rule boundary exemplar i is on (boundary can be noise)

Dimension 1

Dim

ensi

on 2

Exemplars



- similarity to closest exemplar (nearest neighbor) - average similarity to exemplars - summed similarity to exemplars

∑

∑

=

∈

=

=

AN

jijiA

AjijiA

sE

sE

1|

|

Generalized Context Model of Categorization

Dimension 1

Dim

ensi

on 2

Exemplars


i

Dimension 1

Dim

ensi

on 2

Exemplars

EB|i is proportional to similarity to the experienced exemplars of category B

i

iBBiAA

iAA

EEE

iAP||

|)|(⋅+⋅

⋅=

ββ

β

∑=

=AN

jijiA sE

1|

iBBiAA

iAA

EEE

iAP||

|)|(⋅+⋅

⋅=

ββ

β

∑=

=AN

jijiA sE

1|

iBBiAA

iAA

EEE

iAP||

|)|(⋅+⋅

⋅=

ββ

β

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β

iKRK

K

ijjj s

siRP

⋅

⋅=∑∈

β

β)|(

QUESTION: Can the same similarities that explain identification confusions also explain categorization confusions?

iKRK

K

ijjj s

siRP

⋅

⋅=∑∈

β

β)|(


iKRK

K

ijjj s

siRP

⋅

⋅=∑∈

β

β)|(


Shepard, Hovland, & Jenkins (1961) tested this prediction by first having people learn to identify object with a unique name. They fitted SCM to the observed data (more on this later) to obtain values of biases and sij parameters. Next, they attempted to account for categorization data using those sij parameters using the categorization model.

Shepard, Hovland, & Jenkins (1961) tested this prediction by first having people learn to identify object with a unique name. They fitted SCM to the observed data (more on this later) to obtain values of biases and sij parameters. Next, they attempted to account for categorization data using those sij parameters using the categorization model.

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β

I IV II

III

V

VI

Shepard, Hovland, & Jenkins (1961)

< < < single- dimension XOR unique

identification

size

shap

e

Identification requires fine discriminations between similar stimuli … Categorization requires treating clearly discriminable stimuli as the same thing … So maybe it’s not surprising that the answer is no.


Identification requires fine discriminations between similar stimuli … Categorization requires treating clearly discriminable stimuli as the same thing … So maybe it’s not surprising that the answer is no.


Not so fast …

Generalized Context Model (GCM)

)exp( pij

dcij dces

pij ⋅−== ⋅−

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β

(i1,i2)

dimension 1 di

men

sion

2

(j1,j2)

(k1,k2)

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=

weighted general distance metric wm is weight on dimension m

(i1,i2)

dimension 1

dim

ensi

on 2

(j1,j2)

(k1,k2)

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=


(i1,i2)

dimension 1

dim

ensi

on 2

(j1,j2) (k1,k2)

w2à0

(i1,i2)

dimension 1

dim

ensi

on 2

(j1,j2)

(k1,k2)

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=


w1à0 (i1,i2)

dimension 1

dim

ensi

on 2

(j1,j2)

(k1,k2)

I IV II

III

V

VI

Shepard, Hovland, & Jenkins (1961)

< < < single- dimension XOR unique

identification

size

shap

e

Parameter Fitting Techniques

how do we find the values of model parameters that maximize the fit of a model to observed data?

Measures of Fit

what do I mean by “fit”?

what are some ways you could measure fit?

Pearson Correlation

SSE

RMSE

% Variance Accounted For

Likelihood (next week)

Pearson Correlation

∑ ∑∑

−−

−−=

22,)()(

))((

prdobs

prdobsprdobs

prdobs

prdobsr

µµ

µµ

Sum of Squared Error (SSE)

∑ −= 2, )( prdobsSSE prdobs

Root Mean Squared Error (RMSE)

RMSEobs,prd =(obs! prd)2"

N

% Variance Accounted For

null

modelnull

SSESSESSEVar −

=%

2)(∑ −=i

obsinull obsSSE µ

2)(∑ −=i

iimodel prdobsSSE

Parameter Fitting Techniques

Minimize SSE Maximize r

Maximize %Var

next week we’ll talk about maximum likelihood after that we’ll talk about more complex measures

One approach: CALCULUS

DUMB MODEL (example)

obs prd dij sij sij 0 1.000 1 0.368 2 0.135 3 0.050 4 0.018 5 0.007

ijij ds βα +=

find parameters (α and β) that minimize SSE between obs sij and prd sij

0

0

)(2

))((2

)(2

)1)((2

))((

)(

2

2

=∂

∂

=∂

∂

−−−=∂

∂

−−−=∂

∂

−−−=∂

∂

−−−=∂

∂

+−=

−=

∑

∑

∑

∑

∑

∑

β

α

βαβ

βαβ

βαα

βαα

βα

SSE

SSE

ddobsSSE

ddobsSSE

dobsSSE

dobsSSE

dobsSSE

prdobsSSE

kkkk

kkkk

kkk

kk

k

kkk

kkk

0

0

)(2

))((2

)(2

)1)((2

))((

)(

2

2

=∂

∂

=∂

∂

−−−=∂

∂

−−−=∂

∂

−−−=∂

∂

−−−=∂

∂

+−=

−=

∑

∑

∑

∑

∑

∑

β

α

βαβ

βαβ

βαα

βαα

βα

SSE

SSE

ddobsSSE

ddobsSSE

dobsSSE

dobsSSE

dobsSSE

prdobsSSE

kkkk

kkkk

kkk

kk

k

kkk

kkk

Why does this work?

One approach: CALCULUS

nearly impossible in many situations intractable mathematical problem

Other approaches: Search/Optimization Algorithms

require computer power

but first, a quick aside …

Illustration of one Common Modeling Technique (1) start with a model (2) set the free parameters to known values (3) generate predictions from the model (4) now treat those predictions as “data” (5) fit the model to the “observed data” (6) can you fit the model to the data (you should)? (7) do you get the same parameters back (depends)?

Illustration of one Common Modeling Technique (1) start with a model (2) set the free parameters to known values (3) generate predictions from the model (4) now treat those predictions as “data” (5) fit the model to the “observed data” (6) can you fit the model to the data (you should)? (7) do you get the same parameters back (depends)? Why would you do this? (a) test that your model fitting program works right (b) check that the parameters are “identifiable (more later) (c) compare models based on their “flexibility”

image Model A can fit data generated by Model A and by Model B, but Model B can only really fit data generated by Model B then perhaps Model A is too flexible

Generalized Context Model (GCM)

)exp( pij

dcij dces

pij ⋅−== ⋅−

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β

Categorization Task

unidimensional stimuli

e.g., proportion of white vs. black squares

MATLAB EXAMPLE

Categorization Task

two-dimensional stimuli

MATLAB EXAMPLE

how do we find the values of the model parameters that minimize SSE

(or maximize r, or maximize %var)

GRID SEARCH

parameter 1

para

met

er 2

calculate SSE at each combination of parameter 1 and parameter 2

Matlab: See grid search for simple 1-parameter categorization model

What might be some limitations of a grid search?

the finer the grid search, the more evaluations you need to run

How fine of a grid search do you run? What if the best-fitting parameters are between the ones you’ve tried?

How long does it take to run a grid search?

evaluation time for one set of parameters

x # of evaluations


evaluation time for one set of parameters

x # of evaluations


# steps of parm1 x # steps parm 2 x # step parm 3 x …

e.g., 1000 x 1000 x 1000 x 1000

1 nanosecond (10-9) per evaluation x

1012 evaluations

103 seconds = 17 min

100 seconds (102) per evaluation x

1012 evaluations

1014 seconds = 3 million years

Hill-climbing Algorithms

simple hill-climbing Nedler-Meade Simplex

Hooke and Jeeves

“direct search methods”

Enrico Fermi and Nicholas Metropolis used one of the first digital computers, the Los Alamos Maniac, to determine which values of certain theoretical parameters (phase shifts) best fit experimental data (scattering cross sections). They varied one theoretical parameter at a time by steps of the same magnitude, and when no such increase or decrease in any one parameter further improved the fit to the experimental data, they halved the step size and repeated the process until the steps were deemed sufficiently small. Their simple procedure was slow but sure, and several of us used it on the Avidac computer at the Argonne National Laboratory for adjusting six theoretical parameters to fit the pion-proton scattering data we had gathered using the University of Chicago synchrocyclotron [7].

W. C. Davidon, Variable Metric Method for Minimization, Tech. Rep. 5990, Argonne National Laboratory, Argonne, IL, 1959.

Enrico Fermi and Nicholas Metropolis used one of the first digital computers, the Los Alamos Maniac, to determine which values of certain theoretical parameters (phase shifts) best fit experimental data (scattering cross sections). They varied one theoretical parameter at a time by steps of the same magnitude, and when no such increase or decrease in any one parameter further improved the fit to the experimental data, they halved the step size and repeated the process until the steps were deemed sufficiently small. Their simple procedure was slow but sure, and several of us used it on the Avidac computer at the Argonne National Laboratory for adjusting six theoretical parameters to fit the pion-proton scattering data we had gathered using the University of Chicago synchrocyclotron [7].

W. C. Davidon, Variable Metric Method for Minimization, Tech. Rep. 5990, Argonne National Laboratory, Argonne, IL, 1959.

these techniques only emerged 50 years ago (Calculus was invented 400 years ago)

Simple Hill Climbing

DEMONSTRATE


how many points do you need to evaluate with each step?

2 parameters

21 3

4

567

8


how many points do you need to evaluate with each step?

N parameters 5 parameters 10 parameters

3N-1 evaluations per step 242 evaluations per step 59049 evaluations per step

this ends up being inefficient because you can need to take 1000’s of steps

“stupid” algorithm

SimpleHillClimb.m

More sophisticated algorithms

Kolda, T.G., Lewis, R.M., & Torczon, V. (2003) Optimization by direct search: New perspectives on some classical and modern methods. SIAM Review, 45, 385-482.


DEMONSTRATE

e.g., Hooke and Jeeves a pattern search method


DEMONSTRATE

e.g., Nelder Meade Simplex fminsearch in MATLAB

http://en.wikipedia.org/wiki/Nelder-Mead_method

http://www.scholarpedia.org/article/Nelder-Mead_algorithm

What is a simplex? 0 dimensions point 1 vertex 1 dimension line 2 vertices 2 dimensions triangle 3 vertices 3 dimensions tetrahedron 4 vertices 4 dimensions pentachoron 5 vertices . . . N dimensions N-simplex N+1 vertices

basically just a generalization of a triangle to N dimensions

reflect

expand

contract

shrink

Matlab examples

More Matlab examples

Medin & Schaffer (1978)

P(A)obs P(A)prd A1 1 1 1 2 .77 .79 A2 1 2 1 2 .78 .83 A3 1 2 1 1 .83 .88 A4 1 1 2 1 .64 .65 A5 2 1 1 1 .61 .64 B1 1 1 2 2 .39 .45 B2 2 1 1 2 .41 .44 B3 2 2 2 1 .21 .23 B4 2 2 2 2 .15 .16 T1 1 2 2 1 .56 .62 T2 1 2 2 2 .41 .47 T3 1 1 1 1 .82 .85 T4 2 2 1 2 .40 .45 T5 2 1 2 1 .32 .34 T6 2 2 1 1 .53 .61 T7 2 1 2 2 .20 .22

Medin & Schaffer ‘78

P(A)obs P(A)prdA1 1 1 1 2 .77 .79A2 1 2 1 2 .78 .83A3 1 2 1 1 .83 .88A4 1 1 2 1 .64 .65A5 2 1 1 1 .61 .64B1 1 1 2 2 .39 .45B2 2 1 1 2 .41 .44B3 2 2 2 1 .21 .23B4 2 2 2 2 .15 .16T1 1 2 2 1 .56 .62T2 1 2 2 2 .41 .47T3 1 1 1 1 .82 .85T4 2 2 1 2 .40 .45T5 2 1 2 1 .32 .34T6 2 2 1 1 .53 .61T7 2 1 2 2 .20 .22

)exp( pij

dcij dces

pij ⋅−== ⋅−

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β

w1w2w3w4c


SSE

Gradient-Based Techniques when you can calculate

(or approximate) derivatives

Simulated Annealing (generalization of Metropolis algorithm)

with noisy objective functions and with discrete parameter values

Genetic Search Algorithms with discrete parameter values

possible project:

explore different parameter search routines to see which

best recovers parameters and does it most quickly

Homework Assignment

fit SCM fit GCM

partly using code we used in class today and code from last week’s assignment

I encourage people to work together

conceptually, but each person should do their own programming.

Problems of local minima

importance of multiple starting positions

Genetic Algorithms and Simulated Annealing may solve these problems

Simulated Annealing always accept the new candidate parameter vector if it gives a better fit but also accept a new candidate parameter vector with probability P if it gives a WORSE fit e.g., P=exp(-Δfit/T) Δfit is the decrease in fit between current and candidate T is the “temperature”, which decreases according to a schedule as Δfità0 Pà1 T starts at ∞, so P starts at 1 (completely random) T goes to 0, so P goes to 0 (pure hill climbing) depending on cooling schedule, simulated annealing can take orders of magnitude longer than a basic hill climbing algorithm

Genetic Algorithms multiple candidate parameter vectors are recombined or mutated and only some offspring are retained, akin to natural selection

Problems of local minima

importance of multiple starting positions

how do you know when you’ve tested

enough starting points?

what starting points do you pick? (1) based on “experience” with the model (2) based on an initial coarse parameter search

followed by a fine parameter search (3) an initial “random search” first, like

simulated annealing or genetic algorithms

how many starting points? (1) do many starting points converge on the

same optimal parameter values (2) need to consider the amount of time it

takes to do a search from each starting point (3) if the model fits “everything” you’re okay

but it’s harder to know that a model really blows it

How to use the programs

parInit = [3 2]; options = optimset('display', 'iter', 'MaxIter', 500); [bestx,fval] = fminsearch(@mymodel,parInit,options);

passing a function as a parameter

parInit = [1.6 -1.6]; parInc = [0.1 0.1]; parLow = [-4 -4]; parHigh = [ 4 4]; [HOOK_fit,HOOK_pos,HOOK_path] = … hook('mymodel',parInit,parLow,parHigh,

parInc,parInc/10);

passing the name of a function hook uses eval() MATLAB function


fminsearch()


)exp( pij

dcij dces

pij ⋅−== ⋅−

( )rM

m

rmmmij jiwd

/1

1)(∑

=

−=

∑∑

∑

==

=

⋅+⋅

⋅

=BA

A

N

jijB

N

jijA

N

jijA

ss

siAP

11

1)|(ββ

β


params fit

change params to try to decrease fit

mymodel()

Some things to consider when using these tools: •  do you have continuous vs. discrete parameters?

- discrete parameters may require grid search •  need for multiple starting points because of local minima

- did you use enough starting points? - do various starting points converge? - how long does each parameter search take?

•  where to place the starting points to use - based on experience with the model - preliminary exploration of the parameter space

•  has the maximum number of iterations been reached? - MaxIter and MaxFunEvals in MATLAB - should only be reached if a parameter is going to ∞

Some things to consider when using these tools: •  what is the initial step size in the search

- consider a large step size in step 1 - smaller step size in step 2 - does the algorithm decrease the step size? - what is the step size for each parameter?

•  what is the range of valid values for each parameter - does the search algorithm set min and max values?


parInc,parInc/10);

Hook and Jeeves lets you specify the step size (parInc) separately for each parameter


parInc,parInc/10);

Hook and Jeeves lets you specify the step size (parInc) separately for each parameter and lets you specify the min (parLow) and max (parHigh) separately for each parameter


MATLAB’s fminsearch (Simplex) does not all parameters are allowed to range between -∞ and +∞ ONE SOLUTION: Use fminsearchbnd on MATLAB central file exchange http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=8277&objectType=file


or program the constraints yourself function fit = mymodel (params) p1 = params(1); p2 = params(2); p3 = params(3); . . .

fit = sse;


or program the constraints yourself function fit = mymodel (params) p1 = params(1); what if this can only go -∞ and +∞ ? p2 = params(2); and this can go only between 1 and +∞ ? p3 = params(3); and this can only go between 1 and 3 ? . . .

fit = sse;


or program the constraints yourself function fit = mymodel (params) p1 = params(1); between -∞ and +∞ p2 = 1+params(2)^2; between 1 and +∞ p3 = 1+3*(sin(params(3))+1)/2; between 1 and 4 . . .

fit = sse;


or program the constraints yourself function fit = mymodel (params) p1 = params(1); between -∞ and +∞ p2 = LOW+params(2)^2; between LOW and +∞ p3 = HIGH-params(3)^2; between -∞ and HIGH p4 = LOW+(HIGH-LOW)*(sin(params(4))+1)/2;

between LOW and HIGH . . .

fit = sse;


MATLAB’s fminsearch (Simplex) does not let you set the step size (for any parameters) may use an initial step size that is proportional to the value of the parameter, but this cannot be set by the user it‘s probably okay

there are some other search algorithms that assume the same step size for every parameter (e.g., subplex)

Some options the programs give you …

Max Iterations check that Max Iterations is never hit set to a big number sometimes searches can go off to infinity


Step Size / Min Step Size rule of thumb : 1/100 of expected parameter value

NOTE: some programs use the same step size for every parameter; you should rescale the parameter value within the model routine

One approach is to do an initial search with a large step size just to find a reasonable set of starting points, then switch to a smaller step


Step Size / Min Step Size what if step size is way to big?


Step Size / Min Step Size what if step size is way too big?

starting point

minimum?


Step Size / Min Step Size what if step size is way too big?

noisy objective functions – with Monte Carlo simulations


combining hill climbing with grid search when some params are continuous and some are discrete

dimension 1 - vanderbilt university

Documents