what makes a good model? statistical reasoning, common sense, human fallibility richard shiffrin

88
What Makes a Good Model? Statistical Reasoning, Common Sense, Human Fallibility Richard Shiffrin Woojae Kim

Upload: ilario

Post on 24-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

What Makes a Good Model? Statistical Reasoning, Common Sense, Human Fallibility Richard Shiffrin Woojae Kim. What makes a good model? How do scientists judge? How should scientists judge? . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

What Makes a Good Model?

Statistical Reasoning, Common Sense, Human Fallibility

Richard ShiffrinWoojae Kim

Page 2: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• What makes a good model? – How do scientists judge?– How should scientists judge?

Page 3: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Model selection involves many high level factors, but let me begin with a more narrow focus on statistical inference:

– model comparison – model estimation– data prediction

Page 4: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• I will focus today on quantitative models, models that make quantitative predictions for quantitative data, predictions that are exact once all parameters are assigned values.

• Non-experts often find modern model selection an intimidating subject, filled with arcane terminology and difficult and complex methods for implementation. And experts argue endlessly about merits of the many approaches.

Page 5: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• SOME METHODS:

• ML (Maximum Likelihood)• AIC (Akaike Information Criterion• BIC (Bayesian Information Criterion)• BMS (Bayesian Model Selection)• FIA (Fisher Information Approximation)• NML (Normalized Maximum Likelihood)• Prequential Prediction• Cross-validation• Extrapolation• Generalization• PBCM (Parametric Bootstrap Cross-fitting Method)

Page 6: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Let me start by discussing the two ‘best’ methods:– MDL (Minimum Description Length) BMS (Bayesian Model Selection)

– (and cross validation)

Page 7: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Good source:

• Peter Grunwald: The Minimum Description Length principle (2007)

[For background and a great deal of insight into Minimum Description Length (MDL) and its relation to Bayesian Model Selection (BMS), we highly recommend a book by Peter Grunwald: The Minimum Description Length principle, a 2007 MIT Press book that makes reasonably successful attempts to describe much of its material in side boxes and chapters that are less technical.]

Page 8: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• A quantitative model for a given task or tasks specifies the probability of each data set, for all possible data sets that could have been found.

• ‘model’ denotes a given multidimensional parameter—i.e. with all parameter values specified

• ‘model class’ denotes a collection of such models

• Thus y = ax+b is a class of linear models, and y = 2x+4 is a model in that class

Page 9: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

A hierarchical model usually has some parameters assign probabilities to other parameters.

• But all of the values for the parameters and hyperparameters are captured as a single multidimensional parameter (one column of the descriptive matrix I will present shortly).

• All of the data for all subjects are captured as a single multidimensional data description (one row of the matrix).

Page 10: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Statistical model selection in its most advanced form is at heart very simple, basing goodness on the joint probability of the data and the model:

• P(Di,θj)

• In BMS, P(Di,θj) = P(Di|θj)Po(θj)– Po(θj) is termed the ‘prior’ probability of model θj

Page 11: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(λn)

θ1 θ2 θ3 . . . . . λn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&λj) .

. . .

. . .

Dn . . . . .

PA(D1)

PA(D2)

PA(D3)

.

.

.

PA(Dn)

MODEL I and II: Joint Probability Matrix

Table entries give the joint values: The probability of a givendata outcome AND the particular parameter value

ParameterPriors

Data‘Prior’

Model Class I Model

Class II

Page 12: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The entries are the joint probability of the model and the data. Where do these come from? In traditional BMS, they are simply the prior times the likelihood: The probability of the data given the model times the prior probability of the model.

• Although one might think the joint probability should also reflect the prior probability of the data, doing so in any simple way will distort the definition of the model and the model class, so we will keep the traditional approach.

Page 13: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Model selection is based on a comparison of two (or more) model classes.

• The classes typically differ in complexity. E.g. a data set could be fit by a linear model (simpler) or a seventh degree polynomial (more complex).

• How compare? – Judge a model class by its best member (MDL/NML)?– Judge by weighted average of its members (BMS)?

• How balance good fit and complexity?

Page 14: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• BMS and NML both use the joint probability matrix for model selection. It is of course equivalent to separately give the conditional probabilities and the prior probabilities of the models, but I find it simpler to couch discussion directly in terms of the joint probabilities.

Page 15: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(λn)

θ1 θ2 θ3 . . . . . λn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&λj) .

. . .

. . .

Dn . . . . .

PA(D1)

PA(D2)

PA(D3)

.

.

.

PA(Dn)

MODEL I and II: Joint Probability Matrix

Table entries give the joint values: The probability of a givendata outcome AND the particular parameter value

ParameterPriors

Data‘Prior’

Model Class I Model

Class II

Page 16: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Of course it is critical to take prior probabilities into account to carry out sensible inference.

Page 17: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• You are all familiar with the rare disease example: A test is 80% accurate: 80% of the time you have the disease the test says so; 80% of the time you do not have the disease the test says so.

• The test says you have the disease. Should you be worried?

• The incidence of the disease in the population is 1 in 1000. This is the ‘prior’ probability, and needs to be taken into account: P(disease) = 0.004 (not 0.8).

Page 18: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

In general, what we know about data from history, and what we know about parameters from history will NOT be consistent with each other, because they are typically based on different sources of prior knowledge. Also the dimensionality of the two priors differs markedly; models are used to ‘compress’ the data.

Proper inference requires that the both priors be taken into account, but the field has not taken up ways to carry out inference when these are not mutually consistent.

I will soon suggest a way to take data priors into account, but for now let us follow convention and focus only on the parameter (i.e. model) prior, ignoring any data prior.

Page 19: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

When we consider the models together, we must not be confused by the fact that a given column of joint probabilities might be identical in two model classes (if the parameter priors are the same for those columns) or related by a constant multiplier (if the parameter priors for those columns differ). {E.g. A column in each model class might be identical}.

This situation occurs routinely, as when one model class is nested inside another. When we realize we are selecting model classes, not a particular model, the problem dissolves. If may help to think of two identical models in different model classes as just very similar (differing by an infinitesimal amount).

We are now ready to describe BMS and MDL in terms of the joint probability matrix:

Page 20: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The BMS Model selection criterion is now simple: Sum the joint probabilities in the row for the observed data for model class 1, and separately form this sum for model class 2.

• We prefer the class with the larger sum. More precisely, the posterior probability for class 1 is its sum divided by the sum of both sums.

Page 21: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(λn)

θ1 θ2 θ3 . . . . . λn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&λj) .

. . .

. . .

Dn . . . . .

PA(D1)

PA(D2)

PA(D3)

.

.

.

PA(Dn)

MODEL Classes I and II: Joint Probability Matrix

Table entries give the joint values: The probability of a givendata outcome AND the particular parameter value

ParameterPriors

Data‘Prior’

Model Class I Model

Class II

Page 22: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• I will return to BMS after first discussing model selection approaches that base inference on the maximum probability assigned to a given data outcome within a given model class. The best such method (see Grunwald) is MDL as approximated by a particular form of NML (normalized maximum likelihood).

• This is easy to describe with our matrix:

Page 23: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(θn)

θ1 θ2 θ3 . . . . . θn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&θj) .

. . .

. . .

Dn . . . . .

PM(D1)

PM(D2)

PM(D3)

.

.

.

PM(Dn)

MODEL Class I: Joint Probability Matrix

Table entries give the joint values: The probability of a givendata outcome AND the particular parameter value

ParameterPriors

Data‘Prior’

Max in this row

Page 24: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• All of the modern model selection methods balance good fit and complexity. It is easy to see how NML does this: The max fit for the observed data represents good fit: larger is better. But this is divided by the sum of maxes for all possible data outcomes: We dislike models that predict everything, and want the grand sum to be as small as possible.

Page 25: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The way that BMS balances good fit and complexity is called the Bayesian Occam’s Razor, and operates similarly, though it can be harder to see.

• It is easiest to see the close connection of BMS and NML by re-describing the BMS model selection criterion in a new way that is nonetheless mathematically equivalent to the usual description.

Page 26: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

In the joint probability table, take the mean joint probability value for the observed data and divide by the sum of such means for all data outcomes.

Page 27: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(θn)

θ1 θ2 θ3 . . . . . θn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&θj) .

. . .

. . .

Dn . . . . .

PM(D1)

PM(D2)

PM(D3)

.

.

.

PM(Dn)

MODEL Class I: Joint Probability Matrix

Table entries give the joint values: BMS score for Model I is the meanfor the observed data (say Row 1) divided by the sum of means for all rows

ParameterPriors

Data‘Prior’

M1

M2

M3

Mn

Page 28: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• We end with a rather simple and fairly remarkable conceptual convergence of the NML and BMS methods:

• Both use the joint probability matrix. Both divide a statistic for the observed data by a sum of those statistics for all data outcomes.

• The statistic for NML is the max of the distribution, and the statistic for BMS is the mean (both of the joint probability values).

Page 29: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The description in terms of max and mean allows us to compare the two approaches easily.

• Occam’s Razor becomes clear in both BMS and NML:

• Fit to the observed data is Good,• Fit to all possible data is Bad.

Page 30: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The way BMS and NML balance fit and complexity has many connections to another model selection criterion, prediction, often implemented in one or another form of cross-validation:

• A model class is good if the fit to the current set of data predicts new data well.

• Thus we might split the data and fit the first half, and prefer the model that based on that fit predicts the other half best,

Page 31: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin
Page 32: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Note that BMS, MDL, and goodness of prediction (e.g. cross validation) are different criteria. They usually make similar model selection choices, but are not identical (I will say more about this later).

• E.g. One can predict using Bayesian Model Averaging (integrating predictions over the posterior), but this will not necessarily produce the ‘best’ predictions.

(Some recent research by Grunwald shows how to ‘fix’ BMS to predict better).

Page 33: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• I have noted the need for inference to include prior knowledge. There has been much ‘philosophical’ argumentation about the Bayesian interpretation of priors.

• E.g Is it sensible to assign degrees of belief to a model we know is wrong? Thus Grunwald calls the priors ‘weights’ and does not assume they must add to 1.0.

• But since BMS and NML both divide a quantity by a sum of like quantities, only the relative size of the weights/prior probabilities matter. We might as wwell think of the priors as weights.

Page 34: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Because all our models are known to be wrong, you may dislike assigning posterior degrees of belief to such models, as is done in BMS. If this bothers you, use the MDL/NML justification for model selection, and consider BMS a close approximation that is easier to calculate.

Page 35: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• It has been claimed that BMS does not depend on the intent of the experimenter (the Likelihood Principle) but NML does.

• However, if the difference between the two approaches is one of max vs mean, then the difference due to intent is limited to differences in max vs mean calculations.

Page 36: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• E.g. one can carry out a Binomial study: N trials of successes and failures, observing a string of N-1 failures and then a success, or carry out a Negative Binomial study sampling until a first success occurs, also observing N-1 failures and then a success.

• Given the same data, the BMS model selection score is of course the same for the two intents.

• It is generally the case that this is not true for NML.• However, if the difference between the two approaches is one of

max vs mean, then the difference due to intent is limited to differences in max vs mean calculations. Such differences are typically modest and we therefore regard the NML intent differences to be an aside rather than of deep fundamental importance.

• (We will discuss later situations in which intent really ought to matter, though that issue is orthogonal to the present one.)

Page 37: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Generalizing StatisticalModel Selection:

I: Data Priors

Page 38: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• One could imagine data priors and parameter priors that are consistent: Take the joint probability matrix: The column sums are the parameter (model) priors and the row sums are the data priors, and these are then consistent with each other.

• This begs the question: From where do the joint probabilities arise?

Page 39: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Going into an experiment what we know about models and what we know about data are (almost) always based on different sources of knowledge, and will not be consistent with each other.

• In actual practice, we usually know more and are more confident about probable data outcomes than model parameter values. After all, our models are reflections of, and attempts to characterize, the real world– i.e. data.

Page 40: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• No model selection methods, including BMS and MDL, provide a means for dealing with data priors.

• There are several ways we have considered for doing so. This is research in progress. Let me mention one reasonable possibility. Consider BMS first.

Page 41: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Suppose our knowledge of likely data is not based on an earlier replication of the present study, but instead on vague inference from general knowledge and prior studies in other paradigms.

• Such knowledge has two main dimensions: – The relative shape of data outcomes– The strength of belief in such inference

Page 42: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• We can represent both by imagining we had a prior study:– assume the prior study had m trials (representing

the strength of knowledge)– assign different probabilities to data outcomes of

that study (representing shape knowledge)

Page 43: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• For expanded inference, we select one of the prior study outcomes, call it D*j, and combine that data with each of the actual and potential data outcomes of the present study: I.e. the i rows of the matrix which previously represented Di now represent Di+D*j.

• We now carry out Bayesian inference on this matrix as usual, obtaining a posterior based on both the present study and data, and one of the imagined data samples.

Page 44: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• We do this for every imagined data sample, obtaining M* posteriors, the probability of each given by the data prior.

• These posteriors are weighted by the data prior probabilities, and averaged.

• The model selection criteria is as usual the sum of the resultant (average) posterior across the models in a class:

BMS*DP(K) = ΣkΣip’(Dobs, Di, θk)po(Di)

Page 45: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• To do this as stated would generally not be computationally feasible, due to the large number of imagined data outcomes.

• I believe, though this is not yet confirmed, that the proposed system will work pretty well if we represent the data prior with just a few representative imagined data outcomes.

Page 46: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• There is an analogous expanded formulation within NML, but I will not discuss that today, to save time.

Page 47: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• To summarize, we represent prior data knowledge by imagining a prior study with size of study representing our strength of knowledge (relative to the present study), and with outcome structure representing the form of the knowledge.

• We assume one of those imagined outcomes occurred, combine that outcome with the actual and possible outcomes of the present study, and carry out BMS, obtaining a posterior.

• Then these posteriors are averaged.

Page 48: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Generalizing StatisticalModel Selection:II: Data Validity

Page 49: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• In any real life model selection situation we not only have to consider inference based on the observed and virtual data, but also the reality that the observed data might be invalid. For example programming errors might have been made anywhere from experimental design to implementation to data analysis.

• All of us have experienced cases where a research assistant brings us results we do not believe, and most often further checking reveals problems that show such data to have been invalid.

Page 50: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Other common cases occur with study replications where the outcomes are inconsistent to a degree unlikely to have occurred by chance. We probably trust our own study more than a study by someone else, but in truth we should allow for the possibility that either is invalid (or even that both are).

• Of course our validity inferences should be governed in part by the number of studies whose results are consistent with each other: One deviant study among n consistent studies is likely the invalid one.

Page 51: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Yet other common cases occur when one or more of the studies have a design that is not the one being modeled (the methods are unclear, incomplete, or misinterpreted). In these case the researcher has not made an error but the application of the various model classes is inappropriate.

• Here it is proper to judge a data set as invalid’ for the models being assessed, although it might well be more accurate to judge the models as inappropriate.

Page 52: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• With respect to validity, the size of the study acts in peculiar ways. If one study shows p(c) = .98 on 1000 trials and another shows p(c) = .02 on 1,000,000 trials, we are sure one is invalid, but do not want to conclude that the one with the larger n is the valid one.

• Why? We have an intuition that the encoding of results was reversed in one of the studies, but the findings are equally consistent with either study being the one with the error.

Page 53: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Generally we do not expect different studies to have exactly the same parameters within a given model class. For example slight differences in the populations sampled might change overall performance levels.

• Thus judgments of invalidity must be made in the context of hierarchical models that posit prior distributions of parameters from which the different studies are sampled.

Page 54: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Positing such hierarchical models and assigning priors is of course a tricky matter:

• For example, we tend to assume that overall performance can and will change across replication studies, but do not like to assume or allow parameter changes that produce qualitative changes in predictions (e.g. a variable in one study produces an increase in performance but the same variable in the other study produces a decrease).

• Nonetheless, this is the way to proceed.

Page 55: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• It is a tricky matter to introduce inferences about data validity into standard model selection methods. This is another topic of current and unresolved research.

• My best present approach involves the addition of an error model or error models into the inference process. E.g. in the previous example with performance levels of 0.02 and 0.98, an error model would assume that one of the data sets was reversed.

Page 56: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The error models are incorporated into the parameters of the models, i.e. the columns.

• For n actual studies each data outcome (each row) is a concatenation of a potential outcome for each study. Different rows give other joint outcomes of the n studies. (One row represents the observed joint outcomes).

Page 57: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The error models are incorporated into the parameters (the columns).

• E.g. if there are two studies, and we believe one data set might be reversed, then it would be natural to split each given model (column) into four columns, one for neither reversed, one for the first reversed, one for the second reversed, and one for both reversed.

• It would be natural to multiply the original parameter prior by our prior belief about data set validity.

Page 58: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• In our example with 1000 and 1000000 trials, with p(c) of .02 and .98, either reversal would fit equally well, and the other possibilities would have essentially zero posterior probability and drop out.

• If we had prior belief that favored validity for one of the models, then that model would be favored to that degree.

Page 59: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• HOWEVER: Computation could become impossible. • E.g. When there are a large number of studies, we

must cut down the cases to manageable size. I suggest we look at the outcomes, and group them into subsets likely to be all valid or all invalid. In most cases we will have just two groups.

Page 60: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• A second serious problem is the large number of possible error models (there are many ways something can go awry).

• Suppose we allow two kinds of errors: A given study might be valid, encoding reversed, or randomly recorded. Each study would have some prior probabilities of which of these possibilities is true. A set of n studies would then 3n ways to assign these possibilities, and 3n parameter values would determine the probabilities of the assignments.

Page 61: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Even ignoring computational difficulties, we cannot anticipate all error models.

• I suggest therefore that we generate our error models after the fact, after considering the results of all the studies. We then use those posthoc error models to do inference.

• Although a theoretically problematic procedure (e.g. we use the same data twice), this tends to match the way scientists act in actual practice.

Page 62: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• We should not get overwrought about posthoc selection of error models, because we do the same when postulating our model classes:

• In most cases the models being compared are developed posthoc after considering the data. I will return to this point later when discussing implicit models.

Page 63: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• In any event, I retain hopes that restricting consideration to just a few error models and just two groups of studies would make computation feasible. (This hope needs to be assessed with simulation studies).

Page 64: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Combining data priors and data validity is straightforward. The two suggested techniques are combined:

• We construct our error models and our data priors. Then we add each virtual data set to our actual data sets, produce a posterior for each combined data set, and average these across the data prior.

Page 65: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Generalizing StatisticalModel Selection:III: Model Validity

Page 66: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• It is in general not good enough to assign posterior degrees of belief to models and model classes because:

1) we know prior to the study that all the models are wrong

2) all of the model classes might be not just wrong, but terrible.

• In practice we should and often do entertain the possibility that all the extant models are wrong, and proceed to search for a new one.

Page 67: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Present methods do not tell us what it means to be sufficiently ‘wrong’ that all the current models should be rejected.

• Often we use qualitative criteria, based on a comparison of best predictions to the observations.

• However, often we keep a model that predicts well except for a qualitative failure—for example when it is obvious how to add a plausible mechanism that would ‘fix’ the model.

Page 68: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• This issue is related to the goals of modeling (discussed next): In many cases our modeling goal is not to say that one wrong model is slightly better than another wrong model, but instead to use modeling to identify and suggest ways to improve present models. When this is the goal, then ‘none of the above’ is always the right model selection answer.

Page 69: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Generalizing StatisticalModel Selection:

IV: Goals of Modeling

Page 70: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• One cannot discuss model selection, goodness, and evaluation without considering the modeling goals.

• Some goals are ‘engineering in nature’: One wants a model class (usually with parameters specified) that will predict well, but may not care about the processes or mechanisms by which the prediction is obtained. E.g. face recognition at airports, chess playing, designing a wheelchair that climbs stairs, assessing truthtelling from physiological measures.

Page 71: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Another goal is ‘scientific understanding’. One wishes to understand the laws, processes and mechanisms that are operative in some setting: E.g How does memory work? Here one’s goal is largely to improve existing models.

• Another goal is model comparison: Which of several extant models does a better job of explaining observed data and predicting future data (balancing fit and complexity).

Page 72: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• These and other goals are best fulfilled by different approaches. E.g. an engineering goal may not care about model complexity, as long as prediction is accurate. (Of course this ignores generalization. If one would want to predict what happens in a new situation, then complexity becomes quite important.)

Page 73: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• As scientists we often have a goal of understanding. This is a subtle matter. In many (most) domains in cognitive science and psychology our models are really quite crude approximations to anything resembling the true complexities of mind and brain. Thus we never seek the ‘correct’ answer (though we sometimes pretend we are doing so), but seek a greatly simplified system that captures (approximates) ‘enough’ of the processes operating to increase understanding, refine models, and suggest further research.

Page 74: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• This scientific goal requires that we understand the models we propose and use.

• Model classes vary widely in the degree to which such understanding exists, and in the amount of model exploration required to understand how a model class operates and how its parameters determine predictions.

• (Myung, Pitts and Kim term part of this process ‘landscaping’)

Page 75: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• We are all familiar, for example, with neural net modeling. Early versions used feedforward transmission of activation from one layer of nodes to another (usually three or more layers), the activation between nodes governed by weights (parameters).

• In one sense such systems are ‘understandable’, because the structure is simple.

Page 76: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• However, the number of nodes and parameters can be very large making it hard to see how behavior emerges through condensation in the intermediate layers of the model. Thus in a number of cases, researchers analyze the weights in the intermediate layers with multidimensional scaling techniques to try to uncover the underlying dimensions of the weights that control behavior.

Page 77: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• In practice, as both neural net models and many other modeling approaches grow more and more complicated, they almost always expand in modular fashion, with components (e.g. layers or groups of nodes in neural net models) dedicated to some understandable function.

• This kind of modularization is not required by mathematics. Modularization often helps computationally, by allowing more efficient search of the parameter space.

• But most important, modularization is helpful by helping us understand how the model class operates.

Page 78: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The problem of understanding becomes even more acute when models include recurrence, so that activation can return to nodes that are also activation senders. Such models can become very chaotic, and extremely non linear, with predictions that change radically with small changes in parameters.

• The goal of ‘understanding’ has led most theorists to propose recurrent models with very simple modular structure (Grossberg, Elman, etc).

Page 79: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• The problems of understanding also arise very often in probabilistic modeling: Such models are often highly non-linear in the way they map parameters to predictions, and it can be difficult to obtain analytic expressions for predictions. In such cases, it is necessary to explore a model class with extensive monte carlo methods that are very demanding of computational resources, making it difficult to explore the parameter space.

Page 80: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Difficulties or not, when the goals of modeling include understanding, typically the case in science, then it is in most cases essential to explore the model classes in detail. Simply showing a model class fits well, or fits better than another, is at best a starting point for analysis, not an end point of the research.

Page 81: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

Generalizing StatisticalModel Selection:V: Implicit Models

Page 82: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• In scientific practice it is a fairly rare case when one generates the models to be judged in advance.

• Most often one looks at the data, and formulates models appropriate for the new study and new data. Quite often the formulated models may be altered versions of previous models, but sometimes have a new structure.

Page 83: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Either way, the structure of the models are usually formulated after examining the data. Some or many of the model assumptions are built into the structure, but are not parameterized, and therefore alternative assumptions do not appear as separate columns in the model selection matrix. This is a potential problem because such structural assumptions were chosen after looking at the data, to make at least one of the model classes accord with the observations.

Page 84: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Model comparisons try to balance fit and complexity and it should be clear that choosing model structure for a model class after looking at the data tailors the model class to the data in a way that the model comparison methods have no way of capturing. This is particularly a problem when such tailoring differs for the model classes being compared.

Page 85: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Although one could argue that in theory the model classes should be specified before examining the data, and there are some situations where this might be appropriate, this is not possible in most situations, because the models have been developed for other tasks, and do not make explicit predictions for the new task until some alterations/additions have been made that are task specific. Although one could imagine such changes in advance, human imagination is in general insufficient, and is helped mightily by viewing and considering the actual data.

Page 86: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Probably the best we can do is either or both of:

A) acknowledging such implicit assumptions and restricting conclusions accordingly, or

B) trying to make such implicit structural assumptions common to the model classes being considered, in a way that is as fair as possible to both classes.

Page 87: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Time does not permit discussion of other higher level factors that help us judge models. These include admittedly highly subjective factors such as ‘elegance’. They include description in terms of mathematical axioms that allow analytical derivations. They include the ability to predict a priori, and especially to predict a priori results that are different from those anticipated.

Page 88: What Makes a Good Model? Statistical Reasoning,  Common Sense,  Human Fallibility Richard Shiffrin

• Finally let me note that this talk has emphasized theory of model selection. I have also been carrying out empirical research exploring aspects of the way that scientists make judgments (in particular how they explain noisy data). Originally I hoped to include some of this research in this address, but it would at least double the length of the presentation, so I must leave that research for a talk in another setting.