what makes a good model? statistical reasoning, common sense, human fallibility richard shiffrin

What Makes a Good Model?

Statistical Reasoning, Common Sense, Human Fallibility

Richard ShiffrinWoojae Kim

• What makes a good model? – How do scientists judge?– How should scientists judge?

• Model selection involves many high level factors, but let me begin with a more narrow focus on statistical inference:

– model comparison – model estimation– data prediction

• I will focus today on quantitative models, models that make quantitative predictions for quantitative data, predictions that are exact once all parameters are assigned values.

• Non-experts often find modern model selection an intimidating subject, filled with arcane terminology and difficult and complex methods for implementation. And experts argue endlessly about merits of the many approaches.

• SOME METHODS:

• ML (Maximum Likelihood)• AIC (Akaike Information Criterion• BIC (Bayesian Information Criterion)• BMS (Bayesian Model Selection)• FIA (Fisher Information Approximation)• NML (Normalized Maximum Likelihood)• Prequential Prediction• Cross-validation• Extrapolation• Generalization• PBCM (Parametric Bootstrap Cross-fitting Method)

• Let me start by discussing the two ‘best’ methods:– MDL (Minimum Description Length) BMS (Bayesian Model Selection)

– (and cross validation)

• Good source:

• Peter Grunwald: The Minimum Description Length principle (2007)

[For background and a great deal of insight into Minimum Description Length (MDL) and its relation to Bayesian Model Selection (BMS), we highly recommend a book by Peter Grunwald: The Minimum Description Length principle, a 2007 MIT Press book that makes reasonably successful attempts to describe much of its material in side boxes and chapters that are less technical.]

• A quantitative model for a given task or tasks specifies the probability of each data set, for all possible data sets that could have been found.

• ‘model’ denotes a given multidimensional parameter—i.e. with all parameter values specified

• ‘model class’ denotes a collection of such models

• Thus y = ax+b is a class of linear models, and y = 2x+4 is a model in that class

A hierarchical model usually has some parameters assign probabilities to other parameters.

• But all of the values for the parameters and hyperparameters are captured as a single multidimensional parameter (one column of the descriptive matrix I will present shortly).

• All of the data for all subjects are captured as a single multidimensional data description (one row of the matrix).

• Statistical model selection in its most advanced form is at heart very simple, basing goodness on the joint probability of the data and the model:

• P(Di,θj)

• In BMS, P(Di,θj) = P(Di|θj)Po(θj)– Po(θj) is termed the ‘prior’ probability of model θj

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(λn)

θ1 θ2 θ3 . . . . . λn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&λj) .

. . .

. . .

Dn . . . . .

PA(D1)

PA(D2)

PA(D3)

.

.

.

PA(Dn)

MODEL I and II: Joint Probability Matrix

Table entries give the joint values: The probability of a givendata outcome AND the particular parameter value

ParameterPriors

Data‘Prior’

Model Class I Model

Class II

• The entries are the joint probability of the model and the data. Where do these come from? In traditional BMS, they are simply the prior times the likelihood: The probability of the data given the model times the prior probability of the model.

• Although one might think the joint probability should also reflect the prior probability of the data, doing so in any simple way will distort the definition of the model and the model class, so we will keep the traditional approach.

• Model selection is based on a comparison of two (or more) model classes.

• The classes typically differ in complexity. E.g. a data set could be fit by a linear model (simpler) or a seventh degree polynomial (more complex).

• How compare? – Judge a model class by its best member (MDL/NML)?– Judge by weighted average of its members (BMS)?

• How balance good fit and complexity?

• BMS and NML both use the joint probability matrix for model selection. It is of course equivalent to separately give the conditional probabilities and the prior probabilities of the models, but I find it simpler to couch discussion directly in terms of the joint probabilities.

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(λn)

θ1 θ2 θ3 . . . . . λn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&λj) .

. . .

. . .

Dn . . . . .

PA(D1)

PA(D2)

PA(D3)

.

.

.

PA(Dn)

MODEL I and II: Joint Probability Matrix


ParameterPriors

Data‘Prior’

Model Class I Model

Class II

• Of course it is critical to take prior probabilities into account to carry out sensible inference.

• You are all familiar with the rare disease example: A test is 80% accurate: 80% of the time you have the disease the test says so; 80% of the time you do not have the disease the test says so.

• The test says you have the disease. Should you be worried?

• The incidence of the disease in the population is 1 in 1000. This is the ‘prior’ probability, and needs to be taken into account: P(disease) = 0.004 (not 0.8).

In general, what we know about data from history, and what we know about parameters from history will NOT be consistent with each other, because they are typically based on different sources of prior knowledge. Also the dimensionality of the two priors differs markedly; models are used to ‘compress’ the data.

Proper inference requires that the both priors be taken into account, but the field has not taken up ways to carry out inference when these are not mutually consistent.

I will soon suggest a way to take data priors into account, but for now let us follow convention and focus only on the parameter (i.e. model) prior, ignoring any data prior.

When we consider the models together, we must not be confused by the fact that a given column of joint probabilities might be identical in two model classes (if the parameter priors are the same for those columns) or related by a constant multiplier (if the parameter priors for those columns differ). {E.g. A column in each model class might be identical}.

This situation occurs routinely, as when one model class is nested inside another. When we realize we are selecting model classes, not a particular model, the problem dissolves. If may help to think of two identical models in different model classes as just very similar (differing by an infinitesimal amount).

We are now ready to describe BMS and MDL in terms of the joint probability matrix:

• The BMS Model selection criterion is now simple: Sum the joint probabilities in the row for the observed data for model class 1, and separately form this sum for model class 2.

• We prefer the class with the larger sum. More precisely, the posterior probability for class 1 is its sum divided by the sum of both sums.

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(λn)

θ1 θ2 θ3 . . . . . λn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&λj) .

. . .

. . .

Dn . . . . .

PA(D1)

PA(D2)

PA(D3)

.

.

.

PA(Dn)

MODEL Classes I and II: Joint Probability Matrix


ParameterPriors

Data‘Prior’

Model Class I Model

Class II

• I will return to BMS after first discussing model selection approaches that base inference on the maximum probability assigned to a given data outcome within a given model class. The best such method (see Grunwald) is MDL as approximated by a particular form of NML (normalized maximum likelihood).

• This is easy to describe with our matrix:

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(θn)

θ1 θ2 θ3 . . . . . θn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&θj) .

. . .

. . .

Dn . . . . .

PM(D1)

PM(D2)

PM(D3)

.

.

.

PM(Dn)

MODEL Class I: Joint Probability Matrix


ParameterPriors

Data‘Prior’

Max in this row

• All of the modern model selection methods balance good fit and complexity. It is easy to see how NML does this: The max fit for the observed data represents good fit: larger is better. But this is divided by the sum of maxes for all possible data outcomes: We dislike models that predict everything, and want the grand sum to be as small as possible.

• The way that BMS balances good fit and complexity is called the Bayesian Occam’s Razor, and operates similarly, though it can be harder to see.

• It is easiest to see the close connection of BMS and NML by re-describing the BMS model selection criterion in a new way that is nonetheless mathematically equivalent to the usual description.

In the joint probability table, take the mean joint probability value for the observed data and divide by the sum of such means for all data outcomes.

Po(θ1) Po(θ2) Po(θ3) . . . . . Po(θn)

θ1 θ2 θ3 . . . . . θn

D1P1(D1&θ1)

D2 . . . . .

D3

. . P1 (Di&θj) .

. . .

. . .

Dn . . . . .

PM(D1)

PM(D2)

PM(D3)

.

.

.

PM(Dn)

MODEL Class I: Joint Probability Matrix

Table entries give the joint values: BMS score for Model I is the meanfor the observed data (say Row 1) divided by the sum of means for all rows

ParameterPriors

Data‘Prior’

M1

M2

M3

Mn

• We end with a rather simple and fairly remarkable conceptual convergence of the NML and BMS methods:

• Both use the joint probability matrix. Both divide a statistic for the observed data by a sum of those statistics for all data outcomes.

• The statistic for NML is the max of the distribution, and the statistic for BMS is the mean (both of the joint probability values).

• The description in terms of max and mean allows us to compare the two approaches easily.

• Occam’s Razor becomes clear in both BMS and NML:

• Fit to the observed data is Good,• Fit to all possible data is Bad.

• The way BMS and NML balance fit and complexity has many connections to another model selection criterion, prediction, often implemented in one or another form of cross-validation:

• A model class is good if the fit to the current set of data predicts new data well.

• Thus we might split the data and fit the first half, and prefer the model that based on that fit predicts the other half best,

• Note that BMS, MDL, and goodness of prediction (e.g. cross validation) are different criteria. They usually make similar model selection choices, but are not identical (I will say more about this later).

• E.g. One can predict using Bayesian Model Averaging (integrating predictions over the posterior), but this will not necessarily produce the ‘best’ predictions.

(Some recent research by Grunwald shows how to ‘fix’ BMS to predict better).

• I have noted the need for inference to include prior knowledge. There has been much ‘philosophical’ argumentation about the Bayesian interpretation of priors.

• E.g Is it sensible to assign degrees of belief to a model we know is wrong? Thus Grunwald calls the priors ‘weights’ and does not assume they must add to 1.0.

• But since BMS and NML both divide a quantity by a sum of like quantities, only the relative size of the weights/prior probabilities matter. We might as wwell think of the priors as weights.

Because all our models are known to be wrong, you may dislike assigning posterior degrees of belief to such models, as is done in BMS. If this bothers you, use the MDL/NML justification for model selection, and consider BMS a close approximation that is easier to calculate.

• It has been claimed that BMS does not depend on the intent of the experimenter (the Likelihood Principle) but NML does.

• However, if the difference between the two approaches is one of max vs mean, then the difference due to intent is limited to differences in max vs mean calculations.

• E.g. one can carry out a Binomial study: N trials of successes and failures, observing a string of N-1 failures and then a success, or carry out a Negative Binomial study sampling until a first success occurs, also observing N-1 failures and then a success.

• Given the same data, the BMS model selection score is of course the same for the two intents.

• It is generally the case that this is not true for NML.• However, if the difference between the two approaches is one of

max vs mean, then the difference due to intent is limited to differences in max vs mean calculations. Such differences are typically modest and we therefore regard the NML intent differences to be an aside rather than of deep fundamental importance.

• (We will discuss later situations in which intent really ought to matter, though that issue is orthogonal to the present one.)

Generalizing StatisticalModel Selection:

I: Data Priors

• One could imagine data priors and parameter priors that are consistent: Take the joint probability matrix: The column sums are the parameter (model) priors and the row sums are the data priors, and these are then consistent with each other.

• This begs the question: From where do the joint probabilities arise?

• Going into an experiment what we know about models and what we know about data are (almost) always based on different sources of knowledge, and will not be consistent with each other.

• In actual practice, we usually know more and are more confident about probable data outcomes than model parameter values. After all, our models are reflections of, and attempts to characterize, the real world– i.e. data.

• No model selection methods, including BMS and MDL, provide a means for dealing with data priors.

• There are several ways we have considered for doing so. This is research in progress. Let me mention one reasonable possibility. Consider BMS first.

• Suppose our knowledge of likely data is not based on an earlier replication of the present study, but instead on vague inference from general knowledge and prior studies in other paradigms.

• Such knowledge has two main dimensions: – The relative shape of data outcomes– The strength of belief in such inference

• We can represent both by imagining we had a prior study:– assume the prior study had m trials (representing

the strength of knowledge)– assign different probabilities to data outcomes of

that study (representing shape knowledge)

• For expanded inference, we select one of the prior study outcomes, call it D*j, and combine that data with each of the actual and potential data outcomes of the present study: I.e. the i rows of the matrix which previously represented Di now represent Di+D*j.

• We now carry out Bayesian inference on this matrix as usual, obtaining a posterior based on both the present study and data, and one of the imagined data samples.

• We do this for every imagined data sample, obtaining M* posteriors, the probability of each given by the data prior.

• These posteriors are weighted by the data prior probabilities, and averaged.

• The model selection criteria is as usual the sum of the resultant (average) posterior across the models in a class:

BMS*DP(K) = ΣkΣip’(Dobs, Di, θk)po(Di)

• To do this as stated would generally not be computationally feasible, due to the large number of imagined data outcomes.

• I believe, though this is not yet confirmed, that the proposed system will work pretty well if we represent the data prior with just a few representative imagined data outcomes.

• There is an analogous expanded formulation within NML, but I will not discuss that today, to save time.

• To summarize, we represent prior data knowledge by imagining a prior study with size of study representing our strength of knowledge (relative to the present study), and with outcome structure representing the form of the knowledge.

• We assume one of those imagined outcomes occurred, combine that outcome with the actual and possible outcomes of the present study, and carry out BMS, obtaining a posterior.

• Then these posteriors are averaged.

Generalizing StatisticalModel Selection:II: Data Validity

• In any real life model selection situation we not only have to consider inference based on the observed and virtual data, but also the reality that the observed data might be invalid. For example programming errors might have been made anywhere from experimental design to implementation to data analysis.

• All of us have experienced cases where a research assistant brings us results we do not believe, and most often further checking reveals problems that show such data to have been invalid.

• Other common cases occur with study replications where the outcomes are inconsistent to a degree unlikely to have occurred by chance. We probably trust our own study more than a study by someone else, but in truth we should allow for the possibility that either is invalid (or even that both are).

• Of course our validity inferences should be governed in part by the number of studies whose results are consistent with each other: One deviant study among n consistent studies is likely the invalid one.

•

• Yet other common cases occur when one or more of the studies have a design that is not the one being modeled (the methods are unclear, incomplete, or misinterpreted). In these case the researcher has not made an error but the application of the various model classes is inappropriate.

• Here it is proper to judge a data set as invalid’ for the models being assessed, although it might well be more accurate to judge the models as inappropriate.

• With respect to validity, the size of the study acts in peculiar ways. If one study shows p(c) = .98 on 1000 trials and another shows p(c) = .02 on 1,000,000 trials, we are sure one is invalid, but do not want to conclude that the one with the larger n is the valid one.

• Why? We have an intuition that the encoding of results was reversed in one of the studies, but the findings are equally consistent with either study being the one with the error.

• Generally we do not expect different studies to have exactly the same parameters within a given model class. For example slight differences in the populations sampled might change overall performance levels.

• Thus judgments of invalidity must be made in the context of hierarchical models that posit prior distributions of parameters from which the different studies are sampled.

• Positing such hierarchical models and assigning priors is of course a tricky matter:

• For example, we tend to assume that overall performance can and will change across replication studies, but do not like to assume or allow parameter changes that produce qualitative changes in predictions (e.g. a variable in one study produces an increase in performance but the same variable in the other study produces a decrease).

• Nonetheless, this is the way to proceed.

• It is a tricky matter to introduce inferences about data validity into standard model selection methods. This is another topic of current and unresolved research.

• My best present approach involves the addition of an error model or error models into the inference process. E.g. in the previous example with performance levels of 0.02 and 0.98, an error model would assume that one of the data sets was reversed.

• The error models are incorporated into the parameters of the models, i.e. the columns.

• For n actual studies each data outcome (each row) is a concatenation of a potential outcome for each study. Different rows give other joint outcomes of the n studies. (One row represents the observed joint outcomes).

• The error models are incorporated into the parameters (the columns).

• E.g. if there are two studies, and we believe one data set might be reversed, then it would be natural to split each given model (column) into four columns, one for neither reversed, one for the first reversed, one for the second reversed, and one for both reversed.

• It would be natural to multiply the original parameter prior by our prior belief about data set validity.

• In our example with 1000 and 1000000 trials, with p(c) of .02 and .98, either reversal would fit equally well, and the other possibilities would have essentially zero posterior probability and drop out.

• If we had prior belief that favored validity for one of the models, then that model would be favored to that degree.

• HOWEVER: Computation could become impossible. • E.g. When there are a large number of studies, we

must cut down the cases to manageable size. I suggest we look at the outcomes, and group them into subsets likely to be all valid or all invalid. In most cases we will have just two groups.

• A second serious problem is the large number of possible error models (there are many ways something can go awry).

• Suppose we allow two kinds of errors: A given study might be valid, encoding reversed, or randomly recorded. Each study would have some prior probabilities of which of these possibilities is true. A set of n studies would then 3n ways to assign these possibilities, and 3n parameter values would determine the probabilities of the assignments.

• Even ignoring computational difficulties, we cannot anticipate all error models.

• I suggest therefore that we generate our error models after the fact, after considering the results of all the studies. We then use those posthoc error models to do inference.

• Although a theoretically problematic procedure (e.g. we use the same data twice), this tends to match the way scientists act in actual practice.

• We should not get overwrought about posthoc selection of error models, because we do the same when postulating our model classes:

• In most cases the models being compared are developed posthoc after considering the data. I will return to this point later when discussing implicit models.

• In any event, I retain hopes that restricting consideration to just a few error models and just two groups of studies would make computation feasible. (This hope needs to be assessed with simulation studies).

• Combining data priors and data validity is straightforward. The two suggested techniques are combined:

• We construct our error models and our data priors. Then we add each virtual data set to our actual data sets, produce a posterior for each combined data set, and average these across the data prior.

Generalizing StatisticalModel Selection:III: Model Validity

• It is in general not good enough to assign posterior degrees of belief to models and model classes because:

1) we know prior to the study that all the models are wrong

2) all of the model classes might be not just wrong, but terrible.

• In practice we should and often do entertain the possibility that all the extant models are wrong, and proceed to search for a new one.

• Present methods do not tell us what it means to be sufficiently ‘wrong’ that all the current models should be rejected.

• Often we use qualitative criteria, based on a comparison of best predictions to the observations.

• However, often we keep a model that predicts well except for a qualitative failure—for example when it is obvious how to add a plausible mechanism that would ‘fix’ the model.

• This issue is related to the goals of modeling (discussed next): In many cases our modeling goal is not to say that one wrong model is slightly better than another wrong model, but instead to use modeling to identify and suggest ways to improve present models. When this is the goal, then ‘none of the above’ is always the right model selection answer.

Generalizing StatisticalModel Selection:

IV: Goals of Modeling

• One cannot discuss model selection, goodness, and evaluation without considering the modeling goals.

• Some goals are ‘engineering in nature’: One wants a model class (usually with parameters specified) that will predict well, but may not care about the processes or mechanisms by which the prediction is obtained. E.g. face recognition at airports, chess playing, designing a wheelchair that climbs stairs, assessing truthtelling from physiological measures.

• Another goal is ‘scientific understanding’. One wishes to understand the laws, processes and mechanisms that are operative in some setting: E.g How does memory work? Here one’s goal is largely to improve existing models.

• Another goal is model comparison: Which of several extant models does a better job of explaining observed data and predicting future data (balancing fit and complexity).

• These and other goals are best fulfilled by different approaches. E.g. an engineering goal may not care about model complexity, as long as prediction is accurate. (Of course this ignores generalization. If one would want to predict what happens in a new situation, then complexity becomes quite important.)

• As scientists we often have a goal of understanding. This is a subtle matter. In many (most) domains in cognitive science and psychology our models are really quite crude approximations to anything resembling the true complexities of mind and brain. Thus we never seek the ‘correct’ answer (though we sometimes pretend we are doing so), but seek a greatly simplified system that captures (approximates) ‘enough’ of the processes operating to increase understanding, refine models, and suggest further research.

• This scientific goal requires that we understand the models we propose and use.

• Model classes vary widely in the degree to which such understanding exists, and in the amount of model exploration required to understand how a model class operates and how its parameters determine predictions.

• (Myung, Pitts and Kim term part of this process ‘landscaping’)

• We are all familiar, for example, with neural net modeling. Early versions used feedforward transmission of activation from one layer of nodes to another (usually three or more layers), the activation between nodes governed by weights (parameters).

• In one sense such systems are ‘understandable’, because the structure is simple.

• However, the number of nodes and parameters can be very large making it hard to see how behavior emerges through condensation in the intermediate layers of the model. Thus in a number of cases, researchers analyze the weights in the intermediate layers with multidimensional scaling techniques to try to uncover the underlying dimensions of the weights that control behavior.

• In practice, as both neural net models and many other modeling approaches grow more and more complicated, they almost always expand in modular fashion, with components (e.g. layers or groups of nodes in neural net models) dedicated to some understandable function.

• This kind of modularization is not required by mathematics. Modularization often helps computationally, by allowing more efficient search of the parameter space.

• But most important, modularization is helpful by helping us understand how the model class operates.

• The problem of understanding becomes even more acute when models include recurrence, so that activation can return to nodes that are also activation senders. Such models can become very chaotic, and extremely non linear, with predictions that change radically with small changes in parameters.

• The goal of ‘understanding’ has led most theorists to propose recurrent models with very simple modular structure (Grossberg, Elman, etc).

• The problems of understanding also arise very often in probabilistic modeling: Such models are often highly non-linear in the way they map parameters to predictions, and it can be difficult to obtain analytic expressions for predictions. In such cases, it is necessary to explore a model class with extensive monte carlo methods that are very demanding of computational resources, making it difficult to explore the parameter space.

• Difficulties or not, when the goals of modeling include understanding, typically the case in science, then it is in most cases essential to explore the model classes in detail. Simply showing a model class fits well, or fits better than another, is at best a starting point for analysis, not an end point of the research.

Generalizing StatisticalModel Selection:V: Implicit Models

• In scientific practice it is a fairly rare case when one generates the models to be judged in advance.

• Most often one looks at the data, and formulates models appropriate for the new study and new data. Quite often the formulated models may be altered versions of previous models, but sometimes have a new structure.

• Either way, the structure of the models are usually formulated after examining the data. Some or many of the model assumptions are built into the structure, but are not parameterized, and therefore alternative assumptions do not appear as separate columns in the model selection matrix. This is a potential problem because such structural assumptions were chosen after looking at the data, to make at least one of the model classes accord with the observations.

• Model comparisons try to balance fit and complexity and it should be clear that choosing model structure for a model class after looking at the data tailors the model class to the data in a way that the model comparison methods have no way of capturing. This is particularly a problem when such tailoring differs for the model classes being compared.

• Although one could argue that in theory the model classes should be specified before examining the data, and there are some situations where this might be appropriate, this is not possible in most situations, because the models have been developed for other tasks, and do not make explicit predictions for the new task until some alterations/additions have been made that are task specific. Although one could imagine such changes in advance, human imagination is in general insufficient, and is helped mightily by viewing and considering the actual data.

• Probably the best we can do is either or both of:

A) acknowledging such implicit assumptions and restricting conclusions accordingly, or

B) trying to make such implicit structural assumptions common to the model classes being considered, in a way that is as fair as possible to both classes.

• Time does not permit discussion of other higher level factors that help us judge models. These include admittedly highly subjective factors such as ‘elegance’. They include description in terms of mathematical axioms that allow analytical derivations. They include the ability to predict a priori, and especially to predict a priori results that are different from those anticipated.

• Finally let me note that this talk has emphasized theory of model selection. I have also been carrying out empirical research exploring aspects of the way that scientists make judgments (in particular how they explain noisy data). Originally I hoped to include some of this research in this address, but it would at least double the length of the presentation, so I must leave that research for a talk in another setting.

what makes a good model? statistical reasoning, common sense, human fallibility richard shiffrin

Documents

model class

good model

term model

bayesian model selection

modern model selection

classa quantitative

quantitative data

quantitative models