[email protected]@journals@amp@34@7@571

12
The Robust Beauty of Improper Linear Models in Decision Making ROBYN M. DAWES University of Oregon ABSTRACT: Proper linear models are those in which predictor variables are given weights in such a way that the resulting linear composite optimally predicts some criterion of interest; examples of proper linear models are standard regression analysis, discriminant function analysis, and ridge regression analysis. Re- search summarized in Paul Meehl's book on clinical versus statistical predictionand a plethora of re- search stimulated in part by that bookall indicates that when a numerical criterion variable (e.g., graduate grade point average) is to be predicted from numerical predictor variables, proper linear models outperform clinical intuition. Improper linear models are those in which the weights of the predictor variables are ob- tained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge's predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical in- tuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weight- ing is quite robust for making such predictions. The article discusses, in some detail, the application of unit weights to decide what bullet the Denver Police De- partment should use. Finally, the article considers commonly raised technical, psychological, and ethical resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances. Paul MeehPs (1954) book Clinical Versus Statis- tical Prediction: A Theoretical Analysis and a Review of the Evidence appeared 25 years ago. It reviewed studies indicating that the prediction of numerical criterion variables of psychological interest (e.g., faculty ratings of graduate students who had just obtained a PhD) from numerical predictor variables (e.g., scores on the Graduate Record Examination, grade point averages, ratings of letters of recommendation) is better done by a proper linear model than by the clinical intuition of people presumably skilled in such prediction. The point of this article is to review evidence that even improper linear models may be superior to clinical predictions. Vol. 34, No. 7,571-582 A proper linear model is one in which the weights given to the predictor variables are chosen in such a way as to optimize the relationship be- tween the prediction and the criterion. Simple regression analysis is the most common example of a proper linear model; the predictor variables are weighted in such a way as to maximize the correlation between the subsequent weighted com- posite and the actual criterion. Discriminant function analysis is another example of a proper linear model; weights are given to the predictor variables in such a way that the resulting linear composites maximize the discrepancy between two or more groups. Ridge regression analysis, an- other example (Darlington, 1978; Marquardt & Snee, 1975), attempts to assign weights in such a way that the linear composites correlate maxi- mally with the criterion of interest in a new set of data. Thus, there are many types of proper linear models and they have been used in a variety of contexts. One example (Dawes, 1971) was pre- sented in this Journal; it involved the prediction of faculty ratings of graduate students. All gradu- Work on this article was started at the University of Oregon and Decision Research, Inc., Eugene, Oregon; it was completed while I was a James McKeen Cattell Sab- batical Fellow at the Psychology Department at the Uni- versity of Michigan and at the Research Center for Group Dynamics at the Institute for Social Research there, I thank all these institutions for their assistance, and I especially thank my friends at them who helped. This article is based in part on invited talks given at the American Psychological Association (August 1977), the University of Washington (February 1978), the Aachen Technological Institute (June 1978), the Univer- sity of Groeningen (June 1978), the University of Am- sterdam (June 1978), the Institute for Social Research at the University of Michigan (September 1978), Miami Uni- versity, Oxford, Ohio (November 1978), and the Uni- versity of Chicago School of Business (January 1979). I received valuable feedback from most of the audiences. Requests for reprints should be sent to Robyn M. Dawes, Department of Psychology, University of Oregon, Eugene, Oregon 97403. AMERICAN PSYCHOLOGIST JULY 1979 • 571 Copyright 1979 by the American Psychological Association, Inc. 0003-066X/79/3407-0571$00.75

Upload: maslinovik5003

Post on 18-Jul-2016

6 views

Category:

Documents


1 download

DESCRIPTION

b

TRANSCRIPT

Page 1: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

The Robust Beauty of Improper Linear Modelsin Decision Making

ROBYN M. DAWES University of Oregon

ABSTRACT: Proper linear models are those in whichpredictor variables are given weights in such a waythat the resulting linear composite optimally predictssome criterion of interest; examples of proper linearmodels are standard regression analysis, discriminantfunction analysis, and ridge regression analysis. Re-search summarized in Paul Meehl's book on clinicalversus statistical prediction—and a plethora of re-search stimulated in part by that book—all indicatesthat when a numerical criterion variable (e.g., graduategrade point average) is to be predicted from numericalpredictor variables, proper linear models outperformclinical intuition. Improper linear models are those inwhich the weights of the predictor variables are ob-tained by some nonoptimal method; for example, theymay be obtained on the basis of intuition, derivedfrom simulating a clinical judge's predictions, or set tobe equal. This article presents evidence that evensuch improper linear models are superior to clinical in-tuition when predicting a numerical criterion fromnumerical predictors. In fact, unit (i.e., equal) weight-ing is quite robust for making such predictions. Thearticle discusses, in some detail, the application of unitweights to decide what bullet the Denver Police De-partment should use. Finally, the article considerscommonly raised technical, psychological, and ethicalresistances to using linear models to make importantsocial decisions and presents arguments that couldweaken these resistances.

Paul MeehPs (1954) book Clinical Versus Statis-tical Prediction: A Theoretical Analysis and aReview of the Evidence appeared 25 years ago.It reviewed studies indicating that the predictionof numerical criterion variables of psychologicalinterest (e.g., faculty ratings of graduate studentswho had just obtained a PhD) from numericalpredictor variables (e.g., scores on the GraduateRecord Examination, grade point averages, ratingsof letters of recommendation) is better done by aproper linear model than by the clinical intuitionof people presumably skilled in such prediction.The point of this article is to review evidence thateven improper linear models may be superior toclinical predictions.

Vol. 34, No. 7,571-582

A proper linear model is one in which theweights given to the predictor variables are chosenin such a way as to optimize the relationship be-tween the prediction and the criterion. Simpleregression analysis is the most common exampleof a proper linear model; the predictor variablesare weighted in such a way as to maximize thecorrelation between the subsequent weighted com-posite and the actual criterion. Discriminantfunction analysis is another example of a properlinear model; weights are given to the predictorvariables in such a way that the resulting linearcomposites maximize the discrepancy between twoor more groups. Ridge regression analysis, an-other example (Darlington, 1978; Marquardt &Snee, 1975), attempts to assign weights in sucha way that the linear composites correlate maxi-mally with the criterion of interest in a new setof data.

Thus, there are many types of proper linearmodels and they have been used in a variety ofcontexts. One example (Dawes, 1971) was pre-sented in this Journal; it involved the predictionof faculty ratings of graduate students. All gradu-

Work on this article was started at the University ofOregon and Decision Research, Inc., Eugene, Oregon; itwas completed while I was a James McKeen Cattell Sab-batical Fellow at the Psychology Department at the Uni-versity of Michigan and at the Research Center forGroup Dynamics at the Institute for Social Research there,I thank all these institutions for their assistance, and Iespecially thank my friends at them who helped.

This article is based in part on invited talks given atthe American Psychological Association (August 1977),the University of Washington (February 1978), theAachen Technological Institute (June 1978), the Univer-sity of Groeningen (June 1978), the University of Am-sterdam (June 1978), the Institute for Social Research atthe University of Michigan (September 1978), Miami Uni-versity, Oxford, Ohio (November 1978), and the Uni-versity of Chicago School of Business (January 1979).I received valuable feedback from most of the audiences.

Requests for reprints should be sent to Robyn M.Dawes, Department of Psychology, University of Oregon,Eugene, Oregon 97403.

AMERICAN PSYCHOLOGIST • JULY 1979 • 571Copyright 1979 by the American Psychological Association, Inc.

0003-066X/79/3407-0571$00.75

Page 2: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

ate students at the University of Oregon's Psy-chology Department who had been admitted be-tween the fall of 1964 and the fall of 1967—andwho had not dropped out of the program for non-academic reasons (e.g., psychosis or marriage)—were rated by the faculty in the spring of 1969;faculty members rated only students whom theyfelt comfortable rating. The following ratingscale was used: S, outstanding; 4, above average;3, average; 2, below average; 1, dropped out ofthe program in academic difficulty. Such overallratings constitute a psychologically interesting cri-terion because the subjective impressions of fac-ulty members are the main determinants of thejob (if any) a student obtains after leaving gradu-ate school. A total of 111 students were in thesample; the number of faculty members ratingeach of these students ranged from 1 to 20, withthe mean number being 5.67 and the median be-ing 5. The ratings were reliable. (To determinethe reliability, the ratings were subjected to a one-way analysis of variance in which each student be-ing rated was regarded as a treatment. Theresulting between-treatments variance ratio (»j2)was .67, and it was significant beyond the .001level.) These faculty ratings were predicted froma proper linear model based on the student's Grad-uate Record Examination (GRE) score, the stu-dent's undergraduate grade point average (GPA),and a measure of the selectivity of the student'sundergraduate institution.1 The cross-validatedmultiple correlation between the faculty ratingsand predictor variables was .38. Congruent withMeehl's results, the correlation of these latter fac-ulty ratings with the average rating of the peopleon the admissions committee who selected thestudents was .19; 2 that is, it accounted for onefourth as much variance. This example is typicalof those found in psychological research in thisarea in that (a) the correlation with the model'spredictions is higher than the correlation with clin-ical prediction, but (b) both correlations are low.These characteristics often lead psychologists tointerpret the findings as meaning that while thelow correlation of the model indicates that linearmodeling is deficient as a method, the even lowercorrelation of the judges indicates only that thewrong judges were used.

An improper linear model is one in which theweights are chosen by some nonoptimal method.They may be chosen to be equal, they may bechosen on the basis of the intuition of the personmaking the prediction, or they may be chosen at

random. Nevertheless, improper models may havegreat utility. When, for example, the standard-ized GREs, GPAs, and selectivity indices in theprevious example were weighted equally, the re-sulting linear composite correlated .48 with laterfaculty rating. Not only is the correlation of thislinear composite higher than that with the clinicaljudgment of the admissions committee (.19), it isalso higher than that obtained upon cross-validat-ing the weights obtained from half the sample.

An example of an improper model that might beof somewhat more interest—at least to the generalpublic—was motivated by a physician who was ona panel with me concerning predictive systems.Afterward, at the bar with his' wife and me, hesaid that my paper might be of some interest tomy colleagues, but success in graduate school inpsychology was not of much general interest:"Could you, for example, use one of your improperlinear models to predict how well my wife and Iget along together?" he asked. I realized that Icould—or might. At that time, the PsychologyDepartment at the University of Oregon was en-gaged in sex research, most of which was be-havioristically oriented. So the subjects of thisresearch monitored when they made love, whenthey had fights, when they had social engagements(e.g., with in-laws), and so on. These subjectsalso made subjective ratings about how happy theywere in their marital or coupled situation. Iimmediately thought of an improper linear modelto predict self-ratings of marital happiness: rateof lovemaking minus rate of fighting. My col-league John Howard had collected just such dataon couples when he was an undergraduate at theUniversity of Missouri—Kansas City, where heworked with Alexander (1971). After establish-ing the intercouple reliability of judgments oflovemaking and fighting, Alexander had one part-ner from each of 42 couples monitor these events.She allowed us to analyze her data, with the fol-lowing results: "In the thirty happily married

^This index was based on Cass and Birnbaum's (1968)rating of selectivity given at the end of their book Com-parative Guide to American Colleges. The verbal cate-gories of selectivity were given numerical values accordingto the following rale: most selective, 6; highly selective,5; very selective (+), 4; very selective, 3; selective, 2 ;not mentioned, 1.

2Unfortunately, only 23 of the 111 students could beused in this comparison because the rating scale theadmissions committee used changed slightly from yearto year.

572 • JULY 1979 • AMERICAN PSYCHOLOGIST

Page 3: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

couples (as reported by the monitoring partner)only two argued more often than they had inter-course. All twelve of the unhappily marriedcouples argued more often" (Howard & Dawes,1976, p. 478). We then replicated this finding atthe University of Oregon, where 27 monitors ratedhappiness on a 7-point scale, from "very un-happy" to "very happy," with a neutral midpoint.The correlation of rate of lovemaking minus rateof arguments with these ratings of marital happi-ness was .40 ( />< .05) ; neither variable alonewas significant. The findings were replicated inMissouri by D. D. Edwards and Edwards (1977)and in Texas by Thornton (1977), who found acorrelation of .81 (p < .01) between the sex-argu-ment difference and self-rating of marital happi-ness among 28 new couples. (The reason for thismuch higher correlation might be that Thorntonobtained the ratings of marital happiness after,rather than before, the subjects monitored theirlovemaking and fighting; in fact, one subject de-cided to get a divorce after realizing that she wasfighting more than loving; Thornton, Note 1.)The conclusion is that if we love more than wehate, we are happy; if we hate more than we love,we are miserable. This conclusion is not veryprofound, psychologically or statistically. Thepoint is that this very crude improper linear modelpredicts a very important variable: judgmentsabout marital happiness.

The bulk (in fact, all) of the literature since thepublication of Meehl's (1954) book supports hisgeneralization about proper models versus intui-tive clinical judgment. Sawyer (1966) reviewed aplethora of these studies, and some of these studieswere quite extensive (cf. Goldberg, 196S). Some10 years after his book was published, Meehl(1965) was able to conclude, however, that therewas only a single example showing clinical judg-ment to be superior, and this conclusion was im-mediately disputed by Goldberg (1968) on thegrounds that even the one example did not showsuch superiority. Holt (1970) criticized detailsof several studies, and he even suggested thatprediction as opposed to understanding may not bea very important part of clinical judgment. Buta search of the literature fails to reveal any studiesin which clinical judgment has been shown to besuperior to statistical prediction when both arebased on the same codable input variables. Andthough most nonpositivists would agree that un-derstanding is. not synonymous with prediction,

few would agree that it doesn't entail some abilityto predict.

Why? Because people—especially the expertsin a field—are much better at selecting and codinginformation than they are at integrating it.

But people are important. The statistical modelmay integrate the information in an optimal man-ner, but it is always the individual (judge, clini-cian, subjects) who chooses variables. Moreover,it is the human judge who knows the directionalrelationship between the predictor variables andthe criterion of interest, or who can code thevariables in such a way that they have clear direc-tional relationships. And it is in precisely thesituation where the predictor variables are goodand where they have a conditionally monotonerelationship with the criterion that proper linearmodels work well.8

The linear model cannot replace the expert indeciding such things as "what to look for," but itis precisely this knowledge of what to look for inreaching the decision that is the special expertisepeople have. Even in as complicated a judgmentas making a chess move, it is the ability to codethe board in an appropriate way to "see" theproper moves that distinguishes the grand masterfrom the expert from the novice (deGroot, 1965;Simon & Chase, 1973). It is not in the abilityto integrate information that people excel (Slovic,Note 2). Again, the chess grand master considersno more moves than does the expert; he justknows which ones to look at. The distinction be-tween knowing what to look for and the abilityto integrate information is perhaps best illustratedin a study by Einhorn (1972). Expert doctorscoded biopsies of patients with Hodgkin's diseaseand then made an overall rating of the severity ofthe process. The overall rating did not predictsurvival time of the 193 patients, all of whom

8 Relationships are conditionally monotone when vari-ables can be scaled in such a way that higher values oneach predict higher values on the criterion. This condi-tion is the combination of two more fundamental mea-surement conditions: (a) independence (the relationshipbetween each variable and the criterion is independentof the values on the remaining variables) and (b) mono-tonicity (the ordinal relationship is one that is monotone).(See Krantz, 1972; Krantz, Luce, Suppes, & Tversky,1971). The true relationships need not be linear forlinear models to work; they must merely be approximatedby linear models. It is not true that "in order to com-pute a correlation coefficient between two variables therelationship between them must be linear" (advice foundin one introductory statistics text). In the first place, itis always possible to compute something.

AMERICAN PSYCHOLOGIST • JULY 1979 • 573

Page 4: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

died. (The correlations of rating with survivaltime were all virtually 0, some in the wrong direc-tion.) The variables that the doctors coded did,however, predict survival time when they wereused in a multiple regression model.

In summary, proper linear models work for avery simple reason. People are good at pickingout the right predictor variables and at codingthem in such a way that they have a conditionallymonotone relationship with the criterion. Peopleare bad at integrating information from diverseand incomparable sources. Proper linear modelsare good at such integration when the predictionshave a conditionally monotone relationship to thecriterion.

Consider, for example, the problem of compar-ing one graduate applicant with GRE scores of750 and an undergraduate GPA of 3.3 with an-other with GRE scores of 680 and an undergradu-ate GPA of 3.7. Most judges would agree thatthese indicators of aptitude and previous accom-plishment should be combined in-some compensa-tory fashion, but the question is how to compen-sate. Many judges attempting this feat have littleknowledge of the distributional characteristics ofGREs and GPAs, and most have no knowledge ofstudies indicating their validity as predictors ofgraduate success. Moreover, these numbers areinherently incomparable without such knowledge,GREs running from SOO to 800 for viable appli-cants, and GPAs from 3.0 to 4.0. Is it anywonder that a statistical weighting scheme doesbetter than a human judge in these circumstances?

Suppose now that it is not possible to constructa proper linear model in some situation. Onereason we may not be able to do so is that oursample size is inadequate. In multiple regression,for example, b weights are notoriously unstable;the ratio of observations to predictors should be ashigh as IS or 20 to 1 before b weights, which arethe optimal weights, do better on cross-validationthan do simple unit weights. Schmidt (1971),Goldberg (1972), and Claudy (1972) have demon-strated this need empirically through computersimulation, and Einhorn and Hogarth (1975) andSrinivisan (Note 3) have attacked the problemanalytically. The general solution depends on anumber of parameters such as the multiple corre-lation in the population and the covariance patternbetween predictor variables. But the applied im-plication is clear. Standard regression analysiscannot be used in situations where there is nota "decent" ratio of observations to predictors.

Another situation in which proper linear modelscannot be used is that in which there are nomeasurable criterion variables. We might, never-theless, have some idea about what the importantpredictor variables would be and the directionthey would bear to the criterion if we were able tomeasure the criterion. For example, when decid-ing which students to admit to graduate school,we would like to predict some future long-termvariable that might be termed "professional self-actualization." We have some idea what we meanby this concept, but no good, precise definition asyet. (Even if we had one, it would be impossibleto conduct the study using records from currentstudents, because that variable could not be as-sessed until at least 20 years after the studentshad completed their doctoral work.) We do, how-ever, know that in all probability this criterion ispositively related to intelligence, to past accom-plishments, and to ability to snow one's colleagues.In our applicant's files, GRE scores assess thefirst variable; undergraduate GPA, the second;and letters of recommendation, the third. Mightwe not, then, wish to form some sort of linearcombination of these variables in order to assessour applicants' potentials? Given that we cannotperform a standard regression analysis, is therenothing to do other than fall back on unaided in-tuitive integration of these variables when weassess our applicants?

One possible way of building an improper linearmodel is through the use of bootstrapping (Dawes& Corrigan, 1974; Goldberg, 1970). The processis to build a proper linear model of an expert'sjudgments about an outcome criterion and then touse that linear model in place of the judge. Thatsuch linear models can be accurate in predictingexperts' judgments has been pointed out in thepsychological literature by Hammond (1955) andHoffman (1960). (This work was anticipated by32 years by the late Henry Wallace, Vice-Presi-dent under Roosevelt, in a 1923 agricultural ar-ticle suggesting the use of linear models to analyze"what is on the corn judge's mind.") In his in-fluential article, Hoffman termed the use of linearmodels a paramorphic representation of judges, bywhich he meant that the judges' psychological pro-cesses did not involve computing an implicit orexplicit weighted average of input variables, butthat it could be simulated by such a weighting.Paramorphic representations have been extremelysuccessful (for reviews see Dawes & Corrigan,1974; Slovic & Lichtenstein, 1971) in contexts in

574 • JULY 1979 • AMERICAN PSYCHOLOGIST

Page 5: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

which predictor variables have conditionally mono-tone relationships to criterion variables.

The bootstrapping models make use of theweights derived from the judges; because theseweights are not derived from the relationship be-tween the predictor and criterion variables them-selves, the resulting linear models are improper.Yet these paramorphic representations consistentlydo better than the judges from which they werederived (at least when the evaluation of goodnessis in terms of the correlation between predictedand actual values).

Bootstrapping has turned out to be pervasive.For example, in a study conducted by Wiggins andKohen (1971), psychology graduate students atthe University of Illinois were presented with 10background, aptitude, and personality measuresdescribing other (real) Illinois graduate studentsin psychology and were asked to predict these stu-dents' first-year graduate GPAs. Linear models ofevery one of the University of Illinois judges dida better job than did the judges themselves inpredicting actual grade point averages. This re-sult was replicated in a study conducted in con-junction with Wiggins, Gregory, and Diller (citedin Dawes & Corrigan, 1974). Goldberg (1970)demonstrated it for 26 of 29 clinical psychologyjudges predicting psychiatric diagnosis of neu-rosis or psychosis from Minnesota MultiphasicPersonality Inventory (MMPI) profiles, andDawes (1971) found it in the evaluation ofgraduate applicants at the University of Oregon.The one published exception to the success ofbootstrapping of which I am aware was a studyconducted by Libby (1976). He asked 16 loanofficers from relatively small banks (located inChampaign-Urbana, Illinois, with assets between$3 million and $56 million) and 27 loan officersfrom large banks (located in Philadelphia, withassets between $.6 billion and $4.4 billion) tojudge which 30 of 60 firms would go bankruptwithin three years after their financial statements.The loan officers requested five financial ratios onwhich to base their judgments (e.g., the ratio ofpresent assets to total assets). On the average,the loan officers correctly categorized 44.4 busi-nesses (74%) as either solvent or future bank-ruptcies, but on the average, the paramorphic rep-resentations of the loan officers could correctlyclassify only 43.3 (72%). This difference turnedout to be statistically significant, and Libby con-cluded that he had an example of a situation wherebootstrapping did not work—perhaps because his

judges were highly skilled experts attempting topredict a highly reliable criterion. Goldberg(1976), however, noted that many of the ratioshad highly skewed distributions, and he reana-lyzed Libby's data, normalizing the ratios beforebuilding models of the loan officers. Libby found77% of his officers to be superior to their para-morphic representations, but Goldberg, using hisrescaled predictor variables, found the opposite;72% of the models were superior to the judgesfrom whom they were derived.4

Why does bootstrapping work? Bowman(1963), Goldberg (1970), and Dawes (1971) allmaintained that its success arises from the factthat a linear model distills underlying policy (inthe implicit weights) from otherwise variable be-havior (e.g., judgments affected by context effectsor extraneous variables).

Belief in the efficacy of bootstrapping was basedon the comparison of the validity of the linearmodel of the judge with the validity of his or herjudgments themselves. This is only one of twologically possible comparisons. The other is thevalidity of the linear model of the judge versusthe validity of linear models in general; that is, todemonstrate that bootstrapping works because thelinear model catches the essence of the judge'svalid expertise while eliminating unreliability, it isnecessary to demonstrate that the weights ob-tained from an analysis of the judge's behaviorare superior to those that might be obtained inother ways, for example, randomly. Because boththe model of the judge and the model obtainedrandomly are perfectly reliable, a comparison ofthe random model with the judge's model permitsan evaluation of the judge's underlying linearrepresentation, or policy. If the random modeldoes equally well, the judge would not be "follow-ing valid principles but following them poorly"(Dawes, 1971, p. 182), at least not principles anymore valid than any others that weight variablesin the appropriate direction:

Table 1 presents five studies summarized byDawes and Corrigan (1974) in which validities

*It should be pointed out that a proper linear modeldoes better than either loan officers or their paramorphicrepresentations. Using the same task, Beaver (1966) andDeacon (1972) found that linear models predicted withabout 78% accuracy on cross-validation. But I can'tresist pointing out that the simplest possible impropermodel of them all does best. The ratio of assets to lia-bilities (!) correctly categorizes 48 (80%) of the casesstudied by Libby.

AMERICAN PSYCHOLOGIST • JULY 1979 • 575

Page 6: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

TABLE 1

Correlations Between Predictions and Criterion Values

Example

Prediction of neurosis vs. psychosisIllinois students' predictions of GPAOregon students' predictions of GPAPrediction of later faculty ratings at OregonYntema & Torgerson's (1961) experiment

Averagevalidityof judge

.28

.33

.37

.19

.84

Averagevalidityof judge

model

.31

.50

.43

.25

.89

Averagevalidity

of randommodel

.30

.51

.51

.39

.84

Validityof equal

weightingmodel

.34

.60

.60

.48

.97

Cross-validity ofregressionanalysis

.46

.57

.57

.38

Validityof optimal

linearmodel

.46

.69

.69

.54

.97

Note. GPA = grade point average.

(i.e., correlations) obtained by various methodswere compared. In the first study, a pool of 861psychiatric patients took the MMPI in varioushospitals; they were later categorized as neuroticor psychotic on the basis of more extensive in-formation. The MMPI profiles consist of 11scores, each of which represents the degree towhich the respondent answers questions in a man-ner similar to patients suffering from a well-de-fined form of psychopathology. A set of 11 scoresis thus associated with each patient, and the prob-lem is to predict whether a later diagnosis will bepsychosis (coded 1) or neurosis (coded 0).Twenty-nine clinical psychologists "of varying ex-perience and training" (Goldberg, 1970, p. 425)were asked to make this prediction on an 11-stepforced-normal distribution. The second two stud-ies concerned 90 first-year graduate students in thePsychology Department of the University ofIllinois who were evaluated on 10 variables thatare predictive of academic success. These vari-ables included aptitude test scores, college GPA,various peer ratings (e.g., extraversion), and vari-ous self-ratings (e.g., conscientiousness). A first-year GPA was computed for all these students.The problem was to predict the GPA from the 10variables. In the second study this prediction wasmade by 80 (other) graduate students at the Uni-versity of Illinois (Wiggins & Kohen, 1971), andin the third study this prediction was made by 41graduate students at the University of Oregon.The details of the fourth study have already beencovered; it is the one concerned with the predic-tion of later faculty ratings at Oregon. The finalstudy (Yntema & Torgerson, 1961) was one inwhich experimenters assigned values to ellipsespresented to the subjects, on the basis of figures'size, eccentricity, and grayness. The formula usedwas ij + kj + ik, where i, j, and k refer to valueson the three dimensions just mentioned. Subjects

in this experiment were asked to estimate thevalue of each ellipse and were presented with out-come feedback at the end of each trial. The prob-lem was to predict the true (i.e., experimenter-assigned) value of each ellipse on the basis of itssize, eccentricity, and grayness. .

The first column of Table 1 presents the averagevalidity of the judges in these studies, and thesecond presents the average validity of the para-morphic model of these judges. In all cases, boot-strapping worked. But then what Corrigan and Iconstructed were random linear models, that is,models in which weights were randomly chosenexcept for sign and were then applied to standard-ized variables."

The sign of each variable was determined on an a prioribasis so 'that it would have a positive relationship to thecriterion. Then a normal deviate was selected at randomfrom a normal distribution with unit variance, and theabsolute value of this deviate was used as a weight forthe variable. Ten thousand such models were constructedfor each example. (Dawes & Corrigan, 1974, p. 102)

On the average, these random linear models per-form about as well as the paramorphic models ofthe judges; these averages are presented in thethird column of the table. Equal-weighting mod-els, presented in the fourth column, do even better.(There is a mathematical reason why equal-weight-ing models must outperform the average randommodel.6) Finally, the last two columns present

5 Unfortunately, Dawes and Corrigan did not spell outin detail that these variables must first be standardizedand that the result is a standardized dependent variable.Equal or random weighting of incomparable variables—for example, GRE score and GPA—without prior stan-dardization would be nonsensical.

6 Consider a set of standardized variables Si, X*, .Xm,each of which is positively correlated with a standardizedvariable Y. The correlation of the average of the Xswith the Y is equal to the correlation of the sum of the

576 • JULY 1979 • AMERICAN PSYCHOLOGIST

Page 7: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

the cross-validated validity of the standard re-gression model and the validity of the optimallinear model.

Essentially the same results were obtained whenthe weights were -selected from a .rectangular dis-tribution. Why? Because linear models are ro-bust over deviations from optimal weighting. Inother words, the bootstrapping finding, at leastin these studies, has simply been a reaffirmation ofthe earlier finding that proper linear models aresuperior to human judgments—the weights derivedfrom the judges' behavior being sufficiently closeto the optimal weights that the outputs of themodels are highly similar. The solution to theproblem of obtaining optimal weights is one that—in terms of von Winterfeldt and Edwards (Note4)—has a "flat maximum." Weights that arenear to optimal level produce almost the sameoutput as do optimal beta weights. Because theexpert judge knows at least something about thedirection of the variables, his or her judgmentsyield weights that are nearly optimal (but notethat in all cases equal weighting is superior tomodels based on judges' behavior).

The fact that different linear composites corre-late highly with each other was first pointed out40 years ago by Wilks (1938). He consideredonly situations in which there was positive corre-lation between predictors. This result seems tohold generally as long as these intercorrelationsare not negative; for example, the correlation be-tween X + 27 and 2X + Y is .80 when X and Yare uncorrelated. The ways in which outputs arerelatively insensitive to changes in coefficients(provided changes in sign are not involved) havebeen investigated most recently by Green (1977),

Xs with Y. The covariance of this sum with 7 is equalto

(- 1 L yt(xn + *,-»... + *,-mn) i

= ( ~ ) £ ViXn + ( - )\n) \n/

= fi + » • « . . . + Cm (the sum of the correlations).

The variance of y is 1, and the variance of the sum ofthe Xs is M + M(M — l)r, where f is the average inter-correlation between the Xs. Hence, the correlation of theaverage of the ATs with Y is (2><)/(M + if (M - l )f)»;this is greater than (Sn)/(Af + M1 - M)* = average n.Because each of the random models is positively corre-lated with the criterion, the correlation of their average,which is the unit-weighted model, is higher than theaverage of the correlations.

Wainer (1976), Wainer and Thissen (1976), W.M. Edwards (1978), and Gardiner and Edwards(197S).

Dawes and Corrigan (1974, p. 105) concludedthat "the whole trick is to know what variables tolook at and then know how to add." That prin-ciple is well illustrated in the following study,conducted since the Dawes and Corrigan articlewas published. In it, Hammond and Adelman(1976) both investigated and influenced the de-cision about what type of bullet should be used bythe Denver City Police, a decision having muchmore obvious social impact than most of thosediscussed above. To quote Hammond and Adel-man (1976):

In 1974, the Denver Police Department (DPD), as wellas other police departments throughout the country, de-cided to change its handgun ammunition. The principlereason offered by the police was that the conventionalround-nosed bullet provided insufficient "stopping effec-tiveness" (that is, the ability to incapacitate and thus toprevent the person shot from firing back at a policeofficer or others). The DPD chief recommended (as didother police chiefs) the conventional bullet be replacedby a hollow-point bullet. Such bullets, it was contended,flattened on impact, thus decreasing penetration, increasingstopping effectiveness, and decreasing ricochet potential.The suggested change was challenged by the AmericanCivil Liberties Union, minority groups, and others. Op-ponents of the change claimed that the new bullets werenothing more than outlawed "dum-dum" bullets, that theycreated far more injury than the round-nosed bullet, andshould, therefore, be barred from use. As is customary,judgments on this matter were formed privately andthen defended publicly with enthusiasm and tenacity, andthe usual public hearings were held. Both sides turnedto ballistics experts for scientific information and support,(p.392)

The disputants focused on evaluating the meritsof specific bullets—confounding the physical effectof the bullets with the implications for socialpolicy; that is, rather than separating questions ofwhat it is the bullet should accomplish (the socialpolicy question) from questions concerning ballis-tic characteristics of specific bullets, advocatesmerely argued for one bullet or another. Thus,as Hammond and Adelman pointed out, socialpolicymakers inadvertently adopted the role of(poor) ballistics experts, and vice versa. WhatHammond and Adelman did was to discover theimportant policy dimensions from the policy-makers, and then they had the ballistics expertsrate the bullets with respect to these dimensions.These dimensions turned out to be stopping ef-fectiveness (the probability that someone hit inthe torso could not return fire), probability ofserious injury, and probability of harm to by-

AMERICAN PSYCHOLOGIST • JULY 1979 • 577

Page 8: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

standers. When the ballistics experts rated thebullets with respect to these dimensions, it turnedout that the last two were almost perfectly con-founded, but they were not perfectly confoundedwith the first. Bullets do not vary along a singledimension that confounds effectiveness with lethal-ness. The probability of serious injury or harmto bystanders is highly related to the penetrationof the bullet, whereas the probability of the bul-let's effectively stopping someone from returningfire is highly related to the width of the entrywound. Since policymakers could not agree aboutthe weights given to the three dimensions, Ham-mond and Adelman suggested that they beweighted equally. Combining the equal weightswith the (independent) judgments of the ballisticsexperts, Hammond and Adelman discovered abullet that "has greater stopping effectiveness andis less apt to cause injury (and is less apt tothreaten bystanders) than the standard bullet thenin use by the DPD" (Hammond & Adelman, 1976,p. 395). The bullet was also less apt to cause in-jury than was the bullet previously recommendedby the DPD. That bullet was "accepted by theCity Council and all other parties concerned, andis now being used by the DPD" (Hammond &Adelman, 1976, p. 395).7 Once again, "the wholetrick is to decide what variables to look at andthen know how to add" (Dawes & Corrigan, 1974,p. 105).

So why don't people do it more often? I knowof four universities (University of Illinois; NewYork University; University of Oregon; Univer-sity of California, Santa Barbara—there may bemore) that use a linear model for applicant selec-tion, but even these use it as an initial screeningdevice and substitute clinical judgment for thefinal selection of those above a cut score. Gold-berg's (1965) actuarial formula for diagnosingneurosis or psychosis from MM PI profiles hasproven superior to clinical judges attempting thesame task (no one to my or Goldberg's knowledgehas ever produced a judge who does better),yet my one experience with its use (at the AnnArbor Veterans Administration Hospital) was thatit was discontinued on the grounds that it madeobvious errors (an interesting reason, discussed atlength below). In 1970, I suggested that our fel-lowship committee at the University of Oregonapportion cutbacks of National Science Founda-tion and National Defense Education Act fellow-ships to departments on the basis of a quasi-linearpoint system based on explicitly defined indices,

departmental merit, and need; I was told "youcan't systemize human judgment." It was onlysix months later, after our committee realized thepolitical and ethical impossibility of cutting backfellowships on the basis of intuitive judgment, thatsuch a system was adopted. And so on.

In the past three years, I have written andtalked about the utility (and in my view, ethicalsuperiority) of using linear models in sociallyimportant decisions. Many of the same objec-tions have been raised repeatedly by differentreaders and audiences. I would like to concludethis article by cataloging these objections and an-swering them.

Objections to Using Linear Models

These objections may be placed in three broadcategories: technical, psychological, and ethical.Each category is discussed in turn.

TECHNICAL

The most common technical objection is to the useof the correlation coefficient; for example, Remusand Jenicke (1978) wrote:

It is clear that Dawes and Corrigan's choice of the corre-lation coefficient to establish the utility of random andunit rules is inappropriate [sic, inappropriate for what?].A criterion function is also needed in the experimentscited by Dawes and Corrigan. Surely there is a costfunction for misclassifying neurotics and psychotics orrefusing qualified students admissions to graduate schoolwhile admitting marginal students, (p. 221)

Consider the graduate admission problem first.Most schools have k slots and N applicants. Theproblem is to get the best k (who are in turnwilling to accept the school) out of N. Whatbetter way is there than to have an appropriaterank? None. Remus and Jenicke write as if theproblem were not one of comparative choice butof absolute choice. Most social choices, however,involve selecting the better or best from a set ofalternatives: the students that will be better, thebullet that will be best, a possible airport sitethat will be superior, and so on. The correlation

7 It should be pointed out that there were only eightbullets on the Pareto frontier; that is, there were onlyeight that were not inferior to some particular otherbullet in both stopping effectiveness and probability ofharm (or inferior on one of the variables and equal onthe other). Consequently, any weighting rule whatsoeverwould have chosen one of these eight.

578 • JULY 1979 • AMERICAN PSYCHOLOGIST

Page 9: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

coefficient, because it reflects ranks so well, isclearly appropriate for evaluating such choices.

The neurosis-psychosis problem is more subtleand even less supportive of their argument."Surely," they state, "there is a cost function,"but they don't specify any candidates. The im-plication is clear: If they could find it, clinicaljudgment would be found to be superior to linearmodels. Why? In the absence of such a discov-ery on their part, the argument amounts to nothingat all. But this argument from a vacuum can bevery compelling to people (for example, to losinggenerals and losing football coaches, who knowthat "surely" their plans would work "if"—whenthe plans are in fact doomed to failure no matterwhat).

A second related technical objection is to thecomparison of average correlation coefficients ofjudges with those of linear models. Perhaps by .averaging, the performance of some really out-standing judges is obscured. The data indicateotherwise. In the Goldberg (1970) study, forexample, only 5 of 29 trained clinicians were betterthan the unit-weighted model, and none did betterthan the proper one. In the Wiggins and Kohen(1971) study, no judges were better than the unit-weighted model, and we replicated that effect atOregon. In the Libby (1976) study, only 9 of 43judges did better than the ratio of assets to lia-bilities at predicting bankruptcies (3 did equallywell). While it is then conceded that cliniciansshould be able to predict diagnosis of neurosis orpsychosis, that graduate students should be able topredict graduate success, and that bank loan offi-cers should be able to predict bankruptcies, thepossibility is raised that perhaps the experts usedin the studies weren't the right ones. This againis arguing from a vacuum: If other experts wereused, then the results would be different. Andonce again no such experts are produced, and onceagain the appropriate response is to ask for areason why these hypothetical other people shouldbe any different. (As one university vice-presidenttold me, "Your research only proves that you usedpoor judges; we could surely do better by gettingbetter judges"—apparently not from the psychol-ogy department.)

A final technical objection concerns the natureof the criterion variables. They are admittedlyshort-term and unprofound (e.g., GPAs, diag-noses) ; otherwise, most studies would be infea-sible. The question is then raised of whether thefindings would be different if a truly long-range

important criterion were to be predicted. Theanswer is that of course the findings could be dif-ferent, but we have no reason to suppose that theywould be different. First, the distant future is ingeneral less predictable than the immediate future,for the simple reason that more unforeseen, ex-traneous, or self-augmenting factors influence in-dividual outcomes. (Note that we are not dis-cussing aggregate outcomes, such as an unusuallycold winter in the Midwest in general spread outover three months.) Since, then, clinical predic-tion is poorer than linear to begin with, the hy-pothesis would hold only if linear prediction gotmuch worse over time than did clinical prediction.There is no a priori reason to believe that thisdifferential deterioration in prediction would occur,and none has ever been suggested to me. There iscertainly no evidence. Once again, the objectionconsists of an argument from a vacuum.

Particularly compelling is the fact that peoplewho argue that different criteria or judges or vari-ables or time frames would produce different re-sults have had 25 years in which to produce ex-amples, and they have failed to do so.

PSYCHOLOGICAL

One psychological resistance to using linear modelslies in our selective memory about clinical predic-tion. Our belief in such prediction is reinforcedby the availability (Tversky & Kahneman, 1974)of instances of successful clinical prediction—espe-cially those that are exceptions to some formula:"I knew someone once with . , . who . . . ." (e.g.,"I knew of someone with a tested IQ of only 130who got an advanced degree in psychology.") AsNisbett, Borgida, Crandall, and Reed (1976)showed, such single instances often have greaterimpact on judgment than do much more valid sta-tistical compilations based on many instances. (Agood prophylactic for clinical psychologists basingresistance to actuarial prediction on such instanceswould be to keep careful records of their ownpredictions about their own patients—prospectiverecords not subject to hindsight. Such recordscould make all instances of successful and unsuc-cessful prediction equally available for impact; inaddition, they could serve for another clinical ver-sus statistical study using the best possible judge—the clinician himself or herself.)

Moreover, an illusion of good judgment may bereinforced due to selection (Einhorn & Hogarth,1978) in those situations in which the prediction

AMERICAN PSYCHOLOGIST • JULY 1979 • S79

Page 10: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

of a positive or negative outcome has a self-ful-filling effect. For example, admissions officerswho judge that a candidate is particularly quali-fied for a graduate program may feel that theirjudgment is exonerated when that candidate doeswell, even though the candidate's success is inlarge part due to the positive effects of the pro-gram. (In contrast, a linear model of selectionis evaluated by seeing how well it predicts per-formance within the set of applicants selected.)Or a waiter who believes that particular peopleat the table are poor tippers may be less attentivethan usual and receive a smaller tip, thereby hav-ing his clinical judgment exonerated.8

A second psychological resistance to the use oflinear models stems from their "proven" low valid-ity. Here, there is an implicit (as opposed toexplicit) argument from a vacuum because neitherchanges in evaluation procedures, nor in judges, norin criteria, are proposed. Rather, the unstatedassumption is that these criteria of psychologicalinterest are in fact highly predictable, so it fol-lows that if one method of prediction (a linearmodel) doesn't work too well, another might dobetter (reasonable), which is then translated intothe belief that another will do better (which isnot a reasonable inference)—once it is found.This resistance is best expressed by a dean con-sidering the graduate admissions who wrote, "Thecorrelation of the linear composite with futurefaculty ratings is only .4, whereas that of theadmissions committee's judgment correlates .2.Twice nothing is nothing." In 1976, I answeredas follows (Dawes, 1976, pp. 6-7):

In response, I can only point out that 16% of the varianceis better than 4% of the variance. To me, however, thefascinating part of this argument is the implicit assump-tion that that other 84% of the variance is predictableand that we can somehow predict it.

Now what are we dealing with? We are dealing withpersonality and intellectual characteristics of [uniformlybright] people who are about 20 years old. . . . Why arewe so convinced that this prediction can be made at all?Surely, it is not necessary to read Ecclesiastes every nightto understand the role of chance. . . . Moreover, thereare clearly positive feedback effects in professional de-velopment that exaggerate threshold phenomena. Forexample, once people are considered sufficiently "out-standing" that they are invited to outstanding institutions,they have outstanding colleagues with Whom to interact—and excellence is exacerbated. This same problemoccurs for those who do not quite reach such a thresholdlevel. Not only do all these factors mitigate againstsuccessful long-range prediction, but studies of the successof such prediction are necessarily limited to those accepted,with the incumbent problems of restriction of range anda negative covariance structure between predictors (Dawes,1975).

Finally, there are all sorts of nonintellectualfactors in professional success that could not pos-sibly be evaluated before admission to graduateschool, for example, success at forming a satisfy-ing or inspiring libidinal relationship, not yet evi-dent genetic tendencies to drug or alcohol addic-tion, the misfortune to join a research group that"blows up," and so on, and so forth.

Intellectually, I find it somewhat remarkablethat we are able to predict even 16% of thevariance. But I believe that my own emotionalresponse is indicative of those of my colleagueswho simply assume that the future is more pre-dictable. / want it to be predictable, especiallywhen the aspect of it that I want to predict isimportant to me. This desire, I suggest, trans-lates itself into an implicit assumption that thefuture is in fact highly predictable, and it wouldthen logically follow that if something is not avery good predictor, something else might do bet-ter (although it is never correct to argue that itnecessarily will).

Statistical prediction, because it includes thespecification (usually a low correlation coefficient)of exactly how poorly we can predict, bluntlystrikes us with the fact that life is not all thatpredictable. Unsystematic clinical prediction (or"postdiction"), in contrast, allows us the com-forting illusion that life is in fact predictable andthat we can predict it.

ETHICAL

When I was at the Los Angeles Renaissance Fairlast summer, I overhead a young woman complainthat it was "horribly unfair" that she had beenrejected by the Psychology Department at theUniversity of California, Santa Barbara, on thebasis of mere numbers, without even an interview."How can they possibly tell what I'm like?"The answer is that they can't. Nor could theywith an interview (Kelly, 1954). Nevertheless,many people maintain that making a crucial socialchoice without an interview is dehumanizing. Ithink that the question of whether people are.treated in a fair manner has more to do with thequestion of whether or not they have been de-humanized than does the question of whetherthe treatment is face to face. (Some of the worstdoctors spend a great deal of time conversing with

8 This example was provided by Einhorn (Note 5 ) .

580 • JUL^T 1979 • AMERICAN PSYCHOLOGIST

Page 11: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

their patients, read no medical journals, order fewor no tests, and grieve at the funerals.) A GPArepresents 3£ years of behavior on the part of theapplicant. (Surely, not all the professors arebiased against his or her particular form of cre-ativity.) The GRE is a more carefully devisedtest. Do we really believe that we can, do abetter or a fairer job by a 10-minute folder evalua-tion or a half-hour interview than is done bythese two mere numbers? Such cognitive conceit(Dawes, 1976, p. 7) is unethical, especially giventhe fact of no evidence whatsoever indicating thatwe do a better job than does the linear equation.(And even making exceptions must be done withextreme care if it is to be ethical, for if we admitsomeone with a low linear score on the basis thathe or she has some special talent, we are automati-cally rejecting someone with a higher score, whomight well have had an equally impressive talenthad we taken the trouble to evaluate it.)

No matter how much we would like to see thisor that aspect of one or another of the studiesreviewed in this article changed, no matter howpsychologically uncompelling or distasteful we mayfind their results to be, no matter how ethically un-comfortable we may feel at "reducing people tomere numbers," the fact remains that our clientsare people who deserve to be treated in the bestmanner possible. If that means—as it appearsat present—that selection, diagnosis, and progno-sis should be based on nothing .more than theaddition of a few numbers representing values onimportant attributes, so be it. To do otherwise ischeating the people we serve.

REFERENCE NOTES

1. Thornton, B. Personal communication, 1977.

2. Slovic, P. Limitations of the mind of man: Implica-tions for decision making in the nuclear age. In H. J.Otway (Ed.), Risk vs. benefit: Solution or dream?(Report LA 4860-MS), Los Alamos, N.M.: Los Ala-mos Scientific Laboratory, 1972. [Also available asOregon Research Institute Bulletin, 1971, 77(17).]

3. Srinivisan, V. A theoretical comparison of the predic-tive power of the multiple regression and equal weight-ing procedures (Research Paper No. 347). Stanford,Calif.: Stanford University, Graduate School of Busi-ness, February 1977.

4. von Winterfeldt, D., & Edwards, W. Costs and payoffsin perceptual research. Unpublished manuscript, Uni-versity of Michigan, Engineering Psychology Labora-tory, 1973.

5. Einhorn, H. J. Personal communication, January 1979.

REFERENCES

Alexander, S. A. H. Sex, arguments, and social engage-ments in marital and premarital relations. Unpublishedmaster's thesis, University of Missouri—Kansas City,1971.

Beaver, W. H. Financial ratios as predictors of failure.In Empirical research in accounting: Selected studies.Chicago: University of Chicago, Graduate School ofBusiness, Institute of Professional Accounting, 1966.

Bowman, E. H. Consistency and optimality in manager-ial decision making. Management Science, 1963, 9, 310-321.

Cass, J., & Birnbaum, M. Comparative guide to Americancolkges. New York: Harper & Row, 1968.

Claudy, J. G. A comparison of five variable weightingprocedures. Educational and Psychological Measure-ment, 1972,52,311-322.

Darlington, R. B. Reduced-variance regression. Psycho-logical Bulletin, 1978, 8S, 1238-12SS.

Dawes, R. M. A case study of graduate admissions:Application of three principles of human decision mak-ing. American Psychologist, 1971^ 26, 180-188.

Dawes, R. M. Graduate admissions criteria and futuresuccess. Science, 1975,187, 721-723,

Dawes, R. M. Shallow psychology. In J. Carroll & J.Payne (Eds.), Cognition and social behavior. Hills-dale, N.J.: Erlbaum, 1976.

Dawes, R. M., & Corrigan, B. Linear models in decisionmaking. Psychological Bulletin, 1974, SI, 95-106.

Deacon, E. B. A discriminant analysis of predictors ofbusiness failure. Journal of Accounting Research, 1972,10, 167-179.

deGroot, A. D. Het denken van den schaker [Thoughtand choice in chess]. The Hague, The Netherlands:Mouton, 1965.

Edwards, D. D., & Edwards, J. S. Marriage: Direct andcontinuous measurement. Bulletin of the PsychonomicSociety, 1977, 10, 187-188.

Edwards, W. M. Technology for director dubious: Evalu-ation and decision in public contexts. In K. R. Ham-mond (Ed,), Judgement and decision in public policyformation. Boulder, Colo.: Westview Press, 1978.

Einhorn, H. J. Expert measurement and mechanicalcombination. Organizational Behavior and Human Per-

, formance, 1972, 7, 86-106.Einhorn, H. J., & Hogarth, R. M. Unit weighting schemas

for decision making. Organizational Behavior and Hu-man Performance, 1975, 13, 171-192.

Einhorn, H. J., & Hogarth, R. M. Confidence in judg-ment: Persistence of the illusion of validity. Psycho-logical Review, 1978, 85, 395-416.

Gardiner, P. C., & Edwards, W. Public values: Multi-attribute-utility measurement for social decision mak-ing. In M. F. Kaplan & S. Schwartz (Eds.), Humanjudgment and decision processes. New York: AcademicPress, 1975.

Goldberg, L. R. Diagnosticians vs. diagnostic signs: Thediagnosis of psychosis vs. neurosis from the MMPI.Psychological Monographs, 1965, 79(9, Whole No. 602).

Goldberg, L. R. Seer over sign: The first "good" ex-ample? Journal of Experimental Research in Person-ality, 1968,3, 168-171.

Goldberg, L. R. Man versus model of man: A rationale,plus some evidence for a method of improving onclinical inferences. Psychological Bulletin, 1970, 73,422-432;

Goldberg, L. R. Parameters of personality inventory con-struction and utilization: A comparison of prediction

AMERICAN PSYCHOLOGIST • JULY 1979 • 581

Page 12: 10.0000@Psycnet.apa.Org@Journals@Amp@34@7@571

strategies and tactics. Multivariate Behavioral ResearchMonographs, 1972, No. 72-2.

Goldberg, L. R. Man versus model of man: Just howconflicting is that evidence? Organizational Behaviorand Human Performance, 1976, 16, 13-22.

Green, B. F., Jr. Parameter sensitivity in multivariatemethods. Multivariate Behavioral Research, 1977, 3,263.

Hammond, K. R. Probabilistic functioning and theclinical method. Psychological Review, 1955, 62, 255-262.

Hammond, K. R., & Adelman, L. Science, values, andhuman judgment. Science, 1976, 194, 389-396.

Hoffman, P. J. The paramorphic representation of clini-cal judgment. Psychological Bulletin, 1960, 57, 116-131.

Holt, R. R. Yet another look at clinical and statisticalprediction. American Psychologist, 1970, 25, 337-339.

Howard, J. W., & Dawes, R. M. Linear prediction ofmarital happiness. Personality and Social PsychologyBulletin, 1976, 2, 478-480.

Kelly, L. Evaluation of the interview as a selectiontechnique. In Proceedings of the 1953 Invitational Con-ference on Testing Problems. Princeton, N.J.: Educa-tional Testing Service, 1954.

Krantz, D. H. Measurement structures and psychologi-cal laws. Science, 1972, 175, 1427-1435.

Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A.Foundations of measurement (Vol. 1). New York: Aca-demic Press, 1971.

Libby, R. Man versus model of man: Some conflictingevidence. Organizational Behavior and Human Per-formance, 1976,16, 1-12.

Marquardt, D. W., & Snee, R. D. Ridge regression inpractice. American Statistician, 1975, 29, 3-19.

Meehl, P. E. Clinical versus statistical prediction: Atheoretical analysis and a review of the evidence. Min-neapolis: University of Minnesota Press, 1954.

Meehl, P. E. Seer over sign: The first good example.Journal of Experimental Research in Personality, 1965,1, 27-32.

Nisbett, R. E., Borgida, E., Crandall, R., & Reed, H.Popular induction: Information is not necessarily nor-

mative. In J. Carrol & J. Payne (Eds.), Cognitionand social behavior. Hillsdale, N.J.: Erlbaum, 1976.

Remus, W. E., & Jenicke, L. O. Unit and random linearmodels in decision making. Multivariate BehavioralResearch, 1978, 13, 215-221.

Sawyer, J. Measurement and prediction, clinical and sta-tistical. Psychological Bulletin, 1966, 66, 178-200.

Schmidt, F. L. The relative efficiency of regression andsimple unit predictor weights in applied differentialpsychology. Educational and Psychological Measure-ment, 1971, 31, 699-714.

Simon, H. A., & Chase, W. G. Skill in chess. AmericanScientist, 1973, 61, 394-403.

Slovic, P., & Lichtenstein, S. Comparison of Bayesianand regression approaches to the study of informationprocessing in judgment. Organizational Behavior andHuman Performance, 1971, 6, 649-744.

Thornton, B. Linear prediction of marital happiness: Areplication. Personality and Social Psychology Bulletin,1977, 3, 674-676.

Tversky, A., & Kahneman, D. Judgment under uncer-tainty: Heuristics and biases. Science, 1974, 184, 1124-1131.

Wainer, H. Estimating coefficients in linear models: Itdon't make no nevermind. Psychological Bulletin, 1976,83, 312-317.

Wainer, H., & Thissen, D. Three steps toward robustregression. Psychometrika, 1976, 41, 9-34.

Wallace, H. A. What is in the corn judge's mind?Journal of the American Society of Agronomy, 1923,15, 300-304.

Wiggins, N., & Kohen, E. S. Man vs. model of manrevisited: The forecasting of graduate school success.Journal of Personality and Social Psychology, 1971, 19,100-106.

Wilks, S. S. Weighting systems for linear functions ofcorrelated variables when there is no dependent vari-able. Psychometrika, 1938, 8, 23-40.

Yntema, D. B., & Torgerson, W. S. Man-computer co-operation in decisions requiring common sense. IRETransactions of the Professional Group on Human Fac-tors in Electronics, 1961, 2(1), 20-26.

582 • JULY 1979 • AMERICAN PSYCHOLOGIST