july, psychological bulletin lj & meehl pe... · test score and criterion score are de...

\tOL. 52, No.4

Psychological

JULY, 1955

Bulletin

CONSTRUCT VALIDI1'Y IN PSYCI-IOLOGICAL TESTS

LEE J. CRONBACI-IUniversity of Illinois

ANDPAUL E. MEEHL1

University of Minnesota

\Talidation of psychological testsLas not yet been adequately concepblalized, as the APA Committee onJ~sychological Tests learned when itundertook (1950-54) to specify whatqualities should be investigated bcfnre a test is published. In order toIlLake coherent recommendations the( 'ommittee found it necessary to distinguish four types of validity, estab]j~)hed by different types of researchand requiring different interprctatlOl1. 1'he chief innovation in theC,OITIlnittee's report was the terJn construct validity.2 This idea was firstformulated by a subcommittee(.Meehl and R. C. Challman) studying how proposed recommendationsv/Quld apply to projective techniques,and later modified and clarified bythe entire Committee (Bordin, Challnlan, Conrad, I-Iumphreys, Super,and the present writers). The state111ents agreed upon by the COlnmittee (and by cOlnmittees of two otherassociations) were published in theTechnical Recom1nendations (59). Thepresent interpretation of constructvalidi ty is not"official" and deals

1 1'he second author worked on this probletn in connection with his appointtnent to theIvlinnesota Center for Philosophy of Science.vVe are indebted to the other members of theCenter (I-Ierbert Feigl, Michael Scriven,vVilfrid Sellars), and to D. L. 'Thistlethwaiteof the University of Illinois, for their majorcontributions to our thinking and their suggestions for improving this paper.

2 l{eferred to in a prelitninary report (58)'-Hi congruent validity.

with some areas where the Committeewould probably not be unanimous.The present "vriters are solely responsible for this a ttempt to explain theconcept and elaborate its inlplications.

Identification of construct validitywas not an isolated developluent.Writers on validity during the preceding decade had shown a great dealof dissatisfaction with conyen tionalnotions of validity, and introducednew terms and ideas, but the resulting aggreg-atjon of types of validityseelns only to have stirred the lllUddy"vaters. Portions of the distinctionswe shall discuss are implicit in Jenkins' paper, "Validity for what?"(33), Gulliksen's ill ntrinsic validity"(27) , Goodenough's distinction bet,veen tests as "signs" and "samples"(22), Cronbach's separation of ulogical" and "empirical" validity (11),Guilford's Ufactorial validity" (25),and Mosier's papers on "face validity" and "validity generalization"(49, 50). I-Jelen Peak (52) COUlesclose to an explicit statement of construct validity as "ve shall presen tit.

~""'OUR TYPES OF VALIDATION

The categories into ,vhich the Recommendations divide validity studiesare: predictive validity, concurrentvalidity, content validity, and construct validi ty. 'fhe first two of thesemay be considered together as criterion-oriented validation procedures.

The pattern of a criterion-or'iented

281

282 LE~J~ J. Cl?ONBACI-l AND PA UL E. lvfEEIIL

study is falniliar. 'fhe in vcstigator ispriInarily interested in some criterionwhich he wishes to predict. I-Ie administers the test, obtains an independent criterion 11leasurc on thesame subjects, and computes a correlation. If the criterion is obtainedsome titTIC after the test is given, he isstudying predictive validity. If thetest score and criterion score are detertnined at essentially the sanle time,he is studying concurrent validity.Concurrent validity is studied vvhenone test is proposed as a substitutefor another (for example, when an1ultiple-choice form of spelling testis substituted for taking dictation),or a test is sho\\t"n to correlate withsome contemporary criterion (e.g.,psychiatric diagnosis).

Content validity is established byshowing that the test itenls are a sample of a universe in which the investigator is interested. Content validityis ordinarily to be established deductively, by defining a universe ofiieITIS and sampling systematicallyvvithin this universe to establish thetest.

Construct validation is involvedwhenever a test is to be interpretedas a measure of some attribute orquality \vhich is not "operationallydefined." The problenl faced by theinvestigator is, 4'What constructsaccount for variance in test perforlnance?" Construct validity calls forno new scientific approach. Muchcurren t research on tests of personality (9) is construct validation, usually without the benefit of a clearforIn ula tion of this process.

Construct validity is not to be identified solely by particular investigative procedures, but by the orientation of the investigator. Criterionoriented validity, as Bechtoldt eITIphasizes (3, p. 1245), "involves theacceptance of a set of operations as anadequate definition of whatever is to

be nleasurcd." \I\lhen an investiga torbelieves that no criterion available tohinl is fully valid, he perforce bcCaInes interested in construct validitybecause this is the only way to avoidthe "infinite frustration" of relatingevery criterion to some more ultimatestandard (21). In content validation,acceptance of the universe of conten tas defining the variable to be measured is essen tial. Construct valid itymust be investigated whenever nocriterion or universe of content isaccepted as entirely adequate to define the quality to be measured. Determining ,vhat psychological constructs accoun t for test perfoflnanceis desirable for ahnost any test. 'rhus,although the lV1MPI was originallyestablished on the basis of empiricaldiscrimination between patientgroups and so-called norrna]s (COI1

current valid i ty), cantinuing- researchhas tried to provide a basis for describing the personality associatedwith each score pattern. Such interpretations pern1i t the clinician to predict perforn1ance with respect to criteria which have not yet been ernplayed in empirical validation studies(cf. 46, pp. 49-50, 110-111).

vVe can distinguish atllong the four typesof validity by noting that each involves adifferent cluphasis on the criterion. In predictive or concurrent validity, the criterionbehavior is of concel'n to the tester, and hemay have no concern whatsoever with thetype of behavior exhibited in the test. (Anemployer does not care if a worker can manipulate blocks, but the score on the blocktest may predict SOll1ething- he cares abollt.)Content validity is studied when the testeris concerned with the type of behavior involved in the test pcrforn1ance. Indeed, if thetest is a work saluple, the behavior represented in the test n1ay be an end in itself.Construct validity is ordinarily studied whenthe tester has no defInite criterion Ineasurcof the quality with which he is concerned, andmust use indirect tueasures. Here the trait orquality underlying the test is of central in1portance, rather than either the test behavioror the scores on the criteria (59, p. 14).

CONSTRUCT VALIDITY 283

Construct validation is ilnportantat times for every sort of psychological test: aptitude, achievelnent, interests, and so all. Thurstone's staten1ent is interesting in this connection:In the field of intelligence tests, it used to becomlnon to define validity as the correlationbetween a test score and some outside cri..terion. We have reached a stage of sophistication where the test-criterion correlation istoo coarse. I t is obsolete. If we attenlpted toascertain the validity of a test for the secondspace-factor, for exan1ple, we would have toget judges [to] make reliable juclgtnents aboutpeople as to this factor. Ordinarily their[the available judges'] ratings would be ofno value as a criterion. Consequently I validitystudies in the cognitive functions now dependon criteria of internal consistency. '. (60,p. 3).

Construct validity would be involvedin ans\vering- such questions as: To\\That extent is this test of intelligencecuI ture- free? Does this test of 'lin terpretation of data" Ineasure readingability, quantitative reasoning, or response sets? I-Iow does a person withA in Strong Accountant, and J~ inStrong CPA, differ fronl a person whohas these scores reversed?

Exa1nple of construct validation procedure. Suppose measure X correlates.50 vvith Y, the anlount of palmarsvveating induced when \ve tell a student that he has failed a PsychologyI exam. Predictive validity of X forY is adequately described by the co-efficient, and a statement of the experimental and san1pling conditions.ff someone were to ask, "Isn't thereperhaps another ,vay to interpret thiscorrelation?" or "What other kindsof evidence can you bring to supportyour interpretation?", \ve ,vouldhardly understand what he was asking because no interpretation hasheen Inade. l"'hese questions becomerelevant \vhen the correlation is ad..vanced a~ evidence that 44tcst X111easures anxiety proneness." Alterna tive interpretations are possible;

e.g., perhaps the test measures "aca..delnic aspiration," in which case we\vill expect differen t results if ,ve induce palmar sweating by economicthreat. I t is then reasonable to inquire about other kinds of evidence.

Add these facts from further studies: 1"'est X correlates .45 with fraternity brothers' ratings on "tenseness." Test X correlates .55 withamount of intellectual inef-ficiency induced by painful electric shock, and.68 with the Taylor Anxiety scale.Mean X score decreases among fourdiagnosed groups in this order: anxiety state, reactive depression, tlnormal," and psychopathic personality.And palmar sweat under threat offailure in Psychology I correlates .60,,:vith threat of failure in mathenlatics.Negative results elitninate cOlnpetingexplanations of the X score; thus,findings of negligible correlations bet,veen X and social class, vocationalaim, and value-orientation 111ake itfairly safe to reject the suggestionthat X nleasures "acadelnic aspiration." We can have substantial confidence that X does measure anxietyproneness if the current theory ofanxiety can embrace the variateswhich yield positive correlations, anddoes not predict correlations v.Therewe found none.

I(INDS OF CONSTRUCTS

At this point ,ve should indicatesUlunlarily '\\rhat we lnean by a construct, recognizing that much of theremainder of the paper deals withthis question. }\ construct is somepastulated a ttribu te of people, assumed to be reilected in test perforlnance, In test validation the attributeabout which we make statements ininterpreting a test is a construct. Weexpect a person at any tirne to possessor not possess a qualitative attribute(amnesia) or structure, or to possessSOlTIC degree of a quantitative attrib-

284 LEE J. CRONBAC1I AND PAUL E. MEEHL

ute (cheerfulness). A construct hascertain associated Ineanings carriedin statements of this general character: Persons who possess this attribute will, in situation X, act in mannerY (with a stated probability). Thelog-ic of construct validation is invoked whether the construct is highlysystematized or loose, used in ralnified theory or a few silnple propositions, used in absolute propositionsor probability statements. We seekto specify how one is to defend a proposed interpretation of a test; weare not recommending anyone type ofinterpretation.

'rhe constructs in which tests areto be in terpreted are certainly notlikely to be physiological. Most oftenthey ,vi11 be traits such as "latent hostility" or "variable in lTIood," or descriptions in tern1S of an educationalobjective, as "ability to plan experiments." F'or the benefit of readerswho may have been influenced by certain eisegeses of MacCorquodaleand Meehl (40), let us here emphasize: Whether or not an interpretation of a test's properties or relationsinvolves questions of construct validity is to be decided by cxan1ining theentire body of evidence offered, tog-ether with what is asserted aboutthe test in the context of this evidence. I=>roposed identifications ofconstructs allegedly measured by thetest with constructs of other sciences(c.g., genetics, neuroanaton1Y, biochctnistry) 111ake up only one classof construct-validity clailTIS, and arather n1inor one at present. Spacedoes not permit full analysis of therelation of the present paper to theMacCorquoclale-Meehl distinctionbet\Jveen hypothetical constructs andin tcrvening variables. 1'he philosophy of science pertinent to the present paper is set forth later in the section entitled, "I"he norl1ological network."

THE I~ELATION OF CONSTRUCTS

TO "CRITERIA"

Critical View of the Criterion I n'tplied

An unquestionable criterion Inaybe found in a practical operation, ornlay be established as a consequenceof an operational definition. l'ypically, however, the psychologist is unwilling to usc the directly operationalapproach hecause he is interested inbuilding theory about a generalizedconstruct. A theorist trying to relatebehavior to "hunger" ahnost certainly invests that tern1 with Ineanings other than the operation4' elapsed-tin1e-since- feeding. " I f heis concerned vvith hunger as a tissueneed, he "viII not accept tilne lapse aseqttivalent to his construct because itfails to consider, among- other thing-s,energy expenditure of the anitnal.

In some situations the criterion isno n10re valid than the test. Suppose, for example, that we want toknow if counting the dots on BcnclerGestalt figure five indicates "compulsive rigidity," and take psychiatric ratings on this trait as a criterion.r~ven a conventional report on the re~

suIting correlation ,vi11 say son1cthingabout the extent and intensity of thepsychiatrist's contacts and shoulddescribe his q ualifica tions (e.g., cIiplonlate status? analyzed?).

Why report these facts? J3ecauscdata are needed to indicate \\rhetherthe criterion is any good. "Compulsive rigidity" is not really intended tomean "social stimulus value to psychiatrists." rrhe implied trait involves a range of behavior-dispositions which lllay be very in1perfectlysampled by the psychiatrist. Suppose dot-counting does not occur jn aparticular patient and yet vve findthat the psychiatrist has rated hin1 asHrig-id." When questioned the psychiatrist tells us that the patient "vasa rather easy, free-wheeling sort;


however, the patient did lean over tostraighten out a skeVtred desk blotter,and this, viewed against certainother facts, tipped the scale in favorof a Urigid" rating. On the face of it,counting Bender dots may be just asgood (or poor) a sample of the compulsive-rigidity domain as straightening desk blotters is.

Suppose, to extend our example, wehave four tests on the "predictor"side, over against the psychiatrist's"criterion," and find generally positive correlations anlong the five variables. Surely it is artificial and arbitrary to impose the "test-should-perdiet-criterion" pattern on such data.l"'he psychiatrist samples verbal content, expressive pattern, voice, posture, etc. The psychologist samplesverbal content, perception, expressive pattern, etc. Our proper conclusion is that, frolu this evidence,the four tests and the psychiatrist allassess SOllle common factor.

The asymmetry between the 'ltest"and the so-designated Ucriterion"arises only because the terminologyof predictive validity has becon1c acommonplace in test analysis. Inthis study where a construct is thecentral concern, any distinction between the merit of the test and criterion variables wall Id be justifiedonly if it had already been sho\vnthat the psychiatrist's theory andoperations ,vere excellent measuresof the attribute.

INADEQUACY OF VALIDATION IN

T"'ERMS OF SPECIFIC CRITERIA

rrhe proposal to validate constructual in terpretations of tests runscounter to suggestions of some others.Spiker and IV1 cCandless (57) favoran operational approach. Validationis replaced by COl11piling statementsas to how strongly the test predictsother observed variables of interest.To avoid requiring that each new

variable be investigated completelyby itself, they allow two variables tocollapse into one whenever the properties of the operationally definedmeasures are the same: "If a newtest is demonstrate~l to predict thescores on an older, well-establishedtest, then an eva!ua tion of the predictive power of the older test may beused for the new one." But accurateinferences are possible only if the twotests correlate so highly that thereis negligible reliable variance in eithertest, independent of the other. Wherethe correspondence is less close, onemust either retain all the separatevariables operationally defined or embark on construct validation.

The practical user of tests mustrely on constructs of SaIne generalityto make predictions about new situations. Test X could be used to predict pahnar sweating in the face offailure without invoking any construct, but a counselor is more likelyto be asked to forecast behavior indiverse or even unique situations forwhich the correlation of test X is unknown. Significant predictions relyon knowledge acculnulatcd aroundthe generalized construct of anxiety.The Technical Recommendationsstate:

I t is ordinarily necessary to evaluate constructvalidity by integrating evidence fronl manydifferent sources. 1'he problenl of constructvalidation becolllcs especially acute in theclinical field since for many of the constructsdealt with it is not a question of finding anilnperfect criterion but of finding any criterionat all. rrhe psychologist interested in construct validity for clinical devices is concernedwith making an estimate of a hypotheticalinternal process, factor, systenl, structure,or state and cannot expect to find a clearunitary behavioral criterion. An attempt toidentify anyone criterion measure or anycomposite as the criterion aimed at is, however,usually unwarranted (59, p. 14~15).

This appears to conflict with arguments for specific criteria prominentat places in the testing literature.

286 LEE J. CRONBACI1 AND PAUL E. MEEHL

'rhus Anastasi (2) tnakcs nlany statements of the latter character: 44 I t isonly as a measure of a specificallydefined criterion that a test can beobjectively validated at all . .. rroclain1 that a test nlcasures anythingover and above its criterion is purespeculation" (p. 67). Yet clse\vherethis article Stl pports construct validation. Tests can be profitably interpreted if we "know the relationshipsbetween the tested behavior ... andother behavior samples, none of thesebehavior sanlples necessarily occupying the preeminent position of a criterion" (p. 75). F'actor analysis "\vithseveral partial criteria tnight be usedto study whether a test Ineasures apostula ted 'tgeneral learning a bility."I f the data demonstrate specificity ofability instead, such specificity is44 useful in its own right in advancingour knowledge of behavior; it shouldnot be construed as a weakness of thetests" (p. 75).

We depart fron1 Anastasi at twopoints. She writes, 4vrhc validity ofa psychological test should not beconfused with an analysis of the fac~

tors which determine the behaviorunder consideration." We, however,reg-ard such analysis as a Inost important type of validation. Second,she refers ta 44 t he will-a' -the-wisp ofpsychological processes which aredistinct from performance" (2, p. 77).While ,ve agree that psychologicalprocesses are elusive, we are sympathetic to attempts to formulate andclarify constructs which are evidenced by perforlnance but distinctfrom it. Surely an inductive inferencebased on a pattern of correlationscannot be dismissed as "pure speculation. Jt

Specific Criteria Used Temporarily:The HBootstraps" Effect

Even when a test is constructed onthe basis of a specific criterion, it may

ultimately be judged to have greaterconstruct validity than the criterion.We start with a vague concept "rhichwe associate with certain observations. We then discover elupiricallythat these observations covary withSaIne other observation v.rhich possesses greater reliability or is luore intimately correlated with relevant experimental chang-es than is the original measure, or both. For example,the notion of telnperature arises because some objects feel hotter to thetouch than others. The expansion ofa mercury column does not have facevalidi ty as an index of hotness. Butit turns out that (a) there is a statistical relation between expansion andsensed ternperature ; (b) observersemploy the mercury method withgood interobserver agreement; (c)the regularity of observed relationsis increased by using the thermometer(e.g., melting points of samples of thesame material vary little on the thermometer; we obtain nearly linear relations between mercury 111CaSUres

and pressure of a gas). F'inally, (d)a theoretical structure involving unobservable microevents--the kinetictheory-is worked out "\vhich explainsthe relation of mercury expansion toheat. This whole process of conceptual enrichment begins with what inretrospect we see as an extremely fallible t'criterion"-the human temperature sense. That original criterion has no\v been relegated to a peripheral position. We have lifted ourselves by our bootstrapSt but in alegitimate and fruitful way.

Similarly, the Binet scale was firstvalued because children's scorestended to agree with judgn1cnts byschoolteachers. If it had not shownthis agreement, it v.Tould have beendiscarded along with reaction timeand the other measures of ability previously tried. Teacher judgnlcntsonce constituted the criterion against


which the individual intelligence testwas validated. But if today a child'sIQ is 135 and three of his teacherscomplain about how stupid he is, wedo not conclude that the test hasfailed. Quite to the contrary, if noerror in test procedure can be argued,we treat the test score as a valid statement about an important quality,and define our task as that of findingout what other variables-personality, study skills t etc.-nlodifyachievement or distort teacher judgInent.

EXPERIMENTATION TO INVESTI

GATE CONSTRUCT VALIDITY

Validation Procedures

We can use Inany methods in construct validation. Attention shouldparticularly be drawn to Macfarlane's survey of these methods as theyapply to projective devices (41).

Group differences. If our understanding of a construct leads us toexpect two groups to differ on thetest, this expectation may be testeddirectly. Thus Thurstone and Chavevalidated the Scale for MeasuringAttitude'roward the Church by showing score differences between churchmembers and nonchurchgoers.Churchgoing is not the criterion ofattitude, for the purpose of the test isto measure sOlnething other than thecrude sociological fact of church attendance; on the other hand, failureto find a difference would have seriously chalJeng"ed the test.

Only coarse correspondence between test and group designation isexpected. Too great a correspondencebetween the t,vo ,vould indicate thatthe test is to some degree invalid, because Inembers of the groups are expected to overlap on the test. Intelligence test items are selected initiallyon the basis of a correspondence toage t but an item that correlates .95

with age in an elementary schoolsample would surely be suspect.

Correlation matrices and factor analysis. If two tests are presulned tomeasure the sanle construct, a correlation bet\veen them is predicted.(An exception is noted \vherc somesecond attribute has positive loadingin the first test and negative loadingin the second test; then a low correlation is expected. l'his is a testablein terpretation provided an externalmeasure of either the first or the second variable exists.) If the obtainedcorrelation departs fr0111 the expectation, however, there is no way toknow whether the fault lies in test A,test B, or the formulation of the construct. A matrix of intercorrelationsoften points out profitable \vays ofdividing the construct into moremeaningful parts, factor analysis being a useful COITIputational methodin such studies.

Guilford (26) has discussed theplace of factor analysis in constructvalidation. His statcruents may beextracted as follows:

~vrhe personnel psychologist \vishesto know 'why his tests are valid.' I-Iecan place tests and practical criteriain a matrix and factor it to identify'rea] dimensions of hUlnan personality.' A factorial description is exact and stable; it is econolnical inexplanation; it leads to the creationof pure tests which can be combinedto predict complex behaviors." It isclear tha t factors here function asconstructs. Eyscnck, in his "criterionanalysis" (18), goes farther than Guil ..ford, and shows that factoring can beused explicitly to test hypothesesabout constructs.

Factors mayor 111ay not beweighted with surplus n1eaning. Certainly ,,,hen they are regarded as"real dilnensions n a great deal ofsurplus meaning is implied, and theinterpreter must shoulder a substan-

288 LEE J. CRONBACII AND PA UL E. MEEIIL

tial burden of proof. The alternativeview is to regard factors as defining aworking reference frame, located in aconvenient nlanncr in the "space"defined by all behaviors of a giventype. Which set of factors from agiven luatrix is 4ln1ost useful" willdepend partly on predilections, butin essence the best construct is theone around which we can build thegreatest nUlnber of inferences, in themost direct fashion.

Studies of internal structure. }~or

Inany constructs, evidence of homogeneity \\rithin the test is relevant injudging validity. If a trait such asdominance is hypothesized, and theitems inquire about behaviors subsumed under this label, then the hypothesis appears to require that theseitems be g-enerally intercorrelated.Even low correlations, if consistent,""ouId support the argument thatpeople may be fruitfully described interms of a generalized tendency tod0111inate or not dOlninate. 1~hc g-eneral quality would have power to predict behavior in a variety of situa~

tions represented by the specificitems. I tern-test correlations andcertain reliability formulas describein ternal consistency.

I t is unwise to list uninterpreteddata of this sort under the heading"validity" in test Inanuals, as someauthors have done. l-ligh internalconsistency lTIay lower validity. Onlyif the underlying theory of the traitbeing Ineasured calls for high itemintercorrelations do the correlationssupport construct validity. Negativeiteln-test correlations nlay supportconstruct validity, provided that theitems with negative correlations arebelieved irrelevant to the postulatedconstruct and serve as suppressorvariables (31, p. 431-436; 44).

Study of distinctive subgroups ofitelus within a test Inay set an upperlimit to construct validity by showing

that irrelevant elements influencescores. Thus a study of the PMAspace tests shows that variance canbe partially accounted for by a response set, tendency to lnark manyfigures as similar (12). An intcrnaIfactor analysis of the PEA Interpretation of Data Test shows that jn addition to measuring reasoning skills,the test score is strongly influencedby a tendency to say ((probably true"rather than Ilcertainly true t" regardless of i tern con ten t (1 7) . On theother hand, a study of item groupingsin the DArr Mechanical COlnprehcnsian Test permitted rejection of thehypothesis that knowledge aboutspecific topics such as gears Inade asubstantial contribution to scores(13).

Studies of change over occasions.'rhe stability of test scores ("retestreliability," Cattell's "N-technique")may be relevant to construct validation. Whether a high degree of stability is encouraging or discouragingfor the proposed interpretation depends upon the theory defining theconstruct.

More powerful than the retest afteruncontrolled intervening experiencesis the retest with experimental intervention. If a transient influenceswings test scores over a wide range,there arc definite litnits on the extentto which a test result can be interpreted as reflecting the typical behavior of the individual. '[hese areexamples of experilnents \vhich haveindicated upper limits to test validity: studies of differences associatedwith the exatniner in projective testing, of change of score under alternative directions ("tell the truth" vs."make yourself look good to an employer") t and of coachability ofrnental tests. \rVe luay recall Gul1iksen's distinction (27): When thecoaching is of a sort that in1provesthe pupil's intellectual functioning in


school, the test which is affected bythe coaching has validity as a measure of in tellectual functioning; if thecoaching inlprovcs test taking butnot school perforlnance, the testwhich responds to the coaching haspoor validity as a measure of thisconstruct.

SOlnetimes, \vhere differences between individuals are difficult toassess by any Ineans other than thetest, the experilnenter validates bydetern1ining whether the test can detect induced intra-individual differences. One might hypothesize thatthe Zeigarnik effect is a measure ofego involvement, i.e., that ,vith egoinvolvClnent there is nl0re recall ofinconlplete tasks. 1""0 support suchan interpretation, the investigatorwill try to induce ego involven1ent onsome task by appropriate directionsand compare subjects' recall withtheir recall for tasks where there ,vasa contrary induction. Sometimes thein terven tion is drastic. Porteus finds(53) that brain-operated patientssho\v disruption of performance onhis maze, but do not show impairedperfornlance on conventional verbaltests and argues therefrom that histest is a better nlcasure of planfulness.

Studies of process. One of the bestways of determining informally whataccounts for variability on a test isthe observation of the person's process of perforn1ance. If it is supposed,for example, that a test measuresn1athematical competence, and yetobservation of students' errors showsthat erroneous reading of the question is common, the implications of alow score are altered. Lucas in thisway showed that the Navy RelativeMovement Test, an aptitude test,actually involved two different abilities: spatial visual ization and rnathematical reasoning (39).

Mathematical analysis of scoringprocedures may provide important

negative evidence on construct validity. A recent analysis of Ilempathy"tests is perhaps worth citing (14).~'Empathy" has been operationallydefined in nlany studies by the abilityof a judge to predict what responseswill be given on some questionnaireby a subject he has observed briefly.A mathematical argument has shown,however, tha t the scores depend onseveral attributes of the judge whichenter into his perception of any individual, and that they therefore cannot be interpreted as evidence of hisa bility to interpret cues offered byparticular others, or his intuition.

The Numerical Estin'late ofConstruct Valid'l'ty

There is an understandable tendency to seek a Hconstruct validitycoefficient." A numerical statementof the degree of construct validitywould be a statenlent of the proportion of the test score variance that isattributable to the construct variable.'I'his numerical estimate can sometimes be arrived at by a factor annlysis, but since present methods of factor analysis are based on linear relations, more general methods willultimately be needed to deal withmany quantitative problems of construct validation.

Rarely will it be possible to estimate definite "construct saturations," because no factor corresponding closely to the construct will beavailable. One can only hope to setupper and lower bounds to the "loading." Ii "creativity" is defined assomething independent of knowledge,then a correlation of .40 between apresulned test of creativity and a testof arithmetic knowledge would indicate that at least 16 per cent of thereliable test variance is irrelevant tocreativity as defined. Laboratoryperformance on problems such asMaier's Uhatrack" ,vould scarcely be

290 LEE J. CR()NBA CI-I AND PAUL E. MEEHL

an ideal measure of creativity, but itwould be somewhat relevant. If itscorrelation \vith the test is .60, thispermits a tentative estimate of 36per cent as a lower bound. Cl'he estiInate is tentative because the testInig-ht overlap with the irrelevant portion of the laboratory n1casure.) ~rhe

saturation scen1S to lie between 36and 84 per cent; a cunl111ation of stud...ies would provjde better linli ts.

I t should be particularly notedthat rejectil1~ the nul) hypothesisdoes not finish the job of constructvalidation (35, p. 284). 1'he problemis not to conclude that the test Hisvalid" for Ineasuring the constructvariable. l'he task is to state as definitely as possible the degree of validity the test is presunled to have.

TI-IE L,OGle OF CONSTRUCT

VALIDATION

Construct validation takes place,,,,hen an investigator believes thathis instrument reflects a particularconstruct, to \vhich are attached certain meaning-s. The proposed interpretation generates specifIc testablehypotheses, \vhich are a means ofconfirn1ing or disconfirn1ing the claim."fhe philosophy of science which webelieve does most justice to actualscien tiflC practice will now be brieflyand dogmatically set forth. I~caclers

interested in further study of thephilosophical underpinning are referred to the works by Braithwaite(6, especially Chapter II I), Carnap(7; 8, pp. 56-69), flap (51), Sellars(55, 56), Feigl (19, 20), Beck (4),J<neale (37, pp. 92-110) t Ilclnpel(29; 30, Sec. 7).

The Nomological Net

The funclalnental principles arethese:

1. Scientifically speaking, to"make clear what something is"nleans to set forth the laws in which

it occurs. We shall refer to the interlocking system of laws which constitu te a theory as a nomological network..

2. The laws in a nomological network may relate (a) observable properties or quan ti ties to each other; or(b) theoretical constructs to obscrva ·bIes; or (c) different theoretical can...structs to one another. 'fhese "la\vs"nlay be statistical or deterministic.

3. A necessary condition for a construct to be scientifically admissibleis that it occur in a nomological net,at least some of whose laws involveobservables. Admissible constructsmay be remote from observation. i.e.,a long derivation may intervene between the nornologicaIs which im·plicitly define the construct, and the(derived) nOlTIoIogicals of type a..These latter propositions pertuit predictions about events. The constructis not Hreduced" to the observations,.but only combined with other constructs in the net to make predictionsabout observables.

4. 41I...earning- more about" a theoretical construct is a. n1atter of elaborating the nomological network inwhich it occurs t or of increasing thedefiniteness of the components. Atleast in the early history of a con...struct the network will be linlited,and the construct \vill as yet havefew connections.

5. An enrichtnent of the net suchas adding a construct or a relation totheory is justified if it generates nomolog·icals that are confirtTIcd by observation or if it reduces the numberof nomologicals required to predictthe saIne observations. When observations will not fit in to the network as it stands, the scientist has acertain freedom in selecting where ton10dify the network. '[hat is, therenlay be alternative constructs or waysof organizing the net which for thetime being are equally defensible.

6. We can say that "operations"

CONSTRUCT VALIDll'Y 291

which are qualitatively very different"overlap" or "measure the samething tt if their positions in the nomological net tie them to the same construct variable. Our confidence inthis identification depends upon theamount of inductive support ,ve havefor the regions of the net involved.I t is not necessary that a direct observational comparison of the twooperations be made-we may be content with an intranetwork proof indicating that the two operationsyield estimates of the sanle networkdefined quantity. Thus, physicistsare content to speak of the "temperature" of the sun and the 44temperature" of a gas at room temperatureeven though the test operations arenonoverlapping because this identification makes theoretical sense.

With these statements of scientific methodology in mind, we returnto the specific problem of constructvalidity as applied to psychologicaltests. The preceding guide rulesshould reassure the "toughminded,"who fear that allowing constructvalidation opens the door to nonconfirmable test claims. The answer isthat unless the network makes contact with observations, and exhibitsexplicit, public steps of inference,construct validation cannot beclaimed. An adn1issible psychologicalconstruct must be behavior-relevant(59, p. 15). For n10st tests intendedto n1easure constructs, adequate criteria do not exist. This being thecase, D1any such tests have been leftunvalidated, or a finespun networkof rationalizations has been offeredas if it were validation. Rationalization is not construct validation. Onewho claims that his test reflects aconstruct cannot maintain his claimin the face of recurrent negative results because these results show thathis construct is too loosely defined toyield verifiable inferences.

A rigorous (though perhaps probabilistic) chain of inference is required to establish a test as a lueasureof a construct. To validate a clain1that a test measures a construct, anomological net surrounding the concept must exist. When a construct isfairly new, there may be few specifiable associations by \vhich to pindown the concept. 1\5 research proceeds, the construct sends out rootsin many directions, which attach itto more and more facts or other construct~. l'hus the electron has n10reaccepted properties than the neutrino: nutnerical ability has luore thanthe second space factor.

IIAcceptance,H which was criticalin criterion-oriented and contentvalidities, has now appeared in construct validity. Unless substantiallythe same nomological net is acceptedby the several users of the construct,public validation is itnpossible. If Auses aggressiveness to mean overt assault on others, and D's usage in..cludes repressed hostile reactions,evidence which convinces B that atest measures aggressiveness convinces A that the test does not.I--Ience, the investigator who proposesto establish a test as a measure of aconstruct must specify his net,vork ortheory sufficiently clearly that otherscan accept or reject it (cf. 41, p. 406).A consumer of the test who rejectsthe author's theory cannot acceptthe author's validation. He mustvalidate the test for hilTISelf t if hewishes to show that it represents theconstruct as he defines it.

Two general qualifications are inorder with reference to the methodo...logical principles 1-6 set forth atthe beginning of this section. Bothof them concern the amount ofHtheory," in any high-level sense ofthat word, which enters into a con ..struct-defining net\vork of la\vs orlawlike statements. We do not wish

292 LEE J. CRONBAClf AND ]:JA UL E. MEEliL

to convey the impression that one al\vays has a very eia borate theoreticalnetvV"ork, rich in hypothetical processes or e11 ti tics.

Constructs as inductive summaries.In the early stages of developn1ent ofa construct or even at more advancedstages when our orientation is thoroughly practical, little or no theoryin the usual sense of the word need beinvolved. In the extrelne case the hypothesized laws are formulated entirely in terms of descriptive (observational) dinlensions although not allof the relevant observations haveactually been made.

'I'he hypothesized network "goesbeyond the data tt only in the litnitedsense that it purports to characterizethe behavior facets which belong toan observable but as yet only partially satnpled cluster; hence, it generates predictions about hitherto unsampled regions of the phenotypicspace. Even though no unobservables or high-order theoretical constructs are introduced, an element ofinductive extrapolation appears inthe clairn that a cluster includingsome elements not-yet-observed hasbeen iclen tified. Since, as in any sorting or abstracting task involving afinite set of con1plex elements, several nonequivalent bases of categ-orization are available, the investigator may choose a hypothesis whichgenerates erroneous predictions. 1'hefailure of a supposed, hitherto untried, menlber of the cluster to behave in the 111anner said to be characteristic of the grou p, or the findingthat a nonlnember of the postulatedcluster does behave in this Inanner,tnay lTIodify greatly our tentativeconstruct.

}1'or example, one nlight build anintelligence test on the basis of hisbackground notions of "intellect,"including vocabulary, arithmetic calculation, general information, sitni-

larities, two-point threshold, reactiontime, and line bisection as subtests.'rhe first four of these correlate, andhe extracts a huge first factor. Thisbecomes a second approximation ofthe intelligence construct, describedby its pattern of loadings on the fourtests. The other three tests havenegligible loading on any COIUr110nfactor. On this evidence the investigator reinterprets intelligence as"tnanipulation of words." Subsequently it is discovered that tcststupid people are rated as unable toexpress their ideas, are easily taken inby fallacious arguments, and Inisreadcomplex directions. rrhese data support the "linguistic" definition of intelligence and the test's clailTI ofvalidity for that construct. But thena block design test with pantomimeinstructions is found to be stronglysaturated with the first factor. IlnInediately the purely "linguistic" interpretation of Factor I becomes suspect. This finding, taken togetherwith our initial acceptance of theothers as relevant to the backgroundconcept of intelligence, forces us toreinterpret the concept once again.

If we situply list the tests or traitswhich have been shown to be saturated with the "factor" or vvhich belong to the cluster, no construct isemployed. As soon as \ve even SU1n

marize the properties of this group ofindicators-we are already makingsome guesses. Intensional characterization of a domain is hazardous sinceit selects (abstracts) properties anditnplies that new tests sharing thoseproperties will behave as do theknown tests in the cluster, and thattests not sharing them will not.

The difficul tics in Inerely "characterizing the surface cluster" are strikingly exhibited by the use of certainspecial and extren1e groups for purposes of construct validation. TheP d scale of MMPI \-vas originally dc-


rived and cross-validated upon hospitalized patients diagnosed "Psychopathic personality, asocial andamoral type" (42). Further researchshows the scale to have a limited degree of predictive and concurrentvalidity for Udelinquency" morebroadly defined (5, 28). Several studies show associations between P d

and very special "criterion" groupswhich it would be ludicrous to identify as "the criterion" in the traditional sense. I f one lists these heterogeneous groups and tries to characterize them intensionally, he facesenormous conceptual difiiculties. Forexatnple, a recent survey of huntingaccidents in Minnesota showed thathunters \vho had "carelessly" shotsomeone were significantly elevatedon P d \vhen compared with otherhunters (48). This is in line with one'stheoretical expectations; when youask IVI M I) I tt experts" to predict forsuch a group they invariably predictP d or ],([a or both. The finding seemstherefore to lend some slig'ht supportto the construct validity of the Pd

scale. But of course it would be nonsense to define the P d component"operationally" in terms of, say, accident proneness. We tnight try tosubsume the original phenotype andthe hunting-accident proneness undersome broader category, such as "Disposition to violate society's rules.whether legal, nloral, or just sensible/'But now we have ceased to have aneat operational criterion, and areusing instead a rather vague andwide-range class. Besides, there isworse to con1e. We want the classspecification to cover a group trendthat (nondelinquent) high school students judged by their peer group asleast "responsible" score over a fullsignlu higher on Pd than those judgedmost "responsible" (23, p. 75). Mostof the behaviors contributing to suchsociometric choices fall well within

the range of socially permissible action; the proffered criterion specification is still too restrictive. Again, anyclinician familiar with MMPI lorewould predict an elevated P d on asample of (nondelinquent) professional actors. Chyatte' s confirmationof this prediction (10) tends to support both: (a) the theory sketch ofl4 w hat the P d factor is, psychologically"; and (b) the claim of the P d

scale to construct validity for thishypothetical factor. Let the readertry his hand at writing a brief phenotypic criterion specification that willcover both trigger-happy huntersand Broadway actors! i-\ndl·if heshould be ingeniolls enough to achievethis, does his definition also encompass I-Iovey's report that hig'h P d

predicts the judgments 14 not shy"and l'unafraid of lncntal patients"made upon nurses by their supervisors (32, p. 143)? And then we haveGough's report that low P d is associated with ratings as 4'good-natured"(24, p. 40), and Roessell's data showing that high P d is predictive of"dropping out of high school" (54).The point is that all seven of these"criterion" dispositions would bereadily guessed by any clinician having even superficial familiarity withMMPI interpretation; but to mediate these inferences explicitly requires quite a fe\v hypotheses aboutdynalnics, constituting an adtnittedlysketchy (but far from vacuous) network defining the genotype 1Jsychopathie deviate.

Vagueness of present psychologicallaws. l"'his line of thought leads directly to our second ilnportant qualification upon the network schema.The idealized picture is one of a tidyset of postulates which jointly entailthe desired theorems; since some ofthe theorems are coordinated to theobservation base, the systenl constitu tes an in1plici t definition of the

294 LEE J. CRONBACH AND PAUL E. MEEHL

theoretical primitives and gives thelnan indirect empirical meaning. Inpractice, of coutse, even the Illost advanced physical sciences only approxin1ate this ideal. Questions of ~~catc

goricalness" and the like, such aslogicians raise about pure calculi, arehardly even statable for empiricalnetworks. (What, for example, wouldbe the desiderata of a "well-formedformula" in tnolar behavior theory?)I)sychology works with crude, halfexplicit formulations. We do not\vorry about such advanced formalquestions as "whether all [nolar-behavior statements are decidable byappeal to the postulates" becausewe know that no existing theoreticalnetwork suffices to predict even theknown descriptive laws. Nevertheless, the sketch of a network is there;if it were not, we would not be sayinganything intelligible about our constructs. We do not have the rigorousilnplicit definitions of formal calculi(which still, be it noted, usually permit of a multiplicity of interpretations). Yet the vague, avowedly incomplete network still gives the constructs whatever meaning they dohave. When the network is very incomplete, having many strands Blissing entirely and some constructs tiedin only by tenuous threads, then theHimplicit definition" of these constructs is disturbingly loose; onenlight say that the Ineaning of theconstructs is underdeterlnined. Sincethe meaning of theoretical constructsis set forth by stating the laws in whichthey occur, our incomplete knowledgeof the laws of nature produces a vagueness in our constructs (see llempel,30; Kaplan, 34; Pap, 51). We will beable to say "what anxiety is" whenwe know all of the laws involving it;meanwhile, since we are in the process of discovering these laws t we donot yet know precisely what anxiety

is"

CONCLUSIONS REGARDING TIlE NET

WORI{ AFTER I~XPERIMENT.A..TION

1'he proposition that x per cent oftest variance is accounted for by theconstruct is inserted into the acceptednetwork. ~rhe network then generatesa testable prediction about the relation of the test scores to certain othervariables, and the investigator gathers data. If prediction and result arcin harmony, he can retain his beliefthat the test lTICaSUres the construct.'l.'he construct is at best adopted,never delnonstratcd to be "correct."

We do not first "prove" the theory,and then validate the test, nor conversely. In any probable inductivetype of inference from a pattern of observations. \ve exalnine the relationbet\veen the total net\vork of theoryand observations. l"hc systelTI involves propositions relating test toconstruct, construct to other constructs, and finally relating S0T11C ofthese constructs to obscrvablcs. Inongoing research the chain of inference is very complicated. I{elly andFiske (36, p. 124) give a cOlnplex diagraIn sho\ving the nunlerous inferences required in validating a prediction from asseSSlllcn t techniq LIes,

\vherc theories about the criterion situation are as integral a part of theprediction as are the test data. A predicted empirical relationship perruitsus to test all the propositions leadingto that prediction. 'T'raditionally theproposition claiming to interpret thetest has been set apart as the hypothesis being tested, but actually the evidence is significant for all parts of thechain. If the prediction is not confirrned, any link in the chain Inay bewrong.

A theoretical network can be divided into subtheories used in makingparticular predictions. All the eventssuccessfully predicted through a subtheory are of course evidence in favorof that theory. Such a subtheory


may be so well confirmed by voluminous and diverse evidence that we canreasonably view a particular experilYLent as relevant only to the test'svalidity. If the theory, combinedwith a proposed test interpretation,mispredicts in this case, it is the latterwhich must be abandoned. On theother hand, the accumulated evidencefor a test's construct validity nlay beso strong that an instance of misprediction will force us to modify thesubthcory employing the constructrather than deny the clailn that thetest measures the construct.

Most cases in psychology today liesomewhere bet\veen these extremes."rhus, suppose ,ve fail to find agreater incidence of ~lholnoscxual

signs" in the Rorschach records ofparanoid patients. Which is moresttongly disconfirmed-the !<.orschach signs or the orthodox theoryof paranoia? The negative findingsho\\'s the bridge between the t\VO tobe undependable, but this is all wecan say. l"he bridge cannot be usedunless one end is placed on soliderground. The investigator must decide which end it is best to relocate.

Numerous successful predictionsdealing \vith phenotypically diverseIlcriteria" give greater weight to theclaim of construct validi ty than dofewer predictions, or predictions involving very similar behaviors. Inarriving at diverse predictions, thehypothesis of test validity is connected each time to a subnetworklargely independent of lhe portionpreviously used. Success of thesederivations testifies t.o the inductivepower of the test-validity statement.and renders it unlikely that anequaJJy effective aJternative can beoffered.

Implications of Negative EV1:dence

1'he investigator whose predictionand data are discordant rnust Blake

strategic decisions. !-lis result can beinterpreted in three ways:

1. The test does not measure theconstruct variable.

2. The theoretical netvlork whichgenerated the hypothesis is incorrect.

3. The experimental design failedto test the hypothesis properly.(Strictly speaking this may be analyzed as a special case of 2, but inpractice the distinction is worth making.)

For further research. If a specificfault of procedure n1akes the third areasonable possibility, his proper response is to perform an adequatestudy, meanwhile making no report.When faced with the other two alternatives, he may decide that his testdoes not measure the construct adequately. f'ollowing that decision, he,vill perhaps prepare and validate anew test. Any rescoring or new interpretative procedure for the originalinstrument, like a new test. requiresvalidation by means of a fresh body ofdata.

]'he investigator may regard interpretation 2 as more likely to lead toeventual advances. It is legitimatefor the investigator to call the network defining the construct into question, if he has confidence in the test.Should the investigator decide thatsome step in the network is unsound,he lnay be able to invent an alternative network. Perhaps he modifiesthe network by splitting a conceptinto two or lllore portions, e.g., bydesignating types of anxiety. or perhaps he specifies added conditionsunder which a generalization holds.When an investigator tnodifies thetheory in such a manner, he is now required to gather afresh body of data totest the altered hypotheses. "[his stepshould norlnally precede publicationof the Inodified theory. If the newdata are consistent with the modifiednetwork, he is free {raIn the fear that

296 LEE J. CRONBACH AND PAUL E. MEEHL

his nOlTIologicals were gerrymanderedto fit the peculiarities of his first sample of observations. I-Ie can now trusthis test to some extent, because histest results behave as predicted.

The choice an10ng alternatives, likeany strategic decision, is a gan1ble asto which course of action is the bestinvestlnent of effort. Is it wise to1110dify the theory? 'rhat depends onhow "vell the system is confirn1ed byprior data, and how well the nlodifications fit available observations. Jsit worth while to modify the test inthe hope that it will fi 1: the construct?TFhat depends on how much evidencethere is-apart frOlll this abortive experilnent-to support the hope, andalso on how n1uch it is worth to theinvestig'ator's ego to salvage the test.The choice among alternatives IS amatter of research planning.

For practical use of the test. Tfheconsumer can accept a test as a measure of a construct only when there is astrong posi tive fit between predictions and subsequent data. Whenthe evidence from a proper investigation of a published test is essentiallynegative, it should be reported as astop sign to discourage use of the testpending a reconciliation of test andconstruct, or final abandonlncnt ofthe test. If the test has not been published, it should be rcstI'icted to research use until some deg-ree ofvalidity is established (1). rrhe COl1

sutner can await the results of the investigator's gamble with confidencethat proper application of the scientific method will ultilnately tellwhether the test has value. lJntil theevidence is in, he has no justificationfor employing the test as a basis forterminal decisions. 1"hc test Inayserve, at best, only as a source of suggestions about individuals to be con ..firmed by other evidence (15, 47).

There are two perspectives in testvalidation. Froln the viewpoint of

the psychological practi tioner, theburden of proof is on the test. A testshould not be used to Ineasure a traituntil its proponent establishes thatpredictions made from such 111casuresare consistent with the best availabletheory of the trait. In the view of thetest developer, however, both thetest and the theory are under scrutiny. H.e is free to say to himself privately, "If my test disagrees with thetheory, so much the worse for thetheory." This way lies delusion, unless he continues his research using abetter theory.

Reporting of ]Jositive Results

1'he test developer who finds positive correspondence between his proposed interpretation and data is expected to report the basis for hisvalidity claim. Defending a clain1 ofconstruct validity is a lnajor task, notto be satisfied by a discourse withoutdata. ]"'he Technical Recommendations have little to say on reporting ofconstruct validity. Indeed, the onlydetailed suggestions under that heading refer to correlations of the test,vith other tneasures, tog-ether wi th across reference to son1e other sectionsof the report. 'The two key principles, however, call for the Inost COlTI

prehensive type of reporting. l'hemanual for any test "should report allavailable information \vhich will assist the user in determining- what psychological attributes account for variance in test scores" (59, p. 27). And,"rrhe Inanual for a test which is usedprimarily to assess postulated attributes of the individual should outlinethe theory on which the test is basedand organize whatever partial validity data there are to 8ho\\,'" in whatway they support the theory" (59,p. 28). I t is recognized, by a classification as "very desirable" rather thani i essentiaI, " that the latter reCOIn-


mendation goes beyond present practice of test authors.

'[he proper goals in reporting construct validation are to make clear(a) what interpretation is proposed,(b) how adequately the writer believes this interpretation is substantiated, and (c) what evidence and reasoning lead hin1 to this belief. Without a the construct validity of thetest is of no use to the consumer.Without b the consunlcr must carrythe en tire burden of evaluating- thetest research. Without c the conSUITlCr or reviewer is being asked totake a and b on faith. 'l'hc test 111anual cannot al\vays present an exhaustive stateluen t on these poin ts, bu tit should sUll1marize and indicate\vhere c01l1plete statements tnay befound.

1"0 specify the interpretation, thewriter mllst state what construct hehas in mind, and what meaning hegives to that construct. For a construct \vhich has a short history andhas built tlP few connotations, it willbe fairly easy to indicate the presun1ed properties of the construct,i.e., the nornologicaIs in \\rhich it appears. 11'or a construct with a longerhistory, a SU1l1111ary of properties andreferences to previous theoretical discussions may be appropriate. It isespecially critical to distinguish proposed interpretations froln othermeanings previously g'iven the sameconstruct. rrhe validator faces nosmall task; he mllst sOlnehow communicate a theory to his reader.

1'0 evaluate his evidence calls for astatclnent like the conclusions from aprogTan1 of research, noting what iswell substantiated and vv-hat alternative interpretations have been considered and rejected. The writer mustnote what portions of his proposedinterpretation are speculations, extrapolations, or conclusions froln insufficient data. ]'he author has an

ethical responsibility to prevent unsubstantiated interpretations fromappearing as truths. A claim is unsubstantiated unless the evidence forthe claim is public, so that other scientists may review the evidence, criticize the conclusions, and offer alternative interpretations.

The report of evidence in a testmanual must be as complete as anyresearch rcport, except where adeQuate public reports can be cited.Reference to sOlnething Hobserved bythe writer in 111any clinical cases" isworthless as evidence. :Full case reports, on the other hand, may be avaluable source of evidence so long asthese cases are representative andnegative instances receive due attention. l--he report of evidence must beinterpreted \vith reference to thetheoretical net\vork in such a mannerthat the reader sees why the authorregards a particular correlation or experinlent as confirming (or throvvingdoubt upon) the proposed interpretation. Evidence collected byothers n1ust be taken fairly into account.

VALIDATION OF A COMPLEX TEST'~As A WHOLE n

Special questions must be considered when we are investigating thevalidity of a test which is ailned toprovide inforn1ation about severalconstructs. In one sense t it is naiveto inquire "Is this test valid?" Onedoes not validate a test, but only aprinciple for making inferences. If atest yields many different types ofinferences, SOlne of thelTI can be validand others invalid (eL TechnicalI~econ1mendatjon C2; 4vfhc manualshould report the validity of eachtype of inference for which a test isrecommended"). From this point ofview t every topic sentence in thetypical book on Rorschach interpretation presents a hypothesis re-

298 LEE J. CRONBAC[f AND ]JA UL E. AfEEHL

QUIrIng validation, and one shouldvalidate inferences aboll t each aspectof the personality separately and inturn, just as he would ",rant inforrnation on the validity (concurrent orpredictive) for each scale of MMPI.

There is, however, another defensible point of vic"v. If a test is purelyen1pirical, based strictly on observedconnections between response to aniieln and some criterion, then ofcourse the validity of one scoring keyfor the test does not make validationfor its other scoring keys any lessnecessary. nu t a test lnay be developed on the basis of a theory whichin itself provides a linkage bet\veenthe various keys and the various criteria. 1'hus, vvhile Strong's Vocational Interest 13lank is developeden1pirically, it also rests on a C'theory" that a youth can be expected tobe satisfied in an occupation if he hasinterests COffilnon to men now happyin the occupation. When Strong findsthat those with high Engineering interest scores in college are preponderantly in engineering careers 19years later, he has partly validatedthe proposed usc of the Engineerscore (predictive validity). Since theevidence is cOl1sisten t with the theoryon which all the test keys were built,this evidence alone increases the presumption that the other keys havepredictive validity. flow strong islhis prcsu111ption? Not very, fronllhe viewpoint of the traditional skepticis111 of science. I~nginccring interests I11ay stabilize early, while interests in art or 111anagc111cnt or social''lark arc still unstable. A claim cannot be rnade that the ,vhole Strongapproach is valid just because onescore 5ho\V8 predictive validity. Butif thirty interest scores were investigated longitudinally and all of thClnshowed the type of validity predictedby Strong's theory, we would indeedbe caviling to say that this evidence

gives no confidence in the long-rangevalidity of the thirty-first score.

Confidence in a theory is increasedas more relevant evidence confirlnsit, but it is always possible that tonlorrow's investigation will render thetheory obsolete. The Technical I~ec

olnmendations sugg'est a rule of reason, and ask for evidence for eachtype of inference for \vhich a test isrccolnnlcnded. It is stated that notest developer can present predictivevalidities for all possible criteria;similarly, no developer can run allpossible experimcn tal tests of hisproposed interpretation. But the rccomnlendation is more subtle than advice that a ]at of validation is betterthan a little.

(:onsider the Rorschach test. I t isused for many inferences. made byll1cans of nOll101ogical networks atseveral levels. At a lo\v level are thesitnple unrationalized correspondences presulTIcd to exist between certain signs and psychia tric diagnoses.Validating such a sign does nothing tosubstantiate Rorschach theory. I~or

other I{orschach fOrlTIulas an explicita priori rationale exists (for instance,high F% interpreted as implyingrigid control of in1pulses). Each tinlesuch a sign shows correspondencewith criteria, its rationale is supported just a little. At a still higherlevel of abstraction, a considerablebody of theory surrounds the generalarea of outer control, interlacingmany different constructs. As evidence cumulates, one should be ableto decide what specific inferencetnaking chains within this systen1can be depended upon. One shouldalso be able to conclude-or dcnythat so much of the system has sloodup under test that one has some confidence in even the untested lines inthe network.

In addition to relatively delimitednomological networks surrounding


control or aspiration, the I~orschach

interpreter usually has an overridingtheory of the test as a whole. ThisInay be a psychoanalytic theory, atheory of perception and set, or a theory stated in terms of learned habitpatterns. \Vhatever the theory of theinterpreter, whenever he validates aninference from the system, he obtainssome reason for added confidence inhis overriding system. His total theory is not tested, ho\vever t byexperinlents dealing with only one limitedset of constructs. '[he test developer111ust investigate far-separated, independent sections of the network. TheInore diversified the predictions thesystem is required to make, thegreater confidence we can have thatonly minor parts of the system willlater prove faulty. I-Ierc we begin toglimpse a logi~ to defend the judgment that the test and its whole interpretative systcln is valid at some levelof confidence.

1---here are enthusiasts who wouldconclude from the foregoing paragraphs that since there is some evidence of correct, diverse predictionsmade froln the Rorschach, the testas a whole can now be accepted asvalidated. 1-"his conclusion overlooksthe negative evidence. Jlist one finding contrary to expectation, based onsound research, is sufficient to wash a\vhole theoretical structure away.Perhaps the remains can be salvagedto form a new structure. But thisstructure now must be exposed tofresh risks, and sound negative evidence will destroy it in turn. l'hereis sufficien t negative evidence to prevent acceptance of the Rorschach andits accompanying interpretativestructures as a whole. So long as anyaspects of the overriding theorystated for the test have been disconfirmed, this structure luust be rebuilt.

'l'alk of areas and structures mayseem not to recognize those who

would interpret the personality "globaJ]y." ']'hey may argue that a test isbest validated in matching stuclie5.Without going into detailed questionsof matching methodology, we can askwhether such a study validates thenomological network "as a whole."'fhe judge does employ some networkin arriving at his conception of hissubject, integrating specific inferences from specific data. Matchingstudies, if successful, demonstrateonly that each judge's interpretativetheory has some validity, that it isnot completely a fantasy. Very highconsistency between judges is required to show that they are usingthe saine network, and very high success in matching is required to showthat the network is dependable.

I f inference is less than perfectlydependable, we nlust know \vhich aspects of the interpretative networkare least dependable and which aremost dependable. Thus, even if onehas considerable confidence in a test"as a whole" because of frequent successful inferences, one still returns asan ultilnate ain1 to the request of the'l'echnical H..ecommendation for separate evidence on the validity of eachtype of inference to be n1ude.

RECAPITULATION

Construct validation was introduced in order to specify types of research required in developing testsfor which the conventional views onvalidation are inappropriate. Personality tests, and SOIne tests of ability, are interpreted in terms of attributes for which there is no adequatecriterion. l'his paper indicates whatsorts of evidence can substantiatesuch an interpretation, and ho\v suchevidence is to be interpreted. Thefollowing points nlade in the discussion are particularly significant.

1. A construct is defined inlplicitlyby a network of associations or propo-

300 LEE J. CRONBACI] AND PA UL E. MEEHL

sitions in which it occurs. Constructsen1ployecl at different stages of research vary in definiteness.

2. Construct validation is possibleonly when some of the statelnents inthe network lead to predicted relations among observables. While SOlne

observables may be regarded as "criteria," the construct validity of thecri teria thenlselves is regarded asunder investigation.

3. The network defining the construct, and the derivation leading tothe predicted observation, lTILlst bereasonably explicit so that validatingevidence may be properly interpreted.

4. Many types of evidence are relc~

vant to construct validity, includingconten t validi ty, intcri telll correlations, intcrtcst correlations, test-"critcrion" correlations, studies of stability over time, and stability underexperimental intervention. High correlations and high stability may constitute either favorable or unfavorable evidence for the proposed interpretation, depending on the theorysurrounding the construct.

5. When a predicted relation failsto occur, the fault may lie in the proposed interpretation of the test or inthe net\vork. Altering the networkso that it can cope with the new observations is, in effect, redefining theconstruct. f\ny such new interpretation of the test must be validated bya fresh body of data before being advanced publicly. Great care is required to avoid substituting a posteriori rationalizations for proper validation.

6. Construct validity cannot generally be expressed in the forrrl of asingle sinlple coefficient. '[he dataoften pern1it one to establish upperand lower bounds for the proportionof test variance \vhich can be attributed to the construct. ]'he integration of diverse data into a proper interpretation cannot be an entirelyquantitative process.

7. Constructs tnay vary in naturefroln those very close to "pure description" (involving little n10re thanextrapolation of relations among observation-variables) to highly theorclical constructs involving hypothesized entities and processes, or n1aking identifications with constructs ofother sciences.

8. rfhe investigation of a test's construct validity is not essentially differen t [raIn the general scientific procedures for developing and confirmingtheories.

Without in the least advocating construct validity as preferable to theother three kinds (concurrent, predictive, content), we do believe it imperative that psychologists make aplace for it in their methodologicalthinking, so that its rationale, itsscientifIc legitimacy, and its dangersmay become explicit and familiar.]'his would be preferable to the widespread current tendency to engagein what actually amounts to consiruct validation research and use ofconstructs in practical testing, whiletalking an "operational" methodology which, if adopted, would forceresearch into a mold it does not fit.

REFERENCES

1. AMERICAN PSYCHOLOGICAL ASSOCIATION.

Ethical standards of psychologists. \Vashington, D.C.: Alllerican PsychologicalAssociation, Inc., 1953.

2. ANASTASI, ANNE. 1'he concept of validityin the interpretation of test scores.Educ. psychol. Measmt, 1950, 10,67-78.

3. BECHTOLDT, H. P. Selection. In S. S·Stevens (Ed.), }Iandbook oj experirnentalpsychology. Ne,v York: Wiley, 1951.Pp. 1237-1267.

4. BECK, L. W. Constructions and inferredentities. Phil. Sci., 1950, 17. Re..printed in 1-1. Feigl and M. Brodbeck


(Ells.), Readings in the philosophy ofscience. New York: Appleton-CenturyCrofts, 1953. Pp. 368-381.

5 BLAIR t W. R. N. A cOlnparative study ofdisciplinary offenders and non-offendersin the Canadian Anny. Canad. J.Psychol., 1950, 4, 49-62.

6. BRAITHWAITE, R. B. Scientific explanaMon. Cambridge: Cambridge Univer.Press, 1953.

7. CARNAl', R. Empiricism, semantics, andontology. l~ev. into de Phil., 1950, II,20-40. Reprinted in P. P. Wiener(Eel.), Readings i1t jJhilosophy of science,New York: Scribner's, 1953. Pp.509-521.

8. CARNAP, l{. Foundations oj logic and1nathematics. I nternational e1~cyclo-

pedia of unified sc'ience, I, No.3. Pages56-69 reprinted as "The interpretationof physics" in II. Feigl and M. Brodbeck(Eds.), Readings in the philosoPhy ofscience. Ne\v York: Appleton-CenturyCrofts, 1953. Pp. 309-318.

9. CHILD, I. L. Personality. Annu. Rev.]Jsychol., 1954, 5, 149-171.

10. CHYATTE, C. Psychological characteristicsof a group of professional actors. Occupations, 1949, 27, 245-250.

11. CRONllACII, L . .1. Essentials of psychological testing. Ne\v York: I-Iarper, 1949.

12. CRONllACH, L. J. Further evidence onresponse sets and test design. Educ.psychol. Meas1nl, 1950, 10, 3-3l.

13. CRONDACII, L. J. Coefficient alpha andthe internal structure of tests. ]:Jsychometrika, 1951, 16, 297-335.

14. CRONllACI-I, L. J. Processes affectingscores on "understanding of others"and '·assufficd sitnilarity." Psychol.Bull., 19S5, 52, 177-193.

15. CRONBACH, L. J. The counselor's problems from the perspective of communication theory. In Vivian H. Hewer(Ed.), Ne'lv perspectives in counseling.Minneapolis: Univcr. of MinnesotaPress, 1955.

16. CURETON, E. E. Validity. In E. F.Lindquist (EeL), Educational measurement. Washington, D. C.: AnlericanCouncil on Education, 1950. Pp.621-695.

17. DAMRIN, DORA E. A conlparative studyof infonllation derived f1'o111 a diagnostic problenl-solving test by logicaland factorial methods of scoring. Unpublished doctor's dissertation, Univer.of Illinois, 1952.

18. EVSENCK, 1-1. J. Cri tcrion analysis-anapplication of the hypothetico-deductive nlethod in factor analysis. Psychol.Rev., 1950,57,38-53.

19. FEIGL, 11. Existential hypotheses. Phil.Sci., 1950, 17,35-62.

20. FEIGL, 1-1. Confirmability and confirmation. Rev. into de Phil., 1951, 5, 1-12.Reprinted in P. P. vViener (Ed.),Readings in pltilosoj)hy of science. NewYork: Scribner's, 1953. Pp.522-530.

21. GAYLORD, R. I-I. Conceptual consistencyand criterion equivalence: a dual approach to criterion analysis. Unpu bIished nlanuscript (PRB ResearchNote No. 17). Copies obtainable frolnASTIA-DSC, AD-21 440.

22. GOODENOUGH, FLORENCE L. Mentaltest,ing. New York: Rinehart, 1950.

23. GOUGH, I-I. G., MCCLOSKY, 1-1., & MEEHL,

P. E. A personality scale for socialresponsibility. J. abnorm. soc. Psychol.,1952,47, 73-80.

24. GOUGH, H. G., MC!(EE, M. G., & YAN

DELL, R. J. Adjective check list analyses of a number of selected psychometric and assessment variables. Unpublished Inanuscript. Berkelcy:IPAR,1953.

25. GUILFORD, J. P. New standards for testevaluation. Educ. psychol. Meastnt.1946, 6,427-439.

26. GUILFORD, J. P. Factor analysis in a test..dcvelopluent prograln. Psychol. Rev.,1948, SSt 79-94.

27. GULLIKSEN, 1-1. Intrinsic validity. Amer.Psychologist, 1950,5,511-517.

28. HATHAWAY, S. R. t & MONACHESI, E. D.A nalyzing and predicting juvenile delinquency 'with the MlvIP f. Minneapolis:IJniver. of Minnesota Press, 1953.

29. I-IEMPEL, C. G. Problems and changes inthe empiricis t cri terion of lueaning.Rev. int. de Phil., 1950, 4, 41~63. Reprinted in L. Linsky, Senzant-ics andthe PhilosolJhy of language. Urhana:Univcr. of Illinois Press t 1952. Pp.163-185.

30. HEMPEL, C. G. Fundamentals of conceptformation in emf)irical science. Chicago:Univcr. of Chicago Press, 1952.

31. HORST, P. The prediction of personaladjustnlent. Soc. Sci. Res. Council Bull.,1941, No. 48.

32. HOVEV, H. B. l\fMPI profiles and personality characteristics. J. consult.Psychol., 1953, 17, 142-146.

33. JENKINS, J. G. Validity for what? J.consult. Psychol., 1946, 10, 93-98.

34. I(APLAN, A. Definition and specificationof Iueaning. J. Phil., 1946,43, 281-288.

35. !(ELLY, E. L. Theory and techniques ofasscssnlcnt. Annu. Rev. ]JsychoJ., 1954,5, 281-311.

36. KELLY, E. L., & FISKE, D. W. The prediction of perforrnance in clinical psy-

302 LEE J. CR.ONBACH AND PAUL E. MEEHL

chology. Ann Arbor: Univcr. of Michigan Press, 1951.

37. KNEALE, W. Probability and induction.Oxford~ Clarendon Press, 1949. Pages92-110 reprinted as "Induction, explanation) and transcendent hypotheses"in 1-1. Feigl and M. Brodbeck (Eds.),Readings in the philosophy of science.New York: Appleton-Century-Crofts,1953. Pp. 353--367.

.~8. LINDQUIST, E. F. Educational rneasuremente Washington, D. C.: AnlericanCouncil on Education, 1950.

39. LUCAS) C. M. Analysis of the relativemovelnent test by a lnethod of individ~

ual interviews. Bur. Naval Personnell~es. Rep., Contract Nonr-694 (00),NR 151-13, Educational 'resting Service, lVlarch 1953.

40. MACCORQUODALE, K., & MEEHL, P. E.On a distinction between hypotheticalconstructs and intervening variables.Psychol. Rev., 1948, 55, 95-107.

tt1. MACFARLANE, JEAN W. Problclns ofvalidation inherent in projective meth..ods. A 1ner. J. Orthopsychiat., 194·2.12. 405-410.

42. MC!{INLEY, J. C., & I-IATHAWAY, S. R.l'he MMPI: V. I-Iysteria, hyponlania,and psychopathic deviate. J. aPl)l.]Jsychol., 1944,28, 153-174.

43. lVlcKINLEY, J. C., IIATHAWAY, S. R., &MEEHL, P. E. The MMPI: VI. 1'he!{ scale. J. cottsult. jJsychol., 1948, 12,20-31.

4;1. MEEHL, P. E. A silnple algebraic devclopnlent of I-Iorst's suppressor vari..ables. A n'ler. J. Psychol., 1945, 58,550-554.

45. MEEHL, P. E. An investigation of ageneral nonnality or control factor inpersonality tcstinSl;. Psychol. Afonogr.,1945, 59, No. 4 (\~rhole No. 274).

46. MEEHL, P. E. Clinical vs. statisticallJrediction. Minneapolis: Univcr. ofl\1innesota Press, 1954.

47. MEEHL, P. E., & ROSEN, A. Antecedentprobability and the efficiency of psycho~metric signs, patt.erns or cutting scores.

Psychol. Bull., 1955, 52, 194-216.48. Minn,esota Hunter Casualty Study. St.

Paul: Jacob Schtnidt Brewing Company, 1954.

49. MOSIER, C. I. A critical exatuination ofthe concepti of face validity. Educ.psychol. Measmt, 1947, 7, 191-205.

50. MOSIER, C. 1. Problems and designs ofcross-validation. Educ. psychol.Measmt, 1951, 11,5-12.

51. PAP, A. H.eduction-sentences and openconcepts. Methodos, 1953, 5, 3-30.

52. PEAK, I-IELEN. Problems of objectiveobservation. In L. Festinger and D.I(atz (Eds.), Research methods in thebehavioral sciences. New Yark: DrydenPress, 1953. Pp. 243-300.

53. PORTEUS, S. D. The Porteus maze testand intelligence. Palo l\lto: PaciflcBooks, 1950.

54. ROESSEL, F. P. MMPI results for highschool drop-onts and graduates. U11

published doctor's dissertation, IJl1iver.of M innesota1 1954.

55. SELLARS, vV. S. Concepts as involvinglaws and inconceivable without them.Phil. Sci., 1948, 15, 287-,j15.

56. SELLARS, W. S. SOlne reflections on lan~

guage games. Phil. SC1:., 1954, 21 1

204-228.57. SPIKER, C. C., & MCCANDLESS, n. R.

The concept of intelligence and thephilosophy of science. Psychol. Rev.,1954·, 61, 255-267.

58. Technical recoffilnendations for psychological tests and diagnostic techniques:preliminary proposaL An1er. Psychologist, 1952, 7, 461-476.

59. 1'echnical recoll11nendations for psychological tests and diagnostic technjque~.

Psychol. Bull. Supplement, 1954, 51,2, Part 2, t-38.

60. THURSTONE, L. L. 'The criterion proble!nin personality research. ]:Jsych01netricLab. Rep., No. 78. Chicago: Univcr.of Chicago, 1952.

Received for early publication February 18,1955

july, psychological bulletin lj & meehl pe... · test score and criterion score are de...

Documents