test validity and the ethics of assessment

Upload: andi-ulfa-tenri-pada

Post on 03-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Test Validity and the Ethics of Assessment

    1/16

    Test Validity and the Ethics of Assessment

    SAMUEL MESSICK Educational Testing ServicePrinceton, New Jersey

    ABSTRACT: Questions of the adequacy of a test asa measure of the characteristic it is interpreted toassess are answerable on scientific grounds by apprais-ing psychometric evidence, especially constru ct validity.Questions o th e appropriateness of test use in proposedapplications are answerable on qthical grounds byappraising potential social consequences of the testing.The first set of answers provides an evidential basisfor test interpretation, and the second set provides a

    consequential basis fo r test use. In addition, thisarticle stresses (a) the importance of construct validityfo r test use because it provides a rational foundation forpredictiveness and relevance, and (b) the importance oftaking into account th e value implications of test inter-pretations per se. By thus considering both th e evi-dential and consequential bases of both test interpreta-tion and test use, the roles of evidence and social valuesin the overall validation process are illuminated, and testvalidity comes to be based on ethical as well asevidential grounds.

    Fifteen years ago or so, in papers dealing with per-sonality measurement and the ethics of assessment,I drew a straightforward but deceptively simpledistinction between the psychometric adequacy of atest and the appropriateness of its use (Messick,1964, 1965). I argued that no t only should tests beevaluated in terms of their measurement propertiesbu t that testing applications should be evaluated interms of their potential social consequences. I urgedthat two questions be explicitly addressed whenevera test is proposed for a specific purpose: First, is thetest any good as a measure of the characteristics itis inte rpreted to assess? Second, should the test beused for the proposed purpose in the proposed way?The first question is a scientific and technical oneand may be answered by appraising evidence for thetest's psychometric properties, especially constructvalidity. The second question is an ethical one, andits answer requires a justification of the proposeduse in terms of social values. Good answers to thefirst question are not satisfactory answers to thesecond. Justifica tion of test use by an appeal toempirical validity is not enough; the potential social

    consequences of the testing should also be appraised,

    1 0 1 2 N O V E M B E R 1980 A M E R I C A N P S Y C H O L O G I STCopyright 1980 by the American Psychological Association Inc.0003 066X/80/3511 1012 00.7S

    no t only in terms o f what it might entail directly ascosts and benefits but also in terms of what it makesmore likely as possible side effects.

    These two questions were phrased to parallel tworecurrent criticisms of testingthat some tests areof poor quality and that tests a re often misusedinan attempt to separate the frequently blurred issuesin the typical critical interchange into (a) ques-tions of test bias or the adequacy of measurement,and (b) questions of test fairness or the appro-priateness of use (Messick, 19,65).

    It was in the context of appraising personalitymeasurement for selection purposes that I originallystressed the need fo r ethical standards fo r justifyingtest use (Messick, 1964 ). Although at that timepersonality tests appeared inadequate for the selec-tion task when systematically evaluated againstmeasurement and prediction standards, it seemedlikely that rapidly advancing research technologywould, in the relatively near future, produce psy-chometrically sophisticated personality assessmentdevices. The refore, questions might soon arise inearnest as to the scope of their practical applicationbeyond clinical and counseling usage. With vari-ables as value-laden as personality characteristics,it seemed critical that values as well as validity beconsidered in contemplating test use.

    Kaplan (1964) pointed out that "the validity ofa measurement consists in what it is able to accom-plish, or m ore accurately, in wha t we are able to dowith it. j . . The basic question is always whether

    the measures have been so arrived at that they canserve effectively as means to the given end" X P -198). .Also at issue is whether the measures shouldserve as means to the given end, in light of otherends they might inadvertently serve and in con-sideration of the place of the given end in the social

    This article was an invited address to the Divisions ofEducational Psychology and of Evaluation and Measure-menti presented at the meeting of the American Psycholog-ical Association, New York City, September 1, 1979.

    Requests fo r reprints should be sent to Samuel Messick,Educational Testing Service, Princeton, New Jersey 08S41.

    Vo l . 35, No. 11, 1012-1027

  • 8/12/2019 Test Validity and the Ethics of Assessment

    2/16

    fabric of pluralistic alternatives. For example,should a psychometrically sound measure of "flex-ibility versus rigidity" be used fo r selection in aparticular college if it significan tly improves themultiple prediction of grade point average there?What if the direction of prediction favored rigidstudents? What if entrance to a military academy

    were at issue, or a medical school? W hat if thescores had been interpreted instead as measures of"confusion versus control"? W hat if there werelarge sex differences in the score distributions? Ina different arena, what minimal levels of knowledgeand skill should be required fo r graduation fromhigh school and in what areas?

    It seemed clear at this point that value issues inmeasurement were no t limited to personality assess-ment, nor to selection applications, but should beextended to all psychological and educational mea-surement (Messick, 1975). This is primarily be-cause psychological and educational variables allbear, either directly or indirectly, on human char-acteristics, processes, and products and hence areinherently, though variably, value-laden. Themeasurement of such characteristics entails valuejudgmentsat all levels of test construction, analy-sis, interpretation, and useand this raises ques-tions both of whose values are the standard and ofwhat should be the consequences of negative valu-ation. Values thus appear to be as pervasive andcritical fo r psychological and educational measure-ment as is testing's acknowledged touchstone,validity. Indeed, "The root meaning of the word'validity' is the same as that of the word 'value':both derive from a term meaning strength" (Kap-lan, 1964, p. 198).

    It should be emphasized that value questionsarise with any approach to psychological and edu-cational testingj whether it be norm-referenced orcriterion-referenced (Glaser & Nitko, 1971 ), aconstruct-based ability test or a content-sampledachievement test (Messick, 1975), a reactive taskor an unobtrusive observation (Webb, Campbell,Schwartz, & Sechrest, 19,66), a sign or a sample(Goodenough, 1969), or whatever, but the natureof the critical value questions may differ from oneapproach to ano ther. For example, many of theadvantages of samples over signs derive from thesimilarity of past behaviors to desired future be-haviors, which 'makes it more likely that behavior-sample tests will be judged relevant in both contentand process to the task or job domain about whichinferences are to be drawn. It is also likely that

    scores from such samples, because of behavioral

    consistency from one time to another, will be pre-dictive of performance in those domains (Wern-imont & Campbell, 1968). A key value question iswhether such "persistence forecasting," as Wallach(1976) calls it, is desirable, in a particular domainof application. In higher education, fo r example,the appropriate model might not be persistence but

    development and change, which suggests that insuch instances we be wary of selection proceduresthat restrict individual opportunity on the basis ofbehavior to date (Hudson, 1976).

    The distinction stressed thus far between theadequacy of a test as a measure of the character-istic it is interpreted to assess and the appropriate-ness of its use in specific applications underscoresin the first instance the evidential basis of testinterpretation, especially the need for constructvalidity evidence, and in the second instance theconsequential basis of test use, through appraisalof potential social consequences. In developingthis distinction in prior work I emphasized theimportance of construct validity' for test use aswell, arguing "that even for purposes of applieddecision making reliance upon criterion validity orcontent coverage is not enough," that the meaningof the measure must also be comprehended in orderto appraise potentia l social consequences sensibly(Messick, 1975, p. 956 ) . The present articleextends this argument for the importance of con-struct validity in test use still further by stressingits role in providin g a "rational foundation for pre-dictive validity" (Guion, 1976b). After thuselaborating the evidential basis of test use, I con-sider the value implications of test interpretationsper se, especially those that bear evaluative andideological overtones going beyond intended mean-ings and supporting evidence; the circle is therebycompleted with an examination of the consequentialbasis of test interpretation. Finally, the dynamicinterplay between test interpretation and its valueimplications, on the one hand, and test use and itssocial consequences, on the other, is sketched in afeedback model that incorporates a pragmaticcomponent for the empirical evaluation of testingconsequences.

    Validity as Inference F rom Evidence

    According to the Standards for Educational andPsychological Tests (American Psychological Asso-ciation et al., 1 97 4), "Questions of validity arequestions of what may properly be inferred from a

    test score; validity refers to the appropriateness of

    A M E R I C A N P S Y C H O L O G I S T N O V E M B E R 1980 1013

  • 8/12/2019 Test Validity and the Ethics of Assessment

    3/16

    inferences from test scores or other forms of assess-ment. . . . It is important to note that validity isitself inferred, not measured. It is, therefore,something that is judged as adequate, or marginal,or unsatisfactory" (p. 25) . This document alsopoints out that the many forms of validity ques-tions fall into two broad classes, those dealing with

    inferences about what is being measured by thetest and those inquiring into the usefulness of themeasurement as a predictor of other variables.Furthermore, there are a variety of validationmethods available, bu t they al l entail in principlea clear designation of what is to be inferred fromthe scores and the presentation of data to supportsuch inferences.

    Unfortunately, after this splendid beg inning , thisand other official documentsnamely, the Divisiono f Industrial an d Organizational Psychology's(1975) Principles for the Validation and U se ofPersonnel Selection Procedures and the EqualEmployment Opportunity Commission et al.'s(1978) "Unifor'm Guidelines on Employee Selec-tion Procedures" proceed, as Dunnette and Bor-man (1979) lament, to "perpetuate a conceptualcompartmentalization of 'types' of validity-criterion-related, content, and construct. . . . theimplication that validities come in different typesleads to confusion and, in the face of confusion,over-simplification" (p . 483). One consequenceo f this simplism is that many test users focus onone or another of the types of validity, as thoughany one would do, rather than on the specific infer-ences they intend to make from the scores. Thereis an implication that once evidence of one type ofvalidity is forthcoming, one is relieved of respon-sibility fo r further inquiry. Indeed, the "UniformGuidelines" seem to treat the three types of valid-ity, in Guion's (1980) words, "as something of aHoly Trinity representing three different roads topsychom etric salvation. If you can't dem onstra teone kind of validity, you've got two more chances "( p . 4 ) .

    Different kinds of inferences from test scoresrequire different kinds of evidence, not differentkinds of valid ity. By "evidence" I mean bothdata, or facts, and the rationale or arguments thatcement those facts into a justification of test-scoreinferences. "Another way to put this is to notethat data are not information; information is thatwhich results from the interpretation of data"(Mitroff & Sagasti, 1973, p. 123) . Or as Kaplan(1964) states, '-'What serves as evidence is the

    result of a process of interpretationfacts do no t

    speak fo r themselves; nevertheless, facts must begiven a hearing, or the scientific point to the processof interpretation is lost" (p. 375). Facts andrationale thus blend in this view of evidence, andthe tolerable balanced between them in the arena oftest validity extends over a considerable range,possibly even falling just short of the one extreme

    where facts are left to speak for, themselves and theother extreme where a logical rationale alone isdeemed self-evident.

    By focusing on the nature of the evidence inrelation to the nature of the inferences drawn fromtest scores, we come to view validity as a generalimperative in measurem ent. Validity is the overalldegree of justification for test interpretation anduse. It is "an evaluation, considering al l things,of a certain kind of inference about people whoobtain a certain score" (Guion, 1978b, p. 500).Although it may prove helpful conceptually todiscuss the interdependent features of the genericconcept in terms of different aspects or facets, it issimplistic to think of different types or kinds ofvalidity.

    From this standpoint we are not very well servedby labeling different aspects of a general conceptwith the name of the concept, as in criterion-relatedvalidity, content validity > or construct validity, orby proliferating a host of specialized validity mod-ifiers, such as discriminant validity, trait validity,factorial validity, structural validity, or populationvalidity, each delimiting some aspect of a broadermeaning. The substantive points associated witheach of these terms are important ones, bu t theirdistinctiveness is blunted by calling them all"validity." Since many of the referents are similarbut not identical, they tend to be assimilated oneto another, leading to confusion among them andto a blurring of the different forms of evidence thatthe terms wer$ invoked to highlight in the firstplace. Worse still, any one of these so-calledvalidities, or a small set of them, might be treated

    as the whole of validity, while the entire collectionto date might still not exhaust the essence of thewhole.

    We would be much better off conceptually to uselabels more descriptive of the character and intentof each aspect, such as content relevance and con-tent coverage rather than content validity, or pop-ulation generalizability rather than populationvalidity. Table 1 lists a number of currently usedvalidity terms along with a tentative descriptivedesignation fo r each that is intended to underscoredifferences among the concepts while at the same

    1014 -NOVEMBER 1980 AMERICAN PSYCHOLOGIST

  • 8/12/2019 Test Validity and the Ethics of Assessment

    4/16

    time highlighting the key feature of each, such asconsistency or utility, an d pointing to essentialareas of similarity and overlap, as with criterionrelatedness, nomological relatedness, and externalrelatedness. W ith one possible exception to bediscussed .subsequently, none of these conceptsqualify for the accolade of validity, for at best they

    are only one facet of validity and at worst, as inthe case of content coverage, they are not validityat all. So-called "content validity" refers to therelevance and representativeness of the task con-tent used in test construction an d does no t refer totest scores at all, let alone evidence to supportinferences from test scores, although such contentconsiderations do permit elaborations on scoreInferences supported by other evidence (Guion,1977a, 1978a; Messick, 1975; Tenopyr, 1977) .

    I will comment on most of the concepts in Table

    1 in passing while considering the claim of the oneexception noted earliernamely, construct validityto bear the name "validity" and to wear themantle of all that name implies. I have pressed inprevious writing for the view that all measure-ment should be construct-rejerenced (Messick,1975, p. 957) . Others have similarly argued that"any inference relative to prediction and ,. - . . allinferences relative to test scores, are based uponunderlying constructs" (Tenopyr, 1977, p. 48) .Guion (1977b, p. 410) concluded that all validityis at its base some form of construct validity. . . .It is the basic meaning of validity." I will argue,building on Guion's (1976b) conceptual ground-work, that construct validity is indeed the unifyingconcept of validity that integrates criterion andcontent considerations into a common frameworkfor testing rational hypotheses about theoreticallyrelevant relationships. The bridge or unifyingtheme that permits this integration is the meaning-fulness or interpretability of the test scores, whichis the goal of the construct validation process. Thisconstruct meaning provides a rational basis bothfo r hypothesizing predictive relationships and forjudging content relevance and representativeness.

    I stop short, however, as did Guion (1980), ofequating construct validity with validity in gen-eral, but for different reasons. The main basis forhesitancy on my part, as we shall see, is that valid-ity entails an evaluation of the value implicationsof both test interpretation and test use. Theseimplications derive primarily from the test's con-struct meaning, to be sure, and they feed backinto the construct validation process, but they

    also derive in part from broader social ideologies,

    TABLE 1

    Alternative Descriptors fo r Aspects of Test Validity

    Validity designation Descriptive designation

    Content validity

    Criterion validityPredictive validityConcurrent validity

    Construct validityConvergent validityDiscriminant validityTrait validityNomological validityFactorial validitySubstantive validityStructural validi tyExternal validi tyPopulat ion validi tyEcological validityTemporal validity

    Task validity

    Content relevancedomain speci-fi tions

    Content coveragedomain repre-

    sentativenessCriterion relatedness

    Predictive utilityDiagnostic utilitySubstitutability

    Interpret ive meaningfulnessConvergent coherenceDiscriminant distinctivenessTrait correspondenceNomological relatednessFactorial composit ionSubstantive consistencyStructural fidelityExternal relatednessPopulation generalizabilityEcological generalizabilityTemporal continuityacross de-

    velopmental levelsTemporal generalizabilityacross

    historical periodsTask generalizability

    such as the ideologies of social science or of educa-tion or of social justice, and hence go beyond con-s t ruct meaning per se.

    INTERPRETIVE MEANINGFULNESS

    Construct validation is a process of marshaling evi-dence to support the inference that an observedresponse consistency in test performance has aparticular meaning, primarily by appraising th eextent to which empirical relationships with othermeasures, or the lack thereof, are consistent withthat meaning. These empirical relationships maybe assessed in a variety of ways, fo r example, by

    gauging the degree of consistency in correlationalpatterns and factor structures, in group differences,response processes, and changes over time, or inresponsiveness to experimental treatments. Theprocess attempts to link the reliable response con-sistencies summarized by test scores to nontestbehavioral consistencies reflective of a presumablycommon underly ing construct, usually an attributeor process or trait that is itself embedded in a morecomprehensive network of theoretical propositionsor laws called a nomological network (Feigl, 1956;

    Hempel, 1970; M argenau, 19 50). An empiricallyA M E R I C A N PSYCHOLOGIST NOVEMBER 1980 1015

  • 8/12/2019 Test Validity and the Ethics of Assessment

    5/16

    grounded pattern of such links provides an evi-dential basis for interpreting the test scores inconstruct or process terms, as well as a rationalbasis for inferring testable implications of thescores from the broader theoretical network of theconstructs meaning (Cronbach & Meehl, 1955;Messick, 19 75 ). Constructs are thus chosen or

    created "to organize experience into general law-like statements" (Gronbach, 1971, p . 4 6 2 ) .Construct validation entails both confirmatory

    and disconfirmatory strategies, one to provideconvergent evidence that the measure in questionis coherently related to other measures of the sameconstruct as well as to other variables that it shouldrelate to on theoretical grounds, and the other toprovide discriminant evidence that the measure isnot related unduly to exemplars of other distinctconstructs (D. T. Campbell & Fiske, 1959). Dis-criminant evidence is particularly critical fo r dis-counting plausible counterhypotheses to the con-struct interpretation (Popper, 1959), especiallythose pointing to the possibility that the observedconsistencies might instead be attributable toshared method constraints, response sets, or othercontaminants.

    Construct validity emphasizes two intertwinedsets of relationships for the test: one between th etest and different methods for measuring the sameconstruct Or trait, and the other between measuresof the focal construct and exemplars of differentconstructs predicted to be variously related to iton theoretical grounds. Theoretically relevantempirical consistencies in the first set, indicating acorrespondence between measures of the same con-struct, have been called trait validity, and those inthe second set, indicating a lawful relatedness be-tween measures of different constructs, have beencalled nomological validity (D. T. Campbell, 1960;Cronbach & Meehl, 1955). In order to discountcompeting hypotheses involving alternative con-structs or method contaminants, the two sets are

    often analyzed simultaneously in a multitrait-multimethod strategy that employs multiple meth-ods for assessing each of two or more differentconstructs (D. T. Campbell & Fiske, 1959). Suchan approach highlights the need fo r both convergentand discriminant evidence in both trait and nomo-logical validity.

    Trait validity deals with the fit between measure-ment operations and conceptual definitions of theconstruct, and nomological validity deals with the fitbetween obtained data patterns and theoretical pre-

    dictions about those patterns (Cook & Campbell,

    1979). Th e former is concerned with the meaningof the measure as a reflection of the construct, andthe latter is concerned with the meaning of theconstruct as reflected in the measure's relationalproperties. Both aspects are intrinsic to constructvalidity, and the. interplay between them leadsto iterative refinements of measures, constructs, and

    theories over time. Thus, the paradox that mea-sures are needed to define constructs and constructsare needed to build measures is resolved, like allexistential dilemmas in science, by a process ofsuccessive approximation (Kaplan, 1964; Lenzen,1955).

    It will be recalled that the Standards fo r Educa-tional and Psychological Tests (APA et al., 1974)condensed the variety of validity questions intotwo types, those dealing with the intrinsic natureor meaning of the measure and those dealing withits use as an indicator or predictor of other vari-ables. In the present context, this distinctionshould be seen as a whole-part relationship: Evi-dence bearing , on the meaning of the measureembraces all of construct validity, whereas evidencefor certain predictive relationships contributes tothat part called nomological validity. Some pre-dictive relationshipsnamely, those between themeasure and specific applied criterion behaviorsare traditionally singled out for special attentionunder the rubric of " criterion-related validity," andit therefore follows that this too is subsumed con-ceptually as part of construct validity.

    This does no t mean, however, that constructvalidity in general can replace criterion-relatedvalidity in particular in applied settings. Th ecriterion correlates of a measure constitute strandsin the construct's nomological network, but theirempirical basis is still to be checked. Thus,"criterion-related validity is intended to show thevalidity, not of the test but of that hypothesis ofrelationship to the criterion (Guion, 1978a, p. 207) . .The analysis of criterion variables within the mea-

    sure's construct network, especially if conducted intandem with the construct v alidation of the criterionmeasures themselves, provides a powerful rationalbasis fo r criterion prediction (Guion, 1 976b ).

    CRITERION RELATEDNESS

    So-called "criterion-related validity" is usually con-sidered to comprise two types, concurrent validityand predictive validity, which differ respectively interms of whether the test and criterion data werecollected at the same time or at different times. A

    1016 N O V E M B E R 1980 A M E R I C A N P S Y C H O L O G IS T

  • 8/12/2019 Test Validity and the Ethics of Assessment

    6/16

    more fundamental distinction would recognize thatconcurrent correlations with criteria are usuallyobtained either to appraise the diagnostic effective-ness of the test in detecting current behavioral pat-terns or to assess the suitability of substituting thetest for a longer, more cumbersome, or more expen-sive criterion measure. It would also be' more help-

    ful in both the predictive and the concurrent case tocharacterize the function of the relationship in termsof utility rather than validi ty. Criterion relatednessdiffers from the more general nomological relatednessin being more narrowly stated and pointed towardspecific sets of data and specific applied settings.In criterion relatedness we are concerned no t justwith verifying the existence of relationships andgauging their strength, bu t with identifying usefulrelationships under the applied conditions. Utilityis the more appropriate concept in such instances

    because it implies interpretation of the correlationsin the decision context in terms of indices of predic^tive efficiency relative to base rates, mean gains incriterion performance due to selection, the dollarvalue of such gains relative to costs, and so forth(Brogden, 1946; Cronbach & Gleser, 1965; Curtis& Alf, 1969; Darlington & Stauffer, 1966; Hunter,Schmidt, & Rauschenberger, 1977).

    In developing rational hypotheses of criterionrelatedness, we not only need a conception of theconstruct meaning of the predictor measures, as we

    ' have seen, but we also need to conceptualize c riterionconstructs, basing judgments on data from job ortask analyses and the construct validation of pro-visional criterion measures (Guion, 1976b). In thelast analysis, the ultimate criterion is determined onrational grounds (Thorndike, 1949); in any event,it "can best be described as a psychological construct. . . [and] the process of determining the relevanceof the immediate to the ultimate criterion becomesone of construct validation (Kavanagh, Mac-Kinney, & Wolins, 1971, p. 35 ) . It is particularlycrucial to identify criterion constructs wheneverpotentially contaminated criterion measures, such asratings or especially multiple ratings from differentsources, are employed (James, 1973). In the faceof impure or contaminated criterion measures, thequestion of the intrinsic nature of the relation be-tween predictor and criterion comes to the fore(Gulliksen, 1950), and construct validity is neededto .broach that issue. "In other words, an orienta-tion toward construct validation in criterionresearch is the best way of guarding against ahopelessly 'incomplete job of criterion development"(Smith, 1976, p . 768). Thus, if construct validity

    is no t available on the predictor side, it better beon the criterion side, and both "must have adequateconstruct validity fo r their respective sides if thetheory is to be tested adequately" (Guion, 1976b,p. 802).

    Implicit in this rational approach to predictivehypotheses there is thus also a rational basis for

    judging the relevance of the test to the criteriondomain. This provides a means of coping with thequasi-judicial term jab-relatedness, even in the casewhere criterion-related empirical verification ismissing. "Where it is clearly no t feasible to do thestudy, the defense of the predictor can rest on acombination of its construct validity and therational justification for the inclusion of the con-struct in the predictive hypothesis" (Gu ion, 1974,p. 29 1 ). The case becomes stronger if the pre-dicted relationship has been verified empirically in

    other settings. Gu ion (1974) , for one, has main-tained that this stance offers better evidence ofjob-relatedness than does a tenuous criterion-related study done under pressure with smallsamples, low variances, or questionable criterionmeasures. On the other hand, the simple demon-stration of an empirical relationship between ameasure and a criterion in the absence of a cogentrationale is a dubious basis for justifying relevanceor use (Messick, 1964, 1975).

    CONTENT RELEVANCE AND CONTENT COVERAGE

    Th e other major basis for judging the relevanceof the test to the behavioral domain about whichinferences are to be draw n is so-called "contentvalidity." Content validity in its classic form(Cronbach, 1971) is limited to the strict behaviorallanguage of task description, for otherwise, con-structs are apt to be invoked and we have anothercase of construc t validity. There are two mainfacets to content validity: One is content relevance,which refers to the specification of the behavioraldomain in question and the attendant specificationO f the task or test domain. Specifying domainboundaries is essentially a requirement of opera-tional definition and, in the absence of appeal toa construct theory of task performance, is limitedto a statement of admissible task characteristicsand behavioral requirements. Th e other facet iscontent coverage, which refers to the specificationof procedures for sampling the domain in somerepresentative fashion. The concern is thus withcontent sampling of a specified content domain,which is a prescription fo r test construction, no t

    A M E R I C A N P S Y C H O L O G I S T N O V E M B E R 1980 1017

  • 8/12/2019 Test Validity and the Ethics of Assessment

    7/16

    validity. Consensual judgm ents abo ut the rele-vance of the test domain as denned to, a particularbehavioral -domain of interest (as, for example,when choosing a standardized achievement test toevaluate a new curriculum), along with judgmentsof the adequacy of content coverage in the test,are the kinds of ev idence usually offered fo r content

    validity. But note that this is not evidence insupport of inferences from test scores, although itmight influence the nature of those inferences.

    This attempt to define content validity as sep-arate from construct validity produces a dysfunc-tional strain to avoid constructs, as if shunningthem in test development somehow lessens theimport o f response processes in test performance.The important sampling consideration in test con-struction is not representativeness of the surfacecontent of tasks but representativeness of the pro-

    cesses employed by subjects in arriving at aresponse (Lennon, 1956). This puts contentvalidity squarely in the realm of construct validity(Messick, 19 75 ). Rather than strain after neb-ulous distinctions, we should inquire how contentconsiderations contribute to construct validity andhow to strengthen that contribution (Tenopyr,1977).

    Loevinger (1957) incorporated content as animportant feature of construct validity by consider-ing content representativeness and response con-

    sistency jointly. What she called "substantivevalidity" is "the extent to which the content of theitems included in (and excluded from?) the testcan be accounted for in terms of the trait believedto be measured and the context of measurement"(Loevinger, 1957, p. 661). This notion was intro-duced "because of the conviction that consider-ations of content alone are not sufficient to establishvalidity even when the test content resembles thetrait, and considerations of content cannot beexcluded .when the test content least resembles the

    trait" (Loevinger, 1957, p. 657) . The elimination

    of certain items from the test because of poorempirical response properties may sometimes distortthe test's representativeness in covering the con-struct domain as originally conceived, but it isjustified if the resulting test thereby becomes abetter exemplar of the construct as empiricallygrounded (Loevinger, 1957; Messick, 1975).

    Content validity has little to say about the scor-ing of content samples, and as a result scoring pro-cedures are typically ad hoc (Guion, 197 8b). Scor-ing models in the construct framework, in contrast,logically parallel the structural relations inherent in

    behavioral manifestations of the construct beingmeasured. Loevinger (1957) drew explicit atten-tion to the need for rational scoring models bycoining the term structural validity, which includes"both the fidelity of the structural model to thestructural characteristics of non-test manifestationsof the trait and the degree of inter-item structure"

    (p . 661) .Even in instances where the test is an undisputed

    representative sample of the behavioral domain ofinterest and the concern is with the demonstrationof task accomplishment per se regardless of theprocesses underlying performance (cf. Ebel, 1961,1977), empirical evidence of response consistencyand not just representative content sampling isimportant. In such cases, inferences are usuallydrawn from the sample performance to domain per-formance, and these inferences should be buttressed

    by indices o f . the internal-consistency type to gaugethe extent of generalizability to other items likethose in the sample, to other tests developed inparallel fashion, and so forth (J. P. Campbell, 1976;Cronbach, Gleser, Nanda, & Rajaratnam, 1972) .We should also consider the possibility that thetest might contain sources of variance irrelevant todomain p erformance, which is a particularly impor-tant consideration in interpreting low scores. Con-tent validity at best is a unidirectional concept:Although it may undergird certain straightforward

    interpretations for high scorers (such as "theypossess suitable skills to perform the tasks cor-rectly, because they did so .repeatedly"), it providesno basis for interpreting low scores in terms ofincompetence or lack of skill. To do that requiresthe discounting of plausible counterhypothesesabout such irrelevancies in the testing as anxiety,defensiveness, inattention, or low motivation(Guion, 1978a; Messick, 1975, 1979). And theempirical discounting of plausible rival hypothesesis the hallmark of construct validation.

    GENERALITY OF CONSTRUCT MEANING

    The issue of generalizability just broached for con-tent sampling permeates all of validi ty. Severalaspects of generalizability of special concern havebeen given distinctive labels, but unfortunatelythese labels once again invoke the sobriquetvalidity. The extent to which a measure's empir-ical relations and construct interpretation gen-eralize to other population groups is called "popula-tion validity" (Shulman, 1970); to other situationsor settings, "ecological validity" (Bracht & Glass,

    1018 N O V E M B E R 1980 A M E R I C A N P S Y C H O L O G I S T

  • 8/12/2019 Test Validity and the Ethics of Assessment

    8/16

    1968; Snow, 1974); to other times,- "temporalvalidity" (Messick & Barrows, 1972) ; and toother tasks representative of the operations calledfor in the particular domain of interest, "taskvalidity" (Shulman, 1970).

    The label validity is especially unsuitable fo rthese important facets of generalizability, for such

    usage might be taken to imply that the moregeneralizable a measure is, the more valid. This isnot always the case, however, as in the measure-ment of such constructs as mood, which fluctuatesover time, or concrete operations, which typify aparticular developmental stage, or administrativerole, which operates in special organizational set-tings, or delusions, which are limited to specificpsychotic group s. Rather , the appropriate degreeo f generalizability for a measure depends upon thenature of the construct assessed and the scope of

    its theoretical applicability. A closely related issueo f " refere nt generality" (Coan, 1964; Snow, 19 74 ),called "referent validity" by Cook and Campbell(1979) , concerns the extent to which research evi-dence supports a measure's range of reference andthe multiplicity of its referent terms. This con-cept points to the need to tailor the level of con-struct interpretation to the limits of the evidenceand to avoid both oversimplification and over-generalization in the connotation of constructlabels. Nonetheless, constructs refer not only to

    available evidence but to potential evidence, sothat the choice of construct labels is influenced bytheory as well as by evidence and, as we shall see,by ideologies about the nature of humanity andsociety which add value implications that gobeyond evidential validity per se.

    EVIDENTIAL BASIS OF TEST INTERPRETATION

    AN D USE

    To recapitulate thus far, construct validity is theevidential basis of test interpretation. It entailsboth convergent and discriminant evidence docu-menting theoretically relevant empirical relation-ships (a ) between the test and different methodsfor measuring the same construct, as well as (b)between measures of the construct and exemplarsof different constructs predicted to be relatednomologically. For test use, the relevance of theconstruct for the applied purpose is determined inaddition, by developing rational hypotheses relatingthe construct to p erformance in the applied domain.Some of the construct's nomological relations thusbecome criteria when made specific to the applied

    setting. The empirical verifica tion of this rationalhypothesis contributes to the construct validity ofboth the measure and the criterion, and the utilityof the applied relation supports the practicality ofthe proposed use. Thus, the evidential basis oftest use is also construct validitybut elaboratedto determine the relevance of the construct to the

    applied purpose and the utility of the measure inthe applied setting.

    In all of this discussion I have tried to avoid thelanguage of necessary and sufficient requirements,because such language seemed simplistic for a com-plex and holistic concept like test validi ty. On theone hand, construct validation is a continuous,never-ending process developing an ever-expandingmosaic of research evidence. At any point newevidence may dictate a change in construct, theory,or measurement, so that in the long run it is diffi-

    cult to claim sufficiency for any piece. On theother hand, given that the mosaic of evidence isreasonably dense, it is difficult to claim that anypiece is necessaryeven, as we have seen, empiricalevidence fo r criterion-related predictive relation-ships in specific applied settings, provided, ofcourse, that other evidence consistently supportsa compelling rationale for the application.

    Since the evidence in these evidential bases de-rives from empirical studies evaluating hypothesesabout relationships or about the structure of sets

    of relationships, we must also be concerned aboutthe quality of those studies themselves and aboutthe extent to which the research conclusions aretenable or are threatened by plausible counter-hypotheses to explain the results (Guion, 1980).Four classes of threats to the tenability and gen-eralizability of research conclusions are discussedby Cook and Campbell (1979), with primaryreference to quasi-experimental and experimentalresearch but also relevant to nonexperimental cor-relational studies. These four classes deal, respec-tively, with the questions of (a) whether a relation-ship exists between two variables, an issue called"statistical conclusion validity"; (b ) whether therelationship is plausibly causal from one variableto the other, called "internal validity"; (c ) whatinterpretive constructs underlie the relationship,called "construct validity"; and (d) the extent towhich the interpreted relationship generalizes toand across other population groups, settings, andtimes, called "external validity."

    I will not discuss here the first question raisedby Cook and Campbell except simply to affirm thatthe tenability of statistical conclusions about the

    A M E R I C A N P S Y C H O L O G IS T N O V E M B E R 1980 1019

  • 8/12/2019 Test Validity and the Ethics of Assessment

    9/16

    existence and strength of relationships is of coursebasic to the whole enterprise. I have already dis-cussed construct validity and external generalizabil-ity, although it is important to note in connectionwith the latter that I was referring to the generaliz-ability of a measure's empirical relations and con-struct interpretation to other populations, settings,

    and times, whereas Cook and Campbell (1979)were referring to the generalizability of researchconclusions that two variables (and their attendantconstructs) are causally related one to the other.My emphasis was on the generality of a measure'sconstruct meaning based on any relevant evidence(Messick, 1975; Messick & Barrows, 1972)com-monality of factor structures, for examplewhiletheirs was on the generality of a causal relationshipfrom one measure or construct to another -based onexperimental or quasi-experimental treatments.

    Verification of the hypothesis of causal relation-ship is what Cook and Campbell term internalvalidity, and such evidence contributes importantlyto the nomological basis of a measure's constructmeaning for those construct theories entailingcausal claims. Internal validity thus provides theevidential basis for causal strands in a nomologicalnetwork. The tenability of cause-effect implica-tions is important for the construct validity of avariety of educational and psychological measures,such as those interpreted in terms of intelligence,

    achievement, or motivation . Indeed, the causalovertones of constructs are one source of the valueimplications of test interpretation, a topic I willturn to shortly.

    Validity as Evaluation of Implications

    Since validity is an evaluation of evidence, a judg-ment rather than an entity, and since some evi-dential basis should be provided for the interpreta-tion and use of any test, validity has always beenan ethical

    imperative in testing. As

    Burton (1978)put it, "Validity (as the word implies) has beenprimarily an ethical requirement o f tests, a pre-requisite guarantee, rather than an active com-ponent of the use and interpretation of tests" (p.2 6 4 ) . She went on to argue that with criterion-referenced testing, "Glaser in essence, was takingtraditional validity ,out of the realm of ethics intothe active arena of test use" (p . 26 4) . Glasermay have taken traditional validity into the activearena of test use, as it were, but it never left therealm of ethics because test use itself is an ethicalissue.

    If test validity is the overall degree of justificationfor test interpretation and use, and if human andsocial values encroach on both interpretation anduse, as they do , then test validity should takeaccount of those value implications in the overalljudgment. The concern here, as in most ethicalissues, is with evaluating the present and futureconsequences of interpretation and use (Church-m a n , 1961). If, as an intrinsic part of the overallvalidation process, we weigh the actual and poten-tial consequences of our testing practices in lighto f considerations of what future society might needor desire, theh test validity comes to be based onethical as well as evidential grounds.

    CONSEQUENTIAL BASIS OF TEST USE

    Value issues have long been recognized in connec-

    tion with test use. We have seen that one of thekey questions to be posed whenever a test is sug-gested for a specific purpose is "Should it be usedfor that purpose?" Answers to that questionrequire an evaluation of the potential consequencesof the testing in terms of social values, but that isno trivial enterprise. There is no guarantee thatat any point in time we will identify all of thecritical possibilities, especially those unintendedside effects that are distal to the manifest testingaims.

    There are few prescriptions for how to proceedhere, but one recommendation is to contrast thepotential social consequences of the proposed test-ing with the potential social consequences of alter-native procedures and even of procedures antago-nistic to testing. This pitting of the proposed testuse against alternative proposals is an instance ofwhat Churchman (1971) has called Kantianinquiry; the pitting against antithetical counter-proposals is called Hegelian inquiry. The intentof these strategies is to draw attention to vulner-abilities in the proposed use and to expose its tacitvalue assumptions to open debate. In the contextof testing, a particularly powerful and general formof counterproposal is to weigh the potential socialconsequences of the proposed test use against thepotential social consequences of no t testing at all(Ebel, 19,64).

    The role of values in test use has been intensivelyexamined in certain selection applicationsnamely,in those where different population groups displaysignificantly different means on predictors, orcriteria, or both . Since fair test use implies thatselection decisions will be equally appropriate

    1 0 2 0 N O V E M B E R 1980 A M E R I C A N P S Y C H O L O G I ST

  • 8/12/2019 Test Validity and the Ethics of Assessment

    10/16

    regardless of an individual's group membership, andsince different selection systems yield differentproportions of selected individuals in the differentgroups, the question of test fairness arises in ear-nest. In good Kantian fashion, several models offair selection were formulated and contrasted witheach other (deary, 1968; Cole, 1973; Darlington,

    1 9 7 1 ; Einhorn & Bass, 1971; Linn, 1973, 1976;Thorndike, 1971); some, having been found incom-patible or even mutually contradictory, offered goodHegelian contrasts (Peterson & Novick, 1976 ). Itsoon became apparent in comparing these modelsthat each accorded a different importance or valueto the various subsets of selected versus rejectedand successful versus unsuccessful individuals inthe different population groups (Dunnette & Bor-man, 1979; Linn, 1973). Moreover, the valuesaccorded are a function no t only of desired criterion

    performance but of desired individual and groupattributes (Novick & Ellis, 1977) . Thus, eachmodel no t only constitutes a different definition offairness bu t also implies a particular ethical posi-tion (Hu nter & Schmidt, 1976). Each view isostensibly fair under certain conditions, so thatarguments over the fairness of test use turn out inmany instances to be disagreements as to what theconditions are or ought to be.

    With the recognition that fundamental valuedifferences were at issue, several utility models were

    developed that required specific value positions tobe taken (Cronbach, 1976; Gross & Su, 1975;Peterson & Novick, 1976; Sawyer, Cole, & Cole,1976), thereby incorporating social values explicitlywith measurement technology. But making valuesexplicit does not determine choices among them,and at this point it appears difficult if not impos-sible to be fair to individuals in terms of equity, togroups in terms of parity or adverse impact, toinstitutions in terms of efficiency, and to society interms of benefits and risks all at the same time. Aworkable balancing of the needs of all of the partiesis likely to require successive approximations overtime, with iterative m odifications of utility matricesbased on experience with the consequences ofdecision processes to date (Darlington, 1976).

    CONSEQUENTIAL BASIS OF TEST INTERPRETATION

    In contrast to test use, the value issues in testinterpretation have not been as vigorously ad-dressed. That social values impinge upon theoreticalinterpretation may not be as obvious, but it is noless serious. "Data come to us only in answer to

    questions.. . . . How we put the question reflectsour values on the one hand, and on the other handhelps determine the answer we get" (Kaplan, 1964,p. 385). Facts and values thus go hand in hand(Churchman, 1961), and "we cannot avoid ethicsbreaking into inductive logic" (Braithwaite, 1956,p. 174). As Kaplan (1964) put it, "Data are

    the product of a process of interpretation, andthough there is some sense in which the materialsfor this process are 'given' it is only the productwhich has a scientific status and function. In aword, data have meanirig, and this word 'meaning,'like its cognates 'significance' and 'import,' includesa reference to values" (p. 385). Thus, just asdata and theoretical interpretation were seen to beintimately intertwined in the concept of evidence,so data and values are intertwined in the conceptof interpretation, and fact, value, and meaning

    become three faces of the substance of science.Whenever an event or relationship is concep-tualized, it is judgedeven if only tacitlyas be-longing to some broader category to which valuealready attaches. If a crime, fo r example, is seenas a violation of the social order, the modal societalresponse is to seek correction, which is a derivativeof the value context of this way of seeing. If crimeis seen as a violation of the moral order, expiationwill be sought. And if seen as a sign of distress,especially if the distress can be assimilated to a

    narrower category like mental illness, then a claimof compassion and help attaches to the valuation.In Vickers's (1970) terms, the conceptualizationof an event or relationship within a broader cate-gory is a process of "matching," which is an infor-mational concept involving the comparison offorms. The assimilation of the value attached tothe broader schema is a process of "weighing,"which is a dynamic concept involving the compari-son of forces. For Vickers (19 70 ), "the elaborationo f the reality system and the value system proceedtogether. Facts are relevant only to some standardof value; values are applicable only to some con-figuration of fact" (p. 134). He uses the termappreciation to refer to those conjoint judgments offact and value (Vickers, 1965).

    In the construct interpretation of tests, suchappreciative processes are central, though typicallylatent. Constructs are broader conceptual cate-gories than the test behaviors, and they carry withthem into the testing arena value connotationsstemming from three major sources: First are theevaluative overtones of the construct labels them-selves; next are the value connotations of the

    A M E R I C A N P S Y C H O L O G I S T N O V E M B E R 1980 1021

  • 8/12/2019 Test Validity and the Ethics of Assessment

    11/16

    broader theories or nomological networks in whichconstructs are embedded; and last are the implica-tions of the still broader ideologies about the na-ture of humanity, society, and science that colorhow we proceed. Ideology is a complex configura-tion of values, affects, and beliefs that provides,among other things, an existential perspective for

    viewing the worlda "stage-setting," as it were,for interpreting the human drama in ethical, sci-entific, or whatever terms (Ed'el, 1970). Theideological overlay subtly influences test interpre-tation, especially for very general constructs likeintelligence, in ways that go beyond empiricallyverified connec tions; in the nomological network(Crawford, 19 79). The hope here in draw ingattention explicitly to the value implications of testinterpretation is that some of these ideological andvaluative links might be exposed to inquiry andsubjected either to empirical grounding or to policydebate.

    Exposing the value assumptions of a constructtheory and its more subtle links to ideologypos-sibily to multiple, cross-cutting ideologiesis anawesome challenge. One approach is to followChurchman's (1971) lead arid attempt to contrasteach construct theory with an alternative perspec-tive fo r interpreting the test scores, as in theKantian mode of inquiry; better still f ,or probingthe ethical implications of a theory is to contrastit with an antithetical, though plausible, Hegeliancounterperspective. This raises to the grander levelof theory-comparison the strategy of focusing onplausible rival hypotheses and counterhypothesesin evaluating the basis fo r relationships within atheory. Systematic competition between counter-theories in attempting to explain the conjoint dataderivable from each also tends to offset the concernthat scientific observations are theory-laden ortheory-dependent and that the presumption of asingle theory might thereby preclude uncoveringthe most challenging test data fo r that theory(Feyerabend, 1975; Mitroff, 1973). Moreover, asChurchman (1961) stresses, although consensus isthe decision rule of traditional science, conflict isthe decision rule of ethics. Since the one thing .weuniversally disagree about is "what ought to be,"any scientific approach to ethics should allow forconflict ^and debate, as should any attempt to assessthe ethical implications of science. "Thus, inorder to derive the 'ethical' implications of anytechnical or scientific model, we explicitly incor-porate a dialectical mode of examining (or testing)

    models" (Mitroff & Sagasti, 1973, p. 133). In a

    sense we are asking, as did Churchman's mentorE. A. Singer (19S9), what the consequences" wouldbe if a given scientific judgment had the statusof an ethical judgment.

    It should be noted that value issues intrude inthe testing process at all levels, no t just at thegrand level of broad construct interpretation. For

    example, values influence the relative emphasis ondifferent types of content in test construction(Nunnally, 1967) and procedures for scoring thequality of performance on content samples (Guion,.1978b), but the concern here is limited to the valueimplications of test interp retatio n. Consider firstthe evaluative overtones of the construct label itself.I have already suggested that a measure interpretedin terms of "flexibility versus rigidity" would beutilized differently if it were instead labeled "con-fusion versus control." Similarly, a measure called"inhibited versus impulsive" would have differentconsequences if it were labeled "self-controlledversus uninhibited." So would a variable like"stress" if it were relabeled "challenge." The pointis not that we would make a concept like stressinto a good thing by renaming it but that by notpresuming it to be a bad thing we would investigatebroader consequences, facilitative as well asdebilitative (McGrath, 1976). In choosing a con-struct label, we should strive fo r consistency be-tween the trait and evaluative implications of thename, attempting to capture as closely as possiblethe essence of the construct's theoretical import,especially its empirically grounded import, in termsreflective of its salient value connotations. Thismay prove difficult, however, because many traitsare ope/n to conflicting value interpretations andthus call fo r systematic examination of counter-hypotheses about value outcomes, if not to reachconvergence, at least to clarify the basis of theconflict. Some traits may also imply differentvalue outcomes under different circumstances,which suggests the possible utility of differentiatedtrait labels to embody these value distinctions, as inthe case of "debilitating anxiety" and "facilitatinganxiety." Rival theories of the construct mightalso highlight different value implications, ofcourse, and lead to conflict between the theoriesno t only in trait interpretation but also in valueinterpretation.

    Apart from its normative and evaluative over-tones, perhaps the most important feature of aconstruct in regard to value connotations is itsbreadth, or the range of its theoretical and

    empirical referents. This is the issue that Snow

    1022 NOVEMBER 1980 AMERICAN PSYCHOLOGIST-

  • 8/12/2019 Test Validity and the Ethics of Assessment

    12/16

    Test Interpretation Test Us e

    Evidential Basis

    Consequential Basis

    Construct Va l id i ty

    Value Implications

    o n s t r u ct Va l i d i t y Relevance Utility

    Social Consequences

    Figure 1. Facets of test validity.

    (1974) called refer ent generality. The broader theconstruct, the more difficult it is to embrace all ofits critical features in a single measure and themore we are open to what Coombs (1954) hascalled "operationism in reverse," that is, "endowing

    the measures with all the meanings associated withth e concept" (p . 4 7 6 ) . In choosing th e appro-priate breadth or level of generality for a constructand its label, one is buffeted by opposing counter-pressures toward oversimplification on the one handand overgeneralization on the other. At oneextreme is the apparent safety in using merelydescriptive labels tightly tied to behavioral exem-plars in the test (such as Adding Two-Digit Num-bers) . Choices on this side sacrifice interpretiv epower and range of application if the test mightalso be defensibly viewed more broadly (e.g., Num -ber Facility). At the other extreme is the apparentrichness of high-level inferential labels (such asIntelligence, Creativity, or Introv ersion ). Choiceson this side are subject to the dangers of mis-chievous dispositional connotations and the backlashof conceptual imperialism.

    At first glance, one might think that the appro-priate level of construct reference should be tiednot to test behavior but to the level of generaliza-tion supported by the convergent and discriminantresearch evidence in hand. But constructs refer topotential relationships as well as actual relation-ships, so their level of generality should in principlebe tied to their range of reference in the nomo-logical theory, with the important proviso that thisrange be restricted or extended when research evi-dence so indicates. The scope of the original theo-retical formulation is thus modified by the researchevidence available, but it is not limited to theresearch evidence available. As Cook and Camp-bell (1979) put it, "The data edit the kinds ofgeneral statements we can make" (p . 88) . Anddebating the value implications of test interpreta-

    tion may also edit the kinds of general statementswe should make.

    Validity as Evaluation ofEvidence and Consequence

    Test validity is thus an overall evaluative judgm entof the adequacy and appropriateness of inferencesdrawn from test scores. This evaluation rests onfour bases: (1) an inductive summary of convergentand discriminant research evidence that the testscores are interpretable in terms of a particularconstruct meaning, ( 2 ) an appraisal of the valueimplications of that interpretation, (3) a rationaleand evidence for the relevance of the construct andthe utility of the scores in particular applications,and (4) an appraisal of the potential social con-sequences of the proposed use and of the actualconsequences when used.

    Putting these bases together, we can see testvalidity to have two interconnected facets linkingthe source of justificationeither evidential orconsequentialto the function or outcome of thetestingeither interpretation or use. This cross-ing of basis and function is portrayed in Figure 1.

    The interactions among these aspects are moredynamic in practice, however, than is implied bya fourfold classification. In an attempt to rep-resent the interdependence and feedback amongthe components, a flow diagram is presented inFigure 2. The double arrows linking constructvalidity and test interpretation in the diagram aremeant to imply a continuous process that startssometimes with a construct in search of propermeasurement and sometimes with an existing testin search of proper meaning.

    Th e model also includes a pragmatic componentfor the evaluation of actual consequences of testpractice, pragmatic in the sense that this com-ponent is oriented, like pragmatic philosophy,

    A M E R I C A N P S Y C H O L O G I S T N O V E M B E R 1980 1023

  • 8/12/2019 Test Validity and the Ethics of Assessment

    13/16

    I m p l i c a t i o n s

    f o r Te s tt e r p r e t a t l o

    E v a l u a t e C o n s e q u e n c e s

    Figure 2. Feedback model fo r test validity.

    1024 N O V E M B E R 1980 A M E R I C A N P S Y C H O L O G IS T

  • 8/12/2019 Test Validity and the Ethics of Assessment

    14/16

    toward outcomes rather than origins and seeksjustification for use in the practical consequencesof use. The primary concern of this component isthe balancing of the instrumental value of the testin accomplishing its intended purpose with theinstrumental value of any negative side effects an dpositive by-p roduc ts of the testing. Mo st testmakers acknowledge responsibility fo r providinggeneral evidence of the instrumental value of thetest. The terminal value of the test in terms of thesocial ends to be served goes beyond the test makerto include as well the decisionmaker, policymaker,an d test user, who are responsible for specific evi-dence of instrumental value in their particularsetting and for the specific interpretations anduses made of the test scores. In the final analysis,"responsibility for valid use of a test rests on theperson who interprets it" (Cronbach, 1969, p. S I ) ,and that interpretation entails responsibility fo rits value consequences.

    Intervening in the model between test use andthe evaluation of consequences is a decision matrixto emphasize the point that tests are rarely usedin isolation but rather in combination with otherinformation in broader decision systems. Thedecision process is profoundly influenced by socialvalues and deserves, in its own right, massiveresearch a ttention beyond the good beginning pro-vided by utility models. As Guio n (1976 a)phrased it, "The form ulatio n of hypotheses is orshould be applied science, the validation of hypoth-eses is applied methodology, but the act of making. . . [a] decision is ... still an art" (p. 646) . Thefeedback model as portrayed is a closed system, toemphasize the point that even when consequencesare evaluated favorably they should be contin-uously or periodically monitored to permit th edetection of changing circumstances and of delayedside effects.

    The model is closed and this article is closedwjith the provocative words of Sir Geoffrey Vickers

    (1 97 0) : "If indeed we have reached the end ofideology (in Daniel Bell's phrase) it is not becausewe can do without ideologies but because we shouldnow know enough about them to show a properrespect for our neighbour's and a proper sense ofresponsibility for our own" (p. 109).

    REFERENCES

    American Psychological Association, American EducationalResearch Association, & National Council on Measure-ment in Education. Standards fo r educational and psy-chological tests. Washington, D.C.: American Psycho-logical Association, 1974.

    Bracht, G. H., & Glass, G. V. The external validity ofexperiments American Educational Research Journal,1968, 5 437^74.

    Braithwaite, R. B. Scientific explanation. Cambridge,England: Cambridge University Press, 1956.

    Brogden, H. E. On the interpretation of the correlationcoefficient as a measure of predictive efficiency. Journalo Educational Psychology, 1946, 37 , 65-76.

    Burton, N. W. Societal standards. Journal of EducationalMeasurement, 1978, IS 263-271.

    Campbell, D. T. Recommendations for APA test standardsregarding construct, trait , or discriminant validity.American Psychologist, 1960, 15 , 546-553.

    Campbell, D. T., & Fiske, D. W. Conv ergent and discrim-inant validation by the multi trait-multimethod matrix.Psychological Bulletin 1959, 56 , 81-105.

    Campbell, J . P. P sychometric theory. In M . D . Dunnette(Ed . ) , Handbook of industrial an d organizational psy-chology. Chicago: Ran d McNally, 1976.

    Churchman, C. W. Prediction an d optimal decision:Philosophical issues of a science of values. EnglewoodCliffs, N .J.: P rentice-H al , 1961.

    Churchman, C. W. Th e design of inquiring systems: Basicconcepts of systems an d organization. New York: BasicBooks, 1971.

    Cleary, T. A. Test bias: Prediction of grades of Negroand white students in integrated colleges. Journal ofEducational Measurement, 1968, 5, 115-124.

    Coan, R. W. Facts, facto rs, an d art ifacts: The quest fo rpsychological meaning. Psychological Review 1964, 71,123-140.

    Cole, N. S. Bias in selection. Journal of EducationalMeasurement, 1973, W, 237-255.

    Cook, T. D., & Cam pbell, D. T. Quasi-experimentation:Design an d analysis issues for field settings. Chicago:Rand McNally, 1979.

    Coombs, C. H.. Theory an d methods of social measure-ment. In L. Festinger & D. Katz (Eds.), Researchmethods in the behavioral sciences. New York : Holt,Rinehart & W inston, 1954.

    Crawford , C. George W ashington, Abraha m Lincoln, an dArthur Jensen: Are they compatible? American Psychol-ogist, 1979, 3 4, 664-672.

    Cronbach, L. J. Validation of educational measures. Pro-ceedings of the 1969 Invitational Conference on TestingProblems: Toward a theory of achievement measure-ment . P rinceto n, N.J.: Edu cationa l Testing Service,1969.

    Cronbach, L. J. Test validation. In R. L. Thorndike(Ed.), Educational measurement (2nd ed.). Washington,D.C.: American Council on Education, 1971.

    Cronbach, L. J. Equity in selectionWhere psycho-metrics an d political philosophy meet. Journal of Edu-cational Measurement, 1976, 13 , 31-41.

    Cronbach, L. J., & Gleser, G. C. Psycholog ical tests an dpersonnel decisions ( 2 n d ed.). Urbana: University ofIllinois P ress, 1965.

    Cronbach, L . J., Gleser, G., Nanda, H., & Rajaratnam, N.Th e dependability of behavioral measurements: Theoryo generalizability fo r scores an d profiles. New York:Wiley, 1972.

    Cronbach, L . J., & Meehl, P. E. Construct validity in psy-chological tests. Psychological Bulletin 195S, 52 281-302.

    Curtis, E. W ., & Alf, E. F. Validity, predictive efficiency,and practical significance of selection tests. Journal ofApplied Psychology, 1969, 53 , 327-337.

    Darlington, R. B. Anoth er look at "culture fairness."Journal of Educational Measurement, 1971, 8, 71-82.

    Darlington, R. B. A defense of "rational" personnel selec-tion, and two new methods. Journal o EducationalMeasurement, 1976, 1 3, 43-52.

    AMERICAN PSYCHOLOGIST NOVEMBER 1980 1025

  • 8/12/2019 Test Validity and the Ethics of Assessment

    15/16

    Darlington, R. B., & Stauffer, G. F. Use and evaluationof discrete test information in decision making. Journalo Applied Psychology, 1966, 50, 125-129.

    Division of Industrial and Organizational Psychology,American Psychological Association. Principles for thevalidation and use of personnel selection procedures.Hamilton, Ohio: Hamilton Print Co., 1975.

    Dunnette, M. D., & Borman, W. C. P ersonnel selectionand classification systems. In M. R. Rosenzweig & L. W .Porter (Eds.), Annual Review of Psychology (Vol. 30).

    Palo Alto, Calif.: Annual Reviews, 1979.Ebel, R. L. Mu st all tests be valid? American Psychol-ogist, 1961, 16 , 640-647.

    Ebel, R. L. The social consequences of educational testing.Proceedings o th e 1963 Invitational Conference on Test-in g Problems. P rinceton, N.J.: Edu cation al TestingService, 1964.

    Ebel, R. L. Com ments on some problems of employmenttesting. Personnel Psychology, 1977, 30, 5S-63.

    Edel, A. Science and the s tructure of ethics. In 0. Neu-ra th , R. Carnap, & C. Morris (Eds.), Foundations of theunity of science: Toward an international encyclopediao unified science (Vol. 2 ) . Chica go: University ofChicago Press, 1970.

    Ei nho rn, H. J., & Bass, A. R. Metho dological consider-

    ations relevant to discrimination in employment testing,Psychological Bulletin 1971, 75, 261-269.Equal Employment Opportunity Commission, Civil Service

    Commission, U.S. Department of Labor, & U.S. Depar t -ment of Justice. Uniform guidelines on employee selec-tion procedures. Federal Register (August 25 , 1978),43 (166), 38290-38315.

    Feigl, H . Som e ma jor issues and developments in the phi-losophy of science of logical empiricism. In H. Feigl &M . Scriven, Minnesota studies in philosophy of science:Th e foundations of science and the concepts of psychol-ogy an d psychoanalysis. Minneapolis: University ofMinnesota Press, 1956.

    Feyerabend, P . Against method: Outline of an anarchisttheory o knowledge. London, England: New LeftBooks, 1975.

    Glaser, R., & Ni tko, A. J. Measurement in learning andinstruction. In R. L. Thorndike (Ed.), Educationalmeasurement (2nd ed.) . W ashington, D.C.: AmericanCouncil on Education, 1971.

    Goodenough, F. L. Mental testing: Its history, principles,an d applications. New York: Holt, Rinehart & Win-ston, 1969.

    Gross, A. L., & Su, W. Defining a "fair" or "unbiased" selection model: A question of utilities. Journal o Ap-

    plied Psychology, 1975, 60 , 345-351.Guion, R. M . Open a new windo w: Validities and values

    in psychological measurement. American Psychologist,1974, 29, 287-296.

    Guion, R. M. The practic e of industrial and organizationalpsychology. In M. D. Dunnette (Ed.), Handbook ofindustrial and organizational psychology. Chicago: RandMcNally, 1976. (a) ,

    Guion, R. M. Recruiting, selection, and job placement. InM. D. Dunnette (Ed.), Handbook of industrial and or-ganizational psychology. Chicago: Rand Mc Nally, 1976.(b )

    Guion, R. M. Content validityThe sou rce of m y discon-tent. Applied Psychological Measurement, 1977, 1 1-10. (a)

    Guion, R. M. Content validity: Three years of talkWhat's the action? Public Personnel Managemen t, 1977,6 407-414. (b)

    Guion, R. M: "Content validity" in moderation. Person-ne l Psychology, 1978, 31, 205-213. (a)

    Guion, R. M. Scoring of content domain samples: The

    problem of fairness. Journal of Applied Psychology,1978,^,499-506. (b)

    Guion, R. M. On trinitarian doctrines of validity. Pro-fessional Psychology, 1980, 11 385-398.

    Gulliksen, H . Intrinsic validity . American Psychologist,1950,5,511-517.

    Hempel, C. G. Fundamentals of concept formation in em-pirical science. In O. Neurath, R. Carnap, & C. Morris(Eds . ) ,' Founda t ions of the unity of science: Toward aninternational encyclopedia of unified science (Vol. 2) .

    Chicago: University of Chicago Press, 1970.Hudson, L. Singula rity of talen t. In S. Messick (Ed,), Individuality in learning. San Francisco: Jossey-Bass,

    ,1976.Hunter, J. E., & Schmidt, F. L. Critical analysis of the

    statistical and ethical implications of various definitionsof test bias. Psychological Bulletin 1976, 8 3, 1053-1071.

    Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. M.Fairness of psychological tests: Implications of four def-initions fo r selection utility and minority hiring. Journalo Applied Psychology, 1977, 62, 245-260.

    Jam es, L. R. Criterio n models and constru ct validity forcriteria. Psychological Bulletin 1973, 80, 75-83.

    Kaplan, A: Th e conduct of inquiry: Methodology for be-havioral science. Sa n Francisco: Cha ndler, 1964.

    Kavanagh, M . J., M acK inney, A. C., & Wolins, L. Issuesin managerial performance: Multitrait-multimethod anal-yses of ratings. Psychological Bulletin 1971, 75 , 3449.

    Lennon, R. T. Assumptions underlying the use of contentvalidity. Educational and Psychological Measurement,1956, 16, 294-304.

    Lenzen, V. F. Procedures of empirical science. In O. Neu-rath, R. Carnap, & C. W. Morris (Eds.) , Internationalencyclopedia of unified science (Vol. 1, Pt. 1). Chicago:University of Chicago Press, 1955.

    Linn, R. L. Fair test use in selection. Review of Educa-tional Research, 1973, 43 , -139-161.

    Linn, R. L. In search of fair selection procedures. Journal of Educational Measurement, 1976, 13, 53-58.Loevinger, J. Objective tests as instruments df psychologi-

    cal theory. Psychological Reports, 1957, 3, 635-694(Monograph Supplement 9 ) .

    Margenau, H. Th e nature of physical reality. New York:M cGra w-H ill, 1950. (Reprinted , Wood bridge, Conn.:Oxbow, 1977.)

    McGra th , J. E. Stress and behavior in organizations. InM. D. Dunnet te (Ed.), Handbook of industrial and or-ganizational psychology. Chicago: Rand McNally, 1976.

    Messick, S. P ersonality measurement an d college perform-ance. Proceedings o th e 1963 Invitational Conferenceon Testing Problems. Princeton, N.J.: Educational Test-ing Service, 1964.

    Messick, S. P ersonality measurement and the ethics of as-sessment. American Psychologist, 1965, 20 , 136-142.

    Messick, S. The standard problem: Meaning an d valuesin measurement and evaluation. American Psychologist,1975, 30, 955-966.

    Messick, S. P otential uses of noncognitive measurement ineducation. Journal o Educational Psychology, 1979, 71 ,281-292.

    Messick, S., & Barrows, T. S. Strategies fo r research andevaluation in early childhood education. In I. J. Gordon(Ed.), Early childhood education: The seventy-first year-book of the National Society for the Study of Education.Chicago: University of Chicag o Press, 1972.

    Mitroff , I. I. 'B e it resolved that structured debate not-consensus ough t to fo rm th e epistemic cornerstone ofOR/MS': A reaction to Ackoff's note on systems science.Interfaces, 1973, 3, 14-17.

    Mitroff , I. I., & Sag asti, F. Epistemology as general sys-tems theory: An approach to the design of complex

    1026 NOVEMBER 1980 AMERICAN PSYCHOLOGIST

  • 8/12/2019 Test Validity and the Ethics of Assessment

    16/16

    decision-making experiments. Philosophy of Social Sci-ence, 1973, 3 117-134'.

    Novick, M. R., & Ellis, D. D. Equal opportunity in edu-cational an d employment selection. American Psychol-ogist, 1977, 3 2, 306-320.

    Nunnally, J. Psychometric theory. New York: McGraw-Hill, 1967.

    Peterson, N. S., & Novick, M. R. An evaluation of somemodels fo r culture-fair selection. Journal of Educational

    Measurement 1976, 13 3-29.Popper, K . R. The logic of scientific discovery. New York:Basic Books, 1959.

    Sawyer, R. L., Cole, N. S., & Cole, J. W. L. Utilities andthe issue of fairness in a decision theoretic model fo rselection. Journal of Educational Measurement, 1976,13 , 59-76.

    Shulman, L. S. Reconstruction of educational research.Review of Educational Research, 1970, 40 , 371-396.

    Singer, E. A. Experience and reflection (C. W. Churchman,Ed.). Philadelphia: University of Pennsylvania Press,1959.

    Smith, P. C. Behaviors, results, an d organizational effec-tiveness: The problem of criteria. In M. D. Dunnette

    (Ed.), Handbook of industrial and organizational psy-chology. .Chicago: Rand McNally, 1976.

    Snow, R. E. Representative and quasi-representative de-signs for research on teaching. Review of EducationalResearch, 1974 44 265-291.

    Tenopyr, M. L. Content-construct confusion. PersonnelPsychology, 1977, 30, 47-54.

    Thorndike, R. L. Personnel selection: Test an d measure-ment techniques. New York: Wiley, 1949.

    Thorndike, R. L. Concepts of culture-fairness. Journal ofEducational Measurement, 1971, S, 63-70.

    Vickers, G. The art of judgment. New Yo rk: Basic Books,1965.

    Vickers, G . Value systems an d social process. Harmonds-wo.rth, Middlesex, England: Penguin Books, 1970.

    Wallach, M. A. P sychology of talent and graduate educa-tion. In S. Me'ssick (Ed.), Individuality in learning.Sa n Francisco: Jossey-Bass, 1976.

    Webb, E. J., Campbell, D. T., Schwartz, R. D., & Securest,L. U 'nobtrusive measures: Nonreactive research in thesocial sciences. Chicago: Rand McNally, 1966.

    Wernimont, P. F., & Campbell, J. P. Signs, samples, andcriteria. Journal of Applied Psychology, 1968, 52 , 372-376.

    APA Congressional Science Fellowship Program

    The Am erican Psychological Association (AP A) is now accepting applications for itsCongressional Science Fellowship Program, which is designated for 1981-1982 in the area ofchild policy: The program , administered by the American Association for the Advance-ment of Science (AAAS) and funded fo r 1981-1982 by the Esther Katz Rosen Fund ofthe American Psychological Foundation, provides an extraordinary opportunity fo r post-doctoral and midcareer individuals to learn about science-government interaction and tomake contributions to the more effective use of science in government. One fellow willbe selected by the APA to spend one year working as a special legislative assistant on thestaff of an individual congressperson or a congressional committee.

    Applicants must have obtained a doctorate in psychology, must demonstrate exceptionalresearch ability and scientific expertise in some area of child psychology (e.g., develop-mental, child-clinical), and must have a strong interest in using scientific knowledgetoward the solution or prevention of societal problems involving and affecting children.Applicants must also belong to A P A, or be an ap plicant for membership.

    The fellowship period covers one year beginning 1 September 1981 and requires resi-dence in the Washington, B.C., area. The fellowship includes a stipend of $20,000 plus

    nominal relocation and travel expenses.Interested individuals should submit the following application materials: (a) a detailed

    vita; (b) a statement of 500 words or less addressing the applica nt's interest in thefellowship and how it relates to the applicant's career goals; and (c) three letters ofreference on the applicant's ability to work on Capitol Hill as a special legislative assistantwith scientific expertise in psychology (sent directly to the address below). Applicationmaterials must be postmarked by midnight January 23, 1981.

    Finalists will be invited to APA's Central O f f i c e in Washington, D.C., for an interviewby a selection committee in late M arch 1981. Annou ncement of the award will be madeby early April 1981.

    Send application materials to Joann Horai, Congressional Science Fellowship Program,American P sychological Association, 1200 Seventeenth Street, N.W ., W ashington, D.C.

    20036.AMERICAN PSYCHOLOGIST NOVEMBER 1980 1027