1 content-based interpretations of test scores michael kane national conference of bar examiners...

36
1 Content-based Interpretations of Test Scores Michael Kane National Conference of Bar Examiners Maryland Assessment Research Center for Education Success October, 2008

Upload: helena-croshaw

Post on 14-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Content-based Interpretations of Test Scores

Michael KaneNational Conference of Bar Examiners

Maryland Assessment Research Center for Education SuccessOctober, 2008

2

Overview

• Argument-based framework for validation

• Three content-based interpretations: – observable attributes, – operationally defined attributes – traits

• Limitations of content-based validity evidence– “Begging the question”

3

Validation

• To validate test score interpretations and uses is to evaluate the plausibility of the interpretations and the appropriateness of the uses.

• Validation is therefore contingent; the evidence relevant to validation depends on the proposed interpretations and uses.

4

Argument-based Framework for Validation

5

Interpretations/Uses of scores

• In order to evaluate an interpretation, it is necessary to specify what it claims.

• What inferences are being draw?

• What rules of inferences are being relied on?

• What supporting assumptions are being made?

• The format used to specify the interpretation and uses is not important. That they be specified is essential.

6

Argument-based Approach to Validation

• The interpretive argument specifies the interpretations and uses of the test performances in terms of the inferences and assumptions used to get from a person’s test performance to the conclusions and decisions based on the test results.

• The validity argument provides a critical evaluation of the interpretive argument.

7

Toulmin’s Model of Inference

Datum [warrant] so{Qualifier} claim Backing exceptions

8

Warrants as Generic Inferences

9

Characteristics of the Interpretive Argument

• “Informal” - Involves substantive inferences and assumptions - not just logical or statistical inferences and assumptions.

• “presumptive” - does not prove the conclusions, but develops a presumption in favor of them.

• “tentative” - conclusions are uncertain.

• “defeasible” – can be overturned in particular cases.

10

Criteria for Validating/Evaluating Interpretive Arguments

• Clarity of the interpretive Argument

• Coherence of the interpretive argument

• Plausibility of Inferences

• Plausibility of Assumptions

11

Asking the Right Questions

• An essential step in validation is the clear, explicit, and complete specification of the proposed interpretations and uses of test scores.

• In the absence of a clear and complete understanding of the proposed interpretations and uses, validators literally do not know what they are doing.

• To evaluate/validate the claims based on test scores, it is important to know what is being claimed.

12

Three Distinct Content-based Interpretations:

Observable attributes

Operationally defined attributes

Traits

13

A Family of Content-based Interpretations

• A cluster of closely related attributes that derive much of their meaning from content domains (Observable Attributes, Operational Definitions, and Traits).

• These attributes are interesting in themselves.

• And they illustrate the dependence of validation on the details of the proposed interpretations and uses.

14

Observable Attributes

• Some kind of behavior is of interest

• A target domain (TD) of possible observations (often large and somewhat fuzzy) is specified.

• The target score (TS), the expected value over the TD for the person is taken to be the value of the observable attribute (OA) for the person.

• Because it is not generally possible to observe all of the observations in the TD, the TS has to be estimated using samples from the TD.

15

Possible Observations

• Observable attributes are dispositions.

• They report a tendency to respond in a some way to some kind of stimulus or to perform in some way given a task.

• Each possible observation in the TD involves some task or stimulus, some conditions of observation, some context, some response, and a categorization of the response (e.g., good, adequate, marginal, inadequate).

16

Notes on OAs• OAs are “observable” in the sense that they

are expected values over (very large) domains (or sets) of potential observations.

• They are inductive summaries.

• OAs do not require an explanation for the observations, and they do not assume any latent trait that accounts for the observations.

• But they do not rule out explanations in terms of theories, latent traits, etc. Rather, they invite explanation.

17

What Shapes TDs?

• Why do we include some observations and not others in the TD?– Practical needs: performances involved in a job,

sport, or other activity– Theoretical context: performances serve the

same role or are accounted for in the same way by a theory.

– Experience: performances seem to hang together

• However, once the TD is specified, it defines the observable attribute.

18

Examples of Observable Attributes

• Performance in shooting free throws in basketball

• Performance in responding appropriately to written materials in English

• Performance in a job

• Performance in a trade or profession

• Tendency to respond in some way to some kind of stimulus

19

Measuring Observable Attributes

• Typically, it is not feasible to draw random or representative samples from the TD.

• Rather, a measurement procedure is defined in terms of a subset of the TD, from which we can draw random or representative samples.

• I will refer to this subset of the TD defining the measurement procedure as the universe of generalization (UG) for the procedure.

• I will refer to a person’s expected value over the UG as the person’s universe score (US).

20

Interpretive Arguments for OAs

• Evaluation: from observations to an observed score (OS)

• Generalization: from the observed score (OS) to a universe score (US)

• Extrapolation: from the universe score (US) to the target score (TS)

21

Validity Arguments for OAs

• Evaluation

• Generalization

• Extrapolation

• Expert judgment supporting scoring rule

• Generalizability study

• Criterion-related data study, analyses of relationships between UG performances and TD performances

22

Operational Definitions

• In some cases, OAs may be defined in terms of a domain from which it is possible to draw random or representative samples, and the attribute can be operationally defined in terms of a measurement procedure.

• For such operationally defined attributes (ODAs), there is no extrapolation to a broader domain, and therefore no need for evidence supporting extrapolation.

• So validation is much easier for an ODA than it is for a broadly defined OA.

23

Interpretive Arguments for Operationally Defined Attributes

• Evaluation: from observations to an observed score (OS)

• Generalization: from the observed score (OS) to a universe score (US)

24

Uses of Operational Definitions

• An operationally-defined attribute is interpreted in terms of expected test performance.

• Any inferences about non-test performances will generally require specific criterion-related evidence.

• An ODA can also be used as an indicator for a theoretical construct, but this use requires construct-related validity evidence.

25

Traits

• Trait definitions incorporate target domains of possible observations, but add assumptions about underlying causal traits, that account for performance in the target domain.

• As a result, trait interpretations are much richer than the interpretations of observable attributes or operationally defined attributes.

26

Trait Language 1

• A trait is a disposition to behave or perform in some way in response to some kinds of stimuli or tasks, under some range of circumstances. Much of the meaning of the trait is given by the domain of observations over which the disposition is defined, but trait interpretations also assume, at least implicitly, that some underlying or latent attribute accounts for the observed regularities in performance (Loevinger, 1957).

27

Trait Language 2

• Messick defined a trait as: “a relatively stable characteristic of a person ... which is consistently manifested to some degree when relevant, despite considerable variation in the range of settings and circumstances” (Messick, 1989, p 15).

• Trait language tends to be implicitly causal, but no specific mechanisms describe how the trait influences performance or behavior.

28

Traits

• One can think of a trait as an observable attribute with an added dimension, the underlying latent attribute that accounts for the observed performances.

• Alternately, one can think of a latent “trait” (e.g., anxiety, quantitative aptitude), and then specify a corresponding target domain of possible observations.

• Either way, we have a target domain and an underlying latent trait.

29

Interpretive Arguments for Traits

• Evaluation: from observations to an observed score (OS)

• Generalization: from the observed score (OS) to a universe score (US)

• Extrapolation: from the universe score (US) to the target score (TS)

• Explanation/Implications: from the target score (TS) to the latent trait and to any implications associated with the trait

30

• Validation requires backing for the scoring and generalization inferences, and typically for an extrapolation inference.

• In addition, validation calls for backing for any additional inferences associated with the trait claims:– Unidimensionality– Agreement with theory (as in Cronbach and Meehl,

1955)– Relationship to other variables– Fit to an IRT model

Validating Trait Interpretations

31

Limitations of Content-based Validity Evidence

32

• Content-based judgments about content relevance and representativeness are typically made during test development and have a confirmationist bias.

• Messick (1989) saw content-validity evidence as playing a minor role in validation because it doesn’t apply directly to “inferences to be made from test scores” (p. 17).

• Cronbach (1971, p.452) maintained that, – Judgments about content validity should be restricted to the

operational, externally observable side of testing. Judgments about the subject’s internal processes state hypotheses, and these require empirical construct validation. (italics in original)

Criticisms of the Content Model

33

Judgment

34

Confirmationist Bias and the Stages of Validation

• Development Stage: Creating the test and the interpretive argument– Done by test developers– Tends to be confirmationist– Most content-related evidence is collected

• Appraisal Stage: challenging the interpretive argument

35

The Begging-the-question Fallacy• Begging the question occurs if a large part of

the question at issue is simply taken for granted or “begged”.

• In the weakest applications of content-validity models, content judgments are used to justify very expansive interpretations (e.g., in terms of traits, theoretical constructs) and uses (accountability).

36

To validate the interpretations and uses of test scores is to evaluate all of the claims being made.