maintaining, adjusting and generalizing standards and cut-scores robert coe, durham university...

18
Maintaining, adjusting and generalizing standards and cut-scores Robert Coe, Durham University Standard setting in the Nordic countries Centre for Educational Measurement, University of Oslo (CEMO) Oslo, 22 September 2015 @ProfCoe Slides available at: www.twitter.com/ ProfCoe

Upload: miranda-mckenzie

Post on 02-Jan-2016

219 views

Category:

Documents


4 download

TRANSCRIPT

Maintaining, adjusting and generalizing standards and cut-scores

Robert Coe, Durham UniversityStandard setting in the Nordic countriesCentre for Educational Measurement, University of Oslo (CEMO) Oslo, 22 September 2015

@ProfCoe

Slides available at:

www.twitter.com/ProfCoe

Different meanings of ‘standards’ Does it test something sensible? Is the content complex & extensive? Are the questions hard? Has the design/development followed the rules? Has it been marked properly? Are the scores reliable? Does it actually measure the intended construct? Can the outcomes (grades/scores) be used as desired? Does a particular cut-point indicate the same as

– Its ‘equivalent’ in previous versions– Some kind of equivalent in other assessments– Some specified level of performance

2

In England we are worried about

Standards over time Standards across qualifications, subjects, or

specifications within the same broad qualification

Standards between awarding organisations, or assessment processes

Standards across countries Standards between groups of candidates (e.g.

males/females, rich/poor)

3

Comparability (Newton, 2010)

Candidates who score at linked (grade boundary) marks must be the same in terms of …– the character of their attainments (phenomenal)– the causes of their attainments (causal)– the extent to which their attainments predict their

future success (predictive)

4

Comparability (Coe, Newton & Elliott, 2012)

Any rational claim about the comparability of grades in different qualifications amounts to a claim that those grades can be treated as interchangeable for some purpose or interpretation.

We should talk about the comparability of grades or scores (rather than of qualifications), since these are the outcomes of an assessment that are interpreted and used.

Most interpretations of a grade achieved in an examination relate directly to the candidate. In other words, we are interested in what the grade tells us about the person who achieved it, and inferring characteristics of the person from the observed performance.

Any claim about interchangeability relates to a particular construct.

5

Test development and standards

Theoretical

Specify the construct Develop the assessments

to measure it Use equating/linking

procedures to link key cut-points

Candidates with linked scores are equivalent (wrt the construct)

Pragmatic

Assessments evolve, shaped by– Explicit constructs– Past practice– User requirements (wide

range of different uses & purposes)

– Political drivers– Pragmatic constraints

Comparability defined by public opinion (Cresswell, 1996, 2012)

6

An integration: rational and pragmatic Consider the different ways exam results are

used (interchangeably) Identify an implied construct for each (in

terms of which they are interchangeable) Develop a defensible method for minimising

unfairness and undesirable behaviour that results from these interchangeability requirements

7

 Use / interpretation Implied construct

Interchangeability requirement

1 The claim by teachers in the 2012 GCSE English dispute that students who met the criteria deserve a C

The grade indicates specific competences within the subject domain that have been demonstrated on the assessment occasion.

Performance judged to meet the same ‘criteria’ gets the same grade on different occasions, specifications, boards

2 The use of a B in GCSE maths as a filter for A level study in maths.

The grade indicates specific competences within the subject domain that the candidate is likely to be able reproduce in the future.

Grades (across occasions, specifications, boards) represent the same level of the construct (mathematics)

8

 Use / interpretation Implied construct

Interchangeability requirement

3 The use of ‘5A*-C EM’ (at least 5 grade Cs inc Eng & math) at GCSE as a filter for any A level study

The grade indicates competences transferable to other academic study that the candidate is likely to be able reproduce in the future.

Grades achieved in different combinations of subjects and other allowable qualifications must be equivalent in terms of their predictions for subsequent academic outcomes.

4 Employers requiring job applicants to have ‘5A*-C EM’.

The grade indicates competences transferable to employment contexts that the candidate is likely to be able reproduce in the future.

Grades achieved in maths and English (across occasions, specifications, boards) must predict the same level of relevant, reproducible workplace competences.

9

 Use / interpretation Implied construct Interchangeability requirement

5 Use of GCSE results in league tables to judge schools

Average grades for a class or school (especially if referenced against prior attainment) indicate the impact (and hence quality) of the teaching experienced.

Grades achieved in different combinations of subjects and other allowable qualifications must be equivalent in terms of some measure of the teaching (quality and quantity) that is typically (after controlling for pre-existing or irrelevant differences) associated with those outcomes.

6 Comparison of GCSE results of different types of school to justify impact of policy.

Average grades across the jurisdiction indicate the impact (and hence quality) of the system’s schooling provision.

As in 5

10

Grade C in GCSE French could be made comparable to the same grade

In French in previous years (or parallel specifications), in terms of what is specified to be demonstrated

In French in previous years (or parallel specifications), or in other languages, in terms of the candidate’s ability to communicate in the target language

In other (academic) subjects, in terms of their prediction of subsequent attainment in other (academic) subjects

In other subjects, in terms of how hard it is to get students to reach this level

11

It follows that … We cannot talk about standards (setting or

maintaining) until we decide which of these uses/interpretations we want to support

In at least some cases the different uses/interpretations will be incompatible

If we want the ‘standard’ to be captured in the outcome (score/grade) we have to prioritise (or optimise)

Alternatively, we can use different equivalences for different uses

12

A level data

0

20

40

60

80

100

120

140

160

180

Film

Stu

d

Art

Pho

to

Med

iaS

tud

Art

Tex

t

Art

Gra

ph

Tra

velT

our

Soc

iolo

gy

Bus

App

l

Fin

eArt

Art

Des

Dra

ma

Eng

Lang

Lit

Hea

lSoc

Car

e

Eco

nBus

Bus

Stu

d

Eng

Lang

Law

DT

Pro

dDes

RS

Eng

Lit

Gov

Pol

ICT

App

l

Geo

grap

hyP

sych

olog

y

Cla

ssC

iv

PE

Spo

rtS

tud

ICT

His

tory

Acc

ouF

inM

usic

Tec

h

Eco

nom

ics

Mat

hs

Logi

cPhi

l

Mus

ic

Com

putin

gF

renc

h

Mat

hFur

Bio

lHum

an

Bio

logy

Che

mis

try

Phy

sics

Re

lati

ve

se

ve

rity

(c

orr

ec

ted

ta

riff

) A*

A

B

C

D

E

Leniently graded Severely graded

14

A taxonomy of standard setting and maintaining methods(From Coe & Walker, 2013)

15

Judgement-based methods Criterion-based judgement

– Judgement against specific competences– Judgement against overall grade descriptors

Item-based judgement– Angoff method– Bookmark method

Comparative judgment– Cross-moderation– Paired comparison

Judgement of demand– CRAS (complexity, resources, abstractness, strategies)

16

Equating methods

Classical equating models– Linear equating– Equipercentile equating

IRT equating– Rasch model– Other IRT models

Equating designs– Equivalent groups– Common persons– Common items

17

Linking/comparability methods Reference/anchor test

– Concurrent– Prior

Common candidate methods– Subject pairs– Subject matrix– Latent trait

Pre-testing designs (when high-stakes & released)– Live testing with additional future trial test items– Random future test versions within live testing– Low-stakes pre-testing two versions in counterbalanced trial– Low-stakes pre-testing with an anchor test

Norm/cohort referencing– Pure cohort referencing– Adjusted cohort referencing

18