Transcript
Page 1: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Generalized Mixed-effects Models Generalized Mixed-effects Models for Monitoring Cut-scores for for Monitoring Cut-scores for Differences Between Raters, Differences Between Raters,

Procedures, and TimeProcedures, and Time

Yeow Meng ThumHye Sook Shin

UCLA Graduate School of Education & Information StudiesNational Center for Research on Evaluation,Standards, and Student Testing (CRESST)

CRESST Conference 2004

Los Angeles

Page 2: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

RationaleRationale• Research shows that cut-scores vary as a

function of many factors: raters, procedures, and over time.

• How does one defend a particular cut-score? Averaging several values, use of collateral information are current options.

• High-stakes accountability hinges on the comparability of performance standards over time.

• Some method is required to monitor cut-scores for consistency across groups and over time. (Green, et al)

Page 3: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Purpose of StudyPurpose of Study

• An approach for estimating the impact from procedural factors and rater characteristics and time.

• Monitoring the consistency of cut-scores across several groups.

Page 4: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Transforming Judgments into Transforming Judgments into Scale ScoresScale Scores

-4 -2 0 2 4480 540 600 660 720

0.0

0.2

0.4

0.6

0.8

1.0

Pro

bab

ility

Logit

Scale Score

Cut-Score

0.633 logits

619 scale-score points

Figure 1: Working with the Grade 3 SAT-9 mathematics scale

Page 5: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Performance DistributionPerformance Distributionfor Four Urban Schoolsfor Four Urban Schools

Figure 2: Grade 3 SAT-9 mathematics scale score distribution for four schools

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School A32% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School B70% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012School C19% Proficient

619

400 450 500 550 600 650 700 750 8000.000

0.004

0.008

0.012

School D32% Proficient

619

Page 6: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Potential ImpactPotential Impactof Revising a Cut-score of Revising a Cut-score

  Revised Cut-score

(as fraction of sem)

school -1 -0.5 0 0.5 1

A 41% 37% 32% 29% 26%

B 78% 75% 70% 67% 63%

C 25% 23% 19% 15% 13%

D 40% 36% 32% 28% 25%

Table 1: Potential impact on school performance when cut-score changes

Page 7: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Data & ModelData & Model• Simulate Data for a standard setting study

design : a ramdomized block comfounded factorial design (Kirk, 1995)

• Factors of standard setting study

a. Rater Dimensions (Teacher, Non-Teacher, etc.)

b. Procedural Factors/Treatments

1. Type of Feedback (Outcome or impact Feedback- “yes” or “No”, etc)

2. Item Sampling in Booklet (Number of items, etc)

3. Type of Task (A modified Angoff, a contrasting group approach, or

Bookmark method, etc)

Page 8: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Treating Binary OutcomesTreating Binary Outcomes

1 for "pass"

0 for "fail",ijty

ln1

ijtijt

ijt

p

p

(2)

Binary outcome

(1)

Logit link function

(pass if rater j thinks, the passing candidate has a good chance of getting the ith item right in session t)

Page 9: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

s= + jt sjt jts

K S

IRT Model for Cut-score - IIRT Model for Cut-score - I

Procedural Factors Impacting A Rater’s Cut-scores

(3)

Where s is the fixed effect due to session characteristics s

is random effect, which evolves over time ROUNDjt, and a function of rater characteristics, Xpj

jt

Item Response Model (IRT)

= - ijt jt ijtK d

(4)

Page 10: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Estimating Factors Impacting A Rater’s Cut-scores

(5)

0 1

0 00 0 0

1 10 1 1

jt j j jt

j p pj jp

j p pj jp

ROUND

X u

X u

0 1( , )j ju u are distributed bivariate normal

with means (0, 0) and variance-covariances

00 01

10 11

T

IRT Model for Cut-score - IIIRT Model for Cut-score - II

Page 11: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

LikelihoodLikelihood

( , )jg T

(1 )( ; ) [1 ]ijt ijty yj j ijt ijt

t i

f y p p

(7)

Prior distribution of j

Conditional posterior of the rater random effects j is

( ; ) ( ; )

( , , )j j j

j

f y g T

h y T

where ( , , ) ( ; ) ( ; )j

j j j j jh y T f y g T

Condition on , y has probability

(6)

Joint marginal likelihood

( , , )jj

h y T (8)

Page 12: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

= + jt sg sjt jts

K S

Multiple StudiesMultiple StudiesConsistency & StabilityConsistency & Stability

Procedural Factors Impacting A Rater’s Cut-scores for separate study g (g=1.2.3….,g)

Where sg is the fixed effect due to session characteristics s

is random effect, which evolves over time SESSIONjt, and a function of rater characteristics, Xpj

jt

(9)

Group Factors Impacting A Rater’s Severity

0 1

0 00 0 01 1

1 10 1 11 1

jt j j jt

G G

j gj g gj g p pj jg g p

G G

j gj g gj p pj jg g p

ROUND

GROUP GROUP X u

GROUP GROUP g X u

(10)

Page 13: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

SimulationSimulationSAS Proc NLMixed

150 raters who are randomly exposed 4 rounds to STD setting exercise varying on 3 session factors.

Session Factor 1: Feedback type

Session Factor 2: Item Targeting in Booklet

Session Factor 3: Type of Standard Setting Task

Rater Characteristics: Teacher, Non-Teacher

Change over Round (time)

Page 14: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

Selected Results Selected Results

• Model (reasonably) recovers parameters within sampling uncertainty across 3 studies.

• Average cut-score (All Teachers) for each rater group at the last Round is not significantly different from 619, while the first Round results were significantly different.

• Results from the model for multiple studies are similarly encouraging.

Page 15: Generalized Mixed-effects Models for Monitoring Cut-scores for Differences Between Raters, Procedures, and Time Yeow Meng Thum Hye Sook Shin UCLA Graduate

SuggestionsSuggestions

• Large-scale testing programs should monitor their cut-score estimates for consistency and stability.

• For stable performance scale, estimates of cut-scores and factor effects should be replicable to a reasonable degree across groups and over time.

• The model in this paper can be modified based on actual data so that we verify and balance out the variation due to the relevant factors of the study.


Top Related