reliability: sem and agreement - michigan state university notes/395 - 3... · 7 example 2: brutus...

1

Reliability:SEM and Agreement

PSY 395 – Oswald

Overview

• Review of Classical Test Theory

• Standard Error of Measurement (SEM)

• Comparing SEM to Reliability

• Reliability vs. Agreement

• Agreement

Classical Test Theory – A refresher!

If an immortal person took test an infinite number of times, what would be the mean of the observed scores?

X = T + E,T is the same every time you took itE is random (sometimes +, sometimes -) and would average out to zero, so…average X = T

Review of Classical Test Theory

2

Standard Error of Measurement (SEM)

• …but what would the standard deviation be?...

T = 10

+ errors- errors

Review of Classical Test Theory

The Mysteries of SEM …Revealed!

standard error of measurement = SEMstandard deviation of the error around any individual’s true score

±2 SEM captures95% of the error

Standard Error of Measurement

The Mysteries of SEM …Revealed!

…you’d like SEM/errorto be small

Why?How?


3

Large SEM

The CAT (Creative Analogies Test) has 50 items. Most people score anywhere from 20 to 100.

Amy scores 75 on the CAT.What if SEM = 10?Then…• 95% sure her true score is 55 – 95 (±2 SEM)


Small SEM

The CAT (Creative Analogies Test) has 50 items. Most people score anywhere from 20 to 100.

Amy scores 75 on the CAT.What if SEM = 2?Then…• 95% sure her true score is 71 – 79 (±2 SEM)


It’s SEMtacular!

1x xxSEM s r= −

xs

10 1 .84 4SEM = − =

xxr= SD of test scores = test reliability

10 1 .19 9SEM = − =

good reliability low SEM

poor reliability high SEM


4

SEM Example

Say you have a test wheremean = 50, sx = 10 and rxx = .84

SEM = − =10 1 84 4.

Fred scores 35 on this test, which means

• 95% sure his true score is 27 – 43 (±2 SEM)• 68% sure his true score is 36 – 44 (±1 SEM)


SEM Assumptions

• a reliability coefficient based on an appropriate measure

• the sample appropriately represents the population

1x xxSEM s r= −


Use of Cut Scores

• Cut scores are set values on a test that determines who passes and who fails the test.– Commonly used for licensure or certification

(bar exam, medical licensure, civil service)

• What is the impact of the SEM on interpreting cut scores?

Comparing SEM to Reliability

5

Some general rules about reliability

• Bigger is better!– The more items, the more occasions, and the more

observers the better the reliability• The Spearman-Brown Prophesy formula gives

us the reliability of a test if the number of (similar) items is increased

• Need to compute the standard error of measurement to understand individual test scores

• A measure can not be valid if it isn’t reliable

Comparing SEM to Reliability

Reliability vs. Agreement

Ex. Two raters rate you on the MSU School Spirit Test!!!1-5 scale, 1=strongly disagree and 5=strongly agree

Are the scores reliably related? Do they agree?

R2R1Question

134. This person might tattoo her/his entire body green and white.

353. Maybe you think she/he is sick, but it’s just “Spartan Fever.”

352. For this person, an MSU football game is an all-American treat.

241. This person would change his/her name to Sparty or Spartyette.

Reliability vs. Agreement

Example 1 - Interrater Agreement

ExampleBrutus and Maureen are clinical psychologists. They independently conduct intake interviews for 200 clients presenting mood problems.

Agreement

6

(Ex. 1, cont.) Assigned Categories

Clients assigned to 1 of 3 categories

– Cyclothymic– Bipolar– Depressed

Why would you think/hopethe therapists agree?

Agreement

(Ex. 1, cont.) Assignments

N=200 totalMaureen

cyclo bipo deprBrutus cyclo 106 10 4 120

bipo 22 28 10 60depr 2 12 6 20

130 50 20 140

Agreement

(Ex. 1, cont.) Convert # to Proportions

Proportions (=cell freq/200)Maureen

cyclo bipo deprBrutus cyclo .53 .05 .02 .60

bipo .11 .14 .05 .30depr .01 .06 .03 .10

.65 .25 .10 .70.53+.14+.03 = 70 % interrater agreement

Agreement

7

Example 2:Brutus and Maureen: Bogus Clinicians

What if Brutus and Maureen were actors, just pretending to be clinical psychologists as a sick little drama class project?

In other words, what if Brutus and Maureen had NO IDEA WHAT THEY WERE DOING, and all their “diagnoses”were COMPLETELY DUE TO CHANCE?

Agreement

(Ex. 2, cont.) Frequencies

N=201 totalMaureen

cyclo bipo deprBrutus cyclo 22 23 22 67

bipo 23 22 22 67depr 22 22 23 67

67 67 67 67

Agreement

(Ex. 2, cont.) Proportions



bipo .11 .11 .11 .33depr .11 .11 .11 .33

.33 .33 .33 .33.11+.11+.11 = 33 % interrater agreement just by chance

Agreement

8

Wait a sec…

Two people could agree just by chance, right? (e.g., 2 people flippin’ coins)

(1) By how much do ratings beatchance agreement?

(2) What’s the MOST the ratings could beat chance by?

Agreement

Kappa CoefficientHow Much Above Chance Agreement?

κ =−

−P P

Pobs chance

chance1

The maximum amount beyond chance you couldhave agreement on.

The amount above chanceyou do have agreement on.

….out of….

Agreement



bipo .11 .11 .11 .33depr .11 .11 .11 .33

.33 .33 .33 .33.11+.11+.11 = 33 % interrater agreement just by chance

. 3 3 .3 3 01 1 .3 3o b s c h a n c e

c h a n c e

P PP

κ − −= = =

− −

Agreement

9



bipo .11 .14 .05 .30depr .01 .06 .03 .10

.65 .25 .10 .70.53+.14+.03 = 70 % interrater agreement

.7 0 .4 7 5 .4 31 1 .4 7 5o b s ch a n ce

ch a n ce

P PP

κ − −= = =

− −

(chance is different from .33 here, why?)

Agreement

PregnancyTest

Say you had the followingdata on 1000 women…

Actual…

960 98%

9604098096020no preg20020preg

no pregpreg

Example 3:What if a pregnancy test was

claimed to be “98% effective”?

Agreement

κ = (Pobs – Pchance) / (1 – Pchance)κ = .98 – .94 = .67

1 – .94

PregnancyTest

Say you had the followingdata on 1000 women…

Actual…

.98.96.04

.98.96.02no preg

.02.00.02pregno pregpreg

(Ex. 3, cont.) Proportions

10

What if a pregnancy test was claimed to be “98% effective”?

In these data, most (96%) were not pregnant anyway, so the test would have been correct 96% of the time even if the test always said the individual was not pregnant!

The pregnancy test does help above chance levels, but – given the sample data – not as much as the claim might lead one to believe.

reliability: sem and agreement - michigan state university notes/395 - 3... · 7 example 2: brutus...

Documents