reliability: sem and agreement - michigan state university notes/395 - 3... · 7 example 2: brutus...
TRANSCRIPT
1
Reliability:SEM and Agreement
PSY 395 – Oswald
Overview
• Review of Classical Test Theory
• Standard Error of Measurement (SEM)
• Comparing SEM to Reliability
• Reliability vs. Agreement
• Agreement
Classical Test Theory – A refresher!
If an immortal person took test an infinite number of times, what would be the mean of the observed scores?
X = T + E,T is the same every time you took itE is random (sometimes +, sometimes -) and would average out to zero, so…average X = T
Review of Classical Test Theory
2
Standard Error of Measurement (SEM)
• …but what would the standard deviation be?...
T = 10
+ errors- errors
Review of Classical Test Theory
The Mysteries of SEM …Revealed!
standard error of measurement = SEMstandard deviation of the error around any individual’s true score
±2 SEM captures95% of the error
Standard Error of Measurement
The Mysteries of SEM …Revealed!
…you’d like SEM/errorto be small
Why?How?
Standard Error of Measurement
3
Large SEM
The CAT (Creative Analogies Test) has 50 items. Most people score anywhere from 20 to 100.
Amy scores 75 on the CAT.What if SEM = 10?Then…• 95% sure her true score is 55 – 95 (±2 SEM)
Standard Error of Measurement
Small SEM
The CAT (Creative Analogies Test) has 50 items. Most people score anywhere from 20 to 100.
Amy scores 75 on the CAT.What if SEM = 2?Then…• 95% sure her true score is 71 – 79 (±2 SEM)
Standard Error of Measurement
It’s SEMtacular!
1x xxSEM s r= −
xs
10 1 .84 4SEM = − =
xxr= SD of test scores = test reliability
10 1 .19 9SEM = − =
good reliability low SEM
poor reliability high SEM
Standard Error of Measurement
4
SEM Example
Say you have a test wheremean = 50, sx = 10 and rxx = .84
SEM = − =10 1 84 4.
Fred scores 35 on this test, which means
• 95% sure his true score is 27 – 43 (±2 SEM)• 68% sure his true score is 36 – 44 (±1 SEM)
Standard Error of Measurement
SEM Assumptions
• a reliability coefficient based on an appropriate measure
• the sample appropriately represents the population
1x xxSEM s r= −
Standard Error of Measurement
Use of Cut Scores
• Cut scores are set values on a test that determines who passes and who fails the test.– Commonly used for licensure or certification
(bar exam, medical licensure, civil service)
• What is the impact of the SEM on interpreting cut scores?
Comparing SEM to Reliability
5
Some general rules about reliability
• Bigger is better!– The more items, the more occasions, and the more
observers the better the reliability• The Spearman-Brown Prophesy formula gives
us the reliability of a test if the number of (similar) items is increased
• Need to compute the standard error of measurement to understand individual test scores
• A measure can not be valid if it isn’t reliable
Comparing SEM to Reliability
Reliability vs. Agreement
Ex. Two raters rate you on the MSU School Spirit Test!!!1-5 scale, 1=strongly disagree and 5=strongly agree
Are the scores reliably related? Do they agree?
R2R1Question
134. This person might tattoo her/his entire body green and white.
353. Maybe you think she/he is sick, but it’s just “Spartan Fever.”
352. For this person, an MSU football game is an all-American treat.
241. This person would change his/her name to Sparty or Spartyette.
Reliability vs. Agreement
Example 1 - Interrater Agreement
ExampleBrutus and Maureen are clinical psychologists. They independently conduct intake interviews for 200 clients presenting mood problems.
Agreement
6
(Ex. 1, cont.) Assigned Categories
Clients assigned to 1 of 3 categories
– Cyclothymic– Bipolar– Depressed
Why would you think/hopethe therapists agree?
Agreement
(Ex. 1, cont.) Assignments
N=200 totalMaureen
cyclo bipo deprBrutus cyclo 106 10 4 120
bipo 22 28 10 60depr 2 12 6 20
130 50 20 140
Agreement
(Ex. 1, cont.) Convert # to Proportions
Proportions (=cell freq/200)Maureen
cyclo bipo deprBrutus cyclo .53 .05 .02 .60
bipo .11 .14 .05 .30depr .01 .06 .03 .10
.65 .25 .10 .70.53+.14+.03 = 70 % interrater agreement
Agreement
7
Example 2:Brutus and Maureen: Bogus Clinicians
What if Brutus and Maureen were actors, just pretending to be clinical psychologists as a sick little drama class project?
In other words, what if Brutus and Maureen had NO IDEA WHAT THEY WERE DOING, and all their “diagnoses”were COMPLETELY DUE TO CHANCE?
Agreement
(Ex. 2, cont.) Frequencies
N=201 totalMaureen
cyclo bipo deprBrutus cyclo 22 23 22 67
bipo 23 22 22 67depr 22 22 23 67
67 67 67 67
Agreement
(Ex. 2, cont.) Proportions
Proportions (=cell freq/200)Maureen
cyclo bipo deprBrutus cyclo .11 .11 .11 .33
bipo .11 .11 .11 .33depr .11 .11 .11 .33
.33 .33 .33 .33.11+.11+.11 = 33 % interrater agreement just by chance
Agreement
8
Wait a sec…
Two people could agree just by chance, right? (e.g., 2 people flippin’ coins)
(1) By how much do ratings beatchance agreement?
(2) What’s the MOST the ratings could beat chance by?
Agreement
Kappa CoefficientHow Much Above Chance Agreement?
κ =−
−P P
Pobs chance
chance1
The maximum amount beyond chance you couldhave agreement on.
The amount above chanceyou do have agreement on.
….out of….
Agreement
Proportions (=cell freq/200)Maureen
cyclo bipo deprBrutus cyclo .11 .11 .11 .33
bipo .11 .11 .11 .33depr .11 .11 .11 .33
.33 .33 .33 .33.11+.11+.11 = 33 % interrater agreement just by chance
. 3 3 .3 3 01 1 .3 3o b s c h a n c e
c h a n c e
P PP
κ − −= = =
− −
Agreement
9
Proportions (=cell freq/200)Maureen
cyclo bipo deprBrutus cyclo .53 .05 .02 .60
bipo .11 .14 .05 .30depr .01 .06 .03 .10
.65 .25 .10 .70.53+.14+.03 = 70 % interrater agreement
.7 0 .4 7 5 .4 31 1 .4 7 5o b s ch a n ce
ch a n ce
P PP
κ − −= = =
− −
(chance is different from .33 here, why?)
Agreement
PregnancyTest
Say you had the followingdata on 1000 women…
Actual…
960 98%
9604098096020no preg20020preg
no pregpreg
Example 3:What if a pregnancy test was
claimed to be “98% effective”?
Agreement
κ = (Pobs – Pchance) / (1 – Pchance)κ = .98 – .94 = .67
1 – .94
PregnancyTest
Say you had the followingdata on 1000 women…
Actual…
.98.96.04
.98.96.02no preg
.02.00.02pregno pregpreg
(Ex. 3, cont.) Proportions
10
What if a pregnancy test was claimed to be “98% effective”?
In these data, most (96%) were not pregnant anyway, so the test would have been correct 96% of the time even if the test always said the individual was not pregnant!
The pregnancy test does help above chance levels, but – given the sample data – not as much as the claim might lead one to believe.