Lesson Six
Reliability
Yun-Pi Yuan 2
Contents Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient
Ways of obtaining reliability coefficient: Alternate/Parallel forms Test-retest Split-half & KR-21/KR-20
Two ways of testing reliability How to make test more reliable
Yun-Pi Yuan 3
Definition of Reliability (1) “The consistency of measures acros
s different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24).
If you give the same test to the same testees on two different occasions, the test should yield similar results.
Yun-Pi Yuan 4
Definition of Reliability (2) A reliable test is consistent and
dependable. Scores are consistent and
reproducible. The accuracy or precision with
which a test measures something; that is, consistency, dependability, or stability of test results.
Yun-Pi Yuan 5
Factors Contributing to Unreliability
X=T+ E (observed score = true score + error score)
Concerned with freedom from nonsystematic fluctuation.
Fluctuations inthe studentscoringtest administrationthe test itself
Yun-Pi Yuan 6
Types of Reliability
Student- (or Person-) related reliability Rater- (or Scorer-) related reliability
Intra-rater reliability Inter-rater reliability
Test administration reliability Test (or instrument-related) reliability
Yun-Pi Yuan 7
Student-Related Reliability (1) The source of the error score co
mes from the test takers.Temporary illnessFatigueAnxietyOther physical or psychological f
actorsTest-wiseness (i.e., strategies for efficie
nt test taking)
Yun-Pi Yuan 8
Student-Related Reliability (2)
Principles:Assess on several occasionsAssess when person is
prepared and best able to perform well
Ensure that person understands what is expected (e.g., instructions are clear)
Yun-Pi Yuan 9
Rater (or Scorer) Reliability (1)
Fluctuations: including human error, subjectivity, and bias
Principles:Use experienced trained
raters.Use more than one rater.Raters should carry out their
assessments independently.
Yun-Pi Yuan 10
Rater Reliability (2)
Two kinds of rater reliability:Intra-rater reliabilityInter-rater reliability
Yun-Pi Yuan 11
Intra-Rater Reliability
Fluctuations including:Unclear scoring criteriaFatigueBias toward particular good
and bad studentsSimple carelessness
Yun-Pi Yuan 12
Inter-Rater Reliability (1)
Fluctuations including:Lack of attention to scoring
criteriaInexperienceInattentionPreconceived biases
Yun-Pi Yuan 13
Inter-Rater Reliability (2)
Used with subjective tests when two or more independent raters are involved in scoring
Train the raters before scoring (e.g., TWE, dept. oral and composition tests for recommended students).
Yun-Pi Yuan 14
Inter-Rater Reliability (3)
Compare the scores of the same testee given by different raters. If r= high, there’s inter-rater reliability.
Yun-Pi Yuan 15
Test Administration Reliability
Street noiseListening comprehension test
Photocopying variationsLightingVariations in temperatureCondition of desks and chairsMonitors
Yun-Pi Yuan 16
Test ReliabilityMeasurement errors come
from the test itself:Test is too longTest with a time limitTest format allows for
guessingAmbiguous test itemsTest with more than one
correct answer
Yun-Pi Yuan 17
Reliability Coefficient (r) To quantify the reliability of a test al
low us to compare the reliability of different tests.
0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered).
If r = 1: 100% reliable A good achievement test: r>= .90 R<.70 shouldn’t use the test
Yun-Pi Yuan 18
How to Get Reliability Coefficient
Two forms, two administrations: alternate/parallel forms
One form, two administrations: test-retest
One form, one administration (internal consistency):split-half (Spearman-Brown procedure)KR-21KR-20
Yun-Pi Yuan 19
Alternate/Parallel Forms
Two forms, two administrations:Equivalent forms
(i.e., different items testing the same topic) taken by the same test taker on different days
If r is high, this test is said to have good reliability.
the most stringent form
Test plan
Form A Form B
Yun-Pi Yuan 20
Test-Retest
The same test is administered to the same testees with a short time lag, and then calculate r.
Appropriate for highly speeded test
Test A
Trial 1 Trial 2
One form, two administrations
Yun-Pi Yuan 21
Split-half (Spearman-Brown Procedure)
One test, one administration Split the test into halves (i.e., odd quest
ions vs even questions) to form two sets of scores.
Also called internal consistencyQ1
Q2
Q3
Q4
Q5
Q6
First Half
Second Half
Yun-Pi Yuan 22
Split-half (2) Note that the r isn’t the reliability of the tes
t A math relationship between test length and
reliability: the longer the test, the more reliable it is.
Rel.total = nr/1+ (n-1)r Spearman & Brown Prophecy Formula
E.g., correlation between 2 parts of test; r= .6 rel. of full test = .75
If lengthen the test items into 3 times: r= .82
Yun-Pi Yuan 23
Kuder-Ridchardson formula 21 KR-21 = k/(k-1){1-[x (1- x/k)]/s2} k= number of items; x= mean s= standard deviation (formula see Bailey 100)
description of the spread outness in a set of scores (or score deviations from the mean)
o<=s the larger s, the more spread outE.g., 2 sets of scores: (5, 4,3) and (7,4,1); which
group in general behaves more similarly?
Yun-Pi Yuan 24
Kuder-Ridchardson formula 20
KR-20= [k/(k-1)][1-(∑pq/s2) p= item difficulty (percent of people
who got an item right) q= 1-p (i.e., percent of people who
got an item wrong)
Yun-Pi Yuan 25
Ways of Testing Reliability
Examine the amount of variation Standard Error of Measurement (SEM) The smaller the better
Calculate “reliability coefficient” “r” The bigger the better
Yun-Pi Yuan 26
Standard Error of Measurement (1)
Average SD of an individual over a large number of testing
Essence of variability of scores of an individual
How large the error component is likely to be
Particularly useful in interpretation of test scores
SEM= S√1-rel.
Yun-Pi Yuan 27
Standard Error of Measurement (2)
Average of a set of scores= “true” score of the individual
X1=T1+ E1
X2=T2+ E2
: : : Xn= Tn+ En
X = T + 0
Yun-Pi Yuan 28
Standard Error of Measurement (3)
E.g., GRE SD= 100, rel.= .91 SEM= 100 √1-.91= 30o How do we apply the SEM in the int
erpretation of the score? For a given spread of scores, the gre
ater the reliability coefficient, the smaller will be the SEM.
Yun-Pi Yuan 29
Ways of Enhancing Reliability
General strategies:Consider possible sources of
unreliabilityReduce or average out
nonsystematic fluctuations inraterspersonstest administrationinstruments
Yun-Pi Yuan 30
How to Make Tests More Reliable? (1)
Take enough samples of behavior
Try to avoid ambiguous itemsProvide clear and explicit
instructionsEnsure tests are well layout &
perfectly legibleProvide uniform and undistracted
condition of administrationTry to use objective tests
Yun-Pi Yuan 31
How to Make Tests More Reliable? (2)
Try to use direct tests Have independent, trained raters Provide a detailed scoring key Try to identify the test takers by
number, not by names Try to have more multiple
independent scoring in subjective tests
(Hughes, 1989, pp. 36-42).