Lesson Six
Reliability
Case
Imagine that a hundred students take a 100-item test at three o’clock one Thursday afternoon. The test is not impossible difficult or ridiculously easy for these students, so they do not all get zero or a perfect score of 100. Now what if in fact they had not taken the test on the Thursday but had taken it at three o’clock the previous afternoon? Would we expect each student to have got exactly the same score on the Wednesday as they actually did on the Thursday?
Contents
Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient Ways of obtaining reliability coefficient:
– Alternate/Parallel forms– Test-retest– Split-half & KR-21/KR-20
Two ways of testing reliability How to make test more reliable Online video
http://www.le.ac.uk/education/testing/ilta/faqs/main.html
Definition of Reliability (1)
“The consistency of measures across different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24).
If you give the same test to the same testees on two different occasions, the test should yield similar results.
Definition of Reliability (2)
A reliable test is consistent and dependable.
Scores are consistent and reproducible.The accuracy or precision with which a
test measures something; that is, consistency, dependability, or stability of test results.
Factors Contributing to Unreliability
X=T+ E (observed score = true score + error score)
Concerned with freedom from nonsystematic fluctuation.
Fluctuations in– the student– scoring– test administration– the test itself
Types of Reliability
Student- (or Person-) related reliability Rater- (or Scorer-) related reliability
– Intra-rater reliability– Inter-rater reliability
Test administration reliability Test (or instrument-related) reliability
Student-Related Reliability (1)
The source of the error score comes from the test takers.– Temporary illness– Fatigue– Anxiety– Other physical or psychological fact
ors– Test-wiseness (i.e., strategies for efficient t
est taking)
Student-Related Reliability (2)
Principles:– Assess on several occasions– Assess when person is prepared
and best able to perform well– Ensure that person understands
what is expected (e.g., instructions are clear)
Rater (or Scorer) Reliability (1)
Fluctuations: including human error, subjectivity, and bias
Principles:– Use experienced trained raters.– Use more than one rater.– Raters should carry out their
assessments independently.
Rater Reliability (2)
Two kinds of rater reliability:– Intra-rater reliability– Inter-rater reliability
Intra-Rater Reliability
Fluctuations including:– Unclear scoring criteria– Fatigue– Bias toward particular good and bad
students– Simple carelessness
Inter-Rater Reliability (1)
Fluctuations including:– Lack of attention to scoring criteria– Inexperience– Inattention– Preconceived biases
Inter-Rater Reliability (2)
Used with subjective tests when two or more independent raters are involved in scoring
Train the raters before scoring (e.g., TWE, dept. oral and composition tests for recommended students).
Inter-Rater Reliability (3)
Compare the scores of the same testee given by different raters.
If r= high, there’s inter-rater reliability.
Test Administration Reliability
Street noise– Listening comprehension test
Photocopying variationsLightingVariations in temperatureCondition of desks and chairsMonitors
Test Reliability
Measurement errors come from the test itself:– Test is too long– Test with a time limit– Test format allows for guessing– Ambiguous test items– Test with more than one correct
answer
Ways of Enhancing Reliability
General strategies: Consider possible sources of unreliability
– Reduce or average out nonsystematic fluctuations inraterspersonstest administrationinstruments
How to Make Tests More Reliable? (1)
Take enough samples of behaviorTry to avoid ambiguous itemsProvide clear and explicit instructionsEnsure tests are well layout & perfectly
legibleProvide uniform and undistracted
condition of administrationTry to use objective tests
How to Make Tests More Reliable? (2)
Try to use direct testsHave independent, trained ratersProvide a detailed scoring keyTry to identify the test takers by number,
not by namesTry to have more multiple independent
scoring in subjective tests (Hughes, 1989, pp. 36-42).
Reliability Coefficient (r)
To quantify the reliability of a test allow us to compare the reliability of different tests.
0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered).
If r = 1: 100% reliable A good achievement test: r>= .90 R<.70 shouldn’t use the test
How to Get Reliability Coefficient
Type of Reliability How to Measure
Stability or Test-Retest
Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and Time 2.
Alternate Form Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2.
Internal Consistency
(Alpha, a)
Compare one half of the test to the other half. Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.
How to Get Reliability Coefficient
Two forms, two administrations: alternate/parallel forms
One form, two administrations: test-retest One form, one administration (internal
consistency):– split-half (Spearman-Brown procedure)– KR-21– KR-20
Alternate/Parallel Forms
Two forms, two administrations:– Equivalent forms (i.e.,
different items testing the same topic) taken by the same test taker on different days
– If r is high, this test is said to have good reliability.
– the most stringent form
Test plan
Form A Form B
Test-Retest
The same test is administered to the same testees with a short time lag, and then calculate r.
Appropriate for highly speeded test
Test A
Trial 1 Trial 2
One form, two administrations
Split-half (Spearman-Brown Procedure)
One test, one administration Split the test into halves (i.e., odd questions vs ev
en questions) to form two sets of scores. Also called internal consistency
Q1
Q2
Q3
Q4
Q5
Q6
First Half
Second Half
Split-half (2)
Note that the r isn’t the reliability of the test A math relationship between test length and reliabilit
y: the longer the test, the more reliable it is. Rel.total = nr/1+ (n-1)r Spearman & Brown Prophec
y Formula E.g., correlation between 2 parts of test; r= .6 rel.
of full test = .75 If lengthen the test items into 3 times: r= .82
Kuder-Ridchardson formula 21
KR-21 = k/(k-1){1-[x (1- x/k)]/s2}k= number of items; x= means= standard deviation (formula see Bailey 100)
– description of the spread outness in a set of scores (or score deviations from the mean)
– o<=s the larger s, the more spread out
– E.g., 2 sets of scores: (5, 4,3) and (7,4,1); which group in general behaves more similarly?
Kuder-Ridchardson formula 20
KR-20= [k/(k-1)][1-(∑pq/s2) p= item difficulty (percent of people who got
an item right) q= 1-p (i.e., percent of people who got an
item wrong)
Ways of Testing Reliability
Examine the amount of variation– Standard Error of Measurement (SEM)– The smaller the better
Calculate “reliability coefficient”– “r”– The bigger the better