lesson six reliability. case imagine that a hundred students take a 100-item test at three o’clock...

Lesson Six

Reliability

Case

Imagine that a hundred students take a 100-item test at three o’clock one Thursday afternoon. The test is not impossible difficult or ridiculously easy for these students, so they do not all get zero or a perfect score of 100. Now what if in fact they had not taken the test on the Thursday but had taken it at three o’clock the previous afternoon? Would we expect each student to have got exactly the same score on the Wednesday as they actually did on the Thursday?

Contents

Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient Ways of obtaining reliability coefficient:

– Alternate/Parallel forms– Test-retest– Split-half & KR-21/KR-20

Two ways of testing reliability How to make test more reliable Online video

http://www.le.ac.uk/education/testing/ilta/faqs/main.html

Definition of Reliability (1)

“The consistency of measures across different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24).

If you give the same test to the same testees on two different occasions, the test should yield similar results.

Definition of Reliability (2)

A reliable test is consistent and dependable.

Scores are consistent and reproducible.The accuracy or precision with which a

test measures something; that is, consistency, dependability, or stability of test results.

Factors Contributing to Unreliability

X=T+ E (observed score = true score + error score)

Concerned with freedom from nonsystematic fluctuation.

Fluctuations in– the student– scoring– test administration– the test itself

Types of Reliability

Student- (or Person-) related reliability Rater- (or Scorer-) related reliability

– Intra-rater reliability– Inter-rater reliability

Test administration reliability Test (or instrument-related) reliability

Student-Related Reliability (1)

The source of the error score comes from the test takers.– Temporary illness– Fatigue– Anxiety– Other physical or psychological fact

ors– Test-wiseness (i.e., strategies for efficient t

est taking)

Student-Related Reliability (2)

Principles:– Assess on several occasions– Assess when person is prepared

and best able to perform well– Ensure that person understands

what is expected (e.g., instructions are clear)

Rater (or Scorer) Reliability (1)

Fluctuations: including human error, subjectivity, and bias

Principles:– Use experienced trained raters.– Use more than one rater.– Raters should carry out their

assessments independently.

Rater Reliability (2)

Two kinds of rater reliability:– Intra-rater reliability– Inter-rater reliability

Intra-Rater Reliability

Fluctuations including:– Unclear scoring criteria– Fatigue– Bias toward particular good and bad

students– Simple carelessness

Inter-Rater Reliability (1)

Fluctuations including:– Lack of attention to scoring criteria– Inexperience– Inattention– Preconceived biases


Used with subjective tests when two or more independent raters are involved in scoring

Train the raters before scoring (e.g., TWE, dept. oral and composition tests for recommended students).


Compare the scores of the same testee given by different raters.

If r= high, there’s inter-rater reliability.

Test Administration Reliability

Street noise– Listening comprehension test

Photocopying variationsLightingVariations in temperatureCondition of desks and chairsMonitors

Test Reliability

Measurement errors come from the test itself:– Test is too long– Test with a time limit– Test format allows for guessing– Ambiguous test items– Test with more than one correct

answer

Ways of Enhancing Reliability

General strategies: Consider possible sources of unreliability

– Reduce or average out nonsystematic fluctuations inraterspersonstest administrationinstruments

How to Make Tests More Reliable? (1)

Take enough samples of behaviorTry to avoid ambiguous itemsProvide clear and explicit instructionsEnsure tests are well layout & perfectly

legibleProvide uniform and undistracted

condition of administrationTry to use objective tests

How to Make Tests More Reliable? (2)

Try to use direct testsHave independent, trained ratersProvide a detailed scoring keyTry to identify the test takers by number,

not by namesTry to have more multiple independent

scoring in subjective tests (Hughes, 1989, pp. 36-42).

Reliability Coefficient (r)

To quantify the reliability of a test allow us to compare the reliability of different tests.

0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered).

If r = 1: 100% reliable A good achievement test: r>= .90 R<.70 shouldn’t use the test

How to Get Reliability Coefficient

Type of Reliability How to Measure

Stability or Test-Retest

Give the same assessment twice, separated by days, weeks, or months. Reliability is stated as the correlation between scores at Time 1 and Time 2.

Alternate Form Create two forms of the same test (vary the items slightly). Reliability is stated as correlation between scores of Test 1 and Test 2.

Internal Consistency

(Alpha, a)

Compare one half of the test to the other half. Or, use methods such as Kuder-Richardson Formula 20 (KR20) or Cronbach's Alpha.

How to Get Reliability Coefficient

Two forms, two administrations: alternate/parallel forms

One form, two administrations: test-retest One form, one administration (internal

consistency):– split-half (Spearman-Brown procedure)– KR-21– KR-20

Alternate/Parallel Forms

Two forms, two administrations:– Equivalent forms (i.e.,

different items testing the same topic) taken by the same test taker on different days

– If r is high, this test is said to have good reliability.

– the most stringent form

Test plan

Form A Form B

Test-Retest

The same test is administered to the same testees with a short time lag, and then calculate r.

Appropriate for highly speeded test

Test A

Trial 1 Trial 2

One form, two administrations

Split-half (Spearman-Brown Procedure)

One test, one administration Split the test into halves (i.e., odd questions vs ev

en questions) to form two sets of scores. Also called internal consistency

Q1

Q2

Q3

Q4

Q5

Q6

First Half

Second Half

Split-half (2)

Note that the r isn’t the reliability of the test A math relationship between test length and reliabilit

y: the longer the test, the more reliable it is. Rel.total = nr/1+ (n-1)r Spearman & Brown Prophec

y Formula E.g., correlation between 2 parts of test; r= .6 rel.

of full test = .75 If lengthen the test items into 3 times: r= .82

Kuder-Ridchardson formula 21

KR-21 = k/(k-1){1-[x (1- x/k)]/s2}k= number of items; x= means= standard deviation (formula see Bailey 100)

– description of the spread outness in a set of scores (or score deviations from the mean)

– o<=s the larger s, the more spread out

– E.g., 2 sets of scores: (5, 4,3) and (7,4,1); which group in general behaves more similarly?

Kuder-Ridchardson formula 20

KR-20= [k/(k-1)][1-(∑pq/s2) p= item difficulty (percent of people who got

an item right) q= 1-p (i.e., percent of people who got an

item wrong)

Ways of Testing Reliability

Examine the amount of variation– Standard Error of Measurement (SEM)– The smaller the better

Calculate “reliability coefficient”– “r”– The bigger the better

lesson six reliability. case imagine that a hundred students take a 100-item test at three o’clock...

Documents

kinds of rater reliability

scorer reliability

test measures

reliability coefficientways

test takers

test forms

reliable test

item test