Download - Lesson Six Reliability. Yun-Pi Yuan 2 Contents Definition of reliability Definition of reliability Factors contributing to unreliability Factors contributing

Lesson Six

Reliability

Yun-Pi Yuan 2

Contents Definition of reliability Factors contributing to unreliability Types of reliability Indication of reliability: Reliability coefficient

Ways of obtaining reliability coefficient: Alternate/Parallel forms Test-retest Split-half & KR-21/KR-20

Two ways of testing reliability How to make test more reliable

Yun-Pi Yuan 3

Definition of Reliability (1) “The consistency of measures acros

s different times, test forms, raters, and other characteristics of the measurement context” (Bachman, 1990, p. 24).

If you give the same test to the same testees on two different occasions, the test should yield similar results.

Yun-Pi Yuan 4

Definition of Reliability (2) A reliable test is consistent and

dependable. Scores are consistent and

reproducible. The accuracy or precision with

which a test measures something; that is, consistency, dependability, or stability of test results.

Yun-Pi Yuan 5

Factors Contributing to Unreliability

X=T+ E (observed score = true score + error score)

Concerned with freedom from nonsystematic fluctuation.

Fluctuations inthe studentscoringtest administrationthe test itself

Yun-Pi Yuan 6

Types of Reliability

Student- (or Person-) related reliability Rater- (or Scorer-) related reliability

Intra-rater reliability Inter-rater reliability

Test administration reliability Test (or instrument-related) reliability

Yun-Pi Yuan 7

Student-Related Reliability (1) The source of the error score co

mes from the test takers.Temporary illnessFatigueAnxietyOther physical or psychological f

actorsTest-wiseness (i.e., strategies for efficie

nt test taking)

Yun-Pi Yuan 8

Student-Related Reliability (2)

Principles:Assess on several occasionsAssess when person is

prepared and best able to perform well

Ensure that person understands what is expected (e.g., instructions are clear)

Yun-Pi Yuan 9

Rater (or Scorer) Reliability (1)

Fluctuations: including human error, subjectivity, and bias

Principles:Use experienced trained

raters.Use more than one rater.Raters should carry out their

assessments independently.

Yun-Pi Yuan 10

Rater Reliability (2)

Two kinds of rater reliability:Intra-rater reliabilityInter-rater reliability

Yun-Pi Yuan 11

Intra-Rater Reliability

Fluctuations including:Unclear scoring criteriaFatigueBias toward particular good

and bad studentsSimple carelessness

Yun-Pi Yuan 12

Inter-Rater Reliability (1)

Fluctuations including:Lack of attention to scoring

criteriaInexperienceInattentionPreconceived biases

Yun-Pi Yuan 13


Used with subjective tests when two or more independent raters are involved in scoring

Train the raters before scoring (e.g., TWE, dept. oral and composition tests for recommended students).

Yun-Pi Yuan 14


Compare the scores of the same testee given by different raters. If r= high, there’s inter-rater reliability.

Yun-Pi Yuan 15

Test Administration Reliability

Street noiseListening comprehension test

Photocopying variationsLightingVariations in temperatureCondition of desks and chairsMonitors

Yun-Pi Yuan 16

Test ReliabilityMeasurement errors come

from the test itself:Test is too longTest with a time limitTest format allows for

guessingAmbiguous test itemsTest with more than one

correct answer

Yun-Pi Yuan 17

Reliability Coefficient (r) To quantify the reliability of a test al

low us to compare the reliability of different tests.

0 ≤ r ≤ 1 (ideal r= 1, which means the test gives precisely the same results for particular testees regardless of when it happened to be administered).

If r = 1: 100% reliable A good achievement test: r>= .90 R<.70 shouldn’t use the test

Yun-Pi Yuan 18

How to Get Reliability Coefficient

Two forms, two administrations: alternate/parallel forms

One form, two administrations: test-retest

One form, one administration (internal consistency):split-half (Spearman-Brown procedure)KR-21KR-20

Yun-Pi Yuan 19

Alternate/Parallel Forms

Two forms, two administrations:Equivalent forms

(i.e., different items testing the same topic) taken by the same test taker on different days

If r is high, this test is said to have good reliability.

the most stringent form

Test plan

Form A Form B

Yun-Pi Yuan 20

Test-Retest

The same test is administered to the same testees with a short time lag, and then calculate r.

Appropriate for highly speeded test

Test A

Trial 1 Trial 2

One form, two administrations

Yun-Pi Yuan 21

Split-half (Spearman-Brown Procedure)

One test, one administration Split the test into halves (i.e., odd quest

ions vs even questions) to form two sets of scores.

Also called internal consistencyQ1

Q2

Q3

Q4

Q5

Q6

First Half

Second Half

Yun-Pi Yuan 22

Split-half (2) Note that the r isn’t the reliability of the tes

t A math relationship between test length and

reliability: the longer the test, the more reliable it is.

Rel.total = nr/1+ (n-1)r Spearman & Brown Prophecy Formula

E.g., correlation between 2 parts of test; r= .6 rel. of full test = .75

If lengthen the test items into 3 times: r= .82

Yun-Pi Yuan 23

Kuder-Ridchardson formula 21 KR-21 = k/(k-1){1-[x (1- x/k)]/s2} k= number of items; x= mean s= standard deviation (formula see Bailey 100)

description of the spread outness in a set of scores (or score deviations from the mean)

o<=s the larger s, the more spread outE.g., 2 sets of scores: (5, 4,3) and (7,4,1); which

group in general behaves more similarly?

Yun-Pi Yuan 24

Kuder-Ridchardson formula 20

KR-20= [k/(k-1)][1-(∑pq/s2) p= item difficulty (percent of people

who got an item right) q= 1-p (i.e., percent of people who

got an item wrong)

Yun-Pi Yuan 25

Ways of Testing Reliability

Examine the amount of variation Standard Error of Measurement (SEM) The smaller the better

Calculate “reliability coefficient” “r” The bigger the better

Yun-Pi Yuan 26

Standard Error of Measurement (1)

Average SD of an individual over a large number of testing

Essence of variability of scores of an individual

How large the error component is likely to be

Particularly useful in interpretation of test scores

SEM= S√1-rel.

Yun-Pi Yuan 27


Average of a set of scores= “true” score of the individual

X1=T1+ E1

X2=T2+ E2

: : : Xn= Tn+ En

X = T + 0

Yun-Pi Yuan 28


E.g., GRE SD= 100, rel.= .91 SEM= 100 √1-.91= 30o How do we apply the SEM in the int

erpretation of the score? For a given spread of scores, the gre

ater the reliability coefficient, the smaller will be the SEM.

Yun-Pi Yuan 29

Ways of Enhancing Reliability

General strategies:Consider possible sources of

unreliabilityReduce or average out

nonsystematic fluctuations inraterspersonstest administrationinstruments

Yun-Pi Yuan 30

How to Make Tests More Reliable? (1)

Take enough samples of behavior

Try to avoid ambiguous itemsProvide clear and explicit

instructionsEnsure tests are well layout &

perfectly legibleProvide uniform and undistracted

condition of administrationTry to use objective tests

Yun-Pi Yuan 31

How to Make Tests More Reliable? (2)

Try to use direct tests Have independent, trained raters Provide a detailed scoring key Try to identify the test takers by

number, not by names Try to have more multiple

independent scoring in subjective tests

(Hughes, 1989, pp. 36-42).

Download - Lesson Six Reliability. Yun-Pi Yuan 2 Contents Definition of reliability Definition of reliability Factors contributing to unreliability Factors contributing

Top Related