characteristics of a good test

A. RELIABILITYA. RELIABILITY

CHARACTERISTICS OF A CHARACTERISTICS OF A GOOD TESTGOOD TEST

ReliabilityReliability

• Reliability is synonymous with consistency. It is the degree to which test scores for an individual test taker or group of test takers are consistent over repeated applications.

• No psychological test is completely consistent, however, a measurement that is unreliable is worthless.

Would you keep using these measurement tools?

The consistency of test scores is critically important in determining whether a test can provide good measurement.

When someone says you are a ‘reliable’ person, what do they really mean?

Are you a reliable person?

Reliability (cont.)Reliability (cont.)

* Because no unit of measurement is exact, any time you measure something (observed score), you are really measuring two things.

1. True Score - the amount of observed score that truly represents what you are intending to measure.

2. Error Component - the amount of other variables that can impact the observed score

Observed Test Score = True Score + Errors of Measurement

Measurement ErrorMeasurement Error

• Any fluctuation in test scores that results from factors related to the measurement process that are irrelevant to what is being measured.

• The difference between the observed score and the true score is called the error score. S true = S observed - S error

Measurement Error is Reduced By:

- Writing items clearly

- Making instructions easily understood

- Adhering to proper test administration

- Providing consistent scoring

Determining ReliabilityDetermining Reliability• There are several ways that measuring reliability can be

determined, depending on the type of measurement the supporting data required. They include:

- Internal Consistency

- Test-retest Reliability

- Inter rater Reliability

- Split-half Methods

- Odd-even Reliability

- Alternate Forms Methods

Internal ConsistencyInternal Consistency

• Measures the reliability of a test solely on the number of items on the test and the inter correlation among the items. Therefore, it compares each item to every other item.

Cronbach’s Alpha: .80 to .95 (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory)

<.60 (Suspect)

Split Half & Odd-Even ReliabilitySplit Half & Odd-Even ReliabilitySplit Half - refers to determining a correlation between the first

half of the measurement and the second half of the measurement (i.e., we would expect answers to the first half to be similar to the second half).

Odd-Even - refers to the correlation between even items and odd items of a measurement tool.

• In this sense, we are using a single test to create two tests, eliminating the need for additional items and multiple administrations.

• Since in both of these types only 1 administration is needed and the groups are determined by the internal components of the test, it is referred to as an internal consistency measure.

Split-half reliability [error due to differences in item content between the halves of

the test]

• Typically, responses on odd versus even items are employed

• Correlate total scores on odd items with the scores obtained on even items

Person Odd Even

1 36 43

2 44 40

3 42 37

4 33 40

1

100

50 pairs

Test-retest ReliabilityTest-retest Reliability• Test-retest reliability is usually measured by computing

the correlation coefficient between scores of two administrations.

Test-retest Reliability (cont.)Test-retest Reliability (cont.)• The amount of time allowed between measures is critical.

• The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time.

• Optimum time between administrations is 2 to 4 weeks.

• The rationale behind this method is that the difference between the scores of the test and the retest should be due to measurement solely.

Inter rater ReliabilityInter rater Reliability

• Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.

Inter rater Reliability (cont.)Inter rater Reliability (cont.)

• For some scales it is important to assess interrater reliability.

• Interrater reliability means that if two different raters scored the scale using the scoring rules, they should attain the same result.

• Interrater reliability is usually measured by computing the correlation coefficient between the scores of two raters for the set of respondents.

• Here the criterion of acceptability is pretty high (e.g., a correlation of at least .9), but what is considered acceptable will vary from situation to situation.

Parallel/Alternate Forms MethodParallel/Alternate Forms MethodParallel/Alternate Forms Method - refers to the

administration of two alternate forms of the same measurement device and then comparing the scores.

• Both forms are administered to the same person and the scores are correlated. If the two produce the same results, then the instrument is considered reliable.

Parallel/Alternate Forms Method (cont.)Parallel/Alternate Forms Method (cont.)

• A correlation between these two forms is computed just as the test-retest method.

Advantages • Eliminates the problem of memory effect.

• Reactivity effects (i.e., experience of taking the test) are also partially controlled.

Factors Affecting ReliabilityFactors Affecting Reliability

• Administrator Factors

• Number of Items on the instrument

• The Instrument Taker

• Heterogeneity of the Items

• Heterogeneity of the Group Members

• Length of Time between Test and Retest

How High Should Reliability Be?How High Should Reliability Be?

• A highly reliable test is always preferable to a test with lower reliability.

. 80 > greater (Excellent) .70 to .80 (Very Good)

.60 to .70 (Satisfactory) <.60 (Suspect)

• A reliability coefficient of .80 indicates that 20% of the variability in test scores is due to measurement error.

Reliability deals with the consistency.

Reliability is the quality that guarantees us that we will get similar results when conducting the same test on the same population every time.

Consider this ruler…

Now compare this ruler…

With this one…

Each ruler will give the same answer each time…

But this one will be wrong each time…

Each ruler is reliable…

But reliability doesn‘t mean much when it is wrong…

So, not only do we require reliability…

We also need…

VALIDITY

Good Ruler

Bad Ruler

VALIDITY

Validity deals with the accuracy of the measurement

Validity Depends on the PURPOSE E.g. a ruler may be a valid measuring device for

length, but isn’t very valid for measuring volume Measuring what ‘it’ is supposed to Matter of degree (how valid?) Specific to a particular purpose! Learning outcomes

1. Content coverage (relevance?)2. Level & type of student engagement

(cognitive, affective, psychomotor) – appropriate?

Types of validity measures

Face validity Construct validity Content validity Criterion validity

Face Validity

Does it appear to measure what it is supposed to measure?

Example: Let’s say you are interested in measuring, ‘Propensity towards violence and aggression’. By simply looking at the following items, state which ones qualify to measure the variable of interest: Have you been arrested? Have you been involved in physical fighting? Do you get angry easily? Do you sleep with your socks on? Is it hard to control your anger? Do you enjoy playing sports?

Construct Validity Does the test measure the ‘human’ theoretical

construct or trait. Examples

Mathematical reasoning Verbal reasoning or fluency Musical ability Spatial ability Motivation

Applicable to authentic assessment Each construct is broken down into its

component parts E.g. ‘motivation’ can be broken down to:

Interest Attention span Hours spent Assignments undertaken and submitted, etc. All of these sub-constructs put together – measure

‘motivation’

Content Validity

How well elements of the test relate to the content domain?

How closely content of questions in the test relates to content of the curriculum?

Directly relates to instructional objectives and the fulfillment of the same!

Major concernfor achievement tests (where content is emphasized)

Can you test students on things they have not been taught?

How to establish Content Validity?

Instructional objectives (looking at your list) Table of Specification E.g. At the end of the chapter, the student will be

able to do the following:1. Explain what ‘stars’ are2. Discuss the type of stars and galaxies in our universe3. Categorize different constellations by looking at the

stars 4. Differentiate between our stars, the sun, and all

other stars

Categories of Performance (Mental

Skills)

Content areas

Knowledge Comprehension Analysis Total 1. What are

‘stars’?

2. Our star, the Sun

3. Constellations 4. Galaxies

Total Grand Total

Table of Specification (An Example)

Criterion Validity

The degree to which content on a test (predictor) correlates with performance on relevant criterion measures (concrete criterion in the "real" world?)

If they do correlate highly, it means that the test (predictor) is a valid one!

E.g. if you taught skills relating to ‘public speaking’ and had students do a test on it, the test can be validated by looking at how it relates to actual performance (public speaking) of students inside or outside of the classroom

Factors that can lower Validity

Unclear directions Difficult reading vocabulary and sentence structure Ambiguity in statements Inadequate time limits Inappropriate level of difficulty Poorly constructed test items Test items inappropriate for the outcomes being

measured Tests that are too short Improper arrangement of items (complex to easy?) Identifiable patterns of answers Teaching Administration and scoring Students Nature of criterion

Validity and Reliability

Neither Valid

nor Reliable Reliable but not

Valid

Valid & Reliable Fairly Valid but not very Reliable

Think in terms of ‘the purpose of tests’ and the ‘consistency’ with which the purpose is fulfilled/met

Objectivitythe state of being fair, without bias or external

influence.if the test is marked by different people, the score

will be the same . In other words, marking process should not be affected by the marking person's personality.

Not influenced by emotion or personal prejudice. Based on observable phenomena; presented factually: an objective appraisal.

The questions and answers should be clear

measures an individual's characteristics in a way that is independent of rater’s bias or the examiner's own beliefs

gauges the test taker's conscious thoughts and feelings without regard to the test administrator's beliefs or biases.

help greatly in determining the test taker's personality.

Understanding Norms

a list of scores and corresponding percentile ranks, standard scores, or other transformed scores of a group of examinees on whom a test was standardized.

In a psychometric context, norms are the test performance data of a particular group of test takers that are designed for use as a reference for evaluating or interpreting individual test scores” (Cohen & Swerdlik, 2002, p. 100).

TYPES OF NORMS

•Percentiles - refer to a distribution divided into 100 equal parts.

- refer to the score at or below which a specific percentage of scores fall.

Ex. A student got 90% rank of NAT exam. What does this mean?

It means that 90% of his classmates scored lower than his score or 10% of his classmates got score above his score.

Age Norms (age-equivalent scores)

–“indicate the average performance of different samples of test takers who were at various ages at the time the test was administered” (Cohen & Swerdlik, 2002, p. 105).

Grade Norms–Used to indicate the average test

performance of testtakers in a specific grade.

–Based on a ten month scale, refers to grade and month (e.g., 7.3 is equivalent to seventh grade, third month).

•National Norms–Derived from a standardization sample

nationally representative of the population of interest.

Subgroup Norms–Are created when narrowly defined groups are

sampled.Ex. •Socioeconomic status

•Handedness•Education level

Local Norms–Are derived from the local population’s

performance on a measure.- Typically created locally (i.e., by guidance

counselor, personnel director, etc.)Fixed Reference Group Scoring Systems•Calculation of test scores is based on a fixed

reference group that was tested in the past.

•Norm referenced tests consider the individual’s score relative to the scores of testtakers in the normative sample.

•Criterion Referenced tests consider the individual’s score relative to a specified standard or criterion (cut score).

–Licensure exams–Proficiency tests

Item Analysis

A name given to a variety of statistical techniques designed to analyze individual items on a test

It involves examining class-wide performance on individual test items.

It sometimes suggests why an item has not functioned effectively and how it might be improved

A test composed of items revised and selected on the basis of item-analysis is almost certain to be more reliable than the one composed of an equal number of untested items.

Difficulty index

The proportion of students in class who got an item correct. The larger the proportion , the more students who have learned the content measured by the item

Discrimination indexA basic measure of the validity of an

item.A measure of an item’s ability to

discriminate between those who scored high on the total test and those who scored low.

It can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skill is related to the response on an item

Analysis of response options/distracter analysis

In addition to examining the performance of a test item, teachers are often interested in examining the performance of individual distracters ( incorrect answer options) on multiple-choice items

By calculating the proportion of students who chose each answer option, teachers can identify which distracters are working and appear to be attractive to students who do not know the correct answer, and which distracters are simply taking up space and not being chosen by many students

To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distracters as is feasible.

The process of item analysis

1. Arrange the test scores from highest to lowest2. Select the criterion groups

Identify a High group and a Low group. The High group is the highest-scoring 27% of the group and the Low group is the lowest scoring 27%

27% of the examinees is called the criterion group. It provides the best compromise between two desirable but inconsistent aims:to make the extreme groups as large as possible and as different as possible

then we can say with confidence that those in the High group are superior in the ability measured by the test than those in the Low group.

3. For each item, count the number of examinees in the High group who have correct responses. Do a separate, similar procedure for the low group

4. Solve for the difficulty index of each item The larger the value of the index, the easier the

item. The smaller the value, the more difficult is the

item. Scale for interpreting the difficulty index of an item

Below 0.25 item is very difficult0.25 – 0.75 item is of average difficulty

or item is rightly difficult

Above 0.75 item is very easy

Example: Item analysis

1. Count and arrange the scores from highest to lowest. Ex. n=43 scores

2. Calculate the criterion group (N) which is 27% of the total number of scores. Ex. N=27% of 43= (0.27)(43) = 12

3. Take 12 scores from the highest down and take 12 scores from the lowest up, call these High group and Low group respectively.

4. Tabulate the number of responses for each options from the high and low groups for that particular item under analysis.

5. Solve for the difficulty index of each item The larger the value of the index, the

easier the item. The smaller, the more difficult.

Scale for interpreting the difficulty index of an itemBelow 0.25 item is very difficult0.25 – 0.75 item is of average difficulty

or item is rightly difficult

Above 0.75item is very easy

A B C D* E Total

Upper Group

1 1 0 9 1 12

Lower Group

3 1 4 4 0 12

Ex: Item # 5 of the Multiple Choice test, D is the correct option.

Idis Index Description Interpretation

0.40 – 1.0 High The item is very good

0.30 -0.39 Moderate Reasonably good, can be improved

0.20 – 0.29 Moderate In need of improvement

< 0.20 Low Poor, to be discarded

The following can be used to interpret the index of discrimination.

Idis Idif Item category

High Easy Good

High Easy/difficult Fair

Moderate Easy/difficult Fair

High/moderate Easy/difficult Fair

low At any level Poor (Discard the item)

•Interpreting the results by giving value judgment

Index of difficulty = (Hc + Lc) / 2N = (9+4)/2(12)=.54 ----the item is rightly difficult

Index of discrimination = (Hc –Lc)/N=(9-4)/12=.42

---- high index of discrimination---- the item has the power to

discriminate

Hence, item number 5 has to be retained.

Distracter analysis: A and C are good distracters

Thank you and God bless us all!

characteristics of a good test

Documents

error score

consistency of test

psychological test

internal consistency

retest reliability

reliability cont

good measurement

group of test takers