characteristics of a good test
DESCRIPTION
TRANSCRIPT
![Page 1: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/1.jpg)
A. RELIABILITYA. RELIABILITY
CHARACTERISTICS OF A CHARACTERISTICS OF A GOOD TESTGOOD TEST
![Page 2: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/2.jpg)
ReliabilityReliability
• Reliability is synonymous with consistency. It is the degree to which test scores for an individual test taker or group of test takers are consistent over repeated applications.
• No psychological test is completely consistent, however, a measurement that is unreliable is worthless.
![Page 3: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/3.jpg)
Would you keep using these measurement tools?
The consistency of test scores is critically important in determining whether a test can provide good measurement.
![Page 4: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/4.jpg)
When someone says you are a ‘reliable’ person, what do they really mean?
Are you a reliable person?
![Page 5: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/5.jpg)
Reliability (cont.)Reliability (cont.)
* Because no unit of measurement is exact, any time you measure something (observed score), you are really measuring two things.
1. True Score - the amount of observed score that truly represents what you are intending to measure.
2. Error Component - the amount of other variables that can impact the observed score
Observed Test Score = True Score + Errors of Measurement
![Page 6: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/6.jpg)
Measurement ErrorMeasurement Error
• Any fluctuation in test scores that results from factors related to the measurement process that are irrelevant to what is being measured.
• The difference between the observed score and the true score is called the error score. S true = S observed - S error
![Page 7: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/7.jpg)
Measurement Error is Reduced By:
- Writing items clearly
- Making instructions easily understood
- Adhering to proper test administration
- Providing consistent scoring
![Page 8: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/8.jpg)
Determining ReliabilityDetermining Reliability• There are several ways that measuring reliability can be
determined, depending on the type of measurement the supporting data required. They include:
- Internal Consistency
- Test-retest Reliability
- Inter rater Reliability
- Split-half Methods
- Odd-even Reliability
- Alternate Forms Methods
![Page 9: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/9.jpg)
Internal ConsistencyInternal Consistency
• Measures the reliability of a test solely on the number of items on the test and the inter correlation among the items. Therefore, it compares each item to every other item.
Cronbach’s Alpha: .80 to .95 (Excellent) .70 to .80 (Very Good) .60 to .70 (Satisfactory)
<.60 (Suspect)
![Page 10: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/10.jpg)
Split Half & Odd-Even ReliabilitySplit Half & Odd-Even ReliabilitySplit Half - refers to determining a correlation between the first
half of the measurement and the second half of the measurement (i.e., we would expect answers to the first half to be similar to the second half).
Odd-Even - refers to the correlation between even items and odd items of a measurement tool.
• In this sense, we are using a single test to create two tests, eliminating the need for additional items and multiple administrations.
• Since in both of these types only 1 administration is needed and the groups are determined by the internal components of the test, it is referred to as an internal consistency measure.
![Page 11: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/11.jpg)
Split-half reliability [error due to differences in item content between the halves of
the test]
• Typically, responses on odd versus even items are employed
• Correlate total scores on odd items with the scores obtained on even items
Person Odd Even
1 36 43
2 44 40
3 42 37
4 33 40
1
100
50 pairs
![Page 12: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/12.jpg)
Test-retest ReliabilityTest-retest Reliability• Test-retest reliability is usually measured by computing
the correlation coefficient between scores of two administrations.
![Page 13: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/13.jpg)
Test-retest Reliability (cont.)Test-retest Reliability (cont.)• The amount of time allowed between measures is critical.
• The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time.
• Optimum time between administrations is 2 to 4 weeks.
• The rationale behind this method is that the difference between the scores of the test and the retest should be due to measurement solely.
![Page 14: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/14.jpg)
Inter rater ReliabilityInter rater Reliability
• Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret.
![Page 15: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/15.jpg)
Inter rater Reliability (cont.)Inter rater Reliability (cont.)
• For some scales it is important to assess interrater reliability.
• Interrater reliability means that if two different raters scored the scale using the scoring rules, they should attain the same result.
• Interrater reliability is usually measured by computing the correlation coefficient between the scores of two raters for the set of respondents.
• Here the criterion of acceptability is pretty high (e.g., a correlation of at least .9), but what is considered acceptable will vary from situation to situation.
![Page 16: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/16.jpg)
Parallel/Alternate Forms MethodParallel/Alternate Forms MethodParallel/Alternate Forms Method - refers to the
administration of two alternate forms of the same measurement device and then comparing the scores.
• Both forms are administered to the same person and the scores are correlated. If the two produce the same results, then the instrument is considered reliable.
![Page 17: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/17.jpg)
Parallel/Alternate Forms Method (cont.)Parallel/Alternate Forms Method (cont.)
• A correlation between these two forms is computed just as the test-retest method.
Advantages • Eliminates the problem of memory effect.
• Reactivity effects (i.e., experience of taking the test) are also partially controlled.
![Page 18: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/18.jpg)
Factors Affecting ReliabilityFactors Affecting Reliability
• Administrator Factors
• Number of Items on the instrument
• The Instrument Taker
• Heterogeneity of the Items
• Heterogeneity of the Group Members
• Length of Time between Test and Retest
![Page 19: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/19.jpg)
How High Should Reliability Be?How High Should Reliability Be?
• A highly reliable test is always preferable to a test with lower reliability.
. 80 > greater (Excellent) .70 to .80 (Very Good)
.60 to .70 (Satisfactory) <.60 (Suspect)
• A reliability coefficient of .80 indicates that 20% of the variability in test scores is due to measurement error.
![Page 20: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/20.jpg)
Reliability deals with the consistency.
Reliability is the quality that guarantees us that we will get similar results when conducting the same test on the same population every time.
Consider this ruler…
![Page 21: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/21.jpg)
Now compare this ruler…
With this one…
![Page 22: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/22.jpg)
Each ruler will give the same answer each time…
But this one will be wrong each time…
![Page 23: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/23.jpg)
Each ruler is reliable…
But reliability doesn‘t mean much when it is wrong…
![Page 24: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/24.jpg)
So, not only do we require reliability…
We also need…
![Page 25: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/25.jpg)
VALIDITY
Good Ruler
Bad Ruler
![Page 26: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/26.jpg)
VALIDITY
Validity deals with the accuracy of the measurement
![Page 27: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/27.jpg)
Validity Depends on the PURPOSE E.g. a ruler may be a valid measuring device for
length, but isn’t very valid for measuring volume Measuring what ‘it’ is supposed to Matter of degree (how valid?) Specific to a particular purpose! Learning outcomes
1. Content coverage (relevance?)2. Level & type of student engagement
(cognitive, affective, psychomotor) – appropriate?
![Page 28: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/28.jpg)
Types of validity measures
Face validity Construct validity Content validity Criterion validity
![Page 29: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/29.jpg)
Face Validity
Does it appear to measure what it is supposed to measure?
Example: Let’s say you are interested in measuring, ‘Propensity towards violence and aggression’. By simply looking at the following items, state which ones qualify to measure the variable of interest: Have you been arrested? Have you been involved in physical fighting? Do you get angry easily? Do you sleep with your socks on? Is it hard to control your anger? Do you enjoy playing sports?
![Page 30: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/30.jpg)
Construct Validity Does the test measure the ‘human’ theoretical
construct or trait. Examples
Mathematical reasoning Verbal reasoning or fluency Musical ability Spatial ability Motivation
Applicable to authentic assessment Each construct is broken down into its
component parts E.g. ‘motivation’ can be broken down to:
Interest Attention span Hours spent Assignments undertaken and submitted, etc. All of these sub-constructs put together – measure
‘motivation’
![Page 31: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/31.jpg)
Content Validity
How well elements of the test relate to the content domain?
How closely content of questions in the test relates to content of the curriculum?
Directly relates to instructional objectives and the fulfillment of the same!
Major concernfor achievement tests (where content is emphasized)
Can you test students on things they have not been taught?
![Page 32: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/32.jpg)
How to establish Content Validity?
Instructional objectives (looking at your list) Table of Specification E.g. At the end of the chapter, the student will be
able to do the following:1. Explain what ‘stars’ are2. Discuss the type of stars and galaxies in our universe3. Categorize different constellations by looking at the
stars 4. Differentiate between our stars, the sun, and all
other stars
![Page 33: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/33.jpg)
Categories of Performance (Mental
Skills)
Content areas
Knowledge Comprehension Analysis Total 1. What are
‘stars’?
2. Our star, the Sun
3. Constellations 4. Galaxies
Total Grand Total
Table of Specification (An Example)
![Page 34: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/34.jpg)
Criterion Validity
The degree to which content on a test (predictor) correlates with performance on relevant criterion measures (concrete criterion in the "real" world?)
If they do correlate highly, it means that the test (predictor) is a valid one!
E.g. if you taught skills relating to ‘public speaking’ and had students do a test on it, the test can be validated by looking at how it relates to actual performance (public speaking) of students inside or outside of the classroom
![Page 35: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/35.jpg)
Factors that can lower Validity
Unclear directions Difficult reading vocabulary and sentence structure Ambiguity in statements Inadequate time limits Inappropriate level of difficulty Poorly constructed test items Test items inappropriate for the outcomes being
measured Tests that are too short Improper arrangement of items (complex to easy?) Identifiable patterns of answers Teaching Administration and scoring Students Nature of criterion
![Page 36: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/36.jpg)
Validity and Reliability
Neither Valid
nor Reliable Reliable but not
Valid
Valid & Reliable Fairly Valid but not very Reliable
Think in terms of ‘the purpose of tests’ and the ‘consistency’ with which the purpose is fulfilled/met
![Page 37: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/37.jpg)
Objectivitythe state of being fair, without bias or external
influence.if the test is marked by different people, the score
will be the same . In other words, marking process should not be affected by the marking person's personality.
Not influenced by emotion or personal prejudice. Based on observable phenomena; presented factually: an objective appraisal.
The questions and answers should be clear
![Page 38: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/38.jpg)
measures an individual's characteristics in a way that is independent of rater’s bias or the examiner's own beliefs
gauges the test taker's conscious thoughts and feelings without regard to the test administrator's beliefs or biases.
help greatly in determining the test taker's personality.
![Page 39: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/39.jpg)
Understanding Norms
a list of scores and corresponding percentile ranks, standard scores, or other transformed scores of a group of examinees on whom a test was standardized.
In a psychometric context, norms are the test performance data of a particular group of test takers that are designed for use as a reference for evaluating or interpreting individual test scores” (Cohen & Swerdlik, 2002, p. 100).
![Page 40: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/40.jpg)
TYPES OF NORMS
•Percentiles - refer to a distribution divided into 100 equal parts.
- refer to the score at or below which a specific percentage of scores fall.
Ex. A student got 90% rank of NAT exam. What does this mean?
![Page 41: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/41.jpg)
It means that 90% of his classmates scored lower than his score or 10% of his classmates got score above his score.
![Page 42: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/42.jpg)
Age Norms (age-equivalent scores)
–“indicate the average performance of different samples of test takers who were at various ages at the time the test was administered” (Cohen & Swerdlik, 2002, p. 105).
Grade Norms–Used to indicate the average test
performance of testtakers in a specific grade.
–Based on a ten month scale, refers to grade and month (e.g., 7.3 is equivalent to seventh grade, third month).
![Page 43: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/43.jpg)
•National Norms–Derived from a standardization sample
nationally representative of the population of interest.
Subgroup Norms–Are created when narrowly defined groups are
sampled.Ex. •Socioeconomic status
•Handedness•Education level
![Page 44: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/44.jpg)
Local Norms–Are derived from the local population’s
performance on a measure.- Typically created locally (i.e., by guidance
counselor, personnel director, etc.)Fixed Reference Group Scoring Systems•Calculation of test scores is based on a fixed
reference group that was tested in the past.
![Page 45: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/45.jpg)
•Norm referenced tests consider the individual’s score relative to the scores of testtakers in the normative sample.
•Criterion Referenced tests consider the individual’s score relative to a specified standard or criterion (cut score).
–Licensure exams–Proficiency tests
![Page 46: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/46.jpg)
Item Analysis
A name given to a variety of statistical techniques designed to analyze individual items on a test
It involves examining class-wide performance on individual test items.
It sometimes suggests why an item has not functioned effectively and how it might be improved
A test composed of items revised and selected on the basis of item-analysis is almost certain to be more reliable than the one composed of an equal number of untested items.
![Page 47: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/47.jpg)
Difficulty index
The proportion of students in class who got an item correct. The larger the proportion , the more students who have learned the content measured by the item
![Page 48: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/48.jpg)
Discrimination indexA basic measure of the validity of an
item.A measure of an item’s ability to
discriminate between those who scored high on the total test and those who scored low.
It can be interpreted as an indication of the extent to which overall knowledge of the content area or mastery of the skill is related to the response on an item
![Page 49: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/49.jpg)
Analysis of response options/distracter analysis
In addition to examining the performance of a test item, teachers are often interested in examining the performance of individual distracters ( incorrect answer options) on multiple-choice items
By calculating the proportion of students who chose each answer option, teachers can identify which distracters are working and appear to be attractive to students who do not know the correct answer, and which distracters are simply taking up space and not being chosen by many students
![Page 50: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/50.jpg)
To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distracters as is feasible.
![Page 51: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/51.jpg)
The process of item analysis
1. Arrange the test scores from highest to lowest2. Select the criterion groups
Identify a High group and a Low group. The High group is the highest-scoring 27% of the group and the Low group is the lowest scoring 27%
27% of the examinees is called the criterion group. It provides the best compromise between two desirable but inconsistent aims:to make the extreme groups as large as possible and as different as possible
then we can say with confidence that those in the High group are superior in the ability measured by the test than those in the Low group.
![Page 52: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/52.jpg)
3. For each item, count the number of examinees in the High group who have correct responses. Do a separate, similar procedure for the low group
4. Solve for the difficulty index of each item The larger the value of the index, the easier the
item. The smaller the value, the more difficult is the
item. Scale for interpreting the difficulty index of an item
Below 0.25 item is very difficult0.25 – 0.75 item is of average difficulty
or item is rightly difficult
Above 0.75 item is very easy
![Page 53: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/53.jpg)
Example: Item analysis
1. Count and arrange the scores from highest to lowest. Ex. n=43 scores
2. Calculate the criterion group (N) which is 27% of the total number of scores. Ex. N=27% of 43= (0.27)(43) = 12
3. Take 12 scores from the highest down and take 12 scores from the lowest up, call these High group and Low group respectively.
4. Tabulate the number of responses for each options from the high and low groups for that particular item under analysis.
![Page 54: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/54.jpg)
5. Solve for the difficulty index of each item The larger the value of the index, the
easier the item. The smaller, the more difficult.
Scale for interpreting the difficulty index of an itemBelow 0.25 item is very difficult0.25 – 0.75 item is of average difficulty
or item is rightly difficult
Above 0.75item is very easy
![Page 55: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/55.jpg)
A B C D* E Total
Upper Group
1 1 0 9 1 12
Lower Group
3 1 4 4 0 12
Ex: Item # 5 of the Multiple Choice test, D is the correct option.
![Page 56: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/56.jpg)
Idis Index Description Interpretation
0.40 – 1.0 High The item is very good
0.30 -0.39 Moderate Reasonably good, can be improved
0.20 – 0.29 Moderate In need of improvement
< 0.20 Low Poor, to be discarded
The following can be used to interpret the index of discrimination.
![Page 57: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/57.jpg)
Idis Idif Item category
High Easy Good
High Easy/difficult Fair
Moderate Easy/difficult Fair
High/moderate Easy/difficult Fair
low At any level Poor (Discard the item)
•Interpreting the results by giving value judgment
![Page 58: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/58.jpg)
Index of difficulty = (Hc + Lc) / 2N = (9+4)/2(12)=.54 ----the item is rightly difficult
Index of discrimination = (Hc –Lc)/N=(9-4)/12=.42
---- high index of discrimination---- the item has the power to
discriminate
Hence, item number 5 has to be retained.
Distracter analysis: A and C are good distracters
![Page 59: Characteristics of a good test](https://reader033.vdocument.in/reader033/viewer/2022061210/548f868cb47959d7668b45d4/html5/thumbnails/59.jpg)
Thank you and God bless us all!