introduction to testing and measurement
TRANSCRIPT
Testing: Basic Definitions
• Assessment - process of documenting knowledge, skills, attitudes, and/or beliefs
• Evaluation - the making of a judgment about the amount, number, or value
• Measurement - quantitative (involves assigning numbers)
• Testing - form of measurement
Basic Definitions (Continued)
• Reliability - Measures consistency
• Validity - Valid to the degree that accomplishes purpose
• Objective - To the degree that two or more reasonable persons given a key will agree
Mean (Arithmetic Average - the sum divided by
the count.)
• Advantages – Calculation includes all scores – Indicates “typical” score for
group
• Disadvantages – Easily distorted by extreme
scores
Median (Midpoint - place the numbers in value
order and find the middle number)
• Advantages – Not easily distorted by
extremely high or low scores
• Disadvantages – Does not take into account the
value of all the scores in the group
Mean or median?
“Rule of Thumb”
• use median when extremely high or low scores (outliers) are present;
• use the mean for most other situation
Standard Deviation
• Indicates by how much the scores in a distribution typically deviate from the mean
• Mean represents 50% of the norm group, – 68% within 1 SD above or below
the mean, – 95% within 2 SD above or below
the mean, – 99.7% within 3 SD above or below
mean
Normal Curve - Properties • Symmetrical, bell-shaped • Total area under the curve represents total
number of scores in the distribution • Vertical lines mark sub-areas and represent
proportions of scores falling in a particular range
• Points along baseline correspond to standard deviations away from the mean
Validity of Test Scores
• The extent to which the scores on the test are representative of what you are trying to measure
– Example - Does the science test
measure only the knowledge of science, or is it dependent on reading ability and therefore measuring science and reading ability?
Types of Validity
• Content Validity – Determined by the degree to
which the questions or items are representative of the universe of behavior the test was designed to sample (does the test assess what it claims to assess?)
• Criterion-Related Validity – Determined by whether there is a
relationship between a test and an immediate criterion measure – example - a driving test, employment
Factors That Can Reduce Validity?
• Factors in the Test – Vague Directions – Irrelevant Items – Poorly Constructed Items – Items that Contain Clues to
the Correct Answer – Too Few or Improperly
Sequenced Items
What Affects Validity (Continued)
• Factors in Test Administration and Scoring – Insufficient Time to Complete
the Test – Testing Environment – Undetected Cheating – Inappropriate Help or Coaching – Properly Motivated Students – Unreliable Item Scoring
What Affects Validity (Continued)
• Factors Affecting Pupil Responses – High Level of Fear or
Anxiety About Taking the Test
– A Tendency to Rush Though the Test
– Guessing
Reliability of Test Scores
• Consistency
• Measure of confidence that if same individuals were retested under similar conditions that the results could be replicated
Types of Reliability
• Test-Retest: Coefficient of Stability
• Alternate Form: Coefficient of Equivalence
• Internal Consistency: Consistency of examinee across test items
• Interrater Reliability: Consistency of judges or scorers
Reliability General Guidelines
• Test scores used for decision about individuals require a much higher degree of reliability than those for making decisions about groups.
• Higher reliability coefficients are essential if decisions based on test scores have long term consequences.
Reliability General Guidelines
(Continued)
• Lower reliability coefficients are tolerable if decisions are reversible or have only a temporary impact.
• Reliability coefficients for standardized tests should be .90 or higher
• Reliability coefficients are influenced by many factors.
How to Increase Reliability
• Use objective tests
• Use a more heterogeneous group
• Make sure the difficulty level is appropriate for the individuals being tested
• Increase the number of items
Reliability vs. Validity
• Reliability means that the test-takers will get the same score in multiple takes (within reason of course).
• Validity means measuring what it is supposed to measure
• Reliability doesn't necessarily equate to validity: – A test can be reliable without being
valid. – However, a test cannot be valid
unless it is reliable.
Standardized Test • administered and scored in a
consistent, or "standard", manner. • designed in such a way that the
questions, conditions for administering, scoring procedures, and interpretations are consistent
• administered and scored in a predetermined, standard manner.
• not necessarily a high-stakes, time-limited, or multiple-choice.
Standardized Testing Benefits
• Objectivity • Evidence of validity or reliability of
results • Ability to compare across students,
schools, states, etc. • Ease of administration and scoring • Efficiency (group testing) • Developed over time and
supported with data and research
Standardized Testing Possible issues
• Can only sample a portion of the domain
• May not match school curriculum • May not answer relevant questions • Interpretations may not be relevant
for all populations • Extraneous factors may prevent
good measure of the student’s ability
• May not be available for some constructs/concepts
Base test type according to decision to be made
• Norm-Referenced: Level of achievement compared to others students
• Criterion-Referenced: Level of achievement compared to external criterion
Norm-Referenced Scores
• Based on the normal curve • Reflects student performance
compared to other similar students • Shows relative strengths and
weaknesses • Are not standards of “what should
be” - only indicators of what “is” Examples: CogAT, Iowa, NNAT, WISC, Stanford, Terra Nova
• A set standard of development or achievement usually derived from the average or median achievement of a large group
• Used to compare one student’s results to those of a large sample of students: – National norms - based on a large
sample from across the nation – Local norms - based on a large
sample from local schools within a city, district, state, etc.
Norms
Norms (Continued)
• Indicate what the current reality is – are not standards, or indicators of
what should be
• Derived by assessing students thought to be “typical”
• For mental ability scores, use student age norms
• For achievement scores, use student grade scores
Good Norms are…
• Recent – When outdated norms are used, results can be
misleading. Norms change every 5-7 years. (Tests with norms over 10 years old are not used for gifted evaluation in Cobb County.)
• Representative – Because participation in the norm group is
voluntary, norm groups might not be representative.
• Relevant – The “normal” students used to establish the
norms may not have been provided a “normal” instructional program.
Norm Referenced Tests (NRT)
Appropriate Uses
• Used to compare student performance with large, usually national or international, sample of similar students
• Used to make relative comparisons among schools or school systems to a national sample
Criterion-Referenced Tests • Allow inferences about:
– a curricular domain of skills and knowledge (e.g. the CCGPS, state standards)
– a cognitive domain of skill • reading comprehension • math computation
– standing with respect to a judgmental criterion
• CRCT (Criterion Referenced Competency Test • EOCT (End of Course Test) • Georgia Milestones
Criterion Referenced Tests (CRT)
Appropriate Uses
• To make instructional decisions about individual students
• To make placement decisions about students, along with other information
• To make evaluative (formative and summative) decisions about programs
• To make decisions about the curriculum
Raw Scores
• Actual number of points
received on test – For example, 25 correct answers
out of 30 questions equals a raw score of 25
• Have not been “cooked” in cauldron of statistics
Standard Scores • Raw scores converted to new
scale • Can be used to make direct
comparisons among classes, schools, or districts
• Can be misinterpreted because somewhat arbitrary scale values used from test to test
• Commonly Reported Standard Scores • SAT, GRE, NCEs, Stanines, SAS
Normal Curve Equivalent (NCE)
• “Normalized standard scores” used for reporting some standardized achievement tests
• Converted to a scale with a mean of 50 and a standard deviation of 21.06
• Reported in a range between values of 1 and 99
• Are not particularly useful in reporting test reports to parents
Standard Age Scores (SAS)
• Used to report the results of ability tests
• Sometimes reported as “deviation” IQ scores
• Converted to a scale with a mean of 100 and a standard deviation of 15
• “Average” is considered 15 above and below 100 – from 85 -115 on the normal curve
Stanines
• Standard Scores with whole number values ranging from 1 to 9
• Relate to percentile bands • Useful as a simple
approximation of performance; • May lead to a loss of precision
in reporting
Percentile Scores • Commonly used in expressing results of
standardized tests • Probably the best single derived score
for general use in relaying test results • Indicate the percentage of students in
the norm group scoring lower than the examinee
• Range between values of 1 and 99 • Used to interpret a student’s
performance in comparison to other students
• Can result in misinterpretation because all percentile ranks are not equally spaced along any one scale
Percentile Bands • Range of values thought to contain the
student’s “true” percentile rank – smaller bands reflect higher reliability
• Example: Susan might have a percentile band ranging between 76 and 86 for math computation on the ITBS, and a percentile band ranging between 82 and 92 for reading. – Scores indicate that Susan probably
performs better at reading than she did at math computation
– However, exact percentile score for math could be higher than for reading
Grade Equivalents
• Identifies grade level at which “typical” student obtains same raw score
• Expressed by grade and month
• Are useful in measuring growth
• Can be easily misinterpreted
Grade Equivalent Interpretation
• Compares student performance on grade-level material against the average performance of students at other grade levels on the same material
• Reported in terms of grade level and months • Does not mean a 5th grade student with a 9.5
GE score in reading can do 8th grade reading work
• Does not mean the 5th grade student needs to be in 8th grade
• Does mean the 5th grade student is performing better than peers at same level
• Does mean that 5th grade student reads 5th grade material as well as the average 8th grader
Grade Equivalents- Common Misinterpretations
• Can not be interpreted as estimate of grade where a student should be placed
• Are not equal across the range of the scale
• Are not necessarily equal across tests • Extremely high or low GE scores are not
dependable estimates of student achievement
Things to Know • Know the Test – study the manual and
understand the content and purpose • Know the Norms – cannot interpret
scores well if don’t understand norming population
• Know the Score – is it standard score, raw score, percentile rank, or something else?
• Know the Background – test results don’t tell the whole story so consider multiple sources of data and information on student
More to know • Research on your own – the more you
know, the more you can explain test results with accuracy and confidence
• Communicate effectively – provide pertinent information in a clear, understandable manner to approved individuals
• Use the test – understanding increases with multiple uses
• Use caution – test scores can reflect ability but they do not determine ability
Reference – Test Scores and What They Mean, 6th edition by H. Lyman,