using state tests to measure student achievement in large-scale randomized experiments
DESCRIPTION
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments. An Empirical Assessment Based on Four Recent Evaluations. IES Research Conference June 28 th , 2010. Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC. - PowerPoint PPT PresentationTRANSCRIPT
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments
IES Research ConferenceJune 28th, 2010
Marie-Andrée Somers (Presenter)Pei ZhuEdmond Wong MDRC
Two key concerns with using state tests in an evaluation…
They may not be suitable for the evaluation Validity concerns: They may not be aligned with
outcomes of interest (do not provide a valid inference about program impacts)
Reliability concerns: They may be too difficult for low-performing students (unreliable)
Variation in scale/content of state tests also complicates the task of combining impact findings across states and grades
2
About This Study
Funded by Institute of Education Sciences (IES) Purpose is to “bring data to bear” on several topics
covered in May et al. discussion paper: Are state tests suitable for evaluation purposes?
As a measure of the outcome(s) of interest?
As a measure of student achievement at baseline?
How should impacts on state tests be pooled?
Are impact findings sensitive to methods of rescaling and aggregating test scores across states and/or grades?
Overview of Analytical Approach
We identified 4 large-scale randomized experiments where achievement was measured using both (i) state tests AND (ii) a study test The study test provides a benchmark for gauging the
suitability of state tests
Two types of analyses: Impact analyses: We compared estimated impacts on
state tests and on the « benchmark » study test
Descriptive analyses: We also examined published information on the characteristics/content of tests
Data and Samples
Studies represent diversity with respect to grade levels and outcomes
Analysis sample includes students with a state test score and a study test score
Study A Study B Study C Study D
Targeted Outcome
General Reading Achievement
General Math Achievement
Specific Reading Outcome
Specific Math Outcome
Level Elementary Elementary High School Middle School
Sample for Analysis
1,032(9 states)
944(7 states)
1,065(4 states)
4,387(9 states)
Approach for Estimating Impacts
Impact on state tests: Rescaling: Scores are z-scored by state and grade
using the sample mean and standard deviation Pooling approach: Impacts by state and grade are
aggregated using precision weighting
Impact on the study test: Rescaled/pooled using the same approach for
comparability
Two dimensions of suitability Validity:
Whether the content of state tests is aligned with the outcomes of interest in the evaluation
Reliability: Whether state tests provide a reliable measure of
achievement for the target population (in this case, low-performing students)
A key concern: State tests have low reliability and do not yield valid inferences about program effectiveness
Criteria for Assessing “Suitability”
Criteria for Assessing “Suitability”
Implications for the impact findings: Poor Validity:
Could fail to detect impacts on the outcome of interest (invalid inference about program effectiveness)
Affects the magnitude of the estimated impact on state tests
Low Reliability:
Student achievement is estimated with greater error Affects the standard error of the estimated impact on
state tests
Criteria for Assessing “Suitability”
Reliability: Compare the standard error of the estimated impact on state tests vs. the study test Smaller standard error is better (more precision)
Validity: Compare the magnitude of the impact estimates, in light of estimation error… Compare the statistical significance of the impact
findings (i.e., conclusions about program effectiveness based on p-value)
If both estimates are statistically significant, then also compare their magnitudes
Criteria for Assessing Validity
The extent to which the magnitude of the impact estimates are expected to differ depends on the outcome that state tests are intended to measure
Two types of intervention:
Targeted outcome is general achievement (Studies A and B)
The outcome of interest is “general achievement” in math or reading
Both state tests and the study test measure the targeted outcome (general achievement)
If state tests are valid, then the impact on the study test and state tests should be similar
Criteria for Assessing Validity
Two types of intervention (ctd.) Targeted outcome is a specific skill (Studies C and D)
There are two outcomes of interest: Targeted skill (short-term) and
General achievement (longer term)
Study test is used to measure the short-term outcome (specific skill), while state tests are used to measure the longer-term outcome (general achievement)
If state tests are valid, then the impact on state tests should be smaller than the impact on the study test
Benchmark: Benchmark: Impact on the Study TestImpact on the Study Test
P-Value & Magnitude (Validity)Targeted Outcome is General Achievement
p = 0.055
p = 0.119
P-Value & Magnitude (Validity)Targeted Outcome is General Achievement
p = 0.055
p = 0.119
p = 0.229
p = 0.189
P-Value & Magnitude (Validity)Targeted Outcome is a Specific Skill
p = 0.578
p = 0.002
P-Value & Magnitude (Validity)Targeted Outcome is a Specific Skill
p = 0.578
p = 0.002 p = 0.007
P-Value & Magnitude (Validity)Targeted Outcome is a Specific Skill
p = 0.578
p = 0.002 p = 0.007
p = 0.219
Standard Errors (Reliability)
Standard Errors (Reliability)
State-Study Ratio: 1.20 1.07 1.04 1.03
Conclusion Findings suggest that state tests can be used as a
complement to a study-administered test State tests are suitable (valid and reliable) in 3 of 4 studies Whether state tests can be used as a substitute for a study
test is an open question Limited availability in some grades and subjects
Available for all states/grades in only 1 of 4 studies May not be able to use them to measure a specific targeted
skill Possibly less reliable
Findings from descriptive analysis lead to the same conclusions as the impact analysis…
Questions?
Marie-Andrée Somers [email protected]
Pei Zhu [email protected]
Edmond Wong [email protected]