strategies for measuring the impacts of federal reading .../media/publications/... · reading...

Contract No . :

282-98-0021 (13 )MPR Reference No . :

8701-300 MATHEMATICAPolicy Research, IC1c .

Strategies for Measurin gthe Impacts of Federa lReading Programs onReading Achievemen t

Final Report

May 16, 2002

Steven ClazernianJahn Mullen sAngie IiewalRanaaniDavid Myers

Submitted to : Submitted by :

U .S . Department of Education Mathematica Policy Research . Inc .Readin g Excellence Team 600 Maryland Ave . . SW. Suite 550400 Maryland Avenue, SW Washin gton, DC 20024-251 2Suite 5C152 Telephone : (202) 484-9220Washington . DC 20202-6200 Facsimile : (202) 863-1763

Project Officer:

Project Director :Nancy Rhett

David Myers

ACKNOWLEDGMENT S

The authors would like to thank the many people at the U .S . Department of Education wh oprovided helpful advice and input, including Nancy Rhett . the project officer for this study, andJoseph Conatv and Ricky Takai . We also benefited from discussions with educators wh ogenerously shared their time and expertise with us . These individuals include state REAcoordinators in Florida . Kentucky, Maine . Maryland, Oregon . Pennsylvania. Utah, Vermont . an dWest Virginia, and especially Beverly Kingery . the Reading/English Language Arts Coordinato rin the West Virginia Department of Education . David J. Francis, of the University of Houston ,provided helpful technical advice .

At MPR . Mark Dynarski and Peter Schochet read the report carefully and provided valuabl ecomments . Larry Radbill provided expert programming for the statistical power simulation . Thereport was produced by Micki Morris and Melanie Lynch and edited by Carol Soble and Dary lHall .

ii

CONTENTS

Section

Page

A. OVERVIEW 1

B. DESIGN CONSIDERATION S

1 .

Research Questions

2? . Defining the Counterfactual : What Should Reading Programs B e

Compared To? 43 . Outcomes: Defining and Measuring Reading Achievement

C. RESEARCH DESIGNS 7

1 . Experimental DesignQuasi-Experimental Designs

D. SAMPLE SIZES 1 5

E. NEXT STEPS 1 ~). . . . . . . . . . . . . . . . . .

REFERENCES 2 1

APPENDIX A : AN ILLUSTRATION OF A FEDERALLY FUNDED READIN GPROGRAM: THE READING EXCELLENCE AC T

APPENDIX B : ANALYTIC MODEL

APPENDIX C : METHODS FOR POWER ANALYSI S

iii

A. OVERVIE W

Improving young children's reading skills is at the top of the Bush Administration' seducation policy agenda, The National Reading Panel has already documented successfu lpractices in reading instruction when such practices are implemented in experimental setting s(National Reading Panel 2000), but there is little evidence on the impact of federal programs asimplemented at the local level . To build a body of evidence on the impacts of specific federa lreading programs, the U .S. Department of Education (ED) contracted with Mathematica Polic yResearch . Inc . (MPR) to provide guidance on designing a formal impact evaluation of federa lefforts to boost reading achievement in young children . The initial objective of this task orde rproject. later modified durin g the course of the work, was to design an evaluation of federa lreading programs operating under the Reading Excellence Act (REA) .

To learn more about the implementation of the REA grants and how such implementationmight influence the evaluation designs considered by ED . MPR conducted interviews and sit evisits with stale and local education officials who were REA grantees. We consulted withreading experts_ including David J . Francis, on a range of methodolo gical issues including ho wto measure different aspects of reading achievement and how often to test students .

Durin g the course of this work, as the new administration's education agenda wa sarticulated . it became clear that the objective of the task needed to he broadened to include othe rreading programs that ED may operate . Several general concerns arose during discussions wit hED officials about how to structure an evaluation of REA or of other reading programs ED ma yoperate :

n Defining the Scope of the Evaluation . Whether to focus on all grantees or only o ngrantees that implemented the program most faithfully .

n Making Causal Inferences. The use of random assignment and other researc hdesigns, such as matching or interrupted time-series, to yestimate program impacts .and the assumptions that must be made when making causal inferences concernin gprogram effects .

making

• Determining the Number of Test Points Necessary . The utility of using more thantwo measurement points to assess program impacts in a large-scale evaluation .

n Assessing Sample Size. The number of schools . students . and test points that shoul dbe included in an evaluation so that substantial and expected program impacts can b edetected by the research design .

We considered a range of design options and impact estimation strategies, specificall yaddressing these concerns. Although the designs can apply to any federally funded readin gprogram, we use features of the REA program in this report as the context for the design options .(A summary description of the REA program appears in Appendix A . )

1

In this report we discuss (1) design considerations for assessing impacts of lar ge-scalefederal reading programs, (2) possible research designs, their assumptions . and feasibility, and(3) the likely sample sizes required to estimate expected program impacts that are in the range o feffects that ED may expect . To estimate the required sample sizes we developed acomputational tool to simulate student achievement growth and program impact estimation usinga multilevel model . The model is described in Appendix B . and the tool for estimating samplesize requirements is described in Appendix C .

B. DESIGN CONSIDERATION S

1 . Research Question s

ED posed three main research questions for evaluating large-scale reading programs tha tfocus on narrow hands of grades in elementary school, such as grades 1 through 3 . The mostgeneral question was: What is the cumulative impact of the program as implemented o nstudents' reading achievement? The cumulative, or gross, impact of the program has to do wit hthe level of student outcomes after a predetermined period . The second. more specific, questio nwas : What is the impact of a reading program on students' reading achievement trajectories, or .in other words, what is the program impact on the pattern of chan ge in achievement over time?This question acknowledges that a reading program may affect children differently in earl ygrades than in later grades or at different times of the year . It may be informative, for exampl eto know if the impact is large while children are in first grade and then is smaller when childre nare in grades two and three . The third question was : What is the impact of a well-implemente dprogram on students ' reading achievement? This question tries to determine the effectiveness o fprograms that have faithfully implemented a set of guidelines prescribed by ED, not th eeffectiveness of all reading programs that may be operated under a federal grant .

In this report . we focus mainly on the first question by describing strate gies that can be use dto measure gross impacts in a large-scale evaluation . The answers to this question will (1) tel lpolicymakers if the taxpayers ' money is being well spent . that is, whether the program is makinga substantial difference by improving the reading achievement of all children who may beeli gible for such a program and (2) provide new evidence on the impact of reading programs tha thave been studied in many small scale settings, when implemented on a large scale .

The second question, on how the impact on students' reading achievement differs as theymove through the early elementary school grades . may require many test points to detect wher echan ges in achievement impacts might occur. The additional testing may place considerabl eburden on schools and students . Francis su ggests that this research question is more appropriat efor a small-scale study in which the testing conditions can he easily controlled by the evaluator sand when only a few classrooms or schools are involved (personal communication) .

Another reason we do not focus here on the second question is the difficulty of interpretin gtime-varying impacts . Consider two alternative achievement growth trajectories, shown inFigure 1 . For both patterns A and B the final measure of the reading outcome is the same andthe growth in reading achievement . as measured by pretest/post-test gains, is the same . Yet, i sone of these superior to the other? Is it worth conducting frequent student assessments to

FIGURE 1ALTERNATIVE ACl1lEVEMENT GROWTH TRAJECTORIE S

Pattern A

Pattern B

4 4

TimeiTesting Occasion

Time/Testing Occasio n

distinguish a pattern A child from a pattern B child? Some might argue that pattern B is a betteroutcome because the time trend suggests an increasing rate of growth . Others mi ght argue tha tpattern A is a better outcome because it demonstrates achievement that is the same or higher thanpattern B in every period . In a real-world evaluation, the shapes of these trajectories are unlikel yto be even as simple as this example, where quadratic growth curves . drawn in to aid the reader ,fit the data very well . Answering this second question is likely to produce an indeterminat eresult, diverting attention from the summative result of whether the program is effective . Werecommend against undue attention to the question in this report . but we do offer som esuggestions about design issues that are relevant to the question .

The third question . on the impacts of well-implemented reading programs, has . to a largeextent . been addressed by the National Readin g Panel . Studies summarized by the pane lgenerally focus on well-implemented instructional approaches to readin g and demonstrate wha tED might expect as an upper bound for an impact when a scientifically based program ofinstruction is well implemented .

2. Defining the Counterfactual : What Should Reading Programs Be Compared To ?

To understand the impact of a program like REA that is implemented Iocally . we mustdetermine the difference between what happened under the program and what teouid have

happened in the program's absence (the counterfactual) the research design attempts t oapproximate the counterfactual . Because reading instruction is essentially universal . it is neitherrealistic nor especially informative to compare REA-funded instruction with no readin ginstruction at all . Instead. the program should be compared with whatever instructiona lstrategies are otherwise available to schools or that become available over the course of theevaluation .

Under these conditions . we would infer from a positive and statistically si gnificant impac testimate that REA improved the reading achievement more than existing programs would hav edone. An impact estimate near zero . on the other hand, would suggest that the students woul dhave done just as well under the instructional approaches already in place . and a negative impac twould indicate that they would have done better under existing instructional approaches tha nthose implemented through REA .

4

Although it is not necessary to collect information on the other services received by th etreatment group and the control group to compute program impacts . such information would heespecially important to interpret the impacts . Thus an evaluation of REA . for example . mightestimate impacts of REA relative to some mixture of existing Title I-funded reading programs o rlocally developed reading initiatives . If REA is found to have large impacts for some groups o fstudents but not for others, evaluators may need additional information on the services andresources that had been available to those students to explain the result . A careful impac tevaluation could therefore he most useful by also surveying or visiting both REA and non-RE Aschools as part of an implementation study to measure :

• Classroom reading instruction and assessment practice s

• Professional development activities for reading instructio n

▪ Staff attitudes toward whole language reading and phonic s

+ Level of local and supplemental resources for instruction and professiona ldevelopmen t

• Participation in specific state . local, and federal readin g programs, including wholeschool reform models

4

n School context and culture

4

3. Outcomes : Defining and Measuring Reading Achievemen t

Given the research questions and the fact that the program is directed to the lowe relementary school grades, there are two distinct approaches to assessing impacts . The firstapproach measures the cumulative impact of the program on reading ability and use sstandardized reading tests that are appropriate to students' grade level . The test would be givenat the end of grade 3, for example .

4

The second approach measures students' performance in terms of a set of readin gdimensions . or reading-related skills, such as those identified by the National Reading Panel :

n Phonemic Awareness . The skills and knowled ge to understand how phonemes, o rspeech sounds . are connected to prin t

n Decoding . The ability to decode unfamiliar word s

• Fluency . The ahility to read fluentl y

• Background Knowledge and Vocabulary . Background information and vocabular ythat can foster reading comprehension

n Comprehension Strategies . The development of appropriate active strategies t oconstruct meaning from prin t

In order to make decisions about measuring reading outcomes it is necessary to understandhog. these component skills interact . A conceptual model of the process by which effectivereading skills develop over time is shown in Figure 2 . Children may develop precursor skill ssuch as phonemic awareness and decoding earl\ on . even before school . Together with th evocabulary and background knowledge they receive from their environments in and out o fschool . these skills help develop fluency and comprehension . so that by third grade . childrenmight become effective . fluent readers with good comprehension .

One way to capture the constellation of all five dimensions of reading ability is to measur eevery skill separately in each of several testing occasions . Another approach is to measure onl yselected dimensions appropriate to the grade being assessed . The latter approach concentrate sonly on those aspects in which student achievement is expected to change during a given tim espan . For example. research suggests that children reach a plateau in precursor skills such a sphonemic awareness after which decoding and fluency become more important . Therefore .assessing phonemic awareness for younger children could be replaced by assessing decoding asthey get older .

FIGURE 2READING SKILL IFVELUI'MF:N i

The ideal instrument for this evaluation will cover the aspects of reading appropriate for th egrade levels being tested and will have desirable statistical properties such as a high reliabilit yand an ability to discriminate over a wide range of student proficiency levels . ' To measure gros simpacts . we recommend using standardized tests . such as the Woodcock-Johnson . the Grey Ora lReading Test . and the Iowa Test of Basic Skills . An advantage of this approach is tha tstandardized tests are relatively easy to administer and may already be part of the district' sexisting assessment program . so the burden on students and schools is modest . -

For three reasons. it may be preferable to use more than one reading assessmen tinstrument . First, we know that -reading - is composed of five or more unique dimensions .Second. the extent of student growth in reading over a given period can range widely within asingle school or classroom . Third, most reading instruments tend to assess some of the readin gdimensions but not the full range . For this reason . it may he both wise and more practical t oidentify different instruments for the targeted dimensions . Furthermore . it may he most efficien tto use a "placing test" as a precursor to identify the appropriate level within each dimension t otest each study participant .

We also recommend that the same test or a similar test he given to students just before the yparticipate in the program, say . fall grade one. In doing so . one can substantially increase th eprecision with which impacts are estimated at the end of the program period and use the prio rtest scores to statistically adjust for possible differences that may occur by chance between th etreatment and control groups when random assignment is used. and for differences that may stil lexist between treatment and comparison groups when matching or other quasi-experimenta lmethods are employed. In Section D . we address the issue of whether there is any advantage t ousing more than one prior test point . In the context of using a single prior test for purposes ofincreasing precision or for statistical adjustment, it is not critical that the prior test be the same a sthe test used at the end of the program : however . it should he strongly correlated with the late rtest . If we were measuring growth. then the same test (a different form and level could be used )would he required at each test point .

Phonemic awarenes s

Decodinglohonicsya

F laen CY q

Comprehension strategie sI

Vocabularvihackground knowledge

Fluen treadin g

with goo dcorn prehe n

Sia n

6

Capturing student achievement on all the dimensions at just the right time requires carefu lconsideration of the assessment instrument . Other ED-funded reports (Voight et a1 . 2001 ;Birman et al . 2001) list several assessment instruments that could be used in an evaluation o fREA and discuss the possible modes of test administration . To balance the need to control cos twith the need for developmentally appropriate tests, Voigt et al . recommended using individuall yadministered tests for young children and then group-administered tests as the children reach th eend of grade one . Rather than recommending a sin gle test, they suggested combining subtest sfrom different sources for the youngest study participants (those who have yet to complete grad eone). Similarly, Birman et al . did not single out a test or publisher that was best suited forevaluating REA but listed many instruments with the domains they cover and their psychometri cproperties . Recommendations concerning specific subtasks or tests must wait until specificreading programs are selected for evaluation and the scope of the curriculum is known .

C. RESEARCH DESIGN S

To be useful for policymaking . the evaluation must generate impact estimates that measur ethe true effects of the program on students' reading achievement ; that is. the estimates mustreflect the difference between what students achieved while in a program and what they woul dhave achieved without the program (the counterfactual) . The main objective of the researc hdesign is to approximate the counterfactual . A major harrier to accomplishing this objective i sselection bias-the difference between the program's true impact and the observed difference i noutcomes between program participants and nonparticipants .

A conceptual model of the evaluation of reading interventions helps to illustrate ho wselection bias may arise . Under the model shown in Figure 3, several factors combine to produc ereading achievement . The model is based on the hypothesis that children come to school wit hunique personal and family characteristics . At school . they receive instruction that may or ma ynot include services enhanced by REA or similar grants . Through this instruction . childre ndevelop phonemic awareness and decoding abilities through which they develop fluency an dreading comprehension. Throughout this process, children develop background knowledge ,vocabulary . and motivation that also help them become effective readers .

The selection bias in this case is a consequence of differences in (1) the skills an dmotivation that local educators and policymakers bring to the grant-making and progra mimplementation process and in (2) the kinds of parents and children who "self-select" into th etreatment or into the potential comparison schools . These are represented in Figure 3 by dotte dlines connecting family background and school effects with participation in the reading program .If grants are awarded to educators who are more experienced and skilled, as in a competitiv eaward process . for example, then the treatment group benefits not only from the services attache dto the grant program but also from the additional pre-program skills and experience of th eteachers and administrators who operate the program . The evaluation strategy must isolate th eseparate impact of the program or the impact estimates will be biased .

7

FIGURE 3LOGIC MODEL FOR THE EVALUATION OF READING INTERVENTION S

Federal Reading Progra m

Family Background Other Reading Intervention sand School Effect s

Reading Skill Developmen t(see Figure 2 )

A

A

A

A

A

Assessment of Reading Skills al Multiple Time Point s

Selection bias is typically addressed throu gh one of two broad types of research designs :experimental and quasi-experimental . In an experimental design . individuals who are similar i ndemographic and other background-related characteristics are randomly assigned to a treatmen tor a control group. thus ensuring that any differences in the outcomes between treated an duntreated subjects can be attributed the treatment itself . In a quasi-experimental design, a varietyof methods can be used in the attempt to approximate the results achieved with rando massignment . Table I provides an overview of the designs in each category and described in thi sreport . showing the assumptions required when making causal inferences based on the desi gnsand the advantages and disadvantages of each .

8

TABLE I

FkA I't. ]RES OF ALTERNATIVE DESIGN STRATEGIE S

Implementatio nStrategy and Analysis Assumption(s )

Experimenta lRandom intervene in grant-making No spillover u rAssignment process ; compute differences contaminatio n

between treatment andcontrol sehuols

lntuiti\ e ; mostdirect approachto reducin gselection bias :simple analysi smethods

Potentially disruptiv eto progra m

Quasi-Experimenta lDeterminants of

Intu :ti'•c ; no tprogram status [hat

disruptive to thealso affect reading

progra mai Ine'elllent arekn(mn and can all hemeasure d

'Matched

C'&lnstrttcl Comparison group sComparison based oil student ,Intl schoo l

background eharaeteristics ,with or without propensitySL'ore nmethlldti :Compute difference sbet eel tre[Itmenl ant icomparison sehoc:l]s \Vithi nniate•hed pairs or groups

May require lar g enumber o fcomparison school sand e\terlsiv ebackground data :unobservable factorsmay still bia sestimates : LI1ltestabl eassrlillptu±It s

Interrupted

Uses multiple pre- and post -l i[ne-SCI- ICti

i11terVClltillil test print s(School-level)

There is a linear tim etrend in schoo lperformance ; cohort sof different childre nare similar [Jet' tulle

Not disruptive to the

Requires many tes tprogram

points : unlestahl cassumption s

IJiilrrcncin~

Use pre-in[erventulnStudent Fi•feet

outcome nleasuret ;0, re-Seal et Student-

dependent \ ariable andle\cll compute re g ression La sinll]] e

differences at the same grad elevel

Confounding studen tfactors are stable ove rtime .

Complements other

Requires longitudina lapproaches

test data : LIntestabl eassumption s

1 . Experimental Desig n

a. Overview

The centerpiece of an experimental design is the random assignment of subjects to atreatment or control condition . For an educational intervention, random assignment can he don eat the student, classroom . school . or district level . Assigning students in blocks, such a sclassrooms. schools . or districts . to a treatment or control group is not as statistically efficient a sindependently assigning each individual student because it leaves fewer units over whic htreatment status varies . It is often necessary, however, to employ such blocking strate gies tomaintain a clean separation between treatment and control status .

For a reading program like REA . we recommend using schools as the unit of rando massignment . It would he impossible to assign individual students or classrooms to the treatmen twithout serious concerns of control group contamination or treatment spillover . because student scollaborate within classrooms and teachers collaborate within schools . Under REA especially ,teachers engage in professional development and work together across classrooms on lesso nplanning. as we observed during site visits to REA-funded schools . Experimental impac testimates would be biased if there is any interaction between the treatment and control groups .

Defining treatment status at the school level, therefore . is congruent with both the practica llimitations of field-based research and the theoretical requirements of an experimental design .

h . Estimating Impacts with an Experimental Design : An Illustratio n

Implementing a random assignment evaluation with schools as the unit of assignment mightinclude the following steps :

n States compete for REA program grants just as they would in the absence of anevaluation . allocating sub-grants to districts according to standard criteria o rcompetitive awards based on federal REA requirements .

n To he eligible for REA sub-grants . local school districts would identify a set o fschools that could implement the program and agree to determine by lottery whic hschools receive the REA funding during the study period and which serve as contro lschools . '

- Allocation of program funds to half the schools is not a formal requirement for th eevaluation design . The ratio of schools assigned to the treatment and control groups will depen don both the number of schools identified as eligible within a district and the level of fundin gallocated to the district . For simplicity . we typically assume a balanced design . which means tha thalf arc treatment schools and half are control schools . Unbalanced designs are less statisticall yefficient . but may still he the only option that is feasible and cost-effective .

10

r Schools in the control group are prohibited from receiving federal REA grant fundsduring the study period, but would receive them afterward . The study period woul dbe three years, to follow one cohort of students from the beginning of grade one to th eend of grade three .

beginnin g

With this design . the evaluator can estimate the gross impact of the program as simply th edifference in average test scores between treatment and control schools at the end of the stud yperiod : spring of third grade . A pre-test administered at baseline . in the fall of first grade . andpossibly at other times as well . could be used to reduce the variance in the impact estimates . Thedesign relies on school districts to identify potential comparison schools .

c. Assumption s

Few assumptions are required in a well-implemented random assignment experiment . qneof the few is that conditions in the control group schools are not affected by the presence of th ereading program at other schools or by the rejection of their own grant application . Severalconditions might violate this assumption . although each can he monitored if the evaluation als omeasures implementation, as recommended earlier . For example, teachers who benefit fro mprogram-funded professional development might move from a treatment school to a contro lgroup school or share ideas with other teachers working in control schools . causing the treatmentto spill over into control schools . Another example would be a school principal who learns abou treading instruction methods and invests time gaining teachers' commitment to the approach i npreparation for applying for a grant but then is randomly assigned to the control group (and doesnot receive funds) . If that school still implements instructional practices recommended by th eNational Reading Panel, for example . they would no longer represent the counterfactual . 4

d. Feasibility

The usual concern about randomized experiments is not restrictive . untestable assumptions ,but feasibility . Random assignment is feasible if some schools eligible to receive federal fund scan be denied, based on the lottery_ the funds and services that would normally be provided a spart of the federal grant . ' Otherwise . the data requirements of a randomized experiment areminimal : there is no need for frequent testing or extensive background data collection except to

An alternative to random assignment of schools is random assignment of school districts .Allocating entire districts to a treatment or a control group addresses some but not all of th ecross-over issues . It is less likely that staff from schools in different districts woul dcommunicate with one another, but rejection effects described above, may still be present . Forexample . motivated district or school staff might use their rejected REA grant application as ameans to seek alternative funding to implement the same program as the schools that did receiv eREA funding .

5 If some districts are unable or unwilling to participate in the study and comply with thei rassigned treatment status, it may be difficult to generalize the findings beyond the students an dschools who did voluntarily participate .

11

increase the precision of the estimated impacts or learn about the time pattern of impacts, i fdesired .

2. Quasi-Experimental Designs

a. Overview

A variety of quasi-experimental designs can he used to evaluate the impacts of a federa lreading program. A basic ingredient of the more rigorous quasi-experimental designs is acomparison group, which is composed of nonparticipants! For some of the same reasons tha twe suggested for using the school as the unit of assignment in an experimental design . werecommend the samer for quasi-experimental designs . Educators may view the yquasi-experimental designs as less intrusive than experimental designs because evaluators work wit hnaturally occurring treatment and comparison groups. defined through the usual selectionprocedures for who receives a grant and who does not . For example, school district official swould decide on their own . rather than by lottery, which schools receive resources to implemen tnew reading curricula . However, quasi-experimental designs have several drawbacks . such as arequirement for more data and a greater reliance on untestable assumptions to claim that th eimpact estimates are unbiased .

The idea behind a quasi-experimental design using a comparison group is to construct thestatistical equivalent of a randomly selected control group. This is often done by matchin gtreatment and comparison schools by their easily observed characteristics . By necessity . thetechniques used when constructing comparison groups and estimating impacts from compariso ngroup designs are more complicated than those used in experimental designs . Compared withusing random assignment to deal with selection bias, quasi-experimental designs rely o nstatistical matching and modeling procedures to overcome the effects of selection bias .

Although many procedures can he used to construct comparison groups . perhaps the mos tpopular procedures use propensity score matching (Rosenbaum and Rubin 1983)] Propensit yscores are the predicted probability that treatment schools and potential comparison schools wil lbe in the treatment group. These scores are estimated from an analytic model, such as a logi tmodel or prohit model, and should include as predictor variables all characteristics hypothesize d

6 Simple pre-post designs or simple interrupted time-series designs with no compariso ngroup demand that the evaluator make very strong assumptions when making causal inference sabout program effects: these assumptions are too strong in a developmental context lik echildren's reading because the underlying maturation process is indistinguishable from progra meffects . As a consequence, we do not consider these designs in this report (see, for example .Cook and Campbell 1979 for an overview of designs both with and without comparison groups) .

A simpler approach is to adjust for differences in background characteristics using aregression model of reading achievement with students from all potential comparison schools .Often . this approach will provide estimates of program impacts that are similar to those from th ematched comparison design based on propensity scores : however, the matching process allow sthe evaluator (1) to weed out schools from the analysis that are poor matches and (2) to explicitl yconstruct a counterfactual that is compatible with the research questions .

12

to influence whether a school participates in the federal reading program . To select thecomparison group. the evaluator identifies the candidate comparison schools whose propensityscores are similar to the scores of treated schools . There are several other quasi-experimenta ldesign strategies in addition to matching, many of which can be combined in a single evaluation .We discuss two here, interrupted time series and individual differencing presented in Table 1 . aspotential enhancements to the matched comparison design .

b. Estimating Impacts with a Quasi-Experimental Design : An Illustratio n

A typical quasi-experimental design would be implemented as follows :

• States apply for and are awarded grants through normal procedure s

• States provide funds to school districts, and school districts select schools throug htheir normal procedure s

• The evaluator collects some basic data on all the schools in the selected districts, an dpossible on others chosen as potential comparison schools . The data are used t oestimate propensity scores which are then used to identify the best compariso nschools that are similar to the schools that received REA funding ; matching is base don commonly available characteristics such as school demographics (for example ,enrollment, percent minority, percent free and reduced-price lunch) or pre -intervention test score data . Ideally. there are three or four potential compariso nschools for every treatment school within a district . From the larger pool of potentia lcomparison schools_ the evaluator identifies one school that matches well with eac htreatment school .

• The evaluator collects follow-up test score data and more detailed student and schoo lbackground data from treatment and matched-comparison schools .

Although these steps appear similar to those used in an experimental design, it i ssubstantially more difficult to analyze data generated from a quasi-experimental design versus a nexperimental design . First . there are often several different ways to implement matching . sSecond. the evaluator may need to choose from a variety of analytic strategies to further adjus tfor sample selection bias that may remain .

For example . we proposed using propensity scores to construct the matched comparisons .One method for using propensity scores would choose a single comparison school for eac htreatment school, selecting the comparison school whose score is closest to the score of th etreatment school ; this is called nearest neighbor matching . Another method would comparetreatment and comparison schools whose scores fall within defined ranges on the distribution o fpropensity scores. Still other methods might be used instead or in addition . as robustness tests .to one of the above methods . although practical experience suggests that the different propensit yscore methods produce similar results .

13

qne strategy that some have proposed is the interrupted time series of school-level data .This strategy would use multiple test points from pre-intervention and post-intervention period sin a fixed grade level with successive cohorts of students . Bloom (1999) suggested this approac hfor a hypothetical evaluation of school restructuring in which children in the third grade aretested every year . His strategy was to use data from a set of treatment and comparison school swith five successive cohorts of third graders tested before the onset of the intervention and fiv eadditional cohorts of third graders tested after the intervention started . With this time-series ofschool-level data . Bloom argues . one can predict the school-specific trend in achievement tha tone would expect in the absence of the intervention (the counterfactual) by extrapolating fro mthe pre-intervention cohorts' scores . and that the difference between the counterfactual and th eobserved performance after the intervention will equal the program's impact .

Another strategy that can strengthen the matched comparison design can he calle ddifferencing of individual student scores over time . A differencing design focuses on the samecohorts of students each year as they progress through the early elementary grades . using at leastone pre-intervention and one post-intervention test score for the same students. In this waystudents effectively serve as their own control group . Including a comparison group in thedesign helps adjust for maturation effects. The impact would be computed by subtracting post -test scores from pre-test scores and subtracting this difference based on the comparison grou pfrom the corresponding difference based on the treatment group . producing what is referred to a sthe difference-in-differences estimate of the program impact . The technique can he extended t odata where students are tested at several time points . This would be called a student fixed effec tmodel .

c. Assumption s

In a quasi-experimental design . the evaluator must make strong, often untestahle _assumptions in order to interpret the differences in outcomes between the treatment andcomparison schools as causal effects (impacts) . These assumptions can be summarized asfollows. For a matched comparison group design we assume that all relevant inputs to

A fourth strategy for reducing sample selection bias is behavioral modeling . Behavioralmodeling is an attempt to use information about how students and schools come to be in th etreatment and comparison groups to measure the influence of selection factors in determinin gprogram outcomes . If the influence of selection factors can be accounted for . then the remainin gdifferences between treatment and comparison groups can be attributed to the program itself. I nthe REA example . the evaluator would model two processes : (1) how schools and students gai naccess to REA services and (2) how REA services and other factors influence reading outcomes .Methods for implementing this approach include the well-known Heckman selection mode l(Heckman and Robb 1985) and instrumental variables estimation . The critical assumption o fthe behavioral modeling approach is that the we can identify at least one variable that affect sprogram participation but not reading achievement . We do not list this approach as a separat eoption in Table I primarily because the research settings in which behavioral modelin g is mostsuccessful are often discovered opportunistically by researchers rather than planned through a nevaluation . Nevertheless, behavioral modeling approaches may be useful as a sensitivity test, b yre-analyzing data collected through one of the other desi gns .

14

achievement that differ between participating and nonparticipating schools have been captured i nthe matching process . This creates a requirement for detailed back ground data .

The interrupted time series design . if Bloom's example is any guide . requires at least 1 0years of test score data . It also requires the following two critical assumptions : (1) eachsuccessive third grade cohort is similar enough that they would have the same average readin goutcomes . 10 and (2) schools have their own stable time trend in average student achievemen tscores, although researchers typically assume further that the trend is linear .

The individual differencing desi gn assumes that unobserved selection factors such a smotivation affect student achievement both before and after the intervention . For all of thesedesign strategies. the extent to which the assumptions are violated affects the credibility of th eimpact estimates derived from them . An evaluator can try to test these assumptions . but theunmeasured bias that may remain cannot be assessed with sample information alone . Combiningapproaches may stren gthen the findings . because we would rely less on any one assumption .

d . Feasibility

In many circumstances, a quasi-experimental design in\olving matched comparison group scan be implemented . However. several challenges are involved. including the need to (1) obtai nthe cooperation of potential comparison schools that are not participating in the federal readin gprogram and (2) .collect more information from schools and students than in the simpl eexperimental design . Experience with other evaluations suggests that schools not participating i nthe federal program may resist being used as comparison schools . and there is little incentive for

them to cooperate with the evaluation . Often when schools are denied a grant, they worry tha tthey \k rill be sin g led out when comparisons of performance are made or that their staff an dstudents will need to endure the burden created by the additional data collection . Thesechallenges can sometimes he overcome with incentive payments, althou gh this approach t ogaining school cooperation may be expensive .

q . SAMPLE SIZES

A practical consideration in selecting an evaluation design is the number of schools .students, and test points required to detect the size of impacts that may he expected from areading program implemented on a large scale . Although we focused on gross impacts and no tdevelopmental outcomes (which require many test points), additional test points may increas ethe precision with which we can estimate a program's impact. Therefore . we consider multipl etest points in this analysis along with varying the number of schools and students . ' i

[{' One can either assume that successive cohorts are the same or try to use statistica ladjustments to make them more similar . These adjustments are generally able to correct fordemographic shifts. but not for changes in teacher quality and other hard-to-observe factors .

If multiple test points are already bein g used to answer the research question about ho wimpacts vary over time (developmental outcomes), then the only remaining ways to increas eprecision are to include more schools in the evaluation or increase the length of the assessmen tinstrument . We typically assume that every student in sampled grade (schooll will be include din the study .

l'S

We conducted a power analysis to determine the sample sizes that may be required to detec tthe expected impacts. A power analysis estimates the size of program impacts that can bedetected with different sample configurations . We quantify the size of such program impact susing the minimum detectable effect size (MDE) . The MDE is the smallest impact that thedesign is able to detect . expressed as a fraction of a standard deviation in the outcome measur e(reading ability). An MDE of 0 .20, for example, means that if the program is able to boost achild's reading achievement by at least two-tenths of a standard deviation (equivalent to movin gthe child from the 50th to the 58th percentile) . there is an 80 percent chance that the impac testimate from the evaluation will he found to be statistically significant at the 5 percent level . ' 'If the true impact of the program is smaller than 0 .20, we would be less likely to find a'statistically significant result with that design . Failure to detect such an outcome is acceptable i fan effect size smaller than 0 .20 is not large enough to be of policy interest . A more powerfu ldesign would have a smaller MDE, but would require a larger sample .

Determining what is "of policy interest" is subjective . The range of impacts we migh texpect from a locally implemented federal reading program is probably large. The Nationa lReading Panel (2000) showed that effect sizes may range from about 0 .15 to about 0 .80. In part ,the variation can be attributed to the skill or outcome assessed . whether high- or low-leve llearners were assessed . and the grade in which students were tested. For this report, we havefocused on the ability to detect impacts in the range of 0 .10 to 0 .20. This lower MDE thresholdmay be appropriate because a national evaluation is more likely to use standardized tests, whic htend not to be as closely tied to the curriculum as the instruments used in studies summarized b ythe National Reading Panel . In addition. the impact estimated from such a national progra mevaluation may be diluted somewhat by schools whose implementation of the program i srelatively weak .

The power analysis also requires an analytic model . We used a multi-level model of studen tachievement that accounts for variation at the school level . student level . and time level . Ourmodel corresponds to what is described in the literature as a hierarchical linear model (13ayk andRaudenbusch 1992) . random effects model (Hsiao 1986) . or in the version we present . latentcurve model (Muthen 1997) . This power analysis is meant to be illustrative . We made tentativeassumptions about the particular assessment instrument and its statistical properties . A precisedescription of the multi-level model and the methods and assumptions used for the powe ranalysis are presented in Appendices S and C .

Figure 4 presents the MDEs we calculated for different sample size configurations . '' Tosimplify the presentation we fixed the number of students at 50 per school . based on 20 student s

'z The 80 percent and 5 percent figures are conventional but can be chan ged .

13 For convenience, we assumed in the power analysis that treatment status is determined byrandom assignment . Some quasi-experimental methods may reduce the power from what i sshown in Figure 2 because more data are required to model selection effects (REA participation) .Recently completed research shows that under some conditions at least . the sample variance formatching estimators is nearly the same as for simple difference estimators from randomizeddesigns (Agodini and Dynarski 2001) . However, a cautious approach would assume that large rsample sizes than those reported here will he required for quasi-experimental designs .

16

FIGURE 4

MINIMUM DETECTABLE EFFECT BY NUMBER OF SCHOOL S

00 Students per School )

0 .25 -

0 .20 a

0 .1 5

10

Pre/Pos t

--As- 1 Test/Yr

Tests/Yr

4 Tests/Yr

0 .00

20

3 0

Number of School s

NOTE: Based on simulations with [ .000 replicate s

0 10 40 50

60

per classroom . 2.5 classrooms per grade level (on average), and one grade level (cohort) pe rschool . l4 We assumed a single cohort would he followed over a three-year intervention an dassessment period .

We considered sample configurations with up to 50 schools (25 treatment schools and 2 5control/comparison schools) and several options for the frequency of testing during th eevaluation period . The options include two tests (a baseline in first grade and a post-test in thespring of grade three), four tests (fall and spring of grade one . spring of grades two and three) ,six tests (fall and spring of each year). and 12 tests (four times per year) . In each case. weassumed that half the schools would be treatment schools and half would be control orcomparison schools . Unbalanced designs (unequal numbers of treatment or comparison schools )reduce the power slightly from what we present here .

The lines in Figure 4 show the MDEs associated with each option for different numbers o fschools . All else being equal . adding more test points always lowers the MDEs . allowing theevaluation to detect smaller and smaller gross impacts . The separation between the linesindicates that the greatest gains in MDEs are achieved by increasing the number of testing point sfrom two to four . Over three years, this reduces the MDE from 0 .14 to 0 .08 with 20 schools i nthe evaluation (11) treatment schools and 10 control/comparison schools) .

The slopes of the lines indicate the improvement in MDEs that results from increasing th enumber of participating schools . In this regard . the greatest gains in MDEs are achieved byincreasing the number of schools from 10 to 20 . Assuming . for example, that two test points areused . increasing the number of schools from 10 to 20 reduces the MDE from 0 .21 to 0 .14 . 1 DFigure 4 also clearly shows that adding schools to the evaluation reduces the MDE . but in ever ycase. there are diminishing returns . The value of adding schools beyond 30 or 40 seem snegligible in this example .

Figure 4 can also be used to determine the different sample size configurations for achievin ga target MDE . Assuming that we expect a minimum detectable effect of 0 .10. for example . ourpower analysis suggests that we can achieve this with 40 schools and two test points over thre eyears or with 14 schools and four test points . These two designs are equally powerful, but thei rcost is likely to differ . l'' Cost information would illuminate the tradeoffs involved in addingschools or adding test points .

14 An elementary school used in a typical evaluation would have about 350 students, roughl y50 in each grade . Actual schools may be somewhat smaller or larger .

15 This echoes the finding reported by Muthen and Curran (1997) . They showed thatmoving from two measurement points to three yields a much greater gain in statistical powe rthan moving from three to five . They also remark that adding time points does make it easier t odistinguish between alternative growth forms .

iG One should note, however, that this analysis rests on some necessarily crude assumptions.Before concluding that the designs are indeed equally powerful . the assumptions should beupdated using empirical data relevant to a particular program and assessment instrument .

18

We also conducted the power analysis with 75 students per school (not shown) . Includingthese extra 25 students per school made little difference in the MDEs . As noted. a design tha tachieves an MDE of 0 .10 with four test points requires 14 schools (7 treatment schools and 7control/comparison schools) . Increasing the number of students per school by 50 percent (from50 students to 75 students) lowers that requirement to about 13 schools, a savings of only on eschool . which in practice means using a slightly unbalanced design. Similarly, a 50-studentdesi gn that achieves an MDE of 0 .l() with only two test points requires 40 schools. The samedesign with 75 students per school requires 34 schools, a savings of six schools . "

Some additional information would be useful for assessing the tradeoffs between addin gschools and adding students to gain precision . In particular. one needs to compare theincremental cost of adding schools to the evaluation with the incremental cost of adding student sand using a fixed number of schools . The ratio of these costs depends in large part on the modeof data collection . Incremental costs per student are relatively low when the achievement test sare group-administered and when they are part of the districts regular testing program . In otherwords, it is often easier to include entire classrooms, grades, or schools in an evaluation an dreduce the total number of schools than to sample student within schools . For a specifi cevaluation in which the mode of test administration is known, empirical data can be used t oresolve the tradeoff .

E . NEXT STEPS

Designing an evaluation of a federal reading program presents many operational an dmethodological challenges . For programs that resemble REA, experimental designs ar epromising so that quasi-experimental approaches may not he necessary . The appropriate numberof students . schools, and test points will depend on details of the program, the study questions ,and psychometric properties of the assessment instrument, all of which need to be investigate dfurther. For our illustrative example, we calculated that a random assignment design with abou t50 students per grade in each school would require 20 treatment and 20 control schools tested i nthe fall of grade one and in the spring of grade three to detect policy-relevant impacts . Or i tcould achieve nearly the same precision by testing more frequently with fewer schools . For amatched comparison (quasi-experimental) design . we suggested that the initial sample o fpotential comparison schools may have to he slightly larger and it may he necessary to collectextensive data on students . their families . and their schools to adjust for sample selection bias .To estimate the impact at different developmental stages, considerably more test points an dschools may he needed .

'' Similar calculations can be used to estimate the design's ability to detect impacts forsubgroups of students . Once the sub groups are identified, we would simply take the averagenumber of students in each subgroup per school and re-calculate the MDEs corresponding to tha tmany students per school .

19

The next steps for refining the design are full specification of the intervention to h eevaluated, selection of a test instrument . and development of a cost function that relates the cos tof adding schools, students, and test points to the data collection plan . After that, it will b enecessary to refine the assumptions that feed into the power analysis . The key assumption sinclude the conditional intraclass correlation of test scores and the correlation of test scores fro mone occasion to the next ; the conditional intraclass correlation is the amount of between-schoolvariation in test scores versus within-school variation after adjusting for measurable studen tcharacteristics . We can find realistic values for these parameters once the testing instrument fo ra particular evaluation is identified . With more concrete design parameters. we can assembl emore practical evidence concerning the tradeoffs outlined throughout this report and mov eforward with an efficient, informative evaluation design .

W

20

REFERENCE S

Agodini, Roberto, and Mark Dynarski . "Are Experiments the Only Option? A Look at Dropou tPrevention Programs ." Princeton . NJ : Mathematica Policy Research . August 2001 .

Bloom, Howard . "Estimating Program Impacts on Student Achievement Using `Short 'Interrupted Time Series ." Working Paper . New York : Manpower Demonstration Researc hCorporation . August 1999 .

Birman . Bea, Kwang Yoon . Mike Garet . Andy Porter . Laura Desimone . Marc Moss. and BethGamse. "Assessment Options for the LEESI and REA SCII ." Washington. DC : America nInstitutes for Research . April 2001 .

Bryk, Anthony S ., and Stephen W. Raudenbusch . Hierarchical Linear Models : Applications an dData Analysis Methods . Newbury Park . CA: Sage Publications . 1992 .

Cook. Thomas D . . and Donald T. Campbell . Quasi-Experinrentarion : Design and Analysis Issue s"Or Field Settings . Boston, MA: Houghton Mifflin . 1979 .

Heckman. James J . . and Richard R . Robb, Jr . "Alternative Methods for Evaluating the Impact o fInterventions ." In James Heckman and Burton Singer, eds ., Longitudinal Analysis of LaborMarket Data . New York : Cambridge University Press . 1985 ,

Hsiao. Cheng . Analysis of Panel Data . Econometric Society Monograph No . 11, CambridgeUniversity Press . 1986 .

Muthcn, Bengt. "Latent Variable Modeling with Lon gitudinal and Multilevel Data . - In A .Raftery. ed . . Sociological Methodology 1997 . Boston : Blackwell Publishers . 1997 .

Muthen, Bengt, and Patrick J . Cumin . -General Longitudinal Modeling of Individua lDifferences in Experimental Designs : A Latent Variable Framework for Analysis and Powe rEstimation ." Psychological Methods, vol . ? . no. 4 . 1997 . pp . 371-402 .

National Reading Panel . "Teaching Children to Read : An Evidence-Based Assessment of theScientific Research Literature on Reading and Its Implications for Reading Instruction . "Report of the National Reading Panel . Reports of the Subgroups . Washington . DC: NationalInstitutes of l-Iealth . December . 2000 .

Rosenbaum. Paul R., and Donald B . Ruhin . "The Central Role of the Propensity Score i nObservational Studies for Causal Effects ." Biometrika, vol . 70, no . 1 . 1983, pp . 41-55 .

Satorra . A . . W . E . Saris, and W . M . de Pijper . "A Comparison of Several Approximations to thePower Function of the Likelihood Ratio Test in Covariance Structure Analysis ." StaristicaNeerlandica, vol . 45 . no. 2 . 1991, pp . 173-185 .

Voight, Janet, Terry Salinger . and Rita Kirshstein . "Review of Reading Assessments . Grades K -3 . " Draft Report. Washington . DC: American Institutes for Research . February 2001 .

21

Willett, John B . '`Questions and Answers in Measurement of Change ." Review of Research i nEducation, vol . 15 . 1988. pp . 345-422 . y

Willms. J . Douglas . and Stephen W. Raudenbusch . "A Longitudinal Hierarchical Linear Mode lfor Estimating School Effects and Their Stability ." Journal of Educational Measurementvol . 26, no. 3, Fall 1988, pp. 209-232 .

?2

APPENDIX A

AN ILLUSTRATION OF A FEDERALLY FUNDED READING PROGRAM :THE READING EXCELLENCE ACT

APPENDIX A

AN ILLUSTRATION OF A FEDERALLY FUNDED READING PROGRAM :THE READING EXCELLENCE AC T

The REA program . administered by the U.S . Department of Education (ED), has awardedthree-year grants totaling more than S700 million to 40 states and jurisdictions in three yearl ycompetitions beginning in FY 1999 . The funding legislation grew out of federal efforts t osynthesize a vast literature on effective reading instructional practices . as described below. Theliterature review suggested that some specific instructional practices were scientifically shown tobe effective . Encouraged by the review's findings . Congress enacted REA in 1998 as part o fTitle II of the Elementary and Secondary Education Act of 1965 . REA is intended to help everychild read by the end of third grade . provide children in early childhood with the readiness skill sand support they need to learn to read once they enter school . expand the number of high-qualit yfamily literacy programs, provide early intervention to children at risk of being identifie dinappropriately for special education . and provide the impetus to ground classroom instruction inscientifically-based reading research . Most of the funding is allocated for reading and literac ygrants to states . Over 80 percent of the state funds are directed toward state competitions fortwo-year subgrants to districts and schools that exhibit "extreme need . *

REA funds are awarded through grants and subgrants that pass through states and schoo ldistricts . Funds are awarded to states through a competitive review process . Successful state saward subgrants throu gh their own competitive review processes to local educational agencie sas: Local Reading Improvement (LRI) subgrants and tutorial assistance (TA) subgrants . Underthe terms of the legislation, 85 percent of the state award must go to LR1 subgrants . LR1s aredirected to districts that have (a) at least one school in Title I school improvement status . (b) thehighest or second-highest rates of poverty in the state, or (c) the highest or second-highes tnumber of poor children in the state . LRI subgrants are intended to provide professionaldevelopment . operate tutoring programs . and provide family literacy services . TA subgrants areintended to provide tutorial assistance in reading to children with reading difficulties . Theevaluation strategies discussed in this report would be especially relevant for LRI subgrants .

The program requires all services provided under the sub-grants to be based on scientificall ygrounded reading research to he consistent with the local school's reading program . Activitie sare expected to address each of the following aspects of reading :

• The skills and knowledge to understand how phonemes, or speech sounds, ar econnected to print

n The ability to decode unfamiliar words

n The ability to read fluentl y

• Sufficient background information and vocabulary to foster readin g comprehension

• The development of appropriate active strategies to construct meaning from prin t

A-?

States and LEAs can incorporate these activities into their reading program in a variety o fways . One state . for example, provided a five-day summer training session to introduce teacher sto scientifically-based reading research, the new state reading benchmarks, and the si xdimensions of reading. In addition to one-day follow-up workshops every 8 weeks . the statefunds Master Teachers in reading to help teachers transfer the workshop learning into dail yinstructional strategies . The following vignette describes how a Master Teacher demonstrates atypical reading lesson in a rural classroom with limited local opportunities for professiona ldevelopment . The classroom teacher observes from the back .

In Green Hills Elementary School, a master teacher from a nearby count y has a class offirst grade students cull out words to describe Spring while she lists them on the board .77te regular- classroom reacher watches front the back of the classroom and takes notes .The master teacher's tone and facial expressions encourage the children . As theirenthusiasm grows, so does the list of words . The master teacher guides the students toexamine the words, read them aloud, acrd group therm by color, smell, and sound. On anoverhead projector, the master teacher displays a poem on Spring filled with blanks an dthe students take turns selecting words from their list to fill the blanks . She leads thechildren in reciting the completed litres and by the time the poem is finished, the childre nhave recited it man y times and some have memorized it . Everyone gets to cop y an d

illustrate the poem as the teacher works individually it•ith children at their desks .Tonight ' .s homework it•ill be to read their poem to their parents .

After the children file outside for- recess, the master teacher and the classroom teache rcritique how the lesson was constructed, how it played out, how the master teacher -respnrtdect to the students, and how the regular teacher could adapt the essentia lelements of that lesson to other lessons .

APPENDIX B

ANALYTIC MODEL

APPENDIX B

ANALYTIC MODEL

Underpinning the power analyses was an analytic model that can be applied regardless o fwhether an experimental or quasi-experimental design is used. The model is based on a genera lstatistical model that goes by many names including a random effects model, hierarchial linea rmodel_ and latent curve model (Bryk and Raudenbusch 1992, Muthen 1997 . Willms and

Raudenbusch 1989. and Willett 1988) . It proposes that reading achievement is determined b y

factors that vary across students . between schools, and over time .

Z . The Basic Model

We can express the test score of a student i in school j at time t (Y,,,) as the sum of a set o flatent variables (u,v,wt,e) that vary alon g these dimensions of student . school, and time asfollows :

(1)

Y ;II = U .+

+ll ' + + e

This is a student-level model . Of particular interest is the time-varying component of th e

school effects . It captures the effects of interventions whose timing and strength . it is assumed .

can be explicitly measured . We can decompose the time-varying school effect usin g a school -

level model as follows :

(2)

iti , r = c1 ='' T„ +a) " ,

The treatment variable T, is set equal to 1 if school j had a reading program available i n

period i, 0 otherw ise . It can also be set to a number between 0 and 1 to represent the fraction o ftime in period t that students in school j were exposed to pro gram services . The coefficient o nthe treatment vaiiable represents the effect of intervention per unit of time . We typically resettl ethe effect size into standard deviations per year so we can compare models that allow lar ger orsmaller time periods between testing occasions .

If treatment status is determined by a random process similar to a lottery . then there is n odanger that the program impact estimate. the estimate of the coefficient d. would hecontaminated by selection bias. Formally . selection bias is manifested by a correlation betwee n

the treatment variable and the error term wfi . which includes all the other factors that determine aschool's effectiveness at raising readin g achievement. Under random assignment . the treatmentvariable is by construction uncorrelatedywith everything .

Equations (1) and (2) represent a hierarchical model stripped down to show the basi cfeatures of the student achievement growth process described above . We assume that, whe nestimated . time-varying emanates such as the interaction between time and free lunch eligibilit ystatus will he included in equation (2) to reduce the variance of the error term and hence result i na more precise estimate of the treatment effect . For ease of presentation (but without loss o f

B-2

generality) we assume that there is no secular time trend . That is, normal achievement "growth "would he reflected by havin g the same score m every period, because we assume the test i sscaled to reflect ability relative to a nationally normed age cohort . Thus, if the mean were set at50 and a child's score were 50 in every period, then that child would have shown normalachievement growth over the period . This assumption refers to the scaling of the test, not to th etest's content or difficulty .

2. Bask Model with Treatment Interaction s

The impact estimate from the basic model represents the average or gross treatment effect ,however defined . on reading achievement . To answer more detailed questions about whether aprogram has a differing impact for certain types of students or schools . we can define interactionterms that break down the gross effect into a sum or average of separate components .

(3a) w

d„'`T„+d,*(T,, * X )+ w it

(3b) it•,+=do=`Tj,+d, :(T *Z+t?~,r

(3c) = d; * T +, +o), ; l

The model by equation (3a) presents impacts that vary with the school's average studen t

characteristics ( X , ), Because the unit of treatment assignment is the school . it is difficult t o

observe directly how the treatment mi ght have a different impact on different types of student swithin the school . We can still address the issue of whether- impacts differ, for example, for low -income versus higher-income students or for girls versus boys . Equation (3a) measures subgroupimpacts by comparing impacts across schools that vary in the composition of their student body .If schools with higher concentrations of low-income students show larger impacts . we concludethat the program has a greater impact for low-income students . l The impact of the treatment as

measured by equation (2) is (d„ +d,X ,) , which depends on the average student characteristics i n

the school .

Equation t3h) allows the researcher to estimate impacts that vary with school or communit ycharacteristics represented by

which can include many variables . The impact from equatio n(3b) is (d„

rI 1 Z , ) ,

There are two ways to specify the model when the research question asks . "How do impact svary over time?" Equation (3c) shows a general specification. with a treatment effect an dassociated time subscript . It assumes separate period-by-period impacts . If four tests areadministered to the same cohort over a three-year period (assuming the initial test is a baselin ethere will be three separate impacts to estimate, one for each grade level .

To understand the effects of targeting REA services within schools of a give ncomposition-for example . to evaluate the effects of the REA tutorial grants--a separate stud ywould have to be undertaken in which students arc the unit of treatment assi gnment .

B-3

To achieve a slight increase in efficiency . a less general specification can be used to estimat eimpacts . The second specification assumes that the time pattern of impacts is a linear (o rquadratic) function of time. In other words, it assumes the impacts follow a fixed time trend lik ea straight line or a curve. These assumptions are slightly more restrictive than letting the impactsbounce around from one period to the next, but because the model is more concise and estimate sfewer parameters altogether they yield slightly more efficient impact estimates . The savingsfrom the more specific model would he meaningful only if there were many testing occasionsand some reason to expect that impacts follow a simple time trend .

,

A general problem in trying to address the timing of impacts is how to attribute gains in agiven period to that period only . It may be the case that students learn something in one perio dbut that learning is reflected in test score improvement in a later period . It would be mistaken tothink that eliminating the intervention in the earlier period and intervening only in the late rperiod, when the impact was measured. would produce the same result. Therefore, the value topohcymakers of localizing impacts to a particular time period is questionable .

It is important not to confuse growth of program impacts over time with growth i nindividual student achievement over time . Individual student achievement growth is captured b ythe age-norming of the test .

B-4

APPENDIX C

METHODS FOR POWER ANALYSIS

APPENDIX C

METHODS FOR POWER ANALYSIS

This appendix explains how the statistical power analysis was conducted to compute th eminimum detectable effect sizes (MDEs) used in the body of this report . In order t oaccommodate the complexity of a hierarchical model, we used a non-standard technique tha trelied on simulation methods . As part of this technique . we conducted the actual data analysi sthat would he conducted as part of the evaluation . These methods are described below . As withany statistical power analysis, we had to make several assumptions . These are also describedbelow .

1 . Simulation Approach to Power Analysi s

It is customary to estimate the statistical power of an evaluation design as follows . Define akey hypothesis test in terms of data to be collected . Derive a formula for the test statistic tha tuses the observed data . Set some probability for a Type I error (the statistical significance level )and a Type II er r or (which declines with statistical power) . Rearrange the formula to solve forthe functional relationship between sample size and the smallest impact that the design ca ndetect . This is called the minimum detectable effect (MDE) .

A problem with this approach can he that deriving the formula for the test statistic of interes tmay be overly complicated . depending on the sampling design, the structure of the data on eexpects to collect . and the methods used to analyze the data . In many cases, there is no close dform solution to derive. Most researchers approximate the power function (the function tha trelates sample size to MDE) using rules of thumb rather than sound principles . To avoid thi spotential source of error, we used an alternative approach . one that relies on Monte Carlosimulations . The method . used by Satorra et al (1991) . works as follows .

First . we specify the data generating process in as much detail as we think is realistic . Thenwe use a random number generator to draw a sample using the assumptions about the samplin gdistributions of unknown quantities (see below for assumptions) . The sample consists only oflatent variables, plus a binary indicator for treatment status . where half the sample is assigned t otreatment and half to control . From this random sample . we conduct the analysis as we wouldwith the actual data, including the hypothesis test of whether the impact estimate is statisticall ysignificant (different from zero) . We draw another random sample and repeat the hypothesis tes tand do this again and again for some large number of iterations. We then list and sort the tes tstatistics from these repeated samples from the same distribution to identify the smallest effec tsize that would result in rejection of the null hypothesis in 80 percent of the cases (assuming astatistical power of 80 percent) . This is the replicate-based MDE, which we report in this text .As we increase the assumed sample size we can observe how the MDE becomes smaller an dsmaller (at a diminishing rate) . The method has the flexibility of conducting the power analysi sunder a wide range of assumptions, including those that have complex data generating processe sor complex impact estimation methods .

2. Estimation Method Assumed for the Power Analysi s

A key feature of the simulation method of conducting power analysis is that it requires us t ogo through the steps of analyzing the data for the evaluation design using simulated data exactl yas we would when conducting the evaluation itself with real data . For the results shown in thi sreport we estimated the model in two stages . The first stage was a student-level regression . Welet test scores for student i in time t be a function of dummy variables for each school j at eachtime point t . We omit any time varying regressors and the time-invariant components of th estudent and school effects, assuming for clarity of exposition that they are uncorrelated wit hschool-time interactions and are absorbed into the composite error term

.

(Cl) Y„ =W, +e~f

The second stage was a random effects regression at the school level :

(C2) W,1 =rl*REA„+as „

Where to . is assumed to include a school-specific random effect . The impact estimat e

is d from this regression of school-time effects on treatment status (REA) .

3. Assumptions Used in the Power Analysi s

Any power analysis must rely on assumptions about the expected variances of the unknow nerror terms . in a hierarchical setting such as with student outcomes measured within schools atmultiple time points it is important to account for the variation that is specific to schools ,students . and repeated measures . For the analysis in this report we assumed that the outcome i sthe sum of four latent variables : These are unobserved random variables, with known (assumed )distributions . We assume all the components are mutually independent .

(C3) Yir - W + f u . + v +e }

The terms in braces in equation (C3) correspond to the composite error term in equatio n(Cl). The following assumptions were made for the variance components in order to create arealistic balance between variation at the student and school level from one time period to th enext .

u, -N(0,1/3) fixed for each individual I (time-invariant student effect )

'':i -N(0,0 .113) fixed for each school j (time invariant school effect )

W;, -N(0 .0 . I) varies by school and time (time varying school effect )

e,; t -N(0,1/3) varies by student . school, and time point . This represents all the idiosyncrati cdeterminants of student achievement .

C-3

This results in the following key parameters :

• Intra-class correlation coefficient approximately = 0 .15

• Test-retest correlation approximately = 0 . 50

These were chosen to be consistent with the range estimates in the literature and th eexpectations of reading experts consulted for this study . These assumptions can he updated to b emore realistic by doing some preliminary calculations with existing test score data . First .however, a specific assessment instrument and evaluation design must he determined .

strategies for measuring the impacts of federal reading .../media/publications/... · reading...

Documents