prioritizing growth but underutilizing growth scales ... · paper 1 title: modeling growth by...

26
Prioritizing Growth But Underutilizing Growth Scales: Implications of Advances in Growth Modeling for Educational Policy and Practice Understanding and supporting growth in student achievement has garnered much attention among practitioners, policymakers, and researchers. This interest is based in part on well- documented shortcomings of only considering a student’s achievement at a point in time when making decisions about students, teachers, and schools (Darling-Hammond, 2007; Duckworth, Quinn, & Tsukayama, 2012; Kim & Sunderman, 2005). Despite emphasis placed on student growth, educational practice still tends to rely primarily on static achievement. For example, most states participating in the federal Race to the Top competition identified the bottom 5% of schools and restructured many of those schools based on achievement from a single point in time. Even when policies and practices do include estimates of growth, they tend to rely on fairly coarse approaches like comparing separate cohorts of students over time or standardizing test scores and comparing rank orderings between two timepoints. In this symposium, we introduce innovative growth modeling techniques to address several high- profile issues in education policy and practice that revolve around making better inferences about student learning using standardized achievement data. The initial paper in the symposium lays out a new approach to modeling the seasonality of vertically-scaled achievement test data, which often grows curvilinearly from fall to spring, but can drop off between spring and fall of the following academic year. This Compound Polynomial (CP) model simultaneously estimates a smooth growth curve over time and the jagged spring-to-fall drops. The remaining three papers either apply the CP directly or use related growth modeling techniques to examine specific issues of practice and policy. First, the CP model is used to estimate summer learning loss, a major concern among policymakers. Most previous research on summer learning loss only compares test scores from spring and fall, but does not account for growth trends outside of summer break. Second, school value-added models are estimated using a student growth model, a very different approach to quantifying school effectiveness than the traditional models that regress a post-test score on a pre-test score (both standardized). This work shows that school effectiveness estimates between the two approaches are quite different, and provides evidence that value- added estimates can give educators useful information not always available when scale properties are removed from the test through standardization or use of ordinal models. Third, dynamic measurement modeling is used with vertically scaled longitudinal data to estimate a student’s predicted capacity asymptote, which provide evidence on a student’s capacity to learn a subject. Results suggest that impoverished students, despite having developed less mathematics ability on average than their more privileged peers by 8th grade, nonetheless retain a practically equal capacity for learning within that domain in the future. Altogether, these papers show that using innovative methods in conjunction with vertically scaled assessment data states and districts are currently collecting can generate inferences about educational policies that are useful to educators and policymakers. These inferences cannot typically be made under the common practice of standardizing test scores, using ordinal models, or otherwise doing analyses that do not harness vertical scales.

Upload: ngodat

Post on 04-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Prioritizing Growth But Underutilizing Growth Scales:

Implications of Advances in Growth Modeling for Educational Policy and Practice

Understanding and supporting growth in student achievement has garnered much attention

among practitioners, policymakers, and researchers. This interest is based in part on well-

documented shortcomings of only considering a student’s achievement at a point in time when

making decisions about students, teachers, and schools (Darling-Hammond, 2007; Duckworth,

Quinn, & Tsukayama, 2012; Kim & Sunderman, 2005).

Despite emphasis placed on student growth, educational practice still tends to rely primarily on

static achievement. For example, most states participating in the federal Race to the Top

competition identified the bottom 5% of schools and restructured many of those schools based on

achievement from a single point in time. Even when policies and practices do include estimates

of growth, they tend to rely on fairly coarse approaches like comparing separate cohorts of

students over time or standardizing test scores and comparing rank orderings between two

timepoints.

In this symposium, we introduce innovative growth modeling techniques to address several high-

profile issues in education policy and practice that revolve around making better inferences about

student learning using standardized achievement data. The initial paper in the symposium lays

out a new approach to modeling the seasonality of vertically-scaled achievement test data, which

often grows curvilinearly from fall to spring, but can drop off between spring and fall of the

following academic year. This Compound Polynomial (CP) model simultaneously estimates a

smooth growth curve over time and the jagged spring-to-fall drops. The remaining three papers

either apply the CP directly or use related growth modeling techniques to examine specific issues

of practice and policy. First, the CP model is used to estimate summer learning loss, a major

concern among policymakers. Most previous research on summer learning loss only compares

test scores from spring and fall, but does not account for growth trends outside of summer break.

Second, school value-added models are estimated using a student growth model, a very different

approach to quantifying school effectiveness than the traditional models that regress a post-test

score on a pre-test score (both standardized). This work shows that school effectiveness

estimates between the two approaches are quite different, and provides evidence that value-

added estimates can give educators useful information not always available when scale

properties are removed from the test through standardization or use of ordinal models. Third,

dynamic measurement modeling is used with vertically scaled longitudinal data to estimate a

student’s predicted capacity asymptote, which provide evidence on a student’s capacity to learn a

subject. Results suggest that impoverished students, despite having developed less mathematics

ability on average than their more privileged peers by 8th grade, nonetheless retain a practically

equal capacity for learning within that domain in the future.

Altogether, these papers show that using innovative methods in conjunction with vertically

scaled assessment data states and districts are currently collecting can generate inferences about

educational policies that are useful to educators and policymakers. These inferences cannot

typically be made under the common practice of standardizing test scores, using ordinal models,

or otherwise doing analyses that do not harness vertical scales.

Citations

Darling-Hammond, L. (2007). Race, inequality and educational accountability: The irony of ‘No

Child Left Behind.’ Race Ethnicity and Education, 10(3), 245–260.

Duckworth, A. L., Quinn, P. D., & Tsukayama, E. (2012). What No Child Left Behind leaves behind:

The roles of IQ and self-control in predicting standardized achievement test scores and report

card grades. Journal of Educational Psychology, 104(2), 439–451.

Kim, J. S., & Sunderman, G. L. (2005). Measuring Academic Proficiency under the No Child Left

behind Act: Implications for Educational Equity. Educational Researcher, 34(8), 3–13.

Paper 1

Title: Modeling Growth by Adding Curves: The Compound Polynomial for Seasonal Time-

series

Background / Context:

While there are many alternative growth functions are available for describing learning change

over time, the choice of a model will depend highly on key graphical features of a given set of

data. In this paper, we question the appropriateness of the conventional polynomial growth

curve, given as

2

0 1 2

p

i i i p i iy x x x (1)

for scoreiy observed at time

ix (of order p ) for assessment, or other, data that exhibit clear

seasonality or cycles.

An example of data series that exhibits a marked form of seasonality is the Measures of

Academic Progress, or MAP Growth®, an interim assessment which is offered by Northwest

Evaluation Association to elementary and high school students across the US. When we focus on

aggregate trends, we find patterns of mathematics and reading score means (in RITs or Rasch

Units) for a typical district, as shown in Figure 1 above. Notice the prominent pattern of mean

score drops from the Spring term of one grade to the Fall term of the next, the so-called “summer

drop-offs.” Growth appears as a chain of upward-tilting chevron curve segments from the lower

to the upper grade-levels.

Also shown are the fitted conventional polynomials (quadratic), indicating how inadequately

they navigate the observed data points. We focus on conventional polynomials not only because

they are familiar to analysts but they are often an effective option for many educational

applications. It is evident that, if conventional polynomials are employed for the above data, the

drawbacks are familiar: inefficiencies of estimates due to serially-correlated errors, hard higher-

order model terms that are hard to interpret, and poor data-model fit with clearly discernable

prediction bias.

Purpose:

First introduced in Thum and Hauser (2015) for developing growth norms for MAP Growth

assessments, the compound polynomial, or CP, has been employed in other growth modeling

MAP Growth data by Thum and Matta (2015a, 2015b, 2016) and by Thum and Soland (2017). In

this paper, we present the basic principle of the CP and, with the help of selected analyses from

past applications, show that the CP is to be preferred when compared with the conventional

polynomial.

While the original motivation, and discovery, of the CP comes from a successful attempt to fit

NWEA MAP Growth data better, it turns out that the effect of adding “primary curves” to

achieve a more desirable “overall curve” has a long history, including Bernoulli’s study of

harmonics and Thomas Young’s attempt to give a better account of the interference of light and

sound waves toward the latter half of the 18th century (Kipnis, 1991). Figure 2 depicts the

addition of y1, y2, and y3, three sound waves of varying amplitudes, frequencies, and phases.

Note how the compound wave departs starkly from the general character of each of its

contributors.

Setting / Population / Participants:

This study employs student mathematics RIT scores for the Northwest Evaluation Association

MAP Growth® assessment, a computerized adaptive test that is used by over 6,000 districts

across the US. All the available fall, winter, and spring MAP mathematics RIT scores (592,305

in total) for a random sample of 1443 schools serving 130,077 students who attended grades 3, 4,

and 5.

Method / Research Design:

To describe seasonal or cyclical data series, such as that depicted in Figure 1, we employ a

weighted sum of several cross-segment polynomial functions. Each component describes how a

within-segment polynomial coefficient changes over segments. For the general case with

1,2, ,g G segments, gn observations in segment g , and a within-segment polynomial P of

degree min( )gK n , the CP is given by

1 1 2 2 ,G k k K K I P Q u Q u Q u Q u (2)

where kQ is the between-segment polynomial model for the k th within-segment term and

ku is a

1K null vector with “1” in row 1 k K and 1,2, ,k K . Modifications to Equation 2 result

to trace primary effects such as Spring-Fall summer loss, to accommodate a centering of time, or

to vary the inter-point spacing of time (to reflect varying number of instructional weeks, for

example), are minimal and they will be detailed in the completed paper.

Returning to the earlier motivating example from MAP Growth, what are added together in a CP

(see in Figure 3) are a polynomial capturing the change in Fall mean scores over grade-levels

(y1, left vertical axis) and a second polynomial describing the change of Fall-Spring gains over

grade-levels (y2, right vertical axis). The result of adding two continuous curves is a surprising

near-discontinuous growth curve that better describes the summer loss typically observed in

interim assessments when compared with the conventional polynomial growth curve.

Findings / Results:

Table 1 displays MAP Growth mathematics means and instructional days for Fall, Winter, and

Spring terms in grades 4, 5, and 6 to be employed in our illustrative example. Table 2 displays

the estimation results for a CP model with those from a regular polynomial model describing the

change in mathematics score means over 10 consecutive terms from the Fall of grade 3 through

Fall of grade 6.

We find that a null model shows a strong sample first-order auto-correlation of 0.612 . We also

find that the best-fitting regular polynomial model is one that is linear in time, with a residual

sum-of-squares of 59.317 and an auto-correlation estimate of 0.320 . In contrast, the best-fitting

CP model indicates that Fall results are quadratic over grade-levels and that the within-grade

trend is linear and constant over grades 3 to 6. The residual sum-of-squares for the CP model is

much lower at 8.311. Equally important, as the auto-correlation estimate of 0.042 indicates,

residuals are no longer correlated. Figure 4 displays the observed data, predicted results, and the

residuals for this illustration.

Conclusions:

We present a new class of growth curves that is obtained by the weighted sum of cross-segment

polynomial functions, defined for each segment-specific polynomial coefficient. We show that

CP curves are surprisingly flexible in form and thus they provide a better fit to seasonal or

cyclical data. With the help several examples, we also show how clinically useful parameters,

i.e., they convey key and meaningful aspects of growth, may be obtained.

Paper 1 Appendix A.

References

Kipnis, N. (1991). History of the Principle of Interference of Light. Basel, Boston and Berlin:

Birkhauser Verlag.

Thum Y. M., & Hauser, C. H. (2015). NWEA 2015 MAP Norms for Student and School

Achievement Status and Growth. NWEA Research Report, Portland, OR: NWEA.

Thum, Y.M., & Matta, T. (2015a, May). Predicting College Readiness from Interim Assessment

Results: Selection Modeling for Longitudinal Data. Paper presented at the Modern

Modeling Methods (M3) Conference, Neag School of Education, University of

Connecticut, CT.

Thum Y. M., & Matta, T. (2015b). MAP College Readiness Benchmarks: A Research Brief.

NWEA Research Report. Portland, OR: NWEA.

Thum, Y. M., & Matta, T. (2016, April). Fitting Curves to Data Series with Seasonality using the

Additive Polynomial (AP) Growth Model. Presented at the Annual Meeting of the

National Council on Measurement in Education, Washington, DC.

Thum, Y. M., & Soland, J. (2017, March). School Norms for Mathematics Achievement Status,

Term-to-term Growth, and the Gender Gap. Paper presented at the SREE 2017 Spring

Conference, Washington, DC.

Paper 1 Appendix B. Tables and Figures

Table 1. Grade 4 MAP Mathematics Means (Midwestern State) and the

Number of Instructional Days by School Year and Term

School

Year Term

Avg.

RIT Days

2011 F 192 15

2011 W 199 81

2011 S 206 155

2012 F 204 192

2012 W 210 259

2012 S 216 332

2013 F 213 368

2013 W 218 438

2013 S 225 511

2014 F 221 547

Table 2. Comparing a conventional polynomial with a compound (or additive)

polynomial: Grade 4 MAP Mathematics Means (Midwestern State).

Conventional Polynomial* (error

DF=8) Est. s.e.

Intercept (4th Grade Fall) 205.550 0.973

Linear 0.279 0.025

Residual SS 59.317

Durbin-Watson d 2.262

Sample 1st-order Auto-correlation -0.320 Compound Polynomial (error

DF=6)

Est. s.e.

Within-

Grade Between Grade

Fall Status Intercept (4th

Grade) 204.022 0.343

Fall Status Linear 10.099 0.223

Fall Status Quadratic -0.478 0.206

Linear Intercept (4th

Grade) 0.464 0.016

Residual SS 8.311

Durbin-Watson d 1.841

Sample 1st-order Auto-correlation 0.042

* 10 data points, ρ=0.612

Figure 1. Pattern of MAP Growth mathematics and reading means by grade and term

(F=“Fall”, S=“Spring”) for a district. Dashed lines are quadratic curves for each series.

Figure 2. The impact of adding three sound waves.

-6

-5

-4

-3

-2

-1

0

1

2

3

4

-12 -10 -8 -6 -4 -2 0 2 4 6 8y

x

Adding 3 Sine Curves of Different Amplitudes, Frequencies, and Phases

y1=1.4*SIN(2*x+0) y2=2*SIN(1.2*x+60) y3=2.2*SIN(3*x+40) y1+y2+y3

Figure 3. A compound polynomial for generic MAP Growth data.

Figure 4. Illustrative example comparing the compound (or additive) polynomial to a

conventional polynomial.

0

1

2

3

4

5

6

7

8

190

195

200

205

210

215

220

225

230

235

240

-7 -6 -5 -4 -3 -2 -1 0 1 2

Fall-Sp

ring

Slo

pe

RIT

Time

Adding 2 Regular Polynomial Curves

y1=230+2*Time-0.7*Time^2+0.02*Time^3 y1+y2 y2=0.8-0.2*Time+0.1*Time^2-0.02*Time^3

Paper 2

Title: Summer Learning Loss and Student Learning Trajectories

Background / Context:

The question of whether student learning is negatively impacted by summer vacation has

been of interest for researchers for a long time (Cooper, Nye, Charlton, Lindsay, & Greathouse,

1996; Phillips, Crouse, & Ralph, 1998). Of particular interest is whether summer learning rates

differ by student characteristics, such as socioeconomic status (SES) or race/ethnicity, which

could contribute to inequalities in academic trajectories (Quinn, Cooc, McIntyre, Gomez, 2015).

Researchers have typically used fairly basic standardized statistics or regression models with two

time points (fall and spring) to obtain population estimates or group differences in summer

learning. These estimates provide a broad overview of summer learning patterns at an aggregate

level, but potentially mask a great deal of variability across students and do not provide

meaningful information regarding the degree to which experience summer learning loss is

associated with students’ academic trajectories.

Purpose:

The purpose of this paper is to embed the study of summer learning loss in a longitudinal

analysis of student academic growth across school years. We estimate the variability in students’

summer learning across individuals, as well as the association between summer learning and

learning rates across the school years (e.g., growth in reading and math across elementary and

middle school). We also will explore whether minority students are more likely to experience

summer learning loss than non-minority students.

Setting:

The data for the current study comes the Measures of Academic Progress (MAP)

assessment, which is administered to school age students across the U.S. The MAP is a computer

adaptive test that assesses student performance in math and reading, and is administered multiple

times per school year, typically in the fall and spring. Test scores are reported in an IRT-based

metric, which is equal-interval scaled allowing for measures of growth across grades.

Population / Participants / Subjects:

The dataset includes a cohort of students enrolled in public schools who took the MAP

assessment between 2004-2008 in a single school district in a midwestern state. We follow

students in this district across a five-year longitudinal pattern (fourth through eighth grade, with

MAP assessments in the fall and spring of each school year). Table 1 presents sample sizes by

grade and year of data. Our analyses include 3,693 students for whom we observe math and

reading achievements scores from fourth through eighth grade.

Research Design:

The goal of the study is to obtain estimates of student score drops/increases over the

summer as well as patterns of growth across the elementary/middle school years. We estimated

overall trajectories and summer learning patterns using a growth curve model that accounts for

the seasonality of student assessment data. In this model, we treat the fall and spring MAP test

scores (level 1) as nested within students (level 2). A compound polynomial model, which is

described in greater detail in the first paper of this symposium as well as in Thum & Matta

(2015; 2016), was specified for this study. The compound polynomial set-up allows for the

simultaneous estimation of overall learning trajectory from fall of fourth to fall of eighth grade

and the average rate of spring-to-fall (summer) growth across grades. Time is centered in the

model so that fall of fifth grade is the intercept. In the multilevel set-up, we can also include

person-level predictors such as gender and race/ethnicity to better understand how both summer

and within-school learning rates vary across students. The equations for the growth model are

provided in Appendix B. The lmer package in R was used to estimate the longitudinal

hierarchical linear models (Bates, Maechler, Bolker, Walker, 2015).

MAP testing schedules are set by users and test dates typically vary between 6 to 8 weeks

in each testing season. To better relate learning to the amount of instruction students received,

we employ an estimate of the allotted instructional time (in weeks) based on a database of

NWEA partner district calendars instead of test dates as the time scale for describing

achievement growth (Thum & Hauser, 2015).

Findings / Results:

We have conducted preliminary analyses using the math MAP assessment following one

cohort of students as they move from fourth through eighth grade. Table 2 presents the

regression coefficients from the compound polynomial model. To better represent the meaning

of these coefficients, Figure 1 shows the predicted overall trajectory as well as the separate fall-

to-fall growth (red line) and spring-to-fall (green line) components of the compound polynomial

model. The intercept term represents the fall status in 5th grade (208.9), and the estimated fall-to-

fall slope across grade-levels is 9.55 (slope of the red curve). The estimated spring-to-fall

(summer) score differences are displayed as a separate green line. The fourth model coefficient

(β3̂) represents the difference between spring and fall, where a negative value indicates the fall

score is higher (e.g., summer learning), while a positive value indicates fall is lower than spring

(e.g., summer drop). In the summer before 5th grade, students increased an average of 1.34 points

(β3̂ = −1.34). Across grade levels, the spring-to-fall gains decrease and become an average

summer loss by the summer before eighth grade.

Figure 2 displays the predicted trajectory as well as individual trajectories for a random

subset of students. It is clear that the overall trajectory seen in Figure 1 masks a great deal of

individual variability seen in the lines in the background. While many students showed increases,

a fair number of students displayed serious loss in MAP scores over the summer periods. More

analyses will be conducted to understand how student and school characteristics explain this

variability.

Conclusions:

Advanced modeling techniques such as the compound polynomial model can shed light

on average summer learning patterns as well as characteristics of students and schools that are

associated with summer test score drop offs. By identifying the groups of students and grade

levels in which summer learning loss is most often observed, policies can be better targeted to

alleviate gaps for the most vulnerable students.

Paper 2 Appendix A.

References

Bates, D., Maechler, M. Bolker, B., Walker, S. (2015). Fitting linear mixed-effects models using

lme4. Journal of Statistical Software, 67(1), 1-48.

Cooper, H., Nye, B., Charlton, K., Lindsay, J., & Greathouse, S. (1996). The effects of summer

vacation on achievement test scores: A narrative and meta-analytic review. Review of

Educational Research, 66 (3), 227.

Phillips, M., Crouse, J., & Ralph, J. (1998). Does the Black-White test score gap widen after

children enter school. The Black-White Test Score Gap, 229–272. Washington, DC:

Brookings Institution Press.

Quinn, D. M., Cooc, N., McIntyre, J., & Gomez, C. J. (2016). Seasonal dynamics of academic

achievement inequality by socioeconomic status and race/ethnicity: Updating and

extending past research with new national data. Educational Researcher, 45(8), 443–453.

Thum Y. M., & Hauser, C. H. (2015). NWEA 2015 MAP Norms for Student and School

Achievement Status and Growth. NWEA Research Report. Portland, OR: NWEA

Thum, Y.M., & Matta, T. (2015, May). Predicting College Readiness from Interim Assessment

Results: Selection Modeling for Longitudinal Data. Paper presented at the Modern

Modeling Methods (M3) Conference, Neag School of Education, University of

Connecticut, CT.

Thum, Y. M., & Matta, T. H. (2016, April). Fitting Curves to Data Series with Seasonality using

the Additive Polynomial (AP) Growth Model. Presented at the Annual Meeting of the

National Council on Measurement in Education, Washington, DC.

Paper 2 Appendix B. Tables and Figures and Equations

Table 1. Number of unique Students, sample MAP score mean, and compound polynomial

predictive value by school term

Term Grade Sample size Sample mean Predicted value

2004 Fall 4 260 196.77 198.48

2004 Spring 4 329 203.32 207.52

2005 Fall 5 1550 208.07 208.87

2005 Spring 5 1714 217.03 216.68

2006 Fall 6 3022 220.01 217.55

2006 Spring 6 3029 225.94 224.14

2007 Fall 7 3035 225.13 224.52

2007 Spring 7 3011 230.31 229.88

2008 Fall 8 2979 231.20 229.78

2008 Spring 8 2967 235.65 233.92

Table 2. Coefficients from the compound polynomial model

Fixed effects

Term Estimate Std. Error t-value

Fall - status (5th grade) 208.87 0.26 790.50

Fall linear growth 9.55 0.12 77.80

Fall quadratic growth -0.86 0.03 -25.50

Spring-to-fall difference (4th-5th grade) -1.34 0.21 -6.50

Spring-to-fall linear growth 0.47 0.09 5.20

Variance components

Term Variance Std. Dev.

Fall - status (5th grade) 207.49 14.40 Fall linear growth 2.69 1.64 Spring-fall gap (4th-5th grade) 0.45 0.67 Residual 28.85 5.37

Figure 1. Overall trajectory, fall-to-fall, and spring-to-fall components from the compound

polynomial model

Note. The black line with diamonds represents the predicted trajectory of scores estimated by the compound

polynomial model. For demonstration, we also split up the terms of the model and plotted the separated components

to clarify the structure of the model. The red line represents the first half of model (β0̂X0 + β1̂X1 + β2̂X2) that

describes the change in fall status across grade-level. The green line (on the scale shown on the right axis) represents

the change over time in the spring-to-fall gaps (β3̂X3 + β4̂X4).

Figure 2. Overall trajectory and individual trajectories of MAP data across grades 4-8

Note. The black line with large circles represents the predicted trajectory for the sample. The colored lines represent

individual observed trajectories for a random subsample of the cohort. The fall-to-spring growth is represented by

dashed lines while the summer (spring-to-fall) growth is represented by solid lines.

Equation 1. Basic Structure of the Compound Polynomial Hierarchical Linear Model

Level-1 Model (Repeated observations of MAP scores (t) within students (i)

MAP𝑡𝑖 = ∑ 𝛽𝑘𝑖X𝑘𝑡𝑖

4

𝑘=0

+ 𝑒𝑡𝑖

Level-2 Model (students (i))

𝛽0𝑖 = 𝛾00 + 𝑣00𝑖 , where γ00 is the predicted Fall score at 5th grade

𝛽1𝑖 = 𝛾10 + 𝑣10𝑖, where γ10 is the linear growth rate of change of Fall scores 𝛽2𝑖 = 𝛾20, where γ20 is the quadratic growth rate of change of Fall scores

𝛽3𝑖 = 𝛾30 + 𝑣30𝑖 , where γ30 is the predicted Spring − Fall difference in 5th grade

𝛽4𝑖 = 𝛾40, where γ40 is the linear growth rate of change for the Spring − Fall diff.

A simplified version of the X𝑖 coding for an individual with all observed timepoints is presented

below:

Grade/Term X0 (Int.) X1 (Fall

linear

slope)

X2 (Fall

quadratic

slope)

X3

(Spring-

Fall Int.)

X4

(Spring-

Fall linear

slope)

4th Fall 1 -1 1 0 0

4th Spring 1 0 0 1 0

5th Fall 1 0 0 0 0

5th Spring 1 1 1 1 1

6th Fall 1 1 1 0 0

6th Spring 1 2 4 1 2

7th Fall 1 2 4 0 0

7th Spring 1 3 9 1 3

8th Fall 1 3 9 0 0

8th Spring 1 4 16 1 4

Note. The first three terms (X0 − X2) represent a standard quadratic growth model estimating

change in fall status from 4th-8th grade, with time centered at 5th grade. The second set of terms

(X3 − X4) represent the part of the model where spring-to-fall gaps are estimated. Since time is

centered around 5th grade fall, X3 represents the predicted difference in scores between 4th grade

spring and 5th grade fall, and X4 represents the change in the estimated gap across grade-levels.

Paper 3

Title: Estimating School Value-Added Using a Student Growth Model: Implications for

Practice and Policy

Background/ Purpose:

Teacher and school value added models (VAM) and estimates are best known for their

uses in high-stakes accountability systems. In practice, some districts and states have used VAM

estimates to remediate or terminate extremely low-performing teachers, including the District of

Columbia (Dee & Wyckoff, 2015). Under the federal Race to the Top competition, many states

chose to identify the bottom 5% of schools and implement comprehensive reforms in those

schools (Baker, Oluwole, & Green, 2013). In both cases, consequences for low-performing

teachers and schools can be severe.

Most of the VAMs used in research and practice regress the student’s post-test score on a

pre-test score, covariates, and a teacher or school fixed or random effect. Oftentimes, these

scores are standardized to have a mean of zero and variance of one or are treated as ordinal,

essentially stripping out the psychometric properties of the scale. These models can shed light

on whether teachers or schools contribute to re-ordering of students between two time points.

Decisions to coarsen the data are often made due to a growing literature on the practical

consequences of wrongly assuming an interval scale, the implications of which are great when

VAM estimates are used in high-stakes accountability regimes (Briggs & Domingue, 2012;

Briggs & Weeks, 2009; Soland, 2017).

Purpose:

One might argue a question much closer to the policy intent behind VAMs is whether

teachers or schools contribute to the learning gains of students over the course of their time in K-

12 schools. That is, rather than look at rankings of students, models could instead be developed

to quantify teacher and school contributions to how students develop as learners. In this paper,

we fit VAMs using a baseline student growth model that better matches this second concept of

long-term learning, compare estimates of school effectiveness to ones from more traditional

VAMs, and highlight inferences that can be made from VAMs that are layered on to student

growth models. Ultimately, this study explores what might come from inverting the normal

scholarly dialog on VAM estimates conducted in the literature: rather than ask what invalid and

improper inferences might be drawn from VAM estimates in a high-stakes context, we ask what

useful inferences may be drawn from thinking of VAMs as low-stakes tools for educators.

Specifically, we ask two research questions:

1. How different are school-level VAM estimates produced by traditional models versus

those that use a vertical scale to estimate a student growth model with school random

effects?

2. What are examples of inferences that can be drawn from estimates of school quality

based on an underlying student growth model that are useful to educators and cannot

necessarily be made using an ordinal scale?

Population / Participants / Subjects:

The analytical sample consists of data from a Southern state where the majority of

students take MAP Growth, an interim assessment with a vertical scale. Three years of data are

used with students beginning in fall of 6th grade and ending in spring of 8th grade. MAP

Growth is often administered three times per year in the fall, winter, and spring, though winter

administrations of the test are less frequent. Table 1 shows counts of students and other relevant

demographic information by time period and term. As the table makes clear, while the data

follow students over time, we do not use an intact cohort design: students can enter and exit the

sample at any time so long as they attended a school in the state during 8th grade. For the

purposes of estimating VAM, each student is associated with the school they attended in fall of

8th grade (total of 266 schools overall).

Methods & Research Design:

We fit two VAMs, then compare estimates from those models. Models are estimated

separately in math and reading. The first VAM is the traditional model estimated for student i in

school j for time t and test score 𝑌𝑡𝑖𝑗.

𝑌𝑡𝑖𝑗 = 𝛽0 + 𝛽1𝑌𝑡−4,𝑖𝑗 + 𝑿𝑖𝑗𝑠𝑡𝛿 + 𝛾𝑠 + 휀𝑖𝑗𝑠𝑡

Here, 𝑿𝑖𝑗𝑠𝑡 is a matrix of student- and school-level covariates and 𝛾𝑠 is a school random effect.

Scores from spring of 8th grade are regressed on scores from spring of 7th grade, both

standardized within time period to have mean of zero and unit variance.

By comparison, a VAM that models individual student growth is also fit such that:

Level 1

𝑌𝑡𝑖𝑗 = 𝜋0𝑖𝑗 + 𝜋1𝑖𝑗𝑡𝑖𝑚𝑒𝑡𝑖𝑗 + 𝜋2𝑖𝑗𝑡𝑖𝑚𝑒𝑡𝑖𝑗2 + 𝜋3𝑖𝑗𝑡𝑖𝑚𝑒𝑡𝑖𝑗

3 + 휀𝑡𝑖𝑗

Level 2

𝜋0𝑖𝑗 = 𝛽00𝑗 + 𝑟0𝑖𝑗

𝜋1𝑖𝑗 = 𝛽10𝑗 + 𝑟1𝑖𝑗

𝜋2𝑖𝑗 = 𝛽20𝑗

𝜋3𝑖𝑗 = 𝛽30𝑗

Level 3

𝛽00𝑗 = 𝛾000 + 𝑢00𝑗

𝛽10𝑗 = 𝛾100 + 𝑢10𝑗

𝛽20𝑗 = 𝛾200

𝛽30𝑗 = 𝛾300

With covariance structure

(𝑢00𝑗

𝑢10𝑗) ~ 𝑁 [(

00

) , (𝜏00

2 𝜏01

𝜏10 𝜏112 )]

(𝑟0𝑖𝑗

𝑟1𝑖𝑗) ~ 𝑁 [(

00

) , (𝜑00

2 𝜑01

𝜑10 𝜑112 )]

휀𝑡𝑖𝑗 ~ 𝑁(0, σ𝜀2)

This formulation includes a student growth model with a random student intercept and slope on

time. It also includes school random intercepts and random slopes on time.

Findings/Results:

Table 2 shows crosstabulations of traditional and growth VAM estimates by quintile of

their respective distributions. Incorporating student growth into the model changes

classifications of students moderately. For example, 13% of schools in the bottom two quintiles

based on traditional estimates would be in the top two quintiles using a growth model. Table 3

shows a similar crosstabulation, this time of estimated school-level achievement at time 0 and

mean within-school growth across time periods. This table suggests that some schools ranked

high or low in terms of baseline achievement would be ranked quite differently based on

estimated growth rates.

Conclusions:

Results suggest that VAMs can be used to help understand long-term contributions to

student learning. This approach is categorically different to that taken in most VAMs in practice,

which examine school contributions to changes in rank orderings of students between two

timepoints. This finding suggests that VAM estimates have untapped potential as data to inform

practice, a potential that is hard to realize when estimates are also used for accountability

purposes, which often necessitates more cautious approaches like ordinal models.

Paper 3 Appendix A.

References

Baker, Bruce D. and Oluwole, Joseph and Green, Preston C., The Legal Consequences of

Mandating High Stakes Decisions Based on Low Quality Information: Teacher

Evaluation in the Race-to-the-Top Era (January 28, 2013). Education Evaluation and

Policy Analysis, 21(1).

Briggs, D. C., & Domingue, B. (2013). The gains from vertical scaling. Journal of Educational

and Behavioral Statistics, 38(6), 551-576.

Briggs, D. C., & Weeks, J. P. (2009). The sensitivity of value-added modeling to the creation of

a vertical score scale. Education, 4(4), 384–414.

Dee, T. S., & Wyckoff, J. (2015). Incentives, selection, and teacher performance: Evidence from

IMPACT. Journal of Policy Analysis and Management, 34(2), 267–297.

Soland, J. (2017). Is Teacher Value Added a Matter of Scale? The Practical Consequences of

Treating an Ordinal Scale as Interval for Estimation of Teacher Effects. Applied

Measurement in Education, 30(1), 52–70.

Paper 3 Appendix B. Tables and Figures

Table 2

Cross-tabulations of Traditional and Growth VAM Estimates by Quintile

Growth VAM

1 2 3 4 5

1 28 14 8 3 1 54

2 17 16 9 9 2 53

3 6 12 10 16 9 53

4 3 8 16 13 13 53

5 0 4 9 13 27 53

Total 54 54 52 54 52 266

Traditional VAM

Table 3

Cross-tabulations of Achievement at Time 0 and Mean within-School Growth Rate by Quintile

Achievement Time 0

1 2 3 4 5

1 37 8 3 5 0 53

2 6 29 14 1 1 51

3 4 7 24 16 1 52

4 3 6 9 21 13 52

5 0 3 1 10 38 52

Total 50 53 51 53 53 260

Mean Growth

Table 1

Descriptive Statistics for Analytical Sample by Time Period

Time Term Grades Students Prop. Black Prop. Hisp. Prop. Male

0 Fall 2011 6 43,873 0.355 0.061 0.506

1 Winter 2012 6 21,033 0.388 0.059 0.510

2 Spring 2012 6 42,151 0.356 0.060 0.505

3 Fall 2012 7 41,824 0.356 0.060 0.504

4 Winter 2013 7 20,772 0.384 0.056 0.508

5 Spring 2013 7 40,386 0.358 0.061 0.502

6 Fall 2013 8 37,180 0.362 0.062 0.502

7 Winter 2014 8 15,605 0.439 0.065 0.507

8 Spring 2014 8 35,998 0.362 0.062 0.502

Paper 4

Title: Dynamic Measurement Modeling: Using Nonlinear Growth Models to Estimate Student

Learning Capacity

Background and Purpose:

Psychometric assessments, as they have traditionally been applied in the educational

setting, solely measure abilities and skills that students have developed prior to the occasion of

testing, and consequently cannot tap a student’s capacity for developing those abilities in the

future (Sternberg et al., 2002). Despite this recognized disconnect between developed abilities

and developing capacities, scores on single-time-point educational or psychological measures are

all-too-often misinterpreted as relating to student potential. For this reason, students who may

not have had adequate opportunity to develop a given ability—and therefore score poorly on a

performance assessment—may be officially judged as not having the capacity for developing

that ability, and as such may not be given the resources and attention they need from educators in

order to meet their potential (Lohman, 1999).

One methodology that has been historically utilized to address this problem is dynamic

assessment (DA; Feuerstein, 1979; Tzuriel, 2001). Because DA features multiple testing

occasions, integrated with instruction by a clinician, it is capable of estimating a student’s

capacity for developing a particular skill or ability. Please see Figure 1 for a graphical depiction

of the theoretical conceptualization of student ability growth and predicted learning capacity that

is adapted from the DA literature. Unfortunately, because widely applying DA in any

educational system would entail substantial time investment by trained clinicians, the monetary

requirements of such extensive application are beyond that currently available to most state

systems, school districts, and educational research groups.

However, recent advances in non-linear growth modeling and statistical computing, as

well as the proliferation of reliable longitudinal data pertaining to the educational achievement of

U.S. students, offer an alternative solution. Specifically, a new psychometric modeling

framework—Dynamic Measurement Modeling (DMM)—is now capable of accomplishing many

of the goals of DA through the modeling of longitudinal testing data, without the need for

extensive one-on-one clinical work (Dumas & McNeish, 2017; McNeish & Dumas, 2017).

In general, DMM utilizes vertically scaled longitudinal data to estimate subject-specific

random effects for a number of growth parameters associated with learning, including each

students’ predicted capacity asymptote. A major motivation for the creation of DMM was to

quantitatively produce estimates of student learning capacity that are relatively free from the

undue influence of student demographic characteristics such as socioeconomic status (SES),

race/ethnicity, and gender. However, prior to the completion of the current study, the actual

efficacy of DMM to accomplish this goal had not been formally tested. Therefore, we here

conduct and present just such an investigation.

Setting:

To address this critical research question, we utilized the Early Childhood Longitudinal

Survey- Kindergarten (ECLS-K) 1999 cohort (Tourangeau, et al., 2009). These publicly

available data were collected at seven time-points: fall and spring of kindergarten, fall and spring

of Grade 1, spring of Grade 3, spring of Grade 5, and spring of Grade 8. In this analysis, we

utilized mathematics assessment scale scores (not individual items), which were vertically scaled

across time-points.

Research Design:

To these data, we fit a DMM capable of modeling the growth trajectory of every

individual student in the dataset. Please see Figure 2 for the growth trajectories and capacity

asymptotes of 50 randomly selected ECLS-K participants. This model, termed the “full model”

included data spanning from kindergarten through eighth grade. Another model, termed the

“reduced model” was fit to data spanning only through grade 5. All subject-specific estimates for

both of these models were saved for later analysis.

Findings/Results:

Reliability and Convergent Validity. In order to ascertain the reliability of DMM

capacity estimates, we correlated the subject-specific capacity estimates from the full and

reduced model. The resulting value of r = .934 indicated that the asymptotic capacity predictions

were very stably estimated across models. In fact, this correlation exceeds the correlation of r =

.836 found between Grade 5 and Grade 8 mathematics scores by a comfortable margin. The

correlations between capacity estimates and scale scores are also moderately high (range: .679 to

.771) which appears to provide satisfactory convergent validity evidence, suggesting that student

capacities are positively related to single-time-point assessments, but not so strongly related so as

to suggest they are synonymous.

Consequential Validity. In a sequence of general linear models (GLMs) we tested the

effect of student demographic characteristics (i.e., race/ethnicity, SES, and gender) on single-

time-point assessments as well as DMM capacity estimates. In this analysis, SES was quantified

through a principal components analysis of a variety of student background variables present in

the ECLS-K data. All GLM analysis was conducted on the “full model” DMM results that

included ECLS-K data from kindergarten through eighth grade.

Figure 3 shows a plot of the GLM omnibus R2 values. Note that for the ECLS-K

mathematics scores, the omnibus R2 values fall between 15.8% and 22.8%. On the other hand,

the R2 value for capacity is 9.9%: approximately half that of the ECLS-K score GLMs. So, the

percentage is noticeably reduced compared to the single-time-point scores.

W present the effect sizes related to SES in Figure 4. Effect sizes depicted in Figure 4

are Cohen’s f, which fall on the following scale: .10, .25, and .40 for small, medium and large

effects, respectively (Cohen, 1992). Effect sizes below .10 are considered negligible. The effect

of SES on ECLS-K mathematics scale scores would be classified on the high side of a small

effect, at times approaching a medium effect. However, the effect size for SES on capacity is

noticeably smaller than each of the scale scores and is short of the small effect cut-off (i.e., it is

negligible) by a reasonable margin.

Conclusions:

In our view, this finding implies that impoverished students, despite having developed

less mathematics ability on average than their more privileged peers by 8th grade, nonetheless

retain a practically equal capacity for learning within that domain in the future. This type of

conclusion is not readily attainable with most other available types of psychometric or

longitudinal methods. Therefore, we argue that DMMs hold substantial promise for informing

measurement practice and educational research.

Paper 4. Appendix A.

References

Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.

Dumas, D., McNeish, D. (2017). Dynamic measurement modeling: Using nonlinear growth

models to estimate student learning capacity. Educational Researcher, 46(6), 284-292.

DOI: 10.3102/0013189X17725747

Feuerstein, R., Rand, Y., & Hoffman, M. B. (1979). The dynamic assessment of retarded

performers: the learning potential assessment device, theory, instruments, and techniques.

Baltimore: University Park Press.

Lohman, D. F. (1999). Minding our p's and q's: On finding relationships between learning and

intelligence. Learning and individual differences: Process, trait, and content

determinants, 55-76

McNeish, D. & Dumas, D. (2017). Non-linear growth models as measurement models: A

second-order growth curve model for measuring potential. Multivariate Behavioral

Research, 52 (1), 61-85.

Sternberg, R. J., Grigorenko, E. L., Ngorosho, D., Tantufuye, E., Mbise, A., Nokes, C., ... &

Bundy, D. A. (2002). Assessing intellectual potential in rural Tanzanian school children

Intelligence, 30, 141-162.

Tourangeau, K., Nord, C., Lê, T., Sorongon, A. G., & Najarian, M. (2009). Early Childhood

Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K): Combined user's manual for the

ECLS-K eighth-grade and k-8 full sample data files and electronic codebooks. NCES

2009-004. Washington: National Center for Education Statistics.

Tzuriel, D. (2001). Dynamic assessment of young children. New York: Kluwer Academic.

Paper 4 Appendix B. Tables and Figures

Figure 1. Theoretical depiction of the dynamic assessment process. The space below the line

is realized ability, the space above the line is unrealized availability, and the horizontal line

at the top is the capacity.

0

50

100

150

200

250

300

350

0 2 4 6 8

Abil

ity S

core

Elapsed Time

Availability

Ability

Capacity

Figure 2. Ability, Capacity and Availability trajectory plots for two random samples of 25

students from the ECLS-K dataset, with superimposed sample mean trajectory (bold)

0

50

100

150

200

250

300

0 2 4 6 8

Pre

dic

ted

Mat

hem

atic

s S

core

Years After Kindergarten Fall

0

50

100

150

200

250

300

0 2 4 6 8

Pre

dic

ted M

athem

atic

s S

core

Years After Kindergarten Fall

Figure 3. Plot of omnibus R2 values showing the amount of total variation explained in

Scale Scores and Full Model Capacity estimates by gender, race/ethnicity, SES, and all two

and three-way interactions

Figure 4. The effect size (Cohen’s f) of SES on ECLS-K scale scores and Full Model

capacity estimates. The dashed horizontal line at .10 represents the cut-off for a “small”

effect, the dashed line at .25 represents the cut-off for a “medium” effect.

.189.181

.158

.178

.202

.228 .227

.099

0.00

0.05

0.10

0.15

0.20

0.25

Om

nib

us

R2

Mathematics Score

.237

.222

.196

.163

.196 .193.204

.071

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Cohen

's f

Eff

ect

Siz

e

Mathematics Score