effectiveness of stanford’s epgy online math k-5 …...draft document. do not circulate. direct...

62
Draft document. Do not circulate. Direct questions to [email protected] Page 1 of 62 Contents 1 INTRODUCTION 1 2 METHODS 2 2.1 Research design 2 2.2 Data collection 3 Math CST scores 3 Table 1. Count of number of pairs on 2006 and 2007 Math CST scores for 919 pairs 3 Figure 1. Comparison on 2006 Math CST scores for EPGY and control group 5 Course performance of EPGY group 6 Figure 2: Histogram of number of correct first-attempts for the 919 EPGY students at 8 Title I schools 6 Table 1. Descriptive statistics of correct first-attempts for each subgroup of 919 EPGY students 7 3 METHOD OF STATISTICAL ANALYSIS 8 3.1 Paired t-test and Effect size 8 Paired t-test 8 Effect size-Cohens’d 9 3.2 Three-level hierarchical linear model 10 Table 3. Three levels of student, classroom and school 11 3.3 Binomial Analysis of Changes in proficiency level 11 Table 4. California classification of 2007 Math CST test scores by proficiency level 11 3.4 Multivariate normal covariate models 12 4 COMPARISON OF EPGY AND CONTROL STUDENTS IN THE EFFECTIVENESS STUDY 13 4.1 Paired t-test result for all Title I students 13 Results for non-second graders Table 5. Summaries of comparisons between EPGY and matching control students for all students in the Effectiveness Study on the paired t-test, using correct first-attempts 13-14 Table 5a. Quartile summaries comparisons between EPGY and matching control students for all students in the Effectiveness Study on the paired t-test, using weighted correct first-attempts 15

Upload: others

Post on 09-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 1 of 62

Contents 1 INTRODUCTION 1 2 METHODS 2 2.1 Research design 2 2.2 Data collection 3 Math CST scores 3

Table 1. Count of number of pairs on 2006 and 2007 Math CST scores for 919 pairs 3

Figure 1. Comparison on 2006 Math CST scores for EPGY and control

group 5 Course performance of EPGY group 6

Figure 2: Histogram of number of correct first-attempts for the 919 EPGY students at 8 Title I schools 6 Table 1. Descriptive statistics of correct first-attempts for each subgroup of 919 EPGY students 7

3 METHOD OF STATISTICAL ANALYSIS 8 3.1 Paired t-test and Effect size 8

Paired t-test 8 Effect size-Cohens’d 9

3.2 Three-level hierarchical linear model 10 Table 3. Three levels of student, classroom and school 11

3.3 Binomial Analysis of Changes in proficiency level 11 Table 4. California classification of 2007 Math CST test scores by proficiency level 11

3.4 Multivariate normal covariate models 12

4 COMPARISON OF EPGY AND CONTROL STUDENTS IN THE EFFECTIVENESS STUDY 13

4.1 Paired t-test result for all Title I students 13 Results for non-second graders

Table 5. Summaries of comparisons between EPGY and matching control students for all students in the Effectiveness Study on the paired t-test, using correct first-attempts 13-14

Table 5a. Quartile summaries comparisons between EPGY and matching control

students for all students in the Effectiveness Study on the paired t-test, using weighted correct first-attempts 15

Page 2: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 2 of 62

Results for second graders 15 Table 6. Summaries of comparisons between EPGY and matching control

students in the second grade at the Effectiveness Study schools on the paired t-test, using correct first-attempts 16

4.2 Changes in proficiency level 15

Table 7. Result of binomial analysis of changes in proficiency level, with students ranked by correct first-attempts 16 Table 7a. Result of binomial analysis of changes in proficiency level, with students ranked by weighted correct first-attempts 18 Table 8. Comparisons between EPGY and matching control students ranked by Math CST06 scores for all schools in the Effectiveness Study on the paired t-test 19 Table 9. Result of binomial analysis of changes in proficiency level 19 Table 10. Comparisons between EPGY and matching Control students ranked by Math CST06 scores and correct first-attempts for all schools in the Effectiveness Study on the paired t-test 21 Table 11. Result of binomial analysis of changes in proficiency level 22

4.3 District-wide and school-wide results 23 Table 12. Summary of comparisons between top-half EPGY and matching control students for all schools, individual district and individual school in the Effectiveness Study on the paired t-tests, using correct first-attempts 23 Table 13. Expected effect sizes on paired t-test results in Table 24

4.4 Three-level hierarchical linear model (HLM) 24 Results for top half 24 Table 14a. Random effects for top half pairs of EPGY and control students, using correct first-attempts 25 Table 14b. Fixed effects for top half pairs of EPGY and control students, using correct first-attempts 25

Result for the 572 pairs 26 Table 15a. Random effects for the 572 pairs 26 Table 15b Fixed effects for the 572 pairs 26

5 REGRESSION MODELS ONLY FOR EXPERIMENTAL GROUP (EPGY STUDENTS) 27

5.1 Multiple linear regression model with the independent variables EPGY correct first attempts and students’ 2006 Math CST scores 27

Page 3: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 3 of 62

Table 16. Regression model of 2007 Math CST against correct first-attempts 28 Table 17. Normalized regression model of 2007 Math CST against correct first-attempts 29 Table 18. Regression model of 2007 Math CST against weighted correct first-attempts 30 Table 19. Normalized regression model of 2007 Math CST against weighted correct first-attempts 30-31

5.2 A Title I school in district 2 outside the Effectiveness Study 31 5.3 Regression with the added covariate of number of correct second-attempts 32 5.4 Correlation between 2006 Math CST and correct first-attempts 33

Table 20. Data on 2006 Math CST and correct first-attempts 33 Table 21a. Fixed effects of hierarchical model of 2006 Math CST and correct first-attempts 34 Table 21b. Random effects 34

5.5 Correlation between re-scaled correct first attempts and Math CST scores. 34 Table 22. Pearson correlation coefficients for re-scaled correct first-attempts of top half of EPGY students 35 Figure 4. Scatter plots of the three groups of students: green those who were in top half on both CFA’s and AGP’s; red those in top half of CFA or AGP, but not both; and black, students in neither top half of CFA of AGP. 35

5.6 Analysis of the mean-difference 2007 Math CST – 2006 Math CST corresponding to different 2006 CST scores and number of correct first-attempts

Table 23. The number of students in each cell is shown in the lower right-hand corner 37

6 MULTIVARIATE-NORMAL COVARIATE MODEL FOR EPGY STUDENTS ONLY 38

6.1 Minimum-distance classifier 38 Table 24. Transition matrix for proficiency levels 40 Table 25. Classification matrix of minimum distance classifier 40

6.2 Bayesian classifier 41

Table 26. Classification matrix for Bayesian classifier using uniform prior 41

Page 4: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 4 of 62

6.3 Learning curves 42

Figure 4 learning curve for different models using test data set with fixed size (10% of each class) 44

Figure 5 learning curve for different models using test data set with fixed size (219 trials) 45

7 PREDICTION MODEL FOR EPGY STUDENTS ON THE MATH CST 2008

OLS model 46

HLM model 46

7.1 Two models using CFA for prediction 47

Coefficients from California Standard Math Test 2006-2007 47

Table 27a. Estimated coefficients of OLS model (2006-2007) 47 Table 27b. Parameter estimates of Hierarchical Linear Model (2006-2007) 47

Prediction result 45 Figure 6: The comparison of predicted value 2008 CST between OLS and HLM Models. 48

Table 28: Statistics of Effectiveness Study predictions obtained from the two different models 48 Figure 7. Predicted histograms of OLM and HLM models 49

Figure 8. The normal-distribution approximations OLS and HLM predictions 48 Figure 9. The comparison of predicted and actual 2008 CST 51

Table 29: The difference between the predicted values and the actual CST08 with unweighted correct first-attempts 54

Table 30. Number of students over-predicted or under-predicted with

unweighted correct first-attempts 54 7.2 Two models using weighted CFA for predicting results

Table 31a. Estimated coefficients of OLS model (2006-2007) 55

Table 31b. Parameter estimates of Hierarchical Linear Model (2006-2007) 55 Figure 10. The comparison of predicted and actual 2008 CST with weighted correct first-attempts 56

Page 5: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 5 of 62

Table 32. Statistics of Effectiveness Study predictions obtained from OLS and HLM models with weighted correct first-attempts 56 Table 33. The difference between the predicted values and the actual 2008 CST with weighted correct first-attempts 57 Table 34. Number of students over-predicted or under-predicted with weighted correct first-attempts 57

8 Conclusion 58 Acknowledgements 58 References 59

Page 6: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 1 of 62

Draft

Effectiveness of Stanford’s EPGY Online Math K-5 Course in Eight Title I

Elementary Schools in Three California School Districts, 2006-2007

Patrick Suppes, Minh-thien Vu, and Yuanan Hu

7 January 2009

1. INTRODUCTION

Stanford University’s Education Program for Gifted Students (EPGY) conducted a

randomized treatment experiment (RTE) during the 2006-2007 school year to test the efficacy for

Title I students of the EGPY Kindergarten through Grade 5 Mathematics Course Sequence (Math

K-5). All eight participating schools had a full K-5 sequence of instruction.

In Section 2 we describe the research design and the data collection procedures, with a

table and three histograms summarizing significant features of the data. In Section 3 we describe

the statistical methods of analysis used. The presentation is more detailed than it would be if just

aimed at those familiar with these matters. Section 4 occupies a central place. Here the main

results of the randomized treatment experiment are given, focused on analyzing the Math 2007

California Standards Test (CST) for the experimental (EPGY) and control students. Section 5 is

devoted to the application of linear regression models just to the results of the experimental

Page 7: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 2 of 62

group on Math CST 2007 scores. Section 6 is a long and rather technical section presenting a

number of results from applying multivariate-normal covariate models to the experimental

(EPGY) student test results. Section 7 is unusual for this kind of assessment study. We use a

regression model to predict EPGY students’ test scores on the Math CST 2008. Finally, Section

8 provides a brief concluding summary.

2. METHODS

2.1 Research design

The RTE was conducted with students in Grades 2 through 5 at eight Title I elementary

schools located in three school districts within a 50-miles radius of Stanford. Within each

participating class, students entering Grades 3 to 5 were randomly assigned to two distinct

treatment groups on the basis of mathematical achievement, as measured by their performance

on the mathematics section of the 2006 California Standards Test (CST) for Grades 2 to 4 and the

Stanford EPGY Mathematical Aptitude Test (SEMAT) for students entering grade 2. (The CST

was not given to students in the first grade, which was the situation of the Grade 2 students in

2006). Students were ranked and ordered by their test scores. Each two adjacent students in the

sorted order were paired. A computer algorithm randomly assigned members of each pair into

two distinct treatment conditions 1000 times and chose an assignment that yielded the smallest

sum of absolute difference in the mean scores and absolute difference in the variances of EPGY

and control groups. As the result of the pairing and random assignment process, the mean and the

variance of the prior test scores were approximately the same for both treatments. Because of

this, we were assured at the outset that the two treatment groups were nearly evenly matched on

mathematical achievement.

Page 8: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 3 of 62

Students in the EPGY group left their classrooms and went to the computer lab in their

school, where they worked for roughly 20 minutes a day, five days a week, under the supervision

of a classroom teacher and an EPGY School Site Instructor. On each exercise attempted, the

EPGY students’ performance data and response latency were logged every school day in the

EPGY Oracle database at Stanford. Students in the control group remained in the classroom

during this time under the supervision of a classroom teacher and received an alternative

treatment consisting of seatwork which was either worksheets from the adopted textbook or

worksheets from the Renaissance Learning Accelerated Mathematics product, which was widely

available. The alternative treatment was the same in each of the schools. Additionally, both

groups participated in the same basic mathematics instruction delivered by their classroom

teachers during the school day. Scheduling and logistical details were determined on a school-

by-school basis.

2.2 Data collection

After randomization, 2046 students were paired, 1023 in EPGY and 1023 in the control

group. Of 1023 matched paired EPGY students, 919 students completed at least one exercise as

logged in the centralized EPGY Oracle database at Stanford. The remaining 104 students

dropped out of the experiment. This left 919 pairs for data collection and analysis, but at the end

of the school year, there were only 619 EPGY students with both Math CST06 and Math CST07

scores and, of these there were 572 matched pairs of EPGY and control students with both Math

CST06 and Math CST07 scores. In addition, there were 170 matched pairs of 2nd graders with

Math CST07 scores. As already remarked, because of their grade level none of these students

had Math CST06 scores. In the particular case of analysis of individual school results in Section

Page 9: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 4 of 62

4.4, we have used 800 of matched students. All of them have Math CST07 scores, but not all

had Math CST06 scores. They had to be matched on the basis of other evidence.

Math CST scores. At the end of the year, students in both groups took the 2007 Math CST

(CST07). The number of CST scores, for each district and school, is shown in Table 1 and the

histograms of 2006 Math CST scores (CST06) are shown in Figure 1. The close similarity of the

two randomly assigned groups is apparent. The test results, and related objective data such as

individual student proficiency level, were evaluated to compare the performance of the two

treatment groups. Given that the CST has been externally developed, validated, administered,

and scored, there was no additional external evaluation of this experiment.

Table 1 Count of number of pairs on 2006 and 2007 Math CST scores for 919 pairs Total number of pairs on

CST06

Total number of pairs on CST07

Total number of pairs on CST06 & CST07

School Name

Total numberof pairs

Both members

having scores

Any member without scores

Both members having scores

Any member without scores

Both members having scores

Only EPGY

studentsAll 8 Title I schools 919 632 287 800 119 572 619

All 526 353 173 486 40 333 348 School A 144 87 57 136 8 82 84 School B 102 77 25 94 8 72 78 School C 145 103 42 140 5 99 102

District1

School D 135 86 49 116 19 80 84 All 174 108 66 128 46 92 104 School E 97 58 39 60 37 50 56

District 2

School F 77 50 27 68 9 42 48 All 219 171 48 186 33 147 167 School G 113 104 9 93 20 87 103

District 3

School H 106 67 39 93 13 60 64

Page 10: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 5 of 62

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

20

40

60

80

100

120

140

160

180

2006 Math CST scores

Num

ber o

f stu

dent

s

EPGY group

N 572Median 353Mean 358.97Std 77.56

150 200 250 300 350 400 450 500 550 6000

20

40

60

80

100

120

140

2006 Math CST scores

Num

ber o

f stu

dent

s

Control group

N 572Median 353Mean 358.12Std 76.39

Figure 1: Comparison on 2006 Math CST scores for EPGY-and-control group 572 matched pairs

Page 11: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 6 of 62

Course performance of EPGY group. Because the results of the experimental treatment

consisted of the mathematics exercises given to the students as computer-presented ones, an

unusually detailed record of student course performance in the EPGY curriculum was available

for analysis. Here we concentrate on two variables that we expect to be significant covariates of

final, i.e., 2007 Math CST scores.

These two variables are, for each student i, the number of correct first-attempts (CFA07i)

at doing an exercise, and the number of correct second-attempts (CSA07i ) (of course, there is no

second attempt, if the first attempt was correct). Except on exercises with only two possible

responses, students who made an error on their first attempt were given immediately thereafter a

second opportunity to do the exercise. We only consider second attempts later.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000

50

100

150

200

250

300

Number of correct first-attempts

Num

ber o

f stu

dent

s

N 919Median 1828Mean 1838.57Std 755.89

Figure 2: Histogram of number of correct first-attempts for 919 EPGY students at the 8 Title I schools

Page 12: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 7 of 62

Figure 2 shows the histogram of correct first-attempts for these 919 EPGY students with

the range from 1 to 4,917. Ranking by correct first-attempts, 230 students in the first quartile

completed a minimum of 2 and a maximum of 1325 correct first responses during the 2006-2007

school year with a mean of 903.23 and a standard deviation of 310.56. Students in the second

quartile completed a minimum of 1326 and a maximum of 1825 such attempts with a mean of

1612.82 and a standard deviation of 142.31. Students in the third quartile had a minimum of

1828 such responses and a maximum of 2287, with a mean of 2033.33, and a standard deviation

of 128.52. Students in the fourth quartile had a range from 2290 to 4917 correct first-attempts,

with a mean of 2803.90 and a standard deviation of 503.63.

In subsequent analyses, we often use subsets of the 919 EPGY students. So, we show in

Table 2 the median, mean and standard deviation (Std) of the number of correct first-attempts

(CFA). The descriptive statistics for second correct-attempts will be shown in Section 5.3

Table 2. Descriptive statistics of correct first-attempts for each subgroup of 919 EPGY students

Subgroup of 919 EPGY students Number of students Median Mean Std

Both member of second-graders in the pair with CST07 170 2098 2117.90 554.94

Both members of the pair having CST06 and CST07 572 1808.50 1861.48 780.10

Only EPGY students with CST06 and CST07 619 1797 1843.44 766.61

Page 13: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 8 of 62

3. METHOD OF STATISTICAL ANALYSIS

The basic question for the statistical analysis was how the students in the EPGY

group performed relative to the students in the control group. Results for all schools

together, individual districts and schools are given.

Mainly, three statistical approaches were used for comparison of experimental and

control groups. First, paired sample t-tests were run to examine the difference between

the EPGY and control groups at each level of aggregation. Effect sizes were also

calculated using Cohen’s d statistics (Cohen 1988). Second, a three-level linear model of

student, classroom and school was used (West, Welch, and Galecki 2007). Third,

student’s changes in proficiency level were analyzed for statistical significance. The five

proficiency levels used were those defined in the “No Child Left Behind” legislation: Far

Below Basic, Below Basic, Basic, Proficient, and Advanced. Statistical tests applied only

to the data of the experimental group are described later in Section 5 and 6.

3.1 Paired samples t-test and effect size

Paired t-test. A paired t-test was used to compare the means of 2007 CST scores of the

two groups, EPGY and control. The null hypothesis is that the mean difference of the

two groups was the same. The test statistic, t, is a function of the average difference

between all pairs ( DX ), the standard deviation of those differences ( DSD ), and the

number of pairs (N).

NSDXt

D

D=

Page 14: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 9 of 62

Under the null hypothesis, this statistic has, to a close approximation, a t-

distribution with N-1 degrees of freedom. The p-value for the statistic t is computed using

the tN-1 distribution.

Hypothetical sample. For students at a small school, the mean and the standard deviation

of the differences were DX = 47 and DSD =88.29. Using the formula above, the t statistic

was 47* 182.26

88.29= = =

X Dt NSDD

, for which the p-value was p=0.04, and N = l8. This p-

value was smaller than .05, a common standard level for significance, and so in this case

there was significant evidence to reject the null hypothesis. (The level of significance of

0.05 is the probability that the null hypothesis is true, i. e., the chance of obtaining such a

t-value under the null hypothesis is 0.05 or less.)

Effect size-Cohen's d . Cohen's d is an appropriate effect-size measure to use in the

context of a t-test on means. “d” is defined as the difference between two means divided

by the average standard deviations for those means. Because of the pairing, our two

samples had the same size and thereby

2/)( 2

22

1

21

SDSD

meanmeand+

−=

where meani and SDi are the mean and standard deviation for group i, for i = 1, 2. The

standard interpretation of the effect size is that 0.2 is indicative of a small effect, 0.5 a

medium and 0.8 a large effect size (Cohen 1988). Readers with a background in physics

or engineering will take this interpretation of 0.8 as being a large effect size as perhaps

too strong a claim. To see this, just express the above equation as

SDmeanmeand )( 21 −= ,

Page 15: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 10 of 62

or a σ – physicists would prefer σ , rather than SD in notation. Then the physics folklore

value for significant experiments of at least 3 or 4 σ ’s, and in really significant finds

often as much as 20 or 30 or more. Quality-control engineers often consider something

like 6σ ’s the gold standard. A little reflection and it seems clear that such a standard for

comparing instructional methods would almost never be met, and thus the defense of 0.8

as large in educational assessment.

Example (continued). The mean of EPGY students was 432.44 with standard deviation of

63.87. The mean of control students was 385.44 with standard deviation of 63.44. So

78.02/)44.6387.63(

44.38544.43222

=+

−=d .

An effect size of 0.78 is considered large, based on the standard stated above.

3.2 Three-level hierarchical linear model.

The basic question for this statistical analysis is whether students in the EPGY

program had higher 2007 Math CST scores then those in the control group. With

randomization at the student level, we expect 2006 Math CST scores to be equally

distributed between treatment and control groups, but in any single randomization there

may be discrepancies between the distributions due to chance. In the model below,

CST06 is included to increase the precision of the treatment effect. The dependencies

among observations within classrooms and schools are also considered by modeling

random effects (West, Welch and Galecki, 2007). This three-level hierarchical linear

model of student, classroom and school is presented as the following:

Page 16: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 11 of 62

Table 3. Three levels of student, classroom and school

Level Name Factor and Covariate I : i=1, 2, 3, 4,… Student CST06

Treatment (1 for EPGY students and 0 for Controls)II: j=1,2,3,4, 5…. Classroom III: k=1,2,3,4,5,6….. School

Level 1: student level. Coefficients of CST06 and Treatment are fixed.

ijkjkijk eCSTTREATMENTCST +++= 0607 210 πππ

Level 2: classroom level. No predictor is available at this level and the coefficient of the

intercept is allowed to be varying within schools.

jkkjk 0000 υβπ +=

Level 3: school level. No predictor is available at this level

kk u0000000 += γβ

Putting those three level models together, we have the following equation.

)(0607 00021000 ijkjkkijk euTREATMENTCSTCST +++++= υππγ ,

where 21000 ,, ππγ are called the fixed effects and ijkjkk eu ,, 000 υ are called the random

effects.

3.3 Binomial analysis of changes in proficiency level

The third main statistical analysis is to compare the experimental and control

groups on the nature of the changes in proficiency levels, as fixed in the No Child Left

Behind legislation. The null hypothesis in this case is that the numbers of positive

changes (+1) in proficiency from the 2006 Math CST Test to the 2007 Test should be

about the same as the number of negative changes (-1). For example, a student who

moved from Proficient in 2006 to Basic in 2007 counts as -1, and a student who moved

Page 17: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 12 of 62

from Proficient to Advanced counts as +1. The null hypothesis is that the chance of any

student who changes his or her level moving up is the same as moving down, i.e., the

probability of moving up or down is ½, given a move was made. If N students in a unit

changed, k of them moved up and N-k moved down, then the probability of this

occurring by chance (i.e., with p = 0.5) is:

∑∑=

=

−⎟⎟⎠

⎞⎜⎜⎝

⎛===≥

N

kj

jNjN

kjpp

jN

jxPkxP )1()()( .

The 2007 Math CST classification of test scores in terms of levels of proficiency are

shown in

Table 4. California classification of 2007 Math CST test scores by proficiency level

Grade Far below basic

Below basic Basic Proficient Advanced

2 150–235 236–299 300–349 350–413 414–600 3 150–235 236–299 300–349 350–413 414–600 4 150–244 245–299 300–349 350–400 401–600 5 150–247 248–299 300–349 350–429 430–600 6 150–252 253–299 300–349 350–414 415–600 7 150–256 257–299 300–349 350–413 414–600

3.4 Multivariate normal covariate models

We concentrate on three closely related models. In Section 6.1 we use a minimum-

distance classifier using the standard Mahalanobis distance function. In Section 6.2 we

use a Bayesian classifier with the classification criterion being the highest posterior

probability. In the last subsection, 6.3, we use recursive equations for learning for both

classifiers, a topic usually not considered in such covariate analyses. These methods

were developed in Suppes and Liang (1998).

Page 18: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 13 of 62

4. COMPARISON OF EPGY AND CONTROL STUDENTS IN THE

EFFECTIVENESS STUDY

4.1 Paired t-test results for all Title I students

Results for non-second graders

We first restrict the analysis to the 572 pairs of students in the eight Title I schools.

On the hypothesis that the number of correct first-attempts would be an important

covariate for the experimental group, we ranked the pairs by the number of correct first-

attempts at doing exercises for each EPGY member of a pair. (No such data were

available for the other pair members.) In Table 5 the results are shown for the top

quartile, ranked by number of correct first-attempts, then in the second row for the top

third, in the third row for the top half, in the fourth row, the top quartile of the bottom

half, then in the fifth row the top third of the bottom half, and in the final row, the top half

of the bottom half, corresponding in this case to the third quartile.

Table 5. Summaries of comparisons between EPGY and matching control students for all students in the Effectiveness Study on the paired t-test, using correct first-attempts

Group/Subgroup 572 pairs of students in all 8 schools

Number of pairs

Mean of the difference of CST07 within each pair

Std of the mean difference

Effect size Paired t test p-value

Top quartile ranked on correct first-attempts

143 17.65 67.91 0.21 2.30 x 10-3

Top third ranked on correct first-attempts 191 16.90 4.64 0.19 4.00 x 10-4

Top half ranked on correct first-attempts 286 12.54 65.85 0.15 1.40x10-3

Top quartile of the bottom half ranked on correct first-attempts

71 -2.16 75.88 -0.03 0.81

Page 19: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 14 of 62

Top third of the bottom half ranked on correct first-attempts

95 -2.30 77.06 -0.03 0.77

Top half of the bottom half ranked on correct first-attempts

143 -8.27 73.74 -0.11 0.18

All pairs 572 0.05 67.62 4.96x10-4 0.98

The paired t-tests were significant for the first three subgroups, with a range of

significant values, i.e., probability p of occurring under the null hypothesis, from p < 10-4,

i.e., p < .0001 to p < 10-3. The effect sizes for these t-tests showed a corresponding range.

These three separate p-values show highly significant results for the EPGY students who

were working regularly and well. The result for the whole population in the last row

shows no statistically difference between the experimental and control groups. In fact, it

is notable that the data for all students show a regressive tendency, a general effect we

discuss in Section 5.6.

Because all of our models are essentially linear, it is worth exploring the effects of

two natural non-linear re-scalings of correct first-attempts (CFA). The first is to weight

each CFAi of student i by the adjusted grade-placement of each exercise ij (AGPij)

worked correctly on the first attempt. The adjustment is to subtract from the grade-

placement (GPj) of the exercise, the school grade placement of student i. The AGPij can

be negative. The new weighted WCFAi is defined as follow:

WCFAi=∑ j ijAGP

We consider the second rescaling by latency in Section 5.5. Table 5a is structural just like

Table 5, but with the variable CFAi replaced by WCFAi

Page 20: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 15 of 62

Table 5a. Quartile summaries comparisons between EPGY and matching control students for all students in the Effectiveness Study on the paired t-test, using weighted correct first-attempts

Group/Subgroup 572 pairs of students in all 8 schools

Number of pairs

Mean of the difference of CST07 within each pair

Std of the mean difference

Effect size

Paired t test p-value

Top quartile ranked on weighted correct first-attempts 143 23.99 80.86 0.346 5.00x10-4 Top third ranked on weighted correct first-attempts 191 15.74 80.89 0.217 7.80 x 10-3 Top half ranked on weighted correct first-attempts 286 6.44 77.26 0.087 0.16Top quartile of the bottom half ranked on weighted correct first-attempts 71 -13.49 56.81 -0.250 0.05Top third of the bottom half ranked on weighted correct first-attempts 95 -10.71 60.98 -0.192 0.09Top half of the bottom half ranked on weighted correct first-attempts 143 -6.56 56.82 -0.112 0.17

All pairs 572 0.05 67.62 0.001 0.98

Results for second graders

Table 6. Summaries of comparisons between EPGY and matching control students in the second grade at the Effectiveness Study schools on the paired t-test, using correct first-attempts

Group/Subgroup 170 pairs of 2nd grade

Number of pairs

Mean of the difference of

Std of the mean

Effect size

Paired t test p-value

Page 21: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 16 of 62

students in all 8 schools

CST07 within each pair

difference

Top quartile ranked on correct first-attempts

42 53.33 76.29 0.74 5.01x10-5

Top third ranked on correct first-attempts 57 46.97 87.45 0.65 1.57x10-4

Top half ranked on correct first-attempts 85 28.25 95.39 0.38 7.71x10-3

Top quartile of the bottom half ranked on correct first-attempts

21 9.62 79.46 0.13 0.59

Top third of the bottom half ranked on correct first-attempts

28 4.79 78.63 0.06 0.75

Top half of the bottom half ranked on correct first-attempts

43 4.72 75.63 0.07 0.68

All pairs 170 -3.08 99.23 -0.04 0.69

4.2 Changes in proficiency level

We now turn to binomial analysis of changes in proficiency level. The main results

are summarized in Table 7 7 and Table 7a.

Table 7. Result of binomial analysis of changes in proficiency level, with students ranked

by correct first-attempts

Change in proficiency level data

p-value

Change in test scores data

p-value

Group/Subgroup 572 pairs of students in all 8 schools

EPGY group Control group EPGY group Control group

Page 22: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 17 of 62

Top quartile ranked on correct first-attempts

+48 vs –15 p=3.76x10-5

+39 vs -28 p=0.22

+91 vs -52 p=1.40x10-3

+80 vs -62 p=0.15

Top third ranked on correct first-attempts

+64 vs -24 p=2.37 x 10-5

+49 vs -36 p=019

+121 vs -70 p=2.75x10-4

+103 vs -87 p=0.28

Top half ranked on correct first-attempts

+84 vs-50

p=4.19 x 10-3

+69 vs -62 p=0.60

+165 vs -120 p=9.03x10-3

+152 vs -133 p=0.29

Top quartile of the bottom half ranked on correct first-attempts

+13 vs -22 p=0.18

+11 vs -26 p=0.02

+31 vs -40 p=0.34

+28 vs -42 p=0.12

Top third of the bottom half ranked on correct first-attempts

+17 vs -32 p=0.04

+17 vs -34 p=0.02

+39 vs -56 p=0.10

+37 vs -57 p=0.049

Top half of the bottom half ranked on correct first-attempts

+22 vs -53 p=4.49 x 10-4

+31 vs -48 p=0.07

+51 vs -92 p=7.64x10-4

+58 vs -83 p=0.04

All pairs +120 vs -177 p=1.12x10-3

+130 vs -164

p=0.05

+258 vs -313 p=0.02

+263 vs -306 p=0.08

Table 7 summarizes the positive and negative changes in proficiency classification

for EPGY and control students, with students ranked by correct first-attempts. The top

quartile of EPGY students ranked on number of correct first-attempts, the EPGY result

for the 143 students in the matched pair is highly significant in a positive sense, with p<

10-5 for the experimental group, and not so for the control group. The numerical result is

also impressive for the experimental group, +48 vs -15, meaning that 48 of the students’

proficiency level increased from 2006 to 2007, and 15 regressed to a lower level of

proficiency. For the control group, the pair of numbers is, as shown, +39 vs -28. In the

second row the p-value for EPGY is again p < 10-3, compared to a nonsignificant result

for the control group. The third row, for the top half of EPGY students, ranked by correct

first-attempts, is also highly significant with p < 10-4, and with a nonsignificant result for

the control group.

Page 23: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 18 of 62

Table 7a is like Table 7, except the weighted variable WCFAi replaces CFAi in the

analysis.

Table 7a. Result of binomial analysis of changes in proficiency level, with students ranked by weighted correct first-attempts

Change in proficiency level data

p-value

Change in test scores data

p-value Group/Subgroup 572 pairs of students in all 8 schools

EPGY Group Control Group EPGY Group

Control Group

Top quartile ranked on weighted correct first-attempts

+21 vs-27 p =0.47

+19 vs-44 p=2.00x10-3

+63 vs -80 p=0.19

+51 vs -91 p=9.95x10-4

Top third ranked on weighted correct first-attempts

+27 vs -51 p=9.00x10-3

+26 vs -64 p=7.66x10-5

+78 vs -113 p=0.01

+71 vs -119 p=6.12x10-4

Top half ranked on weighted correct first-attempts

+46 vs -97 p=2.41x10-5

+47 vs -95 p=6.91x10-5

+110 vs -176

p=1.14x10-4

+109 vs -175 p=1.07x10-4

Top quartile of the bottom half ranked on weighted correct first-attempts

+15 vs -30 p=0.04

+17 vs -21 p=0.63

+30 vs -41 p=0.24

+30 vs -41 p=0.24

Top third of the bottom half ranked on weighted correct first-attempts

+22 vs -39 p=0.04

+25 vs -28 p=0.78

+39 vs -55 p=0.121

+43 vs -52 p=0.41

Top half of the bottom half ranked on weighted correct first-attempts

+34 vs -50 p=0.05

+39 vs -41 p=0.91

+69 vs -73 p=0.80

+71 vs -72 p=1.00

All pairs +120 vs -177 p=1.12x10-3

+130 vs -164 p=0.05

+258 vs -313

p=0.02

+263 vs -306 p=0.08

A possible response to these very significant results for EPGY students, as reflected

in both Table 5 and Table 7, is that what we have done is just select a variable applicable

only to EPGY students but one that is correlated with their mathematical ability and

achievement. Naturally a selection so favorable to EPGY, and not applicable to the

control group, would yield positive comparative results.

Page 24: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 19 of 62

Our next analysis shows that this alternative hypothesis to explain the positive

results does not hold up under closer scrutiny.

If this selection by general mathematics ability and achievement held up, then the

hypothesis should hold for both groups if we use as the variable for selection Math

CST06 scores. Indeed, as shown in Table 8, using this selection variable, there is no

difference between the Math CST07 scores of the top quartile and top half of the two

groups.

Table 8. Comparisons between EPGY and matching control students ranked by Math

CST06 scores for all schools in the Effectiveness Study on the paired t-test

Subgroup 572 pairs of students in all 8 schools

Number of pairs

Mean of the difference CST07within each pair

Std of the mean difference

Paired t test p-value

Top quartile ranked on 2006 Math CST 142 9.17 84.31 0.20

Top half ranked on 2006 Math CST 281 3.41 75.50 0.45

Table 9summarizes the result of binomial analysis of changes in proficiency level for the

top quartile and top half students

Table 9. Result of binomial analysis of changes in proficiency level

Change in proficiency level Data

p-value

Subgroup 572 pairs of students in all 8 schools

EPGY Group Control Group Top quartile ranked on 2006 Math CST +6 vs -44

p=3.21x10-8 +10 vs -49

p=2.71x10-7

Top half ranked on 2006 Math CST +34 vs -99 p=9.91x10-9

+42 vs -95 p=6.92x10-6

Moreover, the real absence of achievement by the better students in the two

groups, as measured by Math CST06 scores, can be seen in Table 9. For both groups,

regression toward the mean in proficiency levels for 2006 to 2007 Math CST scores is

Page 25: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 20 of 62

highly significant for the top quartiles and top half of the students in both experimental

and control groups, ranked by Math CST06 scores.

The stark results of these last two tables reinforce the significance of the positive

achievement of the EPGY students who worked carefully and diligently, as measured by

their number (not percent) of correct first-attempts in doing exercises.

The statistical significance of these results is high, but in the context of general

elementary-school mathematics student learning, not entirely unexpected. Students in

these grades are presented with a curriculum that is increasingly difficult and more

complex. Whatever the particular math curriculum, the students who do well need to

work accurately and continually throughout the school year. The active engagement of

doing hundreds of exercises, individually adapted to the level of each student, are

probably the facilitating features of the EPGY computer courses responsible for the

positive results. In summary, what is important is to have some significant measure of

work done, such as number of correct first-attempts.

Later, we examine briefly the possible effects of nonlinear changes in the scale

measuring the amount of work.

In another direction, a different concern is that the positive results we have

analyzed so far are for the top half of Title I students, as judged by their performance in

EPGY based on a student’s number of correct first-attempts (CFA). These students

overlap considerably with the top half of students based on Math CST06 scores. The

intersection is 150 of 286, which supports the view that these are among the best Title I

students.

Page 26: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 21 of 62

We now present results for the less able Title I students, as measured by Math

CST06 test. We selected the students who constitute the bottom half of the 2006 test. We

then tested the effectiveness of EPGY with this group of students by first taking the top

half of such students as measured by correct first-attempts (CFA). We also did the same

analyses using the top quartile of these students in the bottom half of the 2006 test scores,

again as measured by correct first-attempts.

Table 10. Comparisons between EPGY and matching control students ranked by Math CST06 scores and correct first-attempts for all schools in the Effectiveness Study on the paired t-test Subgroup 572 pairs of students in all 8 schools

Number of pairs

Mean of the difference CST07 within each pair

Std of the mean difference

Effect size

Paired t test p-value

Top quartile CFA of bottom half CST06

73 15.71 63.39 0.25 0.04

Top half CFA of bottom half CST06

146 8.49 59.85 0.15 0.09

Obviously this top quartile group is smaller, and performed, as a group, better in

EPGY than the top half. The results shown in Table 10 match this expectation of better

performance. What is important is that Title I students in the bottom half of 2006 test

scores but who excelled at working carefully and diligently, as measured by their being in

top quartile as defined above, did perform significantly better than the control group with

matching 2006 scores. The results in Table 10 are not as decisive when we consider the

top half rather than the top quartile, as measured by correct first-attempts. The effect size

is positive, but small and the p-value is not significant.

Page 27: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 22 of 62

Table 11. Result of binomial analysis of changes in proficiency level

Change in proficiency level Data

p-value

Subgroup 572 pairs of students in all 8 schools

EPGY group Control group Top quartile CFA of bottom half CST06 +35 vs -6

p=4.87x10-6 +24 vs -13 p=0.1

Top half CFA of bottom half CST06 +62 vs -22 p=1.47x10-5

+46 vs -30 p=0.08

Table 11 shows the corresponding results for the binomial analysis of proficiency-

level change for the top quartile and top half. The results are more clear in terms of

positive proficiency-level changes. For the top half, the EPGY positive +62 changes vs -

22 were highly significant, with p=1.47x 10-5, compared to +46 vs -30 for the control

group, which is significant, with p=0.08, but not as good as those for the EPGY group.

In the case of the quartile group for the EPGY students, the results were also quite

good, +35 vs -6, with p=4.87 x 10-6, and for the control students, not as good, +24 vs -13,

with nonsignificant p=0.10.

In summary, the results shown in Table 10 and Table 11support the hypothesis that

careful and diligent work by EPGY students, whose test scores have classified them as

being in the bottom half of the Title I students in the experiment, can also show

significant benefits, as measured by the Math CST07 results.

Just to look even more closely at the students with the lowest Math CST06 scores

we examined the data of EPGY students in the top half CFA of the bottom half CST06.

The number of students is now only 72. The difference between EPGY students and

control students was not significant on the paired t-test. On the other hand, on the

binomial analysis of change proficiency level; the results were significant, + 28 vs -10,

Page 28: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 23 of 62

with p = 5.10 x 10-3. The results for the control group were also positive but less

significant, +26 vs -11 with p =0.03.

4.3 District-wide and school-wide results

Given the results of Section 4.1, we restricted our detailed analysis of individual

schools to the top half of students as ranked by their EPGY work. Because of the small

number of students at some schools, we did the ranking using 800 pairs in which both

members had CST07. Of special note, each school and each district were ranked

independently from each other. Table 12 and Table 13 summarize comparisons between

EPGY and control students for all schools, each district, and each school.

Table 12. Summary of comparisons between top-half EPGY and matching control students for all schools, individual district and individual school in the Effectiveness Study on the paired t-tests, using correct first-attempts

School Name

Number of pairs

Mean of the difference of CST07 within each pair

Std of the mean difference

Paired t-test p-value

District 1 243 18.74 79.31 3.00 x 10-4

School A 68 27.77 90.08 0.01

School B 47 25.43 79.25 0.03

School C 70 19.66 69.91 0.02

School D 58 6.72 81.85 0.53

District 2 64 8.20 83.90 0.44

School E 31 36.36 75.51 0.01

School F 34 13.36 79.19 0.33

District 3 93 14.48 72.49 0.05

School G 47 26.70 78.56 0.02

School H 47 5.26 57.95 0.54

Page 29: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 24 of 62

Table 13. Expected effect sizes on paired t-test results in Table 12

EPGY group

Control group

School Name Mean of CST07 Std of CST07 Mean of

CST07 Std of CST07

Expected effect size

based on t-test District 1 383.48 83.79 364.74 76.97 0.23School A 386.75 82.27 358.99 63.39 0.38School B 364.23 75.28 338.81 75.03 0.34School C 403.93 94.87 384.27 84.62 0.22School D 348.17 81.01 341.45 71.72 0.09District 2 386.83 83.61 378.63 81.73 0.10School E 411.77 78 375.42 91.68 0.43School F 376.09 84.58 362.53 71.64 0.17District 3 364.22 81.41 349.73 74.79 0.19School G 388.45 77.96 361.74 76.07 0.35School H 331.64 74.84 326.38 70.6 0.07

Table 12 and Table 13 show the EPGY students performed consistently higher

than the control students when the analysis was restricted to the top half of students as

ranked by number of their correct first-attempts. Most p-values for this group of students

were better than the standard significance level of .05. The associated effect sizes were

also positive.

4.4 Three-level Hierarchical Linear Model (HLM)

Results for top half. The results are for the top half of the EPGY students, as

ranked by the number of correct first-attempts, and their matched control students. There

Page 30: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 25 of 62

are 286 matched pairs in this group who had 2006 Math CST, 2007 Math CST, and

stayed with the same teacher and school within the experiment.

Table 14a. Random effects for top half pairs of EPGY and control students, using correct first-attempts Notation Level Estimated variance StdErr z-value p-value

ku00 School 160.19 193.93 0.83 0.20

jk0υ Classroom 923.95 267.61 3.45 3.00x10-4

ijke Student 2153.27 135.27 15.92 <10-50

Table 14a shows the results of random effects. The significant p-value for

classroom level, in addition to this expected result for student level, suggests the

existence of significant variation among classrooms.

Table 14b. Fixed effects for top half pairs of EPGY and control students, using correct first-attempts Notation Effect Estimate StdErr DF t-value p-value

000γ Intercept 72.41 12.37 7 5.85 6.28x10-4

1π CST06 0.80 0.03 561 28.30 <10-100

2π Treatment 10.41 3.89 561 2.68 7.61x10-3

Table 14b presents fixed effects. The intercept, 72.41, estimates 000γ , which is the

mean math score when the predictors are 0. The estimated effect of 1π for CST06, 0.80,

gives the statistical relationship between 2006 and 2007 math test scores. Students who

differ by 1 point on CST06 differ by 0.80 on CST07. The relationship is statistically

significant with p <10-100. The estimate 2π for the treatment, 10.41, provides the average

difference of 2007 Math CST scores between EPGY and control students. Its p-value,

Page 31: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 26 of 62

7.61x10-3, indicates that EPGY has statistically significant effects on improving 2007

Math CST performance.

Results for the 572 pairs. This result is for the 572 matched pairs having 2006

Math CST and 2007 Math CST.

There are variations between and within classrooms as suggested in Table 15a.

Again, similar to the paired t-test result, the three-level hierarchical result also does not

show any statistical difference between the EPGY and control group as presented in Table

15b, corresponding to a similar result for the paired t-test, shown in the last line of Table

5.

Table 15a. Random effects for the 572 pairs Notation Level Estimated variance StdErr z-value p-value

ku00 School 138.24 137.70 1.00 0.16

jk0υ Classroom 718.04 164.27 4.37 6.18x10-6

ijke Student 2170.08 93.45 23.22 <10-100

Table 15b. Fixed effects for the 572 pairs Notation

Effect Estimate StdErr DF t-value p-value

000γ Intercept 77.10 9.30 7 8.30 7.21x10-5

1π CST06 0.78 0.02 1131 38.30 <10-100

2π Treatment -0.53 2.76 1131 -0.19 0.85

Page 32: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 27 of 62

5. REGRESSION MODELS ONLY FOR EXPERIMENTAL GROUP (EPGY

STUDENTS)

5.1 Multiple linear regression model with the independent variables EPGY correct

first-attempts and students’ 2006 Math CST scores.

As has already been shown in several ways, a good measure of EPGY

performance that reflected the amount of careful sustained work by each individual

student was the number of correct first-attempts he or she had during the school year. A

multiple linear regression model was used to examine the relationship between this

number of correct first-attempts, 2006 Math CST scale-scores and 2007 Math CST scale-

scores for all Title I schools, each district and individual school. In other words, 2007

Math CST was modeled as a linear function of 2006 Math CST and the number of correct

first-attempts. Our regression model is of the form

CST07i= iii eCFACST +++ 0706 210 βββ

where CST07i is a student i’s 2007 Math CST, and iCFA07 is the cumulative number of

correct first-attempts of student i in 2006-2007.

Of 919 EPGY students, 619 students have both 2006 and 2007 Math CST scores.

The regression models were applied to these students only. The overall F-test for the

model was used to determine if all coefficients were zero. The t-test was used to examine

the statistical significance of each covariate. The F-test results are shown under the

“Model description” column in Table 16; and the t-test results are shown in the

“Parameter Estimates” of Table 16.

Our regression models show a consistent positive relationship between 2007 Math

CST scores and students’ EPGY work for all schools, every district and every school in

the Effectiveness Study. The regression coefficients for correct first-attempts were

Page 33: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 28 of 62

positive and statistically highly significant with a very small p-value of less than 10-4 for

all districts and all but two of the schools. These results show that, independent of where

they started, the more exercises student did correctly on their first try, the higher CST

scores they got.

Table 16. Regression model of 2007 Math CST against correct first-attempts

Parameter estimates Model description CST06 Correct first-attempts School name N F-value p-value R-square Regression

coefficient p-value Regression coefficient p-value

All Title I

schools 619 441.71 <10-100 0.59 0.74 <10-100 0.03 <10-10

District 1 348 261.3 8.18x10-70 0.60 0.77 <10-50 0.04 <10-10

School A 84 55.11 <10-10 0.58 1.10 <10-10 0.01 0.38

School B 78 64.64 <10-10 0.63 0.51 1.58x10-9 0.05 1.23x10-8

School C 102 87.68 <10-20 0.64 0.65 <10-10 0.06 <10-10

School D 84 78.39 <10-10 0.66 0.85 1.29x10-20 0.01 0.14

District 2 104 85.02 <10-20 0.63 0.64 <10-20 0.02 3.00x10-3

School E 56 65.41 <10-10 0.71 0.64 <10-10 0.05 4.97x10-6

School F 48 43.83 <10-10 0.66 0.60 <10-10 0.01 0.15

District 3 167 164.14 7.18x10-40 0.67 0.78 <10-30 0.02 3.46x10-4

School G 103 95.57 <10-20 0.66 0.77 <10-20 0.01 2.40x10-2

School H 64 65.17 <10-10 0.68 0.76 <10-10 0.02 6.00x10-3

In Table 17 we reanalyze the data used in Table 16 by normalizing to N(0,1) the

distribution of the two independent variables CST06 and correct first-attempts.

Page 34: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 29 of 62

Table 17. Normalized regression model of 2007 Math CST against correct first-attempts

Parameter estimates Model description CST06 Correct first-attempts School name N F-value p-value R-square Regression

coefficient p-value Regression coefficient p-value

All Title I Schools 619 441.71 <10-100 0.59 58.0 <10-100 20.30 <10-10

District 1 348 261.3 8.18x10-70 0.60 60.70 <10-50 29.60 <10-10

School A 84 55.11 <10-10 0.58 86.70 <10-10 8.20 0.38

School B 78 64.64 <10-10 0.63 40.40 1.58 x10-9 34.90 1.23x10-8

School C 102 87.68 <10-20 0.64 51.00 <10-10 47.80 <10-10

School D 84 78.39 <10-10 0.66 66.70 1.29 x10-20 9.60 0.14

District 2 104 85.02 <10-20 0.63 50.40 <10-20 16.50 3.00x10-3

School E 56 65.41 <10-10 0.71 50.30 <10-10 41.50 4.97x10-6

School F 48 43.83 <10-10 0.66 47.50 <10-10 10.40 0.15

District 3 167 164.14 7.18 x10-40 0.67 60.90 <10-30 12.30 3.47x10-4

School G 103 95.57 <10-20 0.66 60.40 <10-20 9.40 0.02

School H 64 65.17 <10-10 0.68 59.70 <10-10 17.10 6.00x10-3

By and large the results shown in Table 17 are very similar to those in Table 16.

Page 35: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 30 of 62

In Table 18 we show regression analysis like that of Table 16, but with correct first-

attempts (CFAi) replaced by the weighted scale WCFAi.

Table 18. Regression model of 2007 Math CST against weighted correct first-attempts Parameter estimates Model description

CST06 Weighted correct first-attempts

School name

N F-value

p-value R-square Regression Coefficient

p-value Regression Coefficient

p-value

All Title I Schools 619 421.83 <10-100 0.58 0.45 1.58 x10-20 0.03 <10-10

District 1 348 217.84 <10-60 0.56 0.46 3.02 x10-9 0.03 8.95 x10-10

School A 84 58.53 <10-10 0.59 0.79 2.99 x10-4 0.03 0.06

School B 78 38.6 <10-10 0.51 0.21 0.18 0.04 1.19 x10-3

School C 102 69.56 <10-10 0.58 0.12 0.395 0.04 5.74 x10-8

School D 84 80.69 5.27 x10-20 0.67 0.65 3.66 x10-7 0.02 0.01

District 2 104 91.63 <10-20 0.65 0.35 2.81 x10-4 0.03 2.21 x10-4

School E 56 50.18 <10-10 0.65 0.30 0.029 0.04 7.60 x10-4

School F 48 45.14 <10-10 0.67 0.41 3.41 x10-3 0.02 0.09

District 3 167 158.09 <10-30 0.66 0.62 1.13 x10-13 0.01 3.13 x10-3

School G 103 95.97 <10-20 0.66 0.62 3.17 x10-9 0.01 0.02

School H 64 56.52 10-14 0.65 0.64 3.93 x10-6 9.00 x 10-3 0.18

Table 19 is the normalized regression of Table 18, i.e, with weighed CFAi’s.

Table 19. Normalized regression model of 2007 Math CST against weighted correct first-attempts

Parameter estimates Model description Normalized CST06 Normalized weighted

correct first-attempts School name

N F-value p-value R-square

Regression Coefficient p-value Regression

Coefficient p-value

All Title I schools 619 421.83 <10-100 0.58 35.70 1.58 x10-20 30.80 6.15 x10-16

Page 36: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 31 of 62

District 1 348 217.84 <10-6 0.56 36.10 3.02 x10-9 37.90 8.95x10-10

School A 84 58.53 <10-10 0.59 62.00 2.99 x10-4 32.30 0.06

School B 78 38.6 <10-10 0.51 16.80 0.18 46.80 1.19x10-3

School C 102 69.56 <10-10 0.58 9.20 0.40 53.30 5.74x10-8

School D 84 80.69 5.27

x10-20 0.67 51.20 3.66 x10-7 22.30 0.06

District 2 104 91.63 <10-20 0.65 27.80 2.81 x10-4 35.20 2.21x10-4

School E 56 50.18 <10-10 0.65 23.90 0.03 52.70 7.6x10-4

School F 48 45.14 <10-10 0.67 32.20 3x10-3 21.30 0.09

District 3 167 158.09 <10-30 0.66 48.90 <10-10 15.10 3.13x10-3

School G 103 95.97 10-20 0.66 48.80 3.17 x10-9 14.50 0.02

School H 64 56.52 <10-10 0.65 50.00 3.93 x10-6 11.70 0.18

As in the previous comparison, the effects of normalizing made only small changes

in the results, i.e., the comparison of Table 18 and Table 19.

5.2 A Title I school in district 2 outside the Effectiveness Study

There were 143 students with 2007 Math CST scores at this school. The average

score was 342.10 points on the CST. The standard deviation was 73.79. The mean

number of correct first-attempts responses was 1168.62 with a standard deviation of

517.63, and with a range from 129 to 2620 such responses.

A simple linear regression was used to examine the relationship between 2007

Math CST and the number of correct first-attempts. The result shows a strong positive

relationship between these two variables. For every 100 correct first-attempts, students

Page 37: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 32 of 62

increased their 2007 Math CST by 4.92 points. This result is statistically significant:

p=2.39 x 10-5.

5.3 Regression with the added covariate of number of correct second-attempts

By including correct second-attempts as a covariate in the regression model

described in Section 5.1, the enlarged model can perhaps explain more variability in the

2007 Math CST scores, i.e. improve the fit. Our model is thus

CST07i= iiii eCSACFACST ++++ 070706 3210 ββββ ,

where CST06i is a student i’s CST Math score in 2006, CST07i is a student i’s CST Math

score in 2007, iCFA07 is the cumulative number of correct first-attempts of student i in

2006-2007, and iCSA07 is the cumulative number of correct second-attempts of student i

in 2006-2007.

This model, as in the case of that of Section 5.1, was applied to the data of 619

EPGY students. The R-square of the original model was 0.589. The R-square of this new

model was 0.600. Thus there was about a 2% increase in the fit. In term of the

regression coefficients, correct first-attempts remained positively significant (0.04) with p

< 10-10. Correct second-attempt responses however was negatively significant (-0.15) at

the p-value of 5.3x10-5. The result of correct second-attempts was not surprising, because

only students who make an error on their first attempts were given immediately thereafter

a second opportunity to do the exercise.

Page 38: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 33 of 62

5.4 Correlation between 2006 Math CST and correct first-attempts

The data we consider are given in Table 20.

Table 20. Data on 2006 Math CST and correct first-attempts

Variable N Mean Std Minimum Maximum

Correct first-attempts 619 1843.44 766.61 145 4917

2006 CST 619 355.68 78.59 194 600

Because 2006 Math CST and correct first-attempts in Section 5.1 are positively

related to 2007 Math CST, it is desirable to examine the correlation between these two

variables for collinearity. We expect a weak relationship between the 2006 Math CST

and correct first-attempts since EPGY students took the CST test (May 2006) before they

began using EPGY program (November 2006). The Pearson correlation coefficient

between these two variables confirms our hypothesis (ρ=0.24, p<.0001).

However, in the reverse chronological order, number of correct first-attempts

should be highly correlated with CST06. Because students were clustered within

classrooms and schools, we used the Hierarchical Linear Model to examine this

relationship. In this model, correct first-attempts is the dependent variable and CST06 is

the independent variable. This HLM model has three levels – school, classroom and

students, and the intercept is modeled as random. The combined equation for this model

is

)(0607 0001000 ijkjkkijkijk euCSTCFA ++++= υδδ ,

where ijkCST 06 is 2006 Math CST score of student i in classroom j at school k,

and ijkCFA07 is correct first-attempts for student i in classroom j at school k

Page 39: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 34 of 62

The results are presented in Table 21a and Table 21b.

Table 21a. Fixed effects of hierarchical model of 2006 Math CST and correct first-

attempts

Notation Term/variable Estimate StdErr DF t Value p-value

000δ Intercept 1265.96 177.92 7 7.12 2.00x10-4

1δ CST06 1.50 0.363 633 4.13 4.06x10-6

The result of fixed effects shows a positive relationship between correct first-

attempts and 2006 Math CST with a p-value of 4.64x10-10. The estimate of 1.65 for 1δ

indicates that for every one point increased in the 2006 Math CST, there are about 2 more

exercises students get correct on the first try.

Table 21b. Random effects

Notation Variance Estimate StdErr Z Value p-value

ku00 School 97286 62721 1.55 0.06

jk0υ Classroom 112793 30325 3.72 9.98x10-5

ijke Student 421902 24753 17.04 <10-60

Significant p-values for classroom and student show there is variance between

classrooms and within a single classroom.

5.5 Correlation between re-scaled correct first-attempts and Math CST scores.

The second re-scaling is to divide AGPAij by the latencyij of student i in working exercise

j (LATij). The intuitive hypothesis is that faster work, ie., shorter latency, is a measure of

greater mastery. So, we define:

∑= iij

ij

LATAGP

LATWFCA

Page 40: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 35 of 62

Table 22. Pearson correlation coefficients for re-scaled correct first-attempts of top half of EPGY students

Group and selected variable Pearson correlation coefficients between CST07 and selected

variable

p-value

Group A:CFA 0.199

4.0 x 10-4

Group B: CFA weighted by AGP 0.611 <10-30

Group C: CFA weighted by AGP and latency

0.543 3.57 x 10-20

As can be seen in Table 22, the AGP weighting has a surprisingly large increase in

correlation with Math CST07, from 0.20 to 0.61. Dividing by the latency decreases the

correlation to 0.54, contrary to expectation.

The adjusted-grade-placement weighting turning out to be more highly correlated

with the Math CST07 than the unweighted sum of correct first-attempts fits the intuitive

judgment to give greater weight to exercises worked correctly at a higher placement level

in the curriculum. The reader is reminded that EPGY’s K-5 Math course is highly

individualized, so that the grade placement at which a student is working does not depend

on the grade placement of other students in the class.

The effects of the scaling on the selection of the top half of the students can be seen

in Figure 3, where CFA is the scale on the x-axis and AGP the scale on the y-axis. All

EPGY students taking the Math CST07 Test in the study are used in the scatter plot.

Page 41: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 36 of 62

Figure 3. Scatter plots of the three groups of students: green for those who were in top half on both CFA’s and AGP’s; red for those in top half of CFA or AGP, but not both; and black, students in neither top half of CFA of AGP.

Note that the re-scaling strongly affects the selection of the top half, as seen by the large

number of red dots in the scatter plot.

groupAB 0 1 2

Weighted_CFA

-6000 -5000 -4000 -3000 -2000 -1000

0 1000 2000 3000 4000

CFA

0 1000 2000 3000 4000 5000

Page 42: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 37 of 62

5.6 Analysis of the mean-difference 2007 Math CST – 2006 Math CST

corresponding to different 2006 CST scores and number of correct first-attempts

Table 23. Mean-difference 2007 Math CST – 2006 Math CST corresponding to different

2006 CST scores and number of correct first-attempts.

4000

3500 . . 74

5 . .

3000 . . 35.8

11 62

5 . .

2500 . 51.8

10 47.9

12 15.2

16 41.6

8

2000 29.1

10 18.1

18 13.6

39 8.9

27 0.5

19 24.8

13

1500 37.9

8 1

40 -3

37 -8.5

45 -12.8

33 -4.3

10

1000 27.7

9 -1

23 -15.8

30 -32.1

30 -21.0

21 -63.1

7

500 12.9

11 -1

18 -44.5

19 -32.1

15 -56.5

13 .

0 . . -24.7

7 . -38

5 .

200 250 300 350 400 450 500

Num

ber o

f cor

rect

firs

t- at

tem

pts

2006 Math CST scores

The mean-difference (2007 Math CST – 2006 Math CST) of each cell is presented

in the upper left-hand corner in Table 23. The number of students in each cell is shown in

the lower right-hand corner. The total is 619-45=574, where 54 is the number of students

in cells with less than 5. The mean-differences are surprisingly strongly clustered. To

confirm this, look at the lower right part of the table, below the mean of 1843.44, in fact

below 2000 correct first-attempts, and to the right of a CST 06 score of more than 300 are

all negative. With only two other negative entries occur outside this block. This table is

not complete. A cell must contain at least 5 students to be recorded here, so 45 students

of the 619 are omitted from the table. The striking qualitative summary description of the

Page 43: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 38 of 62

students with negative change is that they are ones with high CST06 scores, but with less

than the average amount of successful work, as measured by CFA’s in EPGY’s Math K-5

online course. In contrast, the upper left corner consists of students with below average

2006 test scores, but with above average correct first-attempts (CFA) in EPGY. The

mean-differences of 2007-2006 test score with a 2006 test scores less than 350 are all

positive.

This table provides another angle on the 2007 test results that support earlier

evidence of the significant positive effect of students working diligently and carefully in

the EPGY online course.

6. MULTIVARIATE-NORMAL COVARIATE MODEL FOR EPGY STUDENTS

ONLY

The data for this multivariate analysis consisted of information from 619 EPGY

students. Based on past experience, three covariate features were selected from a

student’s EPGY performance. Thus, the vector of a student’s properties contained four

components as follows.

1. 2006 Math CST scores (x1).

2. Number of correct first-attempts (x2).

3. Adjusted final grade placement (x3) is the difference between a student’s final

grade placement in EPGY mathematics curriculum and the actual classroom grade

level.

4. Average latency on correct first-attempts (x4) is the student’s average response

time on exercises answered correctly at the first attempt.

6.1 Minimum-distance classifier

Page 44: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 39 of 62

A minimum-distance classifier was used. First, we observed the outcomes and

feature measurements in the training data set, and calculated the mean and covariance

matrix for each class (proficiency level). Then, we computed the Mahalanobis distance

from a unknown vector x Txxx ),...,,( 421= to the kth class with mean μkT),...,,( 421 μμμ=

and covariance matrix kP as

)()() class,( 1kk

TkM xPxkxd μμ −−= − .

The vector of each student’s properties as defined above was placed in the class whose

multivariate distribution had the closest Mahalanobis distance to the vector.

Table 24 lists the transition matrix of students’ proficiency levels in 2006 and 2007. Each

cell of the matrix contains two numbers, with the frequency at the top and the row

percentage at the bottom. It shows that 54.1% of students were classified correctly with

the simple strong null assumption that the proficiency level of no student changed.

We used all 619 students’ information to construct the classifier.

Table 25 indicates that the model improves the classification significantly with

63.8% of students identified correctly, p<10-6. The comparison also shows that the

classifier predicts ‘Below Basic’ and ‘Advanced’ much better than ‘Basic’ and

‘Proficient’. The worst case occurs at level ‘Basic’ with 47.33% of students classified

correctly, which is still higher than the 37.33% in Table 24

Table 24. Transition matrix for proficiency levels

Page 45: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 40 of 62

2007 level 2006 level Frequency Percentage Advanced Proficient Basic

Below basic Total

Advanced 9265.25

3624.83

1510.00

2 1.09

145

Proficient 3625.53

6846.90

5033.33

12 6.56

166

Basic 128.51

3423.45

5637.33

50 27.32

152

Below Basic 10.71

74.83

2919.33

119 65.03

156

Total 141 145 150 183 619

Table 25. Classification matrix of minimum distance classifier

Predicted level 2007 Level Frequency Percentage Advanced Proficient Basic Below basic Total

Advanced 11279.43

2417.02

53.55

0 0.00

141

Proficient 4128.28

7048.28

2617.93

8 5.52

145

Basic 1610.67

3724.67

7147.33

26 17.33

150

Below Basic 42.19

84.37

2915.85

142 77.60

183

Total 173 139 131 176 619

Page 46: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 41 of 62

6.2 Bayesian classifier

Instead of assigning a vector x Txxx ),...,,( 421= to the class with minimum distance,

a Bayesian classifier places it in the group with highest posterior probability.

Mathematically, Bayes’ theorem states

)()()|(

)|(EP

HPHEPEHP jj

j =

where Hj represents a null hypothesis, P(Hj) is the prior probability of Hj, P(E|Hj) is the

conditional probability of event E occurring given that Hj is true, this probability is called

the likelihood of H on E, P(E) is the marginal probability of event E, which can be

calculated by ∑j

jj HPHEP )()|( .

In our case, we wanted to predict to which classification a vector x Txxx ),...,,( 421=

belonged. We denote the posterior probability that x is assigned to class j as )x|j (classP ,

and assume a prior probability j) (classP and a conditional distribution j) |( classxf for

class j. The posterior probability is computed in the following way:

)()j (j) |(

)x|j (xP

classPclassxfclassP = .

Bayes’ rule assigned the vector )...,( 421 xxxx = to the class j with maximum )j ,( classxP .

Two main concerns here were to determine the prior and conditional distribution.

The uniform prior was tried in our study. That is, we assumed a prior that the probability

of classification had a uniform distribution 1/4) 1/4, 1/4, (1/4, . The conditional

distribution j) |( classxf was obtained from the estimated multivariate normal

distribution.

)}()'(21exp{)2()j |( ,

1,,

2/1

,2

njnjnjnj mxmxclassxf −−−=−−− ∑∑π

Page 47: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 42 of 62

where njm , is the mean and ∑ nj , is the covariance matrix for class j and n is the

number of members in class j .

Table 26 provides the results obtained from the 619 students using the uniform prior.

Compared to Table 24, the outcomes of the Bayesian classifier show significant

improvement of classification with 66.6% of students classified correctly. The

comparisons between Table 26 and Table 25 show that the Bayesian model has much

better performance on predicting the Basic level than the minimum distance classifier.

There is no significant difference between the overall correct prediction rates of two

classifiers.

Table 26. Classification matrix for Bayesian classifier using uniform prior

Predicted level 2007 level Frequency Percentage Advanced Proficient Basic Below basic Total

Advanced 10473.76

2920.57

85.67

0 0.00

141

Proficient 2315.86

8457.93

2920.00

9 6.21

145

Basic 74.67

3624.00

8053.33

27 18.00

150

Below Basic 21.09

73.83

3016.39

144 78.69

183

Total 136 156 147 180 619

6.3 Learning curves

The learning curve represents the relationship between the performance of the

classifier and the size of training data. In our case, the correct classification rate of the

test data was chosen to evaluate the classifier’s performance. The mean learning curve

was obtained by averaging over 1000 statistically independent runs. The procedure for

each run is described as follows.

Page 48: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 43 of 62

(1) Form the test set by randomly selecting 10% of the elements from each class. The

data remaining are available for training.

(2) Form the original training set by randomly selecting samples with the same size for

each class from the whole test population without replacement.

(3) Construct the classifier on the current training set, obtain the predictions for the test

data set, and compute the correct classification rate.

(4) Generate new training sets by adding one trial of each class.

(5) Repeat step (3)&(4) till the training set size reaches the maximum.

In the above procedure, the classifier is updated recursively. With a vector

x Txxx ),...,,( 421= added to the old training data set containing n trials, the recursive

learning rules for updating sample means and the sample covariance matrix for each class

are the following:

∑+

=

=++

+

+= 1

1,

,1

,1,1,

1, n

kkj

nij

n

kkjninj

nij

mxm

δ

δδ

∑∑∑+

=

++

+

===+++

+

−+−+= 1

1,

1,1,

1

1,,

11,,1,1,1,

1,,

)1(

n

kkj

nljnij

n

kjknljnij

n

kjk

n

knjiljknlninj

njil

mmmmSxxS

δ

δδδδ

where ijm is the sample mean of feature ix for class j, ilS is the covariance between two

features ix and lx , n is the number of trials, and kj ,δ is an indicator function. If the kth

trial falls in class j, it is 1. Otherwise, it is zero.

Figure 4 shows the mean learning curves for models we used for our study.

Page 49: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 44 of 62

0 20 40 60 80 100 12035

40

45

50

55

60

65

Samples of each class

Mea

n ra

te(%

)

MI:Mininum Mahalanobis distance classifierMI:Baysian classifier with uniform prior

Figure 4. Learning curve for different models using test data set with fixed size

(10% of each class)

The smallest training set in Figure 4 contains 5 samples for each class. The mean

learning curves show that the performance of classifiers improves greatly when the

training set size increases to 50 for each class. Both of them reach around 60% for the

probability of a correct classification. Increasing the number further brings a small

benefit. The mean rates are more than 61% for both of them when the number of samples

reaches 100 for each category. The classifiers have similar learning ability in this case.

The Bayesian classifier performs better than the Minimum Mahalanobis distance

classifier.

Page 50: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 45 of 62

0 10 20 30 40 50 60 70 80 90 10035

40

45

50

55

60

65

70

Samples of each class

Mea

n ra

te(%

)

MI:Mininum Mahalanobis distance classifierMI:Baysian classifier with uniform prior

Figure 5. Learning curve for different models using test data set with fixed size (219

trials)

Figure 5 represents learning curves obtained from test data with 219 trials. The

procedure is the same as above except that the training data is formed by randomly

selecting 100 trials from each class in the step (1). In this case, the Bayesian classifier

outperforms the minimum-distance model.

Compared to Figure 4, the classifiers get higher correct rates of predictions. The

reason is that both classifiers have good performance in identifying “Advanced” students,

who have higher test percentages.

Page 51: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 46 of 62

7. PREDICTION MODEL FOR EPGY STUDENTS ON THE MATH CST 2008

The last analysis was to predict students’ performance on the 2008 Math CST

given their 2007 Math CST, cumulative number of first-correct attempts and average

latency per exercise since the beginning of the 2007-2008 school year. Two models were

considered: the Ordinary Least-square Regression (OLS) and the Hierarchical Linear

Regression (HLM).

OLS Model. For this prediction, we used the Ordinary Least-square Regression Model.

Two steps were taken. The first step was to use the regression coefficients obtained in the

2006-2007 study.

CST07i= iiii eLATENCYCFACST ++++ 070706 321 βββα

where CST07i is a student i’s CST Math score in 2007, iCFA07 is the cumulative number

of correct first-attempts of student i in 2006-2007, and iLATENCY 07 is the average

latency in minutes of correct first-attempts by student i.

The second step was to predict 2008 Math CST scores by using the coefficient

obtained above and substituting in the following equation.

CST08i= iiii eLATENCYCFACST ++++ 080807 321 βββα

where CST08i is the predicted value of Math CST 2008 of student i, iCFA08 is the

cumulative number of correct first-attempts since July 15, 2007 until May 12, 2008 of

student i, and iLATENCY 08 is the average latency in minutes of correct first-attempts by

student i.

HLM Model. Similar to the steps above, we also used the Hierarchical Linear

Regression Model of the 2006-2007 study, which took into account the variance between

and within schools. The two-level model is presented as the following.

Level 1: Student level

ijijijijjij eLATENCYCFACSTCST ++++= 07070607 3210 ππππ

where CST07ij is a student i’s CST Math score in 2007 at school j, ijCFA07 is the

cumulative number of correct first-attempts of student i at school j in 2006-2007, and

Page 52: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 47 of 62

ijLATENCY 07 is the average latency in minutes of correct first-attempts by student i at

school j.

Level 2: School level. No predictor was available at this level and the coefficients are

estimated directly in the 2007-2008 regression computation, allowed to be varying across

schools.

jj 0000 υβπ +=

Putting those two level models together, the HLM model in step 1 is

)(07070607 032100 ijjijijijij eLATENCYCFACSTCST +++++= υπππβ

The HLM model for prediction is

)(08080708 032100 ijjijijijij eLATENCYCFACSTCST +++++= υπππβ

7.1 OLS and HLM prediction models using correct first-attempts

Coefficients from California Standard Math Test 2006-2007. There were 619 students

with CST06, CST07, correct first-attempts, and latencies. The coefficients obtained by

this regression model are presented in Table 27a and Table 27b.

Table 27a. Estimated coefficients of OLS model (2006-2007)

Notation Effect Coefficient

Estimate StdErr t value p-value

α Intercept 4.34 16.650 0.26 0.79

1β iCST 06 0.73 0.027 26.25 <10-100

2β iCFA07 0.03 0.003 9.62 1.71x10-20

3β iLATENCY 07 93.00 31.890 2.92 3.60x10-3

Table 27b. Parameter estimates of Hierarchical Linear Model (2006-2007)

Notation Fixed Effect Estimate StdErr DF t value p-value

00β Intercept -36.77 19.070 7 -1.93 0.10

1π ijCST 2006 0.71 0.027 608 26.22 <10-100

2π ijCFA2007 0.04 0.004 608 10.87 <10-20

3π 07ijLATENCY 175.20 32.990 608 5.31 1.53x10-7

Random Effect Estimate StdErr z value p-value j0υ Between school variance 493.53 289.45 1.71 0.04

ije Within school variance 2499.10 143.37 17.43 <10-60

Page 53: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 48 of 62

Prediction result. The predicted test scores for 952 students are shown graphically in

Figure 6.

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800

CST07

Pred

icte

d Va

lues

OLS

HLM

Figure 6. The comparison of predicted value 2008 CST between OLS and HLM models.

The scatter plot shows the predicted 2008 test scores from OLS and HLM models

are closed to each other. Both groups clustered somewhat above the diagonal, as evidence

of predicted improvement in scores. Table 28 summarizes the statistics of predictions.

The t-test shows that the difference between predicted values obtained from OLS and

HLM is not significant. Figure 7 shows that the two normal distribution approximations

of the predictions are similar.

Table 28. Statistics of Effectiveness Study predictions obtained from the two different models

Variable N Mean Std Minimum Maximum

Prediction obtained from OLS 952 361.10 73.88 111.71 671.08

Page 54: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 49 of 62

Prediction obtained from HLM 952 361.25 79.19 110.17 800.66

0

50

100

150

200

250

300

Count

100 150 200 250 300 350 400 450 500 550 600 650 700 750 800

0

50

100

150

200

250

300

Count

prediction

Figure 7. Predicted histograms of OLM and HLM models

HLM

OLS

Page 55: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 50 of 62

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6x 10-3

OLSHLM

Figure 8. The normal-distribution approximations OLS and HLM predictions

As can be seen in Figure 8, the normal-distribution approximations for the

predictions are not a really good fit. This discrepancy is not directly relevant to the

detailed predictions.

Page 56: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 51 of 62

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800

CST08

Pred

icte

d Va

lues

OLS

HLM

Figure 9. The comparison of predicted and actual 2008 CST

Figure 9 shows the scatter plot of the predictions, with the actual math CST 08 for

777 EPGY students shown on the x-axis and the predicted scores on the y-axis. The

figure suggests, the models under-predict, on average, actual scores.

Page 57: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 52 of 62

Table 29. The difference between the predicted values and the actual CST08 with unweighted correct first-attempts

OLS Prediction - Actual CST08

HLM Prediction - Actual CST08 Rank for CST08 Number of

students Mean Std StdErr Mean Std StdErr

Bottom quartile 194 25.58 40.46 2.90 25.64 44.33 3.183rd quartile 198 8.16 41.51 2.95 8.04 45.66 3.242nd quartile 184 -10.89 48.39 3.57 -10.67 52.46 3.87Top quartile 201 -41.70 60.32 4.25 -38.22 64.22 4.53All 777 -4.90 54.49 1.95 -3.96 57.45 2.06

Table 29 shows the details of the difference between the predicted values and the

actual CST08. Detailed information for each quartile is also reported. The numbers of

students for quartiles are not equivalent. In this case, students with tied scores and at the

border of two groups are assigned to the same group.

Table 30. Number of students over-predicted or under-predicted with unweighted correct first-attempts

OLS HLM

Rank for CST08 Number of students

Under-predicted

Over-predicted

Under-predicted

Over-predicted

Bottom quartile 194 58 136 59 1353rd quartile 198 86 112 84 1142nd quartile 184 110 74 108 76Top quartile 201 155 46 154 47All 777 409 368 405 372

Page 58: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 53 of 62

7.2 OLS and HLM prediction models using weighted correct first-attempts

We also considered OLS and HLM models with CST06, CST07, weighted correct first-

attempts, and latencies. The coefficients obtained by this regression models are presented

in Table 31a and Table 31b.

Table 31a. Estimated coefficients of OLS model (2006-2007) with weighted correct first-

attempts

Notation Effect Coefficient

Estimate StdErr t value p-value

α Intercept 242.82 21.77 11.16 <10-20

1β iCST 06 0.45 0.047 9.61 1.81x10-20

2β W iCFA07 0.03 0.003 8.52 <10-10

3β iLATENCY 07 -76.72 28.07 -2.73 6.00x10-3

Table 31b. Parameter estimates of HLM model (2006-2007) with weighted correct first-

attempts

Notation Fixed Effect Estimate StdErr DF t value p-value

00β Intercept 231.94 22.66 7 10.23 1.84x10-5

1π ijCST 2006 0.46 0.047 608 9.79 <10-20

2π W ijCFA2007 0.025 0.003 608 8.30 <10-10

3π 07ijLATENCY -61.69 27.64 608 -2.23 0.03

Random Effect Estimate StdErr z value p-value

j0υ Between school variance

224.28 137.95 1.63 0.05

ije Within school variance 2697.58 154.69 17.44 <10-60

Page 59: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 54 of 62

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800

CST08

Pred

icte

d Va

lues

OLS

HLM

Figure 10. The comparison of predicted and actual 2008 CST with weighted correct first-

attempts

Table 32. Statistics of Effectiveness Study predictions obtained from OLS and HLM

models with weighted correct first-attempts

Variable N Mean Std Minimum Maximum

Prediction obtained from OLS 952 338.87 68.20 25.93 602.10Prediction obtained from HLM 952 337.50 67.69 25.11 597.73

Table 32 shows that the predicted values from new models are smaller than those

from old models using un-weighted CFA. As observed in Figure 10, the new models

under-predict most students’ 2008 CST. Detailed information is presented in Table 33 and

Table 34

Page 60: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 55 of 62

Table 33. The difference between the predicted values and the actual 2008 CST with weighted correct first-attempts

OLS Prediction - Actual CST08

HLM Prediction - Actual CST08 Rank for CST08 Number of

students Mean Std StdErr Mean Std StdErr

Bottom quartile 194 2.84 49.37 3.54 1.63 48.91 3.513rd quartile 198 -24.55 49.00 3.48 -25.89 48.60 3.452nd quartile 184 -43.21 46.32 3.41 -44.63 45.85 3.38Top quartile 201 -77.33 65.35 4.61 -78.72 64.40 4.54All 777 -35.78 60.72 2.18 -37.12 60.23 2.16 Table 34. Number of students over-predicted or under-predicted with weighted correct first-attempts

OLS HLM

Rank for CST08 Number of students

Under-predicted

Over-predicted

Under-predicted

Over-predicted

Bottom quartile 194 94 100 96 983rd quartile 198 130 68 135 632nd quartile 184 150 34 154 30Top quartile 201 181 20 182 19All 777 555 222 567 210

The feature of the predictions using both weighted and unweighted correct first-

attempts is the underprediction of actual Math CST08 test scores for all students, all

quartiles, except the bottom quartile for the OLS and HLM models using unweighted

correct first-attempts, and the bottom two quartiles for both models using the weighted

correct first-attempts.

Page 61: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 56 of 62

8. CONCLUSION

A strong positive relationship between EPGY work and 2007 Math CST scores

was found consistently for all Title I schools, each district, and each school. The more

students worked carefully and in a sustained fashion, the higher the students scored on

their 2007 Math CST. In particular, specially, EPGY students in the top quartile or the top

half, ranked by the number of correct first-attempts, performed significantly better than

matched control students.

A clear graphic presentation representing these positive results for students is

given in Table 23. All EPGY students whose number of correct first-attempts was greater

than 2,000 (the mean number 1843.44) had higher test scores in 2007 than in 2007 than in

2006. Only 4 blocks in the graph below 2000 showed such an improvement, and these

were all students with the lowest 2006 scores.

Acknowledgements.

We thank Paul Holland for a variety of critical comments and suggestions at various

stages of this work. For financial support to conduct this study, we are indebted to three

corporations: Tessera, Flextronics and SanDisk; as well as the following individuals:

Bruce and Astrid McWilliams, Michael and Carole Marks, Tom and Johanna Baruch, and

Tim Mott.

References

Page 62: Effectiveness of Stanford’s EPGY Online Math K-5 …...Draft document. Do not circulate. Direct questions to ravaglia@stanford.edu Page 2 of 62 Results for second graders 15 Table

Draft document. Do not circulate. Direct questions to [email protected] Page 57 of 62

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.) New

York: Academic Press.

West, B.T., Welch, K.B., and Galecki, A.T. (2007). Linear Mixed Models: A Practical

Guide Using Statistical Software. London, New York: Chapman & Hall/CRC.

Pack, P., Holland, P., and Suppes, P. (1999). “Development and Analysis of a

Mathematics Aptitude Test for Gifted Elementary School Students”. School

Science and Mathematics, 99: 228-247.

Suppes, P. and Liang, L. (1998). Concept Leraning Rates and Transer Performance of

Several Multivariate Neural Network Models. In Recent Progress in

Mathematical Psychology. C.E. Dowling, F.S. Roberts, P. Theuns, eds.. Mahway,

NJ: Lawrence Erlbaum, pp. 227-252.