applying the online scoring network (osn) to advanced placement
TRANSCRIPT
RESEARCH REPORT April 2003 RR-03-12
Applying the Online Scoring Network (OSN) to Advanced Placement Program® (AP®) Tests
Research & Development Division Princeton, NJ 08541
Yuli (Lilly) Zhang Donald E. Powers Wendi Wright Rick Morgan
Applying the Online Scoring Network (OSN) to
Advanced Placement Program® (AP®) Tests
Yuli (Lilly) Zhang, Donald E. Powers, Wendi Wright, and Rick Morgan
Educational Testing Service, Princeton, NJ
April 2003
Research Reports provide preliminary and limited dissemination of ETS research prior to publication. They are available without charge from:
Research Publications Office Mail Stop 10-R Educational Testing Service Princeton, NJ 08541
Abstract
This project explored the feasibility of using the ETS Online Scoring Network (OSN) to score
selected Advanced Placement Program® (AP®) tests. In particular, the quality of scores obtained
from traditional large-scale readings was compared with the quality of scores obtained from
OSN scoring. (For the former, readers convene in a central location to evaluate student responses
in a paper-based format; for the latter, they work alone from remote locations to evaluate
responses that are transmitted via the Internet.) The study also obtained readers’ reactions to
OSN training and to OSN capabilities for monitoring reader performance online.
The results, based on more than 11,000 test takers who had taken either the AP English
Language and Composition exam or the AP Calculus AB exam, revealed little if any difference
between scores assigned at traditional central readings and those assigned when readers used
OSN. The lack of differences applied to (a) score level and variability, (b) interreader agreement,
and (c) passing rates on the two AP exams that were investigated. Study participants did,
however, suggest several areas in which improvements could be made to OSN procedures.
Key words: Advanced Placement Program, remote scoring, Online Scoring Network,
constructed-responses, essay scoring
i
The College Board’s Advanced Placement Program® (AP®) provides high school
students with an opportunity to earn college credit for, or advanced placement into, college-level
courses while they are still in high school. Currently, more than 100,000 teachers lead AP
courses in 19 subject areas (for which there are 35 exams). In 2001, more than 1,400,000 AP
exams were administered worldwide (The Advanced Placement Program, n.d.).
In addition to their scope, the AP tests are also noteworthy with respect to their format.
Besides multiple-choice questions, all of the AP exams contain a free-response section (essays,
problems, or speaking tasks). The free-response questions are designed to assess examinees’
ability to organize their knowledge and to produce clear, coherent answers that demonstrate their
understanding of the discipline and of specific concepts. The free responses take the form of
essays or solutions to problems and programs and are currently scored in a paper-and-pencil
format by trained readers at central locations. Needless to say, considerable effort is involved in
evaluating examinee responses to AP questions.
The ETS Online Scoring Network (OSN) is a computer-based system that was designed
to accommodate the growing needs of large-scale testing programs to score substantial numbers
of free- or constructed-response test items. By capturing and routing examinee responses, the
system enables readers to evaluate test takers’ performances from remote locations via the
Internet instead of at centralized locations, thus eliminating the transportation and housing costs
associated with a centralized reading. Reader training and certification are also accomplished
remotely as is the continuous monitoring of readers’ performance.
The objective of this study was to evaluate the feasibility of using OSN to score AP
exams. Of particular interest were (a) readers’ reactions to using OSN and (b) the extent to which
the level, reliability, and meaning of scores was comparable under OSN to traditional methods of
scoring.
Previous Research
In perhaps the earliest research concerned with the efficacy of remote scoring, Breland
and Jones (1988) compared the reliability and validity of first-year college students’ essay scores
when they were based on (a) trained readers working in a conference setting versus (b)
unmonitored readers working in their own homes or offices. The results revealed slightly lower
reliability and validity when essays were scored remotely. The researchers concluded, however,
1
that remote scoring was feasible potentially, especially if more sophisticated calibration
procedures could be developed and better reader monitoring implemented.
Early research on the first version of OSN (Powers, Farnum, Grant, & Kubota, 1997) was
conducted to determine readers' reactions to the prospect of evaluating essay responses on
computer screens (instead of paper). To make this assessment, experienced readers evaluated
samples of essays both on screen and on paper. The results revealed that readers were relatively
positive about online scoring. Moreover, there were no differences between the average scores
awarded to on-screen essays versus hard-copy essays, and interreader agreement was comparable
for both kinds of presentation. Similar results were obtained by Powers and Farnum (1997), who
found no differences between displaying and scoring essays on a computer screen versus
presenting them in a paper format.
Powers, Kubota, Bentley, Farnum, Swartz, and Willard (1998a) investigated the extent to
which inexperienced readers could be effectively trained to use OSN. Traditionally, personnel
who evaluate constructed responses for ETS-administered testing programs have been required
to possess certain academic credentials. This study was designed to determine the extent to
which these prerequisites could be relaxed without sacrificing the accuracy of scoring. Both
experienced and inexperienced readers evaluated essays (involving the discussion of an issue)
both before and after they had undergone standard training for scoring essays. The results
showed that training did affect scoring accuracy, especially for readers who were previously
inexperienced. Moreover, after training, a significant proportion of inexperienced readers
exhibited a level of accuracy that was commensurate with that shown by experienced readers. A
related effort extended these findings to a second kind of essay prompt involving the analysis of
an argument (Powers, Kubota, Bentley, Farnum, Swartz, & Willard, 1998b).
Thus, several research studies have provided strong evidence of the feasibility of online
scoring. Most of this research, however, has focused on the evaluation of relatively lengthy essay
responses. To date, no large-scale study has compared evaluations conducted online versus those
conducted when readers convene in a central location to evaluate student responses in a paper-
based format. Also, no study has investigated the evaluation of responses to tests like those
offered by the Advanced Placement Program.
2
The following sections of this paper discuss the research design for, and the results of,
one such study. The final section gives a summary of the results and suggestions for future OSN
training, scoring, and monitoring for AP tests.
Method
The Exams
The study focused on two AP exams, English Language and Composition and Calculus
AB, which employ two very different kinds of constructed-response questions. The AP English
Language and Composition test contains three free-response questions that are designed to test
skills in analyzing the rhetoric of prose passages by requiring students to write narrative,
expository, analytical, and argumentative essays in a clear and cogent manner. The AP Calculus
AB exam contains six free-response questions that require students to demonstrate, by showing
their work, their ability to solve calculus problems involving an extended chain of reasoning.
For this study, readers scored a randomly selected subset of exams from the May 2002
AP test administrations—approximately 6,000 exam books for AP English Language and
Composition and another 6,000 for AP Calculus AB were scanned and available for scoring
using OSN. All of these exams had been scored previously at the operational AP reading for
these exams in June 2002. In addition, approximately 500 exams were scored twice at the same
reading and twice again during OSN reading for our study. OSN scoring took place between July
22, 2002, and August 12, 2002. This allowed readers approximately 20 days to complete their
scoring, with each reader devoting a minimum of 4 hours per day.
Reader Recruitment
The study recruited readers from the pool of previously qualified AP readers. The aim
was to select a sample of readers who would approximate the pool of paper-and-pencil readers
for 2002 in terms of the proportions of experienced and new readers, males and females, college
and high school teachers, and minority group members. People who served as readers in June
2002 were not eligible to read for this study.
Invitations were sent via e-mail to the following categories of readers: those who had
declined invitations to serve as AP readers during 2002; those involved in OSN scoring for other
ETS programs; those who had retired from the reader pool within the past 2 years; and potential
3
readers who had been placed on an AP reader waiting list. The invitations included a link to
ONYX, an electronic relationship management tool on the ETS Web site, where invitees were
directed to complete a form to indicate their interest, availability, and access to the Internet and
other needed hardware. Readers and scoring leaders were compensated at the same hourly rate as
current 2002 AP readers and scoring leaders.
Invitations yielded a total of 73 study readers (33 for the AP English Language and
Composition exam and 40 for the AP Calculus AB exam). A plurality (41%) of the readers was
new to the AP scoring process. The others had read previously for the AP program for either 1
year (21%), 2 to 5 years (22%), or more than 5 years (16%). Most were college faculty members
(62 %). A majority (82%) was White, 11% were minority group members (Asian American,
Black American, or other), and 7% did not reveal their ethnicity. Virtually equal numbers of men
and women participated.
The study involved 10 scoring leaders for the AP English Language and Composition test
and 17 for the AP Calculus AB exam. The scoring leaders’ purpose was to answer questions that
readers might have. Most had served previously as AP table leaders, with 77% having served at
least two years. Only three of the study scoring leaders indicated having no previous experience
as table leaders. Most were either secondary school teachers (n = 10) or college faculty members
(n = 12), with roughly equal numbers of males (n = 13) and females (n = 12). (The gender of two
scoring leaders was not reported.) All scoring leaders were White.
All in all, the readers and scoring leaders who volunteered to participate in the study
appeared to be reasonably representative of the total pool of potential AP readers. The possibility
exists, however, that these volunteers may have differed from other AP readers with regard to the
extent to which they were comfortable with reading papers on a computer screen.
Training Materials
The selection of training samples was completed as part of the regular June 2002 AP
reading sessions. These samples were used to develop OSN calibration items, benchmarks, and
rangefinders for the study. As required for OSN, the samples were also annotated during the
reading session and loaded into OSN in early July. A training Web site, which included
additional samples and annotations, was also developed. Because the calculus exam was scored
analytically, there was no need for traditional rangefinders and benchmarks. Instead, as per
4
current practice for paper-and-pencil reading sessions, samples of test performances were
selected only to represent a range of typical responses.
Scanning
All AP exam books that were scored for this study were scanned into OSN, which was the
same as that used for scoring other ETS tests. The OSN capability to store any scanned
annotations/scoring guides for each question was also used in this study. For both the AP English
Language and Composition and the AP Calculus exams, a total of 240 folders (25 exam books per
folder) were selected to represent the AP populations for each exam. This was accomplished by
creating folders throughout the entire length of the period during which exams were returned, so that
any sampling bias due to early returns or to geographic region would be minimized.
Training
The study conducted two levels of training:
• General OSN training for experienced AP readers
• General OSN training and subject specific training for new readers
Each reader used the OSN AP tutorial Web site to receive initial training. At the end of
this training, which was expected to require several hours for new readers, each reader took a
certification exam, which required trainees to match the evaluations awarded to previously vetted
responses. We administered a second certification set to trainees who failed to qualify as readers
on the initial attempt. Once certified, a trainee was allowed to access the OSN Web site and
begin scoring examinee responses.
In order to evaluate the effectiveness of reader training conducted through OSN and to
gather reader reaction to participating in OSN-based training and scoring, we developed a brief
questionnaire and administered it to each reader upon completion of training. The questionnaire
also asked readers about their likely availability to participate in future OSN readings.
Scoring
Each reader scored for a total of approximately 20 hours, with each scoring session being
a minimum of 4 hours. At the beginning of each scoring session, readers took a calibration exam
5
consisting of prescored papers to ensure that they remained on scale. Readers were permitted to
resume scoring only if they passed the calibration exam.
Results
This report presents results in two main parts. The first part contains the statistical results,
e.g., the comparison of the free-response scoring for AP operational paper/pencil-based scoring
and OSN scoring. The second part contains the main results from surveys of AP OSN readers
and scoring leaders.
Statistical Results
The following results involve comparing the responses of a large sample of AP
examinees as evaluated by two comparable sets of trained readers under each of two different
scoring systems—OSN and traditional centralized paper/pencil scoring sessions. For each of the
two AP exams investigated here, summary statistics are shown (in Table 1 for the AP English
Language and Composition and in Table 2 for the AP Calculus AB exam) for each free-response
question, both for the total AP test taker population and for the OSN study. For the OSN study
sample, we present the results of both the operational scoring and OSN scoring. The total
samples include all examinees who took the exams during 2002. Paired t-tests were conducted
(and effect sizes calculated) to evaluate the difference between means for operational and OSN
scoring for each of the two OSN study samples.
For the AP English Language and Composition test, the study found statistically
significantly differences (p < .01) between groups for two of the three free-response questions, as
shown in Table 1. These results are consistent with a 1988 reader reliability study that was
conducted for the exam. Table 1 also shows very small effect sizes (d < .1) of less than half a
point between the two groups.
6
Table 1
Comparison of Operational and OSN Summary Statistics for Free-Response Questions for the AP English Language and Composition Test
OSN sample
Total
N = 152,889 Operational OSN
Mean SD Mean SD Mean SD N p
Effect
size
Question 1 4.82 1.76 4.97 1.67 5.09 1.71 5,388 < .01 .07
Question 2 4.74 1.66 4.77 1.63 4.67 1.72 4,225 < .01 .06
Question 3 5.02 1.66 5.00 1.61 5.05 1.60 3,414 .11 .03
Comparable summary statistics are shown in Table 2 for six free-response AP Calculus AB exam
questions. Paired t-tests revealed statistically significant differences (p < .01) between the operational
and OSN scoring for all six questions, with an average score difference of less than .05 score points. The
effect sizes, however, were miniscule (d < .02). This result is also consistent with a reader reliability
study for AP Calculus AB test conducted in 1996.
Table 2
Comparison of Operational and OSN Summary Statistics for Free-Response Questions for the AP Calculus AB Test
Total OSN sample
N = 152,696 Operational OSN
Mean SD Mean SD Mean SD N p
Effect
size
Question 1 4.08 2.57 4.12 2.55 4.16 2.57 2,928 < .01 .02
Question 2 3.13 2.82 3.28 2.86 3.31 2.86 5,320 < .01 .01
Question 3 3.12 2.24 3.24 2.32 3.20 2.32 2,660 < .01 .02
Question 4 3.51 2.87 3.62 2.86 3.58 2.84 3,181 < .01 .01
Question 5 2.29 2.66 2.44 2.71 2.47 2.69 5,609 < .01 .01
Question 6 2.33 2.11 2.47 2.17 2.44 2.15 4,142 < .01 .01
7
Comparison of Interreader Correlations for Operational and OSN Scoring
In addition to scoring each AP question once for a large sample, we also conducted a
reader reliability study in which a sample of responses to each free-response question was scored
by two readers. Product-moment correlations between readers were calculated for each question
as an index of reader consistency. The results were compared with the same statistics derived
from a similar study conducted as part of the 2002 operational AP readings. Tables 3 and 4
compare interreader correlations for AP readers using OSN with those for readers who scored the
same responses in the AP operational reading. The correlations between OSN readers were
nearly as high or higher than those for operational readers. However, z tests of the difference
between two correlation coefficients (Marascuilo & Serlin, 1988) revealed no statistically
significant differences (p > .05) between interreader correlation coefficients obtained in the
operational and OSN settings.
Table 3
Interreader Correlations for Free-Response Questions in Operational and OSN Settings for the AP English Language and Composition Test
N OSN correlation
Operational correlation z p
Question 1 365 .69 .55 1.91 > .05
Question 2 354 .61 .62 0.07 > .05
Question 3 306 .67 .52 1.84 > .05
8
Table 4
Interreader Correlations for Free-Response Questions in Operational and OSN Settings for the AP Calculus AB Test
OSN Operational
N reliability reliability z p
Question 1 228 .98 .97 .03 > .05
Question 2 381 .97 .96 .10 > .05
Question 3 210 .94 .93 .11 > .05
Question 4 263 .97 .96 .10 > .05
Question 5 462 .97 .94 .47 > .05
Question 6 362 .97 .96 .20 > .05
Comparison of Reliability of Operational and OSN Scoring
The Cronbach coefficient alpha was used to estimate the internal consistency of the free-
response section. Because each free-response question was read by a different reader, coefficient
alpha reflects both interreader consistency as well as the overall consistency of measurement
provided by free-response questions. Table 5 presents the reliability comparison between
operational and OSN groups for the AP English Language and Composition test. There were no
significant between-group differences in the reliabilities of free-response items between the
operational and OSN scoring results.
Table 5
Comparison of Reliability Estimates for Free-Response Items in the AP English Language and Composition Test
OSN matched sample
Total Operational OSN
N 152,889 3,241 3,241
Reliability .673 .670 .655
9
Table 6 presents the reliability comparison between operational and OSN scoring for the
AP Calculus AB test. The reliability estimates are virtually identical for the two modes of
scoring.
Table 6
Comparison of Reliability Estimates for Free-Response Items in the AP Calculus AB Test
OSN matched sample
Total Operational OSN
N 153,696 514 514
Reliability .851 .861 .864
Comparison of Item-level Reader Agreement Rates for Operational and OSN Scoring
Besides computing interreader correlations for the smaller reliability sample, we also
estimated score consistency in the larger study sample by computing the simple percentage
agreement rates between readers performing in the two different scoring environments. For each
of the free-response questions, which are scored on a 9-point scale, three agreement rates were
computed: exact, exact or within 1 point, and exact or within 2 points. Both the observed
percentages of agreement and Cohen’s kappa agreement statistics are shown in Tables 7 and 8
for each question on the two AP exams. (Cohen’s kappa is a more appropriate way of expressing
reader agreement than is simple percentage agreement because Cohen’s kappa corrects for
chance agreement.)
As shown in Table 7 for the three questions on the AP English Language and
Composition test, kappa statistics for exact agreement, agreement within 1 point, and agreement
within 2 points varied from .16 to .20, from .49 to .58, and from .77 to .83, respectively. The
exact agreement rates are relatively low, based on criteria that have been suggested elsewhere.
(See Powers, 2000, for a review of these criteria.) However, the rates for less than exact
agreement are reasonably good, especially if the standard is within two points. The reasons for
the rate of exact agreement on the AP English Language and Composition test questions may be
a function of either the nature of the task or the different environments in which the questions
were scored. The comparisons in the following section provide a better sense of the reasons for
these low agreement rates.
10
Table 7
Observed Percentage Agreements (and Kappa Statistics) Between Operational and OSN Scoring for the AP English Language and Composition Test
Exact 1 point adjacent 2 point adjacent
Poa Kb Po K Po K
Question 1 .24 .16 .64 .49 .87 .77
Question 2 .25 .17 .66 .53 .88 .78
Question 3 .28 .20 .70 .58 .91 .83
aPo: Observed percentage of agreement. bK: Cohen’s kappa.
Table 8 presents agreement statistics for the six free-response questions for the AP
Calculus AB test. The kappa-based exact, within-1-point, and within-2-point agreement rates
varied from .57 to .78, .89 to .96, and .95 to 1.00, respectively. All are quite respectable
according to the published criteria.
Table 8
Observed Percentage Agreements (and Kappa Statistics) Between Operational and OSN Scoring for the AP Calculus AB Test
Exact 1 point adjacent 2 point adjacent
Poa Kb Po K Po K
Question 1 .74 .71 .96 .94 .99 .99
Question 2 .72 .69 .95 .93 .99 .98
Question 3 .62 .57 .93 .90 .99 .98
Question 4 .71 .68 .94 .92 .99 . 98
Question 5 .71 .68 .92 .89 .97 .95
Question 6 .80 .78 .97 .96 1.00 1.00
aPo: Observed percentage of agreement. bK: Cohen’s kappa.
11
Appendix A presents detailed agreement tables for each free-response item for both the
AP English and Composition and the AP Calculus AB tests.
Comparison of Interreader Agreement
As shown above, the rate of exact agreement between OSN and operational scoring was
relatively low for the AP English Language and Composition test. In order to ascertain the nature
of the score discrepancies—whether from disagreement between readers or from disagreement
due to different scoring environments—we compared interreader agreement in the reliability
study sample (in which both readers evaluated responses in an operational setting) with the
agreement exhibited in the OSN study sample (in which one reader evaluated responses in an
operational setting and the other evaluated the same responses in an OSN setting). The
comparisons, shown in Table 9, reveal that reader agreement was at least as good (perhaps
better) between two readers in different scoring environments as for readers in the same
operational scoring environment (i.e., in the reliability study sample). This result suggests that
the major source of disagreement is readers themselves, not differences between OSN and
operational scoring environments.
Table 9
Kappa Statistics for the Study Sample (Operational vs. OSN Scoring) and for the Reader Reliability Sample for the AP English Language and Composition Test
Exact 1 point adjacent 2 point adjacent
OSNa RRSb OSN RRS OSN RRS
Question 1 .22 .21 .62 .53 .88 .79
Question 2 .19 .22 .59 .56 .84 .89
Question 3 .24 .17 .62 .54 .91 .82
aOSN: OSN study sample, n = 5,388; 4,225; and 341, respectively, for each question. bRRS: Reader reliability study sample, n = 500.
12
The results in Table 10 provide similar information (and results) for the AP Calculus AB test.
Table 10
Kappa Statistics for the Study Sample (Operational vs. OSN Scoring) and for the Reader Reliability Sample for the AP Calculus AB Test
Exact 1 point adjacent 2 point adjacent
OSNa RRSb OSN RRS OSN RRS
Question 1 .73 .74 .96 .94 1.00 .99
Question 2 .70 .67 .93 .92 1.00 .97
Question 3 .63 .58 .93 .88 .97 .98
Question 4 .72 .66 .92 .91 .99 .99
Question 5 .73 .64 .94 .90 .99 .95
Question 6 .76 .76 .98 .95 1.00 .98
aOSN: OSN study sample n = 2,928; 5,320; 2,660; 3,181; 5,609; and 4,142, respectively, for each question. bRRS: Reader reliability study sample, n = 500.
Comparison of AP Grades Based on Operational and OSN Scoring
The effect on AP grades was evaluated by comparing grades based on an operational
reading to those based on OSN scoring. The multiple-choice section scores were, of course,
identical in these comparisons. For the AP Calculus AB test, the free-response section
contributes 50% to the composite score; for the AP English Language and Composition test, it
contributes 55%. Table 11 provides the distribution of operationally reported grades versus
OSN-based grades for the AP English Language and Composition test, with the diagonal cells
representing agreement and the off-diagonal cells indicating discrepancies.
In this study, OSN grades were identical to operational grades in 2,253 of 3,241 cases
(69.6%) for the AP English Language and Composition test. Had reported grades been based on
OSN-based scores, 30.4% of the grades would have differed by one AP grade, with 16.0% being
higher under OSN grading and 14.4% being lower. Less than 0.5% of the differences were
greater than one grade. This result is consistent with reader reliability studies conducted with
13
operational data in 2002 and 1998, which yielded an overall agreement rate of 66% (Educational
Testing Service, 1998).
Table 11
Cross-tabulation of AP Grades Based on Operational and OSN Scoring for the AP English Language and Composition Test
OSN grade
Operational grade
1 2 3 4 5
Total
1 183 (78%) 60 0 0 0 243
2 53 787 (79%) 163 1 0 1,004
3 0 145 680 (67%) 162 2 989
4 0 3 173 404 (62%) 130 710
5 0 0 6 86 203 (61%) 295
Total 236 995 1,022 653 335 3,241
Table 12 presents the distribution of reported grades versus OSN-based grades for the AP
Calculus AB test. In this study, OSN grades were identical to operational grades in 425 of 446
cases (95.3%). Had reported grades been based on OSN scores, 4.7% of the grades would have
differed by one AP grade. This result is consistent with the reader reliability study conducted in
1996, for which an exact rate of agreement of 94% was calculated (see Bleistein, Morgan, &
Battleman, 1996).
14
Table 12
Cross-tabulation of AP Grades Based on Operational and OSN Scoring for the AP Calculus AB Test
OSN grade Operational grade 1 2 3 4 5
Total
1 38 (97%) 1 0 0 0 39
2 0 56 (97%) 2 0 0 58
3 0 2 119 (95%) 4 0 125
4 0 0 37 107 (93%) 8 115
5 0 0 0 4 105 (96%) 109
Total 38 59 121 115 113 446
Performance of OSN Readers Over Time
The OSN reading for the AP English Language and Composition exam required more
than 20 days, with the numbers of free-response questions read varying from day to day. The
consistency of OSN readers over time was estimated by computing the correlation between
scores on free-response and multiple-choice sections for both operational and OSN scoring.
Using a z test for the equality of correlations, these correlations were compared for every
question for each day of scoring that yielded at least 50 evaluations per question.
For the AP English Language and Composition exam, the median correlations (over days)
with performance on multiple-choice questions for the three free-response questions were .46, .42,
and .42 when free-responses were scored operationally. The corresponding correlations when
free-response questions were scored with OSN were .48, .41, and .46. For the AP Calculus AB
exam, the median correlations over time between multiple-choice scores and each of six free-
response question scores were .67, .64, .63, .69, .68, and .71 for operational scoring. For OSN
scoring, the corresponding correlations were virtually identical: .66, .64, .65, .69, .69, and .71.
The comparisons over time for the three free-response questions in the AP English
Language and Composition test are displayed in Figure 1, which shows that, for each of the three
questions, the differences between correlations (OSN correlation minus operational correlation)
15
fluctuate randomly around 0.0 and between –0.2 and +0.2. There were no statistically significant
differences between the two correlations at the .05 level, and the difference between correlations
in the two scoring environments were as likely to be positive as negative. (See Tables B1 to B3
in Appendix B.) These results indicate that, when the multiple-choice section of the test is treated
as a validity criterion, OSN and traditional operational readings of free-response questions for
the AP English Language and Composition test are equally valid.
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Days of scoring
Diff
eren
ce
Q1 Q2 Q3
Figure 1. Differences over time between OSN and OP in correlations for AP English
Language and Composition Test.
The OSN reading for the AP Calculus AB test required about 16 days. The consistency of
OSN readers was estimated in the same manner as for the AP English Language and
Composition test. The results, shown in Figure 2 (see Tables B4 to B9 in Appendix B), are
similar to those shown in Figure 1 for the AP English Language and Composition test. The six
lines in the graph fluctuate in a seemingly random manner around 0.0 and in a very limited range
(-0.03 to 0.04). Thus, there were no apparent differences in the validity of operational and OSN-
based readings over time.
16
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days of scoring
Diff
eren
ce
Q1 Q2 Q3 Q4 Q5 Q6
Figure 2. Differences over time between OSN and OP in correlations for AP Calculus AB.
AP OSN Survey Results
Readers’ Opinions About OSN Training
Upon completion of their training, readers who participated in the study were surveyed to
obtain their impressions of OSN. All of the 73 study readers (33 for the AP English Language
and Composition exam and 40 for the AP Calculus AB exam) completed the survey. A majority
(89%) of readers rated general OSN training as being effective (29% rated it as very effective) in
helping them to properly utilize OSN without seeking technical support, and 94% regarded the
subject-specific scoring training as being effective (28% rated it as very effective) in helping
them to score responses accurately. However, 76% of the 41 readers who had read previously for
the AP program did not think that OSN training was as effective as their previous AP training,
and six of these readers thought that OSN training was much less effective.
Participants who expressed dissatisfaction with OSN training most often mentioned the
lack of opportunity to discuss standards with other readers, the inability to print the commentary
that accompanied training essays, the incapability to display a greater proportion of an essay
without having to scroll, and the lack of correspondence between the prompts used for training
and those that examinees actually had.
17
During the course of the OSN reading, a majority of readers (63%) felt that they needed
to seek OSN technical support, and 88% found this help to be satisfactory for resolving technical
problems. Readers were also asked to rate each of several aspects of OSN as being either
excellent, good, satisfactory, or unsatisfactory. Their responses are shown in Table 13.
Table 13
Readers’ Opinion About OSN
% Satisfactory or higher
% Excellent or good
OSN login process 97 86
User-friendliness (ease-of-use, navigation) 96 65
Online practice tutorial 90 69
Visual display (screen headings, etc.) 87 69
System response time 80 55
Handwriting image display 69 37
Most aspects of OSN were viewed as at least satisfactory and, with one exception, all
were rated as good or excellent by a majority of readers. The lowest rated feature, the
handwriting display, was found to be satisfactory by 69% of readers. Even though a significant
minority of readers found the handwriting display to be less than satisfactory, it apparently did
not hinder their scoring.
Readers were also asked to evaluate their interactions with and support from scoring
leaders. The vast majority of readers (92%) consulted their scoring leader at least once, and a
slight majority (53%) consulted him/her more than twice. The vast majority of readers (92% of
those responding) felt that using the telephone was at least satisfactory for discussing scoring
issues with their scoring leader. All respondents felt that the scoring leader was helpful (78%
found them very helpful), and 86% said that the scoring leader was available when needed.
Finally, readers were asked if they had encountered technical difficulties when using the
OSN Web site. Of the 58 readers who responded to the question, nearly half (45%) indicated that
they had experienced difficulty connecting with the Web site, and 38% said that they had trouble
with download speed.
18
Scoring Leaders’ Opinions
When asked for their opinion about the overall effectiveness of the general OSN training
(in helping them to properly utilize the system without seeking technical support), fully 83% of
respondents indicated that the training was at least slightly effective. Eighty-one percent
regarded the use of the phone as being at least satisfactory for discussing scoring issues with
readers. The most often mentioned positive aspect of OSN training was that it enabled readers to
train at their own pace. Only four scoring leaders felt that training was either not very effective
or not effective at all. The most often mentioned negative aspect of OSN training was the
inability of readers to interact with each other (to discuss troublesome essays or scoring rubrics,
for example).
Scoring leaders were also asked to rate each of several aspects of OSN. Their ratings are
shown in Table 14.
Table 14
Scoring Leaders’ Opinion About OSN
% Satisfactory or higher
% Excellent or good
OSN login process 96 77
User-friendliness (ease-of-use, navigation) 81 54
Online practice tutorial 80 36
Visual display (screen headings, etc.) 81 31
System response time 77 57
Handwriting image display 69 19
Three quarters of the scoring leaders reported that they required technical support during
the readings, and all but one of nineteen respondents rated the help they received as being
satisfactory. However, 15 of 20 respondents said they encountered trouble at least once when
connecting to the OSN Web site, and nearly half of them reported trouble with download speed.
Summary and Discussion
The ETS Online Scoring Network (OSN) is a computer-based system designed to score
free- or constructed-response test items. By capturing and routing examinee responses, the
19
system enables readers to evaluate test takers’ performances from remote locations via the
Internet rather than at centralized locations, as has been the tradition for many large-scale testing
programs that employ constructed-response testing.
The effort described here entailed the development and application of OSN procedures to
score a large sample of constructed responses for two Advanced Placement tests—English Language
and Composition and Calculus AB. The study objectives were to obtain participants’ reactions to
OSN and to assess their performance when using the system. Study participants were 67 readers and
26 scoring leaders, all recruited from the same population as traditional AP readers. All told, these
readers evaluated nearly 6,000 exams for each of the two tests that were studied.
With respect to participant reactions to using OSN, the following results were obtained.
The vast majority of readers rated both general OSN training and subject-specific training as
being effective in helping them to utilize OSN and to score responses accurately. In general,
however, readers did not think that OSN training was as effective as their previous AP training.
During the course of the OSN reading, a slight majority of readers sought technical
support, and, of those readers, the vast majority found this support to be satisfactory for resolving
technical problems. Readers expressed variable opinions about specific aspects of OSN, though
most aspects were rated quite positively. The least positive rating pertained to the handwriting
image display, which was still rated as at least satisfactory by a majority of readers. Many
readers also indicated that they had experienced difficulty connecting with the Web site or with
download speed.
The opinions of scoring leaders regarding the effectiveness of various aspects of the
system were generally consistent with those of readers. For example, scoring leaders also rated
the handwriting image display as the least effective component of the system, although a
majority rated it as satisfactory. Like readers, some scoring leaders also reported difficulty when
connecting to the OSN Web site or with slow download speed.
Readers’ performance in the OSN environment was assessed by comparing the
evaluations made by two comparable sets of trained readers each working in a different scoring
environment—either in OSN or in traditional centralized paper/pencil scoring sessions. The same
samples of examinee test responses were evaluated in each scoring environment.
For each test question, a comparison of mean scores awarded in each scoring
environment revealed statistically significant differences between scoring environments.
20
(Statistical significance was likely a result of the very large samples.) However, these differences
were very small (d < .1) for the AP English Language and Composition test and miniscule (d <
.02) for the AP Calculus AB exam. Moreover, the direction of any difference was as likely to
favor one scoring environment as the other.
In addition to scoring each AP question once for a large sample, a reader reliability study
was conducted, in which a sample of responses to each free-response question was scored by two
readers. The agreement between OSN readers was at least as good as that for those who read in a
traditional operational setting.
The internal consistency of the free-response test sections was also compared across the
two scoring environments. There were no significant between-environment differences in the
reliabilities of free-response sections for either test.
In addition, we compared interreader agreement in operational and OSN settings. The
comparisons revealed that reader agreement was very similar in the two different scoring
environments.
The effect on AP grades was evaluated by comparing grades based on an operational
reading to those obtained with OSN scoring. For each exam, the rates of agreement between
OSN-based grades and those based on the traditional AP scoring environment were at least as
high as the rates noted in previous AP reliability studies, in which exams were double scored in
the same, traditional scoring environment.
The consistency of OSN readers over time was estimated also—by computing the
correlations between scores on free-response and multiple-choice sections for operational and
OSN scoring and comparing these correlations for each day of scoring. There were no
statistically significant differences between the two sets of correlations for either test. These
results indicate that, when the multiple-choice section of the test is treated as a validity criterion,
OSN and traditional operational readings of free-response questions are equally valid.
In conclusion, it appears that, according to study participants, some aspects of OSN need
to be improved. However, by each of the several widely accepted performance standards
(interreader agreement, internal consistency, and relationship to other appropriate variables), the
results obtained with OSN are extremely similar to those obtained with traditional AP scoring
methods.
21
References
The Advanced Placement Program. (n.d.). Retrieved August 29, 2002, from
http://apcentral.collegeboard.com/program
Bleistein, C., Morgan, R., & Battleman, M. (1996). College Board Advanced Placement
Examination reader reliability study Calculus AB and Calculus BC Form 3RBP.
Unpublished manuscript.
Breland, H. M., & Jones, R. J. (1988). Remote scoring of essays (ETS RR-88-04). Princeton, NJ:
Educational Testing Service.
Educational Testing Service. (1998). College Board Advanced Placement English Language and
Composition form 3TRP reader reliability study. Unpublished manuscript.
Marascuilo, L. A., & Serlin, R. C. (1988). Statistical methods for the social and behavioral
sciences. New York: W. H. Freeman and Company.
Powers, D. E. (2000). Computing reader agreement for the GRE Writing Assessment (ETS RM-
00-08). Princeton, NJ: Educational Testing Service.
Powers, D. E., & Farnum, M. (1997). Effects of mode of presentation on essay scores (ETS RM-
97-08). Princeton, NJ: Educational Testing Service.
Powers, D. E., Farnum, M., Grant, M., & Kubota, M. (1997). A pilot test of online essay scoring
(ETS RM-97-07). Princeton, NJ: Educational Testing Service.
Powers, D. E., Kubota, M., Bentley, J., Farnum, M., Swartz, R., & Willard, A. (1998a).
Qualifying readers for the Online Scoring Network (ETS RR-98-20). Princeton, NJ:
Educational Testing Service.
Powers, D. E., Kubota, M., Bentley, J., Farnum, M., Swartz, R., & Willard, A. (1998b).
Qualifying readers for the Online Scoring Network: Scoring argument essays (ETS RR-
98-28). Princeton, NJ: Educational Testing Service.
22
Appendix A
Item Agreement Between OP and OSN Scoring
Tables A1 through A9 show the item agreement between operational (OP) and OSN scoring on
free-response (FR) questions for AP English Language and Composition and AP Calculus AB tests.
Table A1
AP English Language and Composition: Question 1
OP FR1 OSN FR1
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 1 16.67
3 50.00
116.67
00.00
00.00
00.00
116.67
0 0.00
0 0.00
00.00
6
1 3 3.70
27 33.33
3138.27
1214.81
78.64
11.23
00.00
0 0.00
0 0.00
00.00
81
2 1 0.29
14 4.03
6518.73
10931.41
8223.63
4412.68
185.19
9 2.59
4 1.15
10.29
347
3 2 0.37
7 1.30
8014.81
11320.93
15729.07
9116.85
6912.78
14 2.59
5 0.93
20.37
540
4 1 0.08
3 0.25
564.70
16613.93
33728.27
24520.55
24420.47
109 9.14
24 2.01
70.59
1,192
5 1 0.10
4 0.39
232.23
1039.99
24924.15
20019.40
24223.47
140 13.58
59 5.72
100.97
1,031
6 0 0.00
6 0.48
201.59
635.02
23118.42
27521.93
35027.91
182 14.51
105 8.37
221.75
1,254
7 0 0.00
2 0.33
40.65
182.93
7512.21
8714.17
18429.97
133 21.66
92 14.98
193.09
614
8 0 0.00
0 0.00
31.15
62.31
135.00
3613.85
6926.54
60 23.08
55 21.15
186.92
260
9 0 0.00
0 0.00
00.00
00.00
34.62
46.15
1218.46
9 13.85
24 36.92
1320.00
65
Total 9 66 283 590 1,154 983 1,189 656 368 92 5,390
23
Table A2
AP English Language and Composition: Question 2
OP FR2 OSN FR2
Frequency Rw pct 0 1 2 3 4 5 6 7 8 9 Total
0 1 14.29
4 57.14
1 14.29
00.00
00.00
00.00
114.29
00.00
0 0.00
00.00
7
1 1 1.32
39 51.32
19 25.00
1114.47
56.58
00.00
00.00
11.32
0 0.00
00.00
76
2 1 0.38
28 10.65
85 32.32
5922.43
4316.35
3111.79
103.80
41.52
2 0.76
00.00
263
3 0 0.00
24 4.45
86 15.96
14326.53
13024.12
9217.07
417.61
162.97
6 1.11
10.19
539
4 2 0.19
14 1.31
84 7.88
18917.73
29827.95
27225.52
12611.82
565.25
21 1.97
40.38
1,066
5 6 0.72
8 0.95
33 3.94
10212.17
22126.37
19723.51
16519.69
799.43
25 2.98
20.24
838
6 4 0.47
3 0.36
22 2.61
698.18
15218.01
23227.49
19222.75
11413.51
44 5.21
121.42
844
7 4 0.99
2 0.50
6 1.49
174.21
4912.13
9724.01
10726.49
6917.08
43 10.64
102.48
404
8 0 0.00
0 0.00
2 1.18
42.35
105.88
2816.47
4124.12
4023.53
33 19.41
127.06
170
9 0 0.00
0 0.00
0 0.00
12.86
25.71
38.57
720.00
925.71
6 17.14
720.00
35
Total 19 122 338 595 910 952 690 388 180 48 4,242
24
Table A3
AP English Language and Composition: Question 3
OP FR3 OSN FR3
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 0 0.00
1 100.00
0 0.00
00.00
00.00
00.00
00.00
00.00
0 0.00
00.00
1
1 1 2.63
9 23.68
14 36.84
923.68
25.26
25.26
00.00
12.63
0 0.00
00.00
38
2 1 0.56
10 5.59
35 19.55
4927.37
4826.82
2715.08
73.91
10.56
1 0.56
00.00
179
3 0 0.00
8 2.05
41 10.51
10326.41
11429.23
7418.97
3910.00
82.05
3 0.77
00.00
390
4 0 0.00
4 0.57
36 5.14
11616.55
21831.10
17324.68
10114.41
395.56
12 1.71
20.29
701
5 1 0.13
3 0.40
11 1.48
527.00
17523.55
24032.30
16221.80
8010.77
14 1.88
50.67
743
6 0 0.00
2 0.26
6 0.77
313.99
11915.34
19725.39
21527.71
14718.94
52 6.70
70.90
776
7 0 0.00
2 0.51
2 0.51
102.54
338.40
7819.85
11429.01
9624.43
45 11.45
133.31
393
8 0 0.00
0 0.00
0 0.00
21.27
95.73
2012.74
4528.66
4126.11
35 22.29
53.18
157
9 0 0.00
0 0.00
0 0.00
00.00
25.26
12.63
410.53
1026.32
17 44.74
410.53
38
Total 3 39 145 372 720 812 687 423 179 36 3,416
25
Table A4
AP Calculus AB: Question 1
OP FR1 OSN FR1
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 208 90.83
17 7.42
3 1.31
10.44
00.00
00.00
00.00
00.00
0 0.00
00.00
229
1 14 5.71
184 75.10
32 13.06
104.08
20.82
10.41
00.00
20.82
0 0.00
00.00
245
2 0 0.00
31 8.22
298 79.05
379.81
92.39
10.27
10.27
00.00
0 0.00
00.00
377
3 1 0.20
4 0.79
43 8.46
39277.17
499.65
112.17
81.57
00.00
0 0.00
00.00
508
4 0 0.00
0 0.00
8 2.07
5714.77
26568.65
4611.92
102.59
00.00
0 0.00
00.00
386
5 0 0.00
0 0.00
0 0.00
124.48
3914.55
16862.69
4316.04
51.87
1 0.37
00.00
268
6 0 0.00
0 0.00
0 0.00
10.32
61.95
4313.96
21469.48
3310.71
9 2.92
20.65
308
7 0 0.00
0 0.00
0 0.00
00.00
10.43
62.60
3113.42
15265.80
39 16.88
20.87
231
8 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
31.55
3015.54
128 66.32
3216.58
193
9 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
21.09
21.09
16 8.74
16389.07
183
Total 223 236 384 510 371 276 312 224 193 199 2,928
Note. Frequency missing = 150,768.
26
Table A5
AP Calculus AB: Question 2
OP FR2 OSN FR2
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 1,430 91.67
93 5.96
28 1.79
60.38
30.19
00.00
00.00
00.00
0 0.00
00.00
1,560
1 76 16.41
289 62.42
77 16.63
132.81
61.30
10.22
00.00
00.00
0 0.00
10.22
463
2 14 3.89
64 17.78
217 60.28
5013.89
82.22
51.39
20.56
00.00
0 0.00
00.00
360
3 4 1.25
8 2.51
41 12.85
19761.76
5416.93
123.76
20.63
10.31
0 0.00
00.00
319
4 4 0.59
1 0.15
17 2.49
547.92
50173.46
8212.02
182.64
40.59
0 0.00
10.15
682
5 2 0.35
1 0.18
1 0.18
91.59
6611.66
37065.37
9817.31
162.83
3 0.53
00.00
566
6 1 0.20
0 0.00
0 0.00
00.00
91.79
8817.46
30760.91
8717.26
10 1.98
20.40
504
7 0 0.00
0 0.00
0 0.00
00.00
41.04
102.60
8121.09
21957.03
60 15.63
102.60
384
8 0 0.00
0 0.00
0 0.00
00.00
00.00
20.68
144.76
6823.13
172 58.50
3812.93
294
9 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
00.00
42.13
52 27.66
13270.21
188
Total 1531 456 381 329 651 570 522 399 297 184 5,320
27
Table A6
AP Calculus AB: Question 3
OP FR3 OSN FR3
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 331 90.68
31 8.49
2 0.55
10.27
00.00
00.00
00.00
00.00
0 0.00
00.00
365
1 28 7.80
253 70.47
69 19.22
82.23
10.28
00.00
00.00
00.00
0 0.00
00.00
359
2 8 2.04
65 16.54
239 60.81
6817.30
112.80
20.51
00.00
00.00
0 0.00
00.00
393
3 1 0.25
20 4.90
83 20.34
21753.19
6616.18
153.68
51.23
10.25
0 0.00
00.00
408
4 1 0.29
2 0.58
19 5.56
8324.27
17049.71
5516.08
113.22
10.29
0 0.00
00.00
342
5 1 0.33
0 0.00
1 0.33
165.32
6019.93
17056.48
4314.29
72.33
3 1.00
00.00
301
6 0 0.00
1 0.45
0 0.00
62.70
146.31
4520.27
11953.60
3214.41
4 1.80
10.45
222
7 0 0.00
0 0.00
0 0.00
00.00
21.32
127.95
3825.17
7147.02
23 15.23
53.31
151
8 0 0.00
0 0.00
0 0.00
00.00
00.00
33.75
67.50
2126.25
40 50.00
1012.50
80
9 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
00.00
37.69
10 25.64
2666.67
39
Total 370 372 413 399 324 302 222 136 80 42 2,660
28
Table A7
AP Calculus AB: Question 4
OP FR4 OSN FR4
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 625 92.05
46 6.77
5 0.74
10.15
10.15
10.15
00.00
00.00
0 0.00
00.00
679
1 36 9.73
292 78.92
33 8.92
71.89
20.54
00.00
00.00
00.00
0 0.00
00.00
370
2 3 1.12
37 13.75
176 65.43
4516.73
82.97
00.00
00.00
00.00
0 0.00
00.00
269
3 2 0.73
4 1.46
51 18.61
17162.41
3211.68
134.74
10.36
00.00
0 0.00
00.00
274
4 1 0.36
0 0.00
7 2.49
6121.71
14651.96
4917.44
155.34
10.36
1 0.36
00.00
281
5 0 0.00
1 0.34
1 0.34
124.05
5117.23
18060.81
4013.51
103.38
1 0.34
00.00
296
6 1 0.27
1 0.27
2 0.54
20.54
215.72
6718.26
21358.04
5013.62
10 2.72
00.00
367
7 0 0.00
0 0.00
0 0.00
10.31
51.55
175.28
4313.35
22168.63
31 9.63
41.24
322
8 0 0.00
0 0.00
0 0.00
00.00
10.51
10.51
105.13
3316.92
140 71.79
105.13
195
9 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
21.56
53.91
19 14.84
10279.69
128
Total 668 381 275 300 267 328 324 320 202 116 3,181
29
Table A8
AP Calculus AB: Question 5
OP FR5 OSN FR5
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 1,907 89.03
157 7.33
45 2.10
140.65
100.47
70.33
10.05
10.05
0 0.00
00.00
2,142
1 94 11.94
533 67.73
98 12.45
334.19
121.52
111.40
30.38
30.38
0 0.00
00.00
787
2 42 8.05
101 19.35
279 53.45
6712.84
203.83
61.15
30.57
40.77
0 0.00
00.00
522
3 5 1.45
21 6.07
63 18.21
16347.11
6719.36
164.62
61.73
41.16
1 0.29
00.00
346
4 11 3.23
14 4.11
24 7.04
6619.35
15545.45
5014.66
185.28
30.88
0 0.00
00.00
341
5 0 0.00
4 1.04
11 2.85
276.99
6617.10
22458.03
4712.18
61.55
1 0.26
00.00
386
6 0 0.00
0 0.00
4 0.96
51.20
204.80
5613.43
27365.47
5613.43
2 0.48
10.24
417
7 0 0.00
2 0.48
1 0.24
30.72
51.20
81.91
5412.92
29971.53
42 10.05
40.96
418
8 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
97.09
3124.41
70 55.12
1713.39
127
9 0 0.00
0 0.00
0 0.00
10.81
00.00
00.00
21.63
32.44
16 13.01
10182.11
123
Total 2,059 832 525 379 355 378 416 410 132 123 5,609
30
Table A9
AP Calculus AB: Question 6
OP FR5 OSN FR5
Frequency Row pct 0 1 2 3 4 5 6 7 8 9 Total
0 1,117 94.82
55 4.67
4 0.34
10.08
10.08
00.00
00.00
00.00
0 0.00
00.00
1,178
1 49 12.47
299 76.08
42 10.69
20.51
10.25
00.00
00.00
00.00
0 0.00
00.00
393
2 10 1.46
50 7.31
559 81.73
598.63
50.73
10.15
00.00
00.00
0 0.00
00.00
684
3 0 0.00
10 1.83
74 13.55
38770.88
6511.90
81.47
20.37
00.00
0 0.00
00.00
546
4 0 0.00
1 0.18
10 1.80
6912.39
41674.69
5610.05
50.90
00.00
0 0.00
00.00
557
5 0 0.00
0 0.00
0 0.00
143.66
6316.45
26368.67
379.66
51.31
1 0.26
00.00
383
6 0 0.00
1 0.45
0 0.00
00.00
188.11
4319.37
14364.41
156.76
2 0.90
00.00
222
7 0 0.00
0 0.00
0 0.00
10.92
00.00
65.50
2321.10
7064.22
9 8.26
00.00
109
8 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
23.64
1120.00
39 70.91
35.45
55
9 0 0.00
0 0.00
0 0.00
00.00
00.00
00.00
00.00
00.00
3 20.00
1280.00
15
Total 1176 416 689 533 569 377 212 101 54 15 4,142
31
Appendix B
Correlations Between Operational and OSN Scoring
Tables B1 through B10 show the correlations between free-response (FR) and multiple-
choice (MC) questions in operational (OP) and OSN scoring for the AP English Language and
Composition and the Calculus AB tests by day of reading.
Table B1
AP English Language and Composition: Question 1
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/19/02 99 0.458 0.521 0.063 0.439
7/20/02 280 0.513 0.603 0.090 1.059
7/21/02 155 0.478 0.412 –0.065 0.570
7/22/02 191 0.541 0.499 –0.042 0.411
7/23/02 363 0.505 0.444 –0.061 0.820
7/24/02 179 0.442 0.409 –0.033 0.309
7/25/02 289 0.487 0.511 0.023 0.280
7/26/02 297 0.455 0.498 0.043 0.523
427 0.424 0.412 –0.012 0.178
7/28/02 129 0.401 0.430 0.029 0.231
7/29/02 252 0.519 0.507 –0.012 0.132
7/30/02 47 0.173 0.449 0.276 1.294
7/31/02 210 0.402 0.569 0.167 1.702
8/1/02 267 0.511 0.477 –0.034 0.387
8/2/02 140 0.430 0.532 0.102 0.846
8/3/02 577 0.461 0.385 –0.075 1.277
8/4/02 625 0.385 0.401 0.015 0.271
8/5/02 324 0.479 0.451 –0.028 0.353
8/6/02 495 0.447 0.492 0.044 0.693
7/27/02
32
Table B2
AP English Language and Composition: Question 2
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/19/02 41 0.181 0.150 –0.032 0.139
7/21/02 106 0.321 0.445 0.124 0.892
7/22/02 121 0.338 0.406 0.069 0.527
7/23/02 103 0.473 0.322 –0.151 1.067
7/24/02 426 0.474 0.498 0.024 0.349
7/25/02 124 0.474 0.499 0.026 0.200
7/26/02 279 0.392 0.471 0.080 0.937
7/27/02 322 0.483 0.433 –0.050 0.637
7/28/02 150 0.390 0.416 0.025 0.215
7/29/02 91 0.478 0.481 0.003 0.020
7/31/02 274 0.452 0.411 –0.041 0.472
8/1/02 354 0.497 0.521 0.024 0.318
8/2/02 169 0.401 0.412 0.010 0.095
8/3/02 242 0.412 0.318 –0.094 1.028
8/4/02 208 0.417 0.383 –0.033 0.339
8/5/02 87 0.421 0.342 –0.079 0.511
8/6/02 108 0.517 0.357 –0.160 1.162
8/7/02 69 0.654 0.540 –0.114 0.654
8/8/02 198 0.367 0.298 –0.069 0.684
8/9/02 120 0.444 0.504 0.059 0.454
8/10/02 218 0.417 0.399 –0.018 0.186
8/11/02 51 0.480 0.379 –0.102 0.497
8/12/02 369 0.325 0.405 0.080 1.079
33
Table B3
AP English Language and Composition: Question 3
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/21/02 72 0.486 0.419 –0.066 0.390
7/22/02 54 0.483 0.503 0.020 0.101
7/23/02 53 0.414 0.438 0.025 0.123
7/24/02 61 0.458 0.493 0.036 0.193
7/25/02 162 0.505 0.502 –0.003 0.029
7/27/02 359 0.474 0.473 –0.001 0.017
7/28/02 222 0.440 0.457 0.017 0.175
7/29/02 138 0.384 0.505 0.121 0.993
7/30/02 217 0.417 0.320 –0.097 1.008
7/31/02 149 0.318 0.309 –0.009 0.075
8/1/02 239 0.393 0.393 –0.001 0.010
8/10/02 209 0.388 0.485 0.097 0.985
8/11/02 181 0.419 0.530 0.111 1.050
8/2/02 191 0.367 0.299 –0.068 0.658
8/3/02 183 0.386 0.371 –0.015 0.144
8/4/02 277 0.405 0.459 0.054 0.627
8/5/02 170 0.479 0.554 0.075 0.687
8/6/02 68 0.416 0.448 0.032 0.183
8/8/02 203 0.521 0.554 0.033 0.326
8/9/02 208 0.360 0.387 0.028 0.279
34
Table B4
AP Calculus AB: Question 1
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/22/02 69 0.681 0.698 0.017 0.096
7/23/02 405 0.665 0.659 –0.005 0.073
7/25/02 473 0.673 0.679 0.006 0.091
7/26/02 146 0.632 0.633 0.001 0.009
7/27/02 105 0.701 0.691 –0.010 0.071
7/28/02 249 0.683 0.675 –0.007 0.078
7/29/02 491 0.644 0.661 0.017 0.265
7/30/02 179 0.648 0.644 –0.004 0.036
7/31/02 195 0.618 0.610 –0.008 0.078
8/9/02 191 0.626 0.648 0.022 0.216
8/10/02 133 0.690 0.679 –0.011 0.089
8/11/02 66 0.706 0.720 0.014 0.077
8/12/02 225 0.620 0.649 0.029 0.309
35
Table B5
AP Calculus AB: Question 2
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/22/02 143 0.679 0.681 0.002 0.018
7/23/02 566 0.653 0.652 –0.001 0.009
7/24/02 506 0.616 0.624 0.009 0.138
7/25/02 862 0.635 0.637 0.001 0.031
7/26/02 653 0.651 0.652 0.001 0.027
7/27/02 205 0.633 0.620 –0.013 0.128
7/29/02 1,080 0.626 0.625 –0.001 0.031
7/30/02 210 0.661 0.665 0.004 0.040
7/31/02 734 0.630 0.621 –0.009 0.179
8/8/02 360 0.626 0.637 0.011 0.147
36
Table B6
AP Calculus AB: Question 3
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/23/02 71 0.682 0.682 0.001 0.004
7/24/02 243 0.551 0.569 0.018 0.197
7/25/02 391 0.635 0.637 0.002 0.031
7/26/02 256 0.625 0.648 0.022 0.252
7/27/02 499 0.598 0.601 0.003 0.040
7/29/02 242 0.648 0.647 –0.001 0.010
7/30/02 142 0.621 0.631 0.010 0.087
8/8/02 275 0.631 0.641 0.010 0.117
8/9/02 464 0.641 0.648 0.007 0.109
8/12/02 77 0.612 0.651 0.039 0.236
37
Table B7
AP Calculus AB: Question 4
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/25/02 53 0.821 0.818 –0.003 0.017
7/26/02 143 0.666 0.671 0.005 0.040
7/27/02 101 0.651 0.631 –0.019 0.133
7/30/02 55 0.729 0.754 0.026 0.131
7/31/02 196 0.679 0.658 –0.020 0.199
8/1/02 294 0.645 0.632 –0.013 0.160
8/2/02 382 0.696 0.694 –0.002 0.034
8/3/02 162 0.715 0.722 0.007 0.064
8/4/02 116 0.755 0.759 0.004 0.031
8/5/02 552 0.711 0.713 0.002 0.038
8/6/02 355 0.672 0.676 0.004 0.052
8/7/02 463 0.677 0.681 0.005 0.075
8/8/02 148 0.675 0.674 –0.001 0.008
8/10/02 148 0.751 0.763 0.012 0.106
38
Table B8
AP Calculus AB: Question 5
Date of OSN reading
N read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/26/02 27 0.504 0.524 0.020 0.069
7/29/02 6 0.430 0.731 0.301 0.368
7/30/02 64 0.649 0.670 0.021 0.117
7/31/02 132 0.671 0.653 –0.018 0.147
8/1/02 308 0.599 0.597 –0.002 0.019
8/2/02 729 0.602 0.604 0.002 0.041
8/5/02 949 0.627 0.651 0.024 0.531
8/6/02 1,017 0.615 0.620 0.005 0.120
8/7/02 674 0.606 0.628 0.022 0.397
8/8/02 515 0.672 0.687 0.016 0.253
8/9/02 13 0.585 0.542 –0.043 0.096
8/10/02 471 0.590 0.546 –0.044 0.678
8/11/02 291 0.556 0.565 0.009 0.107
8/12/02 413 0.593 0.615 0.022 0.313
39
Table B9
AP Calculus AB: Question 6
Date of OSN reading
N Read in OSN
Oper FR & MC
OSN FR & MC
Difference between
correlations Z
7/30/02 189 0.766 0.754 –0.012 0.113
7/31/02 562 0.710 0.713 0.003 0.053
8/1/02 462 0.734 0.728 –0.005 0.080
8/2/02 627 0.697 0.687 –0.010 0.181
8/3/02 607 0.722 0.729 0.007 0.125
8/4/02 116 0.692 0.708 0.016 0.120
8/5/02 180 0.698 0.708 0.010 0.098
8/6/02 704 0.694 0.688 –0.005 0.097
8/7/02 593 0.718 0.722 0.004 0.072
8/8/02 52 0.703 0.683 –0.020 0.099
8/9/02 50 0.836 0.821 –0.015 0.073
40