the first rwpt
TRANSCRIPT
-
7/28/2019 The First RWPT
1/3
The First RWPT (lower form)The first RWPT (Appendix B), was developed to address some of the shortcomings of
the existing exam via increased task authenticity, an analytic scoring rubric for the writingsubtest, and balanced portions of R&W tasks. I recruited nine IEP student volunteers to take the
pilot test on Oct. 7, 2001. The following is a brief discussion of the test results with a focus on
statistical analysis and validity, which sheds light on the revision steps taken for developing thecurrent RWPT.
Reliability.Assuming the two halves of the test independently measure the test takers skills, the
Guttman split-half technique can be used to calculate the reliability coefficient (Bachman, 1990).
To calculate the internal consistency, I combined the reading items in sections I and II together.The first reason for this method is that both sections are in the same multiple choice (MC)
format; the second is that the length of each section is relatively short. The final result is 0.76,
which indicates that 76% of variances are measured by the test with 24% measurement error.
The writing tasks were graded by an R&W IEP teacher (rater 1) and myself (rater 2). Thecorrelation coefficient can be calculated between two sets of scores to arrive at an estimate of the
reliability of the judgments made by inter-raters (Brown, 2005). To determine inter-raterreliability, I used the Microsoft Excel Correlation Function.
Table 1Inter-rater Reliability among the Writing Tasks (N=9)
W* Task 1 (Section I) W Task 1 (Section II) W Task 2 (Section II)
Student ID Rater 1 Rater 2 Rater 1 Rater 2 Rater 1 Rater 2
1 5.5 5.5 5 5 5.5 5
2 6 5.5 5 5 5.5 5.5
3 6 5.5 5.5 5.5 5 5
4 6 6 6 5.5 5.5 5.55 5.5 5.5 6 5 6 5
6 5 3.5 4.5 4.5 5.5 5.5
7 6 4.5 5.5 5 6 58 5.5 4 4 3 N/A1 N/A
9 3 4 4 3 4 4
IR*= .57 IR= .88 IR= .97Note. W*: writing; IR*: inter-rater reliability
The inter-rater reliability for both tasks 1 and 2 in Section II are much higher than that
that in Section I. I set up a meeting with rater 1 in order to find out what may account for our
grading discrepancies. Additionally, I intended to seek feedback from his experience using the
analytic rubric (Appendix C). We both realized that the grammar mistakes produced bore more
variety and complexity than the way stipulated in the rubric. We felt that the poor definition ofgrammar use, and the average short writing sample the task elicited (under 50 words), may
have caused gaps in counting writing errors, which in turn contributed to grading differences inresult.
1As mentioned above, student 8 was not able to produce his writing due to his failure to comprehend the reading
prompt.
-
7/28/2019 The First RWPT
2/3
Another salient factor in discrepancy may be the length of writing. Most respondents
provided a longer writing sample for Task 2 than for Task 1. Rater 1 and I acknowledged that it
was easier to assign a level to individual skill area if more complexity is involved. This ease maybe the result of simplistic writings (e.g., Task 1) tending to incur fewer errors, which may subject
the evaluators to the dilemma of basing their judgment on one or two minor errors.
Task 2 in Section II involves writing a 3-5 sentence summary of a news story. Despite itsshort length, the reliability coefficient was rather high (0.97). Most of the test takers had usedparts of the original sentences from the article to construct their summary. Their responses
appeared much more homogenized and standardized than the previous two tasks, which may
have contributed to less grading disputes, and in turn higher inter-rater reliability. However, itcasts alarming doubt on the tasks ability to elicit the real linguistic performance. This task
apparently needs to be revised or replaced.
Item analysis.
Item analysis was conducted to gain a clear picture of the performance of each individualMC itemc (Table 2.1 & 2.2). Table 3 breaks the IF values into three groups to shed some light
on whether the reading subtest presents variances in difficulty levels. To ensure a diversity in
difficulty level, a placement test should contain easy items (IF values 0.81-0.95) and difficultitems (IF values 0.35-0.59) on both spectrums; most items should demonstrate an IF value in the
range of 0.60-0.80 (Turner, 2011). To align with this standard seems to reveal a need for more
difficult reading tasks in order to boost up the 38% in the middle range and 24% on the lower
end of the IF. In addition the ID analysis provides some preliminary assessment on thediscriminating ability among MC items, which helps me decide which items to be selected and
carried over in the revised version.
Table 2.1
Section I Item Facility and Item Discrimination (N=9)
Item No. Number of correct
responses
IF ID
1 8 .89 .33
2 9 1.00 0
3 5 .56 .674 9 1.00 0
5 4 .46 0
6 9 1.00 07 7 .78 0
8 2 .22 .33
9 9 1.00 0
10 8 .89 .3311 7 .78 .67
Average .78 .21
Table 2.2
Section II Item Facility and Item Discrimination (N=9)
Item No. Number of correct
responses
IF ID
1 5 .56 .33
-
7/28/2019 The First RWPT
3/3
2 6 .67 .67
3 7 .78 .33
4 8 .89 05 7 .78 .67
6 6 .67 1
7 6 .67 18 4 .44 .339 8 .89 0
10 4 .67 .67
Average .70 .50
Table 3
Distribution of Lower, Middle and Upper Range of IFs (N=9)
IF valuerange
Lower(0.22
2-0.59)
Middle(0.60-0.80)
Upper(0.81-1)
Percentage 24% 38% 38%
2 The cut-off point for the lower range starts at 0.22, which is the lowest IF value from this pilot test.