the first rwpt

7/28/2019 The First RWPT

1/3

The First RWPT (lower form)The first RWPT (Appendix B), was developed to address some of the shortcomings of

the existing exam via increased task authenticity, an analytic scoring rubric for the writingsubtest, and balanced portions of R&W tasks. I recruited nine IEP student volunteers to take the

pilot test on Oct. 7, 2001. The following is a brief discussion of the test results with a focus on

statistical analysis and validity, which sheds light on the revision steps taken for developing thecurrent RWPT.

Reliability.Assuming the two halves of the test independently measure the test takers skills, the

Guttman split-half technique can be used to calculate the reliability coefficient (Bachman, 1990).

To calculate the internal consistency, I combined the reading items in sections I and II together.The first reason for this method is that both sections are in the same multiple choice (MC)

format; the second is that the length of each section is relatively short. The final result is 0.76,

which indicates that 76% of variances are measured by the test with 24% measurement error.

The writing tasks were graded by an R&W IEP teacher (rater 1) and myself (rater 2). Thecorrelation coefficient can be calculated between two sets of scores to arrive at an estimate of the

reliability of the judgments made by inter-raters (Brown, 2005). To determine inter-raterreliability, I used the Microsoft Excel Correlation Function.

Table 1Inter-rater Reliability among the Writing Tasks (N=9)

W* Task 1 (Section I) W Task 1 (Section II) W Task 2 (Section II)

Student ID Rater 1 Rater 2 Rater 1 Rater 2 Rater 1 Rater 2

1 5.5 5.5 5 5 5.5 5

2 6 5.5 5 5 5.5 5.5

3 6 5.5 5.5 5.5 5 5

4 6 6 6 5.5 5.5 5.55 5.5 5.5 6 5 6 5

6 5 3.5 4.5 4.5 5.5 5.5

7 6 4.5 5.5 5 6 58 5.5 4 4 3 N/A1 N/A

9 3 4 4 3 4 4

IR*= .57 IR= .88 IR= .97Note. W*: writing; IR*: inter-rater reliability

The inter-rater reliability for both tasks 1 and 2 in Section II are much higher than that

that in Section I. I set up a meeting with rater 1 in order to find out what may account for our

grading discrepancies. Additionally, I intended to seek feedback from his experience using the

analytic rubric (Appendix C). We both realized that the grammar mistakes produced bore more

variety and complexity than the way stipulated in the rubric. We felt that the poor definition ofgrammar use, and the average short writing sample the task elicited (under 50 words), may

have caused gaps in counting writing errors, which in turn contributed to grading differences inresult.

1As mentioned above, student 8 was not able to produce his writing due to his failure to comprehend the reading

prompt.


2/3

Another salient factor in discrepancy may be the length of writing. Most respondents

provided a longer writing sample for Task 2 than for Task 1. Rater 1 and I acknowledged that it

was easier to assign a level to individual skill area if more complexity is involved. This ease maybe the result of simplistic writings (e.g., Task 1) tending to incur fewer errors, which may subject

the evaluators to the dilemma of basing their judgment on one or two minor errors.

Task 2 in Section II involves writing a 3-5 sentence summary of a news story. Despite itsshort length, the reliability coefficient was rather high (0.97). Most of the test takers had usedparts of the original sentences from the article to construct their summary. Their responses

appeared much more homogenized and standardized than the previous two tasks, which may

have contributed to less grading disputes, and in turn higher inter-rater reliability. However, itcasts alarming doubt on the tasks ability to elicit the real linguistic performance. This task

apparently needs to be revised or replaced.

Item analysis.

Item analysis was conducted to gain a clear picture of the performance of each individualMC itemc (Table 2.1 & 2.2). Table 3 breaks the IF values into three groups to shed some light

on whether the reading subtest presents variances in difficulty levels. To ensure a diversity in

difficulty level, a placement test should contain easy items (IF values 0.81-0.95) and difficultitems (IF values 0.35-0.59) on both spectrums; most items should demonstrate an IF value in the

range of 0.60-0.80 (Turner, 2011). To align with this standard seems to reveal a need for more

difficult reading tasks in order to boost up the 38% in the middle range and 24% on the lower

end of the IF. In addition the ID analysis provides some preliminary assessment on thediscriminating ability among MC items, which helps me decide which items to be selected and

carried over in the revised version.

Table 2.1

Section I Item Facility and Item Discrimination (N=9)

Item No. Number of correct

responses

IF ID

1 8 .89 .33

2 9 1.00 0

3 5 .56 .674 9 1.00 0

5 4 .46 0

6 9 1.00 07 7 .78 0

8 2 .22 .33

9 9 1.00 0

10 8 .89 .3311 7 .78 .67

Average .78 .21

Table 2.2

Section II Item Facility and Item Discrimination (N=9)

Item No. Number of correct

responses

IF ID

1 5 .56 .33


3/3

2 6 .67 .67

3 7 .78 .33

4 8 .89 05 7 .78 .67

6 6 .67 1

7 6 .67 18 4 .44 .339 8 .89 0

10 4 .67 .67

Average .70 .50

Table 3

Distribution of Lower, Middle and Upper Range of IFs (N=9)

IF valuerange

Lower(0.22

2-0.59)

Middle(0.60-0.80)

Upper(0.81-1)

Percentage 24% 38% 38%

2 The cut-off point for the lower range starts at 0.22, which is the lowest IF value from this pilot test.

the first rwpt

Documents