the first rwpt

Upload: wuzhengmonterey

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 The First RWPT

    1/3

    The First RWPT (lower form)The first RWPT (Appendix B), was developed to address some of the shortcomings of

    the existing exam via increased task authenticity, an analytic scoring rubric for the writingsubtest, and balanced portions of R&W tasks. I recruited nine IEP student volunteers to take the

    pilot test on Oct. 7, 2001. The following is a brief discussion of the test results with a focus on

    statistical analysis and validity, which sheds light on the revision steps taken for developing thecurrent RWPT.

    Reliability.Assuming the two halves of the test independently measure the test takers skills, the

    Guttman split-half technique can be used to calculate the reliability coefficient (Bachman, 1990).

    To calculate the internal consistency, I combined the reading items in sections I and II together.The first reason for this method is that both sections are in the same multiple choice (MC)

    format; the second is that the length of each section is relatively short. The final result is 0.76,

    which indicates that 76% of variances are measured by the test with 24% measurement error.

    The writing tasks were graded by an R&W IEP teacher (rater 1) and myself (rater 2). Thecorrelation coefficient can be calculated between two sets of scores to arrive at an estimate of the

    reliability of the judgments made by inter-raters (Brown, 2005). To determine inter-raterreliability, I used the Microsoft Excel Correlation Function.

    Table 1Inter-rater Reliability among the Writing Tasks (N=9)

    W* Task 1 (Section I) W Task 1 (Section II) W Task 2 (Section II)

    Student ID Rater 1 Rater 2 Rater 1 Rater 2 Rater 1 Rater 2

    1 5.5 5.5 5 5 5.5 5

    2 6 5.5 5 5 5.5 5.5

    3 6 5.5 5.5 5.5 5 5

    4 6 6 6 5.5 5.5 5.55 5.5 5.5 6 5 6 5

    6 5 3.5 4.5 4.5 5.5 5.5

    7 6 4.5 5.5 5 6 58 5.5 4 4 3 N/A1 N/A

    9 3 4 4 3 4 4

    IR*= .57 IR= .88 IR= .97Note. W*: writing; IR*: inter-rater reliability

    The inter-rater reliability for both tasks 1 and 2 in Section II are much higher than that

    that in Section I. I set up a meeting with rater 1 in order to find out what may account for our

    grading discrepancies. Additionally, I intended to seek feedback from his experience using the

    analytic rubric (Appendix C). We both realized that the grammar mistakes produced bore more

    variety and complexity than the way stipulated in the rubric. We felt that the poor definition ofgrammar use, and the average short writing sample the task elicited (under 50 words), may

    have caused gaps in counting writing errors, which in turn contributed to grading differences inresult.

    1As mentioned above, student 8 was not able to produce his writing due to his failure to comprehend the reading

    prompt.

  • 7/28/2019 The First RWPT

    2/3

    Another salient factor in discrepancy may be the length of writing. Most respondents

    provided a longer writing sample for Task 2 than for Task 1. Rater 1 and I acknowledged that it

    was easier to assign a level to individual skill area if more complexity is involved. This ease maybe the result of simplistic writings (e.g., Task 1) tending to incur fewer errors, which may subject

    the evaluators to the dilemma of basing their judgment on one or two minor errors.

    Task 2 in Section II involves writing a 3-5 sentence summary of a news story. Despite itsshort length, the reliability coefficient was rather high (0.97). Most of the test takers had usedparts of the original sentences from the article to construct their summary. Their responses

    appeared much more homogenized and standardized than the previous two tasks, which may

    have contributed to less grading disputes, and in turn higher inter-rater reliability. However, itcasts alarming doubt on the tasks ability to elicit the real linguistic performance. This task

    apparently needs to be revised or replaced.

    Item analysis.

    Item analysis was conducted to gain a clear picture of the performance of each individualMC itemc (Table 2.1 & 2.2). Table 3 breaks the IF values into three groups to shed some light

    on whether the reading subtest presents variances in difficulty levels. To ensure a diversity in

    difficulty level, a placement test should contain easy items (IF values 0.81-0.95) and difficultitems (IF values 0.35-0.59) on both spectrums; most items should demonstrate an IF value in the

    range of 0.60-0.80 (Turner, 2011). To align with this standard seems to reveal a need for more

    difficult reading tasks in order to boost up the 38% in the middle range and 24% on the lower

    end of the IF. In addition the ID analysis provides some preliminary assessment on thediscriminating ability among MC items, which helps me decide which items to be selected and

    carried over in the revised version.

    Table 2.1

    Section I Item Facility and Item Discrimination (N=9)

    Item No. Number of correct

    responses

    IF ID

    1 8 .89 .33

    2 9 1.00 0

    3 5 .56 .674 9 1.00 0

    5 4 .46 0

    6 9 1.00 07 7 .78 0

    8 2 .22 .33

    9 9 1.00 0

    10 8 .89 .3311 7 .78 .67

    Average .78 .21

    Table 2.2

    Section II Item Facility and Item Discrimination (N=9)

    Item No. Number of correct

    responses

    IF ID

    1 5 .56 .33

  • 7/28/2019 The First RWPT

    3/3

    2 6 .67 .67

    3 7 .78 .33

    4 8 .89 05 7 .78 .67

    6 6 .67 1

    7 6 .67 18 4 .44 .339 8 .89 0

    10 4 .67 .67

    Average .70 .50

    Table 3

    Distribution of Lower, Middle and Upper Range of IFs (N=9)

    IF valuerange

    Lower(0.22

    2-0.59)

    Middle(0.60-0.80)

    Upper(0.81-1)

    Percentage 24% 38% 38%

    2 The cut-off point for the lower range starts at 0.22, which is the lowest IF value from this pilot test.