closing the loop: providing test developers with performance level descriptors so standard setters...
TRANSCRIPT
![Page 1: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/1.jpg)
S
Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job
Amanda A. Wolkowitz Alpine Testing Solutions, James C. Impara Psychometric Inquiries
Chad W. Buckendahl Psychometric Consultant
![Page 2: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/2.jpg)
What do standard setters do & how do they do it?
Standard setters recommend cut scores
An early step in the process is defining what the “borderline” examinee is expected to be able to do in terms of the test content. Specifically, they examine the performance level descriptors (PLDs) and define the borderline examinee at each performance level.
Using modifications of the Angoff or Bookmark methods, they review test items and judge the difficulty of the item for examinees who are at the borderline of one or more performance categories.
Typically they estimate item difficulty one or more times (rounds), sometimes with item or other data provided after their first round of item difficulty estimation.
![Page 3: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/3.jpg)
How are tests developed?
Item writers, typically content “experts,” draft items that are responsive to the test specifications (or test blueprint).
The test blueprint may or, most often, may not include a description of the various performance levels.
The test blueprint virtually never provides a description of the “borderline” examinee at each performance level.
![Page 4: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/4.jpg)
The study design
This was not a designed study, but an ad hoc study. That is, we developed the research question and looked for data that would provide some answers. Thus, there are some limitations.
The first data collection was in 2009 and the second was in 2013. Both in the same southeastern state in the USA.
Both related to the same assessment.
![Page 5: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/5.jpg)
The study design - 2
2009 Performance level descriptors (PLDs) defined initially. Borderline performance described for each PLD Standard setting done for alternative assessments (for
students with severe cognitive disabilities) in: English Language Arts (ELA) grades 4 – 8 Mathematics grades 3 – 8 All tests had 15 items scored dichotomously (0 or 2 for each item) Four performance levels were defined, thus three cut scores There were separate panels for each content area.
![Page 6: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/6.jpg)
The study design - 3
2013 Standard Setting was the same as 2009 except: PLDs developed in 2009 were examined and refined.
The original PLDs were known to test developers and drove the development process.
Scoring was modified from dichotomous to three-point scoring for each item – partial credit was permitted, so scores were 0, 1, or 2.
Slightly fewer panelists (17 – 20 for each grade span in 2009, 14 – 15 in 2013).
![Page 7: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/7.jpg)
Study design - 4
A final difference in the two standard setting activities was the method used in the standard setting. 2009 used the Modified Angoff method as described by Impara
& Plake (1997), often characterized as the Yes/No method. 2013 used the Extended Angoff method as described by
Hambleton & Plake, 1995 and Plake & Hambleton, 2001
The reason for this difference was because of the change from dichotomous scoring to giving partial credit (3-points)
Both methods rely on item level judgments.
![Page 8: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/8.jpg)
PLDs
There were four performance levels: Achievement Level 1 (limited command), Achievement Level 2 (partial command), Achievement Level 3 (solid command), and Achievement Level 4 (superior command).
The PLDs further defined each level Level 1 students would need academic support, Level 2 students would likely need academic support, Level 3 students would be prepared, and Level 4 students would be well prepared to be successful in further studies in
that content area.
PLDs also contained specific abilities that students at that given level could demonstrate.
![Page 9: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/9.jpg)
Study Expectations
The principal research question was:
Will the consistency of ratings at the end of round 1 of the standard setting process increase? That is: Will developing items with known PLDs help panelists be more consistent with their initial ratings and more congruent with the item p-values prior to any feedback.
![Page 10: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/10.jpg)
Why?
Why is this an important question? If panelists are more consistent in their round 1
ratings, then they may come to closure faster in subsequent rounds, perhaps reducing the number of rounds (sometimes 3 rounds are used), thus making the process more efficient.
Panelists often become frustrated if there are no or too few items at a performance level, thus causing them to question the validity of the process.
![Page 11: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/11.jpg)
How?
How will we know if there is greater consistency among panelists? Distribution of students across levels will be consistent with
expectations – most students will be classified at Levels 2 and 3. There should be greater congruence between actual item difficulty
and the panelists’ estimate of item difficulty The correlation between actual item difficulty and panelists’ item
difficulty estimate will be higher The range of panelists’ cut scores will be lower. Percentage of panelists who were within one point of the
recommended cut score at the end of round 1 will be higher. The standard deviation of the panelists’ cut scores at each level will
be lower.
![Page 12: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/12.jpg)
Result – 1
Distribution of students across levels
![Page 13: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/13.jpg)
Distribution of students - ELA
![Page 14: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/14.jpg)
Distribution of students - ELA
In virtually every grade in the 2009 standard setting many students were assigned to achievement level 4, the highest level.
In 2013, the distribution was much more appropriate, with most students assigned to levels 2 and 3.
![Page 15: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/15.jpg)
Distribution of students - Math
![Page 16: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/16.jpg)
Distribution of students - Math
In 2009, several of the grades showed appropriate distributions, but many still have many students assigned to levels 1 and 4.
In 2013, relatively few students were assigned to levels 1 and 4 and the preponderance of students were classified as level 2 or 3.
![Page 17: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/17.jpg)
Congruence of actual and panelist’s item difficulty
It was expected that the actual item difficulty value for an item (i.e., the percent of students in the population who get the item correct) would be greater than or equal to the corresponding Level 2 cut score rating and less than the corresponding Level 4 cut score rating.
Hence, the actual p-value would be between the Level 2 cut score and the Level 4 cut score.
Except there should be a relatively small number of items that have difficulties that are outside this range (those items that virtually all examinees answer (for the Level 1 targeted items) and those that virtually no one answers correctly (those targeted at Level 4).
![Page 18: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/18.jpg)
Congruence of actual and panelist’s item difficulty
![Page 19: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/19.jpg)
Congruence - summary
![Page 20: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/20.jpg)
Correlation Analysis
A correlation analysis compared the relationship between actual item difficulty values and the average item rating at each achievement level.
Expectation: a direct relationship was expected between the item’s difficulty value and average item rating. As the item difficulty value increases (i.e., the item
becomes easier), the greater the chance a borderline student will correctly respond to the item correctly.
This trend was expected for all three cut scores.
![Page 21: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/21.jpg)
Correlation Analysis
Results – the reverse of expectations. 2009 item ratings generally had moderate to strong
positive correlations with the corresponding item difficulty values.
2013 ratings tended to have only moderate correlations at best.
The 2009 ratings correlated higher with the p-values than did the 2013 ratings
![Page 22: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/22.jpg)
Correlation Analysis
Why were 2009 correlations higher?
One possible explanation: the 2009 panel only had to make Yes/No judgments whereas the 2013 panel had to make a judgment as to whether a student would score 0, 1, or 2 points on the item
Another possible explanation: is that the items on the 2013 exams may have had more similar difficulty values around the intended PLDs than the 2009 items.
Also, it was learned that in 2013 few students were assigned the 0 score, resulting in a restriction of range of p-values.
![Page 23: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/23.jpg)
Internal Consistency – range
Internal consistency was evaluated several ways First by comparing the range of recommended cut
scores following round 1 for each level and panel.
Thus, a smaller range would indicate that the given year’s panel was more consistent with their ratings than the other year’s panel.
![Page 24: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/24.jpg)
Internal Consistency – Range
Range –ELA
![Page 25: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/25.jpg)
Internal Consistency - Range
Range – Math
![Page 26: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/26.jpg)
Internal Consistency – Proximity to the median
Internal consistency was evaluated several ways: Second, calculating the percent of panelists whose ratings were
within one point (plus or minus) of their panel’s recommended Round 1 median cut score (all median cut scores ended up as possible scores, i.e., no median cut score ended in “.5”).
Thus, if the percent of panelists’ cut scores were all relatively close together, they would be close to the median. For example, the median Level 2 cut score recommendation
for the “Math – 4” exam was 6 out of 30 points for 2009 and 7 out of 30 points for 2013.
![Page 27: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/27.jpg)
Internal Consistency – Proximity to the median
ELA
![Page 28: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/28.jpg)
Internal Consistency – Proximity to the median
Math
![Page 29: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/29.jpg)
Internal Consistency – Standard Deviation
Internal consistency was evaluated several ways: The third way to look at internal consistency is to
compare the standard deviations of the panelists’ ratings across years.
Thus, if the panelists are more consistent, then the standard deviations will be smaller.
![Page 30: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/30.jpg)
Internal Consistency – Standard Deviation
ELA
![Page 31: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/31.jpg)
Internal Consistency – Standard Deviation
Math
![Page 32: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/32.jpg)
Recall this slide
How will we know if there is greater consistency among panelists? Distribution of students across levels will be consistent with
expectations – most students will be classified at Levels 2 and 3. There will be greater congruence between actual item difficulty
and the panelists estimate of item difficulty The correlation between actual item difficulty and panelists’ item
difficulty estimate will be higher The range of panelists’ cut scores will be lower. Percentage of panelists who were within one point of the
recommended cut score at the end of round 1 will be higher. The standard deviation of the panelists’ cut scores at each level
will be lower.
![Page 33: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/33.jpg)
How did we do in terms of distribution of students?
Expected result: Distribution of students across levels will be
consistent with expectations – most students will be classified at Levels 2 and 3.
Actual result: In both ELA and Math the results were as expected
in virtually every grade and performance level.
Thus, positive results
![Page 34: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/34.jpg)
How did we do in terms of congruence of item
difficulty?
Expected result: There will be greater congruence between actual item difficulty
and the panelists estimate of item difficulty
Actual result: There was greater congruence between actual item difficulties
and panelists’ estimation of item difficulty at all levels and grades.
However, there were few cases in which the actual p-values were outside the cut score boundaries.
Thus, somewhat positive results, but somewhat problematic.
![Page 35: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/35.jpg)
How did we do in terms of correlation of actual and estimated item difficulty?
Expected result: The correlation between actual item difficulty and
panelists’ item difficulty estimate will be higher in 2013.
Actual result: The 2009 ratings correlated higher with the p-values
than did the 2013 ratings
Thus, negative results
![Page 36: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/36.jpg)
How did we do in terms of the ranges of panelists’
cut scores?
Expected results: The range of panelists’ cut scores will be lower in
2013.
Actual results: In the majority of grades and levels the range of cut
scores was lower in 2013, particularly at Levels 3 and 4.
Thus, mostly positive results
![Page 37: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/37.jpg)
How did we do in terms of the proximity of panelists’ cut scores
to the median?
Expected result: Percent of panelists who were within one point of the
recommended cut score at the end of round 1 will be higher in 2013.
Actual result: In the majority of comparisons, the percent of panelists
ratings that were within one point of the median was higher in 2013.
Thus, mostly positive.
![Page 38: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/38.jpg)
How did we do in terms of the standard deviations of panelists’
cut scores?
Expected results: The standard deviation of the panelists’ cut scores at
each level will be lower in 2013.
Actual results: In the majority of comparisons the 2013 panels had
lower standard deviations than the 2009 panels.
Thus, mostly positive
![Page 39: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/39.jpg)
Overall & Conclusion
The results overall supported providing test developers with the PLDs.
More specifically designed research is needed. Many limitations to this study.
If future studies are supportive of providing the test developers with the PLDs and they are instructed to target item development to these PLDs, it could result in more efficiency in the standard setting process and in greater levels of satisfaction among panelists.
![Page 40: Closing the loop: Providing test developers with performance level descriptors so standard setters can do their job Amanda A. Wolkowitz Alpine Testing](https://reader035.vdocument.in/reader035/viewer/2022062718/56649e9f5503460f94ba1275/html5/thumbnails/40.jpg)
Questions?
Thank you for your attention.
Are there any questions?