definitions and relevant research - datag · leaders across new york state using performance-based...

7
© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 1 Purpose: This document was created by members of the Learner-Centered Initiatives, Ltd. (LCI, Ltd.) consulting staff to serve as a guidance document to assist our clients, colleagues, and other educational leaders across New York State using Performance-Based Assessments (PBA’s) to generate scores. Gallavan, N. P. (2009) Developing Performance-Based Assessments (Grades K-5). (Corwin Press) Hambleton, R. K., & Murphy, E. (1992). A psychometric perspective on authentic measure. Applied Measurement in Education, 5, 1-16. Johnson, B (1996). The Performance Assessment Handbook (Eye on Education) Martin-Kniep, G. & Picone-Zocchia, J. (2009). Changing the Way You Teach, Improving the Way Students Learn (ASCD) Nitko, A., & Brookheart, S. (2012). Educational Assessment of Students, 6 th edition (Pearson) Popham, W. J. (2011). Portfolio Assessment and Performance Testing (Routledge) Rudner, L. M., & Boston, C. (1994). Performance assessment. The ERIC Review, 3(1),2-12. Definitions and Relevant Research Nitko and Brookhart (2012) define performance-based tasks and assessments as: an assessment that requires students to demonstrate achievement by producing an extended written or spoken answer, by engaging in group or individual activities, or by creating a specific product. (p. 246) PBA’s generally take the form of one of three designs: 1. Structured, on-demand demonstration tasks (i.e. presentations, group performance or problem solving) 2. Structured, on-demand product tasks (i.e. writing response, musical composition) 3. Curriculum-embedded tasks that reflect the creation of a product or the demonstration of a skill or ability Regardless of design, all PBA’s include two components: 1. The performance task itself 2. A clear and explicit rubric for scoring PBA’s have several advantages over other assessments (Hambleton & Murphy, 1992; Rudner & Boston, 1994; Shepard, 1991; Wiggins, 1990): They assess a student’s ability “to do” or “to apply” content knowledge They provide a tighter connection to instructional activities They broaden the approach to student assessment According to Miller and Seraphine (1993) some known challenges presented by PBA’s include the following: They are difficult to create and design They are more difficult and time-consuming to score than traditional tests They limit the number of learning targets which can be assessed While some PBA’s may emphasize authenticity (tasks that enable students to engage in problems for audiences that can benefit from their work) this is not a requirement of all performance tasks. However, authentic assessments can “be an integral component of a rich and balanced and standards-based educational program designed to maximize students’ acquisition and use of basic knowledge and skills; and enable them to use and apply their learning in contexts and situations that relate what they learn in school with what they need to succeed in the world we live in. Now, more than ever before, and in the face of a world in which all kinds of knowledge are accessible digitally, it is critical that schools re-establish the value of school learning by helping students find purpose and meaning in what they learn (Martin-Kniep & Picone-Zocchia, 2009).

Upload: vankhanh

Post on 09-Apr-2018

218 views

Category:

Documents


4 download

TRANSCRIPT

© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 1

Purpose: This document was created by members of the Learner-Centered Initiatives, Ltd. (LCI, Ltd.) consulting staff to serve as a guidance document to assist our clients, colleagues, and other educational leaders across New York State using Performance-Based Assessments (PBA’s) to generate scores.

Gallavan, N. P. (2009) Developing Performance-Based Assessments (Grades K-5). (Corwin Press)

Hambleton, R. K., & Murphy, E. (1992). A psychometric perspective on authentic measure. Applied Measurement in Education, 5, 1-16.

Johnson, B (1996). The Performance Assessment Handbook (Eye on Education)

Martin-Kniep, G. & Picone-Zocchia, J. (2009). Changing the Way You Teach, Improving the Way Students Learn (ASCD)

Nitko, A., & Brookheart, S. (2012). Educational Assessment of Students, 6th edition (Pearson)

Popham, W. J. (2011). Portfolio Assessment and Performance Testing (Routledge)

Rudner, L. M., & Boston, C. (1994). Performance assessment. The ERIC Review, 3(1),2-12.

Definitions and Relevant Research

Nitko and Brookhart (2012) define performance-based tasks and assessments as: an assessment that requires students to demonstrate achievement by producing an extended written or spoken answer, by engaging in group or individual activities, or by creating a specific product. (p. 246) PBA’s generally take the form of one of three designs:

1. Structured, on-demand demonstration tasks (i.e. presentations, group performance or problem solving) 2. Structured, on-demand product tasks (i.e. writing response, musical composition) 3. Curriculum-embedded tasks that reflect the creation of a product or the demonstration of a skill or ability

Regardless of design, all PBA’s include two components:

1. The performance task itself 2. A clear and explicit rubric for scoring

PBA’s have several advantages over other assessments (Hambleton & Murphy, 1992; Rudner & Boston, 1994; Shepard, 1991; Wiggins, 1990):

They assess a student’s ability “to do” or “to apply” content knowledge They provide a tighter connection to instructional activities They broaden the approach to student assessment

According to Miller and Seraphine (1993) some known challenges presented by PBA’s include the following:

They are difficult to create and design They are more difficult and time-consuming to score than traditional tests They limit the number of learning targets which can be assessed

While some PBA’s may emphasize authenticity (tasks that enable students to engage in problems for audiences that can benefit from their work) this is not a requirement of all performance tasks. However, authentic assessments can “be an integral component of a rich and balanced and standards-based educational program designed to maximize students’ acquisition and use of basic knowledge and skills; and enable them to use and apply their learning in contexts and situations that relate what they learn in school with what they need to succeed in the world we live in. Now, more than ever before, and in the face of a world in which all kinds of knowledge are accessible digitally, it is critical that schools re-establish the value of school learning by helping students find purpose and meaning in what they learn (Martin-Kniep & Picone-Zocchia, 2009).

© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 2

Scoring PBA’s using a Quality Rubric

When it comes to scoring PBA’s, the nature of the task, the reason for the score, and how the rubric will be used with students all influence the score-generating process. As performance-based assessments require students to generate product or demonstrate their learning in some way, evaluation focuses on the quality of their work, rather than the quantity. This focus on quality requires the use of a quality rubric – a scoring mechanism that describes the degree to which the student has met the expectations for the task. Typically, these rubrics use four levels to communicate those expectations and what it looks like to meet, or not meet, them. These four levels represent four spaces on an ordinal scale, similar to the scale used to label the order racers finish a race (i.e. 1st, 2nd, 3rd). To continue the example, if a runner wanted to summarize her success in recent races, she’d count up how many times she came in first, second, third, etc. and focus on the most frequently occurring place (the mode). She would not average (find the mean) 1st place, 1st place, 2nd place, 2nd place. She might, however, average her finishing times. The reason she can do this is because her finishing time is reported on an interval scale. Although a quality rubric may be used to generate a number (i.e. 3, meaning the students work demonstrated quality matching the 3rd column of the rubric), it is an ordinal scale and has to be treated as such.

Rubric Overview

Level I Significantly Below

Mastery

Level II Approaching Mastery

Level III Mastery

Level IV Exceeds Mastery

K.CC.4: Understand the relationship between numbers and quantities; connect counting to cardinality. K.CC.3: Write numbers 0-20. Represent a number of objects with a written numeral 0-20.

The student draws random shapes of an indeterminate quantity. Without significant support, their page would not make sense or be blank. Even when the task is repeated or restated, they are unable to make connections between the objects in the bag and the number.

The student is able to count out the manipulatives but needs assistance writing the correct number or writes an incorrect number. Once the task is repeated to the student or they are re-directed, he or she is able to correctly draw the requested number of objects.

The student independently and accurately counts out the number of manipulatives and legibly writes the correct number on the line. The student independently and accurately draws the number of grapes in both jars by counting up one at a time or drawing a large quantity of objects and then erasing or crossing off extras.

The student writes the number of objects on the line without organizing the manipulatives. He or she uses mental math, rather than counting out each object individually to determine count. The student draws the correct number of objects in the jars, clearly arranging the shapes in a way that makes mathematical sense (2 groups of four, 3 groups of 3).

If a student’s work falls over multiple columns or levels, the student’s score should reflect the level that is most evident. The focus should be on the mode (places in a race), rather than the average (time elapsed). Given the nature of these assessments, teachers should err on the side of the lower level if a student’s performance falls equally between two levels. (See the next page for more examples.) Student performance on individual dimensions (rows) can be used to disaggregate data for data analysis purposes.

The indicators in this box describe

the work of a student whose

understanding of the dimension is

significantly below

expectations.

The indicators in this box describe

the work of a student whose

understanding of the dimension is approaching the

expectations.

The indicators in this box describe

the work of a student whose

understanding of the dimension

meets expectations.

The indicators in this box describe

the work of a student whose

understanding of the dimension

exceeds expectations.

© 2014 Learner-Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced or distributed without permission. 3

Examples of Determining a Holistic Score Dimensions of a rubric are attributes of a quality that the evaluator or scorer is looking for in the student’s product or performance. The text in each cell describes the degree of quality at that level. Generally, dimensions are aligned to standards and/or desired outcomes and the work at Level III reflects mastery of the expectation, standard, or desired outcome. Quality rubric design practices suggest that the dimensions (rows) be ranked order with the most important expectation listed first. A variety of examples are provided below.

Example 1 of a holistic score for a rubric with 4 dimensions Example 2 of a holistic score for a rubric with 4 dimensions

Level I Level II Level III Level IV W.11-12.01a Law Student Lawyer Judge Supreme Court

Dimension 1 X Claim X

Dimension 2 X Evidence X

Dimension 3 X Counterclaim X

Dimension 4 X Organization X

The student scored 3, 3, 3, 2 - holistic score = 3 Rationale: The most frequently occurring level is III.

The student scored 3, 3, 2, 2 – holistic score = 2 Rationale: When there is a tie, SCORE LOW to minimize the possibility of inflation

Example 1 of a holistic score for a rubric with 3 dimensions Example 2 of a holistic score for a rubric with 3 dimensions

Line Cook Sous Chef Chef Michelin Chef AAA Minors Majors Hall of Fame

Presentation X Batting X

Taste X Base Running X

Creativity X Fielding X The student scored 2, 3, 1 - holistic score = 2

Rationale: Holistically, work suggests he or she isn’t “quite” there yet. The student scored 1, 1, 3 – holistic score = 1

Rationale: The most frequently occurring level is I. Additionally, the 3 occurred on the 3rd (least important) dimension.

Example 1 of a holistic score for a rubric with 3 dimensions, top level

weighted x2 Example 2 of a holistic score for a rubric with 4 dimensions, top level weighted

x2

Requires Prompting

Some Support

Independent Metacognitive Beginner Amateur Proficient Mentor

Standard 1 weighted X2

X X Dimension 1 weighted X2

X X

Standard 2 weighted X1

X Dimension 2 weighted X1

X

Standard 3 weighted X1

X Dimension 3 weighted X1

X

The student scored 2, 2, 3, 1 - holistic score = 2 (note the first dimension counts twice because the designers chose to weight it)

Dimension 4 weighted X1 X

The student scored 1, 1, 3, 1, 4 – holistic score = 1 Rationale: Although the student excelled at one dimension, it was the least important one. The presence of three 1’s indicates the student is a beginner.

© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 4

Converting a Rubric to a Grade or Percentage For these kinds of tasks, it is not advisable to convert the student’s performance into a percentage. To go back to the runner analogy, a runner who finishes in 20 minutes ran twice as fast as a runner who finished in 40 minutes. We can make that statement because time is reported on an interval scale. However, we do not know how much faster the runner in 1st place is than the runner in 2nd. The same applies to rubrics used to score performance-based assessments. A student whose work is at Level IV is not “twice” as good as a student at Level II. If I represent that Level II as “50%” and the Level IV as “100%”, I would be misrepresenting both the rubric’s scale and nature of what it is measuring. This is also why it is difficult to give students a score of a 0 on a task scored with a qualitative rubric. There is a known, clear difference between a runner with a time of 0:00 and a second runner who did not run the race. The non-runner didn’t come in “zero” place, she simply didn’t run. In other words, if a student has done no work, there is nothing to score. This event should be handled in accordance with district policy for students who are absent or refuse to do an assessment. In most cases, a minimal response would correspond to a Level I. If the page is blank, it can be assumed that the student did not do the work, so there is no need to generate a score. Despite being an ordinal scale, a rubric can be used to establish both a holistic, aggregate score and a set of disaggregated scores. In order for a rubric to be considered reliable, or trustworthy, it’s recommended that the users ensure that scorers are well-versed in using anchors that reflect the expectations for the task and provide examples of student work at each level (anchors). In addition, prior to setting any targets or using the scores, the rubric users have determined the inter-rater reliability (IRR) of the rubric. IRR is a statistic that communicates how frequently two scorers agree on a score given to a set of papers. For low-stake PBA rubrics such as classroom projects, raters should agree at least 65% of the time in order for the rubric to be considered trustworthy (citation). For higher-stakes assessments, such as those used for teacher evaluation, raters should agree at least 80% of the time1. The activity below walks through an example of determining IRR. In order to better understand the example, it is best to complete the task with a second person. Task 1: With a partner, determine each student’s holistic score in the following example. As you work, consider the patterns you find and the conclusions you reach and share them with your partner. Your goal is to be able to explain how you arrived at the holistic score you noted in the last column.

Student D1 – Claim D2 – Evidence D3 – Counterclaim D4 – Organization Holistic Score

White Law Student (1) Lawyer (2) Judge (3) Supreme Court (4)

Wesche Law Student (1) Lawyer (2) Law Student (1) Law Student (1)

Reitz Judge (3) Judge (3) Judge (3) Law Student (1)

Murray Lawyer (2) Law Student (1) Judge (3) Law Student (1)

Morehouse Judge (3) Law Student (1) Judge (3) Law Student (1)

McQueen Law Student (1) Lawyer (2) Judge (3) Supreme Court (4)

Mack Law Student (1) Law Student (1) Judge (3) Judge (3)

Golish Judge (3) Lawyer (2) Lawyer (2) Judge (3)

Emerson Supreme Court (4) Law Student (1) Law Student (1) Judge (3)

1 .7 (or 70%) is general recognized a sufficient for IRR (Nunnally & Bernstein, 1994; Stemler, 2004, cited by Pantzare, 2015). NYSED RFQ #15-001, part 2.2(D)-I references agreement > .80)

© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 5

Task 2: This time, determine each student’s holistic score without talking to your partner.

Student D1 – Claim D2 – Evidence D3 – Counterclaim D4 – Organization My Score

White Judge (3) Judge (3) Law Student (1) Law Student (1)

Wesche Judge (3) Judge (3) Judge (3) Lawyer (3)

Webb Lawyer (3) Law Student (1) Judge (3) Judge (3)

Reitz Judge (3) Judge (3) Judge (3) Supreme Court (4)

Murray Judge (3) Supreme Court (4) Lawyer (3) Lawyer (3)

Morehouse Supreme Court (4) Judge (3) Law Student (1) Law Student (1)

McQueen Judge (3) Judge (3) Judge (3) Supreme Court (4)

Mack Lawyer (3) Lawyer (3) Lawyer (3) Law Student (1)

Golish Lawyer (3) Lawyer (3) Lawyer (3) Judge (3)

Emerson Lawyer (3) Lawyer (3) Judge (3) Judge (3)

Task 3: Determine Inter-rater reliability:

Write your score in the first blank column. After you’ve scored, ask your partner to read down his/her list and note them in the second column.

Calculate Inter-rater Agreement using the formula below and record it as a percentage. If scorers agree on at least 80% (.8) of the samples, you have established inter-rater reliability and can reasonably conclude that rubric will generate consistent scores and that the scores can be used it setting targets.

Our IRR = ___________________

IRR = Number of Student Work Samples Assigned the Same Score

Total Number of Student Work Samples

You Partner

Difference (Y/N)

White

Wesche

Webb

Reitz

Murray

Morehouse

McQueen

Mack

Golish

Emerson

© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 6

Using a PBA Quality Rubric to Generate Targets After administering and scoring a performance based assessment using a quality rubric, each student will have at least one data point for each student – his or her aggregate score. Each student will also have additional data points – one point for each level and dimension (disaggregated) score. Using these data, teachers can set targets for their students if an assessment is used at the beginning of a year, quarter, or unit as a baseline measure. Below is an example set of results from a High School Physical Education performance task:

Student Dimension #1 Dimension #2 Dimension #3 Holistic Score

Student A Level II Level II Level III Level II

Student B Level I Level I Level I Level I

Student C Level II Level II Level I Level II

Student D Level I Level I Level II Level I

Student E Level III Level III Level IV Level III

Student F Level I Level II Level I Level I

Student H Level II Level II Level II Level II

The teacher set the following targets for her students based off their holistic scores. When setting targets, the teacher kept the following guidelines in mind:

Growth from Level I to Level III represents the growth of someone who has never demonstrate the skill or created the product being assessed to someone who has demonstrated mastery.

Growth from Level III to Level IV represents a student who has already demonstrated mastery and can now approach the task in a new and creative way.

Student Baseline Holistic Score Holistic Target

Student A Level II Level II

Student B Level I Level III

Student C Level II Level III

Student D Level I Level II

Student E Level III Level IV

Student F Level I Level III

Student H Level II Level III

One of the challenges of holistic targets is that it may set a target too high, especially for a student who is still a beginner on most or all of the dimensions (i.e. Student A, Student D, and Student F). In order to set more reasonable targets for these students, teachers can also set analytic targets. This means that rather than measuring growth across all of the dimensions, the teacher will be focusing on the most important skill, typically the first dimension (D1). Examples of this approach are provided on the following page.

© 2015 Learner- Centered Initiatives, Ltd. All rights reserved. May not be modified, reproduced, or distributed without permission. 7

Examples of Setting Analytic Targets

Student Pre-Holistic Score

Holistic Target

Analytic Target

Rational for Target Setting

A Level II Level III The student demonstrated mastery on one of the dimensions. A holistic target of Level III is reasonable.

B Level I D1 – Level III The student struggled to demonstrate mastery on all dimensions and full level’s growth is not likely. Growth from Level I to III on D1 is reasonable.

C Level II Level III The student demonstrated skills that are almost to mastery on two of the dimensions. A holistic target of Level III is reasonable.

D Level I Level II The student demonstrated skills that are almost to mastery on one of the dimensions. A holistic target of Level III is reasonable.

E Level III Level IV The student demonstrated mastery on all of the dimensions. A holistic target of Level IV is reasonable.

F Level I D1 – Level III The student struggled to demonstrate mastery on all dimensions and full level’s growth is not likely. Growth from Level I to III on D1 is reasonable.

H Level II Level III The student demonstrated mastery on one of the dimensions. A holistic target of Level III is reasonable.

Task 4: With a partner, consider the student scores below. With a partner, talk through a holistic target and if you feel it is necessary, an analytic target. As you discuss, consider making notes to yourself in the Rationale column in order to document your decision-making process.

Student Dimension #1

Dimension #2

Dimension #3

Holistic Target

Analytic Target

Rationale

Student Z Level 1 Level 2 Level 3

Student Y Level 2 Level 1 Level 1

Student X Level 3 Level 2 Level 2

Student W Level 4 Level 2 Level 1

Student V Level 2 Level 2 Level 2