developing automated scoring for large-scale assessments...

14
Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas 1 , Ellen Holste 2 , Karen Draney 3 , Shruti Bathia 3 , and Charles W. Anderson 2 1. ACT, Inc. 2. Michigan State University 3. UC Berkeley, BEAR Center

Upload: others

Post on 19-Feb-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Developing Automated Scoring for Large-scale Assessments of

Three-dimensional LearningJay Thomas1, Ellen Holste2, Karen Draney3, Shruti Bathia3, and Charles W. Anderson2

1. ACT, Inc.

2. Michigan State University

3. UC Berkeley, BEAR Center

Page 2: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Based on the NRC Developing Assessments for the Next Generation Science Standards (Pellegrino et al, 2014)

• Need assessment tasks with multiple components to get at all 3 dimensions (C 2-1)

• Tasks must accurately locate students along a sequence of progressively more complex understanding (C 2-2)

• Traditional selected-response items cannot assess the full breadth and depth of NGSS

• Technology can address some of the problems• Particularly scalability and cost

Page 3: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Example of a Carbon TIME Item

Page 4: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Comparing FC vs CR vs Both

• Compare spread of data

• Adding CR (or CR only) increases the confidence that we have classified students correctly

• Since explanations is a practice that we are focusing on in the LP, it requires CR to assess the construct fully

Page 5: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Item Development

Students respond to

Items

WEW (Rubric) Development

Using WEW (Human

scoring) to create

training set

Creating Machine Learning

(ML) Models

Using ML Model

(Computer scoring)

Backcheck coding

(human)

QWK Check for

Reliability

Psychometric Analysis

(IRT, WLE)

Interpretation by larger research

group

Recursive Feedback Loops for Item Development

Processes moving towards final interpretationFeedback loops that indicate that a question, rubric, or coding potentially has a problem that needs to be addressed

Page 6: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Consequences of using machine scoring

• Item revision and improvement• Increase in the size of the usable data set to

increase power of statistics• Increased confidence in reliability of scoring

through back-checking samples and revising models

• Reduced costs by needing fewer human coders• Model to show that the kinds of assessments

envisioned by Pellegrino et al (2014) for NGSS can be reached at scale with low cost

As of March 6, 2019

School Year Responses Scored

2015-16 175,265

2016-17 532,825

2017-18 693,086

2018-19 227,041

TOTAL 1,628,217

Cost Savings and scalability

Labor hours needed to human score responses @ 100 per hour 16,282.7 hoursLabor cost per hour (undergraduate students includingmisc. costs)

$18 per hour

Cost to human score all responses $293,079

Page 7: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Types of validity evidence

• As taken from the Standards for Educational and Psychological Testing, 2014 ed.• Evidence based on test content

• Evidence based on response processes

• Evidence based on internal structure

• Evidence based on relation to other variables• Convergent and discriminant evidence

• Test-criterion evidence

• Evidence for validity and consequences of testing

Page 8: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Comparison of interviews and IRT analysis results

• Overall Spearman rank correlation = 0.81, p<0.01, n=49

• Comparison of scoring for one written versus interview item

Page 9: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Evidence based on internal structure

• Analysis method: item response models (specifically, unidimensional and multidimensional partial credit models)

• Provide item and step difficulties and person proficiencies on one scale

• Provide comparisons of step difficulties within items

Page 10: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Step difficulties for each item2015-16 Data

Page 11: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Classifying Students into LP levelsComparing FC to EX + FC

Page 12: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Classifying Students into LP levelsComparing EX to EX + FC

Page 13: Developing Automated Scoring for Large-scale Assessments ...bearcenter.berkeley.edu/sites/default/files/Shruti Slides.pdf · Developing Automated Scoring for Large-scale Assessments

Classifying Classroom Data

95% confidence intervals: Average learning gains for teachers with at least 15 students who had both overall pretests and overall posttests (macroscopic explanations)