developing automated scoring for large-scale assessments...

Developing Automated Scoring for Large-scale Assessments of

Three-dimensional LearningJay Thomas1, Ellen Holste2, Karen Draney3, Shruti Bathia3, and Charles W. Anderson2

1. ACT, Inc.

2. Michigan State University

3. UC Berkeley, BEAR Center

Based on the NRC Developing Assessments for the Next Generation Science Standards (Pellegrino et al, 2014)

• Need assessment tasks with multiple components to get at all 3 dimensions (C 2-1)

• Tasks must accurately locate students along a sequence of progressively more complex understanding (C 2-2)

• Traditional selected-response items cannot assess the full breadth and depth of NGSS

• Technology can address some of the problems• Particularly scalability and cost

Example of a Carbon TIME Item

Comparing FC vs CR vs Both

• Compare spread of data

• Adding CR (or CR only) increases the confidence that we have classified students correctly

• Since explanations is a practice that we are focusing on in the LP, it requires CR to assess the construct fully

Item Development

Students respond to

Items

WEW (Rubric) Development

Using WEW (Human

scoring) to create

training set

Creating Machine Learning

(ML) Models

Using ML Model

(Computer scoring)

Backcheck coding

(human)

QWK Check for

Reliability

Psychometric Analysis

(IRT, WLE)

Interpretation by larger research

group

Recursive Feedback Loops for Item Development

Processes moving towards final interpretationFeedback loops that indicate that a question, rubric, or coding potentially has a problem that needs to be addressed

Consequences of using machine scoring

• Item revision and improvement• Increase in the size of the usable data set to

increase power of statistics• Increased confidence in reliability of scoring

through back-checking samples and revising models

• Reduced costs by needing fewer human coders• Model to show that the kinds of assessments

envisioned by Pellegrino et al (2014) for NGSS can be reached at scale with low cost

As of March 6, 2019

School Year Responses Scored

2015-16 175,265

2016-17 532,825

2017-18 693,086

2018-19 227,041

TOTAL 1,628,217

Cost Savings and scalability

Labor hours needed to human score responses @ 100 per hour 16,282.7 hoursLabor cost per hour (undergraduate students includingmisc. costs)

$18 per hour

Cost to human score all responses $293,079

Types of validity evidence

• As taken from the Standards for Educational and Psychological Testing, 2014 ed.• Evidence based on test content

• Evidence based on response processes

• Evidence based on internal structure

• Evidence based on relation to other variables• Convergent and discriminant evidence

• Test-criterion evidence

• Evidence for validity and consequences of testing

Comparison of interviews and IRT analysis results

• Overall Spearman rank correlation = 0.81, p<0.01, n=49

• Comparison of scoring for one written versus interview item

Evidence based on internal structure

• Analysis method: item response models (specifically, unidimensional and multidimensional partial credit models)

• Provide item and step difficulties and person proficiencies on one scale

• Provide comparisons of step difficulties within items

Step difficulties for each item2015-16 Data

Classifying Students into LP levelsComparing FC to EX + FC

Classifying Students into LP levelsComparing EX to EX + FC

Classifying Classroom Data

95% confidence intervals: Average learning gains for teachers with at least 15 students who had both overall pretests and overall posttests (macroscopic explanations)

Questions?

• Contact info • Jay Thomas [email protected]

• Karen Draney [email protected]

• Andy Anderson [email protected]

• Ellen Holste [email protected]

• Shruti Bathia [email protected]

mailto:[email protected]





developing automated scoring for large-scale assessments...

Documents