standard setting methods with high stakes assessments barbara s. plake buros center for testing...

Standard Setting Methods with High Stakes Assessments

Standard Setting Methods with High Stakes Assessments

Barbara S. PlakeBarbara S. Plake

Buros Center for TestingBuros Center for Testing

University of NebraskaUniversity of Nebraska

Setting Passing Scores

Essential for making high stakes decisions Essential for making high stakes decisions Must ensure that qualified candidates passMust ensure that qualified candidates pass Must ensure that unqualified candidates failMust ensure that unqualified candidates fail 70% correct is NOT the right answer!70% correct is NOT the right answer! ““Standard Setting” -- setting the “standard” or Standard Setting” -- setting the “standard” or

“passing score”“passing score”

Approaches

Empirically basedEmpirically based– RegressionRegression– Contrasting groups/Borderline groupsContrasting groups/Borderline groups– Norm-basedNorm-based

Test BasedTest Based– JudgmentalJudgmental– Test and candidate basedTest and candidate based

Empirically based methods

Need to know status of candidate Need to know status of candidate (worthy of passing or not)(worthy of passing or not)

More likely in classroom settingsMore likely in classroom settings Not likely the case in licensure settingsNot likely the case in licensure settings Norm-basedNorm-based

– Not tied to the KSAs needed to function Not tied to the KSAs needed to function effectively/safely in the professioneffectively/safely in the profession

– Capricious and arbitraryCapricious and arbitrary

Test Based

KSAs form basis for test contentKSAs form basis for test content Focus on target candidateFocus on target candidate

– MCCMCC– JQCJQC

Assessment Tasks

Multiple choice questionsMultiple choice questions– Good content coverageGood content coverage– Efficient scoringEfficient scoring– Can measure higher order reasoning if well Can measure higher order reasoning if well

constructedconstructed

Constructed Response

More directly related to target skill?More directly related to target skill? Some differences by candidateSome differences by candidate Time consuming to administer and Time consuming to administer and

scorescore Increases costsIncreases costs

Judgmental task

How will the minimally qualified How will the minimally qualified candidate (MCC) perform on the tasks candidate (MCC) perform on the tasks in the test?in the test?

Need qualified, well trained judgesNeed qualified, well trained judges– Often experts (SMEs)Often experts (SMEs)– Need to modify SMEs perception to focus Need to modify SMEs perception to focus

on entry level performanceon entry level performance– FeedbackFeedback

Decision Rules

CompensatoryCompensatory– Performance on total is what mattersPerformance on total is what matters– Weaknesses in one area can be Weaknesses in one area can be

compensated by strengths in anothercompensated by strengths in another– Higher reliabilityHigher reliability

Decision Rules

ConjunctiveConjunctive– Passing scores set on parts of the testPassing scores set on parts of the test– Candidates must pass all parts in order to Candidates must pass all parts in order to

pass the testpass the test– Sometimes candidates are allowed to Sometimes candidates are allowed to

“bank” passed parts“bank” passed parts

Test Based Methods

Multiple choice questionsMultiple choice questions– Angoff MethodAngoff Method– Yes/No Extension Yes/No Extension – BookmarkBookmark

Test Based Methods

Constructed ResponseConstructed Response– Analytical JudgmentAnalytical Judgment– Paper selectionPaper selection

Angoff “Method”

SMEs estimate the probability that a SMEs estimate the probability that a hypothetical, randomly selected MCC will be hypothetical, randomly selected MCC will be able to answer each question correctly.able to answer each question correctly.

Addition of SME’s estimates = SME’s passing Addition of SME’s estimates = SME’s passing scorescore

Average across SMEs = recommended Average across SMEs = recommended passing scorepassing score

Range of probable values (SEE)Range of probable values (SEE)

Angoff variations

Multiple rounds of ratingsMultiple rounds of ratings Feedback in betweenFeedback in between

– SME resultsSME results– Candidate performanceCandidate performance

• P-valuesP-values• % passing% passing

Criticisms of Angoff Methods

Cognitively challengingCognitively challenging ““Impossible task”Impossible task” ““Fatally flawed” NRC reportFatally flawed” NRC report Research has shown that ratings are Research has shown that ratings are

consistent across years and ratersconsistent across years and raters Need strong training/discussion of Need strong training/discussion of

KSAs of MCCsKSAs of MCCs

Yes/No Variation

SMEs estimate whether or not the MCC will SMEs estimate whether or not the MCC will be able to get the item correctly (Y/N)be able to get the item correctly (Y/N)– Response probabilityResponse probability– More likely than not (.50)More likely than not (.50)– Fairly certain (.67)Fairly certain (.67)

Add the Ys to get SME’s passing scoreAdd the Ys to get SME’s passing score Average across SMEs = recommended Average across SMEs = recommended

passing scorepassing score Cutpoint +/- SEE (1 or 2)Cutpoint +/- SEE (1 or 2)

Yes/No Variation

More popular with SMEsMore popular with SMEs Feedback not necessarily neededFeedback not necessarily needed Quicker to implementQuicker to implement

Bookmark Method

Often used with IRT calibrated items but Often used with IRT calibrated items but not necessary not necessary

Test questions order from easy to hardTest questions order from easy to hard Response probabilityResponse probability Insert bookmark between pages when Insert bookmark between pages when

the MCC probability of a correct the MCC probability of a correct response dips below response response dips below response probabilityprobability

Bookmark Method

Number of items preceding bookmark is Number of items preceding bookmark is SMEs passing scoreSMEs passing score

Often little discussion on KSAs of MCCOften little discussion on KSAs of MCC Multiple small groupsMultiple small groups Discussion between roundsDiscussion between rounds Multiple rounds; data usually isn’t Multiple rounds; data usually isn’t

shared until 2nd of 3rd rounds.shared until 2nd of 3rd rounds.

Bookmark Method

Results often shown graphically across Results often shown graphically across roundsrounds

Frequently convergence occurs after Frequently convergence occurs after 1st round1st round

Average across SMEs = recommended Average across SMEs = recommended cutpointcutpoint

SEE formula; cutpoint +/- SEE (1 or 2)SEE formula; cutpoint +/- SEE (1 or 2)

Constructed response tasks

Extended AngoffExtended Angoff Analytical JudgmentAnalytical Judgment

Extended Angoff

SMEs estimate how many of the total SMEs estimate how many of the total points available for the task will be points available for the task will be earned by the MCC.earned by the MCC.

Cutpoint is determined in a similar Cutpoint is determined in a similar fashion to Angoff; sum points for SME, fashion to Angoff; sum points for SME, average across SMEs.average across SMEs.

Range of probable valuesRange of probable values

Analytical Judgment

SMEs see prescored candidate SMEs see prescored candidate responses (but scores aren’t revealed)responses (but scores aren’t revealed)

Task is to sort candidate responses into Task is to sort candidate responses into performance categoriesperformance categories– Clearly passingClearly passing– PassingPassing– Not PassingNot Passing

Analytical Judgment

Clearly passing set asideClearly passing set aside Candidate responses in the Passing Candidate responses in the Passing

and Not Passing categories are ordered and Not Passing categories are ordered from lowest performance to highest.from lowest performance to highest.

Top responses in the Not Passing Top responses in the Not Passing category are identified (usually 3)category are identified (usually 3)

Lowest responses in the Passing Lowest responses in the Passing category are identified (usually 3)category are identified (usually 3)

Analytical Judgment

Average across these 6 papers is Average across these 6 papers is SME’s passing scoreSME’s passing score

Feedback provided on SME passing Feedback provided on SME passing scoresscores

Round 2Round 2 Cutpoint is average across SMEs Cutpoint is average across SMEs

passing scorespassing scores Range of probable valuesRange of probable values

Paper Selection

Exemplar candidate work is selected for each Exemplar candidate work is selected for each score point (typically 2/score point)score point (typically 2/score point)

SMEs task is to pick the two papers that best SMEs task is to pick the two papers that best represent the work of the MCCrepresent the work of the MCC

Scores are not revealed to SMEsScores are not revealed to SMEs Average of SMEs selected papers = SME’s Average of SMEs selected papers = SME’s

passing scorepassing score Average across SMEs = cutpointAverage across SMEs = cutpoint Range of probable valuesRange of probable values

Who Makes the Final Decision? Each approach yielded a cutpoint and a Each approach yielded a cutpoint and a

“range of probable values”“range of probable values” This information should be This information should be

communicated to the policy makers for communicated to the policy makers for their final decision.their final decision.

Standard setting methods only yield a Standard setting methods only yield a range of consistent, defensible cutpointsrange of consistent, defensible cutpoints

Final decision is a policy matter!Final decision is a policy matter!

Providing Validity Evidence

What evidence is useful in supporting the What evidence is useful in supporting the results of the standard setting process?results of the standard setting process?

This evidence should be gathered to have This evidence should be gathered to have available in case of a legal challenge.available in case of a legal challenge.

Responsibility of test developer to provide at Responsibility of test developer to provide at least procedural validity evidence.least procedural validity evidence.

Collatoral evidence could be part of a long-Collatoral evidence could be part of a long-term validity research programterm validity research program

Procedural Evidence

SMEsSMEs– Representative of professionRepresentative of profession– QualificationsQualifications– ConfidentialityConfidentiality– Conflict of interest statementsConflict of interest statements– Cannot teach preparation classes or sit for Cannot teach preparation classes or sit for

examinationexamination

Training

Did SMEs understand method?Did SMEs understand method? Was sufficient time allotted to training?Was sufficient time allotted to training? Did the SMEs have a clear conceptualization Did the SMEs have a clear conceptualization

of the MCC?of the MCC? Did they understand the purpose of the Did they understand the purpose of the

standard setting procedure?standard setting procedure? Do they understand that the final decision will Do they understand that the final decision will

be based on their work, but not dictated by it?be based on their work, but not dictated by it?

Practice

Was enough time devoted to practice?Was enough time devoted to practice? Were the practice materials sufficiently Were the practice materials sufficiently

similar to the operational materials?similar to the operational materials? Did the SMEs feel they had a Did the SMEs feel they had a

reasonable opportunity to ask questions reasonable opportunity to ask questions and receive clarificationsand receive clarifications

Did they understand the feedback Did they understand the feedback information?information?

Operational

Was enough time devoted to their work Was enough time devoted to their work (across rounds)?(across rounds)?

How confident did the SMEs feels about How confident did the SMEs feels about their ratings (across rounds)?their ratings (across rounds)?

How useful/influential was the How useful/influential was the feedback?feedback?

Did the facilities support their work?Did the facilities support their work?

Overall

Confidence that the method used will Confidence that the method used will result in appropriate minimum passing result in appropriate minimum passing score?score?

Was the workshop handled in a Was the workshop handled in a professional manner?professional manner?

Was the workshop well organized?Was the workshop well organized? Opportunity for commentsOpportunity for comments

Main Point

Many methods, all aimed at provided a Many methods, all aimed at provided a structured and reasoned approach to structured and reasoned approach to identifying identifying – CutpointCutpoint– Range of probable valuesRange of probable values– Procedural validity evidenceProcedural validity evidence

Match of Method to Assessment Method selected should be appropriate Method selected should be appropriate

for the assessment (MCQ, constructed for the assessment (MCQ, constructed response).response).

Logistically feasibleLogistically feasible Published in peer-reviewed journals?Published in peer-reviewed journals? Should be replicableShould be replicable Multiple methods? Multiple panels?Multiple methods? Multiple panels?

Purpose of Presentation

Provide an orientation to current Provide an orientation to current standards setting methodsstandards setting methods

Provide background on the needed Provide background on the needed processes and procedures to conduct a processes and procedures to conduct a professional (and legally defensible) professional (and legally defensible) standard setting workshop.standard setting workshop.

Thank you

I am honored to be asked to share my I am honored to be asked to share my expertise in this areaexpertise in this area

I hope the presentation has been useful I hope the presentation has been useful and meaningfuland meaningful

Best outcome for me is if it raised your Best outcome for me is if it raised your awareness of methods and issues in awareness of methods and issues in standard setting.standard setting.

standard setting methods with high stakes assessments barbara s. plake buros center for testing...

Documents