lessons from high-stakes licensure examinations for medical school examinations

Lessons from High-Stakes Licensure Examinations

for Medical School Examinations

Queens University

4 December 2008

Dale Dauphinee, MD, FRCPC, FCAHS

Background: FAME Course

Validating Test Scores and Validating Test Scores and DecisionsDecisions

Pulling All of the Pieces Together! Pulling All of the Pieces Together!

Dale DauphineeDale Dauphinee

Seeing the woods for the trees Seeing the woods for the trees ……and defining the way ahead and defining the way ahead …….. !!!.. !!!

Why? Ensure that you keep out of trouble

Why? Ensure that you keep out of trouble

and get the effect/impact that you want!

and get the effect/impact that you want!

Goal today is to offer insights for those of you working at the undergraduate level – looking back on my two careers in

assessment: Undergraduate Assoc. Dean and CEO of the MCC!

FAME Course FrameworkFAME Course FrameworkAssessment Frames

Themes Knowledge and Reasoning

Clinical Skills Workplace Performance

Program Evaluation

Scoring, Analysis & Reporting

Test Material Development

Standard Setting

Test Design: Test Design: Constructed Constructed ResponseResponse

Test Design: Content and

Validity

Elements of Talk

• Process: be clear on why are doing this!– Describe: assessment steps written down

• Item design: key issues • Structure: clear where decision are made• Outcome: pass-fail or honours-pass-fail• Evaluation cycle: it is about improvement!• Getting into trouble

– Problems in process: questions to be asked– Never ask them after the fact: ANTICIPATE

• Prevention

Preparing a ‘Course’ Flow Chart

• For whom and what?• What is the practice/curriculum model?• What method?• What is the blueprint and sampling frame?• To what resolution level will they answer?• Scoring and analysis• Decision making• Reporting• Due processHINT: Think project management! What are the intended steps?

Classic Assessment CycleClassic Assessment Cycle

Desired Objectives Desired Objectives or Attributesor Attributes

Educational Educational ProgramProgram

Assessment of Assessment of PerformancePerformance

Performance GapsPerformance Gaps Program RevisionsProgram Revisions

Change in the Hallmarks of Change in the Hallmarks of Competence - Increase ValidityCompetence - Increase Validity

Knowledgeassessment

Problem-solvingassessment

Clinical skillsassessment

Practiceassessment

Professional or clinicalProfessional or clinicalAuthenticityAuthenticity

1960 2000

(adapted from van der Vleuten 2000)

Climbing the PyramidClimbing the Pyramid

Knows

Shows how

Knows how

Does

Knows Factual tests: MCQ, essay type, oral…..

Knows how (Clinical) Context based tests:MCQ, essay type, oral…..

Shows how Performance assessment in vitro:OSCE, SP-based test…..

DoesPerformance assessment in vivo:Undercover SPs, Video, Logs…..

Traditional ViewTraditional View

Curriculum

Teacher

Assessment

Student

After van der Vleutin - 1999

An Alternative ViewAn Alternative View

Curriculum

Teacher

Assessment

Student

After van der Vleutin - 1999

Traditional Assessment: Traditional Assessment: What, Where & HowWhat, Where & How

Student-Trainee AssessmentStudent-Trainee Assessment

• Content: maps on to the domain and curriculum → to which the results generalize - basis of assessment

• Where and who: within ‘set’ programs where candidates are in same cohort

• Measurement: – Test or tool testing time is long

enough to yield reliable results– Tests are comparable from

administration to administration– Controlled environment – not

complex– Can attribute differences to

candidate? …and rule out ‘exam-based’ or error attribution

– Adequate numbers per cohort

Traditional Tests/Tools at SchoolTraditional Tests/Tools at School

Does content Does content map to domainmap to domain

Test length Test length = reliable= reliable

Attributable Attributable to candidate?to candidate?

Are tests Are tests comparable?comparable?

Ideal test or Ideal test or all these all these qualities!qualities!

PrinciplePrinciple

It is all about the context and purpose It is all about the context and purpose your course, then intended use of the your course, then intended use of the

test score - or the program!test score - or the program!

‘There is no test for all seasons or for all reasons’.

Written Tests: Designing Items

Key Concepts

PrinciplePrinciple

∴ ∴ TThe case ‘prompts’ or item stems mustmust create low level simulationscreate low level simulations in in

the candidate’s mind about ….the candidate’s mind about ….

the performance situations that are about to be assessed …..

Classifying Constructed FormatsClassifying Constructed Formats

• Cronbach (1984): defined constructed response formats as broad class of item formats where the response is generated by examinee rather than selected from a list of options.

• Haladyna (1997): constructed response formats– High inference format

• Requires expert judgment about a trait being observed

– Low inference format• Are observing behaviour of interest: short answer; checklists

Types of CR Formats*Types of CR Formats*• Low Inference

– Work sampling• Done in real time

– In-training evaluations• Provide rating later

– Mini-CEX– Short answer– Clinical orals: structured– Essays (with score key)– Key features (no menus)– OSCEs at early UG level

• High Inference– Work – 360’s– OSCEs at grad level– Orals (not ‘old’ vivas)– Complex simulations

• Teams• Interventions

– Case-based discussions

– Portfolios– Demonstration of

procedures

*Principle - All CR formats need lots of development *Principle - All CR formats need lots of development planning: you can’t show up and wing it!planning: you can’t show up and wing it!

What Do CRs Offer & What Must What Do CRs Offer & What Must One Consider for Good CRsOne Consider for Good CRs

The CR format can provide– Opportunity candidates to

generate/create a response– Opportunity to move beyond

MCQs– Response is evaluated by

comparing response to pre-developed criteria

– Evaluation criteria have a range of values that are acceptable to the faculty of the course or testing body.

CRs: other considerations- Writers/authors need training- Need CR development process- Need topic selection plan or blueprint- Need guidelines- Need scoring rubric and analysis → reporting- Need content review process- Need test assembly process- May encounter technical issues…

Moving to Clinical Assessment

Think of it as work assessment!

Point: validity of scoring is key because the scores are being used to imply judge clinical competence in certain domains!

Clinical Assessment IssuesClinical Assessment Issues• Context:

– Clinical Skills – Work Assessment

• Overview:– Validating test scores– Validating decisions

• Examples:– Exit (final) OSCE– Mini-CEX

• Conclusion

Clinical Clinical SkillsSkills

Mini-Mini-CEXCEX

Validating Validating ScoringScoring ✔ ✔

Validating Validating DecisionsDecisions ✔ ✔

Presentation GridPresentation Grid

Key Pre-condition #1Key Pre-condition #1

• What Is the Educational Goal?What Is the Educational Goal?– And the level of resolution expected?

• Have you defined the purpose or goal of the evaluation and the manner in which the result will be used?

• Learning point:– Need to avoid Downing’s threats to validity

• Too few cases/items (under representation)• Flawed cases/items (irrelevant variance)

• If not – you are not ready to proceed!

Key Pre-condition #2:Key Pre-condition #2:

• Be Clear About Due Process!Be Clear About Due Process!• Ultimately, if if this instrument is an ‘exit’

exam or an assessment to be used for promotion, clarity about ‘due process’‘due process’ is crucial

• Samples: Student must know that he/she has the right to the last word; the ‘board’ has followed acceptable standards of decision-making; etc.

Practically in 2008, validity implies ...Practically in 2008, validity implies ... … … that in the interpretation of a test score a that in the interpretation of a test score a

series of assertions, assumptions and series of assertions, assumptions and arguments are considered that support that arguments are considered that support that interpretation!interpretation!– ∴∴ Validation is a pre-decision assessment - specifying

how you will consider and the interpret the results as ‘evidence’ that will be used in final ‘decision-making’ decision-making’ !

– In simple terms: for student promotion – a series of conditional steps (‘cautions’) are needed – to document a ‘legitimate’ assessment ‘process’

– ∴ Critical steps for a ‘valid’ process leading to ultimate decision

• i.e. make a pass/fail decision or provide a standing

4

General Framework for Evaluating General Framework for Evaluating Assessment Methods Assessment Methods – after Swanson– after Swanson

Evaluation: determining the quality of the performance observed on the test

Generalization: generalizing from performance on the test to other tests covering similar, but not identical, content

Extrapolation: inferring performance in actual practice from performance on the test

Evaluation, Generalization, and Extrapolation are like links in a chain: the chain is only as strong as the weakest link

5

Evaluation

Generalization

Extrapolation

Kane’s ‘Links in a Chain’ Kane’s ‘Links in a Chain’ Defense - after SwansonDefense - after Swanson

Includes: Scoring and

Decision-making

Scoring: Deriving the EvidenceScoring: Deriving the Evidence• Content validity:

– Performance and work based tests• Enough items/cases?

• Match to exam blueprint and ultimate uses– Exam versus work-related assessment point

• Direct measures of observed attributes• Key: is it being scored by items or cases?• Observed score compared to target score

– Item (case) matches the patient problem!– And the candidates’ ability!

Preparing the EvidencePreparing the Evidence

• From results to evidence: three inferences– Evaluate performance – get score– Generalize that to target score– Translate target score into a verbal ‘description’

• All three inferences must be valid• Process:

– Staff role versus decision-makers responsibilities/role• Flawed items/cases• Flag unusual or critical events for decision-makers’• Prepare analyses

– Comparison data

Validating the Scoring - EvidenceValidating the Scoring - Evidence

• Validation carried out in two stages– Developmental stage: process is nurtured,

refined– Appraisal stage: real thing - trial by fire!

• Interpretive argument

• Content validity: how do scores function in various required conditions?– Enough items/cases?– Eliminate flawed items /cases

7

Evaluation

Generalization

Extrapolation

Observation of PerformanceObservation of PerformanceWith Real PatientsWith Real Patients

- if sees variety

of patients

10

Evaluation

Generalization

Extrapolation

Objective StructuredObjective StructuredClinical Examination (OSCE)Clinical Examination (OSCE)

- Dave Swanson

Stop and Re-consider ….Stop and Re-consider ….

What were the educational goals?

ANDAND

How will the decision be used?

The Decision-making ProcessThe Decision-making Process• Standard setting

– many methods

• But keys are: – ultimate success – fidelity – care with which decision is executed is crucial

– must be documented

• Helpful Hint: can also use standard setting for defining faculty expectations for content and use - in advance of test!

The Decision-making ProcessThe Decision-making Process

• Generic steps: Generic steps: – exam was conducted properly; exam was conducted properly; – results are psychometrically accurate and valid; results are psychometrically accurate and valid; – establish pass-fail point; – and consider each candidate’s results

• Red stepsRed steps require an evaluating process that is – require an evaluating process that is –– Deliberate and reflective– Open discussion

• Black steps: decisionBlack steps: decision– All members of decision-making board must be ‘in’ – or else an

escalation procedure needs to be established – in advance!

ExamplesExamples• OSCEOSCE

– MCC meeting steps• Overview: how exam went• Review each station

– Discussion– Decision: use all cases

• Review results ‘in toto’– Decide on pass-fail point– Consider each person:

• Decide pass-fail for specific challenging instances

• Award standing or tentative decision

– Comments

• Work-based: mini-CEXmini-CEX– Six month rotation in PGY-1– Construction steps

• Sampling grid?– Numbers needed– Score per case

• Rating issues:– Global (preferred) vs. Check-

list– Scale issues

• Examiner strategy– Not same one– Number needed– Preparation

• Awarding standing: Pass-fail or one of several parameters?

– Comments

Appeals vs. Remarking!Appeals vs. Remarking!

• Again – pre-defined process

• Tending to make a negative decision– Candidate’s right to last word before final

decision• Where does that take place? Must plan this!

– Differentiate decision-making from rescoring• Requires independent ‘ombudsperson’

• Other common issues

Delivering the NewsDelivering the News

• Depends on the purpose and desired use• Context driven• In a high stakes situation at a specific faculty –

may want two steps process– Tending - to negative decision:

• Notion of right of the candidate to the last word before a decision is made: has right to provide evidence that addresses the board’s concerns

– Final decision

• Comments/queries?

Key Lessons: Re-capKey Lessons: Re-cap

• Purpose and use of result• Overview of due process – in promotion• Overview of Validity – prefer Kane’s approach• Scoring component of validity• Generalization and extra-polization

– True score variance ↑ ↑ - and error variance ↓↓

• Interpretation/Decision-making components of validity

• Know ‘due process’

Are you ready?Are you ready?

• Are the faculty clear on the ultimate use and purpose of the test or exam?

• How will you track the issues to be resolved?

• Have you defined the major feasibility challenges at your institution – and plan!

• Do you have a process to assure valid scoring and interpretation of the result?

• Do you have supportsupport and back-upback-up?

Summary and QuestionsSummary and Questions

Thank You!

ReferencesClauser BE, Margolis MJ, Swanson DB. (2008). Issues of Validity

and Reliability for Assessments in Medical Education. In Practical Guide to the Evaluation of Clinical Competence. Hawkins R. and Holmboe ES, eds. Publisher - Mosby

Pangaro L, Holmboe ES (2008). Evaluation Forms and Global Rating Forms. In Practical Guide to the Evaluation of Clinical Competence. Hawkins R.& Holmboe ES, eds. Publisher - Mosby

Newble D, Dawson-Saunders B, Dauphinee WD, et al: (1994). Guidelines for Assessing Clinical Competence. Teaching and Learning in Medicine 6 (3): 213-220.

Kane MT. (1992). An Argument-Based Approach to Validity. Psychological Bulletin Validity. 112 (3): 527-535.

Downing S. (2003) Validity: on the meaningful interpretation of assessment data. Medical Education 37:830-7

Norcini J. (2003) Work based assessment. BMJ 326:753-5Smee S. (2003) Skill based assessment. BMJ 326: 703-6

lessons from high-stakes licensure examinations for medical school examinations

Documents

assessment steps

test scores

ideal test

traditional assessment

undergraduate level

constructed response

downitem design

constructed formatscronbach