comparative research on training simulators in emergency medicine: a methodological review
DESCRIPTION
Comparative Research on Training Simulators in Emergency Medicine: A Methodological Review. Matt Lineberry, Ph.D. Research Psychologist, NAWCTSD [email protected] Medical Technology, Training, & Treatment (MT3) May 2012. Credits and Disclaimers. Co-authors - PowerPoint PPT PresentationTRANSCRIPT
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
1
Comparative Research on
Training Simulators in Emergency Medicine:
A Methodological Review
Matt Lineberry, Ph.D.Research Psychologist, NAWCTSD
Medical Technology, Training, & Treatment (MT3) May 2012
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
2
Credits and Disclaimers• Co-authors
– Melissa Walwanis, Senior Research Psychologist, NAWCTSD
– Joseph Reni, Research Psychologist, NAWCTSD
• These are my professional views, not necessarily those of NAWCTSD, NAVMED, etc.
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
3
Objectives• Motivate conduct of comparative research in
simulation-based training (SBT) for healthcare
• Identify challenges evident from past comparative research
• Promote more optimal research methodologies in future research
Cook et al. (2011) meta-analysis in JAMA:
“…we question the need for further studies comparing simulation with no intervention (ie, single-group pretest-posttest studies and comparisons with no-intervention controls).
…theory-based comparisons between different technology-enhanced simulation designs (simulation vs. simulation studies) that minimize bias, achieve appropriate power, and avoid confounding… are necessary”
Issenberg et al. (2011) research agenda in SIH:
“…studies that compare simulation training to traditional training or no training (as is often the case in control groups), in which the goal is to justify its use or prove it can work, do little to advance the field of human learning and training.”
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
6
Moving forward:comparative research
• How do varying degrees and types of fidelity affect learning?
• Are some simulation approaches or modalities superior to others?For what learning objectives?Which learners? Tasks? Etc.
• How do cost and throughput considerations affect the utility of different approaches?
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
7
Where are we now?Searched for peer-reviewed studies comparing training
effectiveness of simulation approaches and/or measured practice on human patients for emergency medical skills
• Searched PubMed and CINAHL– mannequin, manikin, animal, cadaver, simulat*, virtual reality,
VR, compar*, versus, and VS• Exhaustively searched Simulation in Healthcare• Among identified studies, searched references forward and
backward
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
8
Reviewed studies17 studies met criteria
• Procedure trained:– Predominantly needle access (7 studies).
4 airway adjunct, 3 TEAM, 2 FAST, etc.
• Simulators compared:– Predominantly manikins, VR systems, and part-task trainers
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
9
Reviewed studies• Design:
Almost entirely between-subjects (16 of 17)
• Trainee performance measurement:– 7 were post-test only; all others included pre-tests– Most (9 studies) use expert ratings;
also: knowledge tests (7), success/failure (6), and objective criteria (5)
– 6 studies tested trainees on actual patients– 6 tested trainees on one of the simulators used in training
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
10
Apparent methodological challenges1. Inherently smaller differences between conditions –
and consequently, underpowered designs
2. An understandable desire to “prove the null” – but inappropriate approaches to testing equivalence
3. Difficulty measuring or approximating the ultimate criterion: performance on the job
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
11
Challenge #1: Detecting “small” differences• Cook et al. (2011) meta:
Differences in outcomes of roughly 0.5-1.2 standard deviations, favoring simulation-based training over no simulation.
Comparative research should expect smaller differences than these.
• HOWEVER, small differences can have great practical significance if they…– correspond to important outcomes
(e.g., morbidity or mortality),– can be exploited widely, and/or– can be exploited inexpensively.
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
12
The power of small differences…• Physicians Health Study:
Aspirin trial halted prematurely due to obvious benefit for heart attack reduction–Effect size: r = .034–Of 22k participants,
85 fewer heart attacks in the aspirin group
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
13
…and the tyranny of small differences• Probability to detect differences (power)
decreases exponentially as effect size decreases
• We generally can’t control effect sizes.Among other things, we can control:– Sample size– Reliability of measurement– Chosen error rates
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
14
Sample size• Among reviewed studies, n ranges from 8 to 62;
median n = 15.
• If n = 15, α = .05, true difference = 0.2 SDs, and measurement is perfectly reliable,probability of detecting the difference is only 13%
RECOMMENDATION:Pool resources in multi-site collaborations to achieve needed power to detect effects(and estimate power requirements a priori)
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
15
Reliability of measurement
• Potential rater errors are numerous
• Typical statistical estimates can be uninformative (i.e. coefficient alpha, inter-rater correlations)
• If measures are unreliable – and especially if samples are also small – you’ll almost always fail to find differences,whether they exist or not
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
16
Reliability of measurementAmong nine studies using expert ratings:
• Only two used multiple raters for all participants
• Six studies did not estimate reliability at all– One study reported an inter-rater reliability coefficient– Two studies reported correlations between raters’ scores
Both approaches make unfounded assumptions
• Ratings were never collected on multiple occasions
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
17
Reliability of measurementRECOMMENDATIONS:1. Use robust measurement protocols –
e.g., frame-of-reference rater training, multiple raters
2. For expert ratings, use generalizability theory to estimate and improve reliability
G-theory respects a basic truth:“Reliability” is not a single value associated with a measurement tool
Rather, it depends on how you conduct measurement, who is being measured, the type of comparison for which you use the scores, etc.
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
18
G-theory process, in a nutshell1. Collect ratings, using an experimental design to expose
sources of error(e.g., have multiple raters give ratings, on multiple occasions)
2. Use ANOVA to estimate magnitude of errors3. Given results from step 2, forecast what reliability will
result from different combinations of raters, occasions, etc.
18
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
19
Weighted scoring• Two studies used weighting schemes –
more points associated with more critical procedural steps– Can improve both reliability and validity
• RECOMMENDATION:Use task analytic procedures to identify criticality of subtasks;weight scores accordingly
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
20
Selecting error ratesWhy do we choose p = .05 as the threshold for statistical significance?
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
21
Relative severity of errorsType I error: “Simulator x is more effective than Simulator y”(but really, they’re equally effective)
Potential outcome: Largely trivial; both are equally effective, so erroneously favoring one does not affect learning or patient outcomes
Type II error: “Simulators x and y are equally effective”(but really, Simulator X is superior)
Potential outcome: Adverse effects on learning and patient outcomes if Simulator X is consequently underutilized
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
22
Relative severity of errorsType I error: “Simulator x is more effective than Simulator y”(but really, they’re equally effective)
Potential outcome: Largely trivial; both are equally effective, so erroneously favoring one does not affect learning or patient outcomes
Type II error: “Simulators x and y are equally effective”(but really, Simulator X is superior)
Potential outcome: Adverse effects on learning and patient outcomes if Simulator X is consequently underutilized
α=.05
β=1-power(e.g., 1-.80 = .20)
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
23
Relative severity of errors• RECOMMENDATION:
Particularly in a new line of research, adopt an alpha level that rationally balances inferential errors according to their severity
Cascio, W. F., & Zedeck, S. (1983). Open a new window in rational research planning: Adjust alpha to maximize statistical power. Personnel Psychology, 36, 517-526.
Murphy, K. (2004). Using power analysis to evaluate and improve research. In S.G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (Chapter 6, pp. 119-137). Malden, MA: Blackwell.
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
24
Challenge #2: Proving the null• Language in studies often reflects desire to
assert equivalence– e.g., different simulators are “reaching parity”
• Standard null hypothesis statistical testing (NHST) does not support this assertion– Failure to detect effects should prompt
reservation of judgment, not acceptance of the null hypothesis
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
25
Which assertion is more bold?
“Sim X is more effective than Sim Y”
“Sims X and Y are equally effective”0Y favored X favored
0Y favored X favored
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
26
Proving the null• Possible to prove the null:
– Set a region of practical equivalence around zero– Evaluate whether all plausible differences (e.g., 95% confidence
interval) fall within the region
• RECOMMENDATION:– Avoid unjustified acceptance of the null– Use strong tests of equivalence when hoping to assert
equivalence– Be explicit about what effect size you would consider practically
significant, and why
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
27
Challenge #3: Getting to the ultimate criterion• The goal is not test performance but job
performance;“the map is not the terrain”
• Typical to test demonstration of procedures, often on a simulator– Will trainees perform similarly on actual patients,
under authentic work conditions?– Do trainees know when to execute the procedure?– Are trainees willing to act promptly?
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
28
e.g.: Roberts et al. (1997)• No differences detected in rate of successful laryngeal mask
airway placement for manikin vs. manikin-plus-live-patient training– However: Confidence very low, and only increased with live-patient
practice
• “…if a nurse does not feel confident enough… the patient will initially receive pocket-mask or bag-mask ventilation, and this is clearly less desirable”
Issue of willingness to act decisively
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
29
Criterion relevance• RECOMMENDATION:
Where possible, use criterion testbeds that correspond highly to actual job performance– Assess performance on human patients/volunteers– Replicate performance-shaping factors (not just
environment)– Test knowledge of indications and willingness to act
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
30
What if patients can’t be used?• Using simulators as the criterion
testbed introduces potential biases–e.g., train on cadaver or manikin;
test on a different manikin
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
31
A partial solution:Crossed-criterion design
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
32
A partial solution:Crossed-criterion design• Advantages
–Mitigates bias–Allows comparison of generalization of learning
from each training condition• Disadvantages
–Precludes pre-testing, if pre-test exposure to each simulator is sufficiently lengthy to derive learning benefits
Click to edit Master title style
• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level
33
Conclusions• “The greatest enemy of a good plan is the
dream of a perfect plan”• All previous comparative research is to be
lauded for pushing the field forward• Concrete steps can be taken to maximize
the theoretical and practical value of future comparative research