comparative research on training simulators in emergency medicine: a methodological review

Click to edit Master title style

• Click to edit Master text styles• Second level• Third level• Fourth level• Fifth level

1

Comparative Research on

Training Simulators in Emergency Medicine:

A Methodological Review

Matt Lineberry, Ph.D.Research Psychologist, NAWCTSD

[email protected]

Medical Technology, Training, & Treatment (MT3) May 2012



2

Credits and Disclaimers• Co-authors

– Melissa Walwanis, Senior Research Psychologist, NAWCTSD

– Joseph Reni, Research Psychologist, NAWCTSD

• These are my professional views, not necessarily those of NAWCTSD, NAVMED, etc.



3

Objectives• Motivate conduct of comparative research in

simulation-based training (SBT) for healthcare

• Identify challenges evident from past comparative research

• Promote more optimal research methodologies in future research

Cook et al. (2011) meta-analysis in JAMA:

“…we question the need for further studies comparing simulation with no intervention (ie, single-group pretest-posttest studies and comparisons with no-intervention controls).

…theory-based comparisons between different technology-enhanced simulation designs (simulation vs. simulation studies) that minimize bias, achieve appropriate power, and avoid confounding… are necessary”

Issenberg et al. (2011) research agenda in SIH:

“…studies that compare simulation training to traditional training or no training (as is often the case in control groups), in which the goal is to justify its use or prove it can work, do little to advance the field of human learning and training.”



6

Moving forward:comparative research

• How do varying degrees and types of fidelity affect learning?

• Are some simulation approaches or modalities superior to others?For what learning objectives?Which learners? Tasks? Etc.

• How do cost and throughput considerations affect the utility of different approaches?



7

Where are we now?Searched for peer-reviewed studies comparing training

effectiveness of simulation approaches and/or measured practice on human patients for emergency medical skills

• Searched PubMed and CINAHL– mannequin, manikin, animal, cadaver, simulat*, virtual reality,

VR, compar*, versus, and VS• Exhaustively searched Simulation in Healthcare• Among identified studies, searched references forward and

backward



8

Reviewed studies17 studies met criteria

• Procedure trained:– Predominantly needle access (7 studies).

4 airway adjunct, 3 TEAM, 2 FAST, etc.

• Simulators compared:– Predominantly manikins, VR systems, and part-task trainers



9

Reviewed studies• Design:

Almost entirely between-subjects (16 of 17)

• Trainee performance measurement:– 7 were post-test only; all others included pre-tests– Most (9 studies) use expert ratings;

also: knowledge tests (7), success/failure (6), and objective criteria (5)

– 6 studies tested trainees on actual patients– 6 tested trainees on one of the simulators used in training



10

Apparent methodological challenges1. Inherently smaller differences between conditions –

and consequently, underpowered designs

2. An understandable desire to “prove the null” – but inappropriate approaches to testing equivalence

3. Difficulty measuring or approximating the ultimate criterion: performance on the job



11

Challenge #1: Detecting “small” differences• Cook et al. (2011) meta:

Differences in outcomes of roughly 0.5-1.2 standard deviations, favoring simulation-based training over no simulation.

Comparative research should expect smaller differences than these.

• HOWEVER, small differences can have great practical significance if they…– correspond to important outcomes

(e.g., morbidity or mortality),– can be exploited widely, and/or– can be exploited inexpensively.



12

The power of small differences…• Physicians Health Study:

Aspirin trial halted prematurely due to obvious benefit for heart attack reduction–Effect size: r = .034–Of 22k participants,

85 fewer heart attacks in the aspirin group



13

…and the tyranny of small differences• Probability to detect differences (power)

decreases exponentially as effect size decreases

• We generally can’t control effect sizes.Among other things, we can control:– Sample size– Reliability of measurement– Chosen error rates



14

Sample size• Among reviewed studies, n ranges from 8 to 62;

median n = 15.

• If n = 15, α = .05, true difference = 0.2 SDs, and measurement is perfectly reliable,probability of detecting the difference is only 13%

RECOMMENDATION:Pool resources in multi-site collaborations to achieve needed power to detect effects(and estimate power requirements a priori)



15

Reliability of measurement

• Potential rater errors are numerous

• Typical statistical estimates can be uninformative (i.e. coefficient alpha, inter-rater correlations)

• If measures are unreliable – and especially if samples are also small – you’ll almost always fail to find differences,whether they exist or not



16

Reliability of measurementAmong nine studies using expert ratings:

• Only two used multiple raters for all participants

• Six studies did not estimate reliability at all– One study reported an inter-rater reliability coefficient– Two studies reported correlations between raters’ scores

Both approaches make unfounded assumptions

• Ratings were never collected on multiple occasions



17

Reliability of measurementRECOMMENDATIONS:1. Use robust measurement protocols –

e.g., frame-of-reference rater training, multiple raters

2. For expert ratings, use generalizability theory to estimate and improve reliability

G-theory respects a basic truth:“Reliability” is not a single value associated with a measurement tool

Rather, it depends on how you conduct measurement, who is being measured, the type of comparison for which you use the scores, etc.



18

G-theory process, in a nutshell1. Collect ratings, using an experimental design to expose

sources of error(e.g., have multiple raters give ratings, on multiple occasions)

2. Use ANOVA to estimate magnitude of errors3. Given results from step 2, forecast what reliability will

result from different combinations of raters, occasions, etc.

18



19

Weighted scoring• Two studies used weighting schemes –

more points associated with more critical procedural steps– Can improve both reliability and validity

• RECOMMENDATION:Use task analytic procedures to identify criticality of subtasks;weight scores accordingly



20

Selecting error ratesWhy do we choose p = .05 as the threshold for statistical significance?



21

Relative severity of errorsType I error: “Simulator x is more effective than Simulator y”(but really, they’re equally effective)

Potential outcome: Largely trivial; both are equally effective, so erroneously favoring one does not affect learning or patient outcomes

Type II error: “Simulators x and y are equally effective”(but really, Simulator X is superior)

Potential outcome: Adverse effects on learning and patient outcomes if Simulator X is consequently underutilized



22

Relative severity of errorsType I error: “Simulator x is more effective than Simulator y”(but really, they’re equally effective)

Potential outcome: Largely trivial; both are equally effective, so erroneously favoring one does not affect learning or patient outcomes

Type II error: “Simulators x and y are equally effective”(but really, Simulator X is superior)

Potential outcome: Adverse effects on learning and patient outcomes if Simulator X is consequently underutilized

α=.05

β=1-power(e.g., 1-.80 = .20)



23

Relative severity of errors• RECOMMENDATION:

Particularly in a new line of research, adopt an alpha level that rationally balances inferential errors according to their severity

Cascio, W. F., & Zedeck, S. (1983). Open a new window in rational research planning: Adjust alpha to maximize statistical power. Personnel Psychology, 36, 517-526.

Murphy, K. (2004). Using power analysis to evaluate and improve research. In S.G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (Chapter 6, pp. 119-137). Malden, MA: Blackwell.



24

Challenge #2: Proving the null• Language in studies often reflects desire to

assert equivalence– e.g., different simulators are “reaching parity”

• Standard null hypothesis statistical testing (NHST) does not support this assertion– Failure to detect effects should prompt

reservation of judgment, not acceptance of the null hypothesis



25

Which assertion is more bold?

“Sim X is more effective than Sim Y”

“Sims X and Y are equally effective”0Y favored X favored

0Y favored X favored



26

Proving the null• Possible to prove the null:

– Set a region of practical equivalence around zero– Evaluate whether all plausible differences (e.g., 95% confidence

interval) fall within the region

• RECOMMENDATION:– Avoid unjustified acceptance of the null– Use strong tests of equivalence when hoping to assert

equivalence– Be explicit about what effect size you would consider practically

significant, and why



27

Challenge #3: Getting to the ultimate criterion• The goal is not test performance but job

performance;“the map is not the terrain”

• Typical to test demonstration of procedures, often on a simulator– Will trainees perform similarly on actual patients,

under authentic work conditions?– Do trainees know when to execute the procedure?– Are trainees willing to act promptly?



28

e.g.: Roberts et al. (1997)• No differences detected in rate of successful laryngeal mask

airway placement for manikin vs. manikin-plus-live-patient training– However: Confidence very low, and only increased with live-patient

practice

• “…if a nurse does not feel confident enough… the patient will initially receive pocket-mask or bag-mask ventilation, and this is clearly less desirable”

Issue of willingness to act decisively



29

Criterion relevance• RECOMMENDATION:

Where possible, use criterion testbeds that correspond highly to actual job performance– Assess performance on human patients/volunteers– Replicate performance-shaping factors (not just

environment)– Test knowledge of indications and willingness to act



30

What if patients can’t be used?• Using simulators as the criterion

testbed introduces potential biases–e.g., train on cadaver or manikin;

test on a different manikin



31

A partial solution:Crossed-criterion design



32

A partial solution:Crossed-criterion design• Advantages

–Mitigates bias–Allows comparison of generalization of learning

from each training condition• Disadvantages

–Precludes pre-testing, if pre-test exposure to each simulator is sufficiently lengthy to derive learning benefits



33

Conclusions• “The greatest enemy of a good plan is the

dream of a perfect plan”• All previous comparative research is to be

lauded for pushing the field forward• Concrete steps can be taken to maximize

the theoretical and practical value of future comparative research

comparative research on training simulators in emergency medicine: a methodological review

Documents

master title styleclick

simulation studies

simulation training

simulationbased training

training simulators

research agenda

traditional training

past comparative research