crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant...

7
Crowd-sourced assessment of technical skills: an opportunity for improvement in the assessment of laparoscopic surgical skills Shanley B. Deal, M.D. a, *, Thomas S. Lendvay, M.D., F.A.C.S. b , Mohamad I. Haque, M.D., L.T.C. c , Timothy Brand, M.D., L.T.C. c , Bryan Comstock, M.S. b , Justin Warren, M.B.A. b , Adnan Alseidi, M.D., M.Ed. a a Virginia Mason Medical Center, Department of General Surgery, Mailstop H8-GME, 1100 9th Ave., Seattle, WA, 98101, USA; b Department of Urology, University of Washington, Seattle, WA, USA; and c Madigan Army Medical Center, Department of General Surgery, Tacoma, WA, USA KEYWORDS: Surgical skills education; Psychomotor skills; Surgical skills assessment; Crowd sourced data Abstract BACKGROUND: Objective, unbiased assessment of surgical skills remains a challenge in surgical education. We sought to evaluate the feasibility and reliability of Crowd-Sourced Assessment of Tech- nical Skills. METHODS: Seven volunteer general surgery interns were given time for training and then testing, on laparoscopic peg transfer, precision cutting, and intracorporeal knot-tying. Six faculty experts (FEs) and 203 Amazon.com Mechanical Turk crowd workers (CWs) evaluated 21 deidentified video clips using the Global Objective Assessment of Laparoscopic Skills validated rating instrument. RESULTS: Within 19 hours and 15 minutes we received 662 eligible ratings from 203 CWs and 126 ratings from 6 FEs over 10 days. FE video ratings were of borderline internal consistency (Krippen- dorff’s alpha 5 .55). FE ratings were highly correlated with CW ratings (Pearson’s correlation coeffi- cient 5 .78, P , .001). CONCLUSION: We propose the use of Crowd-Sourced Assessment of Technical Skills as a reliable, basic tool to standardize the evaluation of technical skills in general surgery. Ó 2015 Elsevier Inc. All rights reserved. Objective and unbiased assessment of surgical skills remains an ongoing challenge in surgical education. Rigorous psychomotor assessment has been neglected, while assessment of knowledge and judgment is common- place using well validated, high stakes assessment tools. Competency-based assessment of surgical graduates is the future of our educational environment. To produce safe, accurate surgeons we must be able to assess their operative performance using a timely, cost-effective, reliable, and valid tool. C-SATS served as a study sponsor and donated the use of their product for the completion of this study. In addition, three of the contributing authors: Thomas Lendvay MD FACS, Bryan Comstock MS and Justin Warren MBA are C-SATS employees who conducted the C-SATS preparation of videos and analysis of data. Interpretation of data and writing of the manuscript was a collaborative endeavor. * Corresponding author. Tel.: 11-253-948-6920; fax: 11-206-341- 0048. E-mail address: [email protected] Manuscript received April 11, 2015; revised manuscript August 31, 2015 0002-9610/$ - see front matter Ó 2015 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.amjsurg.2015.09.005 The American Journal of Surgery (2015) -, --

Upload: others

Post on 21-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

The American Journal of Surgery (2015) -, -–-

Crowd-sourced assessment of technical skills: anopportunity for improvement in the assessmentof laparoscopic surgical skills

Shanley B. Deal, M.D.a,*, Thomas S. Lendvay, M.D., F.A.C.S.b,Mohamad I. Haque, M.D., L.T.C.c, Timothy Brand, M.D., L.T.C.c,Bryan Comstock, M.S.b, Justin Warren, M.B.A.b,Adnan Alseidi, M.D., M.Ed.a

aVirginia Mason Medical Center, Department of General Sub

rgery, Mailstop H8-GME, 1100 9th Ave.,Seattle, WA, 98101, USA; Department of Urology, University of Washington, Seattle, WA, USA; andcMadigan Army Medical Center, Department of General Surgery, Tacoma, WA, USA

KEYWORDS:Surgical skillseducation;Psychomotor skills;Surgical skillsassessment;Crowd sourced data

C-SATS served as a study sponsor and

the completion of this study. In addition,

Thomas Lendvay MD FACS, Bryan C

MBA are C-SATS employees who cond

videos and analysis of data. Interpreta

manuscript was a collaborative endeavor

* Corresponding author. Tel.: 11-2

0048.

E-mail address: Shanley.Deal@virgi

Manuscript received April 11, 2015;

2015

0002-9610/$ - see front matter � 2015

http://dx.doi.org/10.1016/j.amjsurg.20

AbstractBACKGROUND: Objective, unbiased assessment of surgical skills remains a challenge in surgical

education. We sought to evaluate the feasibility and reliability of Crowd-Sourced Assessment of Tech-nical Skills.

METHODS: Seven volunteer general surgery interns were given time for training and then testing, onlaparoscopic peg transfer, precision cutting, and intracorporeal knot-tying. Six faculty experts (FEs) and203 Amazon.com Mechanical Turk crowd workers (CWs) evaluated 21 deidentified video clips usingthe Global Objective Assessment of Laparoscopic Skills validated rating instrument.

RESULTS: Within 19 hours and 15 minutes we received 662 eligible ratings from 203 CWs and 126ratings from 6 FEs over 10 days. FE video ratings were of borderline internal consistency (Krippen-dorff’s alpha 5 .55). FE ratings were highly correlated with CW ratings (Pearson’s correlation coeffi-cient 5 .78, P , .001).

CONCLUSION: We propose the use of Crowd-Sourced Assessment of Technical Skills as a reliable,basic tool to standardize the evaluation of technical skills in general surgery.� 2015 Elsevier Inc. All rights reserved.

donated the use of their product for

three of the contributing authors:

omstock MS and Justin Warren

ucted the C-SATS preparation of

tion of data and writing of the

.

53-948-6920; fax: 11-206-341-

niamason.org

revised manuscript August 31,

Elsevier Inc. All rights reserved.

15.09.005

Objective and unbiased assessment of surgical skillsremains an ongoing challenge in surgical education.Rigorous psychomotor assessment has been neglected,while assessment of knowledge and judgment is common-place using well validated, high stakes assessment tools.Competency-based assessment of surgical graduates is thefuture of our educational environment. To produce safe,accurate surgeons we must be able to assess their operativeperformance using a timely, cost-effective, reliable, andvalid tool.

Page 2: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

2 The American Journal of Surgery, Vol -, No -, - 2015

When evaluating laparoscopic skills, an assessment toolmust demonstrate reliability, validity, and feasibility to bean acceptable tool to determine competency.1 In our currentassessment environment, most surgical residency graduatesare determined to be competent surgeons by virtue of grad-uating from a residency program and subsequently passingtheir written and oral board examinations as facilitated bythe American Board of Surgery. Both examinations reliablyevaluate cognition and judgment but not psychomotor skill.Because of this, our profession relies on an apprenticeshipmodel of skill acquisition without a measure that ‘‘every’’surgeon has demonstrated acquisition of a safe benchmarkof technical skills.

Attempts to determine psychomotor skills competencyin the past have largely relied on 3 factors: the evaluationof the number of procedures performed as recorded by theAccredited Council of Graduate Medical Education resi-dent case log system, including case type and number; thespeed and relative accuracy of basic laparoscopic skillsthrough successful completion of the Fundamentals ofLaparoscopic Skills examination; and subjective evaluationof resident’s operative skills by senior surgical faculty atthe resident’s institution. Caseload has limitations as asurrogate for surgical skill as the degree of involvement ofthe surgical trainee may be variable. Also, relying on anoverall number of cases does not account for the innateskill of the surgical trainee nor the amount of surgicaltraining they may acquire outside the surgical suite.2 Time-based metrics and a proprietary formula can be used forscoring trainees using the Fundamentals of LaparoscopicSurgery program offered by the American College of Sur-geons, but this is expensive, labor intensive for grading,limited to specific testing sites, and somewhat less appli-cable to gynecology and urology.3 Surgical faculty arefrequently used for these surgical evaluationsdeither realtime or less frequently for a taped event. These facultyevaluations can be completed using validated instruments,or using more subjective measures. There are some limita-tions to this approach. Faculty frequently have competingdemands, so there may be a lag time between when thecase is finished and when the evaluation is completed.There may also be personal bias that intercedes, to artifi-cially lower or raise the assessment of the trainees perfor-mance.4 These 3 avenues for determination of surgicalcompetency may be inadequate and allow for inconsistencyamong residents, faculty, and programs across the nation.There has been some hope that virtual reality solutionsmay be useful mechanisms for resident assessment, butthese have mostly fallen short.5

The Association for Surgical Education, through aDelphi consensus process, identified determining the bestmethods/metrics for assessment of technical and nontech-nical performance on simulators and in the operating roomas one of the top 10 research priorities for 21st centurysurgical simulation.6 Objective Structured Assessment ofTechnical skills is a reliable and valid tool but is time inten-sive for expert raters, and thus making it less feasible for

some centers.7 Because of this, virtual reality simulatorshave been proposed as a more efficient assessment tool.Currently, assessment tools are derived from the opinionof one or a few expert surgeons which require that inter-rater reliability be rigorously examined. These evaluationtools are certainly valuable adjuncts to current training;we set out to assess a new crowd-sourced assessmentmethod that generates rapid, reliable, and feasible data toevaluate surgical skill.

Crowd sourcing is a recent phenomenon that utilizesanonymous crowd workers (CWs) to complete tasks. The‘‘crowd’’ is a group of independent, diverse, anonymousworkers that generate data by completing defined tasks.The Amazon Mechanical Turk is an online work market-place that recruits affordable, readily available nonexpertsto complete tasks. A recent study proposed that crowdsourcing may be an alternative, objective method forevaluation of operative performance and when used toevaluate robotic suturing performance found that CWscould rapidly assess skill equivalent to faculty experts(FEs).8–11 We sought to evaluate the feasibility of Crowd-Sourced Assessment of Technical Skills (C-SATS).

Methods

Seven general surgery interns were invited to participatein the pilot study on a volunteer basis. The interns wereinstructed and given ample time for proctored training onpeg transfer, precision cutting, and intracorporeal knot-tying using a standard laparoscopic box trainer (threestandard tasks from the Fundamentals of LaparoscopicSurgery). Interns then performed these tasks on theElectronic Data Generator for Evaluation laparoscopictrainer which has video recording capability. Each partic-ipant completed the set of tasks, individually, in a testsetting away from other participants. Two facilitators werepresent to orient the intern, review the series of tasks beforerecording each task, and answer questions. Edited anddeidentified videos were uploaded to a private, securewebsite in a standardized format. These 21 video clipswere randomly evaluated by 5 volunteer surgical FEs. Astandard evaluation tool was used and will be discussed indetail.

Crowd worker recruitment and selection

Amazon.com Mechanical Turk CWs, generated a min-imum of 30 evaluations on each of the same 21 video clips.CWs were included based on rater performance history aswell as passing a screening and attention test. Raterperformance history was determined by including CWswho had greater than or equal to 95% acceptance rating onhistorical tasks within the Amazon Mechanical Turk crowdsourcing platform. The screening test required the CW towatch a short side-by-side video of the Fundamentals ofLaparoscopic Skills block transfer and identify which video

Page 3: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

S.B. Deal et al. C-SATS for surgical skills assessment 3

showed the surgeon of higher skill to assess the CW’sdiscriminative ability. In order to further qualify the CWand ensure their complete attention, an attention questionwas embedded within the survey. The attention questiondirected the CW to leave the question unanswered. Theresponses from any CW that failed either the screening orattention questions were excluded. A sample video inter-face can be viewed demonstrating the use of C-SATS forCW selection (Video 1).

Technical performance measures

Both evaluator groups used the Global ObjectiveAssessment of Laparoscopic Skills (GOALS) validated ratinginstrument.12 For both experts and CWs, 5-point Likertrating assessments (15 worst to 5 5 best) were captured on4 domains. These domains included depth perception,bimanual dexterity, efficiency, and tissue handling as dictatedby the GOALS instrument. GOALS includes autonomy as afifth domain; however, this was excluded from theassessment. All raters were also provided an overall pass orfail assessment for each video. Each rater provided commentsto further describe the basis of their rating. These commentswere collected and evaluated for common themes andsubthemes.

Data collection

Crowd-sourced data collection was adopted from thestudy and methods described by Chen et al.8–11 EligibleCWs were able to access the rating task through Amazon

Figure 1 Correlation between F

Mechanical Turk. For the task, a custom-built web-basedvideo display and survey tool (Zoho Corp., Pleasanton,CA) was built such that the faculty panel and CWs utilizedthe same interface for blinded review of the videos. Thisinterface also time-stamped the moment a reviewer’s pageloaded and all survey click times for each video review ses-sion. Total video review time was derived as the time be-tween page load and last survey click.

Ratings from random CWs naturally vary as they dowith any set of expert raters. The CWs were from avariety of countries, but most were from North America,and most had scored surgeon videos previously. For thisproject, we included only performance ratings from CWswho passed a calibration test and an attention test. Acalibration test was used to ensure that CWs correctlychose one of the two options, where the correct choice isobvious to an expert surgeon. In each rating questionnaire,we additionally embedded a single-item attention test thatbegins by asking the CW ‘‘not’’ to answer the followingquestion: ‘‘How well do you pay attention to details?.’’Ratings of CWs who responded to this question wereexcluded from all analyses. Missing data were otherwisenot allowed for individual GOALS items in the survey andCWs were not allowed to submit or be paid for incompletetasks.

Analytic approach

The 4GOALS items from experts and eligible CW ratingswere summed into a single numeric global performance scoreranging from 4 (worst) to 20 (best). We evaluated inter-rater

E and CW GOALS scores.

Page 4: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

Figure 2 FE GOALS ratings by video ID showing the high degree of variability within expert surgeon raters.

4 The American Journal of Surgery, Vol -, No -, - 2015

reliability of FE GOALS ratings through the use of Krippen-dorff’s alpha statistic.13 Krippendorff’s alpha statistic is ameasure for assessing internal reliability between raters orbetween items on a questionnaire when there are variablenumbers of rater’s rating tasks. Because there are no knownguidelines for Krippendorff’s alpha statistic, we assessedthe reliability of outcomes of FE using frequently citedthresholds for Cronbach’s alpha statistic14:aR .9 (excellent,high stakes testing), .9. aR .7 (good, low stakes testing), .7

Figure 3 CW GOALS ratings by video ID showing the variability with

. aR .6 (acceptable), .6. a R .5 (poor), and a, .5 (un-acceptable). Because multiple ratings per expert and CWwere allowed across video performances (only one ratingper video), we determined the mean crowd-based GOALSratings and 95% confidence interval (CI) for each video/task using a linear mixed-effects model clustering on raterID.15 A minimum of n 5 31 eligible CW ratings were ob-tained to obtain a CW rating of 95% CI width of 61 pointon the GOALS rating scale. Linear mixed-effects models

in CW; box plots demonstrate the central tendency of CW ratings.

Page 5: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

Figure 4 Demonstrates the percentage of FE and CW raters who indicated that the video was endorsed with a passing rating. The degreeof agreement between the FE and CW is calculated with a Pearson correlation coefficient of 0.80.

S.B. Deal et al. C-SATS for surgical skills assessment 5

derived an average CW rating and FE rating for each videoclip and Pearson correlation coefficients were used to mea-sure the strength of association between CWand FE ratings.Finally, we also assessed the duration that each CW and FEwere on the web-linked survey.

Results

Responses were collected rapidly from participatingCWs. Over a period of 19 hours and 15 minutes 803ratings were received, of which 662 were eligible, from 203unique Amazon Mechanical Turk CWs. In contrast, 126ratings from 6 FEs were completed over the course of10 days from the initial notification.

Expert GOALS ratings were highly correlated (Pearson’scorrelation coefficient 5 .78) with CW ratings (P , .001)(Fig. 1). FE GOALS ratings were found to have borderline in-ternal consistency across videos (Krippendorff’s alpha5 .55).Fig. 2 demonstrates variation in FE ratings. Fig. 3 demon-strates the variability seen within CW ratings for each video.Fig. 4 displays the percentage of FE and CW raters who indi-cated that the video was endorsed with a ‘‘passing’’ rating(Pearson correlation coefficient5 .80).

Expert and crowd comments were reviewed for similar-ity from a qualitative perspective using theme and sub-theme analysis. Comments were remarkably consistentbetween FEs and CWs. Both raters focused on feedbackregarding depth perception, poorly coordinated bimanualdexterity, rough use of instruments, missing the target,inefficiency, and slow speed. From our review, the

comments from either FE or CW would yield similarfeedback to the learner on their final report.

Comments

In the rapidly changing educational environment, surgi-cal educators have new pressures to assess new surgeons ina fair and efficient manner. This study describes theexperience with a new assessment tool from a feasibilityperspective. The encouraging evidence supports the conclu-sion that CW assessment of technical performance ratingsmay be a reliable surgical skills assessment tool. This initialanalysis demonstrates that a crowd-sourced method forevaluation is efficient, reliable, and practical. Furthermore,it addresses the inherent problem of inter-rater andinterinstitution reliability error.

Review of the expert and crowd comments revealedsimilar themes between both groups. The most commoncomments regarding efficiency were in regards to speed butalso on instruments missing the target ‘‘overshooting’’ inboth FE and CW pooled comments. Tissue handling hadsimilarly agreeable commentary with FE and CW focusingon how ‘‘rough’’ the surgeon was and frequently bumpinginto or hitting objects in the frame. Depth perception wasequally congruent in feedback from both FEs and CWswith most videos receiving comments about perceiveddifficulty with depth perfection and poor coordinationsuch as dropping the needle or suture. Finally, bimanualdexterity generated comments from both FE and CWregarding an idle left hand, awkward transfers, and poor

Page 6: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

6 The American Journal of Surgery, Vol -, No -, - 2015

use of both hands. Overall, comments for each task andvideo were highly correlated based on theme analysis.

All 5 of the expert reviewers were experienced laparos-copists and surgeons who have been training and evaluatingresidents for many years. The inter-rater variability be-tween these expert surgeons highlights the need for a moreconsistent mechanism to provide this type of evaluation andfeedback regarding trainees’ performance of commonlyperformed laparoscopic tasks. In addition, it emphasizes theimportance of training faculty raters appropriately. Weacknowledge that the expert faculty may have beenpotentially underprepared for their evaluative role in thissetting. However, their orientation and training was iden-tical to that of the CWs and each of the 5 expert facultyhave had experience with using GOALS to evaluateresident performance at their respective institutions.

In contrast to the importance of reliability among a smallgroup of expert raters, we feel that inter-rater reliability ofCW is less vital in this study. One of the central tenets ofcrowd sourcing is that one has a large enough sample sizeto be able to accurately measure the ‘‘signal,’’ or a gage asto how one technical performance is relative to another. It istherefore of lesser importance what the overall distributionof CW ratings is, so long as one has enough CW ratings tomeasure the average with accuracy. In this study, thenumber of CW ratings was obtained such that for anygiven surgical performance the CW rating of 95% CI waswithin 61 point on the GOALS scale.

The rationale for presenting the distribution of expertscores, and for inter-rater reliability analyses in general, isthat while one can obtain a limited number of expert ratings,these expert opinions can widely vary. As such, in the settingwhere ratings will be obtained from multiple expert raters itis imperative that inter-rater reliability is high to establish anaccurate ground truth. It is duly noted that experts in thisstudy had less than ideal inter-rater reliability. However,these experts were given basic training in the judgment oftechnical assessment and had prior experience with theGOALS assessment tool. We, therefore, believe that thecomparison of crowd-based assessment with expert-basedassessment remains accurate in this study. Crowd sourcingoffers an alternative methodology for obtaining a largernumber of ratings per surgical performance, where the law oflarge numbers ensures that the rating will be close to theunderlying truth and allows for higher confidence in theoverall average rating even if variability in crowd ratings isgreater than that of experts.

There are limitations to this study including a smallsample size and the use of novice learners only. Certainly,future investigations should determine the ability of thecrowd to stratify learners based on experience from novice toexpert. This tool also requires continued vigorous evaluationof additional parameters to include reliability, ease of use,cost, ease of interpretation by faculty and learners, andimpact on patient safety and surgical outcomes.

Swing et al16 proposed a set of standards by whichassessment tools should be evaluated and implemented.

Van Nortwick et al17 reviewed and made recommendationsfor validation study reporting in the context of a longitudi-nal research program approach. Both validation methodol-ogies provide the framework for additional research toprove C-SATS is acceptable and applicable for surgicalevaluation.

Conclusions

C-SATS has broad potential applications. Proposedapplications include annual or biannual assessment ofsurgical resident technical skill to ensure progression ofskill and provide feedback. C-SATS also allows for nationalbenchmarking. Objective measures obtained map to Ac-credited Council of Graduate Medical Education Mile-stones for the Patient Care domains focused on open,endoscopic, laparoscopic, and robotic surgery. Addition-ally, C-SATS may provide a tool to ensure that graduatingresidents achieve acceptable psychomotor performancestandards as a supplement to oral and written boards. Inconclusion, this pilot study demonstrates the feasibility ofCrowd-Sourced Assessment of Technical Skills (C-SATS)and its potential to standardize the evaluation of technicalskills in general surgery residency training.

Supplementary data

Supplementary data related to this article can be found athttp://dx.doi.org/10.1016/j.amjsurg.2015.09.005.

References

1. Aggarwal R, Moorthy K, Darzi A. Laparoscopic skills training and

assessment. Br J Surg 2004;91:1549–58.

2. Bell Jr RH, Biester TW, Tabuenca A, et al. Operative experience of

residents in US general surgery programs: a gap between expectation

and experience. Ann Surg 2009;249:719–24.

3. Hur HC, Arden D, Dodge LE, et al. Fundamentals of laparoscopic sur-

gery: a surgical skills assessment tool in gynecology. JSLS 2011;15:

21–6.

4. Williams RG, Klamen DA, McGaghie WC. Cognitive, social and envi-

ronmental sources of bias in clinical performance ratings. Teach Learn

Med 2003;15:270–92.

5. Steigerwald SN, Park J, Hardy KM, et al. Does laparoscopic simula-

tion predict intraoperative performance? A comparison between the

Fundamentals of Laparoscopic Surgery and LapVR evaluation metrics.

Am J Surg 2015;209:34–9.

6. Stefanidis D, Arora S, Parrack DM, et al. Research priorities in surgi-

cal simulation for the 21st century. Am J Surg 2012;203:49–53.

7. Martin JA, Regehr G, Reznick R, et al. Objective structured assess-

ment of technical skill (OSATS) for surgical residents. Br J Surg

1997;84:273–8.

8. Chen C, White L, Kowalewski T, et al. Crowd-Sourced assessment of

technical skills (C-SATS): a novel method to evaluate surgical perfor-

mance. J Surg Res 2014;187:65–71.

9. Holst D, Kowalewski TM, White L, et al. Crowd sourced assessment

of technical skills (CSATS): an adjunct to urology resident surgical

simulation training. J Endourol 2015;29:604–9.

Page 7: Crowd-sourced assessment of technical skills: an opportunity for … · 2016-01-22 · ipant completed the set of tasks, individually, in a test setting away from other participants

S.B. Deal et al. C-SATS for surgical skills assessment 7

10. White LW, Lendvay TS, Holst D, et al. Using crowd-assessment to

support surgical training in the developing world. J Am Coll Surg

2014;219:e40.

11. Lendvay TS, White LW, Holst D, et al. Quantifying surgical skill: us-

ing the wisdom of crowds. J Am Coll Surg 2014;219:e158–9.

12. Vassiliou MC, Feldman LS, Andrew CG, et al. A global assessment

tool for evaluation of intraoperative laparoscopic skills. Am J Surg

2005;190:107–13.

13. Krippendorff K. Content Analysis: An Introduction to its Methodol-

ogy. 3rd ed. Thousand Oaks, CA: Sage; 2013.

14. Cronbach LJ, Shavelson RJ. My current thoughts on coefficient

alpha and successor procedures. Educ Psychol Meas 2004;64:

391–418.

15. Laird NM, Ware JH. Random effects models for longitudinal studies.

Biometrics 1983;38:963–74.

16. Swing SR, Clyman SG, Holmboe ES, et al. Advancing resident assess-

ment in graduate medical education. J Grad Med Educ 2009;1:278–86.

17. Van Nortwick S, Lendvay T, Jensen A, et al. Methodologies for estab-

lishing validity in surgical simulation studies. Surgery 2010;147:

622–30.