public reporting of surgeon outcomes low numbers of

Upload: prof-dr-dr-ernst-hanisch

Post on 02-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Public Reporting of Surgeon Outcomes Low Numbers Of

    1/4

    Public Health

    www.thelancet.com Published online July 5, 2013 http://dx.doi.org/10.1016/S0140-6736(13)61491-9 1

    Public reporting of surgeon outcomes: low numbers of

    procedures lead to false complacencyKate Walker, Jenny Neuburger, Oliver Groene, David A Cromwell, Jan van der Meulen

    The English National Health Service published outcome information for individual surgeons for ten specialties inJune, 2013. We looked at whether individual surgeons do suffi cient numbers of procedures to be able to reliablyidentify those with poor performance. For some specialties, the number of procedures that a surgeon does each yearis low and, as a result, the chance of identifying a surgeon with increased mortality rates is also low. Therefore, publicreporting of individual surgeons outcomes could lead to false complacency. We recommend use of outcomes that arefairly frequent, considering the hospital as the unit of reporting when numbers are low, and avoiding interpretationof no evidence of poor performance as evidence of acceptable performance.

    Introduction

    From the summer of 2013, outcomes of some surgicalprocedures will be reported for individual surgeons aspart of the English National Health Service (NHS)Commissioning Boards new policy.1 This policy followsthe example of the Society for Cardiothoracic Surgery inGreat Britain and Ireland (SCTS)2 and several US states(eg, New York3), which report mortality for adult cardiacprocedures by surgeon. The aim is to allow patients tochoose their surgeon and clinicians to improve out-comes of care. However, when overall numbers ofspecific procedures are low, correct identification of asurgeon with poor performance is challenging, even ifmortality is high.4The danger is that low numbers mask

    poor performance and lead to false complacency.We examine this issue in relation to reporting ofsurgical mortality for individual surgeons for adultcardiac surgery, plus key procedures in threeother specialties: oesophagectomy or gastrectomy foroesophagogastric cancer; bowel cancer resection; andhip fracture surgery. We address three questions.First, what number of procedures is necessary forreliable detection of poor performance? Second, howmany surgeons in each specialty actually do this numberof procedures in a period of 1, 3, or 5 years? Third, whatis the probability that a surgeon identified as a statisticaloutlier has truly poor performance? Finally, we offerrecommendations about how surgeon performance can

    be assessed in a meaningful way. We used postoperativemortality as an example to address these questions,because it is the outcome that will be reported forEnglish surgeons this summer.

    Number of proceduresThe number of adult cardiac surgeries done in NHShospitals is fairly high: 50% of cardiac surgeons dobetween 60 and 170 per year.2 Many other procedures aredone less frequently, which means statistical power ispoor and that poorly performing surgeons are unlikely tobe correctly identified. In this context, statistical power isthe probability that a surgeon with poor performance willbe detected as a statistical outlierie, as significantlyworse than average. For example, 80% power means that,

    of ten surgeons who are truly performing poorly, on

    average eight would be identified.Bowel cancer resection illustrates the issue of low

    numbers. Postoperative mortality is about 5% (table 1).8Therefore, if 20 operations were done in a yeara highnumber for this procedureonly one patient would beexpected to die after surgery. With low numbers, the playof chance (ie, the role of uncontrollable factors) might begreater than the effect of a surgeons performance on thenumber of deaths. Conversely, if a surgeons performancewas average, the chance that more than one patient of20 would die after surgery can be calculated as about 25%with basic statistical principles.

    Statistical power is determined by the expected numberof deaths, which is a combination of numbers of theprocedures and the mortality (panel 1). To calculate howmany procedures would be necessary to achieve differentstatistical power thresholds, we used the national overallmortality rate for adult cardiac surgery, bowel cancerresection, oesophagectomy or gastrectomy, and hipfracture surgery to calculate the expected numbers ofdeaths,6,7,9 and deemed a doubling of these rates as poorperformance (panel 1). The numbers necessary for eachpower threshold exceed the annual number of procedurestypically done by surgeons in English NHS hospitals(table 1). The differences are particularly large for bowelcancer surgery and oesophagectomy or gastrectomy: theannual median number of bowel cancer surgeries is

    roughly a tenth of that necessary for 60% power, and themedian number of oesophagectomy or gastrectomyprocedures is about a tenth of that needed for 70% power(table 1). Hip fracture surgery has the highest mortality,and therefore the same power is achieved for feweroperations than are necessary for other procedures(table 1).

    Proportion of surgeons who do the necessarynumber of proceduresWe estimated the proportion of surgeons who do asuffi cient number of procedures to achieve 60%, 70%,and 80% power to detect poor performance (table 2).2,5These proportions are calculated for reporting periodsof 1, 3, and 5 years, assuming that the overall rate of

    Published Online

    July 5, 2013

    http://dx.doi.org/10.1016/

    S0140-6736(13)61491-9

    Department of Health Services

    Research and Policy, London

    School of Hygiene and Tropical

    Medicine, London, UK

    (K Walker PhD, J Neuburger PhD,

    O Groene PhD, D A Cromwell PhD,

    Prof J van der Meulen PhD)

    Correspondence to:

    Dr Jenny Neuburger, Department

    of Health Services Research and

    Policy, London School of Hygiene

    and Tropical Medicine,

    London WC1H 9SH, UK

    [email protected]

    http://crossmark.dyndns.org/dialog/?doi=10.1016/S0140-6736(13)61491-9&domain=pdf
  • 7/27/2019 Public Reporting of Surgeon Outcomes Low Numbers Of

    2/4

    Public Health

    2 www.thelancet.com Published online July 5, 2013 http://dx.doi.org/10.1016/S0140-6736(13)61491-9

    mortality remains constant with time. The SCTS reportssurgeon-level mortality with 3 years of data.2 Its datashow that about three-quarters of surgeons do suffi cientnumbers of cardiac operations to achieve 60% statisticalpower (table 2). The proportion of surgeons who dosuffi cient numbers of procedures to identify poor

    performance is much lower for procedures other thancardiac and hip fracture surgery (table 2).

    Gains in statistical power can clearly be achieved byextension of the period during which data are obtained:as length of time increases, so does the proportion ofsurgeons who do the necessary number of procedures(table 2). However, pooling of data from long periodswill adversely affect the timeliness of reported data.It assumes that individual surgeon skills and practicepatterns are largely stable, which might not be the case.Moreover, such pooling could mask a recent deteriorationin a surgeons performance. It also introduces challengesrelated to the reporting of outcomes of junior surgeons,retired surgeons, and surgeons who are temporarilyappointed overseas. We recommend pooling of data over

    a period of time to increase power, but recognise that a

    balance needs to be struck between statistical power andtimeliness (panel 2).

    Correct identification of poor performanceNot all surgeons identified as statistical outliers willtruly have poor performance. The proportion correctlyidentified as having poor performance is known as thepositive predictive value.10 The number correctly identi-fied depends on the significance threshold, how manyprocedures a surgeon does, and the prevalence of poorperformance. With standard diagnostic reasoning, itcan be calculated that, if one in 20 cardiac surgeonstruly had poor performance, 63% would be correctlyidentified on the basis of the median number of

    procedures in 3 years. The equivalent positive predictivevalues for the other procedures, with the same assump-tions, are 62% for hip fracture surgery, 57% foroesophagectomy or gastrectomy, and 38% for bowelcancer resection. Therefore, a large proportion of sur-geons identified as outliers would be falsely accused ofpoor performance. In reality, the proportion of sur-geons with poor performance will probably be lowerthan one in 20, with the result that the proportion ofoutliers which are false accusations would be sub-stantially higher.

    Improving statistical powerThere are options for improvements in statistical powerother than the pooling of data over time, but eachintroduces problems of its own. First, data for differentprocedures could be pooled. However when outcomesdiffer between procedures, this approach could preventfair comparisons of outcomes. We grouped gastrectomy,which has a mortality of 69%, with oesophagectomy,which has a mortality of 57%.7 Additionally, cardiacprocedures were grouped together, combining coronarybypass surgery with replacement or repairs of cardiacvalves, as is done by SCTS.2 Adjustment for patientsrisk profiles (ie, case-mix adjustment) might not besuffi cient to remove biases due to surgeons who dovarying mixes of procedures. Another diffi culty caused

    by the pooling of data is that poor surgeon performancefor one specific procedure could be masked.

    Second, the control limits used to identify poorperformance could be lowered. We used a 5% signifi-cance level, which corresponds to 95% control limits(figure), and is the lowest commonly used threshold.However, lowering of the threshold would also lead toan increased proportion of surgeons falsely identifiedas having poor performance.

    Third, an alternative outcome measure could beselected, such as surgical complications or emergency re-admission. Although increased numbers of events wouldraise statistical power, measurement error due toincomplete or inconsistent recording would tend toreduce it. We recommend use of outcome measures that

    National

    postoperativemortality (%)

    Median annual

    number*

    Number of procedures

    necessary to detectpoor performance

    60%

    power

    70%

    power

    80%

    power

    Hip fracture surgery 84% 31 56 75 102

    Oesophagectomy or gastrectomy 61% 11 79 109 148

    Bowel cancer resection 51% 9 95 132 179

    Cardiac surgery 27% 128 192 256 352

    5% significance level. Poor performance defined as double the national overall mortality rate. *On the basis of hospital

    episode statistics5 for the 3-year period from April, 2009, to March, 2012 (except for cardiac surgery, for which reported

    numbers2 are used). 30-day mortality (March 1, 2010Feb 28, 2011).6 90-day mortality (Oct 1, 2007June 30, 2009). 7

    90-day mortality (Aug 1, 2010July 31, 2011).8 In-hospital mortality (April 1, 2008March 31, 2011).9

    Table 1: Mortality after four surgical procedures, the number of procedures that occur annually, and how

    many would be necessary to detect poor performance with different statistical powers

    Panel 1: Calculation of statistical power

    We calculate statistical power with four numbers (under the assumption that underlying

    statistical distributions can be approximated with the normal distribution):

    1 National overall mortality

    2 Mortality rate at which performance is deemed to be poor

    3 Statistical threshold used to test whether an individual surgeons rate is consistent

    with the national overall mortality rate

    4 Number of procedures done by the surgeon

    In this report, we define poor performance as double the national overall mortality rate.

    This definition is arbitrary, but would in practice represent a fairly large difference in

    performance. To detect smaller differences would necessitate larger numbers to achievethe same levels of statistical power. We use a 5% significance level to calculate the poor

    performance threshold, which corresponds to the commonly used 95% control limit on

    funnel plots (figure). In fact, many audits use wider limits, such as 998%, or even higher,

    to detect an outlier. As levels of significance increase, so limits widen, reducing the

    statistical power to detect poor performance.

  • 7/27/2019 Public Reporting of Surgeon Outcomes Low Numbers Of

    3/4

    Public Health

    www.thelancet.com Published online July 5, 2013 http://dx.doi.org/10.1016/S0140-6736(13)61491-9 3

    are fairly frequent (panel 2). Additionally, for specialtiesin which most surgeons still do not do suffi cient numbersof procedures to achieve acceptable power, we recom-mend that reporting should be at the level of the surgicalteam or hospital, not the surgeon (panel 2).

    Implications of the new policyReporting of outcomes for individual surgeons forcardiac surgery in the UK has largely been viewed as asuccess.11,12 As we have shown, numbers of cardiacsurgeries are suffi cient to allow the process of detection

    to operate with reasonable statistical power. However, webelieve that consultant-level reporting could be far lesseffective for other specialties. The concern about falseidentification of poor performance has received muchattention in view of the stigma attached to poor perfor-mance.13 The potential collateral damage of a false accu-sation could include reputational damage, increases inindemnity insurance premiums, or even prosecution.Public reporting could also affect surgeon behaviour,leading to selection of low-risk patients for surgery.14

    Inaccurate estimates of surgeon performance couldalso cause unnecessary alarm in patients. A mortalityestimate of 10% for a surgeon could worry patients, evenif the estimate is based on such small numbers that nostatistical evidence of poor performance is available. One

    option to overcome this issue would be to use hierarchicalmodelling techniques that would shrink the surgeons

    mortality estimates, especially when based on smallnumbers, towards the overall mortality.15 However, thesehierarchical models would not overcome the problem oflow statistical power.

    A second implication has received much lessattention.16 With low numbers of procedures, an un-intended result of reporting for individual surgeonscould be false complacency. For most surgeons, powerwill be insuffi cient to detect poor performance, and thisabsence of evidence could be falsely interpreted asevidence of acceptable performance. Therefore, ratherthan stimulating quality improvement and earlyresponses to local concerns about quality of care,publicly reported figures that identify no problemscould lead to inaction.

    60%

    power

    70%

    power

    80%

    power

    1-year reporting period

    Hip fracture surgery* 4% 1% 0

    Oesophagectomy or

    gastrectomy*

    0 0 0

    Bowel resection* 0 0 0

    Cardiac surgery 16% 1% 0

    3-year reporting period

    Hip fracture surgery* 73% 62% 42%

    Oesophagectomy or

    gastrectomy*

    9% 0 0

    Bowel resection* 17% 4% 0

    Cardiac surgery 75% 69% 56%

    5-year reporting period

    Hip fracture surgery* 84% 79% 70%

    Oesophagectomy or

    gastrectomy*

    34% 17% 5%

    Bowel resection* 37% 24% 11%

    Cardiac surgery 80% 77% 72%

    Poor performance is defined as double the national overall mortality rate, with a 5%

    significance level. *Data for numbers of procedures come from the hospital episode

    statistics for the 3-year period from April, 2009, to March, 2012.5 We selected

    procedures with the appropriate International Classification of Diseases (version 10)

    diagnosis codes and Offi ce of Population Censuses and Surveys Class ification of

    Interventions and Procedures-4.4 procedure codes. Procedures were allocated to a

    consultant if they were contracted under a relevant specialty. Data used from

    Society for Cardiothoracic Surgery in Great Britain and Ireland published data. 2

    Table 2: Proportion of surgeons who do suffi cient numbers of different

    procedures every year to identify cases of poor performance

    Panel 2: Recommendations for public reporting of surgeon outcomes

    Measurement of outcomes

    Pool data over time when annual numbers are low, but also consider timeliness of data

    Select outcome measures for which the outcome event is fairly frequent

    For specialties in which most surgeons do not achieve 60% power, the unit of

    reporting should be the team, hospital, or trust

    Presentation of results

    Report surgeon outcomes with appropriate statistical techniques, such as funnel plots

    Avoid interpreting no evidence of poor performance as evidence of acceptable

    performance

    Report surgeon outcomes with appropriate health warnings, such as interpretation of

    outcomes with low numbers and data quality issues

    Report surgeon outcomes alongside unit or hospital outcomes to guide interpretation

    Figure: Funnel plot showing risk-adjusted 90-day mortality after bowel cancer resection in different trusts

    For reporting for individual surgeons, one dot would represent one surgeon and numbers of procedures would

    be much lower than they are here. Trusts falling above the control limits are deemed to be outliers. In our

    estimates, we have used 95% control limits. Reproduced from Health and Social Care Information Centres

    national bowel cancer audit,8 by permission of the Health and Social Care Information Centre.

    0 50 100 150 200 2500

    5

    10

    15

    Adjusted90-daymortality(%)

    Number of procedures in trust

    998% control limits

    95% control limits

    Median mortality

  • 7/27/2019 Public Reporting of Surgeon Outcomes Low Numbers Of

    4/4

    Public Health

    4 www.thelancet.com Published online July 5, 2013 http://dx.doi.org/10.1016/S0140-6736(13)61491-9

    Our analyses draw attention to the need for great care

    in presentation of estimates for individual surgeons.Analyses should be presented in such a way as to avoidfalse complacency, false accusation, and unnecessaryalarm to patients (panel 2).

    Wider issuesSeveral wider issues have been raised previously aboutthe reporting of surgeon outcomes, mainly related toadequate adjustment for patient case mix, the accuracywith which the responsible surgeon can be identified,and the shared responsibility for the care of patientswithin teams.17 Operative mortality, including unavoid-able deaths, might not be a good proxy for preventablemortality. Of particular relevance is the mean proportion

    of deaths that can be prevented: if this proportion is low,mortality is a poor test to predict avoidable mortality.18Additionally, mortality is not the only outcome thatconcerns patients; others include avoidance of seriouscomplications, being treated with dignity, return offunction, and freedom from recurrent symptoms.19

    Case-mix adjustment aims to account for differences inage, disease severity, or other factors in comparisons ofsurgeon outcomes. Validated methods for case-mixadjustment do not exist for all the procedures for whichoutcomes have to be reported in 2013. Even when they doexist, these methods might not fully adjust for case-mixdifferences. Therefore, some surgeons treating patientsat high risk could be wrongly identified as an outlier, 20and underperforming surgeons treating low-risk patientswill be less likely to be detected.

    Identification of the surgeon responsible for aprocedure is not always straightforward. Some pro-cedures are not allocated to any surgeon, whereas othersare done by more than one surgeon. Inconsistenciesbetween units in how procedures are allocated tosurgeons could further undermine these comparisons.

    A final issue is the appropriate organisational level forreporting outcomes. Reporting for individual surgeonsignores the effect of the multidisciplinary team and thecontext in which especially complex surgery is done.Many aspects of care other than the surgeons perfor-

    mance will affect the outcome, such as timeliness ofreferral and diagnosis, perioperative care, and follow-upcare after discharge. For example, complications result-ing from surgery might result in a patients death,depending on the way clinical units monitor patientsvital status and respond to adverse occurrences.21We recommend that surgeon outcomes are reportedalongside unit outcomes to guide interpretation (panel 2).

    Contributors

    KW and JN conceived this report. All authors were involved in thedesign. KW, JN, and OG collected and analysed data. All authorsinterpreted results. KW, JN, and OG wrote the report, with contributionsfrom DAC and JvdM.

    Conflicts of interest

    We are all involved in national clinical audits, but report that we have noconflicts of interest.

    Acknowledgments

    We thank Rob Wakemanfor providing information on surgeon volume

    for hip fracture surgery. No specific funding was received for this report.The salaries of KW, JN, and OG were funded by a grant from the RoyalCollege of Surgeons of England.

    References1 NHS Commissioning Board. Everyone counts: planning for patients

    2013/14. 2012. www.commissioningboard.nhs.uk/files/2012/12/everyonecounts-planning.pdf (accessed June 28, 2013).

    2 Society for Cardiothoracic Surgery in Great Britain and Ireland.UK surgeons results. http://www.scts.org/modules/surgeons/default.aspx (accessed June 28, 2013).

    3 Hannan EL, Cozzens K, King SB, Walford G, Shah NR. TheNew York State cardiac registries: history, contributions, limitations,and lessons for future efforts to assess and publicly reporthealthcare outcomes.J Am Coll Cardiol2012; 59: 230916.

    4 Altman DG, Bland JM. Absence of evidence is not evidence ofabsence. BMJ1995; 311: 485.

    5 Health and Social Care Information Centre. Hospital episode

    statistics. http://www.hscic.gov.uk/hes (accessed June 28, 2013).6 National Hip Fracture Database. National Report 2012. 2012. http://

    www.nhfd.co.uk/003/hipfractureR.nsf/luMenuDefinitions/CA920122A244F2ED802579C900553993/$file/NHFD%20National%20Report%202012.pdf?OpenElement (accessed June 28, 2013).

    7 Health and Social Care Information Centre. Nationaloesophago-gastric cancer audit 2012. http://www.hscic.gov.uk/searchcatalogue?productid=7335&q=%22National+Oesophago-Gastric+Cancer+Audits%22&sort=Relevance&size=10&page=1#top(accessed June 28, 2013).

    8 Health and Social Care Information Centre. National bowel canceraudit 2012. http://www.hscic.gov.uk/searchcatalogue?productid=10227&q=title%3a%22bowel+cancer%22&sort=Relevance&size=10&page=1#top (accessed June 28, 2013).

    9 National Institute for Cardiovascular Outcomes Research. AdultCardiac Surgery. 2013. http://www.ucl.ac.uk/nicor/audits/Adultcardiacsurgery (accessed June 28, 2013).

    10 Altman DG, Bland JM. Diagnostic tests 2: predictive values. BMJ1994; 309: 102.

    11 Bridgewater B, Hickey GL, Cooper G, Deanfield J, Roxburgh J.Publishing cardiac surgery mortality rates: lessons for otherspecialties. BMJ2013; 346: f1139.

    12 Bridgewater B. Mortality data in adult cardiac surgery for namedsurgeons: retrospective examination of prospectively collected dataon coronary artery surgery and aortic valve replacement. BMJ2005;330: 50610.

    13 Lilford R, Mohammed AM, Spiegelhalter D, Thomson R. Use andmisuse of process and outcome data in managing performance ofacute medical care: avoiding institutional stigma. Lancet2004;363: 114754.

    14 Shahian DM, Edwards FH, Jacobs JP, et al. Public reporting ofcardiac surgery performance: part 1history, rationale,consequences. Ann Thorac Surg2011; 92: S211.

    15 Dimick JB, Staiger DO, Birkmeyer JD. Ranking hospitals onsurgical mortality: the importance of reliability adjustment.

    Health Serv Res 2010; 45: 161429.16 Dimick JB, Welch HG, Birkmeyer JD. Surgical mortality as an

    indicator of hospital quality: the problem with small sample size.JAMA 2004; 292: 84751.

    17 Johal A, Cromwell D, van der Meulen J. Hospital episode statisticsand revalidation: creating the evidence to support revalidation. Jan 9,2013. http://www.rcseng.ac.uk/surgeons/research/surgical-research/docs/hospital-episode-statistics-and-revalidation-report-2013(accessed June 28, 2013).

    18 Girling AJ, Hofer TP, Wu J, et al. Case-mix adjusted hospitalmortality is a poor proxy for preventable mortality: a modellingstudy. BMJ Qual Saf2012; 21: 105256.

    19 Shahian DM, Normand SL. Autonomy, beneficence, justice, and thelimits of provider profiling.J Am Coll Cardiol2012; 59: 238386.

    20 Spiegelhalter DJ. Handling over-dispersion of performanceindicators. Qual Saf Health Care 2005; 14: 34751.

    21 Ghaferi AA, Birkmeyer JD, Dimick JB. Hospital volume and failureto rescue with high-risk surgery. Med Care 2011; 49: 107681.