the interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/teaching/share/material/... ·...

23
1 Introduction Laboratory diagnostic tests are central in the practice of modern medicine. Common uses include screening a specific population for evidence of disease and confirming or ruling out a tentative diagnosis in an individual patient. 1 The interpretation of a diagnostic test result depends on both the ability of the test to distinguish diseased from nondiseased subjects and the particular characteristics of the patient and setting in which the test is being used. 2 This article reviews statistical methodology for assessing laboratory diagnostic test accuracy and interpreting individual test results. For convenience, a diagnostic test will be called continuous, dichotomous, or ordinal according to whether the test yields a continuous measurement (e.g. blood pressure), a dichotomous result (e.g. HIV-positive or HIV-negative), or an ordinal outcome (e.g. confidence rating for presence of disease – definitely, probably, possibly, probably not, definitely not). The main focus here is on continuous diagnostic tests, since the majority of laboratory diagnostic tests are of this type 1 and since much new methodology for such tests has appeared recently in the statistical literature. Although dichotomous diagnostic tests also are common in laboratory medicine, they will only be discussed briefly here, because much of the methodology involves analysis of binomial proportions and 2 2 tables, which is likely to be more familiar to the reader. Many of the methods proposed for continuous tests borrow heavily from the vast literature on ordinal diagnostic tests, which are common in the radiological imaging field; methodology for ordinal tests has been reviewed in depth elsewhere, 3,4 including a recent issue of Statistical Methods in Medical Research (volume 7, number 4, December 1998), and will only be discussed here when pertinent to continuous tests. This article is organized as follows: Section 2 summarizes some basic concepts and terminology. Section 3 discusses study design and methods for assessing the accuracy Address for correspondence: DE Shapiro, Center for Biostatistics in AIDS Research, 651 Huntington Avenue, Boston, MA 02115-6017, USA. E-mail: [email protected] Ó Arnold 1999 0962-2802(99)SM180RA The interpretation of diagnostic tests David E Shapiro Center for Biostatistics in AIDS Research, Harvard School of Public Health, Boston, Massachusetts, USA Laboratory diagnostic tests are central in the practice of modern medicine. Common uses include screening a specific population for evidence of disease and confirming or ruling out a tentative diagnosis in an individual patient. The interpretation of a diagnostic test result depends on both the ability of the test to distinguish diseased from nondiseased subjects and the particular characteristics of the patient and setting in which the test is being used. This article reviews statistical methodology for assessing laboratory diagnostic test accuracy and interpreting individual test results, with an emphasis on diagnostic tests that yield a continuous measurement. The article begins with a summary of basic concepts and terminology, then briefly discusses study design and reviews methods for assessing the accuracy of a single diagnostic test, comparing the accuracy of two or more diagnostic tests and interpreting individual test results. Statistical Methods in Medical Research 1999; 8: 113–134

Upload: others

Post on 24-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

1 Introduction

Laboratory diagnostic tests are central in the practice of modern medicine. Commonuses include screening a speci®c population for evidence of disease and con®rming orruling out a tentative diagnosis in an individual patient.1 The interpretation of adiagnostic test result depends on both the ability of the test to distinguish diseasedfrom nondiseased subjects and the particular characteristics of the patient and settingin which the test is being used.2 This article reviews statistical methodology forassessing laboratory diagnostic test accuracy and interpreting individual test results.

For convenience, a diagnostic test will be called continuous, dichotomous, or ordinalaccording to whether the test yields a continuous measurement (e.g. blood pressure), adichotomous result (e.g. HIV-positive or HIV-negative), or an ordinal outcome (e.g.con®dence rating for presence of disease ± de®nitely, probably, possibly, probably not,de®nitely not). The main focus here is on continuous diagnostic tests, since themajority of laboratory diagnostic tests are of this type1 and since much newmethodology for such tests has appeared recently in the statistical literature. Althoughdichotomous diagnostic tests also are common in laboratory medicine, they will onlybe discussed brie¯y here, because much of the methodology involves analysis ofbinomial proportions and 2� 2 tables, which is likely to be more familiar to thereader. Many of the methods proposed for continuous tests borrow heavily from thevast literature on ordinal diagnostic tests, which are common in the radiologicalimaging ®eld; methodology for ordinal tests has been reviewed in depth elsewhere,3,4

including a recent issue of Statistical Methods in Medical Research (volume 7, number 4,December 1998), and will only be discussed here when pertinent to continuous tests.

This article is organized as follows: Section 2 summarizes some basic concepts andterminology. Section 3 discusses study design and methods for assessing the accuracy

Address for correspondence: DE Shapiro, Center for Biostatistics in AIDS Research, 651 Huntington Avenue, Boston,

MA 02115-6017, USA. E-mail: [email protected]

Ó Arnold 1999 0962-2802(99)SM180RA

The interpretation of diagnostic tests

David E Shapiro Center for Biostatistics in AIDS Research, Harvard School of Public Health,Boston, Massachusetts, USA

Laboratory diagnostic tests are central in the practice of modern medicine. Common uses includescreening a speci®c population for evidence of disease and con®rming or ruling out a tentative diagnosisin an individual patient. The interpretation of a diagnostic test result depends on both the ability of thetest to distinguish diseased from nondiseased subjects and the particular characteristics of the patient andsetting in which the test is being used. This article reviews statistical methodology for assessing laboratorydiagnostic test accuracy and interpreting individual test results, with an emphasis on diagnostic tests thatyield a continuous measurement. The article begins with a summary of basic concepts and terminology,then brie¯y discusses study design and reviews methods for assessing the accuracy of a single diagnostictest, comparing the accuracy of two or more diagnostic tests and interpreting individual test results.

Statistical Methods in Medical Research 1999; 8: 113±134

Page 2: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

of a single diagnostic test. Section 4 reviews methods for comparing the accuracy oftwo or more diagnostic tests. Section 5 describes methods for interpreting individualtest results. Finally, Section 6 contains some concluding remarks.

2 Basic concepts and terminology

The interpretation of a diagnostic test depends on two factors: (1) the intrinsic abilityof the diagnostic test to distinguish between diseased patients and nondiseasedsubjects (discriminatory accuracy); and (2) the particular characteristics of the indivi-dual and setting in which the test is applied. Each of these factors will be discussed inturn.

2.1 Discriminatory accuracyThe discriminatory accuracy of a diagnostic test is commonly assessed by measuring

how well it correctly identi®es n1 subjects known to be diseased (D+) and n0 subjectsknown to be nondiseased (Dÿ). If the diagnostic test gives a dichotomous result foreach subject, e.g. positive (T+) or negative (Tÿ), the data can be summarized in a2� 2 table of test result versus true disease status (Figure 1). In the medical literature,discriminatory accuracy is commonly measured by the conditional probabilities ofcorrectly classifying diseased patients [P(T+ | D+) = TP=n1 in Figure 1, called thesensitivity or true positive rate (TPR)] and nondiseased subjects [P(Tÿ | Dÿ)= TN=n0, called the speci®city or true negative rate (TNR)]; equivalently, the

Figure 1 The 2� 2 table of frequencies summarizing the results of a dichotomous diagnostic accuracy study.TP represents the number of true positives (diseased patients who test positive), FP the false positives, TN thetrue negatives, and FN the false negatives.

114 DE Shapiro

Page 3: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

probabilities of type I error [P(T+ | Dÿ) = FP/n0 = 1ÿ speci®city, called the falsepositive rate (FPR)] and type II error [P(Tÿ | D+) = FN/n1 = 1ÿ sensitivity, calledthe false negative rate (FNR)] may be also be used.

Many other indices of accuracy have been proposed for dichotomous diagnostictests, including several measures of association in the 2� 2 table familiar to bio-statisticians, such as the odds ratio and kappa statistic.5 One commonly-used accuracyindex, the overall percentage of correct test results Pc = (TP+TN)/N, deservescomment here because it can be misleading.2 Unlike sensitivity and speci®city, Pc

depends on the prevalence of disease in the study sample (n1=N); in fact, Pc is actuallya prevalence-weighted average of sensitivity and speci®city. Consequently, when theprevalence is high or low, Pc can be high even when the speci®city or sensitivity,respectively, is quite low. For example, in screening pregnant women for HIVinfection in the USA, where the prevalence in many areas is < 1%, an HIV test thatclassi®ed all women as HIV-negative, and therefore identi®ed none of the HIV-positivewomen, would have Pc > 99% even though its sensitivity would be 0%.

Many laboratory diagnostic tests give a quantitative result X rather than adichotomous result. The degree of overlap in the distributions of test results fordiseased [(f(X | D+)] and nondiseased subjects [f(X | Dÿ)] determines the ability ofthe test to distinguish diseased from nondiseased. In screening situations, where thetest is used to separate patients into two groups ± those requiring further investigationand/or intervention and those who do not ± it is common to de®ne a threshold ordecision limit and classify the patient as diseased if X > and nondiseased if X < (Figure 2).1 Each decision limit yields a different 2� 2 table of dichotomized testresult versus true disease status; thus, sensitivity and speci®city can be estimated foreach decision limit . However, as decreases, sensitivity increases but speci®city

Pro

babi

lity

dens

ity

0.0

0.1

0.2

0.3

X levelµ µ0 1

D− D+

γ

FPRFNR

Figure 2 Hypothetical distributions of diagnostic test results X for diseased and nondiseased subjects. Thevertical line at X � indicates the decision limit for a positive test. The shaded area to the right of is the FPR;the shaded area to the left of is the FNR.

The interpretation of diagnostic tests 115

Page 4: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

decreases; as decreases, speci®city increases at the expense of sensitivity. Thus, thereis a trade-off between sensitivity and speci®city as the decision limit varies, whichmust be accounted for in assessing discriminatory accuracy.

A useful graphical summary of discriminatory accuracy is a plot of sensitivity versuseither speci®city or FPR as the decision limit varies, which is called the receiveroperating characteristic (ROC) curve (Figure 3). The upper left corner of the graphrepresents perfect discrimination (TPR = 1, FPR = 0), while the diagonal line whereTPR = FPR represents discrimination no better than chance. The ROC curve of adiagnostic test is invariant with respect to any monotone transformation of the testmeasurement scale for both diseased and nondiseased subjects; in other words, theROC curve does not depend on the scale of measurement for the test, which makes ituseful for comparing diagnostic tests of different scales, including tests for which thedecision limit scale is latent rather than quantitative, such as medical imagingtechniques. Since it is often useful to have a single number that summarizes a test'saccuracy, various summary indices based on the ROC curve have been proposed; themost popular such summaries are the area under the ROC curve (AUC), whichrepresents both the average sensitivity over all values of FPR and the probability thattest results for a randomly-selected diseased subject and nondiseased subject will beordered correctly in magnitude; the partial AUC, which is the AUC over a range ofFPR that is relevant to a particular setting; and the sensitivity at a ®xed FPR.

2.2 Context-speci®c interpretationThe assessment of discriminatory accuracy focuses on the performance of the

diagnostic test when disease status is known; however, when interpreting thediagnostic test result for an individual patient, typically the true disease status isunknown and it is of interest to use the test result to estimate the posterior probability

FPR

TP

R

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

•γ

Figure 3 The ROC curve corresponding to the test result distributions shown in Figure 2. The point labelled onthe curve corresponds to the decision limit shown in Figure 2.

116 DE Shapiro

Page 5: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

of disease. For interpretation of a dichotomous test result, the quantities of interest arethe posterior probability of disease given a positive test result [P(D+ | T+) = TP/m1

in Table 1], called the positive predictive value (PPV), and the posterior probability ofno disease given a negative test result [P(Dÿ | Tÿ) = TN/m0], called the negativepredictive value (NPV). Using Bayes theorem, one can calculate the PPV and NPVbased on the sensitivity and speci®city of the diagnostic test and the prior probabilityof disease (prevalence); if probabilities are converted to odds [odds = probability/(1ÿprobability)], this calculation has a particularly simple form: posterior odds ofdisease = prior odds of disease� likelihood ratio, where the likelihood ratio isf(T+ | D+)/f(T+ | Dÿ) = sensitivity/(1ÿ speci®city).

For tests that yield a quantitative result X, one could determine the PPV and NPVas in the dichotomous case by dichotomizing the test result at a ®xed decision limit .However, this method has two limitations: it discards information about the likelihoodof disease, since it ignores the distance of X from , i.e. it yields the same estimate ofPPV or NPV whether X just exceeds or is well beyond ; also, this method requiresselecting a decision limit that is optimal in some sense for the particular setting. Analternative approach that avoids these dif®culties is to calculate the likelihood ratioL(X) = f(X | D+)/f(X | Dÿ) corresponding to the actual test result, and use it toconvert the patient's prior probability of disease into a posterior probability of diseaseP(D+ | X).

It is important to remember that PPV, NPV, and posterior probability of disease arespeci®c to a particular patient or setting, and cannot necessarily be generalized to adifferent population or setting, because these measures depend strongly on theprevalence of disease. For example, an HIV screening test that is 82% sensitive and> 98% speci®c would have a PPV of 88% if the prevalence were 10%, as in a high risksetting, but a PPV of < 3% if the prevalence were only 0:044%, as might be the case inscreening the general population.1

3 Assessing the accuracy of a single diagnostic test

This section begins with a brief overview of design considerations for studies ofdiscriminatory accuracy. It then reviews methods for continuous diagnostic tests,including estimation of the ROC curve, summary indices, and incorporatingcovariates, and concludes with a brief discussion of methods for dichotomousdiagnostic tests. Context-speci®c assessment of diagnostic test performance will bediscussed in Section 5.

3.1 Study designStudy design considerations for diagnostic test accuracy studies have received much

attention in the literature, but have been reviewed elsewhere,2,4,6 so only a briefsummary of the major issues will be given here.

Boyd6 reviews common study designs for diagnostic test accuracy studies, includingprospective and retrospective designs. Results of such studies are subject to severalpotential biases;4,7,8 some of these biases can be avoided by careful study design, andsome can be corrected to some degree in the analysis. Begg4 identi®es the most

The interpretation of diagnostic tests 117

Page 6: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

prominent and important biases as those concerned with issues relating to thereference test, or `gold standard', used to determine true disease status. Veri®cationbias in the reported sensitivity and speci®city estimates occurs if the selection ofpatients to receive the reference test is in¯uenced by the result of the diagnostic testbeing studied. A prospective study design in which all subjects in a de®ned cohortreceive the reference test is preferred, but selective veri®cation may be unavoidable forsome diseases, such as when the reference test is invasive; methods for correction ofveri®cation bias exist and have been reviewed recently by Zhou.9 Another importantsource of bias is an imperfect reference test; many bias-correction methods have beenproposed.2,4,10±12 The interpretation of the reference test and diagnostic test resultsshould be independent and blinded.8 Failed or uninterpretable test results should beincluded in the analysis, because such results may affect the test's usefulness.4,8

Selection of appropriate samples of diseased and nondiseased subjects is critical, sincesensitivity and speci®city can vary according to the `case-mix' or disease spectrum inthe study cohort.2,4,6,13 Adequate sample size is also critical; methods for determiningsample size for diagnostic accuracy studies have been reviewed recently byObuchowski.13

Many diagnostic accuracy studies in the medical literature have serious methodo-logic ¯aws. A recent review14 of diagnostic test studies published in four prominentgeneral medical journals between 1978 and 1993 found that the percentage of studiesthat ful®lled criteria for each of seven methodological standards ranged from 8%(reported test indexes for clinical subgroups) to 46% (avoided veri®cation bias). Theproportions of studies meeting individual standards and meeting multiple standardsincreased over time during the review period, but in the most recent interval, 1990±93,only one methodological standard was ful®lled by more than 50% of the studies, andonly 44% of the studies met at least three of the standards.

3.2 ROC curve estimationLet X and Y denote the continuous diagnostic test result for nondiseased and

diseased populations with cumulative distribution functions F and G, respectively.Then, the sensitivity at decision limit t is the survivor function �F�t� � 1ÿ F�t�, andFPR is �G�t� � 1ÿG�t�. The ROC curve of the diagnostic test is a plot of ��F�t�; �G�t��for all possible values of t, or, equivalently, a plot of �p; q�, where p = FPR ��0; 1� andq �TPR� �G��Fÿ1�p��.

Suppose diagnostic test results x1 . . . xn0are available for n0 nondiseased and

y1 . . . yn1for n1 diseased subjects. Many approaches have been proposed for estimating

the ROC curve, including parametric, semiparametric, and nonparametric methods;each category will be reviewed in turn.

Parametric approachesOne parametric approach is to assume that F and G follow a parametric family of

distributions, e.g. normal or lognormal, and ®t these distributions to the observed(possibly transformed) test results.1,15 A major advantage of this distributionalapproach is that it yields a smooth ROC curve that is completely speci®ed by a smallnumber of parameters, which form the basis for statistical inference; hypothesis tests,

118 DE Shapiro

Page 7: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

variance estimates, and con®dence intervals for the fully parametric binormal ROCcurve and its parameters are well-developed. The parametric method, however, is quitesensitive to departures from the distributional assumptions.15

An alternative parametric approach is to specify the functional form of the ROCcurve, e.g. hyperbolic16 or weighted exponential.17 This approach is closely related tothe distributional approach, in that specifying the underlying test result distributionsdetermines the form of the ROC curve; e.g. the symmetric `bilogistic' ROC curveobtained by assuming that F and G follow logistic distributions with equal scaleparameters is hyperbolic in form.5 Note that in ®tting an ROC curve directly in ROCspace, the ®tting algorithm needs to take into account the errors in both FPR andTPR; e.g. simple least squares minimizing vertical deviations would be inappropriate.2

Semi-parametric approachesMetz et al.18 propose a semi-parametric algorithm, LABROC4, that groups the

ordered continuous test measurements into runs of diseased and nondiseased subjectsand then ®ts a binormal ROC curve using the Dorfman±Alf maximum likelihoodalgorithm for ordinal data.19 The LABROC4 approach assumes that the underlyingdistributions of the grouped nondiseased and diseased test measurements can betransformed to normal distributions by an unspeci®ed monotone transformation of thetest measurement axis. The resulting binormal ROC curve is of the formTPR� ��a� b�ÿ1(FPR)), where � is the standard normal cumulative distributionfunction, a is the scaled difference in means and b the ratio of standard deviations ofthe latent normal distributions of diseased and nondiseased. Large-sample normal-theory standard errors and con®dence intervals for the ROC curve parameters areprovided.

The assumption of latent normal distributions for grouped data in the LABROC4method is less strict than the assumption of explicit distributional form for thecontinuous measurements in the fully parametric approach. However, the LABROC4method does still assume that a single monotone transformation exists that wouldmake both latent distributions normal simultaneously; this assumption can be checkedgraphically by assessing linearity of the raw data on normal probability axes(�ÿ1(FPR) versus �ÿ1(TPR)). Hanley20 demonstrated that by transformation one canmake many highly non-normal pairs of distributions resemble the binormal model;however, Zou et al.21 constructed a simulated dataset in which G was bimodal but Fwas unimodal (as might occur if the diagnostic test identi®ed only one of two possiblemanifestations of a disease), and found that the LABROC4 ROC curve departedsubstantially from the ROC curves obtained by nonparametric methods (describednext). Hsieh and Turnbull22 proposed a generalized least squares procedure for ®ttingthe binormal model to grouped continuous data and compared it to the Dorfman±Alfalgorithm;19 they also described a minimum-distance estimator of the binormal ROCcurve that does not require grouping the data.

Nonparametric approachesThe simplest nonparametric approach involves estimating F and G by the empirical

distribution functions Fn0�t� � #fxi < tg=n0 and Gn1

�t� � #fyi < tg=n1 for the non-diseased and diseased subjects, respectively. This approach is free of distributional

The interpretation of diagnostic tests 119

Page 8: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

assumptions in that it depends only on the ranks of the observations in the combinedsample,23 but the resulting empirical ROC curve is a series of horizontal and verticalsteps (in the absence of ties), which can be quite jagged.2 Hsieh and Turnbull22 deriveasymptotic properties of this estimator. If the data contain ties between testmeasurements of nondiseased and diseased subjects, corresponding sections of theempirical ROC curve are unde®ned;2 Le24 proposes a modi®ed estimator of theempirical ROC curve based on mid-ranks that overcomes this dif®culty and shows thatit converges uniformly to the true ROC curve.

An alternative nonparametric approach is to ®t a smoothed ROC curve using kerneldensity estimation of F and G21,25

F�t� � 1

n0h

Xn0

i�1

Ktÿ xi

h

� �where K � R k, the kernel k is a mean 0 density function, and h is the `bandwidth'which controls the amount of smoothing; G�t� is de®ned similarly based on y1 . . . yn1

.Zou et al.21 suggested using a biweight kernel k�t� � 15=16�1ÿ t2�2 for t in (ÿ1,1), withbandwidth hn � 0:9 min(SD,IQR/1.34)nÿ1=5, where SD and IQR are the samplestandard deviation and interquartile range, respectively; separate bandwidths hn0

andhn1

are determined for F and G. This bandwidth is optimal for histograms that areroughly bell-shaped, so that a preliminary transformation of the test measurementscale might be required. Lloyd25 suggested using the standard normal kernel withbandwidths hn1

=hn0� �n0=n1�1=3; this choice of bandwidth minimizes mean square

error (MSE) if the overall smoothness of the two densities is similar. Lloyd alsosuggests using the bootstrap to reduce bias in the smooth ROC curve. Advantages ofthe kernel method are that it closely follows the details of the original data and is freeof parametric assumptions, but disadvantages are that it is unreliable at the ends ofthe ROC curve, requires some ad hoc experimentation with choice of bandwidth, anddoes not work well when substantial parts of F or G are near zero.

Simultaneous con®dence bands for the entire ROC curveFor the binormal model, Ma and Hall26 give simultaneous con®dence bands for the

entire ROC curve by constructing Working±Hotelling con®dence bands in probitcoordinates, where the binormal ROC curve is linear, and transforming them back toROC coordinates; the method generalizes to other location-scale parametric ROCmodels. Campbell23 constructs ®xed-width con®dence bands for the entire ROC curvebased on the bootstrap; Li et al.27 provide an alternative bootstrap simultaneous con-®dence band for the entire ROC curve based on a vertical shift quantile comparisonfunction.

3.3 Estimation of summary accuracy indicesThe ROC curve provides a graphical summary of discriminatory accuracy, but often

a one-number summary index of discriminatory accuracy is desired. Many suchsummary indices have been proposed; the choice of an appropriate index depends onthe application. The AUC is the most popular global index for the ordinal diagnostictests common in radiology, where the test measurement scale is usually latent rather

120 DE Shapiro

Page 9: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

than an observed continuous variable; however, the AUC has several limitations thatmay make it less useful for continuous diagnostic tests:28 when two ROC curves cross,the two tests can have similar AUC even though one test has higher sensitivity forcertain speci®cities while the other test has better sensitivity for other speci®cities; theAUC includes regions of ROC space that would not be practical interest (e.g. very highFPR, or very low TPR); and the AUC is actually the probability that the test results ofa random pair of subjects, one diseased and one nondiseased, are ranked correctly (i.e.P(X < Y)), but in practice the test is not given to pairs of diseased and nondiseasedsubjects. Despite these limitations, since the AUC continues to be the focus of muchmethodologic work and it may be still be useful in certain applications, this sectionbegins by reviewing methods for AUC estimation and inference then discussesalternative summary indices.

AUCFor normally distributed continuous test results, the AUC is

��1 ÿ �0�����������������2

0 � �21

q0B@

1CAso the AUC can be estimated directly by substituting sample means and standarderrors;29 Wieand et al. obtain a variance estimator via the delta method.29 Thesemiparametric LABROC algorithm estimates the AUC based on the binormal ROCcurve parameters, with large-sample standard errors and con®dence intervals.18

An unbiased nonparametric estimate of P�X < Y� is the trapezoidal area under theempirical ROC curve, which equals the Mann±Whitney form of the two-sampleWilcoxon rank-sum statistic.30,31 Several variance estimators for this nonparametricAUC estimator have been proposed;31±34 Hanley and Hajian-Tilaki35 recommendeither using the variance based on the theory of U-statistics33 or jackkni®ng.34 Cof®nand Sukhatme36 show that if the diagnostic test results are subject to measurementerror, the nonparametric area estimator is biased downward, i.e. underestimates thetrue population area. They show that resampling methods such as the bootstrap andjackknife are not appropriate for correction of biases caused by nonsampling errors,and derive a bias-correction factor based on kernel-density estimation. Monte Carlosimulations indicated that for binormal and bigamma ROC models, with normal andnon-normal measurement errors, the bias-corrected estimator had smaller bias andcomparable MSE to the uncorrected estimator. However, they caution that althoughmeasurement errors also reduce the power of hypothesis tests based on theuncorrected nonparametric area estimate, it is unclear whether use of the bias-corrected estimator would result in increased power due to its larger variance.

Hajian-Tilaki et al.37 assessed the bias of parametric and nonparametric AUCestimates for continuous test results generated by Monte Carlo simulation frombinormal and nonbinormal models. The biases in both the parametric andnonparametric AUC estimates were found to be very small with both binormal andnonbinormal data. However, for nonbinormal data, standard error estimates by bothparametric and nonparametric estimates exceeded the true standard errors. Hajian-

The interpretation of diagnostic tests 121

Page 10: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

Tilaki et al. did not evaluate the effect of the small biases and in¯ated standard errorson the power and size of hypothesis tests; however, in another simulation study inwhich the binormal model was applied to simulated ordinal rating data from thebilogistic model, Walsh38 observed small biases in area estimates and in¯ated standarderrors that together substantially altered the size and power of statistical testscomparing independent areas, suggesting that model misspeci®cation can lead toaltered test size and power loss.

Zou et al.21 obtain the area under their smooth kernel-density-based nonparametricROC curve by numerical integration. Since the smoothing does not affect the true SEof AUC, to ®rst order, they use the U-statistic variance for the area under the smoothROC, and calculate a large-sample normal theory con®dence interval after applying aÿlog (1ÿAUC) transformation to reduce skewness. They also provide a varianceestimate and transformation-based con®dence interval for the numerically-integratedpartial area under the smooth nonparametric ROC curve. Lloyd25 shows that for thesmooth ROC curve based on standard-normal kernel density estimation, there is noadvantage in terms of variance to determining the area under the curve by numericalintegration rather than via the Mann±Whitney estimate.

Whether the AUC is estimated parametrically from the binormal model, ornonparametrically using the Mann±Whitney statistic, con®dence intervals are usuallyobtained by relying on asymptotic normality. Using Monte Carlo simulation,Obuchowski and Lieber39 evaluated the coverage of the asymptotic con®denceintervals and alternative ones based on the bootstrap and Student's t-distribution for asingle AUC and for the difference between two AUC. They generated both continuousand ordinal data with sample sizes between 10 and 70 for both diseased andnondiseased subjects and AUC of moderate (0.8) and high (0.95) accuracy. They foundthat for the difference in AUC, asymptotic methods provided adequate coverage evenwith very small sample sizes (20 per group). In contrast, for a single AUC, theasymptotic methods do not provide adequate coverage for small samples, and forhighly accurate diagnostic tests, quite large sample sizes (> 200 patients) are requiredfor the asymptotic methods to be adequate. Instead, they recommend using one ofthree bootstrap methods, depending on the estimation approach (parametric versusnonparametric) and AUC (moderate versus high).

Partial AUCIf only a particular range of speci®city or sensitivities values is relevant, the partial

AUC may be a more appropriate accuracy index than the AUC. McClish40 calculatesthe partial AUC or average sensitivity over a ®xed range of FPR by direct numericalintegration, with variances based on the binormal model; Thompson and Zuccini41

provide an alternative method based on integrating the bivariate normal probabilitydensity function. For settings in which high sensitivity is relevant, Jiang et al.42 extendMcClish's approach to obtain the average speci®city over a range of high sensitivity.

Sensitivity at ®xed speci®cityGreenhouse and Mantel43 provide normal-theory and nonparametric tests of the

hypothesis that a diagnostic test has at least a speci®ed sensitivity (e.g. �90%) withFPR no higher than a speci®ed value (e.g. �5%); however, their procedure does not

122 DE Shapiro

Page 11: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

provide an explicit decision limit that has the desired sensitivity and speci®city.Schafer44 discusses two methods for estimating an explicit decision limit with pre-speci®ed speci®city and/or sensitivity, one based on tolerance limits and the other anextension of the Greenhouse±Mantel method that applies when both sensitivity andspeci®city are pre-speci®ed; he demonstrates that the extended Greenhouse±Mantelmethod can yield substantial gains in ef®ciency compared to the tolerance-limit method.

For the binormal model, McNeil and Hanley34 give pointwise con®dence intervalsfor sensitivity at ®xed FPR by transforming to probit coordinates (�ÿ1(FPR),�ÿ1(TPR)), constructing large sample con®dence intervals, and transforming back toROC coordinates; the simultaneous Working±Hotelling bands of Ma and Hall26 alsohave a pointwise interpretation (with a different con®dence coef®cient) as `vertical'con®dence bands for TPR at ®xed FPR, or `horizontal' bands for FPR at ®xed TPR.Schafer45 gives a nonparametric large-sample pointwise con®dence interval for sen-sitivity at the estimated decision limit needed to obtain a pre-speci®ed speci®city, byinverting the Greenhouse±Mantel43 test statistic. Zou et al.21 suggested constructing alarge-sample normal-theory con®dence interval based on the kernel-estimated ROCcurve in logit coordinates (logit(FPR), logit (TPR)), then transforming back to ROCcoordinates; Lloyd25 suggested the analogous procedure based on a probit transforma-tion.

A dif®culty with con®dence intervals for sensitivity at ®xed FPR is that the decisionlimit t on the test measurement scale that yields this FPR can only be estimated. Analternative is to construct a joint con®dence region for the point (FPR, TPR) on theROC curve corresponding to a given decision limit t. Hilgers46 constructs a pointestimator for t and joint con®dence rectangle with con®dence level �1ÿ ��2 based onseparate distribution-free tolerance intervals for F and G. Unlike the asymptoticmethods, Hilgers' method is valid for any sample size; on the other hand, Schafer45

found that substantial gains in ef®ciency (i.e. narrower con®dence intervals) can beobtained with the Greenhouse±Mantel method versus Hilgers' method when samplesizes are adequate for the Greenhouse±Mantel method to be valid. Zou et al.21 suggestconstructing joint con®dence rectangles with coverage �1ÿ ��2 by forming large-sample con®dence intervals for the kernel-based logit(p) and logit(q) and transformingthem back to ROC coordinates. Campbell23 uses Kolmogorov±Smirnov con®dencebands for F and G to provide equal-sized joint con®dence rectangles for each observedpoint on the ROC curve that have simultaneous coverage �1ÿ ��2 for the true points(FPR, TPR).

Other indicesHsieh and Turnbull47 suggest using Youden's index ( = maxt[sensitivity(t)

+ speci®city(t)]) as a measure of diagnostic accuracy. They propose two nonpara-metric estimation methods for determining Youden's index and the corresponding`optimal' decision limit, based on estimating F and G with the empirical distributionfunctions and Gaussian kernel distribution estimates, respectively. However, thisapproach requires assumptions that limit its applicability: Youden's index essentiallydichotomizes the continuous diagnostic test results at the decision limit where thecosts of a false positive and false negative are equal5 and it assumes F and G havemonotone likelihood ratio, which often does not hold in practice (e.g. pairs of normal

The interpretation of diagnostic tests 123

Page 12: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

and logistic distributions have monotone likelihood ratio only when the scaleparameters of the diseased and nondiseased distributions are equal). Two otherindices, the projected length of the ROC curve and the area swept out by the ROCcurve, have been proposed recently as alternatives to the AUC for continuousdiagnostic tests,48 but their statistical properties have not been worked out.

3.4 Incorporating covariate informationExternal factors can affect the performance of diagnostic tests by in¯uencing the

distributions of test measurements for diseased and/or nondiseased subjects. Generalapproaches to incorporating covariate information into ROC analysis includestrati®cation and modelling.

Strati®cationSukhatme and Beam49 use strati®cation of the diseased or nondiseased (or both) to

assess the performance of a diagnostic marker against several types of diseased ornondiseased subjects. They develop methods for estimating and comparing stratum-speci®c empirical ROC curves and associated nonparametric areas, and for ef®cientestimation of the population ROC curve and associated area, when the diseased ornondiseased (or both) are strati®ed. One limitation, pointed out by Le,24 is that thisapproach can only be used with one binary or categorical covariate; it is less ef®cientwith a continuous covariate and very impractical when several covariates need to besimultaneously included.

ModellingAn alternative approach is to pool data across strata and use a regression modelling

approach to adjust for confounding effects. Pepe50 describes three general approachesto regression analysis of ROC curves with continuous test results: (1) modellingcovariate effects on the test result, then calculating the induced covariate effect on theROC curve; (2) modelling covariate effects on summary measures of accuracy, such asthe area under the curve; and (3) directly modelling covariate effects on the ROCcurve.

Several methods of the ®rst type have been proposed. Pepe50 models the continuoustest result as belonging to a location-scale family of unspeci®ed form, with meanfunction depending on both disease status and covariates (including covariates speci®cto diseased or nondiseased), and variance depending only on disease status; the meanand variance parameters can be estimated by quasi-likelihood methods. In situationswhere it is reasonable to group the continuous test results into ordered categories, onecould apply methods based on the proportional odds model51 or the continuation ratiomodel,52 although it is not clear whether either method can handle covariates that arespeci®c to the diseased or nondiseased subjects; the continuation ratio-based approachhas several advantages over the proportional odds-based approach, in that it belongs tothe class of generalized linear models, and yields ROC curves that are concave but notnecessarily symmetric, and ®tted probabilities that are always between 0 and 1.

The second approach is appropriate for settings in which the experimental designpermits calculation of an ROC curve and its summary measure �k (e.g. AUC or partialarea) at each observed covariate level k � 1; . . . ;K. Pepe50 suggests using standard

124 DE Shapiro

Page 13: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

linear regression techniques to model the expected value of Hÿ10 ��k� as a linear

function of the covariates at level k, where Hÿ10 is a monotone transformation that

gives a linear predictor with unrestricted range �ÿinf ; inf�; e.g. since the range of AUCis restricted to (0,1), the probit or logit transformation could be used to remove thisrestriction. Disadvantages of this approach are that it cannot incorporate continuouscovariates or covariates that are speci®c to diseased subjects (e.g. disease severity), andit requires suf®cient data at each covariate level (or combination of levels) to estimatean ROC curve with adequate precision.

The third approach differs from the ®rst approach in that it models the relationshipbetween test results in the diseased and nondiseased populations, rather than the testresult distributions themselves. Pepe50 suggests modelling the ROC curve asROC(t) = gf��t�; cXg, where g is a known bivariate function in the range (0,1),increasing in t, and ��t� is a parametric function on the domain (0,1). A necessarypreliminary step is to estimate the test result distribution given relevant covariates inthe population of nondiseased subjects, using parametric or nonparametric methods.Regression parameters are ®tted using an estimating equations approach. Choosing gto be a survivor function and � to be a linear function of gÿ1 yields an ROC curve ofthe same form as one generated by assuming a location-scale model for the test results(Method 1 above). However, modelling the ROC curve directly gives much greater¯exibility than modelling the test results in that it allows a much broader range ofROC curve models, it can allow the effect of covariates to vary with FPR (interaction),the distributions of test results in the diseased and nondiseased do not need to be fromthe same family, and the method can model and compare ROC curves for tests withcompletely different types of test results; on the other hand, currently the method ismore dif®cult to implement, requires special programming, and lacks adequate model-checking procedures and knowledge of large and small sample properties.

Le24 proposed using the Cox proportional hazards regression model with the mid-rank-based empirical ROC curve to evaluate the effect of covariates on sensitivity instudies where certain covariates, such as disease severity, are available only for thediseased subjects and not the nondiseased subjects; the method can also be used tomodel speci®city when certain covariates are only available for nondiseased subjects.The method can identify important covariates, but the regression coef®cients do nothave an interpretation similar to relative risk in the survival analysis setting, and themethod does rely on proportional hazards and linearity assumptions, which may bestrong but testable.

3.5 Dichotomous diagnostic testsMost laboratory tests give a continuous measurement but some tests give a truly

dichotomous result. For a single study in which n0 nondiseased and n1 diseasedsubjects are given the diagnostic test, sensitivity and speci®city can be estimated asdescribed in Section 2, and separate binomial con®dence intervals calculated.Coughlin et al.53 suggested using a logistic regression model with true disease statusas a covariate to model sensitivity and speci®city as a function of other covariates.Leisenring et al.54 propose a more general approach using separate marginal regres-

The interpretation of diagnostic tests 125

Page 14: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

sion models for sensitivity and speci®city with robust sandwich variance estimators;the method permits incorporation of different covariates for diseased and nondiseasedsubjects, and can accommodate clustered and unbalanced data.

The literature on a particular dichotomous diagnostic test often contains severalaccuracy studies, each of which report different and possibly discrepant sensitivity andspeci®city estimates. Shapiro55 reviews quantitative and graphical methods for meta-analysis of such collections of diagnostic accuracy studies.

4 Comparing the accuracy of diagnostic tests

This section reviews methods for comparing the discriminatory accuracy of two ormore diagnostic tests; its organization mirrors that of Section 3. The choice of methoddepends on whether the study used a parallel-group or `unpaired' design, in whicheach diagnostic test is applied to a different group of subjects, or a paired design, inwhich the diagnostic tests are applied to the same subjects; the paired design is moreef®cient, but requires analysis methods that properly handle the within-subjectcorrelation in the data. Many of the methods for correlated data can also be used withunpaired data. Beam56 recently reviewed methods for analysing correlated data inROC studies, with an emphasis on clustered ordinal data; however, many of theconceptual considerations and the nonparametric methodological strategies in hispaper are also applicable to continuous data.

4.1 Comparing ROC curvesParametric and semiparametric ROC curves can be compared using hypothesis tests

and con®dence intervals for differences in the ROC curve parameters. Metz andKronman57 provided normal-theory methods for comparing unpaired binormal ROCcurves. For paired data, Metz et al.58 assume the latent test result distributions arejointly bivariate normal, and derive hypothesis tests for equality of the binormal ROCcurve parameters; Metz et al.59 extend this methodology to partially-paired designs, inwhich some but not all subjects were tested with both diagnostic tests. The methodassumes that the paired and unpaired portions of the sample are statisticallyindependent and are representative of the same population, and that failure to obtaindata from both tests does not bias the case sample.

Three different nonparametric approaches to comparing ROC curves have beenproposed that would be useful in cases where the ROC curves cross or where the AUCdoes not seem to adequately re¯ect the situation. Campbell23 provides a test of thehypothesis of no difference between two correlated ROC curves based on the sup normand studies the distribution of the test statistic using the bootstrap. Venkatraman andBegg60 provide a distribution-free permutation test procedure for testing whether twoROC curves based on paired continuous test results are identical at all operatingpoints; further work is needed to extend this method to unpaired data. Moise et al.61

suggest a hypothesis test based on the integrated absolute difference between twocorrelated ROC curves (i.e. the area between the two ROC curves), with standard error

126 DE Shapiro

Page 15: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

estimated by bootstrapping. Simulation studies conducted by each author found thateach of the three methods had superior power to the usual AUC-based test when theROC curves crossed, but the three methods have not been compared with each other.

4.2 Comparing summary accuracy indices

AUC and partial AUCWieand et al.29 gave a parametric test for difference in AUC for two paired tests with

normally-distributed test results; the asymptotic variance of the test statistic wasobtained using the delta method. Hanley and McNeil31 provided a nonparametric testof difference in trapezoidal AUC for the unpaired design. Hanley and McNeil32

developed a semiparametric test for comparing nonparametric areas under correlatedROC curves that accounts for the correlations induced by the paired design assumingan underlying binormal model holds. DeLong et al.33 developed a fully nonparametricapproach to comparing correlated AUC, based on the theory of U-statistics, in whichall of the covariance terms are estimated nonparametrically, leading to anasymptotically normal test statistic. Mossman62 suggests using resampling methodssuch as the bootstrap and jackknife for hypothesis testing to compare independent orcorrelated AUCs (or other summary indices).

Parametric normal-theory methods for comparing diagnostic tests with respect topartial AUC are also available.40±42 Wieand et al.29 gave a class of nonparametricstatistics for comparing two correlated or independent ROC curves with respect toaverage sensitivity over a restricted range of speci®city (partial area), with testprocedures and con®dence intervals based on asymptotic normality; this class includesthe full AUC and the sensitivity at a common speci®city as special cases.

Sensitivity at common speci®cityGreenhouse and Mantel43 give parametric and nonparametric hypothesis tests for

comparing sensitivities of continuous tests at a prespeci®ed common speci®city;methods are given for both independent and paired data, and all methods correctlytake into account the uncertainty in the estimated FPR, which adds an additionalvariance component to the uncertainty in the estimated sensitivity. Linnet63 modi®estheir methods to produce a combined parametric/nonparametric procedure that isappropriate when the test measurements are approximately normally distributed fornondiseased subjects but non-normally distributed for diseased subjects. He alsopoints out that in the medical literature it is common to use the simple binomial testfor difference in proportions, rather than the correct Greenhouse±Mantel procedures,to compare diagnostic tests with respect to sensitivity at an estimated speci®city of2.5%, i.e. at the upper limit of each test's reference range (the central intervalcontaining 95% of the test results of the nondiseased population, discussed elsewherein this issue64). Linnet shows that using the simple binomial test in this way, whichignores the uncertainty in the estimated FPR, can increase its type I error to aboutseven times the nominal value of 0.05. Li et al.27 provide tests for equality of twoindependent ROC curves at a ®xed FPR and over a range of FPR values based on avertical shift quantile comparison function.

The interpretation of diagnostic tests 127

Page 16: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

Beam and Wieand65 provide a method for comparing one or more continuousdiagnostic tests to one discrete diagnostic test. The comparison of areas or partial areasunder the curve in this setting may be inappropriate, because even if the sensitivitiesof a continuous and a discrete test are comparable at all decision limits available to thediscrete test, the concavity of the ROC curve of the continuous test ensures it will havelarger area under the curve than the discrete test. They avoid this problem bycomparing the tests only at one or more decision limits available to the discrete test,i.e. comparing with sensitivity at a common speci®city (or average sensitivity across aset of common speci®cities). Their method is related to the Greenhouse±Mantelmethod, but it differs in that the common speci®city cannot be arbitrarily chosen butrather must be estimated due to the discreteness of the test; on the other hand, theBeam-Wieand method permits comparison of more than two diagnostic tests, providedonly one is discrete, and it allows comparison of sensitivity at more than one speci®cityof interest.

4.3 Incorporating covariate informationSukhatme and Beam49 show how to compare two diagnostic tests when the diseased

or nondiseased (or both) are strati®ed. Pepe50 points out that diagnostic tests can becompared using regression approaches that model the ROC curve or a summary index(e.g. AUC) by including a test indicator as one of the covariates in the model; however,regression approaches that model the test results themselves can only be used forcomparison if the results of the two tests can be modelled sensibly in the sameregression model.

4.4 Dichotomous diagnostic testsComparison of sensitivity and speci®city of two dichotomous diagnostic tests based

on a paired design can be done using McNemar's test for paired binary data.54 Themarginal regression formulation of Leisenring et al.54 described in Section 3 can beused to compare sensitivities and speci®cities of two or more diagnostic tests, even ifall tests are not carried out on all subjects, and it can accommodate clustered data andadjust for covariates; this method corresponds to McNemar's test when two diagnostictests are given to all subjects in a simple paired design. Lachenbruch and Lynch66

generalize McNemar's test to assess equivalence of sensitivity and speci®city of two-paired dichotomous tests.

5 Methods for context-speci®c interpretation

This section discusses three aspects of context-speci®c interpretation of diagnostic testresults: selection of an optimal decision limit to de®ne a positive test result, estimationof the likelihood ratio needed to convert an individual test result into a posteriorprobability of disease, and context-speci®c assessment of diagnostic test performance.

5.1 Selection of an optimal decision limitIn screening situations, it is often desired to use the diagnostic test to determine

whether or not a subject needs further work-up or treatment. Selection of a decisionlimit at which to dichotomize a continuous diagnostic test results as positive (further

128 DE Shapiro

Page 17: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

work-up or treatment required) or negative (no further work-up required) is anoptimization problem; the appropriate method depends on the speci®c objective anddata available and, clearly, many different optimization criteria could be used. Twoexamples are described here.

One possible objective is to minimize the expected costs of misclassi®cation ofdiseased and nondiseased subjects, in which case the Bayes minimum cost decisionrule is optimal.1 This decision rule classi®es a subject as diseased or nondiseasedaccording to whether or not the likelihood ratio f �XjD��=f �XjDÿ� of the test result Xexceeds a ®xed decision limit

Q � �1ÿ P�D���RÿP�D��R�

where P(D+) is the prior probability of disease, and Rÿ and R� are the regrets(difference between cost of correct and incorrect decision) associated with classifying asubject as nondiseased and diseased, respectively. Strike1 describes two methods forderiving the optimal decision limit XQ: if the test result distributions for both diseasedand nondiseased are assumed normally distributed, the decision limit XQ can becalculated algebraically by solving the equality f �XjD��=f �XjDÿ� � Q for X.Alternatively, the decision limit can be determined graphically from the ROC curve,using the fact that the slope of the ROC curve at any decision limit X equals thelikelihood ratio at that decision limit; XQ is the decision limit corresponding to thepoint where the line with slope Q is tangent to the ROC curve.

As an alternative to the cost-minimizing approach, Somoza and Mossman67 suggestselecting a prevalence-speci®c decision limit that maximizes diagnostic information(in the information theoretic sense) in the particular clinical situation. This method isappropriate when the objective of testing is to reduce overall uncertainty about thepatient's disease status, rather than to balance the error rates based on their respectivecosts, and in situations where misclassi®cation costs are open to dispute or are dif®cultto estimate precisely.

5.2 Estimation of the likelihood ratioSeveral methods have been proposed for estimating the likelihood ratio L�X�

corresponding to an individual test result X. If the ROC curve of the diagnostic test isavailable for a population similar to the population of interest, the likelihood ratio canbe estimated graphically as the slope of the tangent to the ROC curve at the pointcorresponding to the test result X; note that this method requires knowledge of theactual test measurements corresponding to each point on the ROC curve, which maynot be included in published papers.

For the case where test results for both diseased and nondiseased subjects arenormally distributed, Strike1 directly calculates the likelihood ratio corresponding toan individual test result and gives a normal-theory standard error and large-samplecon®dence interval for the posterior probability of disease. If one or both test resultdistributions are non-normal and cannot both be transformed to normality by a singletransformation, he suggests using kernel density estimation to obtain separate smoothestimates of the nondiseased and diseased test result distributions, then calculatingthe likelihood ratio at the test result X as the ratio of the kernel density estimates

The interpretation of diagnostic tests 129

Page 18: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

f �XjD�� and f �XjDÿ�; however, he does not give a variance estimate or con®denceinterval for the resulting estimated posterior probability of disease. Linnet68 suggesteda different nonparametric method based on estimating relative frequency polygons ofthe test result distributions.

Albert69 suggests estimating the likelihood ratio L�X� using logistic regression oftrue disease status on the test result X. This approach assumes that the likelihood ratiocan be well-approximated by an exponential function; in particular, L�X�� exp ��0 � �1X � log �nDÿ=nD���, where �0 and �1 are the usual logistic regressioncoef®cients and nD- and nD� are the numbers of nondiseased and diseased subjects,respectively. This method can handle dichotomous or continuous diagnostic tests, andcan easily incorporate covariate information. Albert and Harris70 and Simel et al.71

discuss con®dence intervals for likelihood ratios.

5.3 Context-speci®c assessment of diagnostic test performanceThe methods for assessing discriminatory accuracy of a diagnostic test reviewed in

Sections 3 and 4 only provide information about the ability of the test to distinguishbetween diseased and nondiseased patients; however, they do not directly indicate howwell the test result for an individual patient predicts the probability of disease in thatpatient, which will vary depending on the prevalence of disease. This section reviewsmethods for assessing context-speci®c diagnostic test performance. The results of allthese methods are prevalence-dependent and may not be directly generalizable toother settings.

Linnet72 suggests assessing diagnostic test performance based on the accuracy ofestimated posterior probabilities. He proposes using a strictly proper scoring rule,which is a function of the posterior probabilities that has maximum expected value atthe underlying true probabilities. His approach is to estimate the posterior probabilityas a function of diagnostic test result based on a training set of observations, thensubsequently to assess the score on a new set of test observations. Bootstrap standarderror estimation and bias correction allows the same subjects to serve as training andtest sets. Comparison of two diagnostic tests can be done using the bootstrap.

Somoza and Mossman67 suggest comparing diagnostic tests by determining theprevalence-speci®c decision limit for each test that maximizes diagnostic information(i.e. maximizes the reduction in uncertainty about disease status), then comparing thetests at their optimal decision limits, which may occur at different FPR. This methodmay be more useful than comparing partial AUC or sensitivity at ®xed speci®citywhen ROC curves cross; in this situation, comparison of partial areas or TPR at ®xedFPR may be dif®cult if the ranges of FPR where the tests perform optimally do notcoincide. Many other related approaches have been proposed in the medical decisionmaking literature to incorporate the consequences (in terms of risks and bene®ts) ofsubsequent diagnostic decisions based on the test result into the evaluation ofdiagnostic test performance. A review of this literature is beyond the scope of thispaper; a good starting place is a recent paper by Moons et al.73

Bloch74 provides two alternative methods for comparing two diagnostic tests in apaired design where the decision limit for each test and the costs of a false positive andfalse negative are speci®ed a priori. One method is based on comparing risks of the twotests, and the other on comparing kappa coef®cients.

130 DE Shapiro

Page 19: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

6 Concluding remarks

The assessment and interpretation of laboratory diagnostic tests has receivedincreasing attention in the statistical literature in recent years. Many new methodshave been proposed, but the relative advantages and disadvantages of differentmethods have not yet been assessed. More work is needed in identifying the mostpromising methods and incorporating them into widely available statistical software.

This paper has reviewed methods for assessing, comparing, and interpretingindividual diagnostic tests. In actual medical practice, however, individual diagnostictest results are not interpreted in a vacuum, but are combined by the physician withall other available information to arrive at a diagnosis. A large and growing literatureexists on computer-based methods for diagnosis and prognosis, including bothstatistical methods such as logistic regression and recursive partitioning, andnonstatistical methods such as expert systems and neural networks. Strike1 and Begg4

give brief overviews of this area and include several references.Despite the availability of methodology for interpreting diagnostic test results,

relatively few physicians actually use formal methods such as sensitivity, speci®city,likelihood ratios, or Bayes theorem when interpreting diagnostic test results. A recenttelephone survey of 300 practicing physicians found that only 3% reported calculatingposterior probability of disease using Bayes theorem, 1% reported using ROC curves,and 1% reported using likelihood ratios when evaluating diagnostic test results.75 Onthe other hand, 84% reported using a test's sensitivity and speci®city when conductingdiagnostic evaluations, primarily when interpreting test results; however, when askedhow they used these indexes, 95% described an informal method that amounted toestimating the test's posterior error probabilities (i.e. proportion of patients who testpositive that subsequently are found to be nondiseased, and proportion of patients whotest negative that subsequently are found to be diseased) based on patients in theirown practice, rather than using published sensitivity and speci®city data. Additionalefforts to make the formal diagnostic test interpretation methods more user-friendlyand explain their use, like the two recent papers by Jaeschke et al.,76,77 are clearlyneeded.

AcknowledgementSupported in part by National Institute of Allergy and Infectious Diseases,

cooperative agreement number U01-AI411.

References

1 Strike PW. Measurement in laboratory medicine:a primer on control and interpretation. Oxford:Butterworth-Heinemann, 1995.

2 Zweig MH, Campbell G. Receiver-operatingcharacteristic (ROC) plots: a fundamentalevaluation tool in clinical medicine. ClinicalChemistry 1993; 39: 561±77.

3 Hanley JA. Receiver operating characteristic(ROC) methodology: the state of the art.Critical Reviews in Diagnostic Imaging 1989; 29:307±35.

4 Begg CB. Advances in statistical methodologyfor diagnostic medicine in the 1980's. Statisticsin Medicine 1991; 10: 1887±95.

The interpretation of diagnostic tests 131

Page 20: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

5 Swets JA. Indices of discrimination ordiagnostic accuracy: their ROCs and impliedmodels. Psychological Bulletin 1986; 99: 100±17.

6 Boyd JC. Mathematical tools fordemonstrating the clinical usefulness ofbiochemical markers. Scandinavian Journal ofClinical and Laboratory Investigation 1997;57(suppl. 227): 46±63.

7 Begg CB. Biases in the assessment ofdiagnostic tests. Statistics in Medicine 1987; 6:411±24.

8 Begg CB, McNeil BJ. Assessment of radiologictests: control of bias and other designconsiderations. Radiology 1988; 167: 565±69.

9 Zhou XH. Correcting for veri®cation bias instudies of a diagnostic test's accuracy.Statistical Methods in Medical Research 1998; 7:337±53.

10 Hui SL, Zhou XH. Evaluation of diagnostictests without gold standards. StatisticalMethods in Medical Research 1998; 7: 354±70.

11 Torrance-Rynard VL, Walter SD. Effects ofdependent errors in the assessment ofdiagnostic test performance. Statistics inMedicine 1997; 16: 2157±75.

12 Phelps CE, Hutson A. Estimating diagnostictest accuracy using a fuzzy gold standard.Medical Decision Making 1995; 15: 44±57.

13 Obuchowski NA. Sample size calculations instudies of test accuracy. Statistical Methods inMedical Research 1998; 7: 371±92.

14 Reid MC, Lachs MS, Feinstein AR. Use ofmethodological standards in diagnostic testresearch. Journal of the American MedicalAssociation 1995; 274: 645±51.

15 Goddard MJ, Hinberg I. Receiver operatingcharacteristic (ROC) curves and non-normaldata: an empirical study. Statistics in Medicine1990; 9: 325±37.

16 Egan JP. Signal detection theory and ROCanalysis. New York: Academic Press, 1975.

17 England WL. An exponential model used foroptimal threshold selection on ROC curves.Medical Decision Making 1988; 8: 120±31.

18 Metz CE, Herman BA, Shen J-H. Maximumlikelihood estimation of receiver operatingcharacteristic (ROC) curves fromcontinuously-distributed data. Statistics inMedicine 1998; 17: 1033±53.

19 Dorfman DD, Alf E Jr. Maximum likelihoodestimation of parameters of signal detectiontheory ± a direct solution. Psychometrika 1968;33: 117±24.

20 Hanley JA. The use of the `binormal' modelfor parametric ROC analysis of quantitative

diagnostic tests. Statistics in Medicine 1996; 15:1575±85.

21 Zou KH, Hall W, Shapiro DE. Smooth non-parametric receiver operating characteristic(ROC) curves for continuous diagnostic tests.Statistics in Medicine 1997; 16: 2143±56.

22 Hsieh F, Turnbull BW. Nonparametric andsemiparametric estimation of the receiveroperating characteristic curve. Annals ofStatistics 1996; 24: 25±40.

23 Campbell G. General methodology I.Advances in statistical methodology for theevaluation of diagnostic and laboratory tests.Statistics in Medicine 1994; 13: 499±508.

24 Le CT. Evaluation of confounding effects inROC studies. Biometrics 1997; 53: 998±1007.

25 Lloyd CJ. Using smoothed receiver operatingcharacteristic curves to summarize andcompare diagnostic systems. Journal of theAmerican Statistical Association 1998; 93:1356±64.

26 Ma G, Hall WJ. Con®dence bands for receiveroperating characteristic curves. MedicalDecision Making 1993; 13: 191±97.

27 Li G, Tiwari RC, Wells MT. Quantilecomparison functions in two-sampleproblems, with application to comparisons ofdiagnostic markers. Journal of the AmericanStatistical Association 1996; 91: 689±98.

28 Hilden J. The area under the ROC curve andits competitors. Medical Decision Making 1991;11: 95±101.

29 Wieand S, Gail MH, James BR, James KL. Afamily of non-parametric statistics forcomparing diagnostic markers with paired orunpaired data. Biometrika 1989; 76: 585±92.

30 Bamber D. The area above the ordinaldominance graph and the area below thereceiver operating characteristic graph.Journal of Mathematical Psychology 1975; 12:387±415.

31 Hanley JA, McNeil BJ. The meaning and useof the area under a ROC curve. Radiology 1982;143: 27±36.

32 Hanley JA, McNeil BJ. A method ofcomparing the areas under receiver operatingcharacteristic curves derived from the samecases. Radiology 1983; 148: 839±43.

33 DeLong ER, DeLong DM, Clarke-PearsonDL. Comparing the areas under two or morecorrelated receiver operating characteristiccurves: a non-parametric approach. Biometrics1988; 44: 837±45.

34 McNeil BJ, Hanley JA. Statistical approachesto the analysis of receiver operating

132 DE Shapiro

Page 21: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

characteristic (ROC) curves. Medical DecisionMaking 1984; 4: 137±50.

35 Hanley JA, Hajian-Tilaki KO. Samplingvariability of non-parametric estimates of thearea under receiver operating characteristiccurves: an update. Academic Radiology 1997; 4:49±58.

36 Cof®n M, Sukhatme S. Receiver operatingcharacteristic studies and measurementerrors. Biometrics 1997; 53: 823±37.

37 Hajian-Tilaki KO, Hanley J, Joseph L, ColletJ-P. A comparison of parametric and non-parametric approaches to ROC analysis ofquantitative diagnostic tests. Medical DecisionMaking 1997; 17: 94±102.

38 Walsh SJ. Limitations to the robustness ofbinormal ROC curves: effects of modelmisspeci®cation and location of decisionthresholds on bias, precision, size and power.Statistics in Medicine 1997; 16: 669±79.

39 Obuchowski NA, Lieber M. Con®denceintervals for the receiver operatingcharacteristic area in studies with smallsamples. Academic Radiology 1998; 5: 561±71.

40 McClish DK. Analyzing a portion of the ROCcurve. Medical Decision Making 1989; 9: 190±95.

41 Thompson ML, Zuccini W. On the statisticalanalysis of ROC curves. Statistics in Medicine1989; 8: 1277±90.

42 Jiang Y, Metz CE, Nishikawa RM. A receiveroperating characteristic partial area index forhighly sensitive diagnostic tests. Radiology1996; 201: 745±50.

43 Greenhouse SW, Mantel N. The evaluation ofdiagnostic tests. Biometrics 1950; 6: 399±412.

44 Schafer H. Constructing a cut-off point for aquantitative diagnostic test. Statistics inMedicine 1989; 8: 1381±91.

45 Schafer H. Ef®cient con®dence bounds forROC curves. Statistics in Medicine 1994; 13:1551±61.

46 Hilgers R. Distribution-free con®dencebounds for ROC curves. Methods of Informationin Medicine 1991; 30: 96±101.

47 Hsieh F, Turnbull BW. Nonparametricmethods for evaluating diagnostic tests.Statistica Sinica 1996; 6: 47±62.

48 Lee W-C, Hsiao CK. Alternative summaryindices for the receiver operatingcharacteristic curve. Epidemiology 1996; 7:605±11.

49 Sukhatme S, Beam C. Strati®cation in non-parametric ROC studies. Biometrics 1994; 50:149±63.

50 Pepe MS. Three approaches to regressionanalysis of receiver operating characteristic

curves for continuous test results. Biometrics1998; 54: 124±35.

51 Tosteson ANA, Begg CB. A general regressionmethodology for ROC curve estimation.Medical Decision Making 1988; 8: 204±15.

52 Smith PJ, Thompson TJ, Engelau MM,Herman WH. A generalized linear model foranalysing receiver operating characteristiccurves. Statistics in Medicine 1996; 15: 323±33.

53 Coughlin SS, Trock B, Criqui MH, Pickle LW,Browner D, Tefft MC. The logistic modellingof sensitivity, speci®city, and predictive valueof a diagnostic test. Journal of ClinicalEpidemiology 1992; 45: 1±7.

54 Leisenring W, Pepe MS, Longton G. Amarginal regression modelling framework forevaluating medical diagnostic tests. Statisticsin Medicine 1997; 16: 1263±81.

55 Shapiro DE. Issues in combining independentestimates of the sensitivity and speci®city of adiagnostic test. Academic Radiology 1995; 2:S37±S47.

56 Beam CA. Analysis of clustered data inreceiver operating characteristic studies.Statistical Methods in Medical Research 1998; 7:324±36.

57 Metz CE, Kronman HB. Statisticalsigni®cance tests for binormal ROC curves.Journal of Mathematical Psychology 1980; 22:218±43.

58 Metz CE, Wang P-L, Kronman HB. A newapproach for testing the signi®cance ofdifferences between ROC curves measuredfrom correlated data. In Deconinck F ed.Information Processing in Medical Imaging:Proceedings of the Eighth Conference. TheHague: Martinus Nijhoff, 1984: 432±45.

59 Metz CE, Herman BA, Roe CA. Statisticalcomparison of two ROC-curve estimatesobtained from partially-paired datasets.Medical Decision Making 1998; 18: 110±21.

60 Venkatraman E, Begg CB. A distribution-freeprocedure for comparing receiver operatingcharacteristic curves from a pairedexperiment. Biometrika 1996; 83: 835±48.

61 Moise A, Clement B, Raissis M. A test forcrossing receiver operating characteristic(ROC) curves. Communications in Statistics ±Theory and Methods 1988; 17: 1985±2003.

62 Mossman D. Resampling techniques in theanalysis of non-binormal ROC data. MedicalDecision Making 1995; 15: 358±66.

63 Linnet K. Comparison of quantitativediagnostic tests: type I error, power andsample size. Statistics in Medicine 1987; 6:147±58.

The interpretation of diagnostic tests 133

Page 22: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to

64 Bland JM, Altman DG. Measuring agreementin method comparison studies. StatisticalMethods in Medical Research 1999; 8: 135±60.

65 Beam CA, Wieand S. A statistical method forthe comparison of a discrete diagnostic testwith several continuous diagnostic tests.Biometrics 1991; 47: 907±19.

66 Lachenbruch PA, Lynch CJ. Assessingscreening tests: extensions of McNemar's test.Statistics in Medicine 1998; 17: 2207±17.

67 Somoza E, Mossman D. Comparing andoptimizing diagnostic tests: an informationtheoretic approach. Medical Decision Making1992; 12: 179±88.

68 Linnet K. A review on the methodology forassessing diagnostic tests. Clinical Chemistry1988; 34: 1379±86.

69 Albert A. On the use and computation oflikelihood ratios in clinical chemistry. ClinicalChemistry 1982; 28: 1113±19.

70 Albert A, Harris EK. Multivariate interpretationof clinical laboratory data. New York: MarcelDekker, 1987.

71 Simel DL, Samsa GP, Matchar DB.Likelihood ratios with con®dence: sample sizeestimation for diagnostic test studies. Journalof Clinical Epidemiology 1991; 44: 763±70.

72 Linnet K. Assessing diagnostic tests by astrictly proper scoring rule. Statistics inMedicine 1989; 8: 609±18.

73 Moons KG, Stinjen T, Michel BC, Buller HR,Es G-AV, Grobbee DE, Habbema DF.Application of treatment thresholds todiagnostic-test evaluation: an alternative tothe comparison of areas under receiveroperating characteristic curves. MedicalDecision Making 1997; 17: 447±54.

74 Bloch DA. Comparing two diagnostic testsagainst the same `gold standard' in the samesample. Biometrics 1997; 53: 73±85.

75 Reid MC, Lane DA, Feinstein AR. Academiccalculations versus clinical judgments:practicing physicians' use of quantitativemeasures of test accuracy. The AmericanJournal of Medicine 1998; 104: 374±80.

76 Jaeschke R, Guyatt GH, Sackett DL, theEvidence-Based Medicine Working Group.Users' guides to the medical literature. III howto use an article about a diagnostic test. A. arethe results of the study valid? Journal of theAmerican Medical Association 1994; 271: 389±91.

77 Jaeschke R, Guyatt GH, Sackett DL, theEvidence-Based Medicine Working Group.Users' guides to the medical literature. III howto use an article about a diagnostic test. B.what are the results and will they help me incaring for my patients? Journal of the AmericanMedical Association 1994; 271: 703±707.

134 DE Shapiro

Page 23: The interpretation of diagnostic testsstaff.pubhealth.ku.dk/~tag/Teaching/share/material/... · 2017. 1. 12. · diagnostic test result depends on both the ability of the test to