…borrelia diagnostics – statistical aspects jørgen hilden jh@biostat.ku.dk february 2009 notes...

Post on 15-Jan-2016

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

…Borrelia diagnostics – statistical aspects

Jørgen Hildenjh@biostat.ku.dk

February 2009

Notes have been added in this file

Plan of my talk

Clinicometric framework

Descriptors of diagnostic power

Displays of diagnostic power

including the ROC diagram

Simultaneous use of 2 measurements

Randomized testing of diagn. procedures

Special topics in supplementary slides

Biostatistical motto: Formalism with a human face

Topics not mentioned

Systematic reviews &

meta-analyses

”Clinicometrics”

…always considers a stream of cases

( statisticians say: a population of cases ):

They are the units of clinical experience

and also of clinical decision making.

They are instances of a (well-defined?)

clinical problem,

”the who-how-where-why of

a patient-doctor encounter.”

Therefore…

In clinical studies the choice of sample, and of the variables on which one bases one's prediction,

must match the clinical problem as it presents itself at the time of decision making.

In particular, one mustn't discard subgroups (as ‘atypical’ or ‘impurities’) that did not become identifiable until later: ensure prospective recognizability !

Data

colle

ctio

n*

*as opposed to the ’engineering’ phases

Purity vs. representativeness:

A meticulously filtered case stream ( 'proven single-agent infections', or 'meeting CDC criteria' ) may be needed for patho- and pharmaco-physiological research,

but is inappropriate as a basis for clinical decision policies

[incl. cost studies].

Data

colle

ctio

n

Your job is to create decision rules that help the clinician decide, e.g.

- whether to proceed with antibiotics- when to plan clin. & serol. follow-up checks - when to apply other tests for, e.g. HSV

►ideally drawing a complete management flowchart,i.e. a bushy tree of action diagnoses, not etiological diagnoses

Don’t

fo

rget…

Consecutivity as a safeguard against selection bias.

Standardization: How? Who? Where? When? Gold standard … the big problem !! w. blinding, etc.

Safeguards against change of data after the fact.

w3.consort-statement.org/Initiatives/stardClinical_Chemistry_statement.pdf

Data

colle

ctio

n

Quantitative markers

A quantity holds the result of a diagnostic procedure. Histograms describe its distribution in two subpopulations.

We can interpret ordinates and areas under the two humps in terms of true and false decisions …

and get a feel for the trade-off involved, provided that the pre-test probability of disease (percentage diseased) is known.

Focussing on

… principle …

Measurement

Non-disease

Each area = 1.00 = 100 % of the subpopulation

False negatives

False positives

Cutoff point

Negative rangePositive range

Diseased

Healthy

False positive

False negative

Measurement

… principle …

Measurement

Non-disease

False negatives

False positives

Cutoff point

Negative rangePositive range

Diseased

Healthy

False positive

False negative

Measurement

Sensitivity( true positive

fraction )Specificity

( true negative fraction )

Note: BLACK&WHITE

paradigm !

Pre-test ’case mix’ 30% 70% diseased non-diseased

I.e., 64.4 % of cases are true negatives; the other three areas are analogous.

the

probability

square’

Sensitivity,

true posit. fraction

1 – spec. = false

positive fraction

All the positives

Specificity = 0.92, say

True negatives area =

0.70 × 0.92 = 0.644

Classical terminology”Positive” = suggestive of (target) disease

”Negative” = suggestive of its absence

”False / True Positive / Negative …”

Sensitivity = TP/(those diseased)

Specificity = TN/(those without it)

What is meant by PV ( ”predictive value” )?What is meant by LR ( ”likelihood ratio” )?

Classical terminology

”Positive” = suggestive of (target) disease”Negative” = suggestive of its absence”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it)

PVpos = the ”predictive value” of a positive outcome = TP/(all positives) = Pr{ disease | pos }

…chance that the test is right when it says ”positive”

Classical terminology

”Positive” = suggestive of (target) disease”Negative” = suggestive of its absence”False / True Positive / Negative …” Sensitivity = TP/(those diseased) Specificity = TN/(those without it)

PVneg = the ”predictive value” of a negative outcome = TN/(all negatives) = Pr{ non-disease | neg }

…chance that the test is right when its verdict is ”negative”

pre-test odds = 3 : 7

30% 70%

diseased non-diseased

”LR” = 5 : 1 (the ratio of red arrows);

ergo post-test odds = 15 : 7.

”Likelih

ood

ratio”

principle …

Sensitivity

1 – specificity

Pre-test odds low in Lyme problems diseased non-diseased

”LRpos” = 5 : 1 is fair; but

post-test odds and PVpos are still low.

Sensitivity

1 – specificity

Specificity is not bad. Yet most positives are

false positives

classical terminology:

Sensitivity = TP/(those diseased)

Specificity = TN/(those without it)

LRpos = the ”likelihood ratio” occasioned

by a positive outcome =

(sensitivity) / (1 – specificity) =

Pr{ pos | disease } / Pr{ pos | non-disease }

Not quite so

classical terminology:

Sensitivity = TP/(those diseased)

Specificity = TN/(those without it)

LRneg = the ”likelihood ratio” occasioned

by a negative outcome =

(1 – sensitivity) / (specificity) =

Pr{ neg | disease } / Pr{ neg | non-disease }

Not quite so

= 0.1 = 1 : 10, for instance. If the pre-test risk of Lyme Disease is low, say p = 2%, a negative outcome almost eliminates it:

(post-test odds) = (pre-test odds)(LR) = (1 : 49)(1 : 10) = (1 : 490) .

”LR” principle: it’s the factor by which the observed data will change the odds

Measurement

Non-disease

False negatives

False positives

Cutoff point

Negative rangePositive range

Diseased

Healthy

False positive

False negative

Measurement

LRpos = ratio –

– ratio = LRneg*

*LRneg < 1 (!)

”LR” principle: it’s still the factor by which the observed data will change the odds

Measurement

Non-disease

False negatives

False positives

Cutoff point

Negative rangePositive range

Diseased

Healthy

False positive

False negative

Measurement

LR = ratio

Cutpoint now

irrelevant

*

LRdata when data = a measurement value*

A 2-gate study

50 diseased 75 non-diseased

”LRpos” = 5 : 1 but the ”predictive values” and

the post-test odds are unavailable.

Sensitivity

1 – specificity

Warning…

A 2-dim. task

IgM

IgG

Confirmed infectionConfirmed non-infected case

A 2-dim. task

IgM ?

IgG ?

Iso-Likelih. Ratio lines

(uphill arrow)

A 2-dim. task

IgM

IgG

?

Nearest-neighboursclassification of

a new patient

A 2-dim. task

IgM

IgG

etc., with decreasing influence, the farther away.

Kernel methods form aweighted averageof neighbouring

prototypes(diagnosed cases)

Iso-density (iso-tætheds-)

linier

IgM

IgG

IgM

IgG

Iso-Likelihood Ratio lines

(uphill arrows)

IgM

IgG

Infection

Non-infected

Simulated data (100+100)

A ROC diagram shows the true positive fraction against the false positive fraction as a function of the choice of cutoff point

Hypothetical smooth trajectory, and two raw empirical ones [ sample sizes: 17+17 ( ), and 40+40 ( ) ]

Everyone negative

Everyone treated as positive

Liberal cutoffStrict cutoff

The ROC diagram describes the nosographics

Sens, spec.LRpos, LRneg = slopes of segments.

Y = Youden’s Index = sens + spec – 1 is equivalent to

AUC [Area Under Curve] = ½(sens + spec)

in this case. ROCY = 1

Y = 0

FN

TP

FP TNBLACK&WHITE

neg

posWe are

within the Black&WhiteParadigm

nosographic

properties

The ROC diagram describes the nosographics*

The slope of each outcome line is its LR; e.g.

LRpos = (TP fraction of Diseased)/(FP fraction of non-dis.)

ROCY = 1

Y = 0

FN

TP

FP TN

neg

pos

*i.e., the information obtainable from a 2-gate study

Three testoutcomes

Ideal test

Om

inou

s

Almost no

evidence

either way

Reassuring

Three testoutcomes

Ideal test

Positive

+/-

NegativeNeg.?

Pos.?

Ordered (ordinal) test outcomes

Ordered how?By increasing slope, i.e. LR[ concavity ! ]

Three testoutcomes

Definitely positive

+/-

NegativeNeg.?

Possibly positive

Ordered (ordinal) test outcomes

Ordered how?By increasing slope, i.e. LR[ concavity ! ]

The slope reflects the medical trade-off between % sensitivity and % specificity

Those with a ”+/ – ”test result are best treated as negative

in this situation

Trade-off? Constant benefit? … Please take a look at the supplementary figures

A ’constant-benefit’ line

E.g., 5 cases of disease D and 10 non-D cases:The ROC square holds 50 small rectangles, 40 of which happen to be below the ROC trajectory, because 40 times (out of 50) it so happens that a a non-D finding > a D-group finding[the desired ordering]. For an example, see patient * vs. patient **.

Area Under ROC Curve =freq{ (non-D value) > (D value) } = 0.80.

Interpretation of the area under the ROC as a rank statistic ( cf. Wilcoxon-Mann-Whitney )

The 5 cases of disease D and 10 non-D cases:The ROC square holds 50 small rectangles, 40 of which happen to be below the ROC trajectory, because 40 times (out of 50) it so happens that a a non-D finding > a D-group finding[the desired ordering]. For an example, see patient * vs. patient **.

Area Under ROC Curve =freq{ (non-D value) > (D value) } = 0.80.

But where does that lead us?The AUC has no definable interpretation in terms of blood, sweat and tears(loss, benefit, utility).

It only has a soft association with decision-analyticmeasures of diagnostic power (separation, discrimination).

Its frequent use is purely a matter of being thepopular girl in the class.

The primary virtues of the ROC: it allows you

(1) to compare tests regardless of scale, units, & transformations

(2) to see oddities [ which may point to a technical problem, or

call for a revised test interpretation rule ]

What!?

Lesions in floating locationsSuspect

area?Red = as the imagist saw it

Green = surgical truth

How do we score diagnostic performance

in such situations ???

Randomized trials of diagn. tests

… theory under development

Purpose & design: many variantsSub(-set-)randomization, depending on the

pt.’s data so far collected.

”Non-disclosure”: some data are kept under seal until analysis. No parallel in therapeutic trials!

Main purposes …

Digression…

… Randomized trials of diagn. tests

1) when the diagnostic intervention is itself potentially therapeutic;

2) when the new test is likely to redefine the disease(s) ( cutting the cake in a completely new way );

3) when there is no obvious rule of translation from the outcomes of the new test to existing treatment guidelines;

4) when clinician behaviour is part of the research question…

…end of digression

Statistical analysis … in the narrow sense: … is very much standard

once you knowwhat aspects to count and compare.

To know that, work backwards from (likely) consequences:

what would have happened to these patients? And

what would have happened in the alternative scenario?

Never argue ”It’s customary to calculate … (this or that)” !

Thank you !

Let me add a personal maxim: Never ask

”What can the journal impact factors do for me?” Ask instead

”What can I do for the journal impact factors?”

Supplementary pictures follow here

… Vassily Vlassov pixit

The rôle of noise

Pure noise, independent of the patient’s true condition,flattens distributions and henceflattens the ROC; less information.

Remedies: technical & procedural standardization, duplicate measurements,

(averaging over assessors, dominance-free consensus formation) …

… may be ineffective if the noise is ”inter-patient”

Three testoutcomes

Definitely positive

+/-

NegativeNeg.?

Presumably positive

Ordered (ordinal) test outcomes

Ordered how?By increasing slope, i.e. LR[ concavity ! ]

Its slope reflects the medical trade-off between % sensitivity and % specificity

Slope? Constant benefit? … Let’s first look at a

continuous test & selection of cutoff that maximizes benefit

An ”iso-benefit” line

The slope chosen so as to imply constant benefit

A continuous test Cutoff at measurement x = c

maximizes benefit

x = c How do we find that critical slope?

It depends on the pre-test ’disease mix’ – and on the (human) loss associated with wrong or suboptimal treatment

- when only two courses of action are available (otherwise there will be more lines, reflecting several trade-offs).

The slope chosen so as to imply constant benefit

A continuous test : Cutoff at measurement x = c

maximizes benefit

x = c

Treat

no-

one

Treat

eve

rybo

dy

The slope chosen so as to imply constant benefit

A continuous test : Cutoff at measurement x = c

maximizes benefit

x = c

Treat

no-

one

Treat

eve

rybo

dy

Without the test, it’s (slightly) betterto treat everybody than to treat no-one.

With the test available, about 60 % of the’misdiagnostic burden’is eliminated;cf. purple bar.

No misdiagnoses

B

A

AB

A or B

Not A

A a

nd

B

A, n

ot B

Parallelogram

Two binary tests and their 6 most important

joint rules of interpretation( positivity criterion BLACK )

Always!

Never!

Three slides that illustrate the pitfalls of combining two or more tests …

Test interpretation rules in Boolean terms

Test result,

T = A OR B

vs.

T = A AND B

T = A OR B

A

B

A B

positive

positive

… but do not talk about tests in parallel

vs. tests in series (next slide)

not A not Bnegative

Suppose Tests A and B both figure in the rule of interpretation adopted.

Next, choose a rule of execution,

sequential or simultaneous

– depending on cost, inconvenience and delays

A first; if needed, B

B first; if needed, A

A & B simultaneously

Beware: careless writers may describe either rule by words like ”parallel” & ”serial” without realizing the ambiguity. Distinguish!

Rule of interpretation vs. rule of execution …

Three existing tests with poorly researched ROC trajectories& their conventional decision points

(cut-off values)

The point is that – unbeknownst to us –

the 3 tests are really equally good (the ROCs nearly coincide):

only historical coincidences, FDA, and other irrelevant factors

make performance look very different.

Remedy...

Sensit.

1 – specif.

Estimate the entire ROC

- with focus on the local slope

- which becomes the LR estimate belonging to the test outcome concerned

- to be weighed into the pre-test odds: (post-test odds) = (pre-test odds)(LR).

Slope +/– statistical uncertainty

Sensit.

1 – specif.

Sensit.

1 – specif.

Actual data (108 infected; 815 non-infected subjects)

An ordinary (mathematically formulated) statistical model, and a more closely fitting (over-fitting?) kernel procedure

Lesions in floating locations

?

Red = as the imagist sees it

Lesions in floating locations

?

Red = as the imagist saw it

Green = surgical truth

How do we score diagnostic performance

in such situations ???

– No neat answer !

Lesions in floating locations

?

Red = as the imagist saw it

Green = surgical truth

1 true positive

1 dubious lesion confirmed apart from extention (two ½-errors)

1 false positive & 1 false negative, but due to proximity surgeons count:

1 true positive finding

An ’infinite’ number of true negative locations

the region-by-region truth may remain unknown

( death from focus A leaves a suspect

location B unresolved ); and

– region-to-region interdependence is common:

– pathogenetic ( a focus at A makes focus at B more likely )

– diagnostic ( a focus at A

makes location B less visible, or

prompts surgery making B directly observable, or

sharpens the imagist’s attention to B, C, …)

– prognostic ( one verified metastatic focus incurability )

– therapeutic (a positive finding, even a false positive one,

may prompt a drug regimen that cures an overlooked

focus [cancelling a false negative blunder] )

Unfortunately,

top related