experimental design for binary diagnostic tests - biostat · 1 motivation and introduction 2 some...

Experimental design for binary diagnostic tests

Frank Schaarschmidt

Institute of Biostatistics - Leibniz Universitat Hannover

2009 September 23

ESNATS Summerschool, Zermatt

email: [email protected]

Frank Schaarschmidt (Biostatistics, LUH) Diagnostic tests 2009 September 23 1 / 39

1 Motivation and introduction

2 Some basic statistical concepts

3 Binary diagnostic test

4 Experimental design for binary diagnostic tests

5 Software

6 References


A diagnostic test I

Developing an assay

Reproducibly define a biological system

to measure its reaction to a compound,

where the measurement allows to classify the toxicity of thecompound

Define a diagnostic test

Primary measurement could be:I IC20

I expression level of some genesI number of mutated cellsI etc.

Simplify the primary measurementI split the possible values into two or several categories (e.g. GHS

C1-C5, not classified)


A diagnostic test II

Binary diagnostic test

Most simple scenario: classify into only two categoriesI fatal vs. not fatal, or,I {C1, C2, C3, C4} vs. {C5, not classified}, or,I IC20 ≤ 0.001 vs. IC20 > 0.001

Validation

For a given set of compounds

compare the results of the new assay toI the results of an existing gold standard (e.g. an in-vivo assay)I the summary of all available knowledge from different sources (the

’truth’ as defined by an expert)


A diagnostic test III

Final objective

Assess precision of new method compared to historically gatheredknowledge, a gold standard assay, etc.

in classifying:I new toxic compounds correctly as toxicI new non-toxic compounds correctly as non-toxic


A carcinogenicity example I

Compare in vivo carcinogenicity in rats and mice to the results of a batteryof in vitro carcinogenicity assay (Ames, MN & CA assay) [Kirkland et al.(2006)]

Compound CAS-Nr. In vivo In vitroN-Acetoxy-2-acetylaminofluorene 6098-44-8 positive positive

2-Acetylaminofluorene 53-96-3 positive positive2-Aminoanthracenea 613-13-8 positive positive

2-Aminofluorenea 153-78-6 positive positive2-Amino-4-nitrophenol 99-57-0 negative positive

......

......

3-Amino-1,2,4-triazole (Amitrole) 61-82-5 positive negativetert-Butyl alcohol 75-65-0 positive negative

5-Chloro-o-toluidine 95-79-4 negative negativeDecabromodiphenyl oxide 1163-19-5 negative negative

Diethanolamine 111-42-2 negative negative...

......

...

Table: Excerpt of published data [Kirkland et al.(2006)]


A carcinogenicity example II

In-vivoIn vitro negative positive

negative 9 9positive 18 41

Total 27 50

Table: Summary: In vitro vs. in vivo test results [Kirkland et al.(2006)]

Relative predictivity:

9 + 41

9 + 9 + 18 + 41=

50

77= 0.65 (1)

Is that all to present? What does it tell? What else to present?


Questions... I

Questions ... concerning validation

How to describe the precision of in vitro assays compared to agold-standard?

I Sensitivity, specificity, relative predictivityI Positive and negative predictive valueI Which measure to prefer?

How to estimate these measures and present the uncertainty due toestimation?

I Confidence intervals!

Questions ... concerning experimental design

How many compounds to choose for the validation step?

Which proportion of positive and negative compounds?

Statistical software?


Questions... II

IN THIS TALK:1 Introduction to basic statistical concepts

2 Binary diagnostic tests: Validation

3 Experimental design for binary diagnostic tests

4 Software


Some basic statistical concepts


Random variables and parameters I

Random variable

Everything we can observe or measure is subject to (measurement) errors.Repeating an experiment two times gives two (slightly?) differentmeasurements, even in case that all controllable factors are the same inthe two replications

Parameter

One might assume that some true and unknown values can describe

the variability of the measurements

the properties of our new assay

Statistics

Our measurement, being a random variable, may tell something about thetrue parameter, but certainly does not tell everything about the truth.


Decisions under uncertainty - Hypothesis tests I

Decisions

Two complementary statements:

1 The new assay is sufficiently precise in distinguishing (toxic vs.non-toxic)

2 The new assay is NOT sufficiently precise in ...

might be expressed in terms of a parameter for precision:

1 H1: The precision of the new assay is greater x0.

2 H0: The precision of the new assay is smaller or equal x0.

where x0 is the margin of insufficient precision.

Whenever we decide for one of these statements, based onobservations, the decision could be wrong, because the observationsare random variables.


Decisions under uncertainty - Hypothesis tests II

Right and wrong decisions in hypothesis tests

Unknown truthTest decision H0 is true H1 is true

H0 is true 1− α β

H1 is true α 1− β

Table: Type-I and Type-II errors in statistical tests

Type-I-error (α): Risk of rejecting H0 when indeed H0 is true

Type-II-error (β): Risk of not rejecting H0 when indeed H1 is true

Power (1− β): Chance of deciding for H1, when H1 is indeed true


Decisions under uncertainty - Hypothesis tests III

In our case

Type-I-error (α): Risk of deciding that the assay is sufficientlyprecise, when indeed it is NOT

Type-II-error (β): Risk of deciding that the assay is insufficientlyprecise, when indeed the assay is sufficiently precise

Power (1− β): Chance of correctly identifying the new assay assufficiently precise

Controlling the type-I-error α

Directly fixed and controlled by the test

Specify the hypotheses such that the more important risk is controlledvia α

By convention chosen at α = 0.05 (’5% significance level’)


Decisions under uncertainty - Hypothesis tests IV

Reducing the type-II-error β

Can not be directly fixed or controlled

Depends on the chosen α, effect size (i.e. difference of true precisionto threshold precision), sample size, variance of the measurements

Type-II-error β decreases, ifI sample size ↑I variance ↓I α ↑I effect size ↑

Via sample size, the actual type-II-error β depends on theexperimental design!


Confidence intervals I

Confidence intervals

constructed based on the measurements

such that the interval contains the true parameter in (1− α)*100 %of the cases

estimate a parameter, including the uncertainty of the estimation(due to measurement error, limited sample size, type-I-error)

decide upon arbitrary hypotheses concerning that parameter


Confidence intervals II

●

2.5% 2.5%95%Two−sided 95% confidence interval:

●

5% 95%Lower 95% confidence limit:

●

2.5%95%Upper 95% confidence limit:

Possible parameter values

Figure: Depiction of confidence intervals and limits


A binary diagnostic test for assay validation


A binary diagnostic testObjective is the validation of a new assay method/protocol

Experimental setup

A sample of n compounds are investigated

n0 compounds are known to be non-toxic

n1 compounds are known to be toxic

the new assay is applied on all n = n0 + n1 compounds

Results in a 2× 2 table:

Truth (or gold standard)Assay result toxic non-toxic

toxic (1) x11 x10

non-toxic (0) x01 x00

Sample size n1 n0

Table: 2× 2 table summarizing the output of an experiment for estimatingpredictive values.

the result x11, x01, x10, x00 is a random variablerunning the same experiment a second time, in a different lab, with adifferent sample of n compounds will give (slightly?) different resultsHence: everything calculated from x11, x01, x10, x00 is subject touncertainty


Parameters to validate a binary diagnostic test I

Sensitivity and Specificity describe the ’precision’ in an experimentalsetting

Sensitivity

π11, the probability to classify a toxic compound as toxic

Point estimate: π11 = x11/n1

Specificity

π00, the probability to classify a non-toxic compound as non-toxic

Point estimate: π00 = x00/n0

Relative predictivity

The proportion of correct classificated compounds

Point estimate: (x00 + x11) /n


Parameters to validate a binary diagnostic test II

Negative and positive predictive values describe the ’precision’ in areal world application

Positive predictive value PPV

Probability that a compound is indeed toxic, given test result y = 1

Depends on the prevalence of toxic compounds, ψ, and thesensitivity π11 and specificity π00

PPV =π11ψ

π11ψ + (1− π00) (1− ψ), (2)


Parameters to validate a binary diagnostic test III

Negative predictive value NPV

Probability that a compound is indeed non-toxic, given test resulty = 0

Important for judgment according to the precautionary principle:’being confident in negative results’

NPV =π00 (1− ψ)

(1− π11)ψ + π00 (1− ψ), (3)

[Mercaldo et al. (2007)]


Hypotheses / Confidence limits I

A good binary diagnostic test should have high PVs:Precautionary principle:

H0npv : NPV ≤ tresholdNPV0 (4)

H1npv : NPV > tresholdNPV0 (5)

where NPV0 should be higher than 1− ψ.Of further interest:

H0ppv : PPV ≤ tresholdPPV0 (6)

H1ppv : PPV > tresholdPPV0 (7)

where PPV0 should be higher than ψ.


Hypotheses / Confidence limits II

Confidence interval methods

Confidence intervals for sensitivity and specificity (exact andapproximative) are available[Agresti and Coull (1998), Brown et al. (2001), Cai (2005)]

Confidence intervals for PPV and NPV are discussed by[Mercaldo et al. (2007)]


A medical example on diagnosing Alzheimers disease I

793 patients (418 with Alzheimers disease (AD), and 375 without) aregenotyped. The presence of a certain allele (denoted ApoE.e4+) isassumed to indicate a higher risk for AD. Can genotyping persons forApoE.e4+/ApoE.e4- serve as a useful diagnostic test for AD?

Classification by cliniciansGenotype AD no AD

ApoE.e4+ 240 87ApoE.e4- 178 288

Sample size 418 375

Table: Case control study of [Li et al.(2004)]

Prevalence: ψ = 0.03 (depending on age!)


A medical example on diagnosing Alzheimers disease II

Estimate Lower 95% limit

Sensitivity 0.574 0.533Specificity 0.768 0.729

NPV 0.9831 0.9813PPV 0.0711 0.0607

Table: Estimates, lower 95% confidence limits for sensitivity, specificity, NPV andPPV of the ApoE.e4, with prevalence assumed ψ = 0.03


Experimental design for binary diagnostic tests


Experimental Design I

Question

How to choose sample sizes n, n1, n0 such that

the power (1− β) is high

for successfully showing that the diagnostic test is more precise than athreshold precision (H1npv : NPV > NPV0, H1ppv : PPV > PPV0)?


Experimental Design II

Statistical background

The uncertainty of the test decision depends on the uncertainty of theestimates π11, π00.This again depends on:

True sensitivity π11 and specificity π00,

Sample size n and the proportion of true positives n1/n (and truenegatives)

The power (1− β) additionally depends on

prevalence ψ of positive compounds,

type-I-error rate α


Experimental Design III

Two possible approaches

Calculation:

1 Assume fixed type-I-error α, sensitivity π11, specificity π00, prevalenceψ and thresholds NPV0, PPV0

2 calculate n and the proportion n1/n, for which a high power1− β=0.8 (80%) is achieved

for situations, where asymptotic calculations are valid [Mercaldo et al. (2007)]

Simulation:

1 Assume fixed type-I-error α, sensitivity π11, specificity π00, prevalenceψ and thresholds NPV0, PPV0

2 Choose a number of reasonable designs (n, n1, n0)

3 Simulate actual power for each setting


An example of an assay for acute toxicity I

Setting

Acute toxicity

Prevalence ψ, of toxic compounds among all new chemicals (since1981):

a) Toxic ψ = 0.132 (13.2 %) (GHS C1-C4)b) Very toxic ψ = 0.004 (0.4%) (GHS C1, C2; fatal, LD50 ≤ 50 mg/kg)

[Bulgheroni et al. (2009)]

Investigating several assays,

[Clothier et al. (2008)] used n = 97 compounds


An example of an assay for acute toxicity II

Case a) prevalence ψ = 0.132 (13.2 %)

Show with high probability that1 H1 : NPV > 0.952 H1 : PPV > 0.33

for an assay withI sensitivity π11 = 0.95I specificity π00 = 0.90I i.e., true NPV=0.992, true PPV=0.591


An example of an assay for acute toxicity III

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

Proportion of toxic compounds, P ==n1

n

Tot

al s

ampl

e si

ze ,

n==

n 0++

n 1

All toxic compounds GHS: C1−C4Prevalence, ψψ == 0.132Sensitivity, ππ11 == 0.95Specificity, ππ00 == 0.9

Probability to rejectH0: NPV0 ≤≤ 0.9

0.90.80.70.5

Probability to rejectH0: PPV0 ≤≤ 0.33

0.90.80.70.5

Figure: Example of experimental design with intermediate prevalence

Case b) prevalence ψ = 0.004 (0.4 %)

Show with high probability that1 H1 : NPV > 0.9992 H1 : PPV > 0.33

for an assay withI sensitivity π11 = 0.98I specificity π00 = 0.90I i.e., true NPV=0.9999, true PPV=0.0379

Note: It is impossible to show that PPV > 0.33, since the true PPV forsuch an assay is about 0.0379.

0.0 0.2 0.4 0.6 0.8 1.0

050

100

150

200

Proportion of toxic compounds, P ==n1

n

Tot

al s

ampl

e si

ze ,

n==

n 0++

n 1

Very toxic compounds: GHS: C1, C2Prevalence, ψψ == 0.004Sensitivity, ππ11 == 0.98Specificity, ππ00 == 0.9

Probability to rejectH0: NPV0 ≤≤ 0.999

0.90.80.70.5

Probability to rejectH0: PPV0 ≤≤ 0.33

0.90.80.70.5

Figure: Example of experimental design for very low prevalence


Software I


Software I

Methodology

Confidence intervals forI Sensitivity and specificityI NPV and PPV

Sample size calculation for tests on NPV and PPV

Simulation of power, confidence interval width for NPV and PPV

Graphical user interface

Made available by Kornelius Rohmeyer (LUH) and Bernd Bischl (TUDortmund), using methodology by [Chine (2008)]


Software II

Installation

Needed before installation:

Java 6

R-2.9.2 (at least R>2.8.0)

http://dr-ibs.biostat.uni-hannover.de:8080/rjavaclient/deploy/esnats/

click Java Web Start

to download and run the installer




References I

Agresti, A. and Coull, B.A. (1998). Approximate is better than ”exact” for intervalestimation of binomial proportions. The American Statistician 52, 119-126.

Brown, L.D., Cai, T.T., DasGupta, A. (2001). Interval estimation for a binomialproportion. Statistical Science 16, 101-128.

Bulgheroni A, Kinsner-Ovaskainen A, Hoffmann S, Hartung T, Prieto P (2009):Estimation of acute oral toxicity using the No Observed Adverse Effect Level(NOAEL) from the 28 day repeated dose toxicity studies in rats. RegulatoryToxicology and Pharmacology 53, 16-19.

Cai, T.T. (2005). One-sided confidence intervals in discrete distributions. Journalof Statistical Planning and Inference 131, 63-88.

Chine K (2008). Biocep, Towards a Federative, Collaborative, User-Centric,Grid-Enabled and Cloud-Ready Computational Open Platform. escience, 321-322.Fourth IEEE International Conference on eScience


References II

Clothier R, Dierickx P, Lakhanisky T, Fabre M, Betanzos M, Curren R, SjostromM, Raabe H, Bourne N, Hernandez V, Mainez J, Owen M, Watts S andAnthonissen R (2008): A Database of IC50 Values and Principal ComponentAnalysis of Results from Six Basal Cytotoxicity Assays, for Use in the Modelling ofthe In Vivo and In Vitro Data of the EU ACuteTox Projecta. ATLA 36, 503-519.

Kirkland D, Aardemab M, Mueller L, Hayashi M (2006): Evaluation of the abilityof a battery of three in vitro genotoxicity tests to discriminate rodent carcinogensand non-carcinogens II. Further analysis of mammalian cell results, relativepredictivity and tumour profiles. Mutation Research 608, 29-42.

Li et al.(2004). Association of late-onset Alzheimers disease with genetic variationin multiple members of the GAPD gene family. Proceedings of the NationalAcademy of Sciences, U.S.A. 101, 15688-15693.

Mercaldo, N.D., Lau, K.F. and Zhou, X.H. (2007). Confidence intervals forpredictive values with emphasis to case - control studies. Statistics in Medicine 26,2170-2183


References III

R Development Core Team (2008). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria. ISBN3-900051-07-0, URL http://www.R-project.org.

Steinberg, D.M., Fine, J. and Chappell, R. (2009). Sample size for positive andnegative predictive value in diagnostic research using case-control designs.Biostatistics 10, 94-105.


experimental design for binary diagnostic tests - biostat · 1 motivation and introduction 2 some...

Documents