section ii descriptive stats for continuous data

53
Section II Descriptive stats for continuous data Descriptive stats for binary data and bivariate associations in binary data 1

Upload: nubia

Post on 24-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Section II Descriptive stats for continuous data Descriptive stats for binary data and bivariate associations in binary data . Types of data Numerical: Continuous-age, SBP,glucose Interval-parity, num infections Ordinal (ranks) Cancer stage, Apgar score Nominal (no order) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Section II Descriptive stats for continuous data

Section IIDescriptive stats for continuous

dataDescriptive stats for binary data

and bivariate associations in binary data

1

Page 2: Section II Descriptive stats for continuous data

Types of dataNumerical:

Continuous-age, SBP,glucoseInterval-parity, num infections

Ordinal (ranks)Cancer stage, Apgar score

Nominal (no order)Gender, ethnicity, treatment

2

Page 3: Section II Descriptive stats for continuous data

Dataset used to illustrate some statistics in this section

Stomach cancer survival times in controls (Cameron & Pauling, PNAS, Oct 1976)Days from end of treatment to death

4, 6, 8, 8, 12, 14, 15, 17, 19, 22, 24, 34,45 n= 13 subjects

3

Page 4: Section II Descriptive stats for continuous data

Measures of central tendency (middle)Data: 4, 6, 8, 8,12, 14, 15, 17, 19, 22, 24, 34,45

mean = 17.5 days median = 15 days

mode = 8 daysGeometric mean-GM= 13√4x6x8x8x…x45=14.25

If we delete the most extreme value, 45, mean is now 15.24, median is 14.5, GM=13,

median changes least

4

Page 5: Section II Descriptive stats for continuous data

Mean versus Median (lesson #1 in how to lie with statistics)Yearly income data from n=11 persons, one income is for Dr Brilliant, the other

10 incomes from her 10 graduate students Yearly income in dollars

950 960 970 980 990 1010 1020 1030 1040 1050 $100,000

$110,000 (total) mean = 110,000/11 = $10,000, median = 1010 (the sixth ordered value) Which is better summary of “typical” value?

5

Page 6: Section II Descriptive stats for continuous data

Example - Survival times in women with advanced Breast Cancer Survival time in days after end of radiotherapy woman after 275 days f/u after 305 days f/u 1 14 14 2 26 26 3 43 43 4 45 45 5 50 50 6 58 58 7 60 60 8 62 62 9 70 70 10 70 70 11 83 83 12 98* 128* 13 104* 134* 14 124* 154* 15 125* 155* 16 275* 305*

mean 75.6 83.1 median 66.0 66.0 SD 55.8 66.3 * still alive (censored)

The median is still a valid measure when less than half the data are censored. 6

Page 7: Section II Descriptive stats for continuous data

Cumulative frequencies & survival

num pct cum cum pct cum pctDays dead dead dead dead alive=S 1-10 4 30.8 4 30.8 69.211-20 5 38.5 9 69.2 30.821-30 2 15.4 11 84.6 15.431-40 1 7.7 12 92.3 7.741-50 1 7.7 13 100.0 0 total 13

7

Page 8: Section II Descriptive stats for continuous data

Stomach cancer survival time in days

8

day cum dead Cum incidence survival4 1 7.7% 92.3%6 2 15.4% 84.6%8 4 30.8% 69.2%

12 5 38.5% 61.5%14 6 46.2% 53.8%15 7 53.8% 46.2%17 8 61.5% 38.5%19 9 69.2% 30.8%22 10 76.9% 23.1%24 11 84.6% 15.4%34 12 92.3% 7.7%45 13 100.0% 0.0%

Page 9: Section II Descriptive stats for continuous data

9

0 6 12 18 24 30 36 42 48 540%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%Stomach cancer cum incidence & survival

cum incidence

survival

days

Page 10: Section II Descriptive stats for continuous data

Bevacizumab & Ovarian CancerBerger et. al. NEJM Dec 2011

10

Page 11: Section II Descriptive stats for continuous data

Why survival curves?

0%10%20%30%40%50%60%70%80%90%

100%

0 1 2 3 4 5 6 7 8 9 10

day

pct a

live

11

Page 12: Section II Descriptive stats for continuous data

Summarizing mortality – hazard rates Hazard rate = h =

number of persons with outcome total person-time follow up in all at risk

This is a rate per person-time. It is NOT a probability (not a risk)

In stomach cancer n=13, with 13 deaths, total follow up is 4+6+8+8+12+14+15+17+19+22+24+34+45

= 228 person-days

Hazard rate = mortality rate = 13/228 = 0.057 or 5.7 deaths per 100 person-days of

follow up. Do NOT report as 5.7%-wrong

12

Page 13: Section II Descriptive stats for continuous data

Example: Why hazard rates?

Group n num dead mean f/u total f/u rate per 1000 A 100 7 36 3600 7/3600=1.94 B 100 2 3 300 2/300 =6.66

Mortality rate is higher for B than A even though the number of persons in each group is the same and more people died in group A.

The hazard rate ratio for A/B is 1.94/6.66=0.291.

When ALL patients are followed to the endpoint, (no censoring) mean time to event= 1/hazard.

13

Page 14: Section II Descriptive stats for continuous data

Hazard rates & survival curvesSurvival

0%

20%

40%

60%

80%

100%

0 2 4 6 8 10 12t

S 0.2 0.4hazard rate

log Survival

-6.0

-5.0

-4.0

-3.0

-2.0

-1.0

0.0

0 2 4 6 8 10 12t

log(

S)

0.2 0.4

hazard rate

loge(S) = cum haz= h t, h is (average) slope of loge(S) vs t

14

Page 15: Section II Descriptive stats for continuous data

Hazard rate ratios & Survival curves

ha = hazard rate in group A hb = hazard rate in group B, hazard rate ratio, (HR) for A compared to B is HR = ha/hb

If HR is constant over time one can compute the Survival in group A from the Survival in group B.

Sa = SbHR

Ex: HR=0.291, S at t=12 mos is 90% in group B, S=0.900.291 = 0.970 or 97.0% in group A at t=12 months.

A “protective” HR < 1 increases survival. HR >1 decreases survival.

15

Page 16: Section II Descriptive stats for continuous data

Cumulative hazard rate

16

Loge(S)=Cumulative hazard = Σt hi = ∫ h(t) dt

If h is constant over timeCumulative hazard = h T where T is the

follow up time. In this case, h = cum hazard/T h is the slope of the cum hazard vs t plot.

Page 17: Section II Descriptive stats for continuous data

From: Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women:  Principal Results From the Women's Health Initiative Randomized Controlled Trial

JAMA. 2002;288(3):321-333.

HR indicates hazard ratio; nCI, nominal confidence interval; andaCI, adjusted confidence interval. Global index = first occurrence of CHD, cancer, stroke, pulmonary embolism, hip fracture or death.

Page 18: Section II Descriptive stats for continuous data

Distribution skewnessLong right tailed distribution median < mean (common for survival data)

0

1

0 9

median

mean

18

Page 19: Section II Descriptive stats for continuous data

Example: ICU length of stay(Howard)

n=94, mean=11.3 days, median= 6 daysmin=1 day, max=80 days

19

'

6 18 30 42 54 66 78

0

10

20

30

40

50

60

70

80

Per

cent

LOS_ICU

Page 20: Section II Descriptive stats for continuous data

SkewnessLong left tailed distribution median > mean

(not as common in biology/medicine)

0

1

0 9

median

mean

20

Page 21: Section II Descriptive stats for continuous data

Symmetric(common in biology)

0.00

0.40

-3.5 3.5

mean

median

Can be symmetric without being bell curve shaped – has one mode When data has a skewed distribution, must use “non parametric” methods

21

Page 22: Section II Descriptive stats for continuous data

Measures of variation, spreadIQR – interquartile range

0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14 16 18 20Q1 Q3median

25% 25%25% 25%

22

Page 23: Section II Descriptive stats for continuous data

Box-whisker plot

0 10 20 30 40 50

min max

Q1 Q3median

mean

23

Page 24: Section II Descriptive stats for continuous data

Variation-Variance & SD _ Mean = Y= 17.54 days _ _ Y Y-Y (Y-Y)2

4 -13.54 183.3 6 -11.54 133.2 8 -9.54 91.0 8 -9.54 91.0 12 -5.54 30.7 14 -3.54 12.5 15 -2.54 6.5 17 -0.54 0.3 19 1.46 2.1 22 4.46 19.9 24 6.46 41.7 34 16.46 270.9 45 27.46 754.1sum 0 1637.2

_ Variance = (Yi - Y)2

(n-1)

Var=1637.2/12=136.4

SD=√Variance=√136.4=

11.6 days

24

Page 25: Section II Descriptive stats for continuous data

Variation- Interpreting the SDRule of thumb from Gaussian (“Normal”) theory

(will study more shortly) rule ok if data has unimodel symmetric distribution

Range of middle 2/3 of the data: mean +/- SD

Range of middle 95% of the data:mean +/- 2 SD

Implies SD ≈ range/4 (after extreme values removed from range)

25

Page 26: Section II Descriptive stats for continuous data

SD of differences-paired datachol in mmol/L

person chol at start chol at end difference 1 12.6 10.0 2.6 2 8.5 7.5 1.0 3 7.0 5.8 1.2 4 6.9 4.9 2.0 5 5.8 4.0 1.8 6 4.1 3.8 0.3

mean 7.48 6.00 1.48 SD 2.90 2.38 0.82

Corr of start vs end: r=0.971 26

Page 27: Section II Descriptive stats for continuous data

If authors only report (mmol/L) start end change??mean 7.48 6.00 SD 2.90 2.38 Easy to get mean difference=7.48 – 6.00=1.48

But can’t get SD of differences2.90 - 2.38 = 0.52 ≠ 0.82

The 1.48 mean diff is average responseThe 0.82 diff SD is variation in response.

SDdiff= √ SD2start +SD2

end – 2 r SDstart SDend

r= correlation coeff

27

Page 28: Section II Descriptive stats for continuous data

SD of differencestwo independent groups

Comparing ages in groups A vs B group A group B

30 5035 5177 5541

n 4 3 B - A B + Amean 45.75 52.00 6.25 97.75

SD 18.46 2.16 18.58 18.58Var 340.69 4.67 345.35 345.35

Data->

28

Page 29: Section II Descriptive stats for continuous data

All possible differences, B-A

50 51 5530 20 21 2535 15 16 2077 -27 -26 -2241 9 10 14

mean 6.25SD 18.58

Var 345.35

All possible sums, B+A

50 51 5530 80 81 8535 85 86 9077 127 128 13241 91 92 96

mean 97.75

SD 18.58Var 345.35

29

Page 30: Section II Descriptive stats for continuous data

Rule for SD of differencestwo independent groups

Var(Y - X) = Var(Y) + Var(X)Var(Y + X) = Var(Y) + Var(X)

SD(Y-X)= √ SD2(Y) + SD2(X) SD(Y+X)=√ SD2(Y) + SD2(X)

SD(X)

SD(Y)SD(Y-X)

30

Page 31: Section II Descriptive stats for continuous data

BINARY DATAStatistics

31

Page 32: Section II Descriptive stats for continuous data

Associations for Binary datadisease No disease total

Exposed (e) a b a+b

Unexposed(u) c d c+d

risk=P odds=O

Pe= a/(a+b) Oe= a/b

Pu = c/(c+d) Ou= c/d

RR =Pe/Pu OR= Oe/Ou32

Page 33: Section II Descriptive stats for continuous data

Risk vs OddsP=risk, O=odds

O=P/(1-P), P=O/(1+O)P=1/10, O=1/9.

Risk=num sick/totalOdds=num sick/num not sick

RR = OR/(1 – Pu + OR Pu)When Pu is small,

RR=OR In general, OR is more extreme than RR

33

Page 34: Section II Descriptive stats for continuous data

diseaseno

disease risk odds

exposed 50 950 1000 0.050 0.053

unexposed 200 8550 8750 0.0228 0.0234

250 9500 9750

OC use (P) 20% 10% RR OR

2.188 2.250

Oral Contraceptive exposure vs CancerProspective study (unbiased est of pop)

34

Page 35: Section II Descriptive stats for continuous data

Ratios and differences

For rare events or diseases Pe=1/10,000, Pu= 1/100,000

RR = 10, risk difference = 9/100,000Misleading to only report ratio and not actual

risks.

35

Page 36: Section II Descriptive stats for continuous data

Odds-case control studycancer No cancer

OC 100 5no OC 400 45

500 50

OC use (P) 20% 10%

Odds (O) 0.25 0.11OR 2.25

36

Page 37: Section II Descriptive stats for continuous data

Why use ORs?1.In prospective study, usually quote disease risk &

risk ratio (RR). In case-control, we always quote OR, not RR. Case-control OR of exposure in disease/no disease

Equals Prospective OR of disease in exposed/unexposedin population if the probability of exposure is same

as in the target population.(Not necessarily true if there is confounding, bias).

2. OR more “stable” (universal) across studies. If unexposed risk=20%, RR=2, exposed risk=40%If unexposed risk=60%, RR can’t be 2.

37

Page 38: Section II Descriptive stats for continuous data

Independence rule for ORsORs for heart attack (MI) For smokers/non smoker: OR = 4 For alcohol/no alcohol: OR = 2

If independent, OR for those who smoke AND drink alcohol is 4 x 2 = 8 (relative to

no smoke, no alcohol). Only true if smoking, drinking are

independent influences on MI. However, smoking & drinking can be correlated with

each other. 38

Page 39: Section II Descriptive stats for continuous data

NNT – number needed to treat (or harm)(clinical trials)

Pc (like Pu)=prop w/ disease in control groupPt (like Pe)=prop w/ disease in treat group

ARR=absolute risk reduction= risk difference= RD=Pc-Pt

RRR=Relative risk reduction=(Pc-Pt)/Pc = ARR/Pc=1-RR

NNT=number needed to treat=1/ARR

39

Page 40: Section II Descriptive stats for continuous data

NNT ExamplePc=0.36=36%, Pt=0.34=34%

ARR=RD=0.02=2%RRR=0.2/0.36 = 5.5% (a percent of a percent)

NNT = 1/0.02 = 50

So 50 patients must be given the treatment to cure one additional disease case.Can be extended to more complex stats.

40

Page 41: Section II Descriptive stats for continuous data

NNT–Ovarian Ca screening“Tests commonly recommended to screen healthy

women for ovarian cancer do more harm than good and should not be performed, a panel of medical experts said on Monday. The screenings —blood tests for a substance linked to cancer and ultrasound scans to examine the ovaries — do not lower the death rate from the disease, and they yield many false-positive results that lead to unnecessary operations with high complication rates, the panel said.

…“To find one case of ovarian cancer, 20 women

had to undergo surgery. “ (NY Times–10 Sept 2012)

Page 42: Section II Descriptive stats for continuous data

Summary-Ratios Risk Odds Hazard P O h

Ratio: RR=Pe/Pu OR=Oe/Ou HR=he/hu

All have the null value of 1.0 when there is no association. The distribution of the logs of their ratios from study to study are usually bell curve shaped around the true log scale value.

42

Page 43: Section II Descriptive stats for continuous data

True-disease True-No diseaseTest-positive a b

Test-negative c dTotal a+c b+d

Sensitivity and Specificity

Sensitivity=a/(a+c), false negative=c/(a+c)

Specificity=d/(b+d), false positive=b/(b+d)

Positive predictive value=PPV=a/(a+b) *

Negative predictive value=NPV=d/(c+d) ** Depends on disease prevalence-not just attribute of test

43

Page 44: Section II Descriptive stats for continuous data

Sensitivity, Specificity, Accuracy

Accuracy = W Sensitivity + (1-W) Specificity where 0 < W < 1.

Often W=0.5 (unweighted accuracy)

We wish to maximize accuracy=minimize misclassification = 1- Accuracy

Choose W depending on “costs”.

44

Page 45: Section II Descriptive stats for continuous data

0.00

0.10

0.20

0.30

0.40

0.50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Y

Y

ROC curve–choose continuous data cutpoint (threshold) for highest accuracy, best “separation”

45

Page 46: Section II Descriptive stats for continuous data

“Modern” format for ROC

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Yc =threshold (cutpoint)

sensitivity

specificity

accuracy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Yc =threshold (cutpoint)

unw

eigh

ed a

ccur

acy

Highest accuracy is NOT necessarily where sens=spec,

(only when SD1=SD2) 46

Page 47: Section II Descriptive stats for continuous data

“Traditional” ROC(not recommended-hard to label cutpoints)

traditional ROC

0%10%20%30%40%50%60%70%80%90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

false pos=1-spec

sens

47

Page 48: Section II Descriptive stats for continuous data

C (concordance) statistic for ROCC = area under the “traditional” ROC curve

0.5 (bad) < C < 1.0 (good)If nd=a+c true num w/disease

nnd=b+d true num w/o disease From all possible nd x nnd pairs with one

diseased and one not, call a pair “concordant” if diseased is positive and

non diseased is negative. C is the proportion of the pairs that are

concordant. 48

Page 49: Section II Descriptive stats for continuous data

Positive and Negative predictive valuePositive predictive value (PPV) & negative predictive value (NPV)

depend on sensitivity (sens), specificity (spec) & disease prevalence (P).

Sensitivity and specificity do NOT depend on disease prevalence.

Can only compute PPV=a/(a+b) & NPV=d/(c+d) when disease prevalence P = (a+c)/(a+b+c+d) = (a+c)/n

Bayes formulas for PPV and NPV

Let P = prevalence of disease

PPV = test true pos/ (test true pos + test false pos) = sens x P / [ sens x P + (1- spec) x (1- P) ]

NPV = test true neg/ (test true neg + test false neg) = spec x (1-P) / [ spec x (1-P) + (1-sens) x P ]

But don’t use these formulas – there is an easier way49

Page 50: Section II Descriptive stats for continuous data

Exampledisease no disease Total

Test positive 95 20 115Test negative 5 1980 1985

Total 100 2000 2100 Sens = 95/100=0.95, Spec= 1980/2000 = 0.99,

Disease prevalence=P = 100/2100 = 0.0476

PPV = (0.95 x 0.0476) / [ 0.95 x 0.0476 + 0.01 x 0.9524 ] = 0.826 PPV = 95/115=0.826

NPV = (0.99 x 0.9524) / [0.99 x 0.9524 + 0.05 x 0.0476] = 0.9974

NPV = 1980/1985 = 0.9974

50

Page 51: Section II Descriptive stats for continuous data

Bayesian “paradigm” for PPVOdds of disease Probability of disease

Prior 100/2000=0.05 100/2100=0.0476=4.76%

Positive test “data” Likelihood ratio (LR)

Sensitivity/false pos=0.95/0.01=95

(not applicable)

Posterior given positive test=

Prior x LR=PPV

0.05 x 95 = 4.75 4.75/(1+4.75)=0.826=82.6%

51

LR=Prob(+ test | disease)/Prob(+test | no disease)Posterior odds = Prior odds x LR Bayes: Prior data Posterior

Prior probability is updated with data (LR) to get a posterior probability (PPV)

Page 52: Section II Descriptive stats for continuous data

Bayes paradigm (algebra)Prior -> (test) data -> Posterior

52

Disease No disease Total

Test positive a b a+b

Test negative c d c+d

total a+c b+d n

Prior disease risk=(a+c)/n n=a+b+c+dPrior disease odds= (a+c)/(b+d)

Test Data:LR positive test = Sens/ false pos =[a/(a+c)]/[ b/(b+d)] = RR=LR

Posterior odds disease = Prior odds x LR pos test = a/bPosterior disease risk = a/(a+b) = PPV

Page 53: Section II Descriptive stats for continuous data

Ex: FASTER Trial(NEJM 353:19, 10 Nov 2005)

53

Prior odds of Down’s syndrome (varies with gestational age)

↓ LR from biochemical markers

(& other factors/data) ↓

Posterior odds of Downs syndrome