power and sample size (at study design stage before doing the study): “how large a sample size do...
TRANSCRIPT
Power and Sample Size
(At study design stage before doing the study):
“How large a sample size do I need to have a good chance of statistically finding a difference if a difference (or effect) truly exists.”
Robert Boudreau, PhDCo-Director of Methodology Core
PITT-Multidisciplinary Clinical Research Center for Rheumatic and Musculoskeletal Diseases
PHARYNX
• A Clinical Trial in the Treatment of Carcinoma of the Oropharynx
• SIZE: 195 observations
SEX Frequency Percent
Male 149 76.4
Female 46 23.6
Standard treatment: Radiation therapy alone (n=100)
Test treatment: Radiation + Chemotherapy (n=95)
Post Treatment: 1 Yr Mortality Signif Diffs By Gender (?)
% died < 1 yr ‚Standard‚ Test ‚P-value‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆMen ‚ 42.1% ‚ 45.7% ‚ 0.66 ‚ ‚ ‚ ‚ ‚ (n=146) ‚ (32/76)‚ (32/70)‚ ‚ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆWomen ‚ 21.7% ‚ 52.2% ‚ 0.03 ‚ ‚ ‚ ‚ ‚ (n=46) ‚ (5/23) ‚ (12/23)‚ ‚ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒˆFrequency Missing = 3 (censored before 1yr)
• Large difference in women detected (even with smaller N)
Is Stage of Cancer a Factor ?
T_STAGE • 1=primary tumor measuring 2 cm or less in largest diameter,• 2=primary tumor measuring 2 cm to 4 cm in largest diameter with minimal infiltration in depth• 3=primary tumor measuring more than 4 cm, 4=massive invasive tumor
N_STAGE (see Cooper et. al, NEJM: Stage 2+ => high mortality) • 0=no clinical evidence of node metastases• 1=single positive node 3 cm or less in diameter, not fixed• 2=single positive node more than 3 cm in diameter, not fixed• 3=multiple positive nodes or fixed positive nodes
Is Stage of Cancer a Factor ?
Cooper JS, et.al. Postoperative Concurrent Radiotherapy and Chemotherapy for High-Risk Squamous-Cell Carcinoma of the Head and Neck. NEJM 350(19):1937-1944. May 6, 2004
• “Patients who have two or more regional lymph nodes involved, extracapsular spread of disease, or microscopically involved mucosal margins of resection have particularly high rates of local recurrence (27 to 61 percent) and distant metastases (18 to 21 percent) and a high risk of death (five-year survival rate, 27 to 34 percent).”
Males: Tumor Stage by Metastasized Nodes
-------------------------------- SEX=Male -----------------------
The FREQ Procedure Table of T_STAGE by N_STAGE T_STAGE(T_STAGE) N_STAGE(N_STAGE) Frequency‚ 0 ‚ 1 ‚ 2 ‚ 3 ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 0 ‚ 0 ‚ 3 ‚ 5 ‚ 8 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 2 ‚ 0 ‚ 0 ‚ 9 ‚ 10 ‚ 19 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 3 ‚ 17 ‚ 11 ‚ 11 ‚ 29 ‚ 68 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 4 ‚ 13 ‚ 9 ‚ 2 ‚ 30 ‚ 54 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 30 20 25 74 149
Males: 1 Year Mortality(Among those with none or 1 small node)
TX(TX) died < 1 yr
Frequency‚ Row Pct ‚ 0‚ 1‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Standard ‚ 20 ‚ 9 ‚ 29 ‚ 68.97 ‚ 31.03 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Test ‚ 10 ‚ 10 ‚ 20 ‚ 50.00 ‚ 50.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 30 19 49
Frequency Missing = 1
Statistics for Table of TX by died_lt_1yr
Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 1.7934 0.1805
• Not quite Statistically Significant
Males: 1 Year Mortality (Among those with none or 1 small node)
WHAT IF: Exact same rates, but 5 times as many in study (n=245 vs 49)
TX(TX) died < 1 yr
Frequency‚ Row Pct ‚ 0‚ 1‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Standard ‚ 100 ‚ 45 ‚ 145 ‚ 68.97 ‚ 31.03 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Test ‚ 50 ‚ 50 ‚ 100 ‚ 50.00 ‚ 50.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 150 95 245
Frequency Missing = 5
Statistics for Table of TX by died_lt_1yr
Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 8.9670 0.0027
Sampling Variability, Power and Sample Size
Standard Treatment
Case 1: n=29 (original study sample size)
p1= sample estimate of prob of death < 1 yr
= 9/29 = 0.3103
Stderr(p1) = sqrt ( p1*(1-p1) / n1 )
= sqrt ( 0.3103*0.6897/29) = 0.0859 (8.6%)
Case 2: n=145 (if 5 times larger sample size)
p1* = 45/145= 0.3103
Stderr(p1) = sqrt(0.3103*0.6897/ 145)
= 0.0384 (3.8%))
= Stderr(p1) / sqrt(5) = Stderr(p1) / 2.236
Sampling Variability, Power and Sample Size (cont’d)
Standard Test Difference .
n1 p1 Stderr(p1) n2 p2 Stderr(p2) p2-p1 Stderr(p2-p1) Z (ratio)
29 0.3105 0.0859 20 0.50 0.1180 0.1895 0.1460 1.30
145 0.3105 0.0384 100 0.50 0.0500 0.1895 0.0653 2.90
In both cases• The null hypothesis is H0: True Diff=0 • P[ Type I error ] = P[ Reject H0 when H0 true ] = 0.05
Case #1: Observed diff. explainable “by chance” (Z=1.30, p=0.1936)
Case #2: Observed diff. not explainable “by chance” (Z=3.01, p=0.0037)
“Level of significance”, “alpha-level"
Distribution of possible observed p2-p1 for different sample sizes under hypothetical condition that the mortality rates are really the same
n=49per group
n=245per group
(Pvalue)/2
Two-sided Hypothesis Test (2 treatments equal vs not equal ?)
n=49, p=0.1936 n=245, p=0.0037
Sampling Variability, Power and Sample Size (cont’d)
• Null Hypothesis: 1 yr mortality rates are same• Alternate Hypothesis: 1 yr mortality rates differ by
treatment
Natural Question: Is there actually a difference, but the small sample size study didn’t find it ?
• Type II Error: Accept null hypothesis when alternate hypothesis is true
• Prob[Type II Error] = β
• Power = Prob[ Reject Ho when alternate true] = 1 - β
Making Decisions Using Statistical Tests: Type I & Type II Errors
Q: “Is there actually a difference in 1 yr mortality rates, but the small sample size in the study didn’t find it ?”
The question is asking about the two cells highlighted in blue.
True State of Nature
(Actual Relationship of 1 yr mortality rates between treatments)
Null True
(1 yr mortality rates actually the same)
Alternate True ( 1 yr mortality rates actually differ )
Decision
Based on
Statistical Test
Accept Null
(1 yr mortality rates not signif diff )
Correct Decision Made by Statistical Test
Type II error Prob: β= ?(depends on how
different)
Reject Null
(1 yr mortality rates are signif diff )
Type I error Prob: α = 0.05 (preset level)
Correct Decision Made by Statistical Test Prob: Power=1- β
Power & Sample Size
Cooper et. al. NEJM• “On the basis of the previous trials of the RTOG,
patients treated with postoperative radiation were expected to have a two-year rate of local or regional recurrence of 38 percent. The study required the randomization of 398 eligible patients to have the statistical power to detect an absolute improvement of 15 percent in this rate with the use of a two-sided test with 0.80 statistical power and a significance level of 0.05.
Power & Sample Size Calculations
• Power & sample size calculations are typically made using estimated rates from prior or related studies
(1) A scientifically meaningful improvement, change, difference, odds-ratio (OR) or hazard-ratio (HR) is set, then a required sample size to achieve 80% power is computed.
(2) The budget may dictate the maximum available “N”. => Power is then calculated based on fixed “N” for a range of differences, ORs or HRs. Prior studies are used to estimate means, stdevs, rates, ORs … etc.
A. Power with sample size (N) fixed
Absolute Improvement
Two Year Rate of Local or Regional Recurrence
Radiation Radiation + Chemo Power with n=150 per group *
0 0.38 0.38 0.050
0.05 0.38 0.33 0.147
0.10 0.38 0.28 0.453
0.15 0.38 0.23 0.809
Power = Prob[ finding signif difference if recurrence rates differ by tabulated amounts]
* Using two-sample independent chi-square test
A. Power with sample size (N) fixed• Null Hypothesis: 1 yr mortality rates are same• Alternate Hypothesis: 1 yr mortality rates differ by treatment
Test statistic: Z = (p1 – p2) / Stderr(p1-p2) Stderr(p1-p2) =sqrt( var(p1-p2) ) =sqrt( p1*(1-p1)/n + p2*(1-p2)/n )
Z is approximately Normal (for any p1, p2)
with mean: (p1-p2)/stderr (=0 if no difference) with SD=1 (aka “standarized”)
A. Power with sample size fixed(n=150 each group)
| → Rejection Region Rejection Region ← |
←Alt #3: p1=0.38, p2=0.23 (recurrence rates) (radiation) (rad + chemo) Power=0.809 = Prob [ in rejection region ]
←Null Hypothesis distribution is red
←Alt #3: p1=0.38, p2=0.23 (radiation) (rad + chemo) Power=0.809 = Prob [ in rejection region ]
A. Power with sample size (N) fixed
Z = (p1 – p2) / stderr Under Alt #3, distribution of Z has mean:
(0.38 – 0.23) / 0.052 = 0.15 / 0.052 = 2.86
→ 80.9% of area is to right of null hypothesis (no diff) rejection region
→ Reject H0 if |Z| > 1.96
A. Power with sample size (N) fixed
* In SAS:* Compute power with n=150* per group with alternate p2=0.23;proc power; twosamplefreq test=pchi groupproportions = (0.38, 0.23) npergroup = 150 power= .;run;
A. Power with sample size (N) fixed
The POWER Procedure Pearson Chi-square Test for Two Proportions
Fixed Scenario Elements
Distribution Asymptotic normal Method Normal approximation Group 1 Proportion 0.38 Group 2 Proportion 0.23 Sample Size Per Group 150 Number of Sides 2 Null Proportion Difference 0 Alpha 0.05
Computed Power
Power
0.809
B: Sample size (N) to achieve 80% power
* How many needed per group for exactly* 80% power ?;proc power; twosamplefreq test=pchi groupproportions = (0.38, 0.23) npergroup = . power= 0.8;run;
B: Sample size (N) to achieve 80% power
The POWER Procedure Pearson Chi-square Test for Two Proportions
Fixed Scenario Elements
Distribution Asymptotic normal Method Normal approximation Group 1 Proportion 0.38 Group 2 Proportion 0.23 Nominal Power 0.8 Number of Sides 2 Null Proportion Difference 0 Alpha 0.05
Computed N Per Group
Actual N Per Power Group
0.801 147
B: Sample size (N) to achieve 80% power
AbsoluteImprovement
Two Year Rate of Local or Regional Recurrence
N to achieve 80% Power *
Radiation Radiation + Chemo N per group Total N
0 0.38 0.38
0.05 0.38 0.33 1437 2874
0.10 0.38 0.28 346 692
0.15 0.38 0.23 147 294
* 80% Power = Prob[ finding signif difference if recurrence rates differ by tabulated amounts] Using two-sample independent chi-square test
Rates of Local and Regional Control
Cooper JS et al. Postoperative Concurrent Radiotherapy and Chemotherapy for High-Risk Squamous-Cell Carcinoma of the Head and Neck. New Eng J Med. 350 (2004) 1937-1944.
Actual Results of the Cooper Study Using the SampleSizes Based on Their Power Calculations (P = 0.01)
B: Sample size (N) to achieve 80% power
Sample Size: Two-sample Test of Proportions
B: Sample size (N) to achieve 80% power
* How many are needed per group for exactly* 80% power ? (implements the formula);data _null_; p1=0.38; p2=0.23; p=(p1+p2)/2; n=( 1.96*sqrt( 2*p*(1-p) ) + 0.84*sqrt( p1*(1-p1)+ p2*(1-p2) ) )**2 /(p2-p1)**2; put n=;run;
n=146.5414874
BARI 10-Year SurvivalStratified by Diabetes Status
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8 9 10
Su
rviv
al
Diabetes CABG (n=180) Diabetes PTCA (n=173)No diabetes CABG (n=734) No diabetes PTCA (n=742)
ND CABG 78.2%ND PTCA 76.8%
D CABG 57.1%
D PTCA 44.1%
No Treated Diabetes CABG vs PTCA: p = 0.50Treated Diabetes CABG vs PTCA: p = 0.012
Years
Logistic Regression: Sample size (N) to achieve 80% power
Goal of new study proposal:
Test survival for improved method of PTCA
BARI: Diabetics vs Non-Diabetics
PTCA 10 yrs survival: p1=0.441, p2=0.768
OR= ( p2/(1-p2) ) / (p1/(1-p1)) = 3.31
Approx 20% of eligible patients are diabetic
(in general population)
Logistic Regression: Sample size (N) to achieve 80% power
* To Detect OR=1.8 with 80% Power;* 20% diabetics (e.g like cohort study);proc power; twosamplefreq test=pchi oddsratio= 1.8 refproportion=0.441 groupweights=(1 4) ntotal=. power=0.80;run;* Note: Could assume higher than 0.441 for diabetics if new method does better
Logistic Regression: Sample size (N) to achieve 80% power
The POWER Procedure Pearson Chi-square Test for Two Proportions
Fixed Scenario Elements
Distribution Asymptotic normal Method Normal approximation Reference (Group 1) Proportion 0.441 Odds Ratio 1.8 Group 1 Weight 1 Group 2 Weight 4 Nominal Power 0.8 Number of Sides 2 Null Odds Ratio 1 Alpha 0.05 Computed N Total
Actual N Power Total
0.801 570
Logistic Regression: Sample size (N) to achieve 80% power
* Detect OR=1.8 with 80% power;* With equal number of diabetics/non-diabetics* recruited into study;proc power; twosamplefreq test=pchi oddsratio= 1.8 refproportion=0.441 npergroup=. power=0.80; run;
N Per Group = 184 ( Total N = 368 )
Note: Total N = 570 when 20% diabetics, 80% non-diab Power always lower with unequal sample sizes
Comparing Means of 2 Groups:Power and Sample Size
From Women’s Health Initiative Observational Study (WHI-OS)
~ 90,000 women longitudinal cohort study (8yrs and continuing)
Osteoporotic Fractures Ancillary Substudy Funded case-control study: 1200 cases (fractures), 1200 controls• 25(OH)2 Vitamin D3 (ng/ml)• Inflammatory markers (e.g. IL-6)• Hormones (estradiol), bone mineral density, …
Comparing Means of 2 Groups:Power and Sample Size
25(OH)2 Vitamin D3 (ng/ml) mean (sd): 25.8 ± 10.7
With n=1200 in each group (fracture=case, no fracture=control)
What is difference in means of Vitamin D3 that can be detected with 80% power ?
Comparing Means of 2 Groups:Power and Sample Size
proc power;
twosamplemeans test=diff
meandiff=.
stddev=10.7
npergroup=1200
power=0.80;
run;
Comparing Means of 2 Groups:Power and Sample Size
The POWER Procedure Two-sample t Test for Mean Difference
Fixed Scenario Elements
Distribution Normal Method Exact Standard Deviation 10.7 Sample Size Per Group 1200 Power 0.8 Number of Sides 2 Null Difference 0 Alpha 0.05
Computed Mean Diff
Mean Diff 1.22
Comparing Means of 2 Groups:Power and Sample Size
Suppose a 1 ng/ml difference is considered scientifically/clinically meaningful
(or) You are designing a study to potentially detect
differences in Vitamin D3 that are this small.
How many are needed in each group to have 80% power to detect a difference of 1 ng/ml ?
25(OH)2 Vitamin D3 (ng/ml) mean (SD): 25.8 ± 10.7
Sample Size Formula for Comparing Means of 2 Groups
Usually: D0 = 0(i.e. equality of the means)
Sample Size Formula for Comparing Means of 2 Groups
• How many fracture cases and non-fracture controls are needed to have 80% power to detect a difference of 1 ng/ml in Vitamin D3?
We know from a pilot study or other published results that:
25(OH)2 Vitamin D3 (ng/ml): mean (SD): 25.8 ± 10.7 (SD=10.7)
0.05, =0.025, Z/2= 1.96 (/2= 0.025 =area to the right on the normal curve )
Power=0.80 → β = 0.20, Zβ = 0.84 (β = 0.20 =area to the right on the normal curve )
σ ~10.7, Z/2= 1.96, Zβ = 0.84, Δ = 1
The sample size (approx) required in each group is:
2 σ2 (Z/2 +Zβ )2 2 (10.7)2 ( 1.96 + 0.84)2
n ~ ------------------- = ------------------------------- = 1795.2 → 1796
Δ2 12
Comparing Means of 2 Groups:Power and Sample Size
proc power; twosamplemeans test=diff meandiff=1 stddev=10.7 npergroup=. power=0.80;run;
Computed N Per Group Actual N Per Power Group 0.800 1799 (vs 1200 to detect 1.22 diff)
Comparing Means of 2 Groups:Related to Logistic Regression OR
Hosmer & Lemeshow, Applied Logistic Regression
• Relationship between 2-sample t-test
and logistic regression
For continuous predictor (e.g. Vitamin D3):
Let u2-u1 = detectable difference with 80% power
σ = standard deviation
An odds-ratio (OR) per SD ~ exp ( (u2-u1)/ σ )
is detectable with approx. 80% power
OR between 1st & 4th quartile ~ exp (3*(u2-u1)/ σ )
Comparing Means of 2 Groups:Related to Logistic Regression OR
25(OH)2 Vitamin D3 (ng/ml) mean (sd): 25.8 ± 10.7
Actual funded study:With n=1200 in each group (fracture, no fracture)
Diff in means = 1.22 is detectable with 80% power
=> OR per SD= exp(1.22/10.7) = 1.12 OR between 1st & 4th quartile ~ exp(3*1.22/10.7) = 1.4
are both detectable with 80% power
Proc Power Capabilities
– MULTREG < options > ; – ONECORR < options > ; – ONESAMPLEFREQ < options > ; – ONESAMPLEMEANS < options > ; – ONEWAYANOVA < options > ; – PAIREDFREQ < options > ; – PAIREDMEANS < options > ; – TWOSAMPLEFREQ < options > ; – TWOSAMPLEMEANS < options > ; – TWOSAMPLESURVIVAL < options > ; – PLOT < plot-options > < / graph-options > ;
Thank you !
Any Questions?
Robert Boudreau, PhDCo-Director of Methodology Core
PITT-Multidisciplinary Clinical Research Center for Rheumatic and Musculoskeletal Diseases