sample size and study design

Sample size and Sample size and study designstudy design

Brian Healy, PhDBrian Healy, PhD

Comments from last timeComments from last time We did not cover confoundingWe did not cover confounding Too much in one class/Not enough Too much in one class/Not enough

examples/Superficial levelexamples/Superficial level– I wanted to show one example for each type of I wanted to show one example for each type of

analysis so that you can determine what your analysis so that you can determine what your data matches. This way you can speak to a data matches. This way you can speak to a statistician knowing the basic ideas.statistician knowing the basic ideas.

– My hope was for you to feel confident enough My hope was for you to feel confident enough to learn more about the topics relevant to youto learn more about the topics relevant to you

– Worked example lecturesWorked example lectures This is not basic biostatisticsThis is not basic biostatistics I did Teach for AmericaI did Teach for America

ObjectivesObjectives Type II errorType II error How to improve power? How to improve power? Sample size calculationSample size calculation Study design considerationsStudy design considerations

ReviewReview Previous classes we have focused on Previous classes we have focused on

data analysisdata analysis– AFTER data collectionAFTER data collection

Hypothesis testing allowed us to Hypothesis testing allowed us to determine whether there was a determine whether there was a statistically significant:statistically significant:– Difference between groupsDifference between groups– Association between two continuous factorsAssociation between two continuous factors– Association between two dichotomous Association between two dichotomous

factorsfactors

ExampleExample We know that the heart rate for healthy We know that the heart rate for healthy

adult is 80 beats per minute and this adult is 80 beats per minute and this has an approximately normal has an approximately normal distribution (according to my wife)distribution (according to my wife)

Some elite athletes, like Lance Some elite athletes, like Lance Armstrong, have lower heart rate, but it Armstrong, have lower heart rate, but it is not known if this is true on averageis not known if this is true on average

How could we address this question?How could we address this question?

Experimental designExperimental design One way to do this is to collect a One way to do this is to collect a

sample of normal controls and a sample sample of normal controls and a sample of elite athletes and compare their of elite athletes and compare their meanmean– What test would you use?What test would you use?

Another way is to collect a sample of Another way is to collect a sample of elite athletes and compare their mean elite athletes and compare their mean to the known population meanto the known population mean– This is a one sample testThis is a one sample test– Null hypothesis: meanNull hypothesis: meaneliteelite=80=80

QuestionQuestion How large a sample of elite athletes should I How large a sample of elite athletes should I

collect?collect? What is the benefit of having a large sample What is the benefit of having a large sample

size?size?– More informationMore information– More accurate estimate of the population meanMore accurate estimate of the population mean

What is the disadvantage of a large sample What is the disadvantage of a large sample size?size?– CostCost– Effort required to collectEffort required to collect

What is the “correct” sample size?What is the “correct” sample size?

Effect of sample sizeEffect of sample size Let’s say we wanted to estimate the blood Let’s say we wanted to estimate the blood

pressure of people at MGHpressure of people at MGH– If we sampled 3 people, would we have a good If we sampled 3 people, would we have a good

estimate of the population mean?estimate of the population mean? How much will sample mean vary from sample to How much will sample mean vary from sample to

sample?sample?– Does our estimate of the improve if we Does our estimate of the improve if we

sampled 30 people?sampled 30 people? Would the sample mean to vary more or less from Would the sample mean to vary more or less from

sample to sample?sample to sample?– What about 300 people?What about 300 people?

SimulationSimulation http://onlinestatbook.com/stat_sim/sahttp://onlinestatbook.com/stat_sim/sa

mpling_dist/index.htmlmpling_dist/index.html What is the shape of the distribution What is the shape of the distribution

of sample means?of sample means? Where is the curve centered?Where is the curve centered? What happens to curve as sample What happens to curve as sample

size increases?size increases? Technical: Central limit theoremTechnical: Central limit theorem

http://onlinestatbook.com/stat_sim/sampling_dist/index.html

http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Standard error of the meanStandard error of the mean There are two measures of spread in the There are two measures of spread in the

datadata– Standard deviationStandard deviation: measure of spread of : measure of spread of

the individual observationsthe individual observations The estimate of this is the standard deviation of The estimate of this is the standard deviation of

the observations:the observations:– Standard errorStandard error: standard deviation of the : standard deviation of the

sample meansample mean The estimate of this is the standard deviation of The estimate of this is the standard deviation of

the observations divided by the sample sizethe observations divided by the sample size

n

Technical: Distribution of Technical: Distribution of sample mean under the nullsample mean under the null

If we took If we took repeated samples repeated samples and calculated and calculated the sample mean, the sample mean, the distribution of the distribution of the sample the sample means would means would have a have a distributiondistribution

Mean of distribution=80

Spread in distribution is based on standard error

Type I errorType I error We could plot the distribution of the We could plot the distribution of the

sample means under the null before sample means under the null before collecting datacollecting data

Type I error is the probability that you Type I error is the probability that you reject the null given that the null is truereject the null given that the null is true

P(reject HP(reject H00 | H | H00 is true) is true)

Notice that the shaded area is still part of the null curve, but it is in the tail of the distribution

Hypothesis test-reviewHypothesis test-review After data collection, we can After data collection, we can

calculate the p-valuecalculate the p-value If the p-value is less than the pre-If the p-value is less than the pre-

specified specified -level, we reject the null -level, we reject the null hypothesishypothesis

As the sample size increases, the standard As the sample size increases, the standard error decreaseserror decreases

p-value is based on the standard errorp-value is based on the standard error– As you sample size increases, the p-value As you sample size increases, the p-value

decreases if the mean and standard deviation do decreases if the mean and standard deviation do not changenot change

– With an extremely large sample, a very small With an extremely large sample, a very small departure from the null is statistically significantdeparture from the null is statistically significant

What would you think if you found the What would you think if you found the sample mean heart rate of three elite sample mean heart rate of three elite athletes was 70 beats per minute?athletes was 70 beats per minute?– Do your thoughts change if you sampled 300 Do your thoughts change if you sampled 300

athletes and found the same sample mean?athletes and found the same sample mean?

How much data should we How much data should we collect?collect?

Depends on several factors:Depends on several factors:– Type I errorType I error– Type II errorType II error (power) (power)– Difference we are trying to detect (null Difference we are trying to detect (null

and alternative hypotheses)and alternative hypotheses)– Standard deviationStandard deviation

Remember this is decided BEFORE Remember this is decided BEFORE the study!!!the study!!!

Type II errorType II error Definition:Definition: when you fail to reject when you fail to reject

the null hypothesis when the the null hypothesis when the alternative is in fact true (alternative is in fact true (type II type II errorerror))

This type of error is based on a This type of error is based on a specific alternativespecific alternative

P(fail to reject the HP(fail to reject the H00 | H | HAA is true) is true)

PowerPower Definition:Definition: the probability that you the probability that you

reject the null hypothesis given that reject the null hypothesis given that the alternative hypothesis is true. the alternative hypothesis is true. This is what we want to happen.This is what we want to happen.

Power = P(reject HPower = P(reject Ho o | H| HAA is true) = 1 - is true) = 1 - Since this is a good thing, we want Since this is a good thing, we want

this to be highthis to be high

This is the population distribution under the null hypothesis

The location of the curve is 0 and the spread in the curve is the standard error

This is the population distribution under the alternative hypothesis

This is the cut-off value.

Reject HoFail to reject H0

P(reject H0| H0 is true)

PowerP(reject H0| HA is true)

P(fail to reject H0|

HA is true)

Reject HoFail to reject H0

Life is a trade offLife is a trade off These two errors are relatedThese two errors are related

– We usually assume that the type I error is We usually assume that the type I error is 0.05 and calculate the type II error for a 0.05 and calculate the type II error for a specific alternativespecific alternative

– If you are want to be more strict and falsely If you are want to be more strict and falsely reject the null only 1% of the time (reject the null only 1% of the time (=0.01), =0.01), the chance of a type II error increasesthe chance of a type II error increases

Sensitivity/specificity or false Sensitivity/specificity or false positive/false negativepositive/false negative

Changing the powerChanging the power Note how the power Note how the power

(green) increases (green) increases as you increase the as you increase the difference between difference between the null and the null and alternative alternative hypotheseshypotheses

How else do you How else do you think we could think we could increase the power?increase the power?

Another way to increase power is to Another way to increase power is to increase type I error rateincrease type I error rate

Two other ways to increase power Two other ways to increase power involve changing the shape of the involve changing the shape of the distributiondistribution– Increasing the sample sizeIncreasing the sample size

When the sample size increases, the curve for When the sample size increases, the curve for the sample means tightensthe sample means tightens

– Decreasing the variability in the Decreasing the variability in the populationpopulation When there is less variability, the curve for the When there is less variability, the curve for the

sample means also tightenssample means also tightens

ExampleExample For our study, we know that we can enroll

40 elite athletes. We also know that the population mean is

80 beats per minute and the standard deviation is 20

We believe the elite athletes will have a mean of 70 beats per minute

How much power would we have to detect How much power would we have to detect this difference at the two-sided 0.05 level?this difference at the two-sided 0.05 level?– All this information fully defined our curvesAll this information fully defined our curves

• Using STATA, we find that we have 88.5% power to detect the difference of 10 beats per minute between the groups at the two-sided 0.05 level using a one sample z-test• Question: If we were able to enroll more subjects would our power increase or decrease?

ConclusionsConclusions For a specific sample size, standard For a specific sample size, standard

deviation, difference between the deviation, difference between the means and type I error, we can means and type I error, we can calculate the powercalculate the power

Changing any of the four parameters Changing any of the four parameters above will change the powerabove will change the power– Some under the control of the Some under the control of the

investigator, but others are notinvestigator, but others are not

Sample sizeSample size Up to now we have shown how to find the Up to now we have shown how to find the

power given a specific sample size, power given a specific sample size, difference between the means, standard difference between the means, standard deviation and alpha level.deviation and alpha level.

We can vary any four of these five factors We can vary any four of these five factors and find the fifth. and find the fifth. – Usually the alpha level is required to be two-Usually the alpha level is required to be two-

sided 0.05 sided 0.05 – How can we calculate the sample size for How can we calculate the sample size for

specific values of the remaining parameters?specific values of the remaining parameters?

Two approaches to sample Two approaches to sample sizesize

Hypothesis testingHypothesis testing– When you have a specific null AND When you have a specific null AND

alternative hypothesis in mindalternative hypothesis in mind Confidence intervalConfidence interval

– When you want to place an interval When you want to place an interval around an estimatearound an estimate

Hypothesis testing approachHypothesis testing approach1)1) State null and alternative hypothesisState null and alternative hypothesis

– Null usually pretty easyNull usually pretty easy– Alternative is more difficult, but very importantAlternative is more difficult, but very important

2)2) State standard deviation of outcomeState standard deviation of outcome3)3) State desired power and alpha levelState desired power and alpha level

– Power=0.8Power=0.8– Alpha=0.05 for two-sided testAlpha=0.05 for two-sided test

4)4) State testState test5)5) Use statistical package to calculate sample Use statistical package to calculate sample

sizesize

We know the We know the location of the null location of the null and alternative and alternative curves, but we do curves, but we do not know the shape not know the shape because the sample because the sample size determines the size determines the shape. We need to shape. We need to find the sample size find the sample size that will give the that will give the curves the shape so curves the shape so that the that the level and level and power equal the power equal the specified values.specified values.

Alpha=0.025

Power=0.8

Beta=0.2

General form of sample size General form of sample size calculationcalculation

Here is the general form of the normal Here is the general form of the normal sample sizesample size– One-sidedOne-sided

– Two-sidedTwo-sided

2

10

12/1

zz

n

211

2

10

11

zzzz

nSample size

Standard deviation

Related to Type I error

Related to Type II error

Mean under null and alternative


– HH00: : 00=80=80– HHAA: : 11=70=70

2)2) sd=20sd=203)3) State desired power and alpha levelState desired power and alpha level


4)4) State test: z-testState test: z-test5)5) n=31.36 n=31.36 n=32 n=32

Example-more complexExample-more complex In a recently submitted grant, we In a recently submitted grant, we

investigated the sample size required investigated the sample size required to detect a difference between RRMS to detect a difference between RRMS and SPMS patients in terms of levels and SPMS patients in terms of levels of a markerof a marker

Preliminary data:Preliminary data:– RRMS: mean level=0.54 +/- 0.37 RRMS: mean level=0.54 +/- 0.37 – SPMS: mean level=0.94 +/- 0.42SPMS: mean level=0.94 +/- 0.42


– HH00: mean: meanRRMSRRMS=mean=meanSPMSSPMS=0.54=0.54– HHAA: mean: meanRRMSRRMS=0.54, mean=0.54, meanSPMSSPMS=0.94, =0.94,

Difference between groups=0.4Difference between groups=0.42)2) sdsdRRMSRRMS=0.37, sd=0.37, sdSPMSSPMS=0.42=0.423)3) State desired power and alpha levelState desired power and alpha level


4)4) State test: t-testState test: t-test

ResultsResults Use these values in statistical Use these values in statistical

packagepackage– 17 samples from each group are 17 samples from each group are

requiredrequired Website: Website:

http://hedwig.mgh.harvard.edu/samphttp://hedwig.mgh.harvard.edu/sample_size/size.htmlle_size/size.html

http://hedwig.mgh.harvard.edu/sample_size/size.html

http://hedwig.mgh.harvard.edu/sample_size/size.html

Statistical considerations for Statistical considerations for grantgrant

““Group sample sizes of 17 and 17 achieve Group sample sizes of 17 and 17 achieve at least 80% power to detect a difference at least 80% power to detect a difference of -0.400 between the null hypothesis of -0.400 between the null hypothesis that both group means are 0.540 and the that both group means are 0.540 and the alternative hypothesis that the mean of alternative hypothesis that the mean of group 2 is 0.940 with estimated group group 2 is 0.940 with estimated group standard deviations of 0.370 and 0.420 standard deviations of 0.370 and 0.420 and with a significance level (alpha) of and with a significance level (alpha) of 0.05 using a two-sided two-sample t-0.05 using a two-sided two-sample t-test.”test.”

Technical remarksTechnical remarks So we have shown that we can calculate So we have shown that we can calculate

the power for a given sample size and the power for a given sample size and sample size for a given power. We can also sample size for a given power. We can also change the clinically meaningful difference change the clinically meaningful difference if we set the sample size and power.if we set the sample size and power.

In many grant applications, we show the In many grant applications, we show the power for a variety of sample sizes and power for a variety of sample sizes and differences in the means in a table so that differences in the means in a table so that the grant reviewer can see that there is the grant reviewer can see that there is sufficient power to detect a range of sufficient power to detect a range of differences with the proposed sample size.differences with the proposed sample size.

Confidence interval Confidence interval approachapproach

If we do not have a set alternative, If we do not have a set alternative, we can choose the sample size based we can choose the sample size based on how close to the truth we want to on how close to the truth we want to getget

In particular we choose the sample In particular we choose the sample size so that the confidence interval is size so that the confidence interval is of a certain widthof a certain width

Under a normal distribution, the Under a normal distribution, the confidence interval for a single confidence interval for a single sample mean issample mean is

We can choose the sample size to We can choose the sample size to provide the specified width of the provide the specified width of the confidence intervalconfidence interval

nmean

nmean *96.1,*96.1

ConclusionsConclusions Sample size can be calculated if the Sample size can be calculated if the

power, alpha level, difference power, alpha level, difference between the groups and standard between the groups and standard deviation are specifieddeviation are specified

For more complex setting than those For more complex setting than those presented here, statisticians have presented here, statisticians have worked out the sample size worked out the sample size calculations, but still need estimates calculations, but still need estimates of the hypothesized difference and of the hypothesized difference and variability in the datavariability in the data

Study designStudy design

Reasons for differences Reasons for differences between groupsbetween groups

Actual effect-when there is a Actual effect-when there is a difference between the two groups difference between the two groups (ex. the treatment has an effect)(ex. the treatment has an effect)

ChanceChance BiasBias ConfoundingConfounding

ChanceChance When we run a study, we can only When we run a study, we can only

take a sample of the population. Our take a sample of the population. Our conclusions are based on the sample conclusions are based on the sample we have drawn. Just by chance, we have drawn. Just by chance, sometimes we can draw an extreme sometimes we can draw an extreme sample from the population. If we had sample from the population. If we had taken a different sample, we may taken a different sample, we may have drawn different conclusions. We have drawn different conclusions. We call this call this sampling variabilitysampling variability..

Note on variabilityNote on variability Even though your experiments are well Even though your experiments are well

controlled, not all subjects will behave controlled, not all subjects will behave exactly the sameexactly the same– This is true for almost all experimentsThis is true for almost all experiments– If all animals acted EXACTLY the same, we If all animals acted EXACTLY the same, we

would only need one animalwould only need one animal Since one is not enough, we observe a Since one is not enough, we observe a

group of micegroup of mice– We call this our sampleWe call this our sample

Based on our sample, we draw a Based on our sample, we draw a conclusion regarding the entire populationconclusion regarding the entire population

Study design considerationsStudy design considerations Null hypothesisNull hypothesis Outcome variableOutcome variable Explanatory variableExplanatory variable Sources of variabilitySources of variability Experimental unitExperimental unit Potential correlationPotential correlation Analysis planAnalysis plan Sample sizeSample size

ExampleExample We start with a single group (ex. We start with a single group (ex.

Genetically identical mice)Genetically identical mice) The group are broken into 3 groups that The group are broken into 3 groups that

are treated with 3 different interventionsare treated with 3 different interventions An outcome is measured in each individualAn outcome is measured in each individual Questions:Questions:

– What analysis should we do?What analysis should we do?– What is the effect of starting from the same What is the effect of starting from the same

population?population?– Do we need to account for repeated measures?Do we need to account for repeated measures?

Original group Condition

1

Condition 3

Condition 2

GeneralizabilityGeneralizability Assume that we have found a Assume that we have found a

difference between our exposure and difference between our exposure and control group and we have shown that control group and we have shown that this result is not likely due to chance, this result is not likely due to chance, bias or confounding. bias or confounding.

What does this mean for the general What does this mean for the general population? Specifically, to which population? Specifically, to which group can we apply our results?group can we apply our results?– This is often based on how the sample This is often based on how the sample

was originally collected.was originally collected.

Example 2Example 2 We want to compare the expression of a We want to compare the expression of a

marker in patients vs. controlsmarker in patients vs. controls Full sample size is 288 samplesFull sample size is 288 samples Can only run 24 samples (1 plate) per Can only run 24 samples (1 plate) per

dayday Questions:Questions:

– What types of analysis should we do?What types of analysis should we do?– Can we combine across the plates?Can we combine across the plates?– Could other confounders be important to Could other confounders be important to

collect?collect?

Plate 1: 10 patients, 14 controls Estimate of

difference in this plate





We can test if there is a different effect in each plate by investigating the interaction

Example 3Example 3 We want to compare the expression of We want to compare the expression of

6 markers6 markers We measure the six markers in 5 miceWe measure the six markers in 5 mice Questions:Questions:

– What types of analysis should we do?What types of analysis should we do?– How many independent groups do we How many independent groups do we

have?have?– What is the null hypothesis?What is the null hypothesis?

Example 4Example 4 ““In our experiments, we collect 3 In our experiments, we collect 3

measurements. If it is significant, we measurements. If it is significant, we call it a day. If it is close to significant, call it a day. If it is close to significant, we measure 1 more animal”we measure 1 more animal”

Question:Question:– Is this valid?Is this valid?

Always more statistically valid if the Always more statistically valid if the number is specified BEFORE the number is specified BEFORE the experimentexperiment

Spreadsheet formationSpreadsheet formation What to collectWhat to collect

– Everything that might be important for Everything that might be important for the analysisthe analysis PlatePlate BatchBatch TechnicianTechnician All potential sources of variabilityAll potential sources of variability All potential confoundersAll potential confounders

– Most accurate version of this you canMost accurate version of this you can If it is continuous, collect it as such. Can If it is continuous, collect it as such. Can

always dichotomize lateralways dichotomize later

Spreadsheet formationSpreadsheet formation Easiest to move to a statistical Easiest to move to a statistical

package ifpackage if– One row per measurementOne row per measurement– One column for the outcome, each One column for the outcome, each

predictor and potential confounderspredictor and potential confounders– No open spaceNo open space

ConclusionsConclusions Sample size for experiment must be Sample size for experiment must be

considered BEFORE collecting dataconsidered BEFORE collecting data Can improve power by reducing Can improve power by reducing

standard deviation, increasing standard deviation, increasing sample size or increasing difference sample size or increasing difference between groupsbetween groups

Important to consider study design Important to consider study design as you develop your analysis planas you develop your analysis plan

sample size and study design

Documents

distribution of sample

sample means

sample meanthe

sample sizetechnical

large sample size

sample of elite athletes

sample size increases

correct sample size