introduction to biostatistics for clinical researchers
DESCRIPTION
Introduction to Biostatistics for Clinical Researchers. University of Kansas Department of Biostatistics & University of Kansas Medical Center Department of Internal Medicine. Schedule. Friday, December 10 in 1023 Orr-Major Friday, December 17 in B018 School of Nursing - PowerPoint PPT PresentationTRANSCRIPT
Introduction to Biostatistics for Clinical Researchers
University of Kansas Department of Biostatistics
& University of Kansas Medical Center
Department of Internal Medicine
ScheduleFriday, December 10 in 1023 Orr-Major
Friday, December 17 in B018 School of Nursing
Possibility of a 5th lecture, TBDAll lectures will be held from 8:30a - 10:30a
Materials PowerPoint files can be downloaded from the Department of
Biostatistics website at http://biostatistics.kumc.edu
A link to the recorded lectures will be posted in the same location
An Introduction to Hypothesis Testing: The Paired t-Test
Topics Comparing two groups: the paired-data situation
Hypothesis testing: the Null and Alternative hypotheses
Relationships between confidence intervals and hypothesis testing when comparing means
P-values: definitions, calculations, and more
The Paired t-test: the Confidence Interval Component
Two Group Designs For continuous endpoint: Are the population means
different?
Subjects could be randomized to one of two treatments (randomized parallel-group design) Compare the mean responses from each treatment Also referred to as independent groups
Subjects could each be given both treatments with the ordering of treatments randomized (paired design) Compare the mean difference to zero (or some other
interesting value) Pre-post data Matched case-control
Example: Pre- versus Post- Data Why pair the observations?
Decrease variability in response Each subject acts as its own control (reduced sample
sizes) Good way to get preliminary data/estimates to develop
further research
Example: Pre- versus Post- Data Ten non-pregnant, pre-menopausal women 16-49 years old
who were beginning a regimen of oral contraceptive (OC) use had their blood pressures measured prior to starting OC use and three-months after consistent OC use
The goal of this small study was to see what, if any, changes in average blood pressure were associated with OC use in such women
The data shows the resulting pre- and post-OC use systolic BP measurements for the 10 women in the study
Blood Pressure and OC Use
Subject Before OC After OC Δ = After - Before
1 115 128 132 112 115 33 107 106 -14 119 128 95 115 122 76 138 145 77 126 132 68 105 109 49 104 102 -210 115 117 2
Average 115.6 120.4 4.8
Blood Pressure and OC Use The sample average of the differences is
Also note:
The sample standard deviation (s) of the differences is sdiff = 4.6 Standard deviation of differences follows the formula:
where each represents an individual difference and is the mean difference
4.8diffx
diff after beforex x x -
21
1
n
diff diffi
diff
x xs
n
-
-
diffxdiffx
Note on Paired Data Designs The BP information is essentially reduced from two samples
(prior to and after OC use) into one piece of information--difference in BP between the two samples Response is “within-subject”
This is standard protocol for comparing paired samples with a continuous outcome measure
The Confidence Interval Approach Suppose we want to draw a conclusion about a population
parameter: In a population of women who use OC, is the average
change in blood pressure (after - before) zero?
The CI approach allows us to create a range of plausible values for the average change (μΔ) in blood pressure using data from a single, imperfect, paired sample
The Confidence Interval Approach A 95% CI for μΔ in BP in the population of women taking OC
is
0.95,9
0.95,9 104.64.8 2.2610
1.5,8.1 mmHg
diff diff
diffdiff
x t SE xs
x t
Note The number 0 is NOT in the confidence interval (1.5-8.1)
This suggests there is a non-zero change in BP over time
The phrase “statistically significant” change is used to indicate a non-zero mean change
Note The BP change could be due to factors other than OC
Change in weather over pre- and post- period Changes in personal stress
A control group of comparable women who were not taking OC would strengthen this study This is an example of a pilot study-a small study done
just to generate some evidence of a possible association This can be followed up with a larger, more scientifically
rigorous study
The Paired t-test: the Hypothesis Testing Component
The Hypothesis Testing Approach Suppose we want to draw a conclusion about a population
parameter: In a population of women who use OC, is the average
change in blood pressure (after - before) zero?
The hypothesis testing approach allows us to choose between two competing possibilities for the average change (μΔ) in blood pressure using data from a single, imperfect, paired sample
Hypothesis Testing Two mutually exclusive, collectively exhaustive possibilities
for “truth” about mean change, μΔ
Null hypothesis: HO: μΔ = 0 (what we wish to ‘nullify’) Alternative hypothesis: HA : μΔ ≠ 0 (what we wish to
show evidence in favor of)
We use our data as ‘evidence’ in favor of or against the null hypothesis (and alternative hypothesis, as a result)
Hypothesis Testing Null: Typically represents the hypothesis that there is no
association or difference HO: μΔ = 0 There is no association between OC use
and blood pressure
Alternative: The very general complement to the null HA: μΔ ≠ 0 There is an association between OC use
and blood pressure
Hypothesis Testing Our result will allow us to either reject or fail to reject HO
We start by assuming HO is true, and ask: How likely, given HO is true, is the result we got from
our sample? In other words, what are the chances of obtaining the
sample data we actually observed (“evidence”) if the truth is that there is no association between blood pressure and OC use?
Hypothesis Testing HO, in combination with other information about our
population and the size of our sample, sets up (via the CLT) a theoretical probability distribution of sample means computed from all possible samples of size n = 10 where µ∆ = 0
µ∆ = 0
σ∆
Population distribution
of BP change
Hypothesis Testing HO, in combination with other information about our
population and the size of our sample, sets up (via the CLT) a theoretical probability distribution of sample means computed from all possible samples of size n = 10 where µ∆ = 0
µ∆ = 0
Sampling distribution
of the sample
mean diffSE x
Hypothesis Testing Theoretically, if the null hypothesis were true we would be
more likely to observe values of the sample mean “close” to zero:
µ∆ = 0
diffSE x
Hypothesis Testing Theoretically, if the null hypothesis were true it would be
unlikely that we should observe values of the sample mean “far” from zero:
µ∆ = 0
diffSE x
Hypothesis Testing We observed a sample mean of = 4.8 mmHg—is it far
enough from zero for us to conclude in favor of HA?
µ∆ = 0
In favor of HO
In favor of HA In favor of HA
diffSE x
diffx
Hypothesis Testing We need some measure of how probable the result from our
sample is given the null hypothesis
The sampling distribution of the sample mean allows us to evaluate how unusual our sample statistic is by computing a probability corresponding to the observed results—the p-value If p is small, it suggests that the probability of obtaining
the observed result from the hypothesized distribution is not likely to occur by chance.
In other words, either 1. The null hypothesis is actually true and, just by
chance, we got a sample that gave us an unlikely result; or
2. The null hypothesis is actually false, and we got a sample with evidence of such
Hypothesis Testing1. The null hypothesis is actually false, and we got a sample
with evidence of suchIf we are using a random sample, we can be assured that this is the case (95% confident, in fact)
Hypothesis Testing To compute a p-value, we need to find our value of on
the sampling distribution and figure out how “unusual” it is
Recall: = 4.8 mmHg
µ∆ = 0
diffx
diffx
diffSE x
Hypothesis Testing Problem: What is σ∆?
µ∆ = 0
diffSE xn
Hypothesis Testing Solution: the Student’s t distribution
µ∆ = 0
diffsSE xn
Hypothesis Testing Where is = 4.8 mmHg located on the sampling
distribution (t9)?
µ∆ = 0
1.45
diffx
4.8diffx
Hypothesis Testing The p-value is the probability of getting a sample result as
(or more) extreme that what we observed, given the null hypothesis is true:
µ∆ = 0
1.45
4.84.8-
Hypothesis Testing The p-value is the area under the curve corresponding to
values of the sample mean more extreme than 4.8 P(| | ≥ 4.8)
µ∆ = 0
1.45
diffx
4.84.8-
Hypothesis Testing Strictly for convenience, we standardize our distribution
We center it at zero by subtracting the mean We adjust the variability to correspond to s = 1 by
dividing every observation by
µ∆ = 0 +t-t
1
diffSE x
O 3.3diffxt s
n
-
Hypothesis Testing The p-value is the area under the curve corresponding to
values of the sample mean more extreme than 4.8 P(| | ≥ 4.8) = P(|t| ≥ 3.3) easily found in any t table
µ∆ = 0 +3.3-3.3
1
diffx
Hypothesis Testing Note: this t is called a test statistic (and is synonymous
with a z-score) It represents the distance of the observation from the
hypothesized mean in standard errors In this case, our mean (4.8) is 3.3 SE away from the
hypothesized mean (0) Based on this, what do you think the p-value will look
like? Is a result 3.3 SE above its mean unusual?
Hypothesis Testing t = 3.3 on the sampling distribution (t9)
The p-value The p-value is the probability of getting a sample result as
(or more) extreme that what we observed, given the null hypothesis is true
The p-value We can look this up in a t-table
. . . or we can let Excel or another statistical package do it for us =TDIST(3.3,9,2)
The p-value We can look this up in a t-table
. . . or we can let Excel or another statistical package do it for us =TDIST(3.3,9,2)
Interpreting the p-value P = 0.0092: if the true before OC/after OC blood pressure
difference is zero among all women taking OCs, then the chance of seeing a mean difference as extreme or more extreme than 4.8 in a sample of 10 women is 0.0092
We now need to use the p-value to make a decision—either reject or fail to reject HO
We need to decide if our sample result is unlikely enough to have occurred by chance if the null was true
Using the p-value to Make a Decision Establishing a cutoff
In general, to make a decision about what p-value constitutes “unusual” results, there needs to be a cutoff such that all p-values less than the cutoff result in rejection of the null
The standard (but arbitrary) cutoff is p = 0.05 This cutoff is referred to as the significance level of
the test and is usually represented by α For example, α = 0.05
Using the p-value to Make a Decision Establishing a cutoff
Frequently, the result of a hypothesis test with p < 0.05 is called statistically significant
At the α = 0.05 level, we have a statistically significant blood pressure difference in the BP/OC example
Example: BP/OC Statistical method
The changes in blood pressures after oral contraceptive use were calculated for 10 women
A paired t-test was used to determine if there was a statistically significant change in blood pressure, and a 95% confidence interval was calculated for the mean blood pressure change (after-before)
Result Blood pressure measurements increased on average 4.8
mmHg with standard deviation 4.6 mmHg The 95% confidence interval for the mean change was
1.5-8.1 mmHg The blood pressure measurements after OC use were
statistically significantly higher than before OC use(p = 0.009)
Example: BP/OC Discussion
A limitation of this study is that there was no comparison group of women who did not use oral contraceptives
We do not know if blood pressures may have risen without oral contraceptive usage
Example: Clinical Agreement Two different physicians assessed the number of palpable
lymph nodes in 65 randomly selected male sexual contacts of men with AIDS or AIDS-related conditions1
1Example based on data taken from Rosner, B. (2005). Fundamentals of Biostatistics (6th ed.), Duxbury Press
Doctor 1 Doctor 2 Difference7.91 5.16 -2.75
S 4.35 3.93 2.83x
95% Confidence Interval A 95% CI for difference in mean number of lymph nodes
(Doctor 2 compared to Doctor 1):
1.992.832.75 1.99
653.45, 2.05
diff diffx SE x
- -
Getting a p-value Hypotheses:
HO: µdiff = 0 HA : µdiff ≠ 0
1. Assume the null is true2. Compute the distance in SEs between and the
hypothesized value (zero)3. The sample result is 7.8 SEs below 0—is this unusual?
diffx
O 2.75 7.82.8365
diffxt s
n
- - -
Getting a p-value Sample result is 7.8 SEs below 0—is this unusual?
See where this result falls on the sampling distribution (t64)
The p-value corresponds to P(|t|> 7.8)—without looking it up we know p < 0.001
Example: Oat Bran and LDL Cholesterol Cereal and cholesterol: 14 males with high cholesterol
given oat bran cereal as part of diet for two weeks, and corn flakes cereal as part of diet for two weeks1
mmol/dL Corn Flakes
Oat Bran Difference
4.44 4.08 0.36S 1.0 1.1 0.40
1Example based on data taken from Pagano, M. (2000). Principles of Biostatistics (2nd ed.), Duxbury Press
diffx
95% Confidence Interval A 95% confidence interval for the difference in mean LDL
(corn flakes versus oat bran):
0.95,13
0.040.36 2.1614
0.13,0.6
diff diffx t SE x
Getting a p-value Hypotheses:
HO: µdiff = 0 HA : µdiff ≠ 0
1. Assume the null is true2. Compute the distance in SEs between and the
hypothesized value (zero)3. The sample result is 3.3 SEs above 0—is this unusual?
diffx
O 0.36 3.30.414
diffxt s
n
-
Getting a p-value Sample result is 3.3 SEs above 0—is this unusual?
See where this result falls on the sampling distribution (t13)
The p-value corresponds to P(|t|> 3.3) Using a table or software package, we can find p =
0.005
Note on Direction of Comparison Whether we chose to examine the difference (oat – corn) or
(corn – oat) makes no difference to our results—only in the appropriate interpretation of estimates (including confidence intervals) The sign of the mean will change The limits of the CI will reverse and signs will change
Summary Designate hypotheses
The alternative is usually what we are interested in supporting
The null is usually what we wish to nullify (“no association/no change”)
Collect data
Compute difference in outcome for each paired set of observations Compute , the sample mean of the paired differences Compute s, the sample standard deviation of the
differences Compute the 95% (or other %) CI for the true mean
difference
diffx
1 , 1diff n diffx t SE x
- -
Summary To get the p-value
Assume HO is true (sets up the sampling distribution) Measure the distance of the sample result from µO
Odiffxt s
n
-
Summary Compare the test statistic (distance) to the appropriate
distribution to get the p-value
Summary Paired t-test scenarios
Blood pressure/OC use example
Degree of clinical agreement (each pt received two assessments)
Diet example (each man received two different diets in random order)
Twin study
Matched case-control Suppose we wish to compare levels of a certain
biomarker in pts with versus without a disease
More about P-values
P-values P-values are probabilities
Have to be between 0 and 1
Small p-values mean that the sample results are unlikely when the null is true
The p-value is the probability of obtaining a result as extreme or more extreme than what was actually observed by chance alone, assuming the null is true
P-values The p-value is not the probability that the null hypothesis is
true
It alone imparts no information about scientific/substantive content in result of a study
From the previous example, the researchers found a statistically significant (p = 0.005) difference in average LDL cholesterol levels in men who had been on a diet including corn flakes versus the same men on a diet including oat bran cereal Which diet showed lower average LDL levels? How much was the difference? Does it mean anything
nutritionally?
P-values If the p-value is small, either a rare event occurred and
1. HO is false; or2. HO is true
Type I Error Claim HA is true when in fact HO is true The probability of making a Type I error is called the
significance level of a test (α)
Note on p and α If p < α the result is called statistically significant
This cutoff is the significance (or alpha) level of the test It is the probability of falsely rejecting HO
The idea is to keep the chances of making an error when HO is true low and only reject if the sample evidence is highly against HO
Note on p and α
TruthHO HA
DecisionReject HO
Type I Error (α)
Power (1-β)
Fail to Reject HO
Correct (1-α)
Type II Error β
One- or Two-sided? A two-sided p-value corresponds to results as or more
extreme than what was observed in either direction We know a test is two-sided by observing the alternative
hypothesis, HA
HA containing ≠ indicates we are interested in either an increase or decrease in (greater or smaller) value than the hypothesized value
A one-sided p-value corresponds to results as or more extreme than what was observed in a single direction of interest The direction of interest is stated explicitly in the
alternative hypothesis, HA
HA containing > indicates we are interested in an increase or greater value than the hypothesized value
HA containing < indicates we are interested in a decrease or smaller value than the hypothesized value
Null and Alternative Sampling Distributions
zα
Fail to reject H0Conclude no difference
Reject H0Conclude difference
α1 - α
β 1 - β
Assume H0 is True
zα
Fail to reject H0Conclude no difference
Reject H0Conclude difference
α1 - α
Assume H1 is True
zα
Fail to reject H0Conclude no difference
Reject H0Conclude difference
β 1 - β
One- or Two-sided? In some cases, a one-sided alternative may not make
scientific sense In the absence of pre-existing information for the
evaluation of the relationship between BP and OC, wouldn’t either result be interesting and useful (i.e., negative or positive association)?
In some cases, a one-sided alternative often makes scientific sense We are not interested if new treatment is worse than old,
just whether it’s statistically significantly better
However, for reasons already shown (and because of the sanctity of “.05”), one-sided p-values are viewed with suspicion
Connection: Hypothesis Testing and CIs The confidence interval gives plausible values for the
population parameter “Data, take me to the truth”
Hypothesis testing postulates two choices for the population parameter “Here are two mutually exclusive possibilities for the
truth—data help me choose one”
95% Confidence Interval If zero is not in the 95% confidence interval, then we would
reject HO: µ = 0 at the 5% level of significance (α = 0.05)
Why?
With a confidence interval, we start at the sample mean and go approximately two standard errors in either direction
95% Confidence Interval If zero is not in the 95% confidence interval, then we would
reject HO: µ = 0 at the 5% level of significance (α = 0.05)
Why?
With a confidence interval, we start at the sample mean and go approximately two standard errors in either direction
95% Confidence Interval If zero is not in the 95% CI, then this must mean is > ~2
SE away from zero (either above or below)
Hence, the distance (t) will be either > 2 or < 2, and the resulting p-value will be < 0.05
diffx
95% Confidence Interval and p-value In the BP/OC example, the 95% CI tells us that p < 0.05, but
it doesn’t tell us that it is p = 0.009
The confidence interval and p-value are complimentary
However, you can’t get the exact p-value from just looking at a confidence interval, and you can’t get a sense of the scientific/substantive significance of your study results by looking at a p-value
More on the p-value Statistical significance does not imply or prove causation
Example: in the BP/OC case, there could be other factors at play that could explain the change in blood pressure
A significant p-value is only ruling out random sampling (chance) as the explanation
We would need a randomized comparison group to better establish causality Self-selected would be okay, but not ideal
More on the p-value Statistical significance is not the same as scientific
significance
Hypothetical example: blood pressure and oral contraceptives Suppose: n = 100,000; = 0.03 mmHg; s = 4.6 mmHg P = 0.04
Big n can sometimes produce a small p-value, even in the absence of a relationship The magnitude of the effect is small (not scientifically
interesting) Noise
It is very important to always report a confidence interval 95% CI: 0.002-0.058 mmHg
diffx
More on the p-value Lack of statistical significance is not the same as lack
of scientific significance Must evaluate results in the context of the study and
sample size
Small n can sometimes produce a non-significant result even though the magnitude of the association at the population level is real and important—your study just may not be big enough to detect it
Underpowered, small studies makes not rejecting hard to interpret
Sometimes small studies are designed without power in mind just to generate preliminary data
Comparing Means among Two (or More) Independent Populations
Topics CIs for mean difference between two independent
populations
Two-sample t-test
Non-parametric alternative
Comparing means of more than two independent populations
Comparing Two Independent Groups “A Low Carbohydrate as Compared with a Low Fat Diet in
Severe Obesity”1
132 severely obese subjects randomized to one of two diet groups
Subjects followed for six months
At the end of the study period:“Subjects on the low-carbohydrate diet lost more weight than those on a low-fat diet (95% CI for the difference in weight loss between groups, -1.6 to -6.2 kg; p < 0.01)”
1Samaha, F., et al. A low-carbohydrate as compared with a low-fat diet in severe obesity, NEJM 348:21.
Comparing Two Independent Groups Is weight change associated with diet type?
Diet GroupLow-Carb Low-Fat
Number of subjects (n) 64 68Mean weight change (kg)
(post - pre)-5.7 -1.8
Standard deviation of weight changes (kg)
8.6 3.9
Diet Type and Weight Change 95% CIs for weight change by diet group:
8.6Carb: 5.7 1.96 7.807 , 3.59364
3.9Fat: 1.8 1.96 2.728 , 0.87368
kg kg
kg kg
- - -
- - -
Comparing Two Independent Groups In statistical terms, is there a non-zero difference in the
average weight change for the subjects on the low-fat diet as compared to subjects on the low-carbohydrate diet? 95% CIs for each diet group mean weight change do not
overlap, but how do you quantify the difference?
The comparison of interest is not “paired”-there are different subjects in each diet group
For each subject, a change in weight (post-pre) was computed However, the authors compared the changes in weight
between two independent groups
Comparing Two Independent Groups How do we calculate
CI for the difference? P-value to determine if the difference in two groups is
significant?
Since we have large samples (both greater than 60) we know the sampling distributions of the sample means in both groups are approximately normal
It turns out the difference of quantities that are approximately normally distributed are also normally distributed
Sampling Distribution of Difference in Sample Means The sampling distribution of the difference of two sample
means, each based on large samples, approximates a normal distribution
This sampling distribution is centered at the true mean difference, μ1 - μ2
Simulated Sampling Distribution The simulated sampling distribution of sample mean weight
change for the low-carbohydrate diet group is shown:
Simulated Sampling Distribution The simulated sampling distribution of sample mean weight
change for the low-fat diet group is shown:
Simulated Sampling Distribution The simulated sampling distribution of the difference in
sample means for the two groups is shown:
Simulated Sampling Distribution Side-by-side boxplots
95% CI for the Difference in Means Our most general formula is:
The best estimate of a population mean difference based on sample means:
Here, may represent the sample mean weight loss for the 64 subjects on the low-carb diet, and the mean weight loss for the 68 subjects on the low fat diet
best estimate from sample best estimate from samplemultiplier SE
1 2x x-
1x2x
95% CI for the Difference in Means So, ; this makes the formula for
the 95% CI for μ1 - μ2 is:
where is the standard deviation of the sampling distribution (i.e., the standard error of the difference of two sample means)
1 2 5.7 1.8 3.9x x- - - - -
1 23.9 1.96SE x x- -
1 2SE x x-
Two Independent Groups The standard error of the difference for two independent
samples is calculated differently than that for the paired design With the paired design, we reduced data on two samples
to one set of differences
Statisticians have developed formulas for the standard error of the difference-they depend on sample sizes in both groups and standard deviations in both groups
Aside: is greater than either or -any ideas why?
1 2SE x x- 2SE x 1SE x
Principle Variation from independent sources can be added
We don’t know σ1 or σ2 so we estimate them using s1 and s2 to get an estimated standard error:
2 21 2
1 21 2
SE x xn n
-
2 21 2
1 21 2
s sSE x xn n
-
Comparing Two Independent Groups Recall from the weight change/diet type study:
Diet GroupLow-Carb Low-Fat
Number of subjects (n) 64 68Mean weight change (kg)
(post - pre)-5.7 -1.8
Standard deviation of weight changes (kg)
8.6 3.9
2 2 2 21 2
1 21 2
8.6 3.9 1.1764 68s sSE x xn n
-
95% CI for Difference in Means In this example, the approximate 95% confidence interval
for the true mean difference in weight between the low-carb and low-fat diet groups is:
3.9 1.96 1.17 6.2 , 1.6kg kg- - -
From Article “Subjects on the low-carbohydrate diet lost more weight
than those on a low-fat diet (95% CI: -1.6 to 6.2 kg; p < 0.01)”
Those on the low-carb diet lost more on average by 3.9 kg-after accounting for sampling variability this excess average loss over the low-fat diet group could be as small as 1.6 kg or as large as 6.2 kg
This CI does not include zero, suggesting a real population level association between type of diet and weight loss
Two-sample t-test: Getting a p-value
Hypothesis Test to Compare Two Independent Groups Two-sample t-test
Is the (mean) weight change equal in the two diet groups? HO: μ1 = μ2 HA: μ1 ≠ μ2
In other words, is the expected difference in weight change zero? HO: μ1 - μ2 = 0 HA: μ1 - μ2 ≠ 0
Hypothesis Test to Compare Two Independent Groups Recall, the general “recipe” for hypothesis testing is:
1. Assume HO is true2. Measure the distance of the sample result from the
hypothesized result, μO (in most cases it’s 0)3. Compare the test statistic (distance) to the appropriate
distribution to get the p-value
1 2 O 1 2
2 21 21 21 2
observed difference null differenceobserved difference
tSEx x x xt
s sSE x xn n
-
- - -
-
Diet Type and Weight Loss Study Recall:
For this study:
This study result was 3.33 standard errors below the hypothesized mean of 0-is this result unusual?
1 2
1 2
3.9
1.17
x x
SE x x
- -
-
3.9 3.331.17t - -
How are p-values calculated? The p-value is the probability of getting a result as extreme
or more extreme than what you observed if the null hypothesis were true
It comes from the sampling distribution of the difference in two sample means
What does this sampling distribution look like? If both groups are large, it is approximately normal It is centered at the true difference
Under the null, the true difference is 0
Diet/Weight Loss To compute the p-value, we would need to compute the
probability of being 3.3 or more SEs away from 0
Diet/Weight Loss In Excel, use the “TTEST” function to test whether two
independent samples are significantly different
If you’ve calculated the test statistic, t (-3.33, in this example), you can use the “TDIST” function to compute the p-value
For the diet example, p = 0.0013
Summary: Weight Loss Example Statistical Methods
“We randomly assigned 132 severely obese patients . . . To a carbohydrate-restricted (low-carbohydrate) diet or a calorie- and fat-restricted diet”
“For comparison of continuous variables between the two groups, we calculated the change from baseline to six months in each subject and compared the mean changes in the two diet groups using an unpaired (two-sample) t-test”
Result “Subjects on the low-carbohydrate diet lost more weight
than those on a low-fat diet (95% CI: -1.6 to -6.2 kg; p < 0.01)”
Sampling Distribution Detail What exactly is the sampling distribution of the difference
in sample means? A Student’s t distribution is used with n1 - n2 - 2 degrees
of freedom (total sample size minus two)
Two-Sample t-test In a randomized design, 23 patients with hyperlipidemia
were randomized to either treatment A or treatment B for 12 weeks 12 to A 11 to B
LDL cholesterol levels (mmol/L) measured on each subject at baseline and 12 weeks
The 12-week change in LDL cholesterol was computed for each subject Treatment Group
A BN 12 11Mean LDL change -1.41 -0.32
Standard deviation of LDL changes
0.55 0.65
Two-Sample t-test Is there a difference in LDL change between the two
treatment groups?
Methods of inference CI for the difference in mean LDL cholesterol change
between the two groups Statistical hypothesis test
95% CI for Difference in Means
1 2
1 2
1 2
1 2 1 , 2 1 2
2 2
1 , 2
1 , 2
0.55 0.651.41 0.32 12 111.09 0.25
n n
n n
n n
x x t SE x x
t
t
- -
- -
- -
- -
- - -
-
Treatment GroupA B
N 12 11Mean LDL change -1.41 -0.32
Standard deviation of LDL changes
0.55 0.65
95% CI for Difference in Means How many standard errors to add and subtract (i.e., what is
the correct multiplier)?
The number we need comes from a t with 12 + 11 - 2 = 21 degrees of freedom From t table or excel, this value is 2.08
The 95% CI for true mean difference in change in LDL cholesterol, drug A to drug B is:
1.09 2.08 0.251.61, 0.57
-
- -
Hypothesis Test to Compare Two Independent Groups Two-sample (unpaired) t-test: getting a p-value
Is the change in LDL cholesterol the same in the two treatment groups? HO: μ1 = μ2 HO: μ1 - μ2 = 0 HA: μ1 ≠ μ2 HA: μ1 - μ2 ≠ 0
Hypothesis Test to Compare Two Independent Groups Recall the general “recipe” for hypothesis testing:
1. Assume HO is true2. Measure the distance of the sample result from the
hypothesized result (here, it’s 0)3. Compare the test statistic (distance) to the appropriate
distribution to get the p-value
1 2 O 1 2
2 21 21 21 2
observed difference null differenceobserved difference
tSEx x x xt
s sSE x xn n
-
- - -
-
Diet Type and Weight Loss Study In the diet types and weight loss study, recall:
In this study:
This study result was 4.4 standard errors below the null mean of 0
1 2
1 2
1.09
0.25
x x
SE x x
- -
-
1.09 4.40.25t - -
How are p-values Calculated? Is a result 4.4 standard errors below 0 unusual?
It depends on what kind of distribution we are dealing with
The p-value is the probability of getting a result as extreme or more extreme than what was observed (-4.4) by chance, if the null hypothesis were true
The p-value comes from the sampling distribution of the difference in two sample means
What is the sampling distribution of the difference in sample means? t12 + 11 - 2 = 21
Hyperlipidemia Example To compute a p-value, we need to compute the probability
of being 4.4 or more SE away from 0 on the t with 21 degrees of freedom
P = 0.0003
Summary: Weight Loss Example Statistical Methods
Twenty-three patients with hyperlipidemia were randomly assigned to one of two treatment groups: A or B
12 patients were assigned to receive A 11 patients were assigned to receive B Baseline LDL cholesterol measurements were taken on
each subject and LDL was again measured after 12 weeks of treatment
The change in LDL cholesterol was computed for each subject
The mean LDL changes in the two treatment groups were compared using an unpaired t-test and a 95% confidence interval was constructed for the difference in mean LDL changes
Summary: Weight Loss Example Result
Patients on A showed a decrease in LDL cholesterol of 1.41 mmol/L and subjects on treatment B showed a decrease of 0.32 mmol/L (a difference of 1.09 mmol/L, 95% CI: 0.57 to 1.61 mmol/L)
The difference in LDL changes was statistically significant (p < 0.001)
FYI: Equal Variances Assumption The “traditional” t-test assumes equal variances in the two
groups This can be formally tested using another hypothesis
test But why not just compare observed values of s1 to s2?
There is a slight modification to allow for unequal variances-this modification adjusts the degrees of freedom for the test, using slightly different SE computation
If you want to be truly ‘safe’, it is more conservative to use the test that allows for unequal variances
Makes little to no difference in large samples
FYI: Equal Variances Assumption If underlying population level standard deviations are equal,
both approaches give valid confidence intervals, but intervals assuming unequal standard deviations are slightly wider (p-values slightly larger)
If underlying population level standard deviations are unequal, the approach assuming equal variances does not give valid confidence intervals and can severely under-cover the goal of 95%
Non-Parametric Analogue to the Two-Sample t
Alternative to the Two Sample t-test “Non-parametric” refers to a class of tests that do not
assume anything about the distribution of the data
Nonparametric tests for comparing two groups Mann-Whitney Rank-Sum test (Wilcoxon Rank Sum Test) Also called Wilcoxon-Mann-Whitney Test
Attempts to answer: “Are the two populations distributions different?”
Advantages: does not assume populations being compared are normally distributed, uses only ranks, and is not sensitive to outliers
Alternative to the Two Sample t-test Disadvantages:
often less sensitive (powerful) for finding true differences because they throw away information (by using only ranks rather than the raw data)
need the full data set, not just summary statistics results do not include any CI quantifying range of
possibility for true difference between populations
Health Education Study Evaluate an intervention to educate high school students
about health and lifestyle over a two-month period
10 students randomized to intervention or control group
X = post-test score - pre-test score Compare between the two groups
• Only five individuals in each sample• We want to compare the control and intervention to assess
whether the ‘improvement’ in scores are different, taking random sampling error into account
• With such a small sample size, we need to be sure score improvements are normally distributed if we want to use the t test (BIG assumption)
• Possible approach: Wilcoxon-Mann-Whitney test
Health Education Study
Intervention
5 0 7 2 19
Control 6 -5 -6 1 4
Health Education Study Step 1: rank the pooled data, ignoring groups
Step 2: reattach group status
Step 3: find the average rank in each of the two groups
Intervention
5 0 7 2 19
Control 6 -5 -6 1 4
Intervention
7 3 9 5 10
Control 8 2 1 4 6
3 5 7 9 10 6.851 2 4 6 8 4.25
Health Education Study Statisticians have developed formulas and tables to
determine the probability of observing such an extreme discrepancy in ranks (6.8 versus 4.2) by chance alone (p)
The p-value here is 0.17 The interpretation is that the Mann-Whitney test did not
show any significant difference in test score ‘improvement’ between the intervention and control group (p = 0.17)
The two-sample t test would give a different answer (p = 0.14)
Different statistical methods give different p-values
If the largest observation was changed, the MW p would not change but the t p-value would
Notes The t or the nonparametric test?
Statisticians will not always agree, but there are some guidelines
Use the nonparametric test if the sample size is small and you have no reason to believe data is ‘well-behaved’ (normally distributed)
Only ranks are available
Summary: Educational Intervention Example Statistical methods
10 high school students were randomized to either receive a two-month health and lifestyle education program or no program
Each student was administered a test regarding health and lifestyle issues prior to randomization and after the two-month period
Differences in the two test scores were computed for each student
Mean and median test score changes were computed for each of the two study groups
A Mann-Whitney rank sum test was used to determine if there was a statistically significant difference in test score change between the intervention and control groups at the end of the two-month study period
Summary: Educational Intervention Example Results
Participants randomized to the educational intervention scored a median five points higher on the test given at the end of the two-month study period, as compared to the test administered prior to the intervention
Participants randomized to receive no educational intervention scored a median one point higher on the test given at the end of the two-month study period
The difference in test score improvements between the intervention and control groups was not statistically significant (p = 0.17)
Next Lecture Friday, December 17 in B018 SON from 8:30a - 10:30a
Topics include―ANOVA―Linear Regression―Chi-square test―Survival Analysis―Design of Experiments
References and CitationsLectures modified from notes provided by John McGready and Johns Hopkins Bloomberg School of Public Health accessible from the World Wide Web: http://ocw.jhsph.edu/courses/introbiostats/schedule.cfm