introduction to biostatistics for clinical researchers

Introduction to Biostatistics for Clinical Researchers

University of Kansas Department of Biostatistics

& University of Kansas Medical Center

Department of Internal Medicine

ScheduleFriday, December 10 in 1023 Orr-Major

Friday, December 17 in B018 School of Nursing

Possibility of a 5th lecture, TBDAll lectures will be held from 8:30a - 10:30a

Materials PowerPoint files can be downloaded from the Department of

Biostatistics website at http://biostatistics.kumc.edu

A link to the recorded lectures will be posted in the same location

http://biostatistics.kumc.edu/

An Introduction to Hypothesis Testing: The Paired t-Test

Topics Comparing two groups: the paired-data situation

Hypothesis testing: the Null and Alternative hypotheses

Relationships between confidence intervals and hypothesis testing when comparing means

P-values: definitions, calculations, and more

The Paired t-test: the Confidence Interval Component

Two Group Designs For continuous endpoint: Are the population means

different?

Subjects could be randomized to one of two treatments (randomized parallel-group design) Compare the mean responses from each treatment Also referred to as independent groups

Subjects could each be given both treatments with the ordering of treatments randomized (paired design) Compare the mean difference to zero (or some other

interesting value) Pre-post data Matched case-control

Example: Pre- versus Post- Data Why pair the observations?

Decrease variability in response Each subject acts as its own control (reduced sample

sizes) Good way to get preliminary data/estimates to develop

further research

Example: Pre- versus Post- Data Ten non-pregnant, pre-menopausal women 16-49 years old

who were beginning a regimen of oral contraceptive (OC) use had their blood pressures measured prior to starting OC use and three-months after consistent OC use

The goal of this small study was to see what, if any, changes in average blood pressure were associated with OC use in such women

The data shows the resulting pre- and post-OC use systolic BP measurements for the 10 women in the study

Blood Pressure and OC Use

Subject Before OC After OC Δ = After - Before

1 115 128 132 112 115 33 107 106 -14 119 128 95 115 122 76 138 145 77 126 132 68 105 109 49 104 102 -210 115 117 2

Average 115.6 120.4 4.8

Blood Pressure and OC Use The sample average of the differences is

Also note:

The sample standard deviation (s) of the differences is sdiff = 4.6 Standard deviation of differences follows the formula:

where each represents an individual difference and is the mean difference

4.8diffx

diff after beforex x x -

21

1

n

diff diffi

diff

x xs

n

-

-

diffxdiffx

Note on Paired Data Designs The BP information is essentially reduced from two samples

(prior to and after OC use) into one piece of information--difference in BP between the two samples Response is “within-subject”

This is standard protocol for comparing paired samples with a continuous outcome measure

The Confidence Interval Approach Suppose we want to draw a conclusion about a population

parameter: In a population of women who use OC, is the average

change in blood pressure (after - before) zero?

The CI approach allows us to create a range of plausible values for the average change (μΔ) in blood pressure using data from a single, imperfect, paired sample

The Confidence Interval Approach A 95% CI for μΔ in BP in the population of women taking OC

is

0.95,9

0.95,9 104.64.8 2.2610

1.5,8.1 mmHg

diff diff

diffdiff

x t SE xs

x t

Note The number 0 is NOT in the confidence interval (1.5-8.1)

This suggests there is a non-zero change in BP over time

The phrase “statistically significant” change is used to indicate a non-zero mean change

Note The BP change could be due to factors other than OC

Change in weather over pre- and post- period Changes in personal stress

A control group of comparable women who were not taking OC would strengthen this study This is an example of a pilot study-a small study done

just to generate some evidence of a possible association This can be followed up with a larger, more scientifically

rigorous study

The Paired t-test: the Hypothesis Testing Component

The Hypothesis Testing Approach Suppose we want to draw a conclusion about a population

parameter: In a population of women who use OC, is the average

change in blood pressure (after - before) zero?

The hypothesis testing approach allows us to choose between two competing possibilities for the average change (μΔ) in blood pressure using data from a single, imperfect, paired sample

Hypothesis Testing Two mutually exclusive, collectively exhaustive possibilities

for “truth” about mean change, μΔ

Null hypothesis: HO: μΔ = 0 (what we wish to ‘nullify’) Alternative hypothesis: HA : μΔ ≠ 0 (what we wish to

show evidence in favor of)

We use our data as ‘evidence’ in favor of or against the null hypothesis (and alternative hypothesis, as a result)

Hypothesis Testing Null: Typically represents the hypothesis that there is no

association or difference HO: μΔ = 0 There is no association between OC use

and blood pressure

Alternative: The very general complement to the null HA: μΔ ≠ 0 There is an association between OC use

and blood pressure

Hypothesis Testing Our result will allow us to either reject or fail to reject HO

We start by assuming HO is true, and ask: How likely, given HO is true, is the result we got from

our sample? In other words, what are the chances of obtaining the

sample data we actually observed (“evidence”) if the truth is that there is no association between blood pressure and OC use?

Hypothesis Testing HO, in combination with other information about our

population and the size of our sample, sets up (via the CLT) a theoretical probability distribution of sample means computed from all possible samples of size n = 10 where µ∆ = 0

µ∆ = 0

σ∆

Population distribution

of BP change

Hypothesis Testing HO, in combination with other information about our

population and the size of our sample, sets up (via the CLT) a theoretical probability distribution of sample means computed from all possible samples of size n = 10 where µ∆ = 0

µ∆ = 0

Sampling distribution

of the sample

mean diffSE x

Hypothesis Testing Theoretically, if the null hypothesis were true we would be

more likely to observe values of the sample mean “close” to zero:

µ∆ = 0

diffSE x

Hypothesis Testing Theoretically, if the null hypothesis were true it would be

unlikely that we should observe values of the sample mean “far” from zero:

µ∆ = 0

diffSE x

Hypothesis Testing We observed a sample mean of = 4.8 mmHg—is it far

enough from zero for us to conclude in favor of HA?

µ∆ = 0

In favor of HO

In favor of HA In favor of HA

diffSE x

diffx

Hypothesis Testing We need some measure of how probable the result from our

sample is given the null hypothesis

The sampling distribution of the sample mean allows us to evaluate how unusual our sample statistic is by computing a probability corresponding to the observed results—the p-value If p is small, it suggests that the probability of obtaining

the observed result from the hypothesized distribution is not likely to occur by chance.

In other words, either 1. The null hypothesis is actually true and, just by

chance, we got a sample that gave us an unlikely result; or

2. The null hypothesis is actually false, and we got a sample with evidence of such

Hypothesis Testing1. The null hypothesis is actually false, and we got a sample

with evidence of suchIf we are using a random sample, we can be assured that this is the case (95% confident, in fact)

Hypothesis Testing To compute a p-value, we need to find our value of on

the sampling distribution and figure out how “unusual” it is

Recall: = 4.8 mmHg

µ∆ = 0

diffx

diffx

diffSE x

Hypothesis Testing Problem: What is σ∆?

µ∆ = 0

diffSE xn

Hypothesis Testing Solution: the Student’s t distribution

µ∆ = 0

diffsSE xn

Hypothesis Testing Where is = 4.8 mmHg located on the sampling

distribution (t9)?

µ∆ = 0

1.45

diffx

4.8diffx

Hypothesis Testing The p-value is the probability of getting a sample result as

(or more) extreme that what we observed, given the null hypothesis is true:

µ∆ = 0

1.45

4.84.8-

Hypothesis Testing The p-value is the area under the curve corresponding to

values of the sample mean more extreme than 4.8 P(| | ≥ 4.8)

µ∆ = 0

1.45

diffx

4.84.8-

Hypothesis Testing Strictly for convenience, we standardize our distribution

We center it at zero by subtracting the mean We adjust the variability to correspond to s = 1 by

dividing every observation by

µ∆ = 0 +t-t

1

diffSE x

O 3.3diffxt s

n

-

Hypothesis Testing The p-value is the area under the curve corresponding to

values of the sample mean more extreme than 4.8 P(| | ≥ 4.8) = P(|t| ≥ 3.3) easily found in any t table

µ∆ = 0 +3.3-3.3

1

diffx

Hypothesis Testing Note: this t is called a test statistic (and is synonymous

with a z-score) It represents the distance of the observation from the

hypothesized mean in standard errors In this case, our mean (4.8) is 3.3 SE away from the

hypothesized mean (0) Based on this, what do you think the p-value will look

like? Is a result 3.3 SE above its mean unusual?

Hypothesis Testing t = 3.3 on the sampling distribution (t9)

The p-value The p-value is the probability of getting a sample result as

(or more) extreme that what we observed, given the null hypothesis is true

The p-value We can look this up in a t-table

. . . or we can let Excel or another statistical package do it for us =TDIST(3.3,9,2)

Interpreting the p-value P = 0.0092: if the true before OC/after OC blood pressure

difference is zero among all women taking OCs, then the chance of seeing a mean difference as extreme or more extreme than 4.8 in a sample of 10 women is 0.0092

We now need to use the p-value to make a decision—either reject or fail to reject HO

We need to decide if our sample result is unlikely enough to have occurred by chance if the null was true

Using the p-value to Make a Decision Establishing a cutoff

In general, to make a decision about what p-value constitutes “unusual” results, there needs to be a cutoff such that all p-values less than the cutoff result in rejection of the null

The standard (but arbitrary) cutoff is p = 0.05 This cutoff is referred to as the significance level of

the test and is usually represented by α For example, α = 0.05

Using the p-value to Make a Decision Establishing a cutoff

Frequently, the result of a hypothesis test with p < 0.05 is called statistically significant

At the α = 0.05 level, we have a statistically significant blood pressure difference in the BP/OC example

Example: BP/OC Statistical method

The changes in blood pressures after oral contraceptive use were calculated for 10 women

A paired t-test was used to determine if there was a statistically significant change in blood pressure, and a 95% confidence interval was calculated for the mean blood pressure change (after-before)

Result Blood pressure measurements increased on average 4.8

mmHg with standard deviation 4.6 mmHg The 95% confidence interval for the mean change was

1.5-8.1 mmHg The blood pressure measurements after OC use were

statistically significantly higher than before OC use(p = 0.009)

Example: BP/OC Discussion

A limitation of this study is that there was no comparison group of women who did not use oral contraceptives

We do not know if blood pressures may have risen without oral contraceptive usage

Example: Clinical Agreement Two different physicians assessed the number of palpable

lymph nodes in 65 randomly selected male sexual contacts of men with AIDS or AIDS-related conditions1

1Example based on data taken from Rosner, B. (2005). Fundamentals of Biostatistics (6th ed.), Duxbury Press

Doctor 1 Doctor 2 Difference7.91 5.16 -2.75

S 4.35 3.93 2.83x

95% Confidence Interval A 95% CI for difference in mean number of lymph nodes

(Doctor 2 compared to Doctor 1):

1.992.832.75 1.99

653.45, 2.05

diff diffx SE x

- -

Getting a p-value Hypotheses:

HO: µdiff = 0 HA : µdiff ≠ 0

1. Assume the null is true2. Compute the distance in SEs between and the

hypothesized value (zero)3. The sample result is 7.8 SEs below 0—is this unusual?

diffx

O 2.75 7.82.8365

diffxt s

n

- - -

Getting a p-value Sample result is 7.8 SEs below 0—is this unusual?

See where this result falls on the sampling distribution (t64)

The p-value corresponds to P(|t|> 7.8)—without looking it up we know p < 0.001

Example: Oat Bran and LDL Cholesterol Cereal and cholesterol: 14 males with high cholesterol

given oat bran cereal as part of diet for two weeks, and corn flakes cereal as part of diet for two weeks1

mmol/dL Corn Flakes

Oat Bran Difference

4.44 4.08 0.36S 1.0 1.1 0.40

1Example based on data taken from Pagano, M. (2000). Principles of Biostatistics (2nd ed.), Duxbury Press

diffx

95% Confidence Interval A 95% confidence interval for the difference in mean LDL

(corn flakes versus oat bran):

0.95,13

0.040.36 2.1614

0.13,0.6

diff diffx t SE x

Getting a p-value Hypotheses:

HO: µdiff = 0 HA : µdiff ≠ 0

1. Assume the null is true2. Compute the distance in SEs between and the

hypothesized value (zero)3. The sample result is 3.3 SEs above 0—is this unusual?

diffx

O 0.36 3.30.414

diffxt s

n

-

Getting a p-value Sample result is 3.3 SEs above 0—is this unusual?

See where this result falls on the sampling distribution (t13)

The p-value corresponds to P(|t|> 3.3) Using a table or software package, we can find p =

0.005

Note on Direction of Comparison Whether we chose to examine the difference (oat – corn) or

(corn – oat) makes no difference to our results—only in the appropriate interpretation of estimates (including confidence intervals) The sign of the mean will change The limits of the CI will reverse and signs will change

Summary Designate hypotheses

The alternative is usually what we are interested in supporting

The null is usually what we wish to nullify (“no association/no change”)

Collect data

Compute difference in outcome for each paired set of observations Compute , the sample mean of the paired differences Compute s, the sample standard deviation of the

differences Compute the 95% (or other %) CI for the true mean

difference

diffx

1 , 1diff n diffx t SE x

- -

Summary To get the p-value

Assume HO is true (sets up the sampling distribution) Measure the distance of the sample result from µO

Odiffxt s

n

-

Summary Compare the test statistic (distance) to the appropriate

distribution to get the p-value

Summary Paired t-test scenarios

Blood pressure/OC use example

Degree of clinical agreement (each pt received two assessments)

Diet example (each man received two different diets in random order)

Twin study

Matched case-control Suppose we wish to compare levels of a certain

biomarker in pts with versus without a disease

More about P-values

P-values P-values are probabilities

Have to be between 0 and 1

Small p-values mean that the sample results are unlikely when the null is true

The p-value is the probability of obtaining a result as extreme or more extreme than what was actually observed by chance alone, assuming the null is true

P-values The p-value is not the probability that the null hypothesis is

true

It alone imparts no information about scientific/substantive content in result of a study

From the previous example, the researchers found a statistically significant (p = 0.005) difference in average LDL cholesterol levels in men who had been on a diet including corn flakes versus the same men on a diet including oat bran cereal Which diet showed lower average LDL levels? How much was the difference? Does it mean anything

nutritionally?

P-values If the p-value is small, either a rare event occurred and

1. HO is false; or2. HO is true

Type I Error Claim HA is true when in fact HO is true The probability of making a Type I error is called the

significance level of a test (α)

Note on p and α If p < α the result is called statistically significant

This cutoff is the significance (or alpha) level of the test It is the probability of falsely rejecting HO

The idea is to keep the chances of making an error when HO is true low and only reject if the sample evidence is highly against HO

Note on p and α

TruthHO HA

DecisionReject HO

Type I Error (α)

Power (1-β)

Fail to Reject HO

Correct (1-α)

Type II Error β

One- or Two-sided? A two-sided p-value corresponds to results as or more

extreme than what was observed in either direction We know a test is two-sided by observing the alternative

hypothesis, HA

HA containing ≠ indicates we are interested in either an increase or decrease in (greater or smaller) value than the hypothesized value

A one-sided p-value corresponds to results as or more extreme than what was observed in a single direction of interest The direction of interest is stated explicitly in the

alternative hypothesis, HA

HA containing > indicates we are interested in an increase or greater value than the hypothesized value

HA containing < indicates we are interested in a decrease or smaller value than the hypothesized value

Null and Alternative Sampling Distributions

zα

Fail to reject H0Conclude no difference

Reject H0Conclude difference

α1 - α

β 1 - β

Assume H0 is True

zα



α1 - α

Assume H1 is True

zα



β 1 - β

One- or Two-sided? In some cases, a one-sided alternative may not make

scientific sense In the absence of pre-existing information for the

evaluation of the relationship between BP and OC, wouldn’t either result be interesting and useful (i.e., negative or positive association)?

In some cases, a one-sided alternative often makes scientific sense We are not interested if new treatment is worse than old,

just whether it’s statistically significantly better

However, for reasons already shown (and because of the sanctity of “.05”), one-sided p-values are viewed with suspicion

Connection: Hypothesis Testing and CIs The confidence interval gives plausible values for the

population parameter “Data, take me to the truth”

Hypothesis testing postulates two choices for the population parameter “Here are two mutually exclusive possibilities for the

truth—data help me choose one”

95% Confidence Interval If zero is not in the 95% confidence interval, then we would

reject HO: µ = 0 at the 5% level of significance (α = 0.05)

Why?

With a confidence interval, we start at the sample mean and go approximately two standard errors in either direction

95% Confidence Interval If zero is not in the 95% CI, then this must mean is > ~2

SE away from zero (either above or below)

Hence, the distance (t) will be either > 2 or < 2, and the resulting p-value will be < 0.05

diffx

95% Confidence Interval and p-value In the BP/OC example, the 95% CI tells us that p < 0.05, but

it doesn’t tell us that it is p = 0.009

The confidence interval and p-value are complimentary

However, you can’t get the exact p-value from just looking at a confidence interval, and you can’t get a sense of the scientific/substantive significance of your study results by looking at a p-value

More on the p-value Statistical significance does not imply or prove causation

Example: in the BP/OC case, there could be other factors at play that could explain the change in blood pressure

A significant p-value is only ruling out random sampling (chance) as the explanation

We would need a randomized comparison group to better establish causality Self-selected would be okay, but not ideal

More on the p-value Statistical significance is not the same as scientific

significance

Hypothetical example: blood pressure and oral contraceptives Suppose: n = 100,000; = 0.03 mmHg; s = 4.6 mmHg P = 0.04

Big n can sometimes produce a small p-value, even in the absence of a relationship The magnitude of the effect is small (not scientifically

interesting) Noise

It is very important to always report a confidence interval 95% CI: 0.002-0.058 mmHg

diffx

More on the p-value Lack of statistical significance is not the same as lack

of scientific significance Must evaluate results in the context of the study and

sample size

Small n can sometimes produce a non-significant result even though the magnitude of the association at the population level is real and important—your study just may not be big enough to detect it

Underpowered, small studies makes not rejecting hard to interpret

Sometimes small studies are designed without power in mind just to generate preliminary data

Comparing Means among Two (or More) Independent Populations

Topics CIs for mean difference between two independent

populations

Two-sample t-test

Non-parametric alternative

Comparing means of more than two independent populations

Comparing Two Independent Groups “A Low Carbohydrate as Compared with a Low Fat Diet in

Severe Obesity”1

132 severely obese subjects randomized to one of two diet groups

Subjects followed for six months

At the end of the study period:“Subjects on the low-carbohydrate diet lost more weight than those on a low-fat diet (95% CI for the difference in weight loss between groups, -1.6 to -6.2 kg; p < 0.01)”

1Samaha, F., et al. A low-carbohydrate as compared with a low-fat diet in severe obesity, NEJM 348:21.

Comparing Two Independent Groups Is weight change associated with diet type?

Diet GroupLow-Carb Low-Fat

Number of subjects (n) 64 68Mean weight change (kg)

(post - pre)-5.7 -1.8

Standard deviation of weight changes (kg)

8.6 3.9

Diet Type and Weight Change 95% CIs for weight change by diet group:

8.6Carb: 5.7 1.96 7.807 , 3.59364

3.9Fat: 1.8 1.96 2.728 , 0.87368

kg kg

kg kg

- - -

- - -

Comparing Two Independent Groups In statistical terms, is there a non-zero difference in the

average weight change for the subjects on the low-fat diet as compared to subjects on the low-carbohydrate diet? 95% CIs for each diet group mean weight change do not

overlap, but how do you quantify the difference?

The comparison of interest is not “paired”-there are different subjects in each diet group

For each subject, a change in weight (post-pre) was computed However, the authors compared the changes in weight

between two independent groups

Comparing Two Independent Groups How do we calculate

CI for the difference? P-value to determine if the difference in two groups is

significant?

Since we have large samples (both greater than 60) we know the sampling distributions of the sample means in both groups are approximately normal

It turns out the difference of quantities that are approximately normally distributed are also normally distributed

Sampling Distribution of Difference in Sample Means The sampling distribution of the difference of two sample

means, each based on large samples, approximates a normal distribution

This sampling distribution is centered at the true mean difference, μ1 - μ2

Simulated Sampling Distribution The simulated sampling distribution of sample mean weight

change for the low-carbohydrate diet group is shown:

Simulated Sampling Distribution The simulated sampling distribution of sample mean weight

change for the low-fat diet group is shown:

Simulated Sampling Distribution The simulated sampling distribution of the difference in

sample means for the two groups is shown:

Simulated Sampling Distribution Side-by-side boxplots

95% CI for the Difference in Means Our most general formula is:

The best estimate of a population mean difference based on sample means:

Here, may represent the sample mean weight loss for the 64 subjects on the low-carb diet, and the mean weight loss for the 68 subjects on the low fat diet

best estimate from sample best estimate from samplemultiplier SE

1 2x x-

1x2x

95% CI for the Difference in Means So, ; this makes the formula for

the 95% CI for μ1 - μ2 is:

where is the standard deviation of the sampling distribution (i.e., the standard error of the difference of two sample means)

1 2 5.7 1.8 3.9x x- - - - -

1 23.9 1.96SE x x- -

1 2SE x x-

Two Independent Groups The standard error of the difference for two independent

samples is calculated differently than that for the paired design With the paired design, we reduced data on two samples

to one set of differences

Statisticians have developed formulas for the standard error of the difference-they depend on sample sizes in both groups and standard deviations in both groups

Aside: is greater than either or -any ideas why?

1 2SE x x- 2SE x 1SE x

Principle Variation from independent sources can be added

We don’t know σ1 or σ2 so we estimate them using s1 and s2 to get an estimated standard error:

2 21 2

1 21 2

SE x xn n

-

2 21 2

1 21 2

s sSE x xn n

-

Comparing Two Independent Groups Recall from the weight change/diet type study:

Diet GroupLow-Carb Low-Fat

Number of subjects (n) 64 68Mean weight change (kg)

(post - pre)-5.7 -1.8

Standard deviation of weight changes (kg)

8.6 3.9

2 2 2 21 2

1 21 2

8.6 3.9 1.1764 68s sSE x xn n

-

95% CI for Difference in Means In this example, the approximate 95% confidence interval

for the true mean difference in weight between the low-carb and low-fat diet groups is:

3.9 1.96 1.17 6.2 , 1.6kg kg- - -

From Article “Subjects on the low-carbohydrate diet lost more weight

than those on a low-fat diet (95% CI: -1.6 to 6.2 kg; p < 0.01)”

Those on the low-carb diet lost more on average by 3.9 kg-after accounting for sampling variability this excess average loss over the low-fat diet group could be as small as 1.6 kg or as large as 6.2 kg

This CI does not include zero, suggesting a real population level association between type of diet and weight loss

Two-sample t-test: Getting a p-value

Hypothesis Test to Compare Two Independent Groups Two-sample t-test

Is the (mean) weight change equal in the two diet groups? HO: μ1 = μ2 HA: μ1 ≠ μ2

In other words, is the expected difference in weight change zero? HO: μ1 - μ2 = 0 HA: μ1 - μ2 ≠ 0

Hypothesis Test to Compare Two Independent Groups Recall, the general “recipe” for hypothesis testing is:

1. Assume HO is true2. Measure the distance of the sample result from the

hypothesized result, μO (in most cases it’s 0)3. Compare the test statistic (distance) to the appropriate


1 2 O 1 2

2 21 21 21 2

observed difference null differenceobserved difference

tSEx x x xt

s sSE x xn n

-

- - -

-

Diet Type and Weight Loss Study Recall:

For this study:

This study result was 3.33 standard errors below the hypothesized mean of 0-is this result unusual?

1 2

1 2

3.9

1.17

x x

SE x x

- -

-

3.9 3.331.17t - -

How are p-values calculated? The p-value is the probability of getting a result as extreme

or more extreme than what you observed if the null hypothesis were true

It comes from the sampling distribution of the difference in two sample means

What does this sampling distribution look like? If both groups are large, it is approximately normal It is centered at the true difference

Under the null, the true difference is 0

Diet/Weight Loss To compute the p-value, we would need to compute the

probability of being 3.3 or more SEs away from 0

Diet/Weight Loss In Excel, use the “TTEST” function to test whether two

independent samples are significantly different

If you’ve calculated the test statistic, t (-3.33, in this example), you can use the “TDIST” function to compute the p-value

For the diet example, p = 0.0013

Summary: Weight Loss Example Statistical Methods

“We randomly assigned 132 severely obese patients . . . To a carbohydrate-restricted (low-carbohydrate) diet or a calorie- and fat-restricted diet”

“For comparison of continuous variables between the two groups, we calculated the change from baseline to six months in each subject and compared the mean changes in the two diet groups using an unpaired (two-sample) t-test”

Result “Subjects on the low-carbohydrate diet lost more weight

than those on a low-fat diet (95% CI: -1.6 to -6.2 kg; p < 0.01)”

Sampling Distribution Detail What exactly is the sampling distribution of the difference

in sample means? A Student’s t distribution is used with n1 - n2 - 2 degrees

of freedom (total sample size minus two)

Two-Sample t-test In a randomized design, 23 patients with hyperlipidemia

were randomized to either treatment A or treatment B for 12 weeks 12 to A 11 to B

LDL cholesterol levels (mmol/L) measured on each subject at baseline and 12 weeks

The 12-week change in LDL cholesterol was computed for each subject Treatment Group

A BN 12 11Mean LDL change -1.41 -0.32

Standard deviation of LDL changes

0.55 0.65

Two-Sample t-test Is there a difference in LDL change between the two

treatment groups?

Methods of inference CI for the difference in mean LDL cholesterol change

between the two groups Statistical hypothesis test

95% CI for Difference in Means

1 2

1 2

1 2

1 2 1 , 2 1 2

2 2

1 , 2

1 , 2

0.55 0.651.41 0.32 12 111.09 0.25

n n

n n

n n

x x t SE x x

t

t

- -

- -

- -

- -

- - -

-

Treatment GroupA B

N 12 11Mean LDL change -1.41 -0.32

Standard deviation of LDL changes

0.55 0.65

95% CI for Difference in Means How many standard errors to add and subtract (i.e., what is

the correct multiplier)?

The number we need comes from a t with 12 + 11 - 2 = 21 degrees of freedom From t table or excel, this value is 2.08

The 95% CI for true mean difference in change in LDL cholesterol, drug A to drug B is:

1.09 2.08 0.251.61, 0.57

-

- -

Hypothesis Test to Compare Two Independent Groups Two-sample (unpaired) t-test: getting a p-value

Is the change in LDL cholesterol the same in the two treatment groups? HO: μ1 = μ2 HO: μ1 - μ2 = 0 HA: μ1 ≠ μ2 HA: μ1 - μ2 ≠ 0

Hypothesis Test to Compare Two Independent Groups Recall the general “recipe” for hypothesis testing:

1. Assume HO is true2. Measure the distance of the sample result from the

hypothesized result (here, it’s 0)3. Compare the test statistic (distance) to the appropriate


1 2 O 1 2

2 21 21 21 2

observed difference null differenceobserved difference

tSEx x x xt

s sSE x xn n

-

- - -

-

Diet Type and Weight Loss Study In the diet types and weight loss study, recall:

In this study:

This study result was 4.4 standard errors below the null mean of 0

1 2

1 2

1.09

0.25

x x

SE x x

- -

-

1.09 4.40.25t - -

How are p-values Calculated? Is a result 4.4 standard errors below 0 unusual?

It depends on what kind of distribution we are dealing with

The p-value is the probability of getting a result as extreme or more extreme than what was observed (-4.4) by chance, if the null hypothesis were true

The p-value comes from the sampling distribution of the difference in two sample means

What is the sampling distribution of the difference in sample means? t12 + 11 - 2 = 21

Hyperlipidemia Example To compute a p-value, we need to compute the probability

of being 4.4 or more SE away from 0 on the t with 21 degrees of freedom

P = 0.0003

Summary: Weight Loss Example Statistical Methods

Twenty-three patients with hyperlipidemia were randomly assigned to one of two treatment groups: A or B

12 patients were assigned to receive A 11 patients were assigned to receive B Baseline LDL cholesterol measurements were taken on

each subject and LDL was again measured after 12 weeks of treatment

The change in LDL cholesterol was computed for each subject

The mean LDL changes in the two treatment groups were compared using an unpaired t-test and a 95% confidence interval was constructed for the difference in mean LDL changes

Summary: Weight Loss Example Result

Patients on A showed a decrease in LDL cholesterol of 1.41 mmol/L and subjects on treatment B showed a decrease of 0.32 mmol/L (a difference of 1.09 mmol/L, 95% CI: 0.57 to 1.61 mmol/L)

The difference in LDL changes was statistically significant (p < 0.001)

FYI: Equal Variances Assumption The “traditional” t-test assumes equal variances in the two

groups This can be formally tested using another hypothesis

test But why not just compare observed values of s1 to s2?

There is a slight modification to allow for unequal variances-this modification adjusts the degrees of freedom for the test, using slightly different SE computation

If you want to be truly ‘safe’, it is more conservative to use the test that allows for unequal variances

Makes little to no difference in large samples

FYI: Equal Variances Assumption If underlying population level standard deviations are equal,

both approaches give valid confidence intervals, but intervals assuming unequal standard deviations are slightly wider (p-values slightly larger)

If underlying population level standard deviations are unequal, the approach assuming equal variances does not give valid confidence intervals and can severely under-cover the goal of 95%

Non-Parametric Analogue to the Two-Sample t

Alternative to the Two Sample t-test “Non-parametric” refers to a class of tests that do not

assume anything about the distribution of the data

Nonparametric tests for comparing two groups Mann-Whitney Rank-Sum test (Wilcoxon Rank Sum Test) Also called Wilcoxon-Mann-Whitney Test

Attempts to answer: “Are the two populations distributions different?”

Advantages: does not assume populations being compared are normally distributed, uses only ranks, and is not sensitive to outliers

Alternative to the Two Sample t-test Disadvantages:

often less sensitive (powerful) for finding true differences because they throw away information (by using only ranks rather than the raw data)

need the full data set, not just summary statistics results do not include any CI quantifying range of

possibility for true difference between populations

Health Education Study Evaluate an intervention to educate high school students

about health and lifestyle over a two-month period

10 students randomized to intervention or control group

X = post-test score - pre-test score Compare between the two groups

• Only five individuals in each sample• We want to compare the control and intervention to assess

whether the ‘improvement’ in scores are different, taking random sampling error into account

• With such a small sample size, we need to be sure score improvements are normally distributed if we want to use the t test (BIG assumption)

• Possible approach: Wilcoxon-Mann-Whitney test

Health Education Study

Intervention

5 0 7 2 19

Control 6 -5 -6 1 4

Health Education Study Step 1: rank the pooled data, ignoring groups

Step 2: reattach group status

Step 3: find the average rank in each of the two groups

Intervention

5 0 7 2 19

Control 6 -5 -6 1 4

Intervention

7 3 9 5 10

Control 8 2 1 4 6

3 5 7 9 10 6.851 2 4 6 8 4.25

Health Education Study Statisticians have developed formulas and tables to

determine the probability of observing such an extreme discrepancy in ranks (6.8 versus 4.2) by chance alone (p)

The p-value here is 0.17 The interpretation is that the Mann-Whitney test did not

show any significant difference in test score ‘improvement’ between the intervention and control group (p = 0.17)

The two-sample t test would give a different answer (p = 0.14)

Different statistical methods give different p-values

If the largest observation was changed, the MW p would not change but the t p-value would

Notes The t or the nonparametric test?

Statisticians will not always agree, but there are some guidelines

Use the nonparametric test if the sample size is small and you have no reason to believe data is ‘well-behaved’ (normally distributed)

Only ranks are available

Summary: Educational Intervention Example Statistical methods

10 high school students were randomized to either receive a two-month health and lifestyle education program or no program

Each student was administered a test regarding health and lifestyle issues prior to randomization and after the two-month period

Differences in the two test scores were computed for each student

Mean and median test score changes were computed for each of the two study groups

A Mann-Whitney rank sum test was used to determine if there was a statistically significant difference in test score change between the intervention and control groups at the end of the two-month study period

Summary: Educational Intervention Example Results

Participants randomized to the educational intervention scored a median five points higher on the test given at the end of the two-month study period, as compared to the test administered prior to the intervention

Participants randomized to receive no educational intervention scored a median one point higher on the test given at the end of the two-month study period

The difference in test score improvements between the intervention and control groups was not statistically significant (p = 0.17)

Next Lecture Friday, December 17 in B018 SON from 8:30a - 10:30a

Topics include―ANOVA―Linear Regression―Chi-square test―Survival Analysis―Design of Experiments

References and CitationsLectures modified from notes provided by John McGready and Johns Hopkins Bloomberg School of Public Health accessible from the World Wide Web: http://ocw.jhsph.edu/courses/introbiostats/schedule.cfm

http://ocw.jhsph.edu/courses/introbiostats/schedule.cfm

introduction to biostatistics for clinical researchers

Documents