Download - BIOS 2041: Introduction to Statistical Methods

BIOS 2041: Introduction to Statistical Methods

Abdus S Wahed*

*Some of the materials in this chapter has been adapted from Dr. John Wilson’s lecture

notes for the same course.

BIOS 2041 Statistical Methods Abdus S. Wahed

Chapter 0 2

Chapter 1

Introduction to Statistical Methods

1.1 What is Statistics?

• Statistics → Science of making inferences about specific “ran-

dom” phenomena based on limited sample materials. The disci-

pline provides methods for answering questions such as

• What effect does air pollution have on the residents of Pitts-

burgh?

• What proportion of Pittsburgh residents invest in stocks or

bonds?

• Is drug A better than drug B in relieving certain asthma

3


symptoms?

• Does vitamin A prevent cancer?

• Based on this quarter’s performance of stock returns, what

strategy will optimize the expected return in the next quar-

ter?

A central task of statistical analysis is to draw a conclusion (”make

inference”) about a population of interest based on evidence in a

sample from that population.

• Population = the set of all subjects or individuals who could

be measured for some variable of interest. Another viewpoint is

that the population is the group about which you wish to draw

a conclusion.

Example: All women in Allegheny County

• A parameter is a numeric characteristic of a population.

Example: Proportion of women in Allegheny County having a

female relative who has been treated for breast cancer.

Chapter 1 4


• A sample is a subset of population selected for study. The idea

is that the sample will provide the information used in drawing

the conclusion about the population.

Example: 200 Allegheny County women selected by random-

digit telephone dialing.

• A statistic is a numeric characteristic of a sample.

Example: The observed proportion of women with a female rel-

ative who had been treated for breast cancer is 35%.

• Inference is the conclusion drawn about population on basis

of sample.

Example: The proportion of Allegheny County women having a

female relative who has had breast cancer is 35%.

Another example:

• Population: All patients treated for Acute Myelocytic Leukemia

(AML) who are in first complete remission (CR1).

Chapter 1 5


• Parameter: Median duration of remission of treated AML pa-

tients in CR1.

• Sample: 35 AML / CR1 patients treated at the University of

Pittsburgh Cancer Institute during 2007.

• Statistic: The median duration of CR1 in these 35 patients was

13 months.

• Inference: The median duration of CR1 in patients treated for

AML is 13 months.

1.2 What is Biostatistics?

• Biostatistics → The branch of statistics that applies statistical

methods to medical and biological problems. Biostatisti-

cians help researchers (basic scientists, medical researchers, drug

developers) from the inception of a study to its completion. The

role of a biostatistician in the process is:

Chapter 1 6


• To formulate the research question in concrete terms → hy-

pothesis.

• To plan the experiment/study that will answer the research

question accurately and efficiently e.g.

• How many subjects (mice, patients, machines) will be

needed to answer the research question?

• How would, for example, subjects be assigned to different

groups?

• What data should be collected on each subject?

• How would the data be verified and processed?

• What are the issues with the data? e.g.

• How would the missing data be handled?

• Are there measurement errors in the data? How is it

going to be handled?

• Analyze the collected data to draw conclusions regarding the

hypotheses.

Chapter 1 7


Example 1.2.1. Drug development. XYZ pharmaceuticals has

been conducting research on developing drugs for hepatitis C (Hep C)

treatment since 1990. Their basic science researchers have convinced

Food and Drug Administration (FDA) through phase I and II trials

that they have discovered a new molecule of the standard interferon

that can be administered once weekly instead of once daily, and

they claim that the drug provides better response rate compared to

standard interferon. The company is planning to test the drug on

a large cohort of hepatitis C patients. The statistician assigned for

this study will generally start asking basic questions like:

1. How would you quantify the response? (Usually a simplified

answer would be: absence of Hep C virus in the serum 24 weeks

after the end of the treatment.)

2. How much improvement do you expect in response rate among

the users of the new drug compared to standard interferon users?

(The Phase II trial would indicate some ball park figure for this.)

Chapter 1 8


Based on the answers, the statistician will

• Formulate the hypothesis in quantitative terms:

H0 : P1 = P2, (1.2.1)

P1 is the response rate in the standard interferon group and P2

is the response rate for the new treatment (weekly interferon).

• Determine the number of patients to be recruited in the (stan-

dard) daily interferon group and in the (new) weekly interferon

group.

• Make sure that the patient safety and privacy is ensured in the

protocol keeping in mind the objective of the study.

• Devise a randomization scheme (possibly double-blinded) to as-

sign treatments to patients so that the two groups are compara-

ble with respect to patient characteristics.

• Suggest a data collection, verification and management plan.

Chapter 1 9


• How many sites will be used for patient recruitment?

• What data needs to be collected? What system will be used

to transfer the data?

• How will the data be processed?

• What information and how often should the data be pre-

sented to the DSMB (Data Safety Monitoring Board)?

• What criteria should be used to declare the new treatment sig-

nificantly better?

• How many interim analysis should be planned?

• What criteria should be used for stopping the trial?

• Finally, when the trial ends, the statistician will conduct/oversee

the data analysis to arrive at a conclusion regarding the hypoth-

esis.

In this course, we will mainly talk about:

Chapter 1 10


• Statistical methods to analyze collected data so that answers to

specific questions of interest can be made.

• Design issues, for example, sample size and power, etc.

We will cover:

Chapters 1-8 (in full), 10 - 11 (partial).

Chapter 1 11


Chapter 1 12

Chapter 2

Descriptive Statistics

In most cases data consist of many sample points. In a bid to in-

terpret data, the first task is to summarize the data in some concise

manner.

2.1 Types of data.

Data collected, outcomes of experiments, etc. are often referred to

as variables or outcomes, which come in several varieties. The type

of outcome observed plays a role in determining which statistical

procedures are appropriate.

13


• Categorical (discrete) - data can be assigned to discrete cate-

gories.

• a) Unordered

i) Gender

ii) Political party to which one belongs

iii) Exposed vs not exposed

iv) Disease or no disease

• b) Ordered

i) Good- Better- Best classification

ii) Number of times patient admitted to hospital for illness

during a given year.

• Continuous variables

• a) Ordinary or uncensored

i) Standard scale measurements

- height

Chapter 2 14


- weight

- optical density

- pH

ii) Survival times that are actually observed.

• b) Censored data

i) Survival time- may be known only that time is greater

than some observed time.

Here is the first 10 records from a dataset:

Table 2.1: Several records from a dataset

Obs ID AGE SEX LEADTYP IQF

1 101 1101 1 1 70

2 102 905 1 1 85

3 103 1101 1 1 86

4 104 611 1 1 76

5 105 1103 1 1 84

6 106 606 1 2 96

7 107 611 1 2 94

8 108 1500 2 2 56

9 109 702 2 2 115

10 110 703 1 2 97

Chapter 2 15


Many numerical and graphical techniques are available for the pur-

pose of summarizing data. We will start with continuous variables.

2.2 Measures of Location

The first sets of summary measures will define the center (or middle)

of the sample data. Such measures are known as measures of location

or measures of central tendency. We will start with the simplest of

these measures, the arithmetic mean (or simply, the mean).

2.2.1 Arithmetic Mean

Arithmetic mean is the sum of the observations divided by the num-

ber of observations.

Formula:

If X is what is measured (observed) and x1, x2, . . . , xn are the

values of n measurements, then the arithmetic mean is given by the

formula:

x̄ =x1 + x2 + . . . + xn

n=

∑ni=1 xi

n. (2.2.1)

Chapter 2 16


Example 2.2.1. Table 2.1 (Rosner)

Table 2.2: Sample of birthweights (g) of live-born infants born at a private hospital in San

Diego, California, during a 1-week period.

New-born Weight (g) New-born Weight (g) New-born Weight (g) New-born Weight (g)

1 3265 6 3323 11 2581 16 2759

2 3260 7 3649 12 2841 17 3248

3 3245 8 3200 13 3609 18 3314

4 3484 9 3031 14 2838 19 3101

5 4146 10 2069 15 3541 20 2834

X = birthweights (g) of live-born infants

x̄ =3265 + 3260 + . . . + 2834

20= 3166.9g. (2.2.2)

Facts about mean

• Arithmetic mean is easy to compute.

• If the sample points change in scale by a factor of c, the mean

changes by a factor of c.

• In some cases it fails reflect the center of the sample, specifically

in the presence of unusually high or low values (outliers).

Chapter 2 17


• It is most widely used measures of location.

2.2.2 Median

Loosely speaking, the median is a number such that in the ordered

sample, half of the sample points lies below it, and half above it.

Formula:

If n is odd then(

n+12

)th observation is the median. Otherwise,

median is defined as the average of the(

n2

)th and

(n2

+ 1)th largest

observations.

Example 2.2.2. Table 2.2 (Rosner). White blood cell counts

(× 1000) for a sample of 9 patients entering a hospital. The ordered

sample is as follows:

3, 5, 7, 8, 8, 9, 10, 12, 35

Here, n = 9, and hence(

n+12

)= 5. The median white blood cell

counts for this sample is the 5th observation, which is 8000.

Chapter 2 18


Facts about median:

• Median is not highly influenced by extreme observations, unless

there is only one or two data points.

• Median depends only on one or two middle observations and

hence is less sensitive to the magnitude of other observations in

the sample.

2.2.3 Mode

Mode is the most frequently occurring value in the sample.

In the above example, the mode white blood cell count is 8000 as

it occurs most frequently than any other white-blood count.

Facts about mode:

• If all the data points occur exactly the same number of times,

then there is no mode.

• A sample with one mode is called unimodal; two modes, bimodal;

Chapter 2 19


three modes, trimodal; and so on.

2.2.4 Geometric Mean

Geometric mean is often used for summarizing ratios, percentages,

indices, or other data sets bounded by zero. The geometric mean of

n positive numbers x1, x2, . . . , xn ia defined as the n-th root of their

product.

Formula:

GM = n√

x1 × x2 × . . . × xn = (x1 × x2 × . . . × xn)1n . (2.2.3)

In Example (2.2.2), the geometric mean is

(3 × 5 × . . . × 35)19 = 8.59

Facts about geometric mean:

• Only defined for non-negative numbers.

• Usually, if a distribution on the positive axis is asymmetric, then

a log transformation is used to make it symmetric. For such

distributions the geometric mean is used.

Chapter 2 20


2.3 Measures of Spread/Variation/Dispersion

Refer to Figure 2.4 (FOB).

2.3.1 Range

Range is the difference between the largest and the smallest obser-

vations. For the birthweights data in Table 2.1, the range is

Range = 4146 − 2069 = 2077g.

For the data in Figure 2.4 (FOB), the range for the Autoanalyzer

method is

226 − 177 = 49mg/dl,

whereas the same for the Microenzymetic method is

209 − 172 = 17mg/dl.

Thus, one can claim that:

• The Microenzymetic method measures cholesterol levels more

consistently than Autoanalyzer method does. Or, equiva-

lently,

Chapter 2 21


• Measurements of cholesterol levels using Microenzymetic method

are more precise than those using Autoanalyzer method. Or,

equivalently,

• Microenzymetic cholesterol measurements have lower variability

compared to Autoanalyzer cholesterol measurements.

Facts about range:

• Easy to compute.

• Depends highly on the extreme values.

2.3.2 Percentiles/Quantiles and Interquartile Range

The 100pth (0 ≤ p ≤ 1) percentile of a distribution is the value Vp

such that 100p% of the sample points are less than or equal to Vp.

Median is the 50th percentile.

Chapter 2 22


For the birthweights data in Table 2.1, some of the percentiles are

calculated as:

Position Percentile How we calculated it from the ordered data

10th 2670.0 n ∗ p = 20 ∗ 0.10 = 2; The average of 2nd and 3rd observation.

25th 2839.5 n ∗ p = 20 ∗ 0.25 = 5; The average of 5th and 6th observation.




99th 4146.0 n ∗ p = 20 ∗ 0.99 = 19.8; The 20th observation.

Table 2.3: Percentiles for the Birthweights data in Table 2.1 (Rosner)

Facts about percentiles

• Percentiles are also known as quantiles.

• Percentiles characterize the relative positioning of the observa-

tions in the sample.

• The spread of the distribution about the center can be character-

ized by specifying cerain quantiles. For instance, 25th and 75th

percentiles tell us that the middle half of the sample points lies

between these two values.

Chapter 2 23


• The 25th percentile and the 75th percentile of a distribution are

commonly referred to as 1st (lower) and 3rd (upper) quartiles.

Here are the percentiles for the cholesterol data in Figure 2.4 (FOB):

Method N Lower Quartile Median Upper Quartile IQR

Auto 5 193.0 195.0 209.0 16.0

Micro 5 197.0 200.0 202.0 5.0

Table 2.4: Percentiles for the Cholesterol data in Figure 2.4 (Rosner)

Interquartile range

The distance between the 1st quartile (Q1) and the 3rd quartile (Q3)

is known as interquartile range (IQR). Interquartile range is useful

for comparing the spread of two distribution as well as detecting

outliers. The higher the IQR, the more variable the distribution is.

For the cholesterol data, the IQR for Autoanalyzer method and the

microenzymatic method are respectively 16 and 5 which justifies our

previous claim that the autoanalyzer method is not as precise as the

Microenzymatic method.

Chapter 2 24


• For a positively skewed distribution, the distance between the

median and upper quartile is greater than the distance between

median and the lower quartile.

• For a negatively skewed distribution, the distance between the

median and upper quartile is smaller than the distance between

median and the lower quartile. [Birthweights data (Table 2.1,

FOB)]

• For a symmetric distribution, the distance between the median

and upper quartile is approximately equal the distance between

median and the lower quartile. [For the menstrual cycle data

Table 2.3 (FOB), Q1 = 28 = Median, Q3 = 29.]

Outliers

Outliers are extremely high or low values that are “isolated” from the

overall distribution. Outliers in a data set can be identified based on

the lower and upper quartiles.

Formula:

Chapter 2 25


An observation x can be treated as an outlier if either

1. x > Q3 + 1.5 ∗ IQR, or

2. x < Q1 − 1.5 ∗ IQR.

Formula:

An observation x is an extreme outlier if either

1. x > Q3 + 3 ∗ IQR, or

2. x < Q1 − 3 ∗ IQR.

Are there any outliers in the cholesterol data set?

2.3.3 Mean deviation

Let us look at the cholesterol data one more time.

[INSERT CHOLSTEROL FIGURE]

Look at how each observation differs from the mean; i.e,

x1 − x̄, x2 − x̄, x3 − x̄, . . . , xn − x̄.

One way to measure the spread is to look at how sample points in

the data differ from the mean. However, the mean of these differences

Chapter 2 26


are zero for any data. For the autoanalyzer method sample, the

differences are:

(177 − 200) = −23, (193 − 200) = −7, (195 − 200) = −5,

(209 − 200) = 9, and (226 − 200) = 26,

and the mean difference is zero. Same is true for the microenzymatic

method. Therefore the mean difference about the mean cannot be

used to distinguish between samples based on spreads.

What if we just take the average of the distances, instead of dif-

ferences, i.e,

|x1 − x̄|, |x2 − x̄|, |x3 − x̄|, . . . , |xn − x̄|.

Average of the distances from mean is known as mean deviation. For

the autoanalyzer method sample, the distances are:

23, 7, 5, 9, and 26

with an average of 14. On the other hand, the mean deviation for

the microenzymatic method is 4.4.

Chapter 2 27


2.3.4 Variance and Standard Deviation

In the definition of the mean deviation, we used absolute values of the

difference between individual observations and the sample mean. Ab-

solute values are sometimes difficult to deal with. Another measure

of spread uses the squared deviations from the mean and averages it

over the whole sample. The measure, known as variance, is defined

as:

s2 =

∑ni=1(xi − x̄)2

n − 1. (2.3.1)

The use of n − 1 instead of n in the denominator have special justi-

fication, which we will discuss in chapter 6.

Standard deviation is defined as the positive square root of the

variance:

s =

√∑ni=1(xi − x̄)2

n − 1. (2.3.2)

For the autoanalyzer method, the variance is

s2 =(−23)2 + (−7)2 + (−5)2 + 92 + 262

4= 340.

Chapter 2 28


For the microenzymatic method, the variance is

s2 =(−8)2 + (−3)2 + (0)2 + 22 + 92

4= 39.5.

Corresponding standard deviations are respectively s =√

340 = 18.4

and s =√

39.5 = 6.3.

Thus the spread, as measured by the standard deviation, is ap-

proximately three times as large as that of microenzymatic method.

Facts about variance and standard deviation

• Variance and standard deviation remain unchanged when all the

observations in the sample are shifted by the same constant. For

example, the following two samples have the same variance (340)

and standard deviation (18.4):

Sample 1: 77, 93, 95, 109, 126

Sample 2: 177, 193, 195, 209, 226

• Standard deviation has the same unit of measurement as the

original samples.

Chapter 2 29


• If the sample points change in scale by a factor of c, the variance

changes by a factor of c2 and the standard deviation changes by

a factor of c.

• Standard deviation is the most widely used measure of spread

(dispersion).

2.3.5 Coefficient of Variation

Suppose you are comparing two distributions having different means.

How would you compare the variability of a sample with mean 10

and standard deviation 5 to a sample with mean 100 and standard

deviation 5? Of course, the former is more variable, as the magnitude

of the standard deviation relative to the mean is much higher for that

sample compared to the latter. The measure coefficient of variation

is designed to account for the magnitude of mean when assessing the

spread. It is defined as:

CV =s

x̄× 100. (2.3.3)

Chapter 2 30


For the cholesterol data in Table 2.4 (FOB), the coefficient of

variations for the Autoanalyzer and Microenzymatic methods are

respectively 9.2% and 3.1%.

2.4 Graphical Representation

2.4.1 Histogram

Histogram is a useful way of presenting data graphically. It presents

frequencies (or relative frequencies) on the Y-axis against the data

points on X-axis. The frequencies along with the values are usually

referred to as the frequency distribution or distribution. When the

number of unique observations are too large, the range of the variable

is categorized in continuous intervals and the number of observations

belonging to those intervals are reported.

• Distributions having two tails approximately similar are called

symmetric distributions. For such distributions

• Mean≈ Median ≈ Mode.

Chapter 2 31


Histogram of Menstrual Cycle

0

10

20

30

40

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Time (days)

Rel

ativ

e Fr

eque

ncy

Figure 2.1: Distribution of time intervals between successive menstrual periods (days) of

college women (Table 2.3; Rosner; Page 13). Mean=28.5; Median=28; Mode=28.

• A distribution which has a longer tail on the right is called a

positively skewed distribution. For such distributions

• data points on the right of the median tends to be farther

from the median in absolute value than points below median,

Chapter 2 32


• Mean ≥ Median ≥ Mode.

Figure 2.2: Example of a distribution which is neither skewed, nor symmetric.

• Distributions with a tail on the left are known as negatively

skewed distributions. For such distributions

• Mean ≤ Median ≤ Mode.

For more examples on symmetric, positively skewed and negatively

skewed distributions, refer to page 12 of FOB.

Chapter 2 33


2.4.2 Stem-and-leaf Plot

Stem-and-leaf plot is similar to histogram, but it keeps the plot more

close to the actual data by using the observations from the actual

sample. It shows the basic shape of the distribution just like his-

togram does.

Stem Leaf Number

4 1 1

3 5566 4

3 012223333 9

2 68888 5

2 1 1

Multiply Stem.Leaf by 10**+3

Figure 2.3: Steam-and-leaf plot for the birthweights data in Table 2.1 (FOB).

2.4.3 Box plot

Chapter 2 34


Stem Leaf Number

14 1 1

13

13

12 58 2

12 0 1

11 558 3

11 1124 4

10 55677778 8

10 00111124444 11

9 566666666777889999 18

9 0111122223334444 16

8 555555566666778888889999 24

8 00000022334 11

7 5556666667778899 16

7 012234 6

6

6

5 6 1

5 0 1

4 6 1

Multiply Stem.Leaf by 10**+1

Figure 2.4: Steam-and-leaf plot for the the variable IQF from the dataset “Lead” in the case

study described in section 2.9 (FOB).

Chapter 2 35


Figure 2.5: Box plot for the the variable IQF from the dataset “Lead” in the case study

described in section 2.9 (FOB) by exposure type.

140 + 0

|

|

|

130 +

| 0

| |

| |

120 + |

| |

| |

| | |

110 + | |

| | |

| | |

| | |

100 + +-----+ |

| | | |

| *-----* +-----+

| | + | | |

90 + | | | + |

| | | *-----*

| +-----+ | |

| | | |

80 + | +-----+

| | |

| | |

| |

70 + |

/

/

|

| 0

|

50 + 0

------------+-----------+-----------

LEAD_TYP 1 2

Chapter 2 36

Download - BIOS 2041: Introduction to Statistical Methods

Top Related