BIOS 2041: Introduction to Statistical Methods
Abdus S Wahed*
*Some of the materials in this chapter has been adapted from Dr. John Wilson’s lecture
notes for the same course.
BIOS 2041 Statistical Methods Abdus S. Wahed
Chapter 0 2
Chapter 1
Introduction to Statistical Methods
1.1 What is Statistics?
• Statistics → Science of making inferences about specific “ran-
dom” phenomena based on limited sample materials. The disci-
pline provides methods for answering questions such as
• What effect does air pollution have on the residents of Pitts-
burgh?
• What proportion of Pittsburgh residents invest in stocks or
bonds?
• Is drug A better than drug B in relieving certain asthma
3
BIOS 2041 Statistical Methods Abdus S. Wahed
symptoms?
• Does vitamin A prevent cancer?
• Based on this quarter’s performance of stock returns, what
strategy will optimize the expected return in the next quar-
ter?
A central task of statistical analysis is to draw a conclusion (”make
inference”) about a population of interest based on evidence in a
sample from that population.
• Population = the set of all subjects or individuals who could
be measured for some variable of interest. Another viewpoint is
that the population is the group about which you wish to draw
a conclusion.
Example: All women in Allegheny County
• A parameter is a numeric characteristic of a population.
Example: Proportion of women in Allegheny County having a
female relative who has been treated for breast cancer.
Chapter 1 4
BIOS 2041 Statistical Methods Abdus S. Wahed
• A sample is a subset of population selected for study. The idea
is that the sample will provide the information used in drawing
the conclusion about the population.
Example: 200 Allegheny County women selected by random-
digit telephone dialing.
• A statistic is a numeric characteristic of a sample.
Example: The observed proportion of women with a female rel-
ative who had been treated for breast cancer is 35%.
• Inference is the conclusion drawn about population on basis
of sample.
Example: The proportion of Allegheny County women having a
female relative who has had breast cancer is 35%.
Another example:
• Population: All patients treated for Acute Myelocytic Leukemia
(AML) who are in first complete remission (CR1).
Chapter 1 5
BIOS 2041 Statistical Methods Abdus S. Wahed
• Parameter: Median duration of remission of treated AML pa-
tients in CR1.
• Sample: 35 AML / CR1 patients treated at the University of
Pittsburgh Cancer Institute during 2007.
• Statistic: The median duration of CR1 in these 35 patients was
13 months.
• Inference: The median duration of CR1 in patients treated for
AML is 13 months.
1.2 What is Biostatistics?
• Biostatistics → The branch of statistics that applies statistical
methods to medical and biological problems. Biostatisti-
cians help researchers (basic scientists, medical researchers, drug
developers) from the inception of a study to its completion. The
role of a biostatistician in the process is:
Chapter 1 6
BIOS 2041 Statistical Methods Abdus S. Wahed
• To formulate the research question in concrete terms → hy-
pothesis.
• To plan the experiment/study that will answer the research
question accurately and efficiently e.g.
• How many subjects (mice, patients, machines) will be
needed to answer the research question?
• How would, for example, subjects be assigned to different
groups?
• What data should be collected on each subject?
• How would the data be verified and processed?
• What are the issues with the data? e.g.
• How would the missing data be handled?
• Are there measurement errors in the data? How is it
going to be handled?
• Analyze the collected data to draw conclusions regarding the
hypotheses.
Chapter 1 7
BIOS 2041 Statistical Methods Abdus S. Wahed
Example 1.2.1. Drug development. XYZ pharmaceuticals has
been conducting research on developing drugs for hepatitis C (Hep C)
treatment since 1990. Their basic science researchers have convinced
Food and Drug Administration (FDA) through phase I and II trials
that they have discovered a new molecule of the standard interferon
that can be administered once weekly instead of once daily, and
they claim that the drug provides better response rate compared to
standard interferon. The company is planning to test the drug on
a large cohort of hepatitis C patients. The statistician assigned for
this study will generally start asking basic questions like:
1. How would you quantify the response? (Usually a simplified
answer would be: absence of Hep C virus in the serum 24 weeks
after the end of the treatment.)
2. How much improvement do you expect in response rate among
the users of the new drug compared to standard interferon users?
(The Phase II trial would indicate some ball park figure for this.)
Chapter 1 8
BIOS 2041 Statistical Methods Abdus S. Wahed
Based on the answers, the statistician will
• Formulate the hypothesis in quantitative terms:
H0 : P1 = P2, (1.2.1)
P1 is the response rate in the standard interferon group and P2
is the response rate for the new treatment (weekly interferon).
• Determine the number of patients to be recruited in the (stan-
dard) daily interferon group and in the (new) weekly interferon
group.
• Make sure that the patient safety and privacy is ensured in the
protocol keeping in mind the objective of the study.
• Devise a randomization scheme (possibly double-blinded) to as-
sign treatments to patients so that the two groups are compara-
ble with respect to patient characteristics.
• Suggest a data collection, verification and management plan.
Chapter 1 9
BIOS 2041 Statistical Methods Abdus S. Wahed
• How many sites will be used for patient recruitment?
• What data needs to be collected? What system will be used
to transfer the data?
• How will the data be processed?
• What information and how often should the data be pre-
sented to the DSMB (Data Safety Monitoring Board)?
• What criteria should be used to declare the new treatment sig-
nificantly better?
• How many interim analysis should be planned?
• What criteria should be used for stopping the trial?
• Finally, when the trial ends, the statistician will conduct/oversee
the data analysis to arrive at a conclusion regarding the hypoth-
esis.
In this course, we will mainly talk about:
Chapter 1 10
BIOS 2041 Statistical Methods Abdus S. Wahed
• Statistical methods to analyze collected data so that answers to
specific questions of interest can be made.
• Design issues, for example, sample size and power, etc.
We will cover:
Chapters 1-8 (in full), 10 - 11 (partial).
Chapter 1 11
BIOS 2041 Statistical Methods Abdus S. Wahed
Chapter 1 12
Chapter 2
Descriptive Statistics
In most cases data consist of many sample points. In a bid to in-
terpret data, the first task is to summarize the data in some concise
manner.
2.1 Types of data.
Data collected, outcomes of experiments, etc. are often referred to
as variables or outcomes, which come in several varieties. The type
of outcome observed plays a role in determining which statistical
procedures are appropriate.
13
BIOS 2041 Statistical Methods Abdus S. Wahed
• Categorical (discrete) - data can be assigned to discrete cate-
gories.
• a) Unordered
i) Gender
ii) Political party to which one belongs
iii) Exposed vs not exposed
iv) Disease or no disease
• b) Ordered
i) Good- Better- Best classification
ii) Number of times patient admitted to hospital for illness
during a given year.
• Continuous variables
• a) Ordinary or uncensored
i) Standard scale measurements
- height
Chapter 2 14
BIOS 2041 Statistical Methods Abdus S. Wahed
- weight
- optical density
- pH
ii) Survival times that are actually observed.
• b) Censored data
i) Survival time- may be known only that time is greater
than some observed time.
Here is the first 10 records from a dataset:
Table 2.1: Several records from a dataset
Obs ID AGE SEX LEADTYP IQF
1 101 1101 1 1 70
2 102 905 1 1 85
3 103 1101 1 1 86
4 104 611 1 1 76
5 105 1103 1 1 84
6 106 606 1 2 96
7 107 611 1 2 94
8 108 1500 2 2 56
9 109 702 2 2 115
10 110 703 1 2 97
Chapter 2 15
BIOS 2041 Statistical Methods Abdus S. Wahed
Many numerical and graphical techniques are available for the pur-
pose of summarizing data. We will start with continuous variables.
2.2 Measures of Location
The first sets of summary measures will define the center (or middle)
of the sample data. Such measures are known as measures of location
or measures of central tendency. We will start with the simplest of
these measures, the arithmetic mean (or simply, the mean).
2.2.1 Arithmetic Mean
Arithmetic mean is the sum of the observations divided by the num-
ber of observations.
Formula:
If X is what is measured (observed) and x1, x2, . . . , xn are the
values of n measurements, then the arithmetic mean is given by the
formula:
x̄ =x1 + x2 + . . . + xn
n=
∑ni=1 xi
n. (2.2.1)
Chapter 2 16
BIOS 2041 Statistical Methods Abdus S. Wahed
Example 2.2.1. Table 2.1 (Rosner)
Table 2.2: Sample of birthweights (g) of live-born infants born at a private hospital in San
Diego, California, during a 1-week period.
New-born Weight (g) New-born Weight (g) New-born Weight (g) New-born Weight (g)
1 3265 6 3323 11 2581 16 2759
2 3260 7 3649 12 2841 17 3248
3 3245 8 3200 13 3609 18 3314
4 3484 9 3031 14 2838 19 3101
5 4146 10 2069 15 3541 20 2834
X = birthweights (g) of live-born infants
x̄ =3265 + 3260 + . . . + 2834
20= 3166.9g. (2.2.2)
Facts about mean
• Arithmetic mean is easy to compute.
• If the sample points change in scale by a factor of c, the mean
changes by a factor of c.
• In some cases it fails reflect the center of the sample, specifically
in the presence of unusually high or low values (outliers).
Chapter 2 17
BIOS 2041 Statistical Methods Abdus S. Wahed
• It is most widely used measures of location.
2.2.2 Median
Loosely speaking, the median is a number such that in the ordered
sample, half of the sample points lies below it, and half above it.
Formula:
If n is odd then(
n+12
)th observation is the median. Otherwise,
median is defined as the average of the(
n2
)th and
(n2
+ 1)th largest
observations.
Example 2.2.2. Table 2.2 (Rosner). White blood cell counts
(× 1000) for a sample of 9 patients entering a hospital. The ordered
sample is as follows:
3, 5, 7, 8, 8, 9, 10, 12, 35
Here, n = 9, and hence(
n+12
)= 5. The median white blood cell
counts for this sample is the 5th observation, which is 8000.
Chapter 2 18
BIOS 2041 Statistical Methods Abdus S. Wahed
Facts about median:
• Median is not highly influenced by extreme observations, unless
there is only one or two data points.
• Median depends only on one or two middle observations and
hence is less sensitive to the magnitude of other observations in
the sample.
2.2.3 Mode
Mode is the most frequently occurring value in the sample.
In the above example, the mode white blood cell count is 8000 as
it occurs most frequently than any other white-blood count.
Facts about mode:
• If all the data points occur exactly the same number of times,
then there is no mode.
• A sample with one mode is called unimodal; two modes, bimodal;
Chapter 2 19
BIOS 2041 Statistical Methods Abdus S. Wahed
three modes, trimodal; and so on.
2.2.4 Geometric Mean
Geometric mean is often used for summarizing ratios, percentages,
indices, or other data sets bounded by zero. The geometric mean of
n positive numbers x1, x2, . . . , xn ia defined as the n-th root of their
product.
Formula:
GM = n√
x1 × x2 × . . . × xn = (x1 × x2 × . . . × xn)1n . (2.2.3)
In Example (2.2.2), the geometric mean is
(3 × 5 × . . . × 35)19 = 8.59
Facts about geometric mean:
• Only defined for non-negative numbers.
• Usually, if a distribution on the positive axis is asymmetric, then
a log transformation is used to make it symmetric. For such
distributions the geometric mean is used.
Chapter 2 20
BIOS 2041 Statistical Methods Abdus S. Wahed
2.3 Measures of Spread/Variation/Dispersion
Refer to Figure 2.4 (FOB).
2.3.1 Range
Range is the difference between the largest and the smallest obser-
vations. For the birthweights data in Table 2.1, the range is
Range = 4146 − 2069 = 2077g.
For the data in Figure 2.4 (FOB), the range for the Autoanalyzer
method is
226 − 177 = 49mg/dl,
whereas the same for the Microenzymetic method is
209 − 172 = 17mg/dl.
Thus, one can claim that:
• The Microenzymetic method measures cholesterol levels more
consistently than Autoanalyzer method does. Or, equiva-
lently,
Chapter 2 21
BIOS 2041 Statistical Methods Abdus S. Wahed
• Measurements of cholesterol levels using Microenzymetic method
are more precise than those using Autoanalyzer method. Or,
equivalently,
• Microenzymetic cholesterol measurements have lower variability
compared to Autoanalyzer cholesterol measurements.
Facts about range:
• Easy to compute.
• Depends highly on the extreme values.
2.3.2 Percentiles/Quantiles and Interquartile Range
The 100pth (0 ≤ p ≤ 1) percentile of a distribution is the value Vp
such that 100p% of the sample points are less than or equal to Vp.
Median is the 50th percentile.
Chapter 2 22
BIOS 2041 Statistical Methods Abdus S. Wahed
For the birthweights data in Table 2.1, some of the percentiles are
calculated as:
Position Percentile How we calculated it from the ordered data
10th 2670.0 n ∗ p = 20 ∗ 0.10 = 2; The average of 2nd and 3rd observation.
25th 2839.5 n ∗ p = 20 ∗ 0.25 = 5; The average of 5th and 6th observation.
50th 3246.5 n ∗ p = 20 ∗ 0.50 = 10; The average of 10th and 11th observation.
75th 3403.5 n ∗ p = 20 ∗ 0.75 = 15; The average of 15th and 16th observation.
95th 3629.0 n ∗ p = 20 ∗ 0.95 = 18; The average of 18th and 19th observation.
99th 4146.0 n ∗ p = 20 ∗ 0.99 = 19.8; The 20th observation.
Table 2.3: Percentiles for the Birthweights data in Table 2.1 (Rosner)
Facts about percentiles
• Percentiles are also known as quantiles.
• Percentiles characterize the relative positioning of the observa-
tions in the sample.
• The spread of the distribution about the center can be character-
ized by specifying cerain quantiles. For instance, 25th and 75th
percentiles tell us that the middle half of the sample points lies
between these two values.
Chapter 2 23
BIOS 2041 Statistical Methods Abdus S. Wahed
• The 25th percentile and the 75th percentile of a distribution are
commonly referred to as 1st (lower) and 3rd (upper) quartiles.
Here are the percentiles for the cholesterol data in Figure 2.4 (FOB):
Method N Lower Quartile Median Upper Quartile IQR
Auto 5 193.0 195.0 209.0 16.0
Micro 5 197.0 200.0 202.0 5.0
Table 2.4: Percentiles for the Cholesterol data in Figure 2.4 (Rosner)
Interquartile range
The distance between the 1st quartile (Q1) and the 3rd quartile (Q3)
is known as interquartile range (IQR). Interquartile range is useful
for comparing the spread of two distribution as well as detecting
outliers. The higher the IQR, the more variable the distribution is.
For the cholesterol data, the IQR for Autoanalyzer method and the
microenzymatic method are respectively 16 and 5 which justifies our
previous claim that the autoanalyzer method is not as precise as the
Microenzymatic method.
Chapter 2 24
BIOS 2041 Statistical Methods Abdus S. Wahed
• For a positively skewed distribution, the distance between the
median and upper quartile is greater than the distance between
median and the lower quartile.
• For a negatively skewed distribution, the distance between the
median and upper quartile is smaller than the distance between
median and the lower quartile. [Birthweights data (Table 2.1,
FOB)]
• For a symmetric distribution, the distance between the median
and upper quartile is approximately equal the distance between
median and the lower quartile. [For the menstrual cycle data
Table 2.3 (FOB), Q1 = 28 = Median, Q3 = 29.]
Outliers
Outliers are extremely high or low values that are “isolated” from the
overall distribution. Outliers in a data set can be identified based on
the lower and upper quartiles.
Formula:
Chapter 2 25
BIOS 2041 Statistical Methods Abdus S. Wahed
An observation x can be treated as an outlier if either
1. x > Q3 + 1.5 ∗ IQR, or
2. x < Q1 − 1.5 ∗ IQR.
Formula:
An observation x is an extreme outlier if either
1. x > Q3 + 3 ∗ IQR, or
2. x < Q1 − 3 ∗ IQR.
Are there any outliers in the cholesterol data set?
2.3.3 Mean deviation
Let us look at the cholesterol data one more time.
[INSERT CHOLSTEROL FIGURE]
Look at how each observation differs from the mean; i.e,
x1 − x̄, x2 − x̄, x3 − x̄, . . . , xn − x̄.
One way to measure the spread is to look at how sample points in
the data differ from the mean. However, the mean of these differences
Chapter 2 26
BIOS 2041 Statistical Methods Abdus S. Wahed
are zero for any data. For the autoanalyzer method sample, the
differences are:
(177 − 200) = −23, (193 − 200) = −7, (195 − 200) = −5,
(209 − 200) = 9, and (226 − 200) = 26,
and the mean difference is zero. Same is true for the microenzymatic
method. Therefore the mean difference about the mean cannot be
used to distinguish between samples based on spreads.
What if we just take the average of the distances, instead of dif-
ferences, i.e,
|x1 − x̄|, |x2 − x̄|, |x3 − x̄|, . . . , |xn − x̄|.
Average of the distances from mean is known as mean deviation. For
the autoanalyzer method sample, the distances are:
23, 7, 5, 9, and 26
with an average of 14. On the other hand, the mean deviation for
the microenzymatic method is 4.4.
Chapter 2 27
BIOS 2041 Statistical Methods Abdus S. Wahed
2.3.4 Variance and Standard Deviation
In the definition of the mean deviation, we used absolute values of the
difference between individual observations and the sample mean. Ab-
solute values are sometimes difficult to deal with. Another measure
of spread uses the squared deviations from the mean and averages it
over the whole sample. The measure, known as variance, is defined
as:
s2 =
∑ni=1(xi − x̄)2
n − 1. (2.3.1)
The use of n − 1 instead of n in the denominator have special justi-
fication, which we will discuss in chapter 6.
Standard deviation is defined as the positive square root of the
variance:
s =
√∑ni=1(xi − x̄)2
n − 1. (2.3.2)
For the autoanalyzer method, the variance is
s2 =(−23)2 + (−7)2 + (−5)2 + 92 + 262
4= 340.
Chapter 2 28
BIOS 2041 Statistical Methods Abdus S. Wahed
For the microenzymatic method, the variance is
s2 =(−8)2 + (−3)2 + (0)2 + 22 + 92
4= 39.5.
Corresponding standard deviations are respectively s =√
340 = 18.4
and s =√
39.5 = 6.3.
Thus the spread, as measured by the standard deviation, is ap-
proximately three times as large as that of microenzymatic method.
Facts about variance and standard deviation
• Variance and standard deviation remain unchanged when all the
observations in the sample are shifted by the same constant. For
example, the following two samples have the same variance (340)
and standard deviation (18.4):
Sample 1: 77, 93, 95, 109, 126
Sample 2: 177, 193, 195, 209, 226
• Standard deviation has the same unit of measurement as the
original samples.
Chapter 2 29
BIOS 2041 Statistical Methods Abdus S. Wahed
• If the sample points change in scale by a factor of c, the variance
changes by a factor of c2 and the standard deviation changes by
a factor of c.
• Standard deviation is the most widely used measure of spread
(dispersion).
2.3.5 Coefficient of Variation
Suppose you are comparing two distributions having different means.
How would you compare the variability of a sample with mean 10
and standard deviation 5 to a sample with mean 100 and standard
deviation 5? Of course, the former is more variable, as the magnitude
of the standard deviation relative to the mean is much higher for that
sample compared to the latter. The measure coefficient of variation
is designed to account for the magnitude of mean when assessing the
spread. It is defined as:
CV =s
x̄× 100. (2.3.3)
Chapter 2 30
BIOS 2041 Statistical Methods Abdus S. Wahed
For the cholesterol data in Table 2.4 (FOB), the coefficient of
variations for the Autoanalyzer and Microenzymatic methods are
respectively 9.2% and 3.1%.
2.4 Graphical Representation
2.4.1 Histogram
Histogram is a useful way of presenting data graphically. It presents
frequencies (or relative frequencies) on the Y-axis against the data
points on X-axis. The frequencies along with the values are usually
referred to as the frequency distribution or distribution. When the
number of unique observations are too large, the range of the variable
is categorized in continuous intervals and the number of observations
belonging to those intervals are reported.
• Distributions having two tails approximately similar are called
symmetric distributions. For such distributions
• Mean≈ Median ≈ Mode.
Chapter 2 31
BIOS 2041 Statistical Methods Abdus S. Wahed
Histogram of Menstrual Cycle
0
10
20
30
40
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
Time (days)
Rel
ativ
e Fr
eque
ncy
Figure 2.1: Distribution of time intervals between successive menstrual periods (days) of
college women (Table 2.3; Rosner; Page 13). Mean=28.5; Median=28; Mode=28.
• A distribution which has a longer tail on the right is called a
positively skewed distribution. For such distributions
• data points on the right of the median tends to be farther
from the median in absolute value than points below median,
Chapter 2 32
BIOS 2041 Statistical Methods Abdus S. Wahed
• Mean ≥ Median ≥ Mode.
Figure 2.2: Example of a distribution which is neither skewed, nor symmetric.
• Distributions with a tail on the left are known as negatively
skewed distributions. For such distributions
• Mean ≤ Median ≤ Mode.
For more examples on symmetric, positively skewed and negatively
skewed distributions, refer to page 12 of FOB.
Chapter 2 33
BIOS 2041 Statistical Methods Abdus S. Wahed
2.4.2 Stem-and-leaf Plot
Stem-and-leaf plot is similar to histogram, but it keeps the plot more
close to the actual data by using the observations from the actual
sample. It shows the basic shape of the distribution just like his-
togram does.
Stem Leaf Number
4 1 1
3 5566 4
3 012223333 9
2 68888 5
2 1 1
Multiply Stem.Leaf by 10**+3
Figure 2.3: Steam-and-leaf plot for the birthweights data in Table 2.1 (FOB).
2.4.3 Box plot
Chapter 2 34
BIOS 2041 Statistical Methods Abdus S. Wahed
Stem Leaf Number
14 1 1
13
13
12 58 2
12 0 1
11 558 3
11 1124 4
10 55677778 8
10 00111124444 11
9 566666666777889999 18
9 0111122223334444 16
8 555555566666778888889999 24
8 00000022334 11
7 5556666667778899 16
7 012234 6
6
6
5 6 1
5 0 1
4 6 1
Multiply Stem.Leaf by 10**+1
Figure 2.4: Steam-and-leaf plot for the the variable IQF from the dataset “Lead” in the case
study described in section 2.9 (FOB).
Chapter 2 35
BIOS 2041 Statistical Methods Abdus S. Wahed
Figure 2.5: Box plot for the the variable IQF from the dataset “Lead” in the case study
described in section 2.9 (FOB) by exposure type.
140 + 0
|
|
|
130 +
| 0
| |
| |
120 + |
| |
| |
| | |
110 + | |
| | |
| | |
| | |
100 + +-----+ |
| | | |
| *-----* +-----+
| | + | | |
90 + | | | + |
| | | *-----*
| +-----+ | |
| | | |
80 + | +-----+
| | |
| | |
| |
70 + |
/
/
|
| 0
|
50 + 0
------------+-----------+-----------
LEAD_TYP 1 2
Chapter 2 36