sta120 introduction to statistics -...
TRANSCRIPT
STA 120
Introduction to Statistics
Script
Reinhard Furrer
and the Applied Statistics Group
Version July 5, 2018
Contents
Preface v
1 Exploratory Data Analysis 1
1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Examples of Poor Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 12
2 Random Variables 13
2.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Expectation, Variance and Moments . . . . . . . . . . . . . . . . . . . . . 18
2.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Common Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Common Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 32
3 Estimation 33
3.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Construction of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Comparison of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 44
4 Statistical Testing 45
4.1 The General Concept of Significance Testing . . . . . . . . . . . . . . . . . 45
4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
i
ii CONTENTS
4.3 Comparing Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Duality of Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . . 57
4.5 Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Additional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 61
5 A Closer Look: Proportions 63
5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Comparison of Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 76
6 Rank-Based Methods 77
6.1 Robust Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Other Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 87
7 Multivariate Normal Distribution 89
7.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 92
7.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.4 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 99
7.5 Appendix: Positive-Definite Matrices . . . . . . . . . . . . . . . . . . . . . 99
8 Correlation and Simple Regression 101
8.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 109
9 Multiple Regression 111
9.1 Model and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.3 Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10 Analysis of Variance 125
10.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.2 Two-Way and Complete Two-Way ANOVA . . . . . . . . . . . . . . . . . 132
10.3 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
CONTENTS iii
11 Extending the Linear Model 141
11.1 Unifying the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
11.2 Regression with Correlated Data . . . . . . . . . . . . . . . . . . . . . . . 141
11.3 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11.4 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 151
12 Bayesian Methods 153
12.1 Motivating Example and Terminology . . . . . . . . . . . . . . . . . . . . . 153
12.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
12.3 Choice and Effect of Prior Distribution . . . . . . . . . . . . . . . . . . . . 159
12.4 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 160
13 A Closer Look: Monte Carlo Methods 161
13.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
13.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
13.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
13.4 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 168
References 171
Glossary 175
Index of Statistical Tests 179
Dataset Index 181
iv CONTENTS
Preface
This document accompanies the lecture STA120 Introduction to Statistics that has been
given each spring semester since 2013. The lecture is given in the framework of the minor
in Applied Probability and Statistics (www.math.uzh.ch/aws) and comprises 14 weeks of
two hours of lecture and one hour of exercises per week.
As the lecture’s topics are structured on a week by week basis, the script contains
thirteen chapters, each covering “one” topic. Some of chapters contain consolidations or
in-depth studies of previous chapters. The last week is dedicated to a recap/review of the
material.
I have thought long and hard about an optimal structure for this script. Let me quickly
summarize my thoughts. It is very important that the document contains a structure that
is tailored to the content I cover in class each week. This inherently leads to 13 “chapters.”
Instead of covering Linear Models over four weeks, I framed the material in four seemingly
different chapters. This structure helps me to better frame the lectures: each week having
a start, a set of learning goals and a predetermined end.
So to speak, the script covers not 13 but essentially only three topics:
1. Background
2. Statistical foundations in a (i) frequentist setting and (ii) Bayesian setting
3. Linear Modeling
We will not cover these topics chronologically. This is not necessary and I have opted for a
smoother setting. For example, we do not cover the multivariate Gaussian distribution at
the beginning but just before we need it. This also allows for a recap of several univariate
concepts. We use a path illustrated in Figure 1.
In case you use this document outside the lecture, here are several alternative paths
through the chapters, with a minimal impact on concepts that have not been covered:
• Focusing on linear models: Chapters 1, 2, 7, 3, 4, 8, 9, 10, 11
• Good background in probability: You may omit Chapters 2 and 7
v
vi Preface
Background
Estimation
Statistical Testing
Proportions
Rank−Based Methods
Bayesian Approach
Monte Carlo Methods
Linear Modeling
Multiple Regression
Correlation and Simple Regression
Analysis of Variance
Extending the Linear Model
Bayesian
Frequentist
Statistical Foundations
Exploratory Data Analysis
Random Variables
Multivariate Normal Distribution
End
Start
Figure 1: Structure of the script
All the datasets that are not part of regular CRAN packages are available via the
url www.math.uzh.ch/furrer/download/sta120/. The script is equipped with appropriate
links that facilitate the download.
The lecture STA120 Introduction to Statistics formally requires prerequisites MAT183
Stochastic for the Natural Sciences and MAT141 Linear Algebra for the Natural Sciences
or equivalent modules. For their content we refer to the corresponding course web pages
www.math.uzh.ch/fs17/mat183 and www.math.uzh.ch/hs17/mat141. It is possible to
successfully pass the lecture without having had the aforementioned lectures, some self-
studying is necessary though. This script and the accompanying exercises require some:
differentiation, integration, matrix notation and basic operations, concept of solving a
linear system of equations. We review and summarize the relevant concepts of probability
theory in Chapter 2.
Preface vii
Many have contributed to this document. A big thanks to all of them, especially
(alphabetically) Zofia Baranczuk, Julia Braun, Eva Furrer, Florian Gerber, Lisa Hofer,
Mattia Molinaro, Leila Schuh, Franziska Robmann, Leila Schuh and many more. Kelly
Reeve spent many hours improving my English. Without their help, you would not be
reading these lines. Yet, this document needs more than a polishing. Please let me know
of any necessary improvements and I highly appreciate all forms of contributions in form of
errata, examples, or text blocks. Contributions can be deposited directly in the following
Google Doc sheet.
Major errors that were corrected after the lecture of the corresponding semester are
listed www.math.uzh.ch/furrer/download/sta120/errata18FS.txt. I try hard that after
the lecture, the pagination of the document does not change anymore.
Reinhard Furrer
February 2018
viii Preface
Chapter 1
Exploratory Data Analysis
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter01.R.
Assuming that a data collection process is completed, the “analysis” of this data
is one of the next steps. Foremost, the data needs to be loaded into an appropriate
software environment, here we use R. At the beginning of any formal statistical analysis,
an exploratory data analysis (EDA) is performed, in which observations or measured
values are depicted, and qualitatively and quantitatively described. At the end of a
study, results are often summarized graphically because it is generally easier to interpret
and understand graphics than values in a table. As such, graphical representation of data
is an essential part of statistical analysis, from start to finish. In this chapter, we consider
how different types of data can be visualized and get as such an introduction to what we
mean with EDA.
1.1 Types of Data
There are several different types of data. An important distinction is whether data is
qualitative or quantitative in nature. Non-numerical data, often gathered from open-
ended responses or in audio-visual form, is considered qualitative. This data can be
measured on a nominal or ordinal scale. With the nominal scale, one can determine the
equality of observations, while with the ordinal scale, rank order can also be determined.
Whereas qualitative data consists of categories, quantitative data is numeric and math-
ematical operations can be performed with it. Quantitative data can be either discrete,
taking on only certain values like qualitative data but in the form of integers, or continu-
ous, taking on any value on the real number line. Quantitative data can be measured on
an interval or ratio scale. Unlike the ordinal scale, the interval scale is uniformly spaced.
The ratio scale is characterized by a meaningful absolute zero in addition to the char-
acteristics of all previous scales. Depending on the measurement scale, certain statistics
1
2 CHAPTER 1. EXPLORATORY DATA ANALYSIS
are appropriate. The measurement scales are classified according to Stevens (1946) and
summarized in Figure 1.1.
ScalesNominal Ordinal Interval Ratio
< > < > < >+ - + - * /
Mode Mode Mode ModeMedian Median Median
Arithmetic mean Arithmetic meanGeometric mean
Standard deviation Standard deviation
Range Studentized range
Mathematical operators
Statistical measures
Coefficient of variation
Figure 1.1: Types of scales according to Stevens (1946) and possible math-
ematical operations. The statistical measures are for a description of location
and spread.
Example 1.1. When measuring temperature in Kelvin (absolute zero at −273.15C), a
statement such as “The temperature has increased by 20%” can be made. ♣
1.2 Descriptive Statistics
Another very basic summary of the data is its size, number of variables, number of missing
values. Although this is rudimentary information, it is often forgotten and or neglected.
A statistic is a single measure of some attribute of the data, which gives a good first
impression of the distribution of the data. Typical statistics for the location parameter
include the (arithmetic) mean, truncated/trimmed mean, and median, and for the scale
parameter, variance, standard deviation, and interquartile range.
Given discrete data, the mode, and sometimes median, can be determined. The mode
is the most frequent value of an empirical frequency distribution; in order to calculate the
mode, only the operations =, 6= are necessary. The (empirical) median is the middle
value (or 50th percentile) of the sorted data and thus the <,> operators are also nec-
essary. Continous data are first divided into categories (discretized/binned) and then the
mode and median can be determined.
Example 1.2. In R-Code 1.1 several summary statistics for 293 observations of content
mercury in lake Geneva sediments are calculated (www.math.uzh.ch/furrer/download/
sta120/lemanHg.csv). ♣
1.3. UNIVARIATE DATA 3
R-Code 1.1 A quantitative EDA of the mercury dataset (subset of the ‘leman’
dataset).
Hg <- c( read.csv('data/lemanHg.csv'))$Hg
str( Hg)
## num [1:293] 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...
c( mean=mean( Hg), tr.mean=mean( Hg, trim=.1), median=median( Hg))
## mean tr.mean median
## 0.46177 0.43238 0.40000
c( var=var( Hg), sd=sd( Hg), iqr=IQR( Hg))
## var sd iqr
## 0.090146 0.300243 0.380000
summary( Hg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 0.250 0.400 0.462 0.630 1.770
sum( is.na( Hg)) # or any( is.na( Hg))
## [1] 0
tail( sort( Hg)) # largest values
## [1] 1.28 1.29 1.30 1.31 1.47 1.77
Another important aspect is the identification of outliers, which are defined (verbatim
from Olea (1991)): “In a sample, any of the few observations that are separated so far
in value from the remaining measurements thety the questions arise whether they belong
to a different population, or that the sampling technique is faulty. The determination
that an observation is an outlier may be highly subjective, as there is no strict criteria
for deciding what is and what is not an outlier”. Graphical respresentations of the data
often help in the identification of outliers.
1.3 Classic Plots for Univariate Data
Univariate data is composed of one variable, or a single scalar component. Graphical
depiction of univariate data is usually accomplished with histograms, box plots, Q-Q
plots, or bar plots.
Histograms illustrate the frequency distribution of observations graphically and are
easy to construct and to interpret. However, the number of bins (categories to break the
data down into) is a subjective choice that affects the look of the histogram and several
valid rules of thumb exist for choosing the optimal number of bins. R-Code 1.2 and the
associated Figure 1.2 illustrate the construction and resulting histograms of the mercury
4 CHAPTER 1. EXPLORATORY DATA ANALYSIS
dataset. In one of the histograms, a “smoothed density” has been superimposed. Such
curves will be helpful when comparing the data with different statistical models, as we
will see in later chapters. The histogram shows that the data is unimodal, right skewed,
no exceptional values.
R-Code 1.2 Different histograms (good and bad ones) for the mercury dataset. (See
Figure 1.2.)
histout <- hist( Hg)
hist( Hg, col=7, probability=TRUE, main="With 'smoothed density'")
lines( density( Hg))
hist( Hg, col=7, breaks=90, main="Too many bins")
hist( Hg, col=7, breaks=2, main="Too few bins")
str( histout[1:3]) # Contains essentially all information of histogram
## List of 3
## $ breaks : num [1:10] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
## $ counts : int [1:9] 58 94 61 38 26 9 5 1 1
## $ density: num [1:9] 0.99 1.604 1.041 0.648 0.444 ...
Histogram of Hg
Hg
Fre
quen
cy
0.0 0.5 1.0 1.5
020
60
With 'smoothed density'
Hg
Den
sity
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
Too many bins
Hg
Fre
quen
cy
0.0 0.5 1.0 1.5
05
1015
Too few bins
Hg
Fre
quen
cy
0.0 0.5 1.0 1.5 2.0
010
020
0
Figure 1.2: Histograms with various bin sizes. (See R-Code 1.2.)
1.3. UNIVARIATE DATA 5
When constructing histograms for discrete data (e.g., integer values), one has to be
careful with the binning. Often it is better to manually force the bins. To represent the re-
sult of many dice tosses, it would be advisable to use hist( x, breaks=seq( from=0.5,
to=6.5, by=1)), or possibly barplots as explained towards the end of this section.
A stem-and-leaf plot is similar to a histogram, however, this plot is rarely used today
(Figure 1.3).
Figure 1.3: Stem-and-leaf plot of residents of the municipality of Staldenried
as of 31.12.1993 (Gemeinde Staldenried, 1994).
A box plot is a graphical representation of five statistics of the frequency distribution
of observations: the minimum and maximum values, the lower and upper quartiles, and
the median. Advantages of the box plot include
• It is a quantitative representation of important statistics.
• Symmetry is quickly detected visually.
• Outliers are denoted.
R-Code 1.3 and Figure 1.4 illustrates the box plot for the mercury data. Due to the right
skewness of the data, there are several data points beyond the whiskers.
6 CHAPTER 1. EXPLORATORY DATA ANALYSIS
R-Code 1.3 Box plot. (See Figure 1.4.)
boxplot( Hg)
boxplot( Hg, col="LightBlue", notch=TRUE, ylab="Hg",
outlty=1, outpch='', outcol=2)
quantile( Hg, c(0.25, 0.75)) # compare with summary( Hg)
## 25% 75%
## 0.25 0.63
IQR( Hg)
## [1] 0.38
quantile(Hg, 0.75) + 1.5 * IQR( Hg)
## 75%
## 1.2
Hg[ quantile(Hg, 0.75) + 1.5 * IQR( Hg) < Hg]
## [1] 1.25 1.30 1.47 1.31 1.28 1.29 1.77
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
Hg
Figure 1.4: Box plots: simple and notched versions. (See R-Code 1.3.)
A quantile-quantile plot (Q-Q plot) is used to visually compare empirical data qunatiles
with the quantiles of a theoretical distribution (we will talk more about “theoretical
distributions” in the next chapter). The ordered values are compared with the i/(n+ 1)-
quantiles. In practice, some softwares use (i− a)/(n+ 1− 2a), a ∈ [0, 1]. R-Code 1.4 and
Figure 1.5 illustrates a Q-Q plot for the mercury dataset by comparing it to a normal
distribution and a (so-called) exponential distribution. In cases of good fits, the points
are aligned almost on a straight line. To guide the eye, one often adds the a line to the
plots.
1.3. UNIVARIATE DATA 7
R-Code 1.4 Q-Q plot. (See Figure 1.5.)
qqnorm( Hg)
qqline( Hg, col=2)
qqplot( qexp( ppoints( 293), rate=2.1), Hg)
qqline( Hg, distribution=function(p) qexp( p, rate=2.1),
prob=c(0.1, 0.6), col=2)
−3 −2 −1 0 1 2 3
0.0
0.5
1.0
1.5
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
0.0 1.0 2.0 3.0
0.0
0.5
1.0
1.5
qexp(ppoints(293), rate = 2.1)
Hg
Figure 1.5: Q-Q plots using the normal distribution (left) and exponential
distribution (right). (See R-Code 1.4.)
The frequency distribution of discrete data is often displayed with a bar plot. The
height of the bars is proportional to the frequency of the corresponding value. The
German language discriminates between bar plots with vertical and horizontal orientation
(‘Saulendiagramm’ and ‘Balkendiagramm’).
R-Code 1.3 and Figure 1.4 illustrates bar plots with data giving aggregated CO2 emis-
sions from the year 2005 (SWISS Magazine, 2011).
Do not use pie charts unless absolutely necessary. Pie charts are often difficult to
read. When slices are similar in size it is nearly impossible to distinguish which is larger.
Barplots allow an easier comparison.
8 CHAPTER 1. EXPLORATORY DATA ANALYSIS
R-Code 1.5 Emissions sources for the year 2005 as presented by the SWISS Magazine
10/2011-01/2012, page 107 (SWISS Magazine, 2011). (See Figure 1.6.)
dat <- c(2, 15, 16, 32, 25, 10) # see Figure 1.9
emissionsource <- c('Air', 'Transp', 'Manufac', 'Electr',
'Deforest', 'Other')
barplot( dat, names=emissionsource, ylab="Percent", las=2)
barplot( cbind('2005'=dat), col=c(2,3,4,5,6,7), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', xlim=c(0.2,4))
Air
Tran
sp
Man
ufac
Ele
ctr
Def
ores
t
Oth
er
Per
cent
0
5
10
15
20
25
30
2005
OtherDeforestElectrManufacTranspAir
Per
cent
020
4060
8010
0
Figure 1.6: Bar plots: juxtaposed bars (left), stacked (right). (See R-Code 1.5.)
1.4 Visualizing Multivariate Data
Multivariate data means two or more variables are of interest and collected for each
observation. Visualization of multivariate data is often accomplished with scatter plots
(with plot(x,y) or pairs(...)) so that the relationship between the variables may be
illustrated. For three variables, an interactive visualization based on the package rgl
might be helpful.
R-Code 1.6 and Figure 1.7 illustrate the pairs plot for mercury, cadmium and zinc
content in sediment samples taken in lake Geneva. For some of the probes, the content is
high for all three metals.
In scatter plot, “guide-the-eye” lines are often included. In such situation, some care
is needed as there is an perception asymmetry between y versus x and x versus y. We
will discuss this further in Chapter 8.
1.4. MULTIVARIATE DATA 9
R-Code 1.6 Mercury (multivariate), data available at www.math.uzh.ch/furrer/
download/sta120/lemanHgCdZn.csv.
metal <- read.csv( 'data/lemanHgCdZn.csv')
str( metal)
## 'data.frame': 293 obs. of 3 variables:
## $ Hg: num 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...
## $ Cd: num 0.23 0.37 0.14 0.3 0.56 0.3 0.17 0.44 0.39 0.26 ...
## $ Zn: num 72.2 98.2 81.6 131 160 125 48 131 127 106 ...
# More EDA should be here!!!
pairs( metal, gap=0, main='')
# top left is as plot( Hg~Cd, data=metal)
Hg
0.5 1.5 2.5
0.0
0.5
1.0
1.5
0.5
1.5
2.5
Cd
0.0 0.5 1.0 1.5
200 600
200
600
Zn
Figure 1.7: Scatter plot matrix. (See R-Code 1.6.)
10 CHAPTER 1. EXPLORATORY DATA ANALYSIS
SW
ISS
c2es
.org
OtherDeforestElectrManufacTranspAir
Per
cent
0
20
40
60
80
100
SW
ISS
c2es
.org
AirTranspManufacElectrDeforestOther
Per
cent
0
5
10
15
20
25
30
Figure 1.8: Bar plots for two variables: stacked (left), grouped (right). (See
R-Code 1.7.)
In the case of several frequency distributions, bar plots, either stacked or grouped,
may also be used in an intuitive way. See R-Code 1.7 and Figure 1.8 for two slightly
different partitions of emission sources.
R-Code 1.7 Emissions sources for the year 2005 from www.c2es.org/facts-
figures/international-emissions/sector (approximate values). (See Figure 1.8.)
dat2 <- c(2, 10, 12, 28, 26, 22) # source c2es.org
mat <- cbind( SWISS=dat, c2es.org=dat2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(0.2,5), legend=emissionsource,
args.legend=list(bty='n') ,ylab='Percent', las=2)
barplot(mat, col=c(2,3,4,5,6,7), xlim=c(1,30), legend=emissionsource,
args.legend=list(bty='n'), ylab='Percent', beside=TRUE, las=2)
1.5 Examples of Poor Plots
Figures 1.9 and 1.10 are examples of faulty plots and charts. Additional examples from
renowned academic journals can be found at www.biostat.wisc.edu/˜kbroman/topten
worstgraphs/. The ‘discussion’ links point out the mistakes and give recommendations
for improvement.
1.5. EXAMPLES OF POOR PLOTS 11
Figure 1.9: SWISS Magazine 10/2011-01/2012, 107.
12 CHAPTER 1. EXPLORATORY DATA ANALYSIS
Figure 1.10: Bad example (above) and improved but still not ideal graphic
(below). Figures from university documents.
1.6 Bibliographic Remarks and Links
The term exploratory data analysis (EDA) was coined by Tukey (1977). The series of
books (Tufte, 1983, 1990, 1997b,a) is a visual pleasure to read and gives a plethora of
useful information on visualization and graphical communication.
Chapter 2
Random Variables
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter02.R.
2.1 Basics of Probability Theory
In order to define the term random variable, some basic principles are needed first.
The set of all possible outcomes of an experiment is called the sample space, denoted
by Ω. Each outcome of an experiment ω ∈ Ω is called an elementary event. A subset of
the sample space Ω is called an event, denoted by A ⊂ Ω.
Informally, a probability P can be considered the value of a function defined for an
event of the sample set and assuming values in the interval [0, 1], that is, P(A) ∈ [0, 1].
In order to introduce probability theory formally one needs some technical terms (σ-
algebra, measure space, . . . ). However, the axiomatic structure of A. Kolmogorov can
also be described accessibly as follows and is sufficient as a basis for our purposes.
A probability measure must satisfy the following axioms:
i) 0 ≤ P (A) ≤ 1, for every event A,
ii) P (Ω) = 1,
iii) P (∪iAi) =∑
i P (Ai), for Ai ∩ Aj = ∅, i 6= j.
Informally, a probability function P assigns a value in [0, 1], i.e., the probability, to
each event of the sample space constraint to:
i) the probability of an event is never smaller than 0 or greater than 1,
ii) the probability of the whole sample space is 1,
iii) the probability of several events is equal to the sum of the individual probabilities,
if the events are mutually exclusive.
13
14 CHAPTER 2. RANDOM VARIABLES
Probabilities are often visualized with Venn diagrams (Figure 2.1), which clearly and
intuitively illustrate more complex facts, such as:
P (A ∪B) = P (A) + P (B)− P (A ∩B), (2.1)
P (A | B) =P (A ∩B)
P (B). (2.2)
We consider a random variable as a function that assigns values to the outcomes
(events) of a random experiment, that is, these values or values in the interval are assumed
with certain probabilities. These values are called realizations of the random variable.
A
BΩ
C
Figure 2.1: Venn diagram
The following definition gives a (unique) characterization of random variables. In
subsequent sections, we will see additional characterizations. These, however, will depend
on the what type of values the random variable can take.
Definition 2.1. The distribution function (cumulative distribution function, cdf) of a
random variable X is
F (x) = FX(x) = P(X ≤ x), for all x. (2.3)
♦
Property 2.1. A distribution function FX(x) is
i) monotonically increasing, i.e. for x < y, FX(x) ≤ FX(y).
ii) right-continuous, i.e. limε↓0
FX(x+ ε) = FX(x), for all x ∈ R.
iii) normalized, i.e. limx→−∞
FX(x) = 0 and limx→∞
FX(x) = 1.
Random variables are denoted with uppercase letters (e.g. X, Y ), while realizations
are denoted by the corresponding lowercase letters (x, y). This means that the theoretical
concept, or the random variable as a function, is denoted by uppercase letters. Actual
values or data, for example the columns in your dataset, would be denoted with lower-
case letters. For further characterizations of random variables, we need to differentiate
according to the sample space of the random variables.
2.2. DISCRETE DISTRIBUTIONS 15
2.2 Discrete Distributions
A random variable is called discrete when it can assume only a finite or countably infinite
number of values, as illustrated in the following two examples.
Example 2.1. Let X be the sum of the roll of two dice. The random variable X assumes
the values 2, 3, . . . , 12. Figure 2.2 illustrates the probabilities and distribution function.
The distribution function (as for all discrete random variables) is piece-wise constant with
jumps equal to the probability of that value. ♣
Example 2.2. A boy practices free throws, i.e., foul shots to the basket standing at a
distance of 15 ft to the board. Let the random variable X be the number of throws
that are necessary until the boy succeeds. Theoretically, there is no upper bound on this
number. Hence X can take the values 1, 2, . . . . ♣
Another way of describing discrete random variables is the probability mass function,
defined as follows.
Definition 2.2. The probability mass function (pmf) of a discrete random variable X is
defined by fX(x) = P(X = x). ♦
In other words, the pmf gives probabilities that the random variables takes one single
value, the cdf gives probabilities that the random variables takes that or any smaller value.
Property 2.2. Let X be a discrete random variable with mass function fX(x) and dis-
tribution function FX(x). Then:
i) The probability mass function satisfies fX(x) ≥ 0 for all x ∈ R.
ii)∑i
fX(xi) = 1.
iii) The values fX(xi) > 0 are the “jumps” in xi of FX(x).
iv) FX(xi) =∑
k;xk≤xi
fX(xk).
The last two points show that there is a one-to-one relation (also called a bijection)
between the cdf and pmf. Given one, we can construct the other.
Figure 2.2 illustrates the pmf and cdf of the random variableX as given in Example 2.1.
The jump locations and sizes (discontinuities) of the cdf correspond to probabilities given
in the left panel. Notice that we have emphasized the right continuity of the cdf (see
Proposition 2.1.ii) with the additional dot.
16 CHAPTER 2. RANDOM VARIABLES
R-Code 2.1 Probabilities and distribution function of X as given in Example 2.1.
(See Figure 2.2.)
x <- 2:12
p <- c(1:6,5:1)/36
plot( x, p, type='h', ylim=c(0, .2),
xlab=expression(x[i]), ylab=expression(p[i]))
points( x, p, pch = 19)
plot.ecdf( outer(1:6,1:6,"+"), ylab=expression(F[X](x)), main='')
2 4 6 8 10 12
0.00
0.05
0.10
0.15
0.20
xi
p i
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
x
FX(x
)
Figure 2.2: Probability mass function (left) and cumulative distribution func-
tion (right) of X = the sum of the roll of two dice. (See R-Code 2.1.)
2.3 Continuous Distributions
A random variable is called continuous if it can (theoretically) assume any value from one
or several intervals. This means that the number of possible values within the sample space
is infinite. Therefore, it is impossible to assign one probability value to one (elementary)
event. Or in other words, given an infinite amount of possible outcomes, the likeliness of
one particular value being the outcome becomes zero. For this reason, we need to consider
outcomes that are contained in some interval. Hence, the probability is described by an
integral, as areas under the probability density function, which is formally defined as
follows.
Definition 2.3. The probability density function (density function, pdf) fX(x), or density
for short, of a continuous random variable X is defined by
P(a < X ≤ b) =
∫ b
a
fX(x)dx, a < b. (2.4)
♦
2.3. CONTINUOUS DISTRIBUTIONS 17
The density function does not give directly a probability and thus cannot be compared
to the probability mass function. The following properties are nevertheless similar to
Property 2.2.
Property 2.3. Let X be a continuous random variable with density function fX(x) and
distribution function FX(x). Then:
i) The density function satisfies fX(x) ≥ 0 for all x ∈ R and fX(x) is continuous
almost everywhere.
ii)
∫ ∞−∞
fX(x)dx = 1.
iii) fX(x) = F ′X(x) =dFX(x)
dx.
iv) FX(x) =
∫ x
−∞fX(y)dy.
v) The cumulative distribution function FX(x) is continous everywhere.
vi) P(X = x) = 0.
As given by Property 2.3.iii and iv, there is again a bijection between the density
function and the cumulative distribution function: if we know one we can construct the
other. Actually, there is a third characterization of random variables, called the quantile
function, which is essentially the inverse of the cdf. That means, we are interested in values
x for which F (x) = p. For (strictly) monotone cumulative distribution functions, the
quantile function is equivalent to the inverse of the distribution function: QX(p) = F−1X (p).
To be precise, we have to be slightly more careful as the inverse might not always
exist, e.g., if the cdf has a plateau. The quantile function returns the minimum value of
x from amongst all those values with probability P(X ≤ x) = p. Formally this is then.
Definition 2.4. The quantile function QX(p) of a random variable X with cdf FX(x) is
defined by
QX(p) = infx | FX(x) ≥ p, 0 < p < 1. (2.5)
♦
This latter definition is general, such that we do not have to distinguish between
continuous or discrete random variables. In any case, the quantile function should be
understood as the generalized inverse function of the cdf.
Example 2.3. The continuous uniform distribution U(a, b) is defined by a constant den-
sity function over the interval [a, b], a < b, i.e. f(x) =
1
b− a, if a ≤ x ≤ b,
0, otherwise.
The quantile function is QX(p) = a+ p(b−a) for 0 < p < 1. Figure 2.3 shows the density
and cumulative distribution function of the uniform distribution U(0, 1). ♣
18 CHAPTER 2. RANDOM VARIABLES
R-Code 2.2 Density and distribution function of a uniform distribution. (See Fig-
ure 2.3.)
plot( c(-1,0,NA,0,1,NA,1,2), c(0,0,NA,1,1,NA,0,0), type='l',
xlab='x', ylab=expression(f[X](x)))
plot( c(-1,0,1,2), c(0,0,1,1), type='l',
xlab='x', ylab=expression(F[X](x)))
−1.0 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
x
f X(x
)
−1.0 0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
x
FX(x
)
Figure 2.3: Density and distribution function of the uniform distribution
U(0, 1). (See R-Code 2.2.)
2.4 Expectation, Variance and Moments
Density, cumulative distribution function or quantile function uniquely characterize ran-
dom variables. Often we do not require such a complete definition and “summary” values
are sufficient. We introduce a measure of location and scale (more will be seen in Chap-
ter 6).
Definition 2.5. The expectation of a discrete random variable X is defined by
E(X) =∑i
xiP(X = xi) . (2.6)
The expectation of a continuous random variable X is defined by
E(X) =
∫RxfX(x)dx , (2.7)
where fX(x) denotes the density of X. ♦
The expectation is “linked” to the average (or empirical mean, mean) if we have a set
of realizations thought to be from the particular random variable. Similarly, the variance,
the expectation of the squared deviation of a random variable from its expected value
is “linked” to the empirical variance (var). This link will be more formalized in later
chapters.
2.5. INDEPENDENT RANDOM VARIABLES 19
Definition 2.6. The variance of X is defined by:
Var(X) = E((X − E(X))2
). (2.8)
♦
Property 2.4. For an “arbitrary” real function g we have:
i) E(g(X)) =∑i
g(xi)P(X = xi), X discrete,
E(g(X)) =
∫Rg(x)fX(x)dx, X continuous.
Regardless of whether X is discrete or continous, the following rules apply:
ii) Var(X) = E(X2)−(E(X)
)2,
iii) E(a+ bX) = a+ bE(X), for a a and b given constants,
iv) Var(a+ bX) = b2 Var(X), for a a and b given constants,
v) E(aX + bY ) = aE(X) + bE(Y ), for a random variable Y .
The second but last property seems somewhat surprising. But starting from the
definition of the variance, one quickly realizes that the variance is not a linear operator:
Var(a+ bX) = E((a+ bX − E(a+ bX)
)2)
= E((a+ bX − (a+ bE(X))
)2). (2.9)
Example 2.4. We consider again the setting of Example 2.1, and straightforward calcu-
lation shows that
E(X) =12∑i=2
iP(X = i) = 7, by equation (2.6), (2.10)
= 26∑i=1
i1
6= 2 · 7
2, by using Property 3.4.v first. (2.11)
2.5 Independent Random Variables
Often we not only have one random variable but many of them. Here, we see a first way
to characterize a set of random variables.
Definition 2.7. Two random variables X and Y are independent if
P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B) . (2.12)
The random variables X1, . . . , Xn are independent if
P(∩ni=1Xi ∈ Ai) =n∏i=1
P(Xi ∈ Ai) . (2.13)
♦
20 CHAPTER 2. RANDOM VARIABLES
The definition also implies that the joint density and joint cumulative distribution is
simply the product of the individual ones, also called marginal ones.
We will often use many independent random variables with a common distribution
function.
Definition 2.8. A random sample X1, . . . , Xn consists of n independent random vari-
ables with the same distribution F . We write X1, . . . , Xniid∼ F where iid stands for
“independent and identically distributed.” The number n of random variables is called
the sample size or sample range.
The iid assumption is very crucial and relaxing it has severe implications on the
statistical modeling. Luckily, independence also implies a simple formula for the variance
of the sum of two or many random variables.
Property 2.5. Let X and Y be two independent random variables. Then
i) Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) .
Let X1, . . . , Xniid∼ F with E(X1) = µ and Var(X1) = σ2. Denote X =
1
n
n∑i=1
Xi. Then
ii) E(X) = E( 1
n
n∑i=1
Xi
)= E(X1) = µ .
iii) Var(X) = Var( 1
n
n∑i=1
Xi
)=
1
nVar(X1) =
σ2
n.
The latter two properties will be used when we investigate statistical properties of the
sample mean, i.e., linking the empirical mean x =1
n
n∑i=1
xi with the random sample mean
X =1
n
n∑i=1
Xi.
2.6 Common Discrete Distributions
There is of course no limitation on the number of different random variables. In practice,
we can reduce our framework to some common distribution.
2.6.1 Binomial Distribution
A random experiment with exactly two possible outcomes (for example: success/failure,
heads/tails, male/female) is called a Bernoulli trial. For simplicity, we code the sample
space with ‘1’ (success) and ‘0’ (failure).
P(X = 1) = p, P(X = 0) = 1− p, 0 < p < 1, (2.14)
2.7. COMMON CONTINUOUS DISTRIBUTIONS 21
where the cases p = 0 and p = 1 are typically not relevant. Thus
E(X) = p, Var(X) = p(1− p). (2.15)
If a Bernoulli experiment is repeated n times (resulting in n-tuples of zeros and ones),
the random variable X = “number of successes” is intuitive. The distribution of X is
called the binomial distribution, denoted with X ∼ Bin(n, p) and the following applies:
P(X = k) =
(n
k
)pk(1− p)n−k, 0 < p < 1, k = 0, 1, . . . , n, (2.16)
and
E(X) = np, Var(X) = np(1− p). (2.17)
2.6.2 Poisson Distribution
The Poisson distribution gives the probability of a given number of events occurring in
a fixed interval of time if these events occur with a known and constant rate over time.
One way to formally introduce such a random variable is giving its probability mass. The
exact form thereof is not very important.
Definition 2.9. A random variable X, whose probability function is given by
P(X = k) =λk
k!exp(−λ), 0 < λ, k = 0, 1, . . . , (2.18)
is said to follow a Poisson distribution, denoted by X ∼ Poisson(n, p). Further,
E(X) = λ, Var(X) = λ. (2.19)
♦
The Poisson distribution is also a good approximation for the binomial distribution
with large n and small p.
2.7 Common Continuous Distributions
Example 2.3 introduced a first common continuous distribution. In this section we see
first the most commonly used one, i.e., the normal distribution, and then some distri-
butions that are derived as functions from normally distributed random variables. We
will encounter these much later, for example, the t distribution in Chapter 4 and the F
distribution in Chapter 9.
22 CHAPTER 2. RANDOM VARIABLES
2.7.1 The Normal Distribution
The normal or Gaussian distribution is probably the most known distribution, having the
omnipresent “bell-shaped” density. Its importance is mainly due the fact that the sum
of many random variables (under iid or more general settings) is distributed as a normal
random variable. This is due to the celebrated central limit theorem. As in the case of a
Poisson random variable, we define the normal distribution by giving its density.
Definition 2.10. The random variable X is said to be normally distributed if
FX(x) =
∫ x
−∞fX(x)dx (2.20)
with density function
f(x) = fX(x) =1√
2πσ2exp
(−1
2· (x− µ)2
σ2
), (2.21)
for all x (µ ∈ R, σx > 0). We denote this with X ∼ N (µ, σ2).
The random variable Z = (X − µ)/σ (the so-called z-transformation) is standard
normal and its density and distribution function are usually denoted with ϕ(z) and Φ(z),
respectively. ♦
While the exact form of the density (2.21) is not important, a certain recognizing
factor will be very useful. Especially, for a standard normal random variable, the density
is proportional to exp(−z2/2).
The following property is essential and will be consistently used throughout the work.
We justify the first one later in this chapter. The second one is a result of the particular
form of the density.
Property 2.6. i) Let X ∼ N (µ, σ2), thenX − µσ
∼ N (0, 1). Conversely, if Z ∼N (0, 1), then σZ + µ ∼ N (µ, σ2).
ii) Let X1 ∼ N (µ1, σ21) and X2 ∼ N (µ2, σ
22) be independent, then aX1+bX2 ∼ N (aµ1+
bµ2, a2σ2
1 + b2σ22).
The cumulative distribution function Φ has no closed form and the corresponding
probabilities must be determined numerically. In the past, so-called “standard tables”
were often used and included in statistics books. Table 2.1 gives an excerpt of such a
table. Now even “simple” pocket calculators have the corresponding functions. It is
probably worthwhile to remember 84% = Φ(1), 98% = Φ(2), 100% ≈ Φ(3), as well as
95% = Φ(1.64) and 97.5% = Φ(1.96).
2.7. COMMON CONTINUOUS DISTRIBUTIONS 23
Table 2.1: Probabilities of the standard normal distribution. The table gives
the value of Φ(zp). For example, Φ(0.2 + 0.04) = 0.595.
zp 0.00 0.02 0.04 0.06 0.08
0.0 0.500 0.508 0.516 0.524 0.532
0.1 0.540 0.548 0.556 0.564 0.571
0.2 0.579 0.587 0.595 0.603 0.610
0.3 0.618 0.626 0.633 0.641 0.648
0.4 0.655 0.663 0.670 0.677 0.684
0.5 0.691 0.698 0.705 0.712 0.719...
1.0 0.841 0.846 0.851 0.855 0.860...
1.6 0.945 0.947 0.949 0.952 0.954
1.7 0.955 0.957 0.959 0.961 0.962
1.8 0.964 0.966 0.967 0.969 0.970
1.9 0.971 0.973 0.974 0.975 0.976
2.0 0.977 0.978 0.979 0.980 0.981...
3.0 0.999 0.999 . . .
R-Code 2.3 Calculation of the “z-table”, see Table 2.1.
y <- seq( 0, by=.02, length=5)
x <- c( seq( 0, by=.1, to=.5), 1, seq(1.6, by=.1, to=2), 3)
round( pnorm( outer( x, y, "+")), 3)
Example 2.5. Let X ∼ N (4, 9). Then
i) P(X ≤ −2) = P(X − 4
3≤ −2− 4
3
)= P(Z ≤ −2) = Φ(−2) = 1− Φ(2) = 1− 0.977 = 0.023 .
ii) P(|X − 3| > 2) = 1− P(|X − 3| ≤ 2) = 1− P(−2 ≤ X − 3 ≤ 2)
= 1− (P(X − 3 ≤ 2)− P(X − 3 ≤ −2)) = 1− Φ
(5− 4
3
)+ Φ
(1− 4
3
)≈
0.5281. ♣
Using the central limit theorem argument, we can show that distribution of a binomial
random variable X ∼ Bin(n, p) converges to a distribution of a normal random varialbe
as n→∞. Thus, the distribution of a normal random variable N (np, np(1− p)) can be
used as an approximation for the binomial distribution Bin(n, p). For the approximation,
n should be larger than 30 for p ≈ 0.5. For p closer to 0 and 1, n needs to be much larger.
24 CHAPTER 2. RANDOM VARIABLES
R-Code 2.4 Density, distribution, and quantile functions of the standard normal
distribution. (See Figure 2.4.)
plot( dnorm, -2, 2, ylim=c(-1,2))
abline( c(0, 1), h=c(0,1), col='gray') # diag and horizontal lines
plot( pnorm, -2, 2, col=3, add=TRUE)
plot( qnorm, 0, 1, col=4, add=TRUE)
−2 −1 0 1 2
−1.
0−
0.5
0.0
0.5
1.0
1.5
2.0
x
dnor
m
Figure 2.4: Density (black), distribution function (green), and quantile func-
tion (blue) of the standard normal distribution. (See R-Code 2.4.)
To calculate probabilities, we often apply a so-called continuity correction, as illustrated
in the following example.
Example 2.6. Let X ∼ Bin(30, 0.5). Then P(X ≤ 10) = 0.049, “exactly”. However,
P(X ≤ 10) ≈ P( X − np√
np(1− p)≤ 10− np√
np(1− p)
)= Φ
(10− 15√30/4
)= 0.034, (2.22)
P(X ≤ 10) ≈ P(X + 0.5− np√
np(1− p)≤ 10 + 0.5− np√
np(1− p)
)= Φ
(10.5− 15√30/4
)= 0.05. (2.23)
♣
2.7. COMMON CONTINUOUS DISTRIBUTIONS 25
2.7.2 Chi-Square Distribution
Let Z1, . . . , Zniid∼ N (0, 1). The distribution of the random variable
X 2n =
n∑i=1
Z2i (2.24)
is called the chi-square distribution (X 2 distribution) with n degrees of freedom. The
following applies:
E(X 2n) = n; Var(X 2
n) = 2n. (2.25)
Here and for the next two distributions, we do not give the densities as they are
very complex. Similarly, the expectation and the variance here and for the next two
distributions are for reference only.
The chi-square distribution is used in numerous statistical tests.
The definition of a chi-square random variable is based on a sum of independent
random variables. Hence in view of the central limit theorem, the following result are
not surprising. If n > 50, we can approximate the chi-square distribution with a normal
distribution, i.e., X2n is distributed approximately N (n, 2n). Furthermore, for X2
n with
n > 30 the random variable X =√
2X2n is approximately normally distributed with
expectation√
2n− 1 and standard deviation of 1.
R-Code 2.5 Chi-square distribution for various degrees of freedom. (See Figure 2.5.)
x <- seq( 0, to=50, length=150)
plot(x, dchisq( x, df=1), type='l', ylab='Density')
for (i in 1:6)
lines( x, dchisq(x, df=2^i), col=i+1)
legend( "topright", legend=2^(0:6), col=1:7, lty=1, bty="n")
2.7.3 Student’s t-Distribution
In later chapters, we use the so-called t-test when, for example, comparing empirical
means. The test will be based on the distribution that we define next.
Let Z ∼ N (0, 1) and X ∼ X 2m be two independent random variables. The distribution
Tm =Z√X/m
(2.26)
is called the t-distribution (or Student’s t-distribution) with m degrees of freedom. We
have:
E(Tm) = 0, for m > 1; (2.27)
Var(Tm) =m
(m− 2), for m > 2. (2.28)
26 CHAPTER 2. RANDOM VARIABLES
0 10 20 30 40 50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
x
Den
sity
1248163264
Figure 2.5: Densities of the Chi-square distribution for various degrees of free-
dom. (See R-Code 2.5.)
The density is symmetric around zero and as m → ∞ the density converges to the
standard normal density φ(x) (see Figure 2.6).
R-Code 2.6 t-distribution for various degrees of freedom. (See Figure 2.6.)
x <- seq( -3, to=3, length=100)
plot( x, dnorm(x), type='l', ylab='Density')
for (i in 0:6)
lines( x, dt(x, df=2^i), col=i+2)
legend( "topright", legend=2^(0:6), col=2:8, lty=1, bty="n")
Remark 2.1. For m = 1, 2 the density is heavy-tailed and the variance of the distribution
does not exist. Realizations of this random variable occasionally manifest with extremely
large values. Of course, the empirical variance can still be calculate (see R-Code 2.7). We
come back to this issue in Chapter 5.
2.7. COMMON CONTINUOUS DISTRIBUTIONS 27
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
x
Den
sity
1248163264
Figure 2.6: Densities of the t-distribution for various degrees of freedom. The
normal distribution is in black. A density with 27 = 128 degrees of freedom
would make the normal density function appear thicker. (See R-Code 2.6.)
R-Code 2.7 Empirical variances of the t-distribution with one degree of freedom.
set.seed( 14)
tmp <- rt( 1000, df=1)
print( c(summary( tmp), Var=var( tmp)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. Var
## -191.00 -1.12 -0.02 8.11 1.01 5730.00 37390.66
sort( tmp)[1:10] # many "large" values, but 2 exceptionally large
## [1] -190.929 -168.920 -60.603 -53.736 -47.764 -43.377 -36.252
## [8] -31.498 -30.029 -25.596
sort( tmp, decreasing=TRUE)[1:10]
## [1] 5726.531 2083.682 280.848 239.752 137.363 119.157 102.702
## [8] 47.376 37.887 32.443
2.7.4 F -Distribution
The F -distribution is mainly used to compare two empirical variances with each other,
as we will see in Chapter 11.
Let X ∼ X 2m and Y ∼ X 2
n be two independent random variables. The distribution
Fm,n =X/m
Y/n(2.29)
28 CHAPTER 2. RANDOM VARIABLES
is called the F -distribution with m and n degrees of freedom. It holds that:
E(Fm,n) =n
n− 2, for n > 2; (2.30)
Var(Fm,n) =2n2(m+ n− 2)
m(n− 2)2(n− 4), for n > 4. (2.31)
Figure 2.7 shows the density for various degrees of freedom.
R-Code 2.8 F -distribution for various degrees of freedom. (See Figure 2.7.)
x <- seq(0, to=4, length=500)
df1 <- c( 1, 2, 5, 10, 50, 100, 250)
df2 <- c( 1, 50, 10, 50, 50, 300, 250)
plot( x, df( x, df1=1, df2=1), type='l', ylab='Density')
for (i in 2:length(df1))
lines( x, df(x, df1=df1[i], df2=df2[i]), col=i)
legend( "topright", legend=c(expression(F[list(1,1)]),
expression(F[list(2,50)]), expression(F[list(5,10)]),
expression(F[list(10,50)]),expression(F[list(50,50)]),
expression(F[list(100,300)]),expression(F[list(250,250)])),
col=1:7, lty=1, bty="n")
0 1 2 3 4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
x
Den
sity
F1, 1F2, 50
F5, 10
F10, 50
F50, 50
F100, 300
F250, 250
Figure 2.7: Density of the F -distribution for various degrees of freedom. (See
R-Code 2.8.)
2.7. COMMON CONTINUOUS DISTRIBUTIONS 29
2.7.5 Beta Distribution
A random variable X with density
fX(x) = c · xα−1(1− x)β−1, x ∈ [0, 1], α > 0, β > 0, (2.32)
where c is a normalization constant, is called the beta distributed with parameters α and
β. We write this as X ∼ Beta(α, β). The normalization constant cannot be written in
closed form for all parameters α and β. For α = β the density is symmetric around 1/2
and for α > 1, β > 1 the density is concave with mode (α− 1)/(α+ β− 2). For arbitrary
α > 0, β > 0 we have:
E(X) =α
α + β; (2.33)
Var(X) =αβ
(α + β + 1)(α + β)2. (2.34)
Figure 2.8 shows densities of the beta distribution for various pairs of (α, β).
The beta distribution is mainly used to model probabilities. We encounter this distri-
bution again in Chapter 12.
R-Code 2.9 Densities of beta distributed random variables for various pairs of (α, β).
(See Figure 2.8.)
p <- seq( 0, to=1, length=100)
a.seq <- c( 1:6, .8, .4, .2, 1, .5, 2)
b.seq <- c( 1:6, .8, .4, .2, 4, 4, 4)
col <- c( 1:6, 1:6)
lty <- rep( 1:2, each=6)
plot( p, dbeta( p, 1, 1), type='l', ylab='Density', xlab='x',
xlim=c(0,1.3), ylim=c(0,3))
for ( i in 2:length(a.seq))
lines( p, dbeta(p, a.seq[i], b.seq[i]), col=col[i], lty=lty[i])
legend("topright", legend=c(expression( list( alpha, beta)),
paste(a.seq, b.seq)),
col=c(NA,col), lty=c(NA, lty), cex=.9, bty='n')
30 CHAPTER 2. RANDOM VARIABLES
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
x
Den
sity
α, β1 12 23 34 45 56 60.8 0.80.4 0.40.2 0.21 40.5 42 4
Figure 2.8: Densities of beta distributed random variables for various pairs of
(α, β). (See R-Code 2.9.)
2.8. FUNCTIONS OF RANDOM VARIABLES 31
2.8 Functions of Random Variables
In the previous sections we saw different examples of often used, classical random vari-
ables. These examples are often not enough and through a modeling approach, we need
additional ones. In this section we illustrate how to construct the cdf and pdf of a random
variable that is the square of one from which we know the density.
Let X be a random variable with distribution function FX(x). We define a random
variable Y = g(X). The cumulative distribution function of Y is written as
FY (y) = P (Y ≤ y) = P (g(X) ≤ y). (2.35)
In many cases g(·) is invertable (and differentiable) and we obtain
FY (y) =
P(X ≤ g−1(y)) = FX(g−1(y)), if g−1 monotonically increasing,
P(X ≥ g−1(y)) = 1− FX(g−1(y)), if g−1 monotonically decreasing.(2.36)
To derive the probability mass function we apply Property 2.2.. In the more interesting
setting of continuous random variables, the density function is derived by Property 2.3.iii
and is thus
fY (y) =∣∣∣ ddyg−1(y)
∣∣∣fX(g−1(y)). (2.37)
Example 2.7. Let X be a random variable with cdf FX(x) and pdf fX(x). We consider
Y = a + bX, for b > 0 and a arbitrary. Hence, g(·) is a linear function and its inverse
g−1(y) = (y − a)/b is monotonically increasing. The cdf of Y is thus FX
((y − a)/b
)and
the pdf is fX
((y − a)/b
)· 1/b. This has fact has already been stated in Property 2.6 for
the Gaussian random variables. ♣
Example 2.8. Let X ∼ U(0, 1) and for 0 < x < 1, we set g(x) = − log(1 − x), thus
g−1(y) = 1− exp(−y). Then the distribution and density function of Y = g(X) is
FY (y) = FX(g−1(y)) = g−1(y) = 1− exp(−y), (2.38)
fY (y) =∣∣∣ ddyg−1(y)
∣∣∣fX(g−1(y)) = exp(−y), (2.39)
for y > 0. This random variable is called the exponential random variable (with rate
parameter one). Notice further that g(x) is the quantile function of this random variable.
♣
As we are often interested in summarizing a random variable by its mean and variance,
we have a very convenient short-cut.
The expectation and the variance of a transformed random variable Y can be approx-
imated by the so-called delta method. The idea thereof consists of a Taylor expansion
around the expectation E(X):
g(X) ≈ g(E(X)) + g′(E(X)) · (X − E(X)) (2.40)
32 CHAPTER 2. RANDOM VARIABLES
(two terms of the Taylor series). Thus
E(Y ) ≈ g(E(X)), (2.41)
Var(Y ) ≈ g′(E(X))2 · Var(X). (2.42)
Example 2.9. Let X ∼ B(1, p) and Y = X/(1−X). Thus,
E(Y ) ≈ p/(1− p); Var(Y ) ≈( 1
(1− p)2
)2
· p(1− p) =p
(1− p)3. (2.43)
♣
Example 2.10. Let X ∼ B(1, p) and Y = log(X). Thus
E(Y ) ≈ log(p), Var(Y ) ≈(1
p
)2
· p(1− p) =1− pp
. (2.44)
♣
Of course, in the case of a linear transformation (as, e.g., in Example 2.7), equa-
tion (2.40) is an equality and thus relations (2.41) and (2.42) are exact, which is in sync
with Property 2.6.
2.9 Bibliographic Remarks and Links
In general, wikipedia has nice summaries of many distributions. The page https://en.
wikipedia.org/wiki/List of probability distributions lists many thereof.
The ultimate reference for (univariate) distributions is the encyclopedic series of John-
son et al. (2005, 1994, 1995). Figure 1 of Leemis and McQueston (2008) illustrates exten-
sively the links between countless univariate distributions.
Chapter 3
Estimation
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter03.R.
A central point of statistics is to draw information from observations (measurements,
data, subset of a population) and to infer from the data towards our hypotheses or possibly
towards the population. This requires foremost data and a statistical model. Such a
statistical model describes describes the data through the use of distributions, which
means the data is considered a realization of the distribution. The model comprises
of unknown quantities, so-called parameters. The goal of a statistical estimation is to
determine plausible values for the parameters of the model using the observed data.
The following is an example of a statistical model:
Yi = µ+ εi, i = 1, . . . , n, (3.1)
where Yi are the observations, µ is an unknown constant and εi are random variables rep-
resenting measurement error. It is often reasonable to assume E(εi) = 0 with a symmetric
density. Here, we even assume εiiid∼ N (0, σ2). Thus, Y1, . . . , Yn are normally distributed
with mean µ and variance σ2.
As typically, both parameters µ and σ2 are unknown we first address the question how
can we determine plausible values for these model parameters from observed data.
Example 3.1. In R-Code 3.1 hemoglobin levels of blood samples from patients with Hb
SS and Hb S/β sickle cell disease are given (Husler and Zimmermann, 2010). The data
are summarized in Figure 3.1.
Equation (3.1) can be used as a simple statistical model for both diseases, with µ
representing the population mean and εi the indivdual deviance from the population
mean.
Natural questions that arise are: What are plausible values of the population mean?
How much do the indivdual deviances vary? ♣
33
34 CHAPTER 3. ESTIMATION
R-Code 3.1 Hemoglobin levels of patients with sickle cell disease. (See Figure 3.1.)
HbSS <- c( 7.2, 7.7, 8, 8.1, 8.3, 8.4, 8.4, 8.5, 8.6, 8.7,
9.1, 9.1, 9.1, 9.8, 10.1, 10.3)
HbSb <- c(8.1, 9.2, 10,10.4,10.6,10.9,11.1,11.9,12.0,12.1)
boxplot( list(HbSS=HbSS, HbSb=HbSb), col=c(3,4))
qqnorm( HbSS, xlim=c(-2,2), ylim=c(7,12), col=3, main='')
qqline( HbSS, col=3)
tmp <- qqnorm( HbSb, plot.it=FALSE)
points( tmp, col=4)
qqline( HbSb, col=4)
HbSS HbSb
89
1011
12
−2 −1 0 1 2
78
910
1112
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 3.1: Hemoglobin levels of patients with Hb SS and Hb S/β sickle cell
disease. (See R-Code 3.1.)
3.1 Point Estimation
For estimation, the data is considered a realization of a random sample with a certain
(parametric) distribution. The goal of point estimation is to provide a value for the
parameters of the distribution based on the data at hand.
Definition 3.1. A statistic is an arbitrary function of a random sample Y1, . . . , Yn and
is therefore also a random variable.
An estimator is a statistic used to extract information about a parameter from the
random sample.
A point estimate is the value of the estimator evaluated at y1, . . . , yn, the realizations
of the random sample.
Estimation (or estimating) is the process of finding a (point) estimate. ♦
3.2. CONSTRUCTION OF ESTIMATORS 35
Example 3.2. i) The numerical values shown in R-Code 3.2 are estimates.
ii) Y =1
n
n∑i=1
Yi is an estimator.
y =1
n
10∑i=1
yi = 10.6 is a point estimate.
iii) S2 =1
n− 1
n∑i=1
(Yi −Y )2 is an estimator.
s2 =1
n− 1
10∑i=1
(yi −y)2 = 1.65 oder s = 0.844 is a point estimate. ♣
R-Code 3.2 Some summary statistics of hemoglobin levels of patients with sickle
cell disease.
c( mean( HbSS), mean( HbSb)) # means for both diseases
## [1] 8.7125 10.6300
var( HbSS) # here and below spread measures
## [1] 0.71317
c( var(HbSb), sum( (HbSb-mean(HbSb))^2)/(length(HbSb)-1) )
## [1] 1.649 1.649
c( sd( HbSS), sqrt( var( HbSS)))
## [1] 0.84449 0.84449
Often, we denote parameters with Greek letters (µ, σ, λ, . . . ), with θ being the generic
one. The estimator and estimate of a parameter θ are denoted by θ. Context makes clear
which of the two cases is meant.
3.2 Construction of Estimators
We now consider three examples of how estimators are constructed for arbitrary distribu-
tions.
3.2.1 Ordinary Least Squares
The ordinary least squares method of parameter estimation minimizes the sum of squares
of the differences between observed responses and those predicted by a linear function of
the explanatory variables.
36 CHAPTER 3. ESTIMATION
Let Y1, . . . , Yn be iid with E(Yi) = µ. The least squares estimator is based on
µ = µLS = argminµ
n∑i=1
(Yi − µ)2, (3.2)
and thus one has estimator µLS = Y and estimate µLS = y.
Often, the parameter θ is linked to the expectation E(Yi) through some function, say
g. In such a setting, we have
θ = θLS = argminθ
n∑i=1
(Yi − g(θ)
)2. (3.3)
In many situations, for example in linear regression settings, simple and close form solu-
tions exist.
3.2.2 Method of Moments
The method of moments is based on the following idea. The parameters of the distribution
are expressed as functions of the moments, e.g., E(Y ), E(Y 2). The random sample mo-
ments are then plugged into the theoretical moments of the equations in order to obtain
the estimators:
µ := E(Y ) , µ =1
n
n∑i=1
Yi = Y , (3.4)
µ2 := E(Y 2) , µ2 =1
n
n∑i=1
Y 2i . (3.5)
By using the observed values of a random sample in the method of moments estimator,
the estimates of the corresponding parameters are obtained.
Example 3.3. Let Y1, . . . , Yniid∼ E(λ)
E(Y ) = 1/λ, Y = 1/λ λ = λMM =
1
Y. (3.6)
Thus, the estimate of λ is the value of 1/y. ♣
Example 3.4. Let Y1, . . . , Yniid∼ F with expectation µ and variance σ2. Since Var(Y ) =
E(Y 2)− E(Y )2 (Property .), we can write σ2 = µ2 − (µ)2 and we have the estimator
σ2MM =
1
n
n∑i=1
Y 2i −Y 2 =
1
n
n∑i=1
(Yi −Y )2. (3.7)
♣
3.2. CONSTRUCTION OF ESTIMATORS 37
3.2.3 Likelihood Method
The likelihood method chooses the as estimate the value such that the observed data is
most likely to stem from the model (using the estimate). We consider the probability
density function (for continuous random variables) or the probability mass function (for
discrete random variables) to be a function of parameter θ, i.e.,
fY (y) = fY (y; θ) −→ L(θ) := fY (y; θ) (3.8)
pi = P(Y = yi) = P(Y = yi; θ) −→ L(θ) := P(Y = yi; θ). (3.9)
For a given distribution, we call L(θ) the likelihood.
Definition 3.2. The maximum likelihood estimator θML of parameter θ is based on
maximizing the likelihood, i.e.
θML = argmaxθ
L(θ). (3.10)
♦
Since a random sample contains independent and identically distributed random vari-
ables, the likelihood is the product of the individual densities (see page 2.5). Since
θML = argmaxθ L(θ) = argmaxθ log(L(θ)
), the log-likelihood `(θ) := log(L(θ)) can be
maximized instead. The log-likelihood is often preferred because the expressions simplify
more and maximizing sums is much easier than maximizing products.
Example 3.5. Let Y1, . . . , Yniid∼ Exp(λ), thus
L(λ) =n∏i=1
fY (yi) =n∏i=1
λ exp(−λyi) = λn exp(−λn∑i=1
yi) . (3.11)
Then
d`(λ)
dλ=d log(λn exp(−λ
∑ni=1 yi)
dλ=dn log(λ)− λ
∑ni=1 yi
dλ=n
λ−
n∑i=1
yi!
= 0 (3.12)
λ = λML =n∑ni=1 yi
=1
y. (3.13)
In this case (as in others), λML = λMM. ♣
In a vast majority of cases, maximum likelihood estimators posses very nice properties.
Intuitively, because we use information about the density and not only about the moments,
they are “better” compared to method of moment estimators. Further, for many common
random variables, the likelihood function has a single optimum, in fact a maximum, for
all permissible θ.
38 CHAPTER 3. ESTIMATION
3.3 Comparison of Estimators
In Example 3.4 we saw an estimator that divides the sum of the squared deviances by n,
whereas the default denominator used in R is n−1 (see R-Code 3.2 or Example 3.2). There
are different estimators for a particular parameter possible. We now see two measures to
compare them.
Definition 3.3. An estimator θ of a parameter θ is unbiased if
E(θ) = θ, (3.14)
otherwise it is biased.
The value E(θ)− θ is called the bias. ♦
Example 3.6. Y1, . . . , Yniid∼ N (µ, σ2)
i) Y is unbiased for µ, since
E(Y)
= E( 1
n
n∑i=1
Yi
)=
1
nnE(Yi) = µ . (3.15)
ii) S2 =1
n− 1
n∑i=1
(Yi −Y )2 is unbiased for σ2. To show this we use the following two
identities
(Yi −Y )2 = (Yi − µ+ µ−Y )2 = (Yi − µ)2 + 2(Yi − µ)(µ−Y ) + (µ−Y )2, (3.16)n∑i=1
2(Yi − µ)(µ−Y ) = 2(µ−Y )n∑i=1
(Yi − µ) = 2(µ−Y )(nY − nµ) (3.17)
= −2n(µ−Y )2, (3.18)
i.e., we rewrite the square for which the cross-term simplifies with the second square
term. Collecting all leads finally to
(n− 1)S2 =n∑i=1
(Yi − µ)2 − n(µ−Y )2 (3.19)
(n− 1) E(S2) =n∑i=1
Var(Yi)− nE((µ−Y )2
)(3.20)
By Property 2.5.iii, E((µ−Y )2
)= Var
(Y)
= σ2/2 and thus
(n− 1) E(S2) = nσ2 − n · σ2
n= (n− 1)σ2 (3.21)
iii) σ2 =1
n
∑(Yi −Y )2 is biased, since
E(σ2) =1
n(n− 1) E
( 1
n− 1
∑(Yi −Y )2
)︸ ︷︷ ︸
E(S2) = σ2
=n− 1
nσ2. (3.22)
3.4. INTERVAL ESTIMATORS 39
The bias is
E(σ2)− σ2 =n− 1
nσ2 − σ2 = − 1
nσ2, (3.23)
which amounts to underestimation of the variance. ♣
A further possibility for comparing estimators is the mean squared error
MSE(θ) = E((θ − θ)2). (3.24)
The mean squared error can also be written as MSE(θ) = bias(θ )2 + Var(θ).
Example 3.7. Let Y1, . . . , Yniid∼ N (µ, σ2). Using the result (3.15) and Property 2.5.iii,
we have
MSE(Y ) = bias(Y )2 + Var(Y ) = 0 +σ2
n. (3.25)
Hence, the MSE vanishes as n increases. ♣
There is a second “classical” example for the calculation of the mean squared error,
however it requires some properties of squared Gaussian variables.
Example 3.8. If Y1, . . . , Yniid∼ N (µ, σ2) it is possible to show that (n− 1)S2/σ2 ∼ X 2
n−1.
Then using the variance from a chi-squared random variable (2.25), we have
MSE(S2) = Var(S2) =σ4
(n− 1)2Var
((n− 1)S2
σ2
)=
σ4
(n− 1)2(2n− 2) =
2σ4
n− 1. (3.26)
Analogously, one can show that MSE(σ2MM) is smaller than Equation (3.26). Moreover,
the estimator (n− 1)S2/(n+ 1) possesses the smallest MSE. ♣
3.4 Interval Estimators
Point estimates, inherently, are single values and do not provide us with uncertainties.
For this we have to extend the idea towards interval estimates and interval estimators.
We start with a motivating example.
Let Y1, . . . , Yniid∼ N (µ, σ2) with known σ2. Thus Y ∼ N (µ, σ2/n) and, by standardiz-
ing,Y − µσ/√n∼ N (0, 1). As a consequence,
1− α = P(zα/2 ≤
Y − µσ/√n≤ z1−α/2
)(3.27)
= P(zα/2
σ√n≤ Y − µ ≤ z1−α/2
σ√n
)(3.28)
= P(−Y + zα/2
σ√n≤ −µ ≤ −Y + z1−α/2
σ√n
)(3.29)
= P(Y − zα/2
σ√n≥ µ ≥ Y − z1−α/2
σ√n
). (3.30)
40 CHAPTER 3. ESTIMATION
Definition 3.4. Let Y1, . . . , Yniid∼ N (µ, σ2) with known σ2. The interval[
Y − z1−α/2σ√n,Y + z1−α/2
σ√n
](3.31)
is an exact (1− α) confidence interval for the parameter µ. ♦
If we evaluate the bounds of the confidence interval[Bl, Bu
]at a realization we denote[
bl, bu]
as the empirical confidence interval or the observed confidence interval.
The interpretation of an exact confidence interval is as follows: If a large number of
realisations are drawn from a random sample, on average (1−α) · 100% of the confidence
intervals will cover the true parameter µ .
Empirical confidence intervals do not contain random variables and therefore it is not
possible to make probability statements about the parameter.
If the standard deviation σ is unknown, the ansatz must be modified by using a point
estimate for σ, typically S =√S2, S2 = 1
n−1
∑i(Yi − Y )2. Since
Y − µS/√n∼ Tn−1, the
corresponding quantile must be modified:
1− α = P(tn−1,α/2 ≤
Y − µS/√n≤ tn−1,1−α/2
). (3.32)
Definition 3.5. Let Y1, . . . , Yniid∼ N (µ, σ2). The interval[
Y − tn−1,1−α/2S√n, Y + tn−1,1−α/2
S√n
](3.33)
is an exact (1− α) confidence interval for the parameter µ. ♦
Confidence interval for the mean µ
Under the assumption of a normal random sample,[Y − tn−1,1−α/2
S√n, Y + tn−1,1−α/2
S√n
](3.34)
is an exact (1− α) confidence interval and[Y − z1−α/2
S√n, Y + z1−α/2
S√n
](3.35)
an approximate (1− α) confidence interval for µ.
Confidence intervals are, as shown in the previous two definitions, constituted by
random variables (functions of Y1, . . . , Yn). Similar to estimators and estimates, confidence
3.4. INTERVAL ESTIMATORS 41
intervals are computed with the corresponding realization y1, . . . , yn of the random sample.
Subsequently, confidence intervals will be outlined in the blue-highlighted text boxes, as
shown here.
Example 3.9. Let Y1, . . . , Y4iid∼ N (0, 1). The R-Code 3.3 and Figure 3.2 show 100 em-
pirical confidence intervals based on Equation (3.31) (top) and Equation (3.33) (bottom).
Because n is small, the difference between the normal and the t-distribution is quite
pronounced. This becomes clear when[Y − z1−α/2
S√n, Y + z1−α/2
S√n
](3.36)
is used as an approximation (Figure 3.2, middle). ♣
Confidence intervals can often be constructed from a starting estimator and its distri-
bution. In many cases it is possible to extract the parameter to get to 1 − α = P(Bl ≤θ ≤ Bu), often some approximations are necessary.
The coverage probability of an exact confidence interval amounts to exactly 1 − α.
This is not the case for approximate confidence intervals.
Example 3.10. Let Y1, . . . , Yniid∼ N (µ, σ2). The estimator S2 for the parameter σ2 is
such, that (n − 1)S2/σ2 ∼ X 2n−1, i.e. a chi-square distribution with n − 1 degrees of
freedom. Hence
1− α = P(χ2n−1,α/2 ≤
(n− 1)S2
σ2≤ χ2
n−1,1−α/2
)(3.38)
= P( (n− 1)S2
χ2n−1,α/2
≥ σ2 ≥ (n− 1)S2
χ2n−1,1−α/2
), (3.39)
where χ2n−1,p is the p-quantile of the chi-square distribution with n−1 degrees of freedom.
The corresponding exact (1−α) confidence interval no longer has the form θ±q1−α/2sd(θ),
because the chi-square distribution is not symmetric.
For the Hb SS Daten with sample variance 0.71, we have the confidence interval
[0.39, 1.71], computed with (16-1)*var(HbSS)/qchisq( c(.975, .025), df=16-1). ♣
Confidence interval for the variance σ2
Under the assumption of a normal random sample,[ (n− 1)S2
χ2n−1,1−α/2
,(n− 1)S2
χ2n−1,α/2
](3.37)
is an exact (1− α) confidence interval for σ2.
42 CHAPTER 3. ESTIMATION
R-Code 3.3 100 confidence intervals for the parameter µ = 0 with σ = 1 and
unknown σ.
set.seed( 1)
ex.n <- 100 # 100 confidence intervals
alpha <- .05 # 95\% confidence intervals
n <- 4 # sample size
mu <- 0
sigma <- 1
sample <- array( rnorm( ex.n * n, mu, sigma), c(n,ex.n))
yl <- c( -2.9, 2.9) # same y-axis for all
ybar <- apply( sample, 2, mean) # mean
# Sigma known:
sigmaybar <- sigma/sqrt(n)
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main=expression(sigma~known))
abline(h=0)
for ( i in 1:ex.n)ci <- ybar[i] + sigmaybar * qnorm(c(alpha/2,1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>0|ci[2]<0, 2, 1))
# Sigma unknown, normal approx:
sybar <- apply(sample, 2, sd)/sqrt(n)
plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main="Gaussian Approximation")
abline(h=0)
for ( i in 1:ex.n)ci <- ybar[i] + sybar[i] * qnorm(c(alpha/2, 1-alpha/2))
lines( c(i,i), ci, col=ifelse( ci[1]>0|ci[2]<0, 2, 1))
# Sigma unknown, t-based:
plot(1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',
main='t-distribution')
abline(h=0)
for ( i in 1:ex.n)ci <- ybar[i] + sybar[i] * qt(c(alpha/2,1-alpha/2), n-1)
lines( c(i,i), ci, col=ifelse( ci[1]>0|ci[2]<0, 2, 1))
3.4. INTERVAL ESTIMATORS 43
−3
−2
−1
01
23
σ known
1:ex.n
−3
−2
−1
01
23
Gaussian Approximation
1:ex.n
−3
−2
−1
01
23
t−distribution
Figure 3.2: Normal and t-based confidence intervals for the parameters µ = 0
with σ = 1 (above) and unknown σ (middle and below). The sample size is
n = 4 and confidence level is (1− α) = 95%. Confidence intervals which do not
cover the true value zero are in red.
(Figure based on R-Code 3.3).
44 CHAPTER 3. ESTIMATION
3.5 Bibliographic Remarks and Links
Statistical estimation theory is very classical and many books are available. For example,
Held and Sabanes Bove (2014) (or it’s German predecessor, Held, 2008) are written at an
accessible level.
Chapter 4
Statistical Testing
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter04.R.
Even with a fair coin, we do not always toss heads or tails exactly n/2 times. This
is illustrated in R with rbinom(1, size=n, prob=1/2) for n tosses, since rbinom(1,
size=1, prob=1/2) “is” a fair coin. Suppose we observe 13 heads in 17 tosses. Is this a
fair coin? Do we have evidence against a fair coin?
We will formulate a formal statistical procedure to answer such questions.
4.1 The General Concept of Significance Testing
Example 4.1. A student, by observing her eating habits during her menstrual cycle,
found that she felt hungrier a few days before menstruation, but had almost no appetite
after. Based on these observations, she hypothesized that a recurring pattern is recogniz-
able in the eating habits of women, which can be explained by the menstrual cycle.
As part of her Maturitatsarbeit, a survey was created and over the course of 2 months
17 women kept daily records of what and how much they ate. This survey was completed
by subjects daily and captured information on how many portions from each food group
(vegetables, fruits, carbohydrates, milk products, proteins) were eaten by the subjects at
breakfast, lunch, and dinner. A fact sheet was used to explain to the subjects how to
categorize foods and how much is considered one serving.
The menstrual cycle is divided into 4 phases (before/during/after and normal) and the
data for the resprective phases is given in R-Code 4.1 and denoted by portions.normal,
portions.during, etc.
Before we start to compare the means between phases, we consider the normal phase
only. The hypothesis is that the population mean is different than (not equal to) 9
portions. (The observed mean is 7.73 with a standard deviation of 2.06.) R-Code 4.1
illustrates the computation of these statistics and the construction of Figure 4.1. ♣
45
46 CHAPTER 4. STATISTICAL TESTING
R-Code 4.1 Eating habits, dataset portions . (See Figure 4.1.)
portions.normal
## [1] 5.17 9.00 12.69 6.50 5.76 7.60 6.40 5.19 9.90 7.60
## [11] 6.40 8.39 6.06 8.40 10.80 8.80 6.70
print( me <- mean( portions.normal))
## [1] 7.7271
print( se <- sd( portions.normal))
## [1] 2.0646
plot( 0, type='n', yaxt='n', xaxt='n', bty='n', xlim=c(4,14), ylab='')
axis(1)
rug( portions.normal,ticksize = 0.3)
abline( v=me, col=3)
abline( v=9, lwd=3, col=2)
# t-based CI in green and Gaussian CI (approximation!) in blue
lines( me + qt(c(.025,.975),16)*se/sqrt(17), c(.1,.1), col=3,lwd=5)
lines( me + qnorm(c(.025,.975))*se/sqrt(17), c(-.1,-.1), col=4, lwd=2)
4 6 8 10 12 14
Figure 4.1: Rug plot with sample mean (vertical green) and confidence intervals
based on the t-distribution (green) and normal approximation (blue). (See R-
Code 4.1.)
The idea of a statistical testing procedure is to formulate statistical hypotheses and to
draw conclusions from them based on the data. We always start with a null hypothesis,
denoted with H0, and – informally – we compare how compatible the data is with respect
to this hypothesis.
Simply stated, starting from a statistical hypothesis a statistical test calculates a value
from the data and places that value in the context of the hypothetical density induced
by the statistical hypothesis. If the value is unlikely to occur, we argue that the data
provides evidence against the hypothesis.
More formally, if we want to test about a certain parameter, say θ, we need an estimator
θ for that parameter. We often need to transform the estimator such that the distribution
thereof does not depend on (the) parameter(s). We call this random variable (function
of the random sample) test statistic. Some of these test statistics are well-known and
have been named, as we shall see later. Once the test statistic has been determined, we
4.2. HYPOTHESIS TESTING 47
evaluate it at the observed sample and compare it with quantiles of the distribution of
the null hypothesis, which is typically expressed as a probability, i.e., the famous p-value.
A more formal definition of p-value follows.
Definition 4.1. The p-value is the probability, under the distribution of the null hypoth-
esis, of obtaining a result equal to or more extreme than the observed result. ♦
In practice, one often starts with a scientific hypothesis and starts to collect data or
performs an experiment. The data is then “modeled statistically”, e.g., we need to deter-
mine a theoretical distribution thereof. In our discussion here, the distribution typically
involves parameters that are linked to the scientific question (probability p in a binomial
distribution for coin tosses, mean mu of a Gaussian distribution for testing differences
in food consumption). We formulate the null hypothesis H0. In many cases we pick a
“known” test, instead of a “manually” constructing a test statistic. Of course this test
has to be in sync with the statistical model. Based on the p-value we then summarize the
evidence against the null hypothesis. We cannot make any statement for the hypothesis.
Some authors summarize p-values in [1, 0.1] as no evidence, in [0.1, 0.01] as weak
evidence, in [0.01, 0.001] as substantial evidence, and smaller ones as strong evidence
(Held and Sabanes Bove, 2014). In R, symbols are used for similar ranges , . and * ,
** , and, *** .
As a summary, the workflow of a statistical significance test can be summarized as:
i) Formulation of the scientific question or scientific hypothesis,
ii) Formulation of the statistical model (statistical assumptions),
iii) Formulation of the statistical test hypothesis (null hypothesis H0),
iv) Selection of the appropriate test,
v) Calculation of the p-value,
vi) Interpretation.
Although we discuss all six points in various settings, emphasis lies on points ii) to v).
4.2 Hypothesis Testing
A criticism of significance testing is that we have only one hypotheses. It is often easier to
choose between two alternatives. This is the approach of hypothesis testing. Beforehand,
we will still not choose for one hypothesis but rather reject or fail to reject the null
hypothesis.
More precisely, we start with a null hypothesis H0 and an alternative hypothesis, de-
noted by H1 or HA. These hypotheses are with respect to a parameter, say θ. Hypotheses
48 CHAPTER 4. STATISTICAL TESTING
are classified as simple if parameter θ assumes only a single value (e.g. H0: θ = 0), or
composite if parameter θ can take on a range of values (e.g. H0: θ ≤ 0). Often, you will
encounter a simple null hypothesis with a simple or compositive alternative hypothesis.
For example, when testing a mean parameter we would have H0: µ = µ0 vs HA: µ 6= µ0
for the latter. In practice the case of simple null and alternative hypothesis, e.g., H0:
µ = µ0 vs HA: µ 6= µ0, is rarely used but has considerable didactic value.
Hypothesis tests may be either one-sided (directional), in which only a relationship in a
prespecified direction is of interest, or two-sided, in which a relationship in either direction
is tested. One could use a one-sided test for “Hb SS has a lower average hemoglobin value
than Hb Sβ”, but a two-sided test is needed for “Hb SS and Hb Sβ have different average
hemoglobin values”. Further examples of hypotheses are given in Rudolf and Kuhlisch
(2008). We strongly recommend to always use two-sided tests. However to illustrate
certain concepts, a one-sided setting may be simpler and more accessible.
As in the case of a significance test, we compare the value of the test statistic with the
quantiles of the distribution of the null hypothesis. In the case of small p-values we reject
H0, if not, we fail to reject H0. The decision of whether the p-value is “small” or not is
based on the so-called significance level, α. The significance level is the 1− α quantile of
the distribution of the null hypothesis.
Definition 4.2. The rejection region of a test includes all values of the test statistic with
a p-value smaller than the signifiance level. The boundary values of the rejection region
are called critical values. ♦
For one-sided (or directional) hypotheses like H0 : µ ≤ µ0 or H0 : µ ≥ µ0, the “=”
case is used in the null hypothesis. That means, the null hypothesis is from a calculation
perspective always simple.
Figure 4.2 shows the rejection area of two-sided test H0 : µ = µ0 vs HA : µ 6= µ0 (left
panel), H0 : µ ≤ µ0 vs HA : µ > µ0 and one-sided (right panel).
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 4.2: Rejection regions for H0 : µ = µ0 = 0 (left) and H0 : µ ≤ µ0 = 0
(right) for α = 10%.
4.2. HYPOTHESIS TESTING 49
Table 4.1: Type I and Type II errors in the setting of significance tests.
True state
H0 true H1 true
Test resultsdo not reject H0 1− α β
reject H0 α 1− β
Two types of errors can occur in significance testing: Type I errors: we reject H0
if we should not and Type II errors: we fail to reject H0 if we should. The framework
of hypothesis testing allows us to quantify probability of committing these errors (see
Table 4.1). Ideally we would like to construct tests that have small Type I errors and
Type II errors. This is not possible and one typically fixes the Type I error (committing
a type one error has typically more severe consequences than a Type II error). Type I
and Type II errors are shown in Figure 4.3 for 2 different alternative hypotheses. This
illustrates that reducing α leads to an increase in β.
We can further summarize. The Type I error:
• is defined a priori by selection of the significance level (often 5%, 1%),
• is not influenced by sample size,
• is increased with multiple testing of the same data
and the Type II error:
• depends on sample size and significance level α,
• is a function of the alternative hypothesis.
−2 0 2 4 6
H0 H1
−2 0 2 4 6
H0 H1
Figure 4.3: Type I error at rate α (red) and Type II error at rate β (blue) for
2 different alternative hypotheses (µ = 2 left, µ = 4 right) with H0 : µ ≤ µ0 = 0.
50 CHAPTER 4. STATISTICAL TESTING
−1 0 1 2 3 4
0.0
0.4
0.8
µ1 − µ0
Pow
er
Figure 4.4: Power: one-sided (blue solid line) and two-sided (black dashed
line). The gray line represents the level of the test, here α = 5%. The vertical
lines represent the alternative hypotheses µ = 2 and µ = 4 of Figure 4.3. (See
R-Code 4.2.)
The value 1−β is called the power of a test. High power is desirable in an experiment.
R-Code 4.2 computes the power under a Gaussian assumption. More specifically, under
the assumption of σ = 1 we test H0: µ0 = 0 versus HA: µ0 6= 0. The power can only
be calculated for a specific assumption of the “actual” mean µ1. Thus, as typically done,
Figure 4.4 plots power(µ1).
R-Code 4.2 A one-sided and two-sided power curve for a z-test. (See Figure 4.4.)
alpha <- 0.05 # significance level
mu0 <- 0 # H_0 mean
mu1 <- seq(-1, to=4, by=.5) # H_1 mean
power_onesided <- 1-pnorm( qnorm(1-alpha, mean=mu0), mean=mu1)
power_twosided <- pnorm( qnorm(alpha/2, mean=mu0), mean=mu1) +
pnorm( qnorm(1-alpha/2, mean=mu0), mean=mu1, lower.tail=FALSE)
plot( mu1, power_onesided, type='l', ylim=c(0,1),
xlab=expression(mu[1]-mu[0]), ylab="Power", col=4)
lines( mu1, power_twosided,lty=2)
abline( h=alpha, col='gray') # significance
abline( v=c(2,4), lty=4,col=7) # values from figure 4.3
4.2. HYPOTHESIS TESTING 51
The workflow of a hypothesis test is very similar to the one of a statistical significance
test and only point iii) and v) need to be slightly modified:
i) Formulation of the scientific question or scientific hypothesis,
ii) Formulation of the statistical model (assumptions),
iii) Formulation of the statistical test hypothesis and selection of significance level,
iv) Selection of the appropriate test,
v) Calculation of the test statistic value (or p-value), comparison and decision
vi) Interpretation.
The choice of test is again constrained by the assumptions. The significance level must,
however, always be chosen before the computations.
The test statistic value tobs calculated in step vi is compared with tabulated critical
values ttab in order to reach a decision in step vii). When the decision is based on the
calculation of the p-value, it consists of a comparison with α. The p-value can be difficult to
calculate, but is valuable because of its direct interpretation as the strength (or weakness)
of the evidence against the null hypothesis.
In this course we consider various test situations. The choice of test is primarily
dependent on the parameter and secondly on the statistical assumptions. The following
list can be used as a decision tree.
• Mean
– One sample (Test 1)
– Two samples
∗ Two independent samples (Test 2 and Test 7 in Chapter 6)
∗ Two paired/dependent samples (Test 3 and Test 8 in Chapter 6)
– Several samples (Test 13 in Chapter 11)
• Variances (Test 4)
• Proportions (Test 6 in Chapter 5)
• Distributions (Test 5)
• . . .
Distribution tests (also goodness-of-fit tests) differ from rest, in that they do not test only
a single parameter. In the following sections we consider each test in detail. The tests are
always presented in the same pattern (additional general remarks are given in the yellow
text blocks).
52 CHAPTER 4. STATISTICAL TESTING
General remarks about statistical tests
In the following pages, we will present several important statistical tests in
text blocks like this one.
In general, we call
• n, nx, ny, . . . the sample size;
• x, y, . . . the arithmetic mean;
• s2, s2x, s
2y, . . . the unbiased sample variance estimates, e.g.,
s2x =
1
nx − 1
nx∑i=1
(xi −x)2
(In the tests under consideration, the variance is unknown);
• α the significance level;
• ttab, Ftab, . . . the appropriate quantiles.
Our scientific question can only be formulated in general terms and we call
it a hypothesis. Generally, two-sided tests are performed. The significance
level is modified accordingly for one-sided tests.
In unclear cases, the statistical hypothesis is specified.
For most tests, there is a corresponding R function. The arguments x, y
usually represent vectors of realisations and alpha the significance level.
4.3 Comparing Means
In this section, we compare means from normally distributed samples. To start, we
compare the mean of a sample with a hypothesized value. More formally, Y1, . . . , Yniid∼
N (µ, σ2) with unknown σ2 and parameter of interest µ. Thus, from
Y ∼ N (µ, σ2/n),Y − µσ/√n∼ N (0, 1)
Y − µS/√n∼ Tn−1. (4.1)
we can only use the last distribution, as σ2 is unknown. Under the null hypothesis, µ is
of course our theoretical value, say µT .
The test statistic is t-distributed (see Figure 2.7.3) and so the function pt is used to
calculate p-values. As we have only one sample, it is typically called the “one-sample
t-test”, illustrated in box Test 1, and R-Code 4.2 with an example based on the dietary
data.
4.3. COMPARING MEANS 53
Test 1: Comparing a sample mean with a theoretical value
Question: Does the experiment-based sample mean x deviate significantly
from the unknown (“theoretical”) mean µT ?
Assumptions: The population, from which the sample arises, is normally
distributed with unknown mean µT . The observed data are on the
interval scale and the variance is unknown.
Calculation: tobs =|x− µT |
s·√n .
Decision: Reject H0 : µ = µT , if tobs > ttab = tn−1,1−α/2 .
Calculation in R: t.test( x, mu=muT, conf.level=1-alpha)
Example 4.2. We test the hypothesis that the normal number of dietary portions is
significantly less than 9. We have one sample and we want to know if the deviates from
an unknown value.
The following values are given: mean: 7.73, standard deviation: 2.06, sample size: 17.
H0 : µ = 9
tobs =|7.73− 9|2.06/
√17
= 2.54
ttab = t16,1−0.05/2 = 2.12 p-value: 0.022.
As shown by Figure 4.1, clear evidence against the null hypothesis (R-Code 4.3). ♣
R-Code 4.3 One sample t-test, portions (see Example 4.2 and Test 1)
t.test( portions.normal, mu=9)
##
## One Sample t-test
##
## data: portions.normal
## t = -2.54, df = 16, p-value = 0.022
## alternative hypothesis: true mean is not equal to 9
## 95 percent confidence interval:
## 6.6656 8.7886
## sample estimates:
## mean of x
## 7.7271
54 CHAPTER 4. STATISTICAL TESTING
Often we have want to compare to different means, say x and y. We assume that both
random samples are normally distributed, i.e., X1, . . . , Xniid∼ N (µx, σ
2), Y1, . . . , Yniid∼
N (µy, σ2), and independent. Then
X ∼ N(µx,
σ2
n
), Y ∼ N
(µy,
σ2
n
), X −Y ∼ N
(µx − µy,
2σ2
n
), (4.2)
X −Y − (µx − µy)σ/√n/2
∼ N (0, 1).X −Yσ/√n/2
H0:µx=µy∼ N (0, 1). (4.3)
The difficulty is that we do not know σ and we have to estimate it. The estimate thereof
takes a somewhat complicated form as both samples need to be taken into account, with
possibly different lengths and means. This pooled estimate is denoted by sp and given
in Test 2. Ultimately, we have again a t-distribution of the test statistic, as we use an
estimate of the standard deviation in the standardization of a normal random variable.
As the calculation sd requires the estimates µx and µy, we adjust the degrees of freedom
of to nx + ny − 2.
R-Code 4.3 is based on the dietary data again and compares the portions during the
menstrual cycle with the normal amount.
Test 2: Comparing means from two independent samples
Question: Are the means x and y of two samples significantly different?
Assumptions: Both populations are normally distributed with the same un-
known variance. The samples are independent.
Calculation: tobs =|x−y|sp
·√
nx · nynx + ny
,
where s2p =
1
nx + ny − 2·((nx − 1)s2
x + (ny − 1)s2y
).
Decision: Reject H0 : µx = µy if tobs > ttab = tnx+ny−2,1−α/2 .
Calculation in R: t.test( x, y, var=TRUE, conf.level=1-alpha)
In practice, the variances of both samples are often different, say σ2x and σ2
y. In
such a setting, we have to normalize the mean difference by√s2x/nx + s2
y/ny. While this
estimate seems simpler than the pooled estimate sp, the degrees of freedom of the resulting
t-distribution is are not intuitive and difficult to derive, and we refrain to elaborate it here.
In the literature, this test is called Welch’s t-test and actually the default choice of t.test(
x, y, conf.level=1-alpha).
4.3. COMPARING MEANS 55
Example 4.3. For the two samples normal and during, we have the following summaries.
Means 7.73 and 8.21; standard deviations: 2.06 and 1.9; sample sizes: 17 and 17. Hence,
using the formulas given in Test 2, we have
H0 : µx = µy
s2p =
1
17 + 17− 2(16 · 2.062 + 16 · 1.902) = 1.98
tobs =|7.73− 8.21|√
(16 · 2.062 + 16 · 1.902)/(17 + 17− 2)
√17 · 17
17 + 17= 0.71
ttab = t32,1−0.05/2 = 2.04 p-value: 0.48.
Hence, 7.73 and 8.21 are not statistically different. See also R-Code 4.4. ♣
R-Code 4.4 Two-sample t-test with independent samples, portions (see Example 4.3
and Test 2).
t.test( portions.normal, portions.during, var=TRUE)
##
## Two Sample t-test
##
## data: portions.normal and portions.during
## t = -0.709, df = 32, p-value = 0.48
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.86966 0.90378
## sample estimates:
## mean of x mean of y
## 7.7271 8.2100
The assumption of independence of both samples in Example 4.3 may not be realistic.
We have observed the same individual over two different instances of time. In such
settings, were we have a “before” and “after” measurement, it would be better to take
this pairing into account, by considering differences only instead of two samples. Hence,
instead of constructing a test statistic based on X −Y we consider
X1 − Y1, . . . , Xn − Yniid∼ N (µx − µy, σ2
d), X −Y ∼ N(µx − µy,
σ2d
n
), (4.4)
X −Yσd/√n
H0:µx=µy∼ N (0, 1). (4.5)
where σ2d is essentially the sum of the variances minus the “dependence” between Xi and
Yi. We formalize this dependence, called covariance, starting in Chapter 7.
The paired two-sample t-test can thus be considered a one sample t-test of the differ-
ences with mean µT = 0.
56 CHAPTER 4. STATISTICAL TESTING
Test 3: Comparing means from two paired samples
Question: Are the means x and y of two paired samples significantly differ-
ent?
Assumptions: The samples are paired, the observed values are on the inter-
val scale. The differences are normally distributed with unknown mean
δ. The variance is unknown.
Calculation: tobs =|d |sd·√n, where
• di = xi − yi is the i-th observed distance,
• d is the arithmetic mean and sd is the standard deviation of the
differences di .
Decision: Reject H0 : δ = 0 if tobs > ttab = tn−1,1−α/2 .
Calculation in R: t.test(x-y, conf.level=1-alpha) or
t.test(x, y, paired=TRUE, conf.level=1-alpha)
Example 4.4. Again working with the two samples normal and during, we have the fol-
lowing summaries for the differences (see R-Code 4.5 and Test 3). Mean: −0.48; standard
deviation: 0.91; and sample size: 17.
H0 : d = 0 or H0 : µx = µy
tobs =| − 0.48|0.91/
√17
= 2.19
ttab = t16;0.05 = 2.12 p-value: 0.045.
Now, by taking into account that we have two measurements on the same individual, we
have moderate evidence that there is a difference between the two periods. ♣
The “t-tests” requires normally distributed data. The tests are relatively robust to-
wards deviation from normality, as long as there are no extreme outliers. Otherwise,
rank-based tests can be used (see Chapter 6):
• U -test (Wilcoxon-Mann-Whitney test),
• Wilcoxon test.
The assumption of normality can be verified quantitatively with formal Normality tests
(X 2-test, Shapiro–Wilk test, Kolmogorov–Smirnov test). Often, however, a qualitative
verification is sufficient (e.g. with the help of a Q-Q plot).
4.4. DUALITY OF TESTS AND CONFIDENCE INTERVALS 57
R-Code 4.5 Two-sample t-test with paired samples, portions (see Example 4.4 and
Test 3).
t.test( portions.normal, portions.during, paired=TRUE)
##
## Paired t-test
##
## data: portions.normal and portions.during
## t = -2.19, df = 16, p-value = 0.044
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.95124 -0.01464
## sample estimates:
## mean of the differences
## -0.48294
# Same result as with
# t.test( portions.normal - portions.during)
4.4 Duality of Tests and Confidence Intervals
There is a close connection between a significance test and a confidence interval for the
same parameter. Rejection of H0 : θ = θ0 with a significance level α is equivalent to θ0 not
being included in the (1−α) confidence interval of θ. The duality of tests and confidence
intervals is quickly apparent when the criterion for the test decision is reworded, here for
example, “compare a mean value with the theoretical value”.
As shown above, H0 is rejected if tobs > ttab, so if
tobs =|x− µT |
s·√n > ttab. (4.6)
H0 is not rejected if
tobs =|x− µT |
s·√n ≤ ttab. (4.7)
This can be rewritten as
|x− µT | ≤ ttabs√n, (4.8)
which also means that H0 is not rejected if either
x− µT ≥ −ttabs√n
or x− µT ≤ ttabs√n. (4.9)
We can again rewrite this as
µT ≤ x+ ttabs√n
and µT ≥ x− ttabs√n, (4.10)
58 CHAPTER 4. STATISTICAL TESTING
which corresponds to the boundaries of the (1−α) confidence interval for µT . Analogously,
this duality can be established for the other tests described in this chapter.
Example 4.5. Consider the situation from Example 4.2. Instead of comparing the p-
value, we can also consider the confidence interval, whose boundary values are 6.67 and
8.79. Since 9 is not in this range, the null hypothesis is rejected.
This is shown in Figure 4.1. ♣
In R most test functions give the corresponding confidence intervals with the value of
the statistic and p-value. Some functions require the additional argument conf.int=TRUE,
as well.
4.5 Multiple Testing
In many cases, we want to perform not just one test, but a series of tests. We must then
be aware that the significance level α only holds for a single test. In the case of a single
test, the probability of a falsely significant test result equals the significance level, usually
α = 0.05. The probability that the null hypothesis H0 is correctly not rejected is then
1− 0.05 = 0.95.
Consider the situation in which m > 1 tests are performed. The probability that at
least 1 false significant test result is obtained is then equal to one minus the probability
that no false significant test results are obtained. It holds that
P(at least 1 false significant results) = 1− P(no false significant results) (4.11)
= 1− (1− α)m. (4.12)
In Table 4.2 the probabilities of at least one false significant result for α = 0.05 and
various m are presented. Even for just a few tests, the probability increases drastically,
which should not be tolerated.
Table 4.2: Probabilities of at least one false significant test result when per-
forming m tests at level α = 5%.
m 1 2 3 4 5 6 8 10 20 100
1− (1− α)m 0.05 0.10 0.14 0.19 0.23 0.27 0.34 0.40 0.64 0.99
There are several different methods that allow multiple tests to be performed while
maintaining the selected significance level. The simplest and most well-known of them is
the Bonferroni correction. Instead of comparing the p-value of every test to α, they are
compared to a new significance level, αnew = α/m. There are several alternative methods,
which, according to the situation, may be more appropriate. See, for example, Farcomeni
(2008).
4.6. ADDITIONAL TESTS 59
4.6 Additional Tests
Just as means can be compared, there are also tests for comparing variances. However,
instead of taking differences as in the case of means, we take the ratio of the estimated
variances. If the two estimates are similar, the ratio should be close to one. The test
statistic is accordingly S2y
/S2y and the distribution thereof is an F -distribution (see also
Section 2.7.4). This “classic” F -test is given in Test 4.
In latter chapters we will see more natural settings where we need to compare variances
(not necessary from a priori two different samples).
Test 4: Comparing two variances
Question: Are the variances s2x and s2
y of two samples significantly different?
Assumptions: Both populations, from which the samples arise, are normally
distributed. The samples are independent and the observed values are
from the interval scale.
Calculation: Fobs =s2x
s2y
(the larger variance always goes in the numerator, so s2x > s2
y).
Decision: Reject H0: σ2x = σ2
y if Fobs > Ftab = fnx−1,ny−1,1−α.
Calculation in R: var.test( x, y, conf.level=1-alpha)
Example 4.6. The two-sample t-test (Test 2) requires equal variances. The data portions
does not contradict the null hypothesis, as R-Code 4.6 shows (compare Example 4.3 and
R-Code 4.4). ♣
The so-called chi-square test (X 2 test) verifies if the observed data follow a particular
distribution. This test is based on a comparison of the observations with the expected
frequencies and can also be used in other situations, for example, to test if two categorical
variables are independent (Test 6).
Under the null hypothesis, the chi-square test is X 2 distributed. The quantile, density,
and distribution functions are implemented in R with [q,d,p]chisq (see Section 2.7.2).
In Test 5, the categories should be aggregated so that all e1 ≥ 5. Additionally,
N − k > 1.
60 CHAPTER 4. STATISTICAL TESTING
R-Code 4.6 Comparison of two variances, portions (see Example 4.6 and Test 4).
var.test( portions.normal, portions.during)
##
## F test to compare two variances
##
## data: portions.normal and portions.during
## F = 1.18, num df = 16, denom df = 16, p-value = 0.75
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4268 3.2544
## sample estimates:
## ratio of variances
## 1.1785
Test 5: Comparison of observations with expected frequencies
Question: Do the observed frequencies oi of a sample deviate significantly
from the expected frequencies ei of a certain distribution?
Assumptions: Sample with data from any scale.
Calculation: Calculate the observed values oi and the expected frequencies
ei (with the help of the expected distribution) and then compute
χ2obs =
N∑i=1
(oi − ei)2
ei
where N is the number of categories.
Decision: Reject H0: “no deviation between the observed and expected”
if χ2obs > χ2
tab = χ2N−1−k,1−α , where k is the number of parameters
estimated from the data.
Calculation in R: chisq.test( obs, p=expect/sum(obs))
or chisq.test( obs, p=expect, rescale.p=TRUE)
4.7. BIBLIOGRAPHIC REMARKS AND LINKS 61
Example 4.7. With only 17 observed it is often pointless to test for normality of the
data. Even for larger samples, a Q-Q plot is more informative. For completeness, the
procedure is shown using the dietary obervations for the time periods “normal” and
“during menstruation” in R-Code 4.7.
The binning of the data is done through the a histogram-type binning (using a table(
cut( )) would be an alternative approach). As we have less than five observations in the
last two bins, the function issues a warning. Additionally, the degrees of freedom should be
N−1−k = 5−1−2 = 2 rather than four, as we estimate the mean and standard deviation
to determine the expected counts. Both effects could be mitigated if we calculate the p-
value using a bootstrap simulation by setting the argument simulate.p.value=TRUE.
♣
R-Code 4.7 Testing normality, portions (see Example 4.7 and Test 5).
portions.normal.during <- c(portions.normal, portions.during)
observed <- hist( portions.normal.during, plot=FALSE, breaks=4)
# without 'breaks' argument there are too many categories
observed[1:2]
## $breaks
## [1] 4 6 8 10 12 14
##
## $counts
## [1] 4 13 13 2 2
m <- mean( portions.normal.during)
s <- sd( portions.normal.during)
p <- pnorm( observed$breaks, mean=m, sd=s)
chisq.test( observed$counts, p=diff( p), rescale.p=TRUE)
## Warning in chisq.test(observed$counts, p = diff(p), rescale.p =
TRUE): Chi-squared approximation may be incorrect
##
## Chi-squared test for given probabilities
##
## data: observed$counts
## X-squared = 4.36, df = 4, p-value = 0.36
4.7 Bibliographic Remarks and Links
There are many books about statistical testing, ranging from accessible to very mathe-
matical, too many to even start a list here.
62 CHAPTER 4. STATISTICAL TESTING
Chapter 5
A Closer Look: Proportions
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter05.R.
Pre-eclampsia is a hypertensive disorder occurring during pregnancy (gestational hy-
pertension) with symptoms including: edema, high blood pressure, and proteinuria. In
a double-blinded randomized controlled trial (RCT) 2706 pregnant women were treated
with either a diuretic or with a placebo (Landesman et al., 1965). Pre-eclampsia was
diagnosed in 138 of the 1370 subjects in the treatment group and in 175 of the subjects
receiving placebo. The medical question is whether diuretic medications, which reduce
water retension, reduce the risk of pre-eclampsia.
In this chapter we have a closer look at statistical techniques that help us to correctly
answer the above questions. More precisely, we will estimate and compare proportions
with each other.
5.1 Estimation
We start with a simple setting where we observe occurrences of a certain event and are
interested in the proportion of the events over the total population. More specifically,
we consider the number of successes in a sequence of experiments, i.e., whether a certain
treatment had an effect. We often use a binomial random variable X ∼ Bin(n, p) for
such a setting, where n is given or known and p is unknown, the parameter of interest.
Intuitively, we find x/n to be an estimate of p and X/n the corresponding estimator. We
will construct a confidence interval for p in the next section.
With the method of moments we obtain the estimator pMM = X/n, since np = E(X)
and we have only one observation (total number of cases). The estimator is identical to
the intuitive estimator.
63
64 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
The likelihood estimator is constructed as follows:
L(p) =
(n
x
)px(1− p)n−x (5.1)
`(p) = log(L(p)
)= log
(n
x
)+ x log(p) + (n− x) log(1− p) (5.2)
d`(p)
dp=x
p− n− x
1− p⇒ x
pML
=n− x
1− pML
. (5.3)
Thus, the maximum likelihood estimator is (again) pML = X/n.
In our example we have the following estimates: pB = 138/1370 ≈ 10% and pK =
175/1336 ≈ 13%. The question then arises as to whether the two proportions are different
enough to be able to speak of an effect of the drug.
Once we have an estimator, we can now answer questions, such as:
i) How many cases of pre-eclampsia can be expected in a group of 100 pregnant women?
ii) What is the probability that more than 20 cases occur?
Note however, that the estimator p = X/n does not have a “classical” distribution.
Figure 5.1 illustrates the probability mass function based on the estimate for the pre-
eclampsia cases in the treated group. The figure visually suggest to use a Gaussian
approximation which is well justified here as np(1−p) 9. The Gaussian approximation
for X is then used to state that the estimator p is also approximately Gaussian.
0 200 400 600 800 1000 1200 1400
0.00
00.
015
0.03
0
x
100 120 140 160 180
0.00
00.
015
0.03
0
Figure 5.1: Probability mass function with the normal approximation.
5.2. CONFIDENCE INTERVALS 65
When dealing with proportions we often speak of odds, or simply of chance, defined
by ω = p/(1− p). The corresponding intuitive estimator (and estimate) is ω = p/(1− p).As a side note, this estimator also coincides with the maximum likelihood estimator.
5.2 Confidence Intervals
To construct a confidence intervals for the parameter p we use the Gaussian approximation
for the binomial distribution:
1− α ≈ P(zα/2 ≤
X − np√np(1− p)
≤ z1−α/2
). (5.4)
This can be rewritten as
1− α ≈ P(zα/2
√np(1− p) ≤ X − np ≤ z1−α/2
√np(1− p)
)(5.5)
= P(−Xn
+ zα/21
n
√np(1− p) ≤ −p ≤ −X
n+ z1−α/2
1
n
√np(1− p)
). (5.6)
As a further approximation we have
1− α ≈ P(−Xn
+ zα/21
n
√np(1− p) ≤ −p ≤ −X
n+ z1−α/2
1
n
√np(1− p)
). (5.7)
Since p = x/n (as estimate) and q := z1−α/2 = −zα/2, we have as empirical confidence
interval
bl,u = p± q ·√p(1− p)
n= p± q · SE(p) . (5.8)
Remark 5.1. For (regular) models with parameter θ, as n → ∞, likelihood theory
states that, the estimator θML is normally distributed with expected value θ and variance
Var(θML).
Since Var(X/n) = p(1 − p)/n, one can assume that SE(p) =√p(1− p)/n. The so-
called Wald confidence interval rests upon this assumption (which can naturally be shown
more formally) and is identical to (5.8).
If the inequality in (5.4) is solved through a quadratic equation, we obtain the empirical
Wilson confidence interval
bl,u =1
1 + q2/n·
(p+
q2
2n± q ·
√p(1− p)
n+
q2
4n2
), (5.9)
where we use the estimates p = x/n.
The Wilson confidence interval is “more complicated” than the Wald confidence in-
terval. Is it also “better” because one less approximation is required?
66 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
Confidence intervals for proportions
An approximate (1− α) Wald confidence interval for a proportion is
Bl,u = p± q ·√p(1− p)
n(5.10)
with estimator p = X/n. An approximate (1−α) Wilson confidence interval
for a proportion is
Bl,u =1
1 + q2/n·
(p+
q2
2n± q ·
√p(1− p)
n+
q2
4n2
)(5.11)
In R an exact (1− α) confidence interval for a proportion is computed with
binom.test( x, n)$conf.int.
Ideally the coverage probability of a (1− α) confidence interval should be 1− α. For
a discrete random variable, the coverage is
P(p ∈ CI ) =n∑x=0
P(X = x)Ip∈CI . (5.12)
The R-Code 5.1 calculated the coverage of the 95% confidence intervals for X ∼ Bin(n =
50, p = 0.4) and demonstrates that the Wilson confidence interval has better coverage
(96% compared to 94%).
R-Code 5.1: Coverage of 95% confidence intervals for X ∼ Bin(n = 50, p =
0.4).
p <- .4
n <- 20
x <- 0:n
WaldCI <- function(x, n) # eq (5.9)
mid <- x/n
se <- sqrt(x*(n-x)/n^3)
cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))
WaldCIs <- WaldCI(x,n)
Waldind <- (WaldCIs[,1] <= p) & (WaldCIs[,2] >= p)
5.2. CONFIDENCE INTERVALS 67
Waldcoverage <- sum( dbinom(x, n, p)*Waldind) # eq (5.12)
WilsonCI <- function(x, n) # eq (5.9)
mid <- (x + 1.96^2/2)/(n + 1.96^2)
se <- sqrt(n)/(n+1.96^2)*sqrt(x/n*(1-x/n)+1.96^2/(4*n))
cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))
WilsonCIs <- WilsonCI(x,n)
Wilsonind <- (WilsonCIs[,1] <= p) & (WilsonCIs[,2] >= p)
Wilsoncoverage <- sum( dbinom(x, n, p)*Wilsonind)
print( c(true=0.95, Wald=Waldcoverage, Wilson=Wilsoncoverage))
## true Wald Wilson
## 0.95000 0.92802 0.96301
R-Code 5.2 Tips for construction of Figure 5.2.
p <- seq(0, 1, .001)
#
# either a loop over all elements in p or a few applies
# over the functions 'Wilsonind' and 'Wilsoncoverage'
#
# Waldcoverage and Wilsoncoverage are thus a vectors!
Waldsmooth <- loess( Waldcoverage ~ p, span=.1)
Wilsonsmooth <- loess( Wilsoncoverage ~ p, span=.1)
plot( p, Waldcoverage, type='l', ylim=c(.8,1))
lines( c(-1, 2), c(.95, .95), col=2, lty=2)
lines( Waldsmooth$x, Waldsmooth$fitted, col=3, lw=2)
The coverage depends on p, as shown in Figure 5.2 (from R-Code 5.2). The Wilson
confidence interval has better nominal coverage at the center. This observation also holds
when n is varied, as in Figure 5.3.
The width of an empirical confidence interval is bu − bl. For the Wald confidence
interval we obtain
2q ·√p(1− p)
n(5.13)
and for the Wilson confidence interval
2q
1 + q2/n
√p(1− p)
n+
q2
4n2. (5.14)
68 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
0.0 0.2 0.4 0.6 0.8 1.0
0.80
0.90
1.00
p
Wal
dcov
erag
e
0.0 0.2 0.4 0.6 0.8 1.0
0.80
0.90
1.00
p
Wils
onco
vera
ge
Figure 5.2: Coverage of the 95% confidence intervals for X ∼ Bin(n = 50, p).
The red dashed line is the goal 1− α and in green we have a smoothed curve.
0.0 0.2 0.4 0.6 0.8 1.0
1020
3040
p
n
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Wald CI Wilson CI
Figure 5.3: Coverage of the 95% confidence intervals for X ∼ Bin(n, p) as
functions of p and n. The probabilities are symmetric around p = 1/2. All
values smaller than 0.7 are represented with dark red.
The widths are depicted in Figure 5.4. For 5 < x < 36, the Wilson confidence interval
has a smaller width.
5.3. TESTS 69
0 10 20 30 40
0.00
0.10
0.20
0.30
x
Wid
th
Figure 5.4: Widths of the empirical 95% confidence intervals for X ∼ Bin(n =
40, p) (The Wald is in solid green, the Wilson in dashed blue).
5.3 Tests
The pre-eclampsia data can be presented with a so-called two-dimensional contingency
table (two-way table), as, for example, in Table 5.1.
If one wants to know whether the proportions in both groups are equal or whether they
come from the same distribution, a test for equality of the proportions can be conducted,
i.e., H0 : p1 = p2. This test is also called Pearson’s χ2 test and is given in Test 6. It can
also be considered a special case of Test 5.
Table 5.1: Example of a two-dimensional contingency table displaying frequen-
cies. The first index refers to the presence of the risk factor, the second to the
diagnosis.
Diagnosis
positive negative
Riskwith factor h11 h12 n1
without h21 h22 n2
R-Code 5.3 Contingency table for the pre-eclampsia data example.
xD <- 138; xC <- 175 # positive diagnosed counts
nD <- 1370; nC <- 1336 # totals in both groups
tab <- rbind(Diuretic=c(xD, nD-xD), Control=c(xC, nC-xC))
colnames( tab) <- c('pos','neg')
tab
## pos neg
## Diuretic 138 1232
## Control 175 1161
70 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
Test 6: Test of proportions
Question: Are the proportions in the two groups the same?
Assumptions: Two sufficiently large, independent samples of binary data.
Calculation: We use the notation for cells of a contingency table, as in Table
5.1. The test statistic is
χ2obs =
(h11h22 − h12h21)2(h11 + h12 + h21 + h22)
(h11 + h12)(h21 + h22)(h12 + h22)(h11 + h21)
and, under the null hypothesis that the proportions are the same, is
X 2 distributed with one degree of freedom.
Decision: Reject if χ2obs > χ2
tab = χ21,1−α.
Calculation in R: prop.test( tab) or chisq.test(tab)
Example 5.1. The R-Code 5.4 shows the results for the pre-eclampsia data, once using
a proportion test and once using a Chi-squared test (comparing expected and observed
frequencies). ♣
Remark 5.2. We have presented the rows of Table 5.1 in terms of two binomials, i.e., with
two fixed marginals. In certain situations, such a table can be seen from a hypergeometric
distribution point of view (see help( dhyper)), where three margins are fixed. For this
latter view, fisher.test is the test of choice.
It is natural to extend the 2× 2-tables to general two-way tables, where many of the
concepts discussed here need to be extended. Often, at the very end, a test based on a
chi-squared distributed test statistic is used.
5.3. TESTS 71
R-Code 5.4 Test of proportions
print(RD <- tab[2,1]/ sum(tab[2,]) - tab[1,1]/ sum(tab[1,]) )
## [1] 0.030258
prop.test( tab)
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: tab
## X-squared = 5.76, df = 1, p-value = 0.016
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0551074 -0.0054088
## sample estimates:
## prop 1 prop 2
## 0.10073 0.13099
chisq.test( tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 5.76, df = 1, p-value = 0.016
72 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
5.4 Comparison of Proportions
The goal of this section is to introduce formal approaches for a comparison of two pro-
portions p1 and p2. This can be accomplished using (i) a difference p1− p2, (ii) a quotient
p1/p2, or (iii) an odds ratio p1/(1− p1)/
(p2/(1− p2)), which we consider in the following
three sections.
5.4.1 Difference Between Proportions
Based on the binomial distribution, the differenceh11
h11 + h12
− h21
h21 + h22
(seen as an esti-
mator) is approximately normally distributed
N(p1 − p2,
p1(1− p1)
n1
+p2(1− p2)
n2
)(5.15)
and a corresponding confidence interval can be derived.
The risk difference (or risk reduction) describes the (absolute) difference in the prob-
ability of experiencing the event in question.
5.4.2 Relative Risk
The relative risk estimates the size of the effect of a risk factor compared with the effect
size when the risk factor is not present:
RR =P(Positive diagnosis with risk factor)
P(Positive diagnosis without risk factor). (5.16)
The groups with or without the risk factor can also be considered the treatment and
control groups.
The relative risk assumes positive values. A value of 1 means that the risk is the same in
both groups and there is no evidence of a correlation between the diagnosis/disease/event
and the risk factor. A value greater than one is evidence of a possible positive correlation
between a risk factor and a diagnosis/disease. If the relative risk is less than one, the
exposure has a protective effect, as is the case, for example, for vaccinations.
An estimate of the relative risk is (see Table 5.1)
RR =p1
p2
=
h11
h11 + h12
h21
h21 + h22
. (5.17)
To construct confidence intervals, we consider first θ = log(RR). The standard error of θ
is determined with the delta method and based on Equation (2.44), applied to a Binomial
5.4. COMPARISON OF PROPORTIONS 73
instead of a Bernoulli:
Var( θ ) = Var(
log(RR))
= Var(
log(p1
p2
))
= Var(
log(p1)− log(p2))
(5.18)
= Var(
log(p1))
+ Var((log(p2))≈ 1− p1
n1 · p1
+1− p2
n2 · p2
(5.19)
≈1− h11
h11 + h12
(h11 + h12) · h11
h11 + h12
+1− h22
h21 + h22
(h21 + h22) · h22
h21 + h22
(5.20)
=1
h11
− 1
h11 + h12
+1
h21
− 1
h21 + h22
. (5.21)
A back-transformation
[exp(θ ± z1−α/2 SE(θ)
) ](5.22)
implies positive confidence boundaries. Note that with the back-transformation we loose
the ‘symmetry’ of estimate plus/minus standard error.
Confidence interval for relative risk (RR)
An approximate (1 − α) confidence interval for RR, based on the two-
dimensional contingency table (Table 5.1), is[exp(
log(RR)± z1−α/2 SE(log(RR)))]
(5.23)
where for the empirical confidence intervals we use the estimates
RR =h11(h21 + h22)
(h11 + h12)h21
, SE(log(RR)) =
√1
h11
− 1
h11 + h12
+1
h21
− 1
h21 + h22
.
Example 5.2. The relative risk and corresponding confidence interval for the pre-eclampsia
data are given in R-Code 5.5. The relative risk is smaller than one (diuretics reduce the
risk). An approximate 95% confidence interval does not include one. ♣
74 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
R-Code 5.5 Relative Risk with confidence interval.
print( RR <- ( tab[1,1]/ sum(tab[1,])) / ( tab[2,1]/ sum(tab[2,])) )
## [1] 0.769
s <- sqrt(
1/tab[1,1] + 1/tab[2,1] - 1/sum(tab[1,]) - 1/sum(tab[2,]) )
exp( log(RR) + qnorm(c(.025,.975))*s)
## [1] 0.62333 0.94872
5.4.3 Odds Ratio
The relative risk is closely related to the odds ratio, which is defined as
OR =
P(Positive diagnosis with risk factor)
P(Negative Diagnosis with risk factor)
P(Positive diagnosis without risk factor)
P(Negative diagnosis without risk factor)
(5.24)
=
P(A)
1− P(A)
P(B)
1− P(B)
=P(A)(1− P(B))
P(B)(1− P(A)). (5.25)
The odds ratio indicates the strength of an association between factors (association mea-
sure). The calculation of the odds ratio also makes sense when the number of diseased is
determined by study design, as is the case for case-control studies.
When a disease is rare (very low probability of disease), the odds ratio and relative
risk are approximately equal.
An estimate of the odds ratio is
OR =
h11
h12
h21
h22
=h11 h22
h12 h21
. (5.26)
The construction of confidence intervals for the odds ratio is based on Equation (2.43)
and Equation (2.44), analogous to that of the relative risk.
Example 5.3. The odds ratio with confidence interval for the pre-eclampsia data is given
in R-Code 5.6. Notice that the function fisher.test (see Remark 5.2) also calculates
the odds ratios. As they are based on a likelihood calculation, there are minor differences
between both estimates. ♣
5.4. COMPARISON OF PROPORTIONS 75
Confidence intervals for the odds ratio (OR)
An approximate (1 − α) confidence interval for OR, based on a two-
dimensional contingency table (Table 5.1), is[exp(
log(OR)± z1−α/2 SE(log(OR)))]
(5.27)
where for the empirical confidence intervals we use the estimates
OR =h11h22
h12h21
and SE(log(OR)) =
√1
h11
+1
h21
+1
h12
+1
h22
.
R-Code 5.6 Odds ratio with confidence interval, approximate and exact.
print( OR <- tab[1]*tab[4]/(tab[2]*tab[3]))
## [1] 0.74313
s <- sqrt( sum( 1/tab) )
exp( log(OR) + qnorm(c(.025,.975))*s)
## [1] 0.58626 0.94196
# Exact test from Fisher:
fisher.test(tab)
##
## Fisher's Exact Test for Count Data
##
## data: tab
## p-value = 0.016
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.58166 0.94826
## sample estimates:
## odds ratio
## 0.74321
76 CHAPTER 5. A CLOSER LOOK: PROPORTIONS
5.5 Bibliographic Remarks and Links
See also the package binom and the references therein for further confidence intervals for
proportions.
Chapter 6
Rank-Based Methods
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter06.R.
In this chapter, we discuss basic approaches to estimation and testing for cases in
which the data are not normally distributed. This means that there could be outliers, the
data might not be measured on the interval/ratio scale, etc.
6.1 Robust Point Estimates
As seen in previous chapters “classic” estimates of the mean and the variance are
x =1
n
n∑i=1
xi, and s2 =1
n− 1
n∑i=1
(xi −x)2. (6.1)
If, hypothetically, we set an arbitrary value xi to an infinitely large value (i.e., we create
an extreme outlier), these estimates above also “explode”.
Robust estimators are not sensitive to one or possibly several outliers. They do often
not require specific distributional assumptions on the random sample.
The mean is therefore not a robust estimate of location. A robust estimate of location
is the trimmed mean, in which the biggest and smallest values are trimmed away and
not considered. The (empirical) median (the middle value of an odd amount of data or
the center of the two middle-most values of an even amount of data) is another robust
estimate of location.
Robust estimates of the dispersion (data spread) are (1) the (empirical) interquartile
range (IQR), calculated as the difference between the third and first quartiles, and (2)
the (empirical) median absolute deviation (MAD), calculated as
MAD = c ·median|xi −medianx1, . . . , xn|, (6.2)
where most software programs (including R) use c = 1.4826. The choice of c is such, that
for normally distributed random variables we have an unbiased estimator, i.e., E(MAD) =
77
78 CHAPTER 6. RANK-BASED METHODS
σ for MAD seen as an estimator. Since for normally distributed random variables IQR=
2Φ−1(3/4)σ, IQR/1.349 is an estimator of σ; for IQR seen as an estimator.
Example 6.1. Let the values 1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 2.7 be given. Unfortunately, we
have entered the final number as 27. The R-Code 6.1 compares several statistics illustrates
the effect of a single outlier on the estimates. ♣
R-Code 6.1 Classic and robust estimates from Example 6.1.
sam <- c(1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 27)
print( c(mean(sam), mean(sam, trim=.2), median(sam)))
## [1] 5.6143 2.2400 2.1000
print( c(sd(sam), IQR(sam), mad(sam)))
## [1] 9.45083 0.85000 0.44478
The estimators of the trimmed mean or median do not possess simple distribution
functions and for this reason the corresponding confidence intervals are not easy to cal-
culate. Based on
Mean± zα/2
√Variance
n, (6.3)
approximate confidence intervals, however, can be easily constructed. For example,
median( x)+c(-2,2)*mad( x)/sqrt(length( x)) is an approximate empirical 95% con-
fidence interval for the median.
A second disadvantage of robust estimators is their low efficiency, i.e., these estimators
have larger variances. Formally, the efficiency is the ratio of the variance of one estimator
to the variance of the second estimator.
In some cases the exact variance of robust estimators can be determined, often ap-
proximations or asymptotic approximations exist. For a continuous random variable with
cdf F (x), asymptotically, the median is also normally distributed around the true me-
dian η = F−1(1/2) with variance (4nf(η)2)−1, where f(x) is the density function. The
following example illustrates this result and R-Code 6.2
Example 6.2. Let X1, . . . , X10iid∼ N (0, σ2). Figure 6.1 shows the (smoothed) empirical
density of X based on N = 1000 repetitions. The empirical efficiency is roughly 72%.
Because the density is symmetric, η = µ = 0 and thus the asymptotic efficiency is
σ2/n
1/(4nf(0)2)=σ2
n· 4n(
1√2πσ
)2 =2
π≈ 64%. (6.4)
Of course, if we change the distribution of Xi, the efficiency changes. For example with a
density with heavier tails than the normal, like a t-distribution with 4 degrees of freedom,
the empirical efficiency is 1.27, which means that the median is better. ♣
6.1. ROBUST POINT ESTIMATES 79
R-Code 6.2 Distribution of empirical mean and median, see Example 6.2. (See
Figure 6.1.)
n <- 10
N <- 1000
samples <- array( rnorm(n*N), c(N,n))
means <- apply( samples, 1, mean)
medians <- apply( samples, 1, median)
print( c( var(means), var(medians), var(means)/var(medians)))
## [1] 0.099383 0.134253 0.740266
hist( medians, border=7, col=7, prob=T, main='', ylim=c(0,1.2))
hist( means, add=T, prob=T)
lines( density( medians), col=2)
lines( density( means), col=1)
# with a t_4 the situation is different!
samples <- array( rt(n*N, df=4), c(N,n))
means <- apply( samples, 1, mean)
medians <- apply( samples, 1, median)
print( c( var(means), var(medians), var(means)/var(medians)))
## [1] 0.19186 0.15634 1.22719
medians
Den
sity
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Figure 6.1: Efficiency of estimators. Medians in yellow with red density, means
in black. (See R-Code 6.2.)
The decision as to whether a realization of a random sample contains outliers is not
always easy and some care is needed. For example for all distributions with values from
R, observations will lay outside the whiskers of a box plot when n is sufficiently large and
80 CHAPTER 6. RANK-BASED METHODS
thus “marked” as outliers. Obvious outliers are easy to identify and eliminate, but in less
clear cases robust estimation methods are preferred.
Outliers can be very difficult to recognize in multivariate random samples, because they
are not readily apparent with respect to the marginal distributions. Robust methods for
random vectors exist, but are often computationally intense and not as intuitive as for
scalar values.
6.2 Rank-Based Tests
In Chapter 4 we considered tests to compare means. These tests assume normally dis-
tributed data for exact results. Slight deviations from a normal distribution has typically
negligible consequences as the central limit theorem reassures that the mean is approx-
imately normal. However, if outliers are present or the data is measured on the ordinal
scale, the use of so-called ‘rank-based’ tests is recommended. Classical tests typically as-
sume a distribution that is parametrized (e.g., µ, σ2 in N (µ, σ2) or p in Bin(n, p)). Rank
based test do not prescribe a detailed distribution and are thus also called non-parametric
tests.
The rank of a value in a sequence is the position (order) of that value in the ordered
sequence (from smallest to largest). In particular, the smallest value has rank 1 and the
largest rank n. In the case of ties, the arithmetic mean of the ranks is used.
Example 6.3. The values 1.1, −0.6, 0.3, 0.1, 0.6, 2.1 have ranks 5, 1, 3, 2, 4 and 6.
However, the ranks of the absolute values are 5, (3+4)/2, 2, 1, (3+4)/2 and 6. ♣
Rank-based tests only consider the ranks of the observations, not the magnitude of
the differences between the observations. The largest value always has the same rank and
therefore always has the same influence on the test statistic.
We now introduce two classical rank tests (i) the Mann–Whitney U test (Wilcoxon–
Mann–Whitney U test) and (ii) the Wilcoxon test, i.e., rank-based versions of Test 2 and
Test 3 respectively.
6.2.1 Wilcoxon–Mann–Whitney Test
When using rank tests, the symmetry assumption is dropped and we test if the samples
likely come from the same distribution, i.e., the two distributions have the same shape
but are shifted. The Wilcoxon–Mann–Whitney test can be interpreted as comparing the
medians between the two populations, see Test 7.
The quantile, density and distribution functions of the test statistic U are implemented
in R with [q,d,p]wilcox. For example, the critical value Utab(nx, ny;α/2) mentioned in
Test 7 is qwilcox( .025, nx, ny) for α = 5% and corresponding p-value 2*pwilcox(
Uobs, nx, ny).
6.2. RANK-BASED TESTS 81
Test 7: Comparison of location for two independent samples
Question: Are the medians of two independent samples significantly differ-
ent?
Assumptions: Both populations follow continuous distributions of the same
shape, the samples are independent and the data are at least ordinally
scaled.
Calculation: Let nx ≤ ny, otherwise switch the samples. Assemble the
(nx + ny) sample values into a single set ordered by rank and calculate
the sum Rx and Ry of the sample ranks. Calculate Ux and Uy
• Ux = nx · ny + nx·(nx+1)2
−Rx
• Uy = nx · ny + ny ·(ny+1)
2−Ry (Ux + Uy = nx · ny)
• Uobs = min(Ux, Uy)
Decision: Reject H0: “medians are the same” if Uobs < Utab(nx, ny;α/2),
where Utab is the tabulated critical value.
Calculation in R: wilcox.test(x, y, conf.level=1-alpha)
It is possible to approximate the distribution of the test statistic by a Gaussian one.
The U statistic value is then transformed by
zobs =
Uobs −nx · ny
2
√nx · ny · (nx + ny + 1)
12
, (6.5)
where nx · ny/2 is the mean of U and the denominator is the standard deviation. This
value is then compared with the respective quantile of the standard normal distribution.
The normal approximation may be used with sufficiently large samples, nx ≥ 2 and
ny ≥ 8. With additional continuity corrections, the approximation may be improved.
To construct confidence intervals, the argument conf.int=TRUE must be used in the
function wilcox.test.
In case of ties, R may not be capable to calculate exact p-values and thus will issue a
warning. The warning can be avoided by not requiring exact p-values through the setting
of the argument exact=FALSE.
82 CHAPTER 6. RANK-BASED METHODS
6.2.2 Wilcoxon Signed Rank Test
The Wilcoxon signed rank test is for matched/paired samples (or repeated measurements
on the same sample) and can be used to compare the the median with a theoretical value.
As in the two-sample t-test, we calculate the differences. However, here we compare the
ranks of the negative and positive differences. If the two samples are from the same
distribution the ranks and thus the rank sums should be comparable. Zero differences are
omitted from the sample and the sample size is reduced accordingly. Details are given in
Test 8.
The quantile, density and distribution functions of the Wilcoxon signed rank test
statistic are implemented in R with [q,d,p]signrank. For example, the critical value
Wtab(n;α/2) mentioned in Test 8 is qsignrank( .025, nx, ny) for α = 5% and corre-
sponding p-value 2*psignrank( Wobs, n).
Test 8: Comparison of location of two dependent samples
Question: Are the medians of two dependent samples significantly different?
Assumptions: Both populations follow continuous distributions of the same
shape, the samples are dependent and the data are measured at least
on the ordinal scale.
Calculation: (1) Calculate the differences di = xi − yi between the sam-
ple values. Ignore all differences di = 0 and consider only the
remaining n? differences di 6= 0.
(2) Order the n? differences di by their absolute differences |di|.
(3) Calculate W+, the sum of the ranks of all the positive differences
di > 0 and also W−, the sum of the ranks of all the negative
differences di < 0
Wobs = min(W+,W−) (W+ +W− =n? · (n? + 1)
2)
Decision: Reject H0: “the medians are the same” if Wobs < Wtab(n?;α/2),
where Wtab is the tabulated critical value.
Calculation in R: wilcox.test(x-y, conf.level=1-alpha) or
wilcox.test(x, y, paired=TRUE, conf.level=1-alpha)
6.2. RANK-BASED TESTS 83
As in the case of the Wilcoxon-Mann-Whitney test, it is again possible to approximate
the distribution of test statistic with a normal distribution. The statistic is transformed
as follows:
zobs =
Wobs −n? · (n? + 1)
4
√n? · (n? + 1) · (2n? + 1)
24
. (6.6)
and then zobs is compared with the corresponding quantile of the standard normal distri-
bution. This approximation may be used when the sample is sufficiently large, n? ≥ 20.
Example 6.4. Consider the portions data again. R-Code 6.3 performs these tests. As
expected, the p-values are similar to those in Chapter 4. Because ties can exist, exact
p-values cannot be calculated. To avoid warnings, use the argument exact=FALSE.
The advantage of robust methods becomes clear when the first value is changed from
5.17 to 51.7, as shown towards the end of the same R-Code. While the p-value of the
signed rank test does virtually not change, the one from the paired two sample t-test
changes from 0.4 (see R-Code 4.5) to 0.27. ♣
R-Code 6.3: Rank tests and comparison of a paired tests with a corrupted
observation.
wilcox.test( portions.normal, mu=9, exact=FALSE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: portions.normal
## V = 25, p-value = 0.028
## alternative hypothesis: true location is not equal to 9
wilcox.test( portions.normal, portions.during, exact=FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: portions.normal and portions.during
## W = 118, p-value = 0.37
## alternative hypothesis: true location shift is not equal to 0
wilcox.test( portions.normal, portions.during, paired=TRUE)
## Warning in wilcox.test.default(portions.normal, portions.during,
paired = TRUE): cannot compute exact p-value with ties
84 CHAPTER 6. RANK-BASED METHODS
##
## Wilcoxon signed rank test with continuity correction
##
## data: portions.normal and portions.during
## V = 31.5, p-value = 0.035
## alternative hypothesis: true location shift is not equal to 0
portions.during[1] <- 62.5 # badly corrupted value
t.test( portions.normal, portions.during, paired=TRUE)[3:4]
## $p.value
## [1] 0.27481
##
## $conf.int
## [1] -10.9003 3.3167
## attr(,"conf.level")
## [1] 0.95
wilcox.test( portions.normal, portions.during, paired=TRUE,
conf.int=TRUE, exact=FALSE)[c("p.value","conf.int")]
## $p.value
## [1] 0.031226
##
## $conf.int
## [1] -1.174996 -0.050017
## attr(,"conf.level")
## [1] 0.95
6.3 Other Tests
6.3.1 Sign Test
Given paired data X1, . . . , Xn, Y1, . . . , Yn from symmetric distributions, we consider the
signs of the differences Di = Yi−Xi. Under the null hypothesis H0 : P(Di > 0) = P(Di <
0) = 1/2, the signs are binomially distributed with probability p = 0.5. If too many or
too few negative signs are observed, the null hypothesis is rejected. In case the observed
difference di is zero, it will be dropped from the sample and n will be reduced
The test is based on very weak assumptions and therefore has little power. The test
procedure is illustrated in Test 9 and R-Code 6.4 illustrates the sign test along Exam-
ple 6.4. The resulting p-values are very similar to the Wilcoxon signed rank test.
6.3. OTHER TESTS 85
Test 9: Sign test
Question: Are the medians of two paired/matched samples significantly dif-
ferent?
Assumptions: Both populations follow continuous distributions of the same
shape, the samples are independent and the data are at least ordinally
scaled.
Calculation: (1) Calculate the differences di = xi−yi. Ignore all differences
di = 0 and consider only the n? differences di 6= 0.
(2) Categorize each difference di by its sign (+ or −) and count the
number of positive signs: k =∑n?
i=1 Idi>0.
(3) If the medians of both samples are the same, k is a realization
from a binomial distribution Bin(n?, p = 0.5).
Decision: Reject H0 : p = 0.5 (i.e., the medians are the same), if k > btab =
b(n?, 0.5, 1− α2) or k < btab = b(n?, 0.5,
α2).
Calculation in R: binom.test( sum( d>0), sum( d!=0),
conf.level=1-alpha)
R-Code 6.4 Sign test.
d <- portions.normal - portions.during
binom.test( sum( d>0), sum( d!=0))
##
## Exact binomial test
##
## data: sum(d > 0) and sum(d != 0)
## number of successes = 4, number of trials = 17, p-value =
## 0.049
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.068108 0.498993
## sample estimates:
## probability of success
## 0.23529
86 CHAPTER 6. RANK-BASED METHODS
6.3.2 Permutation Test
Permutation tests can be used to answer many different questions. They are based on
the idea that, under the null hypothesis, the samples being compared are the same. In
this case, the result of the test would not change if the values of both samples were to be
randomly reassigned to either group. In R-Code 6.5 we show an example of a permutation
test for comparing the means of two independent samples.
Permutation tests are straightforward to implement manually and thus are often used
in settings where the distribution of the test statistic is complex or even unknown.
Note that the package exactRankTests will no longer be further developed. Therefore,
use of the function perm.test is discouraged. Functions within the package coin can be
used instead and this package includes extensions to other rank-based tests.
Test 10: Permutation test
Question: Are the means of two independent samples significantly different?
Assumptions: The null hypothesis is formulated, such that the groups, un-
der H0, are exchangeable.
Calculation: (1) Calculate the difference dobs in the means of the two
groups to be compared (m observations in group 1, n observa-
tions in group 2).
(2) Form a random permutation of the values of both groups by ran-
domly allocating the observed values to the two groups (m obser-
vations in group 1, n observations in group 2). There are(m+nn
)possibilities.
(3) Calculate the difference in means of the two new groups.
(4) Repeat this procedure many times.
Decision: Compare the selected significance level with the p-value:
number of permuted differences more extreme than dobs(m+nn
)Calculation in R: require(coin); oneway test( formula, data)
6.4. BIBLIOGRAPHIC REMARKS AND LINKS 87
R-Code 6.5 Permutation test comparing means of two samples.
require(coin)
Portions <- c(portions.normal, portions.during)
Group <-factor(c(rep(0, length(portions.normal)),
rep(1, length(portions.during))))
oneway_test(Portions ~ Group)
##
## Asymptotic Two-Sample Fisher-Pitman Permutation Test
##
## data: Portions by Group (0, 1)
## Z = -0.715, p-value = 0.47
## alternative hypothesis: true mu is not equal to 0
6.4 Bibliographic Remarks and Links
There are many (“elderly”) books about robust statistics and rank test. Siegel and Castel-
lan Jr (1988) was a very accessible classic for many decades. The book by Hollander and
Wolfe (1999) treats the topic in much more details and depth.
88 CHAPTER 6. RANK-BASED METHODS
Chapter 7
Multivariate Normal Distribution
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter07.R.
In Chapter 2 we have introduce univariate random variables. We now extend the
framework to random vectors (i.e., multivariate random variables). We will mainly focus
on continuous random vectors, especially Gaussian random vectors. In the framework of
this document, we can only cover a tiny part of the beautiful theory. We are pragmatic
and discuss what will be needed in the sequel.
7.1 Random Vectors
A random vector is a (column) vector X = (X1, . . . , Xp)T with p random variables as
components. The following definition is the generalization of the univariate cdf to the
multivariate setting (see Definition 2.1).
Definition 7.1. The multidimensional, i.e. multivariate, distribution function of X is
defined as
FX(x ) = P(X ≤ x ) = P(X1 ≤ x1, . . . , Xp ≤ xp), (7.1)
where the list is to be understood as the intersection (∩). ♦
The multivariate distribution function generally contains more information than the
set of marginal distributions, because (7.1) only simplifies to FX(x ) =
p∏i=1
P(Xi ≤ xi)
under independence (compare to Equation (2.3)).
The probability density function for a continuous random vector is defined in a similar
manner as for random variables.
89
90 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
Definition 7.2. The probability density function (density function, pdf) fX(x ) of a p-
dimensional continuous random vector X is defined by
P(X ∈ A) =
∫A
fX(x )dx , for all A ⊂ Rp. (7.2)
♦
For convenience, we summarize here a few facts of random vectors with two continuous
components, i.e., for a bivariate random vector (X, Y )T. The univariate counterparts are
stated in Properties 2.1 and 2.3.
• The distribution function is monotonically increasing:
for x1 ≤ x2 and y1 ≤ y2, FX,Y (x1, y1) ≤ FX,Y (x2, y2).
• The distribution function is normalized:
limx,y→∞
FX,Y (x, y) = FX,Y (∞,∞) = 1.
We use the slight abuse of notation by writing ∞ in arguments without a limit.
• FX,Y (−∞,−∞) = FX,Y (x,−∞) = FX,Y (−∞, y) = 0.
• FX,Y (x, y) (fX,Y (x, y)) are continuous (almost) everywhere.
• fX,Y (x, y) =∂2
∂x∂yFX,Y (x, y).
• P(a < X ≤ b, c < Y ≤ d) =
∫ b
a
∫ d
c
fX,Y (x, y)dxdy
= FX,Y (b, d)− FX,Y (b, c)− FX,Y (a, d) + FX,Y (a, c).
In the multivariate setting there is also the concept termed marginalization, i.e., reduce
a higher-dimensional random vector to a lower dimensional one. Intuitively, we “neglect”
components of the random vector in allowing them to take any value. In two dimensions,
we have
• FX(x) = P(X ≤ x, Y arbitrary) = FX,Y (x,∞);
• fX(x) =
∫RfX,Y (x, y)dy.
Definition 7.3. The expected value of a random vector X is defined as
E(X) = E
X1
...
Xp
=
E(X1)...
E(Xp)
. (7.3)
♦
7.1. RANDOM VECTORS 91
Hence the expectation of a random vector is simply the vector of the individual ex-
pectations. Of course, to calculate these, we only need the marginal univariate densities
fXi(x) and thus the expectation is independent of whether 7.1 can be factored or not. The
variance of a random vector requires a bit more thought and we first need the following.
Definition 7.4. The covariance between two arbitrary random variables X1 and X2 is
defined as
Cov(X1, X2) = E((X1 − E(X1))(X2 − E(X2))
). (7.4)
♦
Using the linearity properties of the expectation operator, it is possible to show the
following handy properties.
Property 7.1. We have for arbitrary random variables X1, X2 and X3:
i) Cov(X1, X2) = Cov(X2, X1),
ii) Cov(X1, X1) = Var(X1),
iii) Cov(a+ bX1, c+ dX2) = bdCov(X1, X2),
iv) Cov(X1, X2 +X3) = Cov(X1, X2) + Cov(X1, X3).
The covariance describes the linear relationship between the random variables. The
correlation between two random variables X1 and X2 is defined as
Corr(X1, X2) =Cov(X1, X2)√
Var(X1) Var(X2)(7.5)
and corresponds to the normalized covariance. It holds that −1 ≤ Corr(X1, X2) ≤ 1,
with equality only in the degenerate case X2 = a+ bX1 for some a and b 6= 0.
Definition 7.5. The variance of a p-variate random vector X = (X1, . . . , Xp)T is defined
as
Var(X) = E((X− E(X))(X− E(X))T
)(7.6)
= Var
X1
...
Xp
=
Var(X1) . . . Cov(Xi, Xj). . .
Cov(Xj, Xi) . . . Var(Xp)
, (7.7)
called the covariance matrix or variance–covariance matrix. ♦
Similar to Properties 3.4, we have the following properties for random vectors.
Property 7.2. For an arbitrary p-variate random vector X, vectors a ∈ Rq and matrices
B ∈ Rq×p it holds:
i) Var(X) = E(XXT)− E(X) E(X)T,
ii) E(a + BX) = a + B E(X),
iii) Var(a + BX) = B Var(X)BT.
92 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
7.2 Multivariate Normal Distribution
We now consider a special multivariate distribution: the multivariate normal distribution,
by first considering the bivariate case.
7.2.1 Bivariate Normal Distribution
Definition 7.6. The random variable pair (X, Y ) has a bivariate normal distribution if
FX,Y (x, y) =
∫ x
−∞
∫ y
−∞fX,Y (x, y)dxdy (7.8)
with density
f(x, y) = fX,Y (x, y) (7.9)
=1
2πσxσy√
1− ρ2exp
(− 1
2(1− ρ2)
((x− µx)2
σ2x
+(y − µy)2
σ2y
− 2ρ(x− µx)(y − µy)σxσy
)),
for all x and y and where µx ∈ R, µy ∈ R, σx > 0, σy > 0 and −1 < ρ < 1. ♦
The role of some of the parameters µx, µy, σx, σy and ρ might be guessed. We will
discuss their precise meaning after the following example.
Example 7.1. R-Code 7.1 and Figure 7.1 show the density of a bivariate normal dis-
tribution with µx = µy = 0, σx = 1, σy =√
5, and ρ = 2/√
5 ≈ 0.9. Because of the
quadratic form in (7.9), the contour lines (isolines) are ellipses.
Several R packages implement the bivariate/multivariate normal distribution. We
recommend mvtnorm. ♣
R-Code 7.1 Density of a bivariate normal distribution. (See Figure 7.1.)
require( mvtnorm)
require( fields) # providing tim.colors()
Sigma <- array( c(1,2,2,5), c(2,2))
x1 <- seq( -3, to=3, length=100)
x2 <- seq( -3, to=3, length=99)
grid <- expand.grid( x1, x2)
densgrid <- dmvnorm( grid, mean=c(0,0), sigma=Sigma)
dens <- matrix( densgrid, length( x1), length( x2))
image(x1, x2, dens, col=tim.colors()) # left panel
faccol <- tim.colors()[cut(dens[-1,-1],64)]
persp(x1, x2, dens, col=faccol, border = NA, # right panel
tick='detailed', theta=120, phi=30, r=100)
7.2. MULTIVARIATE NORMAL DISTRIBUTION 93
−3 −2 −1 0 1 2 3
−3
−2
−1
01
23
x1
x2
x1
−3−2
−10
12
3x2
−3 −2 −1 0 1 2 3dens
0.00
0.05
0.10
0.15
Figure 7.1: Density of a bivariate normal distribution. (See R-Code 7.1.)
The multivariate normal distribution as many nice properties.
Property 7.3. For the bivariate normal distribution we have: The marginal distributions
are X ∼ N (µx, σ2x) and Y ∼ N (µy, σ
2y) and
E
((X
Y
))=
(µx
µy,
)Var
((X
Y
))=
(σ2x ρσxσy
ρσxσy σ2y
). (7.10)
Thus,
Cov(X, Y ) = ρσxσy, Corr(X, Y ) = ρ. (7.11)
If ρ = 0, X and Y are independent and vice versa.
Note, however, that the equivalence of independence and uncorrelatedness is specific
to jointly normal variables and cannot be assumed for random variables that are not
jointly normal.
Example 7.2. R-Code 7.2 and Figure 7.2 show realizations from a bivariate normal
distribution for various values of correlation ρ. Even for large sample shown here (n =
500), correlations between −0.25 and 0.25 are barely perceptible. ♣
94 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
R-Code 7.2 Realizations from a bivariate normal distribution for various values of ρ,
termed binorm (See Figure 7.2.)
set.seed(12)
rho <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))
sample <- rmvnorm( 500, sigma=Sigma)
plot(sample, pch='.', xlab='', ylab='')
legend( "topleft", legend=bquote(rho==.(rho[i])), bty='n')
−4 −2 0 2 4
−4
−2
02
4
ρ = −0.25
−4 −2 0 2 4
−4
−2
02
4
ρ = 0
−4 −2 0 2 4
−4
−2
02
4
ρ = 0.1
−4 −2 0 2 4
−4
−2
02
4
ρ = 0.25
−4 −2 0 2 4
−4
−2
02
4
ρ = 0.75
−4 −2 0 2 4
−4
−2
02
4
ρ = 0.9
Figure 7.2: Realizations from a bivariate normal distribution. (See R-
Code 7.2.)
7.2.2 Multivariate Normal Distribution
In the general case we have to use vector notation. Surprisingly, we gain clarity even
compared to the bivariate case.
Definition 7.7. The random vector X = (X1, . . . , Xp)T is multivariate normally dis-
tributed if
FX(x ) =
∫ x1
−∞. . .
∫ xp
−∞fX(x1, . . . , xp)dx1 . . . dxp (7.12)
7.2. MULTIVARIATE NORMAL DISTRIBUTION 95
with density
fX(x1, . . . , xp) = fX(x ) =1
(2π)p/2 det(Σ)1/2exp(−1
2(x − µ)TΣ−1(x − µ)
)(7.13)
for all x ∈ Rp (µ ∈ Rp and symmetric, positive-definite Σ). We denote this distribution
with X ∼ Np(µ,Σ). ♦
Property 7.4. For the multivariate normal distribution we have:
E(X) = µ , Var(X) = Σ . (7.14)
Property 7.5. Let a ∈ Rq, B ∈ Rq×p, q ≤ p, rank(B) = q and X ∼ Np(µ,Σ), then
a + BX ∼ Nq(a + Bµ,BΣBT
). (7.15)
This last property has profound consequences. It also asserts that the one-dimensional
marginal distributions are again Gaussian with Xi ∼ N((µ)i, (Σ)ii
), i = 1, . . . , p. Sim-
ilarly, any subset of random variables of X is again Gaussian with appropriate selection
of the mean and covariance matrix.
We now discuss how to draw realizations from an arbitrary Gaussian random vector,
much in the spirit of Property 2.6.ii. Let I ∈ Rp×p be the identity matrix, a square matrix
which has only ones on the main diagonal and only zeros elsewhere, and let L ∈ Rp×p so
that LLT = Σ. That means, L is like a “matrix square root” of Σ.
To draw a realization x from a p-variate random vector X ∼ Np(µ,Σ), one starts
with drawing p values from Z1, . . . , Zpiid∼ N (0, 1), and sets z = (z1, . . . , zp)
T. The vector
is then (linearly) transformed with µ+ Lz . Since Z ∼ Np(0, I) Property 7.5 asserts that
X = µ+ LZ ∼ Np(µ,LLT).
In practice, the Cholesky decomposition of Σ is often used. This decomposes a sym-
metric positive-definite matrix into the product of a lower triangular matrix L and its
transpose. It holds that det(Σ) = det(L)2 =∏p
i=1(L)2ii.
7.2.3 Conditional Distributions
We now consider properties of parts of the random vector X. For simplicity we write
X =
(X1
X2
), X1 ∈ Rq, X2 ∈ Rp−q. (7.16)
We divide the matrix Σ in 2× 2 blocks accordingly:
X =
(X1
X2
)∼ Np
((µ1
µ2
),
(Σ11 Σ12
Σ21 Σ22
))(7.17)
Both (multivariate) marginal distributions X1 and X2 are again normally distributed with
X1 ∼ Nq(µ1,Σ11) and X2 ∼ Np−q(µ2,Σ22).
X1 and X2 are independent if Σ21 = 0 and vice versa.
96 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
Property 7.6. If one conditions a multivariate normally distributed random vector (7.17)
on a subvector, the result is itself multivariate normally distributed with
X2 | X1 = x1 ∼ Np−q(µ2 + Σ21Σ
−111 (x1 − µ1),Σ22 −Σ21Σ
−111 Σ12
). (7.18)
The expected value depends linearly on the value of x 1, but the variance is independent
of the value of x 1. The conditional expected value represents an update of X2 through
X1 = x 1: the difference x 1−µ1 is normalized by the variance and scaled by the covariance.
Notice that for p = 2, Σ21Σ−111 = ρσy/σx.
Equation (7.18) is probably one of the most important formulas you encounter in
statistics albeit not always explicit.
7.3 Estimation
The estimators are constructed in a manner similar to that for the univariate case. Let
x 1, . . . ,xn be a realization of the random sample X1, . . . ,Xniid∼ Np(µ,Σ). We have the
following estimators
µ = X =1
n
n∑i=1
Xi Σ =1
n− 1
n∑i=1
(Xi −X)(Xi −X)T (7.19)
and corresponding estimates
µ = x =1
n
n∑i=1
x i Σ =1
n− 1
n∑i=1
(x i −x )(x i −x )T. (7.20)
A normally distributed random variable is determined by two parameters. A multi-
variate normally distributed random variable is determined by p (for µ) and p(p + 1)/2
(for Σ) parameters. The remaining p(p − 1)/2 parameters in Σ are determined by sym-
metry. However, the p(p + 1)/2 values can not be arbitrarily chosen since Σ must be
positive-definite (in the univariate case σ > 0 must be satisfied). As long as n > p, the
estimator in (7.20) satisfies the conditions. If p < n, additional assumptions about the
structure of the matrix Σ are needed.
Example 7.3. R-Code 7.3 estimates the expected value and the variance matrix/covariance
of the realization with ρ = 0.9 from Figure 7.2. Since n = 500, the values are close to the
“true” value.
The function ellipse from the package ellipse draws confidence regions for the esti-
mated parameter. R-Code 7.4 and Figure 7.3 show the estimated 95% and 50% confidence
regions. As n increases, the estimation improves. ♣
7.3. ESTIMATION 97
R-Code 7.3 Estimates for the realization ρ = 0.9 from Figure 7.2.
colMeans(sample) # apply( sample, 2, mean) # is identical
## [1] 0.0027344 -0.0136907
var(sample) # cov( sample) # is identical
## [,1] [,2]
## [1,] 1.01773 0.91706
## [2,] 0.91706 0.99525
R-Code 7.4 Bivariate normally distributed random numbers for various sample sizes
with contour lines of the density and estimated moments. (See Figure 7.3.)
require( ellipse)
n <- c(10, 100, 500, 1000)
mu <- c(2,1)
Sigma <- matrix( c(4,2,2,2), 2)
for (i in 1:4) plot(ellipse( Sigma, cent=mu, level=.95), col='gray',
xaxs='i', yaxs='i', xlim=c(-4,8), ylim=c(-4,6), type='l')
lines( ellipse( Sigma, cent=mu, level=.5), col='gray')
sample <- rmvnorm( n[i], mean=mu, sigma=Sigma)
points(sample, pch='.', cex=2)
Sigmahat <- cov( sample)
muhat <- apply( sample, 2, mean)
lines( ellipse( Sigmahat, cent=muhat, level=.95), col=2, lwd=2)
lines( ellipse( Sigmahat, cent=muhat, level=.5), col=4, lwd=2)
points( rbind( muhat), col=3, cex=2)
text(-2,4, paste('n =',n[i]))
98 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
−4 −2 0 2 4 6 8
−4
−2
02
46
x
n = 10
−4 −2 0 2 4 6 8
−4
−2
02
46
x
y
n = 100
−4 −2 0 2 4 6 8
−4
−2
02
46
n = 500
−4 −2 0 2 4 6 8
−4
−2
02
46
y
n = 1000
Figure 7.3: Bivariate normally distributed random numbers. The contour lines
of the density are in gray, the corresponding estimated 95% (50%) confidence
region in red (blue) and the arithmetic mean in green. (See R-Code 7.4.)
7.4. BIBLIOGRAPHIC REMARKS AND LINKS 99
7.4 Bibliographic Remarks and Links
The online book “Matrix Cookbook” is a very good synopsis of the important formulas
related to vectors and matrices (Petersen and Pedersen, 2008).
7.5 Appendix: Positive-Definite Matrices
Definition 7.8. An n × n matrix A is positive-definite if xTAx > 0, for all x 6= 0. If
A = AT, the matrix is symmetric positive-definite. ♦
Property 7.7. A few important properties of symmetric positive-definite matrices A =
(aij) ∈ Rp×p are:
i) rank(A) = p;
ii) det(A) > 0;
iii) All eigenvalues of A are positive, λi > 0;
iv) aii > 0;
v) aiiajj − a2ij > 0, i 6= j;
vi) aii + ajj − 2|aij| > 0, i 6= j;
vii) A−1 is symmetric positive-definite;
viii) All principal submatrices of A are symmetric positive-definite;
ix) There exists a nonsingular lower triangular matrix L, such that A = LLT.
If we write a nonsingular matrix A as a 2× 2 block matrix, we have:
A−1 =
(A11 A12
A21 A22
)−1
=
(C−1
1 −A−111 A12C
−12
−C2A21A−111 C−1
2
)(7.21)
with
C1 = A11 −A12A−122 A21, C2 = A22 −A21A
−111 A12 (7.22)
and
C−11 = A−1
11 −A−111 A12A
−122 A21A
−111 , C−1
2 = A−122 −A−1
22 A21A−111 A12A
−122 . (7.23)
It also holds that det(A) = det(A11) det(C2) = det(A22) det(C1).
100 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION
Chapter 8
Correlation and Simple Regression
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter08.R.
In this chapter (and the following three chapters) we consider “linear models.” A
detailed discusson of linear models would fill an entire lecture module, hence we consider
here only the most important elements.
The (simple) linear regression is commonly considered the archetypical task of statis-
tics and is often introduced as early as middle school.
In this chapter, we will (i) quantify the (linear) relationship between two variables, (ii)
explain one variable through another variable with the help of a “model” and (iii) give
further tips.
8.1 Correlation
Our goal is to quantify the linear relationship between X and Y with the help of n pairs
of values (x1, yn), . . . , (xn, yn), the realizations of two random variables (X, Y ).
An intuitive estimator of the correlation between X and Y (see Equation (7.5)) is
r = Corr(X, Y ) =Cov(X, Y )√
Var(X)Var(Y )
=
∑i
(xi − x)(yi − y)√∑i
(xi − x)2∑j
(yj − y)2
, (8.1)
which is also called the Pearson correlation coefficient. Just like the correlation, the
Pearson correlation coefficient also lies in the interval [−1, 1].
Example 8.1. R-Code 8.1 estimates the correlation of the scatter plot data from Fig-
ure 7.2. Although n = 500, in the case ρ = 0.1, the estimate is more than two times too
large. ♣
101
102 CHAPTER 8. CORRELATION AND SIMPLE REGRESSION
R-Code 8.1 binorm data: Pearson correlation coefficient of the scatter plot from
Figure 7.2. Spearman’s ρ or Kendall’s τ are given as well.
require( mvtnorm)
set.seed(12)
rho <- array(0,c(4,6))
rownames(rho) <- c("rho","Pearson","Spearman","Kendall")
rho[1,] <- c(-.25, 0, .1, .25, .75, .9)
for (i in 1:6) Sigma <- array( c(1, rho[1,i], rho[1,i], 1), c(2,2))
sample <- rmvnorm( 500, sigma=Sigma)
rho[2,i] <- cor( sample)[2]
rho[3,i] <- cor( sample, method="spearman")[2]
rho[4,i] <- cor( sample, method="kendall")[2]
print( rho, digits=2)
## [,1] [,2] [,3] [,4] [,5] [,6]
## rho -0.25 0.000 0.10 0.25 0.75 0.90
## Pearson -0.22 0.048 0.22 0.28 0.78 0.91
## Spearman -0.22 0.066 0.18 0.26 0.77 0.91
## Kendall -0.15 0.045 0.12 0.18 0.57 0.74
The correlation is a measure for the linear relationship; however, the correlation coef-
ficient is not robust, see R-Code 8.2 and Figure 8.1.
Rank correlation coefficients, such as Spearman’s ρ or Kendall’s τ , are “robust” and
are seen as non-parametric correlation estimates. Spearman’s ρ is calculated similarly to
(8.1), where the values are replaced by their ranks. Kendall’s τ compares the number
of concordant (if xi < xj then yi < yj) and discordant pairs (if xi < xj then yi > yj).
R-Code 8.2 compares the various correlation estimates of the anscombe data in Figure 8.1.
R-Code 8.2: anscombe data: visualization and correlation estimates. (See
Figure 8.1.)
library( faraway)
data( anscombe)
head( anscombe, 3)
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
8.1. CORRELATION 103
with( anscombe, c(cor(x1, y1), cor(x2, y2), cor(x3, y3), cor(x4, y4)))
## [1] 0.81642 0.81624 0.81629 0.81652
with( anscombe, plot(x1, y1); plot(x2, y2); plot(x3, y3);
plot(x4, y4) )sel <- c(0:3*9+5) # extract diagonal entries of sub-block
print(rbind( pearson=cor(anscombe)[sel],
spearman=cor(anscombe, method='spearman')[sel],
kendall=cor(anscombe, method='kendall')[sel]), digits=2)
## [,1] [,2] [,3] [,4]
## pearson 0.82 0.82 0.82 0.82
## spearman 0.82 0.69 0.99 0.50
## kendall 0.64 0.56 0.96 0.43
4 6 8 10 12 14
46
810
x1
y1
4 6 8 10 12 14
34
56
78
9
x2
y2
4 6 8 10 12 14
68
1012
x3
y3
8 10 12 14 16 18
68
1012
x4
y4
Figure 8.1: anscombe data, the four cases all have a Pearson correlation coef-
ficient of 0.82. (See R-Code 8.2.)
For bivarate normally distributed random variables, r is an estimate of the correlation
parameter ρ. Let R be the corresponding estimator of ρ based on (8.1). The random
variable
T = R
√n− 2√
1−R2(8.2)
104 CHAPTER 8. CORRELATION AND SIMPLE REGRESSION
is, under H0 : ρ = 0, t-distributed with n− 2 degrees of freedom. The corresponding test
is described under Test 11.
Test 11: Test of correlation
Question: Is the correlation between two paired samples significant?
Assumptions: The pairs of values come from a bivariate normal distribution.
Calculation: tobs = |r|√n− 2√1− r2
where r is the Pearson correlation coefficient (8.1).
Decision: Reject H0: ρ = 0 if tobs > ttab = tn−2,1−α/2.
Calculation in R: cor.test( x, y, conf.level=1-alpha)
In order to construct confidence intervals for correlation estimates, we typically need
the so-called Fisher transformation
W (r) =1
2log(1 + r
1− r
)= arctanh(r) (8.3)
and the fact that, for bivarate normally distributed random variables, the distribution of
W (R) is approximately N(W (ρ), 1/(n− 3)
).
Astoundingly large sample sizes are needed for correlation estimates around 0.25 to
be significant.
Confidence interval for the Pearson correlation coefficient
An approximate (1− α) confidence interval for r is[tanh
(arctanh(r)−
zα/2√n− 3
), tanh
(arctanh(r) +
zα/2√n− 3
) ]where tanh and arctanh are the hyperbolic and inverse hypoerbolic tangent
functions.
8.2. SIMPLE LINEAR REGRESSION 105
8.2 Simple Linear Regression
In simple linear regression a (dependent) variable is explained linearly through a single
independent variable:
Yi = µi + εi (8.4)
= β0 + β1xi + εi, i = 1, . . . , n, (8.5)
with
• Yi: dependent variable, measured values, observations
• xi: independent variable, predictor
• β0, β1: parameters (unknown)
• εi: measurement error, error term, noise (unknown), with symmetric distribution
around E(εi) = 0.
It is often also assumed that Var(εi) = σ2 and/or that the errors are independent of
each other. For simplicity, we assume εiiid∼ N (0, σ2) with unknown σ2. Thus, Yi ∼
N (β0 + β1xi, σ2), i = 1, . . . , n, and Yi and Yj are independent when i 6= j.
Fitting a linear model in R is straightforward as shown in a motivating example below.
Example 8.2. (hardness data) One of the steps in the manufacturing of metal springs
is a quenching bath. The temperature of the bath has an influence on the hardness of the
springs. Data is taken from Abraham and Ledolter (2006). Figure 8.2 shows the Rockwell
hardness of coil springs as a function of the temperature of the quenching bath, as well as
the line of best fit. The Rockwell scale measures the hardness of technical materials and
is denoted by HR (Hardness Rockwell). R-Code 8.3 shows how a simple linear regression
for this data is performed. ♣
R-Code 8.3 hardness data from Example 8.2, see Figure 8.2.
Temp <- rep( 10*3:6, c(4, 3, 3, 4))
Hard <- c(55.8, 59.1, 54.8, 54.6, 43.1, 42.2, 45.2,
31.6, 30.9, 30.8, 17.5, 20.5, 17.2, 16.9)
plot( Temp, Hard, # scatter plot of the data
xlab="Temperature [C]", ylab="Hardness [HR]")
lm1 <- lm( Hard~Temp) # fit a linear model
abline( lm1)
106 CHAPTER 8. CORRELATION AND SIMPLE REGRESSION
30 35 40 45 50 55 60
2030
4050
60
Temperature [C]
Har
dnes
s [H
R]
Figure 8.2: hardness data: hardness as a function of temperature (see R-
Codes 8.3 and 8.4).
The main idea of regression is to estimate βi, which minimize the sum of squared
errors. That means that β0 and β1 are determined such that
n∑i=1
(yi − β0 − β1xi)2 (8.6)
is minimized. This concept is also called the least squares method. The solution is
β1 =sxys2x
= rsysx
=
n∑i=1
(xi − x)(yi − y)
n∑i=1
(xi − x)2
, (8.7)
β0 = y − β1x . (8.8)
The predicted values are
yi = β0 + β1xi . (8.9)
which lie on the estimated regression line
y = β0 + β1x . (8.10)
8.2. SIMPLE LINEAR REGRESSION 107
R-Code 8.4 hardness data from Example 8.2, see Figure 8.2.
summary( lm1) # summary of the fit
##
## Call:
## lm(formula = Hard ~ Temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.550 -1.190 -0.369 0.599 2.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.1341 1.5750 59.8 3.2e-16 ***
## Temp -1.2662 0.0339 -37.4 8.6e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.5 on 12 degrees of freedom
## Multiple R-squared: 0.991,Adjusted R-squared: 0.991
## F-statistic: 1.4e+03 on 1 and 12 DF, p-value: 8.58e-14
coef( lm1)
## (Intercept) Temp
## 94.1341 -1.2662
head( fitted( lm1))
## 1 2 3 4 5 6
## 56.149 56.149 56.149 56.149 43.488 43.488
head( resid( lm1))
## 1 2 3 4 5 6
## -0.34945 2.95055 -1.34945 -1.54945 -0.38791 -1.28791
head( Hard - (fitted( lm1) + resid( lm1)))
## 1 2 3 4 5 6
## 0 0 0 0 0 0
In simple linear regression, the corresponding estimators can be directly derived from
Equation (8.5). For multiple regression, meaning for models with more than one indepen-
dent variable, it is better to begin directly with matrix notation to determine estimators
for an arbitrary number of predictors (see Chapter 10).
108 CHAPTER 8. CORRELATION AND SIMPLE REGRESSION
In simple linear regression, the central task is often to determine whether there exists a
linear relationship between the dependent and independent variables. This can be tested
with the hypothesis H0 : β1 = 0 (Test 12).
Test 12: Test of a linear relationship
Question: Is there a linear relationship between two paired samples?
Assumptions: Based on the sample x1, . . . , xn, the second sample is a real-
ization of a normally distributed random variable with expected value
β0 + β1xi and variance σ2.
Calculation: tobs = |r|sysx
Decision: Reject H0: β1 = 0 if tobs > tobs = tn−2,1−α/2.
Calculation in R: summary( lm( y ∼ x ))
Comparing the expression of Pearson’s correlation coefficient (8.1) and the estimate
of β1 (8.7), it is not surprising that the p-values of Tests 12 and 11 coincide.
For prediction of the dependent variable for a given independent variable x0, the value
x0 is used in Equation (8.10). The function predict can be used for prediction in R. The
uncertainty of the prediction depends on the uncertainty of the estimated parameter (and
on the variance of the error term). Specifically:
Var(µ0) = Var(β0 + β1x0) (8.11)
In general, however, β0 and β1 are not independent and the variance is not easy to
calculate. We return to this in Chapter 11. To construct confidence intervals for a
prediction, first we must discern whether it is for y0 or for µ0.
The prediction can also be written as
β0 + β1x0 = y − β1x+ β1x0 = y + sxy(s2x)−1(x0 − x) . (8.12)
This equation is equivalent to Equation (7.18) but with estimates instead of (unknown)
parameters.
Linear regression makes assumptions about the errors εi. After calculation, we must
check whether these assumptions are realistic. For this, the residuals and their properties
are considered. We return to this in Chapter 10.
8.3. BIBLIOGRAPHIC REMARKS AND LINKS 109
R-Code 8.5 hardness data: predictions and pointwise confidence intervals. (See
Figure 8.3.)
new <- data.frame( Temp = seq(15, 75, by=.5))
pred.w.clim <- predict( lm1, new, interval="confidence")
# for hat( mu )
pred.w.plim <- predict( lm1, new, interval="prediction")
# for hat( y ), hence wider!
plot( Temp, Hard,
xlab="Temperature [C]", ylab="Hardness [HR]")
matlines( new$Temp, cbind(pred.w.clim, pred.w.plim[,-1]),
col=c(1,2,2,3,3),lty=c(1,1,1,2,2))
30 35 40 45 50 55 60
2030
4050
60
Temperature [C]
Har
dnes
s [H
R]
Figure 8.3: hardness data: hardness as a function of temperature. (See R-
Code 8.5.)
8.3 Bibliographic Remarks and Links
We recommend the book from Fahrmeir et al. (2009) (German) or Fahrmeir et al. (2013)
(English), which is both detailed and accessible.
110 CHAPTER 8. CORRELATION AND SIMPLE REGRESSION
Chapter 9
Multiple Regression
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter09.R.
The simple linear regression model can be extended by the addition of independent
variables. We first introduce the model and estimators. Subsequently, we become ac-
quainted with the most important steps in model validation. At the end, several typical
examples of multiple regression are demonstrated.
9.1 Model and Estimators
Equation (8.5) is generalized to p independent variables
Yi = µi + εi, (9.1)
= β0 + β1xi1 + · · ·+ βpxip + εi, (9.2)
= xTi β + εi i = 1, . . . , n, n > p (9.3)
with
• Yi: dependent variable, measured value, observation
• x i = (1, xi1, . . . , xip)T: independent/explanatory variables, predictors
• β = (β0, . . . , βp)T: (unknown) parameter vector
• εi: (unknown) error term, noise, “measurement error”, with symmetric distribution
around zero, E(εi) = 0.
It is often also assumed that Var(εi) = σ2 and/or that the errors are independent of
each other. For simplicity, we assume that εiiid∼ N (0, σ2), with unknown σ2. In matrix
notation, this is written as (9.3)
Y = Xβ + ε (9.4)
111
112 CHAPTER 9. MULTIPLE REGRESSION
with X an n× (p + 1) matrix with rows x i. We assume that the rank of X equals p+ 1
(rank(X) = p + 1, column rank). This assumption guarantees that the inverse of XTX
exists.
The least squares concept is thus applied as follows:
β = argminβ
(y −Xβ)T(y −Xβ) (9.5)
⇒ d
dβ(y −Xβ)T(y −Xβ) (9.6)
=d
dβ(yTy − 2βTXTy + βTXTXβ) = −2XTy + 2XTXβ (9.7)
⇒ XTXβ = XTy (9.8)
⇒ β = (XTX)−1XTy (9.9)
Equation (9.8) is also called the normal equation.
It can be shown that
ε ∼ Nn(0, σ2I), (9.10)
Y ∼ Nn(Xβ, σ2I), (9.11)
β = (XTX)−1XTy , β ∼ Np+1
(β, σ2(XTX)−1
), (9.12)
y = X(XTX)−1XTy = Hy , Y ∼ Nn(Xβ, σ2H), (9.13)
r = y − y = (I−H)y R ∼ Nn(0, σ2(I−H)
). (9.14)
In the left column we find the estimates, in the right column the functions of the random
samples. The marginal distributions of the individual coefficients βi are determined by
distribution (9.12):
βi ∼ N (βi, σ2vii) with vii =
((XTX)−1
)ii, i = 0, . . . , p (9.15)
Henceβi − βi√σ2vii
∼ N (0, 1). However, since σ2 is usually unknown, we need
βi − βi√σ2vii
∼ tn−p−1 with σ2 =1
n− p− 1
n∑i=1
(yi − yi)2 =1
n− p− 1rTr (9.16)
as our statistic for testing and to derive confidence intervals about individual coefficients
βi. For testing, we are often interested in H0 : βi = 0. Confidence intervals are constructed
along equation (3.33) and summarized in the subsequent blue box.
9.2. MODEL VALIDATION 113
Confidence intervals for regression coefficients
An empirical (1− α) confidence interval for βi is[βi ± tn−p−1,1−α/2
√1
n− p− 1rTr vii
](9.17)
with r = y − y and vii =((XTX)−1
)ii.
9.2 Model Validation
Model validation essentially verifies if (9.3) is an adequate model for using the data to
test the given hypothesis. The question is not whether a model is correct, but rather if it
is useful (“Essentially, all models are wrong, but some are useful”, Box and Draper, 1987
page 424).
Validation is based on (a) graphical summaries, typically (standardized) residuals
versus fitted values, individual predictors, summaries values of predictors or simply Q-Q
plots and (b) summary statistics like
Coefficient of determination, or R2: R2 = 1− SSESST
= 1−∑
i(yi − yi)2∑i(yi −yi)2
, (9.18)
Adjusted R2: R2adj = R2 − (1−R2)
p
n− p− 1, (9.19)
Observed value of F -Test:(SST − SSE)/p
SSE/(n− p− 1), (9.20)
were SS stands for sums of squares. The last one is essentially equivalent to Test 4 and
performs the omnibus test H0 : β1 = β2 = · · · = βp = 0. Rejecting merely means that any
of the coefficients is significant. This test can also be used to compare nested models. Let
M0 be the simpler model with only q out of the p predictors of the more complex model
M1 (0 ≤ q < p). The test H0 : “M0 is sufficient” is based on the statistic
(SSsimple model − SScomplex model)/(p− q)SScomplex model/(n− p− 1)
(9.21)
and often runs under the name ANOVA (analysis of variance). We see an alternative
derivation thereof in Chapter 10. Further, notice that in Section 9.3 we two more often
used summary statistics to assess model fit.
Validation verifies (i) the fixed components (or fixed part) µi and (ii) the stochastic
components (or stochastic part) εi and is normally an iterative process.
In terms of the fixed components, it must be verified whether the necessary number
of predictors are in the model. We do not want too many, nor too few. Unnecessary
114 CHAPTER 9. MULTIPLE REGRESSION
predictors are often identified through insignificant coefficients. When predictors are
missing, the residuals show (in the ideal case) structure indicative of a need for model
improvement. In other cases, the quality of the regression is small (F -Test, R2 (too)
small). Example 9.1 below will illustrate the most important elements.
It is important to understand that the stochastic part εi does not only represent
measurement error. In general, the error is the remaining “variability” (also noise) that
is not explained through the predictors (“signal”). Therefore it is possible for the error
to not be iid normally distributed.
With respect to the stochastic part εi, the following points should be verified:
i) constant variance: if the (absolute) residuals are displayed as a function of the
predictors, the estimated values, or the index, no structure should be discernible.
The observations can often be transformed in order to achieve constant variance.
Constant variance is also called homoscedasticity and heteroscedasticity or variance
heterogenity are used otherwise.
ii) independence: correlation between the residuals should be negligible. Types of error
structure will be discussed in Chapter 11.
iii) symmetric distribution: it is not easy to find evidence against this assumption. If
the distribution is strongly right- or left-skewed, the scatter plots of the residuals
will have structure. Transformations or generalized linear models may help. We
have a quick look at a generalized linear model in Section 11.3.
Example 9.1. Here we construct synthetic data in order to better illustrate the difficulty
of detecting a suitable model. Table 9.1 gives the actual models and the five fitted models.
In all cases we use a small dataset of size n = 50 and predictors (x1, x2 and x3) that we
construct from a uniform distribution. Further, we set εiiid∼ N (0, 0.252). R-Code 9.1 and
the corresponding Figure 9.1 illustrate how model deficiencies manifest.
We illustrate how residual plots may or may not show missing or unnecessary predic-
tors. Because of the ‘textbook’ example, the adjusted R2 values are very high and the
p-value of the F -Test is – as often in practice – of little value.
Since the output of summary(.) is quite long, here we show only elements from it.
This is acheived with the functions print and cat. For Examples 2 to 5 the output
has been constructed by a function call being identical to sres summary constructing the
output as the first example.
The plots should supplement a classical graphical analysis through lm( res). ♣
9.2. MODEL VALIDATION 115
Table 9.1: Fitted models for five examples of 9.1. The true model is always
Yi = β0 + β1x1 + β2x21 + β3x2 + εi.
Example fitted model
1 Yi = β0 + β1x1 + β2x21 + β3x2 + εi correct model
2 Yi = β0 + β1x1 + β2x2 + εi missing predictor x21
3 Yi = β0 + β1x21 + β2x2 + εi missing predictor x1
4 Yi = β0 + β1x1 + β2x21 + εi missing predictor x2
5 Yi = β0 + β1x1 + β2x21 + β3x2 + β4x3 + εi unnecessary predictor x3
R-Code 9.1: Illustration of missing and unnecessary predictors for an
artificial dataset. (See Figure 9.1.)
set.seed( 18)
n <- 50
x1 <- runif( n); x2 <- runif( n); x3 <- runif( n)
eps <- rnorm( n, sd=0.16)
y <- -1 + 3*x1 + 2.5*x1^2 + 1.5*x2 + eps
# Example 1: Correct model
sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 ))
print( sres$coef, digits=2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.96 0.067 -14.4 1.3e-18
## x1 2.90 0.264 11.0 1.8e-14
## I(x1^2) 2.54 0.268 9.5 2.0e-12
## x2 1.49 0.072 20.7 5.8e-25
cat("Adjusted R-squared: ", formatC( sres$adj.r.squared),
" F-Test: ", pf(sres$fstatistic[1], 1, n-2, lower.tail = FALSE))
## Adjusted R-squared: 0.9927 F-Test: 6.5201e-42
plotit( res$fitted, res$resid) # a nice plot, essentially equivalent to
# plot( res, which=1, caption=NA, sub.caption=NA, id.n=0) with ylim=c(-1,1)
# i.e, plot(), followed by a panel.smooth()
plotit( x1, res$resid)
plotit( x2, res$resid)
# Example 2: Missing predictor x1^2
sres <- summary( res <- lm( y ~ x1 + x2 ))
sres_summary( ) # as above
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3 0.093 -14 9.0e-19
116 CHAPTER 9. MULTIPLE REGRESSION
## x1 5.3 0.113 47 2.5e-41
## x2 1.5 0.122 13 1.1e-16
## Adjusted R-squared: 0.9789 F-Test: 4.2729e-35
# Example 3: Missing predictor x1
sres <- summary( res <- lm( y ~ I(x1^2) + x2 ))
sres_summary( )
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.45 0.091 -5 9.3e-06
## I(x1^2) 5.39 0.126 43 3.0e-39
## x2 1.41 0.135 10 8.4e-14
## Adjusted R-squared: 0.9742 F-Test: 4.945e-33
# Example 4: Missing predictor x2
sres <- summary( res <- lm( y ~ x1 + I(x1^2) ))
sres_summary( )
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.042 0.16 -0.26 0.7942
## x1 2.307 0.83 2.76 0.0081
## I(x1^2) 2.963 0.85 3.49 0.0011
## Adjusted R-squared: 0.9266 F-Test: 1.3886e-22
# Example 5: Too many predictors x3
sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 + x3))
sres_summary( )
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.973 0.071 -13.8 9.5e-18
## x1 2.908 0.266 10.9 3.0e-14
## I(x1^2) 2.537 0.270 9.4 3.4e-12
## x2 1.483 0.075 19.9 6.5e-24
## x3 0.044 0.074 0.6 5.5e-01
## Adjusted R-squared: 0.9926 F-Test: 7.6574e-39
# The results are in sync with:
# tmp <- cbind( y, x1, "x1^2"=x1^2, x2, x3)
# pairs( tmp, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
# cor( tmp)
9.2. MODEL VALIDATION 117
0 1 2 3 4 5
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0 1 2 3 4 5
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0 1 2 3 4 5
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0 1 2 3 4 5
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0 1 2 3 4 5
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
0.0 0.2 0.4 0.6 0.8 1.0
−1.
00.
01.
0
x
Figure 9.1: Residual plots. Residuals versus fitted values (left column), pre-
dictor x1 (middle) and x2 (right column). The rows correspond to the different
fitted models. The panels in the left column have different scaling of the x-axis.
(See R-Code 9.1.)
118 CHAPTER 9. MULTIPLE REGRESSION
9.3 Information Criterion
An information criterion in statistics is a tool for model selection. It follows the idea
of Occam’s razor, in that a model should not be unnecesarily complex. It balances the
goodness of fit of the estimated models with its complexity, measured by the number of
parameters. There is a penalty for the number of parameters, otherwise complex models
with numerous parameters would be preferred.
We assume that the distribution of the variables follows a known distribution with
an unknown parameter θ. In maximum likelihood estimation, the larger the likelihood
function Lθ) or, equivalently, the smaller the negative log-likelihood function −`(θ), the
better the model is. The oldest criterion was proposed as “an information criterion” in
1973 by Hirotugu Akaike and is known today as the Akaike information criterion (AIC):
AIC = −2`(θ) + 2p. (9.22)
In regression models with normally distributed errors, the maximized log-likelihood is
linear to log σ2 and so the first term describes the goodness of fit.
The disadvantage of AIC is that the penalty term is independent of the sample size.
The Bayesian information criterion (BIC)
BIC = −2`(θ) + log(n)p (9.23)
penalizes the model more heavily based on both the number of parameters p and sample
size n, and its use is recommended.
9.4 Examples
In this section we give two more examples of multiple regression problems, based on
classical datasets.
Example 9.2. (abrasion data) The data comes from an experiment investigating how
rubber’s resistance to abrasion is affected by the hardness of the rubber and its tensile
strength (Cleveland, 1993). Each of the 30 rubber samples was tested for hardness and
tensile strength, and then subjected to steady abrasion for a fixed time.
R-Code 9.2 performs the regression analysis based on two predictors. The empirical
confidence intervals confint( res) do not contain zero. Accordingly, the p-values of the
three t-tests are small.
A quadratic term for strength is not necessary. However, the residuals appear to be
slightly correlated. We do not have further information about the data here and cannot
investigate this aspect further. ♣
9.4. EXAMPLES 119
50 150 250 350
5015
025
035
0
loss
50 60 70 80 90
120 160 200 240
5015
025
035
0
hardness
5060
7080
90
120 160 200 240
120
160
200
240
strength
Figure 9.2: abrasion data: EDA in form of a pairs plot. (See R-Code 9.2.)
R-Code 9.2: abrasion data: EDA, fitting a linear model and model vali-
dation.
abrasion <- read.csv('data/abrasion.csv')
str(abrasion)
## 'data.frame': 30 obs. of 3 variables:
## $ loss : int 372 206 175 154 136 112 55 45 221 166 ...
## $ hardness: int 45 55 61 66 71 71 81 86 53 60 ...
## $ strength: int 162 233 232 231 231 237 224 219 203 189 ...
pairs(abrasion, upper.panel=panel.smooth, lower.panel=NULL, gap=0)
res <- lm(loss~hardness+strength, data=abrasion)
summary( res)
##
## Call:
## lm(formula = loss ~ hardness + strength, data = abrasion)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
120 CHAPTER 9. MULTIPLE REGRESSION
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hardness -6.571 0.583 -11.27 1.0e-11 ***
## strength -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84,Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
confint( res)
## 2.5 % 97.5 %
## (Intercept) 758.4573 1011.86490
## hardness -7.7674 -5.37423
## strength -1.7730 -0.97562
# Fitted values
plot( loss~hardness, ylim=c(0,400), yaxs='i', data=abrasion)
points( res$fitted~hardness, col=4, data=abrasion)
plot( loss~strength, ylim=c(0,400), yaxs='i', data=abrasion)
points( res$fitted~strength, col=4, data=abrasion)
# Residual vs ...
plot( res$resid~res$fitted)
lines( lowess( res$fitted, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid~hardness, data=abrasion)
lines( lowess( abrasion$hardness, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid~strength, data=abrasion)
lines( lowess( abrasion$strength, res$resid), col=2)
abline( h=0, col='gray')
plot( res$resid[-1]~res$resid[-30])
abline( h=0, col='gray')
9.4. EXAMPLES 121
50 60 70 80 90
010
030
0
hardness
loss
120 140 160 180 200 220 240
010
030
0
strength
loss
50 100 150 200 250 300 350
−50
050
res$fitted
res$
resi
d
50 60 70 80 90
−50
050
hardnessre
s$re
sid
120 140 160 180 200 220 240
−50
050
strength
res$
resi
d
−50 0 50
−50
050
res$resid[−30]
res$
resi
d[−
1]
Figure 9.3: abrasion data: model validation. Top row shows the loss (black)
and fitted values (blue) as a function of hardness (left) and strength (right).
Middle and bottom panels are different residual plots. (See R-Code 9.2.)
Example 9.3. (LifeCycleSavings data) Under the life-cycle savings hypothesis de-
veloped by Franco Modigliani, the savings ratio (aggregate personal savings divided by
disposable income) is explained by per-capita disposable income, the percentage rate of
change in per-capita disposable income, and two demographic variables: the percentage
of population less than 15 years old and the percentage of the population over 75 years
old. The data are averaged over the decade 1960–1970 to remove the business cycle or
other short-term fluctuations.
The dataset contains information from 50 countries about these five variables:
• sr aggregate personal savings,
• pop15 % of population under 15,
• pop75 % of population over 75,
122 CHAPTER 9. MULTIPLE REGRESSION
• dpi real per-capita disposable income,
• ddpi % growth rate of dpi.
Scatter plots are shown in Figure 9.4. R-Code 9.3 fits a multiple linear model, selects
models through comparison of various goodness of fit criteria (AIC, BIC) and shows the
model validation plots for the model selected using AIC.
Different models may result from different criteria: when using BIC for model selection,
pop75 drops out of the model. ♣
0 5 10 15 20
05
1015
20
sr
25 35 45
1 2 3 4
0 2000 4000
0 5 10 15
05
1015
20
pop15
2535
45
pop75
12
34
dpi0
2000
4000
0 5 10 15
05
1015
ddpi
Figure 9.4: LifeCycleSavings data: scatter plots of the data.
R-Code 9.3: LifeCycleSavings data: EDA, linear model and model selec-
tion. (Figure 9.4).
data( LifeCycleSavings)
head( LifeCycleSavings)
## sr pop15 pop75 dpi ddpi
## Australia 11.43 29.35 2.87 2329.68 2.87
## Austria 12.07 23.32 4.41 1507.99 3.93
## Belgium 13.17 23.80 4.43 2108.47 3.82
## Bolivia 5.75 41.89 1.67 189.13 0.22
## Brazil 12.88 42.19 0.83 728.47 4.56
## Canada 8.79 31.72 2.85 2982.88 2.43
9.4. EXAMPLES 123
pairs(LifeCycleSavings, upper.panel=panel.smooth, lower.panel=NULL,
gap=0)
lcs.all <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
summary( lcs.all)
##
## Call:
## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.242 -2.686 -0.249 2.428 9.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.566087 7.354516 3.88 0.00033 ***
## pop15 -0.461193 0.144642 -3.19 0.00260 **
## pop75 -1.691498 1.083599 -1.56 0.12553
## dpi -0.000337 0.000931 -0.36 0.71917
## ddpi 0.409695 0.196197 2.09 0.04247 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.8 on 45 degrees of freedom
## Multiple R-squared: 0.338,Adjusted R-squared: 0.28
## F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079
lcs.aic <- step( lcs.all) # AIC is default choice
## Start: AIC=138.3
## sr ~ pop15 + pop75 + dpi + ddpi
##
## Df Sum of Sq RSS AIC
## - dpi 1 1.9 653 136
## <none> 651 138
## - pop75 1 35.2 686 139
## - ddpi 1 63.1 714 141
## - pop15 1 147.0 798 146
##
## Step: AIC=136.45
## sr ~ pop15 + pop75 + ddpi
124 CHAPTER 9. MULTIPLE REGRESSION
##
## Df Sum of Sq RSS AIC
## <none> 653 136
## - pop75 1 47.9 701 138
## - ddpi 1 73.6 726 140
## - pop15 1 145.8 798 144
summary( lcs.aic)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.12466 7.18379 3.9150 0.00029698
## pop15 -0.45178 0.14093 -3.2056 0.00245154
## pop75 -1.83541 0.99840 -1.8384 0.07247270
## ddpi 0.42783 0.18789 2.2771 0.02747818
plot( lcs.aic) # 4 plots to assess the models
# BIC:
summary( step( lcs.all, k=log(50), trace=0))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.59958 2.334394 6.6825 2.4796e-08
## pop15 -0.21638 0.060335 -3.5863 7.9597e-04
## ddpi 0.44283 0.192401 2.3016 2.5837e-02
6 8 10 12 14 16
−10
−5
05
10
Fitted values
Res
idua
ls
Residuals vs Fitted
Zambia
Chile
Philippines
−2 −1 0 1 2
−2
01
23
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
Zambia
Chile
Philippines
6 8 10 12 14 16
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−LocationZambia
ChilePhilippines
0.0 0.1 0.2 0.3 0.4 0.5
−2
01
23
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance10.5
0.51
Residuals vs Leverage
Libya
Japan
Zambia
Figure 9.5: LifeCycleSavings data: model validation.
Chapter 10
Analysis of Variance
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter10.R.
In Test 2 discussed in Chapter 4, we compared the means of two independent samples
with each other. Naturally, the same procedure can be applied to I independent samples,
which would amount to(I2
)tests and would require adjustments due to multiple testing.
In this chapter we learn a “better” method, based on the concept of analysis of variance,
termed ANOVA. We focus on a linear model approach to ANOVA. Due to historical
reasons, the notation is slightly different than what we have seen in the last two chapters;
but we try to link and unify as much as possible.
10.1 One-Way ANOVA
The model of this section is tailored to compare I different groups where the variability
of the observations around the mean is the same in all groups. That means that there is
one common variance parameter and we pool the information across all observations to
estimate it.
More formally, the model consists of I groups, in the ANOVA context called factor levels,
i = 1, . . . , I, and every level contains a sample of size ni. Thus, in total we have N =
n1 + · · ·+ nI observations. The model is given by
Yij = µi + εij (10.1)
= µ+ βi + εij, (10.2)
where we use the indices to indicate the group and within group observation. We again
assume εijiid∼ N (0, σ2). Formulation (10.1) represents the individual group means directly,
whereas formulation (10.2) models an overall mean and deviations from the mean.
However, model (10.2) is overparameterized (I levels and I + 1 parameters) and an ad-
ditional constraint on the parameters is necessary. Often, the sum-to-zero-contrast or
125
126 CHAPTER 10. ANALYSIS OF VARIANCE
treatment contrast, written as:
I∑i=1
βi = 0 or β1 = 0, (10.3)
are used.
We are inherently interested in whether there exists a difference between the groups
and so our null hypothesis is H0 : β1 = β2 = · · · = βI = 0. Note that the hypothesis is
independent of the constraint. To develop the associated test, we proceed in several steps.
We first link the two group case to the notation from the last chapter. In a second step
we intuitively derive estimates in the general setting. Finally, we state the test statistic.
10.1.1 Two Level ANOVA Written as a Regression Model
Model (10.2) with I = 2 and treatment constraint β1 = 0 can be written as a regression
problem
Y ∗i = β∗0 + β∗1xi + ε∗i , i = 1, . . . , N (10.4)
with Y ∗i the components of the vector (Y11, Y12, . . . , Y1n1 , Y21, . . . , Y2n2)T and xi = 0 if
i = 1, . . . , n1 and xi = 1 otherwise. We simplify the notation and spare ourselves from
writing the index denoted by the asterisk with(Y1
Y2
)= Xβ + ε = X
(β0
β1
)+ ε =
(1 0
1 1
)(β0
β1
)+ ε (10.5)
and thus we have as least squares estimate
β =
(β0
β1
)= (XTX)−1XT
(y1
y2
)=
(N n2
n2 n2
)−1(1T 1T
0T 1T
)(y1
y2
)(10.6)
=1
n1n2
(n2 − n2
−n2 N
)(∑ij yij∑j y2j
)=
( 1n1
∑j y1j
1n2
∑j y2j − 1
n1
∑j y1j
). (10.7)
Thus the least squares estimates of µ and β2 in (10.2) for two groups are the mean of the
first group and the difference between the two group means.
The null hypothesis H0 : β1 = β2 = 0 in Model (10.2) is equivalent to the null hypothesis
H0 : β∗1 = 0 in Model (10.4) or to the null hypothesis H0 : β1 = 0 in Model (10.5). The
latter is of course based on a t-test for a linear association (Test 12) and coincides with
the two sample t-test for two independent samples (Test 2).
Estimators can also be derived in a similar fashion under other constraints or for more
factor levels.
10.1. ONE-WAY ANOVA 127
10.1.2 Sums of Squares Decomposition in the Balanced Case
Historically, we often had the case n1 = · · · = nI = J , representing a balanced setting.
In this case, there is another simple approach to deriving the estimators under the sum-
to-zero constraint. We use dot notation in order to show that we work with averages, for
example,
yi· =1
ni
ni∑j=1
yij and y·· =1
N
I∑i=1
ni∑j=1
yij. (10.8)
Based on Yij = µ+ βi + εij we use the following ansatz to derive estimates
yij = y·· + (yi· − y··) + (yij − yi·). (10.9)
With the least squares method, µ and βi are chosen such that∑i,j
(yij − µ− βi)2 =∑i,j
(y·· + yi· − y·· + yij − yi· − µ− βi)2 (10.10)
=∑i,j
((y·· − µ) + (yi· − y·· − βi) + (yij − yi·)
)2(10.11)
is minimized. We evaluate the square of this last equation and note that the cross terms
are zero since
J∑j=1
(yij − yi·) = 0 andI∑i=1
(yi· − y·· − βi) = 0 (10.12)
(the latter due to sum-to-zero constraint) and thus µ = y·· and βi = yi· − y··. Hence,
writing rij = yij − yi·, we have
yij = µ+ βi + rij. (10.13)
As in regression, the observations are orthogonally projected in the space spanned by µ
and βi. This orthogonal projection allows for the division of the sums of squares of the
observations (mean corrected to be precise) into the sums of squares of the model and sum
of squares of the error component. These sums of squares are then weighted and compared.
The representation of this process in table form and the subsequent interpretation is often
equated with the analysis of variance, denoted ANOVA.
The decomposition of the sums of squares can be derived with help from (10.9). No
assumptions about constraints or ni are made∑i,j
(yij − y··)2 =∑ij
(yi· − y·· + yij − yi·)2 (10.14)
=∑ij
(yi· − y··)2 +∑i,j
(yij − yi·)2 +∑i,j
2(yi· − y··)(yij − yi·), (10.15)
128 CHAPTER 10. ANALYSIS OF VARIANCE
where the cross term is again zero because
ni∑j=1
(yij − yi·) = 0. Hence we have the decom-
position of the sums of squares∑i,j
(yij − y··)2
︸ ︷︷ ︸Total
=∑i,j
(yi· − y··)2
︸ ︷︷ ︸Model
+∑i,j
(yij − yi·)2
︸ ︷︷ ︸Error
(10.16)
or SST = SSA + SSE. We choose deliberately SSA instead of SSM as this will simplify
subsequent extensions. Using the least squares estimates µ = y·· and βi = yi· − y··, this
equation can be read as
1
N
∑i,j
(yij − µ)2 =1
N
∑i
ni(µ+ βi − µ)2 +1
N
∑i,j
(yij − µ+ βi)2 (10.17)
Var(yij) = σ2 +1
N
∑i
niβi2, (10.18)
(where we could have used some divisor other than N). The test statistic for the statistical
hypothesis H0 : β1 = β2 = · · · = βI = 0 is based on the idea of decomposing the variance
into variance between groups and variance within groups, just as illustrated in (10.18),
and comparing them. Formally, this must be made more precise. A good model has
a small estimate for σ2 in comparison to that for the second sum. We now develop a
quantitative comparison of the sums.
A raw comparison of both variance terms is not sufficient, the number of observations
must be considered. In order to weight the individual sums of squares, we divide them
by their degrees of freedom. Under the null hypothesis, the mean squares are chi-square
distributed and thus their quotients are F distributed. Hence, an F -test as illustrated
in Test 4 is needed again. Historically, such a test has been “constructed” via a table
and is still represented as such. This so-called ANOVA table consists of columns for the
sums of squares, degrees of freedom, mean squares and F -test statistic due to variance
between groups, within groups, and the total variance. Table 10.1 illustrates such a
generic ANOVA table, numerical examples are given in Example 10.1 later in this section.
The expected values of the mean squares (MS) are
E(MSA) = σ2 +
∑i niβ
2i
I − 1, E(MSE) = σ2. (10.19)
Note that only the latter is intuitive.
Thus under H0 : β1 = · · · = βI = 0, E(MSA) = E(MSE) and hence the ratio MSA/MSE
is close to one, but typically larger. We reject H0 for large values of the ratio (see
Section 2.7.4).
10.1. ONE-WAY ANOVA 129
Table 10.1: Table for one-way analysis of variance.
Source Sum of squares Degrees of freedom Mean squares Test statistic
(SS) (DF) (MS)
Between groups SSA =
(Factors, levels, . . . )∑i,j
(yi· − y··)2 I − 1 MSA =SSAI − 1
Fobs =MSAMSE
Within groups SSE =
(Error)∑i,j
(yij − yi·)2 N − I MSE =SSEN − I
Total SST =∑i,j
(yij − y··)2 N − 1
Test 13: Performing a one-way analysis of variance
Question: Of means y1, y2, . . . , yI , are at least two significantly different?
Assumptions: The I populations are normally distributed with homoge-
neous variances. The samples are independent.
Calculation: Construct a one-way ANOVA table. The quotient of the mean
squares of the factor and the error are needed:
Fobs =SSA/(I − 1)
SSE/(N − I)=
MSAMSE
Decision: Reject H0 : β1 = β2 = · · · = βI if Fobs > Ftab = fI−1,N−I,1−α,
where I − 1 gives the degrees of freedom “between” and N − I the
degrees of freedom “within”
Calculation in R: summary( lm(...)) for the value of the tesqt statistic or
anova( lm(...)) for the explicit ANOVA table.
Test 13 summarizes the procedure. The test statistic of Table 10.1 naturally agrees with
that of Test 13. Observe that when MSA ≤ MSE, i.e., Fobs ≤ 1, H0 is never rejected.
Details about F -distributed random variables are given in Section 2.7.4.
Example 10.1 discusses a simple analysis of variance.
130 CHAPTER 10. ANALYSIS OF VARIANCE
Example 10.1. (retardant data) Many substances related to human activities end up
in wastewater and accumulate in sewage sludge. The present study focuses on hexabro-
mocyclododecane (HBCD) detected in sewage sludge collected from a monitoring network
in Switzerland. HBCD’s main use is in expanded and extruded polystyrene for thermal
insulation foams, in building and construction. HBCD is also applied in the backcoating
of textiles, mainly in furniture upholstery. A very small application of HBCD is in high
impact polystyrene, which is used for electrical and electronic appliances, for example in
audio visual equipment. Data and more detailed background information are given in
Kupper et al. (2008) where it is also argued that loads from different types of monitor-
ing sites showed that brominated flame retardants ending up in sewage sludge originate
mainly from surface runoff, industrial and domestic wastewater.
HBCD is harmful to one’s health, may affect reproductive capacity, and may harm children
in the mother’s womb.
In R-Code 10.1 the data are loaded and reduced to Hexabromocyclododecane.
First we use constraint β1 = 0, i.e., Model (10.5). The estimates naturally agree with
those from (10.7).
Then we use the sum-to-zero constraint and compare the results. The estimates and the
standard errors changed (and thus the p-values of the t-test). The p-values of the F -test
are, however, identical, since the same test is used.
The R command aov is an alternative for performing ANOVA and its use is illustrated
in R-Code 10.2. We prefer, however, the more general lm approach. Nevertheless we
need a function which provides results on which, for example, Tukey’s honest significant
difference (HSD) test can be performed with the function TukeyHSD. The differences can
also be calculated from the coefficients in R-Code 10.1. The p-values are smaller because
multiple tests are considered. ♣
R-Code 10.1: retardant data: ANOVA with lm command and illustration
of various contrasts.
tmp <- read.csv('./data/retardant.csv')
retardant <- read.csv('./data/retardant.csv', skip=1)
names( retardant) <- names(tmp)
HBCD <- retardant$cHBCD
str( retardant$StationLocation)
## Factor w/ 16 levels "A11","A12","A15",..: 1 2 3 4 5 6 7 8 11 12 ...
type <- as.factor( rep(c("A","B","C"), c(4,4,8)))
lmout <- lm( HBCD ~ type )
summary(lmout)
10.1. ONE-WAY ANOVA 131
##
## Call:
## lm(formula = HBCD ~ type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.6 -44.4 -26.3 22.0 193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.7 42.4 1.78 0.098 .
## typeB 77.2 60.0 1.29 0.220
## typeC 107.8 51.9 2.08 0.058 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
options( "contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
# manually construct the estimates:
c( mean(HBCD[1:4]), mean(HBCD[5:8])-mean(HBCD[1:4]),
mean(HBCD[9:16])-mean(HBCD[1:4]))
## [1] 75.675 77.250 107.788
# change the constrasts to sum-to-zero
options(contrasts=c("contr.sum","contr.sum"))
lmout1 <- lm( HBCD ~ type )
summary(lmout1)
##
## Call:
## lm(formula = HBCD ~ type)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.6 -44.4 -26.3 22.0 193.1
132 CHAPTER 10. ANALYSIS OF VARIANCE
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.4 22.3 6.15 3.5e-05 ***
## type1 -61.7 33.1 -1.86 0.086 .
## type2 15.6 33.1 0.47 0.646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.8 on 13 degrees of freedom
## Multiple R-squared: 0.249,Adjusted R-squared: 0.134
## F-statistic: 2.16 on 2 and 13 DF, p-value: 0.155
beta <- as.numeric(coef(lmout1))
# Construct 'contr.treat' coefficients:
c( beta[1]+beta[2], beta[3]-beta[2], -2*beta[2]-beta[3])
## [1] 75.675 77.250 107.787
10.2 Two-Way and Complete Two-Way ANOVA
Model (10.2) can be extended for additional factors. Adding one factor to a one-way
model leads us to a two-way model
Yijk = µ+ βi + γj + εijk (10.20)
with i = 1, . . . , I, j = 1, . . . , J , k = 1, . . . , nij and εijkiid∼ N (0, σ2). The indices again
specify the levels of the first and second factor as well as the count for that configuration.
As stated, the model is over parameterized and additional constraints are again necessary,
in which case
I∑i=1
βi = 0,J∑j=1
γj = 0 or β1 = 0, γ1 = 0 (10.21)
are often used.
When the nij in each group are not equal, the decomposition of the sums of squares is not
necessarily unique. In practice this is often unimportant since, above all, the estimated
coefficients are compared with each other (“contrasts”). We recommend always using the
command lm(...). In case we do compare sums of squares there are resulting ambiguities
and factors need to be included in decreasing order of “natural” importance.
For the sake of illustration, we consider the balanced case of nij = K, called complete
two-way ANOVA. More precisely, the model consists of I · J groups and every group
10.2. TWO-WAY AND COMPLETE TWO-WAY ANOVA 133
R-Code 10.2 retardant data: ANOVA with aov and multiple testing of the means.
aovout <- aov( HBCD ~ type )
options("contrasts")
## $contrasts
## [1] "contr.sum" "contr.sum"
coefficients( aovout) # coef( aovout) is suffient as well.
## (Intercept) type1 type2
## 137.354 -61.679 15.571
summary(aovout)
## Df Sum Sq Mean Sq F value Pr(>F)
## type 2 31069 15534 2.16 0.15
## Residuals 13 93492 7192
TukeyHSD( aovout)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = HBCD ~ type)
##
## $type
## diff lwr upr p adj
## B-A 77.250 -81.085 235.59 0.42602
## C-A 107.788 -29.335 244.91 0.13376
## C-B 30.538 -106.585 167.66 0.82882
contains K samples and N = I · J ·K. The calculation of the estimates are easier than
in the unbalanced case and are illustrated as follows.
As in the one-way case, we can derive least squares estimates
yijk = y···︸︷︷︸µ
+ yi·· − y···︸ ︷︷ ︸βi
+ y·j· − y···︸ ︷︷ ︸γj
+ yijk − yi·· − y·j· + y···︸ ︷︷ ︸rijk
(10.22)
and separate the sums of squares∑i,j,k
(yijk − y···)2
︸ ︷︷ ︸SST
=∑i,j,k
(yi·· − y···)2
︸ ︷︷ ︸SSA
+∑i,j,k
(y·j· − y···)2
︸ ︷︷ ︸SSB
+∑i,j,k
(yijk − yi·· − y·j· + y···)2
︸ ︷︷ ︸SSE
.
(10.23)
We are interested in the statistical hypotheses H0 : β1 = · · · = βI = 0 and H0 : γ1 =
· · · = γJ = 0. The test statistics of both of these tests are given in the last column of
134 CHAPTER 10. ANALYSIS OF VARIANCE
Table 10.3. The test statistics Fobs,A (Fobs,B) are compared with the quantiles of the F
distribution with I − 1 and N − I − J + 1 degrees of freedom. Similarly, for Fobs,B we use
J − 1 and N − I − J + 1 degrees of freedom. The tests are equivalent to that of Test 13.
Table 10.2: Table of the complete two-way ANOVA.
Source SS DF MS Test statistic
Factor A SSA =∑i,j,k
(yi·· − y···)2 I − 1 MSA =SSAI − 1
Fobs,A =MSAMSE
Factor B SSB =∑i,j,k
(y·j· − y···)2 J − 1 MSB =SSBJ − 1
Fobs,B =MSBMSE
SSE = DFE =
Error∑i,j,k
(yijk − yi·· − y·j· + y···)2 N−I−J+1 MSE =
SSEDFE
Total SST =∑i,j,k
(yijk − y···)2 N − 1
An interaction (βγ)ij can also be included in the model, if the effects are not linear:
Yijk = µ+ βi + γj + (βγ)ij + εijk (10.24)
with εijkiid∼ N (0, σ2). In addition to constraints (10.21)
I∑i=1
(βγ)ij = 0 andJ∑j=1
(βγ)ij = 0 for all i and j (10.25)
or analogous treatment constraints are often used. As in the previous two-way case, we
can derive the least squares estimates
yijk = y···︸︷︷︸µ
+ yi·· − y···︸ ︷︷ ︸βi
+ y·j· − y···︸ ︷︷ ︸γj
+ yij· − yi·· − y·j· + y···︸ ︷︷ ︸(βγ)ij
+ yijk − yij·︸ ︷︷ ︸rijk
(10.26)
and a decomposition in sums of squares is straightforward.
Note that for K = 1 the error is the interaction.
Table 10.3 shows the table of a complete two-way ANOVA. The test statistics Fobs,A,
Fobs,B and Fobs,AB are compared with the quantiles of the F distribution with I−1 J −1,
(I − 1)(J − 1) and N − IJ degrees of freedom, respectively.
10.3. ANALYSIS OF COVARIANCE 135
Table 10.3: Table of the complete two-way ANOVA with interaction.
Source SS DF MS Test statistic
Factor A SSA =∑i,j,k
(yi·· − y···)2 I − 1 MSA =SSADFA
Fobs,A =MSAMSE
Factor B SSB =∑i,j,k
(y·j· − y···)2 J − 1 MSB =SSBDFB
Fobs,B =MSBMSE
Interaction SSAB = (I − 1)×∑i,j,k
(yij· − yi·· − y·j· + y···)2 (J − 1) MSAB =
SSABDFAB
Fobs,AB =MSABMSE
Error SSE =∑i,j,k
(yijk − yij·)2 N − IJ MSE =SSEDFE
Total SST =∑i,j,k
(yijk − y···)2 N − 1
10.3 Analysis of Covariance
We now combine regression and analysis of variance elements, i.e., continuous predictors
and factor levels, by adding “classic” predictors to models (10.2) or (10.20). Such models
are sometimes called ANCOVA, e.g., example
Yijk = µ+ β1xi + γj + εijk, (10.27)
with j = 1, . . . , J , k = 1, . . . , nj and εijkiid∼ N (0, σ2). Additional constraints are again
necessary. Keeping track of indices and Greek letters quickly gets cumbersome and one
often assumes a R formula notation. For example if the predictor xi is in the variable
Population and γj is in the variable Treatment, in form of a factor then
y ~ Population + Treatment (10.28)
is representing (10.27). An intersection is indicated with :, for example Treatment:Month.
The * operator denotes ‘factor crossing’: a*b is interpreted as a + b + a:b. See Section
Details of help(formula).
ANOVA tables are constructed similarly and individual rows thereof are associated to the
individual elements in a formula specification.
Hence our unified approach via lm(...). Notice that for the estimates the order of the
variable in formula (10.28) does not play a role, for the decomposition of sums of squares
it does. Different statistical software packages have different approaches and thus may
lead to minor differences.
136 CHAPTER 10. ANALYSIS OF VARIANCE
10.4 Example
Example 10.2. chemosphere data Octocrylene is an organic UV Filter found in sun-
screen and cosmetics. The substance is classified as a contaminant and dangerous for the
environment by the EU under the CLP Regulation.
Because the substance is difficult to break down, the environmental burden of Octocrylene
can be estimated through measurement of its concentration in sludge from waste treatment
facilities.
The study Plagellat et al. (2006) analyzed Octocrylene (OC) concentrations from 24 dif-
ferent purification plants (consisting of three different types of Treatment), each with
two samples (Month). Additionally, the catchment area (Population) and the amount of
sludge (Production) are known. Treatment type A refers to small plants, B medium-sized
plants without considerable industry and C medium-sized plants with industry.
R-Code 10.3 prepares the data and shows a one-way ANOVA. R-Code 10.4 shows a two-
way ANOVA (with and without interactions).
Figure 10.1 shows why the interaction is not significant. First, the seasonal effect of
groups A and B are very similar and second, the variability in group C is too large. ♣
R-Code 10.3: UVfilter data: one-way ANOVA using lm.
UV <- read.csv( './data/chemosphere.csv')
UV <- UV[,c(1:6,10)] # reduce to OT
str(UV, strict.width='cut')
## 'data.frame': 24 obs. of 7 variables:
## $ Treatment : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 3 3 ..
## $ Site_code : Factor w/ 12 levels "A11","A12","A15",..: 1 2 3 4 7 ..
## $ Site : Factor w/ 12 levels "Chevilly","Cronay",..: 1 2 10 7..
## $ Month : Factor w/ 2 levels "jan","jul": 1 1 1 1 1 1 1 1 1 1 ..
## $ Population: int 210 284 514 214 674 5700 8460 11300 6500 7860 ...
## $ Production: num 2.7 3.2 12 3.5 13 80 150 220 80 250 ...
## $ OT : int 1853 1274 1342 685 1003 3502 4781 3407 11073 33..
with( UV, table(Treatment, Month))
## Month
## Treatment jan jul
## A 5 5
## B 3 3
## C 4 4
options( contrasts=c("contr.sum", "contr.sum"))
lmout <- lm( log(OT) ~ Treatment, data=UV)
summary( lmout)
10.4. EXAMPLE 137
##
## Call:
## lm(formula = log(OT) ~ Treatment, data = UV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.952 -0.347 -0.136 0.343 1.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.122 0.116 70.15 < 2e-16 ***
## Treatment1 -0.640 0.154 -4.16 0.00044 ***
## Treatment2 0.438 0.175 2.51 0.02049 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.555 on 21 degrees of freedom
## Multiple R-squared: 0.454,Adjusted R-squared: 0.402
## F-statistic: 8.73 on 2 and 21 DF, p-value: 0.00174
R-Code 10.4: UVfilter data: two-way ANOVA and two-way ANOVA with
interactions using lm. (See Figure 10.1.)
lmout <- lm( log(OT) ~ Treatment + Month, data=UV)
summary( lmout)
##
## Call:
## lm(formula = log(OT) ~ Treatment + Month, data = UV)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7175 -0.3452 -0.0124 0.1691 1.2236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.122 0.106 76.78 < 2e-16 ***
## Treatment1 -0.640 0.141 -4.55 0.00019 ***
## Treatment2 0.438 0.160 2.74 0.01254 *
138 CHAPTER 10. ANALYSIS OF VARIANCE
## Month1 -0.235 0.104 -2.27 0.03444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.507 on 20 degrees of freedom
## Multiple R-squared: 0.566,Adjusted R-squared: 0.501
## F-statistic: 8.69 on 3 and 20 DF, p-value: 0.000686
summary( aovout <- aov( log(OT) ~ Treatment * Month, data=UV))
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 2 5.38 2.688 11.67 0.00056 ***
## Month 1 1.32 1.325 5.75 0.02752 *
## Treatment:Month 2 1.00 0.499 2.17 0.14355
## Residuals 18 4.15 0.230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD( aovout, which=c('Treatment'))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log(OT) ~ Treatment * Month, data = UV)
##
## $Treatment
## diff lwr upr p adj
## B-A 1.07760 0.44516 1.71005 0.00107
## C-A 0.84174 0.26080 1.42268 0.00446
## C-B -0.23586 -0.89729 0.42556 0.64105
boxplot(log(OT)~Treatment, data=UV, col=7, boxwex=.5)
at <- c(0.7, 1.7, 2.7, 1.3, 2.3, 3.3)
boxplot( log(OT) ~ Treatment + Month, data=UV, add=T, at=at, xaxt='n',
boxwex=.2)
with( UV, interaction.plot( Treatment, Month, log(OT), col=2:3))
10.4. EXAMPLE 139
A B C
6.5
7.5
8.5
9.5
6.5
7.5
8.5
9.5
7.0
7.5
8.0
8.5
Treatment
mea
n of
log
(OT
)
A B C
Month
janjul
Figure 10.1: UVfilter data: box plots sorted by treatment and interaction plot.
(See R-Code 10.4.)
140 CHAPTER 10. ANALYSIS OF VARIANCE
Chapter 11
Extending the Linear Model
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter11.R.
In Chapters 8 to 10 we considered linear modeling in terms of simple regression, multiple
regression and ANOVA. Assumptions of the model were: linearity in the parameters of
interest (inherently) and Gaussian, uncorrelated errors. Very often, however, there is
evidence in the data against these assumptions. For example, data are correlated due
to spatial proximity. In this chapter we consider various possible extension of the linear
model that address deviations of these assumptions.
11.1 Unifying the Linear Model
We first recall that “all” models we have seen in the last three chapters are linear models,
Y = Xβ + ε, ε ∼ Nn(0, σ2I). (11.1)
even if the observations or some of the predictors need to be transformed, or if we have
interactions. In practice we hardly need to explicitly specify the design matrix X. R will
take that role.
Pre-multiplying the equation with a known matrix only changes the variance of the error,
not the linearity or Gaussianity of the model. This trick is used in the next section.
11.2 Regression with Correlated Data
Equation (11.1) is generalized to
Y = Xβ + ε, ε ∼ Nn(0,Σ). (11.2)
This means that the errors are no longer independent. The least squares estimator of β
is still β = (XTX)−1XTY. This estimator is not optimal, however, since it does not take
141
142 CHAPTER 11. EXTENDING THE LINEAR MODEL
into account all information about the model (here Σ). The “best” estimator is
β = (XTΣ−1X)−1XTΣ−1Y, (11.3)
which correctly considers the correlation between the observations and the possible dif-
ferent variances.
In practice, the following special cases are often observed:
i) Σ = σ2V, where V is a (known) diagonal matrix. In R this diagonal matrix is
considered with the argument weights (weights=1/diag(V); weighted least squares
(WLS)).
ii) a) Σ = σ2R, where R is a (known) correlation matrix. In this case the data
and predictors are premultiplied by the inverse square-root of R (generalized least
squares (GLS)).
ii) b) Σ is parameterized, but the parameters are unknown. In R, the package nlme
(Nonlinear Mixed-Effects Models) contains the function gls.
The differences between (iia) and (iib) are subtle. We now consider situations illustrating
the some cases in greater detail.
11.2.1 Heteroscedacity
The case of non-constant variance is quite ubiquitous. In terms of Model (9.3), we have
to relax εiiid∼ N (0, σ2) to εi
indep∼ N (0, σ2i ). Often and to a certain degree intuitively, the
variance increases with increasing response Yi. Such a relationship is depicted in the so-
called “Scale-Location” plot in regression diagnostics (e.g., lower left panel of Figure 9.5).
The following example gives a couple informal tests.
Example 11.1. (LifeCycleSavings data continued from Example 9.2) Starting
from the final model, lcs.aic, containing the variables pop15 and dpi, we compare the
variances of the residuals for different subgroups. R-Code 11.1 indicates that there is no
strong evidence. Alternatively, we could assess if there is evidence of structure in the
“Scale-Location”. While the red line in Figure 9.5) suggests some heteroscedacity, an
informal regression model is not conclusive. Note that we should not take the statistical
tests too formal as not all hypotheses are satisfied. ♣
An alternative way to correct for heteroscedacity is a transformation of the response
variable. We discuss this approach in Section 11.3.1 further.
11.2. REGRESSION WITH CORRELATED DATA 143
R-Code 11.1 LifeCycleSavings data: assessing informally heteroscedacity.
resids <- residuals(lcs.aic)
summary( lm( sqrt( abs( resids)) ~ fitted(lcs.aic)))
##
## Call:
## lm(formula = sqrt(abs(resids)) ~ fitted(lcs.aic))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.432 -0.478 0.031 0.537 1.654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2496 0.4915 4.58 3.4e-05 ***
## fitted(lcs.aic) -0.0750 0.0496 -1.51 0.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.767 on 48 degrees of freedom
## Multiple R-squared: 0.0455,Adjusted R-squared: 0.0256
## F-statistic: 2.29 on 1 and 48 DF, p-value: 0.137
ind <- LifeCycleSavings$dpi<1200
var.test( resids[ ind], resids[ !ind])
##
## F test to compare two variances
##
## data: resids[ind] and resids[!ind]
## F = 1.27, num df = 30, denom df = 18, p-value = 0.6
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.52083 2.84351
## sample estimates:
## ratio of variances
## 1.2732
# sligthly more evidence using: ind <- LifeCycleSavings£pop15<35
144 CHAPTER 11. EXTENDING THE LINEAR MODEL
11.2.2 Serially Correlated Data
If data are taken over time, observations might be serially correlated or dependent. That
means, Corr(εi−1, εi) 6= 0 and Var(ε) = Σ which is not a diagonal matrix. This is easy
to test and to visualize through the residuals, illustrated in Example 11.2. The example
shows that there is a strong dependency and proposes a model for the residuals:
εi = φ εi−1 + νi, i = 2, . . . , n, (11.4)
where now νiiid∼ N (0, σ2
ν) and φ takes the role of the regression slope. As the residuals
have zero mean, we should not include an intercept here. Note that for εi we see εi−1
as a fixed predictor and not as random. We prefer this slight abuse of notation over a
cumbersome mathematical notation.
We require φ based on the residual and we use the estimate φ to construct Σ for a GLS
estimation of β. This joint estimation is not possible within a least squares approach
and thus in practice maximum likelihood (type) procedures are used. Whether or not
model (11.4) is sufficient is beyond the scope of this document and would be formally
discussed in times series lectures.
Example 11.2. (abrasion data continued from Example 9.3) Starting from the
final model res, we plot the scatterplot of ri versus ri−1 and fit a linear model for the
residuals (R-Code 11.2 and Figure 11.1). There is a strong serial correlation in the resid-
uals: correlation of 0.53. Compared to this lag one correlation, the lag two correlation is
much smaller: Corr(ri, ri−2) = 0.26. An improved model would be based on the function
nlme::gls with argument correlation = corAR1(). ♣
−50 0 50
−50
050
resids[−n]
resi
ds[−
1]
−50 0 50
−50
050
resids[−c(n, n − 1)]
resi
ds[−
c(1:
2)]
Figure 11.1: abrasion data: serial correlation and linear model fit for the
residuals (left: lag one; right: lag two). (See R-Code 11.2.)
11.2. REGRESSION WITH CORRELATED DATA 145
R-Code 11.2 abrasion data: assessing informally heteroscedacity. (See Figure 11.1.)
resids <- residuals( res)
n <- length( resids)
plot( resids[-n], resids[-1], type = "p", pch = 19)
cor.test(resids[-1], resids[-n]) # estimate is given at end
##
## Pearson's product-moment correlation
##
## data: resids[-1] and resids[-n]
## t = 3.24, df = 27, p-value = 0.0031
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.20233 0.75042
## sample estimates:
## cor
## 0.52956
# model is r_i+1 = phi r_i + epsilon_i No intercept!
abline( residlm <- lm( resids[-1]~resids[-n]-1))
# summary( residlm) # gives similar results as the cor.test.
plot( resids[-c(n,n-1)], resids[-c(1:2)], type = "p", pch = 19)
cor.test(resids[-c(n,n-1)], resids[-c(1:2)])$p.value
## [1] 0.18385
11.2.3 Spatial Data
Spatial statistics is concerned with data/measurements, which possess topological, geo-
metric, or geographic properties. Examples, reflecting the basic types of spatial data,
are:
[A] geostatistical data: Cadmium (poisonous heavy metal) concentration in the sedi-
ment of Lake Geneva, see Example 11.3.
[B] lattice data: number of Bluetongue disease cases in cattle within 150+ Swiss districts
[C] point process data: cattle farms on which Bluetongue disease is diagnosed.
In these cases there appears to be a correlation between the data. Measurements taken
closer to each other in space appear more similar than those taken from a further distance
away. W. Tobler, an Amero-Swiss geographer with close ties to UZH stated, “Everything
is related to everything else, but near things are more related than distant things,” which
146 CHAPTER 11. EXTENDING THE LINEAR MODEL
is now referred to as the first law of geography (Tobler, 1970). We experience this “law”
every day, for example, when we sense the weather: between Zurich and Effretikon there
is no large temperature difference, but there is between Visp (VS) and Zurich. This
dependence is utilized in spatial statistics.
Cases [A] and [B] can often be modeled with the multivariate normal distribution. The
next section considers case [A].
Hence, the problem can be framed in the context of a conditional distribution as given in
Equation (7.18). In practice, the expectation and variance are not known and need to be
estimated. However, we often have only one observation and the classical estimators, as
given in (7.19), cannot be used.
The estimation of the dependency structure is often refered to as variogram estimation
and the prediction of the “process” at unobserved locations as kriging. We will not discuss
these latter two points in detail but give a simple illustration.
Example 11.3. (leman data) Here we have 293 measurements of nickel concentration
in the sediment of Lake Geneva from year 1983. See Furrer and Genton (1999).
The measurements are displayed in R-Code 11.3 and Figure 11.2. We illustrate kriging
through a black-box use of the function Krig of the package fields. The result is also
given in Figure 11.2. ♣
R-Code 11.3: leman data: EDA. (See Figure 11.2.)
load( "data/leman83_red.RData")
# str( leman); dim(leman) # etc...
names( leman)
## [1] "x" "y" "Zn" "Cr" "Cd" "Co" "Sr" "Hg" "Pb" "Ni"
library( fields)
plot( y~x, col=tim.colors()[cut(Ni,64)], pch=19, data=leman)
# simpler alternative is quilt.plot
lines( lake.data)
hist( leman$Ni, main='') # not quite bell shaphed but no outlier
# Construct a grid within the lake boundaries:
library(splancs)
xr <- seq(min(lake.data[,1]),to=max(lake.data[,1]),l=110)
yr <- seq(min(lake.data[,2]),to=max(lake.data[,2]),by=xr[2]-xr[1])
locs <- data.frame( x=lake.data[,1], y=lake.data[,2])
grid <- expand.grid( x=xr, y=yr) # create a 2-dim grid
pts <- pip( grid, locs, bound=TRUE) # pip points-in-polygon
11.2. REGRESSION WITH CORRELATED DATA 147
# pts contains a fine grid
# Kriging:
out <- Krig( cbind(leman$x,leman$y), leman$Ni, give.warnings=FALSE)
pout <- predict( out, pts)
quilt.plot(pts, pout, nx=length(xr)-2, ny=length(yr)-1)
500 510 520 530 540 550 560
120
130
140
150
x
y
leman$Ni
Fre
quen
cy
20 40 60 80 100 120
020
4060
80
510 520 530 540 550 560
120
125
130
135
140
145
150
40
60
80
100
Figure 11.2: leman data: EDA and prediction. (See R-Code 11.3.)
148 CHAPTER 11. EXTENDING THE LINEAR MODEL
11.3 Nonlinear Regression
Naturally, the linearity assumption (of the individual parameters) in the linear model
is central. There are situations, where we have a nonlinear relationship between the
observations and the parameters. In this section we enumerate some of the alternatives,
such that at least relevant statistical terms can be associated.
Model (9.3) is generalized to
Yi ≈ f(x i,β), (11.5)
where f(x ,β) is a (sufficiently well behaved) function depending on a vector of covari-
ates x i and the vector of parameters β. We write the relationship in form of approx-
imately, since we do not necessarily claim that we have additive Gaussian errors. For
example, we may have multiplicative errors or the right-hand side of model (11.5) may
be used for the expectation of the response.
11.3.1 Transformation of the Response
As seen in Section 11.2.1, it is possible that the variance of the residuals increase with
increasing observations. A classical example is a Poisson random variable. Instead of mod-
eling individual variances, as done, transforming the variances is an alternative approach.
For example instead of linear model for Yi a linear model for log(Yi) is constructed.
There are formal approaches to determine an optimal transformation of the data, notably
the function boxcox from the package MASS. However, in practice log(·) and√· are used
dominantly.
If the original data stems “truly” from a linear model, a transformation typically leads to
an approximation. This is seen when starting from the relation
Yi = β0 xβ1i eεi (11.6)
leads to a truly linear relationship
log(Yi) = log(β0) + β1 log(xi) + εi. (11.7)
11.3.2 Nonlinear and Non-parametric Regression
When model (11.5) has additive (Gaussian) errors, we can estimate β based on least
squares, that means
β = argminβ
n∑i=1
(yi − f(x i,β)
)2. (11.8)
However, we do not have closed forms for the resulting estimates and iterative approaches
are needed. That is, starting from an initial condition we improve the solution by small
11.3. NONLINEAR REGRESSION 149
steps. The correction factor typically depends on the gradient. Such algorithms are
variants of the so-called Gauss–Newton algorithm. The details of these are beyond the
scope of this document.
There is no guarantee that an optimal solution exists (global minimum). Using nonlinear
least squares as a black box approach is dangerous and should be avoided. In R, the
function nls (nonlinear least-squares) can be used. General optimizer functions are nlm
and optim.
There are some prototype nonlinear functions that are often used:
Yi = β1 exp(−(xi/β2)β3
)+ εi Weibull model, (11.9)
Yi =
β1 + εi, if xi < β3
β2 + εi, if xi ≥ β3,break point models. (11.10)
The exponential growth model is a special case of the Weibull model.
An alternative way to express a response in a nonlinear way is through a virtually arbitrary
flexible function, similar to the (red) guide-the-eye curves in many of our plots. These
smooth curves are constructed through a so-called “non-parametric” approach.
A detailed discussion of non-parametric regression is beyond the scope here but we can
add that non-parametric regression can also be formulated as a linear model with many,
many parameters.
11.3.3 Generalized Linear Regression
The response Yi in previous chapters as well as in models (11.1) and (11.2) are Gaussian
(albeit not necessarily independent nor with constant variance). If this is not the case,
for example the observations yi are count data, then we need another approach. The idea
of this new approach is model the expected value of Yi or a function thereof as a linear
function of some predictors. This approach is called Generalized Linear Regression (GLM)
and contains as a special case the classical linear regression. The estimation procedure is
quite similar to a nonlinear approach, but details are again beyond the scope.
To illustrate the concept we look at a simple logistic regression one particular case of the
GLM. Example 11.4 is based on widely discussed data.
Example 11.4. (orings data) In January 1986 the space shuttle Challenger exploded
shortly after taking off. Part of the problem was with the rubber seals, the so-called O-
rings, of the booster rockets. The data set data( orings, package="faraway") contains
the number defects in the six seals in 23 previous launches (Figure 11.3). The question
we ask here is whether the probability of a defect for an arbitrary seal can be predicted
for an air temperature of 31F (as in January 1986). See Dalal et al. (1989) for a detailed
account. ♣
150 CHAPTER 11. EXTENDING THE LINEAR MODEL
Figure 11.3 shows that linear models are often inadequate. Here the variable of interest
is a probability, meaning yi ∈ [0, 1], but a linear model cannot guarantee yi ∈ [0, 1].
In this and similar cases, logistic regression is appropriate. The logistic regression models
the probability of a defect as
p = P(defect) =1
1 + exp(−β0 − β1x), (11.11)
where x is the air temperature. Through inversion one obtains a linear model for the log
odds
g(p) = log( p
1− p
)= β0 + β1x (11.12)
where g(·) is generally called the link function. In this special case, the function g−1(·) is
called the logistic function.
R-Code 11.4 orings data and estimated probability of defect dependent on air
temperature. (See Figure 11.3.)
data( orings, package="faraway")
str(orings)
## 'data.frame': 23 obs. of 2 variables:
## $ temp : num 53 57 58 63 66 67 67 67 68 69 ...
## $ damage: num 5 1 1 1 0 0 0 0 0 0 ...
plot( damage/6~temp, xlim=c(21,80), ylim=c(0,1), data=orings,
xlab="Temperature [F]", ylab='Probability of damage') # data
abline(lm(damage/6~temp, data=orings), col='gray') # regression line
glm1 <- glm( cbind(damage,6-damage)~temp, family=binomial, data=orings)
points( orings$temp, glm1$fitted, col=2) # fitted values
ct <- seq(20, to=85, length=100) # vector to predict
p.out <- predict( glm1, new=data.frame(temp=ct), type="response")
lines(ct, p.out)
abline( v=31, col='gray', lty=2) # actual temp. at start
unlist( predict( glm1, new=data.frame(temp=31), type="response",
se.fit=TRUE))[1:2] # ('residual.scale' not relevant here).
## fit.1 se.fit.1
## 0.993034 0.011533
11.4. BIBLIOGRAPHIC REMARKS AND LINKS 151
20 30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
Temperature [F]
Pro
babi
lity
of d
amag
e
Figure 11.3: orings data (black dots) and estimated probability of defect (red
dots) dependent on air temperature. Linear fit is given by the gray solid line.
Dotted vertical line is the ambient launch temperature at the time of launch.
(See R-Code 11.4.)
11.4 Bibliographic Remarks and Links
The book by Faraway (2006) nicely summarizes the many facets of this chapter.
In R the packages fields and geoR are very helpful for modelling spatial data.
152 CHAPTER 11. EXTENDING THE LINEAR MODEL
Chapter 12
Bayesian Methods
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter12.R.
In statistics there exist two different philosophical approaches to inference: frequentist
and Bayesian inference. Past chapters dealt with the frquentist approach; now we deal
with the Bayesian approach. Here, we consider the parameter as a random variable with
a suitable distribution, which is chosen a priori, i.e., before the data is collected and
analyzed. The goal is to update this prior knowledge after observation of the data in
order to draw conclusions (with the help of the so-called posterior distribution).
12.1 Motivating Example and Terminology
Bayesian statistics is often introduced by recalling the so-called Bayes theorem, which
states for two events A and B
P(A | B) =P(B | A) P(A)
P(B), for P(B) 6= 0, (12.1)
and is shown by using twice Equation (2.2). Bayes theorem is often used in probability
theory to calculate probabilities along an event tree, as illustrated in the arch-example
below.
Example 12.1. A patient sees a doctor and gets a test for a (relatively) rare disease.
The prevalence of this disease is 0.5%. As typical, the screening test is not perfect and
has a sensitivity of 99%, i.e., true positive rate; properly identified the disease in a sick
patient, and a specificity of 98%, i.e., true negative rate; a healthy person is correctly
identified disease free. What is the probability that the patient has the disease provided
the test is positive?
Denoting the events D = ‘Patient has disease’ and + = ‘test is positive’ we have us-
153
154 CHAPTER 12. BAYESIAN METHODS
ing (12.1)
P(D | +) =P(+ | D) P(D)
P(+)=
P(+ | D) P(D)
P(+ | D) P(D) + P(+ | ¬D) P(¬D)(12.2)
=99% · 0.5%
99% · 0.5% + 2% · 99.5%= 20%. (12.3)
Note that for the denominator we have used the so-called law of total probability to get
an expression for P(+). ♣
An interpretation of the previous example from a frequentist view is in terms of proportion
of outcomes (in a repeated sampling framework). In the Bayesian we view the probabilities
as “degree of belief”, where we have some proposition (event A in (12.1) or D in the
example) and some evidence (event B in (12.1) or + in the example). More specifically,
P(D) represents the prior believe of our proposition, P(+ | D)/P(+) is the support of the
evidence for the proposition and P(D | +) is the posterior believe of the proposition after
having accounted for the new evidence.
Extending Bayes’ theorem to the setting of two continuous random variables X and Y we
have
fX|Y=y(x) =fY |X=x(y | x) fX(x)
fY (y). (12.4)
In the context of Bayesian inference the random variable X will now be a parameter,
typically of the distribution of Y :
fΘ|Y=y(θ | y) =fY |Θ=θ(y | θ) fΘ(θ)
fY (y). (12.5)
Hence, current knowledge about the parameter is expressed by a probability distribution
on the parameter - the prior distribution. The model for our observations is called the
likelihood. We use our observed data to update the prior distribution and thus obtain
the posterior distribution.
Notice that P(B) in (12.1), P(+) in (12.2), or fY (y) in (12.4) serves as a normalizing
constant, i.e., it is independent of A, D, x or the parameter θ. This can be expressed in
several ways:
fΘ|Y=y(θ | y) ∝ fY |Θ=θ(y | θ)× fΘ(θ), (12.6)
f(θ | y) ∝ f(y | θ)× f(θ). (12.7)
The symbol “∝” means “proportional to” and is used because the normalizing constants
are not considered:
f(θ|y) =f(y, θ)
f(y)=f(y | θ)f(θ)
f(y)∝ f(y | θ)f(θ). (12.8)
12.2. EXAMPLES 155
Finally we can summarize the most important result in Bayesian inference as the posterior
density is proportional to the likelihood multiplied by the prior density, i.e.,
Posterior density ∝ Likelihood × Prior density (12.9)
Advantages of using a Bayesian framework are:
• formal way to incorporate priori knowledge;
• intuitive interpretation of the posterior;
• much easier to model complex systems;
• no n dependency of ‘significance’ and the p-value.
As nothing comes for free, there are also some disadvantages:
• more ‘elements’ have to be specified for a model;
• in virtually all cases, computationally more demanding.
Until recently, there were clear fronts between frequentists and Bayesians. Luckily these
differences have vanished.
12.2 Examples
We start with two very typical examples that are tractable and well illustrate the concept
of Bayesian inference.
Example 12.2. (Beta-Binomial) Let Y ∼ Bin(n, p). We observe y successes (out of n).
As was shown in Section 5.1, p = y/n. We often have additional knowledge about the
parameter p. For example, let n = 13, the number of autumn lambs in a herd of sheep.
We count the number of male lambs. It is highly unlikely that p ≤ 0.1. Thus we assume
that p is beta distributed, i.e.,
f(p) = c · pα−1(1− p)β−1, p ∈ [0, 1], α > 0, β > 0, (12.10)
with normalization constant c. We write p ∼ Beta(α, β). Figure 2.8 shows densities for
various pairs (α, β).
The posterior density is then
∝(n
y
)py(1− p)n−y × c · pα−1(1− p)β−1 (12.11)
∝ pypα−1(1− p)n−y(1− p)β−1 (12.12)
∝ py+α−1(1− p)n−y+β−1, (12.13)
156 CHAPTER 12. BAYESIAN METHODS
which can be recognized as a beta distribution Beta(y + α, n− y + β).
The expected value of a beta distributed random variable Beta(α, β) is α/(α + β) (here
the prior distribution). The posterior expected value is thus
E(p | Y = y) =y + α
n+ α + β. (12.14)
The beta distribution Beta(1, 1), i.e., α = 1, β = 1, is equivalent to a uniform distribution
U(0, 1). The uniform distribution, however, does not mean “information-free”. As a result
of Equation (12.14), a uniform distribution as prior is “equivalent” to two experiments,
of which one is a success, i.e., one of two lambs are male.
Figure 12.1 illustrates the case of y = 10, n = 13 with prior Beta(3, 3). The frequentist
empirical 95% CI is [0.5, 0.92], see Equation (5.9). The 2.5% and 97.5% quantiles of the
posterior are 0.45 and 0.83, respectively. ♣
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
p
p
Data/likelihoodPriorPosterior
Figure 12.1: Beta-binomial model with prior density (cyan), data/likelihood
(green) and posterior density (blue).
Example 12.3. (Normal-Normal) Let Y1, . . . , Yniid∼ N (µ, σ2). We assume that µ ∼
N (η, τ 2) and σ is known. Thus we have the Bayesian model:
Yi | µiid∼ N (µ, σ2), i = 1, . . . , n, (12.15)
µ | η ∼ N (η, τ 2). (12.16)
The posterior density is then
f(µ | y1, . . . , yn) ∝ f(y1, . . . , yn | µ)× f(µ) (12.17)
∝n∏i=1
f(yi | µ)× f(µ) (12.18)
∝n∏i=1
exp(−1
2
(yi − µ)2
σ2
)exp(−1
2
(µ− η)2
τ 2
)(12.19)
∝ exp(−1
2
n∑i=1
(yi − µ)2
σ2− 1
2
(µ− η)2
τ 2
), (12.20)
12.2. EXAMPLES 157
where the constants (2πσ2)−1/2 and (2πτ 2)−1/2 do not need to be considered. Through
further manipulation (of the square in µ) one obtains
∝ exp
−1
2
(n
σ2+
1
τ 2
)[µ−
(n
σ2+
1
τ 2
)−1(ny
σ2+
η
τ 2
)]2 (12.21)
and thus the posterior distribution is
N
((n
σ2+
1
τ 2
)−1(ny
σ2+
η
τ 2
),
(n
σ2+
1
τ 2
)−1). (12.22)
In other words, the posterior expected value
E(µ | y1, . . . , yn) = ησ2
nτ 2 + σ2+y
nτ 2
nτ 2 + σ2(12.23)
is a weighted mean of the prior mean η and the mean of the Likelihood y. The greater n
is, the less weight there is on the prior mean, since σ2/(nτ 2 + σ2)→ 0 for n→∞. ♣
R-Code 12.1 Normal-normal model. (See Figure 12.2.)
# Information about data:
ybar <- 2.1; n <- 4; sigma2 <- 1
# information about prior:
priormean <- 0; priorvar <- 2
# Calculating the posterior variance and mean:
postvar <- 1/( n/sigma2 + 1/priorvar)
postmean <- postvar*( ybar*n/sigma2 + priormean/priorvar )
# Plotting follows:
y <- seq(-2, to=4, length=500)
plot( y, dnorm( y, postmean, sqrt( postvar)), type='l', col=4,
ylab='Density', xlab=expression(mu))
lines( y, dnorm( y, ybar, sqrt( sigma2/n)), col=3)
lines( y, dnorm( y, priormean, sqrt( priorvar)), col=5)
legend( "topleft", legend=c("Data/likelihood", "Prior", "Posterior"),
col=c(3, 5, 4), bty='n', lty=1)
The posterior mode is often used as a summary statistic of the posterior distribution.
Naturally, the posterior median and posterior mean (i.e., expectation of the posterior
distribution) are intuitive alternatives.
With the frequentist approach we have constructed confidence intervals, but these intervals
cannot be interpreted with probabilities. An empirical (1−α)% confidence interval [bu, bo]
158 CHAPTER 12. BAYESIAN METHODS
−2 −1 0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
µ
Den
sity
Data/likelihoodPriorPosterior
Figure 12.2: Normal-Normal model with prior (cyan), data/likelihood (green)
and posterior (blue). (See R-Code 12.1.)
contains the true parameter with a frequency of (1 − α)% in infinite repetitions of the
experiment. With a Bayesian approach, we can now make statements about the parameter
with probabilities. In Example 12.3, based on Equation (12.22)
P(v−1m− z1−α/2v
−1/2 ≤ µ ≤ v−1m+ z1−α/2v−1/2
)= 1− α, (12.24)
with v =n
σ2+
1
τ 2and m =
ny
σ2+
η
τ 2. That means that the bounds v−1m ± z1−α/2v
−1/2
can be used to construct a Bayesian counterpart to a confidence interval.
Definition 12.1. The interval R with∫R
f(θ | y1, . . . , yn)dθ = 1− α (12.25)
is called a (1 − α)% credible interval for θ with respect to the posterior density f(θ |y1, . . . , yn) and 1− α is the credible level of the interval. ♦
The definition states that the random variable whose density is given by f(θ | y1, . . . , yn)
is contained in the (1− α)% credible interval with probability (1− α).
Since the credible interval for a fixed α is not unique, the “narrowest” is often used. This
is the so-called HPD Interval (highest posterior density interval). A detailed discussion
can be found in Held (2008). Credible intervals are often determined numerically.
The Bayesian counterpart to hypothesis testing is done through a comparison of posterior
probabilities. For example, consider two models specified by two hypotheses H0 and H1.
By Bayes theorem,
P(H0 | y1, . . . , yn)
P(H1 | y1, . . . , yn)︸ ︷︷ ︸Posterior odds
=P(y1, . . . , yn | H0)
P(y1, . . . , yn | H1)︸ ︷︷ ︸Bayes factor
× P(H0)
P(H1)︸ ︷︷ ︸Prior odds
, (12.26)
12.3. CHOICE AND EFFECT OF PRIOR DISTRIBUTION 159
that means that the Bayes factor summarizes the evidence of the data for the hypothesis
H0 versus the hypothesis H1. However, it has to be mentioned that a Bayes factor needs
to exceed 3 to talk about substantial evidence for H0. For strong evidence we typically
require Bayes factors larger than 20.
Bayes factors are popular because they are linked to the BIC (Bayesian Information
Criterion) and thus automatically penalize model complexity. Further, they also work for
non-nested models.
12.3 Choice and Effect of Prior Distribution
The choice of prior distribution belongs to modeling process just like the choice of the
likelihood distribution.
The examples in the last section were such that the posterior and prior distributions
belonged to the same class. Naturally, that is no coincidence. In these examples we have
chosen so-called conjugate prior distributions.
−2 −1 0 1 2 3 4
0.0
0.2
0.4
0.6
0.8
µ
Den
sity
Data/likelihoodPriorPosterior
0.5 1.0 1.5 2.0 2.5
01
23
4
µ
Den
sity
Data/likelihoodPriorPosterior
Figure 12.3: Normal-Normal model with prior (cyan), data/likelihood (green)
and posterior (blue). Two different priors top and increasing n bottom (n =
4, 36, 64, 100).
With other prior distributions we may obtain posterior distributions which we no longer
160 CHAPTER 12. BAYESIAN METHODS
“recognize” and normalisizing constants must be explicitly calculated.
Example 12.4. We consider again the Normal-Normal model and compare the posterior
density for various n with the likelihood. We keep y = 2.1, independent of n. As shown in
Figure 12.3, the maximum likelihood estimate does not depend on n (y is kept constant by
design). The uncertainty decreases, however (Standard error is σ/√n). For increasing n,
the posterior approaches the likelihood density. In the limit there is no difference between
the posterior and the likelihood.
12.4 Bibliographic Remarks and Links
The choice of prior distribution leads to several discussions and we refer you to Held and
Sabanes Bove (2014) for more details.
Chapter 13
A Closer Look: Monte Carlo
Methods
R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter13.R.
Several examples in the previous chapter consisted of posterior and prior distributions of
the same distribution type (but different parameters). This was no coincidence; rather,
we chose so-called conjugate priors based on our likelihood functions.
With other prior distributions, under certain circumstances we have unknown, “compli-
cated” posterior distributions, for which, for example, we no longer know the expected
value. The calculation of the integral is often too complex and so here we consider classic
(Monte Carlo) simulation procedures as a solution to this problem.
In general, Monte Carlo simulation is used to numerically solve a complex problem through
repeated random sampling. This utilizes, above all, the law of large numbers.
13.1 Monte Carlo Integration
First let us consider how we can simply approximate integrals. Let fX(x) be a density
and g(x) an arbitrary function. We are looking for the expected value of g(X), i.e.,
E(g(X)
)=
∫Rg(x)fX(x) dx. (13.1)
An approximation of this integral is (method of moments)
E(g(X)
)=
∫Rg(x)fX(x) dx ≈ E
(g(X)
)=
1
n
n∑i=1
g(xi), (13.2)
where x1, . . . , xn is a random sample of fX(x).
161
162 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
Example 13.1. Let g(x1, x2) be an indicator function of the set x21 + x2
2 ≤ 1. We have
the following approximation of the number π (area of a circle with radius 1)
π =
∫x21+x22≤1
g(x1, x2) dx1 dx2 ≈ 4× 1
n
n∑i=1
g(x1,i, x2,i), (13.3)
where x1,i and x2,i, i = 1 . . . , n are two random samples from U(0, 1).
It is important to note that the convergence is very slow, see Figure 13.1. ♣
R-Code 13.1 Approximation of π with the aid of Monte Carlo integration (See
Figure 13.1.)
set.seed(14)
m <- 49
n <- round( 10+1.4^(1:m))
piapprox <- numeric(m)
for (i in 1:m) st <- matrix( runif( 2*n[i]), ncol=2)
piapprox[i] <- 4*mean( rowSums( st^2)<= 1)
plot( n, abs( piapprox-pi)/pi, log='xy', type='l')
lines( n, 1/sqrt(n), col=2, lty=2)
cbind( n=n[1:7*7], pi.approx=piapprox[1:7*7], rel.error=
abs( piapprox[1:7*7]-pi)/pi, abs.error=abs( piapprox[1:7*7]-pi))
## n pi.approx rel.error abs.error
## [1,] 21 2.4762 0.21180409 0.66540218
## [2,] 121 3.0083 0.04243968 0.13332819
## [3,] 1181 3.1634 0.00694812 0.02182818
## [4,] 12358 3.1662 0.00783535 0.02461547
## [5,] 130171 3.1403 0.00040166 0.00126186
## [6,] 1372084 3.1424 0.00025959 0.00081554
## [7,] 14463522 3.1406 0.00032656 0.00102592
13.2. REJECTION SAMPLING 163
1e+01 1e+03 1e+05 1e+07
1e−
051e
−03
1e−
01
n
abs(
piap
prox
− p
i)/pi
Figure 13.1: Convergence of the approximation for π: the relative error as a
function of n. (See R-Code 13.1.)
13.2 Rejection Sampling
When no method exists to directly sample from a distribution with density fY (y), rejection
sampling can be used. In this method values from a known density fZ(z) are drawn
and through rejection of unsuitable values, observations of fY (y) are generated. This
method can also be used when the normalizing constant of fY (y) is unknown. We write
fY (y) = c · f ∗(y).
The procedure is as follows: Find an m < ∞, so that f ∗(y) ≤ m · fZ(y). Then draw y
from fZ(y) and u from a standard uniform distribution. If u ≤ f ∗(y)/(m · fZ(y)
)then
y is accepted as a simulated value from fY (y), otherwise y is dicarded and no longer
considered. For efficiency reasons the constant m should be chosen to be as small as
possible.
Example 13.2. We draw values from a Beta(6, 3) distribution with the the rejection
sampling method and additionally we need a uniform distribution. That means, fY (y) =
c · y6−1(1 − y)3−1 (f ∗(y) = y6−1(1 − y)3−1 and the normalizing constant c is 1/β(6, 3) =
1/168) and fZ(y) = 10≤y≤1(y). We select m = 0.02, which fulfills the condition f ∗(y) ≤m · fZ(y).
An implementation of the example is given in R-Code 13.2. The R-Code can be optimized
with respect to speed. It would then, however, be more difficult to read. Figure 13.2 shows
a histogram and the density of the simulated values. ♣
R-Code 13.2: Rejection sampling in the setting of a beta distribution.
(See Figure 13.2.)
set.seed( 14)
n.sim <- 1000
164 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
m <- 0.02
fst <- function(y) y^( 6-1) * (1-y)^(3-1)
f_Z <- function(y) ifelse( y >= 0 & y <= 1, 1, 0)
result <- sample <- rep( NA, n.sim)
for (i in 1:n.sim)sample[i] <- runif(1)
u <- runif(1)
if( u < fst( sample[i]) /( m * f_Z( sample[i])) ) # if accepted ...
result[i] <- sample[i]
mean( !is.na(result)) # proportion of accepted samples
## [1] 0.285
result <- result[ !is.na(result)]
## Figures:
hist( sample, xlab="y", main="", col="lightblue")
hist( result, add=TRUE, col=4)
curve( dbeta(x, 6, 3), frame =FALSE, ylab="", xlab='y', yaxt="n")
lines( density( result), lty=2, col=4)
legend( "topleft", legend=c("truth", "smoothed empirical"),
lty=1:2, col=c(1,4))
y
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
020
4060
8012
0
0.0 0.2 0.4 0.6 0.8 1.0
y
truthsmoothed empirical
Figure 13.2: On the left we have a histogram of the simulated values of fZ(y)
(light blue) and fY (y) (dark blue). On the right the theoretical density (truth)
and the simulated density (smoothed empirical) are drawn. (See R-Code 13.2.)
13.3. GIBBS SAMPLING 165
13.3 Gibbs Sampling
The idea of Gibbs Sampling is to simulate the posterior distribution through the use of a
Markov chain. This belongs then to the family of Markov chain Monte Carlo (MCMC). We
illustrate the principle with a joint distribution f(θ1, θ2 | y) with the two parameters θ1 and
θ2. The Gibbs Sampler reduces the problem to two single dimensional simulations. Then
samples are drawn, alternating from f(θ1 | θ2, y) and f(θ2 | θ1, y), where we condition
respectively on the previous sampled values of θ1 and θ2.
In many cases one does not have to program a Gibbs sampler oneself but can use a pre-
programmed sampler. We use the sampler JAGS (Just Another Gibbs sampler) (Plum-
mer, 2003) with the R-Interface package rjags (Plummer, 2016).
When using MCMC methods, you may encounter situations in which the sampler does
not converge (or converges too slowly). In such a case the posterior distribution cannot
be approximated with the simulated values. It is therefore important to examine the
simulated values for eye-catching patterns. Additionally, diagnostic plots, for example,
the trace plot in Figure 13.3 (left) are often used. Further information about these
diagnostics are found in Lunn et al. (2012). Here a short introduction to Bayesian statistics
and numerous examples of (Open)BUGS models can also be found.
Example 13.3. The following three sets of R-Codes give a short view into Markov chain
Monte Carlo methods with JAGS. R-code 13.3 implements the normal-normal model with
y = 1, n = 1 and known variance, R-code 13.4 the normal-normal model with n = 10
and known variance and R-code 13.5 the normal-normal-gamma model with n = 10 and
unknown variance. The corresponding Figures 13.3, 13.4 and 13.5 give the empirical and
exact densities of the posterior, prior, and likelihood. ♣
R-Code 13.3: JAGS sampler for normal-normal model, with n = 1. (See
Figure 13.3.)
require( rjags)
writeLines("model y ~ dnorm( mu, 1)
mu ~ dnorm( 0, 1/1.2) # here Precision = 1/Variance
", con="jags01.txt")
jagsModel <- jags.model( 'jags01.txt', data=list( 'y'=1))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
166 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 7
##
## Initializing model
postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000)
plot( postSamples, density=FALSE, main="", auto.layout=FALSE)
plot( postSamples, trace=FALSE, main="", auto.layout=FALSE,
xlim=c(-2,3), ylim=c(0,.56))
rug( summary( postSamples )$quantiles, lwd=2, tick=.2)
y <- seq(-3,to=4, length=100)
lines( y, dnorm(y, 1, 1), col=3)
lines( y, dnorm(y, 0, sqrt(1.2) ), col=4)
lines( y, dnorm(y, 1/(1+1/1.2), sqrt(1/(1+1/1.2))), col=2)
0 500 1000 1500 2000
−1
01
23
Iterations
−2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
0.5
N = 2000 Bandwidth = 0.17
Figure 13.3: Left: trace plot of the posterior µ | y = 1. Right: empirical
densities: MCMC based posterior (black), exact (red), prior (blue), likelihood
(green). (See R-Code 13.3.)
R-Code 13.4: JAGS sampler for normal-normal model, with n = 10. (See
Figure 13.4.)
set.seed( 4)
n <- 10
13.3. GIBBS SAMPLING 167
obs <- rnorm( n, 1, 1)
writeLines("model for (i in 1:n)
y[i] ~ dnorm( mu, 1)
mu ~ dnorm( 0, 1/2)
", con="jags02.txt")
jagsModel <- jags.model( 'jags02.txt', data=list('y'=obs, 'n'=n),
quiet=T)
postSamples <- coda.samples(jagsModel, 'mu', n.iter=2000)
plot( postSamples, density=FALSE, main="", auto.layout=FALSE)
plot( postSamples, trace=FALSE, main="", auto.layout=FALSE,
xlim=c(-.5, 3), ylim=c(0, 1.3))
rug( obs, col=3)
rug( summary( postSamples )$quantiles, lwd=2, tick=.2)
y <- seq(-.7, to=3.5, length=100)
lines( y, dnorm( y, mean(obs), sqrt(1/n)), col=3)
lines( y, dnorm( y, 0, sqrt(2) ), col=4)
lines( y, dnorm( y, n*mean(obs)/(n+1/2), sqrt(1/(n+1/2))), col=2)
0 500 1000 1500 2000
0.5
1.0
1.5
2.0
2.5
Iterations
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.4
0.8
1.2
N = 2000 Bandwidth = 0.07267
Figure 13.4: Left: trace plot of the posterior µ | y = y1, . . . , yn. Right:
empirical densities: MCMC based posterior (black), exact (red), prior (blue),
likelihood (green). (See R-Code 13.4.)
168 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
R-Code 13.5: JAGS sampler for normal-normal-gamma model, with n =
10. (See Figure 13.5.)
nu <- 0 # hyperparameters
tau2 <- 1.2
writeLines("model for (i in 1:n)
y[i] ~ dnorm( mu, kappa)
mu ~ dnorm( nu, 1/tau2)
kappa ~ dgamma(1, .2)
", con="jags03.txt")
jagsModel <- jags.model('jags03.txt', quiet=T,
data=list('y'=obs, 'n'=n, 'nu'=nu, 'tau2'=tau2))
postSamples <- coda.samples(jagsModel, c('mu','kappa'), n.iter=2000)
plot( postSamples[,"mu"], trace=FALSE, auto.layout=FALSE,
xlim=c(-1,3), ylim=c(0, 1.2))
y <- seq( -2, to=5, length=100)
lines( y, dnorm(y, mean(obs), sd(obs)/sqrt(n)), col=3)
lines( y, dnorm(y, 0, sqrt(1.2) ), col=4)
plot( postSamples[,"kappa"], trace=FALSE, auto.layout=FALSE)
y <- seq( 0, to=5, length=100)
lines( y, dgamma( y, 1, .2), type='l', col=4)
lines( y, dgamma( y, n/2+1, n*var(obs)/2 ), col=3)
# Some cleanup of the files `jags0?.txt' might be necessary
13.4 Bibliographic Remarks and Links
An alternative to JAGS is BUGS (Bayesian inference Using Gibbs Sampling) which is
distributed as two main versions: WinBUGS and OpenBUGS. Additionally, there is the
R-Interface package (R2OpenBUGS, Sturtz et al., 2005). Other possibilities are the Stan
or INLA engines with convenient user interfaces to R through rstan and INLA (Gelman
et al., 2015; Rue et al., 2009; Lindgren and Rue, 2015).
13.4. BIBLIOGRAPHIC REMARKS AND LINKS 169
−1 0 1 2 3
0.0
0.4
0.8
1.2
N = 2000 Bandwidth = 0.07396
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
N = 2000 Bandwidth = 0.1024
Figure 13.5: Empirical posterior densities of µ | y1, . . . , yn (left) and of κ =
1/σ2 | y1, . . . , yn (right), MCMC based (black), prior (blue), likelihood (green).
(See R-Code 13.5.)
The list of textbooks discussing MCMC is long and extensive. Held and Sabanes Bove
(2014) has some basic and accessible ideas. Accessible examples for actual implementa-
tions can be found in Kruschke (2015) (JAGS and STAN) and Kruschke (2010) (Bugs).
170 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS
Bibliography
Abraham, B. and Ledolter, J. (2006). Introduction To Regression Modeling. Duxbury
applied series. Thomson Brooks/Cole. 105
Box, G. E. P. and Draper, N. R. (1987). Empirical Model-building and Response Surfaces.
Wiley. 113
Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A.
118
Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of the space shuttle:
Pre-challenger prediction of failure. Journal of the American Statistical Association,
84, 945–957. 149
Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression: Modelle, Methoden und An-
wendungen. Springer, 2 edition. 109
Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, Methods
and Applications. Springer. 109
Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed
Effects and Nonparametric Regression Models. CRC Press. 151
Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular
attention to the false discovery proportion. Statistical Methods in Medical Research,
17, 347–388. 58
Furrer, R. and Genton, M. G. (1999). Robust spatial data analysis of lake Geneva sed-
iments with S+SpatialStats. Systems Research and Information Science, 8, 257–272.
146
Gelman, A., Lee, D., and Guo, J. (2015). Stan: A probabilistic programming language
for bayesian inference and optimization. Journal of Educational and Behavior Science,
40, 530–543. 168
Gemeinde Staldenried (1994). Verwaltungsrechnung 1993 Voranschlag 1994. 5
171
172 BIBLIOGRAPHY
Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Springer,
Heidelberg. 44, 158
Held, L. and Sabanes Bove, D. (2014). Applied Statistical Inference. Springer. 44, 47,
160, 169
Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods. John Wiley
& Sons. 87
Husler, J. and Zimmermann, H. (2010). Statistische Prinzipien fur medizinische Projekte.
Huber, 5 edition. 33
Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate Discrete Distributions.
Wiley-Interscience, 3rd edition. 32
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distribu-
tions, Vol. 1. Wiley-Interscience, 2nd edition. 32
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distribu-
tions, Vol. 2. Wiley-Interscience, 2nd edition. 32
Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS.
Academic Press, first edition. 169
Kruschke, J. K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and
Stan. Academic Press/Elsevier, second edition. 169
Kupper, T., De Alencastro, L., Gatsigazi, R., Furrer, R., Grandjean, D., and J., T.
(2008). Concentrations and specific loads of brominated flame retardants in sewage
sludge. Chemosphere, 71, 1173–1180. 130
Landesman, R., Aguero, O., Wilson, K., LaRussa, R., Campbell, W., and Penaloza, O.
(1965). The prophylactic use of chlorthalidone, a sulfonamide diuretic, in pregnancy.
J. Obstet. Gynaecol., 72, 1004–1010. 63
Leemis, L. M. and McQueston, J. T. (2008). Univariate distribution relationships. The
American Statistician, 62, 45–53. 32
Lindgren, F. and Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of
Statistical Software, 63, i19. 168
Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2012). The BUGS
Book: A Practical Introduction to Bayesian Analysis. Texts in Statistical Science.
Chapman & Hall/CRC. 165
Olea, R. A. (1991). Geostatistical Glossary and Multilingual Dictionary. Oxford University
Press. 3
BIBLIOGRAPHY 173
Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 2008-11-14,
http://matrixcookbook.com. 99
Plagellat, C., Kupper, T., Furrer, R., de Alencastro, L. F., Grandjean, D., and Tarradellas,
J. (2006). Concentrations and specific loads of UV filters in sewage sludge originating
from a monitoring network in Switzerland. Chemosphere, 62, 915–925. 136
Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models us-
ing gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed
Statistical Computing (DSC 2003). Vienna, Austria. 165
Plummer, M. (2016). rjags: Bayesian Graphical Models using MCMC. R package version
4-6. 165
Rudolf, M. and Kuhlisch, W. (2008). Biostatistik: Eine Einfuhrung fur Biowissenschaftler.
Pearson Studium. 48
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the
Royal Statistical Society B, 71, 319–392. 168
Siegel, S. and Castellan Jr, N. J. (1988). Nonparametric Statistics for The Behavioral
Sciences. McGraw-Hill, 2nd edition. 87
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680. 2
Sturtz, S., Ligges, U., and Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS
from R. Journal of Statistical Software, 12, 1–16. 168
SWISS Magazine (2011). Key environmental figures, 10/2011–01/2012, p. 107. http:
//www.swiss.com/web/EN/fly swiss/on board/Pages/swiss magazine.aspx. 7, 8
Tobler, W. (1970). A computer movie simulating urban growth in the Detroit region.
Economic Geography, 46, 234–240. 146
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press. 12
Tufte, E. R. (1990). Envisioning Information. Graphics Press. 12
Tufte, E. R. (1997a). Visual and Statistical Thinking: Displays of Evidence for Making
Decisions. Graphics Press. 12
Tufte, E. R. (1997b). Visual Explanations: Images and Quantities, Evidence and Narra-
tive. Graphics Press. 12
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. 12
174 BIBLIOGRAPHY
Glossary
Throughout the document we tried to be consistent with standard mathematical notation.
We write random variables as uppercase letters (X, Y , . . . ), realizations as lower case
letters (x, y, . . . ), matrices as bold uppercase letters (Σ, X, . . . ), and vectors as bold
italics lowercase letters (x , β, . . . ). (The only slight confusion arises with random vectors
and matrices.)
The following glossary contains a non-exhaustive list of the most important notation.
Standard operators or products are not repeatedly explained.
:= Define the left hand side by the expression on the other side.
♣, ♦ End of example, end of definition∫,∑
,∏
Integration, summation and product symbol. If there is no ambiguity,
we omit the domain in inline formulas.
∪, ∩ Union, intersection of sets or events.
θ Estimator or estimate of the parameter θ.
x Arithmetic mean of the sample:∑n
i=1 xi/n.
|x| Absolute value of the scalar x.
||x || Norm of the vector x .
XT Transpose of an matrix X.
x(i) Order statistics of the sample xi.0, 1 Vector or matrix with components 0 respectively 1.
Cov(X, Y ) Covariance between two random variables X and Y .
Corr(X, Y ) Correlation between two random variables X and Y .
ddx
, ′, ∂∂x
Derivative and partial derivative with respect to x.
diag(A) Diagonal entries of an (n× n)-matrix A.
ε, εi Random variable or process, usually measurement error.
E(X) Expectation of the random variable X.
e, exp(·) Transcendental number e = 2.71828 18284, the exponential function.
In = I Identity matrix, I = (δij).
IA Indicator function, talking the value one if A is true and zero otherwise.
175
176 Glossary
lim Limit.
log(·) Logarithmic function to the base e.
maxA,minA
Maximum, minimum of the set A.
N, Nd Space of natural numbers, of d-vectors with natural elements.
N (µ, σ2) Normal (Gaussian) distribution with mean µ and variance σ2.
Np(µ,Σ) Normal p dimensional distribution with mean vector µ and variance
matrix Σ.
ϕ(x) Gaussian probability densitiy function ϕ(x) = (2π)−1/2 exp(−x2/2).
Φ(x) Gaussian cumulative distribution function Φ(x) =∫ x−∞ ϕ(z) dz.
π Transzendental number π = 3.14159 26535.
PA Probability of the event A.
R, Rn, Rn×m Space of real numbers, real n-vectors and real (n×m)-matrices.
rank(A) The rank of a matrix A is defined as the number of linearly independent
rows (or columns) of A.
tr(A) Trace of an matrix A defined by the sum of its diagonal elements.
Var(X) Variance of the random variable X.
Z, Zd Space of integers, of d-vectors with integer elements.
The following table contains the abbreviations of the statistical distributions (dof denotes
degrees of freedom).
N (µ, σ2) Gaussian or normal random variable with parameters µ and σ2.
N (0, 1), zp Standard standard normal random variable, p-quantiles thereof.
Bin(n, p), Binomial random variable with n trials and success probability p,
bn,p,1−α 1− α-quantile thereof.
X 2ν , χ2
ν,p Chi-squared distribution with ν dof, p-quantile thereof.
Tn, tn,p Student’s t-distribution with n dof, p-quantile thereof.
Fm,n, fm,n,p F -distribution with m and n dof, p-quantile thereof.
Utab(nx,ny;1−α)1− α-quantile of the distribution of the Wilcoxon rank sum statistic.
Wtab(n?; 1−α) 1− α-quantile of the distribution of the Wilcoxon signed rank statistic.
The following table contains the abbreviations of the statistical methods, properties and
quality measures.
EDA Exploratory data analysis.
dof Degrees of freedom.
Glossary 177
MAD Median absolute deviation.
MAE Mean absolute error.
ML Maximum likelihood (ML estimator or ML estimation).
MM Method of moments.
MSE Mean squared error.
OLS, LS Ordinary least squares.
RMSE Root mean squared error.
WLS Weighted least squares.
178 Glossary
Index of Statistical Tests
Comparing a sample mean with a theoretical value, 53
Comparing means from two independent samples, 54
Comparing means from two paired samples, 56
Comparing two variances, 59
Comparison of location for two independent samples, 81
Comparison of location of two dependent samples, 82
Comparison of observations with expected frequencies, 60
General remarks about statistical tests, 52
Performing a one-way analysis of variance, 129
Permutation test, 86
Sign test, 85
Test of a linear relationship, 108
Test of correlation, 104
Test of proportions, 70
179
180 Index of Statistical Tests
Dataset Index
abrasion, 118, 119, 121, 144, 145
anscombe, 102, 103
binorm, 94, 102
chemosphere, 136
hardness, 105–107, 109
leman, 3, 146, 147
LifeCycleSavings, 121, 122, 124, 142, 143
orings, 149–151
portions, 46, 53, 55–57, 59–61
retardant, 130, 133
UVfilter, 136, 137, 139
181
182 Dataset Index