sta120 introduction to statistics -...

STA 120

Introduction to Statistics

Script

Reinhard Furrer

and the Applied Statistics Group

Version July 5, 2018

Contents

Preface v

1 Exploratory Data Analysis 1

1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Examples of Poor Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.6 Bibliographic Remarks and Links . . . . . . . . . . . . . . . . . . . . . . . 12

2 Random Variables 13

2.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Expectation, Variance and Moments . . . . . . . . . . . . . . . . . . . . . 18

2.5 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 19

2.6 Common Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Common Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . 21

2.8 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 31


3 Estimation 33

3.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Construction of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Comparison of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Interval Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


4 Statistical Testing 45

4.1 The General Concept of Significance Testing . . . . . . . . . . . . . . . . . 45

4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

i

ii CONTENTS

4.3 Comparing Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Duality of Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . . 57

4.5 Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6 Additional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


5 A Closer Look: Proportions 63

5.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.4 Comparison of Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


6 Rank-Based Methods 77

6.1 Robust Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2 Rank-Based Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.3 Other Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84


7 Multivariate Normal Distribution 89

7.1 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 92

7.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96


7.5 Appendix: Positive-Definite Matrices . . . . . . . . . . . . . . . . . . . . . 99

8 Correlation and Simple Regression 101

8.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


9 Multiple Regression 111

9.1 Model and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

9.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.3 Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

9.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

10 Analysis of Variance 125

10.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

10.2 Two-Way and Complete Two-Way ANOVA . . . . . . . . . . . . . . . . . 132

10.3 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

10.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

CONTENTS iii

11 Extending the Linear Model 141

11.1 Unifying the Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

11.2 Regression with Correlated Data . . . . . . . . . . . . . . . . . . . . . . . 141

11.3 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148


12 Bayesian Methods 153

12.1 Motivating Example and Terminology . . . . . . . . . . . . . . . . . . . . . 153

12.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

12.3 Choice and Effect of Prior Distribution . . . . . . . . . . . . . . . . . . . . 159


13 A Closer Look: Monte Carlo Methods 161

13.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

13.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

13.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


References 171

Glossary 175

Index of Statistical Tests 179

Dataset Index 181

iv CONTENTS

Preface

This document accompanies the lecture STA120 Introduction to Statistics that has been

given each spring semester since 2013. The lecture is given in the framework of the minor

in Applied Probability and Statistics (www.math.uzh.ch/aws) and comprises 14 weeks of

two hours of lecture and one hour of exercises per week.

As the lecture’s topics are structured on a week by week basis, the script contains

thirteen chapters, each covering “one” topic. Some of chapters contain consolidations or

in-depth studies of previous chapters. The last week is dedicated to a recap/review of the

material.

I have thought long and hard about an optimal structure for this script. Let me quickly

summarize my thoughts. It is very important that the document contains a structure that

is tailored to the content I cover in class each week. This inherently leads to 13 “chapters.”

Instead of covering Linear Models over four weeks, I framed the material in four seemingly

different chapters. This structure helps me to better frame the lectures: each week having

a start, a set of learning goals and a predetermined end.

So to speak, the script covers not 13 but essentially only three topics:

1. Background

2. Statistical foundations in a (i) frequentist setting and (ii) Bayesian setting

3. Linear Modeling

We will not cover these topics chronologically. This is not necessary and I have opted for a

smoother setting. For example, we do not cover the multivariate Gaussian distribution at

the beginning but just before we need it. This also allows for a recap of several univariate

concepts. We use a path illustrated in Figure 1.

In case you use this document outside the lecture, here are several alternative paths

through the chapters, with a minimal impact on concepts that have not been covered:

• Focusing on linear models: Chapters 1, 2, 7, 3, 4, 8, 9, 10, 11

• Good background in probability: You may omit Chapters 2 and 7

v

http://www.math.uzh.ch/aws

vi Preface

Background

Estimation

Statistical Testing

Proportions

Rank−Based Methods

Bayesian Approach

Monte Carlo Methods

Linear Modeling

Multiple Regression

Correlation and Simple Regression

Analysis of Variance

Extending the Linear Model

Bayesian

Frequentist

Statistical Foundations

Exploratory Data Analysis

Random Variables

Multivariate Normal Distribution

End

Start

Figure 1: Structure of the script

All the datasets that are not part of regular CRAN packages are available via the

url www.math.uzh.ch/furrer/download/sta120/. The script is equipped with appropriate

links that facilitate the download.

The lecture STA120 Introduction to Statistics formally requires prerequisites MAT183

Stochastic for the Natural Sciences and MAT141 Linear Algebra for the Natural Sciences

or equivalent modules. For their content we refer to the corresponding course web pages

www.math.uzh.ch/fs17/mat183 and www.math.uzh.ch/hs17/mat141. It is possible to

successfully pass the lecture without having had the aforementioned lectures, some self-

studying is necessary though. This script and the accompanying exercises require some:

differentiation, integration, matrix notation and basic operations, concept of solving a

linear system of equations. We review and summarize the relevant concepts of probability

theory in Chapter 2.

http://user.math.uzh.ch/furrer/download/sta120/

http://www.math.uzh.ch/fs17/mat183

http://www.math.uzh.ch/hs17/mat141

Preface vii

Many have contributed to this document. A big thanks to all of them, especially

(alphabetically) Zofia Baranczuk, Julia Braun, Eva Furrer, Florian Gerber, Lisa Hofer,

Mattia Molinaro, Leila Schuh, Franziska Robmann, Leila Schuh and many more. Kelly

Reeve spent many hours improving my English. Without their help, you would not be

reading these lines. Yet, this document needs more than a polishing. Please let me know

of any necessary improvements and I highly appreciate all forms of contributions in form of

errata, examples, or text blocks. Contributions can be deposited directly in the following

Google Doc sheet.

Major errors that were corrected after the lecture of the corresponding semester are

listed www.math.uzh.ch/furrer/download/sta120/errata18FS.txt. I try hard that after

the lecture, the pagination of the document does not change anymore.

Reinhard Furrer

February 2018

https://docs.google.com/document/d/1essOmryym0AtFchjspygJI8M1aSUwBg_slKrfsrNcHE/edit?usp=sharing

http://user.math.uzh.ch/furrer/download/sta120/errata18FS.txt

viii Preface

Chapter 1

Exploratory Data Analysis

R-Code for this chapter: www.math.uzh.ch/furrer/download/sta120/chapter01.R.

Assuming that a data collection process is completed, the “analysis” of this data

is one of the next steps. Foremost, the data needs to be loaded into an appropriate

software environment, here we use R. At the beginning of any formal statistical analysis,

an exploratory data analysis (EDA) is performed, in which observations or measured

values are depicted, and qualitatively and quantitatively described. At the end of a

study, results are often summarized graphically because it is generally easier to interpret

and understand graphics than values in a table. As such, graphical representation of data

is an essential part of statistical analysis, from start to finish. In this chapter, we consider

how different types of data can be visualized and get as such an introduction to what we

mean with EDA.

1.1 Types of Data

There are several different types of data. An important distinction is whether data is

qualitative or quantitative in nature. Non-numerical data, often gathered from open-

ended responses or in audio-visual form, is considered qualitative. This data can be

measured on a nominal or ordinal scale. With the nominal scale, one can determine the

equality of observations, while with the ordinal scale, rank order can also be determined.

Whereas qualitative data consists of categories, quantitative data is numeric and math-

ematical operations can be performed with it. Quantitative data can be either discrete,

taking on only certain values like qualitative data but in the form of integers, or continu-

ous, taking on any value on the real number line. Quantitative data can be measured on

an interval or ratio scale. Unlike the ordinal scale, the interval scale is uniformly spaced.

The ratio scale is characterized by a meaningful absolute zero in addition to the char-

acteristics of all previous scales. Depending on the measurement scale, certain statistics

1

http://www.math.uzh.ch/furrer/download/sta120/chapter01.R

2 CHAPTER 1. EXPLORATORY DATA ANALYSIS

are appropriate. The measurement scales are classified according to Stevens (1946) and

summarized in Figure 1.1.

ScalesNominal Ordinal Interval Ratio

< > < > < >+ - + - * /

Mode Mode Mode ModeMedian Median Median

Arithmetic mean Arithmetic meanGeometric mean

Standard deviation Standard deviation

Range Studentized range

Mathematical operators

Statistical measures

Coefficient of variation

Figure 1.1: Types of scales according to Stevens (1946) and possible math-

ematical operations. The statistical measures are for a description of location

and spread.

Example 1.1. When measuring temperature in Kelvin (absolute zero at −273.15C), a

statement such as “The temperature has increased by 20%” can be made. ♣

1.2 Descriptive Statistics

Another very basic summary of the data is its size, number of variables, number of missing

values. Although this is rudimentary information, it is often forgotten and or neglected.

A statistic is a single measure of some attribute of the data, which gives a good first

impression of the distribution of the data. Typical statistics for the location parameter

include the (arithmetic) mean, truncated/trimmed mean, and median, and for the scale

parameter, variance, standard deviation, and interquartile range.

Given discrete data, the mode, and sometimes median, can be determined. The mode

is the most frequent value of an empirical frequency distribution; in order to calculate the

mode, only the operations =, 6= are necessary. The (empirical) median is the middle

value (or 50th percentile) of the sorted data and thus the <,> operators are also nec-

essary. Continous data are first divided into categories (discretized/binned) and then the

mode and median can be determined.

Example 1.2. In R-Code 1.1 several summary statistics for 293 observations of content

mercury in lake Geneva sediments are calculated (www.math.uzh.ch/furrer/download/

sta120/lemanHg.csv). ♣

www.math.uzh.ch/furrer/download/sta120/lemanHg.csv

www.math.uzh.ch/furrer/download/sta120/lemanHg.csv

1.3. UNIVARIATE DATA 3

R-Code 1.1 A quantitative EDA of the mercury dataset (subset of the ‘leman’

dataset).

Hg <- c( read.csv('data/lemanHg.csv'))$Hg

str( Hg)

## num [1:293] 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...

c( mean=mean( Hg), tr.mean=mean( Hg, trim=.1), median=median( Hg))

## mean tr.mean median

## 0.46177 0.43238 0.40000

c( var=var( Hg), sd=sd( Hg), iqr=IQR( Hg))

## var sd iqr

## 0.090146 0.300243 0.380000

summary( Hg)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 0.010 0.250 0.400 0.462 0.630 1.770

sum( is.na( Hg)) # or any( is.na( Hg))

## [1] 0

tail( sort( Hg)) # largest values

## [1] 1.28 1.29 1.30 1.31 1.47 1.77

Another important aspect is the identification of outliers, which are defined (verbatim

from Olea (1991)): “In a sample, any of the few observations that are separated so far

in value from the remaining measurements thety the questions arise whether they belong

to a different population, or that the sampling technique is faulty. The determination

that an observation is an outlier may be highly subjective, as there is no strict criteria

for deciding what is and what is not an outlier”. Graphical respresentations of the data

often help in the identification of outliers.

1.3 Classic Plots for Univariate Data

Univariate data is composed of one variable, or a single scalar component. Graphical

depiction of univariate data is usually accomplished with histograms, box plots, Q-Q

plots, or bar plots.

Histograms illustrate the frequency distribution of observations graphically and are

easy to construct and to interpret. However, the number of bins (categories to break the

data down into) is a subjective choice that affects the look of the histogram and several

valid rules of thumb exist for choosing the optimal number of bins. R-Code 1.2 and the

associated Figure 1.2 illustrate the construction and resulting histograms of the mercury


dataset. In one of the histograms, a “smoothed density” has been superimposed. Such

curves will be helpful when comparing the data with different statistical models, as we

will see in later chapters. The histogram shows that the data is unimodal, right skewed,

no exceptional values.

R-Code 1.2 Different histograms (good and bad ones) for the mercury dataset. (See

Figure 1.2.)

histout <- hist( Hg)

hist( Hg, col=7, probability=TRUE, main="With 'smoothed density'")

lines( density( Hg))

hist( Hg, col=7, breaks=90, main="Too many bins")

hist( Hg, col=7, breaks=2, main="Too few bins")

str( histout[1:3]) # Contains essentially all information of histogram

## List of 3

## $ breaks : num [1:10] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

## $ counts : int [1:9] 58 94 61 38 26 9 5 1 1

## $ density: num [1:9] 0.99 1.604 1.041 0.648 0.444 ...

Histogram of Hg

Hg

Fre

quen

cy

0.0 0.5 1.0 1.5

020

60

With 'smoothed density'

Hg

Den

sity

0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

Too many bins

Hg

Fre

quen

cy

0.0 0.5 1.0 1.5

05

1015

Too few bins

Hg

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0

010

020

0

Figure 1.2: Histograms with various bin sizes. (See R-Code 1.2.)


When constructing histograms for discrete data (e.g., integer values), one has to be

careful with the binning. Often it is better to manually force the bins. To represent the re-

sult of many dice tosses, it would be advisable to use hist( x, breaks=seq( from=0.5,

to=6.5, by=1)), or possibly barplots as explained towards the end of this section.

A stem-and-leaf plot is similar to a histogram, however, this plot is rarely used today

(Figure 1.3).

Figure 1.3: Stem-and-leaf plot of residents of the municipality of Staldenried

as of 31.12.1993 (Gemeinde Staldenried, 1994).

A box plot is a graphical representation of five statistics of the frequency distribution

of observations: the minimum and maximum values, the lower and upper quartiles, and

the median. Advantages of the box plot include

• It is a quantitative representation of important statistics.

• Symmetry is quickly detected visually.

• Outliers are denoted.

R-Code 1.3 and Figure 1.4 illustrates the box plot for the mercury data. Due to the right

skewness of the data, there are several data points beyond the whiskers.


R-Code 1.3 Box plot. (See Figure 1.4.)

boxplot( Hg)

boxplot( Hg, col="LightBlue", notch=TRUE, ylab="Hg",

outlty=1, outpch='', outcol=2)

quantile( Hg, c(0.25, 0.75)) # compare with summary( Hg)

## 25% 75%

## 0.25 0.63

IQR( Hg)

## [1] 0.38

quantile(Hg, 0.75) + 1.5 * IQR( Hg)

## 75%

## 1.2

Hg[ quantile(Hg, 0.75) + 1.5 * IQR( Hg) < Hg]

## [1] 1.25 1.30 1.47 1.31 1.28 1.29 1.77

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

Hg

Figure 1.4: Box plots: simple and notched versions. (See R-Code 1.3.)

A quantile-quantile plot (Q-Q plot) is used to visually compare empirical data qunatiles

with the quantiles of a theoretical distribution (we will talk more about “theoretical

distributions” in the next chapter). The ordered values are compared with the i/(n+ 1)-

quantiles. In practice, some softwares use (i− a)/(n+ 1− 2a), a ∈ [0, 1]. R-Code 1.4 and

Figure 1.5 illustrates a Q-Q plot for the mercury dataset by comparing it to a normal

distribution and a (so-called) exponential distribution. In cases of good fits, the points

are aligned almost on a straight line. To guide the eye, one often adds the a line to the

plots.


R-Code 1.4 Q-Q plot. (See Figure 1.5.)

qqnorm( Hg)

qqline( Hg, col=2)

qqplot( qexp( ppoints( 293), rate=2.1), Hg)

qqline( Hg, distribution=function(p) qexp( p, rate=2.1),

prob=c(0.1, 0.6), col=2)

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

0.0 1.0 2.0 3.0

0.0

0.5

1.0

1.5

qexp(ppoints(293), rate = 2.1)

Hg

Figure 1.5: Q-Q plots using the normal distribution (left) and exponential

distribution (right). (See R-Code 1.4.)

The frequency distribution of discrete data is often displayed with a bar plot. The

height of the bars is proportional to the frequency of the corresponding value. The

German language discriminates between bar plots with vertical and horizontal orientation

(‘Saulendiagramm’ and ‘Balkendiagramm’).

R-Code 1.3 and Figure 1.4 illustrates bar plots with data giving aggregated CO2 emis-

sions from the year 2005 (SWISS Magazine, 2011).

Do not use pie charts unless absolutely necessary. Pie charts are often difficult to

read. When slices are similar in size it is nearly impossible to distinguish which is larger.

Barplots allow an easier comparison.


R-Code 1.5 Emissions sources for the year 2005 as presented by the SWISS Magazine

10/2011-01/2012, page 107 (SWISS Magazine, 2011). (See Figure 1.6.)

dat <- c(2, 15, 16, 32, 25, 10) # see Figure 1.9

emissionsource <- c('Air', 'Transp', 'Manufac', 'Electr',

'Deforest', 'Other')

barplot( dat, names=emissionsource, ylab="Percent", las=2)

barplot( cbind('2005'=dat), col=c(2,3,4,5,6,7), legend=emissionsource,

args.legend=list(bty='n'), ylab='Percent', xlim=c(0.2,4))

Air

Tran

sp

Man

ufac

Ele

ctr

Def

ores

t

Oth

er

Per

cent

0

5

10

15

20

25

30

2005

OtherDeforestElectrManufacTranspAir

Per

cent

020

4060

8010

0

Figure 1.6: Bar plots: juxtaposed bars (left), stacked (right). (See R-Code 1.5.)

1.4 Visualizing Multivariate Data

Multivariate data means two or more variables are of interest and collected for each

observation. Visualization of multivariate data is often accomplished with scatter plots

(with plot(x,y) or pairs(...)) so that the relationship between the variables may be

illustrated. For three variables, an interactive visualization based on the package rgl

might be helpful.

R-Code 1.6 and Figure 1.7 illustrate the pairs plot for mercury, cadmium and zinc

content in sediment samples taken in lake Geneva. For some of the probes, the content is

high for all three metals.

In scatter plot, “guide-the-eye” lines are often included. In such situation, some care

is needed as there is an perception asymmetry between y versus x and x versus y. We

will discuss this further in Chapter 8.

1.4. MULTIVARIATE DATA 9

R-Code 1.6 Mercury (multivariate), data available at www.math.uzh.ch/furrer/

download/sta120/lemanHgCdZn.csv.

metal <- read.csv( 'data/lemanHgCdZn.csv')

str( metal)

## 'data.frame': 293 obs. of 3 variables:

## $ Hg: num 0.17 0.21 0.06 0.24 0.35 0.14 0.08 0.26 0.23 0.18 ...

## $ Cd: num 0.23 0.37 0.14 0.3 0.56 0.3 0.17 0.44 0.39 0.26 ...

## $ Zn: num 72.2 98.2 81.6 131 160 125 48 131 127 106 ...

# More EDA should be here!!!

pairs( metal, gap=0, main='')

# top left is as plot( Hg~Cd, data=metal)

Hg

0.5 1.5 2.5

0.0

0.5

1.0

1.5

0.5

1.5

2.5

Cd

0.0 0.5 1.0 1.5

200 600

200

600

Zn

Figure 1.7: Scatter plot matrix. (See R-Code 1.6.)

www.math.uzh.ch/furrer/download/sta120/lemanHgCdZn.csv

www.math.uzh.ch/furrer/download/sta120/lemanHgCdZn.csv


SW

ISS

c2es

.org

OtherDeforestElectrManufacTranspAir

Per

cent

0

20

40

60

80

100

SW

ISS

c2es

.org

AirTranspManufacElectrDeforestOther

Per

cent

0

5

10

15

20

25

30

Figure 1.8: Bar plots for two variables: stacked (left), grouped (right). (See

R-Code 1.7.)

In the case of several frequency distributions, bar plots, either stacked or grouped,

may also be used in an intuitive way. See R-Code 1.7 and Figure 1.8 for two slightly

different partitions of emission sources.

R-Code 1.7 Emissions sources for the year 2005 from www.c2es.org/facts-

figures/international-emissions/sector (approximate values). (See Figure 1.8.)

dat2 <- c(2, 10, 12, 28, 26, 22) # source c2es.org

mat <- cbind( SWISS=dat, c2es.org=dat2)

barplot(mat, col=c(2,3,4,5,6,7), xlim=c(0.2,5), legend=emissionsource,

args.legend=list(bty='n') ,ylab='Percent', las=2)

barplot(mat, col=c(2,3,4,5,6,7), xlim=c(1,30), legend=emissionsource,

args.legend=list(bty='n'), ylab='Percent', beside=TRUE, las=2)

1.5 Examples of Poor Plots

Figures 1.9 and 1.10 are examples of faulty plots and charts. Additional examples from

renowned academic journals can be found at www.biostat.wisc.edu/˜kbroman/topten

worstgraphs/. The ‘discussion’ links point out the mistakes and give recommendations

for improvement.

http://www.c2es.org/facts-figures/international-emissions/sector

http://www.c2es.org/facts-figures/international-emissions/sector

http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

1.5. EXAMPLES OF POOR PLOTS 11

Figure 1.9: SWISS Magazine 10/2011-01/2012, 107.


Figure 1.10: Bad example (above) and improved but still not ideal graphic

(below). Figures from university documents.

1.6 Bibliographic Remarks and Links

The term exploratory data analysis (EDA) was coined by Tukey (1977). The series of

books (Tufte, 1983, 1990, 1997b,a) is a visual pleasure to read and gives a plethora of

useful information on visualization and graphical communication.

Chapter 2

Random Variables


2.1 Basics of Probability Theory

In order to define the term random variable, some basic principles are needed first.

The set of all possible outcomes of an experiment is called the sample space, denoted

by Ω. Each outcome of an experiment ω ∈ Ω is called an elementary event. A subset of

the sample space Ω is called an event, denoted by A ⊂ Ω.

Informally, a probability P can be considered the value of a function defined for an

event of the sample set and assuming values in the interval [0, 1], that is, P(A) ∈ [0, 1].

In order to introduce probability theory formally one needs some technical terms (σ-

algebra, measure space, . . . ). However, the axiomatic structure of A. Kolmogorov can

also be described accessibly as follows and is sufficient as a basis for our purposes.

A probability measure must satisfy the following axioms:

i) 0 ≤ P (A) ≤ 1, for every event A,

ii) P (Ω) = 1,

iii) P (∪iAi) =∑

i P (Ai), for Ai ∩ Aj = ∅, i 6= j.

Informally, a probability function P assigns a value in [0, 1], i.e., the probability, to

each event of the sample space constraint to:

i) the probability of an event is never smaller than 0 or greater than 1,

ii) the probability of the whole sample space is 1,

iii) the probability of several events is equal to the sum of the individual probabilities,

if the events are mutually exclusive.

13


14 CHAPTER 2. RANDOM VARIABLES

Probabilities are often visualized with Venn diagrams (Figure 2.1), which clearly and

intuitively illustrate more complex facts, such as:

P (A ∪B) = P (A) + P (B)− P (A ∩B), (2.1)

P (A | B) =P (A ∩B)

P (B). (2.2)

We consider a random variable as a function that assigns values to the outcomes

(events) of a random experiment, that is, these values or values in the interval are assumed

with certain probabilities. These values are called realizations of the random variable.

A

BΩ

C

Figure 2.1: Venn diagram

The following definition gives a (unique) characterization of random variables. In

subsequent sections, we will see additional characterizations. These, however, will depend

on the what type of values the random variable can take.

Definition 2.1. The distribution function (cumulative distribution function, cdf) of a

random variable X is

F (x) = FX(x) = P(X ≤ x), for all x. (2.3)

♦

Property 2.1. A distribution function FX(x) is

i) monotonically increasing, i.e. for x < y, FX(x) ≤ FX(y).

ii) right-continuous, i.e. limε↓0

FX(x+ ε) = FX(x), for all x ∈ R.

iii) normalized, i.e. limx→−∞

FX(x) = 0 and limx→∞

FX(x) = 1.

Random variables are denoted with uppercase letters (e.g. X, Y ), while realizations

are denoted by the corresponding lowercase letters (x, y). This means that the theoretical

concept, or the random variable as a function, is denoted by uppercase letters. Actual

values or data, for example the columns in your dataset, would be denoted with lower-

case letters. For further characterizations of random variables, we need to differentiate

according to the sample space of the random variables.

2.2. DISCRETE DISTRIBUTIONS 15

2.2 Discrete Distributions

A random variable is called discrete when it can assume only a finite or countably infinite

number of values, as illustrated in the following two examples.

Example 2.1. Let X be the sum of the roll of two dice. The random variable X assumes

the values 2, 3, . . . , 12. Figure 2.2 illustrates the probabilities and distribution function.

The distribution function (as for all discrete random variables) is piece-wise constant with

jumps equal to the probability of that value. ♣

Example 2.2. A boy practices free throws, i.e., foul shots to the basket standing at a

distance of 15 ft to the board. Let the random variable X be the number of throws

that are necessary until the boy succeeds. Theoretically, there is no upper bound on this

number. Hence X can take the values 1, 2, . . . . ♣

Another way of describing discrete random variables is the probability mass function,

defined as follows.

Definition 2.2. The probability mass function (pmf) of a discrete random variable X is

defined by fX(x) = P(X = x). ♦

In other words, the pmf gives probabilities that the random variables takes one single

value, the cdf gives probabilities that the random variables takes that or any smaller value.

Property 2.2. Let X be a discrete random variable with mass function fX(x) and dis-

tribution function FX(x). Then:

i) The probability mass function satisfies fX(x) ≥ 0 for all x ∈ R.

ii)∑i

fX(xi) = 1.

iii) The values fX(xi) > 0 are the “jumps” in xi of FX(x).

iv) FX(xi) =∑

k;xk≤xi

fX(xk).

The last two points show that there is a one-to-one relation (also called a bijection)

between the cdf and pmf. Given one, we can construct the other.

Figure 2.2 illustrates the pmf and cdf of the random variableX as given in Example 2.1.

The jump locations and sizes (discontinuities) of the cdf correspond to probabilities given

in the left panel. Notice that we have emphasized the right continuity of the cdf (see

Proposition 2.1.ii) with the additional dot.


R-Code 2.1 Probabilities and distribution function of X as given in Example 2.1.

(See Figure 2.2.)

x <- 2:12

p <- c(1:6,5:1)/36

plot( x, p, type='h', ylim=c(0, .2),

xlab=expression(x[i]), ylab=expression(p[i]))

points( x, p, pch = 19)

plot.ecdf( outer(1:6,1:6,"+"), ylab=expression(F[X](x)), main='')

2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

xi

p i

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

x

FX(x

)

Figure 2.2: Probability mass function (left) and cumulative distribution func-

tion (right) of X = the sum of the roll of two dice. (See R-Code 2.1.)

2.3 Continuous Distributions

A random variable is called continuous if it can (theoretically) assume any value from one

or several intervals. This means that the number of possible values within the sample space

is infinite. Therefore, it is impossible to assign one probability value to one (elementary)

event. Or in other words, given an infinite amount of possible outcomes, the likeliness of

one particular value being the outcome becomes zero. For this reason, we need to consider

outcomes that are contained in some interval. Hence, the probability is described by an

integral, as areas under the probability density function, which is formally defined as

follows.

Definition 2.3. The probability density function (density function, pdf) fX(x), or density

for short, of a continuous random variable X is defined by

P(a < X ≤ b) =

∫ b

a

fX(x)dx, a < b. (2.4)

♦

2.3. CONTINUOUS DISTRIBUTIONS 17

The density function does not give directly a probability and thus cannot be compared

to the probability mass function. The following properties are nevertheless similar to

Property 2.2.

Property 2.3. Let X be a continuous random variable with density function fX(x) and

distribution function FX(x). Then:

i) The density function satisfies fX(x) ≥ 0 for all x ∈ R and fX(x) is continuous

almost everywhere.

ii)

∫ ∞−∞

fX(x)dx = 1.

iii) fX(x) = F ′X(x) =dFX(x)

dx.

iv) FX(x) =

∫ x

−∞fX(y)dy.

v) The cumulative distribution function FX(x) is continous everywhere.

vi) P(X = x) = 0.

As given by Property 2.3.iii and iv, there is again a bijection between the density

function and the cumulative distribution function: if we know one we can construct the

other. Actually, there is a third characterization of random variables, called the quantile

function, which is essentially the inverse of the cdf. That means, we are interested in values

x for which F (x) = p. For (strictly) monotone cumulative distribution functions, the

quantile function is equivalent to the inverse of the distribution function: QX(p) = F−1X (p).

To be precise, we have to be slightly more careful as the inverse might not always

exist, e.g., if the cdf has a plateau. The quantile function returns the minimum value of

x from amongst all those values with probability P(X ≤ x) = p. Formally this is then.

Definition 2.4. The quantile function QX(p) of a random variable X with cdf FX(x) is

defined by

QX(p) = infx | FX(x) ≥ p, 0 < p < 1. (2.5)

♦

This latter definition is general, such that we do not have to distinguish between

continuous or discrete random variables. In any case, the quantile function should be

understood as the generalized inverse function of the cdf.

Example 2.3. The continuous uniform distribution U(a, b) is defined by a constant den-

sity function over the interval [a, b], a < b, i.e. f(x) =

1

b− a, if a ≤ x ≤ b,

0, otherwise.

The quantile function is QX(p) = a+ p(b−a) for 0 < p < 1. Figure 2.3 shows the density

and cumulative distribution function of the uniform distribution U(0, 1). ♣


R-Code 2.2 Density and distribution function of a uniform distribution. (See Fig-

ure 2.3.)

plot( c(-1,0,NA,0,1,NA,1,2), c(0,0,NA,1,1,NA,0,0), type='l',

xlab='x', ylab=expression(f[X](x)))

plot( c(-1,0,1,2), c(0,0,1,1), type='l',

xlab='x', ylab=expression(F[X](x)))

−1.0 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

x

f X(x

)

−1.0 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

x

FX(x

)

Figure 2.3: Density and distribution function of the uniform distribution

U(0, 1). (See R-Code 2.2.)

2.4 Expectation, Variance and Moments

Density, cumulative distribution function or quantile function uniquely characterize ran-

dom variables. Often we do not require such a complete definition and “summary” values

are sufficient. We introduce a measure of location and scale (more will be seen in Chap-

ter 6).

Definition 2.5. The expectation of a discrete random variable X is defined by

E(X) =∑i

xiP(X = xi) . (2.6)

The expectation of a continuous random variable X is defined by

E(X) =

∫RxfX(x)dx , (2.7)

where fX(x) denotes the density of X. ♦

The expectation is “linked” to the average (or empirical mean, mean) if we have a set

of realizations thought to be from the particular random variable. Similarly, the variance,

the expectation of the squared deviation of a random variable from its expected value

is “linked” to the empirical variance (var). This link will be more formalized in later

chapters.

2.5. INDEPENDENT RANDOM VARIABLES 19

Definition 2.6. The variance of X is defined by:

Var(X) = E((X − E(X))2

). (2.8)

♦

Property 2.4. For an “arbitrary” real function g we have:

i) E(g(X)) =∑i

g(xi)P(X = xi), X discrete,

E(g(X)) =

∫Rg(x)fX(x)dx, X continuous.

Regardless of whether X is discrete or continous, the following rules apply:

ii) Var(X) = E(X2)−(E(X)

)2,

iii) E(a+ bX) = a+ bE(X), for a a and b given constants,

iv) Var(a+ bX) = b2 Var(X), for a a and b given constants,

v) E(aX + bY ) = aE(X) + bE(Y ), for a random variable Y .

The second but last property seems somewhat surprising. But starting from the

definition of the variance, one quickly realizes that the variance is not a linear operator:

Var(a+ bX) = E((a+ bX − E(a+ bX)

)2)

= E((a+ bX − (a+ bE(X))

)2). (2.9)

Example 2.4. We consider again the setting of Example 2.1, and straightforward calcu-

lation shows that

E(X) =12∑i=2

iP(X = i) = 7, by equation (2.6), (2.10)

= 26∑i=1

i1

6= 2 · 7

2, by using Property 3.4.v first. (2.11)

2.5 Independent Random Variables

Often we not only have one random variable but many of them. Here, we see a first way

to characterize a set of random variables.

Definition 2.7. Two random variables X and Y are independent if

P(X ∈ A ∩ Y ∈ B) = P(X ∈ A)P(Y ∈ B) . (2.12)

The random variables X1, . . . , Xn are independent if

P(∩ni=1Xi ∈ Ai) =n∏i=1

P(Xi ∈ Ai) . (2.13)

♦


The definition also implies that the joint density and joint cumulative distribution is

simply the product of the individual ones, also called marginal ones.

We will often use many independent random variables with a common distribution

function.

Definition 2.8. A random sample X1, . . . , Xn consists of n independent random vari-

ables with the same distribution F . We write X1, . . . , Xniid∼ F where iid stands for

“independent and identically distributed.” The number n of random variables is called

the sample size or sample range.

The iid assumption is very crucial and relaxing it has severe implications on the

statistical modeling. Luckily, independence also implies a simple formula for the variance

of the sum of two or many random variables.

Property 2.5. Let X and Y be two independent random variables. Then

i) Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) .

Let X1, . . . , Xniid∼ F with E(X1) = µ and Var(X1) = σ2. Denote X =

1

n

n∑i=1

Xi. Then

ii) E(X) = E( 1

n

n∑i=1

Xi

)= E(X1) = µ .

iii) Var(X) = Var( 1

n

n∑i=1

Xi

)=

1

nVar(X1) =

σ2

n.

The latter two properties will be used when we investigate statistical properties of the

sample mean, i.e., linking the empirical mean x =1

n

n∑i=1

xi with the random sample mean

X =1

n

n∑i=1

Xi.

2.6 Common Discrete Distributions

There is of course no limitation on the number of different random variables. In practice,

we can reduce our framework to some common distribution.

2.6.1 Binomial Distribution

A random experiment with exactly two possible outcomes (for example: success/failure,

heads/tails, male/female) is called a Bernoulli trial. For simplicity, we code the sample

space with ‘1’ (success) and ‘0’ (failure).

P(X = 1) = p, P(X = 0) = 1− p, 0 < p < 1, (2.14)

2.7. COMMON CONTINUOUS DISTRIBUTIONS 21

where the cases p = 0 and p = 1 are typically not relevant. Thus

E(X) = p, Var(X) = p(1− p). (2.15)

If a Bernoulli experiment is repeated n times (resulting in n-tuples of zeros and ones),

the random variable X = “number of successes” is intuitive. The distribution of X is

called the binomial distribution, denoted with X ∼ Bin(n, p) and the following applies:

P(X = k) =

(n

k

)pk(1− p)n−k, 0 < p < 1, k = 0, 1, . . . , n, (2.16)

and

E(X) = np, Var(X) = np(1− p). (2.17)

2.6.2 Poisson Distribution

The Poisson distribution gives the probability of a given number of events occurring in

a fixed interval of time if these events occur with a known and constant rate over time.

One way to formally introduce such a random variable is giving its probability mass. The

exact form thereof is not very important.

Definition 2.9. A random variable X, whose probability function is given by

P(X = k) =λk

k!exp(−λ), 0 < λ, k = 0, 1, . . . , (2.18)

is said to follow a Poisson distribution, denoted by X ∼ Poisson(n, p). Further,

E(X) = λ, Var(X) = λ. (2.19)

♦

The Poisson distribution is also a good approximation for the binomial distribution

with large n and small p.

2.7 Common Continuous Distributions

Example 2.3 introduced a first common continuous distribution. In this section we see

first the most commonly used one, i.e., the normal distribution, and then some distri-

butions that are derived as functions from normally distributed random variables. We

will encounter these much later, for example, the t distribution in Chapter 4 and the F

distribution in Chapter 9.


2.7.1 The Normal Distribution

The normal or Gaussian distribution is probably the most known distribution, having the

omnipresent “bell-shaped” density. Its importance is mainly due the fact that the sum

of many random variables (under iid or more general settings) is distributed as a normal

random variable. This is due to the celebrated central limit theorem. As in the case of a

Poisson random variable, we define the normal distribution by giving its density.

Definition 2.10. The random variable X is said to be normally distributed if

FX(x) =

∫ x

−∞fX(x)dx (2.20)

with density function

f(x) = fX(x) =1√

2πσ2exp

(−1

2· (x− µ)2

σ2

), (2.21)

for all x (µ ∈ R, σx > 0). We denote this with X ∼ N (µ, σ2).

The random variable Z = (X − µ)/σ (the so-called z-transformation) is standard

normal and its density and distribution function are usually denoted with ϕ(z) and Φ(z),

respectively. ♦

While the exact form of the density (2.21) is not important, a certain recognizing

factor will be very useful. Especially, for a standard normal random variable, the density

is proportional to exp(−z2/2).

The following property is essential and will be consistently used throughout the work.

We justify the first one later in this chapter. The second one is a result of the particular

form of the density.

Property 2.6. i) Let X ∼ N (µ, σ2), thenX − µσ

∼ N (0, 1). Conversely, if Z ∼N (0, 1), then σZ + µ ∼ N (µ, σ2).

ii) Let X1 ∼ N (µ1, σ21) and X2 ∼ N (µ2, σ

22) be independent, then aX1+bX2 ∼ N (aµ1+

bµ2, a2σ2

1 + b2σ22).

The cumulative distribution function Φ has no closed form and the corresponding

probabilities must be determined numerically. In the past, so-called “standard tables”

were often used and included in statistics books. Table 2.1 gives an excerpt of such a

table. Now even “simple” pocket calculators have the corresponding functions. It is

probably worthwhile to remember 84% = Φ(1), 98% = Φ(2), 100% ≈ Φ(3), as well as

95% = Φ(1.64) and 97.5% = Φ(1.96).


Table 2.1: Probabilities of the standard normal distribution. The table gives

the value of Φ(zp). For example, Φ(0.2 + 0.04) = 0.595.

zp 0.00 0.02 0.04 0.06 0.08

0.0 0.500 0.508 0.516 0.524 0.532

0.1 0.540 0.548 0.556 0.564 0.571

0.2 0.579 0.587 0.595 0.603 0.610

0.3 0.618 0.626 0.633 0.641 0.648

0.4 0.655 0.663 0.670 0.677 0.684

0.5 0.691 0.698 0.705 0.712 0.719...

1.0 0.841 0.846 0.851 0.855 0.860...

1.6 0.945 0.947 0.949 0.952 0.954

1.7 0.955 0.957 0.959 0.961 0.962

1.8 0.964 0.966 0.967 0.969 0.970

1.9 0.971 0.973 0.974 0.975 0.976

2.0 0.977 0.978 0.979 0.980 0.981...

3.0 0.999 0.999 . . .

R-Code 2.3 Calculation of the “z-table”, see Table 2.1.

y <- seq( 0, by=.02, length=5)

x <- c( seq( 0, by=.1, to=.5), 1, seq(1.6, by=.1, to=2), 3)

round( pnorm( outer( x, y, "+")), 3)

Example 2.5. Let X ∼ N (4, 9). Then

i) P(X ≤ −2) = P(X − 4

3≤ −2− 4

3

)= P(Z ≤ −2) = Φ(−2) = 1− Φ(2) = 1− 0.977 = 0.023 .

ii) P(|X − 3| > 2) = 1− P(|X − 3| ≤ 2) = 1− P(−2 ≤ X − 3 ≤ 2)

= 1− (P(X − 3 ≤ 2)− P(X − 3 ≤ −2)) = 1− Φ

(5− 4

3

)+ Φ

(1− 4

3

)≈

0.5281. ♣

Using the central limit theorem argument, we can show that distribution of a binomial

random variable X ∼ Bin(n, p) converges to a distribution of a normal random varialbe

as n→∞. Thus, the distribution of a normal random variable N (np, np(1− p)) can be

used as an approximation for the binomial distribution Bin(n, p). For the approximation,

n should be larger than 30 for p ≈ 0.5. For p closer to 0 and 1, n needs to be much larger.


R-Code 2.4 Density, distribution, and quantile functions of the standard normal

distribution. (See Figure 2.4.)

plot( dnorm, -2, 2, ylim=c(-1,2))

abline( c(0, 1), h=c(0,1), col='gray') # diag and horizontal lines

plot( pnorm, -2, 2, col=3, add=TRUE)

plot( qnorm, 0, 1, col=4, add=TRUE)

−2 −1 0 1 2

−1.

0−

0.5

0.0

0.5

1.0

1.5

2.0

x

dnor

m

Figure 2.4: Density (black), distribution function (green), and quantile func-

tion (blue) of the standard normal distribution. (See R-Code 2.4.)

To calculate probabilities, we often apply a so-called continuity correction, as illustrated

in the following example.

Example 2.6. Let X ∼ Bin(30, 0.5). Then P(X ≤ 10) = 0.049, “exactly”. However,

P(X ≤ 10) ≈ P( X − np√

np(1− p)≤ 10− np√

np(1− p)

)= Φ

(10− 15√30/4

)= 0.034, (2.22)

P(X ≤ 10) ≈ P(X + 0.5− np√

np(1− p)≤ 10 + 0.5− np√

np(1− p)

)= Φ

(10.5− 15√30/4

)= 0.05. (2.23)

♣


2.7.2 Chi-Square Distribution

Let Z1, . . . , Zniid∼ N (0, 1). The distribution of the random variable

X 2n =

n∑i=1

Z2i (2.24)

is called the chi-square distribution (X 2 distribution) with n degrees of freedom. The

following applies:

E(X 2n) = n; Var(X 2

n) = 2n. (2.25)

Here and for the next two distributions, we do not give the densities as they are

very complex. Similarly, the expectation and the variance here and for the next two

distributions are for reference only.

The chi-square distribution is used in numerous statistical tests.

The definition of a chi-square random variable is based on a sum of independent

random variables. Hence in view of the central limit theorem, the following result are

not surprising. If n > 50, we can approximate the chi-square distribution with a normal

distribution, i.e., X2n is distributed approximately N (n, 2n). Furthermore, for X2

n with

n > 30 the random variable X =√

2X2n is approximately normally distributed with

expectation√

2n− 1 and standard deviation of 1.

R-Code 2.5 Chi-square distribution for various degrees of freedom. (See Figure 2.5.)

x <- seq( 0, to=50, length=150)

plot(x, dchisq( x, df=1), type='l', ylab='Density')

for (i in 1:6)

lines( x, dchisq(x, df=2^i), col=i+1)

legend( "topright", legend=2^(0:6), col=1:7, lty=1, bty="n")

2.7.3 Student’s t-Distribution

In later chapters, we use the so-called t-test when, for example, comparing empirical

means. The test will be based on the distribution that we define next.

Let Z ∼ N (0, 1) and X ∼ X 2m be two independent random variables. The distribution

Tm =Z√X/m

(2.26)

is called the t-distribution (or Student’s t-distribution) with m degrees of freedom. We

have:

E(Tm) = 0, for m > 1; (2.27)

Var(Tm) =m

(m− 2), for m > 2. (2.28)


0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

x

Den

sity

1248163264

Figure 2.5: Densities of the Chi-square distribution for various degrees of free-

dom. (See R-Code 2.5.)

The density is symmetric around zero and as m → ∞ the density converges to the

standard normal density φ(x) (see Figure 2.6).

R-Code 2.6 t-distribution for various degrees of freedom. (See Figure 2.6.)

x <- seq( -3, to=3, length=100)

plot( x, dnorm(x), type='l', ylab='Density')

for (i in 0:6)

lines( x, dt(x, df=2^i), col=i+2)

legend( "topright", legend=2^(0:6), col=2:8, lty=1, bty="n")

Remark 2.1. For m = 1, 2 the density is heavy-tailed and the variance of the distribution

does not exist. Realizations of this random variable occasionally manifest with extremely

large values. Of course, the empirical variance can still be calculate (see R-Code 2.7). We

come back to this issue in Chapter 5.


−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

x

Den

sity

1248163264

Figure 2.6: Densities of the t-distribution for various degrees of freedom. The

normal distribution is in black. A density with 27 = 128 degrees of freedom

would make the normal density function appear thicker. (See R-Code 2.6.)

R-Code 2.7 Empirical variances of the t-distribution with one degree of freedom.

set.seed( 14)

tmp <- rt( 1000, df=1)

print( c(summary( tmp), Var=var( tmp)))

## Min. 1st Qu. Median Mean 3rd Qu. Max. Var

## -191.00 -1.12 -0.02 8.11 1.01 5730.00 37390.66

sort( tmp)[1:10] # many "large" values, but 2 exceptionally large

## [1] -190.929 -168.920 -60.603 -53.736 -47.764 -43.377 -36.252

## [8] -31.498 -30.029 -25.596

sort( tmp, decreasing=TRUE)[1:10]

## [1] 5726.531 2083.682 280.848 239.752 137.363 119.157 102.702

## [8] 47.376 37.887 32.443

2.7.4 F -Distribution

The F -distribution is mainly used to compare two empirical variances with each other,

as we will see in Chapter 11.

Let X ∼ X 2m and Y ∼ X 2

n be two independent random variables. The distribution

Fm,n =X/m

Y/n(2.29)


is called the F -distribution with m and n degrees of freedom. It holds that:

E(Fm,n) =n

n− 2, for n > 2; (2.30)

Var(Fm,n) =2n2(m+ n− 2)

m(n− 2)2(n− 4), for n > 4. (2.31)

Figure 2.7 shows the density for various degrees of freedom.

R-Code 2.8 F -distribution for various degrees of freedom. (See Figure 2.7.)

x <- seq(0, to=4, length=500)

df1 <- c( 1, 2, 5, 10, 50, 100, 250)

df2 <- c( 1, 50, 10, 50, 50, 300, 250)

plot( x, df( x, df1=1, df2=1), type='l', ylab='Density')

for (i in 2:length(df1))

lines( x, df(x, df1=df1[i], df2=df2[i]), col=i)

legend( "topright", legend=c(expression(F[list(1,1)]),

expression(F[list(2,50)]), expression(F[list(5,10)]),

expression(F[list(10,50)]),expression(F[list(50,50)]),

expression(F[list(100,300)]),expression(F[list(250,250)])),

col=1:7, lty=1, bty="n")

0 1 2 3 4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

x

Den

sity

F1, 1F2, 50

F5, 10

F10, 50

F50, 50

F100, 300

F250, 250

Figure 2.7: Density of the F -distribution for various degrees of freedom. (See

R-Code 2.8.)


2.7.5 Beta Distribution

A random variable X with density

fX(x) = c · xα−1(1− x)β−1, x ∈ [0, 1], α > 0, β > 0, (2.32)

where c is a normalization constant, is called the beta distributed with parameters α and

β. We write this as X ∼ Beta(α, β). The normalization constant cannot be written in

closed form for all parameters α and β. For α = β the density is symmetric around 1/2

and for α > 1, β > 1 the density is concave with mode (α− 1)/(α+ β− 2). For arbitrary

α > 0, β > 0 we have:

E(X) =α

α + β; (2.33)

Var(X) =αβ

(α + β + 1)(α + β)2. (2.34)

Figure 2.8 shows densities of the beta distribution for various pairs of (α, β).

The beta distribution is mainly used to model probabilities. We encounter this distri-

bution again in Chapter 12.

R-Code 2.9 Densities of beta distributed random variables for various pairs of (α, β).

(See Figure 2.8.)

p <- seq( 0, to=1, length=100)

a.seq <- c( 1:6, .8, .4, .2, 1, .5, 2)

b.seq <- c( 1:6, .8, .4, .2, 4, 4, 4)

col <- c( 1:6, 1:6)

lty <- rep( 1:2, each=6)

plot( p, dbeta( p, 1, 1), type='l', ylab='Density', xlab='x',

xlim=c(0,1.3), ylim=c(0,3))

for ( i in 2:length(a.seq))

lines( p, dbeta(p, a.seq[i], b.seq[i]), col=col[i], lty=lty[i])

legend("topright", legend=c(expression( list( alpha, beta)),

paste(a.seq, b.seq)),

col=c(NA,col), lty=c(NA, lty), cex=.9, bty='n')


0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

x

Den

sity

α, β1 12 23 34 45 56 60.8 0.80.4 0.40.2 0.21 40.5 42 4

Figure 2.8: Densities of beta distributed random variables for various pairs of

(α, β). (See R-Code 2.9.)

2.8. FUNCTIONS OF RANDOM VARIABLES 31

2.8 Functions of Random Variables

In the previous sections we saw different examples of often used, classical random vari-

ables. These examples are often not enough and through a modeling approach, we need

additional ones. In this section we illustrate how to construct the cdf and pdf of a random

variable that is the square of one from which we know the density.

Let X be a random variable with distribution function FX(x). We define a random

variable Y = g(X). The cumulative distribution function of Y is written as

FY (y) = P (Y ≤ y) = P (g(X) ≤ y). (2.35)

In many cases g(·) is invertable (and differentiable) and we obtain

FY (y) =

P(X ≤ g−1(y)) = FX(g−1(y)), if g−1 monotonically increasing,

P(X ≥ g−1(y)) = 1− FX(g−1(y)), if g−1 monotonically decreasing.(2.36)

To derive the probability mass function we apply Property 2.2.. In the more interesting

setting of continuous random variables, the density function is derived by Property 2.3.iii

and is thus

fY (y) =∣∣∣ ddyg−1(y)

∣∣∣fX(g−1(y)). (2.37)

Example 2.7. Let X be a random variable with cdf FX(x) and pdf fX(x). We consider

Y = a + bX, for b > 0 and a arbitrary. Hence, g(·) is a linear function and its inverse

g−1(y) = (y − a)/b is monotonically increasing. The cdf of Y is thus FX

((y − a)/b

)and

the pdf is fX

((y − a)/b

)· 1/b. This has fact has already been stated in Property 2.6 for

the Gaussian random variables. ♣

Example 2.8. Let X ∼ U(0, 1) and for 0 < x < 1, we set g(x) = − log(1 − x), thus

g−1(y) = 1− exp(−y). Then the distribution and density function of Y = g(X) is

FY (y) = FX(g−1(y)) = g−1(y) = 1− exp(−y), (2.38)

fY (y) =∣∣∣ ddyg−1(y)

∣∣∣fX(g−1(y)) = exp(−y), (2.39)

for y > 0. This random variable is called the exponential random variable (with rate

parameter one). Notice further that g(x) is the quantile function of this random variable.

♣

As we are often interested in summarizing a random variable by its mean and variance,

we have a very convenient short-cut.

The expectation and the variance of a transformed random variable Y can be approx-

imated by the so-called delta method. The idea thereof consists of a Taylor expansion

around the expectation E(X):

g(X) ≈ g(E(X)) + g′(E(X)) · (X − E(X)) (2.40)


(two terms of the Taylor series). Thus

E(Y ) ≈ g(E(X)), (2.41)

Var(Y ) ≈ g′(E(X))2 · Var(X). (2.42)

Example 2.9. Let X ∼ B(1, p) and Y = X/(1−X). Thus,

E(Y ) ≈ p/(1− p); Var(Y ) ≈( 1

(1− p)2

)2

· p(1− p) =p

(1− p)3. (2.43)

♣

Example 2.10. Let X ∼ B(1, p) and Y = log(X). Thus

E(Y ) ≈ log(p), Var(Y ) ≈(1

p

)2

· p(1− p) =1− pp

. (2.44)

♣

Of course, in the case of a linear transformation (as, e.g., in Example 2.7), equa-

tion (2.40) is an equality and thus relations (2.41) and (2.42) are exact, which is in sync

with Property 2.6.


In general, wikipedia has nice summaries of many distributions. The page https://en.

wikipedia.org/wiki/List of probability distributions lists many thereof.

The ultimate reference for (univariate) distributions is the encyclopedic series of John-

son et al. (2005, 1994, 1995). Figure 1 of Leemis and McQueston (2008) illustrates exten-

sively the links between countless univariate distributions.

https://en.wikipedia.org/wiki/List_of_probability_distributions

https://en.wikipedia.org/wiki/List_of_probability_distributions

Chapter 3

Estimation


A central point of statistics is to draw information from observations (measurements,

data, subset of a population) and to infer from the data towards our hypotheses or possibly

towards the population. This requires foremost data and a statistical model. Such a

statistical model describes describes the data through the use of distributions, which

means the data is considered a realization of the distribution. The model comprises

of unknown quantities, so-called parameters. The goal of a statistical estimation is to

determine plausible values for the parameters of the model using the observed data.

The following is an example of a statistical model:

Yi = µ+ εi, i = 1, . . . , n, (3.1)

where Yi are the observations, µ is an unknown constant and εi are random variables rep-

resenting measurement error. It is often reasonable to assume E(εi) = 0 with a symmetric

density. Here, we even assume εiiid∼ N (0, σ2). Thus, Y1, . . . , Yn are normally distributed

with mean µ and variance σ2.

As typically, both parameters µ and σ2 are unknown we first address the question how

can we determine plausible values for these model parameters from observed data.

Example 3.1. In R-Code 3.1 hemoglobin levels of blood samples from patients with Hb

SS and Hb S/β sickle cell disease are given (Husler and Zimmermann, 2010). The data

are summarized in Figure 3.1.

Equation (3.1) can be used as a simple statistical model for both diseases, with µ

representing the population mean and εi the indivdual deviance from the population

mean.

Natural questions that arise are: What are plausible values of the population mean?

How much do the indivdual deviances vary? ♣

33


34 CHAPTER 3. ESTIMATION

R-Code 3.1 Hemoglobin levels of patients with sickle cell disease. (See Figure 3.1.)

HbSS <- c( 7.2, 7.7, 8, 8.1, 8.3, 8.4, 8.4, 8.5, 8.6, 8.7,

9.1, 9.1, 9.1, 9.8, 10.1, 10.3)

HbSb <- c(8.1, 9.2, 10,10.4,10.6,10.9,11.1,11.9,12.0,12.1)

boxplot( list(HbSS=HbSS, HbSb=HbSb), col=c(3,4))

qqnorm( HbSS, xlim=c(-2,2), ylim=c(7,12), col=3, main='')

qqline( HbSS, col=3)

tmp <- qqnorm( HbSb, plot.it=FALSE)

points( tmp, col=4)

qqline( HbSb, col=4)

HbSS HbSb

89

1011

12

−2 −1 0 1 2

78

910

1112


Sam

ple

Qua

ntile

s

Figure 3.1: Hemoglobin levels of patients with Hb SS and Hb S/β sickle cell

disease. (See R-Code 3.1.)

3.1 Point Estimation

For estimation, the data is considered a realization of a random sample with a certain

(parametric) distribution. The goal of point estimation is to provide a value for the

parameters of the distribution based on the data at hand.

Definition 3.1. A statistic is an arbitrary function of a random sample Y1, . . . , Yn and

is therefore also a random variable.

An estimator is a statistic used to extract information about a parameter from the

random sample.

A point estimate is the value of the estimator evaluated at y1, . . . , yn, the realizations

of the random sample.

Estimation (or estimating) is the process of finding a (point) estimate. ♦

3.2. CONSTRUCTION OF ESTIMATORS 35

Example 3.2. i) The numerical values shown in R-Code 3.2 are estimates.

ii) Y =1

n

n∑i=1

Yi is an estimator.

y =1

n

10∑i=1

yi = 10.6 is a point estimate.

iii) S2 =1

n− 1

n∑i=1

(Yi −Y )2 is an estimator.

s2 =1

n− 1

10∑i=1

(yi −y)2 = 1.65 oder s = 0.844 is a point estimate. ♣

R-Code 3.2 Some summary statistics of hemoglobin levels of patients with sickle

cell disease.

c( mean( HbSS), mean( HbSb)) # means for both diseases

## [1] 8.7125 10.6300

var( HbSS) # here and below spread measures

## [1] 0.71317

c( var(HbSb), sum( (HbSb-mean(HbSb))^2)/(length(HbSb)-1) )

## [1] 1.649 1.649

c( sd( HbSS), sqrt( var( HbSS)))

## [1] 0.84449 0.84449

Often, we denote parameters with Greek letters (µ, σ, λ, . . . ), with θ being the generic

one. The estimator and estimate of a parameter θ are denoted by θ. Context makes clear

which of the two cases is meant.

3.2 Construction of Estimators

We now consider three examples of how estimators are constructed for arbitrary distribu-

tions.

3.2.1 Ordinary Least Squares

The ordinary least squares method of parameter estimation minimizes the sum of squares

of the differences between observed responses and those predicted by a linear function of

the explanatory variables.


Let Y1, . . . , Yn be iid with E(Yi) = µ. The least squares estimator is based on

µ = µLS = argminµ

n∑i=1

(Yi − µ)2, (3.2)

and thus one has estimator µLS = Y and estimate µLS = y.

Often, the parameter θ is linked to the expectation E(Yi) through some function, say

g. In such a setting, we have

θ = θLS = argminθ

n∑i=1

(Yi − g(θ)

)2. (3.3)

In many situations, for example in linear regression settings, simple and close form solu-

tions exist.

3.2.2 Method of Moments

The method of moments is based on the following idea. The parameters of the distribution

are expressed as functions of the moments, e.g., E(Y ), E(Y 2). The random sample mo-

ments are then plugged into the theoretical moments of the equations in order to obtain

the estimators:

µ := E(Y ) , µ =1

n

n∑i=1

Yi = Y , (3.4)

µ2 := E(Y 2) , µ2 =1

n

n∑i=1

Y 2i . (3.5)

By using the observed values of a random sample in the method of moments estimator,

the estimates of the corresponding parameters are obtained.

Example 3.3. Let Y1, . . . , Yniid∼ E(λ)

E(Y ) = 1/λ, Y = 1/λ λ = λMM =

1

Y. (3.6)

Thus, the estimate of λ is the value of 1/y. ♣

Example 3.4. Let Y1, . . . , Yniid∼ F with expectation µ and variance σ2. Since Var(Y ) =

E(Y 2)− E(Y )2 (Property .), we can write σ2 = µ2 − (µ)2 and we have the estimator

σ2MM =

1

n

n∑i=1

Y 2i −Y 2 =

1

n

n∑i=1

(Yi −Y )2. (3.7)

♣

3.2. CONSTRUCTION OF ESTIMATORS 37

3.2.3 Likelihood Method

The likelihood method chooses the as estimate the value such that the observed data is

most likely to stem from the model (using the estimate). We consider the probability

density function (for continuous random variables) or the probability mass function (for

discrete random variables) to be a function of parameter θ, i.e.,

fY (y) = fY (y; θ) −→ L(θ) := fY (y; θ) (3.8)

pi = P(Y = yi) = P(Y = yi; θ) −→ L(θ) := P(Y = yi; θ). (3.9)

For a given distribution, we call L(θ) the likelihood.

Definition 3.2. The maximum likelihood estimator θML of parameter θ is based on

maximizing the likelihood, i.e.

θML = argmaxθ

L(θ). (3.10)

♦

Since a random sample contains independent and identically distributed random vari-

ables, the likelihood is the product of the individual densities (see page 2.5). Since

θML = argmaxθ L(θ) = argmaxθ log(L(θ)

), the log-likelihood `(θ) := log(L(θ)) can be

maximized instead. The log-likelihood is often preferred because the expressions simplify

more and maximizing sums is much easier than maximizing products.

Example 3.5. Let Y1, . . . , Yniid∼ Exp(λ), thus

L(λ) =n∏i=1

fY (yi) =n∏i=1

λ exp(−λyi) = λn exp(−λn∑i=1

yi) . (3.11)

Then

d`(λ)

dλ=d log(λn exp(−λ

∑ni=1 yi)

dλ=dn log(λ)− λ

∑ni=1 yi

dλ=n

λ−

n∑i=1

yi!

= 0 (3.12)

λ = λML =n∑ni=1 yi

=1

y. (3.13)

In this case (as in others), λML = λMM. ♣

In a vast majority of cases, maximum likelihood estimators posses very nice properties.

Intuitively, because we use information about the density and not only about the moments,

they are “better” compared to method of moment estimators. Further, for many common

random variables, the likelihood function has a single optimum, in fact a maximum, for

all permissible θ.


3.3 Comparison of Estimators

In Example 3.4 we saw an estimator that divides the sum of the squared deviances by n,

whereas the default denominator used in R is n−1 (see R-Code 3.2 or Example 3.2). There

are different estimators for a particular parameter possible. We now see two measures to

compare them.

Definition 3.3. An estimator θ of a parameter θ is unbiased if

E(θ) = θ, (3.14)

otherwise it is biased.

The value E(θ)− θ is called the bias. ♦

Example 3.6. Y1, . . . , Yniid∼ N (µ, σ2)

i) Y is unbiased for µ, since

E(Y)

= E( 1

n

n∑i=1

Yi

)=

1

nnE(Yi) = µ . (3.15)

ii) S2 =1

n− 1

n∑i=1

(Yi −Y )2 is unbiased for σ2. To show this we use the following two

identities

(Yi −Y )2 = (Yi − µ+ µ−Y )2 = (Yi − µ)2 + 2(Yi − µ)(µ−Y ) + (µ−Y )2, (3.16)n∑i=1

2(Yi − µ)(µ−Y ) = 2(µ−Y )n∑i=1

(Yi − µ) = 2(µ−Y )(nY − nµ) (3.17)

= −2n(µ−Y )2, (3.18)

i.e., we rewrite the square for which the cross-term simplifies with the second square

term. Collecting all leads finally to

(n− 1)S2 =n∑i=1

(Yi − µ)2 − n(µ−Y )2 (3.19)

(n− 1) E(S2) =n∑i=1

Var(Yi)− nE((µ−Y )2

)(3.20)

By Property 2.5.iii, E((µ−Y )2

)= Var

(Y)

= σ2/2 and thus

(n− 1) E(S2) = nσ2 − n · σ2

n= (n− 1)σ2 (3.21)

iii) σ2 =1

n

∑(Yi −Y )2 is biased, since

E(σ2) =1

n(n− 1) E

( 1

n− 1

∑(Yi −Y )2

)︸︷︷︸

E(S2) = σ2

=n− 1

nσ2. (3.22)

3.4. INTERVAL ESTIMATORS 39

The bias is

E(σ2)− σ2 =n− 1

nσ2 − σ2 = − 1

nσ2, (3.23)

which amounts to underestimation of the variance. ♣

A further possibility for comparing estimators is the mean squared error

MSE(θ) = E((θ − θ)2). (3.24)

The mean squared error can also be written as MSE(θ) = bias(θ )2 + Var(θ).

Example 3.7. Let Y1, . . . , Yniid∼ N (µ, σ2). Using the result (3.15) and Property 2.5.iii,

we have

MSE(Y ) = bias(Y )2 + Var(Y ) = 0 +σ2

n. (3.25)

Hence, the MSE vanishes as n increases. ♣

There is a second “classical” example for the calculation of the mean squared error,

however it requires some properties of squared Gaussian variables.

Example 3.8. If Y1, . . . , Yniid∼ N (µ, σ2) it is possible to show that (n− 1)S2/σ2 ∼ X 2

n−1.

Then using the variance from a chi-squared random variable (2.25), we have

MSE(S2) = Var(S2) =σ4

(n− 1)2Var

((n− 1)S2

σ2

)=

σ4

(n− 1)2(2n− 2) =

2σ4

n− 1. (3.26)

Analogously, one can show that MSE(σ2MM) is smaller than Equation (3.26). Moreover,

the estimator (n− 1)S2/(n+ 1) possesses the smallest MSE. ♣

3.4 Interval Estimators

Point estimates, inherently, are single values and do not provide us with uncertainties.

For this we have to extend the idea towards interval estimates and interval estimators.

We start with a motivating example.

Let Y1, . . . , Yniid∼ N (µ, σ2) with known σ2. Thus Y ∼ N (µ, σ2/n) and, by standardiz-

ing,Y − µσ/√n∼ N (0, 1). As a consequence,

1− α = P(zα/2 ≤

Y − µσ/√n≤ z1−α/2

)(3.27)

= P(zα/2

σ√n≤ Y − µ ≤ z1−α/2

σ√n

)(3.28)

= P(−Y + zα/2

σ√n≤ −µ ≤ −Y + z1−α/2

σ√n

)(3.29)

= P(Y − zα/2

σ√n≥ µ ≥ Y − z1−α/2

σ√n

). (3.30)


Definition 3.4. Let Y1, . . . , Yniid∼ N (µ, σ2) with known σ2. The interval[

Y − z1−α/2σ√n,Y + z1−α/2

σ√n

](3.31)

is an exact (1− α) confidence interval for the parameter µ. ♦

If we evaluate the bounds of the confidence interval[Bl, Bu

]at a realization we denote[

bl, bu]

as the empirical confidence interval or the observed confidence interval.

The interpretation of an exact confidence interval is as follows: If a large number of

realisations are drawn from a random sample, on average (1−α) · 100% of the confidence

intervals will cover the true parameter µ .

Empirical confidence intervals do not contain random variables and therefore it is not

possible to make probability statements about the parameter.

If the standard deviation σ is unknown, the ansatz must be modified by using a point

estimate for σ, typically S =√S2, S2 = 1

n−1

∑i(Yi − Y )2. Since

Y − µS/√n∼ Tn−1, the

corresponding quantile must be modified:

1− α = P(tn−1,α/2 ≤

Y − µS/√n≤ tn−1,1−α/2

). (3.32)

Definition 3.5. Let Y1, . . . , Yniid∼ N (µ, σ2). The interval[

Y − tn−1,1−α/2S√n, Y + tn−1,1−α/2

S√n

](3.33)

is an exact (1− α) confidence interval for the parameter µ. ♦

Confidence interval for the mean µ

Under the assumption of a normal random sample,[Y − tn−1,1−α/2

S√n, Y + tn−1,1−α/2

S√n

](3.34)

is an exact (1− α) confidence interval and[Y − z1−α/2

S√n, Y + z1−α/2

S√n

](3.35)

an approximate (1− α) confidence interval for µ.

Confidence intervals are, as shown in the previous two definitions, constituted by

random variables (functions of Y1, . . . , Yn). Similar to estimators and estimates, confidence


intervals are computed with the corresponding realization y1, . . . , yn of the random sample.

Subsequently, confidence intervals will be outlined in the blue-highlighted text boxes, as

shown here.

Example 3.9. Let Y1, . . . , Y4iid∼ N (0, 1). The R-Code 3.3 and Figure 3.2 show 100 em-

pirical confidence intervals based on Equation (3.31) (top) and Equation (3.33) (bottom).

Because n is small, the difference between the normal and the t-distribution is quite

pronounced. This becomes clear when[Y − z1−α/2

S√n, Y + z1−α/2

S√n

](3.36)

is used as an approximation (Figure 3.2, middle). ♣

Confidence intervals can often be constructed from a starting estimator and its distri-

bution. In many cases it is possible to extract the parameter to get to 1 − α = P(Bl ≤θ ≤ Bu), often some approximations are necessary.

The coverage probability of an exact confidence interval amounts to exactly 1 − α.

This is not the case for approximate confidence intervals.

Example 3.10. Let Y1, . . . , Yniid∼ N (µ, σ2). The estimator S2 for the parameter σ2 is

such, that (n − 1)S2/σ2 ∼ X 2n−1, i.e. a chi-square distribution with n − 1 degrees of

freedom. Hence

1− α = P(χ2n−1,α/2 ≤

(n− 1)S2

σ2≤ χ2

n−1,1−α/2

)(3.38)

= P( (n− 1)S2

χ2n−1,α/2

≥ σ2 ≥ (n− 1)S2

χ2n−1,1−α/2

), (3.39)

where χ2n−1,p is the p-quantile of the chi-square distribution with n−1 degrees of freedom.

The corresponding exact (1−α) confidence interval no longer has the form θ±q1−α/2sd(θ),

because the chi-square distribution is not symmetric.

For the Hb SS Daten with sample variance 0.71, we have the confidence interval

[0.39, 1.71], computed with (16-1)*var(HbSS)/qchisq( c(.975, .025), df=16-1). ♣

Confidence interval for the variance σ2

Under the assumption of a normal random sample,[ (n− 1)S2

χ2n−1,1−α/2

,(n− 1)S2

χ2n−1,α/2

](3.37)

is an exact (1− α) confidence interval for σ2.


R-Code 3.3 100 confidence intervals for the parameter µ = 0 with σ = 1 and

unknown σ.

set.seed( 1)

ex.n <- 100 # 100 confidence intervals

alpha <- .05 # 95\% confidence intervals

n <- 4 # sample size

mu <- 0

sigma <- 1

sample <- array( rnorm( ex.n * n, mu, sigma), c(n,ex.n))

yl <- c( -2.9, 2.9) # same y-axis for all

ybar <- apply( sample, 2, mean) # mean

# Sigma known:

sigmaybar <- sigma/sqrt(n)

plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',

main=expression(sigma~known))

abline(h=0)

for ( i in 1:ex.n)ci <- ybar[i] + sigmaybar * qnorm(c(alpha/2,1-alpha/2))

lines( c(i,i), ci, col=ifelse( ci[1]>0|ci[2]<0, 2, 1))

# Sigma unknown, normal approx:

sybar <- apply(sample, 2, sd)/sqrt(n)

plot( 1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',

main="Gaussian Approximation")

abline(h=0)

for ( i in 1:ex.n)ci <- ybar[i] + sybar[i] * qnorm(c(alpha/2, 1-alpha/2))


# Sigma unknown, t-based:

plot(1:ex.n, 1:ex.n, type='n', ylim=yl, xaxt='n', ylab='',

main='t-distribution')

abline(h=0)

for ( i in 1:ex.n)ci <- ybar[i] + sybar[i] * qt(c(alpha/2,1-alpha/2), n-1)



−3

−2

−1

01

23

σ known

1:ex.n

−3

−2

−1

01

23

Gaussian Approximation

1:ex.n

−3

−2

−1

01

23

t−distribution

Figure 3.2: Normal and t-based confidence intervals for the parameters µ = 0

with σ = 1 (above) and unknown σ (middle and below). The sample size is

n = 4 and confidence level is (1− α) = 95%. Confidence intervals which do not

cover the true value zero are in red.

(Figure based on R-Code 3.3).



Statistical estimation theory is very classical and many books are available. For example,

Held and Sabanes Bove (2014) (or it’s German predecessor, Held, 2008) are written at an

accessible level.

Chapter 4

Statistical Testing


Even with a fair coin, we do not always toss heads or tails exactly n/2 times. This

is illustrated in R with rbinom(1, size=n, prob=1/2) for n tosses, since rbinom(1,

size=1, prob=1/2) “is” a fair coin. Suppose we observe 13 heads in 17 tosses. Is this a

fair coin? Do we have evidence against a fair coin?

We will formulate a formal statistical procedure to answer such questions.

4.1 The General Concept of Significance Testing

Example 4.1. A student, by observing her eating habits during her menstrual cycle,

found that she felt hungrier a few days before menstruation, but had almost no appetite

after. Based on these observations, she hypothesized that a recurring pattern is recogniz-

able in the eating habits of women, which can be explained by the menstrual cycle.

As part of her Maturitatsarbeit, a survey was created and over the course of 2 months

17 women kept daily records of what and how much they ate. This survey was completed

by subjects daily and captured information on how many portions from each food group

(vegetables, fruits, carbohydrates, milk products, proteins) were eaten by the subjects at

breakfast, lunch, and dinner. A fact sheet was used to explain to the subjects how to

categorize foods and how much is considered one serving.

The menstrual cycle is divided into 4 phases (before/during/after and normal) and the

data for the resprective phases is given in R-Code 4.1 and denoted by portions.normal,

portions.during, etc.

Before we start to compare the means between phases, we consider the normal phase

only. The hypothesis is that the population mean is different than (not equal to) 9

portions. (The observed mean is 7.73 with a standard deviation of 2.06.) R-Code 4.1

illustrates the computation of these statistics and the construction of Figure 4.1. ♣

45


46 CHAPTER 4. STATISTICAL TESTING

R-Code 4.1 Eating habits, dataset portions . (See Figure 4.1.)

portions.normal

## [1] 5.17 9.00 12.69 6.50 5.76 7.60 6.40 5.19 9.90 7.60

## [11] 6.40 8.39 6.06 8.40 10.80 8.80 6.70

print( me <- mean( portions.normal))

## [1] 7.7271

print( se <- sd( portions.normal))

## [1] 2.0646

plot( 0, type='n', yaxt='n', xaxt='n', bty='n', xlim=c(4,14), ylab='')

axis(1)

rug( portions.normal,ticksize = 0.3)

abline( v=me, col=3)

abline( v=9, lwd=3, col=2)

# t-based CI in green and Gaussian CI (approximation!) in blue

lines( me + qt(c(.025,.975),16)*se/sqrt(17), c(.1,.1), col=3,lwd=5)

lines( me + qnorm(c(.025,.975))*se/sqrt(17), c(-.1,-.1), col=4, lwd=2)

4 6 8 10 12 14

Figure 4.1: Rug plot with sample mean (vertical green) and confidence intervals

based on the t-distribution (green) and normal approximation (blue). (See R-

Code 4.1.)

The idea of a statistical testing procedure is to formulate statistical hypotheses and to

draw conclusions from them based on the data. We always start with a null hypothesis,

denoted with H0, and – informally – we compare how compatible the data is with respect

to this hypothesis.

Simply stated, starting from a statistical hypothesis a statistical test calculates a value

from the data and places that value in the context of the hypothetical density induced

by the statistical hypothesis. If the value is unlikely to occur, we argue that the data

provides evidence against the hypothesis.

More formally, if we want to test about a certain parameter, say θ, we need an estimator

θ for that parameter. We often need to transform the estimator such that the distribution

thereof does not depend on (the) parameter(s). We call this random variable (function

of the random sample) test statistic. Some of these test statistics are well-known and

have been named, as we shall see later. Once the test statistic has been determined, we

4.2. HYPOTHESIS TESTING 47

evaluate it at the observed sample and compare it with quantiles of the distribution of

the null hypothesis, which is typically expressed as a probability, i.e., the famous p-value.

A more formal definition of p-value follows.

Definition 4.1. The p-value is the probability, under the distribution of the null hypoth-

esis, of obtaining a result equal to or more extreme than the observed result. ♦

In practice, one often starts with a scientific hypothesis and starts to collect data or

performs an experiment. The data is then “modeled statistically”, e.g., we need to deter-

mine a theoretical distribution thereof. In our discussion here, the distribution typically

involves parameters that are linked to the scientific question (probability p in a binomial

distribution for coin tosses, mean mu of a Gaussian distribution for testing differences

in food consumption). We formulate the null hypothesis H0. In many cases we pick a

“known” test, instead of a “manually” constructing a test statistic. Of course this test

has to be in sync with the statistical model. Based on the p-value we then summarize the

evidence against the null hypothesis. We cannot make any statement for the hypothesis.

Some authors summarize p-values in [1, 0.1] as no evidence, in [0.1, 0.01] as weak

evidence, in [0.01, 0.001] as substantial evidence, and smaller ones as strong evidence

(Held and Sabanes Bove, 2014). In R, symbols are used for similar ranges , . and * ,

** , and, *** .

As a summary, the workflow of a statistical significance test can be summarized as:

i) Formulation of the scientific question or scientific hypothesis,

ii) Formulation of the statistical model (statistical assumptions),

iii) Formulation of the statistical test hypothesis (null hypothesis H0),

iv) Selection of the appropriate test,

v) Calculation of the p-value,

vi) Interpretation.

Although we discuss all six points in various settings, emphasis lies on points ii) to v).

4.2 Hypothesis Testing

A criticism of significance testing is that we have only one hypotheses. It is often easier to

choose between two alternatives. This is the approach of hypothesis testing. Beforehand,

we will still not choose for one hypothesis but rather reject or fail to reject the null

hypothesis.

More precisely, we start with a null hypothesis H0 and an alternative hypothesis, de-

noted by H1 or HA. These hypotheses are with respect to a parameter, say θ. Hypotheses


are classified as simple if parameter θ assumes only a single value (e.g. H0: θ = 0), or

composite if parameter θ can take on a range of values (e.g. H0: θ ≤ 0). Often, you will

encounter a simple null hypothesis with a simple or compositive alternative hypothesis.

For example, when testing a mean parameter we would have H0: µ = µ0 vs HA: µ 6= µ0

for the latter. In practice the case of simple null and alternative hypothesis, e.g., H0:

µ = µ0 vs HA: µ 6= µ0, is rarely used but has considerable didactic value.

Hypothesis tests may be either one-sided (directional), in which only a relationship in a

prespecified direction is of interest, or two-sided, in which a relationship in either direction

is tested. One could use a one-sided test for “Hb SS has a lower average hemoglobin value

than Hb Sβ”, but a two-sided test is needed for “Hb SS and Hb Sβ have different average

hemoglobin values”. Further examples of hypotheses are given in Rudolf and Kuhlisch

(2008). We strongly recommend to always use two-sided tests. However to illustrate

certain concepts, a one-sided setting may be simpler and more accessible.

As in the case of a significance test, we compare the value of the test statistic with the

quantiles of the distribution of the null hypothesis. In the case of small p-values we reject

H0, if not, we fail to reject H0. The decision of whether the p-value is “small” or not is

based on the so-called significance level, α. The significance level is the 1− α quantile of

the distribution of the null hypothesis.

Definition 4.2. The rejection region of a test includes all values of the test statistic with

a p-value smaller than the signifiance level. The boundary values of the rejection region

are called critical values. ♦

For one-sided (or directional) hypotheses like H0 : µ ≤ µ0 or H0 : µ ≥ µ0, the “=”

case is used in the null hypothesis. That means, the null hypothesis is from a calculation

perspective always simple.

Figure 4.2 shows the rejection area of two-sided test H0 : µ = µ0 vs HA : µ 6= µ0 (left

panel), H0 : µ ≤ µ0 vs HA : µ > µ0 and one-sided (right panel).

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Figure 4.2: Rejection regions for H0 : µ = µ0 = 0 (left) and H0 : µ ≤ µ0 = 0

(right) for α = 10%.


Table 4.1: Type I and Type II errors in the setting of significance tests.

True state

H0 true H1 true

Test resultsdo not reject H0 1− α β

reject H0 α 1− β

Two types of errors can occur in significance testing: Type I errors: we reject H0

if we should not and Type II errors: we fail to reject H0 if we should. The framework

of hypothesis testing allows us to quantify probability of committing these errors (see

Table 4.1). Ideally we would like to construct tests that have small Type I errors and

Type II errors. This is not possible and one typically fixes the Type I error (committing

a type one error has typically more severe consequences than a Type II error). Type I

and Type II errors are shown in Figure 4.3 for 2 different alternative hypotheses. This

illustrates that reducing α leads to an increase in β.

We can further summarize. The Type I error:

• is defined a priori by selection of the significance level (often 5%, 1%),

• is not influenced by sample size,

• is increased with multiple testing of the same data

and the Type II error:

• depends on sample size and significance level α,

• is a function of the alternative hypothesis.

−2 0 2 4 6

H0 H1

−2 0 2 4 6

H0 H1

Figure 4.3: Type I error at rate α (red) and Type II error at rate β (blue) for

2 different alternative hypotheses (µ = 2 left, µ = 4 right) with H0 : µ ≤ µ0 = 0.


−1 0 1 2 3 4

0.0

0.4

0.8

µ1 − µ0

Pow

er

Figure 4.4: Power: one-sided (blue solid line) and two-sided (black dashed

line). The gray line represents the level of the test, here α = 5%. The vertical

lines represent the alternative hypotheses µ = 2 and µ = 4 of Figure 4.3. (See

R-Code 4.2.)

The value 1−β is called the power of a test. High power is desirable in an experiment.

R-Code 4.2 computes the power under a Gaussian assumption. More specifically, under

the assumption of σ = 1 we test H0: µ0 = 0 versus HA: µ0 6= 0. The power can only

be calculated for a specific assumption of the “actual” mean µ1. Thus, as typically done,

Figure 4.4 plots power(µ1).

R-Code 4.2 A one-sided and two-sided power curve for a z-test. (See Figure 4.4.)

alpha <- 0.05 # significance level

mu0 <- 0 # H_0 mean

mu1 <- seq(-1, to=4, by=.5) # H_1 mean

power_onesided <- 1-pnorm( qnorm(1-alpha, mean=mu0), mean=mu1)

power_twosided <- pnorm( qnorm(alpha/2, mean=mu0), mean=mu1) +

pnorm( qnorm(1-alpha/2, mean=mu0), mean=mu1, lower.tail=FALSE)

plot( mu1, power_onesided, type='l', ylim=c(0,1),

xlab=expression(mu[1]-mu[0]), ylab="Power", col=4)

lines( mu1, power_twosided,lty=2)

abline( h=alpha, col='gray') # significance

abline( v=c(2,4), lty=4,col=7) # values from figure 4.3


The workflow of a hypothesis test is very similar to the one of a statistical significance

test and only point iii) and v) need to be slightly modified:

i) Formulation of the scientific question or scientific hypothesis,

ii) Formulation of the statistical model (assumptions),

iii) Formulation of the statistical test hypothesis and selection of significance level,

iv) Selection of the appropriate test,

v) Calculation of the test statistic value (or p-value), comparison and decision

vi) Interpretation.

The choice of test is again constrained by the assumptions. The significance level must,

however, always be chosen before the computations.

The test statistic value tobs calculated in step vi is compared with tabulated critical

values ttab in order to reach a decision in step vii). When the decision is based on the

calculation of the p-value, it consists of a comparison with α. The p-value can be difficult to

calculate, but is valuable because of its direct interpretation as the strength (or weakness)

of the evidence against the null hypothesis.

In this course we consider various test situations. The choice of test is primarily

dependent on the parameter and secondly on the statistical assumptions. The following

list can be used as a decision tree.

• Mean

– One sample (Test 1)

– Two samples

∗ Two independent samples (Test 2 and Test 7 in Chapter 6)

∗ Two paired/dependent samples (Test 3 and Test 8 in Chapter 6)

– Several samples (Test 13 in Chapter 11)

• Variances (Test 4)

• Proportions (Test 6 in Chapter 5)

• Distributions (Test 5)

• . . .

Distribution tests (also goodness-of-fit tests) differ from rest, in that they do not test only

a single parameter. In the following sections we consider each test in detail. The tests are

always presented in the same pattern (additional general remarks are given in the yellow

text blocks).


General remarks about statistical tests

In the following pages, we will present several important statistical tests in

text blocks like this one.

In general, we call

• n, nx, ny, . . . the sample size;

• x, y, . . . the arithmetic mean;

• s2, s2x, s

2y, . . . the unbiased sample variance estimates, e.g.,

s2x =

1

nx − 1

nx∑i=1

(xi −x)2

(In the tests under consideration, the variance is unknown);

• α the significance level;

• ttab, Ftab, . . . the appropriate quantiles.

Our scientific question can only be formulated in general terms and we call

it a hypothesis. Generally, two-sided tests are performed. The significance

level is modified accordingly for one-sided tests.

In unclear cases, the statistical hypothesis is specified.

For most tests, there is a corresponding R function. The arguments x, y

usually represent vectors of realisations and alpha the significance level.

4.3 Comparing Means

In this section, we compare means from normally distributed samples. To start, we

compare the mean of a sample with a hypothesized value. More formally, Y1, . . . , Yniid∼

N (µ, σ2) with unknown σ2 and parameter of interest µ. Thus, from

Y ∼ N (µ, σ2/n),Y − µσ/√n∼ N (0, 1)

Y − µS/√n∼ Tn−1. (4.1)

we can only use the last distribution, as σ2 is unknown. Under the null hypothesis, µ is

of course our theoretical value, say µT .

The test statistic is t-distributed (see Figure 2.7.3) and so the function pt is used to

calculate p-values. As we have only one sample, it is typically called the “one-sample

t-test”, illustrated in box Test 1, and R-Code 4.2 with an example based on the dietary

data.

4.3. COMPARING MEANS 53

Test 1: Comparing a sample mean with a theoretical value

Question: Does the experiment-based sample mean x deviate significantly

from the unknown (“theoretical”) mean µT ?

Assumptions: The population, from which the sample arises, is normally

distributed with unknown mean µT . The observed data are on the

interval scale and the variance is unknown.

Calculation: tobs =|x− µT |

s·√n .

Decision: Reject H0 : µ = µT , if tobs > ttab = tn−1,1−α/2 .

Calculation in R: t.test( x, mu=muT, conf.level=1-alpha)

Example 4.2. We test the hypothesis that the normal number of dietary portions is

significantly less than 9. We have one sample and we want to know if the deviates from

an unknown value.

The following values are given: mean: 7.73, standard deviation: 2.06, sample size: 17.

H0 : µ = 9

tobs =|7.73− 9|2.06/

√17

= 2.54

ttab = t16,1−0.05/2 = 2.12 p-value: 0.022.

As shown by Figure 4.1, clear evidence against the null hypothesis (R-Code 4.3). ♣

R-Code 4.3 One sample t-test, portions (see Example 4.2 and Test 1)

t.test( portions.normal, mu=9)

##

## One Sample t-test

##

## data: portions.normal

## t = -2.54, df = 16, p-value = 0.022

## alternative hypothesis: true mean is not equal to 9

## 95 percent confidence interval:

## 6.6656 8.7886

## sample estimates:

## mean of x

## 7.7271


Often we have want to compare to different means, say x and y. We assume that both

random samples are normally distributed, i.e., X1, . . . , Xniid∼ N (µx, σ

2), Y1, . . . , Yniid∼

N (µy, σ2), and independent. Then

X ∼ N(µx,

σ2

n

), Y ∼ N

(µy,

σ2

n

), X −Y ∼ N

(µx − µy,

2σ2

n

), (4.2)

X −Y − (µx − µy)σ/√n/2

∼ N (0, 1).X −Yσ/√n/2

H0:µx=µy∼ N (0, 1). (4.3)

The difficulty is that we do not know σ and we have to estimate it. The estimate thereof

takes a somewhat complicated form as both samples need to be taken into account, with

possibly different lengths and means. This pooled estimate is denoted by sp and given

in Test 2. Ultimately, we have again a t-distribution of the test statistic, as we use an

estimate of the standard deviation in the standardization of a normal random variable.

As the calculation sd requires the estimates µx and µy, we adjust the degrees of freedom

of to nx + ny − 2.

R-Code 4.3 is based on the dietary data again and compares the portions during the

menstrual cycle with the normal amount.

Test 2: Comparing means from two independent samples

Question: Are the means x and y of two samples significantly different?

Assumptions: Both populations are normally distributed with the same un-

known variance. The samples are independent.

Calculation: tobs =|x−y|sp

·√

nx · nynx + ny

,

where s2p =

1

nx + ny − 2·((nx − 1)s2

x + (ny − 1)s2y

).

Decision: Reject H0 : µx = µy if tobs > ttab = tnx+ny−2,1−α/2 .

Calculation in R: t.test( x, y, var=TRUE, conf.level=1-alpha)

In practice, the variances of both samples are often different, say σ2x and σ2

y. In

such a setting, we have to normalize the mean difference by√s2x/nx + s2

y/ny. While this

estimate seems simpler than the pooled estimate sp, the degrees of freedom of the resulting

t-distribution is are not intuitive and difficult to derive, and we refrain to elaborate it here.

In the literature, this test is called Welch’s t-test and actually the default choice of t.test(

x, y, conf.level=1-alpha).

4.3. COMPARING MEANS 55

Example 4.3. For the two samples normal and during, we have the following summaries.

Means 7.73 and 8.21; standard deviations: 2.06 and 1.9; sample sizes: 17 and 17. Hence,

using the formulas given in Test 2, we have

H0 : µx = µy

s2p =

1

17 + 17− 2(16 · 2.062 + 16 · 1.902) = 1.98

tobs =|7.73− 8.21|√

(16 · 2.062 + 16 · 1.902)/(17 + 17− 2)

√17 · 17

17 + 17= 0.71

ttab = t32,1−0.05/2 = 2.04 p-value: 0.48.

Hence, 7.73 and 8.21 are not statistically different. See also R-Code 4.4. ♣

R-Code 4.4 Two-sample t-test with independent samples, portions (see Example 4.3

and Test 2).

t.test( portions.normal, portions.during, var=TRUE)

##

## Two Sample t-test

##

## data: portions.normal and portions.during

## t = -0.709, df = 32, p-value = 0.48

## alternative hypothesis: true difference in means is not equal to 0


## -1.86966 0.90378


## mean of x mean of y

## 7.7271 8.2100

The assumption of independence of both samples in Example 4.3 may not be realistic.

We have observed the same individual over two different instances of time. In such

settings, were we have a “before” and “after” measurement, it would be better to take

this pairing into account, by considering differences only instead of two samples. Hence,

instead of constructing a test statistic based on X −Y we consider

X1 − Y1, . . . , Xn − Yniid∼ N (µx − µy, σ2

d), X −Y ∼ N(µx − µy,

σ2d

n

), (4.4)

X −Yσd/√n

H0:µx=µy∼ N (0, 1). (4.5)

where σ2d is essentially the sum of the variances minus the “dependence” between Xi and

Yi. We formalize this dependence, called covariance, starting in Chapter 7.

The paired two-sample t-test can thus be considered a one sample t-test of the differ-

ences with mean µT = 0.


Test 3: Comparing means from two paired samples

Question: Are the means x and y of two paired samples significantly differ-

ent?

Assumptions: The samples are paired, the observed values are on the inter-

val scale. The differences are normally distributed with unknown mean

δ. The variance is unknown.

Calculation: tobs =|d |sd·√n, where

• di = xi − yi is the i-th observed distance,

• d is the arithmetic mean and sd is the standard deviation of the

differences di .

Decision: Reject H0 : δ = 0 if tobs > ttab = tn−1,1−α/2 .

Calculation in R: t.test(x-y, conf.level=1-alpha) or

t.test(x, y, paired=TRUE, conf.level=1-alpha)

Example 4.4. Again working with the two samples normal and during, we have the fol-

lowing summaries for the differences (see R-Code 4.5 and Test 3). Mean: −0.48; standard

deviation: 0.91; and sample size: 17.

H0 : d = 0 or H0 : µx = µy

tobs =| − 0.48|0.91/

√17

= 2.19

ttab = t16;0.05 = 2.12 p-value: 0.045.

Now, by taking into account that we have two measurements on the same individual, we

have moderate evidence that there is a difference between the two periods. ♣

The “t-tests” requires normally distributed data. The tests are relatively robust to-

wards deviation from normality, as long as there are no extreme outliers. Otherwise,

rank-based tests can be used (see Chapter 6):

• U -test (Wilcoxon-Mann-Whitney test),

• Wilcoxon test.

The assumption of normality can be verified quantitatively with formal Normality tests

(X 2-test, Shapiro–Wilk test, Kolmogorov–Smirnov test). Often, however, a qualitative

verification is sufficient (e.g. with the help of a Q-Q plot).

4.4. DUALITY OF TESTS AND CONFIDENCE INTERVALS 57

R-Code 4.5 Two-sample t-test with paired samples, portions (see Example 4.4 and

Test 3).

t.test( portions.normal, portions.during, paired=TRUE)

##

## Paired t-test

##


## t = -2.19, df = 16, p-value = 0.044

## alternative hypothesis: true difference in means is not equal to 0


## -0.95124 -0.01464


## mean of the differences

## -0.48294

# Same result as with

# t.test( portions.normal - portions.during)

4.4 Duality of Tests and Confidence Intervals

There is a close connection between a significance test and a confidence interval for the

same parameter. Rejection of H0 : θ = θ0 with a significance level α is equivalent to θ0 not

being included in the (1−α) confidence interval of θ. The duality of tests and confidence

intervals is quickly apparent when the criterion for the test decision is reworded, here for

example, “compare a mean value with the theoretical value”.

As shown above, H0 is rejected if tobs > ttab, so if

tobs =|x− µT |

s·√n > ttab. (4.6)

H0 is not rejected if

tobs =|x− µT |

s·√n ≤ ttab. (4.7)

This can be rewritten as

|x− µT | ≤ ttabs√n, (4.8)

which also means that H0 is not rejected if either

x− µT ≥ −ttabs√n

or x− µT ≤ ttabs√n. (4.9)

We can again rewrite this as

µT ≤ x+ ttabs√n

and µT ≥ x− ttabs√n, (4.10)


which corresponds to the boundaries of the (1−α) confidence interval for µT . Analogously,

this duality can be established for the other tests described in this chapter.

Example 4.5. Consider the situation from Example 4.2. Instead of comparing the p-

value, we can also consider the confidence interval, whose boundary values are 6.67 and

8.79. Since 9 is not in this range, the null hypothesis is rejected.

This is shown in Figure 4.1. ♣

In R most test functions give the corresponding confidence intervals with the value of

the statistic and p-value. Some functions require the additional argument conf.int=TRUE,

as well.

4.5 Multiple Testing

In many cases, we want to perform not just one test, but a series of tests. We must then

be aware that the significance level α only holds for a single test. In the case of a single

test, the probability of a falsely significant test result equals the significance level, usually

α = 0.05. The probability that the null hypothesis H0 is correctly not rejected is then

1− 0.05 = 0.95.

Consider the situation in which m > 1 tests are performed. The probability that at

least 1 false significant test result is obtained is then equal to one minus the probability

that no false significant test results are obtained. It holds that

P(at least 1 false significant results) = 1− P(no false significant results) (4.11)

= 1− (1− α)m. (4.12)

In Table 4.2 the probabilities of at least one false significant result for α = 0.05 and

various m are presented. Even for just a few tests, the probability increases drastically,

which should not be tolerated.

Table 4.2: Probabilities of at least one false significant test result when per-

forming m tests at level α = 5%.

m 1 2 3 4 5 6 8 10 20 100

1− (1− α)m 0.05 0.10 0.14 0.19 0.23 0.27 0.34 0.40 0.64 0.99

There are several different methods that allow multiple tests to be performed while

maintaining the selected significance level. The simplest and most well-known of them is

the Bonferroni correction. Instead of comparing the p-value of every test to α, they are

compared to a new significance level, αnew = α/m. There are several alternative methods,

which, according to the situation, may be more appropriate. See, for example, Farcomeni

(2008).

4.6. ADDITIONAL TESTS 59

4.6 Additional Tests

Just as means can be compared, there are also tests for comparing variances. However,

instead of taking differences as in the case of means, we take the ratio of the estimated

variances. If the two estimates are similar, the ratio should be close to one. The test

statistic is accordingly S2y

/S2y and the distribution thereof is an F -distribution (see also

Section 2.7.4). This “classic” F -test is given in Test 4.

In latter chapters we will see more natural settings where we need to compare variances

(not necessary from a priori two different samples).

Test 4: Comparing two variances

Question: Are the variances s2x and s2

y of two samples significantly different?

Assumptions: Both populations, from which the samples arise, are normally

distributed. The samples are independent and the observed values are

from the interval scale.

Calculation: Fobs =s2x

s2y

(the larger variance always goes in the numerator, so s2x > s2

y).

Decision: Reject H0: σ2x = σ2

y if Fobs > Ftab = fnx−1,ny−1,1−α.

Calculation in R: var.test( x, y, conf.level=1-alpha)

Example 4.6. The two-sample t-test (Test 2) requires equal variances. The data portions

does not contradict the null hypothesis, as R-Code 4.6 shows (compare Example 4.3 and

R-Code 4.4). ♣

The so-called chi-square test (X 2 test) verifies if the observed data follow a particular

distribution. This test is based on a comparison of the observations with the expected

frequencies and can also be used in other situations, for example, to test if two categorical

variables are independent (Test 6).

Under the null hypothesis, the chi-square test is X 2 distributed. The quantile, density,

and distribution functions are implemented in R with [q,d,p]chisq (see Section 2.7.2).

In Test 5, the categories should be aggregated so that all e1 ≥ 5. Additionally,

N − k > 1.


R-Code 4.6 Comparison of two variances, portions (see Example 4.6 and Test 4).

var.test( portions.normal, portions.during)

##

## F test to compare two variances

##


## F = 1.18, num df = 16, denom df = 16, p-value = 0.75

## alternative hypothesis: true ratio of variances is not equal to 1


## 0.4268 3.2544


## ratio of variances

## 1.1785

Test 5: Comparison of observations with expected frequencies

Question: Do the observed frequencies oi of a sample deviate significantly

from the expected frequencies ei of a certain distribution?

Assumptions: Sample with data from any scale.

Calculation: Calculate the observed values oi and the expected frequencies

ei (with the help of the expected distribution) and then compute

χ2obs =

N∑i=1

(oi − ei)2

ei

where N is the number of categories.

Decision: Reject H0: “no deviation between the observed and expected”

if χ2obs > χ2

tab = χ2N−1−k,1−α , where k is the number of parameters

estimated from the data.

Calculation in R: chisq.test( obs, p=expect/sum(obs))

or chisq.test( obs, p=expect, rescale.p=TRUE)

4.7. BIBLIOGRAPHIC REMARKS AND LINKS 61

Example 4.7. With only 17 observed it is often pointless to test for normality of the

data. Even for larger samples, a Q-Q plot is more informative. For completeness, the

procedure is shown using the dietary obervations for the time periods “normal” and

“during menstruation” in R-Code 4.7.

The binning of the data is done through the a histogram-type binning (using a table(

cut( )) would be an alternative approach). As we have less than five observations in the

last two bins, the function issues a warning. Additionally, the degrees of freedom should be

N−1−k = 5−1−2 = 2 rather than four, as we estimate the mean and standard deviation

to determine the expected counts. Both effects could be mitigated if we calculate the p-

value using a bootstrap simulation by setting the argument simulate.p.value=TRUE.

♣

R-Code 4.7 Testing normality, portions (see Example 4.7 and Test 5).

portions.normal.during <- c(portions.normal, portions.during)

observed <- hist( portions.normal.during, plot=FALSE, breaks=4)

# without 'breaks' argument there are too many categories

observed[1:2]

## $breaks

## [1] 4 6 8 10 12 14

##

## $counts

## [1] 4 13 13 2 2

m <- mean( portions.normal.during)

s <- sd( portions.normal.during)

p <- pnorm( observed$breaks, mean=m, sd=s)

chisq.test( observed$counts, p=diff( p), rescale.p=TRUE)

## Warning in chisq.test(observed$counts, p = diff(p), rescale.p =

TRUE): Chi-squared approximation may be incorrect

##

## Chi-squared test for given probabilities

##

## data: observed$counts

## X-squared = 4.36, df = 4, p-value = 0.36


There are many books about statistical testing, ranging from accessible to very mathe-

matical, too many to even start a list here.

Chapter 5

A Closer Look: Proportions


Pre-eclampsia is a hypertensive disorder occurring during pregnancy (gestational hy-

pertension) with symptoms including: edema, high blood pressure, and proteinuria. In

a double-blinded randomized controlled trial (RCT) 2706 pregnant women were treated

with either a diuretic or with a placebo (Landesman et al., 1965). Pre-eclampsia was

diagnosed in 138 of the 1370 subjects in the treatment group and in 175 of the subjects

receiving placebo. The medical question is whether diuretic medications, which reduce

water retension, reduce the risk of pre-eclampsia.

In this chapter we have a closer look at statistical techniques that help us to correctly

answer the above questions. More precisely, we will estimate and compare proportions

with each other.

5.1 Estimation

We start with a simple setting where we observe occurrences of a certain event and are

interested in the proportion of the events over the total population. More specifically,

we consider the number of successes in a sequence of experiments, i.e., whether a certain

treatment had an effect. We often use a binomial random variable X ∼ Bin(n, p) for

such a setting, where n is given or known and p is unknown, the parameter of interest.

Intuitively, we find x/n to be an estimate of p and X/n the corresponding estimator. We

will construct a confidence interval for p in the next section.

With the method of moments we obtain the estimator pMM = X/n, since np = E(X)

and we have only one observation (total number of cases). The estimator is identical to

the intuitive estimator.

63


64 CHAPTER 5. A CLOSER LOOK: PROPORTIONS

The likelihood estimator is constructed as follows:

L(p) =

(n

x

)px(1− p)n−x (5.1)

`(p) = log(L(p)

)= log

(n

x

)+ x log(p) + (n− x) log(1− p) (5.2)

d`(p)

dp=x

p− n− x

1− p⇒ x

pML

=n− x

1− pML

. (5.3)

Thus, the maximum likelihood estimator is (again) pML = X/n.

In our example we have the following estimates: pB = 138/1370 ≈ 10% and pK =

175/1336 ≈ 13%. The question then arises as to whether the two proportions are different

enough to be able to speak of an effect of the drug.

Once we have an estimator, we can now answer questions, such as:

i) How many cases of pre-eclampsia can be expected in a group of 100 pregnant women?

ii) What is the probability that more than 20 cases occur?

Note however, that the estimator p = X/n does not have a “classical” distribution.

Figure 5.1 illustrates the probability mass function based on the estimate for the pre-

eclampsia cases in the treated group. The figure visually suggest to use a Gaussian

approximation which is well justified here as np(1−p) 9. The Gaussian approximation

for X is then used to state that the estimator p is also approximately Gaussian.

0 200 400 600 800 1000 1200 1400

0.00

00.

015

0.03

0

x

100 120 140 160 180

0.00

00.

015

0.03

0

Figure 5.1: Probability mass function with the normal approximation.

5.2. CONFIDENCE INTERVALS 65

When dealing with proportions we often speak of odds, or simply of chance, defined

by ω = p/(1− p). The corresponding intuitive estimator (and estimate) is ω = p/(1− p).As a side note, this estimator also coincides with the maximum likelihood estimator.

5.2 Confidence Intervals

To construct a confidence intervals for the parameter p we use the Gaussian approximation

for the binomial distribution:

1− α ≈ P(zα/2 ≤

X − np√np(1− p)

≤ z1−α/2

). (5.4)

This can be rewritten as

1− α ≈ P(zα/2

√np(1− p) ≤ X − np ≤ z1−α/2

√np(1− p)

)(5.5)

= P(−Xn

+ zα/21

n

√np(1− p) ≤ −p ≤ −X

n+ z1−α/2

1

n

√np(1− p)

). (5.6)

As a further approximation we have

1− α ≈ P(−Xn

+ zα/21

n

√np(1− p) ≤ −p ≤ −X

n+ z1−α/2

1

n

√np(1− p)

). (5.7)

Since p = x/n (as estimate) and q := z1−α/2 = −zα/2, we have as empirical confidence

interval

bl,u = p± q ·√p(1− p)

n= p± q · SE(p) . (5.8)

Remark 5.1. For (regular) models with parameter θ, as n → ∞, likelihood theory

states that, the estimator θML is normally distributed with expected value θ and variance

Var(θML).

Since Var(X/n) = p(1 − p)/n, one can assume that SE(p) =√p(1− p)/n. The so-

called Wald confidence interval rests upon this assumption (which can naturally be shown

more formally) and is identical to (5.8).

If the inequality in (5.4) is solved through a quadratic equation, we obtain the empirical

Wilson confidence interval

bl,u =1

1 + q2/n·

(p+

q2

2n± q ·

√p(1− p)

n+

q2

4n2

), (5.9)

where we use the estimates p = x/n.

The Wilson confidence interval is “more complicated” than the Wald confidence in-

terval. Is it also “better” because one less approximation is required?


Confidence intervals for proportions

An approximate (1− α) Wald confidence interval for a proportion is

Bl,u = p± q ·√p(1− p)

n(5.10)

with estimator p = X/n. An approximate (1−α) Wilson confidence interval

for a proportion is

Bl,u =1

1 + q2/n·

(p+

q2

2n± q ·

√p(1− p)

n+

q2

4n2

)(5.11)

In R an exact (1− α) confidence interval for a proportion is computed with

binom.test( x, n)$conf.int.

Ideally the coverage probability of a (1− α) confidence interval should be 1− α. For

a discrete random variable, the coverage is

P(p ∈ CI ) =n∑x=0

P(X = x)Ip∈CI . (5.12)

The R-Code 5.1 calculated the coverage of the 95% confidence intervals for X ∼ Bin(n =

50, p = 0.4) and demonstrates that the Wilson confidence interval has better coverage

(96% compared to 94%).

R-Code 5.1: Coverage of 95% confidence intervals for X ∼ Bin(n = 50, p =

0.4).

p <- .4

n <- 20

x <- 0:n

WaldCI <- function(x, n) # eq (5.9)

mid <- x/n

se <- sqrt(x*(n-x)/n^3)

cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))

WaldCIs <- WaldCI(x,n)

Waldind <- (WaldCIs[,1] <= p) & (WaldCIs[,2] >= p)

5.2. CONFIDENCE INTERVALS 67

Waldcoverage <- sum( dbinom(x, n, p)*Waldind) # eq (5.12)

WilsonCI <- function(x, n) # eq (5.9)

mid <- (x + 1.96^2/2)/(n + 1.96^2)

se <- sqrt(n)/(n+1.96^2)*sqrt(x/n*(1-x/n)+1.96^2/(4*n))

cbind( pmax(0, mid - 1.96*se), pmin(1, mid + 1.96*se))

WilsonCIs <- WilsonCI(x,n)

Wilsonind <- (WilsonCIs[,1] <= p) & (WilsonCIs[,2] >= p)

Wilsoncoverage <- sum( dbinom(x, n, p)*Wilsonind)

print( c(true=0.95, Wald=Waldcoverage, Wilson=Wilsoncoverage))

## true Wald Wilson

## 0.95000 0.92802 0.96301

R-Code 5.2 Tips for construction of Figure 5.2.

p <- seq(0, 1, .001)

#

# either a loop over all elements in p or a few applies

# over the functions 'Wilsonind' and 'Wilsoncoverage'

#

# Waldcoverage and Wilsoncoverage are thus a vectors!

Waldsmooth <- loess( Waldcoverage ~ p, span=.1)

Wilsonsmooth <- loess( Wilsoncoverage ~ p, span=.1)

plot( p, Waldcoverage, type='l', ylim=c(.8,1))

lines( c(-1, 2), c(.95, .95), col=2, lty=2)

lines( Waldsmooth$x, Waldsmooth$fitted, col=3, lw=2)

The coverage depends on p, as shown in Figure 5.2 (from R-Code 5.2). The Wilson

confidence interval has better nominal coverage at the center. This observation also holds

when n is varied, as in Figure 5.3.

The width of an empirical confidence interval is bu − bl. For the Wald confidence

interval we obtain

2q ·√p(1− p)

n(5.13)

and for the Wilson confidence interval

2q

1 + q2/n

√p(1− p)

n+

q2

4n2. (5.14)


0.0 0.2 0.4 0.6 0.8 1.0

0.80

0.90

1.00

p

Wal

dcov

erag

e

0.0 0.2 0.4 0.6 0.8 1.0

0.80

0.90

1.00

p

Wils

onco

vera

ge

Figure 5.2: Coverage of the 95% confidence intervals for X ∼ Bin(n = 50, p).

The red dashed line is the goal 1− α and in green we have a smoothed curve.

0.0 0.2 0.4 0.6 0.8 1.0

1020

3040

p

n

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Wald CI Wilson CI

Figure 5.3: Coverage of the 95% confidence intervals for X ∼ Bin(n, p) as

functions of p and n. The probabilities are symmetric around p = 1/2. All

values smaller than 0.7 are represented with dark red.

The widths are depicted in Figure 5.4. For 5 < x < 36, the Wilson confidence interval

has a smaller width.

5.3. TESTS 69

0 10 20 30 40

0.00

0.10

0.20

0.30

x

Wid

th

Figure 5.4: Widths of the empirical 95% confidence intervals for X ∼ Bin(n =

40, p) (The Wald is in solid green, the Wilson in dashed blue).

5.3 Tests

The pre-eclampsia data can be presented with a so-called two-dimensional contingency

table (two-way table), as, for example, in Table 5.1.

If one wants to know whether the proportions in both groups are equal or whether they

come from the same distribution, a test for equality of the proportions can be conducted,

i.e., H0 : p1 = p2. This test is also called Pearson’s χ2 test and is given in Test 6. It can

also be considered a special case of Test 5.

Table 5.1: Example of a two-dimensional contingency table displaying frequen-

cies. The first index refers to the presence of the risk factor, the second to the

diagnosis.

Diagnosis

positive negative

Riskwith factor h11 h12 n1

without h21 h22 n2

R-Code 5.3 Contingency table for the pre-eclampsia data example.

xD <- 138; xC <- 175 # positive diagnosed counts

nD <- 1370; nC <- 1336 # totals in both groups

tab <- rbind(Diuretic=c(xD, nD-xD), Control=c(xC, nC-xC))

colnames( tab) <- c('pos','neg')

tab

## pos neg

## Diuretic 138 1232

## Control 175 1161


Test 6: Test of proportions

Question: Are the proportions in the two groups the same?

Assumptions: Two sufficiently large, independent samples of binary data.

Calculation: We use the notation for cells of a contingency table, as in Table

5.1. The test statistic is

χ2obs =

(h11h22 − h12h21)2(h11 + h12 + h21 + h22)

(h11 + h12)(h21 + h22)(h12 + h22)(h11 + h21)

and, under the null hypothesis that the proportions are the same, is

X 2 distributed with one degree of freedom.

Decision: Reject if χ2obs > χ2

tab = χ21,1−α.

Calculation in R: prop.test( tab) or chisq.test(tab)

Example 5.1. The R-Code 5.4 shows the results for the pre-eclampsia data, once using

a proportion test and once using a Chi-squared test (comparing expected and observed

frequencies). ♣

Remark 5.2. We have presented the rows of Table 5.1 in terms of two binomials, i.e., with

two fixed marginals. In certain situations, such a table can be seen from a hypergeometric

distribution point of view (see help( dhyper)), where three margins are fixed. For this

latter view, fisher.test is the test of choice.

It is natural to extend the 2× 2-tables to general two-way tables, where many of the

concepts discussed here need to be extended. Often, at the very end, a test based on a

chi-squared distributed test statistic is used.

5.3. TESTS 71

R-Code 5.4 Test of proportions

print(RD <- tab[2,1]/ sum(tab[2,]) - tab[1,1]/ sum(tab[1,]) )

## [1] 0.030258

prop.test( tab)

##

## 2-sample test for equality of proportions with continuity

## correction

##

## data: tab


## alternative hypothesis: two.sided


## -0.0551074 -0.0054088


## prop 1 prop 2

## 0.10073 0.13099

chisq.test( tab)

##

## Pearson's Chi-squared test with Yates' continuity correction

##

## data: tab



5.4 Comparison of Proportions

The goal of this section is to introduce formal approaches for a comparison of two pro-

portions p1 and p2. This can be accomplished using (i) a difference p1− p2, (ii) a quotient

p1/p2, or (iii) an odds ratio p1/(1− p1)/

(p2/(1− p2)), which we consider in the following

three sections.

5.4.1 Difference Between Proportions

Based on the binomial distribution, the differenceh11

h11 + h12

− h21

h21 + h22

(seen as an esti-

mator) is approximately normally distributed

N(p1 − p2,

p1(1− p1)

n1

+p2(1− p2)

n2

)(5.15)

and a corresponding confidence interval can be derived.

The risk difference (or risk reduction) describes the (absolute) difference in the prob-

ability of experiencing the event in question.

5.4.2 Relative Risk

The relative risk estimates the size of the effect of a risk factor compared with the effect

size when the risk factor is not present:

RR =P(Positive diagnosis with risk factor)

P(Positive diagnosis without risk factor). (5.16)

The groups with or without the risk factor can also be considered the treatment and

control groups.

The relative risk assumes positive values. A value of 1 means that the risk is the same in

both groups and there is no evidence of a correlation between the diagnosis/disease/event

and the risk factor. A value greater than one is evidence of a possible positive correlation

between a risk factor and a diagnosis/disease. If the relative risk is less than one, the

exposure has a protective effect, as is the case, for example, for vaccinations.

An estimate of the relative risk is (see Table 5.1)

RR =p1

p2

=

h11

h11 + h12

h21

h21 + h22

. (5.17)

To construct confidence intervals, we consider first θ = log(RR). The standard error of θ

is determined with the delta method and based on Equation (2.44), applied to a Binomial

5.4. COMPARISON OF PROPORTIONS 73

instead of a Bernoulli:

Var( θ ) = Var(

log(RR))

= Var(

log(p1

p2

))

= Var(

log(p1)− log(p2))

(5.18)

= Var(

log(p1))

+ Var((log(p2))≈ 1− p1

n1 · p1

+1− p2

n2 · p2

(5.19)

≈1− h11

h11 + h12

(h11 + h12) · h11

h11 + h12

+1− h22

h21 + h22

(h21 + h22) · h22

h21 + h22

(5.20)

=1

h11

− 1

h11 + h12

+1

h21

− 1

h21 + h22

. (5.21)

A back-transformation

[exp(θ ± z1−α/2 SE(θ)

) ](5.22)

implies positive confidence boundaries. Note that with the back-transformation we loose

the ‘symmetry’ of estimate plus/minus standard error.

Confidence interval for relative risk (RR)

An approximate (1 − α) confidence interval for RR, based on the two-

dimensional contingency table (Table 5.1), is[exp(

log(RR)± z1−α/2 SE(log(RR)))]

(5.23)

where for the empirical confidence intervals we use the estimates

RR =h11(h21 + h22)

(h11 + h12)h21

, SE(log(RR)) =

√1

h11

− 1

h11 + h12

+1

h21

− 1

h21 + h22

.

Example 5.2. The relative risk and corresponding confidence interval for the pre-eclampsia

data are given in R-Code 5.5. The relative risk is smaller than one (diuretics reduce the

risk). An approximate 95% confidence interval does not include one. ♣


R-Code 5.5 Relative Risk with confidence interval.

print( RR <- ( tab[1,1]/ sum(tab[1,])) / ( tab[2,1]/ sum(tab[2,])) )

## [1] 0.769

s <- sqrt(

1/tab[1,1] + 1/tab[2,1] - 1/sum(tab[1,]) - 1/sum(tab[2,]) )

exp( log(RR) + qnorm(c(.025,.975))*s)

## [1] 0.62333 0.94872

5.4.3 Odds Ratio

The relative risk is closely related to the odds ratio, which is defined as

OR =

P(Positive diagnosis with risk factor)

P(Negative Diagnosis with risk factor)

P(Positive diagnosis without risk factor)

P(Negative diagnosis without risk factor)

(5.24)

=

P(A)

1− P(A)

P(B)

1− P(B)

=P(A)(1− P(B))

P(B)(1− P(A)). (5.25)

The odds ratio indicates the strength of an association between factors (association mea-

sure). The calculation of the odds ratio also makes sense when the number of diseased is

determined by study design, as is the case for case-control studies.

When a disease is rare (very low probability of disease), the odds ratio and relative

risk are approximately equal.

An estimate of the odds ratio is

OR =

h11

h12

h21

h22

=h11 h22

h12 h21

. (5.26)

The construction of confidence intervals for the odds ratio is based on Equation (2.43)

and Equation (2.44), analogous to that of the relative risk.

Example 5.3. The odds ratio with confidence interval for the pre-eclampsia data is given

in R-Code 5.6. Notice that the function fisher.test (see Remark 5.2) also calculates

the odds ratios. As they are based on a likelihood calculation, there are minor differences

between both estimates. ♣

5.4. COMPARISON OF PROPORTIONS 75

Confidence intervals for the odds ratio (OR)

An approximate (1 − α) confidence interval for OR, based on a two-

dimensional contingency table (Table 5.1), is[exp(

log(OR)± z1−α/2 SE(log(OR)))]

(5.27)

where for the empirical confidence intervals we use the estimates

OR =h11h22

h12h21

and SE(log(OR)) =

√1

h11

+1

h21

+1

h12

+1

h22

.

R-Code 5.6 Odds ratio with confidence interval, approximate and exact.

print( OR <- tab[1]*tab[4]/(tab[2]*tab[3]))

## [1] 0.74313

s <- sqrt( sum( 1/tab) )

exp( log(OR) + qnorm(c(.025,.975))*s)

## [1] 0.58626 0.94196

# Exact test from Fisher:

fisher.test(tab)

##

## Fisher's Exact Test for Count Data

##

## data: tab

## p-value = 0.016

## alternative hypothesis: true odds ratio is not equal to 1


## 0.58166 0.94826


## odds ratio

## 0.74321



See also the package binom and the references therein for further confidence intervals for

proportions.

Chapter 6

Rank-Based Methods


In this chapter, we discuss basic approaches to estimation and testing for cases in

which the data are not normally distributed. This means that there could be outliers, the

data might not be measured on the interval/ratio scale, etc.

6.1 Robust Point Estimates

As seen in previous chapters “classic” estimates of the mean and the variance are

x =1

n

n∑i=1

xi, and s2 =1

n− 1

n∑i=1

(xi −x)2. (6.1)

If, hypothetically, we set an arbitrary value xi to an infinitely large value (i.e., we create

an extreme outlier), these estimates above also “explode”.

Robust estimators are not sensitive to one or possibly several outliers. They do often

not require specific distributional assumptions on the random sample.

The mean is therefore not a robust estimate of location. A robust estimate of location

is the trimmed mean, in which the biggest and smallest values are trimmed away and

not considered. The (empirical) median (the middle value of an odd amount of data or

the center of the two middle-most values of an even amount of data) is another robust

estimate of location.

Robust estimates of the dispersion (data spread) are (1) the (empirical) interquartile

range (IQR), calculated as the difference between the third and first quartiles, and (2)

the (empirical) median absolute deviation (MAD), calculated as

MAD = c ·median|xi −medianx1, . . . , xn|, (6.2)

where most software programs (including R) use c = 1.4826. The choice of c is such, that

for normally distributed random variables we have an unbiased estimator, i.e., E(MAD) =

77


78 CHAPTER 6. RANK-BASED METHODS

σ for MAD seen as an estimator. Since for normally distributed random variables IQR=

2Φ−1(3/4)σ, IQR/1.349 is an estimator of σ; for IQR seen as an estimator.

Example 6.1. Let the values 1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 2.7 be given. Unfortunately, we

have entered the final number as 27. The R-Code 6.1 compares several statistics illustrates

the effect of a single outlier on the estimates. ♣

R-Code 6.1 Classic and robust estimates from Example 6.1.

sam <- c(1.1, 3.2, 2.2, 1.8, 1.9, 2.1, 27)

print( c(mean(sam), mean(sam, trim=.2), median(sam)))

## [1] 5.6143 2.2400 2.1000

print( c(sd(sam), IQR(sam), mad(sam)))

## [1] 9.45083 0.85000 0.44478

The estimators of the trimmed mean or median do not possess simple distribution

functions and for this reason the corresponding confidence intervals are not easy to cal-

culate. Based on

Mean± zα/2

√Variance

n, (6.3)

approximate confidence intervals, however, can be easily constructed. For example,

median( x)+c(-2,2)*mad( x)/sqrt(length( x)) is an approximate empirical 95% con-

fidence interval for the median.

A second disadvantage of robust estimators is their low efficiency, i.e., these estimators

have larger variances. Formally, the efficiency is the ratio of the variance of one estimator

to the variance of the second estimator.

In some cases the exact variance of robust estimators can be determined, often ap-

proximations or asymptotic approximations exist. For a continuous random variable with

cdf F (x), asymptotically, the median is also normally distributed around the true me-

dian η = F−1(1/2) with variance (4nf(η)2)−1, where f(x) is the density function. The

following example illustrates this result and R-Code 6.2

Example 6.2. Let X1, . . . , X10iid∼ N (0, σ2). Figure 6.1 shows the (smoothed) empirical

density of X based on N = 1000 repetitions. The empirical efficiency is roughly 72%.

Because the density is symmetric, η = µ = 0 and thus the asymptotic efficiency is

σ2/n

1/(4nf(0)2)=σ2

n· 4n(

1√2πσ

)2 =2

π≈ 64%. (6.4)

Of course, if we change the distribution of Xi, the efficiency changes. For example with a

density with heavier tails than the normal, like a t-distribution with 4 degrees of freedom,

the empirical efficiency is 1.27, which means that the median is better. ♣

6.1. ROBUST POINT ESTIMATES 79

R-Code 6.2 Distribution of empirical mean and median, see Example 6.2. (See

Figure 6.1.)

n <- 10

N <- 1000

samples <- array( rnorm(n*N), c(N,n))

means <- apply( samples, 1, mean)

medians <- apply( samples, 1, median)

print( c( var(means), var(medians), var(means)/var(medians)))

## [1] 0.099383 0.134253 0.740266

hist( medians, border=7, col=7, prob=T, main='', ylim=c(0,1.2))

hist( means, add=T, prob=T)

lines( density( medians), col=2)

lines( density( means), col=1)

# with a t_4 the situation is different!

samples <- array( rt(n*N, df=4), c(N,n))

means <- apply( samples, 1, mean)

medians <- apply( samples, 1, median)

print( c( var(means), var(medians), var(means)/var(medians)))

## [1] 0.19186 0.15634 1.22719

medians

Den

sity

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Figure 6.1: Efficiency of estimators. Medians in yellow with red density, means

in black. (See R-Code 6.2.)

The decision as to whether a realization of a random sample contains outliers is not

always easy and some care is needed. For example for all distributions with values from

R, observations will lay outside the whiskers of a box plot when n is sufficiently large and


thus “marked” as outliers. Obvious outliers are easy to identify and eliminate, but in less

clear cases robust estimation methods are preferred.

Outliers can be very difficult to recognize in multivariate random samples, because they

are not readily apparent with respect to the marginal distributions. Robust methods for

random vectors exist, but are often computationally intense and not as intuitive as for

scalar values.

6.2 Rank-Based Tests

In Chapter 4 we considered tests to compare means. These tests assume normally dis-

tributed data for exact results. Slight deviations from a normal distribution has typically

negligible consequences as the central limit theorem reassures that the mean is approx-

imately normal. However, if outliers are present or the data is measured on the ordinal

scale, the use of so-called ‘rank-based’ tests is recommended. Classical tests typically as-

sume a distribution that is parametrized (e.g., µ, σ2 in N (µ, σ2) or p in Bin(n, p)). Rank

based test do not prescribe a detailed distribution and are thus also called non-parametric

tests.

The rank of a value in a sequence is the position (order) of that value in the ordered

sequence (from smallest to largest). In particular, the smallest value has rank 1 and the

largest rank n. In the case of ties, the arithmetic mean of the ranks is used.

Example 6.3. The values 1.1, −0.6, 0.3, 0.1, 0.6, 2.1 have ranks 5, 1, 3, 2, 4 and 6.

However, the ranks of the absolute values are 5, (3+4)/2, 2, 1, (3+4)/2 and 6. ♣

Rank-based tests only consider the ranks of the observations, not the magnitude of

the differences between the observations. The largest value always has the same rank and

therefore always has the same influence on the test statistic.

We now introduce two classical rank tests (i) the Mann–Whitney U test (Wilcoxon–

Mann–Whitney U test) and (ii) the Wilcoxon test, i.e., rank-based versions of Test 2 and

Test 3 respectively.

6.2.1 Wilcoxon–Mann–Whitney Test

When using rank tests, the symmetry assumption is dropped and we test if the samples

likely come from the same distribution, i.e., the two distributions have the same shape

but are shifted. The Wilcoxon–Mann–Whitney test can be interpreted as comparing the

medians between the two populations, see Test 7.

The quantile, density and distribution functions of the test statistic U are implemented

in R with [q,d,p]wilcox. For example, the critical value Utab(nx, ny;α/2) mentioned in

Test 7 is qwilcox( .025, nx, ny) for α = 5% and corresponding p-value 2*pwilcox(

Uobs, nx, ny).

6.2. RANK-BASED TESTS 81

Test 7: Comparison of location for two independent samples

Question: Are the medians of two independent samples significantly differ-

ent?

Assumptions: Both populations follow continuous distributions of the same

shape, the samples are independent and the data are at least ordinally

scaled.

Calculation: Let nx ≤ ny, otherwise switch the samples. Assemble the

(nx + ny) sample values into a single set ordered by rank and calculate

the sum Rx and Ry of the sample ranks. Calculate Ux and Uy

• Ux = nx · ny + nx·(nx+1)2

−Rx

• Uy = nx · ny + ny ·(ny+1)

2−Ry (Ux + Uy = nx · ny)

• Uobs = min(Ux, Uy)

Decision: Reject H0: “medians are the same” if Uobs < Utab(nx, ny;α/2),

where Utab is the tabulated critical value.

Calculation in R: wilcox.test(x, y, conf.level=1-alpha)

It is possible to approximate the distribution of the test statistic by a Gaussian one.

The U statistic value is then transformed by

zobs =

Uobs −nx · ny

2

√nx · ny · (nx + ny + 1)

12

, (6.5)

where nx · ny/2 is the mean of U and the denominator is the standard deviation. This

value is then compared with the respective quantile of the standard normal distribution.

The normal approximation may be used with sufficiently large samples, nx ≥ 2 and

ny ≥ 8. With additional continuity corrections, the approximation may be improved.

To construct confidence intervals, the argument conf.int=TRUE must be used in the

function wilcox.test.

In case of ties, R may not be capable to calculate exact p-values and thus will issue a

warning. The warning can be avoided by not requiring exact p-values through the setting

of the argument exact=FALSE.


6.2.2 Wilcoxon Signed Rank Test

The Wilcoxon signed rank test is for matched/paired samples (or repeated measurements

on the same sample) and can be used to compare the the median with a theoretical value.

As in the two-sample t-test, we calculate the differences. However, here we compare the

ranks of the negative and positive differences. If the two samples are from the same

distribution the ranks and thus the rank sums should be comparable. Zero differences are

omitted from the sample and the sample size is reduced accordingly. Details are given in

Test 8.

The quantile, density and distribution functions of the Wilcoxon signed rank test

statistic are implemented in R with [q,d,p]signrank. For example, the critical value

Wtab(n;α/2) mentioned in Test 8 is qsignrank( .025, nx, ny) for α = 5% and corre-

sponding p-value 2*psignrank( Wobs, n).

Test 8: Comparison of location of two dependent samples

Question: Are the medians of two dependent samples significantly different?


shape, the samples are dependent and the data are measured at least

on the ordinal scale.

Calculation: (1) Calculate the differences di = xi − yi between the sam-

ple values. Ignore all differences di = 0 and consider only the

remaining n? differences di 6= 0.

(2) Order the n? differences di by their absolute differences |di|.

(3) Calculate W+, the sum of the ranks of all the positive differences

di > 0 and also W−, the sum of the ranks of all the negative

differences di < 0

Wobs = min(W+,W−) (W+ +W− =n? · (n? + 1)

2)

Decision: Reject H0: “the medians are the same” if Wobs < Wtab(n?;α/2),

where Wtab is the tabulated critical value.

Calculation in R: wilcox.test(x-y, conf.level=1-alpha) or

wilcox.test(x, y, paired=TRUE, conf.level=1-alpha)

6.2. RANK-BASED TESTS 83

As in the case of the Wilcoxon-Mann-Whitney test, it is again possible to approximate

the distribution of test statistic with a normal distribution. The statistic is transformed

as follows:

zobs =

Wobs −n? · (n? + 1)

4

√n? · (n? + 1) · (2n? + 1)

24

. (6.6)

and then zobs is compared with the corresponding quantile of the standard normal distri-

bution. This approximation may be used when the sample is sufficiently large, n? ≥ 20.

Example 6.4. Consider the portions data again. R-Code 6.3 performs these tests. As

expected, the p-values are similar to those in Chapter 4. Because ties can exist, exact

p-values cannot be calculated. To avoid warnings, use the argument exact=FALSE.

The advantage of robust methods becomes clear when the first value is changed from

5.17 to 51.7, as shown towards the end of the same R-Code. While the p-value of the

signed rank test does virtually not change, the one from the paired two sample t-test

changes from 0.4 (see R-Code 4.5) to 0.27. ♣

R-Code 6.3: Rank tests and comparison of a paired tests with a corrupted

observation.

wilcox.test( portions.normal, mu=9, exact=FALSE)

##

## Wilcoxon signed rank test with continuity correction

##

## data: portions.normal

## V = 25, p-value = 0.028

## alternative hypothesis: true location is not equal to 9

wilcox.test( portions.normal, portions.during, exact=FALSE)

##

## Wilcoxon rank sum test with continuity correction

##


## W = 118, p-value = 0.37

## alternative hypothesis: true location shift is not equal to 0

wilcox.test( portions.normal, portions.during, paired=TRUE)

## Warning in wilcox.test.default(portions.normal, portions.during,

paired = TRUE): cannot compute exact p-value with ties


##

## Wilcoxon signed rank test with continuity correction

##


## V = 31.5, p-value = 0.035

## alternative hypothesis: true location shift is not equal to 0

portions.during[1] <- 62.5 # badly corrupted value

t.test( portions.normal, portions.during, paired=TRUE)[3:4]

## $p.value

## [1] 0.27481

##

## $conf.int

## [1] -10.9003 3.3167

## attr(,"conf.level")

## [1] 0.95

wilcox.test( portions.normal, portions.during, paired=TRUE,

conf.int=TRUE, exact=FALSE)[c("p.value","conf.int")]

## $p.value

## [1] 0.031226

##

## $conf.int

## [1] -1.174996 -0.050017

## attr(,"conf.level")

## [1] 0.95

6.3 Other Tests

6.3.1 Sign Test

Given paired data X1, . . . , Xn, Y1, . . . , Yn from symmetric distributions, we consider the

signs of the differences Di = Yi−Xi. Under the null hypothesis H0 : P(Di > 0) = P(Di <

0) = 1/2, the signs are binomially distributed with probability p = 0.5. If too many or

too few negative signs are observed, the null hypothesis is rejected. In case the observed

difference di is zero, it will be dropped from the sample and n will be reduced

The test is based on very weak assumptions and therefore has little power. The test

procedure is illustrated in Test 9 and R-Code 6.4 illustrates the sign test along Exam-

ple 6.4. The resulting p-values are very similar to the Wilcoxon signed rank test.

6.3. OTHER TESTS 85

Test 9: Sign test

Question: Are the medians of two paired/matched samples significantly dif-

ferent?


shape, the samples are independent and the data are at least ordinally

scaled.

Calculation: (1) Calculate the differences di = xi−yi. Ignore all differences

di = 0 and consider only the n? differences di 6= 0.

(2) Categorize each difference di by its sign (+ or −) and count the

number of positive signs: k =∑n?

i=1 Idi>0.

(3) If the medians of both samples are the same, k is a realization

from a binomial distribution Bin(n?, p = 0.5).

Decision: Reject H0 : p = 0.5 (i.e., the medians are the same), if k > btab =

b(n?, 0.5, 1− α2) or k < btab = b(n?, 0.5,

α2).

Calculation in R: binom.test( sum( d>0), sum( d!=0),

conf.level=1-alpha)

R-Code 6.4 Sign test.

d <- portions.normal - portions.during

binom.test( sum( d>0), sum( d!=0))

##

## Exact binomial test

##

## data: sum(d > 0) and sum(d != 0)

## number of successes = 4, number of trials = 17, p-value =

## 0.049

## alternative hypothesis: true probability of success is not equal to 0.5


## 0.068108 0.498993


## probability of success

## 0.23529


6.3.2 Permutation Test

Permutation tests can be used to answer many different questions. They are based on

the idea that, under the null hypothesis, the samples being compared are the same. In

this case, the result of the test would not change if the values of both samples were to be

randomly reassigned to either group. In R-Code 6.5 we show an example of a permutation

test for comparing the means of two independent samples.

Permutation tests are straightforward to implement manually and thus are often used

in settings where the distribution of the test statistic is complex or even unknown.

Note that the package exactRankTests will no longer be further developed. Therefore,

use of the function perm.test is discouraged. Functions within the package coin can be

used instead and this package includes extensions to other rank-based tests.

Test 10: Permutation test

Question: Are the means of two independent samples significantly different?

Assumptions: The null hypothesis is formulated, such that the groups, un-

der H0, are exchangeable.

Calculation: (1) Calculate the difference dobs in the means of the two

groups to be compared (m observations in group 1, n observa-

tions in group 2).

(2) Form a random permutation of the values of both groups by ran-

domly allocating the observed values to the two groups (m obser-

vations in group 1, n observations in group 2). There are(m+nn

)possibilities.

(3) Calculate the difference in means of the two new groups.

(4) Repeat this procedure many times.

Decision: Compare the selected significance level with the p-value:

number of permuted differences more extreme than dobs(m+nn

)Calculation in R: require(coin); oneway test( formula, data)


R-Code 6.5 Permutation test comparing means of two samples.

require(coin)

Portions <- c(portions.normal, portions.during)

Group <-factor(c(rep(0, length(portions.normal)),

rep(1, length(portions.during))))

oneway_test(Portions ~ Group)

##

## Asymptotic Two-Sample Fisher-Pitman Permutation Test

##

## data: Portions by Group (0, 1)

## Z = -0.715, p-value = 0.47

## alternative hypothesis: true mu is not equal to 0


There are many (“elderly”) books about robust statistics and rank test. Siegel and Castel-

lan Jr (1988) was a very accessible classic for many decades. The book by Hollander and

Wolfe (1999) treats the topic in much more details and depth.

Chapter 7

Multivariate Normal Distribution


In Chapter 2 we have introduce univariate random variables. We now extend the

framework to random vectors (i.e., multivariate random variables). We will mainly focus

on continuous random vectors, especially Gaussian random vectors. In the framework of

this document, we can only cover a tiny part of the beautiful theory. We are pragmatic

and discuss what will be needed in the sequel.

7.1 Random Vectors

A random vector is a (column) vector X = (X1, . . . , Xp)T with p random variables as

components. The following definition is the generalization of the univariate cdf to the

multivariate setting (see Definition 2.1).

Definition 7.1. The multidimensional, i.e. multivariate, distribution function of X is

defined as

FX(x ) = P(X ≤ x ) = P(X1 ≤ x1, . . . , Xp ≤ xp), (7.1)

where the list is to be understood as the intersection (∩). ♦

The multivariate distribution function generally contains more information than the

set of marginal distributions, because (7.1) only simplifies to FX(x ) =

p∏i=1

P(Xi ≤ xi)

under independence (compare to Equation (2.3)).

The probability density function for a continuous random vector is defined in a similar

manner as for random variables.

89


90 CHAPTER 7. MULTIVARIATE NORMAL DISTRIBUTION

Definition 7.2. The probability density function (density function, pdf) fX(x ) of a p-

dimensional continuous random vector X is defined by

P(X ∈ A) =

∫A

fX(x )dx , for all A ⊂ Rp. (7.2)

♦

For convenience, we summarize here a few facts of random vectors with two continuous

components, i.e., for a bivariate random vector (X, Y )T. The univariate counterparts are

stated in Properties 2.1 and 2.3.

• The distribution function is monotonically increasing:

for x1 ≤ x2 and y1 ≤ y2, FX,Y (x1, y1) ≤ FX,Y (x2, y2).

• The distribution function is normalized:

limx,y→∞

FX,Y (x, y) = FX,Y (∞,∞) = 1.

We use the slight abuse of notation by writing ∞ in arguments without a limit.

• FX,Y (−∞,−∞) = FX,Y (x,−∞) = FX,Y (−∞, y) = 0.

• FX,Y (x, y) (fX,Y (x, y)) are continuous (almost) everywhere.

• fX,Y (x, y) =∂2

∂x∂yFX,Y (x, y).

• P(a < X ≤ b, c < Y ≤ d) =

∫ b

a

∫ d

c

fX,Y (x, y)dxdy

= FX,Y (b, d)− FX,Y (b, c)− FX,Y (a, d) + FX,Y (a, c).

In the multivariate setting there is also the concept termed marginalization, i.e., reduce

a higher-dimensional random vector to a lower dimensional one. Intuitively, we “neglect”

components of the random vector in allowing them to take any value. In two dimensions,

we have

• FX(x) = P(X ≤ x, Y arbitrary) = FX,Y (x,∞);

• fX(x) =

∫RfX,Y (x, y)dy.

Definition 7.3. The expected value of a random vector X is defined as

E(X) = E

X1

...

Xp

=

E(X1)...

E(Xp)

. (7.3)

♦

7.1. RANDOM VECTORS 91

Hence the expectation of a random vector is simply the vector of the individual ex-

pectations. Of course, to calculate these, we only need the marginal univariate densities

fXi(x) and thus the expectation is independent of whether 7.1 can be factored or not. The

variance of a random vector requires a bit more thought and we first need the following.

Definition 7.4. The covariance between two arbitrary random variables X1 and X2 is

defined as

Cov(X1, X2) = E((X1 − E(X1))(X2 − E(X2))

). (7.4)

♦

Using the linearity properties of the expectation operator, it is possible to show the

following handy properties.

Property 7.1. We have for arbitrary random variables X1, X2 and X3:

i) Cov(X1, X2) = Cov(X2, X1),

ii) Cov(X1, X1) = Var(X1),

iii) Cov(a+ bX1, c+ dX2) = bdCov(X1, X2),

iv) Cov(X1, X2 +X3) = Cov(X1, X2) + Cov(X1, X3).

The covariance describes the linear relationship between the random variables. The

correlation between two random variables X1 and X2 is defined as

Corr(X1, X2) =Cov(X1, X2)√

Var(X1) Var(X2)(7.5)

and corresponds to the normalized covariance. It holds that −1 ≤ Corr(X1, X2) ≤ 1,

with equality only in the degenerate case X2 = a+ bX1 for some a and b 6= 0.

Definition 7.5. The variance of a p-variate random vector X = (X1, . . . , Xp)T is defined

as

Var(X) = E((X− E(X))(X− E(X))T

)(7.6)

= Var

X1

...

Xp

=

Var(X1) . . . Cov(Xi, Xj). . .

Cov(Xj, Xi) . . . Var(Xp)

, (7.7)

called the covariance matrix or variance–covariance matrix. ♦

Similar to Properties 3.4, we have the following properties for random vectors.

Property 7.2. For an arbitrary p-variate random vector X, vectors a ∈ Rq and matrices

B ∈ Rq×p it holds:

i) Var(X) = E(XXT)− E(X) E(X)T,

ii) E(a + BX) = a + B E(X),

iii) Var(a + BX) = B Var(X)BT.


7.2 Multivariate Normal Distribution

We now consider a special multivariate distribution: the multivariate normal distribution,

by first considering the bivariate case.

7.2.1 Bivariate Normal Distribution

Definition 7.6. The random variable pair (X, Y ) has a bivariate normal distribution if

FX,Y (x, y) =

∫ x

−∞

∫ y

−∞fX,Y (x, y)dxdy (7.8)

with density

f(x, y) = fX,Y (x, y) (7.9)

=1

2πσxσy√

1− ρ2exp

(− 1

2(1− ρ2)

((x− µx)2

σ2x

+(y − µy)2

σ2y

− 2ρ(x− µx)(y − µy)σxσy

)),

for all x and y and where µx ∈ R, µy ∈ R, σx > 0, σy > 0 and −1 < ρ < 1. ♦

The role of some of the parameters µx, µy, σx, σy and ρ might be guessed. We will

discuss their precise meaning after the following example.

Example 7.1. R-Code 7.1 and Figure 7.1 show the density of a bivariate normal dis-

tribution with µx = µy = 0, σx = 1, σy =√

5, and ρ = 2/√

5 ≈ 0.9. Because of the

quadratic form in (7.9), the contour lines (isolines) are ellipses.

Several R packages implement the bivariate/multivariate normal distribution. We

recommend mvtnorm. ♣

R-Code 7.1 Density of a bivariate normal distribution. (See Figure 7.1.)

require( mvtnorm)

require( fields) # providing tim.colors()

Sigma <- array( c(1,2,2,5), c(2,2))

x1 <- seq( -3, to=3, length=100)

x2 <- seq( -3, to=3, length=99)

grid <- expand.grid( x1, x2)

densgrid <- dmvnorm( grid, mean=c(0,0), sigma=Sigma)

dens <- matrix( densgrid, length( x1), length( x2))

image(x1, x2, dens, col=tim.colors()) # left panel

faccol <- tim.colors()[cut(dens[-1,-1],64)]

persp(x1, x2, dens, col=faccol, border = NA, # right panel

tick='detailed', theta=120, phi=30, r=100)

7.2. MULTIVARIATE NORMAL DISTRIBUTION 93

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

x1

x2

x1

−3−2

−10

12

3x2

−3 −2 −1 0 1 2 3dens

0.00

0.05

0.10

0.15

Figure 7.1: Density of a bivariate normal distribution. (See R-Code 7.1.)

The multivariate normal distribution as many nice properties.

Property 7.3. For the bivariate normal distribution we have: The marginal distributions

are X ∼ N (µx, σ2x) and Y ∼ N (µy, σ

2y) and

E

((X

Y

))=

(µx

µy,

)Var

((X

Y

))=

(σ2x ρσxσy

ρσxσy σ2y

). (7.10)

Thus,

Cov(X, Y ) = ρσxσy, Corr(X, Y ) = ρ. (7.11)

If ρ = 0, X and Y are independent and vice versa.

Note, however, that the equivalence of independence and uncorrelatedness is specific

to jointly normal variables and cannot be assumed for random variables that are not

jointly normal.

Example 7.2. R-Code 7.2 and Figure 7.2 show realizations from a bivariate normal

distribution for various values of correlation ρ. Even for large sample shown here (n =

500), correlations between −0.25 and 0.25 are barely perceptible. ♣


R-Code 7.2 Realizations from a bivariate normal distribution for various values of ρ,

termed binorm (See Figure 7.2.)

set.seed(12)

rho <- c(-.25, 0, .1, .25, .75, .9)

for (i in 1:6) Sigma <- array( c(1, rho[i], rho[i], 1), c(2,2))

sample <- rmvnorm( 500, sigma=Sigma)

plot(sample, pch='.', xlab='', ylab='')

legend( "topleft", legend=bquote(rho==.(rho[i])), bty='n')

−4 −2 0 2 4

−4

−2

02

4

ρ = −0.25

−4 −2 0 2 4

−4

−2

02

4

ρ = 0

−4 −2 0 2 4

−4

−2

02

4

ρ = 0.1

−4 −2 0 2 4

−4

−2

02

4

ρ = 0.25

−4 −2 0 2 4

−4

−2

02

4

ρ = 0.75

−4 −2 0 2 4

−4

−2

02

4

ρ = 0.9

Figure 7.2: Realizations from a bivariate normal distribution. (See R-

Code 7.2.)

7.2.2 Multivariate Normal Distribution

In the general case we have to use vector notation. Surprisingly, we gain clarity even

compared to the bivariate case.

Definition 7.7. The random vector X = (X1, . . . , Xp)T is multivariate normally dis-

tributed if

FX(x ) =

∫ x1

−∞. . .

∫ xp

−∞fX(x1, . . . , xp)dx1 . . . dxp (7.12)

7.2. MULTIVARIATE NORMAL DISTRIBUTION 95

with density

fX(x1, . . . , xp) = fX(x ) =1

(2π)p/2 det(Σ)1/2exp(−1

2(x − µ)TΣ−1(x − µ)

)(7.13)

for all x ∈ Rp (µ ∈ Rp and symmetric, positive-definite Σ). We denote this distribution

with X ∼ Np(µ,Σ). ♦

Property 7.4. For the multivariate normal distribution we have:

E(X) = µ , Var(X) = Σ . (7.14)

Property 7.5. Let a ∈ Rq, B ∈ Rq×p, q ≤ p, rank(B) = q and X ∼ Np(µ,Σ), then

a + BX ∼ Nq(a + Bµ,BΣBT

). (7.15)

This last property has profound consequences. It also asserts that the one-dimensional

marginal distributions are again Gaussian with Xi ∼ N((µ)i, (Σ)ii

), i = 1, . . . , p. Sim-

ilarly, any subset of random variables of X is again Gaussian with appropriate selection

of the mean and covariance matrix.

We now discuss how to draw realizations from an arbitrary Gaussian random vector,

much in the spirit of Property 2.6.ii. Let I ∈ Rp×p be the identity matrix, a square matrix

which has only ones on the main diagonal and only zeros elsewhere, and let L ∈ Rp×p so

that LLT = Σ. That means, L is like a “matrix square root” of Σ.

To draw a realization x from a p-variate random vector X ∼ Np(µ,Σ), one starts

with drawing p values from Z1, . . . , Zpiid∼ N (0, 1), and sets z = (z1, . . . , zp)

T. The vector

is then (linearly) transformed with µ+ Lz . Since Z ∼ Np(0, I) Property 7.5 asserts that

X = µ+ LZ ∼ Np(µ,LLT).

In practice, the Cholesky decomposition of Σ is often used. This decomposes a sym-

metric positive-definite matrix into the product of a lower triangular matrix L and its

transpose. It holds that det(Σ) = det(L)2 =∏p

i=1(L)2ii.

7.2.3 Conditional Distributions

We now consider properties of parts of the random vector X. For simplicity we write

X =

(X1

X2

), X1 ∈ Rq, X2 ∈ Rp−q. (7.16)

We divide the matrix Σ in 2× 2 blocks accordingly:

X =

(X1

X2

)∼ Np

((µ1

µ2

),

(Σ11 Σ12

Σ21 Σ22

))(7.17)

Both (multivariate) marginal distributions X1 and X2 are again normally distributed with

X1 ∼ Nq(µ1,Σ11) and X2 ∼ Np−q(µ2,Σ22).

X1 and X2 are independent if Σ21 = 0 and vice versa.


Property 7.6. If one conditions a multivariate normally distributed random vector (7.17)

on a subvector, the result is itself multivariate normally distributed with

X2 | X1 = x1 ∼ Np−q(µ2 + Σ21Σ

−111 (x1 − µ1),Σ22 −Σ21Σ

−111 Σ12

). (7.18)

The expected value depends linearly on the value of x 1, but the variance is independent

of the value of x 1. The conditional expected value represents an update of X2 through

X1 = x 1: the difference x 1−µ1 is normalized by the variance and scaled by the covariance.

Notice that for p = 2, Σ21Σ−111 = ρσy/σx.

Equation (7.18) is probably one of the most important formulas you encounter in

statistics albeit not always explicit.

7.3 Estimation

The estimators are constructed in a manner similar to that for the univariate case. Let

x 1, . . . ,xn be a realization of the random sample X1, . . . ,Xniid∼ Np(µ,Σ). We have the

following estimators

µ = X =1

n

n∑i=1

Xi Σ =1

n− 1

n∑i=1

(Xi −X)(Xi −X)T (7.19)

and corresponding estimates

µ = x =1

n

n∑i=1

x i Σ =1

n− 1

n∑i=1

(x i −x )(x i −x )T. (7.20)

A normally distributed random variable is determined by two parameters. A multi-

variate normally distributed random variable is determined by p (for µ) and p(p + 1)/2

(for Σ) parameters. The remaining p(p − 1)/2 parameters in Σ are determined by sym-

metry. However, the p(p + 1)/2 values can not be arbitrarily chosen since Σ must be

positive-definite (in the univariate case σ > 0 must be satisfied). As long as n > p, the

estimator in (7.20) satisfies the conditions. If p < n, additional assumptions about the

structure of the matrix Σ are needed.

Example 7.3. R-Code 7.3 estimates the expected value and the variance matrix/covariance

of the realization with ρ = 0.9 from Figure 7.2. Since n = 500, the values are close to the

“true” value.

The function ellipse from the package ellipse draws confidence regions for the esti-

mated parameter. R-Code 7.4 and Figure 7.3 show the estimated 95% and 50% confidence

regions. As n increases, the estimation improves. ♣

7.3. ESTIMATION 97

R-Code 7.3 Estimates for the realization ρ = 0.9 from Figure 7.2.

colMeans(sample) # apply( sample, 2, mean) # is identical

## [1] 0.0027344 -0.0136907

var(sample) # cov( sample) # is identical

## [,1] [,2]

## [1,] 1.01773 0.91706

## [2,] 0.91706 0.99525

R-Code 7.4 Bivariate normally distributed random numbers for various sample sizes

with contour lines of the density and estimated moments. (See Figure 7.3.)

require( ellipse)

n <- c(10, 100, 500, 1000)

mu <- c(2,1)

Sigma <- matrix( c(4,2,2,2), 2)

for (i in 1:4) plot(ellipse( Sigma, cent=mu, level=.95), col='gray',

xaxs='i', yaxs='i', xlim=c(-4,8), ylim=c(-4,6), type='l')

lines( ellipse( Sigma, cent=mu, level=.5), col='gray')

sample <- rmvnorm( n[i], mean=mu, sigma=Sigma)

points(sample, pch='.', cex=2)

Sigmahat <- cov( sample)

muhat <- apply( sample, 2, mean)

lines( ellipse( Sigmahat, cent=muhat, level=.95), col=2, lwd=2)

lines( ellipse( Sigmahat, cent=muhat, level=.5), col=4, lwd=2)

points( rbind( muhat), col=3, cex=2)

text(-2,4, paste('n =',n[i]))


−4 −2 0 2 4 6 8

−4

−2

02

46

x

n = 10

−4 −2 0 2 4 6 8

−4

−2

02

46

x

y

n = 100

−4 −2 0 2 4 6 8

−4

−2

02

46

n = 500

−4 −2 0 2 4 6 8

−4

−2

02

46

y

n = 1000

Figure 7.3: Bivariate normally distributed random numbers. The contour lines

of the density are in gray, the corresponding estimated 95% (50%) confidence

region in red (blue) and the arithmetic mean in green. (See R-Code 7.4.)



The online book “Matrix Cookbook” is a very good synopsis of the important formulas

related to vectors and matrices (Petersen and Pedersen, 2008).

7.5 Appendix: Positive-Definite Matrices

Definition 7.8. An n × n matrix A is positive-definite if xTAx > 0, for all x 6= 0. If

A = AT, the matrix is symmetric positive-definite. ♦

Property 7.7. A few important properties of symmetric positive-definite matrices A =

(aij) ∈ Rp×p are:

i) rank(A) = p;

ii) det(A) > 0;

iii) All eigenvalues of A are positive, λi > 0;

iv) aii > 0;

v) aiiajj − a2ij > 0, i 6= j;

vi) aii + ajj − 2|aij| > 0, i 6= j;

vii) A−1 is symmetric positive-definite;

viii) All principal submatrices of A are symmetric positive-definite;

ix) There exists a nonsingular lower triangular matrix L, such that A = LLT.

If we write a nonsingular matrix A as a 2× 2 block matrix, we have:

A−1 =

(A11 A12

A21 A22

)−1

=

(C−1

1 −A−111 A12C

−12

−C2A21A−111 C−1

2

)(7.21)

with

C1 = A11 −A12A−122 A21, C2 = A22 −A21A

−111 A12 (7.22)

and

C−11 = A−1

11 −A−111 A12A

−122 A21A

−111 , C−1

2 = A−122 −A−1

22 A21A−111 A12A

−122 . (7.23)

It also holds that det(A) = det(A11) det(C2) = det(A22) det(C1).

Chapter 8

Correlation and Simple Regression


In this chapter (and the following three chapters) we consider “linear models.” A

detailed discusson of linear models would fill an entire lecture module, hence we consider

here only the most important elements.

The (simple) linear regression is commonly considered the archetypical task of statis-

tics and is often introduced as early as middle school.

In this chapter, we will (i) quantify the (linear) relationship between two variables, (ii)

explain one variable through another variable with the help of a “model” and (iii) give

further tips.

8.1 Correlation

Our goal is to quantify the linear relationship between X and Y with the help of n pairs

of values (x1, yn), . . . , (xn, yn), the realizations of two random variables (X, Y ).

An intuitive estimator of the correlation between X and Y (see Equation (7.5)) is

r = Corr(X, Y ) =Cov(X, Y )√

Var(X)Var(Y )

=

∑i

(xi − x)(yi − y)√∑i

(xi − x)2∑j

(yj − y)2

, (8.1)

which is also called the Pearson correlation coefficient. Just like the correlation, the

Pearson correlation coefficient also lies in the interval [−1, 1].

Example 8.1. R-Code 8.1 estimates the correlation of the scatter plot data from Fig-

ure 7.2. Although n = 500, in the case ρ = 0.1, the estimate is more than two times too

large. ♣

101


102 CHAPTER 8. CORRELATION AND SIMPLE REGRESSION

R-Code 8.1 binorm data: Pearson correlation coefficient of the scatter plot from

Figure 7.2. Spearman’s ρ or Kendall’s τ are given as well.

require( mvtnorm)

set.seed(12)

rho <- array(0,c(4,6))

rownames(rho) <- c("rho","Pearson","Spearman","Kendall")

rho[1,] <- c(-.25, 0, .1, .25, .75, .9)

for (i in 1:6) Sigma <- array( c(1, rho[1,i], rho[1,i], 1), c(2,2))

sample <- rmvnorm( 500, sigma=Sigma)

rho[2,i] <- cor( sample)[2]

rho[3,i] <- cor( sample, method="spearman")[2]

rho[4,i] <- cor( sample, method="kendall")[2]

print( rho, digits=2)

## [,1] [,2] [,3] [,4] [,5] [,6]

## rho -0.25 0.000 0.10 0.25 0.75 0.90

## Pearson -0.22 0.048 0.22 0.28 0.78 0.91

## Spearman -0.22 0.066 0.18 0.26 0.77 0.91

## Kendall -0.15 0.045 0.12 0.18 0.57 0.74

The correlation is a measure for the linear relationship; however, the correlation coef-

ficient is not robust, see R-Code 8.2 and Figure 8.1.

Rank correlation coefficients, such as Spearman’s ρ or Kendall’s τ , are “robust” and

are seen as non-parametric correlation estimates. Spearman’s ρ is calculated similarly to

(8.1), where the values are replaced by their ranks. Kendall’s τ compares the number

of concordant (if xi < xj then yi < yj) and discordant pairs (if xi < xj then yi > yj).

R-Code 8.2 compares the various correlation estimates of the anscombe data in Figure 8.1.

R-Code 8.2: anscombe data: visualization and correlation estimates. (See

Figure 8.1.)

library( faraway)

data( anscombe)

head( anscombe, 3)

## x1 x2 x3 x4 y1 y2 y3 y4

## 1 10 10 10 8 8.04 9.14 7.46 6.58

## 2 8 8 8 8 6.95 8.14 6.77 5.76

## 3 13 13 13 8 7.58 8.74 12.74 7.71

8.1. CORRELATION 103

with( anscombe, c(cor(x1, y1), cor(x2, y2), cor(x3, y3), cor(x4, y4)))

## [1] 0.81642 0.81624 0.81629 0.81652

with( anscombe, plot(x1, y1); plot(x2, y2); plot(x3, y3);

plot(x4, y4) )sel <- c(0:3*9+5) # extract diagonal entries of sub-block

print(rbind( pearson=cor(anscombe)[sel],

spearman=cor(anscombe, method='spearman')[sel],

kendall=cor(anscombe, method='kendall')[sel]), digits=2)

## [,1] [,2] [,3] [,4]

## pearson 0.82 0.82 0.82 0.82

## spearman 0.82 0.69 0.99 0.50

## kendall 0.64 0.56 0.96 0.43

4 6 8 10 12 14

46

810

x1

y1

4 6 8 10 12 14

34

56

78

9

x2

y2

4 6 8 10 12 14

68

1012

x3

y3

8 10 12 14 16 18

68

1012

x4

y4

Figure 8.1: anscombe data, the four cases all have a Pearson correlation coef-

ficient of 0.82. (See R-Code 8.2.)

For bivarate normally distributed random variables, r is an estimate of the correlation

parameter ρ. Let R be the corresponding estimator of ρ based on (8.1). The random

variable

T = R

√n− 2√

1−R2(8.2)


is, under H0 : ρ = 0, t-distributed with n− 2 degrees of freedom. The corresponding test

is described under Test 11.

Test 11: Test of correlation

Question: Is the correlation between two paired samples significant?

Assumptions: The pairs of values come from a bivariate normal distribution.

Calculation: tobs = |r|√n− 2√1− r2

where r is the Pearson correlation coefficient (8.1).

Decision: Reject H0: ρ = 0 if tobs > ttab = tn−2,1−α/2.

Calculation in R: cor.test( x, y, conf.level=1-alpha)

In order to construct confidence intervals for correlation estimates, we typically need

the so-called Fisher transformation

W (r) =1

2log(1 + r

1− r

)= arctanh(r) (8.3)

and the fact that, for bivarate normally distributed random variables, the distribution of

W (R) is approximately N(W (ρ), 1/(n− 3)

).

Astoundingly large sample sizes are needed for correlation estimates around 0.25 to

be significant.

Confidence interval for the Pearson correlation coefficient

An approximate (1− α) confidence interval for r is[tanh

(arctanh(r)−

zα/2√n− 3

), tanh

(arctanh(r) +

zα/2√n− 3

) ]where tanh and arctanh are the hyperbolic and inverse hypoerbolic tangent

functions.

8.2. SIMPLE LINEAR REGRESSION 105

8.2 Simple Linear Regression

In simple linear regression a (dependent) variable is explained linearly through a single

independent variable:

Yi = µi + εi (8.4)

= β0 + β1xi + εi, i = 1, . . . , n, (8.5)

with

• Yi: dependent variable, measured values, observations

• xi: independent variable, predictor

• β0, β1: parameters (unknown)

• εi: measurement error, error term, noise (unknown), with symmetric distribution

around E(εi) = 0.

It is often also assumed that Var(εi) = σ2 and/or that the errors are independent of

each other. For simplicity, we assume εiiid∼ N (0, σ2) with unknown σ2. Thus, Yi ∼

N (β0 + β1xi, σ2), i = 1, . . . , n, and Yi and Yj are independent when i 6= j.

Fitting a linear model in R is straightforward as shown in a motivating example below.

Example 8.2. (hardness data) One of the steps in the manufacturing of metal springs

is a quenching bath. The temperature of the bath has an influence on the hardness of the

springs. Data is taken from Abraham and Ledolter (2006). Figure 8.2 shows the Rockwell

hardness of coil springs as a function of the temperature of the quenching bath, as well as

the line of best fit. The Rockwell scale measures the hardness of technical materials and

is denoted by HR (Hardness Rockwell). R-Code 8.3 shows how a simple linear regression

for this data is performed. ♣

R-Code 8.3 hardness data from Example 8.2, see Figure 8.2.

Temp <- rep( 10*3:6, c(4, 3, 3, 4))

Hard <- c(55.8, 59.1, 54.8, 54.6, 43.1, 42.2, 45.2,

31.6, 30.9, 30.8, 17.5, 20.5, 17.2, 16.9)

plot( Temp, Hard, # scatter plot of the data

xlab="Temperature [C]", ylab="Hardness [HR]")

lm1 <- lm( Hard~Temp) # fit a linear model

abline( lm1)


30 35 40 45 50 55 60

2030

4050

60

Temperature [C]

Har

dnes

s [H

R]

Figure 8.2: hardness data: hardness as a function of temperature (see R-

Codes 8.3 and 8.4).

The main idea of regression is to estimate βi, which minimize the sum of squared

errors. That means that β0 and β1 are determined such that

n∑i=1

(yi − β0 − β1xi)2 (8.6)

is minimized. This concept is also called the least squares method. The solution is

β1 =sxys2x

= rsysx

=

n∑i=1

(xi − x)(yi − y)

n∑i=1

(xi − x)2

, (8.7)

β0 = y − β1x . (8.8)

The predicted values are

yi = β0 + β1xi . (8.9)

which lie on the estimated regression line

y = β0 + β1x . (8.10)

8.2. SIMPLE LINEAR REGRESSION 107

R-Code 8.4 hardness data from Example 8.2, see Figure 8.2.

summary( lm1) # summary of the fit

##

## Call:

## lm(formula = Hard ~ Temp)

##

## Residuals:

## Min 1Q Median 3Q Max

## -1.550 -1.190 -0.369 0.599 2.950

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 94.1341 1.5750 59.8 3.2e-16 ***

## Temp -1.2662 0.0339 -37.4 8.6e-14 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 1.5 on 12 degrees of freedom

## Multiple R-squared: 0.991,Adjusted R-squared: 0.991

## F-statistic: 1.4e+03 on 1 and 12 DF, p-value: 8.58e-14

coef( lm1)

## (Intercept) Temp

## 94.1341 -1.2662

head( fitted( lm1))

## 1 2 3 4 5 6

## 56.149 56.149 56.149 56.149 43.488 43.488

head( resid( lm1))

## 1 2 3 4 5 6

## -0.34945 2.95055 -1.34945 -1.54945 -0.38791 -1.28791

head( Hard - (fitted( lm1) + resid( lm1)))

## 1 2 3 4 5 6

## 0 0 0 0 0 0

In simple linear regression, the corresponding estimators can be directly derived from

Equation (8.5). For multiple regression, meaning for models with more than one indepen-

dent variable, it is better to begin directly with matrix notation to determine estimators

for an arbitrary number of predictors (see Chapter 10).


In simple linear regression, the central task is often to determine whether there exists a

linear relationship between the dependent and independent variables. This can be tested

with the hypothesis H0 : β1 = 0 (Test 12).

Test 12: Test of a linear relationship

Question: Is there a linear relationship between two paired samples?

Assumptions: Based on the sample x1, . . . , xn, the second sample is a real-

ization of a normally distributed random variable with expected value

β0 + β1xi and variance σ2.

Calculation: tobs = |r|sysx

Decision: Reject H0: β1 = 0 if tobs > tobs = tn−2,1−α/2.

Calculation in R: summary( lm( y ∼ x ))

Comparing the expression of Pearson’s correlation coefficient (8.1) and the estimate

of β1 (8.7), it is not surprising that the p-values of Tests 12 and 11 coincide.

For prediction of the dependent variable for a given independent variable x0, the value

x0 is used in Equation (8.10). The function predict can be used for prediction in R. The

uncertainty of the prediction depends on the uncertainty of the estimated parameter (and

on the variance of the error term). Specifically:

Var(µ0) = Var(β0 + β1x0) (8.11)

In general, however, β0 and β1 are not independent and the variance is not easy to

calculate. We return to this in Chapter 11. To construct confidence intervals for a

prediction, first we must discern whether it is for y0 or for µ0.

The prediction can also be written as

β0 + β1x0 = y − β1x+ β1x0 = y + sxy(s2x)−1(x0 − x) . (8.12)

This equation is equivalent to Equation (7.18) but with estimates instead of (unknown)

parameters.

Linear regression makes assumptions about the errors εi. After calculation, we must

check whether these assumptions are realistic. For this, the residuals and their properties

are considered. We return to this in Chapter 10.


R-Code 8.5 hardness data: predictions and pointwise confidence intervals. (See

Figure 8.3.)

new <- data.frame( Temp = seq(15, 75, by=.5))

pred.w.clim <- predict( lm1, new, interval="confidence")

# for hat( mu )

pred.w.plim <- predict( lm1, new, interval="prediction")

# for hat( y ), hence wider!

plot( Temp, Hard,

xlab="Temperature [C]", ylab="Hardness [HR]")

matlines( new$Temp, cbind(pred.w.clim, pred.w.plim[,-1]),

col=c(1,2,2,3,3),lty=c(1,1,1,2,2))

30 35 40 45 50 55 60

2030

4050

60

Temperature [C]

Har

dnes

s [H

R]

Figure 8.3: hardness data: hardness as a function of temperature. (See R-

Code 8.5.)


We recommend the book from Fahrmeir et al. (2009) (German) or Fahrmeir et al. (2013)

(English), which is both detailed and accessible.

Chapter 9

Multiple Regression


The simple linear regression model can be extended by the addition of independent

variables. We first introduce the model and estimators. Subsequently, we become ac-

quainted with the most important steps in model validation. At the end, several typical

examples of multiple regression are demonstrated.

9.1 Model and Estimators

Equation (8.5) is generalized to p independent variables

Yi = µi + εi, (9.1)

= β0 + β1xi1 + · · ·+ βpxip + εi, (9.2)

= xTi β + εi i = 1, . . . , n, n > p (9.3)

with

• Yi: dependent variable, measured value, observation

• x i = (1, xi1, . . . , xip)T: independent/explanatory variables, predictors

• β = (β0, . . . , βp)T: (unknown) parameter vector

• εi: (unknown) error term, noise, “measurement error”, with symmetric distribution

around zero, E(εi) = 0.

It is often also assumed that Var(εi) = σ2 and/or that the errors are independent of

each other. For simplicity, we assume that εiiid∼ N (0, σ2), with unknown σ2. In matrix

notation, this is written as (9.3)

Y = Xβ + ε (9.4)

111


112 CHAPTER 9. MULTIPLE REGRESSION

with X an n× (p + 1) matrix with rows x i. We assume that the rank of X equals p+ 1

(rank(X) = p + 1, column rank). This assumption guarantees that the inverse of XTX

exists.

The least squares concept is thus applied as follows:

β = argminβ

(y −Xβ)T(y −Xβ) (9.5)

⇒ d

dβ(y −Xβ)T(y −Xβ) (9.6)

=d

dβ(yTy − 2βTXTy + βTXTXβ) = −2XTy + 2XTXβ (9.7)

⇒ XTXβ = XTy (9.8)

⇒ β = (XTX)−1XTy (9.9)

Equation (9.8) is also called the normal equation.

It can be shown that

ε ∼ Nn(0, σ2I), (9.10)

Y ∼ Nn(Xβ, σ2I), (9.11)

β = (XTX)−1XTy , β ∼ Np+1

(β, σ2(XTX)−1

), (9.12)

y = X(XTX)−1XTy = Hy , Y ∼ Nn(Xβ, σ2H), (9.13)

r = y − y = (I−H)y R ∼ Nn(0, σ2(I−H)

). (9.14)

In the left column we find the estimates, in the right column the functions of the random

samples. The marginal distributions of the individual coefficients βi are determined by

distribution (9.12):

βi ∼ N (βi, σ2vii) with vii =

((XTX)−1

)ii, i = 0, . . . , p (9.15)

Henceβi − βi√σ2vii

∼ N (0, 1). However, since σ2 is usually unknown, we need

βi − βi√σ2vii

∼ tn−p−1 with σ2 =1

n− p− 1

n∑i=1

(yi − yi)2 =1

n− p− 1rTr (9.16)

as our statistic for testing and to derive confidence intervals about individual coefficients

βi. For testing, we are often interested in H0 : βi = 0. Confidence intervals are constructed

along equation (3.33) and summarized in the subsequent blue box.

9.2. MODEL VALIDATION 113

Confidence intervals for regression coefficients

An empirical (1− α) confidence interval for βi is[βi ± tn−p−1,1−α/2

√1

n− p− 1rTr vii

](9.17)

with r = y − y and vii =((XTX)−1

)ii.

9.2 Model Validation

Model validation essentially verifies if (9.3) is an adequate model for using the data to

test the given hypothesis. The question is not whether a model is correct, but rather if it

is useful (“Essentially, all models are wrong, but some are useful”, Box and Draper, 1987

page 424).

Validation is based on (a) graphical summaries, typically (standardized) residuals

versus fitted values, individual predictors, summaries values of predictors or simply Q-Q

plots and (b) summary statistics like

Coefficient of determination, or R2: R2 = 1− SSESST

= 1−∑

i(yi − yi)2∑i(yi −yi)2

, (9.18)

Adjusted R2: R2adj = R2 − (1−R2)

p

n− p− 1, (9.19)

Observed value of F -Test:(SST − SSE)/p

SSE/(n− p− 1), (9.20)

were SS stands for sums of squares. The last one is essentially equivalent to Test 4 and

performs the omnibus test H0 : β1 = β2 = · · · = βp = 0. Rejecting merely means that any

of the coefficients is significant. This test can also be used to compare nested models. Let

M0 be the simpler model with only q out of the p predictors of the more complex model

M1 (0 ≤ q < p). The test H0 : “M0 is sufficient” is based on the statistic

(SSsimple model − SScomplex model)/(p− q)SScomplex model/(n− p− 1)

(9.21)

and often runs under the name ANOVA (analysis of variance). We see an alternative

derivation thereof in Chapter 10. Further, notice that in Section 9.3 we two more often

used summary statistics to assess model fit.

Validation verifies (i) the fixed components (or fixed part) µi and (ii) the stochastic

components (or stochastic part) εi and is normally an iterative process.

In terms of the fixed components, it must be verified whether the necessary number

of predictors are in the model. We do not want too many, nor too few. Unnecessary


predictors are often identified through insignificant coefficients. When predictors are

missing, the residuals show (in the ideal case) structure indicative of a need for model

improvement. In other cases, the quality of the regression is small (F -Test, R2 (too)

small). Example 9.1 below will illustrate the most important elements.

It is important to understand that the stochastic part εi does not only represent

measurement error. In general, the error is the remaining “variability” (also noise) that

is not explained through the predictors (“signal”). Therefore it is possible for the error

to not be iid normally distributed.

With respect to the stochastic part εi, the following points should be verified:

i) constant variance: if the (absolute) residuals are displayed as a function of the

predictors, the estimated values, or the index, no structure should be discernible.

The observations can often be transformed in order to achieve constant variance.

Constant variance is also called homoscedasticity and heteroscedasticity or variance

heterogenity are used otherwise.

ii) independence: correlation between the residuals should be negligible. Types of error

structure will be discussed in Chapter 11.

iii) symmetric distribution: it is not easy to find evidence against this assumption. If

the distribution is strongly right- or left-skewed, the scatter plots of the residuals

will have structure. Transformations or generalized linear models may help. We

have a quick look at a generalized linear model in Section 11.3.

Example 9.1. Here we construct synthetic data in order to better illustrate the difficulty

of detecting a suitable model. Table 9.1 gives the actual models and the five fitted models.

In all cases we use a small dataset of size n = 50 and predictors (x1, x2 and x3) that we

construct from a uniform distribution. Further, we set εiiid∼ N (0, 0.252). R-Code 9.1 and

the corresponding Figure 9.1 illustrate how model deficiencies manifest.

We illustrate how residual plots may or may not show missing or unnecessary predic-

tors. Because of the ‘textbook’ example, the adjusted R2 values are very high and the

p-value of the F -Test is – as often in practice – of little value.

Since the output of summary(.) is quite long, here we show only elements from it.

This is acheived with the functions print and cat. For Examples 2 to 5 the output

has been constructed by a function call being identical to sres summary constructing the

output as the first example.

The plots should supplement a classical graphical analysis through lm( res). ♣


Table 9.1: Fitted models for five examples of 9.1. The true model is always

Yi = β0 + β1x1 + β2x21 + β3x2 + εi.

Example fitted model

1 Yi = β0 + β1x1 + β2x21 + β3x2 + εi correct model

2 Yi = β0 + β1x1 + β2x2 + εi missing predictor x21



5 Yi = β0 + β1x1 + β2x21 + β3x2 + β4x3 + εi unnecessary predictor x3

R-Code 9.1: Illustration of missing and unnecessary predictors for an

artificial dataset. (See Figure 9.1.)

set.seed( 18)

n <- 50

x1 <- runif( n); x2 <- runif( n); x3 <- runif( n)

eps <- rnorm( n, sd=0.16)

y <- -1 + 3*x1 + 2.5*x1^2 + 1.5*x2 + eps

# Example 1: Correct model

sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 ))

print( sres$coef, digits=2)


## (Intercept) -0.96 0.067 -14.4 1.3e-18

## x1 2.90 0.264 11.0 1.8e-14

## I(x1^2) 2.54 0.268 9.5 2.0e-12

## x2 1.49 0.072 20.7 5.8e-25

cat("Adjusted R-squared: ", formatC( sres$adj.r.squared),

" F-Test: ", pf(sres$fstatistic[1], 1, n-2, lower.tail = FALSE))

## Adjusted R-squared: 0.9927 F-Test: 6.5201e-42

plotit( res$fitted, res$resid) # a nice plot, essentially equivalent to

# plot( res, which=1, caption=NA, sub.caption=NA, id.n=0) with ylim=c(-1,1)

# i.e, plot(), followed by a panel.smooth()

plotit( x1, res$resid)

plotit( x2, res$resid)

# Example 2: Missing predictor x1^2

sres <- summary( res <- lm( y ~ x1 + x2 ))

sres_summary( ) # as above


## (Intercept) -1.3 0.093 -14 9.0e-19


## x1 5.3 0.113 47 2.5e-41

## x2 1.5 0.122 13 1.1e-16


# Example 3: Missing predictor x1

sres <- summary( res <- lm( y ~ I(x1^2) + x2 ))

sres_summary( )


## (Intercept) -0.45 0.091 -5 9.3e-06

## I(x1^2) 5.39 0.126 43 3.0e-39

## x2 1.41 0.135 10 8.4e-14


# Example 4: Missing predictor x2

sres <- summary( res <- lm( y ~ x1 + I(x1^2) ))

sres_summary( )


## (Intercept) -0.042 0.16 -0.26 0.7942

## x1 2.307 0.83 2.76 0.0081

## I(x1^2) 2.963 0.85 3.49 0.0011


# Example 5: Too many predictors x3

sres <- summary( res <- lm( y ~ x1 + I(x1^2) + x2 + x3))

sres_summary( )


## (Intercept) -0.973 0.071 -13.8 9.5e-18

## x1 2.908 0.266 10.9 3.0e-14

## I(x1^2) 2.537 0.270 9.4 3.4e-12

## x2 1.483 0.075 19.9 6.5e-24

## x3 0.044 0.074 0.6 5.5e-01


# The results are in sync with:

# tmp <- cbind( y, x1, "x1^2"=x1^2, x2, x3)

# pairs( tmp, upper.panel=panel.smooth, lower.panel=NULL, gap=0)

# cor( tmp)


0 1 2 3 4 5

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0 1 2 3 4 5

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0 1 2 3 4 5

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0 1 2 3 4 5

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0 1 2 3 4 5

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

0.0 0.2 0.4 0.6 0.8 1.0

−1.

00.

01.

0

x

Figure 9.1: Residual plots. Residuals versus fitted values (left column), pre-

dictor x1 (middle) and x2 (right column). The rows correspond to the different

fitted models. The panels in the left column have different scaling of the x-axis.

(See R-Code 9.1.)


9.3 Information Criterion

An information criterion in statistics is a tool for model selection. It follows the idea

of Occam’s razor, in that a model should not be unnecesarily complex. It balances the

goodness of fit of the estimated models with its complexity, measured by the number of

parameters. There is a penalty for the number of parameters, otherwise complex models

with numerous parameters would be preferred.

We assume that the distribution of the variables follows a known distribution with

an unknown parameter θ. In maximum likelihood estimation, the larger the likelihood

function Lθ) or, equivalently, the smaller the negative log-likelihood function −`(θ), the

better the model is. The oldest criterion was proposed as “an information criterion” in

1973 by Hirotugu Akaike and is known today as the Akaike information criterion (AIC):

AIC = −2`(θ) + 2p. (9.22)

In regression models with normally distributed errors, the maximized log-likelihood is

linear to log σ2 and so the first term describes the goodness of fit.

The disadvantage of AIC is that the penalty term is independent of the sample size.

The Bayesian information criterion (BIC)

BIC = −2`(θ) + log(n)p (9.23)

penalizes the model more heavily based on both the number of parameters p and sample

size n, and its use is recommended.

9.4 Examples

In this section we give two more examples of multiple regression problems, based on

classical datasets.

Example 9.2. (abrasion data) The data comes from an experiment investigating how

rubber’s resistance to abrasion is affected by the hardness of the rubber and its tensile

strength (Cleveland, 1993). Each of the 30 rubber samples was tested for hardness and

tensile strength, and then subjected to steady abrasion for a fixed time.

R-Code 9.2 performs the regression analysis based on two predictors. The empirical

confidence intervals confint( res) do not contain zero. Accordingly, the p-values of the

three t-tests are small.

A quadratic term for strength is not necessary. However, the residuals appear to be

slightly correlated. We do not have further information about the data here and cannot

investigate this aspect further. ♣

9.4. EXAMPLES 119

50 150 250 350

5015

025

035

0

loss

50 60 70 80 90

120 160 200 240

5015

025

035

0

hardness

5060

7080

90

120 160 200 240

120

160

200

240

strength

Figure 9.2: abrasion data: EDA in form of a pairs plot. (See R-Code 9.2.)

R-Code 9.2: abrasion data: EDA, fitting a linear model and model vali-

dation.

abrasion <- read.csv('data/abrasion.csv')

str(abrasion)


## $ loss : int 372 206 175 154 136 112 55 45 221 166 ...

## $ hardness: int 45 55 61 66 71 71 81 86 53 60 ...

## $ strength: int 162 233 232 231 231 237 224 219 203 189 ...

pairs(abrasion, upper.panel=panel.smooth, lower.panel=NULL, gap=0)

res <- lm(loss~hardness+strength, data=abrasion)

summary( res)

##

## Call:

## lm(formula = loss ~ hardness + strength, data = abrasion)

##

## Residuals:


## -79.38 -14.61 3.82 19.75 65.98


##

## Coefficients:


## (Intercept) 885.161 61.752 14.33 3.8e-14 ***

## hardness -6.571 0.583 -11.27 1.0e-11 ***

## strength -1.374 0.194 -7.07 1.3e-07 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##



## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11

confint( res)

## 2.5 % 97.5 %

## (Intercept) 758.4573 1011.86490

## hardness -7.7674 -5.37423

## strength -1.7730 -0.97562

# Fitted values

plot( loss~hardness, ylim=c(0,400), yaxs='i', data=abrasion)

points( res$fitted~hardness, col=4, data=abrasion)

plot( loss~strength, ylim=c(0,400), yaxs='i', data=abrasion)

points( res$fitted~strength, col=4, data=abrasion)

# Residual vs ...

plot( res$resid~res$fitted)

lines( lowess( res$fitted, res$resid), col=2)

abline( h=0, col='gray')

plot( res$resid~hardness, data=abrasion)

lines( lowess( abrasion$hardness, res$resid), col=2)


plot( res$resid~strength, data=abrasion)

lines( lowess( abrasion$strength, res$resid), col=2)


plot( res$resid[-1]~res$resid[-30])


9.4. EXAMPLES 121

50 60 70 80 90

010

030

0

hardness

loss

120 140 160 180 200 220 240

010

030

0

strength

loss

50 100 150 200 250 300 350

−50

050

res$fitted

res$

resi

d

50 60 70 80 90

−50

050

hardnessre

s$re

sid

120 140 160 180 200 220 240

−50

050

strength

res$

resi

d

−50 0 50

−50

050

res$resid[−30]

res$

resi

d[−

1]

Figure 9.3: abrasion data: model validation. Top row shows the loss (black)

and fitted values (blue) as a function of hardness (left) and strength (right).

Middle and bottom panels are different residual plots. (See R-Code 9.2.)

Example 9.3. (LifeCycleSavings data) Under the life-cycle savings hypothesis de-

veloped by Franco Modigliani, the savings ratio (aggregate personal savings divided by

disposable income) is explained by per-capita disposable income, the percentage rate of

change in per-capita disposable income, and two demographic variables: the percentage

of population less than 15 years old and the percentage of the population over 75 years

old. The data are averaged over the decade 1960–1970 to remove the business cycle or

other short-term fluctuations.

The dataset contains information from 50 countries about these five variables:

• sr aggregate personal savings,

• pop15 % of population under 15,

• pop75 % of population over 75,


• dpi real per-capita disposable income,

• ddpi % growth rate of dpi.

Scatter plots are shown in Figure 9.4. R-Code 9.3 fits a multiple linear model, selects

models through comparison of various goodness of fit criteria (AIC, BIC) and shows the

model validation plots for the model selected using AIC.

Different models may result from different criteria: when using BIC for model selection,

pop75 drops out of the model. ♣

0 5 10 15 20

05

1015

20

sr

25 35 45

1 2 3 4

0 2000 4000

0 5 10 15

05

1015

20

pop15

2535

45

pop75

12

34

dpi0

2000

4000

0 5 10 15

05

1015

ddpi

Figure 9.4: LifeCycleSavings data: scatter plots of the data.

R-Code 9.3: LifeCycleSavings data: EDA, linear model and model selec-

tion. (Figure 9.4).

data( LifeCycleSavings)

head( LifeCycleSavings)

## sr pop15 pop75 dpi ddpi

## Australia 11.43 29.35 2.87 2329.68 2.87

## Austria 12.07 23.32 4.41 1507.99 3.93

## Belgium 13.17 23.80 4.43 2108.47 3.82

## Bolivia 5.75 41.89 1.67 189.13 0.22

## Brazil 12.88 42.19 0.83 728.47 4.56

## Canada 8.79 31.72 2.85 2982.88 2.43

9.4. EXAMPLES 123

pairs(LifeCycleSavings, upper.panel=panel.smooth, lower.panel=NULL,

gap=0)

lcs.all <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)

summary( lcs.all)

##

## Call:

## lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)

##

## Residuals:


## -8.242 -2.686 -0.249 2.428 9.751

##

## Coefficients:


## (Intercept) 28.566087 7.354516 3.88 0.00033 ***

## pop15 -0.461193 0.144642 -3.19 0.00260 **

## pop75 -1.691498 1.083599 -1.56 0.12553

## dpi -0.000337 0.000931 -0.36 0.71917

## ddpi 0.409695 0.196197 2.09 0.04247 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##



## F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079

lcs.aic <- step( lcs.all) # AIC is default choice

## Start: AIC=138.3

## sr ~ pop15 + pop75 + dpi + ddpi

##

## Df Sum of Sq RSS AIC

## - dpi 1 1.9 653 136

## <none> 651 138

## - pop75 1 35.2 686 139

## - ddpi 1 63.1 714 141

## - pop15 1 147.0 798 146

##

## Step: AIC=136.45

## sr ~ pop15 + pop75 + ddpi


##

## Df Sum of Sq RSS AIC

## <none> 653 136

## - pop75 1 47.9 701 138

## - ddpi 1 73.6 726 140

## - pop15 1 145.8 798 144

summary( lcs.aic)$coefficients


## (Intercept) 28.12466 7.18379 3.9150 0.00029698

## pop15 -0.45178 0.14093 -3.2056 0.00245154

## pop75 -1.83541 0.99840 -1.8384 0.07247270

## ddpi 0.42783 0.18789 2.2771 0.02747818

plot( lcs.aic) # 4 plots to assess the models

# BIC:

summary( step( lcs.all, k=log(50), trace=0))$coefficients


## (Intercept) 15.59958 2.334394 6.6825 2.4796e-08

## pop15 -0.21638 0.060335 -3.5863 7.9597e-04

## ddpi 0.44283 0.192401 2.3016 2.5837e-02

6 8 10 12 14 16

−10

−5

05

10

Fitted values

Res

idua

ls

Residuals vs Fitted

Zambia

Chile

Philippines

−2 −1 0 1 2

−2

01

23


Sta

ndar

dize

d re

sidu

als

Normal Q−Q

Zambia

Chile

Philippines

6 8 10 12 14 16

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−LocationZambia

ChilePhilippines

0.0 0.1 0.2 0.3 0.4 0.5

−2

01

23

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance10.5

0.51

Residuals vs Leverage

Libya

Japan

Zambia

Figure 9.5: LifeCycleSavings data: model validation.

Chapter 10

Analysis of Variance


In Test 2 discussed in Chapter 4, we compared the means of two independent samples

with each other. Naturally, the same procedure can be applied to I independent samples,

which would amount to(I2

)tests and would require adjustments due to multiple testing.

In this chapter we learn a “better” method, based on the concept of analysis of variance,

termed ANOVA. We focus on a linear model approach to ANOVA. Due to historical

reasons, the notation is slightly different than what we have seen in the last two chapters;

but we try to link and unify as much as possible.

10.1 One-Way ANOVA

The model of this section is tailored to compare I different groups where the variability

of the observations around the mean is the same in all groups. That means that there is

one common variance parameter and we pool the information across all observations to

estimate it.

More formally, the model consists of I groups, in the ANOVA context called factor levels,

i = 1, . . . , I, and every level contains a sample of size ni. Thus, in total we have N =

n1 + · · ·+ nI observations. The model is given by

Yij = µi + εij (10.1)

= µ+ βi + εij, (10.2)

where we use the indices to indicate the group and within group observation. We again

assume εijiid∼ N (0, σ2). Formulation (10.1) represents the individual group means directly,

whereas formulation (10.2) models an overall mean and deviations from the mean.

However, model (10.2) is overparameterized (I levels and I + 1 parameters) and an ad-

ditional constraint on the parameters is necessary. Often, the sum-to-zero-contrast or

125


126 CHAPTER 10. ANALYSIS OF VARIANCE

treatment contrast, written as:

I∑i=1

βi = 0 or β1 = 0, (10.3)

are used.

We are inherently interested in whether there exists a difference between the groups

and so our null hypothesis is H0 : β1 = β2 = · · · = βI = 0. Note that the hypothesis is

independent of the constraint. To develop the associated test, we proceed in several steps.

We first link the two group case to the notation from the last chapter. In a second step

we intuitively derive estimates in the general setting. Finally, we state the test statistic.

10.1.1 Two Level ANOVA Written as a Regression Model

Model (10.2) with I = 2 and treatment constraint β1 = 0 can be written as a regression

problem

Y ∗i = β∗0 + β∗1xi + ε∗i , i = 1, . . . , N (10.4)

with Y ∗i the components of the vector (Y11, Y12, . . . , Y1n1 , Y21, . . . , Y2n2)T and xi = 0 if

i = 1, . . . , n1 and xi = 1 otherwise. We simplify the notation and spare ourselves from

writing the index denoted by the asterisk with(Y1

Y2

)= Xβ + ε = X

(β0

β1

)+ ε =

(1 0

1 1

)(β0

β1

)+ ε (10.5)

and thus we have as least squares estimate

β =

(β0

β1

)= (XTX)−1XT

(y1

y2

)=

(N n2

n2 n2

)−1(1T 1T

0T 1T

)(y1

y2

)(10.6)

=1

n1n2

(n2 − n2

−n2 N

)(∑ij yij∑j y2j

)=

( 1n1

∑j y1j

1n2

∑j y2j − 1

n1

∑j y1j

). (10.7)

Thus the least squares estimates of µ and β2 in (10.2) for two groups are the mean of the

first group and the difference between the two group means.

The null hypothesis H0 : β1 = β2 = 0 in Model (10.2) is equivalent to the null hypothesis

H0 : β∗1 = 0 in Model (10.4) or to the null hypothesis H0 : β1 = 0 in Model (10.5). The

latter is of course based on a t-test for a linear association (Test 12) and coincides with

the two sample t-test for two independent samples (Test 2).

Estimators can also be derived in a similar fashion under other constraints or for more

factor levels.

10.1. ONE-WAY ANOVA 127

10.1.2 Sums of Squares Decomposition in the Balanced Case

Historically, we often had the case n1 = · · · = nI = J , representing a balanced setting.

In this case, there is another simple approach to deriving the estimators under the sum-

to-zero constraint. We use dot notation in order to show that we work with averages, for

example,

yi· =1

ni

ni∑j=1

yij and y·· =1

N

I∑i=1

ni∑j=1

yij. (10.8)

Based on Yij = µ+ βi + εij we use the following ansatz to derive estimates

yij = y·· + (yi· − y··) + (yij − yi·). (10.9)

With the least squares method, µ and βi are chosen such that∑i,j

(yij − µ− βi)2 =∑i,j

(y·· + yi· − y·· + yij − yi· − µ− βi)2 (10.10)

=∑i,j

((y·· − µ) + (yi· − y·· − βi) + (yij − yi·)

)2(10.11)

is minimized. We evaluate the square of this last equation and note that the cross terms

are zero since

J∑j=1

(yij − yi·) = 0 andI∑i=1

(yi· − y·· − βi) = 0 (10.12)

(the latter due to sum-to-zero constraint) and thus µ = y·· and βi = yi· − y··. Hence,

writing rij = yij − yi·, we have

yij = µ+ βi + rij. (10.13)

As in regression, the observations are orthogonally projected in the space spanned by µ

and βi. This orthogonal projection allows for the division of the sums of squares of the

observations (mean corrected to be precise) into the sums of squares of the model and sum

of squares of the error component. These sums of squares are then weighted and compared.

The representation of this process in table form and the subsequent interpretation is often

equated with the analysis of variance, denoted ANOVA.

The decomposition of the sums of squares can be derived with help from (10.9). No

assumptions about constraints or ni are made∑i,j

(yij − y··)2 =∑ij

(yi· − y·· + yij − yi·)2 (10.14)

=∑ij

(yi· − y··)2 +∑i,j

(yij − yi·)2 +∑i,j

2(yi· − y··)(yij − yi·), (10.15)


where the cross term is again zero because

ni∑j=1

(yij − yi·) = 0. Hence we have the decom-

position of the sums of squares∑i,j

(yij − y··)2

︸︷︷︸Total

=∑i,j

(yi· − y··)2

︸︷︷︸Model

+∑i,j

(yij − yi·)2

︸︷︷︸Error

(10.16)

or SST = SSA + SSE. We choose deliberately SSA instead of SSM as this will simplify

subsequent extensions. Using the least squares estimates µ = y·· and βi = yi· − y··, this

equation can be read as

1

N

∑i,j

(yij − µ)2 =1

N

∑i

ni(µ+ βi − µ)2 +1

N

∑i,j

(yij − µ+ βi)2 (10.17)

Var(yij) = σ2 +1

N

∑i

niβi2, (10.18)

(where we could have used some divisor other than N). The test statistic for the statistical

hypothesis H0 : β1 = β2 = · · · = βI = 0 is based on the idea of decomposing the variance

into variance between groups and variance within groups, just as illustrated in (10.18),

and comparing them. Formally, this must be made more precise. A good model has

a small estimate for σ2 in comparison to that for the second sum. We now develop a

quantitative comparison of the sums.

A raw comparison of both variance terms is not sufficient, the number of observations

must be considered. In order to weight the individual sums of squares, we divide them

by their degrees of freedom. Under the null hypothesis, the mean squares are chi-square

distributed and thus their quotients are F distributed. Hence, an F -test as illustrated

in Test 4 is needed again. Historically, such a test has been “constructed” via a table

and is still represented as such. This so-called ANOVA table consists of columns for the

sums of squares, degrees of freedom, mean squares and F -test statistic due to variance

between groups, within groups, and the total variance. Table 10.1 illustrates such a

generic ANOVA table, numerical examples are given in Example 10.1 later in this section.

The expected values of the mean squares (MS) are

E(MSA) = σ2 +

∑i niβ

2i

I − 1, E(MSE) = σ2. (10.19)

Note that only the latter is intuitive.

Thus under H0 : β1 = · · · = βI = 0, E(MSA) = E(MSE) and hence the ratio MSA/MSE

is close to one, but typically larger. We reject H0 for large values of the ratio (see

Section 2.7.4).


Table 10.1: Table for one-way analysis of variance.

Source Sum of squares Degrees of freedom Mean squares Test statistic

(SS) (DF) (MS)

Between groups SSA =

(Factors, levels, . . . )∑i,j

(yi· − y··)2 I − 1 MSA =SSAI − 1

Fobs =MSAMSE

Within groups SSE =

(Error)∑i,j

(yij − yi·)2 N − I MSE =SSEN − I

Total SST =∑i,j

(yij − y··)2 N − 1

Test 13: Performing a one-way analysis of variance

Question: Of means y1, y2, . . . , yI , are at least two significantly different?

Assumptions: The I populations are normally distributed with homoge-

neous variances. The samples are independent.

Calculation: Construct a one-way ANOVA table. The quotient of the mean

squares of the factor and the error are needed:

Fobs =SSA/(I − 1)

SSE/(N − I)=

MSAMSE

Decision: Reject H0 : β1 = β2 = · · · = βI if Fobs > Ftab = fI−1,N−I,1−α,

where I − 1 gives the degrees of freedom “between” and N − I the

degrees of freedom “within”

Calculation in R: summary( lm(...)) for the value of the tesqt statistic or

anova( lm(...)) for the explicit ANOVA table.

Test 13 summarizes the procedure. The test statistic of Table 10.1 naturally agrees with

that of Test 13. Observe that when MSA ≤ MSE, i.e., Fobs ≤ 1, H0 is never rejected.

Details about F -distributed random variables are given in Section 2.7.4.

Example 10.1 discusses a simple analysis of variance.


Example 10.1. (retardant data) Many substances related to human activities end up

in wastewater and accumulate in sewage sludge. The present study focuses on hexabro-

mocyclododecane (HBCD) detected in sewage sludge collected from a monitoring network

in Switzerland. HBCD’s main use is in expanded and extruded polystyrene for thermal

insulation foams, in building and construction. HBCD is also applied in the backcoating

of textiles, mainly in furniture upholstery. A very small application of HBCD is in high

impact polystyrene, which is used for electrical and electronic appliances, for example in

audio visual equipment. Data and more detailed background information are given in

Kupper et al. (2008) where it is also argued that loads from different types of monitor-

ing sites showed that brominated flame retardants ending up in sewage sludge originate

mainly from surface runoff, industrial and domestic wastewater.

HBCD is harmful to one’s health, may affect reproductive capacity, and may harm children

in the mother’s womb.

In R-Code 10.1 the data are loaded and reduced to Hexabromocyclododecane.

First we use constraint β1 = 0, i.e., Model (10.5). The estimates naturally agree with

those from (10.7).

Then we use the sum-to-zero constraint and compare the results. The estimates and the

standard errors changed (and thus the p-values of the t-test). The p-values of the F -test

are, however, identical, since the same test is used.

The R command aov is an alternative for performing ANOVA and its use is illustrated

in R-Code 10.2. We prefer, however, the more general lm approach. Nevertheless we

need a function which provides results on which, for example, Tukey’s honest significant

difference (HSD) test can be performed with the function TukeyHSD. The differences can

also be calculated from the coefficients in R-Code 10.1. The p-values are smaller because

multiple tests are considered. ♣

R-Code 10.1: retardant data: ANOVA with lm command and illustration

of various contrasts.

tmp <- read.csv('./data/retardant.csv')

retardant <- read.csv('./data/retardant.csv', skip=1)

names( retardant) <- names(tmp)

HBCD <- retardant$cHBCD

str( retardant$StationLocation)

## Factor w/ 16 levels "A11","A12","A15",..: 1 2 3 4 5 6 7 8 11 12 ...

type <- as.factor( rep(c("A","B","C"), c(4,4,8)))

lmout <- lm( HBCD ~ type )

summary(lmout)


##

## Call:

## lm(formula = HBCD ~ type)

##

## Residuals:


## -87.6 -44.4 -26.3 22.0 193.1

##

## Coefficients:


## (Intercept) 75.7 42.4 1.78 0.098 .

## typeB 77.2 60.0 1.29 0.220

## typeC 107.8 51.9 2.08 0.058 .

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




options( "contrasts")

## $contrasts

## unordered ordered

## "contr.treatment" "contr.poly"

# manually construct the estimates:

c( mean(HBCD[1:4]), mean(HBCD[5:8])-mean(HBCD[1:4]),

mean(HBCD[9:16])-mean(HBCD[1:4]))

## [1] 75.675 77.250 107.788

# change the constrasts to sum-to-zero

options(contrasts=c("contr.sum","contr.sum"))

lmout1 <- lm( HBCD ~ type )

summary(lmout1)

##

## Call:

## lm(formula = HBCD ~ type)

##

## Residuals:


## -87.6 -44.4 -26.3 22.0 193.1


##

## Coefficients:


## (Intercept) 137.4 22.3 6.15 3.5e-05 ***

## type1 -61.7 33.1 -1.86 0.086 .

## type2 15.6 33.1 0.47 0.646

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




beta <- as.numeric(coef(lmout1))

# Construct 'contr.treat' coefficients:

c( beta[1]+beta[2], beta[3]-beta[2], -2*beta[2]-beta[3])

## [1] 75.675 77.250 107.787

10.2 Two-Way and Complete Two-Way ANOVA

Model (10.2) can be extended for additional factors. Adding one factor to a one-way

model leads us to a two-way model

Yijk = µ+ βi + γj + εijk (10.20)

with i = 1, . . . , I, j = 1, . . . , J , k = 1, . . . , nij and εijkiid∼ N (0, σ2). The indices again

specify the levels of the first and second factor as well as the count for that configuration.

As stated, the model is over parameterized and additional constraints are again necessary,

in which case

I∑i=1

βi = 0,J∑j=1

γj = 0 or β1 = 0, γ1 = 0 (10.21)

are often used.

When the nij in each group are not equal, the decomposition of the sums of squares is not

necessarily unique. In practice this is often unimportant since, above all, the estimated

coefficients are compared with each other (“contrasts”). We recommend always using the

command lm(...). In case we do compare sums of squares there are resulting ambiguities

and factors need to be included in decreasing order of “natural” importance.

For the sake of illustration, we consider the balanced case of nij = K, called complete

two-way ANOVA. More precisely, the model consists of I · J groups and every group

10.2. TWO-WAY AND COMPLETE TWO-WAY ANOVA 133

R-Code 10.2 retardant data: ANOVA with aov and multiple testing of the means.

aovout <- aov( HBCD ~ type )

options("contrasts")

## $contrasts

## [1] "contr.sum" "contr.sum"

coefficients( aovout) # coef( aovout) is suffient as well.

## (Intercept) type1 type2

## 137.354 -61.679 15.571

summary(aovout)

## Df Sum Sq Mean Sq F value Pr(>F)

## type 2 31069 15534 2.16 0.15

## Residuals 13 93492 7192

TukeyHSD( aovout)

## Tukey multiple comparisons of means

## 95% family-wise confidence level

##

## Fit: aov(formula = HBCD ~ type)

##

## $type

## diff lwr upr p adj

## B-A 77.250 -81.085 235.59 0.42602

## C-A 107.788 -29.335 244.91 0.13376

## C-B 30.538 -106.585 167.66 0.82882

contains K samples and N = I · J ·K. The calculation of the estimates are easier than

in the unbalanced case and are illustrated as follows.

As in the one-way case, we can derive least squares estimates

yijk = y···︸︷︷︸µ

+ yi·· − y···︸︷︷︸βi

+ y·j· − y···︸︷︷︸γj

+ yijk − yi·· − y·j· + y···︸︷︷︸rijk

(10.22)

and separate the sums of squares∑i,j,k

(yijk − y···)2

︸︷︷︸SST

=∑i,j,k

(yi·· − y···)2

︸︷︷︸SSA

+∑i,j,k

(y·j· − y···)2

︸︷︷︸SSB

+∑i,j,k

(yijk − yi·· − y·j· + y···)2

︸︷︷︸SSE

.

(10.23)

We are interested in the statistical hypotheses H0 : β1 = · · · = βI = 0 and H0 : γ1 =

· · · = γJ = 0. The test statistics of both of these tests are given in the last column of


Table 10.3. The test statistics Fobs,A (Fobs,B) are compared with the quantiles of the F

distribution with I − 1 and N − I − J + 1 degrees of freedom. Similarly, for Fobs,B we use

J − 1 and N − I − J + 1 degrees of freedom. The tests are equivalent to that of Test 13.

Table 10.2: Table of the complete two-way ANOVA.

Source SS DF MS Test statistic

Factor A SSA =∑i,j,k

(yi·· − y···)2 I − 1 MSA =SSAI − 1

Fobs,A =MSAMSE

Factor B SSB =∑i,j,k

(y·j· − y···)2 J − 1 MSB =SSBJ − 1

Fobs,B =MSBMSE

SSE = DFE =

Error∑i,j,k

(yijk − yi·· − y·j· + y···)2 N−I−J+1 MSE =

SSEDFE

Total SST =∑i,j,k

(yijk − y···)2 N − 1

An interaction (βγ)ij can also be included in the model, if the effects are not linear:

Yijk = µ+ βi + γj + (βγ)ij + εijk (10.24)

with εijkiid∼ N (0, σ2). In addition to constraints (10.21)

I∑i=1

(βγ)ij = 0 andJ∑j=1

(βγ)ij = 0 for all i and j (10.25)

or analogous treatment constraints are often used. As in the previous two-way case, we

can derive the least squares estimates

yijk = y···︸︷︷︸µ

+ yi·· − y···︸︷︷︸βi

+ y·j· − y···︸︷︷︸γj

+ yij· − yi·· − y·j· + y···︸︷︷︸(βγ)ij

+ yijk − yij·︸︷︷︸rijk

(10.26)

and a decomposition in sums of squares is straightforward.

Note that for K = 1 the error is the interaction.

Table 10.3 shows the table of a complete two-way ANOVA. The test statistics Fobs,A,

Fobs,B and Fobs,AB are compared with the quantiles of the F distribution with I−1 J −1,

(I − 1)(J − 1) and N − IJ degrees of freedom, respectively.

10.3. ANALYSIS OF COVARIANCE 135

Table 10.3: Table of the complete two-way ANOVA with interaction.

Source SS DF MS Test statistic

Factor A SSA =∑i,j,k

(yi·· − y···)2 I − 1 MSA =SSADFA

Fobs,A =MSAMSE

Factor B SSB =∑i,j,k

(y·j· − y···)2 J − 1 MSB =SSBDFB

Fobs,B =MSBMSE

Interaction SSAB = (I − 1)×∑i,j,k

(yij· − yi·· − y·j· + y···)2 (J − 1) MSAB =

SSABDFAB

Fobs,AB =MSABMSE

Error SSE =∑i,j,k

(yijk − yij·)2 N − IJ MSE =SSEDFE

Total SST =∑i,j,k

(yijk − y···)2 N − 1

10.3 Analysis of Covariance

We now combine regression and analysis of variance elements, i.e., continuous predictors

and factor levels, by adding “classic” predictors to models (10.2) or (10.20). Such models

are sometimes called ANCOVA, e.g., example

Yijk = µ+ β1xi + γj + εijk, (10.27)

with j = 1, . . . , J , k = 1, . . . , nj and εijkiid∼ N (0, σ2). Additional constraints are again

necessary. Keeping track of indices and Greek letters quickly gets cumbersome and one

often assumes a R formula notation. For example if the predictor xi is in the variable

Population and γj is in the variable Treatment, in form of a factor then

y ~ Population + Treatment (10.28)

is representing (10.27). An intersection is indicated with :, for example Treatment:Month.

The * operator denotes ‘factor crossing’: a*b is interpreted as a + b + a:b. See Section

Details of help(formula).

ANOVA tables are constructed similarly and individual rows thereof are associated to the

individual elements in a formula specification.

Hence our unified approach via lm(...). Notice that for the estimates the order of the

variable in formula (10.28) does not play a role, for the decomposition of sums of squares

it does. Different statistical software packages have different approaches and thus may

lead to minor differences.


10.4 Example

Example 10.2. chemosphere data Octocrylene is an organic UV Filter found in sun-

screen and cosmetics. The substance is classified as a contaminant and dangerous for the

environment by the EU under the CLP Regulation.

Because the substance is difficult to break down, the environmental burden of Octocrylene

can be estimated through measurement of its concentration in sludge from waste treatment

facilities.

The study Plagellat et al. (2006) analyzed Octocrylene (OC) concentrations from 24 dif-

ferent purification plants (consisting of three different types of Treatment), each with

two samples (Month). Additionally, the catchment area (Population) and the amount of

sludge (Production) are known. Treatment type A refers to small plants, B medium-sized

plants without considerable industry and C medium-sized plants with industry.

R-Code 10.3 prepares the data and shows a one-way ANOVA. R-Code 10.4 shows a two-

way ANOVA (with and without interactions).

Figure 10.1 shows why the interaction is not significant. First, the seasonal effect of

groups A and B are very similar and second, the variability in group C is too large. ♣

R-Code 10.3: UVfilter data: one-way ANOVA using lm.

UV <- read.csv( './data/chemosphere.csv')

UV <- UV[,c(1:6,10)] # reduce to OT

str(UV, strict.width='cut')


## $ Treatment : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 3 3 ..

## $ Site_code : Factor w/ 12 levels "A11","A12","A15",..: 1 2 3 4 7 ..

## $ Site : Factor w/ 12 levels "Chevilly","Cronay",..: 1 2 10 7..

## $ Month : Factor w/ 2 levels "jan","jul": 1 1 1 1 1 1 1 1 1 1 ..

## $ Population: int 210 284 514 214 674 5700 8460 11300 6500 7860 ...

## $ Production: num 2.7 3.2 12 3.5 13 80 150 220 80 250 ...

## $ OT : int 1853 1274 1342 685 1003 3502 4781 3407 11073 33..

with( UV, table(Treatment, Month))

## Month

## Treatment jan jul

## A 5 5

## B 3 3

## C 4 4

options( contrasts=c("contr.sum", "contr.sum"))

lmout <- lm( log(OT) ~ Treatment, data=UV)

summary( lmout)

10.4. EXAMPLE 137

##

## Call:

## lm(formula = log(OT) ~ Treatment, data = UV)

##

## Residuals:


## -0.952 -0.347 -0.136 0.343 1.261

##

## Coefficients:


## (Intercept) 8.122 0.116 70.15 < 2e-16 ***

## Treatment1 -0.640 0.154 -4.16 0.00044 ***

## Treatment2 0.438 0.175 2.51 0.02049 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




R-Code 10.4: UVfilter data: two-way ANOVA and two-way ANOVA with

interactions using lm. (See Figure 10.1.)

lmout <- lm( log(OT) ~ Treatment + Month, data=UV)

summary( lmout)

##

## Call:

## lm(formula = log(OT) ~ Treatment + Month, data = UV)

##

## Residuals:


## -0.7175 -0.3452 -0.0124 0.1691 1.2236

##

## Coefficients:


## (Intercept) 8.122 0.106 76.78 < 2e-16 ***

## Treatment1 -0.640 0.141 -4.55 0.00019 ***

## Treatment2 0.438 0.160 2.74 0.01254 *


## Month1 -0.235 0.104 -2.27 0.03444 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




summary( aovout <- aov( log(OT) ~ Treatment * Month, data=UV))

## Df Sum Sq Mean Sq F value Pr(>F)

## Treatment 2 5.38 2.688 11.67 0.00056 ***

## Month 1 1.32 1.325 5.75 0.02752 *

## Treatment:Month 2 1.00 0.499 2.17 0.14355

## Residuals 18 4.15 0.230

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD( aovout, which=c('Treatment'))

## Tukey multiple comparisons of means

## 95% family-wise confidence level

##

## Fit: aov(formula = log(OT) ~ Treatment * Month, data = UV)

##

## $Treatment

## diff lwr upr p adj

## B-A 1.07760 0.44516 1.71005 0.00107

## C-A 0.84174 0.26080 1.42268 0.00446

## C-B -0.23586 -0.89729 0.42556 0.64105

boxplot(log(OT)~Treatment, data=UV, col=7, boxwex=.5)

at <- c(0.7, 1.7, 2.7, 1.3, 2.3, 3.3)

boxplot( log(OT) ~ Treatment + Month, data=UV, add=T, at=at, xaxt='n',

boxwex=.2)

with( UV, interaction.plot( Treatment, Month, log(OT), col=2:3))

10.4. EXAMPLE 139

A B C

6.5

7.5

8.5

9.5

6.5

7.5

8.5

9.5

7.0

7.5

8.0

8.5

Treatment

mea

n of

log

(OT

)

A B C

Month

janjul

Figure 10.1: UVfilter data: box plots sorted by treatment and interaction plot.

(See R-Code 10.4.)

Chapter 11

Extending the Linear Model


In Chapters 8 to 10 we considered linear modeling in terms of simple regression, multiple

regression and ANOVA. Assumptions of the model were: linearity in the parameters of

interest (inherently) and Gaussian, uncorrelated errors. Very often, however, there is

evidence in the data against these assumptions. For example, data are correlated due

to spatial proximity. In this chapter we consider various possible extension of the linear

model that address deviations of these assumptions.

11.1 Unifying the Linear Model

We first recall that “all” models we have seen in the last three chapters are linear models,

Y = Xβ + ε, ε ∼ Nn(0, σ2I). (11.1)

even if the observations or some of the predictors need to be transformed, or if we have

interactions. In practice we hardly need to explicitly specify the design matrix X. R will

take that role.

Pre-multiplying the equation with a known matrix only changes the variance of the error,

not the linearity or Gaussianity of the model. This trick is used in the next section.

11.2 Regression with Correlated Data

Equation (11.1) is generalized to

Y = Xβ + ε, ε ∼ Nn(0,Σ). (11.2)

This means that the errors are no longer independent. The least squares estimator of β

is still β = (XTX)−1XTY. This estimator is not optimal, however, since it does not take

141


142 CHAPTER 11. EXTENDING THE LINEAR MODEL

into account all information about the model (here Σ). The “best” estimator is

β = (XTΣ−1X)−1XTΣ−1Y, (11.3)

which correctly considers the correlation between the observations and the possible dif-

ferent variances.

In practice, the following special cases are often observed:

i) Σ = σ2V, where V is a (known) diagonal matrix. In R this diagonal matrix is

considered with the argument weights (weights=1/diag(V); weighted least squares

(WLS)).

ii) a) Σ = σ2R, where R is a (known) correlation matrix. In this case the data

and predictors are premultiplied by the inverse square-root of R (generalized least

squares (GLS)).

ii) b) Σ is parameterized, but the parameters are unknown. In R, the package nlme

(Nonlinear Mixed-Effects Models) contains the function gls.

The differences between (iia) and (iib) are subtle. We now consider situations illustrating

the some cases in greater detail.

11.2.1 Heteroscedacity

The case of non-constant variance is quite ubiquitous. In terms of Model (9.3), we have

to relax εiiid∼ N (0, σ2) to εi

indep∼ N (0, σ2i ). Often and to a certain degree intuitively, the

variance increases with increasing response Yi. Such a relationship is depicted in the so-

called “Scale-Location” plot in regression diagnostics (e.g., lower left panel of Figure 9.5).

The following example gives a couple informal tests.

Example 11.1. (LifeCycleSavings data continued from Example 9.2) Starting

from the final model, lcs.aic, containing the variables pop15 and dpi, we compare the

variances of the residuals for different subgroups. R-Code 11.1 indicates that there is no

strong evidence. Alternatively, we could assess if there is evidence of structure in the

“Scale-Location”. While the red line in Figure 9.5) suggests some heteroscedacity, an

informal regression model is not conclusive. Note that we should not take the statistical

tests too formal as not all hypotheses are satisfied. ♣

An alternative way to correct for heteroscedacity is a transformation of the response

variable. We discuss this approach in Section 11.3.1 further.

11.2. REGRESSION WITH CORRELATED DATA 143

R-Code 11.1 LifeCycleSavings data: assessing informally heteroscedacity.

resids <- residuals(lcs.aic)

summary( lm( sqrt( abs( resids)) ~ fitted(lcs.aic)))

##

## Call:

## lm(formula = sqrt(abs(resids)) ~ fitted(lcs.aic))

##

## Residuals:


## -1.432 -0.478 0.031 0.537 1.654

##

## Coefficients:


## (Intercept) 2.2496 0.4915 4.58 3.4e-05 ***

## fitted(lcs.aic) -0.0750 0.0496 -1.51 0.14

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##




ind <- LifeCycleSavings$dpi<1200

var.test( resids[ ind], resids[ !ind])

##

## F test to compare two variances

##

## data: resids[ind] and resids[!ind]

## F = 1.27, num df = 30, denom df = 18, p-value = 0.6

## alternative hypothesis: true ratio of variances is not equal to 1


## 0.52083 2.84351


## ratio of variances

## 1.2732

# sligthly more evidence using: ind <- LifeCycleSavings£pop15<35


11.2.2 Serially Correlated Data

If data are taken over time, observations might be serially correlated or dependent. That

means, Corr(εi−1, εi) 6= 0 and Var(ε) = Σ which is not a diagonal matrix. This is easy

to test and to visualize through the residuals, illustrated in Example 11.2. The example

shows that there is a strong dependency and proposes a model for the residuals:

εi = φ εi−1 + νi, i = 2, . . . , n, (11.4)

where now νiiid∼ N (0, σ2

ν) and φ takes the role of the regression slope. As the residuals

have zero mean, we should not include an intercept here. Note that for εi we see εi−1

as a fixed predictor and not as random. We prefer this slight abuse of notation over a

cumbersome mathematical notation.

We require φ based on the residual and we use the estimate φ to construct Σ for a GLS

estimation of β. This joint estimation is not possible within a least squares approach

and thus in practice maximum likelihood (type) procedures are used. Whether or not

model (11.4) is sufficient is beyond the scope of this document and would be formally

discussed in times series lectures.

Example 11.2. (abrasion data continued from Example 9.3) Starting from the

final model res, we plot the scatterplot of ri versus ri−1 and fit a linear model for the

residuals (R-Code 11.2 and Figure 11.1). There is a strong serial correlation in the resid-

uals: correlation of 0.53. Compared to this lag one correlation, the lag two correlation is

much smaller: Corr(ri, ri−2) = 0.26. An improved model would be based on the function

nlme::gls with argument correlation = corAR1(). ♣

−50 0 50

−50

050

resids[−n]

resi

ds[−

1]

−50 0 50

−50

050

resids[−c(n, n − 1)]

resi

ds[−

c(1:

2)]

Figure 11.1: abrasion data: serial correlation and linear model fit for the

residuals (left: lag one; right: lag two). (See R-Code 11.2.)


R-Code 11.2 abrasion data: assessing informally heteroscedacity. (See Figure 11.1.)

resids <- residuals( res)

n <- length( resids)

plot( resids[-n], resids[-1], type = "p", pch = 19)

cor.test(resids[-1], resids[-n]) # estimate is given at end

##

## Pearson's product-moment correlation

##

## data: resids[-1] and resids[-n]

## t = 3.24, df = 27, p-value = 0.0031

## alternative hypothesis: true correlation is not equal to 0


## 0.20233 0.75042


## cor

## 0.52956

# model is r_i+1 = phi r_i + epsilon_i No intercept!

abline( residlm <- lm( resids[-1]~resids[-n]-1))

# summary( residlm) # gives similar results as the cor.test.

plot( resids[-c(n,n-1)], resids[-c(1:2)], type = "p", pch = 19)

cor.test(resids[-c(n,n-1)], resids[-c(1:2)])$p.value

## [1] 0.18385

11.2.3 Spatial Data

Spatial statistics is concerned with data/measurements, which possess topological, geo-

metric, or geographic properties. Examples, reflecting the basic types of spatial data,

are:

[A] geostatistical data: Cadmium (poisonous heavy metal) concentration in the sedi-

ment of Lake Geneva, see Example 11.3.

[B] lattice data: number of Bluetongue disease cases in cattle within 150+ Swiss districts

[C] point process data: cattle farms on which Bluetongue disease is diagnosed.

In these cases there appears to be a correlation between the data. Measurements taken

closer to each other in space appear more similar than those taken from a further distance

away. W. Tobler, an Amero-Swiss geographer with close ties to UZH stated, “Everything

is related to everything else, but near things are more related than distant things,” which


is now referred to as the first law of geography (Tobler, 1970). We experience this “law”

every day, for example, when we sense the weather: between Zurich and Effretikon there

is no large temperature difference, but there is between Visp (VS) and Zurich. This

dependence is utilized in spatial statistics.

Cases [A] and [B] can often be modeled with the multivariate normal distribution. The

next section considers case [A].

Hence, the problem can be framed in the context of a conditional distribution as given in

Equation (7.18). In practice, the expectation and variance are not known and need to be

estimated. However, we often have only one observation and the classical estimators, as

given in (7.19), cannot be used.

The estimation of the dependency structure is often refered to as variogram estimation

and the prediction of the “process” at unobserved locations as kriging. We will not discuss

these latter two points in detail but give a simple illustration.

Example 11.3. (leman data) Here we have 293 measurements of nickel concentration

in the sediment of Lake Geneva from year 1983. See Furrer and Genton (1999).

The measurements are displayed in R-Code 11.3 and Figure 11.2. We illustrate kriging

through a black-box use of the function Krig of the package fields. The result is also

given in Figure 11.2. ♣

R-Code 11.3: leman data: EDA. (See Figure 11.2.)

load( "data/leman83_red.RData")

# str( leman); dim(leman) # etc...

names( leman)

## [1] "x" "y" "Zn" "Cr" "Cd" "Co" "Sr" "Hg" "Pb" "Ni"

library( fields)

plot( y~x, col=tim.colors()[cut(Ni,64)], pch=19, data=leman)

# simpler alternative is quilt.plot

lines( lake.data)

hist( leman$Ni, main='') # not quite bell shaphed but no outlier

# Construct a grid within the lake boundaries:

library(splancs)

xr <- seq(min(lake.data[,1]),to=max(lake.data[,1]),l=110)

yr <- seq(min(lake.data[,2]),to=max(lake.data[,2]),by=xr[2]-xr[1])

locs <- data.frame( x=lake.data[,1], y=lake.data[,2])

grid <- expand.grid( x=xr, y=yr) # create a 2-dim grid

pts <- pip( grid, locs, bound=TRUE) # pip points-in-polygon


# pts contains a fine grid

# Kriging:

out <- Krig( cbind(leman$x,leman$y), leman$Ni, give.warnings=FALSE)

pout <- predict( out, pts)

quilt.plot(pts, pout, nx=length(xr)-2, ny=length(yr)-1)

500 510 520 530 540 550 560

120

130

140

150

x

y

leman$Ni

Fre

quen

cy

20 40 60 80 100 120

020

4060

80

510 520 530 540 550 560

120

125

130

135

140

145

150

40

60

80

100

Figure 11.2: leman data: EDA and prediction. (See R-Code 11.3.)


11.3 Nonlinear Regression

Naturally, the linearity assumption (of the individual parameters) in the linear model

is central. There are situations, where we have a nonlinear relationship between the

observations and the parameters. In this section we enumerate some of the alternatives,

such that at least relevant statistical terms can be associated.

Model (9.3) is generalized to

Yi ≈ f(x i,β), (11.5)

where f(x ,β) is a (sufficiently well behaved) function depending on a vector of covari-

ates x i and the vector of parameters β. We write the relationship in form of approx-

imately, since we do not necessarily claim that we have additive Gaussian errors. For

example, we may have multiplicative errors or the right-hand side of model (11.5) may

be used for the expectation of the response.

11.3.1 Transformation of the Response

As seen in Section 11.2.1, it is possible that the variance of the residuals increase with

increasing observations. A classical example is a Poisson random variable. Instead of mod-

eling individual variances, as done, transforming the variances is an alternative approach.

For example instead of linear model for Yi a linear model for log(Yi) is constructed.

There are formal approaches to determine an optimal transformation of the data, notably

the function boxcox from the package MASS. However, in practice log(·) and√· are used

dominantly.

If the original data stems “truly” from a linear model, a transformation typically leads to

an approximation. This is seen when starting from the relation

Yi = β0 xβ1i eεi (11.6)

leads to a truly linear relationship

log(Yi) = log(β0) + β1 log(xi) + εi. (11.7)

11.3.2 Nonlinear and Non-parametric Regression

When model (11.5) has additive (Gaussian) errors, we can estimate β based on least

squares, that means

β = argminβ

n∑i=1

(yi − f(x i,β)

)2. (11.8)

However, we do not have closed forms for the resulting estimates and iterative approaches

are needed. That is, starting from an initial condition we improve the solution by small

11.3. NONLINEAR REGRESSION 149

steps. The correction factor typically depends on the gradient. Such algorithms are

variants of the so-called Gauss–Newton algorithm. The details of these are beyond the

scope of this document.

There is no guarantee that an optimal solution exists (global minimum). Using nonlinear

least squares as a black box approach is dangerous and should be avoided. In R, the

function nls (nonlinear least-squares) can be used. General optimizer functions are nlm

and optim.

There are some prototype nonlinear functions that are often used:

Yi = β1 exp(−(xi/β2)β3

)+ εi Weibull model, (11.9)

Yi =

β1 + εi, if xi < β3

β2 + εi, if xi ≥ β3,break point models. (11.10)

The exponential growth model is a special case of the Weibull model.

An alternative way to express a response in a nonlinear way is through a virtually arbitrary

flexible function, similar to the (red) guide-the-eye curves in many of our plots. These

smooth curves are constructed through a so-called “non-parametric” approach.

A detailed discussion of non-parametric regression is beyond the scope here but we can

add that non-parametric regression can also be formulated as a linear model with many,

many parameters.

11.3.3 Generalized Linear Regression

The response Yi in previous chapters as well as in models (11.1) and (11.2) are Gaussian

(albeit not necessarily independent nor with constant variance). If this is not the case,

for example the observations yi are count data, then we need another approach. The idea

of this new approach is model the expected value of Yi or a function thereof as a linear

function of some predictors. This approach is called Generalized Linear Regression (GLM)

and contains as a special case the classical linear regression. The estimation procedure is

quite similar to a nonlinear approach, but details are again beyond the scope.

To illustrate the concept we look at a simple logistic regression one particular case of the

GLM. Example 11.4 is based on widely discussed data.

Example 11.4. (orings data) In January 1986 the space shuttle Challenger exploded

shortly after taking off. Part of the problem was with the rubber seals, the so-called O-

rings, of the booster rockets. The data set data( orings, package="faraway") contains

the number defects in the six seals in 23 previous launches (Figure 11.3). The question

we ask here is whether the probability of a defect for an arbitrary seal can be predicted

for an air temperature of 31F (as in January 1986). See Dalal et al. (1989) for a detailed

account. ♣


Figure 11.3 shows that linear models are often inadequate. Here the variable of interest

is a probability, meaning yi ∈ [0, 1], but a linear model cannot guarantee yi ∈ [0, 1].

In this and similar cases, logistic regression is appropriate. The logistic regression models

the probability of a defect as

p = P(defect) =1

1 + exp(−β0 − β1x), (11.11)

where x is the air temperature. Through inversion one obtains a linear model for the log

odds

g(p) = log( p

1− p

)= β0 + β1x (11.12)

where g(·) is generally called the link function. In this special case, the function g−1(·) is

called the logistic function.

R-Code 11.4 orings data and estimated probability of defect dependent on air

temperature. (See Figure 11.3.)

data( orings, package="faraway")

str(orings)


## $ temp : num 53 57 58 63 66 67 67 67 68 69 ...

## $ damage: num 5 1 1 1 0 0 0 0 0 0 ...

plot( damage/6~temp, xlim=c(21,80), ylim=c(0,1), data=orings,

xlab="Temperature [F]", ylab='Probability of damage') # data

abline(lm(damage/6~temp, data=orings), col='gray') # regression line

glm1 <- glm( cbind(damage,6-damage)~temp, family=binomial, data=orings)

points( orings$temp, glm1$fitted, col=2) # fitted values

ct <- seq(20, to=85, length=100) # vector to predict

p.out <- predict( glm1, new=data.frame(temp=ct), type="response")

lines(ct, p.out)

abline( v=31, col='gray', lty=2) # actual temp. at start

unlist( predict( glm1, new=data.frame(temp=31), type="response",

se.fit=TRUE))[1:2] # ('residual.scale' not relevant here).

## fit.1 se.fit.1

## 0.993034 0.011533


20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

Temperature [F]

Pro

babi

lity

of d

amag

e

Figure 11.3: orings data (black dots) and estimated probability of defect (red

dots) dependent on air temperature. Linear fit is given by the gray solid line.

Dotted vertical line is the ambient launch temperature at the time of launch.

(See R-Code 11.4.)


The book by Faraway (2006) nicely summarizes the many facets of this chapter.

In R the packages fields and geoR are very helpful for modelling spatial data.

Chapter 12

Bayesian Methods


In statistics there exist two different philosophical approaches to inference: frequentist

and Bayesian inference. Past chapters dealt with the frquentist approach; now we deal

with the Bayesian approach. Here, we consider the parameter as a random variable with

a suitable distribution, which is chosen a priori, i.e., before the data is collected and

analyzed. The goal is to update this prior knowledge after observation of the data in

order to draw conclusions (with the help of the so-called posterior distribution).

12.1 Motivating Example and Terminology

Bayesian statistics is often introduced by recalling the so-called Bayes theorem, which

states for two events A and B

P(A | B) =P(B | A) P(A)

P(B), for P(B) 6= 0, (12.1)

and is shown by using twice Equation (2.2). Bayes theorem is often used in probability

theory to calculate probabilities along an event tree, as illustrated in the arch-example

below.

Example 12.1. A patient sees a doctor and gets a test for a (relatively) rare disease.

The prevalence of this disease is 0.5%. As typical, the screening test is not perfect and

has a sensitivity of 99%, i.e., true positive rate; properly identified the disease in a sick

patient, and a specificity of 98%, i.e., true negative rate; a healthy person is correctly

identified disease free. What is the probability that the patient has the disease provided

the test is positive?

Denoting the events D = ‘Patient has disease’ and + = ‘test is positive’ we have us-

153


154 CHAPTER 12. BAYESIAN METHODS

ing (12.1)

P(D | +) =P(+ | D) P(D)

P(+)=

P(+ | D) P(D)

P(+ | D) P(D) + P(+ | ¬D) P(¬D)(12.2)

=99% · 0.5%

99% · 0.5% + 2% · 99.5%= 20%. (12.3)

Note that for the denominator we have used the so-called law of total probability to get

an expression for P(+). ♣

An interpretation of the previous example from a frequentist view is in terms of proportion

of outcomes (in a repeated sampling framework). In the Bayesian we view the probabilities

as “degree of belief”, where we have some proposition (event A in (12.1) or D in the

example) and some evidence (event B in (12.1) or + in the example). More specifically,

P(D) represents the prior believe of our proposition, P(+ | D)/P(+) is the support of the

evidence for the proposition and P(D | +) is the posterior believe of the proposition after

having accounted for the new evidence.

Extending Bayes’ theorem to the setting of two continuous random variables X and Y we

have

fX|Y=y(x) =fY |X=x(y | x) fX(x)

fY (y). (12.4)

In the context of Bayesian inference the random variable X will now be a parameter,

typically of the distribution of Y :

fΘ|Y=y(θ | y) =fY |Θ=θ(y | θ) fΘ(θ)

fY (y). (12.5)

Hence, current knowledge about the parameter is expressed by a probability distribution

on the parameter - the prior distribution. The model for our observations is called the

likelihood. We use our observed data to update the prior distribution and thus obtain

the posterior distribution.

Notice that P(B) in (12.1), P(+) in (12.2), or fY (y) in (12.4) serves as a normalizing

constant, i.e., it is independent of A, D, x or the parameter θ. This can be expressed in

several ways:

fΘ|Y=y(θ | y) ∝ fY |Θ=θ(y | θ)× fΘ(θ), (12.6)

f(θ | y) ∝ f(y | θ)× f(θ). (12.7)

The symbol “∝” means “proportional to” and is used because the normalizing constants

are not considered:

f(θ|y) =f(y, θ)

f(y)=f(y | θ)f(θ)

f(y)∝ f(y | θ)f(θ). (12.8)

12.2. EXAMPLES 155

Finally we can summarize the most important result in Bayesian inference as the posterior

density is proportional to the likelihood multiplied by the prior density, i.e.,

Posterior density ∝ Likelihood × Prior density (12.9)

Advantages of using a Bayesian framework are:

• formal way to incorporate priori knowledge;

• intuitive interpretation of the posterior;

• much easier to model complex systems;

• no n dependency of ‘significance’ and the p-value.

As nothing comes for free, there are also some disadvantages:

• more ‘elements’ have to be specified for a model;

• in virtually all cases, computationally more demanding.

Until recently, there were clear fronts between frequentists and Bayesians. Luckily these

differences have vanished.

12.2 Examples

We start with two very typical examples that are tractable and well illustrate the concept

of Bayesian inference.

Example 12.2. (Beta-Binomial) Let Y ∼ Bin(n, p). We observe y successes (out of n).

As was shown in Section 5.1, p = y/n. We often have additional knowledge about the

parameter p. For example, let n = 13, the number of autumn lambs in a herd of sheep.

We count the number of male lambs. It is highly unlikely that p ≤ 0.1. Thus we assume

that p is beta distributed, i.e.,

f(p) = c · pα−1(1− p)β−1, p ∈ [0, 1], α > 0, β > 0, (12.10)

with normalization constant c. We write p ∼ Beta(α, β). Figure 2.8 shows densities for

various pairs (α, β).

The posterior density is then

∝(n

y

)py(1− p)n−y × c · pα−1(1− p)β−1 (12.11)

∝ pypα−1(1− p)n−y(1− p)β−1 (12.12)

∝ py+α−1(1− p)n−y+β−1, (12.13)


which can be recognized as a beta distribution Beta(y + α, n− y + β).

The expected value of a beta distributed random variable Beta(α, β) is α/(α + β) (here

the prior distribution). The posterior expected value is thus

E(p | Y = y) =y + α

n+ α + β. (12.14)

The beta distribution Beta(1, 1), i.e., α = 1, β = 1, is equivalent to a uniform distribution

U(0, 1). The uniform distribution, however, does not mean “information-free”. As a result

of Equation (12.14), a uniform distribution as prior is “equivalent” to two experiments,

of which one is a success, i.e., one of two lambs are male.

Figure 12.1 illustrates the case of y = 10, n = 13 with prior Beta(3, 3). The frequentist

empirical 95% CI is [0.5, 0.92], see Equation (5.9). The 2.5% and 97.5% quantiles of the

posterior are 0.45 and 0.83, respectively. ♣

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

p

p

Data/likelihoodPriorPosterior

Figure 12.1: Beta-binomial model with prior density (cyan), data/likelihood

(green) and posterior density (blue).

Example 12.3. (Normal-Normal) Let Y1, . . . , Yniid∼ N (µ, σ2). We assume that µ ∼

N (η, τ 2) and σ is known. Thus we have the Bayesian model:

Yi | µiid∼ N (µ, σ2), i = 1, . . . , n, (12.15)

µ | η ∼ N (η, τ 2). (12.16)

The posterior density is then

f(µ | y1, . . . , yn) ∝ f(y1, . . . , yn | µ)× f(µ) (12.17)

∝n∏i=1

f(yi | µ)× f(µ) (12.18)

∝n∏i=1

exp(−1

2

(yi − µ)2

σ2

)exp(−1

2

(µ− η)2

τ 2

)(12.19)

∝ exp(−1

2

n∑i=1

(yi − µ)2

σ2− 1

2

(µ− η)2

τ 2

), (12.20)

12.2. EXAMPLES 157

where the constants (2πσ2)−1/2 and (2πτ 2)−1/2 do not need to be considered. Through

further manipulation (of the square in µ) one obtains

∝ exp

−1

2

(n

σ2+

1

τ 2

)[µ−

(n

σ2+

1

τ 2

)−1(ny

σ2+

η

τ 2

)]2 (12.21)

and thus the posterior distribution is

N

((n

σ2+

1

τ 2

)−1(ny

σ2+

η

τ 2

),

(n

σ2+

1

τ 2

)−1). (12.22)

In other words, the posterior expected value

E(µ | y1, . . . , yn) = ησ2

nτ 2 + σ2+y

nτ 2

nτ 2 + σ2(12.23)

is a weighted mean of the prior mean η and the mean of the Likelihood y. The greater n

is, the less weight there is on the prior mean, since σ2/(nτ 2 + σ2)→ 0 for n→∞. ♣

R-Code 12.1 Normal-normal model. (See Figure 12.2.)

# Information about data:

ybar <- 2.1; n <- 4; sigma2 <- 1

# information about prior:

priormean <- 0; priorvar <- 2

# Calculating the posterior variance and mean:

postvar <- 1/( n/sigma2 + 1/priorvar)

postmean <- postvar*( ybar*n/sigma2 + priormean/priorvar )

# Plotting follows:

y <- seq(-2, to=4, length=500)

plot( y, dnorm( y, postmean, sqrt( postvar)), type='l', col=4,

ylab='Density', xlab=expression(mu))

lines( y, dnorm( y, ybar, sqrt( sigma2/n)), col=3)

lines( y, dnorm( y, priormean, sqrt( priorvar)), col=5)

legend( "topleft", legend=c("Data/likelihood", "Prior", "Posterior"),

col=c(3, 5, 4), bty='n', lty=1)

The posterior mode is often used as a summary statistic of the posterior distribution.

Naturally, the posterior median and posterior mean (i.e., expectation of the posterior

distribution) are intuitive alternatives.

With the frequentist approach we have constructed confidence intervals, but these intervals

cannot be interpreted with probabilities. An empirical (1−α)% confidence interval [bu, bo]


−2 −1 0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

µ

Den

sity


Figure 12.2: Normal-Normal model with prior (cyan), data/likelihood (green)

and posterior (blue). (See R-Code 12.1.)

contains the true parameter with a frequency of (1 − α)% in infinite repetitions of the

experiment. With a Bayesian approach, we can now make statements about the parameter

with probabilities. In Example 12.3, based on Equation (12.22)

P(v−1m− z1−α/2v

−1/2 ≤ µ ≤ v−1m+ z1−α/2v−1/2

)= 1− α, (12.24)

with v =n

σ2+

1

τ 2and m =

ny

σ2+

η

τ 2. That means that the bounds v−1m ± z1−α/2v

−1/2

can be used to construct a Bayesian counterpart to a confidence interval.

Definition 12.1. The interval R with∫R

f(θ | y1, . . . , yn)dθ = 1− α (12.25)

is called a (1 − α)% credible interval for θ with respect to the posterior density f(θ |y1, . . . , yn) and 1− α is the credible level of the interval. ♦

The definition states that the random variable whose density is given by f(θ | y1, . . . , yn)

is contained in the (1− α)% credible interval with probability (1− α).

Since the credible interval for a fixed α is not unique, the “narrowest” is often used. This

is the so-called HPD Interval (highest posterior density interval). A detailed discussion

can be found in Held (2008). Credible intervals are often determined numerically.

The Bayesian counterpart to hypothesis testing is done through a comparison of posterior

probabilities. For example, consider two models specified by two hypotheses H0 and H1.

By Bayes theorem,

P(H0 | y1, . . . , yn)

P(H1 | y1, . . . , yn)︸︷︷︸Posterior odds

=P(y1, . . . , yn | H0)

P(y1, . . . , yn | H1)︸︷︷︸Bayes factor

× P(H0)

P(H1)︸︷︷︸Prior odds

, (12.26)

12.3. CHOICE AND EFFECT OF PRIOR DISTRIBUTION 159

that means that the Bayes factor summarizes the evidence of the data for the hypothesis

H0 versus the hypothesis H1. However, it has to be mentioned that a Bayes factor needs

to exceed 3 to talk about substantial evidence for H0. For strong evidence we typically

require Bayes factors larger than 20.

Bayes factors are popular because they are linked to the BIC (Bayesian Information

Criterion) and thus automatically penalize model complexity. Further, they also work for

non-nested models.

12.3 Choice and Effect of Prior Distribution

The choice of prior distribution belongs to modeling process just like the choice of the

likelihood distribution.

The examples in the last section were such that the posterior and prior distributions

belonged to the same class. Naturally, that is no coincidence. In these examples we have

chosen so-called conjugate prior distributions.

−2 −1 0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

µ

Den

sity


0.5 1.0 1.5 2.0 2.5

01

23

4

µ

Den

sity


Figure 12.3: Normal-Normal model with prior (cyan), data/likelihood (green)

and posterior (blue). Two different priors top and increasing n bottom (n =

4, 36, 64, 100).

With other prior distributions we may obtain posterior distributions which we no longer


“recognize” and normalisizing constants must be explicitly calculated.

Example 12.4. We consider again the Normal-Normal model and compare the posterior

density for various n with the likelihood. We keep y = 2.1, independent of n. As shown in

Figure 12.3, the maximum likelihood estimate does not depend on n (y is kept constant by

design). The uncertainty decreases, however (Standard error is σ/√n). For increasing n,

the posterior approaches the likelihood density. In the limit there is no difference between

the posterior and the likelihood.


The choice of prior distribution leads to several discussions and we refer you to Held and

Sabanes Bove (2014) for more details.

Chapter 13

A Closer Look: Monte Carlo

Methods


Several examples in the previous chapter consisted of posterior and prior distributions of

the same distribution type (but different parameters). This was no coincidence; rather,

we chose so-called conjugate priors based on our likelihood functions.

With other prior distributions, under certain circumstances we have unknown, “compli-

cated” posterior distributions, for which, for example, we no longer know the expected

value. The calculation of the integral is often too complex and so here we consider classic

(Monte Carlo) simulation procedures as a solution to this problem.

In general, Monte Carlo simulation is used to numerically solve a complex problem through

repeated random sampling. This utilizes, above all, the law of large numbers.

13.1 Monte Carlo Integration

First let us consider how we can simply approximate integrals. Let fX(x) be a density

and g(x) an arbitrary function. We are looking for the expected value of g(X), i.e.,

E(g(X)

)=

∫Rg(x)fX(x) dx. (13.1)

An approximation of this integral is (method of moments)

E(g(X)

)=

∫Rg(x)fX(x) dx ≈ E

(g(X)

)=

1

n

n∑i=1

g(xi), (13.2)

where x1, . . . , xn is a random sample of fX(x).

161


162 CHAPTER 13. A CLOSER LOOK: MONTE CARLO METHODS

Example 13.1. Let g(x1, x2) be an indicator function of the set x21 + x2

2 ≤ 1. We have

the following approximation of the number π (area of a circle with radius 1)

π =

∫x21+x22≤1

g(x1, x2) dx1 dx2 ≈ 4× 1

n

n∑i=1

g(x1,i, x2,i), (13.3)

where x1,i and x2,i, i = 1 . . . , n are two random samples from U(0, 1).

It is important to note that the convergence is very slow, see Figure 13.1. ♣

R-Code 13.1 Approximation of π with the aid of Monte Carlo integration (See

Figure 13.1.)

set.seed(14)

m <- 49

n <- round( 10+1.4^(1:m))

piapprox <- numeric(m)

for (i in 1:m) st <- matrix( runif( 2*n[i]), ncol=2)

piapprox[i] <- 4*mean( rowSums( st^2)<= 1)

plot( n, abs( piapprox-pi)/pi, log='xy', type='l')

lines( n, 1/sqrt(n), col=2, lty=2)

cbind( n=n[1:7*7], pi.approx=piapprox[1:7*7], rel.error=

abs( piapprox[1:7*7]-pi)/pi, abs.error=abs( piapprox[1:7*7]-pi))

## n pi.approx rel.error abs.error

## [1,] 21 2.4762 0.21180409 0.66540218

## [2,] 121 3.0083 0.04243968 0.13332819

## [3,] 1181 3.1634 0.00694812 0.02182818

## [4,] 12358 3.1662 0.00783535 0.02461547

## [5,] 130171 3.1403 0.00040166 0.00126186

## [6,] 1372084 3.1424 0.00025959 0.00081554

## [7,] 14463522 3.1406 0.00032656 0.00102592

13.2. REJECTION SAMPLING 163

1e+01 1e+03 1e+05 1e+07

1e−

051e

−03

1e−

01

n

abs(

piap

prox

− p

i)/pi

Figure 13.1: Convergence of the approximation for π: the relative error as a

function of n. (See R-Code 13.1.)

13.2 Rejection Sampling

When no method exists to directly sample from a distribution with density fY (y), rejection

sampling can be used. In this method values from a known density fZ(z) are drawn

and through rejection of unsuitable values, observations of fY (y) are generated. This

method can also be used when the normalizing constant of fY (y) is unknown. We write

fY (y) = c · f ∗(y).

The procedure is as follows: Find an m < ∞, so that f ∗(y) ≤ m · fZ(y). Then draw y

from fZ(y) and u from a standard uniform distribution. If u ≤ f ∗(y)/(m · fZ(y)

)then

y is accepted as a simulated value from fY (y), otherwise y is dicarded and no longer

considered. For efficiency reasons the constant m should be chosen to be as small as

possible.

Example 13.2. We draw values from a Beta(6, 3) distribution with the the rejection

sampling method and additionally we need a uniform distribution. That means, fY (y) =

c · y6−1(1 − y)3−1 (f ∗(y) = y6−1(1 − y)3−1 and the normalizing constant c is 1/β(6, 3) =

1/168) and fZ(y) = 10≤y≤1(y). We select m = 0.02, which fulfills the condition f ∗(y) ≤m · fZ(y).

An implementation of the example is given in R-Code 13.2. The R-Code can be optimized

with respect to speed. It would then, however, be more difficult to read. Figure 13.2 shows

a histogram and the density of the simulated values. ♣

R-Code 13.2: Rejection sampling in the setting of a beta distribution.

(See Figure 13.2.)

set.seed( 14)

n.sim <- 1000


m <- 0.02

fst <- function(y) y^( 6-1) * (1-y)^(3-1)

f_Z <- function(y) ifelse( y >= 0 & y <= 1, 1, 0)

result <- sample <- rep( NA, n.sim)

for (i in 1:n.sim)sample[i] <- runif(1)

u <- runif(1)

if( u < fst( sample[i]) /( m * f_Z( sample[i])) ) # if accepted ...

result[i] <- sample[i]

mean( !is.na(result)) # proportion of accepted samples

## [1] 0.285

result <- result[ !is.na(result)]

## Figures:

hist( sample, xlab="y", main="", col="lightblue")

hist( result, add=TRUE, col=4)

curve( dbeta(x, 6, 3), frame =FALSE, ylab="", xlab='y', yaxt="n")

lines( density( result), lty=2, col=4)

legend( "topleft", legend=c("truth", "smoothed empirical"),

lty=1:2, col=c(1,4))

y

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

020

4060

8012

0

0.0 0.2 0.4 0.6 0.8 1.0

y

truthsmoothed empirical

Figure 13.2: On the left we have a histogram of the simulated values of fZ(y)

(light blue) and fY (y) (dark blue). On the right the theoretical density (truth)

and the simulated density (smoothed empirical) are drawn. (See R-Code 13.2.)

13.3. GIBBS SAMPLING 165

13.3 Gibbs Sampling

The idea of Gibbs Sampling is to simulate the posterior distribution through the use of a

Markov chain. This belongs then to the family of Markov chain Monte Carlo (MCMC). We

illustrate the principle with a joint distribution f(θ1, θ2 | y) with the two parameters θ1 and

θ2. The Gibbs Sampler reduces the problem to two single dimensional simulations. Then

samples are drawn, alternating from f(θ1 | θ2, y) and f(θ2 | θ1, y), where we condition

respectively on the previous sampled values of θ1 and θ2.

In many cases one does not have to program a Gibbs sampler oneself but can use a pre-

programmed sampler. We use the sampler JAGS (Just Another Gibbs sampler) (Plum-

mer, 2003) with the R-Interface package rjags (Plummer, 2016).

When using MCMC methods, you may encounter situations in which the sampler does

not converge (or converges too slowly). In such a case the posterior distribution cannot

be approximated with the simulated values. It is therefore important to examine the

simulated values for eye-catching patterns. Additionally, diagnostic plots, for example,

the trace plot in Figure 13.3 (left) are often used. Further information about these

diagnostics are found in Lunn et al. (2012). Here a short introduction to Bayesian statistics

and numerous examples of (Open)BUGS models can also be found.

Example 13.3. The following three sets of R-Codes give a short view into Markov chain

Monte Carlo methods with JAGS. R-code 13.3 implements the normal-normal model with

y = 1, n = 1 and known variance, R-code 13.4 the normal-normal model with n = 10

and known variance and R-code 13.5 the normal-normal-gamma model with n = 10 and

unknown variance. The corresponding Figures 13.3, 13.4 and 13.5 give the empirical and

exact densities of the posterior, prior, and likelihood. ♣

R-Code 13.3: JAGS sampler for normal-normal model, with n = 1. (See

Figure 13.3.)

require( rjags)

writeLines("model y ~ dnorm( mu, 1)

mu ~ dnorm( 0, 1/1.2) # here Precision = 1/Variance

", con="jags01.txt")

jagsModel <- jags.model( 'jags01.txt', data=list( 'y'=1))

## Compiling model graph

## Resolving undeclared variables

## Allocating nodes


## Graph information:

## Observed stochastic nodes: 1

## Unobserved stochastic nodes: 1

## Total graph size: 7

##

## Initializing model

postSamples <- coda.samples( jagsModel, 'mu', n.iter=2000)

plot( postSamples, density=FALSE, main="", auto.layout=FALSE)

plot( postSamples, trace=FALSE, main="", auto.layout=FALSE,

xlim=c(-2,3), ylim=c(0,.56))

rug( summary( postSamples )$quantiles, lwd=2, tick=.2)

y <- seq(-3,to=4, length=100)

lines( y, dnorm(y, 1, 1), col=3)

lines( y, dnorm(y, 0, sqrt(1.2) ), col=4)

lines( y, dnorm(y, 1/(1+1/1.2), sqrt(1/(1+1/1.2))), col=2)

0 500 1000 1500 2000

−1

01

23

Iterations

−2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

N = 2000 Bandwidth = 0.17

Figure 13.3: Left: trace plot of the posterior µ | y = 1. Right: empirical

densities: MCMC based posterior (black), exact (red), prior (blue), likelihood

(green). (See R-Code 13.3.)

R-Code 13.4: JAGS sampler for normal-normal model, with n = 10. (See

Figure 13.4.)

set.seed( 4)

n <- 10

13.3. GIBBS SAMPLING 167

obs <- rnorm( n, 1, 1)

writeLines("model for (i in 1:n)

y[i] ~ dnorm( mu, 1)

mu ~ dnorm( 0, 1/2)


jagsModel <- jags.model( 'jags02.txt', data=list('y'=obs, 'n'=n),

quiet=T)

postSamples <- coda.samples(jagsModel, 'mu', n.iter=2000)

plot( postSamples, density=FALSE, main="", auto.layout=FALSE)

plot( postSamples, trace=FALSE, main="", auto.layout=FALSE,

xlim=c(-.5, 3), ylim=c(0, 1.3))

rug( obs, col=3)

rug( summary( postSamples )$quantiles, lwd=2, tick=.2)

y <- seq(-.7, to=3.5, length=100)

lines( y, dnorm( y, mean(obs), sqrt(1/n)), col=3)

lines( y, dnorm( y, 0, sqrt(2) ), col=4)

lines( y, dnorm( y, n*mean(obs)/(n+1/2), sqrt(1/(n+1/2))), col=2)

0 500 1000 1500 2000

0.5

1.0

1.5

2.0

2.5

Iterations

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.4

0.8

1.2

N = 2000 Bandwidth = 0.07267

Figure 13.4: Left: trace plot of the posterior µ | y = y1, . . . , yn. Right:

empirical densities: MCMC based posterior (black), exact (red), prior (blue),

likelihood (green). (See R-Code 13.4.)


R-Code 13.5: JAGS sampler for normal-normal-gamma model, with n =

10. (See Figure 13.5.)

nu <- 0 # hyperparameters

tau2 <- 1.2

writeLines("model for (i in 1:n)

y[i] ~ dnorm( mu, kappa)

mu ~ dnorm( nu, 1/tau2)

kappa ~ dgamma(1, .2)


jagsModel <- jags.model('jags03.txt', quiet=T,

data=list('y'=obs, 'n'=n, 'nu'=nu, 'tau2'=tau2))

postSamples <- coda.samples(jagsModel, c('mu','kappa'), n.iter=2000)

plot( postSamples[,"mu"], trace=FALSE, auto.layout=FALSE,

xlim=c(-1,3), ylim=c(0, 1.2))

y <- seq( -2, to=5, length=100)

lines( y, dnorm(y, mean(obs), sd(obs)/sqrt(n)), col=3)

lines( y, dnorm(y, 0, sqrt(1.2) ), col=4)

plot( postSamples[,"kappa"], trace=FALSE, auto.layout=FALSE)

y <- seq( 0, to=5, length=100)

lines( y, dgamma( y, 1, .2), type='l', col=4)

lines( y, dgamma( y, n/2+1, n*var(obs)/2 ), col=3)

# Some cleanup of the files `jags0?.txt' might be necessary


An alternative to JAGS is BUGS (Bayesian inference Using Gibbs Sampling) which is

distributed as two main versions: WinBUGS and OpenBUGS. Additionally, there is the

R-Interface package (R2OpenBUGS, Sturtz et al., 2005). Other possibilities are the Stan

or INLA engines with convenient user interfaces to R through rstan and INLA (Gelman

et al., 2015; Rue et al., 2009; Lindgren and Rue, 2015).


−1 0 1 2 3

0.0

0.4

0.8

1.2

N = 2000 Bandwidth = 0.07396

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

N = 2000 Bandwidth = 0.1024

Figure 13.5: Empirical posterior densities of µ | y1, . . . , yn (left) and of κ =

1/σ2 | y1, . . . , yn (right), MCMC based (black), prior (blue), likelihood (green).

(See R-Code 13.5.)

The list of textbooks discussing MCMC is long and extensive. Held and Sabanes Bove

(2014) has some basic and accessible ideas. Accessible examples for actual implementa-

tions can be found in Kruschke (2015) (JAGS and STAN) and Kruschke (2010) (Bugs).

Bibliography

Abraham, B. and Ledolter, J. (2006). Introduction To Regression Modeling. Duxbury

applied series. Thomson Brooks/Cole. 105

Box, G. E. P. and Draper, N. R. (1987). Empirical Model-building and Response Surfaces.

Wiley. 113

Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, U.S.A.

118

Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989). Risk analysis of the space shuttle:

Pre-challenger prediction of failure. Journal of the American Statistical Association,

84, 945–957. 149

Fahrmeir, L., Kneib, T., and Lang, S. (2009). Regression: Modelle, Methoden und An-

wendungen. Springer, 2 edition. 109

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, Methods

and Applications. Springer. 109

Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear, Mixed

Effects and Nonparametric Regression Models. CRC Press. 151

Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular

attention to the false discovery proportion. Statistical Methods in Medical Research,

17, 347–388. 58

Furrer, R. and Genton, M. G. (1999). Robust spatial data analysis of lake Geneva sed-

iments with S+SpatialStats. Systems Research and Information Science, 8, 257–272.

146

Gelman, A., Lee, D., and Guo, J. (2015). Stan: A probabilistic programming language

for bayesian inference and optimization. Journal of Educational and Behavior Science,

40, 530–543. 168

Gemeinde Staldenried (1994). Verwaltungsrechnung 1993 Voranschlag 1994. 5

171

172 BIBLIOGRAPHY

Held, L. (2008). Methoden der statistischen Inferenz: Likelihood und Bayes. Springer,

Heidelberg. 44, 158

Held, L. and Sabanes Bove, D. (2014). Applied Statistical Inference. Springer. 44, 47,

160, 169

Hollander, M. and Wolfe, D. A. (1999). Nonparametric Statistical Methods. John Wiley

& Sons. 87

Husler, J. and Zimmermann, H. (2010). Statistische Prinzipien fur medizinische Projekte.

Huber, 5 edition. 33

Johnson, N. L., Kemp, A. W., and Kotz, S. (2005). Univariate Discrete Distributions.

Wiley-Interscience, 3rd edition. 32

Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994). Continuous Univariate Distribu-

tions, Vol. 1. Wiley-Interscience, 2nd edition. 32

Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distribu-

tions, Vol. 2. Wiley-Interscience, 2nd edition. 32

Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial with R and BUGS.

Academic Press, first edition. 169

Kruschke, J. K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and

Stan. Academic Press/Elsevier, second edition. 169

Kupper, T., De Alencastro, L., Gatsigazi, R., Furrer, R., Grandjean, D., and J., T.

(2008). Concentrations and specific loads of brominated flame retardants in sewage

sludge. Chemosphere, 71, 1173–1180. 130

Landesman, R., Aguero, O., Wilson, K., LaRussa, R., Campbell, W., and Penaloza, O.

(1965). The prophylactic use of chlorthalidone, a sulfonamide diuretic, in pregnancy.

J. Obstet. Gynaecol., 72, 1004–1010. 63

Leemis, L. M. and McQueston, J. T. (2008). Univariate distribution relationships. The

American Statistician, 62, 45–53. 32

Lindgren, F. and Rue, H. (2015). Bayesian spatial modelling with R-INLA. Journal of

Statistical Software, 63, i19. 168

Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2012). The BUGS

Book: A Practical Introduction to Bayesian Analysis. Texts in Statistical Science.

Chapman & Hall/CRC. 165

Olea, R. A. (1991). Geostatistical Glossary and Multilingual Dictionary. Oxford University

Press. 3

BIBLIOGRAPHY 173

Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 2008-11-14,

http://matrixcookbook.com. 99

Plagellat, C., Kupper, T., Furrer, R., de Alencastro, L. F., Grandjean, D., and Tarradellas,

J. (2006). Concentrations and specific loads of UV filters in sewage sludge originating

from a monitoring network in Switzerland. Chemosphere, 62, 915–925. 136

Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models us-

ing gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed

Statistical Computing (DSC 2003). Vienna, Austria. 165

Plummer, M. (2016). rjags: Bayesian Graphical Models using MCMC. R package version

4-6. 165

Rudolf, M. and Kuhlisch, W. (2008). Biostatistik: Eine Einfuhrung fur Biowissenschaftler.

Pearson Studium. 48

Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for latent

Gaussian models by using integrated nested Laplace approximations. Journal of the

Royal Statistical Society B, 71, 319–392. 168

Siegel, S. and Castellan Jr, N. J. (1988). Nonparametric Statistics for The Behavioral

Sciences. McGraw-Hill, 2nd edition. 87

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680. 2

Sturtz, S., Ligges, U., and Gelman, A. (2005). R2WinBUGS: A package for running WinBUGS

from R. Journal of Statistical Software, 12, 1–16. 168

SWISS Magazine (2011). Key environmental figures, 10/2011–01/2012, p. 107. http:

//www.swiss.com/web/EN/fly swiss/on board/Pages/swiss magazine.aspx. 7, 8

Tobler, W. (1970). A computer movie simulating urban growth in the Detroit region.

Economic Geography, 46, 234–240. 146

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press. 12

Tufte, E. R. (1990). Envisioning Information. Graphics Press. 12

Tufte, E. R. (1997a). Visual and Statistical Thinking: Displays of Evidence for Making

Decisions. Graphics Press. 12

Tufte, E. R. (1997b). Visual Explanations: Images and Quantities, Evidence and Narra-

tive. Graphics Press. 12

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. 12

http://matrixcookbook.com

http://www.swiss.com/web/EN/fly_swiss/on_board/Pages/swiss_magazine.aspx

http://www.swiss.com/web/EN/fly_swiss/on_board/Pages/swiss_magazine.aspx

174 BIBLIOGRAPHY

Glossary

Throughout the document we tried to be consistent with standard mathematical notation.

We write random variables as uppercase letters (X, Y , . . . ), realizations as lower case

letters (x, y, . . . ), matrices as bold uppercase letters (Σ, X, . . . ), and vectors as bold

italics lowercase letters (x , β, . . . ). (The only slight confusion arises with random vectors

and matrices.)

The following glossary contains a non-exhaustive list of the most important notation.

Standard operators or products are not repeatedly explained.

:= Define the left hand side by the expression on the other side.

♣, ♦ End of example, end of definition∫,∑

,∏

Integration, summation and product symbol. If there is no ambiguity,

we omit the domain in inline formulas.

∪, ∩ Union, intersection of sets or events.

θ Estimator or estimate of the parameter θ.

x Arithmetic mean of the sample:∑n

i=1 xi/n.

|x| Absolute value of the scalar x.

||x || Norm of the vector x .

XT Transpose of an matrix X.

x(i) Order statistics of the sample xi.0, 1 Vector or matrix with components 0 respectively 1.

Cov(X, Y ) Covariance between two random variables X and Y .

Corr(X, Y ) Correlation between two random variables X and Y .

ddx

, ′, ∂∂x

Derivative and partial derivative with respect to x.

diag(A) Diagonal entries of an (n× n)-matrix A.

ε, εi Random variable or process, usually measurement error.

E(X) Expectation of the random variable X.

e, exp(·) Transcendental number e = 2.71828 18284, the exponential function.

In = I Identity matrix, I = (δij).

IA Indicator function, talking the value one if A is true and zero otherwise.

175

176 Glossary

lim Limit.

log(·) Logarithmic function to the base e.

maxA,minA

Maximum, minimum of the set A.

N, Nd Space of natural numbers, of d-vectors with natural elements.

N (µ, σ2) Normal (Gaussian) distribution with mean µ and variance σ2.

Np(µ,Σ) Normal p dimensional distribution with mean vector µ and variance

matrix Σ.

ϕ(x) Gaussian probability densitiy function ϕ(x) = (2π)−1/2 exp(−x2/2).

Φ(x) Gaussian cumulative distribution function Φ(x) =∫ x−∞ ϕ(z) dz.

π Transzendental number π = 3.14159 26535.

PA Probability of the event A.

R, Rn, Rn×m Space of real numbers, real n-vectors and real (n×m)-matrices.

rank(A) The rank of a matrix A is defined as the number of linearly independent

rows (or columns) of A.

tr(A) Trace of an matrix A defined by the sum of its diagonal elements.

Var(X) Variance of the random variable X.

Z, Zd Space of integers, of d-vectors with integer elements.

The following table contains the abbreviations of the statistical distributions (dof denotes

degrees of freedom).

N (µ, σ2) Gaussian or normal random variable with parameters µ and σ2.

N (0, 1), zp Standard standard normal random variable, p-quantiles thereof.

Bin(n, p), Binomial random variable with n trials and success probability p,

bn,p,1−α 1− α-quantile thereof.

X 2ν , χ2

ν,p Chi-squared distribution with ν dof, p-quantile thereof.

Tn, tn,p Student’s t-distribution with n dof, p-quantile thereof.

Fm,n, fm,n,p F -distribution with m and n dof, p-quantile thereof.

Utab(nx,ny;1−α)1− α-quantile of the distribution of the Wilcoxon rank sum statistic.

Wtab(n?; 1−α) 1− α-quantile of the distribution of the Wilcoxon signed rank statistic.

The following table contains the abbreviations of the statistical methods, properties and

quality measures.

EDA Exploratory data analysis.

dof Degrees of freedom.

Glossary 177

MAD Median absolute deviation.

MAE Mean absolute error.

ML Maximum likelihood (ML estimator or ML estimation).

MM Method of moments.

MSE Mean squared error.

OLS, LS Ordinary least squares.

RMSE Root mean squared error.

WLS Weighted least squares.

178 Glossary

Index of Statistical Tests

Comparing a sample mean with a theoretical value, 53

Comparing means from two independent samples, 54

Comparing means from two paired samples, 56

Comparing two variances, 59

Comparison of location for two independent samples, 81

Comparison of location of two dependent samples, 82

Comparison of observations with expected frequencies, 60

General remarks about statistical tests, 52

Performing a one-way analysis of variance, 129

Permutation test, 86

Sign test, 85

Test of a linear relationship, 108

Test of correlation, 104

Test of proportions, 70

179

180 Index of Statistical Tests

Dataset Index

abrasion, 118, 119, 121, 144, 145

anscombe, 102, 103

binorm, 94, 102

chemosphere, 136

hardness, 105–107, 109

leman, 3, 146, 147

LifeCycleSavings, 121, 122, 124, 142, 143

orings, 149–151

portions, 46, 53, 55–57, 59–61

retardant, 130, 133

UVfilter, 136, 137, 139

181

182 Dataset Index

sta120 introduction to statistics -...

Documents