math 105: [probability and] statistics b25 fylde college ...introduce basic concepts as it will be...

MATH 105: [Probability and] StatisticsJoe WhittakerB25 Fylde College

Department of Mathematics and StatisticsLancaster University

April 2010LUVLE: https://domino.lancs.ac.uk/09-10/MATH/MATH105.nsf

Organization

The module runs for five weeks, weeks 21-25, with four lectures a week, a weeklyworkshop and a weekly Lab100 help session. Handouts:

• Course notes

• Exercises: Workshop, Quiz, Course Work.

Please bring both to the lectures and workshops. The notes have gaps which are to befilled in during the lectures.

Your participation in the course, by taking part in experiments, contributing in lecturesand workshops and responding to the questionnaire is much appreciated.

Timetable

sun mon tue wed thu fri

9-10 GFoxLT110-11 GFoxLT1 WkShop411-12 OfficeB25 Faraday OfficeB2512-1 GFoxLT1

1-2 105-QZCW due WkShop1

2-33-44-5 WkShop25-6 WkShop3

11pm 100-QZ due

Lectures: are held at 10am Tuesday, 11am Wednesday, 9am Thursday and 12 noonFriday.

Workshops: will be held in Management School Lecture Theatre 7. Lists of groups areposted outside the Maths and Stats Department Office in Fylde College.

Workshops start in the first week.

Labs: continue in the lab100 stream:

i

Monday at 10; 12; 4; 5; Tuesday at 9; 11; 12; 5. Labs start in the first week, and a testin week24.

Any problems, please see Julia in B4c Fylde.

Assessment

• 20% Course Work (10% quiz + 10% written),

• 30% end-of-term test (Friday week 25),

• 50% final exam.

Deadlines

Online quiz questions (labelled QZ) should be completed by 2pm on the followingWednesday.

Homework questions (labelled CW) should be handed in by 2pm on the followingWednesday in your tutor’s pigeonhole.

Solutions are posted on the course webpage.

Labs

The Lab100 course is running in parallel with math105. Weekly help sessions areavailable. You are expected to have downloaded R on to your computer. The firstlectures are on R, and are examinable in math105.

Preliminaries

The math105 course continues on from math104, very directly. Firstly you have metR in the lab100 work associated with math104. Both math104 and math105 requireR and the first Chapter here goes over a tutorial introduction to some of the basicconcepts of the language.

Secondly, the extension of probability from discrete random variables, discussed inmath104, to continuous random variables is discussed here. Both the discrete andthe continous cases are needed for statistics. The mathematical prerequisites for theanalysis of continuous random variables is the integral calculus of math101.

The third part of the course introduces the statistical methods which are requiredfor tackling a range of applied problems. The focus is on strategies for data mod-elling rather than mathematical theory. However, there is some theory, and we aim tointroduce basic concepts as it will be taught fully in later statistics courses.

Data examples are used throughout the course, to illustrate the techniques that thecourse aims to teach you. The course data sets are on the LUVLE course web.

ii

At the end of this course, you should be able to:

• understand the basic concepts and objects of the R language, including someelements of programming;

• define the basic concepts of continuous random variables, the probability densityfunction and the cumulative distribution function;

• have familiarity with some standard continuous random variables, such as theUniform, Exponential and Normal; be aware of their parameters and how theserelate to expectations.

• use R to make computations and plots of the cdf, and of quantiles derived fromit;

• use R to simulate from standard distributions;

• use graphical tools such as histograms, scatterplots, empirical distribution func-tion and the boxplot;

• calculate and understand numerical summary statistics such as mean, median,variance, quantiles and the correlation coefficient;

• discuss a range of modelling assumptions that can play a part in statistical anal-ysis.

Background reading

Although the lectures and these accompanying notes are self-contained, further detailscan be found in the following recommended texts:

Clarke, G.M. and Cooke, D. (1998). A Basic Course in Statistics. 4th ed, Arnold.

Daly, F., Hand, D., Jones, M., Lunn, A. and McConway, K. (1995). Elements ofStatistics. Addison Wesley.

Lindsey, J. (1995). Introductory Statistics: A Modelling Approach. Oxford Science Pub-lications.

iii

Chapter 1

Introduction to R

R is a software package, and a language, that provides a statistical computing envi-ronment. R is open source and can be downloaded from http://www.r-project.org.More information on obtaining R, and this tutorial, can be found on our /department/info/intranet/comppages.

1.1 The tutorial

Objects

Type x = 3 or x <- 3 to create a new object called x which has the value 3. Theoperator = or <- is not the mathematical = but an assignment operator.

Predict the value of y to understand what is happening:

x <- 6

x

x^2

y <- x*(4+x/2)

y # the answer

# is a comment, and all to its right is ignored. The arithmatic operators + - * / ( )

work as expected. The hat is used for exponentials, so 3^2 is 9.

Exercise 1.1 Create a new variable called z with the value ”five cubed divided byseven plus two”.

Sol:

z = 5^3/(7+2) # 13.88889 z = 5^3/7+2 # 19.85714

Use precedence to resolve ambiguity.

1

There are some tricks you can use to save on typing: whenever possible paste Rcodefrom the pdf file into a text editor; use the up and down arrow keys to recall and editprevious commands.

Functions

Most statments in R involve functions, and usually involve the use of round brackets(). Functions are ways of running commands in R on given inputs, there may or maynot be an output.

y <- sin(pi/4) # gives y the sine of pi/4

round(y)

ls() # lists your objects

rm(y) # removes y

q() # quit R.

At exit you are asked if you would like to save the objects you created. If you answer”yes” all the objects will still be there the next time you start R.

Type ls without the (). Notice that this shows the code of the ls function, but doesnot run the function.

Vectors

Vectors are used to store more than one number in an object.

x <- c(pi, 1, 8.6, -1, 0) # "c" function creates vector

y <- 1:5 # y gets (1, 2, 3, 4, 5)

y <- seq(from=0,to=6,by=2)# a sequence, takes 3 args

length(y) #

z <- c("a","b","c") # z gets 3 characters

Vectors are indexed using the square brackets [].

x[3] # element 3 of x

x[c(4,1)] # elements 4 and 1 of x

x[-3] # x without element 3

x+2 # element by element arithmetic

round(x) # rounds each element of x

Notice how functions work on vectors, they apply to each element of the vector.

Exercise 1.2 For each of the numbers 2, 3, 4, 5, 6 and 12 find the square of the numberdivided by 2.

Sol:

2

x = c(2, 3, 4, 5, 6, 12)

x^2/2 # or is it?

(x/2)^2

Graphics: simple plots

The function plot() starts a new plot. Usually it requires a vector of x-coordinatesand a vector of y-coordinates as input:

x <- 1:20

y <- x^3

plot(x,y) # function with 2 (or more) arguments

A new graphics window pops up for the plot. Subsequent plots overwrite the currentplot in this window.

To get a line plot instead of a point plot, use the optional argument type="l" withplot():

plot(x,y,type="l")

You can add points or lines to an existing plot, using the points or lines functions:

plot(x,y)

points(rev(x),y) # rev reverses the order

lines(x,8000-y)

Different character for points need a pch= argument. Numbers give various symbols,and characters use that letter as a marker:

plot(x,y)

points(rev(x),y,pch=3) # add crosses

points(x,8000-y,pch="x") # a character

Change line widths with lwd=, or line styles with lty=. Colours are set with col=

plot(x,y, col="red")

lines(x,y,lwd=4) # thick line

lines(rev(x),y,lty=2) # dashed line

You can label your axes with the xlab= and ylab= arguments, you can give your plota title with the main= argument.

3

plot(x,y,xlab="X Is Across",

ylab="Y Is Up",

main="Main Title")

Exercise 1.3 Draw a blue circle, put a nice title on your graph but no axis labels.Hint: think radial.

theta <- seq(0, 2*pi, length=100)

x = cos(theta)

y = sin(theta)

plot(x,y,type=’n’, # sets out axes, no points

xlab="", ylab="", main="circle")

lines(x,y,col="blue")

Getting Help in R

You can start R’s help system by typing help.start(), or by using the menus.

Either one will start a web browser window showing the R help web page. If this doesn’twork, go to the http://stat.ethz.ch/R-manual/R-patched/doc/html/index.html orhttp://tinyurl.com/cny9k.

e.g. To find out more about the seq function either enter ?seq, or use the menus.

Exercise 1.4 Use the text function to put the name of your favourite philosopher inthe centre of your blue circle.

Sol:

?text

text(0,0,’Plato’)

Reading data into R

The read.table() function is used to read a data file into R. Save the file class96.datin your home directory. Look at the file in your favourite text editor.

Load it into R with a command such as

class96 <- read.table("h:/class96.dat")# windows

class96 <- read.table("class96.dat") # linux local dir

class96 <- read.table("~/class96.dat") # linux top dir

Notice that the forward slash, not the backward slash, is used to delimit folders, evenin windows.

4

Matrices

The class96 dataset contains the heights and weights of the students enrolled in GSSE401in 96. The four columns in the matrix are: number in list, height (cm), weight (kg),and sex, where 1=female, 2=male.

class(class96) # data.frame, more than a matrix

class96 # displays the values

dim(class96) # dimensions of the matrix

class96[3,4] # element in row 3 column 4

class96[1:3,] # first three rows and all the cols

class96[,c(1,3)] # 1st and 3rd column

hist(class96[,2],

main="Student Heights", xlab="cm")

# histogram of student heights

Headers and column names

names(class96) # default names

names(class96) = c("number", "height",

"weight", "gender")

class96

class96[1:3, c("height", "weight")]

# same as class96[1:3,c(2,3)]

class96$height[1:3] # list access

There are more options for read.table, e.g. to read in files separated by commas, usesep=",". The na.strings argument tells R how missing values are coded. R expectsmissing values to be written as NA. If the missing values are coded differently, a dotsay, use na.strings=".".

Writing functions

Functions are of the generic form

name <- function(input args) statements

For example

myfun <- function(x) plot(x, x^2-x)

myfun takes one argument, x, and makes a graph with it. It expects x to be a vectorwith several values.

x = seq(-2, 3, len=100)

myfun(x)

myfun # lists the code

5

Storing commands in files

Usually we write our R functions in separate files and then load them into R. Traditionhas it that give these files .R extensions. You can use any editor to write functions,though Emacs recognises R code and has some special R features. Create a new filecalled joe.R containing

theplot <- function()

x <- seq(-2, 2, len=1000)

y <- log(abs(x))

plot(x, y, main="the plot", lwd=2, col="green")

Notice that the statements of the function are enclosed in curly braces . There arethree ways of loading this file into R 1 typing source("joe.R"); 2 If you’re runningR in emacs, pull up the file joe.R and press Ctrl-C Ctrl-L, or use the menus ESS ->

Load File; 3 If your R interface has menus (such as the Gnome or windows GUI),select File -> source. Having sourced in the file joe.R run the function theplot().

Arguments

Change your function theplot to the following

theplot <- function(minx, maxx)

x <- seq(minx, maxx, len=1000)

y <- log(abs(x))

plot(x, y, main="the plot", lwd=2, col="green")

As you can probably guess, if you type theplot(-1, 3) the graph that results willhave an x axis going from −1 to 3. You could get the same result by running

theplot(maxx=3, minx=-1) # or

theplot(min=-1, ma=3)

If the first line of the file is changed to

theplot <- function(minx=-2, maxx=2)

the default values of minx and maxx will be −2 and 2. Try running theplot(max=5).

Types of R objects

There is a difference between numbers and characters. 1 is a number. "one" and "1"

are character strings. Try the following:

6

x <- c("1", "2")

x

x+1

as.numeric(x) + 1

as.character(1:4)

class(x) # gives class of an object

Logicals

Logicals or binary data consists of TRUE (or T or 1) or FALSE (or F or 0). Predict theoutput from running R.

x <- 1:4

x == 2 # notice the double ==

x > 2

x != 2

x[x!=2] # clever

x[x!=2] <- 7

Exercise 1.5 Write a function that changes every negative element in a vector to NA.

Sol:

neg2na = function(x) x[x<0] = NA; return(x)

y = c(1,2,-3,4)

neg2na(y)

Lists

Lists are collections of stuff, an element of a list can be anything, a matrix, a vector,a function, or even another list.

x <- list(lenin = sum,

marx = c("bourgeois", "class struggle"),

engels = matrix(1:4, nrow=2))

x

x$marx

x[[2]]

x[[1]](1:4) # is 1+2+3+4

x[["engels"]]

length(x) ; names(x); summary(x)

Most statistical functions in R return lists.

7

Nothing

NULL means nothing. NA is a missing value. NaN means not a number. Try the followingto see if there is a difference between NULL and NA.

c()

c(1:3, NA)

c(1:3, NULL)

x <- c(NA, NA, 3, pi)

x == 3

is.na(x)

is.na(NULL)

is.null(x)

is.null(NULL)

Typing x==NA fails, you need is.na(x). Other functions to test the classs of an objectare is.list, is.matrix, is.logical, is.numeric.

Programming: Loops

The for loop is an important programming tool. A simple loop is for( x in c(4,2,6) ) print(x)

There are other kinds of loops which are more difficult to use but are faster than for

loops. The by function is quite clever. To find the mean height of the class96 studentsaccording to sex,

by(class96$height, class96$sex, mean)

Apply

To apply a function to every column of a matrix use apply, and for lists use lapply.

Exercise 1.6 Use apply to find the sum of each column of class96.

Sol:

sum(class96)

apply(class96,2,sum)

mean(class96) # mean is special [class96 has names]

apply(class96,2,mean)

8

If statements

Predict the output from the following code

for (x in 1:4)

if (x>3)

print("big")

else

print("small")

Exercise 1.7 Write the code to draw four circles with decreasing radii. Put an if

statement in the for loop to make the 2 smallest circles red.

theta = seq(0,2*pi,length=100)

x0 = cos(theta)

y0 = sin(theta)

plot(x0,y0,type=’n’)

for(i in 1:4)

r = (.8)^i

x = r*x0

y = r*y0

if(i>2) lines(x,y,col=’red’)

else lines(x,y,col=’blue’)

# end for

Objects within functions

Any objects you create within a function die when the function ends.

dumbfun <- function() x <- 1

x <- 2

dumbfun()

x

x only had the value 1 when the function was running. When the function ended, thevalue of x went back to 2.

Saving objects

save(x, y, file="xandy.RData")

load("xandy.RData") # load the file, with x and y

save.image() # save all objects to .RData

When you quit with the q() function, R runs save.image(). Whenever you start R,R runs load(".RData"). Emacs asks you what directory to start R in, because it willload the .RData file from that directory.

9

Multiple plots

The graphics device can display several small plots at a time instead of one big one.Use the par() (parameter) function:

par(mfrow=c(2,3)) # 2x3 array

for(plato in 1:6) plot(1:5, pch=plato)

mtext("The Republic", outer=T,

line=-3, cex=2, col="red")

The screen clears when you try and plot the seventh graph. To reset the plottingwindow back to normal, do par(mfrow=c(1,1)).

Printing plots

Plots can be printed directly but it is best to save them to a file. This is particularlyuseful for essays and reports since these files can be easily read into standard wordprocessors (Latex, Word, etc). Try the following:

par(mfrow=c(1,1))

plot(sin(1:1000))

pdf(file="sines.pdf", height=4, width=6) # give file name size

plot(sin(1:1000)) # plot to file

dev.off() # finished writing

Check this works: xpdf sines.pdf.

Functions for probability distributions

There are four functions related to standard distributions in R, prefixed by one of dpqr.For the Poisson distribution, the pmf is p(x) = exp(−λ)λx/x! for x = 0, 1, . . ..

dpois(2, lambda=1 ) # d gives the pmf

# parameter lambda

exp(-1)*1^2/factorial(2)

ppois(2, lambda=1 ) # p gives the cdf

exp(-1)*(1^0+1^1/factorial(1)+1^2/factorial(2))

qpois(.7, lambda=1 ) # q gives the quantile

rpois(10, lambda=1 ) # r gives random numbers

Plotting mass functions using barplot

There are some examples of this in math104/lab100 exercises.

Exercise 1.8 Make a free hand plot from this code.

10

par(mfrow=c(1,2)) # sets up subplots

barplot( dpois(0:10, lambda=1 ),

names.arg=0:10, ylim=c(0,.4))

barplot( dpois(0:10, lambda=3 ),

names.arg=0:10, ylim=c(0,.4))

Plotting the pdf

The probability density function (pdf), introduced in the next section, is the analogueof the pmf for a continuous rv.

Exercise 1.9 The exponential distribution is a good example, and has pdf f(x) = θ exp(−θx)for x > 0.

dexp(2, rate=1 ) # d gives the pdf

# parameter rate=theta

exp(-2)

pexp(2, rate=1 ) # p gives the cdf

1-exp(-2)

qexp(.7, rate=1 ) # q gives the quantile

rexp(10, rate=1 ) # r gives random numbers

xval = seq(0,4,len=100)

f = dexp(xval, rate=1 )

F = pexp(xval, rate=1 )

plot( xval, f, type=’n’) ; grid()

lines(xval, f, col=’red’)

lines(xval, F, col=’blue’)

Accessing course datasets

Throughout the session, we will use some data examples, which you can download fromthe course webpage.

Save the file in your working directory, using the filename m105.Rdata. Then in R,type load("m105.Rdata"); ls() or load("YOURPATH/m105.Rdata") This is neededfor each new R session.

1.2 Chapter summary

The basic constructs of the R language are objects and functions, and are introduced byway of example. Examples of objects are vectors, matrices and dataframes. These cancontain numbers, characters or mixtures of such. Examples of functions are methodsof manipulating these objects including extracting arithmetic summaries, plots andtransformations.

Some instances of writing functions are given, together with a brief summary of someprogramming constructs, such as the for loop and the if statement.

11

Methods specific to plotting pdfs are given, and to reading data from file.

12

Chapter 2

Continuous random variables

2.1 Review of probability

Math104 introduced the concept of probability and of a discrete random variable. Herewe review some of the basics and introduce continuous random variables.

Probability

Probability considers an experiment before it is performed. Probability, P , is a measureof the chance that an event may occur in the experiment. Tossing a coin or conductingan election survey is an example of an experiment. An event, A, is a subset of thesample space, Ω, the set of all possible outcomes.

Observing a tail in a coin throw or hearing a yes response to a survey question are bothevents. Legitimate questions are then: What is the probability of seeing the tail twicein the experiment of tossing two coins. What is the probability of getting no positiveresponses in the survey?

The Axioms of Probability

Mathematically, probability is a function P which assigns to each event A in the samplespace Ω a number P (A) in [0, 1] such that

• Axiom 1: P (A) ≥ 0 for all A ⊆ Ω;

• Axiom 2: P (Ω) = 1;

• Axiom 3: P (A ∪ B) = P (A) + P (B) if A ∩ B = ∅ for any A, B ⊆ Ω.

For mathematicians probability is a function. In every day English probability is closelyassociated with words such chance, uncertainty, randomness, likelihood.

If probability considers the experiment before it is performed, statistics considers theexperiment after it is performed.

13

Examples of discrete random variables

The sample space Ω for a discrete rv X is countable, and we usually take it to be asubset of the integers or the non-negative integers.

Exercise 2.1 Give examples related to University, family and sport.

Sol:

college membership Ω = Bow, Car, . . . ,

exam grades Ω = A, B, C, D, E,number of goals in a match, Ω = 0, 1, 2, . . .number of children in a family, same.

Probability mass function

Definition: The probability mass function (pmf) of a discrete random variable X isp(x), where

p(x) = P (X = x) for x = 0, 1, 2, . . .

Result: (Properties of the pmf). The probability mass function p(x) satisfies

• 0 ≤ p(x) ≤ 1 for all x;

• ∑∞

x=0 p(x) = 1;

• For any event A, P (X ∈ A) =∑

x∈A p(x).

For example,

P (a < X ≤ b) = P (X = a + 1) + P (X = a + 2) + · · ·+ P (X = b)

= p(a + 1) + p(a + 2) + · · · + p(b).

Definition: The cumulative distribution function (cdf) is defined as

F (x) = P (X ≤ x) for −∞ < x < ∞.

Result: The cdf simplifies to

F (x) = P (X ≤ x) = P (X ≤ int(x)) =

int(x)∑

k=0

P (X = x),

where int(x) denotes the largest integer smaller than or equal to x, eg int(5.2) = 5,int(3) = 3, int(−2.1) = −2.

This is a step function, and is not continuous.

14

Exercise 2.2 For a random variable X that takes values 0, 1 with probabilitiesθ, 1 − θ, obtain P (X ≤ x) for all x ≥ 0.

Sol:

P (X ≤ x) =

0 if x < 0θ if 0 ≤ x < 11 if 1 ≤ x

Add graph.

Exercise 2.3 Use this Rcode to plot the cdf at the points (−1, 0, 1, 2) when θ = .4.Draw the graph in your notes.

theta=.4

p0 = theta ; p1 = 1-theta

xval = c(-1, 0, 1, 2)

F = c( 0, p0, p0+p1, 1)

plot( xval ,F)

plot( xval ,F, type = ’s’) # adds in step function

points(xval ,F)

2.2 Continuous and discrete rvs

A mathematical way of describing a probability experiment and its events is to definea random variable associated with it.

Definition: A random variable X is a function from sample space Ω to the real numbersR (continuous) and to the integers Z (discrete).

Exercise 2.4 Experiment 1: In a presential election with two candidates B and C,the possible outcomes are Ω = B, C. Define a random variable X that maps from Ωto 0, 1:

X(B) = 0, X(C) = 1.

Then the probability of the event C is equivalent to P (X = 1).

Exercise 2.5 Experiment 2: A national air quality monitoring system automaticallycollects measurements of ozone level at designated sites. The possible outcomes areΩ = x : x ≥ 0. Define a random variable X to be the value of the measurement,

X(x) = x,

the identity map. Then the probability that ozone level falls below a certain level c isgiven by P (X ≤ c).

15

Remarks on rvs

• A random variable (rv) X is a function that associates a unique number with eachpossible outcome of an experiment.

• Associated with each discrete random variable X is a probability mass function(pmf) p(x) from which probabilites of all possible events involving X may becomputed.

• Associated with each continuous random variable X is a probability distributionfunction (pdf) f(x) from which probabilites of all possible events involving Xmay be computed.

• Associated with the pmf and with the pdf is the cumulative distribution function(cdf) F (x) that gives particular probabilities.

• A continuous rv is by definition one that has a continuous cdf. The cdf of adiscrete rv is a step function.

• Associated with the pmf and the pdf are numerical summaries such as E(X),var(X) and, for continuous rvs, quantiles of F (x).

• Often in scientific investigation X represents the variable of main interest thatcan be measured or observed.

The cumulative distribution function (cdf)

In order to describe all possible outcomes of an experiment, we focus on an event ofthe basic form

X ≤ xfor fixed x, where x can take any value.

Exercise 2.6 Express a general event a < X ≤ b using the basic form with setoperations.

Sol:

a < X ≤ b = X ≤ b ∩ X ≤ ac.

If we have a rule of assigning probability to an event of the basic form, then probabilityof any event can be determined.

Definition: For any discrete or continuous univariate random variable X, the cumulativedistribution function, cdf, F : R→[0, 1], is defined by

F (x) = P (X ≤ x).

In terms of the original sample space the event X ≤ x is interpreted as ω : X(ω) ≤ x.

16

F is defined for −∞ < x < ∞, and we require F (−∞) = 0 and F (∞) = 1 to avoidhaving to deal with degenerate rvs.

xvals = seq(-6,6,length=100)

F = pnorm(xvals)

plot(xvals,F,type=’n’); grid()

lines(xvals,F,col=’red’)

It is a result that F is a non-decreasing function.

Exercise 2.7 Prove that for a ≤ b,

P (a < X ≤ b) = F (b) − F (a).

Sol:

X ≤ b = a < X ≤ b ∪ X ≤ a from above,

union disjoint events, so

P (X ≤ b) = P (a < X ≤ b) + P (X ≤ a) or

F (b) = P (a < X ≤ b) + F (a).

Probability for continuous rvs

When the cdf F (x) = P (X ≤ x) is continuous the outcomes of the experiment have tobe measurements on a continuous scale, and the rv is said to be continuous. Examplesinclude

ozone level, weight, direction, waiting times, stock price,. . .

Result: (Zero probability). If X is continuous rv

P (X = x) = 0 for all x.

Proof:

P (x − h < X ≤ x + h) = F (x + h) − F (x − h) above

so if F is continuous

P (X = x) = limh→0

P (x− h < X ≤ x + h)

= limh→0

F (x + h) − F (x − h) = 0.

Therefore, unlike the discrete case, the probability distribution function cannot bereduced to sum of single events. To describe probability of an event of a continuousrandom variable, we need new mathematical tools!

17

Probability density function

Assume the cdf is differentiable as well as continuous.

Definition: The probability density function, pdf f(x) a continuous random variable X isdefined by

f(x) =d

dxF (x).

Result: (The cdf as a definite integral). The cdf satisfies

F (x) =

∫ x

−∞

f(u) du .

Proof: Standard rules of integral calculus.

The cdf is a definite integral of the pdf. (If discrete the cdf is the definite sum of thepmf.)

Result: The probability density function f(x) satisfies

• f(x) ≥ 0 for all x;

•∫

∞

−∞f(x) dx = 1.

• For any event A, P (X ∈ A) =∫

x∈Af(x) dx.

However it may be that f(x) ≥ 1 for some x.

Interpretation of the pdf

Result: (Area under the curve.) Using calculus,

P (a < X ≤ b) = F (b) − F (a)

=

∫ b

−∞

f(x) dx −∫ a

−∞

f(x) dx

=

∫ b

a

f(x) dx,

but this is the area under the curve (x, f(x)) between (a, b]. Hence this area representsthe probability that the rv X lies in this interval.

18

Probability density function

P (a < X ≤ b)

a b

x

f(x

)

Example of a pdf. P (a < X ≤ b) is the area under the curve between a and b.

Note that the density function f(x) itself does NOT represent the probability of anyevent.

Exercise 2.8 For a random variable X with cumulative distribution function

F (x) =

x if 0 ≤ x ≤ 10 otherwise.

(a) Find P (0.3 < X ≤ 0.5).

(b) Find the pdf of X.

(c) Sketch the function pdf and shade area under the curve between 0.3 and 0.5.

Sol:

(a) P (0.3 < X ≤ 0.5) = F (0.5) − F (0.3) = 0.5 − 0.3 = 0.2.

(b) f(x) = ddx

F (x) =

1 if 0 ≤ x ≤ 10 if x < 0 or x > 1.

(c) Sketch.

Simulating rvs

It is desirable to do experiments with simulated data, where we know the true under-lying distribution, and is never the case with real life data!

19

If a random variable X has the Uniform distribution on the interval (0, 1) then the pdfis

f(x) = 1 for 0 < x < 1, and 0 otherwise.

We write X ∼ Uniform(0, 1).

The area under the curve is the probability

P (a < X < b) =

∫ b

a

f(x)dx = b − a, for 0 < a < b < 1.

The shaded area represents P (0.2 < X < 0.5)

Uniform(0,1) density

0 0.2 0.5 1

01

P(0.2<X<0.5)

x

Exercise 2.9 Uniform. Simulate 1000 realisations of the rv X ∼ Uniform(0, 1) usingrunif. Draw the histogram.

Plot the pdf on the range (−.5, 1.5) using the function dunif to give 100 points andfind the probability that P (0.2 < X < 0.5) using the function punif.

x = runif(1000) # r=rv unif=Uniform

hist(x, prob=T, breaks=20, col=’yellow’,xlim=c(-.5,1.5))

range = seq(-.5,1.5,length=100) # plotting points

f = dunif(range) # d=pdf

plot(range, f, type=’n’)

lines(range, f)

punif(0.5) - punif(0.2)

y = (0.2<x) & (x<0.5)

sum(y) # the frequency of 1’s

Sol:

Theoretically P (0.2 < X < 0.5) = 0.3. The relative number of points in (.2, 5) is282/1000.

2.3 Expected values

Expectation

20

Definition: If X is a discrete rv with pmf p(x) on 0, 1, · · ·, then the expected value ofX is

µ = E[X] =

∞∑

x=0

xp(x) .

If X is a continuous random variable with pdf f(x) on (−∞,∞), then the expectedvalue of X is

µ = E[X] =∫

∞

−∞xf(x) dx.

We can think of this as an average of the different values that X may take, weightedaccording to their chance of occurrence.

Expectations of functions of rvs

Consider g(X) where g is a fixed function.

Definition: If X is a discrete rv with probability mass function p(x) on 0, 1, . . ., thenthe expected value of g(X) is

E[g(X)] =∞∑

x=0

g(x)p(x) .

If X is a continuous rv with probability density function f(x) on (−∞,∞), then theexpected value of g(X) is

E[g(X)] =

∫

∞

−∞

g(x)f(x) dx .

Exercise 2.10 Show that E[3] = 3.

Sol:

Proof: We regard 3 as a constant function of X,

E[3] =

∫

∞

−∞

3f(x) dx def E

= 3

∫

∞

−∞

f(x) dx calculus

= 3[F (∞) − F (−∞)] result above

= 3[1 − 0] non-degenerate

= 3.

21

Exercise 2.11 Let X have the pdf f(x) = exp(−x) for all x ≥ 0. The expectationE[X] (with value = µ) is

µ = E[X] =

∫

∞

0

x exp(−x) dx

= [−x exp(−x)]∞0 +

∫

∞

0

exp(−x) dx integ by parts

= 0 + [− exp(−x)]∞0 = 0 − (−1) = 1.

Find E[X2] and E[(X − µ)2].

Sol:

E[X2] =

∫

∞

0

x2 exp(−x) dx

= [−x2 exp(−x)]∞0 +

∫

∞

0

2x exp(−x) dx

= 0 + 2 × 1 = 2,

E[(X − µ)2] =

∫

∞

0

(x − 1)2 exp(−x) dx

=

∫

∞

0

(x2 − 2x + 1) exp(−x) dx

= 2 − 2 × 1 + 1 = 1.

Properties of expectation

Result: (Linearity of expectation). If X has expectation E[X] and Y is a linear functionof X as Y = aX + b then Y has expectation

E[Y ] = a E[X] + b .

Result: More generally,

E[g(X) + h(X)] = E[g(X)] + E[h(X)] (2.1)

E[cg(X)] = c E[g(X)] (2.2)

E[aX + b] = a E[X] + b (2.3)

Note that we proved them in MATH 104 for discrete random variables.

Using linear properties of expectation, we may compute E[(X − a)2] by

E[(X − a)2] = E[X2 − 2aX + a2] algebra= E[X2] − E[2aX] + E[a2] by (2.1) twice

= E[X2] − 2aE[X] + a2 by (2.3).

22

Variance and standard deviation

Definition: If X is a random variable with expected value µ = E[X], the variance of Xis

σ2 = var[X] = E[(X − µ)2]

=

∑

∞

x=0(x − µ)2p(x) for discrete rv on 0, 1, . . . , ∫

∞

−∞(x − µ)2f(x) dx for continuous rv on (−∞,∞) .

Result: The variance of X can be calculated as

σ2 = E[X2] − µ2 .

Proof: Use the above result.

Definition: The standard deviation of X is

σ =√

var[X] .

The variance, or better the standard deviation, is a measure of the spread of a randomvariable about its expectation.

Exercise 2.12 For f(x) = exp(−x) for all x ≥ 0, find the standard deviation of X.

Sol:

From above the variance σ2 = E(X − µ)2 = 1. Consequently the std is σ =√

1 = 1.

Properties of the variance

Result: If var[X] exists and Y = a + bX, then var[Y ] = b2 var[X]. Hence, the standarddeviation of Y is σY = |b|σ.

Exercise 2.13 Why is the absolute value needed in the above expression?

Sol:

Use counterexample: var(−3X) = 9 var(X), taking sqrt give 3√

var(X), which is

not the same as −3√

var(X).

23

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

y1

0 1 2 3 4 5

0.1

0.3

0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

y1

0.1

0.5

1

Probability mass functionProbability mass function

DensityDensity

µ = 2.5

µ = 2.5

σ = 1.1

σ = 1.1

µ = 0.83

µ = 0.83

σ = 0.83

σ = 0.83

x

x

x

x

Means and standard deviations for discrete and continuous rvs.

2.4 Standard continuous distributions

We specify several standard distributions in terms of given pdfs: the uniform, theexponential and the normal.

Uniform

This distribution is used to model variables that can take any value on a fixed interval,when the probability of occurrence does not vary over the interval.

Definition: The pdf of a Uniform rv X, distributed on the interval (a, b) is given by:

f(x; a, b) =

1b−a

if a < x < b;

0 otherwise,

where the parameters are (a, b) and −∞ < a < b < ∞. This is written as X ∼ Uniform(a, b).We often write f(x) = 1/(b − a) for a < x < b, and suppress the fact that (i) there areother arguments, f(x; a, b) and (ii) f(x) = 0 when x < a or x > b.

24

x0

01/

(b−

a)

a b

P (a < X ≤ x0)

x

pdf for Uniform(a, b) random variable. Shaded area represents P (a < X ≤ x0).

Result: the expected value and variance of X ∼ Uniform(a, b) are

E[X] =a + b

2, var[X] =

(b − a)2

12.

Proof:

E[X] =

∫

∞

−∞

xf(x) dx

=

∫ b

a

x1/(b − a) dx

=1

b − a[x2/2]ba =

b + a

2.

Similar calculations work for the variance.

Exercise 2.14 Evaluate the pdf and the cdf of a Uniform rv with parameters a = −2, b = 2,at x = .5 and then plot on an interval. There is one ambiguity in the plot: identify.

dunif(0.5, min=-2, max=2) # pdf Unif(-2,2)

# at x=0.5, f(0.5)=0.25

punif(0.5, min=-2, max=2 ) # cdf of Unif(-2,2) F(0.5)= 0.625

xval = seq(-2.5, 2.5, length=101)

f = dunif(xval, -2,2) #

F = punif(xval, -2,2) #

plot(xval, F,type=’n’)

lines(xval, f,col=’blue’)

lines(xval, F,col=’red’)

Sol:

The vertical lines on the pdf should not be there.

25

Exponential

This distribution is often used to model variables that are the times until specific eventshappen when the events occur at random at a given rate over time.

Definition: The pdf of an Exponential rv X is

f(x; θ) =

θ exp(−θx) for x > 0,0 otherwise,

where 0 < x and the rate parameter θ > 0. This is written as X ∼ Exponential(θ) andθ ∈ (0,∞).

Result: The cdf of X ∼ Exponential(θ) is

F (x) = 1 − exp(−θx) for x > 0 and 0 otherwise.

Proof:

F (x) =

∫ x

−∞

f(u)du

=

∫ x

0

f(u)du for x > 0

=

∫ x

0

θ exp(−θu)du for x > 0

= [− exp(−θu)]x0 for x > 0

= 1 − exp(−θx) for x > 0.

Result:

E[X] =1

θ, var(X) =

1

θ2

Proof: Seen above.

The parameter θ is known as the rate parameter because if X is the time until thenext event occurs, then θ = 1

E[X]is the rate of occurrence.

Exercise 2.15 The value of θ influences the probability of different outcomes. Howis the shape of the function related to the parameter θ? Which pdf in the figure haslowest tail probability P (X > 10)?

xvals = seq(-.2,6,length=100)

f1 = dexp(xvals, rate=1)

f2 = dexp(xvals, rate=2)

f3 = dexp(xvals, rate=1/2)

plot(xvals,f2,type=’n’) ; grid()

lines(xvals,f1)

lines(xvals,f2,col=’red’)

lines(xvals,f3,col=’blue’) ; grid()

26

Sol:

As f(0) = θ, the highest curve at 0 is θ = 2 (pdf exceeds 1) the lowest curve at 0 isθ = 0.5. The exponential decay of the function is quicker for larger θ, the smallest tailprobability is when θ = 2.

Exercise 2.16 Evaluate the pdf and the cdf of an Exponential distribution and plot;give an eyeball estimate of P (X < 1).

xval = seq(-0.2, 4, length=100)

f = dexp(xval, rate=2) # pdf

F = pexp(xval, rate=2) # cdf

plot(xval, f, type =’n’,ylab=’’) ; grid()

lines(xval, f, col=’red’)

lines(xval, F, col=’blue’)

Sol:

the pdf starts at (0, 2).a pdf is not a probability.from cdf about .9 pexp(1, rate=2)# 0.86

Exercise 2.17 Suppose that the time the first goal is scored can be modelled by anExponential distribution with rate parameter θ = 2/3 hours. Write down the cdf. Findthe probability that time until the goal occurs is (i) more than 30 minutes away, (ii)between 30 and 50 minutes.

Sol:

Let X be the random variable of the waiting time. Then X ∼ Exponential(2/3) andF (x) = 1 − exp(−(2/3)x). (i) P (X > 1/2) = 1 − F (1/2) = exp(−2/3 · 1/2) = 0.7165,(ii)

P (1/2 < X < 5/6) = F (5/6) − F (1/2)

= exp(−2/3 · 1/2) − exp(−2/3 · 5/6)

= 0.1428,

assuming no half time.

Normal distribution: background

quoted from gqview weblib/Gauss.html

The normal distribution was introduced by the French mathematician AbrahamDe Moivre in 1733. De Moivre used this distribution to approximate proba-bilities of winning in various games of chance involving coin tossing. It waslater used by the German mathematician Karl Gauss to prredict the location

27

of astronomical bodies and became known as the Gaussian distribution. Inthe late nineteenth century statisticians started to believe that most datasets would have histograms with the Gaussian bell-shaped form and that allnormal data sets would follow this form and so the curve came to be knownas the normal curve.

This distribution is also known as the Gaussian distribution, after the German math-ematician Karl Frederick Gauss. The density was pictured on the German 10 marknote bearing Gauss’s image!

Normal distribution

Definition: The pdf of a Normal random variable X is

f(x; µ, σ) =1√2πσ

exp

(

−1

2

(

x − µ

σ

)2)

,

where −∞ < x < ∞, and the parameters −∞ < µ < ∞ and 0 < σ. This is written asX ∼ N(µ, σ2) and θ ∈ Θ = (−∞,∞) × (0,∞).

Result:E[X] = µ , var(X) = σ2

Proof: Too hard for math105. The Normal distribution plays an important role ina result that is key to statistics, known as the central limit theorem. This theorem,discussed in Math230 and Math313 gives a theoretical basis to the empirical observationthat many random phenomena seem to follow a Normal distribution. Usually, the meanparameter µ and the scale parameter σ are unknown, although sometimes it is assumedthat σ is known as this simplifies things considerably. These parameters are crucial indetermining probabilities.

Exercise 2.18 Consider the figure

−3 0 3

00.

20.

40.

60.

8

sigma=0.5sigma=1sigma=1.5

x

Pdfs for Normal(µ, σ2) random variables where µ = 0 and σ = 0.5, 1, 1.5.

28

Which one has higher probability of P (|X| > 3)?


f1 = dnorm(xvals, sd=1)

f2 = dnorm(xvals, sd=2)

f3 = dnorm(xvals, sd=1/2)

plot(xvals,f3,type=’n’) ; grid()

lines(xvals,f1)

lines(xvals,f2,col=’red’)

lines(xvals,f3,col=’blue’)

Sol:

The larger σ, the more spread. So θ = 1.5 has the largest probability of P (|X| > 3)and θ = 0.5 has the smallest.

Exercise 2.19 Complete the code to establish that the dnorm function gives the sameresult as direct calculation of the pdf when X ∼ N(2, 4).

xvals = seq(-4,8, length=11)

pdf = dnorm(xvals,mean=2,sd=2)

f = 1/(sqrt(2*pi)*2)*

sum(f!=pdf) # 0 bingo

Sol:

f = 1/(sqrt(2*pi)*2)*exp(-0.5*((xvals-2)/2)^2)

Normal cdf and quantiles

The normal cdf is

F (x) =

∫ x

−∞

f(u) du =

∫ x

−∞

1√2πσ2

exp

− (u − µ)2

2σ2

du .

This does not have a closed form expression so numerical evaluation is required, if wewant to obtain probabilities of the form P (X ≤ x) or quantiles. Note that R functionsfor the Normal use the standard deviation σ, not the variance σ2.

Exercise 2.20 Write down the numerical values of P (X ≤ x) corresponding to

pnorm(0,mean=2,sd=sqrt(5)) # X~N(2,5), P( )=0.1855467

pnorm(0,mean=2,sd=sqrt(3)) # X~N(2, ), P( )=0.1241065

1-pnorm(-2,mean=0, sd=2) # X~N(0, ), P( )=0.8413447

Exercise 2.21 A normal distribution is proposed to model the variation in height ofwomen with parameters µ = 160 and σ2 = 25 measured in cm. Find the proportion oftall women, defined as over 175cm tall, in terms of an integral.

29

Sol:

Let H be the random variable of woman’s height then H ∼ N(160, 25). So

P (H > 175) =

∫

∞

175

1√2π25

exp

− (x − 160)2

2 · 252

dx.

In the above example we have expressed the proportion in terms of an integral andas the number of deviations from the mean. The integral is impossible to calculateanalytically so numerical evaluation is required to obtain probabilities or quantiles.

Standardardization of the random variable

It is useful to express such probabilities in terms of a standardized random variable,with µ = 0 and σ = 1.

Result: If X ∼ N(µ, σ2) then

Z =X − µ

σ∼ N(0, 1),

and conversely if Z ∼ N(0, 1), then

X = µ + σZ ∼ N(µ, σ2) .

Proof: The formal proof will be given in math230 and here it is sufficient to note that

E[Z] = 0 var[Z] = 1 .

Definition: A random variable Z is said to have a standard normal distribution with mean0 and standard deviation 1 if its pdf is given by

f(z) =1√2π

exp(−z2/2) ,

where −∞ < z < ∞ and is denoted by Z ∼ N(0,1).

The cdf, the area under the curve, of the standard normal variable Z is given by

Φ(z) = P (Z ≤ z) =

∫ z

−∞

1√2π

exp(−x2/2) dx .

Values of Φ(z) are obtained from a table of standard normal probabilities or from R:

for (z in c(-3.00,-2.33,-1.67,-1.00,-0.33,0.33,1.00,1.67,2.33,3.00))

print( pnorm(z) )

z -3.00 -2.33 -1.67 -1.00 -0.33 0.33 1.00 1.67 2.33 3.00

Φ(z) 0.0013 0.0098 0.0478 0.1587 0.3694 0.6306 0.8413 0.9522 0.9902 0.9987

30

Exercise 2.22 Repeat the previous example to illustrate the standardization proce-dure:

P (H > 175) = P (H − 160

5>

175 − 160

5) cunning

= P (Z > 3) = 1 − P (Z ≤ 3)

= 1 − Φ(3)

= 1 − 0.9987 = 0.0013 from pnorm(3) .

The figure illustrates coverage properties of a Normal distribution.

µµ − σ µ + σµ − 2σ µ + 2σµ − 3σ µ + 3σ

P (µ − σ < X < µ + σ) = 0.683

P (µ − 2σ < X < µ + 2σ) = 0.954

P (µ − 3σ < X < µ + 3σ) = 0.997

2.5 Quantiles and the cdf

Often interest is in the values of a continuous random variable which are not exceededwith a given probability, e.g. income of lower 10% income tax payer or the score of thetop 5% of students.

Quantiles

Let X be a random variable and p any value such that 0 ≤ p ≤ 1.

Definition: The pth quantile of the distribution of X is the value xp that satisfies:

P (X ≤ xp) = p or equivalently xp = F−1(p),

where F−1 is the inverse function of F .

When p = 0.5, the quantile x0.5 is called the median. When the cdf F is continuous theinverse function is uniquely defined. [Life is more problematic with step functions.]

31

p = 0.6

xp = qnorm(p, mean=2, sd=1) # 2.2533


F = pnorm(xvals, mean=2, sd=1)

plot(xvals,F,type=’n’) ; grid()

lines(xvals,F)

abline(v=0,lty=3)

lines(c(0,xp),c(p,p),col=’red’)

lines(c(xp,xp),c(0,p),col=’red’)

0

p

1

xp

x

F(x

)

Cumulative distribution function

Quartiles

The quartiles of a distribution are the quantiles, those values at which we can cut thedistribution into four equally probable slices: (x0.25, x0.5, x0.75).

x(.25) x(.75)

0.25

0.5

0.75

1

x(.5) x(.75)

0

0.5

Cumulative distribution function Density

f(x

)

F(x

)

xx

32

Quartiles (x0.25, x0.5, x0.75) shown on cdf and pdf respectively.

Exercise 2.23 Suppose X ∼ Uniform(a, b). Find the cdf, sketch its graph, and give aformula for the p-th quantile xp.

a = -2 ; b = 4

xvals = seq(a-.5,b+.5,length=100)

F = punif(xvals,min=a,max=b)

plot(xvals, F, type=’n’) ; grid()

lines(xvals, F)

xmedian = qunif(0.5,min=a,max=b)

Sol:

F (x) =

∫ x

a

1

b − adu

=x − a

b − afor a ≤ x ≤ b

So xp = F−1(p) by def

= a + p(b − a).

Exercise 2.24 Find the mean and the median of X ∼ Uniform(a, b) and compare.

Sol:

E(X) =

∫ b

a

x1

b − adx =

a + b

2,

x0.5 = a + 0.5(b − a) = (a + b)/2,

same as the mean.

Exercise 2.25 Suppose X ∼ Exponential(θ), derive the cdf from the pdf, and find themedian. Verify that the mean of X is 1/θ using this calculation, and compare to themedian.

E(X) =

∫

∞

0

uf(u) du def of expectation

= [−u exp(−θu)]∞0 +

∫

∞

0

exp(−θu) du integ by parts

= 0 − 0 + [−1

θexp(−θu)]∞0

=1

θ.

33

Sol:

Evaluate cdf:

F (x) =

∫ x

0

f(u) du property of F

=

∫ x

0

θ exp(−θu) du

= [exp(−θu)]x0= 1 − exp(−θx) for x > 0 .

For x < 0, F (x) = 0.

Quantiles: solving F (xp) = p gives xp = θ−1 log (1 − p)−1.

For the median, p = 0.5 so the median is x0.5 = 1θlog 2.

Comparison µ = 1/θ > (1/θ) log 2 = x0.5.

The distribution is not symmetric so the mean and the median are not the same.

The median is smaller here because the smaller values are less concentrated than thelarger values to the right.

Exercise 2.26 Sample 200 realisations of X ∼ N(2, 4) and plot a scaled histogram.Overlay the theoretical pdf on this diagram. Overplot the empirical and theoreticalcdfs. Calculate the 0.25, 0.5, and 0.75 sample quantiles and compare to the theoreticalvalues. Make a brief record of these results in your notes. The empirical cdf and samplequantiles are discussed in the next chapter.

par(mfrow=c(1,2))

x = rnorm(200,mean=2,sd=2) # note sd=2,

hist(x,prob=TRUE,breaks=20,col=’yellow’) # bell shaped or what

# overlay the true pdf to make comparison:

a=-5 ; b=8 # trial and error

xvals = seq(a, b, length=101)

pdf = dnorm(xvals,mean=2,sd=2)

lines(xvals,f,col=’red’) # not bad

# now overlay the true cdf on the empirical cdf

plot(ecdf(x),pch=’.’)

F = pnorm(xvals,mean=2,sd=2)

lines(xvals,F,col=’blue’) # again good

quantile(x) # sample quantiles

qnorm(c(0.25,0.5,0.75),mean=2,sd=2) # close

min(x)

Exercise 2.27 Complete the missing parts of the code.

runif(50, min=0,max=1) # 50 obs Uniform(0, )

rnorm(20, =0,sd=5) # 20 obs Normal(0, )

34

rexp(100, rate=0.5) # 100 obs Exponential(0.5)

rpois(200, =3) # 200 obs Poisson(3)

rbinom(35,size=6,prob=0.2) # 35 obs Binomial( ,0.2)

rgeom(150,prob=1-0.2) # 150 obs Geometric(0.2)

The reason for the 1 − 0.2 in the Geometric case is that unfortunately, in R the prob-ability specified is the success probability, whereas the parameter θ in the pmf of aGeometric random variable is the failure probability.

Transformations of rvs

In certain examples it is easy to obtain the cdf of a transformed rv Y = g(X), by achange of variable.

Exercise 2.28 Show that if X ∼ Uniform(0, 1) and Y = − log (X) that Y ∼ Exp(1).

Sol:

Proof: We need to find and identify the cdf of Y .

P (Y < y) = P (− log (X) < y) the key to this e.g.

= P ( log (X) > −y)

= P (X > exp(−y)) monotonicity

= 1 − P (X ≤ exp(−y))

= 1 − exp(−y)) as X ∼ Uniform(0, 1).

But this is the cdf of Y ∼ Exp(1).

Exercise 2.29 Run this code to empirically veriy X ∼ Uniform(0, 1) and Y = − log (X)that Y ∼ Exp(1).

x = runif(10000)

y = -log(x)

hist(y, prob=T, breaks=40, col=’yellow’)

yvals = seq(-.1,4,length=200)

f = dexp(yvals)

lines(yvals,f,col=’red’)

2.6 Chapter summary

The Chapter starts with a review of probability and its axioms, and then reviews dis-crete random variables, the pmf, expectation and application to standard distributions,all material included in math104.

35

The math105 course continues probability theory to cover the extension to continuousrvs. Their properties are determined by the cumulative distribution function (cdf),which in turn leads to the definiton of the probability density function (pdf). Pmfsand pdfs are compared and contrasted.

Expectation, and its notion of a weighted average, is generalised to cover the continuouscase and its properties are discussed. Important definitions for the mean, variance andstandard deviation are given in terms of expectation.

Standard continuous distributions, including the Uniform, Exponential and Normaldistributions, are described. Quantiles are those values of the rv that cover a givenprobability, and are relatively easy to define for a continuous rv.

All these probabilistic concepts are illustrated throughout in the R language withspecial emphasis on plotting and simulation.

36

Chapter 3

Statistics and exploratory data analysis

In our everyday lives, we are surrounded by uncertainty due to random variation.

We often make decisions based on incomplete information.

Mostly, we can cope with this level of uncertainty, but in situations where the decisionis of particular importance, it can be informative to understand this uncertainty ingreater detail, to aid the decision making.

Statistics is unique in that it allows us to make formal statements quantifying uncer-tainty, and this provides a framework for decision making when faced with uncertainty.

3.1 Uncertainty

Sterling’s slide has continued, with the pound falling close to $1.37...The poundalso weakened against the euro, with the single currency now worth 94 pence.

If I am planning to make a trip in summer abroad, is it better to change the currencynow than later?

Is there evidence of global warning or is it simply random fluctuation?

Would the answer affect your way of living?

Decision making

We follow many different routes, rational or irrational, to find an answer and to copewith such situations.

Often it is useful to obtain some evidence in order to decide what the answer shouldbe.

What sort of evidence would be useful in answering such questions?

37

For the UK economy, we may look at exchange rates over the past few months to figureout a trend, if any, we may want to include other factors that may explain the trend,or study similar periods in the past. To determine such factors or variables we maywant to speak to economists.

For the global warming, we may want to study a pattern in temperature over the pastyears in England, Europe or around the world. There may be other variables of interest,for example, increasing number of flooding or storms. Discussion with climatologist orhydrologist would be helpful in deciding which variables should be considered.

What is data?

In statistical studies data refers to the information that is collected from experiments,surveys or observational studies.

For example by themself 4, 3.5, 3.2 is not data but only a sequence of numbers. Howeverif we know these numbers are measurements of new-born baby’s weights, then thesenumbers become data.

Numbers require metadata to become data.

Probability and statistics

In Probability, we consider an experiment before it is performed. The measurementsto be observed are modelled as random variables. We may deduce the probability ofvarious outcomes of the experiment in terms of certain basic parameters.

In Statistics, we have to infer things about the values of the parameters from theobserved outcomes, the realisations, of an experiment after it has been performed.

Is Friday 13th bad for your health?

Consider the following claim:

I’ve heard that Friday 13th is unlucky, am I more likely to be involved in acar accident if I go out on Friday 13th than any other day?

What kind of evidence would be helpful? perhaps hospital admissions.

Suppose that data is available of emergency admissions to hospitals in the SouthwestThames region due to transport accidents, on six Friday 13ths, and correspondingemergency admissions due to transport accidents for the Friday 6th immediately before

each Friday 13th:Number 1 2 3 4 5 6

Accidents on 6th 9 6 11 11 3 5Accidents on 13th 13 12 14 10 4 12

Does the data support the claim?

Compare the number of accidents by finding the average (the unweighted mean) numberof accidents on both days:

38

Average number of accidents = Total number of accidents / Total number of days, sothat

x6th =9 + 6 + 11 + 11 + 3 + 5

6= 7.5

and

x13th =13 + 12 + 14 + 10 + 4 + 12

6= 10.83.

Exercise 3.1 Referring to the Friday 13th example,

• Why compare instead of focusing on accidents only on 13th Fridays? Need a baseline.

• Why have we chosen Friday 6th as the comparison day? Compare like with like.

• There are more accidents on Friday 13th than on Friday 6th, therefore I am morelikely to be involved in a car accident if I go out on Friday 13th. Tentatively: yes.

What is this course about

• To illustrate scientific contexts where statistical issues may arise;

• to demonstrate where statistics can be useful, by showing the sort of questionsit can answer, and the situations in which it is used;

• to understand sampling variation and quantify uncertainty;

• introduce various exploratory tools and summary statistics for data analysis;

• introduce specific techniques from statistical modelling and inference; and

• apply all this to real data. Wow, and this as well!!

Sources of variation

Exercise 3.2 Toss a coin 10 times. How many heads are expected? Record theoutcomes:

H, H, H, T, T, H, H,H, H, T

• Are you surprised that you didn’t have exactly 5, the half of the number of trials?Has the result changed your opinion about the coin?

• Are you surprised that your neighbors didn’t have exactly the same number ofheads as you did?

• Repeat experiment another two times, on two further coins and record the numberof heads. Did you get the same number of heads each time?

• What would happen if you toss 20 times?

39

You have witnessed sampling variation.

Exercise 3.3 Think back to the Friday 13th example. Is the higher chance of beingin a car accident on Friday 13th, due to sampling variation?

Sol:

Possibly: but nearly all Friday 13ths had elevated accidents.

The variation within Friday 13ths is not as great as between Friday 6ths and Friday13ths.

Ultimate test: collect new data on Friday 13th dates.

Later we introduce a statistical framework to evaluate how much evidence there is fora true difference.

Population and sample

In the Friday 13th example, our interest is not limited to those available dates. Ideallywe consider all the possible accidents occurring on all Friday 13th’s. We call thecomplete group of units, or people, understudy the population.

• Population: the set of all individuals or units of interest, exactly defined.

• Sample: a subset of the population, chosen to be representative of the population.

Statistical inference is learning about the population through the behaviour of a sample.

Where is statistics used?

Statistics is used in a surprisingly diverse range of areas. Here is a small selection ofthe fields to which statistics contributes.

Environmental monitoring: for the setting of regulatory standards and in deciding whetherthese are being met;

Engineering: to gauge the quality of products used in manufacturing and building;

Agriculture: to understand field trials of new varieties and choose the crops that willgrow best in particular conditions;

Economics: to describe unemployment and inflation, which are used by the governmentand by business to decide economic policies and form financial strategies;

Finance: risk management, and prediction of the future behaviour of the markets;

Pharmaceutical industry: to judge the clinical effectiveness and safety of new drugs be-fore they can be licensed;

40

Insurance: in setting premium sizes, to reflect the underlying risk of the events that arebeing insured against;

Medicine: to assess the reliability of clinical trials reported in journals, and choose themost effective treatment for patients;

Ecology: to monitor population sizes and to model interactions between different species;

Business: market research is used to plan sales strategies.

The Sally Clark Case

Statistics has played a key role in many topical news issues, including the controversialcourt case of Sally Clark. The case is an famous example of the misuse (or misunder-standing) of statistics contributing to a miscarriage of justice. The Royal StatisticalSociety were so concerned that they wrote a press release, highlighting the statisticalmistakes made.

Sally Clark was a mother convicted of murder, when two of her babies diedof ‘Cot Death’ - the name given to the unexplained death of a young infant(SIDS).

The paediatrician Sir Roy Meadow, acting as an expert witness for theprosecution in the case, famously claimed that the odds of two unexplaineddeaths in the same family was 1 in 73 million.

Where does this figure come from?

Exercise 3.4 The odds of a single unexplained death in an affluent, non-smokingfamily is estimated as 1 in 8500. The figure 73 million comes from multiplying theseodds by themselves: 8500 × 8500 ≈ 73million. Is this a reasonable calculation?

Sol:

It is only appropriate to multiply these odds together if the second death is indepen-dent of the first.

This is not reasonable since the children have the same DNA.

A second problem

A second problem is known as the ‘prosecutors’s fallacy’, which goes as follows:

The chance of two unexplained deaths in the same family occurring bychance is 1 in 73 million. Therefore, the chance of Sally Clark being innocentis 1 in 73 million also.

What is wrong with this argument? The following analogy will help.

41

Exercise 3.5 The idea behind the British National Lottery lottery is that 49 balls areplaced in a machine, and 6 of them are drawn. Before the draw takes place, a punterpays 1 pound to place a guess on which six balls will be drawn. There is a prize of onemillion pounds available, to a correct guess, but the chance of getting it right is 1 : 14million. You decide to play, and, amazingly, all six of your numbers come up! Youtravel to the headquarters of the national lottery to claim your winnings, but instead. . .

Sol:

. . . you are arrested – accused of cheating! and the prosecuting lawyer argues “Thechance of getting all six balls correct by chance is 1 : 14 million. Therefore, the chanceof the defendant being innocent is 1 : 14 million also”.

Exercise 3.6 Formulate the Bayes calculation of the probability of innocence. Here isthe code.

pb.a = 1/(14*10^6) # P(B|A)

pb.acomp = 0.99 # P(B|A^c)

pa = 1-1/(10^6) # P(A)

pa.b = pb.a*pa/( pb.a*pa + pb.acomp*(1-pa) )# 0.0672

Sol:

A = “innocence”, B = “six balls correct”. Want P (A|B).

P (A|B) = P (B|A)P (A)/P (B) by Bayes,

=P (B|A)P (A)

P (B|A)P (A) + P (B|Ac)P (Ac))by TPT.

For calculations guestimate:P (B|A) = 1/(14 × 106)P (B|Ac) = 0.99P (A) = 1 − 1/103 prior prob of innocence.

Posterior prob P (A|B) ≈ 0.06729469 i.e. nearly 1 in 10.

Data

In experiments and surveys certain specific attributes are measured on the units. Theseare called variables. For example, in the Friday 13th data, the unit is a Friday 13, andthe variable we measure is the number of accidents.

The variable is a random variable if is determined at random or by some random process.To apply probability theory we convert the measurements to numerical scales.

Types of data

Most random variables falls into the following two categories, depending on the char-acteristic and how it is measured:

42

Discrete: Variables taking values in countable sets:e.g. gender, eye color, college membership, exam grades(A, B, C, D, E), numberof goals in a match, children in a family,. . . .

Continuous: Variables taking values on some interval of the real line:e.g. height, weight, direction, time. . . .

Sample survey data

We see that some data are useful in carrying out our investigation. But how do wechoose data? What are the important considerations? Is there any limit to the amountof evidence that can be obtained from some given data? Think back to the data onFriday 13th – could we use it to decide whether car accidents were especially commonon Fridays?

So if the evidence available is limited by the data we have, it makes sense that weshould think very carefully about how we collect the data.

If you are not collecting the data yourself, it is always important to understand howthe data is collected, so that you are aware of any limitations that may place on youranalysis.

To illustrate the idea, we begin with an extreme example.

Exercise 3.7 Student study: There is interest in estimating how many hours studentsspend studying every week. So you design a survey and find participants.

Thinking to yourself where a good place would be to find students to fillin your survey, you have a brilliant idea. . . the Library! You sit outsideand stop students as they leave to fill in your questionnaire. After sometime you have enough results for analysis. You find that students spend,on average, 30 hours a week studying.

What is wrong with the way in which the study has been carried out?

• What is population of interest for the survey? All UG students at UoL 2010.

• What property should the sample have? Be representative.

If you had stopped students outside the University Bar instead of the library, wouldyou have got similar results? No.

Can you think of a better way to collect data for your survey? Yes.

For a sample to be representative of the population requires a rigorous definition of thepopulation. Other populations for this survey could be

full time students, maths students, female students,. . . ,students in 1964,. . .

For what population is sampling by stopping people outside the library appropriate?library users.!

43

A representative sample reflects the characteristics and nature of the population. If thesample is not representative, we usually introduce a systematic error called bias intothe calculation.

Exercise 3.8 Beach comber: A measure of how polluted are British beaches is thevolume of residual plastic found on the beach. A survey is proposed to estimate this.Write down the issues that need to be addressed.

Sol:

Issues:

How large is a large sample

The term n usually denotes the number of units or subjects in the sample. There arepractical as well as statistical considerations to choosing the size of the sample. Onthe practical side, financial constraints may mean a sample has to be smaller thann = 1000. Some statistical considerations will be discussed later.

Random sample

The widely accepted method to obtain a representative sample of the population isby selecting a random sample. Statisticians like these.

A simple random sample of size n from a population is one in which each possible sampleof that size has the same chance of being selected.

One method to ensure random sampling is to write the name of every member of thepopulation on a slip of paper, place these slips into a hat, then draw out the requiredamount for the sample.

A more practical method has been developed using the computer, called a randomnumber generator. For an example of a pre-election poll, we may need n = 1000 randomnumbers between 1 and 40 million, for a sample size of n = 1000 out of the 40 millioneligible voters in the UK. If we have all the voters written in a list, we can pick out theselected subjects for our sample.

sample(1:10,4) # 3 7 6 2

Other kinds of sampling

It is not always feasible to carry out sampling in a truly random fashion. It can bevery expensive to contact 1000 random chosen people in a pre-election poll:

44

geographically dispersed, difficult to reach, long delays. We may have to resortto a sampling method that is not random for practical reasons. Provided we are careful,we can minimize the bias that is caused.

Exercise 3.9 Suppose we go to the city centre, stop passerbys in the street and ask whothey are going to vote for in the next election. This is sometimes known as conveniencesampling. An improved version is known as quota sampling. What kinds of bias maybe introduced? Shoppers are not representative of voters.

Does increasing the size of a sample decrease the bias?

Exercise 3.10 For the student study hours example, one survey collects 1000 responses,with convenience sampling, with interviews made outside the library, stoppng randomstudents.

A second survey collects only 50 responses, with random sampling from a list of theentire student population of the University.

Which study should we believe more? It depends on the population of interest.It is almost always better to have a small, representative sample, than a large biasedsample.

From here on we assume that the sample is random and study properties of simplerandom samples. This greatly simplifies our mathematical treatment of the problemand provides insights into important statistical ideas used.

3.2 Exploratory data analysis

We introduced some examples of discrete and continuous random variables and studiedtheir properties. If we know the exact analytical form of the underlying distributionof interest (i.e. the population), there is no need to collect data nor make statisticalanalysis. In reality this is rarely the case, especially in the beginning of investigation,and even if there is a conjectured model for the data, we always need to check if it isconsistent with data.

Data and variability

Data is measured information and is fixed. But in representing the population it alsocarries uncertainty. This may be due to inherent random variability in the character-istic of interest: e.g. a coin throw. Measurement variability from one day to another:e.g. weight. Sampling variability: e.g. one individual is selected into the sample,

another is not.

In mathematical terms, in all of these three cases, the characteristic being measured isrepresented by a random variable: e.g. X = todays weight of an individual, e.g. X =number of plastic bottles on beach selected, e.g. X records 1 if throw a head.

45

Random variables and realisations

There is an important difference between: a random variable and its realisation, obser-vation. A random variable is always written in upper case and is a function with anassociated probability distribution (pmf/pdf); e.g. X = Ozone level.

An observation on a random variable is written in lower case and is just a number; e.g.x = observed value of Ozone.

A data set of size n may be considered in two ways:

X1, . . . , Xn random variables

x1, . . . , xn given realizations.

The first is needed for probability and statistical modelling. The second is needed forexploratory data analysis.

Data analysis

The first stage in any analysis is to get to know the problem and the data. The firststages of data analysis usually involves a variety of graphical procedures to visualise thedata, and the calculation of a few simple summary numbers, or summary statisticsthat capture key features of the data.

The variability in the data is a reflection of and an approximation to the true underlyingdistribution and its features. We need to care how good the approximation is.

Role of exploratory data analysis

There are three essential roles:

Finding errors and anomalies: missing data, outliers, changes of scale,. . . .

However carefully data have been collected, it is always possible that they containerrors. Early detection of these errors can save time and confusion later on. These maybe due to recording or transcription error or broken equipment among other causes.

Suggesting subsequent analyses: plots of data and summary statistics give informationon location, scale and shape of the distribution and relationships between variables.This builds up a feeling for the structure of the data, which gives insight into subsequentstatistical modelling.

Augmenting understanding of applied problem: exploratory tools sharpen the scientificquestions addressed. Context and scientific rationale for analysis is paramount.

3.3 Examples with associated data sets

Each of these real life problems has an associated data set which we explore, to showthe whole process involved in detailed statistical analysis from conception through toconclusion.

46

Marine science Excess wavesEcological Diseased treesAtmospheric Chemistry Ozone and air pollutionHealth Comparing hospitals

Offshore waves at Newlyn

Coastal engineers at the port of Newlyn, in the south west of England, require detailedunderstanding of oceanographic processes in order to estimate overtopping rates ofthe sea wall protecting the town. They can then assess whether existing sea wall isadequate, or whether further protection should be built. Offshore waves are inducedby meteorological conditions, and though complex, they can be summarised by theirheight and their period.

Here we will concentrate on the excess heights of these waves over a threshold.0 20000 40000 60000 80000

020

000

4000

060

000

8000

010

0000

Newlyn

Eastings

Nor

thin

gs

The specific problem for the engineers is:

Given a small probability of exceedance, what is the wave height that is exceeded withthat probability?

How accurate is this estimate? statistics.

Diseased trees

In an ecological study of diseased trees, trees along transects through a plantation wereexamined and assessed as diseased or healthy. Data collection goes as follows. First adiseased tree is found. Then the number of neighbouring trees in an unbroken run ofdiseased trees along the transect is recorded. Ecologists are interested in the following:

How does the disease spread between trees, and what is the probability that trees areinfected by the disease?

The observations made on a total of 109 runs of diseased trees recorded in the Tablebelow. We use this data set to show the benefits of collecting more data. To dothis we have broken down the data in the Table into data collected from the first 50observations and from the whole data set, we refer to these as the partial and full datasets respectively.

Run length 0 1 2 3 4 5

Number of runs 31 16 2 0 1 0in first 50 observations

Number of runs 71 28 5 2 2 1in all 109 observations

47

Urban and rural ozone

In the UK the Department for Environment, Food & Rural Affairs operates a nationalair quality monitoring system, with a network of sites at which air quality measure-ments are taken automatically. These measurements are used to summarise currentair pollution levels, for forecasting of future levels and to provide data for scientificresearch into the atmospheric processes behind the pollution. We look at ground-levelozone (O3).

Ozone: the background

This pollutant is not emitted directly into the atmosphere, but is produced by chemicalreactions between nitrogen dioxide (NO2), hydrocarbons and sunlight. When present at highlevels, ozone can irritate the eyes and air passages causing breathing difficulties and mayincrease susceptibility to infection. Ozone is toxic to some crops, vegetation and trees and isa highly reactive chemical, capable of attacking surfaces, fabrics and rubber materials.

Whereas nitrogen dioxide participates in the formation of ozone, nitrogen oxide (NO) destroysozone to form oxygen and nitrogen dioxide. For this reason, ozone levels are not as high inurban areas (where high levels of NO are emitted from vehicles) as in rural areas. As thenitrogen oxides and hydrocarbons are transported out of urban areas, the ozone-destroyingNO is oxidised to NO2, which participates in ozone formation.

As sunlight provides the energy to initiate ozone formation, high levels of ozone are generally

observed during hot, still, sunny, summertime weather in locations where the airmass has

previously collected emissions of hydrocarbons and nitrogen oxides (e.g. urban areas with

traffic). The resulting ozone pollution or summertime smog may persist for several days and

be transported over long distances.

Ozone: the data

We focus on data from two monitoring sites: - an urban site in Leeds city centre and- a rural site at Ladybower Reservoir, just west of Sheffield.

The data at each site are daily measurements of the maximum hourly mean concentrationof O3 and NO2, recorded in parts per billion (ppb), from 1994 – 1998 inclusive. To focuson the question of whether there is any effect of season on ozone levels, we comparedata from winter (November – February inclusive) and early summer (April – Julyinclusive).

We address the following questions:

How, if at all, does the distribution of ozone measurements vary between the urban

48

and rural sites?

How, if at all, is the distribution of ozone measurements affected by season?

How, if at all, does the presence of other pollutants affect the levels of measured ozone?

The purpose of the statistical analysis is to provide an objective analysis of the data,by extracting the information in the data relevant to each of the scientific questions.

Comparing hospitals

League tables for many public institutions such as schools, hospitals and even uni-versities try to compare the relative performances of the institutions. This very smallexample uses the outcomes of a difficult operation at two hospitals. Ten patients at eachhospital underwent the operation. The patients were selected to make sure that theyhad similar severity of illness and other characteristics which are believed to influencethe outcome of the operation. There is no connection between the two hospitals.

Each operation was classified as successful or unsuccessful. The first hospital had nineout of ten successful operations and the second hospital had five out of ten successful.

What can we conclude about the relative performances of the two hospitals?

R code for the data

The data sets are saved from R in the file m105.Rdata.

load("./m105.Rdata") # linux directory

ls() # "barley" "ozone.summer" "ozone.winter"

# "waveExcesses" "waves"

# Ozone

names(ozone.summer)

attach(ozone.summer)

# "Leeds.O3" "Leeds.NO2" "Ladybower.O3" "Ladybower.NO2"

hist(Leeds.O3)

Population and sample: examples

In the Ozone problem, there is data from a number of days during 1994-1998. However,interest is not solely in the levels of ozone on the days on which measurements weretaken. The objective of a statistical analysis is to learn about the relationships betweenvariables, and extrapolate perhaps to future dates.

Exercise 3.11 For each of the problem data sets state the populations that we aretrying to learn about:

Newlyn waves: All waves encountered offshore at Newlyn.

Ozone: Levels of ozone at the two locations given the time of year.

49

Diseased trees: All trees in similar forests.

Hospitals: Other operations at the two hospitals.

Exercise 3.12 For diseased tree data set, define the variable of interest as X and itspossible range of values. X= length of unbroken run of diseased trees. Discrete: X ∈ 0, 1, . .

Exercise 3.13 For the hospital data set, define the variables of interest and possiblerange of values: X is the number of successful operations in the first hospital,

Y the number in the second. Discrete: X, Y ∈ 0, 1, . . . , 10.

3.4 Graphical methods

Graphical methods are needed for visualising multivariate and univariate data. Ifthe data is high dimensional, then it can be difficult to visualise since plots are twodimensional! Ways of overcoming this is an active area of computer science.

Here the focus is on methods for examining the distribution of a single variable andrelationships between pairs of variables.

Historical note – Florence Nightingale

Good graphical display is the important first step in any data anal-

ysis. Choosing how to do it is part science, part art, and sometimes part politics! Florence

Nightingale was the first female Fellow of the Royal Statistical Society. She pioneered the use

of statistics as an organised way of learning, leading to improvements in medical and surgical

practices. She developed the polar-area diagram, to dramatise the needless deaths caused by

unsanitary conditions. Florence Nightingale revolutionised the idea that social phenomena

could be objectively measured and subjected to mathematical analysis, innovating in the

collection, interpretation, and graphical display of descriptive statistics.

Histograms

The standard histogram of a observations on a variable displays the frequency, thenumber of observations, in each bin, where the bins divide up the range of the variable,and are usually of equal width.

50

A technical definition is hard to write down, and requires a definition of the empiricalcdf.

A histogram displays the variability and the distribution of the variable. It may suggestone pdf rather than another as a possible statistical model for the variable. In a sensethe histogram is an empirical pdf.

Exercise 3.14 Diseased trees. Plot histograms for the partial and full data sets andsummarise the shape of the distributions displayed.

partial = c(31, 16, 2, 0, 1, 0 )

full = c(71, 28, 5, 2, 2, 1 )

# barplot(full) # is another way

# unbundle the data

Partial = rep(0:5,partial)

Full = rep(0:5,full)

par(mfrow=c(1,2))

hist(Partial, xlab="Run length",ylab="Count",main="Partial",

ylim=c(0,75),breaks=seq(-0.5,5.5,by=1), col=’red’)

hist(Full, xlab="Run length",ylab="Count",main="Full",

ylim=c(0,75),breaks=seq(-0.5,5.5,by=1), col=’blue’)

Sol:

Both indicate a geometric decay in the distribution of run lengths.

Scaled histogram

The histogram estimates the underlying pmf of a discrete variable or the pdf of acontinuous variable. Recall that all pmfs sum to 1, and that all pdfs integrate to 1. Itthus makes sense to plot histograms with relative frequency rather than raw frequencyand so respect this summation,

Exercise 3.15 Diseased trees.

?hist # freq is needed

hist(Partial,prob=TRUE,

xlab="Run length",ylab="Rel freq",main="Partial",

breaks=seq(-0.5,5.5,by=1), col=’red’)

hist(Full,prob=TRUE,

xlab="Run length",ylab="Rel freq",main="Full",

breaks=seq(-0.5,5.5,by=1), col=’blue’)

This histogram has area 1. The shape of the histogram does not change. The verticalaxis now represents the relative frequency rather than raw frequency.

The benefit of rescaling is to better compare distbutions.

51

Exercise 3.16 Ozone: Comparing histograms The histograms of the summer ozonedata for both sites are given in this code. Need to get the scales right for comparison.The conclusions can differ: eg peakedness.

load("./m105.Rdata")

attach(ozone.summer) ; names(ozone.summer)

par(mfrow=c(1,2))

hist(Leeds.O3); hist(Ladybower.O3)

hist(Leeds.O3,prob=T); hist(Ladybower.O3,prob=T)

hist(Leeds.O3,prob=T,ylim=c(0,.05));

hist(Ladybower.O3,prob=T,ylim=c(0,.05))

hist(Leeds.O3,prob=T,ylim=c(0,.05),breaks=20);

hist(Ladybower.O3,prob=T,ylim=c(0,.05),breaks=20)

hist(Leeds.O3,prob=T,ylim=c(0,.06),breaks=20);

hist(Ladybower.O3,prob=T,ylim=c(0,.06),breaks=20)

These are clearly different, but the spread and shape of these histograms is sufficientlyclose to make it difficult to identify any obvious difference by eye.

To really look at differences: consider differencing.

Exercise 3.17 Ozone: the differences. We have observations on the ozone level ateach site, (xi, yi) for every day i = 1, . . . , n. Looking directly at the daily differences,di = xi − yi, in ozone removes common variability (e.g. atmospheric conditions) to thetwo locations.

par(mfrow=c(1,1))

length(Leeds.O3) # 469

d = Leeds.O3 - Ladybower.O3

hist(d,freq=FALSE,col=’yellow’, # ways to skin a cat

xlab="difference",ylab="Rel freq",main="O3 differences")

grid()

Exercise 3.18 Conclusions drawn: The variability of these differenced data is lessthan the variability of the measurements made at the separate sites. So commonfactors that affect both sites, and influence ozone values, are removed from thedifferenced data. Differencing is only possible if measurements are collected on thesame unit=day. Most differences are negative : measurements at Ladybower are

larger than at Leeds. This supports scientific expectations that rural ozone levels aregenerally higher than urban levels.

Choice of bin size for a histogram

Constructing a histogram smooths the data, and the width of the bins determines howmuch smoothing is applied. Broad bins correspond to highly smoothed data, in whichmuch of the structure of the data set is lost. Narrow bins undersmooth the data,leaving in random variation which obscures the structure of the data, but in a differentway.

52

Exercise 3.19 Choosing bin size for the summer ozone data. Examples of very wideand very narrow bins are shown for the summer ozone data from the Leeds city centresite.

par(mfrow=c(1,2))

x = Leeds.O3

hist(x,prob=T,col=’yellow’,breaks=2)

hist(x,prob=T,col=’red’,breaks=500)

Using a very large bin size has obscured the structure of the data. So has the very smallbin size – the right hand plot just shows the raw data! surprisingly informative here.The earlier plot is somewhere between and achieved by trial and error.

Heights of offshore waves at Newlyn

The data set waves gives the maximal levels (in metres) recorded over consecutive 15hour windows, throughout the period 1971-77.

Typing in waves displays the whole vector.

Exercise 3.20 Find the length of this vector: length(waves) # 2894

Find the mean of the offshore wave heights: mean(waves) # 2.866

Display a histogram of the offshore wave heights: hist(waves)

0 2 4 6 8 10 12

020

040

060

080

0

Offshore waves

Wave height

Fre

quen

cy

Describe the shape of this distribution and the range of this variable: Asymmetric, long right tail, all

What does the y-axis of this plot represent? Counts of observations that fall in each bin.

Scale the histogram to have area 1. hist(waves,prob=TRUE)

What does the y-axis of this plot represent now? Relative frequency. The x-axis arewave heights measured in metres.

53

3.5 Empirical cdf

The cumulative distribution function (cdf) of a random variable X is

F (x) = P (X ≤ x), for −∞ < x < ∞.

whether discrete or continuous. Define the indicator function

I(X ≤ z) =

1 if X ≤ z0 otherwise.

Exercise 3.21

Result: (Unbiased estimate of cdf.) Show, for any fixed z, the expected value ofI(X ≤ z) is F (z).

Sol:

E[I(X ≤ z)] =

∫

∞

−∞

I(x ≤ z)f(x)dx def E

=

∫ z

−∞

1.f(x)dx +

∫

∞

z

0.f(x)dx

= P (X ≤ z) + 0

= F (z). def F

Definition: The empirical cdf is defined as

F (x) =1

n

n∑

i=1

I(xi ≤ x).

Result: The ecdf can be calculated from F (x) = 1n( number of i st xi ≤ x). Each ob-

servation has an equal weight 1/n in this computation.

Exercise 3.22 5 realisations of a rv X are 2, 3, 4, 1, 2. Compute F (x) at x = .5, 1.5, 2.5, 3.5, 4.5.How would the calculation change if the points x = 0, 1, 2, 3, 4 are used?

Sol:

F (0.5) = 0/5

F (1.5) = 1/5

F (2.5) = 3/5

F (3.5) = 4/5

F (4.5) = 5/5.

Not much change F (0) = F (0.5), F (1) = F (1.5),. . . . But this implies that the ecdf isa step function.

54

Properties of the ecdf

The empirical cdf F (x) is a proper cdf and

• is a step function with jumps at the data points;

• F (x) = 1 if x ≥ max(x1, . . . , xn);

• F (x) = 0 if x < min(x1, . . . , xn).

An alternative calculation of the ecdf

As the ecdf is a step function with jumps at the data points, there is an easier way ofcalculation. Take the realisations x1, . . . , xn; order them with the smallest first; labelthese order statistics as x(1), x(2), . . . , x(n) so that

x(1) ≤ x(2) ≤ . . . ≤ x(n).

The subscripts give the ranks of the data points.

x=c(2, 3, 4, 1, 2)

rank(x) # 2.5 4.0 5.0 1.0 2.5

sort(x) # order statistics

Result: the ecdf can be evaluated at the order statistics

F (x(i)) =i

n.

and for values of x in between

F (x) =i

n, where x(i) ≤ x < x(i+1).

Proof: Number of x ≤ x(i) is i.

Exercise 3.23 For observations 2, 3, 4, 1, 2, find F (x) and sketch the plot.

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

F (x)

Sol:

Order the data: 1, 2, 2, 3, 4.

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

F (x) 15

35

45

55

F (x) 05

05

15

15

35

35

45

45

55

55

55

Exercise 3.24 Summer ozone. Use the first 20 observations from Leeds city centersummer ozone values to compute the ecdf.

55

n = 20

x = Leeds.O3[1:n]

xrank = sort(x) # order the data

Fn = seq(1,n)/n # a jump of 1/n

plot(xrank, Fn, type=’s’) ; grid() # step function

plot(ecdf(x),pch=’.’) ; grid() # ecdf is a R function

# for the whole data

par(mfrow=c(1,1))

x = Leeds.O3

plot(ecdf(x),pch=’.’) ; grid(12) # ecdf is a R function

Draw some conclusions from the complete data.

Sol:

About 60% of days the daily maxima was less than 35 and about 20% of time thedaily maxima was greater than 40, steady increase in the cdf between 20 and 40, themaximum stretches out to 80.

3.6 Summary statistics

In addition to visualising our data graphically, we can calculate some summary statisticswhich capture important features of our data. Numerical summaries of the data can

• facilitate the comparison of different variables;

• help make clear statements about aspects of the data.

Mathematical notation

Recall the notation

n∑

i=1

g(i) = g(1) + g(2) + . . . + g(n − 1) + g(n)

for any positive integer value of n and any function g. In statistics we often have todo mathematics with sums of this form. The most common forms of this expressionencountered are:

n∑

i=1

xi = x1 + . . . + xn andn∑

i=1

x2i = x2

1 + . . . + x2n.

56

Sample mean

Consider a random variable X from which we obtain n realisations x1, . . . , xn. Toemphasise some mathematical properties of averaging we may write the realisations asa vector x = (x1, . . . , xn).

Definition: The sample mean of n observations x1, . . . , xn is denoted by x, or by m(x),and is obtained by summing all the xi and dividing by n:

x =1

n

n∑

i=1

xi and m(x) =1

n

n∑

i=1

xi.

This measures the location of the sample. It is an estimate of the expectation E(X),or the mean of X.

Sample variance and standard deviation

Definition: The sample variance of n observations x1, . . . , xn is denoted s2 and is givenby:

s2 =1

n

n∑

i=1

(xi − x)2.

Note the divisor n. Many textbooks use the divisor (n − 1) instead of n here, wierd.There are technical reasons for this but, for large values of n, it makes little difference.

The sample variance is a measure of spread of X and also an estimate of the variancevar(X). Ideally, a spread measure should have the same units as the original data.

Definition: The sample standard deviation of observations x1, . . . , xn is s =√

s2. Thestandard deviation σ of X is the square root of σ2 = var(X), the sample standarddeviation estimates this value from the data.

Exercise 3.25 Waves. Find the sample mean of the wave height data mean(waves) # 2.866.

Find the sample variance var(waves) # 2.564049.

Use the sqrt() function to derive the sample standard deviation sqrt(var(waves)) # 1.601265.

Exercise 3.26 Ozone data. Calculate summary statistics of O3 to look more closely fordifferences between the locations and the seasons.

There are four groups, arising from the two levels of each of the two nominal variableslocation and season. Standard deviations are in parentheses.

The means areLeeds city Ladybower

summer 31.78 (9.28) 43.63 (11.81)winter 20.52 (10.77) 29.24 (8.40)

Give the Rcode to compute these numbers. Draw conclusions from these summarystatistics.

57

Sol:

mean(Leeds.O3)

mean(ozone.summer$Leeds.O3) # list

mean(ozone.winter$Leeds.O3)

sd(ozone.winter$Ladybower.O3)

The conclusions are comparative: The mean values for Ladybower are higher than forLeeds. The summer mean values are higher than the winter ones. The spreads areroughly the same.

Sample quantiles

Sample quantiles are calculated directly from the empirical cdf.

Definition: the pth sample quantile, xp, satisfies

F (xp) = p for 0 < p < 1.

The median x0.5 corresponds to p = 0.5, it is another widely used measure of location.The definition xp = F−1(p) does not work here because F is a step function and so itsinverse is not defined.

Exercise 3.27 Calculate the sample mean and the median, for each dataset, using thiscode

stats = function(x) c(mean(x),median(x))

stats(c(2, 4, 6, 8, 10)) # 6 6

stats(c(2, 4, 6, 8, 100)) #

stats(c(2, 4, 6, 8, 1000)) #

The lesson learnt is that the median is insensitive to outliers.

Exercise 3.28 Find the 0.6 quantile of the Leeds summer ozone daily maxima.

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

F(x

)

Leeds

Summer daily maxima

p = 0.6

xp in interval (33,34)

58

The 0.6 sample quantile lies between (33, 34). quantile(Leeds.O3,prob=0.6) # 33

Exercise 3.29 The function quantile() calculates quantiles of a vector: quantile(waves).The minimum, maximum and median values are 0.32m, 11.05m, 2.46m.

Compare the median to the mean Mean higher since histogram is skewed to the right.

Exercise 3.30 Plot the empirical cdf of the waves data plot(ecdf(waves),pch=’.’); grid(21)

to answer the following.

Find the median of the wave height distribution: 2.5m approx.

Find the 0.1 and 0.9 quantiles of the wave height distribution: 1.2m and 5.1m approx.

Estimate the probability of a randomly selected wave being less than 1.7m: 0.25.

Find the wave height exceeded by 25% of the waves 3.7m approx.

Box-and-whisker plots

These plots summarise the observations in terms of quantiles. They display the ex-tremes (the whiskers), and the central values (the box defined by the quartiles and themedian).

Definition: the interquartile range is x0.75 − x0.25. The length of the box is the in-terquartile range.

Exercise 3.31 Boxplot for the Ozone data.

load("./m105.Rdata")

attach(ozone.summer) ; names(ozone.summer)

par(mfrow=c(1,2))

hist(Leeds.O3); hist(Ladybower.O3)

boxplot(Leeds.O3,ylim=c(0,110));

boxplot(Ladybower.O3,ylim=c(0,110))

quantile(Ladybower.O3) #

Features of the boxplots are: the thick line in the box is the median; the upper linein the box is the 75% quantile, and the lower line is the 25% quantile; the minimumand maximum are easily identified; and points appearing outside the limits may beconsidered outliers. Summarise conclusions to be drawn from these boxplots.

Sol:

Skewness is shown as asymmetry of the box around the median; here it is only theright hand tail that is long. The Ladybower distribution is a shift to the right of theLeeds distribution. Comparison requires the same scales.

59

3.7 Bivariate relationships

Histograms and empirical distribution functions are useful methods for visualising asingle variable. However, with multivariate data, it is important to examine the re-lationships between variables as well as the structure of each variable by itself. Thescatterplot simply plots the value of one variable against another.

Definition: if (xi, yi) are two observations on the same unit i = 1, 2, . . . , n, the plot of(xi, yi) is called a scatterplot.

Exercise 3.32 Ozone. Consider the effect of the nitrogen dioxide (NO2) on ozone levels.We focus on the Leeds city centre measurements. Use this Rcode to give scatter plotof O3 and NO2 for summer and winter. Sketch the graph in your notes.

par(mfrow=c(1,2))

xsumm = ozone.summer$Leeds.O3

ysumm = ozone.summer$Leeds.NO2

lim = c(0,100) # vital for comparison

plot(xsumm,ysumm,type=’n’,xlim=lim,ylim=lim) ; grid()

points(xsumm,ysumm,col=’red’,pch=’.’,cex=2)

xwint = ozone.winter$Leeds.O3

ywint = ozone.winter$Leeds.NO2

plot(xwint,ywint,type=’n’,xlim=lim,ylim=lim) ; grid()

points(xwint,ywint,col=’blue’,pch=’.’,cex=2)

# stretch the graphic

Draw conclusions.

Sol:

Ozone. Similar joint distributions, main body slightly differently located. No obviousrelationship between x and y, perhaps winter (x=small,y) difference. Many outliers.

The sample correlation coefficient

Consider two rvs X and Y on which we have iid observations (x1, y1), . . . , (xn, yn).Let m(x) denote the sample mean of the x = (xi; i = 1, . . . , n), let s(x) denote thesample standard deviation of the (xi; i = 1, . . . , n). Similarly define m(y) and s(y).Standardised versions of xi and yi are

xi − m(x)

s(x)and

yi − m(y)

s(y).

Definition: the sample correlation coefficient r(x, y) is the average of the product ofthese standardised values

r(x, y) =1

n

n∑

i=1

(

xi − m(x)

s(x)

)(

yi − m(y)

s(y)

)

.

60

n = 20

x = runif(n) ; y = runif(n)

cor(x,y)

mean( (x-mean(x))/sd(x) * (y-mean(y))/sd(y) )

# why are these different?

f = sqrt((n-1)/n)

mean( (x-mean(x))/(f*sd(x)) * (y-mean(y))/(f*sd(y)) )

Result: (The correlation coefficient is invariant to standardisation.) For given scalarsa, b, c, d and vector of ones 1 = (1, 1, . . . , 1)

r(ax + b1, cy + d1) = sign(ac)r(x, y).

Proof: See exercises.

Result: The correlation coefficient always satisfies −1 ≤ r(x, y) ≤ 0.

Proof: Because of the invariance of the correlation coefficient to standardisation, takex, y to have mean 0, and variance 1. Thus

∑

i

xi = 0 and∑

i

x2i = n,

and similarly for y. Consider the quadratic form

Q =1

n

∑

i

(xi + yi)2

=1

n

∑

i

(x2i + y2

i + 2xiyi)

=1

n

∑

i

x2i +

1

n

∑

i

y2i +

2

n

∑

i

xiyi

= 1 + 1 + 2r(x, y).

Now Q ≥ 0, so that 0 ≤ 2 + 2r, and r ≥ − 1. Similarly start with Q = 1n

∑

i(xi − yi)2

and find 0 ≤ 2 − 2r, so that r ≤ 1.

The sample correlation coefficient is a measure of linear association, or clusteringaround a line. Interpretation: r(x, y) = 0 gives no linear association, r(x, y) < 0 meansnegative linear association, r(x, y) > 0 means positive linear association; when r(x, y)is near ±1 the association is strong.

Exercise 3.33 Use this code to generate data with r = 0.5, roughly.

par(mfrow=c(1,1))

n = 400

z = rnorm(n)

x = z + rnorm(n); y = z + rnorm(n)

plot(x,y, type=’p’, pch=’x’)

cor(x,y) # .47

61

Use other relations of x and y to z to give plots with r = −0.5, r = 0.9, r = 0, roughly.

Sol:

x = z + rnorm(n) ; y = -z + rnorm(n)

plot(x,y, type=’p’, pch=’x’); cor(x,y) # -.52

x = 3*z + rnorm(n); y = 3*z + rnorm(n)

plot(x,y, type=’p’, pch=’x’); cor(x,y) # .91

x = rnorm(n) ; y = rnorm(n)

plot(x,y, type=’p’, pch=’x’); cor(x,y) # .04

Exercise 3.34 The sample correlation coefficient is not appropriate for detecting non-linear association.

x = z + rnorm(n) ; y = z^2 + rnorm(n)

plot(x,y,type=’p’, pch=’x’); cor(x,y) # -.04

Exercise 3.35 Ozone data. Calculate the sample correlation coefficients between O3

and NO2 for the ozone data. There are four groups, arising from the two levels of eachof the two nominal variables location and season.

xsc = ozone.summer$Leeds.O3 # summer in the city

ysc = ozone.summer$Leeds.NO2

xwc = ozone.winter$Leeds.O3 # winter

ywc = ozone.winter$Leeds.NO2

xsr = ozone.summer$Ladybower.O3 # rural

ysr = ozone.summer$Ladybower.NO2

xwr = ozone.winter$Ladybower.O3

ywr = ozone.winter$Ladybower.NO2

cor(xsc,ysc) ; cor(xsr,ysr)

cor(xwc,ywc) ; cor(xwr,ywr)

Collating the results givesLeeds city Ladybower reservoir

Summer 0.10 0.25Winter -0.24 -0.48

What conclusions can you draw from these statistics?

Sol:

The correlations between O3 and NO2 are small with only one being moderate. Bycomparison with the earlier figure, one might worry about outliers and/or non-linearity.The fear is that outliers may distort the value of the coefficient. plot(xwr,ywr,type=’p’)shows association but non-linear.

62

Exercise 3.36 The sample correlation coefficient is an estimate of the populationcorrelation between X and Y , denoted corr(X, Y ). While its definition is beyondmath105, consider how one might start by arguing an analogy to the relation betweenE(X) and x.

Sol:

Compare

x =∑

i

xi.1

nweighted average, and

E(X) =

∫

∞

x=−∞

x.f(x)dx weighted average.

Now taking the standardised variables

r =∑

i

xiyi.1

n

E(XY ) =

∫

∞

x=−∞??

xy.??.dx??

E(XY ) =

∫

∞

x=−∞

∫

∞

y=−∞

xy.f(x, y)dxdy conjecture.

Need to define a joint pdf f(x, y).

3.8 Chapter summary

An introduction to statistics and exploratory data analysis is developed in terms ofuncertainty, decision making and data. The symbiotic theories of probability and ofstatistics are contrasted in terms of the before analysis and the after analysis of aprobability experiment.

Data drives statistics, and sources of variation between and within data sets are de-scribed. One source of random variation is sampling and the concept of the simplerandom sample is introduced. Conceptual issues, such as the representative nature ofthe sample, the population, and methodological issues such as how to define a largesample, and other forms of sampling, are briefly discussed.

Given a data set the first step in statistics is to understand its context and subject it toan exploratory data analysis in order to understand its structure and variability. Datasets for waves, trees, and ozone are used as running examples throughout the chapter.The histogram is perhaps the most well known graphical method of eda, and is one wayto portray distributions. We use it to construct an empirical estimate of the pmf or pdfof the rv understudy. However, the empirical cdf is just as practically important andtheoretically has pride of place. Summary statistics related to the data set are the wellknown sample mean, variance and standard deviation, and the lesser known sample

63

quantiles. Boxplots, which are condensed summaries of the histogram, are based ongiven quantiles. The Chapter ends with the extension to bivariate relationships andthe definition of the sample correlation coefficient.

Throughout these statistical concepts are illustrated in the R language with specialemphasis on calculation, plotting and simulation.

64

Contents

1 Introduction to R 1

1.1 The tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Continuous random variables 13

2.1 Review of probability . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Continuous and discrete rvs . . . . . . . . . . . . . . . . . . . 15

2.3 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Standard continuous distributions . . . . . . . . . . . . . . . 24

2.5 Quantiles and the cdf . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Statistics and exploratory data analysis 37

3.1 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . 45

3.3 Examples with associated data sets . . . . . . . . . . . . . . 46

3.4 Graphical methods . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5 Empirical cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.6 Summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7 Bivariate relationships . . . . . . . . . . . . . . . . . . . . . . . 60

3.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

65

math 105: [probability and] statistics b25 fylde college ...introduce basic concepts as it will be...

Documents