stat270 probability and inference

115
STAT270 Probability and Inference Dr Austina S S Clark and Dr Ting Wang University of Otago February, 2020

Upload: others

Post on 19-Feb-2022

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT270 Probability and Inference

STAT270

Probability and Inference

Dr Austina S S Clark and Dr Ting Wang

University of Otago

February, 2020

Page 2: STAT270 Probability and Inference

In first year statistics courses, we emphasise the methods of statistics: which techniques andtests are applied in which situations. In this course, you will learn some of the theory and mathe-matics behind those methods. This is important because you will

• better understand where those standard methods come from, and why they are used,

• learn how to conduct analyses and design statistical methods for the many cases where the‘standard’ toolbox is inadequate.

Modern statistics is a dynamic and rapidly changing subject. If you are going to keep up withthe changes and advances in statistical theory and methodology you will need a good groundingin mathematical statistics and probability theory.

Paper details

The mathematical requirements for this course are kept at first year level, i.e. MATH 160. Noprevious knowledge of probability (beyond that in STAT 110 and 115) is assumed. There will belots of examples and practice problems.

Potential students

Any student who has taken either of the 100-level statistics papers and MATH 160 can take thispaper. It is particularly useful for those majoring in mathematics, statistics, economics, financeand quantitative analysis, psychology, zoology, or any other field in which statistics is used.

Prerequisites

MATH 160 and one of STAT 110, STAT 115, COMO 101, BSNS 112

Intended learning outcomes

On completion of your study of STAT270, you are expected to:

a) Have a good understanding of basic probability theory, discrete and continuous randomvariables, basic univariate and joint probability distributions, expectation, variance and co-variance;

b) Understand the concept of the sampling distribution and models;

c) Be able to carry out frequentist inference including point estimators, evaluation of pointestimators, confidence intervals, and hypothesis testing;

d) Know basic ideas behind Bayesian inference;

e) Be able to use word processing/LaTex and statistical program R.

i

Page 3: STAT270 Probability and Inference

Main topics

• Introduction to probability

• Random variables and distributions

• Expectation and variance

• Transformations of random variables

• Statistical models

• Estimators and likelihood

• Confidence intervals and hypothesis testing

• Bayesian inference

References

The following two texts are available as electronic e-books in the library:

• A Modern Introduction to Probability and Statistics: Understanding Why and How byDekking, Kraaikamp, Lopuhaa and Meester.

• Modern Mathematical Statistics with Applications by Devore and Berk.

Additional reading:

• Mathematical Statistics with Applications by Wackerly, Mendenhall and Scheaffer

• An Introduction to Mathematical Statistics and its Applications by Larsen and Marx

Teaching/learning approaches

Students learn by attending lectures. There are 10 weekly assignments which are required to behanded in and marked as a part of internal assessment. There are also weekly exercises availableon the Departmental website for students to practice. Students get help during tutorials regardingquestions arising from lecture notes, assignments and exercises. Tutorials are held in the computerlaboratory. It is very important to try all the homework assignment questions by yourself or ingroups before each tutorial.

Lecturers

Dr Austina S S Clark (Room 117, Science III; Extension 7757; [email protected])Dr Ting Wang (Room 518, Science III; Extension 7773; [email protected])

Lectures

Monday, Tuesday and Wednesday at 9 am in Room 241, Science III

ii

Page 4: STAT270 Probability and Inference

Tutorials

Wednesdays at 3-5pm in Laboratory B21, Science III

Expected workload

This is an 18 point paper. Students are expected in most weeks to spend 3 hours in lectures,1 hour in tutorials and at least 8 hours in self-directed study such as reading, making summarynotes, completing assignments etc.

Internal Assessment

There will be 10 assignments, each of which contributes 6% of the internal assessment mark, anda mid-term test that contributes the remaining 40%.

What do I do if I need more time?

Unless you make arrangements with the lecturer prior to the due dates (supported by a medicalletter or an arrangement organized beforehand usually involving some special circumstance), latesubmission of assignments is subject to a 10% penalty per day.

Exam

There will be a three-hour exam.

Final mark

If

• E is the Exam mark

• A is the Assignments mark

• T is the Test mark

all expressed out of 100, then your final mark F in this paper will be calculated according to thisformula:

F = (2E + 0.6A+ 0.4T )/3

Plagiarism

Plagiarism (including being party to someone else’s plagiarism) is a form of academic misconduct(https://www.otago.ac.nz/study/academicintegrity/index.html).

Plagiarism is defined as:

iii

Page 5: STAT270 Probability and Inference

• Copying or paraphrasing another person’s work and presenting it as your own

• Being party to someone else’s plagiarism by letting them copy your work or helping them tocopy the work of someone else without acknowledgement

• Using your own work in another situation, such as for the assessment of a different paper orprogram, without indicating the source

For further details, see https://www.otago.ac.nz/study/academicintegrity/otago006307.

html

Any student found to be responsible for plagiarism in any piece of work submitted for assessmentshall be subject to the University’s academic misconduct regulations which may result in variouspenalties, including forfeiture of marks for the piece of work submitted, a zero grade for the paperor in extreme cases exclusion from the University.

iv

Page 6: STAT270 Probability and Inference

Contents

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 About the course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Probability and conditional probability 62.1 Probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Independence and the Prosecutor’s fallacy 103.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 The Prosecutor’s fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Discrete random variables 154.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 The probability distribution of a discrete random variable . . . . . . . . . . . . . . . 164.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Discrete uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.6 The Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.7 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.8 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Joint distributions of random variables 295.1 Joint distributions of random variables . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Continuous random variables 386.1 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.3 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.4 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.6 Transforming random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.7 Joint probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.8 Joint density for an independent sample . . . . . . . . . . . . . . . . . . . . . . . . 54

i

Page 7: STAT270 Probability and Inference

6.9 Maxima of independent random variables . . . . . . . . . . . . . . . . . . . . . . . . 556.10 Sums of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.11 Expectation and variance for sums . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Statistical models 617.1 Probability VS Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 Review of related concepts for statistical models . . . . . . . . . . . . . . . . . . . . 617.3 Random samples and statistical models . . . . . . . . . . . . . . . . . . . . . . . . . 627.4 Sampling distributions and central limit theorem . . . . . . . . . . . . . . . . . . . . 637.5 More on sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.6 Distribution features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8 Point Estimation 728.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728.2 The behavior of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.3 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.3.1 Definition of unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . 768.3.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768.3.3 Unbiased estimators for expectation and variance . . . . . . . . . . . . . . . 76

8.4 Variance of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.5 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9 Methods of Point Estimation 809.1 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809.2 The likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819.3 The maximum likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.4 Maximum likelihood estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.5 MLE - Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869.6 MLE - Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

10 Confidence Intervals 9010.1 General principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9010.2 Normal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

10.2.1 Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9110.2.2 Variance known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9210.2.3 Variance unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10.3 Large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9310.4 A general method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9310.5 Determining the sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.6 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

11 Hypothesis tests 9911.1 Null hypothesis and test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9911.2 Tail probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10011.3 Type I and type II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10011.4 Significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10111.5 Critical region and critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10111.6 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

ii

Page 8: STAT270 Probability and Inference

11.7 Relation with confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

12 Bayesian Analysis 10512.1 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10512.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10512.3 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

iii

Page 9: STAT270 Probability and Inference

Chapter 1

Introduction

1.1 Introduction

What is the probability of a major earthquake in Dunedin?

1.2 About the course

RandomnessProbability theory and statistics do something remarkable: they allow us to study and make

use of random phenomena using mathematics built for deterministic phenomena.

• Random phenomena can have predictable behaviour over the long run.

• Models of randomness provide proxies for reality that we can work with.

This is what this course is fundamentally about.

Textbooks

Available as pdfs in the university library, or ordered from Springer.

1

Page 10: STAT270 Probability and Inference

A formal treatment of probabilityThe axiomatic approach to probability was an attempt, mainly by mathematicians, to get a

formal understanding of what probability is. There are four main ingredients

• Sample space (what could happen)

• Events (statements about what could happen)

• Probability (assigning numbers to events based on our beliefs about how likely they are)

• Random variables (more convenient ways to work with events)

1.3 Sample spaces

Sample spacesFour experiments:

1. Tossing a coin.

2. In which month does a stranger’s birthday fall?

3. Three different emails, sent by three different people. Which order do they appear in?

4. The weather next Tuesday at 10am.

Sample spacesA sample space is a set of outcomes of the experiment (or random phenomenon). Often denoted

by Ω (capital Omega)

1. Tossing a coin: Ω = H,T

2. Month a birthday lies:

Ω = Jan,Feb,Mar,Apr,May, Jun, Jul,Aug, Sep,Oct,Nov,Dec

3. Orders of emails: (numbered 1,2,3):

Ω = 123, 132, 213, 231, 312, 321.

4. The weather next Tuesday:

Ω = sunny, overcast, rainy, something else

2

Page 11: STAT270 Probability and Inference

Notes:

• Given a < b we use (a, b) to denote the open interval

(a, b) = x : a < x < b,and [a, b] to denote the closed interval

[a, b] = x : a ≤ x ≤ b.

• An ordering of a set of objects is called a permutation. In general, if there are n uniquelyidentifiable objects then there are

n! = n · (n− 1) · · · · · 3 · 2 · 1ways to order them. We define 0! = 1.

• In many situations the order of a set of objects is unimportant. Thus 124 and 412 are thesame combination of three objects from four objects labeled 1 to 4. In this case the numberof combinations of n objects taken r at a time is

Cnr = n!/(r!(n− r)!)

ExampleConsider the problem of selecting randomly two applicants for a job from a group of five and

imagine that the applicants vary in competence, labeling these from 1 (the best) to 5 (the worst).

(a) What is the probability that at least one of the two best applicants is chosen? To answerthis we list the seven possible choices that give us at least one of the two best applicants:1, 2, 1, 3, 1, 4, 1, 5, 2, 3, 2, 4, 2, 5. We then list the other three possible choices:3, 4, 3, 5, 4, 5. All of the ten possible choices are equally likely, and therefore each havea probability of 0.1.

(b) Find the number of ways of selecting two applicants out of five and hence the total numberof sample points for part (a).

(c) Find the number of ways of selecting exactly one of the two best applicants in a selection oftwo out of five. Then find the probability of that event.

1.4 Events

The next step in the formalismWhen making statements about probability we are actually making statements about events.

Technically, these are just subsets of the sample space.For the birthday example, let L denote the set of long (31 day) months.

L = Jan,Mar,May, Jul,Aug,Oct,Dec.The set L corresponds to the event that the birthday falls in a long month. If R is the event thatcorresponds to months having the letter ‘r’ in their full name then

R = Jan,Feb,Mar,Apr, Sep,Oct,Nov,Dec.

3

Page 12: STAT270 Probability and Inference

Operations on eventsSince events are just subsets we can manipulate them just like sets:

• The intersection, L ∩ R, of two events L and R occurs when both occur; often we say “Land R”

• The union, L ∪ R, of two events L and R occurs when at least one occurs; often we say “Lor R”

• The complement Lc of an event L occurs when L does not occur, that is, Lc = ω ∈ Ω : ω 6∈L. Sometimes we use L′ to denote Lc.

Example

L = Jan,Mar,May, Jul,Aug,Oct,DecR = Jan,Feb,Mar,Apr, Sep,Oct,Nov,Dec

L ∩R = Jan,Mar,Oct,DecL ∪R = Jan,Feb,Mar,Apr,May, Jul,Aug, Sep,Oct,Nov,Dec

Lc = Feb,Apr, Jun, Sep,Nov(L ∩R)c = Feb,Apr,May, Jun, Jul,Aug, Sep,Nov(L ∪R)c =

Ωc =

(Lc)c =

Venn diagrams for events

A B

De Morgan’s Laws: For events A,B

(A ∪B)c = Ac ∩Bc

(A ∩B)c = Ac ∪Bc

Mutually exclusive eventsA and B are said to be mutually exclusive if A ∩B = ∅. The symbol ∅ denotes the empty set.In words, the occurrence of A precludes the occurrence of B.In the language of sets, we say A and B are disjoint.

4

Page 13: STAT270 Probability and Inference

Quick exerciseLet A denote “There is an earthquake in Dunedin in the next 5 years”, and B denote the event

“There is an earthquake in Christchurch in the next 10 years” Translate (A∪B)c and Ac∩Bc intoEnglish, and check they are equivalent.

5

Page 14: STAT270 Probability and Inference

Chapter 2

Probability and conditional probability

2.1 Probability functions

Formal definitionDefinition A probability function on a finite sample space Ω assigns to each event A in Ω a

number P (A) in [0, 1] such that

(i) P (Ω) = 1, and

(ii) P (A ∪B) = P (A) + P (B)− P (A ∩B) for two events A and B.

(iii) P (A ∪B) = P (A) + P (B) whenever A and B are disjoint (i.e. A ∩B = ∅).

Note that (iii) implies that if we have events A1, A2, A3 . . . that are all disjoint then

P (A1 ∪ A2 ∪ A3 ∪ · · · ) = P (A1) + P (A2) + P (A3) + · · ·

Examples

1. Coin toss. Standard function is P (H) = 1/2, P (T) = 1/2, though others possible.

2. Month. One possibility is to use

P (Jan) = · · · = P (Dec) =1

12.

Does this seem reasonable?

3. Altitude: (we’ll develop machinery for this later).

4. Emails: P (123) = P (132) = P (213) = P (231) = P (312) = P (321) = 16. What is the

probability that letter 3 lies in the middle of the pile? Define the event M = 132, 231 sothat

P (M) = P (132) + P (231) =1

6+

1

6=

1

3.

5. Weather: Who knows?! Maybe P (sunny) = 0.3, P (overcast) = 0.3, P (rainy) = 0.35, andP (something else) = 0.05.

6

Page 15: STAT270 Probability and Inference

Working with probabilities: first trickFirst trick: break down events as much as you can, and then use the fact that P (A ∪ B) =

P (A) + P (B) whenever A and B are disjoint.Example: show that

P (A ∪B) + P (A ∩B) = P (A) + P (B).

Conditional probability

Are Australians getting worse?

• Let O denote the event an Australian goes overseas and let A denote the event that theAustralian is arrested.

• Article observes that P (O ∩ A) has doubled!

• BUT: the proportion of Australians going overseas has also doubled.

• The proportionP (O ∩ A)

P (O)

is unchanged. This is the conditional probability of A given O.

Conditional probability definitionLet A,B ⊆ Ω be two events. If P (B) > 0 then we define

P (A|B) =P (A ∩B)

P (B).

If P (B) = 0 then P (A|B) is undefined. We say that P (A|B) is the conditional probability of Agiven B or simply the probability of A given B.

7

Page 16: STAT270 Probability and Inference

Month problemRecall the experiment from Tuesday about the month a stranger is born in. We have the state

space

Ω = Jan,Feb,Mar,Apr,May, Jun, Jul,Aug, Sep,Oct,Nov,Dec

all with equal probability, and defined two events

L = Jan,Mar,May, Jul,Aug,Oct,Dec,R = Jan,Feb,Mar,Apr, Sep,Oct,Nov,Dec.

If we’re told that the stranger was born in an ‘r’ month, what is the probability that they werealso born in a ‘long’ month?

Month problem restated

L = Jan,Mar,May, Jul,Aug,Oct,Dec P (L) = 7/12

R = Jan,Feb,Mar,Apr, Sep,Oct,Nov,Dec P (R) = 8/12

L ∩R = Jan,Mar,Oct,Dec P (L ∩R) = 4/12

So

P (L|R) =P (L ∩R)

P (R)

=4/12

8/12

= 1/2.

A useful way to think about what is happening is this: if we’re told that the stranger was born inan ‘r’ month, then the sample space effectively becomes R.

Classic puzzleMr Smith has two children, at least one of which is a boy. What is the probability that they

are both boys?

• Sample space: Ω = (b, b), (b, g), (g, b), (g, g), each pair with probability 14.

• Event ‘at least one is a boy’ is L = (b, b), (b, g), (g, b).

• Event ‘both boys’ is B = (b, b).

• P (B ∩ L) = 14.

• P (L) = 34.

• P (B|L) = 13.

8

Page 17: STAT270 Probability and Inference

PuzzleRecall coins problem: I have three coins. One has tails on both sides, another has heads on

both sides, the other has heads on one and tails on the other. I draw a coin at random and lookat one side. It is heads. What is the probability that the other side is also heads?

2.2 Identities

Multiplication RuleFrom the definition

P (A|B) =P (A ∩B)

P (B)

for P (B) > 0 we immediately get the multiplication rule

P (A ∩B) = P (A|B)P (B).

We’ll now see some extensions of this.

Law of total probabilityFrom Venn diagrams, we see that the sets A ∩ B and A ∩ Bc are disjoint and have union A.

HenceP (A) = P (A ∩B) + P (A ∩Bc)

Using the multiplication rule twice gives

P (A) = P (A|B)P (B) + P (A|Bc)P (Bc).

In fact, if B1, B2, . . . , Bk are disjoint events such that

B1 ∪B2 ∪ · · · ∪Bk = Ω

thenP (A) = P (A|B1)P (B1) + P (A|B2)P (B2) + · · ·+ P (A|Bk)P (Bk)

which is so useful it has a name.

Bayes’ ruleThis one doesn’t seem worth a title, but it is really fundamental.Since

P (A ∩B) = P (A|B)P (B)

and

P (A ∩B) = P (B|A)P (A)

we get

P (B|A) =P (A|B)P (B)

P (A).

9

Page 18: STAT270 Probability and Inference

Chapter 3

Independence and the Prosecutor’sfallacy

Lewis Carroll’s puzzleA bag contains either a black bean or a white bean [with equal chance]. A white bean is put

into the bag. When a bean is drawn at random, it is white. What is the probability that the otherbean is white?

3.1 Independence

IndependenceThe concept of independence is central to a lot of statistics (and a lot of the statistical methods

you have already learnt).Statisticians talk of an independent sample, where we assume that the value for one element in

the sample has no effect on the other values.To formalise this idea, we start by talking about independence of events (and extend this to

random variables later).

Definition of independenceWe say that event A is independent of event B if the probability of A occurring is unaffected

by whether B occurs.Formally:

P (A|B) = P (A)

By plugging in the formula for P (A|B) and rearranging, this gives

P (A ∩B) = P (A)P (B).

In fact, these expressions are equivalent.

IndependenceThe following are all equivalent for two events A,B ⊆ Ω:

• A and B are independent,

• P (A ∩B) = P (A)P (B)

10

Page 19: STAT270 Probability and Inference

• P (A|B) = P (A)

• P (B|A) = P (B).

ExampleSuppose a fair die is thrown twice. Let A denote the event that the two dice sum to 4 and let

B denote the the event that at least one of the dice is a 3. Are A and B independent?We have Ω = (1, 1), (1, 2), (1, 3), . . . , (6, 6).Then

A = (1, 3), (2, 2), (3, 1)B = (1, 3), (2, 3), . . . , (6, 3), (3, 1), (3, 2), (3, 4), (3, 5), (3, 6)

A ∩B = (1, 3), (3, 1).

So P (A) = 3/36, P (B) = 11/36 and P (A ∩B) = 2/36.

P (A|B) = (2/36)/(11/36) 6= P (A)

so A and B are not independent.

Three eventsWhen are three events A,B,C independent?We need

P (A ∩B ∩ C) = P (A)P (B)P (C)

as well as pairwise independence

P (A ∩B) = P (A)P (B)

P (B ∩ C) = P (B)P (C)

P (A ∩ C) = P (A)P (C)

Can you see how this would extend to four or more events?

Tests for rare diseasesYou have already been introduced to hypothesis tests, and we will be coming back to these

several times in the semester.Example (Example 2.30 in Devore and Berk).The sensitivity of a medical test is the probability of a positive test given that the patient has

the condition. The specificity of a test is the probability of a negative test, given that the patientdoesn’t have the conditon.

• An article in the October 29, 2010 New York Times reported that the sensitivity and speci-ficity for a new DNA test for colon cancer were 86% and 93%, respectively.

• The PSA test for prostate cancer has sensitivity 85% and specificity about 30%.

• The mammogram for breast cancer has sensitivity 75% and specificity 92%.

QUESTION: Suppose that you tested positive, what is the probability that you have thedisease or condition?

11

Page 20: STAT270 Probability and Inference

Tests for colon cancerDefine the events

‘A’ = event that patient has the condition. ‘B’ = event that test is positive.

Sensitivity of 86% means P (B|A) = 0.86 Specificity of 93% means P (Bc|Ac) = 0.93.The rate of colon cancer is (roughly) 45 per 100,000 (ignoring gender differences), so P (A) =

0.00045.We really want P (A|B).

Probability treesAt times it can be useful to organize conditional probabilities into trees.

Ac

P (Ac ∩Bc) =

P (B c|A c)0.93

P (Ac ∩B) = ...1− P (B

c |Ac )

0.07

1− P (A)0.99955

A

P (A ∩Bc) = .....

1− P (B|A)0.14

P (A ∩B) = 0.00045× 0.86P (B|A)

0.86

P (A)

0.00045

Probability of conditionFrom the diagram we have

P (A ∩B) = 0.00045× .86 = 0.000387

P (A ∩Bc) = 0.00045× 0.14 = 0.000063

P (Ac ∩B) = 0.99955× 0.07 = 0.0699685.

P (Ac ∩Bc) = 0.99955× 0.93 = 0.92958

Hence

P (B) = P (A ∩B) + P (Ac ∩B)

= 0.0704

P (A|B) =P (A ∩B)

P (B)= 0.0055

Still unlikely, but far more likely than for a random person.

12

Page 21: STAT270 Probability and Inference

Some more identitiesIn the example we used the identity

P (Bc|A) = 1− P (B|A).

Is it valid?

P (Bc|A) =P (A ∩Bc)

P (A)

P (B|A) =P (A ∩B)

P (A)

P (B|A) + P (Bc|A) =P (A ∩Bc) + P (A ∩B)

P (A)

=P (A)

P (A)= 1.

Other toolsWe could also have used Bayes’ rule:

P (A|B) = P (B|A)P (A)

P (B)

and the law of total probability

P (B) = P (B|A)P (A) + P (B|Ac)P (Ac).

Reversing conditional probabilityIn this example note that

P (B|A) = 0.86

(the probability of a positive test when the patient is sick), but

P (A|B) = 0.0055

(the probability the patient is sick given a positive test). These numbers are different.In general P (A|B) 6= P (B|A). The exact connection is, of course,

P (A|B) = P (B|A)P (A)

P (B)

so there is equality only when P (A) = P (B).

Gremlins

A = “I hear lots of bumping up in the attic”

B = “There are gremlins bowling balls in the attic”

13

Page 22: STAT270 Probability and Inference

3.2 The Prosecutor’s fallacy

The Prosecutor’s fallacy

A = “Mr Smith is innocent”

B = “Mr Smith’s DNA matches the DNA at the scene”

It may well be that P (B|A) is very small, but that does not imply that P (A|B) is also very small!This is called the prosecutor’s fallacy.

Exploring the prosecutor’s fallacySuppose that the police demonstrate that, if Mr Smith is indeed innocent there is only a

1/1,000,000 chance that his DNA could have matched that found at the crime scene. They arguethat there is only, at most, a 0.0001% chance that Mr Smith could be innocent. However,Mr Smith was identified by searching a database of 1.5 million records. The probability that arandomly selected person from this database is innocent is then 1− 1

1.5 million.

We further assume that if Mr Smith was not innocent then P (B|Ac) = 1.

Mr SmithWe have

A = “Mr Smith is innocent”

B = “Mr Smith’s DNA matches the DNA at the scene”

P (A) = 1− 1/1500000

P (B|A) = 0.000001

P (B|Ac) = 1

and we want to find P (A|B).

SolutionRather than draw up a probability tree, we just apply Bayes’ rule and the law of total proba-

bilities:

P (B) = P (B|A)P (A) + P (B|Ac)P (Ac)

= ...

= 0.000001667

P (A|B) = P (B|A)P (A)

P (B)= ...

= 0.6

There is still a 60% chance that Mr Smith is innocent.

14

Page 23: STAT270 Probability and Inference

Chapter 4

Discrete random variables

Random variableA random variable is a thing which takes on values randomly.

Random variableA random variable is a function from the sample space Ω to the real numbers R.

4.1 Random variables

ExampleLet Ω = (1, 1), (1, 2), (1, 3), . . . , (6, 6) be the sample space for two rolls of a die.Let S denote the sum of the two rolls. Then S is a random variable.Let M denote the maximum of the two rolls. Then M is a random variable.Note: it is customary to use capital letters for random variables and small letters for the values

they can take.

Random variables as eventsThe statement S = 5 corresponds to an event (sometimes denoted S = 5). It is the event

(1, 4), (2, 3), (3, 2), (4, 1).

We therefore can talk of the probability P (S = 5), which in this case equals 436

.In the same way, M = 3 corresponds to the event

(1, 3), (2, 3), (3, 3), (3, 1), (3, 2)

and so P (M = 3) = 536

.

Discrete random variablesA random variable is a function from the sample space Ω to the real numbers R. A discrete

random variable is a function from the sample space Ω to the real numbers R that takes on a finiteor countable range of values.

For example, S can take on only one of 11 possible values, 2, 3, 4, . . . , 12.The random variable M can take on only one of 6 possible values, 1, 2, 3, 4, 5, 6.

15

Page 24: STAT270 Probability and Inference

What is countable?A set is countable if we can list its members as a1, a2, a3, . . ., even if that list is infinite.

• Any set with a finite number of elements is countable.

• The set of numbers 1, 2, 3, . . . is countable.

• The set of integers . . . ,−3,−2,−1, 0, 1, 2, 3, . . . is countable, since we can list them as

0, 1,−1, 2,−2, 3,−3, 4,−4, . . .

• The set of real numbers is not countable, and neither is any interval on the real line.

4.2 The probability distribution of a discrete random vari-

able

Probability mass functionThe probability mass function (or pmf) of a discrete random variable X is the function p for R

to [0, 1], defined byp(a) = P (X = a)

for all real numbers a. We sometimes write pX(a) to make it clear which variable we are referringto.

We can tabulate these values for M :

a 1 2 3 4 5 6p(a) 1/36 3/36 5/36 7/36 9/36 11/36

where p(a) = 0 for all other values of a.

Properties of the pmfSuppose that A is a random variable which takes on values a1, a2, a3, . . .. Let p be its pmf.

Then

1. p(a) ≥ 0 for all a.

2. p(a1) + p(a2) + p(a3) + · · · = 1

Distribution functionThe cumulative distribution function (cdf) F for a random variable X is defined by

F (a) = P (X ≤ a)

for all real numbers a.This is not so useful for discrete random variables, but it is extremely important for continuous

random variables (i.e. non-discrete variables). We’ll come back to this later.Note: sometimes F is called the distribution function.

16

Page 25: STAT270 Probability and Inference

ExerciseLet Z denote the random variable equal to the number of sixes thrown in two (independent)

throws of a die. List the outcomes which Z maps to the number 1. What is the probability massfunction for Z?

Suppose that a gambler wins $20 if the roll is two sixes and zero otherwise. Let W be theamount won. What is the probability mass function for W?

Sigma notationSigma notation

n∑

i=1

xi

is used as a shorthand for x1 + x2 + x3 + · · ·+ xn.Note:

n∑

i=1

(xi + yi) =n∑

i=1

xi +n∑

i=1

yi

andn∑

i=1

kxi = kn∑

i=1

xi.

See http://www.stat.auckland.ac.nz/mathtutor/v1/index.html and the practice prob-lems on the webpage.

4.3 Expectation

Expectation

• The expectation and variance of a random variable are two summary quantities for randomvariables.

• Roughly, the expectation describes the location or centre, while the variance describes thespread.

Definition of expectationIf X is a discrete random variable taking values a1, a2, . . . with pmf p then the expectation of

X is definedE[X] =

i

aiP (X = ai) =∑

i

aip(ai).

The variance is definedVar(X) =

i

(ai − E[X])2p(ai).

Expectation, mean and average

• Expectation (also called mean) and variance are properties of a random variable.

• You will have met sample mean and sample variance as properties of a sample.

• Under general conditions, the sample mean and sample variance will converge to the randomvariable expectation and variance.

17

Page 26: STAT270 Probability and Inference

ExampleThe number of fives in three rolls of a fair die has the pmf

a 0 1 2 3p(a) 125/216 75/216 15/216 1/216

Find the expectation and variance. (1/2 and 5/12)

Functions of random variablesSuppose that X is a random variable and g is a function. Then Y = g(X) is also a random

variable. What is pY (y)?The sum of pX(x) over all x such that g(x) = y. In sigma notation:

pY (y) =∑

x:g(x)=y

pX(x).

Expectation of a functionSuppose that X is a discrete random variable taking on values a1, a2, a3, . . .. Let pX be its

pmf.Recall that the expectation of X is defined

E[X] =∑

i

aipX(ai).

If g is a function then we say that the expectation of g(X) is

E[g(X)] =∑

i

g(ai)pX(ai).

We can get at this another way. Let Y be the random variable Y = g(X). We could computethe pmf of Y (that is, pY ) and then its expectation

E[Y ] =∑

y

ypY (y).

We’d get the same value as E[g(X)]. (Exercise: why?)

ExampleLet M be the maximum of two dice, and let X = (M − 3)2. Then M has the pmf

a 1 2 3 4 5 6pM(a) 1/36 3/36 5/36 7/36 9/36 11/36

X 4 1 0 1 4 9

and X has the pmf

x 0 1 4 9pX(x) 5/36 10/36 10/36 11/36

Compute E[X] two ways.

18

Page 27: STAT270 Probability and Inference

Fascinating facts about expectations

1. If a and b are constants then E[aX + b] = aE[X] + b.

2. If X and Y are any two random variables then E[X + Y ] = E[X] + E[Y ].

IndependenceWe’ve already defined independence of events:

P (A ∩B) = P (A)P (B).

Independence of random variables works the same way. X and Y are independent if

P ((X = x) ∩ (Y = y)) = P (X = x)P (Y = y)

for all values x and y.

Notes

• Usually we write P (X = x, Y = y) instead of P ((X = x) ∩ (Y = y)).

• This extends to three or more variables as you’d expect: X, Y, Z are independent if

P (X = x, Y = y, Z = z) = P (X = x)P (Y = y)P (Z = z)

for all x, y, z, and if P (X = x, Y = y) = P (X = x)P (Y = y), P (Y = y, Z = z) = P (Y =y)P (Z = z), P (X = x, Z = z) = P (X = x)P (Z = z).

• For simplicity, we often write pXY (x, y) for P (X = x, Y = y), or even p(x, y) if the contextis clear.

4.4 Binomial distribution

Bernoulli distributionA random variable X has a Bernoulli distribution with parameter p if the state space of X is

0, 1 and P (X = 1) = p.e.g. If X = 1 when the coin flip is heads and X = 0 otherwise, then X has Bernoulli distribution

with parameter p = 1/2.Exercise: Show E(X) = p and Var(X) = p(1− p).

Binomial distributionA random variable X has a binomial distribution with parameters n and p if it has pmf

P (X = k) =

(n

k

)pk(1− p)n−k

for k = 0, 1, 2, . . . , n.

19

Page 28: STAT270 Probability and Inference

The name comes from the connection with the binomial formula:

(x+ y)n =n∑

k=0

(n

k

)xkyn−k.

Note that(n

k

)=

n!

k!(n− k)!=

n× (n− 1)× · · · × 3× 2× 1

(k × (k − 1)× · · · × 2× 1)((n− k)× (n− k − 1)× · · · × 2× 1)

is the number of subsets of size k in a set of size n (we’ll come back to this).

0 2 4 6 8 10

0.00

0.10

0.20

k

P(X

=k)

The binomial distribution when n = 10 and p = 0.4

Binomial distributions for a mathematicianQ: is it a valid pmf? Plug in x = p and y = 1− p into the binomial formula and we get

1 = (p+ 1− p)n =n∑

k=0

(n

k

)pk(1− p)n−k.

Q: What is the expectation? We want

n∑

k=0

kp(X = k) =n∑

k=0

k

(n

k

)pk(1− p)n−k.

We could use differentiation, or....

20

Page 29: STAT270 Probability and Inference

Binomial and repeated trialsIf X1, X2, . . . , Xn are independent Bernoulli random variables with parameter p then

X = X1 +X2 + · · ·+Xn

is binomial. (This is sometimes used as a definition of the Binomial distribution)The formulation makes it easy to compute expectations:

E[X] = E[X1] + E[X2] + · · ·+ E[Xn] = p+ p+ · · ·+ p = np.

As we will see later, the variance of X is given by

Var(X) = np(1− p).

Rederiving the BinomialThe number of subsets of size k of a set of size n is denoted

(nk

)or Cn

k , which is said ‘n choosek’

For example, if you consider all possible strings of 10 0’s or 1’s:

0000000000 0000000010 0000000011 ... 1111111111

then exactly(nk

)of these have k 1’s and (n− k) 0’s.

Binomial distributionWe have independent random variables X1, X2, . . . , Xn, each with Bernoulli distribution with

parameter p. We want the probability for X = k where X = X1 +X2 + · · ·+Xn.The state space is Ω = (0, 0, . . . , 0), . . . , (1, 1, . . . , 1) and the event X = k is the set of

outcomes with exactly k ones. There are(nk

)of these outcomes; each has probability pk(1−p)n−k.

Hence

P (X = k) =

(n

k

)pk(1− p)n−k.

Coke Zero vs Diet CokeThere were 15 participants.In experiment 1, subjects offered two glasses both containing coke zero.Eight people said both samples were identical. Of the seven that said they were different, six

preferred the first.In experiment 2, subjects offered a glass containing diet coke and another containing coke zero.Nine people said both samples were identical. Of the six that said they were different, there

was equal preference for the two drinks.

TestsIn these experiments, are there any of those counts which should have a binomial distribution?

21

Page 30: STAT270 Probability and Inference

ExtremenessOf the seven people who said they could distinguish the two coke zero samples, six said they

preferred the first. Could this arise by chance? What is the probability of getting something as‘extreme’ as this?

Suppose X has a binomial distribution with parameters n = 7 and p = 0.5. We want

P (X ≤ 1) + P (X ≥ 6) = 2P (X ≤ 1) = 2(P (X = 1) + P (X = 0))

Evaluating probabilitiesIn the not that distant past, students were forced to use probability tables. However you’re so

much better off using R.To compute entries in the pmf:

> dbinom(1,size=7,prob=0.5) [computes P(X = 1)]

To compute cumulative probabilities

> 2*pbinom(1,size=7,prob=0.5) [computes 2*P(X <= 1)]

0.125

Note the explicit use of variable names when I call the functions.

4.5 Discrete uniform distribution

Uniform distributionA discrete random variable X taking on a finite set of values a1, a2, . . . , an has a discrete uniform

distribution if

pX(ai) =1

nfor all values ai.

Counting probabilitiesMany (textbook) problems in probability boil down to discrete uniform distributions with

counting.Examples

1. If I draw a card out of a standard deck, what is the probability it is a picture card (KQJ)?

2. What is the probability of a full house in poker (three of one value and two of another)?

3. What is the probability that three random selected letters will spell a valid three letter word?

Counting probabilitiesThese are all thinly-disguised counting problems (see MATH 170, MATH 272).You need to count

1. the total number of possibilities to compute the uniform distribution probabilities, and then

2. the total number of possibilities which give the desired outcome.

There are tricks for both.

22

Page 31: STAT270 Probability and Inference

Probability of a picture cardsThe number of cards in a standard deck is 52, so each has probability 1

52.

The number of picture cards is 4× 3 = 12.The probability of drawing a picture card is therefore 12

52.

Full houseEach hand is a subset (unordered) of size 5, and the number of these is

(525

).

For each ordered pair of values (e.g. 3,2 or A,5 or 5,A) there are(

43

)= 4 ways of choosing the

triple and(

42

)= 6 ways of choosing the double.

That makes 24 hands which give a full house with the same value.To count the number of pairs we note there are 13 ways of choosing the first number, leaving

12 ways of choosing the second. That makes 156 ways of choosing the pair.The number of full houses is therefore 156× 24 and the probability of a full house is

156× 24(525

) = 0.00144.

Counting wordsWhen picking letters, there are 26 choices for the first letter, 26 choices for the second and 26

choices for the third. Number of potential works is 263. According to the infallible internet,there are 918 or 1155 valid 3-letter English words (at least for Scrabble freaks). So the probabilitywe want is

918

263= 0.052 or

1155

263= 0.066.

Some general principles

• Often useful to break the set into disjoint blocks of the same size (e.g. full house example).

• The number of pairs (a, b) where a ∈ A and b ∈ B equals |A||B|.

• The number of subsets of size k in a set A is(|A|k

). The number of subsets of any size is 2|A|.

• There are k! = k(k − 1) · · · 2 · 1 ways of ordering k distinct objects.

4.6 The Geometric distribution

4.6 The geometric distribution — If at first you don’t succeed, try, try againI’m a really bad ice hockey player. Suppose that every game I play, I have a 0.01 probability

of scoring a goal.

1. What is the probability that I score a goal in the next game?

2. What is the probability that I first score a goal in the 10th game I play? In the 100th game?

3. If I haven’t scored a goal in the first 100 games I play, what is the probability that I firstscore a goal in the next game?

23

Page 32: STAT270 Probability and Inference

Geometric distributionA discrete random variable X has a geometric distribution with parameter p, where 0 < p ≤ 1,

if its pmf is given by

pX(k) = P (X = k) = (1− p)k−1p for k = 1, 2, . . ..

A geometric random variable has expectation

E[X] =∞∑

k=1

k(1− p)k−1p = 1/p

and varianceVar(X) = (1− p)/p2.

Geometric distribution derivedIf there is a probability of success of p, then

• the probability of success in the first move is p

• the probability of failure in the first move and success in the second in (1− p)p

• the probability of failure in the first two moves and success in the third is (1− p)2p

• the probability of failure for k − 1 moves and success in the kth is (1− p)k−1p.

By the way,p+ (1− p)p+ (1− p)2p+ (1− p)3p+ · · · = 1

by the geometric series, which some might have met in MATH 170.

Something very confusingSome people say that X has a geometric distribution if it equals the move/trial on which there

is the first success. Hence the pmf is

pX(k) = (1− p)k−1p.

Dekking et al. follow this custom. For these people k = 1, 2, 3, . . ..Alternatively, people say that Y has a geometric distribution if it equals the number of failures

before the success. Hence for these people the pmf is

pY (k) = (1− p)kp.

Devore and Berk follow this custom, as do the programmers of the R package. In this case,k = 0, 1, 2, . . ..

Note that since Y = X − 1 we have E[Y ] = E[X]− 1 = 1p− 1.

24

Page 33: STAT270 Probability and Inference

0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

k

P(Y

=k)

The Geometric distribution when p = 0.4

Some useful properties of the Geometric distributionWhat is the probability of no successes in the first n trials?

• The probability of failure in the first trial is (1− p), in the second is (1− p), and so on.

• The probability of failure in all the first n trials is then (1− p)n.

The answer is (1− p)n.

Some useful properties of the Geometric distributionIf X has a geometric distribution, what is P (X > n)?The event X > n is the same as the event

No successes in the first n trials

So, P (X > n) = (1− p)n.

Cumulative Distribution FunctionWhat is P (X ≤ n)?Easy: 1− P (X > n), or 1− (1− p)n.

Memoryless propertyWhat is P (X = n+ k|X > n)?Recall from conditional probabilities:

P (A|B) =P (A ∩B)

P (B).

25

Page 34: STAT270 Probability and Inference

Therefore

P (X = n+ k|X > n) =P (X = n+ k ∩X > n)

P (X > n)

=P (X = n+ k)

P (X > n)

=(1− p)n+k−1p

(1− p)n= (1− p)k−1p = P (X = k)

This is a geometric random variable! The fact that the result is independent of n means theconditional probability distribution of the number of additional trials does not depend on howmany failures there have been already.

What is the difference between the binomial distribution and the geometric distribu-tion

The binomial distribution and geometric distribution are both related to sequences of Bernoullitrials. What is the difference?

The binomial distribution describes the number of successes in a fixed number of trials.The geometric distribution describes the number of trials needed before there is a single success.

Which distribution for X?Suppose there are 1500 students enrolled in an introductory psychology course. The term

papers are graded by a team of teaching assistants, however a sample of 10 papers is selected atrandom and checked by the professor. Experience indicates that around 1% of the papers areimproperly graded. X is the number of papers in the sample that are improperly graded.

Which distribution for X?During peak periods, 20% of the calls made to an information service cannot be completed.

Each call has the same probability of being completed, independent of any other calls made. Acertain caller has decided to keep dialing until she is connected. X is the number of calls she hasto make.

Which distribution for X?An experiment is performed whereby 20 men are presented with 20 computer generated pictures

of babies. The pictures are created by morphing an image of each man with the image of anunrelated baby. The men are asked which baby they would like to adopt. Assume that eachman is just as likely to choose any of the images. X is the number that choose babies formed bymorphing their own face.

STAT 261 Exam 2012Suppose that I roll a die until I get a 6, then I roll it until I get a 5, then I roll it until I get a 4,

and then until a 3, a 2 and finally a 1. Let X be the total number of rolls that I make. DetermineE[X] and Var(X).

(For now, just worry about E[X])

26

Page 35: STAT270 Probability and Inference

4.7 Negative binomial distribution

4.7 Negative binomial distribution

There is a natural generalization of the Geometric distribution where you ask how many successesyou have before there are r failures. (NB for the Geometric distribution we ask how many failuresthere are before we have 1 success.)

For there to be x successes we need

• Exactly r failures, and x− 1 successes, in the first x+ r − 1 trials.

• The (x+ r)-th trial to be a success.

This first probability is a binomial probability:(x+r−1x−1

)(1− p)rpx−1.

The second probability equals p.Hence the probability of having x successes is

P (X = x) =

(x+ r − 1

x− 1

)(1− p)rpx.

This is called the negative binomial. Its expectation and variance for the total number of trials,Y , are:

E[Y ] = x/p

andVar(Y ) = x(1− p)/p2.

ExampleA geological study indicates that an exploratory oil well drilled in a particular region should

strike oil with probability 0.2. Find the probability that the third oil strike comes on the fifth welldrilled.

0 2 4 6 8 10 12

0.02

0.06

0.10

0.14

x

P(X

=x)

The negative binomial distribution when r = 3 and p = 0.4

27

Page 36: STAT270 Probability and Inference

4.8 Poisson distribution

4.8 Poisson distribution

The Poisson distribution often provides a good model for the probability distribution of the numberX of rare events which occur independently, at random, at a constant mean rate, over space ortime. It gives the probability of observing x events in a given area of space or a given interval ofspace or time, where λ is the average value of X. Two examples of random variables with Poissondistributions are the number of radioactive particles that decay in a particular time period, andthe number of errors a typist makes in typing a page.

A discrete random variable X taking on values 0, 1, 2, 3, . . . is said to have a Poisson distributionwith parameter λ if

P (X = k) =e−λλk

k!.

In this case E[X] = λ. We will come back to this in Chapter 6.

0 2 4 6 8 10 12

0.00

0.10

0.20

k

P(X

=k)

The Poisson distribution when λ = 3.5

28

Page 37: STAT270 Probability and Inference

Chapter 5

Joint distributions of random variables

5.1 Joint distributions of random variables

Discrete random variablesThe joint probability mass function p of two discrete random variables X and Y is the function

defined byp(a, b) = P (X = a, Y = b) = P (X = a ∩ Y = b).

Remember we sometimes write pXY instead of just p when we need to be more precise.The joint probability mass function p of multiple discrete random variables X1, X2, . . . , Xk is

definedp(a1, a2, . . . , ak) = P (X1 = a1, X2 = a2, . . . , Xk = ak).

Rolling diceSuppose we roll two dice X, Y . The probabilities of the outcomes can be summarised in the

following table:

Y1 2 3 4 5 6

X

1 1/36 1/36 1/36 1/36 1/36 1/362 1/36 1/36 1/36 1/36 1/36 1/363 1/36 1/36 1/36 1/36 1/36 1/364 1/36 1/36 1/36 1/36 1/36 1/365 1/36 1/36 1/36 1/36 1/36 1/366 1/36 1/36 1/36 1/36 1/36 1/36

Max and MinNow let S equal the smaller of the two dice rolls, and L the larger.

29

Page 38: STAT270 Probability and Inference

L1 2 3 4 5 6

S

1 1/36 2/36 2/36 2/36 2/36 2/362 0 1/36 2/36 2/36 2/36 2/363 0 0 1/36 2/36 2/36 2/364 0 0 0 1/36 2/36 2/365 0 0 0 0 1/36 2/366 0 0 0 0 0 1/36

Marginal probabilitiesThe marginal probabilities for one random variable are found by summing over the values for

the other variable:

L1 2 3 4 5 6

S

1 1/36 2/36 2/36 2/36 2/36 2/36 11/362 0 1/36 2/36 2/36 2/36 2/36 9/363 0 0 1/36 2/36 2/36 2/36 7/364 0 0 0 1/36 2/36 2/36 5/365 0 0 0 0 1/36 2/36 3/366 0 0 0 0 0 1/36 1/36

1/36 3/36 5/36 7/36 9/36 11/36 1

Marginal propertiesIn this example, by summing along the rows we are obtaining the pmf for S.Hence these marginal probabilities sum to one.In the same way, the column sums sum to one.

Marginal probabilities againIn equations: if p is the joint pmf of X and Y then

pX(a) =∑

b

p(a, b)

andpY (b) =

a

p(a, b).

In this way we can also compute conditional probabilities:

P (X = x|Y = y) =p(x, y)

pY (y)=

p(x, y)∑a p(a, y).

.

Marginal probabilities don’t tell us everythingNote that the marginal probabilities do not determine the joint probabilities. Consider the

following two random variables.

30

Page 39: STAT270 Probability and Inference

X0 1

Y0 1/4 + α 1/4 - α 1/21 1/4 - α 1/4 + α 1/2

1/2 1/2 1

By choosing different values for α we change the joint distribution without changing the marginaldistributions.

IndependenceThe concept of independence tries to capture the idea that two random variables are unrelated:

how one plays out doesn’t affect how the other plays out. Important because

• Used to study (and identify) how variables are related

• Mathematical convenience: it turns out to be way easier to compute probabilities of inde-pendent random variables.

Recall: Independent eventsRecall: two events A and B are independent when any of the following hold:

• P (A|B) = P (A)

• P (B|A) = P (B)

• P (A ∩B) = P (A)P (B)

This last version is the one we will use for independence of random variables.

An asideIndependence 6= disjoint.Two events are disjoint if they can’t occur at the same time (i.e. A ∩B = ∅).Two events are independent if the fact that one occurs doesn’t affect the probability that the

other occurs.

Independence for discrete random variablesTwo discrete random variables X and Y are independent if

P (X = a, Y = b) = P (X = a)P (Y = b)

for all a, b. In terms of joint pmfs, this is

pXY (a, b) = pX(a)pY (b).

In this case we can determine all the joint distribution just using the marginals.

31

Page 40: STAT270 Probability and Inference

Height and siblingsConsider a randomly chosen person from the 2012 STAT 261 class.

• Let H be 1 if the person is at least 170cm tall, and 0 otherwise.

• Let S be 1 if the person has at least two siblings, and 0 otherwise.

Using the data obtained from the 2012 STAT 261 class, we have probability estimates of

S = 0 S = 1H = 0 0.25 0.35 0.6H = 1 0.15 0.25 0.4

0.4 0.6

Are S and H independent?

Association studiesLet W = 1 if an individual has dry ear wax. Let M = 1 if an individual has a AA at position

rs17822931 in their genome. From studies (of Japanese) we have estimated joint probabilities

M = 0 M = 1W = 0 0.29 0.13 0.42W = 1 0.01 0.57 0.58

0.3 0.7

Nature Genetics 38, 324 - 330Are these variables independent?

Multiple discrete random variablesDiscrete random variables X1, X2, . . . , Xn are independent if

P (X1 = a1, . . . , Xn = an) = P (X1 = a1) · · · · · P (Xn = an)

for all a1, . . . , an, and a similar result holds for all subsets of the the random variables.Equivalently, X1, X2, . . . , Xn are independent if

P (X1 ≤ a1, . . . , Xn ≤ an) = P (X1 ≤ a1) · · · · · P (Xn ≤ an)

for all a1, . . . , an, and a similar result holds for all subsets of the the random variables.Note: variables can be pairwise independent, but not triple-wise independent, etc.

5.2 Variance

VarianceLet µ = E[X]. The variance of X is defined

i

(xi − µ)2p(xi),

that is, E[(X − µ)2].

32

Page 41: STAT270 Probability and Inference

Hence

Var(X) =∑

i

(xi − µ)2p(xi)

=∑

i

(x2i − 2µxi + µ2)p(xi)

=∑

i

x2i p(xi)− 2µ

i

xip(xi) + µ2∑

i

p(xi)

= E[X2]− 2µ2 + µ2.

We obtain the rule

Var(X) = E[X2]− µ2 = E[X2]− (E[X])2.

Identities for free, well almostWhat is Var(aX)?We have

Var(aX) = E[(aX)2]− (E[aX])2

= E[a2X2]− (aE[X])2

= a2E[X2]− a2(E[X])2.

General rule:

Var(aX) = a2Var(X).

ExerciseUse the same approach to prove the rule

Var(X + b) = Var(X).

Multiple variablesHere is a class high school problem. Let X and Y be random variables equal to two rolls of a

dice. What is the pmf of Z = X + Y ?

Y1 2 3 4 5 6

X

1 1/36 1/36 1/36 1/36 1/36 1/362 1/36 1/36 1/36 1/36 1/36 1/363 1/36 1/36 1/36 1/36 1/36 1/364 1/36 1/36 1/36 1/36 1/36 1/365 1/36 1/36 1/36 1/36 1/36 1/366 1/36 1/36 1/36 1/36 1/36 1/36

Y+ 1 2 3 4 5 6

X

1 2 3 4 5 6 72 3 4 5 6 7 83 4 5 6 7 8 94 5 6 7 8 9 105 6 7 8 9 10 116 7 8 9 10 11 12

33

Page 42: STAT270 Probability and Inference

General principleSuppose that Z = X + Y . Then P (Z = z) is the sum of p(x, y) over all pairs x, y such that

x+ y = z.Hence

P (Z = z) =∑

x

p(x, z − x)

=∑

y

p(z − y, y).

Dice example againLet S be the smaller of two dice rolls and let L be the larger. What is the pmf of Z = S + L?

L1 2 3 4 5 6

S

1 1/36 2/36 2/36 2/36 2/36 2/362 0 1/36 2/36 2/36 2/36 2/363 0 0 1/36 2/36 2/36 2/364 0 0 0 1/36 2/36 2/365 0 0 0 0 1/36 2/366 0 0 0 0 0 1/36

L+ 1 2 3 4 5 6

S

1 2 3 4 5 6 72 3 4 5 6 7 83 4 5 6 7 8 94 5 6 7 8 9 105 6 7 8 9 10 116 7 8 9 10 11 12

Adding independent variablesSuppose we want the pmf of Z = X + Y where X and Y are independent.Then p(x, y) = pX(x)pY (y) and so our formula becomes

P (Z = z) =∑

x

pX(x)pY (z − x)

=∑

y

pX(z − y)pY (y).

Example: adding binomialsLet X be Bin(n1, p) and let Y be Bin(n2, p). Find the pmf for Z = X + Y when X and Y are

independent. Using the addition formula we have

P (Z = c) =

n2∑

b=0

pX(c− b)pY (b)

=

n2∑

b=0

((n1

c− b

)pc−b(1− p)n1−(c−b)

)((n2

b

)pb(1− p)n2−b

)

=

n2∑

b=0

(n1

c− b

)(n2

b

)pc(1− p)n1+n2−c

...

=

(n1 + n2

c

)pc(1− p)n1+n2−c

Actually there is a much easier way to get this result: summing the two binomials is likesumming n1 + n2 Bernoulli trials, so is binomial.

From this we conclude that adding two binomials with the same p parameter gives a binomial!

34

Page 43: STAT270 Probability and Inference

Review of sumsSuppose that X, Y have joint pmf p(x, y), and that Z = X + Y .What is P (Z = z)?

• The sum of p(x, y) over all x, y such that x+ y = z.

• The sum of p(x, z − x) over all x.

• The sum of p(z − y, y) over all y.

P (Z = z) =∑

x,y:x+y=z

p(x, y) =∑

x

p(x, z − x) =∑

y

p(z − y, y)

Expectations of sumsExpectations of sums are easy:

E[X + Y ] = E[X] + E[Y ].

Proving E[X + Y ] = E[X] + E[Y ]

E[X + Y ] =∑

x

y

(x+ y)p(x, y).

=∑

x

y

xp(x, y) +∑

x

y

yp(x, y)

=∑

x

x∑

y

p(x, y) +∑

y

y∑

x

p(x, y)

=∑

xpX(x) +∑

ypY (y)

= E[X] + E[Y ]

Variance of a sum

Var(X + Y ) = Var(X) + Var(Y ) + 2cov(X, Y )

wherecov(X, Y ) = E[XY ]− E[X]E[Y ]

If µX = E[X] and µY = E[Y ] then

cov(X, Y ) = E[(X − µX)(Y − µY )]

35

Page 44: STAT270 Probability and Inference

Proving variance of sumsSuppose that Z = X + Y .

Var(Z) = E[Z2]− (E[Z])2

= E[(X + Y )2]− (E[X] + E[Y ])2

= E[X2 + 2XY + Y 2]− (E[X]2 + 2E[X]E[Y ] + E[Y ]2)

= E[X2]− E[X]2 + E[Y 2]− E[Y ]2 + 2E[XY ]− 2E[X]E[Y ]

= Var(X) + Var(Y ) + 2cov(X, Y ).

where cov(X, Y ) = E[XY ]− E[X]E[Y ].

5.3 Covariances

CovariancesIt can be shown that....

• For all X, Y, Z, cov(X + Y, Z) = cov(X,Z) + cov(Y, Z).

• If X, Y are independent then cov(X, Y ) = 0. But not vice versa!

• cov(X,X) = Var(X).

• a cov(X,X) = aVar(X).

ExampleSuppose that X and Y are independent, and that Z = X + Y . What is cov(Z,X)?

cov(Z,X) = cov(X + Y,X)

= cov(X,X) + cov(Y,X)

= Var(X) + 0.

Products of random variables?The covariance formula has a E[XY ] term in it. What is the product of two random variables?Suppose that X, Y have joint pmf p(x, y), and that Z = XY .What is P (Z = z)?

• The sum of p(x, y) over all x, y such that xy = z.

• The sum of p(x, z/x) over all x.

• The sum of p(z/y, y) over all y.

P (Z = z) =∑

x,y:xy=z

p(x, y) =∑

x

p(x, z/x) =∑

y

p(z/y, y)

36

Page 45: STAT270 Probability and Inference

Little exampleConsider variables A,B with the following joint pmf.

B0 1

A0 0.3 0.21 0.1 0.4

What is E[A+B]? What is cov(A,B)? What is Var(A+B)?

Computing covariances

P (A = 1) =

P (B = 1) =

E[A] =

E[B] =

Var(A) =

Var(B) =

P (A = 1 ∩B = 1) =

E[AB] =

cov(A,B) =

Var(A+B) =

37

Page 46: STAT270 Probability and Inference

Chapter 6

Continuous random variables

Beyond discrete random variables

• A discrete random variable can be completely specified by the probabilities of each value itcan take on (i.e. the probability mass function).

• There is a whole class of random variables for which this doesn’t make sense: those describingcontinuous phenomena

– Height of the person sitting on the far left of the back row in this lecture.

– Time it takes to read this sentence.

– Magnitude of the next Auckland earthquake.

– Time we next see the sun in Dunedin

• For these, we’ve developed a formalism which uses continuous functions and calculus todescribe and manipulate event probabilities.

6.1 Probability density functions

Probability density functionsA random variable X is continuous if there is some continuous function f : R→ R such that

• f is always non-negative;

• The area under f equals 1

• For any a, b such that a ≤ b we have that P (a ≤ X ≤ b) is the area under f between a andb.

The function f is called the probability density function (pdf) of X.

38

Page 47: STAT270 Probability and Inference

Probability density functions: take twoA random variable X is continuous if there is some continuous function f : R→ R such that

• f(x) ≥ 0 for all x;

• ∫ ∞

−∞f(x)dx = 1;

• For any a, b such that a ≤ b

P (a ≤ X ≤ b) =

∫ b

a

f(x)dx.

The function f is called the probability density function (pdf) of X.

Endpoints don’t countIf X is a continuous random variable then

P (a ≤ X ≤ b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a < X < b)

since the area under single points is zero.

Cumulative distribution functionOften it’s easier to think in terms of the cdf

F (b) = P (X ≤ b)

which, with a continuous random variable, means

F (b) =

∫ b

−∞f(x)dx.

From calculus, this means that

f(x) =d

dxF (x)

(provided f is itself continuous).

Connections between f and FSuppose that X has pdf f and cdf F .

1. What is P (X ≤ a) in terms of f?

2. What is P (X < a) in terms of f?

3. What is P (X ≤ a) in terms of F?

4. What is P (a ≤ X ≤ b) in terms of F?

5. What is P (X ≥ a) in terms of F?

39

Page 48: STAT270 Probability and Inference

6.2 Expectation and variance

Definition of expectationIf X is a discrete random variable taking values a1, a2, . . . with p.m.f. p then the expectation of

X is definedE[X] =

i

aiP (X = ai) =∑

i

aip(ai).

If X is a continuous random variable with density f then

E[X] =

∫ ∞

−∞xf(x)dx.

Note: in some situations these summations and integrals do not exist.

Change of variable formulaLet X be a random variable, and let g be a function from R to R.

• If X is discrete thenE[g(X)] =

i

g(ai)P (X = ai).

• If X is continuous then

E[g(X)] =

∫ ∞

−∞g(x)f(x)dx.

VarianceThe variance measures the spread of a random variable: the average distance (squared) from

the mean. Let µ = E[X]. Then

Var(X) = E

[(X − µ

)2]

or equivalentlyVar(X) = E[X2]− µ2.

Hence if X is continuous with density f and expectation µ,

Var(X) =

∫(x− µ)2f(x)dx =

∫x2f(x)dx− µ2.

6.3 Uniform distribution

The uniform distributionA random variable X with pdf f has a uniform distribution on the interval [a, b] if f(x) = 0

for all x < a and x > b, and f(x) = 1b−a for all a ≤ x ≤ b.

40

Page 49: STAT270 Probability and Inference

0 1

1

0 2

1

0.5

0 1

1

2

A variable that is uniform on [0, 1] has standard uniform distribution.

More on uniformsIf X is uniform on [a, b] then F looks like:

1

a b

Suppose that X has standard uniform.

• What is P (X < 1/3)?

• What is P (X < −1/3)?

• What is P (X < 0.5)?

What is the expectation of a uniform?Suppose that X is uniform on [a, b]. We’d expect the expectation to be in the middle of [a, b].

If we use integration:

E[X] =

∫ b

a

x1

(b− a)dx =

[x2

2(b− a)

]b

a

=a+ b

2.

Variance of a uniformSuppose that X is uniform on [a, b]. What is E[X2]?

E[X2] =a2 + ab+ b2

3

And so:

Var(X) = E[X2]− (E[X])2

= ...

=(b− a)2

12.

41

Page 50: STAT270 Probability and Inference

6.4 Exponential distribution

Waiting time distributionThe exponential distribution is most widely used to model waiting times.

• The time between substitutions in evolutionary genetics

• Time before a particle decays

• Time until failure of a piece of equipment

• Time back to most recent common ancestor, in population genetics

But there is a strong implicit assumption that the rate of failure/mutation/decay is constant. Formany problems this is unrealistic.

Exponential distributionA continuous random variable has an exponential distribution with parameter λ if its pdf is

given by f(x) = 0 if x < 0 and

f(x) = λe−λx for x ≥ 0.

The cdf is given by

F (a) =

∫ a

−∞f(x)dx

= 0 +

∫ a

0

λe−λxdx

= 1− e−λa

0 2 4 6 8 10

0.0

0.2

0.4

x

f(x)

The Exponential distribution when λ = 0.5

42

Page 51: STAT270 Probability and Inference

Expectation and varianceUsing the formula for expectation and variance and integration by parts, we have

E[X] =

∫ ∞

0

xλe−λxdx =1

λ

and

Var(X) =

∫ ∞

0

(x− 1

λ)2λe−λxdx =

1

λ2.

This means we can interpret the parameter λ: it is the reciprocal of the mean.

When will the next major earthquake strike Dunedin?The DCC estimates that a major earthquake (sufficient to collapse buildings) will strike every

3135 years (note the precision).The simplest model for earthquakes is that the waiting time between earthquakes has an

exponential distribution (ask Ting about more realistic models).To get an expectation of 3135 years, we use λ = 1

3135.

An earthquake in the next 100 yearsWhat is the probability of a major earthquake in the next 100 years?With λ = 1

3135, we have

P (X < 100) = 1− e−λ×100 = 0.031.

(With heaps of caveats).

Alarm clocksSuppose that X is exponential with parameter α and Y is (independent) exponential with

parameter β.What is the distribution of Z = min(X, Y )?Since Z = min(X, Y ) we have

P (Z ≥ z) = P (X ≥ z)P (Y ≥ z).

From above, we haveP (X ≥ z) = e−αz.

andP (Y ≥ z) = e−βz.

HenceP (Z ≥ z) = e−(α+β)z.

That is, Z has cdf equal to

F (z) = 1− P (Z ≥ z) = 1− e−(α+β)z.

If X, Y are independent exponentials with parameters α, β respectively, then

Z = min(X, Y )

is exponential with parameter α + β.

43

Page 52: STAT270 Probability and Inference

The joys of Jafa-land

Jafa disasterThe Auckland city council states that there have been 50 major volcanic eruptions in the

past 100,000 years, starting with one underneath the University of Auckland (NZ’s number oneuniversity, apparently).

Assuming that these also have exponential waiting times, how long will it before before thereis either an earthquake in Dunedin or a volcanic eruption in Auckland? What is the probabilitythat the volcanic eruption occurs first?

Random variablesLet X be the time to the Dunedin quake. We assume X has an exponential distribution with

parameter α = 13135

= 0.0003.Let Y be the time to the Auckland volcano. We assume an expected waiting time of 100000/50

so Y has parameter β = 50/100000 = 0.0005.

When?Define Z = min(X, Y ) so that

P (Z ≥ z) = P (X ≥ z)P (Y ≥ z).

From above, we haveP (X ≥ z) = e−αz.

andP (Y ≥ z) = e−βz.

HenceP (Z ≥ z) = e−(α+β)z.

That is, Z has cdf equal to

F (z) = 1− P (Z ≥ z) = 1− e−(α+β)z.

44

Page 53: STAT270 Probability and Inference

When?If X, Y are independent exponentials with parameters α, β respectively, then

Z = min(X, Y )

is exponential with parameter α + β.In this case, α + β = 0.0008 so E[Z] = 1

0.0008= 1250.

Probability of one in the next 100 years is 1− e−100(α+β) = 0.079.

Which one first?Later we will show that if X and Y are exponential random variables with parameters α and

β respectively, then

P (X < Y ) =α

α + β.

Hence the probability P (Y < X) of there being an Auckland volcano before a Dunedin earth-quake is

β

α + β=

0.0005

0.0003 + 0.0005= 0.625.

Memoryless propertySuppose b > 0. What is P (X > a+ b|X > a)?

P (X > a+ b|X > a) =P (X > a+ b,X > a)

P (X > a

=P (X > a+ b)

P (X > a)

=e−λ(a+b)

e−λa

= e−λb.

Just like the geometric distribution, the exponential distribution is memoryless. The conditionaldistribution of the further waiting time does not depend on how long you have waited already.

Memoryless property

• The probability of an event/failure/mutation etc. at a time does not depend on what hap-pened before.

• Hence even if we have already waited a long time for the event, the amount of time to waitshould still have the same exponential distribution.

• Let B be the time to wait for a bus. The probability of a bus arriving in the next 20 minutesgiven that it hasn’t come in the past 30 minutes is

P (B < 30 + 20|B > 30).

• By playing around with exponential cdfs we have

P (B < 30 + 20|B > 30) = F (20)

if F is the cdf for the exponential with the appropriate λ.

45

Page 54: STAT270 Probability and Inference

One final property

• Suppose that the waiting time between buses is exponential with parameter λ.

• The number of buses arriving in an hour is then a random variable.

• Turns out that the random variable has a Poisson distribution, with parameter λ.

• This has pmf

P (X = k) = e−λλk

k!with

E[X] = Var(X) = λ.

Advanced: Geometric to exponentialConsider a model for the time to failure of some machine. Suppose that we divide time up

into segments of length 1n

and assume that the machine fails during some interval with probabilityp = λ/n.

Let Y denote the number of segments until failure, so by the geometric distribution

P (Y = y) = (1− p)y−1p

= (1− (λ/n))y−1 (λ/n).

P (Y > y) = (1− p)y

=

(1− λ

n

)y.

Let T denote the time until failure.

• Each interval takes time 1/n.

• tn intervals take time t.

P (T > t) = P (Y > tn)

=

(1− λ

n

)tn.

As n goes to infinity,(1− λ

n

)tngoes to e−λt.

Hence

P (T > t) = e−λt

P (T < t) = 1− e−λt.This is exactly the cdf of the exponential distribution.

Advanced: the Poisson processLet X1, X2, X3, . . . be a sequence of random variables equal to the times of events (e.g. times

of earthquakes in Dunedin). We say that X1, X2, . . . is a Poisson process if X1 and the waitingtimes Xi+1 −Xi are independent, exponential random variables.

If X1, X2, . . . is a Poisson process then the number of events occurring within some interval[0, T ] has a Poisson distribution with parameter λT .

46

Page 55: STAT270 Probability and Inference

6.5 Normal distribution

Normal distributionsA continuous random variable X has a normal distribution with parameters µ and σ2 if its pdf

f is given by

f(x) =1

σ√

2πe−

12(x−µσ )

2

.

Unfortunately, there is no simple mathematical expression for the cdf F (x).We use N(µ, σ2) as a shorthand for this distribution. Using advanced calculus, it can be

shown that if X is N(µ, σ2) then

E[X] = µ Var(X) = σ2.

Sometimes we specify the standard deviation√

Var(X) = σ instead of σ2.

Standard normalWe say that X has a a standard normal distribution if X is N(0, 1). In this case,

f(x) =1√2πe−

12x2 ,

E[X] = 0 and Var(X) = 1.

Making sense of the pdf

-3

-4

-5

x1050-5

0

-1

-2

−12

(x−µσ

)2

1

0.8

0.6

0.4

0.2

0

x1050-5

e−12(x−µσ )

2

6.6 Transforming random variables

Taking stockYou have now seem some of the most important named distributions:

Discrete ContinuousBernoulli UniformBinomial ExponentialUniform NormalGeometric Gamma (see later)Negative BinomialPoisson

47

Page 56: STAT270 Probability and Inference

Manipulating random variablesA common problem in statistics is how to “manipulate”, transform or combine random vari-

ables. Typical problems include:

• Adding a constant and scaling

• General function of a random variable

• Summing random variables

• The maximum or minimum of a set of random variables

• Expectation and variance of a sum.

Manipulating random variables on a computerR has standard functions for computing the pdf and cdf of random variables (avoiding the need

to look up tables) as well for generating random values with the given distribution. For a normaldistribution with mean 2 and s.d. 3 we use

dnorm(5, mean = 2, sd = 3)

to find the value of the density at x = 5;

pnorm(5,mean = 2, sd = 3)

to find the cdf P (X ≤ 5) and

x <- rnorm(10000, mean = 2, sd = 3)

to fill x with 10000 random values from this distribution.

x

Fre

quen

cy

−10 −5 0 5 10 15

040

080

012

00

48

Page 57: STAT270 Probability and Inference

Exploring distributions with simulationThe ability to simulate gives us a powerful tool to explore random variables.To plot a histogram using the values just produced we use

hist(x,breaks=20)

Similar commands exist for discrete distributions, except in that case the fact that there is only asmall set of possible values means its better to plot them exactly:

y <- table(x);

plot(y);

Scaling random variablesSuppose that X is a continuous random variable with pdf f . Define Y = rX + s. What is the

pdf of Y ?For example, if X is N(1, 4) what distribution does Y = 3X − 1 have?

x <- rnorm(10000,mean = 1,sd = sqrt(4))

y <- 3*x-1

hist(y,breaks=30);

y

Fre

quen

cy

−20 −10 0 10 20

040

080

012

00

Manipulating the cdfRecall the definition of the cdf:

FX(x) = P (X ≤ x).

49

Page 58: STAT270 Probability and Inference

We have

FY (y) = P (Y ≤ y)

= P (3X − 1 ≤ y)

= P (X ≤ y + 1

3)

= FX

(y + 1

3

).

From cdf to pdfDifferentiate to get the pdf:

fY (y) =d

dyFY (y) =

dFXdx

(y + 1

3

)1

3= fX

(y + 1

3

)1

3.

Since X is N(1, 4), we have

fX(x) =1

2√

2πe−

12(x−1

2 )2

.

and so

fY (y) = fX((y + 1)/3)1

3= ...

=1

6√

2πe−

12( (y+1)/3−1

2 )2

=1

6√

2πe−

12( y−2

6 )2

Main resultIf X is a continuous random variable with pdf fX and cdf FX , and r > 0, then

Y = rX + s

is a continuous random variable with pdf

fY (y) =1

rfX

(y − sr

)

and cdf

FY (y) = FX

(y − sr

).

Question: what happens when r < 0?

Chain ruleThe result makes use of the chain rule from calculus. We have that

FY (y) = FX(g(y))

for some function g. In this case g(y) = y−sr

.To find the derivative of FY with respect to y we use the rule

F ′Y (y) = F ′X(g(y))g′(y).

50

Page 59: STAT270 Probability and Inference

Uniform random variablesSuppose X is uniform on [a, b]. Then

fX(x) =

1b−a a ≤ x ≤ b

0 otherwise.

Hence when r 6= 0, the random variable Y = rX + s has pdf

fY (y) =1

r

1b−a a ≤ (y − s)/r ≤ b

0 otherwise.

rearranging:

fY (y) =

1

r(b−a)ra+ s ≤ y ≤ rb+ s

0 otherwise.

Transforming uniformsIf X is uniform on [c, d] and r > 0 then Y = rX + s is uniform on [rc+ s, rd+ s].If U is standard uniform (i.e. on [0, 1]) then (b− a)U + a is uniform on [a, b].

Transforming exponentialsIf X is exponential with parameter λ and r > 0 then the pdf of Y = rX is given by

fY =1

rfX (y/r) =

1

rλe−λy/r = (λ/r)e−(λ/r)y.

If X is exponential with parameter λ then rX is exponential with parameter λ/r.

Transforming normalsIf X is N(µ, σ2) then Y = rX + s is N(rµ+ s, r2σ2).Note that this works even when r ≤ 0. We haven’t proven this, but it is not difficult.

ReciprocalsTaking the reciprocal is an example of a more general transformation of a random variable.If X has pdf fX and cdf FX , and P (X > 0) = 1, what are the cdf and pdf of Y = 1/X?We have

Fy(y) = P (Y ≤ y)

= P (1/X ≤ y)

In this case, the assumption X > 0 makes life easier: this event holds exactly when X ≥ 1/y.Hence

Fy(y) = P (X ≥ 1/y)

= 1− FX(1/y).

Differentiating we obtain

fY (y) = fX(1/y)1

y2.

51

Page 60: STAT270 Probability and Inference

Reciprocal of the exponentialSuppose X is exponential with parameter λ. What is the pdf of Y = 1/X?Answer:

fY (y) = fX(1/y)1

y2= λe−λ/y

1

y2.

6.7 Joint probability density functions

Joint pmfsRecall what we had previously talked about for discrete random variables:The joint probability mass function p of two discrete random variables X and Y is the function

defined byp(a, b) = P (X = a, Y = b) = P (X = a ∩ Y = b).

We sometimes write pXY instead of just p.

L1 2 3 4 5 6

S

1 1/36 2/36 2/36 2/36 2/36 2/362 0 1/36 2/36 2/36 2/36 2/363 0 0 1/36 2/36 2/36 2/364 0 0 0 1/36 2/36 2/365 0 0 0 0 1/36 2/366 0 0 0 0 0 1/36

Facts about the joint pmf:

• p(a, b) ≥ 0 for all a, b.

• ∑a

∑b p(a, b) = 1.

• Marginal probabilities:

pX(a) =∑

b

p(a, b) and pY (b) =∑

a

p(a, b).

• X and Y independent if p(a, b) = pX(a)pY (b) for all a, b.

Continuous random variables

Probability density function (pdf) f : R→ R satisfies

• f(x) ≥ 0 for all x

• The area under f equals 1 (that is∫f(x)dx = 1)

• For any a, b such that a ≤ b we have that P (a ≤ X ≤ b) =∫ baf(x)dx is the area under f

between a and b.

• The cdf is given by F (a) =∫ a−∞ f(x)dx.

52

Page 61: STAT270 Probability and Inference

Joint probability density function120 9 Joint distributions and independence

-0.4-0.2

0

0.20.4

X-0.4

-0.2

0

0.2

0.4

Y

02

46

810

f(x,y

)

Fig. 9.2. A bivariate normal probability density function.

density function

f(x, y) =2

75

(2x2y + xy2

)for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2,

and f(x, y) = 0 otherwise; see Figure 9.3.

4

3

203

0,2

2,5 x1

0,4

2

0,6

1,5y

0

0,8

1

1

0,5 -1

1,2

0

Fig. 9.3. The probability density function f(x, y) = 275

(2x2y + xy2

).

The joint probability density function for two continuous random variables X, Y is a functionf assigning a real value f(x, y) to every x ∈ R and y ∈ R. It satisfies:

• f(x, y) ≥ 0 for all x, y.

• The volume under f equals one. Using multi-dimensional integrals, this is written∫ ∞

−∞

∫ ∞

−∞f(x, y)dxdy = 1.

From joint densities to probabilitiesRecall that in one dimension

P (a ≤ X ≤ b) =

∫ b

a

f(x)dx.

This extends to two dimensions. We have

P ((a ≤ X ≤ b) ∩ (c ≤ Y ≤ d)) =

∫ b

a

∫ d

c

f(x, y)dydx.

This is the volume above the rectangle bounded by [a, b] in the x direction and [c, d] in the ydirection.

9.2 Joint distributions of continuous random variables 119

-3-2

-1 0

12

3

x-3

-2-1

01

23

y

00.

050.

10.

15

f(x,y)

Fig. 9.1. Volume under a joint probability density function f on the rectangle[−0.5, 1] × [−1.5, 1].

Definition. Random variables X and Y have a joint continuousdistribution if for some function f : R2 → R and for all numbersa1, a2 and b1, b2 with a1 ≤ b1 and a2 ≤ b2,

P(a1 ≤ X ≤ b1, a2 ≤ Y ≤ b2) =

∫ b1

a1

∫ b2

a2

f(x, y) dxdy.

The function f has to satisfy f(x, y) ≥ 0 for all x and y, and∫ ∞−∞

∫ ∞−∞ f(x, y) dxdy = 1. We call f the joint probability density

function of X and Y .

As in the one-dimensional case there is a simple relation between the jointdistribution function F and the joint probability density function f :

F (a, b) =

∫ a

−∞

∫ b

−∞f(x, y) dxdy and f(x, y) =

∂2

∂x∂yF (x, y).

A joint probability density function of two random variables is also calleda bivariate probability density. An explicit example of such a density is thefunction

f(x, y) =30

πe−50x2−50y2+80xy

for −∞ < x < ∞ and −∞ < y < ∞; see Figure 9.2. This is an example ofa bivariate normal density (see Remark 11.2 for a full description of bivariatenormal distributions).

We illustrate a number of properties of joint continuous distributions by meansof the following simple example. Suppose that X and Y have joint probability

The marginal density for X is given by

fX(x) =

∫ ∞

−∞f(x, y)dy.

Important noticeIn STAT 270 I do not expect you to carry out multi-dimensional integrals. These are not

discussed in first year mathematics, and this is not the right place to teach them. HOWEVER:

• You need to understand the concept of volume giving probability;

• You may need to figure out volumes when f is simple (e.g. when f is constant) - all of whichcan be done by elementary geometry (volumes of boxes etc.)

• When we will study independent random variables you’ll be able to avoid multi-dimensionalintegrals all together!

53

Page 62: STAT270 Probability and Inference

More similarities with the discrete caseSuppose that X, Y have joint pdf f . The marginal density for X is given by

fX(x) =

∫ ∞

−∞f(x, y)dy

and the marginal density for Y is given by

fY (y) =

∫ ∞

−∞f(x, y)dx.

That is, integrating (rather than summing) over the other variable.

IndependenceWe say that X and Y are independent if

f(x, y) = fX(x)fY (y)

for all x, y. A consequence of this is that if X and Y are independent then

P (a ≤ X ≤ b ∩ c ≤ Y ≤ d) = P (a ≤ X ≤ b)P (c ≤ Y ≤ d)

and we only need to carry out one dimensional integrations!

ExampleSuppose that X and Y have a joint density given by

f(x, y) =

a if 0 ≤ x ≤ y ≤ 1

0 otherwise..

1. For which value of a is this a valid joint pdf?

2. What is P (X ≤ 1/2 ∩ Y ≤ 1/2)?

3. What are the marginal densities of X and Y ?

4. Are X and Y independent?

6.8 Joint density for an independent sample

Joint density for an independent sampleLater in this course you’ll deal a lot with iid samples, where iid standards for

• independent

• identically distributed.

By independence, the (joint) pdf for a sample is simply the product of the one-dimensionalpdfs.

54

Page 63: STAT270 Probability and Inference

Sample examplesWhat is the density for an iid sample X1 = x1, X2 = x2 of size two from an exponential

distribution with parameter λ?

f(x1, x2) = (λe−λx1)(λe−λx2)

= λ2e−λ(x1+x2).

How about a sample X1 = x1, X2 = x2, . . . , Xn = xn of size n?

f(x1, . . . , xn) = (λe−λx1) · · · · · (λe−λxn)

= λne−λ(x1+···+xn).

Samples from normalWhat is the joint pdf for an iid sample Z1 = z1, . . . , Zn = zn from a standard normal distribu-

tion?

f(z1, . . . , zn) =

(1√2πe−

12z12)· · · · ·

(1√2πe−

12zn2

)

=1

(√

2π)ne−

12

(z21+···+z2n).

6.9 Maxima of independent random variables

Maxima of independent random variablesSuppose that X and Y are independent random variables, each with pdf f . What is the pdf

of Z = max(X, Y )?Once again, it is easier to work with the cdf. We want to find an expression for

FZ(z) = P (Z ≤ z) = P (max(X, Y ) ≤ z).

Max eventsThe event

max(X, Y ) ≤ z

is exactly the same as the event(X ≤ z) ∩ (Y ≤ z)

in the sense that one is true exactly when the other is true.Hence

FZ(z) = P (Z ≤ z)

= P ((X ≤ z) ∩ (Y ≤ z))

= P (X ≤ z)P (Y ≤ z)

= FX(z)FY (z).

55

Page 64: STAT270 Probability and Inference

Maxima of uniformsSuppose that X and Y independent standard uniforms variables. What is the pdf of Z =

max(X, Y )?

x <- runif(10000,min=0,max=1)

y <- runif(10000,min=0,max=1)

z <- pmax(x,y)

hist(z,breaks=30,prob=TRUE)

z

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

pdf of maxFor x, y, z in [0, 1] we have

FX(x) = x

FY (y) = y

FZ(z) = FX(z)FY (z)

= z2.

Differentiating, we getfZ(z) = 2z

on [0, 1] and fZ(z) = 0 otherwise.

MinimaWhat if Z = min(X, Y )?Flipping what we had before, we get

P (Z ≥ z) = P (X ≥ z ∩ Y ≥ z) = P (X ≥ z)P (Y ≥ z).

Hence1− FZ(z) = (1− FX(z))(1− FY (z)).

56

Page 65: STAT270 Probability and Inference

6.10 Sums of random variables

Sums of random variablesSuppose now that X and Y have joint pdf f(x, y) and that they are not necessarily independent.

What is the pdf of Z = X + Y ?To compute fZ(z) we need to integrate f(x, y) over all pairs (x, y) such that x+ y = z. There

are two ways we could do this:

• Integrate over all x, and for each put y = z − x, or

• integrate over all y, and for each put x = z − y.

fZ(z) =

∫ ∞

−∞f(x, z − x)dx =

∫ ∞

−∞f(z − y, y)dy.

Sum of independent random variablesWhen X and Y are independent, f(x, y) = fX(x)fY (y). In this case

fZ(z) =

∫ ∞

−∞f(x, z − x)dx =

∫ ∞

−∞fX(x)fY (z − x)dx

or, equivalently,

fZ(z) =

∫ ∞

−∞f(z − y, y)dy =

∫ ∞

−∞fX(z − y)fY (y)dy

Sum of exponentialsSuppose that volcanoes occur according to a Poisson process with rate λ (i.e. each waiting

time is independent exponential with parameter λ).What is the distribution for the waiting time until there have been two volcanoes?If X and Y are independent exponential with parameter λ, what is the distribution of Z =

X + Y ?

Adding exponentialsWe have fX(x) = λe−λx and fY (y) = λe−λy for x, y > 0.

fZ(z) =

∫ ∞

−∞fX(x)fY (z − x)dx

=

∫ z

0

fX(x)fY (z − x)dx

=

∫ z

0

λe−λxλe−λ(z−x)dx

=

∫ z

0

λ2e−λzdx

= zλ2e−λz.

This is the pdf of a gamma distribution with parameters α = 2 and β = λ.

57

Page 66: STAT270 Probability and Inference

Sum of normalsSuppose that X is N(a, b2) and Y is N(c, d2), and that X and Y are independent. What kind

of random variable is Z = X + Y ?

x <- rnorm(10000,mean=0.1,sd=0.3)

y <- rnorm(10000,mean=0.2,sd=1.2)

z <- x+y

hist(z,breaks=30)

Sum of normalsSince X is N(a, b2) and Y is N(c, d2):

fX(x) =1

b√

2πe−

(x−a)2

2b2

fY (y) =1

d√

2πe−

(y−c)2

2d2

fZ(z) =

∫ ∞

−∞

1

b√

2πe−

(x−a)2

2b21

d√

2πe−

(z−x−c)2

2d2 dx

= ...

=1√

b2 + d2√

2πe− (z−a−c)2

2(b2+d2)

This is a bit mathematically fiddly; you will not be expected to work through the steps in thisclass.

Adding normals summarySuppose that X and Y are independent normals, where X has expectation a and variance b2

while Y has expectation c and variance d2.......then X + Y is also normal! It has expectation a+ c and variance b2 + d2.In fact if α, β are real numbers then αX+βY is normal, with expectation αa+βc and variance

α2b2 + β2d2. This is one reason why normal random variables get a lot of air time in statistics.

6.11 Expectation and variance for sums

Rules of expectation and varianceLet X and Y be continuous random variables and suppose Z = X + Y . Then

E[Z] =

∫ ∞

−∞

∫ ∞

−∞(x+ y)f(x, y)dxdy

=

∫ ∞

−∞x

∫ ∞

−∞f(x, y)dydx+

∫ ∞

−∞y

∫ ∞

−∞f(x, y)dxdy

=

∫ ∞

−∞x fX(x)dx+

∫ ∞

−∞y fY (y)dy

= E[X] + E[Y ].

That isE[X + Y ] = E[X] + E[Y ].

58

Page 67: STAT270 Probability and Inference

CovarianceThe continuous analogue of covariance is

cov(X, Y ) = E[(X − µX)(Y − µY )] = E[XY ]− E[X]E[Y ]

where µX = E[X] and µY = E[Y ]. Using integrals, this becomes

cov(X, Y ) =

∫ ∫(x− µX)(y − µY )f(x, y)dxdy.

Note that, as before,

cov(X, Y ) = cov(Y,X)

cov(X + Y, Z) = cov(X,Z) + cov(Y, Z)

cov(X,X) = Var(X)

cov(aX,X) = acov(X,X)

Easier variance proof

Var(X + Y ) = cov(X + Y,X + Y )

= cov(X,X + Y ) + cov(Y,X + Y )

= cov(X,X) + cov(X, Y ) + cov(Y,X) + cov(Y, Y )

= Var(X) + 2cov(X, Y ) + Var(Y ).

The correlation between X and Y is just the covariance normalised by the standard deviations:

ρ(X, Y ) =cov(X, Y )√

Var(X)√

Var(Y ).

And independence...Well, first we have that if X and Y are independent then

E[XY ] =

∫ ∞

−∞

∫ ∞

−∞xyf(x, y)dxdy

=

∫ ∞

−∞

∫ ∞

−∞xfX(x)yfY (y)dxdy

=

∫ ∞

−∞xfX(x)dx

∫ ∞

−∞yfY (y)dy

= E[X]E[Y ].

Hence if X and Y are independent

E[XY ] = E[X]E[Y ].

59

Page 68: STAT270 Probability and Inference

Expectation and covariance summary sheetLet X and Y be random variables, with µX = E[X] and µY = E[Y ].

E[aX + bY ] = aE[X] + bE[Y ]

Var(X) = E[(X − µX)2]

= E[X2]− (E[X])2

cov(X, Y ) = E[(X − µX)(Y − µY )]

= E[XY ]− E[X]E[Y ]

cov(X + Y, Z) = cov(X,Z) + cov(Y, Z)

cov(aX, Y ) = acov(X, Y )

Var(X + Y ) = Var(X) + 2cov(X, Y ) + Var(Y ).

If X, Y independent then

E[XY ] = E[X]E[Y ]

cov(X, Y ) = 0.

You are now ready to learn some statistical inference.

60

Page 69: STAT270 Probability and Inference

Chapter 7

Statistical models

7.1 Probability VS Inference

In Probability:

• We were given a distribution and its parameters to calculate a certain probability. Forexample, if we are given two independent random variables X and Y : X has an exponentialdistribution with mean 1/40; Y has an exponential distribution with mean 1/80.

• Then we can use this information to calculate P (X ≥ 60, Y ≥ 60).

In reality:

• We do not know the parameters. In fact, sometimes we do not even know the distribution.All we have is a data set.

• For example, if we were to design a credit card fraud detection system. What would wedo? We are given a data set from a person’s spending behaviour. The aim is to detect anyfraud transactions from this credit card. Assume that X is the amount of money this personspends at one time. A simple question for this problem is: what is P (X > 1000), i.e., whatis the probability that this person will spend over $1000 in one transaction?

• If we know the distribution of X, using what we learnt from probability, we can calculateP (X > 1000). Only given us a data set without any further information, how can we findan appropriate distribution for X? After finding a distribution for X, how do we determinethe parameters of this distribution and thus find the probability P (X > 1000)? Does thisdistribution capture the main features of the data? How do we justify our conclusion?

• Statistical inference will help us to solve these problems.

7.2 Review of related concepts for statistical models

REVIEW OF RELATED CONCEPTS IN STAT 110

• Statistical inference is a process of using sample information to infer about the population.Take the credit card fraud detection system for example. We use the data collected fromone’s everyday use of their credit card to infer about their spending behaviour.

61

Page 70: STAT270 Probability and Inference

• Population: Complete set of entities or elements or units or subjects that we wish todescribe or make inference about

• Sample: a subset of a population

• Parameter: Fixed number that characterizes a population

• Statistic: a numerical summary of data

• Model: a mathematical description of the data generating mechanism

REVIEW OF RELATED CONCEPTS IN PROBABILITY

• Independence: If A1, A2, · · · , Am are independent, then

P (A1 ∩ A2 ∩ · · · ∩ Am) = P (A1)P (A2) · · ·P (Am)

• Expectation:E[X + Y ] = E[X] + E[Y ]

E[aX + b] = aE[X] + b

• Variance: If X and Y are independent, then

Var(aX + bY ) = a2Var(X) + b2Var(Y )

7.3 Random samples and statistical models

RANDOM SAMPLESHow are data collected? Could they be anything? We are interested in a population, for

example, a person’s spending behaviour. In order to infer about this behaviour, we need datawhich are representative. We are going to think of a data set as a set of random variables allproduced by the same data generating mechanism.

A random sample is a collection of random variables X1, X2, · · · , Xn that have the sameprobability distribution and are mutually independent. To use the term mentioned in the Prob-ability part of the notes, these random variables form an independent and identically distributed(iid) sample.

Example: randomly select 100 bottles of 350ml JUST JUICE to measure the amount ofvitamin C content in each bottle.

STATISTICAL MODEL FOR REPEATED MEASUREMENTSA data set consisting of values x1, x2, · · · , xn of repeated measurements of the same quantity is

modeled as the realization of a random sample X1, X2, · · · , Xn. The model may include a partialspecification of the probability distribution of each Xi. This probability distribution is called themodel distribution. The parameter of this distribution is called the model parameter.

Once we have a data set for a population, we

• Use the numerical summaries and exploratory plots of the data set as a first indication ofwhat an appropriate choice would be for the model distribution;

62

Page 71: STAT270 Probability and Inference

• Formulate a statistical model for a data set;

• Use the data set to get a good guess of the model parameter.

How do we know that this guess is good? In other words, how do we evaluate whether the specifiedmodel is a good representation of the true data generation mechanism? To do this, we need somecriteria that could formally evaluate the guess of the parameter and the statistical model. Westart with some properties derived from a random sample.

7.4 Sampling distributions and central limit theorem

SAMPLE STATISTICNumerical summaries of a data set can be obtained from functions of a random sample. A

function h(X1, X2, · · · , Xn) of a random sample X1, X2, · · · , Xn is called a sample statistic.

For example,

Sample mean Xn = X1+X2+···+Xnn

Sample variance s2n = (X1−X)2+(X2−X)2+···+(Xn−X)2

n−1

Sample median Medn=X(n+12 ), when n is odd

Medn=12

(X(n2 ) +X(n2 +1)

), when n is even

SAMPLING DISTRIBUTIONSample statistics h(X1, X2, · · · , Xn) are random variables. Different samples produce different

values. The probability distribution of a sample statistic is called the sampling distribution ofthe statistic. Standard error of the statistic is the standard deviation of the sampling distribu-tion.

SAMPLING DISTRIBUTION - EXAMPLEGiven a random sample X1, X2, · · · , Xn from a normal distribution with expectation µ and

variance σ2, find the distribution of Y = 2X1.Solution:

THE LAW OF LARGE NUMBERSIf Xn is the average of n independent random variables, X1, X2, · · · , Xn, with the same ex-

pectation µ and variance σ2 (we call X1, X2, · · · , Xn an independent and identically distributedsequence), then

E[Xn] =1

nE[X1 +X2 + · · ·+Xn] = µ

and

Var(Xn) =1

n2Var(X1 +X2 + · · ·+Xn) =

σ2

n.

The law of large numbers suggests that the average of repeated measurements provide moreaccurate answers.

63

Page 72: STAT270 Probability and Inference

If Xn is the average of n independent random variables, X1, X2, · · · , Xn, with the same expec-tation µ and variance σ2, then Xn converges to µ.

• in mean square: E[(Xn − µ)2]→ 0 as n→∞• in probability: P (|Xn − µ| > ε)→ 0 as n→∞, for any ε > 0.

CENTRAL LIMIT THEOREMLet X1, X2, · · · , Xn be a sequence of independent and identically distributed random variables

with the same expectation µ and variance σ2. For n ≥ 1, let

Zn =Xn − µσ/√n.

Then the distribution function of Zn converges to the distribution function Φ of the standardnormal distribution when n is large enough.

Another way to present the central limit theorem:

Let X1, X2, · · · , Xn be independent random variables from a distribution with mean µ andvariance σ2. Then

Xn =1

n

n∑

i=1

Xi

tends to be distributed as N(µ, σ2/n) when n is large.

7.5 More on sampling distribution

WHICH DISTRIBUTION?

Fre

quen

cy

0 1 2 3 4

05

1015

2025

THE GAMMA DISTRIBUTIONLet Ei be the time of waiting for a single occurrence of a Poisson event. Then Ei ∼ Exp(λ).

What about the time of waiting for a specific number m of Poisson events to occur? It has anErlang distribution, i.e., X = E1 +E2 + · · ·+Em ∼ Erlang distribution. Here m enables us to getdifferent shapes in the density function.

Gamma distribution Gam(α, λ) is obtained by allowing m to take non-integer values and denotethis shape parameter by α. The parameter λ is called a rate parameter.

64

Page 73: STAT270 Probability and Inference

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

Exp(2)

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

Gam(1,2)

Gam(2,2)

Gam(5,2)

The density function of a Gamma distribution with parameters α > 0 and λ > 0 is

f(x) =

λα

Γ(α)xα−1e−λx, x > 0,

0, x ≤ 0.

For α > 0, the quantity Γ(α) is defined by

Γ(α) =

∫ ∞

0

tα−1e−tdt.

The mean and variance of X ∼ Gam(α, λ) are α/λ and α/λ2.

THE CHI-SQUARED DISTRIBUTIONConsider a special gamma distribution with λ = 1/2 and α = ν/2. The random variable X is

said to follow a chi-squared distribution with ν degrees of freedom if

X ∼ Gam(ν/2, 1/2).

A special notation for this is X ∼ X 2ν .

The mean and variance of X ∼ X 2ν are E[X] = ν and Var(X) = 2ν.

If Z has a standard normal distribution and X = Z2, then the pdf of X is

f(x) =

(1/2)1/2

Γ(1/2)x−1/2e−x/2, x > 0

0, x ≤ 0

That is, X has a chi-squared distribution with 1 degree of freedom, X ∼ X 21 .

Let X1, X2, · · · , Xn be a random sample from N(µ, σ2). Then Zi = Xi−µσ

are independent,standard normal random variables and

n∑

i=1

Z2i ∼ X 2

n

Let X1, X2, · · · , Xn be a random sample from N(µ, σ2). Then∑n

i=1(Xi − X)2

σ2=

(n− 1)S2

σ2

has a X 2 distribution with (n− 1) degrees of freedom.

65

Page 74: STAT270 Probability and Inference

THE t DISTRIBUTIONThe sample variance S2 is useful when making inference about σ2, and the sample mean X is

useful when making inference about µ. The distribution of X depends on σ2. In applications, σ2

is often unknown, thus the fact that√n(X − µ)/σ ∼ N(0, 1) becomes useless.

Solution: replace σ by S in√n(X − µ)/σ.

A continuous random variable has a t distribution with parameter m, where m ≥ 1 is an integer,if its probability density is given by

f(x) = km

(1 +

x2

m

)−m+12

, −∞ < x <∞,

where

km =Γ(m+1

2

)

Γ(m2

)√mπ

.

This distribution is denoted by t(m) and is referred to as the t distribution with m degrees offreedom.

For a random sample X1, X2, · · · , Xn from an N(µ, σ2) distribution, the studentized mean

X − µS/√n

has a t(n− 1) distribution, regardless of the values of µ and σ.

THE t DISTRIBUTION - DENSITY FUNCTION

−4 −2 0 2 4

0

0.2

0.4

t(1)t(2)t(5)N(0, 1)

THE F DISTRIBUTIONLet X1 and X2 be independent chi-squared random variables with ν1 and ν2 degrees of freedom,

respectively. Then the ratio

F =X1/ν1

X2/ν2

has an F distribution with ν1 numerator degrees of freedom and ν2 denominator degrees of free-dom.

Suppose that we have a random sample ofm observations from the normal distributionN(µ1, σ21)

and an independent random sample of n observations from a second normal distribution N(µ2, σ22).

66

Page 75: STAT270 Probability and Inference

Then

F =

(m−1)S21/σ

21

m−1

(n−1)S22/σ

22

n−1

=S2

1/σ21

S22/σ

22

has an F distribution with m− 1 numerator degrees of freedom and n− 1 denominator degrees offreedom.

THE F DISTRIBUTION - DENSITY FUNCTION

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1,10 df

2,10 df

5,10 df

10,5 df

7.6 Distribution features

Exploratory plots of a data set could suggest an initial guess of what kind of distributions the dataset was generated from. The plots in this section give a good indication of the features of differentdistributions.

THE EMPIRICAL DISTRIBUTION FUNCTIONLet X1, X2, · · · , Xn be a random sample from distribution function F . The empirical cumu-

lative distribution function of the sample is

Fn(a) =number of Xi in (−∞, a]

n

From the law of large numbers we have, for every ε > 0

limn→∞

P(|Fn(a)− F (a)| > ε) = 0

This means, for most realizations of the random sample

Fn(a) ≈ F (a)

R command: ecdf()

67

Page 76: STAT270 Probability and Inference

−2 0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

x

Fn(

x)

−2 0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

x

Fn(

x)

Figure 7.1: Empirical distribution functions of normal samples with sample size n = 20 (left) andn = 200 (right) from N(5,4). Dashed red lines are for the “true” distribution function.

THE HISTOGRAM AND KERNEL DENSITY ESTIMATEHistogram for a data set x1, x2, · · · , xn:

1. Divide the range of data into bins (intervals): B1, B2, · · · , Bm

2. Let the bin width (length of an interval) of Bi be |Bi|

3. The area under the histogram on a bin Bi is

the number of xj in Bi

n

4. The height of the histogram on bin Bi is then

the number of xj in Bi

n|Bi|

R command: hist()

68

Page 77: STAT270 Probability and Inference

THE HISTOGRAM AND KERNEL DENSITY ESTIMATEKernel density estimate:

−5 0 5 10 15 20

0.00

0.02

0.04

0.06

0.08

Figure 7.2: An illustration of a kernel density estimate.

Kernel density estimate fn,h for a data set x1, x2, · · · , xn:

1. Choose a kernel K and a bandwidth h, satisfying

(K1) K is a probability density, i.e., K(u) ≥ 0 and∫∞−∞K(u)du = 1

(K2) K is symmetric around zero, i.e., K(u) = K(−u)

(K3) K(u) = 0 for |u| > 1

2. Let

fn,h(t) =1

nh

n∑

i=1

K

(t− xih

)

Sometimes one uses kernels that do not satisfy condition (K3), for example, the normal kernel

K(u) =1√2πe−

12u2 ,−∞ < u <∞

R command: density()Suppose the random sample X1, X2, · · · , Xn is generated from a continuous distribution with

probability density f .By the law of large numbers

number of Xi in (x− h, x+ h]

2hn≈ f(x)

69

Page 78: STAT270 Probability and Inference

Therefore,height of the histogram on (x− h, x+ h] ≈ f(x)

Similarly, for the kernel density estimate of a random sample

fn,h(x) ≈ f(x)

−2 0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

−2 0 2 4 6 8 10 120.

000.

050.

100.

150.

20

Figure 7.3: Histogram (left) and kernel density estimate (right) of normal samples with samplesize n = 200 from N(5,4). Dashed red lines are for the “true” distribution function.

−2 0 2 4 6 8 10 12

0.00

0.10

0.20

0.30

−2 0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

Figure 7.4: Histogram (left) and kernel density estimate (right) of normal samples with samplesize n = 20 from N(5,4). Dashed red lines are for the “true” distribution function.

70

Page 79: STAT270 Probability and Inference

• We have seen that if we have a data set from a given probability distribution, then the samplestatistics approximate certain features of this distribution.

• In practice: We have a data set which is modeled as the realization of a random sample,but the probability distribution is unknown.

• Our goal is to use the data set to estimate a certain feature of this distribution that representsthe quantity of interest.

71

Page 80: STAT270 Probability and Inference

Chapter 8

Point Estimation

8.1 Estimators

PROBABILITY → STATISTICAL INFERENCEProbability: Developing methods for studying the behavior of random variables. Given a

specific probability distribution, we can calculate the probabilities of various events. For example,knowing that Y has a Bin(n = 50, p = 0.3), we can calculate P (14 ≤ Y ≤ 15).

Inference: Given a data set, we can have a guess of a statistical model that describes the datageneration mechanism. Assuming our guess is that Y has a Bin(n = 50, p), where the value of pis unknown, and having observed Y = y (say y = 14), what can we say about p?

ESTIMATORS - AN EXAMPLEGiven you a data set of global large earthquakes, which distribution can we use to model this

data set? What is the parameter? Can we tell anything about the probability that an earthquakewith magnitude ≥ 7.0 will occur in the next 30 days, i.e., P(X ≤ 30)?

Figure 8.1: Map of global large earthquakes with magnitude ≥ 7.0 from January 1975 to April2015. Source: USGS, http://earthquake.usgs.gov/earthquakes/search/

72

Page 81: STAT270 Probability and Inference

Time (year)

Mag

nitu

de

7.0

7.5

8.0

8.5

9.0

1975 1980 1985 1990 1995 2000 2005 2010 2015

Figure 8.2: Plot of magnitude versus time of the global large earthquakes with magnitude ≥ 7.0from January 1975 to April 2015.

Days between each two adjacent earthquakes

Fre

quen

cy

0 50 100 150 200

050

100

150

Figure 8.3: Histogram of the days between two adjacent earthquakes from the above example.

We consider two questions: 1) What is a reasonable guess of the parameter of this distribution?2) What are plausible values of this parameter? Question 1 introduces a type of inference thatstatisticians call point estimation. Question 2 introduces a type of inference that statisticians callconfidence interval.

ESTIMATEThe process of using observations to suggest a value for a parameter is called estimation.

The value suggested is called the estimate of the parameter. In a general setup, we denote theparameter of interest by the Greek letter θ. Whatever method we use to estimate the parameterof interest θ, the result depends only on our data set.

ESTIMATE - DEFINITIONMore formally: Estimate: An estimate is a value t that only depends on the data set

x1, x2, · · · , xn, i.e., t is some function of the data set only: t = h(x1, x2, · · · , xn). Sometimes,we use θ instead of t.

73

Page 82: STAT270 Probability and Inference

ESTIMATORS - DEFINITIONEstimator: Let t = h(x1, x2, · · · , xn) be an estimate based on the data set x1, x2, · · · , xn.

Then t is a realization of the random variable

T = h(X1, X2, · · · , Xn).

The random variable T is called an estimator.

EXAMPLE - POISSON DISTRIBUTIONPoisson distribution — number of events occurring independently (not simultaneously) in a

fixed period of time with a known average rate λ.

Poisson process is a sequence of random events which occur at constant average rate λ perinterval. Suppose that different occurrences are independent and cannot occur simultaneously.The number of occurrences X in a particular interval will follow a Poisson distribution with rateλ.

Consider the occurrences of global large earthquakes with magnitude ≥ 7.0. One is interestedin

1) the intensity at which global large earthquakes with magnitude ≥ 7.0 occur in a genericweek, and

2) the percentage of weeks during which no large earthquake occurs.

If we assume that the large earthquakes occur completely at random in time, this earthquakeprocess can be modeled by a Poisson process.

Let X be the number of occurrences of global large earthquakes during one week, and X hasa Pois(λ) distribution. Here λ is an unknown parameter. The parameters of interest are then

1. λ: the intensity of the occurrences of global large earthquakes;

2. e−λ: the percentage of weeks during which no large earthquake occurs (i.e., the probabilityof zero occurrence).

Given a data set gathered for the large earthquake occurrence process, x1, x2, · · · , xn, where xiis the number of large earthquake occurred in the ith week.

To estimate λ (expectation, or variance), we can use

t = xn or t = s2n.

To estimate e−λ, we can use

t =number of xi equal to 0

nor t = e−xn .

74

Page 83: STAT270 Probability and Inference

8.2 The behavior of an estimator

THE BEHAVIOR OF AN ESTIMATORFor the earthquake example, we have the following possible estimators for the probability p0 of

zero occurrence,

S =number of Xi equal to 0

nand T = e−Xn .

Which estimator is better?

Assume that we know λ = 0.3, so that p0 = e−0.3 = 0.74. We can use the following simulationstudy to compare the two estimators.

* Draw 30 values from a Pois(0.3) distribution;

* Compute the values of estimators S and T ;

* Repeat this 1000 times, so that we have 1000 values for each estimator.

0.4 0.5 0.6 0.7 0.8 0.9 1.0

050

100

200

300

e−λ0.4 0.5 0.6 0.7 0.8 0.9 1.0

050

100

200

300

e−λ

Figure 8.4: Frequency histograms of 1000 values for estimators S (left) and T (right) of p0 =e−0.3 = 0.74.

THE SAMPLING DISTRIBUTIONRecall: The sampling distribution: Let T = h(X1, X2, · · · , Xn) be an estimator based

on a random sample X1, X2, · · · , Xn. The probability distribution of T is called the samplingdistribution of T .

THREE CRITERIA FOR EVALUATING ESTIMATORS

• Unbiasedness

• Variance of an estimator

• Mean squared errors

75

Page 84: STAT270 Probability and Inference

8.3 Unbiased estimators

8.3.1 Definition of unbiased estimators

Unbiased estimator: An estimator θ is called an unbiased estimator for the parameter θ, ifE[θ] = θ irrespective of the value of θ. The difference E[θ] − θ is called the bias of θ. If thisdifference is nonzero, then θ is called biased .

8.3.2 An example

Consider the example of large earthquakes that occur completely at random in time. Let X bethe number of occurrences of global large earthquakes during one week. Then X has a Pois(λ)distribution with unknown parameter λ. A realization of a random sample X1, X2, · · · , Xn of thenumber of occurrences during each week is x1, x2, · · · , xn. We want to use either S or T to estimatethe probability of zero occurrences: p0 = e−λ.

a. Show that S is an unbiased estimator of p0.

b. Show that T is a biased estimator of p0.

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

a

p S((a

))

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

a

p T((a

))

Figure 8.5: Probability mass function of S (left) and T (right).

8.3.3 Unbiased estimators for expectation and variance

Suppose X1, X2, · · · , Xn is a random sample from a distribution with finite expectation µ andfinite variance σ2. Then

Xn =X1 +X2 + · · ·+Xn

nis an unbiased estimator for µ and

S2n =

1

n− 1

n∑

i=1

(Xi − Xn)2

76

Page 85: STAT270 Probability and Inference

is an unbiased estimator for σ2.

EXPECTATIONProof: Using linearity of expectations we have

E[Xn] =1

nE[X1 +X2 + · · ·+Xn]

= µ

VARIANCEProof: Using linearity of expectations we have

E[S2n] = E

[1

n− 1

n∑

i=1

(Xi − Xn)2

]

=1

n− 1E

[n∑

i=1

(Xi − Xn − µ+ µ)2

]

=1

n− 1E

n∑

i=1

(Xi − µ)2 − 2n∑

i=1

(Xi − µ)(Xn − µ) +n∑

i=1

(Xn − µ)2

=1

n− 1

n∑

i=1

Var(Xi)− 2E

[(n∑

i=1

Xi − nµ)

(Xn − µ)

]+ nVar(Xn)

=1

n− 1

nσ2 − 2nVar(Xn) + σ2

=1

n− 1

nσ2 − 2n

σ2

n+ σ2

= σ2

EXERCISELet X1, X2, · · · , Xn be a random sample from a population with expectation µ.Is µ1 = 1

4X1 + 1

8X2 + 5

8X3 an unbiased estimator for µ?

8.4 Variance of an estimator

EFFICIENCYLet θ1 and θ2 be two unbiased estimators for the same parameter θ. Then estimator θ2 is called

more efficient than estimator θ1 if Var(θ2) < Var(θ1), irrespective of the value of θ.

Low variance is good!

77

Page 86: STAT270 Probability and Inference

STANDARD ERROR OF AN ESTIMATORThe standard error of an estimator θ is its standard deviation

σθ =

√Var(θ).

EXERCISESuppose X1, X2, · · · , Xn are independent and identically distributed with finite expectation µ

and variance σ2. Both

X3 =X1 +X2 +X3

3and X6 =

X1 +X2 + · · ·+X6

6

are unbiased estimators for the parameter µ. Which estimator is more efficient?

8.5 Mean squared error

MEAN SQUARED ERROR - DEFINITIONLet θ be an estimator for a parameter θ. The mean squared error of θ is the number

MSE(θ) = E[(θ − θ)2]

MEAN SQUARED ERRORAn estimator θ1 performs better than an estimator θ2 if MSE(θ1) < MSE(θ2). Note that

MSE(θ) = E[(θ − θ)2]

= Var(θ) + (E[θ]− θ)2

Proof:

• For unbiased estimators: MSE is equivalent to variance

MEAN SQUARED ERROR - EXAMPLESuppose X1, X2, · · · , Xn is a random sample from a distribution with finite expectation µ and

finite variance σ2. Then Xn can be used to estimate µ.

F Calculate MSE(Xn).

UNBIASEDNESS AND EFFICIENCYA biased estimator with a small variance may be more useful than an unbiased estimator

with a large variance. Take the earthquake data for example. Assume that the true populationhas Pois(0.3) distribution, and n = 30. We take 1000 random samples from this population(simulation) and then calculate 1000 values for S and T .

78

Page 87: STAT270 Probability and Inference

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.00

00.

002

0.00

40.

006

0.00

8

λ

MSE(S)MSE(T)

Figure 8.6: MSEs calculated from the 1000 simulated values of S and T as a function of λ.

79

Page 88: STAT270 Probability and Inference

Chapter 9

Methods of Point Estimation

METHODS OF POINT ESTIMATION

• The Method of Moments

• Maximum Likelihood Estimation

9.1 The Method of Moments

MOMENTSLet X1, X2, · · · , Xn be a random sample from a p.m.f. or p.d.f. f(x). For k = 1, 2, 3, · · · ,

the kth population moment, or kth moment of the distribution f(x), is E(Xk). The kthsample moment is (1/n)

∑ni=1 X

ki .

THE METHOD OF MOMENTSLetX1, X2, · · · , Xn be a random sample from a distribution with p.m.f. or p.d.f. f(x; θ1, · · · , θm),

where θ1, · · · , θm are parameters whose values are unknown. Then the moment estimatorsθ1, · · · , θm are obtained by equating the first m sample moments to the corresponding first mpopulation moments and solving for θ1, · · · , θm.

EXAMPLELet X1, X2, · · · , Xn be a random sample from a gamma distribution with parameters α and λ.

Use the method of moments to estimate the parameters α and λ.

Solution:

We know that E(X) = α/λ, and E(X2) = (α2 + α)/λ2.The moment estimators of α and λ are obtained by solving

Xn =α

λ,

1

n

n∑

i=1

X2i =

α2 + α

λ2

simultaneously, which gives

α =(Xn)2

1n

∑ni=1X

2i − (Xn)2

, λ =Xn

1n

∑ni=1 X

2i − (Xn)2

80

Page 89: STAT270 Probability and Inference

9.2 The likelihood function

LIKELIHOOD - AN EXAMPLESuppose we perform an experiment 20 times and get 6 successes. How do we estimate the

probability of success, p?

We have a single observation x = 6 which we can assume comes from a Binomial(20,p) distri-bution.

The probability of the result is

P(X = 6) =

(20

6

)p6(1− p)14

LIKELIHOOD - AN EXAMPLE - CONTINUEDWe can observe how this expression changes for different values of p:

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

p

P(X=

6)

LIKELIHOOD - AN EXAMPLE - CONTINUEDThe probability P(X = 6) is maximized at p = 0.3.

This gives us a new way of estimation:

• We take the value of p which makes the observed data as likely as possible

LIKELIHOOD FUNCTIONMore generally, if the number of successes from 20 trials is x, then

P(X = x) =

(20

x

)px(1− p)20−x

This is a function of the unknown parameter pwith the observed data x known and regarded as fixed

• Such a function is called the likelihood of the parameter

81

Page 90: STAT270 Probability and Inference

LIKELIHOOD FUNCTIONProbability function:

Probability mass function: pX(x)

Probability density function: fX(x)

I Treated as a function of the data

I Parameter θ, treated as constant

Given a data set x1, x2, · · · , xn which is a realization of an iid sample X1, · · · , Xn from a discretedistribution, the likelihood function is

L(θ) = P (X1 = x1, X2 = x2, · · · , Xn = xn)

= pθ(x1)pθ(x2) · · · pθ(xn) ← independence

Given a data set x1, x2, · · · , xn which is a realization of an iid sample X1, · · · , Xn from a continuousdistribution, the likelihood function is

L(θ) = fθ(x1, x2, · · · , xn)

= fθ(x1)fθ(x2) · · · fθ(xn) ← independence

Likelihood function:

I Treated as a function of θ

I Data are treated as known constants

LOG LIKELIHOOD FUNCTIONGiven that the likelihood function is a product of terms, it is often more convenient to work

with the log likelihood function, defined by

l(θ) = ln(L(θ))

SOME MATHEMATICS

eaeb = ea+b

xaxb = xa+b

xaya = (xy)a

ln(e) = 1

ln(1) = 0

ln(ab) = ln(a) + ln(b)

ln(ab

)= ln(a)− ln(b)

ln(xa) = a ln(x)

ln(ea) = a

If g(x) = xa then g′(x) = axa−1.If g(x) = eax then g′(x) = aeax.If g(x) = ln(ax) then g′(x) = 1

x.

82

Page 91: STAT270 Probability and Inference

AN EXAMPLE – BINOMIAL DISTRIBUTIONSuppose x1, x2, · · · , xn are observed values of an iid sample from a Binomial(N, p) distribution,

N known, p unknown.The likelihood function is

L(p) = Kp(∑ni=1 xi)(1− p)(nN−

∑ni=1 xi)

where K =(Nx1

)(Nx2

)· · ·(Nxn

)is a constant given a sample.

The log likelihood function is

l(p) = ln(L(p)) =n∑

i=1

xi ln(p) +

(nN −

n∑

i=1

xi

)ln(1− p) + ln(K)

0.0 0.2 0.4 0.6 0.8 1.00.0e

+00

2.0e

−05

p

L(p)

0.0 0.2 0.4 0.6 0.8 1.0

−200

−100

p

l(p)

Figure 9.1: The likelihood (top) and log likelihood (bottom) functions for the 5 outcomes 2,5,8,3,4from a Binomial(15, p) distribution.

9.3 The maximum likelihood principle

LIKELIHOOD EXAMPLESuppose a dealer of computer chips is offered on the black market two batches of 10 000

chips each. According to the seller, in one batch about 50% of the chips are defective, whilethis percentage is about 10% in the other batch. Our dealer is only interested in this last batch.Unfortunately the seller cannot tell the two batches apart. To help him to make up his mind,the seller offers our dealer one batch, from which he is allowed to select and test 10 chips. Afterselecting 10 chips arbitrarily, it turns out that only the second one is defective. Our dealer at oncedecides to buy this batch. Is this a wise decision?

THE MAXIMUM LIKELIHOOD PRINCIPLEThe Maximum Likelihood Principle: Given a data set, choose the parameter(s) of interest

in such a way that the data are most likely.

The batch with 50% of defective chips → more likely that defective chips will appear

The batch with 10% of defective chips → hardly any defective chip expected

Dealer chooses the batch for which it is most likely that only one chip is defective.

83

Page 92: STAT270 Probability and Inference

LIKELIHOOD FUNCTION OF THE CHIP EXAMPLESet Ri = 1 in case the ith tested chip was defective, and Ri = 0 in case it was operational,

where i = 1, · · · , 10. R1, · · · , R10 are independent Ber(p) distributed random variables, where p isthe probability that a randomly selected chip is defective. The probability that the observed dataoccur is equal to

L(p) = P(R1 = 0, R2 = 1, R3 = 0, · · · , R10 = 0) = p(1− p)9.

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

p

p(1

−p)

9

Figure 9.2: The likelihood function L(p) of the probability that the observed data occur.

To get the p value which makes the function L(p) to reach its maximum, we first differentiateL(p) with respect to p, which yields

dL(p)

dp= (1− p)9 − 9p(1− p)8

= (1− p)8(1− 10p).

We then solve dL(p)/dp = 0 for p, and have p = 1, or p = 1/10 = 0.1 The value p which maximizesthis likelihood function L(p) is p = 0.1. We say that 0.1 is the maximum likelihood estimate of pfor the chips.

9.4 Maximum likelihood estimates

MAXIMUM LIKELIHOOD ESTIMATESThe maximum likelihood estimate of θ is the value t = h(x1, x2, · · · , xn) that maximizes the

likelihood function L(θ). The corresponding random variable T = h(X1, X2, · · · , Xn) is called themaximum likelihood estimator for θ.

It is common to refer to maximum likelihood estimators as MLEs.

EXAMPLE - EXPONENTIALSuppose x1, x2, · · · , xn is a realization of a random sample from an Exp(λ) distribution. The

likelihood is given byL(λ) = λne−λ

∑ni=1 xi .

84

Page 93: STAT270 Probability and Inference

The derivative of L(λ) is

dL(λ)

dλ= nλn−1e−λ

∑ni=1 xi − λn

(n∑

i=1

xi

)e−λ

∑ni=1 xi

= n

(λn−1e−λ

∑ni=1 xi

(1− λ

n

n∑

i=1

xi

)).

By setting dL(λ)/dλ = 0 and solving for λ, we have

1− λxn = 0

Therefore, λ = 1/xn.

LOG LIKELIHOODBoth the likelihood function L(θ) and the log likelihood function l(θ) = ln(L(θ)) are maximized

for the same value of θ. The logarithm of L(θ) changes the product of the terms involving θ intoa sum of logarithms of these terms. This makes the process of differentiating easier.

EXAMPLE - EXPONENTIALIn the situation that we have a realization x1, x2, · · · , xn of a random sample from an Exp(λ)

distribution, the log likelihood function is given by

l(λ) = n ln(λ)− λn∑

i=1

xi.

The derivative of l(λ) is

dl(λ)

dλ=n

λ−

n∑

i=1

xi.

Letting dl(λ)/dλ = 0, we can obtain λ = 1/xn.

NOTENote that solving dl(λ)/dλ = 0 does not always maximize the likelihood. We should therefore

ensure that we do have a maximum by checking that

d2l(λ)

dλ2

∣∣∣∣λ=λ

< 0

EXAMPLE - EXPONENTIAL

d2l(λ)

dλ2

∣∣∣∣λ=λ

= −nλ−2∣∣λ=λ

= −nλ−2

= −n(1/xn)−2

< 0

Therefore, λ = 1/xn is the maximum likelihood estimate of λ.

85

Page 94: STAT270 Probability and Inference

EXAMPLE - EXPONENTIAL DISTRIBUTIONWrite down the likelihood and log likelihood functions for the observed random sample 2.04,

0.31, 2.15, 3.18, 1.43, 3.00 from an Exp(λ) distribution.

0.0 0.5 1.0 1.5 2.0 2.5 3.00e+0

02e

−05

λ

L(λ)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

−30

−20

−10

λ

l(λ)

Figure 9.3: The likelihood (top) and log likelihood (bottom) functions for the observed randomsample 2.04, 0.31, 2.15, 3.18, 1.43, 3.00 from an Exp(λ) distribution.

INVARIANCE PRINCIPLEIf θ is the maximum likelihood estimator of a parameter θ and if g(θ) is an invertible (one-to-

one) function of θ, then g(θ) is the maximum likelihood estimator for g(θ).

9.5 MLE - Discrete distributions

GEOMETRIC DISTRIBUTIONGiven the observations x1, x2, · · · , xn of a random sample from a Geo(p) distribution, e.g., n

independent experiments each tossing a coin until the first head appears, and recording the numberof tosses xi

The likelihood function isL(p) = (1− p)(

∑ni=1 xi−n)pn

The log likelihood function is

l(p) = ln(L(p)) =

(n∑

i=1

xi − n)

ln(1− p) + n ln(p)

The maximum likelihood estimator for p is:

GEOMETRIC DISTRIBUTION - EXAMPLEWrite down the likelihood and log likelihood functions for the observed random sample 6, 9, 8,

4, 2, 2 from a Geo(p) distribution.

86

Page 95: STAT270 Probability and Inference

0.0 0.2 0.4 0.6 0.8 1.00.0e

+00

2.0e

−07

p

L(p)

0.0 0.2 0.4 0.6 0.8 1.0

−100

−40

pl(p

)

Figure 9.4: The likelihood (top) and log likelihood (bottom) functions for the observed randomsample 6,9,8,4,2,2 from a Geo(p) distribution.

9.6 MLE - Continuous distributions

NORMAL DISTRIBUTIONSuppose that x1, x2, · · · , xn is a realization of a random sample from an N(µ, σ2) distribution,

with µ and σ unknown. What are the maximum likelihood estimates for µ and σ?The likelihood function is

L(µ, σ) = (2πσ2)−n/2e−∑ni=1(xi−µ)

2

2σ2

The log likelihood function is

l(µ, σ) = ln(L(µ, σ)) = −n ln(σ)−∑n

i=1(xi − µ)2

2σ2− n

2ln(2π)

NORMAL DISTRIBUTION

0 1 2 3 4 5 60e+0

04e

−06

σ

L(0, σ

)

0 1 2 3 4 5 6

−50

−30

−10

σ

l(0, σ

)

Figure 9.5: The likelihood (top) and log likelihood (bottom) functions for the observed randomsample 1.21, -1.08, -2.91, 1.71, 1.13, 1.79 from an N(0, σ2) distribution.

87

Page 96: STAT270 Probability and Inference

NORMAL DISTRIBUTIONThe log likelihood function is a function of two parameters, µ and σ

l(µ, σ) = −n ln(σ)− n ln(√

2π)− 1

2σ2[(x1 − µ)2 + · · ·+ (xn − µ)2]

NORMAL DISTRIBUTION (CONTINUED)The partial derivatives of l are

∂l

∂µ=

1

σ2[(x1 − µ) + · · ·+ (xn − µ)] =

n

σ2(xn − µ)

∂l

∂σ= −n

σ+

1

σ3[(x1 − µ)2 + · · ·+ (xn − µ)2]

= − n

σ3

(σ2 − 1

n

n∑

i=1

(xi − µ)2

)

Solving ∂l∂µ

= 0 and ∂l∂σ

= 0 yields

µ = xn, σ =

√√√√ 1

n

n∑

i=1

(xi − xn)2.

UNIFORM DISTRIBUTIONSuppose we have the observations x1, x2, · · · , xn of a random sample from a U(0, θ) distribution

with θ unknown.The likelihood function is

L(θ) =

1θn, if θ ≥ max(x1, x2, · · · , xn)

0, if θ < max(x1, x2, · · · , xn)

The log likelihood function is

l(θ) = ln(L(θ)) = −n ln(θ), if θ ≥ max(x1, x2, · · · , xn)

ESTIMATING THE UPPER ENDPOINT OF A UNIFORM DISTRIBUTIONSuppose x1 = 0.98, x2 = 1.57, x3 = 0.31 is the observation of a random sample from a U(0, θ)

distribution with θ > 0 unknown. The likelihood is given by

L(θ) =

1θ3

if θ ≥ max(x1, x2, x3) = 1.570 if θ < max(x1, x2, x3) = 1.57

ESTIMATING THE UPPER ENDPOINT OF A UNIFORM DISTRIBUTIONL(θ) attains its maximum at max(x1, x2, x3) = 1.57.

88

Page 97: STAT270 Probability and Inference

0 0.98 1.57 3 4

0.00

0.05

0.10

0.15

0.20

0.25

Figure 9.6: Likelihood function L(θ) for the observed random sample 0.98, 1.57, 0.31 from a U(0, θ)distribution.

89

Page 98: STAT270 Probability and Inference

Chapter 10

Confidence Intervals

10.1 General principle

INTERVAL ESTIMATETwo ways to carry out estimation:

Point estimate: A single best value for a parameter. Most of the time, this estimate is notidentical to the true parameter.

Interval estimate: An interval within which the true parameter is expected to lie with a highprobability. It gives the precision with which the parameter is determined.

A GENERAL DEFINITIONSuppose a data set x1, · · · , xn is given, modeled as realization of random variables X1, · · · , Xn.

Let θ be the parameter of interest, and γ a number between 0 and 1. If there exist sample statisticsLn = g(X1, · · · , Xn) and Un = h(X1, · · · , Xn) such that

P(Ln < θ < Un) = γ

for every value of θ, then(ln, un),

where ln = g(x1, · · · , xn) and un = h(x1, · · · , xn), is called a 100γ% confidence interval for θ. Thenumber γ is called the confidence level.

INTERPRETING A CONFIDENCE INTERVALIf 100 different samples are used to construct 100 intervals, then a 90% confidence interval

means 90 of these intervals will contain the parameter θ. Conversely, 10 of these intervals will missthe parameter θ.

90

Page 99: STAT270 Probability and Inference

−1 µ 1

Figure 10.1: A hundred 90% confidence intervals for µ = 0 using samples from an N(0, 1) distri-bution.

10.2 Normal data

10.2.1 Critical values

CRITICAL VALUESThe critical value zp of an N(0, 1) distribution is the number that has right tail probability p.

It is defined byP(Z ≥ zp) = p,

where Z is an N(0, 1) random variable.

−3 z1−p 0 zp 3

area p area p

−3 z1−p 0 zp 3

0p

1 − p1

Figure 10.2: Critical values of the standard normal distribution.

91

Page 100: STAT270 Probability and Inference

10.2.2 Variance known

CONFIDENCE INTERVAL - VARIANCE KNOWNLet X1, X2, · · · , Xn be a random sample from an N(µ, σ2) distribution with µ unknown and σ

known. If cl and cu are chosen such that P(cl < Z < cu) = γ for an N(0, 1) distributed randomvariable Z, then the interval (

xn − cuσ√n, xn − cl

σ√n

)

is a 100γ% confidence interval for µ.

A COMMON CHOICEThe 100(1− α)% confidence interval for µ is

(xn − zα/2

σ√n, xn + zα/2

σ√n

).

e.g., the 95% confidence interval is (α = 0.05, so z0.025 = 1.96, using qnorm(0.025,lower.tail=FALSE))(xn − 1.96

σ√n, xn + 1.96

σ√n

)

The 99% confidence interval is (α = 0.01, so z0.005 = 2.58, using qnorm(0.005,lower.tail=FALSE))(xn − 2.58

σ√n, xn + 2.58

σ√n

)

EXAMPLEIf (ln, un) is a 95% confidence interval for the parameter µ of an N(µ, 4) distribution, find a

95% confidence interval for θ = 4µ.

10.2.3 Variance unknown

VARIANCE UNKNOWNS2 — useful when making inferences about σ2

X — useful when making inferences about µThe distribution of X depends on σ2

If σ2 is unknown, the fact that X−µσ/√n∼ N(0, 1) becomes useless

Solution: replacing σ by S in X−µσ/√n

STUDENTIZED MEAN OF A NORMAL RANDOM SAMPLEThe studentized mean of a normal random sample: For a random sample X1, · · · , Xn

from an N(µ, σ2) distribution, the studentized mean

Xn − µSn/√n

has a t(n − 1) distribution, regardless of the values of µ and σ, where n − 1 is the degrees offreedom.

92

Page 101: STAT270 Probability and Inference

THE t DISTRIBUTION - SYMMETRYThe critical value tm,p is the number satisfying P(T ≥ tm,p) = p. The t-distribution is symmetric

around 0,tm,1−p = −tm,p

CONFIDENCE INTERVAL - VARIANCE UNKNOWNLet X1, X2, · · · , Xn be a random sample from an N(µ, σ2) distribution with µ and σ unknown.

The 100(1− α)% confidence interval for µ is given by

(xn − tn−1,α/2

sn√n, xn + tn−1,α/2

sn√n

).

10.3 Large samples

LARGE SAMPLE CONFIDENCE INTERVALSSuppose X1, · · · , Xn is a random sample from some distribution F with expectation µ. If n is

large enough, we may use

P

(−zα/2 <

Xn − µSn/√n< zα/2

)≈ 1− α.

If x1, · · · , xn is a realization of a random sample from some unknown distribution with expectationµ and if n is large enough, then

(xn − zα/2

sn√n, xn + zα/2

sn√n

)

is an approximate 100(1− α)% confidence interval for µ.

10.4 A general method

A GENERAL METHODSuppose X1, · · · , Xn is a random sample from some distribution depending on some unknown

parameter θ. Let θ be a sample statistic that is an estimator for θ.

If some function g(θ, θ) of the sample statistic θ and the unknown parameter θ has a knowndistribution, then we can use the distribution of g(θ, θ) to derive the distribution of θ, and constructa confidence interval for θ accordingly.

This is called the pivotal method.

93

Page 102: STAT270 Probability and Inference

EXAMPLE IThe normal distribution case:

• Let X1, X2, · · · , Xn be a random sample from N(µ, σ2) with σ known. We want to estimateµ.

• µ = Xn is an estimator for µ

• Let g(µ, µ) = µ−µσ/√n

=⇒ g(µ, µ) ∼ N(0, 1)

• We can use the distribution of g(µ, µ) to construct a confidence interval for µ:

P

(−1.96 <

µ− µσ/√n< 1.96

)= 0.95

=⇒ P

(µ− 1.96

σ√n< µ < µ+ 1.96

σ√n

)= 0.95

•(xn − 1.96 σ√

n, xn + 1.96 σ√

n

)is a 95% confidence interval for µ.

EXAMPLE IIA data set x1, x2, · · · , xn is given, modeled as a realization of a random sample X1, X2, · · · , Xn

from an N(µ, σ2) distribution with µ and σ unknown. Find a 100(1 − α)% confidence intervalfor µ.

• We want to estimate µ.

• µ = Xn is an estimator for µ.

• Let g(µ, µ) = µ−µSn/√n

=⇒ g(µ, µ) ∼ t(n− 1).

• We can use the distribution of g(µ, µ) to construct a confidence interval for µ:

P

(−tn−1,α/2 <

µ− µSn/√n< tn−1,α/2

)= 1− α

=⇒ P

(µ− tn−1,α/2

Sn√n< µ < µ+ tn−1,α/2

Sn√n

)= 1− α

•(xn − tn−1,α/2

sn√n, xn + tn−1,α/2

sn√n

)is a 100(1− α)% confidence interval for µ.

EXAMPLE III - EXPONENTIAL DISTRIBUTIONSuppose that we are to obtain a single observation x from an exponential distribution with

parameter λ. Use x to form a 90% confidence interval for λ.

94

Page 103: STAT270 Probability and Inference

EXAMPLE III - SOLUTIONSolution:

The c.d.f. for X is given by

F (x) = 1− e−λx for x > 0.

Let Y = λX. Then Y has the exponential c.d.f. given by

FY (y) = 1− e−y for y > 0.

EXAMPLE III - SOLUTIONWe need to find two numbers a and b such that

P(a < Y < b) = 0.90

One way is to choose a and b to satisfy

P(Y ≤ a) = 0.05 and P(Y ≥ b) = 0.05

This gives us a = 0.051, b = 2.996. Therefore

P(0.051 < Y < 2.996) = P(0.051 < λX < 2.996) = 0.90

=⇒(

0.051x, 2.996

x

)is a 90% confidence interval for λ.

10.5 Determining the sample size

DETERMINING THE SAMPLE SIZETo make sure a 100(1− α)% confidence interval not wider than a certain accuracy ω, we find

the smallest sample size n such that

the width of the confidence interval ≤ ω.

e.g. 2 · zα/2 σ√n≤ ω

AN EXAMPLEAssume that we have a data set x1, x2, · · · , xn from a random variable which has a normal

distribution with standard deviation σ = 0.1. We want to obtain a 95% confidence interval forthe mean µ of the random variable. The requirement is that the confidence interval should not bewider than 0.05.

Find the minimum sample size that we need.

95

Page 104: STAT270 Probability and Inference

10.6 Bootstrap confidence intervals

THE BOOTSTRAPIn real-world applications, we will encounter scenarios where the estimator θ of the parameter

θ is in a complicated form so that it is not easy or possible to find the distribution of g(θ, θ), somefunction of θ and the unknown parameter θ. In this case, the pivotal method is no longer useful.Thanks to the development of computational softwares, we can use simulation-based methods toderive confidence intervals of unknown parameters. One popular method is the bootstrap.

NONPARAMETRIC BOOTSTRAPGiven a data set x1, · · · , xn observed from the random variables X1, · · · , Xn, one way to obtain

a 100(1− α)% bootstrap confidence interval of θ is to

1. Sample n values with equal probability from x1, · · · , xn with replacement, and obtain a dataset x∗1, · · · , x∗n of a bootstrap sample;

2. Calculate θ∗1 = θ(x∗1, · · · , x∗n), i.e., use the bootstrap sample to compute the estimate θ anddenote it as θ∗1;

3. Repeat steps 1 and 2 B times, and denote the estimate calculated from the bth bootstrapsample as θ∗b ;

4. Denote the α/2 and 1−α/2 quantiles of θ∗1, · · · , θ∗B as l∗ and u∗. Then (l∗, u∗) is a 100(1−α)%bootstrap confidence interval of θ.

The number of repetition B should be sufficiently large, for example, 200, or nowadays often 1000.

PARAMETRIC BOOTSTRAPSuppose we have a random sample X1, · · · , Xn of a distribution F (x, θ). Given a data set

x1, · · · , xn observed from the random sample X1, · · · , Xn, compute an estimate θ of θ. Use F (x, θ)to estimate F (x, θ). Another way to obtain a 100(1− α)% bootstrap confidence interval of θ is to

1. Simulate a data set x∗1, · · · , x∗n from F (x, θ);

2. Calculate θ∗1 = θ(x∗1, · · · , x∗n), i.e., use the bootstrap sample to compute the estimate θ anddenote it as θ∗1;

3. Repeat steps 1 and 2 B times, and denote the estimate calculated from the bth bootstrapsample as θ∗b ;

4. Denote the α/2 and 1−α/2 quantiles of θ∗1, · · · , θ∗B as l∗ and u∗. Then (l∗, u∗) is a 100(1−α)%bootstrap confidence interval of θ.

The number of repetition B should be sufficiently large, for example, 200, or nowadays often 1000.

96

Page 105: STAT270 Probability and Inference

THE BOOTSTRAP - EXAMPLEBelow lists a data set x1, · · · , xn for the number of measles cases reported in New York city in

each month from Jan 1960 to Nov 1963.

4097,6780,6492,3387,1822,469,129,51,43,78,105,227, 298,374,384,644,683,343,185,109,123,383,1043,1725,3056,5839,7875,6555,2866,1075,266,58,86,125,145,184, 260,476,782,1200,1289,901,362,168,221,423,1140

Suppose that the data set x1, · · · , xn is observed from the random variables X1, · · · , Xn, andX1, · · · , Xn are independent and identically distributed.

We assume that Xi has a Pois(λ) distribution. How do we estimate λ and obtain a 95%confidence interval for λ?

NONPARAMETRIC BOOTSTRAPWe can use λ = xn to estimate λ.

1. Sample n values with equal probability from x1, · · · , xn with replacement, and obtain a dataset x∗1, · · · , x∗n of a bootstrap sample;

xstar <- sample(x,size = n, replace = TRUE)

2. Calculate θ∗1 = θ(x∗1, · · · , x∗n), i.e., use the bootstrap sample to compute the estimate θ anddenote it as θ∗1;

hlamst <- mean(xstar)

NONPARAMETRIC BOOTSTRAP

3. Repeat steps 1 and 2 B times, and denote the estimate calculated from the bth bootstrapsample as θ∗b ;

B <- 1000

hlamst <- NULL

for (b in 1:B)

xstar <- sample(x,size = n, replace = TRUE)

hlamst[b] <- mean(xstar)

4. Denote the α/2 and 1−α/2 quantiles of θ∗1, · · · , θ∗B as l∗ and u∗. Then (l∗, u∗) is a 100(1−α)%bootstrap confidence interval of θ.

quantile(hlamst,c(0.025,0.975))

97

Page 106: STAT270 Probability and Inference

PARAMETRIC BOOTSTRAP

1. Simulate a data set x∗1, · · · , x∗n from F (x, θ);

xbar <- mean(x)

xstar <- rpois(n,xbar)

2. Calculate θ∗1 = θ(x∗1, · · · , x∗n), i.e., use the bootstrap sample to compute the estimate θ anddenote it as θ∗1;

hlamst <- mean(xstar)

PARAMETRIC BOOTSTRAP

3. Repeat steps 1 and 2 B times, and denote the estimate calculated from the bth bootstrapsample as θ∗b ;

B <- 1000

hlamst <- NULL

for (b in 1:B)

xstar <- rpois(n,xbar)

hlamst[b] <- mean(xstar)

4. Denote the α/2 and 1−α/2 quantiles of θ∗1, · · · , θ∗B as l∗ and u∗. Then (l∗, u∗) is a 100(1−α)%bootstrap confidence interval of θ.

quantile(hlamst,c(0.025,0.975))

98

Page 107: STAT270 Probability and Inference

Chapter 11

Hypothesis tests

HYPOTHESIS TESTSIn Chapter 8, we gave a guess of a statistical model and its parameters to describe the data

generation mechanism. How do we evaluate if it was a good guess? Sometimes we are interestedin testing if a certain claim about the value of a parameter is true, or if a particular distributioncaptures the main features of the data. In these cases, hypotheses tests are useful.

1. Set up the null hypothesis, H0

2. Propose the alternative hypothesis, H1 or HA

3. Calculate the test statistic.

4. Calculate the critical/rejection region (or p-value)

5. Conclusion

11.1 Null hypothesis and test statistic

NULL HYPOTHESIS vs ALTERNATIVE HYPOTHESISHypothesis testing:I toss a coin 20 times and get 17 heads. How unlikely is that? Can we continue to believe that

the coin is fair when it produces 17 heads out of 20 tosses?

Two sided test: e.g. H0 : θ = θ0 H1 : θ 6= θ0

If we had a good reason, before conducting an experiment, to believe that the parameter θcould only be = θ0 or > θ0, i.e. that it would be impossible to have θ < θ0, then we use

One sided test: e.g. H0 : θ = θ0 H1 : θ > θ0.

TEST STATISTICTest Statistic: Suppose the data set is modeled as the realization of random variables

X1, X2, · · · , Xn. A test statistic is any sample statistic T = h(X1, X2, · · · , Xn), whose numeri-cal value is used to decide whether we reject H0.

Essentially: some random variable with a distribution that we can specify exactly under H0.

The difficult part is finding the distribution.

99

Page 108: STAT270 Probability and Inference

EXAMPLE I - COIN TOSSINGCan we continue to believe that the coin is fair when it produces 17 heads out of 20 tosses?

We have a null hypothesis H0: p = 0.5, and an alternative hypothesis H1: p 6= 0.5. We can useX = the number of heads out of 20 tosses as the test statistics T .

11.2 Tail probabilities

p-VALUESThe p-value is the probability of obtaining a test statistic as extreme or more extreme than

the observed value assuming that the null hypothesis is true.

The smaller the p-value, the more evidence there is against the null hypothesis.

The p-value for the coin-tossing is 0.0026.

11.3 Type I and type II errors

TYPE I AND TYPE II ERRORSType I error: A type I error occurs if we reject the null hypothesis H0 when it is true.

Type II error: A type II error occurs if we do NOT reject H0 when it is false.

TYPE I AND TYPE II ERRORS

EXAMPLE IIAssume that the underlying distribution is normal with unknown expectation µ but known

variance σ2 = 100. We observed a sample mean x = 52.75 based on n = 52 observations. Now wewould like to know whether µ is 60.

Null hypothesis H0: µ = 60

Alternative hypothesis H1: µ 6= 60

Test statistics T : X52

100

Page 109: STAT270 Probability and Inference

EXAMPLE II - TYPE I ERRORQuestion: One decides to reject H0 in favor of H1 if |T − 60| ≥ 25. What is the probability of

committing a type I error?

Solution:< 0.000001

EXAMPLE IIIAfter the introduction of the Euro, Polish mathematicians claimed that the Belgian 1 Euro

coin is not a fair coin (see, for instance, the New Scientist, January 4, 2002).

Suppose we put a 1 Euro coin to the test. We will toss it ten times and record X, the numberof heads.

X ∼ Bin(10, p)

We like to find out whether p differs from 1/2.

EXAMPLE III - CONTINUEDNull hypothesis H0:

Alternative hypothesis H1:

Test statistics T :

11.4 Significance level

SIGNIFICANCE LEVELSignificance level: The significance level is the largest acceptable probability of committing

a type I error and is denoted by α, where 0 < α < 1.

Saying the test is significant is a quick way of saying that there is evidence against the nullhypothesis at the significance level α.

A small p-value does NOT mean that H0 is definitely wrong.

11.5 Critical region and critical values

CRITICAL REGION AND CRITICAL VALUESCritical region (or rejection region) and critical values: Suppose we test H0 against H1

at significance level α by means of a test statistic T . The set K ⊂ R that corresponds to all valuesof T for which we reject H0 in favor of H1 is called the critical region (or rejection region).

Values on the boundary of the critical region are called critical values.

101

Page 110: STAT270 Probability and Inference

Figure 11.1: Critical region for 1-sided (left) and 2-sided (right) tests.

CRITICAL VALUE AND p-VALUE

Figure 11.2: P-value and critical value.

11.6 Power

POWERPower: The probability that the null hypothesis is rejected when it is false is called the power

of the test.

EXAMPLE IPolygraphs that are used in criminal investigations are supposed to indicate whether a person

is lying or telling the truth.

An experienced polygraph examiner was asked to make an overall judgment for each of a total280 records, of which 140 were from guilty suspects and 140 from innocent suspects. The resultsare

102

Page 111: STAT270 Probability and Inference

Suspect’s true status

Innocent Guilty

Examiner’s Acquitted 131 15

assessment Convicted 9 125

We view each judgment as a problem of hypothesis testing, with H0 corresponding to ‘suspectis innocent’ and H1 to ‘suspect is guilty’. Estimate the probabilities of a type I error and a typeII error.

EXAMPLE IIAssume that the underlying distribution is normal with unknown mean µ but known variance

σ2 = 100. We observed a sample mean x = 52.75 based on n = 52 observations. Now we wouldlike to know whether µ is less than 60. What is the rejection region at significance level 0.05?

Null hypothesis H0: µ = 60

Alternative hypothesis H1: µ < 60

Test statistics T : X−60σ/√n

= X−6010/√

52∼ N(0, 1)

Rejection region:

EXAMPLE IIIAfter the introduction of the Euro, Polish mathematicians claimed that the Belgian 1 Euro

coin is not a fair coin (see, for instance, the New Scientist, January 4, 2002).

Suppose we put a 1 Euro coin to the test. We will throw it ten times and record X, the numberof heads.

X ∼ Bin(10, p)

We like to find out whether p differs from 1/2.

EXAMPLE III - CONTINUEDNull hypothesis H0:

Alternative hypothesis H1:

Test statistics T :

Rejection region:

Type I error:

103

Page 112: STAT270 Probability and Inference

11.7 Relation with confidence intervals

HYPOTHESIS TESTS VS CONFIDENCE INTERVALSIf the random variable that is used to construct the confidence interval relates appropriately

to the test statistic, then

Suppose that for some parameter θ we test H0 : θ = θ0. Then

• we reject H0 : θ = θ0 in favor of H1 : θ > θ0 at level α if and only if θ0 is not in the100(1− α)% one-sided confidence interval for θ.

• The same relation holds for testing against H1 : θ < θ0

• we reject H0 : θ = θ0 in favor of H1 : θ 6= θ0 at level α if and only if θ0 is not in the100(1− α)% two-sided confidence interval for θ.

104

Page 113: STAT270 Probability and Inference

Chapter 12

Bayesian Analysis

12.1 Bayesian Statistics

Frequentist StatisticsEstimation is based on finding a good point estimate. The performance of the estimate is

assessed under repetitions of the experiment (or survey) that gave rise to the data.

Bayesian StatisticsEstimation is based on the idea of expressing uncertainty about the (unknown) state of nature

in terms of probability. We start with a probability distribution reflecting our current state ofknowledge (prior distribution). When new data become available, we update the probabilitydistribution in light of the new data (posterior distribution).

Bayesian Statistics

• Prior distribution — h(θ)

• Likelihood — p(y|θ)

• Posterior probability — p(θ|y)

12.2 Bayes’ theorem

Bayes’ theoremSuppose the events B1, B2, · · · , Bm are disjoint and B1 ∪ B2 ∪ · · · ∪ Bm = Ω. The conditional

probability of Bi, given an arbitrary event A, can be expressed as:

P(Bi|A) =P(A|Bi) · P(Bi)∑mi=1 P(A|Bi)P(Bi)

.

105

Page 114: STAT270 Probability and Inference

Bayes’ theorem: Example IAn item is produced in 3 different factories, C1, C2, C3.

The proportions produced in the 3 factories, and the proportions defective in each, are asfollows:

An item is purchased and found to be defective. This is event D.

What is the probability that it was from factory C1?

Bayes’ theorem for Continuous Random VariablesSuppose X and Y are continuous random variables with probability density functions fX(x)

and fY (y) respectively, and with joint probability density function fXY (x, y). If fX|Y (x) is theprobability density for X given that Y = y, then

fY |X(y) =fXY (x, y)

fX(x)=

fX|Y (x)fY (y)∫∞−∞ fX|Y (x)fY (y)dy

.

Bayes’ theorem: Example IIConsider two random variables X and Y . Given the joint density function

fXY (x, y) =

λ2e−λy, 0 ≤ x ≤ y0, otherwise

,

find the marginal density of fY (y), and the conditional density of X given Y = y.

12.3 Bayesian Estimation

POSTERIOR DISTRIBUTIONIn Bayesian inference:

• h(θ) — describes our prior belief or knowledge of the parameter θ

• fY |θ(y) — describes our model for the data conditional on a particular value of θ

• p(θ|y) — describes the uncertainty about the parameter θ once we have observed data

p(θ|y) =fY (y|θ)h(θ)∫∞

−∞ fY (y|θ)h(θ)dθ,

orp(θ|y) ∝ fY (y|θ)h(θ)

Posterior ∝ Likelihood × Prior

106

Page 115: STAT270 Probability and Inference

CREDIBLE INTERVALIn Bayesian inference:

• we refer to credible intervals or credible sets to describe interval estimates

• a 95% credible interval for θ is given by two limits L and U such that P(L < θ < U) = 0.95and is evaluated from the posterior distribution

• Formally, if A defines a credibility set (A must be contained within the sample space) thenthe credible probability of A is

P(θ ∈ A|y) =

A

p(θ|y)dθ.

AN EXAMPLELet X denote the annual number of global large earthquakes with minimum magnitude 7.

Assume that X has a Poisson distribution with parameter λ.

Given that λ is greater than 0, we can use a gamma distribution to model the uncertaintyabout λ.

The number of earthquakes that occurred each year from 1990 to 1992 is x1 = 18, x2 = 16,x3 = 13. Suppose that the number of large earthquakes in each year is independent of that inother years. Find the posterior distribution for λ.

AN EXAMPLE (CONTINUED)The prior distribution is:

h(λ) =βα

Γ(α)λα−1e−λβ

The likelihood is:

f(x1, x2, x3|λ) =e−λλx1

x1!· e−λλx2

x2!· e−λλx3

x3!

The posterior distribution of λ is:

PRIORSChoosing priors often is subjective. In the example above, we can use a gamma distribution

with parameters α = 17 and β = 1.2 as our prior. With this prior and the observed numbers ofearthquakes, we can update our knowledge about λ using the posterior distribution. The posteriordistribution is a gamma distribution with parameters α = 47 + 17 and β = 3 + 1.2.

PRIORSWe can also assign a prior that is much less informative.For example, we can use a gamma distribution with parameters α = 0 and β = 0 as a prior,

βα

Γ(α)λα−1e−λβ =

1

λ,

where λ ∈ (0,∞). The posterior distribution of λ is then a gamma distribution with parametersα = 47 and β = 3.

Priors that do not have a finite integral are improper prior. Improper priors can result inproper posteriors.

107