ma295s.10 lecture

MA295S.10 - ELEMENTS OF PROBABILITYThis course introduces the essential concepts of probability to MS and PhD Math students through topicssuch as random variables and their transformations, special probability distributions, expectations andcumulants, and probability generating functions. It also discusses modes of convergence, the Central LimitTheorem, the Laws of Large Numbers, and other special topics. Emphasis is on proving results, to preparestudents for a more intensive probability course grounded on measure theory.

Summer 201312:00 - 1:30 pmSEC A209

Richard B. Eden, [email protected] DepartmentAteneo de Manila University

Course Requirements

1. Regular problem sets2. Midterm exams - written3. Oral presentation on a special topic or paper on probability at the level of this course

1. Sample Space and Probability

1.1. Probabilistic Models

There are two types of phenomena.

1. Deterministic - repeated observations under a specified set of conditions invariably lead to the sameoutcome.A ball initially at rest is dropped a height of s meters inside an evacuated cylinder. We observe thetime it falls: t =

√2s/g.

2. Random or nondeterministic - repeated observations under a specified set of conditions do not leadto the same resultA coin is tossed. We observe the result: heads or tails.

A probabilistic model is a mathematical description of an uncertain event. It involves an underlyingprocess, called the experiment, that will produce exactly one out of several possible outcomes.

Given an experiement:

• The sample space, denoted by Ω, is the set of all possible outcomes.• A sample outcome or a sample point is any element ω ∈ Ω.• An event is any subset A ∈ Ω.

Example 1 Consider the experiment of tossing a coin three times.

• Suppose we receive $1 each time a head comes up. We can then work with Ω = 0, 1, 2, 3.• Suppose we receive $1 for every toin coss, up to and including the first time a head comes up. Then we

receive $2 for every toin coss, up to the second time a head comes up. Generally, the amount per tossis doubled each time a head comes up. Then we need to work with Ω = (a, b, c) : a, b, c = H or T.

Example 2 Consider the experiment of choosing real values for the coefficients of ax2 + bx + c = 0.Then Ω = (a, b, c)|a 6= 0. If A is the event for which the equation has no real solution, then A =(a, b, c)|b2 − 4ac < 0.

Next, for each experiment with sample space Ω, we want to choose events to which we must associateprobabilities. Let F be a collection of events, i.e., a collection of subsets of Ω.

Definition F is called a σ-field of Ω if the following are satisfied.

1. Ω ∈ F2. A ∈ F ⇒ Ac ∈ F

3. A1, A2, . . . ∈ F ⇒∞⋃j=1

Aj ∈ F

Example 3 Here are three simple examples of σ-fields of Ω.

• F = ∅,Ω• F = power set of Ω = all subsets of Ω• If A ⊂ Ω, F = ∅, A,Ac,Ω.

Remark A σ-field is closed under the taking of complements, countable unions and countable intersections.

MA 295S.10 Elements of Probability 1 Summer 2013 (R. Eden)

• A1, A2, . . . ∈ F ⇒∞⋂j=1

Aj ∈ F

• If A,B ∈ F , then the following are also in F : A ∪ B, A ∩ B, A − B = A ∩ Bc, B − A, andA4B = (A−B) ∪ (B −A).

Finally, we want to allocate a probability to each event A ∈ F .

Definition A mapping P : F → R is called a probability measure on (Ω,F) if the following properties aresatisfied.

(K1) P (A) ≥ 0 for all A ∈ F(K2) P (Ω) = 1

(K3) If A1, A2, . . . ∈ F and Ai ∩Aj = ∅, then P

( ∞⋃n=1

An

)=∞∑n=1

P (An).

Remark If A∩B = ∅, then we say that A and B are mutually exclusive. (K1), (K2) and (K3) are calledKolmogorov’s Axioms, named after A. Kolmogorov, one of the fathers of probability theory. The triple(Ω,F , P ) is called a probability space. P is also referred to as a probability law.

Remark In many instances, we will talk about probability models without explicitly mentioning theunderlying probability space anymore. It will be assumed we are working under an appropriate probabilityspace (Ω,F , P ).

Theorem Let A,B ∈ F .

1. P (Ac) = 1− P (A)2. P (∅) = 03. If A ⊆ B, then P (A) ≤ P (B).4. P (A) ≤ 15. P (A ∪B) = P (A) + P (B)− P (A ∩B)

Corollary Let A,B,C ∈ F .

1. P (A ∪B) ≤ P (A) + P (B)2. P (A ∪B ∪ C) = P (A) + P (B) + P (C)− P (A ∩B)− P (B ∩ C)− P (C ∩A) + P (A ∩B ∩ C)

Example 4 Suppose that P (A) = 0.9 and P (B) = 0.8.

1. Show that P (A ∩B) ≥ 0.7.2. If P (A ∩B) = 0.75, find P (A−B) and P (Ac ∩Bc).

Example 5 A sequence of events Ann≥1 is said to be an increasing sequence if A1 ⊆ A2 ⊆ · · · . Define

A = limn→∞

An =∞⋃n=1

An. Prove that limn→∞

P (An) = P (A).

Example 6 Consider an experiment involving three tosses of a fair coin. What are Ω, F and P? IfA = exactly 2 heads occur, find P (A).

Theorem If the sample space consists of a finite number of possible outcomes, then for any event A =s1, s2, . . . , sm, P (A) = P (s1) + P (s2) + · · ·+ P (sm).


Theorem If the sample space consists of n possible outcomes which are equally likely, then the probability

of any event A is given by P (A) =#An

.

Example 7 One card is drawn at random from a deck of 52 cards. What is the probability that it isneither a heart nor a queen?

Example 8 A fair coin is tossed three times. What is the probability that heads show up an even numberof times?

1.2 Conditional Probability

On a throw of a fair die, the probability we get a 4 is 16 . But if we already know beforehand that the

outcome is an even number, the probability is 13 . We write P (A|B) to denote the probability of event A

given event B, i.e., given that event B has occurred.

Definition Let A and B be events with P (B) > 0. The (conditional) probability of A given B is

P (A|B) =P (A ∩B)P (B)

.

Example 9 The probability that a regularly scheduled flight departs on time is P (D) = 0.83; the prob-ability that it arrives on time is P (A) = 0.82; and the probability that it departs and arrives on time is0.78. Find the probability that a plane

1. arrives on time given that it departed on time,2. departed on time given that it has arrived on time, and3. arrives on time, given that it did not depart on time.

Theorem Let (Ω,F , P ) be a probability space, and B ∈ F with P (B) > 0. Then (Ω,F , Q) is also aprobability space where Q : F → R is defined as Q(A) = P (A|B).

Since conditional probabilities constitute a legitimate probability law, all general properties of proba-bility laws remain valid.Corollary Let A,B,C ∈ F with P (B) > 0.

1. P (Ac|B) = 1− P (A|B)2. If A ⊆ C, then P (A|B) ≤ P (C|B).3. P (A ∪ C|B) = P (A|B) + P (C|B)− P (A ∩ C|B)

Example 10 We toss a fair coin three times. Find P (A|B) where A = more heads than tails come upand B = 1st toss is a head.

Example 11 Find P (A ∩B) if P (A) = 0.2, P (B) = 0.4 and P (A|B) + P (B|A) = 0.75.

Theorem P (A ∩B) = P (A|B)P (B) and P (A ∩B ∩ C) = P (A|B ∩ C)P (B|C)P (C)

Example 12 If an aircraft is present in a certain area, a radar correctly registers its presence with prob-ability 0.99. If it is not present, the radar falsely registers an aircraft presence with probability 0.10. Weassume that an aircraft is present with probability 0.05. What is the probability of false alarm (a falseindication of aircraft presence), and the probability of missed detection (nothing registers, even though anaircraft is present)?


Example 13 Three cards are drawn (in order) from an ordinary deck without replacement. Find theprobability that none of the cards is a heart.

Example 14 Consider the set of families having two children. Assume that the four possible birth se-quences, taking gender into account - BB, BG, GB, GG - are equally likely. What is the probability thatboth children are boys given that at least one is a boy?

1.3 Total Probability Theorem and Bayes’ Rule

A partition of Ω is a collection of events A1, A2, . . . , An such that Ω =n⋃i=1

Ai and Ai ∩Aj = ∅ for i 6= j.

Theorem (Total Probability Theorem) Let A1, A2, . . . , An be a partition of Ω with P (Ai) > 0 fori = 1, 2, . . . , n. Then for any B ∈ F ,

P (B) =n∑i=1

P (B|Ai)P (Ai) = P (B|A1)P (A1) + · · ·+ P (B|An)P (An).

Example 15 Two numbers are chosen at random from among the numbers 1 to 10 without replacement.Find the probability that the second number chosen is 5.

Example 16 We roll a fair four-sided die. If the result is 1 or 2, we roll once more but otherwise we stop.What is the probability that the sum total of our rolls is at least 4?

Example 17 Alice is taking a probability class and at the end of each week she can be either up-to-date orshe may have fallen behind. If she is up-to-date in a given week, the probability that she will be up-to-date(or behind) in the next week is 0.8 (or 0.2, respectively). If she is behind in a given week, the probabilitythat she will be up-to-date (or behind) in the next week is 0.6 (or 0.4, respectively). Alice is (by default)up-to-date when she starts the class. What is the probability that she is up-to-date after three weeks?

Theorem (Bayes’ Rule) Let A1, A2, . . . , An be a partition of Ω with P (Ai) > 0 for i = 1, 2, . . . , n. Thenfor any B ∈ F with P (B) > 0,

P (Ai|B) =P (Ai)P (B|Ai)

P (B)=

P (Ai)P (B|Ai)P (A1)P (B|A1) + · · ·+ P (An)P (B|An)

.

Remark Bayes’ Rule is often used for inference. There are a number of causes A1, A2, . . . , An that mayresult in a certain effect B. Given that the effect B has been observed, we wish to evaluate the probabilitythat the cause Ai is present.

Example 18 A company producing electric relays has three manufacturing plants producing 50, 30 and20 percent, respectively, of its product. Suppose that the probabilities that a relay manufactured by theseplants is defective are 0.02, 0.05, and 0.01, respectively.

1. If a relay is selected at random from the output of the company, what is the probability that it isdefective?

2. If a relay selected at random is found to be defective, what is the probability that it was manufacturedby plant 2?


Example 19 A simple binary communication channel carries messages by using only two signals, 0 and1. For this channel, 40% of the time a 1 is transmitted. The probability that a transmitted 0 is correctlyreceived is 0.90, and the probability that a transmitted 1 is correctly received is 0.95.

1. What is the probability of a 1 being received?2. Given that a 1 is received, what is the probability that a 1 was transmitted?

Example 20 You know that a certain letter is equally likely to be in any one of three different folders.Let αi < 1 (i = 1, 2, 3) be the probability that you will find your letter upon making a quick examinationof Folder i if the letter is, in fact, in Folder i. Suppose you look in Folder 1 and do not find the letter.What is the probability that the letter is in Folder 1?

1.4 Independence

If P (A|B) = P (A), the knowledge that event B has occurred has no effect on the probability of event A.Consequently, P (A ∩B) = P (A)P (B), and also, P (B|A) = P (A).

Definition The events A and B are independent if P (A ∩B) = P (A)P (B).

If P (A) > 0 and P (B) > 0, then A and B are independent iff P (A|B) = P (A) or P (B|A) = P (B).

Example 21 Consider the experiment of throwing two fair dice. Let A be the event that the sum of thedice is 7, B the event that the sum is 6, and C the event that the first die is 4. Show that the events Aand C are independent, but events B and C are not independent.

Example 22 Bill and George go target shooting together. Both shoot at a target at the same time.Suppose Bill hits the target with probability 0.7 whereas George, independently, hits the target withprobability 0.4.

1. Given that exactly one shot hit the target, what is the probability that it was George’s shot?2. Given that the target is hit, what is the probability that George hit it?

Theorem If A and B are independent, then so are (a) A and Bc, (b) Ac and B, and (c) Ac and Bc.

Definition The events A1, A2, . . . , An are independent if P

(⋂i∈S

Ai

)=∏i∈S

P (Ai) for every subset S of

1, 2, . . . , n.

Remark The three events A,B,C are independent if all the following are satisfied: (i) P (A ∩ B) =P (A)P (B), (ii) P (B∩C) = P (B)P (C), (iii) P (C∩A) = P (C)P (A), and (iv) P (A∩B∩C) = P (A)P (B)P (C).A,B,C are said to be pairwise disjoint if (i)-(iii) hold.

Example 23 Let A1, A2, . . . , An be independent events, and P (Ai) = p for each Ai.

1. What is the probability that none of the Ai’s occur?2. Show that the probability that an even number of the Ai’s occur is 1

2(1 + (q − p)n) where q = 1− p.


1.5 Problems

1. Prove that P

(n⋃i=1

Ai

)≤

n∑i=1

P (Ai).

2. A sequence of events Ann≥1 is said to be a decreasing sequence if A1 ⊇ A2 ⊇ · · · . Define A =

limn→∞

An =∞⋂n=1

An. Prove that limn→∞

P (An) = P (A).

3. The numbers 1, 2, 3, . . . , n are arranged in random order. Find the probability that 1, 2, 3 are neigh-bors in the ordering.

4. A committee of 5 persons is to be selected randomly from a group of 5 men and 10 women.

(a) Find the probability that the committee consists of 2 men and 3 women.(b) Find the probability that the committee consists of all women.

5. A die is loaded in such a way that the probability of each face turning up is proportional to thenumber of dots on that face. What is the probability of getting an even number in one throw?

6. 90% of men and 95% of women in a population are right-handed. In the population, 52% are menand 48% are women. What is the probability that a randomly selected person is right-handed?

7. Show that if P (A|B) > P (A), then P (B|A) > P (B).

8. Suppose P (A) = 25 , P (A∪B) = 3

5 , P (B|A) = 14 , P (C|B) = 1

3 and P (C|A∩B) = 12 . Find P (A|B∩C).

9. Prove: P

(n⋂i=1

Ai

)= P (An|A1 ∩A2 ∩ · · · ∩An−1) · · ·P (A3|A1 ∩A2)P (A2|A1)P (A1).

10. Justin hasn’t decided yet where to have his next concert. There is a 50% that it will be in China,a 30% chance it will be in India, and a 20% chance it will be in the Philippines. The probability ofSelena attending the concert is 30% if it is in China, 10% if in India, and 20% if in the Philippines.If Selena ends up attending the concert, what is the probability that Justin chose the Philippines?

11. One bag contains 4 white balls and 3 black balls, and a second bag contains 3 white balls and 5black balls. One ball is drawn from the first bag and placed unseen in the second bag. What is theprobability that a ball now drawn from the second bag is black?

12. If the events A, B and C are independent, are A and B ∪C necessarily independent? Prove, or showa counterexample.

13. Given an event C, A andB are said to be conditionally independent if P (A∩B|C) = P (A|C)P (B|C).In this case, prove that P (A|B ∩ C) = P (A|C), i.e., if C is known to have occurred, the additionalknowledge that B also occurred does not change the probability of A.

14. How many times should a fair die be rolled so that the probability of

(a) at least one six is at least 0.9?(b) at least two sixes is at least 0.9?

15. In the experiment of throwing two fair dice, consider the events A = the first die is odd, B =the second die is odd, and C = the sum is odd. Show that A,B,C are pairwise independent,but not independent.

16. In the experiment of throwing two fair dice, consider the events A = the first die is 1, 2 or 3,B = the first die is 3, 4 or 5, and C = the sum is 9. Show that P (A∩B∩C) = P (A)P (B)P (C),but A,B,C are not independent.


2. Random Variables

2.1. Concept of a Random Variable

In many probabilistic models, the outcomes are of a numerical nature (e.g., instrument readings, stockprices). In other experiments, the outcomes are not numerical, but they may be associated with somenumerical value of interest.

Definition A random variable (r.v.) X is a real-valued function of the experimental outcome.

A r.v. is a function that associates a real number with each sample point. Given the probability space(Ω,F , P ), a (measurable) function X : Ω→ R is a r.v. A r.v. is usually denoted by an upper case letter.

Example 1 Consider the experiment of tossing a coin three times, and noting whether each toss resultsin head or tail. A sample point is ω1 =HTH. Let X be the r.v. which denotes the number of heads. Thenfor this outcome, X = 2; strictly speaking, X(ω1) = 2. For ω2 =HTT, X = 1, i.e., X(ω2) = 1.

Example 2 Consider the experiment of tossing two die. Let X and Y be the r.v.s representing the sum,and the outcome of the second dice, respectively. For the outcome ω = (3, 4), X = 7 and Y = 4, i.e.,X(ω) = 7 and Y (ω) = 4.

Example 3 Suppose Ω = [−1, 1]. Define the r.v.s X, Y and Z as X(ω) = ω, Y (ω) = ω2 and Z(ω) =sgn(ω). If we think of the experiment of randomly choosing any number between −1 and 1 (inclusive),then X is the outcome, Y = X2 is its square, and Z is any of three possible values: −1 if we picked anegative number, 0 if we picked 0, and 1 if we picked a positive number.

Definition A random variable is discrete if its set of possible values is finite or at most countably infinite,and continuous if it takes on values on a continuous scale.

In the preceding example, Z is a discrete r.v. while X and Y are continuous. X takes on all values in[−1, 1], Y takes on all values in [0, 1], and Z takes on the values −1, 0, 1. In practical problems, continuousr.v.’s represent measured data such as all possible heights, temperatures, or distances, whereas discreterandom variables represent count data, such as the number of defectives in a sample of 7 items, or thenumber of highway fatalities per month in a city.

2.2 Discrete Random Variables

A discrete random variable can be analyzed through its probability mass function.

Definition Given a discrete r.v. X whose range of values is S, its probability mass function (PMF)or probability distribution is a function p : S → R for which

1. p(x) ≥ 0 for all x ∈ S,2.∑x∈S

p(x) = 1, and

3. P (X = x) = p(x) for all x ∈ S.

Remark Strictly speaking, when we write P (X = x), we mean P (X = x), the probability of the eventwhere X = x, i.e., the event ω ∈ Ω : X(ω) = x. If there’s more than one r.v. being considered, we maychoose to avoid ambiguity by writing pX instead of p. Also, we may extend the domain of p from S to Rby letting p(x) = 0 if x /∈ S.


Example 4 A r.v. X has possible outcomes x = 0, 1, 2, 3. Find c if its PMF is p(x) = c(x2 + 4).

Example 5 A fair coin is tossed until a head comes up for the first time. Let the r.v. Y be the numberof the toss on which this first head comes up. What is the PMF of Y ? Verify this is a valid PMF. Whatis the probability that the first head happens on an odd-numbered toss?

Example 6 A biased coin is tossed, which comes up a head with probability p, and a tail with probability1− p. Let X = 1 if the outcome is a head, and X = 0 if the outcome is a tail. What is the PMF of X?

An experiment with two possible outcomes - “success” and “failure” - is called a Bernoulli trial. Definethe r.v. X to be X = 1 when the result is a “success” and X = 0 when the result is a “failure”. X is thensaid to be a Bernoulli random variable. We also say X has a Bernoulli distribution, with PMF as inthe preceding example.

If Y = g(X) is a function of a r.v. X, then Y is also a r.v. If X is discrete and has PMF pX , then Y

is also discrete, with PMF pY characterized by pY (y) =∑

x|g(x)=y

pX(x).

Example 7 The r.v. X has PMF

pX(x) =

2/7 if x is an integer in the range [−1, 1]1/7 if x = 20 otherwise

Find the PMF of Y = X2.

Definition Given a discrete r.v. X with PMF p, its cumulative distribution function (CDF) is thefunction F defined by F (x) = P (X ≤ x) =

∑t≤x

p(t).

Example 8 Find the CDF F of the r.v. X from the preceding example and graph it. Use the CDF tofind P (X = 0).

If xi is a value assumed by X, and xi−1, another value assumed by X, is the largest possible value lessthan xi, then clearly, p(xi) = F (xi)− F (xi−1). The CDF of X has a piecewise constant and staircase-likeform.

2.3 Continuous Random Variables

Definition Given a continuous r.v. X, its probability density function (PDF), if it exists, is a functionf : R→ R for which

1. f(x) ≥ 0 for all x ∈ R,

2.∫ ∞−∞

f(x) dx = 1, and

3. P (a ≤ X ≤ b) =∫ b

af(x) dx.

Remark A continuous r.v. has a probability of 0 of assuming exactly any of its values. If X has PDF f ,

P (X = a) =∫ a

af(x)dx = 0


and soP (a ≤ X ≤ b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a < X < b).

Consider a r.v. whose values are the heights of all people, Between, any two values, say, 163.99 and 164.01cm, there are an infinite number of heights. The probability of selecting a person at random exactly 164 cmtall and not one of the infinitely large set of heights so close to 164 cm that you cannot humanly measurethe difference, is remote, and thus we assign a probability of 0 to the event.

Remark If B ∈ B(R), P (X ∈ B) =∫Bf(x) dx.

Example 9 A r.v. X has PDF

f(x) =

c

(x+ 5)3if x > 0

0 elsewhere

for some constant c. Find (a) c, (b) P (X > 7), (c) P(X +

8X> 6)

, and (d) P (X is an even integer).

Remark For convenience, we will refer to the support of a continuous r.v. X, denoted SX , as the setof values x where its density is positive: f(x) > 0. The support is essentially the set of values that Xassumes. In the example above, SX = (0,∞). Strictly speaking, integrations involving the density shouldonly be over the support. For convenience, we will indicate these as integrations over (−∞,∞) withoutsacrificing correctness since f(x) = 0 anyway outside the support.

Definition Given a continuous r.v. X with PDF f , its cumulative distribution function (CDF) is the

function F defined by F (x) = P (X ≤ x) =∫ x

−∞f(x) dx.

Remark It follows thatd

dxF (x) = f(x) at points x where F has a derivative, P (a < X < b) = F (b)−F (a)

and P (X > a) = 1− F (a). Also, F is continuous on (−∞,∞).

Example 10 Let λ > 0. A r.v. X with density f(x) = λe−λx, x > 0 is said to be an exponentialrandom variable with parameter λ (we also say X has exponential distribution). Find the CDF of X.

Example 11 Let X have CDF F (x) =

0 x < 0x2/R2 0 ≤ x ≤ R1 x > R

. Find its PDF.

Remark Even though in the preceding example, F is not differentiable at R, we can “define” f(R) = 0;the resulting f will be a density for X.

If the r.v. Y = g(X) is a function of X, we can find the PDF of Y by differentiating the CDF of Y .We can find the latter by expressing this in terms of the CDF of X because FY (y) = P (g(X) ≤ y).

Example 12 Let X be a r.v. with PDF fX(x) =1 + x

2, −1 < x < 1. Find the PDF of W = 4X,

Y = 2− 3X and Z = X2.

Are there random variables that are neither discrete nor continuous? Yes. Consider the r.v. X whichassumes all values in [1, 3], with the following properties: P (X = 1) = 2

9 , P (X = 3) = 19 , and it has density

p(x) = 1/x2 for 1 < x < 3, i.e., P (a < X < b) =∫ ba (1/x2)dx for 1 < a ≤ b < 3.


2.4 Expectation and Variance

A biased coin is more likely to turn up heads than tails. Let X = 1 if it turns up heads and X = −1 ifit turns up tails. Suppose P (X = 1) = 0.6, and that in fact, in 100 throws of the coin, 60 turned up headsand 40, tails. X is recorded for each throw. What is the average of the values of X?

Definition Let X be a r.v. with probability distribution (PMF/PDF) f(x). The mean or expectedvalue or expectation of X, denoted by E(X), µ or µX , is

E(X) =∑x

xf(x) if X is discrete

andE(X) =

∫ ∞−∞

xf(x) dx if X is continuous.

Example 13 A r.v. X has Poisson distribution with paramteter λ > 0 if it has PMF p(x) =λxe−λ

x!for x = 0, 1, 2, . . .. Verify that this is a valid PMF, and find E(X).

Example 14 Recall that a r.v. X has exponential distribution with parameter λ > 0 if it has PDFf(x) = λe−λx for x > 0. Find E(X).

Remark If X assumes infinitely many values, the infinite sum or improper integral defining E(X) maynot be well-defined. To remedy this, we will require that the sum/integral in the definition converge

absolutely:∑x

|x|f(x) < ∞,∫ ∞∞|x|f(x) dx < ∞. Otherwise, X does not have finite expectation and

E(X) is undefined. In fact, E(X) <∞ iff E(|X|) <∞.

Theorem Let X be a r.v. with probability distribution f(x). Given a function g : R→ R, the r.v. g(X)has expected value

E(g(X)) =∑x

g(x)f(x) if X is discrete

andE(g(X)) =

∫ ∞−∞

g(x)f(x) dx if X is continuous,

provided∑x

|g(x)|f(x) <∞ or∫ ∞−∞|g(x)|f(x) dx <∞.

Example 15 Find E(7X + 1) for the r.v. in Example 7, and E(X2) for the r.v.s in Example 13 andExample 14.

Theorem Let the r.v. X have finite expectation. Let g and h be functions for which g(X) and h(X) bothhave finite expectations.

1. E(aX + b) = aE(X) + b for any constants a and b.2. E(g(X) + h(X)) = E(g(X)) + E(h(X))3. If g(X) ≥ h(X), then E(g(X)) ≥ E(h(X)).4. |E(X)| ≤ E(|X|)


Theorem Let X be a nonnegative integer-valued random variable. If X has finite expectation, then

E(X) =∞∑x=1

P (X ≥ x).

Example 16 If 0 < p < 1, then X is called a geometric random variable with parameter p if it hasmass function p(x) = p(1− p)x for x = 0, 1, 2, . . .. Find E(X).

Definition Let p ≥ 0. The pth moment, or moment of order p, of X, is E(Xp), if this is finite.

Example 17 Let X have density p(x) = 2/x3 for x > 1. Find its first and second moments.

Theorem (Jensen’s Inequality) Let g be convex on [a, b], and X a r.v. such that a ≤ X ≤ b. Theng(E(X)) ≤ E(g(X)).

Corollary For 0 < q < p, if the pth moment exists, then so does the qth moment.

Definition Let X be a r.v. with probability distribution f(x), finite second moment, and mean µ. Thevariance of X, denoted by Var(X), σ2 or σ2

X , is

E[(X − µ)2] =∑x

(x− µ)2f(x) if X is discrete

andE[(X − µ)2] =

∫ ∞−∞

(x− µ)2f(x) dx if X is continuous.

The standard deviation of X is σX =√

Var(X).

Remark For a r.v. X, E(X) describes where the distribution is centered. Var(X) on the other handdescribes the variability by considering the spread of values of X from the mean.

Example 18 Find the variance of X with mass function p for which p(0) = 14 , p(1) = 1

8 , p(2) = 12 ,

p(3) = 18 .

Theorem The variance of X is Var(X) = E(X2)− [E(X)]2 = E(X2)− µ2.

Example 19 Verify this theorem for the preceding example. Find also the variance of Y = 2X + 3.

Theorem If X has finite second moment, Var(aX + b) = a2 Var(X) for any constants a, b.

Example 20 Find the variance of a r.v. X having Poisson distribution with parameter λ, and a r.v. Yhaving Exponential distribution with parameter λ.

2.5 Joint Probability Distributions

Many probabilistic situations often involve several random variables of interest. Given two r.v.s X andY in the same probability space (Ω,F , P ), we will write P (X = x, Y = y) to denote the probability of theevent X = x ∩ Y = y, i.e., the event ω : X(ω) = x and Y (ω) = y.

Definition The function pX,Y (x, y) = p(x, y) is a joint probability mass function or joint probabilitydistribution of the discrete random variables X and Y if

1. p(x, y) ≥ 0 for all x, y


2.∑x

∑y

p(x, y) = 1

3. P (X = x, Y = y) = p(x, y).

For any region A in the xy-plane, P [(X,Y ) ∈ A] =∑∑(x,y)∈A

p(x, y).

Example 21 Suppose X and Y have the following joint probability distribution:

X2 4

1 0.10 0.15Y 3 0.20 0.30

5 0.10 0.15

Find P (X + Y ≥ 5), P (X = 2) and P (Y = 3).

Definition The function fX,Y (x, y) = f(x, y) is a joint probability density function of the continuousrandom variables X and Y if

1. f(x, y) ≥ 0 for all x, y

2.∫ ∞−∞

∫ ∞−∞

f(x, y) dx dy = 1

3. For any region A in the xy-plane, P [(X,Y ) ∈ A] =∫∫A

f(x, y) dx dy.

Example 22 Let X and Y have joint density fX,Y (x, y) = 2e−(x+y), 0 < x < y, 0 < y, i.e., the support isthe region in the first quadrant above the line y > x. Find P (Y < 3X).

Given the joint distribution fX,Y (x, y) of X and Y , how do we determine the distribution of X and ofY alone?

Definition If the joint distribution of X and Y is fX,Y (x, y), the marginal distributions of X aloneand of Y alone are

fX(x) =∑y

fX,Y (x, y) and fY (y) =∑x

fX,Y (x, y)

for the discrete case, and

fX(x) =∫ ∞−∞

fX,Y (x, y) dy and fY (y) =∫ ∞−∞

fX,Y (x, y) dx

for the continuous case.

Example 23 Find the marginal distributions of X and Y in Example 21 and Example 22.

Definition The joint cumulative distribution function of the r.v.s X and Y is FX,Y (x, y) = P (X ≤x, Y ≤ y). If their joint density is fX,Y (x, y), then

FX,Y (x, y) =∑u≤x

∑v≤y

fX,Y (u, v)

for the discrete case, and

FX,Y (x, y) =∫ x

−∞

∫ y

−∞fX,Y (u, v) dv du


for the continuous case.

Theorem Let X and Y be continuous r.v.s with joint CDF FX,Y (x, y). Their joint PDF is fX,Y (x, y) =∂2

∂x ∂yFX,Y (x, y) provided FX,Y has continuous second partial derivatives.

Example 24 Both the r.v.s X and Y assume values in [0, 1]. If their joint CDF is FX,Y (x, y) = 13x

2(2y+y2)in [0, 1]× [0, 1], find their PDF.

Remark The definitions and theorems in this section extend in a very straightforward way to situationsinvolving more than two variables.

2.6 Further Properties of Expectations

Definition Suppose X and Y are r.v.s with joint distribution fX,Y (x, y). Then the expected value of therandom variable g(X,Y ), a function of X and Y , is given by

E[g(X,Y )] =∑x

∑y

g(x, y)fX,Y (x, y)

for the discrete case, provided∑

x

∑y |g(x, y)|fX,Y (x, y) <∞, and

E[g(X,Y )] =∫ ∞−∞

∫ ∞−∞

g(x, y)fX,Y (x, y) dx dy

for the continuous case, provided∫∞−∞

∫∞−∞ |g(x, y)|fX,Y (x, y) dx dy <∞.

Theorem If X and Y both have finite mean, then E(aX + bY ) = aE(X) + bE(Y ).

Expectation is a linear operator: E[ag(X,Y ) + bh(X,Y )] = aE[g(X,Y )] + bE[h(X,Y )].

Example 25 An electrical circuit has three resistors wired in parallel. Their actual resistances, X, Yand Z, vary between 10 and 20 according to their joint PDF fX,Y,Z(x, y, z) = 1

67500(xy + xz + yz), 10 ≤x, y, z ≤ 20. What is the expected resistance for the circuit. Fact: if R is the circuit’s resistance, then1/R = 1/X + 1/Y + 1/Z.

Example 26 A disgruntled assistant is upset about having to stuff envelopes. Handed a box of n lettersand n envelopes, he vents his frustration by putting the letters into the envelopes at random. How manypeople, on average, will receive their correct mail?

Example 27 Ten fair dice are rolled. Calculate the expected value of the sum of the faces showing.

Definition The covariance of the r.v.s X and Y , denoted Cov(X,Y ) or σXY is E[(X − µX)(Y − µY )].

Their correlation is ρ(X,Y ) =Cov(X,Y )√

Var(X) Var(Y )=

σXYσXσY

.

The covariance is a measurement of the association between two random variables. For example, iflarge (small) values of X often result in large (small) values of Y , positive (negative) X − µX will oftenresult in positive (negative) Y − µY ; (X − µX))(Y − µY ) will tend to be positive. Also, we will use σXYfor the covariance instead of the standard deviation of the product XY .

Theorem Cov(X,Y ) = σXY = E(XY )− E(X)E(Y ) = E(XY )− µXµY .

Example 28 Find the covariance and correlation of X and Y if they have the following joint distribution:


X2 4

1 0.10 0.20Y 3 0.20 0.25

5 0.10 0.15

Theorem We have the following properties of variance and covariance.

1. Cov(X,Y ) = Cov(Y,X), i.e., σXY = σY X2. Cov(X,X) = Var(X), i.e., σXX = σ2

X

3. Cov(aX + b, cY + d) = acCov(X,Y )4. Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X,Y ), i.e., σ2

aX+bY = a2σ2X + b2σ2

Y + 2σXY

5. Var(a1X1 + a2X2 + · · ·+ anXn) =n∑i=1

a2i Var(Xi) + 2

∑j<k

ajak Cov(Xj , Xk)

Theorem (Cauchy-Schwarz Inequality) |E(XY )|2 ≤ E(X2)E(Y 2)

Corollary |ρ(X,Y )| ≤ 1

Theorem (Markov’s Inequality) Let X be a nonnegative random variable with finite mean. Then for anyt > 0,

P (X ≥ t) ≤ E(X)t

.

Corollary Let X be a nonnegative random variable with finite pth moment. Then for any t > 0,

P (X ≥ t) ≤ E(Xp)tp

.

Remark We expect that as t → ∞, P (|X| ≥ t) → 0. From the above result, we get more informationabout how fast P (|X| ≥ t) decays to 0 based on the moments that we know are finite for X.

Theorem (Chebyshev’s Inequality) Let X have finite second moment. Then for any a > 0,

P [|X − E(X)| ≥ a] ≤ Var(X)a2

.

Chebyshev’s Inequality sometimes takes the following form: If X has finite second moment, then forany k,

P (µ− kσ < X < µ+ kσ) ≥ 1− 1k2.

The probability that any random variable will assume a value within k standard deviations of the mean isat least 1− 1/k2.

Example 29 A r.v. X has mean µ = 10 and variance σ2 = 4. Use Chebyshev’s Inequality to estimateP (5 < X < 15). For what values of c does Chebyshev’s Inequality guarantee that P (|X − 10| ≥ c) ≤ 0.04.


2.7 Independent Random Variables

Definition The two random variables X and Y are said to be independent if for every interval A andevery interval B, P (X ∈ A and Y ∈ B) = P (X ∈ A)P (Y ∈ B).

Theorem The random variables X and Y are independent if and only if there are functions g(x) and h(y)such that fX,Y (x, y) = g(x)h(y). If this equation holds, there is a constant k such that fX(x) = kg(x) andfY (y) = 1

kh(y).

Example 30 The joint density of X and Y are given. Are X and Y independent?

1. fX,Y (x, y) = 12xy(1− y), 0 ≤ x ≤ 1, 0 ≤ y ≤ 12. fX,Y (x, y) = 8xy, 0 ≤ y ≤ x ≤ 1

Example 31 If two random variables X and Y are defined over a region in the xy-plane that is not arectangle (possibly infinite) with sides parallel to the coordinate axes, can X and Y be independent?

Theorem X and Y are independent iff FX,Y (x, y) = FX(x)FY (y).

Remark If X and Y are independent, and A and B are formed from intervals under the operations ofcountable intersection, countable union, and complement, then P (X ∈ A and Y ∈ B) = P (X ∈ A)P (Y ∈B).

Example 32 Show that if X and Y are independent, then so are eX and Y 2.

Remark If X and Y are independent, and g and h are (Borel) functions (e.g. continuous functions), theng(X) and h(Y ) are independent.

Theorem If X and Y are independent, then

1. E(XY ) = E(X)E(Y )2. Cov(X,Y ) = ρ(X,Y ) = 03. Var(aX + bY ) = a2 Var(X) + b2 Var(Y )

Remark However, if Cov(X,Y ) = 0, it doesn’t mean X and Y are independent. Suppose X takes values−1, 0, 1 with equal probability 1

3 , and let Y = X2. Clearly, X and Y are dependent (not independent),but Cov(X,Y ) = 0.

Definition The random variables X1, X2, . . . , Xn are independent if for any collection A1, A2, . . . , An ofsubsets of R formed by taking countable unions, countable intersections, and complements, of intervals,

P (X ∈ A1, X2 ∈ A2, . . . , Xn ∈ An) = P (X1 ∈ A1)P (X2 ∈ A2) · · ·P (Xn ∈ An).

Theorem IfX1, X2, . . . , Xn are independent, then fX1,X2,...,Xn(x1, x2, . . . , xn) = fX1(x1)fX2(x2) · · · fXn(xn),FX1,X2,...,Xn(x1, x2, . . . , xn) = FX1(x1)FX2(x2) · · ·FXn(xn) and E[X1X2 · · ·Xn] = E[X1]E[X2] · · ·E[Xn].

Example 33 Consider k urns, each holding n chips, numbered 1 through n. A chip is to be drawn atrandom from each urn. What is the probability that all k chips will bear the same number?

Example 34 Suppose that X1, X2, X3, X4 are independent random variables, each with pdf fXi(xi) = 4x3i ,

0 ≤ x ≤ 1. Find


1. P (X1 <12)

2. P (exactly one Xi <12)

3. fX1,X2,X3,X4(x1, x2, x3, x4)4. FX2,X3(x2, x3).

Example 35 Let X and Y be independent random variables each geometrically distributed with parameterp, i.e., fX(x) = p(1− p)x, x = 0, 1, 2, . . ..

1. Find the distribution of min(X,Y ) and of max(X,Y ).2. Find P (min(X,Y ) = X) = P (Y ≥ X).3. Find the distribution of X + Y .4. Find P (Y = y|X + Y = z) for y = 0, 1, . . . , z.

Example 36 A random sample of 8 independent and identically distributed random variablesX1, X2, . . . , X8

is obtained. Each has density function f(x) = 14x

3, 0 < x < 2. Let Y = min(X1, X2, . . . , X8) andZ = max(X1, X2, . . . , X8). Find the densities of Y and of Z.


2.8 Problems

1. Let F be the CDF of a r.v. X (make no assumptions that it is discrete or continuous). Prove thatlimx→∞

F (x) = 1 and limx→−∞

F (x) = 0.

Hint: To prove the first statement, note that it suffices to show that if yi is any increasing sequenceof numbers with yn → ∞, then lim

yn→∞F (yn) = 1. Now, define the sequence of events Ai by

A1 = X ≤ y1 and An = yn−1 < X ≤ yn for n = 2, 3, . . .. What is∞⋃n=1

An? Express F (yn) as the

probability of the union of some of the events Ai.

2. Let p ∈ (0, 1), and X a continuous r.v. X. If xp is a number such that F (xp) = P [X ≤ xp] = p, thenxp is referred to as the pth quantile of X. If X ∼ Exp(λ), find x0.5 (the median), x0.25 (the lowerquartile) and x0.75 (the upper quantile).

3. The joint density of X and Y is fX,Y (x, y) = 10xy2 for 0 < x < y < 1.

(a) Find the marginal densities of X and of Y . Use these to find the cumulative distributions of Xand of Y .

(b) Use fX,Y to derive the joint cumulative distribution FX,Y (x, y) = 53x

2y3 − 23x

5 for points (x, y)in the support.

(c) Find limy→∞

FX,Y (x, y) = FX,Y (x, 1) and limx→∞

FX,Y (x, y) = FX,Y (y, y). Compare these answers

with those from (a).(d) For general r.v.s X and Y with joint distribution, explain why lim

y→∞FX,Y (x, y) = FX(x) and

limx→∞

FX,Y (x, y) = FY (y).

4. Let the r.v. X take the values 2k and −2k for k = 2, 3, . . . with PMF p defined by p(2k) = p(−2k) =2−k. Show that p is a valid PMF. Note that p is symmetric around 0; is E(X) = 0?

5. Let N be a positive integer. Suppose the r.v. X has PMF p(x) =2x

N(N + 1)for x = 1, 2, . . . , N .

Show that p is a valid PMF, and find the mean of X.

6. The speed of a molecule in a uniform gas at equilibrium is a random variable V whose probabilitydistribution is f(v) = kv2e−bv

2, v > 0, where k is an appropriate constant and b depends on the

absolute temperature and mass of the molecule. Find the probability distribution of the kineticenergy of the molecule, W , where W = mV 2/2.

7. If X ∼ Poi(λ), find E[(1 +X)−1].

8. The gamma function Γ is defined as the function Γ(k) =∫ ∞

0tk−1e−t dt for k > 0. It has the property

Γ(z + 1) = zΓ(z) for any z > 0. Let k > 0 and θ > 0. If a random variable X has density

f(x) = cxk−1e−x/θ, x > 0

for some constant c, we say that X has Gamma distribution with parameters k and θ.

(a) Show that c = 1/[Γ(k)θk]. Find E(X) and Var(X) in terms of k and θ.

(b) The skewness and kurtosis of a r.v. X are defined respectively as E[(

X−µσ

)3]

and E[(

X−µσ

)4].

Find the kurtosis and skewness of X above.


(c) If a unique x′ exists such that f(x) is maximized at x = x′, then we say x′ is the mode of X.Show that X above has a mode only if k > 1. In this case, what is the mode?

9. Show that Var(X) = mina∈R

E[(X − a)2] by letting h(a) = E[(X − a)2] and finding the minimum value

of the quadratic function h.

10. Suppose an experiment having r possible outcomes 1, 2, . . . , r that occur with probabilities p1, . . . , pris repeated n times. Let X be the number of times the first outcome occurs, and Y the number oftimes the second outcome occurs. Show that

ρ(X,Y ) = −√

p1p2

(1− p1)(1− p2).

Hint: Let Ii = 1 if the ith trial yields outcome 1, and Ii = 0 otherwise. Similarly, let Ji = 1 if theith trial yields outcome 2, and Ji = 0 otherwise. Then X = I1 + · · · In and Y = J1 + · · ·+ Jn. Showthat E(IiJi) = 0 and if i 6= j, E(IiJj) = p1p2. Note too that I1, . . . , In are independent, and so areJ1, . . . , Jn.

11. Let Xn take on the values 1, 2, . . . , n, each with probability 1/n. Define Yn to be X2n. Find ρ(Xn, Yn)

and limn→∞

ρ(Xn, Yn).

Hint:∑n

i=1 i = n(n+1)2 ,

∑ni=1 i

2 = n(n+1)(2n+1)6 ,

∑ni=1 i

3 =[n(n+1)

2

]2,∑n

i=1 i4 = n(n+1)(2n+1)(3n2+3n−1)

30 .

12. Let X have uniform distribution on (0, 1), which means fX(x) = 1, 0 < x < 1. Find the density ofY = −λ−1 ln(1−X) for λ > 0.

13. Let X be a r.v. with mass function

p(x) =

1/18 x = 1, 316/18 x = 2

.

Find a value of δ such that P (|X − µ| ≥ δ) = Var(X)/δ2. This shows that in general, the boundgiven by Chebyshev’s inequality cannot be improved.

14. Show that if X and Y have finite variance, then Cov(X,Y ) = 14 [Var(X + Y )−Var(X − Y )] and

Var(X + Y ) + Var(X − Y ) = 2 [Var(X) + Var(Y )].

15. Show that if the covariances are finite, Cov(X+W,Y +Z) = Cov(X,Y ) + Cov(X,Z) + Cov(W,Y ) +Cov(W,Z).

16. Let X and Y have finite positive variance, and m,n, p, q real numbers with n, q nonzero. Show that

ρ(m+ nX, p+ qY ) = ρ(X,Y ) if nq > 0

andρ(m+ nX, p+ qY ) = −ρ(X,Y ) if nq < 0.

17. Let X and Y be independent random variables each having geometric distribution with parameterp. Set Z = Y −X and M = min(X,Y ).

(a) Show that for integers z and m ≥ 0,

pM,Z(m, z) =

P (X = m− z)P (Y = m) z < 0,P (X = m)P (Y = m+ z) z ≥ 0

.

(b) Conclude from (a) for for integers z and m ≥ 0, pM,Z(m, z) = p2(1− p)2m(1− p)|z|.(c) Find the marginal distributions of M and Z. Are M and Z independent?


3. Special Distributions

3.x Conditional Distributions

Definition The conditional PMF of a discrete r.v. X, conditioned on an event A with P (A) > 0, isdefined by

pX|A(x) = P (X = x|A) =P (X = x ∩A)

P (A)

Example 1 Let X be the roll of a die and let A = the roll is an even number. Find pX|A(x)

If X and Y are discrete r.v.s in the same probability space, the above definition can be specialized toevents A of the form Y = y.

Definition Let X and Y be discrete r.v.s. The conditional PMF of X, given Y = y, is

pX|Y (x|y) = pX(x|y) = pX|y(x) = P (X = x|Y = y) =pX,Y (x, y)pY (y)

for pY (y) 6= 0.

Example 2 A fair coin is tossed five times. Let Y denote the total number of heads that occur, and Xthe number of heads occuring on the last two tosses. Find the conditional PMF pY |x(y) for each x.

The conditional PMF can be used to calculate the marginal PMFs:

pX(x) =∑y

pX,Y (x, y) =∑y

pY (y)pX|Y (x|y).

Example 3 A transmitter sends messages over a computer network. X is the travel time of a givenmessage and Y is the length of the given message. Suppose

pY (y) =

5/6 if y = 102

1/6 if y = 104.

The travel time depends on the message length. In particular,

pX|Y (x|y) =

1/2 if x = 10−4y

1/3 if x = 10−3y

1/6 if x = 10−2y

.

Find the distribution of X. Use this to find E(X).

Conditional expectations are defined as expected. For example,

E[g(X)|A] =∑x

g(x)pX|A(x|A) E[X|Y = y] =∑x

xpX|Y (x|y).

Example 32 (continued) For the previous example, find E[X|Y = 102] and E[X|Y = 104].


Theorem (Total Expectation Theorem) E(X) =∑y

pY (y)E[X|Y = y]. In general, if A1, A2, · · · , An form

a partition of Ω with P (Ai) > 0 for each i, then E(X) =n∑i=1

P (Ai)E[X|Ai].

This means the unconditional average can be obtained by averaging the conditional average. It is alsoinstructive to compare this with P (X = x) =

∑y

pY (y)P [X = x|Y = y].

Example 32 (continued) Use the conditional expectations of X given Y to find E(X).


ma295s.10 lecture

Documents

definition f

f ac f3

probability measure

sample space

intensive probability

sample point

sample outcome

specified set of conditions