2301690 (2011/1) special topics in advanced mathematics ...left blank for 2nd page of contents...

2301 690 (2011/1)

Special Topics in Advanced Mathematics:

Theory of Copulae

Songkiat Sumetkijakan

Contents

1 Necessary Probability Theory 3

1.1 Riemann and Lebesgue integrations . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Expected values and variances . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.6 Conditional probability and independence . . . . . . . . . . . . . . . . . . . 20

1.7 More on expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.8 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.9 Laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.10 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

2301 690 Special Topics in Advanced Mathematics: Copulae 2011/1 (Songkiat) p.2

Left blank for 2nd page of contents


1 Necessary Probability Theory

N = 1, 2, 3, . . . , the set of all natural numbers.

N0 = 0, 1, 2, 3, . . . , the set of all nonnegative integers.

Z = . . . ,−3,−2,−1, 0, 1, 2, 3, . . . , the set of all integers.

Q =mn : m ∈ Z, n ∈ Z, n 6= 0

, the set of all rational numbers.

R = the set of all real numbers.

1.1 Riemann and Lebesgue integrations

Riemann integration and its shortcomings

Given a bounded function f : [a, b] → R+, one may estimate the area between the curve

y = f(x) and the x-axis over the interval [a, b] by first subdividing [a, b] into n equal

subintervals. This first step can be done by giving the n + 1 endpoints, called a partition

P = x0, x1, . . . , xn, of those n intervals. It is customary to order the endpoints in the

partition P so that a = x0 < x1 < · · · < xn = b. We then compute the lower sum and the

upper sum of f with respect to the partition P by

L(P ; f) =

n∑i=1

(xi − xi−1)mi(f)

U(P ; f) =

n∑i=1

(xi − xi−1)Mi(f)

where mi(f) is the smallest value of f(x) and Mi(f) is the largest value of f(x) for x ∈[xi−1, xi]. If lim‖P‖→0 L(P ; f) = lim‖P‖→0 U(P ; f) then f is said to be Riemann-integrable

on [a, b] and the Riemann-integral of f on [a, b] is∫ b

af(x) dx = lim

‖P‖→0L(P ; f) = lim

‖P‖→0U(P ; f).

Here, ‖P‖ is the smallest distance between two adjacent points in the partition P . In the

process of taking limit where the partition is finer and finer, the lower and upper sums

of a Riemann-integrable function are increasingly better lower and upper bounds of the

area under the curve. The Riemann-integrability of any bounded function f : [a, b] → R is

defined exactly the same as above. It only loses the area interpretation.

Even though the definition of the Riemann integral is very intuitive and behaves nicely

when the integrand f is nice and smooth e.g. continuous, it has some serious drawbacks,

some of which are illustrated below. We first define uniform convergence, a very strong

mode of convergence.


Figure 1: Riemann integration

Definition 1. A sequence of functions fn : D → R converges uniformly to f : D → R if

limn→∞

sup |fn(x)− f(x)| : x ∈ D = 0.

Note that if fn converges uniformly to f then fn converges pointwise to f , i.e. fn(x)

converges to f(x) at every x ∈ D. The converse is false as can be seen by considering the

sequence fn(x) =nx

nx+ 1.

Example. i)

nx

n2x2 + 1

ii)

nx2

nx+ 1

iii) nx(1− x)n

Theorem 1.1 (Limits of Riemann Integrals). Assume that fn is a sequence of continuous

functions defined on [a, b] and fn converges uniformly to a function f . Then f is Riemann-

integrable (in fact, f is continuous) and

limn→∞

∫ b

afn(x) dx =

∫ b

alimn→∞

fn(x) dx =

∫ b

af(x) dx.

Theorem 1.2 (Interchanging the order of Riemann Integration). Assume that f : [a, b] ×[c, d]→ R is continuous. Then the functions

F (y) =

∫ b

af(x, y) dx and G(x) =

∫ d

cf(x, y) dy,

where a ≤ x ≤ b and c ≤ y ≤ d, are Riemann-integrable and∫ dc F (y) dy =

∫ ba G(x) dx. In

other words, ∫ d

c

∫ b

af(x, y) dx dy =

∫ b

a

∫ d

cf(x, y) dy dx.


Example. i) There is a simple function that is not Riemann-integrable. Define f : [0, 1]→R by

f(x) =

1 if x ∈ Q,

0 if x /∈ Q.

Since Q is dense in R, which means that every open interval in R contains a member of

Q, every partition P = x0, . . . , xn of [0, 1] yields Mi(f) = supx∈[xi−1,xi)

|f(x)| = 1 and hence

U(P ; f) =∑n

i=1(xi − xi−1) = 1. Thus, lim‖P‖→0 U(P ; f) = 1. Likewise, since R \ Q is

also dense in R, any partition P of [0, 1] gives lim‖P‖→0 L(P ; f) = 0. Therefore, f is not

Riemann-integrable even though it is just a characteristic function of Q on [0, 1]

ii) An increasing sequence of Riemann-integrable functions whose limit is not Riemann-

integrable. Let r1, r2, . . . be an enumeration of all rational numbers. We then define, for

each n ∈ N, fn : [0, 1]→ R by

fn(x) =

1 if x ∈ r1, . . . , rn ,

0 elsewhere.

Obviously, fn is an increasing sequence and limn→∞ fn(x) = f(x) for all x ∈ [0, 1]. We

have seen that f is not Riemann-integrable while fn, being a finite sum of characteristic

functions of singletons, is clearly Riemann-integrable.

Lebesgue integration

Given f : [a, b] → R+ and suppose f([a, b]) ⊆ [A,B), let us subdivide [A,B) into n subin-

tervals by a partition y0, y1, . . . , yn and define

En = x ∈ [a, b] : f(x) ∈ [yn−1, yn) .

We then estimate the area between the curve y = f(x) and the x-axis over the interval

[a, b] by∑n

i=1m(En) yi where m(En) should be some meaningful “length” of the set En. It

would be easy if En is just an interval with endpoints a < b, i.e. (a, b), [a, b), (a, b], or [a, b],

whose length is simply b− a. But En could be a very complicated set as f is an arbitrary

function and hence the length of En is not clearly defined. In fact, it turns out that there

is no way to assign “length” to every subset of R in a consistent manner. This is where

measure theory comes in. A solution is to regard some sets to be “measurable” and discard

the rest. The class of measurable sets is so large that constructing a non-measurable set has

to invoke the Axiom of Choice. Then, since the shape of the set En depends directly on f , it

is also necessary to define Lebesgue integral only for “measurable functions.” Nonetheless,

we shall postpone detailed discussion on measure/probability theory to the next chapter.


Figure 2: Lebesgue integration

Let us now focus on defining Lebesgue integration. Observe that the finer the partition

P = y0, . . . , yn, the larger the approximate area S(P ; f) =∑n

i=1m(En) yi. We then

define ∫[a,b]

f dm = lim‖P‖→0

S(P ; f).

This deceptively simple idea of partitioning the range of f (in Lebesgue integration)

instead of partitioning the domain (in Riemann integration) has proved to be very crucial

in developing a much more satisfactory and broader theory of integration. We state here

some fundamental theorems without proofs.

Theorem 1.3 (Monotone Convergence Theorem). Let fn be a sequence of non-negative

measurable functions on [a, b]. Suppose that 0 ≤ f1 ≤ f2 ≤ . . . and limn→∞

fn(x) = f(x) at

each x ∈ [a, b]. Then

limn→∞

∫[a,b]

fn dm =

∫[a,b]

f dm.

Theorem 1.4 (Dominated Convergence Theorem). Let fn and f be measurable functions

for which fn converges to f . Suppose there is a function g for which∫

[a,b] g dm is finite and

|fn| ≤ g. Then

limn→∞

∫[a,b]

fn dm =

∫[a,b]

f dm.

Theorem 1.5 (Fubini’s Theorem). Let f : [a, b]× [c, d]→ R. Assume that either

2301 690 Special Topics in Advanced Mathematics: Copulae 2011/1 (Songkiat) p.7∫[a,b]

∫[c,d] |f(x, y)| dm(y) dm(x) or

∫[c,d]

∫[a,b] |f(x, y)| dm(x) dm(y) is finite. Then∫

[a,b]

∫[c,d]

f(x, y) dm(y) dm(x) =

∫[c,d]

∫[a,b]

f(x, y) dm(x) dm(y).

Example. 1. The functions fn in Example after Theorem 1.2 all have∫

[0,1] fn dm = 0.

And since f = 0 everywhere except on Q which has measure zero,∫

[0,1] f dm = 0 also.

2. The sequence fn defined by fn(x) = nx(1 − x)n for x ∈ [0, 1] converges pointwise

to f = 0 but not uniformly. But one can check that∫ 1

0fn(x) dx =

∫ 1

0nx(1− x)n dx =

n

(n+ 1)(n+ 2)→ 0 =

∫ 1

0f(x) dx (1)

as n → ∞. Though limn→∞∫ 1

0 fn(x) dx =∫ 1

0 f(x) dx, it doesn’t satisfy the uniform

convergence assumption in Theorem 1.1. On the other hand, every fn is bounded by

e−1, a function with finite integral on [0, 1], and hence (1) follows straightforwardly

from Theorem 1.4.

1.2 Probability spaces

Definition 2 (Algebras). Let Ω be any nonempty set. A collection F of subsets of Ω is

called an algebra if

1. Ω ∈ F ;

2. if A ∈ F then Ac ∈ F ; and

3. if A1, A2, . . . , An are in F thenn⋃k=1

Ak ∈ F .

Definition 3 (σ-algebras or event spaces). Let Ω be any nonempty set. A collection F of

subsets of Ω is called a σ-algebra if

1. Ω ∈ F ;

2. if A ∈ F then Ac ∈ F ; and

3. if A1, A2, . . . are in F then∞⋃k=1

Ak ∈ F .

The pair (Ω,F) is called a measurable space, and the members of F are called F-measurable

sets or events. The σ-algebra F itself is sometimes referred to as an event space.


Interpretation. To model an experiment, Ω serves as the set of all possible outcomes,

the so-called sample space. We think of any member of the σ-algebra F as an event that

may take place. For example, in the experiment of rolling a fair die once, we should take

Ω = 1, 2, . . . , 6. How would we choose a σ-algebra? If we observe the toss directly and

are interested in the number that turns up, then our σ-algebra must contain all of the

1 , 2 , . . . , 6 and the only choice is P(Ω). On the other hand, if we are only interested

in whether the toss turns up high (4-6) or low (1-3), then we don’t want to distinguish, e.g.,

between 1,2,3. So a natural σ-algebra is ∅, 1, 2, 3 , 4, 5, 6 ,Ω.

Example. 1. P(Ω) is the finest σ-algebra while ∅,Ω is the coarsest σ-algebra.

2. In an experiment of tossing a fair coin twice, our sample space is

Ω = HH,HT, TH, TT

where, for example, HT means the coin turns up head then tail. If we are told only

that how many heads turn up then our σ-algebra must contain HH, HT, TH,and TT. After taking all possible set operations, the σ-algebra is

∅, HH , HT, TH , TT , HH,HT, TH , HT, TH, TT , HH,TT ,Ω .

Proposition 1.6. Let (Ω,F) be a measurable space. Then

1. the empty set ∅ is in F ;

2. if A1, A2, . . . are in F then∞⋂n=1

An ∈ F ; and

3. if A,B ∈ F then A \B ∈ F .

In practice, we usually know exactly what kind of sets we wish to consider as events but

this collection rarely is a σ-algebra. So we want to find the most suitable σ-algebra for the

problem, i.e., it has to be large enough to contain the collection of sets we are interested

in but not too large. Certainly, P(Ω) is always large enough and we can always use it

whenever Ω is a finite set. However, when Ω is an uncountably infinite set, using P(Ω) as

our σ-algebra would cause serious problem once we want to define probability measure on

it.

Proposition 1.7. The intersection of an arbitrary nonempty collection of σ-algebras on a

set Ω is a σ-algebra.


Corollary 1.8. Let Ω be a set and let C be a family of subsets of Ω. Then there exists

the smallest σ-algebra on Ω containing all members of C. It is called the σ-algebra on Ω

generated by C, and is denoted by σ(C).

Proof. Observe that P(Ω) is a σ-algebra containing all members of C. Then

σ(C) =⋂F : F is a σ-algebra and C ⊆ F

is the smallest σ-algebra on Ω containing all members of C.

Proposition 1.9. Let A ⊆ P(F).

1. If A is a σ-algebra then σ(A) = A.

2. If B is a σ-algebra and A ⊆ B then σ(A) ⊆ B.

Example. 1. Ω = 1, 2, . . . , 6 and C = 1, 2, . . . , 6 ⇒ σ(C) = P(Ω).

2. Ω = 1, 2, . . . , 6 and C = 1, 2, 3 ⇒σ(C) = ∅, 1, 2, 3, 4, 5, 6, 1, 2, 3, 2, 3, 4, 5, 6, 1, 4, 5, 6,Ω

3. In general, it is not always possible to list all members of σ(C). However, if C is a

countable set of mutually disjoint sets, say C = A1, A2, . . . , such that ∪∞n=1An = Ω,

then σ(C) = ∪i∈IAi : I ⊆ N.

Definition 4 (Probability spaces). Let (Ω,F) be a measurable space. A measure on F is

a function µ : F → [0,∞] such that µ(∅) = 0 and for any countable collection of disjoint

sets An∞n=1 in F ,

µ

( ∞⋃n=1

An

)=∞∑n=1

µ(An). (countably additive)

(Ω,F , µ) is called a measure space. A measure P is said to be a probability measure if

P (Ω) = 1 and, in this case, (Ω,F , P ) is called a probability space.

Example. Let Ω be any set and F = P(Ω).

1. [Relative frequency] For each A ⊆ Ω, define

µ(A) =

|A| ≡ the number of elements in A if A is a finite set,

∞ if A is an infinite set.

Then µ is a measure called the counting measure on Ω.

If |Ω| = N < ∞ and if each ω ∈ Ω is equally likely to happen, then we let P (A) =µ(A)

µ(Ω)=|A|N

. This defines a probability measure on (Ω,P(Ω)). Notice that in this

case of finite sample space we can always take P(Ω) as the σ-algebra.


2. Fix an ω0 ∈ Ω, we define, for each A ⊆ Ω,

µω0(A) =

1 if ω0 ∈ A,

0 if ω0 /∈ A.

Then µω0 is a measure on Ω (Verify!), called a Dirac measure concentrated at ω0. In

this probability space, ω0 happens with probability 1.

3. If |Ω| is countably infinite then it is impossible to construct a probability measure if

each ω ∈ Ω is equally likely to happen.

4. If |Ω| is uncountably infinite then it is impossible to assign probability to all subsets

of Ω. So we have to opt for a smaller σ-algebra which still contains simple sets such

as intervals.

Definition 5 (Borel σ-algebra). The σ-algebra generated by the collection of all open sets

in R is called the Borel σ-algebra on R, denoted by B. Each element in the Borel σ-algebra

is called a Borel set.

Since every open set in R is a countable union of open intervals, the Borel σ-algebra on

R can also be generated by open intervals (a, b).

Proposition 1.10. Let Ω 6= ∅, Ai∞i=1 be a sequence of mutually disjoint subsets of Ω

(mutually exclusive events) such that⋃∞i=1Ai = Ω, and F be the σ-algebra generated by

Ai∞i=1. If p1, p2, . . . are real numbers in [0, 1] for which∑∞

i=1 pi = 1 then the set function

P : F → [0, 1] defined by

P

(⋃i∈I

Ai

)=∑i∈I

P (Ai) =∑i∈I

pi

is a probability measure on F .

Proposition 1.11. Let (Ω,F , P ) be a probability space.

1. If A1, A2, . . . , An ∈ F are mutually disjoint, then

P

(N⋃n=1

An

)=

N∑n=1

P (An). (finitely additive)

2. If A,B ∈ F and A ⊆ B, then P (A) ≤ P (B).

3. If A,B ∈ F and A ⊆ B, then P (B \A) = P (B)− P (A).


4. For any events A1, A2, . . . ,

P

( ∞⋃n=1

An

)≤∞∑n=1

P (An).

Theorem 1.12 (Continuity of probability measures). Let F be a σ-algebra on Ω, and P

be a probability measure on F , and An∞n=1 be a sequence of events. Then:

1. If A1 ⊆ A2 ⊆ . . . , then P

( ∞⋃n=1

An

)= lim

n→∞P (An).

2. If A1 ⊇ A2 ⊇ . . . , then P

( ∞⋂n=1

An

)= lim

n→∞P (An).

Proof. Put B1 = A1, B2 = A2 \ A1, . . . , Bn = An \ An−1 so that the Bn’s are mutually

disjoint sets in F ,

An =n⋃k=1

Bk, and∞⋃n=1

An =∞⋃n=1

Bn.

Then

P

( ∞⋃k=1

Ak

)= P

( ∞⋃k=1

Bk

)=∞∑k=1

P (Bk)

= limn→∞

n∑k=1

P (Bk) = limn→∞

P

(n⋃k=1

Bk

)= lim

n→∞P (An).

Let A =⋂∞n=1An and, for each n = 1, 2, . . . , let Dn = A1 \An. Then

D1 ⊆ D2 ⊆ . . . and A1 \A =∞⋃n=1

(A1 \An) =∞⋃n=1

Dn.

Applying 1. to Dn gives P (A1 \A) = limn→∞ P (Dn). Then

P (A1)− P (A) = P (A1)− limn→∞

P (An),

and hence P (A) = limn→∞ P (An).

The following example illustrates that constructing a probability space on an infinite

sample space is not at all trivial.

Example. Consider the experiment of tossing successively a fair coin infinitely many times.

The sample space is the sequence space

Ω = H,T∞ = (ωn) ≡ ω1ω2 · · ·ωn · · · : ωi is either H or T for each i = 1, 2, . . . .


The infinite cardinality of Ω prevents us from using the idea of relative frequency to define

probability here. So we start by defining easy events:

Aa1a2···ak ≡ (ωn) ∈ H,T∞ : ω1 = a1, ω2 = a2, . . . , ωk = ak ,

where a1a2 · · · ak ∈ H,Tk. It is natural to assign probability 1/2k to each of these events.

More generally, if S ⊆ H,Tk then the event

AS ≡ (ωn) ∈ H,T∞ : ω1ω2 · · ·ωk ∈ S

should be assigned probability |S|/2k. We then take as the event space the σ-algebra σ(A)

generated by

A =∞⋃k=1

⋃S⊆H,Tk

AS ∪ Ω .

σ(A) contains some events that are not in A but certainly can be obtained from events in

A through a series of countable unions or taking complements. For instance, show that the

followings are events in σ(A) \ A.

• “all tosses after the third toss come up heads”

• “infinitely many heads come up throughout the experiment”

Next step is to define probability measure on σ(A). We only know what probabilities to

assign to events in A. For the members of σ(A) \ A, the probabilities are determined by

the Caratheodory’ Extension Theorem.

Theorem 1.13 (Caratheodory’ Extension Theorem). Let A be an algebra on Ω 6= ∅ and

let µ : A → [0, 1] satisfy µ(Ω) = 1. If µ is countably additive on A, then there exists a

probability measure P on σ(A) such that P (A) = µ(A) for all A ∈ A.

To apply Caratheodory’ Extension Theorem to the above example, we define µ on A by

µ(Ω) = 1 and

µ (AS) =|S|2k

for S ⊆ H,Tk and k ∈ N.

Although it is not at all easy, it is possible to verify that µ is countably additive on A. By

Caratheodory’ Extension Theorem, we have a unique probability measure P on σ(A) which

agrees with µ on each AS . Let us now compute the probability of the two events listed

above.


1.3 Random variables

Definition 6 (Random variables). Let Ω be a set, F be a σ-algebra on Ω, and X : Ω→ R.

X is called a random variable if ω ∈ Ω: X(ω) < α ∈ F for all α ∈ R. A random variable

is called discrete if X(Ω) is a countable set.

Remark. 1. If F = P(Ω) then every function X : Ω → R is a random variable. In

particular, whenever Ω is finite, we can take F = P(Ω) and every X is a random

variable.

2. If X is discrete then X is a random variable if and only if ω ∈ Ω: X(ω) = α ∈ Ffor all α ∈ R. This is not true for general random variables.

3. The σ-algebra depends on the random variable we want to consider.

Example. 1. Two dice are rolled and Jack bets on the sum being 10. If he wins, he’ll

get 2 baht; Otherwise, he’ll have to pay 1 baht. Then a random variable representing

Jack’s gain from this game is

X(ω) =

2 if ω = (4, 6), (5, 5), or (6, 4)

−1 otherwise.

What is the sample space Ω? In each of the following situations, what is the σ-algebra

F? and is X a random variable on such (Ω,F)?

• Jack observes the dice himself. I.e., he has complete information of the outcome

of the experiment.

• Jack is told only the sum of the two dice.

• Jack is told only whether each die is a Hi(4-6) or a Lo(1-3).

In situations that X is a random variable, assuming the die is fair, find P (X = 2),

i.e. the probability that Jack wins.

2. Even though head-to-head record of Federer versus Roddick is something like 10-1,

let’s assume that the probability that Federer wins a set from Roddick is 2/3. In a

five-set match, let X be the number of sets played. For example, if somebody wins in

three sets, then X = 3. Compute P (X = 3), P (X = 4), and P (X = 5). What are

the underlying sample space and σ-algebra?


3. Let (Ω,F , P ) be a probability space. For any event A ∈ Ω, the indicator function of

A is the function 1A : Ω→ [0, 1] defined as

1A =

1, if ω ∈ A,

0, otherwise.

For example, the random variableX in Example 1 is 21A−1Ω\A whereA = (4, 6), (5, 5), (6, 4).

Theorem 1.14. The following statements are equivalent.

1. ω ∈ Ω: X(ω) < α ∈ F for all α ∈ R.

2. ω ∈ Ω: X(ω) ≤ α ∈ F for all α ∈ R.

3. ω ∈ Ω: X(ω) > α ∈ F for all α ∈ R.

4. ω ∈ Ω: X(ω) ≥ α ∈ F for all α ∈ R.

Proof. 1.⇒2.: Let α ∈ R. Sinceω : X(ω) < α+ 1

n

is measurable for each n = 1, 2, . . . ,

so is the countable intersection

ω : X(ω) ≤ α =

∞⋂n=1

ω : X(ω) < α+

1

n

.

2.⇒3.: For each real number α, if ω : X(ω) ≤ α is measurable then ω : X(ω) > α =

ω : X(ω) ≤ αc is clearly measurable as well.

3.⇒4.: This follows from observing that

ω : X(ω) ≥ α =∞⋂n=1

ω : X(ω) > α− 1

n

.

4.⇒1.: ω : X(ω) < α = ω : X(ω) ≥ αc ∈ F .

Theorem 1.15. If X and Y are random variables and c ∈ R, then cX, X + Y , X · Y , and

|X| are random variables.

Proposition 1.16. If X is a random variable on (Ω,F , P ) and f is Borel measurable on

R, then f X is a random variable.

1.4 Distributions

Definition 7 (Distribution functions). Let (Ω,F , P ) be a probability space and X be a

random variable. Define FX : R→ [0, 1] by

FX(x) = P (ω ∈ Ω: X(ω) ≤ x) .


FX is called the distribution function of X. In probability, we (almost) always write P (X ≤x) for P (ω ∈ Ω: X(ω) ≤ x). Similarly, for E ⊆ R, P (X ∈ E) means P (ω ∈ Ω: X(ω) ∈ E).This is an abuse of notation where Ω is understood from the context.

Example. 1. Consider the game Jack played. Find FX . Now, if Jack plays the game

twice, find the distribution of Jack’s gain.

2. By randomly choosing a real number in Ω = [0, 1], we mean that the probability of

a number in an interval (a, b) ⊆ [0, 1] being chosen is b − a, i.e. P ((a, b)) = b − a.

Here, we use the Borel σ-algebra on [0, 1] and the probability measure P determined

by probabilities of open intervals. If a is chosen, let X = a. Find FX and FX2 .

3. Two random variables on different probability spaces may have the same distribution

functions. As should be apparent from the definition, distribution functions extract

only the information on the probability of the r.v. taking values in any given intervals.

Proposition 1.17. Let (Ω,F , P ) be a probability space. Then X is a random variable if

and only if X−1(E) = ω ∈ Ω: X(ω) ∈ E ∈ F for all Borel sets E ∈ B(R).

Definition 8 (Distributions). LetX be a random variable on the probability space (Ω,F , P ).

The set function PX defined on (R,B(R)) by

PX(E) = P (X ∈ E) = P (X−1(E)) for all E ∈ B(R)

is a probability measure called the distribution of X.

Theorem 1.18. A function F : R → [0, 1] is a distribution function of a random variable

if and only if

1. F is nondecreasing.

2. limx→−∞

F (x) = 0 and limx→∞

F (x) = 1

3. F is right continuous, i.e., ∀x0 ∈ R, limx→x+0

F (x) = F (x0).

Definition 9 (Probability functions). If X is a discrete random variable on (Ω,F , P ) then

the function f : R→ [0, 1] defined as

f(x) = P (X = x) for all x ∈ R

is called the probability function of X.

Note that 0 ≤ f(x) ≤ 1 and F (x) =∑t≤x

f(t). Clearly, if t /∈ X(Ω) then f(t) = 0.


Definition 10 (Continuous random variables). A random variable X on (Ω,F , P ) is called

a continuous random variable if there is a Lebesgue integrable function f : R→ [0,∞) such

that

FX(x) =

∫ x

−∞f(t) dt for all x ∈ R.

f is called the probability density function of X. Of course,

∫ ∞−∞

f(t) dt = limx→∞

F (x) = 1.

Note that this is the same as saying

PX(A) =

∫Af(t) dt for all A ∈ B(R).

Example. 1. A continuous random variable is not a random variable that is continuous

on Ω. Consider the constant functions.

2. If Ω is finite, no function on Ω is a continuous random variable.

3. A random variable could be neither discrete nor continuous. For example, consider

an experiment where we toss a coin, if it comes up head, let X = 2. Otherwise, we

choose a number x between 0 and 1 randomly and let X = x. This random variable

X is neither discrete nor continuous.

Definition 11. Two random variables X and Y , probably on different probability spaces,

are said to be identically distributed if they have the same distribution function (distribu-

tion), i.e. FX = FY (PX = PY ).

Proposition 1.19. Let f : R → [0,∞) be a Lebesgue integrable function on R for which∫R f(x) dx = 1. Then f is a pdf of a continuous random variable with distribution function

F : R→ [0, 1] defined as

F (x) =

∫ x

−∞f(t) dt, x ∈ R.

Proposition 1.20. Suppose X is a continuous random variable on (Ω,F , P ). Then

1. ∀a ∈ R, P (X = a) = 0

2. For any interval I with endpoints a < b, P (X ∈ I) =∫ ba f(x) dx.

Definition 12 (Some discrete distributions). .

Bernoulli. A r.v. X with P (X = 0) = p and P (X = 1) = 1 − p is said to be a Bernoulli

random variable with parameter p.

Binomial. X is said to have a binomial distribution with parameters n and p if

P (X = k) =

(n

k

)pk(1− p)k.


Poisson. X is said to have a Poisson distribution with parameter λ if for each k ∈ N0,

P (X = k) = e−λλk

k!.

Definition 13 (Some continuous distributions). .

Uniform. X is said to have the uniform distribution on [a, b] or uniformly distributed on

[a, b], a < b, if its pdf is

f(t) =

1b−a , a ≤ t ≤ b,

0, otherwise.

Exponential. X has the exponential distribution with parameter λ or exponentially dis-

tributed with parameter λ, λ > 0, if its pdf is

f(t) =

λe−λt t ≥ 0,

0, otherwise.

Standard Normal. Define the standard normal density

φ(x) =1√2πe−x

2/2.

Then X is said to be normally distributed if its pdf is φ.

Definition 14 (Almost surely). An event A is said to happen almost surely (a.s.) in

(Ω,F , P ) if P (A) = 1. For example,

1. X = Y a.s. on Ω means P (X = Y ) = 1.

2. |X| ≤M a.s. means P (|X| ≤M) = 1.

1.5 Expected values and variances

X is called a simple random variable if X(Ω) is a finite set. Note that simple random

variables are discrete. Also, a simple random variable can always be written in the standard

form

X =n∑k=1

ak1Ak(2)

where Ak = X = ak, X(Ω) = a1, a2, . . . , an, and all a1, a2, . . . , an are distinct.

Let us note that any given random variable X can be written as

X = X+ −X−

where X+(ω) = max(X(ω), 0), ω ∈ Ω, is the positive part and X−(ω) = −min(X(ω), 0),

ω ∈ Ω, is the negative part. Note that both X+ ≥ 0 and X− ≥ 0.


Definition 15 (Expected values/Expectations). Let X be a random variable on a proba-

bility space (Ω,F , P ). We define the expected value E[X] of X as follows.

• If X is a simple random variable with the standard form in (2), we define

E[X] =

∫ΩX dP =

n∑k=1

akP (Ak).

In particular, if X = 1A then E[X] = P (A).

• If X ≥ 0 then we define

E[X] =

∫ΩX dP = sup E[Y ] : Y is simple and Y ≤ X .

• If X is integrable, i.e., either E[X+] <∞ or E[X−] <∞, then we define

E[X] = E[X+]− E[X−].

From now on, we shall consider only integrable random variables. We also say that the

expectation of X does not exist if it is not integrable.

This measure-theoretic approach of defining expected values needs a series of theorems in

measure theory, such as the monotone convergence theorem, Fatou’s lemma, and Lebesgue’s

dominated convergence theorem.

Proposition 1.21. Let X,Y be (integrable) random variables, a ∈ R, and A,B ∈ F . Then

1. If X ≤ Y a.s., then E[X] ≤ E[Y ].

2. If A ⊆ B then E[X · 1A] ≤ E[X · 1B].

3. E[aX] = aE[X]

4. E[X + Y ] = E[X] + E[Y ]

5. E[X · 1A∪B] = E[X · 1A] + E[X · 1B]

Example. Let (Ω,F , P ) be any probability space with Ω = ω1, ω2, . . . being a countable

set. Then the expected value of X with∑∞

n=1 |X(ω)|P (ω) <∞ is

E[X] =

∞∑n=1

X(ω)P (ω).


Note that for a given random variable X on (Ω,F , P ), PX is a probability measure on

(R,B(R)) and so∫R x dPX(x) is defined. In fact, by Proposition 1.23,

E[X] =

∫Rx dPX(x).

Proposition 1.22. Let (Ω,F , P ) be a probability space. If X is a discrete random variable

on (Ω,F , P ) with probability function f , then

E[X] =∑

x∈X(Ω)

xf(x).

If X is a continuous random variable on (Ω,F , P ) with pdf f , then

E[X] =

∫Rxf(x) dx. (3)

Definition 16 (Variances). Let X be a random variable for which both E[X] and E[X2]

exist. Then the variance of X is defined as

V ar(X) = E[(X − E[X])2].

To justify the equation (3), we have to make a change of variable and use the fact that

the distribution of a continuous random variable is absolutely continuous with respect to

Lebesgue measure on R. This is why continuous random variables are sometimes referred

to as absolutely continuous random variables.

Proposition 1.23 (Change of variables). Let X be a random variable on a probability

space (Ω,F , P ), and let g be any random variable on (X(Ω),B(X(Ω)), PX) with g ≥ 0 or

EPX[g] <∞. Then ∫

X(Ω)g(x) dPX(x) =

∫Ω

(g X)(ω) dP (ω).

Definition 17 (Absolutely continuous). Let (Ω,F) be a measurable space and let P , Q be

probability measures on (Ω,F). We say that P is absolutely continuous with respect to Q,

denoted by P Q, if P (A) = 0 whenever Q(A) = 0 for all A ∈ F .

Theorem 1.24 (Radon-Nikodym Theorem). Let (Ω,F) be a measurable space and let P ,

Q be probability measures on (Ω,F). If P Q then there exists unique f : Ω→ [0,∞) such

that

P (A) =

∫Af dQ, A ∈ F

and, consequently, ∫Ag dP =

∫Ag · f dQ, A ∈ F .


1.6 Conditional probability and independence

Definition 18 (Conditional probabilities). Let A,B ∈ F and P (A) > 0. We define the

conditional probability of B given A by

P (B|A) =P (B ∩A)

P (A).

If both P (A) and P (B) are positive, P (A ∩B) = P (A)P (B|A) = P (B)P (A|B).

Theorem 1.25. Let (Ω,F , P ) be a probability space and A ∈ F with P (A) > 0. Define

FA = A ∩B : B ∈ F and PA : FA → [0, 1] by PA(B) = P (B|A). Then (A,FA, PA) is a

probability space.

We call a set of events B1, B2, . . . , Bn a partition of Ω if B1, B2, . . . , Bn are mutually

disjoint,⋃ni=1Bi = Ω, and each P (Bi) > 0.

Theorem 1.26 (Bayes’ rule). Let B1, B2, . . . , Bn be a partition of (Ω,F , P ) and P (A) >

0. Then

1. P (A) =n∑i=1

P (Bi)P (A|Bi)

2. P (Bk|A) =P (Bk)P (A|Bk)

P (A)=

P (Bk)P (A|Bk)∑ni=1 P (Bi)P (A|Bi)

Example. A football team has had a poor season and the manager is likely to be fired at

the end of the season. If the team wins its final game his chance of being fired is 60% but

if the team fails to win then the chance of his being fired is 80%. The probability that the

team wins its final game is 0.3. Find

1. The probability that the manager is fired.

2. If you are told that he was fired, what is the probability that the team won the final

game.

Definition 19 (Independence of events). Let (Ω,F , P ) be a probability space. The events

A,B ∈ F are said to be independent if

P (A ∩B) = P (A)P (B).

A subcollection A of F is said to be pairwise independent if, for any pair of events A,B ∈ A,

P (A∩B) = P (A)P (B). More generally, a collection A of events in F is called independent

if

P (A1 ∩A2 ∩ · · · ∩An) = P (A1)P (A2) · · ·P (An)

for all A1, A2, . . . , An ⊆ A.


Example. 1. Pairwise independence does not imply independence. Consider, for exam-

ple, events A = HH,HT, B = HH,TH, and C = HH,TT when a fair coin is

tossed twice.

2. ∅ and Ω are independent of any event. More generally, events with probability 0 or 1

are independent of any event.

3. If two disjoint events A and B are independent then either P (A) = 0 or P (B) = 0.

Proposition 1.27. 1. Suppose P (B) > 0. Then A,B are independent if and only if

P (A|B) = P (A).

2. If A and B are independent then Ω \A and B are independent.

3. If A ⊂ F is independent then it is pairwise independent.

Given a random variable X on a probability space (Ω,F , P ), we define the σ-algebra

generated by X, denoted by σ(X), to be the σ-algebra generated by all inverse images of

Borel sets under X, i.e.

σ(X) =X−1(E) : E ∈ B(R)

.

Verify that σ(X) is a σ-algebra.

Definition 20 (Independence of random variables). Random variables X1, X2, . . . , Xm on

(Ω,F , P ) are said to be independent if for any Ai ∈ σ(Xi), i=1,2,. . . ,m, A1, A2, . . . , Amis independent. In other words, X and Y are independent if

P(X−1(E) ∩ Y −1(F )

)= P

(X−1(E)

)P(Y −1(F )

)= PX(E)PY (F )

for all Borel sets E,F ∈ B(R).

Example. 1. Consider two urns, red and blue, each holding k chips that are numbered

1, 2, . . . , k. A chip is to be drawn at random from each urn. Let Xr, Xb be the numbers

on the chips drawn from the red and blue urns, respectively. Then Xr and Xb are

independent.

2. A constant (a.s.) random variable is independent of any other random variable.

3. A random variable is independent of itself iff it is constant a.s.

4. Let X,Y be independent random variables and let f, g be Borel measurable functions

on R. Then the random variables f X and g Y are independent.


Proposition 1.28. X and Y are independent if and only if for all a, b ∈ R,

P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b).

Proposition 1.29. Discrete random variables X1, . . . , Xm are independent iff for any

x1, . . . , xm ∈ R,

P (X1 = x1, . . . , Xm = xm) = P (X1 = x1) · · ·P (Xm = xm).

Example. 1. Toss a fair coin twice and consider X1(HH) = 2 and X1 = −1 otherwise;

X2(TT ) = 2 and X2 = −1 otherwise;

Y1(ω) =

2 ω ∈ HH,HT ,

−1 ω ∈ TH, TT ,and Y2(ω) =

2 ω ∈ HH,TH ,

−1 ω ∈ HT, TT .

Then X1, X2 are not independent while Y1, Y2 are independent.

Definition 21. The joint distribution function of random variables X1, X2, . . . , Xm on

(Ω,F , P ) is the function FX1,...,Xm : Rm → [0, 1] defined by

FX1,...,Xm (x1, . . . , xm) = P (X1 ≤ x1, . . . , Xm ≤ xm) .

The joint probability density function ofX1, X2, . . . , Xm, if exists, is the function fX1,X2,...,Xm : Rm →[0,∞) for which

P (X1 ≤ x1, . . . , Xm ≤ xm) =

∫ xm

−∞· · ·∫ x1

−∞fX1,X2,...,Xm(t1, . . . , tm) dt1 · · · dtm.

So, for example, if FX1,X2 is differentiable then fX1,X2(x1, x2) =∂2FX1,X2∂x1∂x2

.

Example. If fX,Y (x, y) =ce−2x

1 + y2for x > 0 and y ∈ R, what is c?

Remark. One can recover the pdf of Xi from the joint pdf of X1, . . . , Xn. For instance,

fX(x) =

∫RfX,Y (x, y) dy, fY (y) =

∫RfX,Y (x, y) dx.

Proposition 1.30. Let X1, X2, . . . , Xm be random variables with the joint distribution

function FX1,...,Xm. Then they are independent iff for any x1, . . . , xm ∈ R,

FX1,...,Xm (x1, . . . , xm) = FX1(x1) . . . FXm(xm).

In particular, if Xi’s are continuous independent random variables, then the joint probability

density function of X1, X2, . . . , Xm is

fX1,X2,...,Xm(t1, t2, . . . , tm) = fX1(t1)fX2(t2) · · · fXm(tm).

Example. 1. Toss a fair coin twice


1.7 More on expectations

Similar to the one variable case, if g : R2 → R is a Borel measurable function and X,Y are

random variables on (Ω,F , P ) and (Σ,G, Q), respectively, then we define

Eg(X,Y ) =

∫Ω×Σ

g(X,Y ) dP ×Q.

Hence, if X,Y are discrete with probability function f(x, y) then

Eg(X,Y ) =∑

g(x, y)f(x, y);

and if X,Y are continuous with density function f then

Eg(X,Y ) =

∫R

∫Rg(x, y)f(x, y) dx dy.

Theorem 1.31. Let X,Y be random variables on (Ω,F , P ) whose expectations exist.

1. If X,Y are independent then E[XY ] = E[X]E[Y ]. But the converse is not true in

general.

2. X,Y are independent if and only if for any measurable functions g, h for which both

Eg(X) and Eh(Y ) exist,

E[g(X)h(Y )] = E[g(X)]E[h(Y )].

The covariance of X and Y is defined as

Cov(X,Y ) = E[(X − EX)(Y − EY )] = E[XY ]− E[X]E[Y ].

It then follows immediately that if X,Y are independent then their covariance is 0. And

since

V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X,Y ),

if X,Y are independent then V ar(X + Y ) = V ar(X) + V ar(Y ).

Example. Since a binomial is the sum of n independent Bernoulli’s, its variance is np(1−p).

Definition 22 (Conditional expectation). Let X be a discrete random variable whose

expectation exists and A be an event with P (A) > 0. The conditional expectation of X

given A is

E[X|A] =∑

x∈ImXxP (X = x|A).


Theorem 1.32. Let X be a discrete random variable whose expectation exists and B1, . . . , Bnbe a partition of Ω. Then

E[X] =

n∑i=1

E[X|Bi]P (Bi).

Example. Suppose that an urn contains N cards labeled x1, . . . , xN . Let X,Y be the

number on the first and second cards chosen at random. Suppose the cards are drawn

randomly without replacement. Find E[X], E[Y ] and Cov(X,Y ).

If P and Q are probability measures on (Ω,F), then we have expectations with respect

to different measures:

EP [X] =

∫ΩX dP and EQ[X] =

∫ΩX dQ.

If P Q then one can write expectations in terms of the different measures:

EP [X] =

∫ΩX dP =

∫ΩXdP

dQdP = EQ[X

dP

dQ],

where dPdQ is the Radon-Nikodym derivative.

Definition 23. Let X,Y be random variables on (Ω,F , P ). Then E[X|Y ] is a random

variable on (Ω, σ(Y )) satisfying∫AX dP =

∫AE[X|Y ] dP for all A ∈ σ(Y ).

Note that E[X|Y ] is unique up to a.s. equivalence.

Example. Consider E[X|Y ] where Y = 1A,∑n

i=1 ai1Ai .

Proposition 1.33. Let X,Y be random variables on (Ω,F , P ).

1. E [E[X|Y ]] = E[X]

2. If Y ≡ c then E[X|Y ] = E[X].

3. If X,Y are independent then E[X|Y ] = E[X].

In the above condition, what’s really important is the σ-algebra Y generates.

Definition 24. Let X be a random variable on (Ω,F , P ) and G be a σ-algebra with G ⊆ F .

Then E[X|G] is a random variable on (Ω,G) satisfying∫AX dP =

∫AE[X|G] dP for all A ∈ σ(Y ).


Theorem 1.34. Let X,Y be random variables on (Ω,F , P ) and G ⊆ F be a σ-algebra. Let

a, b ∈ R.

1. If X is G-measurable, then E[X|G] = X a.s.

2. E[aX + bY |G] = E[aX|G] + E[bY |G] a.s.

3. If X is G-measurable and E[XY ] is finite then E[XY |G] = X E[Y |G] a.s.

4. If H ⊆ G is a σ-algebra then

E[X|H] = E [E[X|G]|H] = E [E[X|H]|G] a.s.

5. If X ≤ Y a.s., then E[X|G] ≤ E[Y |G] a.s.

1.8 Moment generating functions

Definition 25. The moment generating function of a random variable X on (Ω,F , P ) is

defined as

MX(t) = E[etX ].

It is understood that the domain of MX is the set of all t for which E[etX ] exists.

Example. Find MX if

1. X ∼ Ber(p)

2. X ∼ Poi(λ)

3. X ∼ N (µ, σ2)

Proposition 1.35. Let X be a random variable whose moment generating function exists

on (−δ, δ) for some δ > 0. Then

1. M(k)X (t) =

∑x∈Im(X)

dketx

dtkP (X = x) if X is discrete.

2. M(k)X (t) =

∫R

dketx

dtkf(x) dx if X is continuous.

3. E[Xk] = M(k)X (0)

4. MX(t) =∞∑k=0

tk

k!E[Xk]


Theorem 1.36. If MX = MY on some interval (−δ, δ) then X and Y have the same

distribution.

Theorem 1.37. If X,Y are independent then MX+Y = MXMY .

Example. 1. Find MX if X is a binomial with parameters n, p.

2. If X ∼ N (µ1, σ21) and Y ∼ N (µ2, σ

22) where X,Y are independent then X + Y ∼

N (µ1 + µ2, σ21 + σ2

2).

1.9 Laws of large numbers

Theorem 1.38 (Chebyshev’s inequality). Let X be a random variable on (Ω,F , P ) with

finite expectation and ε > 0. Then

P (|X − EX| ≥ ε) ≤ 1

ε2V ar(X).

More generally, for any r > 0,

P (|X − EX| ≥ ε) ≤ 1

εrE(|X − EX|r).

Definition 26 (Modes of convergence). Let Xn be a sequence of random variables on a

common probability space (Ω,F , P ).

• Xn is said to converge almost surely to X (Xn → X a.s.) if

P(

limn→∞

Xn = X)

= 1.

• Xn is said to converge in the mean of order p > 0 to X (Xn → X in Lp) if

limn→∞

E (|Xn −X|p) = 0.

• Xn is said to converge in probability to a random variable X (Xn → X in prob.) if

for all ε > 0,

limn→∞

P (|Xn −X| ≥ ε) = 0.

Observe that either convergence in the mean and convergence a.s. implies convergence

in probability. There is no other implication among the different modes. Convergence in

probability is the weakest mode of convergence.

Theorem 1.39 (Weak law of large numbers I). Let Xn be a sequence of independent

random variables, each with expected value µ and variance σ2. Then

1

n

n∑k=1

Xk → µ, as n→∞, in probability.


Proof. 1n

∑nk=1Xk has expectation µ and variance σ2/n. Therefore, by Chebyshev’s in-

equality,

P

(∣∣∣∣X1 + · · ·+Xn

n− µ

∣∣∣∣ ≥ ε) ≤ σ2

nε2.

Theorem 1.40 (Weak law of large numbers II). Let X1, X2, . . . be uncorrelated random

variables (E[XiXj ] = E[Xi]E[Xj ]) with means µ1, µ2, . . . and variances σ21, σ

22, . . . . Assume

that

limn→∞

1

n2

n∑k=1

σ2k = 0.

Then for each ε > 0,

limn→∞

P

(∣∣∣∣X1 + · · ·+Xn

n− µ1 + · · ·+ µn

n

∣∣∣∣ ≥ ε) = 0.

Example. Suppose that an urn contains N cards labelled x1, . . . , xN such that∑N

i=1 xi = 0.

Let X1, . . . , Xn be the number on the cards chosen at random. Suppose the cards are drawn

randomly without replacement.

Let A1, A2, . . . be events in a probability space (Ω,F , P ). Then the event

∞⋂n=1

∞⋃m=n

Am = ω ∈ Ω: ω belongs to infinitely many of the An

is called “An i.o.”

Lemma 1.41 (Borel-Cantelli). 1. If∞∑n=1

P (An) <∞ then P (An i.o.) = 0.

2. If

∞∑n=1

P (An) =∞ and An’s are independent, then P (An i.o.) = 1.

Example.

If Xn → X in probability then there is a subsequence nk such that Xnkconverges to X a.s.

What is the probability that two consecutive heads will come up infinitely often in the

repeated tossing of a fair coin?

Theorem 1.42 (Strong law of large numbers). Let Xn be a sequence of independent,

identically distributed, random variables with finite expected value µ = E[Xi]. Then

limn→∞

1

n

n∑k=1

Xk → µ almost surely.

A simple proof assumes that the fourth moment is finite (E[X4i ] < ∞) and uses the

Chebyshev’s inequality and Borel-Cantelli lemma.


1.10 Central limit theorem

Theorem 1.43 (Central limit theorem). Let Xn be a sequence of independent, identically

distributed, random variables with

E[Xi] = µ, V ar(Xi) = σ2.

Then for all (a, b) ⊆ (−∞,∞)

limn→∞

P

(a ≤

∑ni=1(Xi − µ)√

nσ≤ b)

=1√2π

∫ b

ae−x

2/2 dx.

2301690 (2011/1) special topics in advanced mathematics ...left blank for 2nd page of contents...

Documents