a brief analysis of central limit theoremokhanmoh/media/siam_chapter_clt_analysis.pdfa brief...

A Brief Analysis of Central Limit Theorem

Omid Khanmohamadi ([email protected])

Diego Hernán Díaz Martínez ([email protected])

Tony Wills ([email protected])

Kouadio David Yao ([email protected])

SIAM ChapterFlorida State University

March 17, 2014

1 / 36

Outline

Examples

Statement of Theorem

Modes of Convergence

Fourier Transform and Convolution

Outline of Proof

Generalizations

2 / 36

Outline

Examples




Outline of Proof

Generalizations

3 / 36

From Concrete to Abstract: Examples then Theorems!

The source of all great

mathematics is the special case, the

concrete example. It is frequent in

mathematics that every instance of a

concept of seemingly great generality

is in essence the same as a small and

concrete special case.

Paul Halmos (19162006)[image source: Wikipedia]

You should start with

understanding the interesting

examples and build up to explain

what the general phenomena are.

This was your progress from initial

understanding to more understanding.

Michael Atiyah[image source: Wikipedia]

4 / 36

https://en.wikipedia.org/wiki/Paul_Halmos

https://en.wikipedia.org/wiki/Michael_Atiyah

Sum of Dice Throws is (Eventually) Normally Distributed

Comparison of

probability density

functions, p(k), forsum of n fair

6-sided dice,

showing

convergence to a

normal distribution

with increasing n[image source: Wikipedia]

n = 1p(k)0.180.160.140.120.100.080.050.040.020.00

k123456

1 / 6

n = 2p(k)0.180.160.140.120.100.080.050.040.020.00

k2 127

1 / 6

n = 3p(k)0.180.160.140.120.100.080.050.040.020.00

k3 1810,11

1 / 8

n = 4p(k)0.180.160.140.120.100.080.050.040.020.00

k4 2414

73 / 648

n = 5p(k)0.180.160.140.120.100.080.050.040.020.00

k5 3017,18

65 / 648

5 / 36

http://en.wikipedia.org/wiki/Gaussian_distribution

Dice Throws (Cont'd)

I Roll a fair dice 109 times, with each roll independently ofothers.

I fair = faces have equal probability (identically distributed)

I Let Xi be the number that come up on the ith die and let

S109 =∑

109

i=1Xi be the total (sum) of the numbers rolled.

I The probability that S109 is less than x standard deviations

above its mean is (approximately) 1√2π

∫x

−∞ e−t2/2 dt.

6 / 36

Outline

Examples




Outline of Proof

Generalizations

7 / 36

Denitions and Assumptions

Let X1,X2, . . . ,Xn be a sequence i.i.d random variables, each with

mean µ = 0 and variance σ2 = 1. Let Sn =∑

n

i=1Xi .

I Any other nite µ and σ2 may be reduced to this case.

I E[Sn√n

]=

1√nE[Sn] =

1√n

∑n

i=1E[Xi ] = 0.

I Mean (E) is a linear function.

I Var

[Sn√n

]=

(1√n

)2

Var[Sn] =1

n

∑n

i=1Var[Xi ] =

1

nn = 1.

I Var is not a linear function; it distributes over sums (when therandom variables are independent) and it squares scalarmultipliers.

8 / 36

Denitions and Assumptions (cont'd)

Central Limit Theorem is a statement about the so-called

normalized sum dened as Sn−nµ√nσ

which in our case is Sn√n.

I Normalized mean is the dierence between the sum Sn and its

expected value nµ, measured relative to (in units of)

standard deviation√nσ; it measures how many standard

deviations the sum is from its expected value.

9 / 36

Statement of Central Limit Theorem

With the assumptions of the previous slide, we have

Pr

(a ≤ Sn√

n≤ b

)→ 1√

2π

∫b

a

e−t2/2 dt as n→∞

Convergence (→) is in distribution.

I Convergence is not in probability or almost surely.

I Convergence is not uniform.I Tails of the distribution converge more slowly than its center.

10 / 36

Outline

Examples




Outline of Proof

Generalizations

11 / 36

Convergence in Distribution

Central Limit Theorem is expressed in terms of convergence in

distribution which is dened as follows:

Denition (Convergence in Distribution)

A sequence of random variables X1, . . . ,Xn converges in

distribution to X if,

FXn(x)→ FX (x) as n→∞

at all points x where FX is continuous, where FX represents the

distribution of the random variable X , given by

FX (x) := Pr(X ≤ x)

12 / 36

Characteristic Function and its relation to Convergence inDistribution

Denition (Characteristic function)

The characteristic function of any real-valued random variable

completely denes its probability distribution. Let FX be the

distribution function of the random variable X , the characteristic

function of X is the function φX given by

E[e iξX ] = φX (ξ) =

∫ ∞−∞

e iξx dFX (x) =

∫ ∞−∞

fX (x)e iξx dx ,

where fX is the density function of X (if it exists).

I Notice the relation to Fourier transform if the density fX exists.

I Convergence in distribution and convergence in characteristic

are equivalent.

13 / 36

Outline

Examples




Outline of Proof

Generalizations

14 / 36

Fourier Transform Pair

The convention we will be using is that the (1 dimensional) Fourier

transform of a function f (x) is

f (ξ) =

∫ ∞−∞

f (x)e iξx dx

and the inverse Fourier transform of a function f (ξ) is

f (x) =1

2π

∫ ∞−∞

f (ξ)e−iξx dξ.

15 / 36

Convolution

If f and g are integrable functions, we dene the convolution f ? gby

(f ? g)(x) =

∫ ∞−∞

f (x − y)g(y) dy .

I Convolution is sometimes also known by its German name,

faltung ("folding"). Later, in the proof section, we see n-fold

convolution which means convolution repeated n times.

16 / 36

Basic Properties of Fourier Transform

There are a few basic properties of the Fourier transform that we

will need to know. In particular, we need to know what the Fourier

transform does to scaling, a Gaussian distribution, and convolution.

I Scaling: For a non-zero real number α, if g(x) = f (αx), then

g(ξ) =1

|α|f

(ξ

α

).

I Gaussian: If f (x) = 1√2πe−

x2

2 , then

f (ξ) =√2πf (ξ)

I Convolution: Under Fourier transforms the convolution

becomes multiplication.

(f ? g)(ξ) = f (ξ)g(ξ)

17 / 36

Outline

Examples




Outline of Proof

Generalizations

18 / 36

Overview, View, Review!

Tell them what you're

going to tell them, tell them,

and tell them what you told

them.Paul Halmos (19162006)

[image source: Wikipedia]

19 / 36

https://en.wikipedia.org/wiki/Paul_Halmos

An Overview of the Outline of the Proof

Our goal is to outline the steps in showing:

Pr

(a ≤ Sn√

n≤ b

)→ 1√

2π

∫b

a

e−t2/2

dt

1. Write density of sum Sn in terms of density of its i.i.d terms Xi

(by using an n-fold convolution) to go from f to fSn .

2. Find eect of scaling on density (by using a substitution in the

integral) to go from fSn to fSn/√n.

3. Use the scaling results for Fourier transform and density as

well as convolution to go from fSn/√n to fSn/

√n.

4. Expand f around zero to nd a useful converging expression.

5. Rewrite that converging expression for fSn/√n to get

convergence to a Gaussian density

6. Take inverse Fourier transform to arrive at the standard

Gaussian density.20 / 36

Step 1: From f to fSn: n-Fold Convolution

We show the result for two iid variables, X1 and X2, with identical

distributions FX1 ≡ FX2 =: F and densities fX1 ≡ fX2 =: f .

I fX1+X2(a) = d

daFX1+X2(a) = d

daPrX1 + X2 ≤ a.

I FX1+X2(a) is given by the integral over (x1, x2) : x1 + x2 ≤ aof fX1(x1)fX2(x2) = f (x1)f (x2):

FX1+X2(a) = PrX1 + X2 ≤ a =

∫ ∞−∞

∫a−x2

−∞f (x1)f (x2) dx1dx2

=

∫ ∞−∞

F (a − x)f (x) dx

Dierentiation gives

fX1+X2(a) =d

da

∫ ∞−∞

F (a−x)f (x) dx =

∫ ∞−∞

f (a−x)f (x) dx = f ?f (a)

21 / 36

Step 2: From fSn to fSn/√n: Eect of Scaling on Density

The Central Limit Theorem involves the probability

Pr

(a ≤ Sn√

n≤ b

).

Notice that if the density of Sn is fSn(t), then

Pr

(a ≤ Sn√

n≤ b

)= Pr

(a√n ≤ Sn ≤ b

√n)

=

∫b√n

a√n

fSn(t) dt

=

∫b

a

√nfSn(

√ns) ds

by making the substitution s = t√n. This shows that the density of

Sn√nis√nfSn(

√nt).

22 / 36

Step 3: From fSn/√n to fSn/

√n

Now, we have everything we need to get from the density f of a

sequence of i.i.d random variables to the characteristic fSn/√n(ξ) of

the corresponding normalized sum Sn/√n:

I fSn(t) = f ? · · · ? f (t).

I fSn(ξ) = (f ? · · · ? f )(ξ) = (f )n(ξ)

I fSn/√n(t) =

√nfSn(

√nt).

fSn/√n(ξ) =

√n fSn(

√nt)(ξ) =

√n

1√nfSn

(ξ√n

)= fSn

(ξ√n

)= (f )n

(ξ√n

)23 / 36

Step 4: Taylor Expansion of f at 0

The Fourier Transform of the density f (identical for all) of Xi is

f (ξ) =

∫ ∞−∞

e iξx f (x)dx

Dierentiation under the integral sign can be done, so the Taylor

Series is

f (ξ) = f (0) + f ′(0)ξ +f ′′(0)ξ2

2+ ε(ξ)ξ2

as ξ → 0, in which limit ε(ξ)→ 0 also. Observe that

I f (0) =∫∞−∞ f (x)dx = 1

I f ′(0) = i∫∞−∞ xf (x)dx = 0 (mean 0)

I f ′′(0) = −∫∞−∞ x2f (x)dx = −1 (variance 1)

24 / 36

Taylor Expansion of f at 0 (cont'd)

So

f (ξ) = 1− ξ2

2+ ε(ξ)ξ2

as ξ → 0, which is the same as

ξ−2∣∣∣∣f (ξ)−

(1− ξ2

2

)∣∣∣∣→ 0

as ξ → 0.

25 / 36

Step 5: Convergence of fSn/√n(ξ) to e−ξ

2/2

Hoping that we may get a similar convergence result for fSn/√n, we

write

∣∣∣∣(f )n(ξ/√n)−

(1− ξ2

2n

)n∣∣∣∣

=

∣∣∣∣f (ξ/√n)−

(1− ξ2

2n

)∣∣∣∣∣∣∣∣∣n−1∑k=0

(f )k(ξ/√n)

(1− ξ2

2n

)n−k−1∣∣∣∣∣

≤∣∣∣∣f (ξ/

√n)−

(1− ξ2

2n

)∣∣∣∣ n−1∑k=0

∣∣∣f (ξ/√n)∣∣∣k ∣∣∣∣1− ξ2

2n

∣∣∣∣n−k−1

26 / 36

Convergence of fSn/√n(ξ) to e−ξ

2/2

Since |f (ξ)| ≤ ‖f ‖L∞ ≤ ‖f ‖L1 = 1, for n large enough we have

∣∣∣∣(f )n(ξ/√n)−

(1− ξ2

2n

)n∣∣∣∣ ≤ n

∣∣∣∣f (ξ/√n)−

(1− ξ2

2n

)∣∣∣∣It's clear that as n→∞, ξ/

√n→ 0, so

∣∣∣∣(f )n(ξ/√n)−

(1− ξ2

2n

)n∣∣∣∣→ 0

as n→∞, so

fSn/√n(ξ) = (f )n(ξ/

√n)→ e−ξ

2/2

27 / 36

Step 6: Convergence of fSn/√n(x) to e−x

2/2/√2π: Inverse

Fourier Transform

Taking the inverse Fourier Transform we obtain

fSn/√n(x)→ 1√

2πe−x

2/2

as n→∞, which is the conclusion of the Central Limit Theorem!

I Observe that this is pointwise convergence in density (or

equivalently in distribution).

28 / 36

Outline

Examples




Outline of Proof

Generalizations

29 / 36

Directions for Generalization

Three general versions of CLT will be discussed:

I Lyapunov's CLT which weakens the hypothesis of identical

distribution with a tradeback on the hypothesis of nite

variance (Lyapunov's Condition).

I Lindeberg's CLT which weakens Lyapunov's Condition (nite

variance) and maintains the same weak requirements on the

distribution of the random variables.

I Multivariate CLT which uses the covariance matrix of the

random variables for the generalization.

30 / 36

Lyapunov's CLT

Suppose X1,X2, . . . ,Xn is a sequence of independent random

variables, each with nite expected value µi and variance σ2i(i.e.

not identically distributed). Let

s2n =n∑

i=1

σ2i

and for some δ > 0, the following condition (called Lyapunov

condition), holds

limn→∞

1

s2+δn

n∑i=1

E[|Xi − µi |2+δ

]= 0

then a sum of Xi−µisn

converges in distribution to a standard normal

random variable, as n→∞.

31 / 36

Lindeberg's CLT

Suppose X1,X2, . . . ,Xn is a sequence of independent random

variables, each with nite expected value µi and variance σ2i(i.e.

not identically distributed). Let

s2n =n∑

i=1

σ2i

and for every ε > 0, the following condition (called Lindeberg

condition), holds

limn→∞

1

s2n

n∑i=1

E[(Xi − µi )2 · 1|Xi−µi |>εsn

]= 0

then a sum of Xi−µisn

converges in distribution to a standard normal

random variable, as n→∞.

32 / 36

Comparison of Finite Variance Conditions

Lindeberg: ∫|Xi−µi |>εsn

(Xi − µi )2dfi <∞

Classical: ∫R

(Xi − µi )2dfi <∞

Lyapunov: ∫R|Xi − µi |2+δdfi <∞

Observe that, in the Classical CLT, µi = µ and fi (x) = f (x) ∀i

33 / 36

Generalizations in a Nutshell: CLT is Robust

I If one has a lot of small random terms which are mostly

independent and each contributes a small fraction of the

total sum, then the total sum must be approximately

normally distributed.

34 / 36

Multivariate CLT

Suppose X1,X2, . . . ,Xn ∈ Rd is a sequence of independent

random vectors , with nite mean vector E [Xi ] = µ and nite

covariance matrix Σ, then

1√n

(n∑

i=1

Xi − nµ

)−→ Nd (0,Σ)

in distribution as n→∞, where Nd (0,Σ) is the multivariate

normal distribution with mean vector 0 and covariance matrix Σ.

Note: Addition is done componentwise.

35 / 36

Thank you for your attention!

Figure: Laplace 4

36 / 36

Outline

More Details

37 / 36

Almost Sure convergence and Convergence in Probability

Because of their relationship to Convergence in Distribution, it is

useful to review Almost Sure Convergence and Convergence

in Probability. We let X1,X2, . . . ,Xn, . . . be a sequence of

random variables dened on the probability space (Ω,F ,P)

I Almost Sure Convergence (Strong convergence):

X1,X2, . . . ,Xn, . . . converges almost surely to a random

variable X if, for every ε > 0

P(limn→∞

|Xn − X | < ε)

= 1

I Convergence in Probability (Weak convergence):

X1,X2, . . . ,Xn, . . . converges in probability to X if, for for

every ε > 0

limn→∞

P (|Xn − X | < ε) = 1 or limn→∞

P (|Xn − X | ≥ ε) = 0

38 / 36

Notable Relationship between Convergence Concepts

(A.S.) Conv =⇒ Conv in Prob =⇒ Conv inDistribution

39 / 36

a brief analysis of central limit theoremokhanmoh/media/siam_chapter_clt_analysis.pdfa brief...

Documents