a brief analysis of central limit theoremokhanmoh/media/siam_chapter_clt_analysis.pdfa brief...
TRANSCRIPT
A Brief Analysis of Central Limit Theorem
Omid Khanmohamadi ([email protected])
Diego Hernán Díaz Martínez ([email protected])
Tony Wills ([email protected])
Kouadio David Yao ([email protected])
SIAM ChapterFlorida State University
March 17, 2014
1 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
2 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
3 / 36
From Concrete to Abstract: Examples then Theorems!
The source of all great
mathematics is the special case, the
concrete example. It is frequent in
mathematics that every instance of a
concept of seemingly great generality
is in essence the same as a small and
concrete special case.
Paul Halmos (19162006)[image source: Wikipedia]
You should start with
understanding the interesting
examples and build up to explain
what the general phenomena are.
This was your progress from initial
understanding to more understanding.
Michael Atiyah[image source: Wikipedia]
4 / 36
Sum of Dice Throws is (Eventually) Normally Distributed
Comparison of
probability density
functions, p(k), forsum of n fair
6-sided dice,
showing
convergence to a
normal distribution
with increasing n[image source: Wikipedia]
n = 1p(k)0.180.160.140.120.100.080.050.040.020.00
k123456
1 / 6
n = 2p(k)0.180.160.140.120.100.080.050.040.020.00
k2 127
1 / 6
n = 3p(k)0.180.160.140.120.100.080.050.040.020.00
k3 1810,11
1 / 8
n = 4p(k)0.180.160.140.120.100.080.050.040.020.00
k4 2414
73 / 648
n = 5p(k)0.180.160.140.120.100.080.050.040.020.00
k5 3017,18
65 / 648
5 / 36
Dice Throws (Cont'd)
I Roll a fair dice 109 times, with each roll independently ofothers.
I fair = faces have equal probability (identically distributed)
I Let Xi be the number that come up on the ith die and let
S109 =∑
109
i=1Xi be the total (sum) of the numbers rolled.
I The probability that S109 is less than x standard deviations
above its mean is (approximately) 1√2π
∫x
−∞ e−t2/2 dt.
6 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
7 / 36
Denitions and Assumptions
Let X1,X2, . . . ,Xn be a sequence i.i.d random variables, each with
mean µ = 0 and variance σ2 = 1. Let Sn =∑
n
i=1Xi .
I Any other nite µ and σ2 may be reduced to this case.
I E[Sn√n
]=
1√nE[Sn] =
1√n
∑n
i=1E[Xi ] = 0.
I Mean (E) is a linear function.
I Var
[Sn√n
]=
(1√n
)2
Var[Sn] =1
n
∑n
i=1Var[Xi ] =
1
nn = 1.
I Var is not a linear function; it distributes over sums (when therandom variables are independent) and it squares scalarmultipliers.
8 / 36
Denitions and Assumptions (cont'd)
Central Limit Theorem is a statement about the so-called
normalized sum dened as Sn−nµ√nσ
which in our case is Sn√n.
I Normalized mean is the dierence between the sum Sn and its
expected value nµ, measured relative to (in units of)
standard deviation√nσ; it measures how many standard
deviations the sum is from its expected value.
9 / 36
Statement of Central Limit Theorem
With the assumptions of the previous slide, we have
Pr
(a ≤ Sn√
n≤ b
)→ 1√
2π
∫b
a
e−t2/2 dt as n→∞
Convergence (→) is in distribution.
I Convergence is not in probability or almost surely.
I Convergence is not uniform.I Tails of the distribution converge more slowly than its center.
10 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
11 / 36
Convergence in Distribution
Central Limit Theorem is expressed in terms of convergence in
distribution which is dened as follows:
Denition (Convergence in Distribution)
A sequence of random variables X1, . . . ,Xn converges in
distribution to X if,
FXn(x)→ FX (x) as n→∞
at all points x where FX is continuous, where FX represents the
distribution of the random variable X , given by
FX (x) := Pr(X ≤ x)
12 / 36
Characteristic Function and its relation to Convergence inDistribution
Denition (Characteristic function)
The characteristic function of any real-valued random variable
completely denes its probability distribution. Let FX be the
distribution function of the random variable X , the characteristic
function of X is the function φX given by
E[e iξX ] = φX (ξ) =
∫ ∞−∞
e iξx dFX (x) =
∫ ∞−∞
fX (x)e iξx dx ,
where fX is the density function of X (if it exists).
I Notice the relation to Fourier transform if the density fX exists.
I Convergence in distribution and convergence in characteristic
are equivalent.
13 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
14 / 36
Fourier Transform Pair
The convention we will be using is that the (1 dimensional) Fourier
transform of a function f (x) is
f (ξ) =
∫ ∞−∞
f (x)e iξx dx
and the inverse Fourier transform of a function f (ξ) is
f (x) =1
2π
∫ ∞−∞
f (ξ)e−iξx dξ.
15 / 36
Convolution
If f and g are integrable functions, we dene the convolution f ? gby
(f ? g)(x) =
∫ ∞−∞
f (x − y)g(y) dy .
I Convolution is sometimes also known by its German name,
faltung ("folding"). Later, in the proof section, we see n-fold
convolution which means convolution repeated n times.
16 / 36
Basic Properties of Fourier Transform
There are a few basic properties of the Fourier transform that we
will need to know. In particular, we need to know what the Fourier
transform does to scaling, a Gaussian distribution, and convolution.
I Scaling: For a non-zero real number α, if g(x) = f (αx), then
g(ξ) =1
|α|f
(ξ
α
).
I Gaussian: If f (x) = 1√2πe−
x2
2 , then
f (ξ) =√2πf (ξ)
I Convolution: Under Fourier transforms the convolution
becomes multiplication.
(f ? g)(ξ) = f (ξ)g(ξ)
17 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
18 / 36
Overview, View, Review!
Tell them what you're
going to tell them, tell them,
and tell them what you told
them.Paul Halmos (19162006)
[image source: Wikipedia]
19 / 36
An Overview of the Outline of the Proof
Our goal is to outline the steps in showing:
Pr
(a ≤ Sn√
n≤ b
)→ 1√
2π
∫b
a
e−t2/2
dt
1. Write density of sum Sn in terms of density of its i.i.d terms Xi
(by using an n-fold convolution) to go from f to fSn .
2. Find eect of scaling on density (by using a substitution in the
integral) to go from fSn to fSn/√n.
3. Use the scaling results for Fourier transform and density as
well as convolution to go from fSn/√n to fSn/
√n.
4. Expand f around zero to nd a useful converging expression.
5. Rewrite that converging expression for fSn/√n to get
convergence to a Gaussian density
6. Take inverse Fourier transform to arrive at the standard
Gaussian density.20 / 36
Step 1: From f to fSn: n-Fold Convolution
We show the result for two iid variables, X1 and X2, with identical
distributions FX1 ≡ FX2 =: F and densities fX1 ≡ fX2 =: f .
I fX1+X2(a) = d
daFX1+X2(a) = d
daPrX1 + X2 ≤ a.
I FX1+X2(a) is given by the integral over (x1, x2) : x1 + x2 ≤ aof fX1(x1)fX2(x2) = f (x1)f (x2):
FX1+X2(a) = PrX1 + X2 ≤ a =
∫ ∞−∞
∫a−x2
−∞f (x1)f (x2) dx1dx2
=
∫ ∞−∞
F (a − x)f (x) dx
Dierentiation gives
fX1+X2(a) =d
da
∫ ∞−∞
F (a−x)f (x) dx =
∫ ∞−∞
f (a−x)f (x) dx = f ?f (a)
21 / 36
Step 2: From fSn to fSn/√n: Eect of Scaling on Density
The Central Limit Theorem involves the probability
Pr
(a ≤ Sn√
n≤ b
).
Notice that if the density of Sn is fSn(t), then
Pr
(a ≤ Sn√
n≤ b
)= Pr
(a√n ≤ Sn ≤ b
√n)
=
∫b√n
a√n
fSn(t) dt
=
∫b
a
√nfSn(
√ns) ds
by making the substitution s = t√n. This shows that the density of
Sn√nis√nfSn(
√nt).
22 / 36
Step 3: From fSn/√n to fSn/
√n
Now, we have everything we need to get from the density f of a
sequence of i.i.d random variables to the characteristic fSn/√n(ξ) of
the corresponding normalized sum Sn/√n:
I fSn(t) = f ? · · · ? f (t).
I fSn(ξ) = (f ? · · · ? f )(ξ) = (f )n(ξ)
I fSn/√n(t) =
√nfSn(
√nt).
fSn/√n(ξ) =
√n fSn(
√nt)(ξ) =
√n
1√nfSn
(ξ√n
)= fSn
(ξ√n
)= (f )n
(ξ√n
)23 / 36
Step 4: Taylor Expansion of f at 0
The Fourier Transform of the density f (identical for all) of Xi is
f (ξ) =
∫ ∞−∞
e iξx f (x)dx
Dierentiation under the integral sign can be done, so the Taylor
Series is
f (ξ) = f (0) + f ′(0)ξ +f ′′(0)ξ2
2+ ε(ξ)ξ2
as ξ → 0, in which limit ε(ξ)→ 0 also. Observe that
I f (0) =∫∞−∞ f (x)dx = 1
I f ′(0) = i∫∞−∞ xf (x)dx = 0 (mean 0)
I f ′′(0) = −∫∞−∞ x2f (x)dx = −1 (variance 1)
24 / 36
Taylor Expansion of f at 0 (cont'd)
So
f (ξ) = 1− ξ2
2+ ε(ξ)ξ2
as ξ → 0, which is the same as
ξ−2∣∣∣∣f (ξ)−
(1− ξ2
2
)∣∣∣∣→ 0
as ξ → 0.
25 / 36
Step 5: Convergence of fSn/√n(ξ) to e−ξ
2/2
Hoping that we may get a similar convergence result for fSn/√n, we
write
∣∣∣∣(f )n(ξ/√n)−
(1− ξ2
2n
)n∣∣∣∣
=
∣∣∣∣f (ξ/√n)−
(1− ξ2
2n
)∣∣∣∣∣∣∣∣∣n−1∑k=0
(f )k(ξ/√n)
(1− ξ2
2n
)n−k−1∣∣∣∣∣
≤∣∣∣∣f (ξ/
√n)−
(1− ξ2
2n
)∣∣∣∣ n−1∑k=0
∣∣∣f (ξ/√n)∣∣∣k ∣∣∣∣1− ξ2
2n
∣∣∣∣n−k−1
26 / 36
Convergence of fSn/√n(ξ) to e−ξ
2/2
Since |f (ξ)| ≤ ‖f ‖L∞ ≤ ‖f ‖L1 = 1, for n large enough we have
∣∣∣∣(f )n(ξ/√n)−
(1− ξ2
2n
)n∣∣∣∣ ≤ n
∣∣∣∣f (ξ/√n)−
(1− ξ2
2n
)∣∣∣∣It's clear that as n→∞, ξ/
√n→ 0, so
∣∣∣∣(f )n(ξ/√n)−
(1− ξ2
2n
)n∣∣∣∣→ 0
as n→∞, so
fSn/√n(ξ) = (f )n(ξ/
√n)→ e−ξ
2/2
27 / 36
Step 6: Convergence of fSn/√n(x) to e−x
2/2/√2π: Inverse
Fourier Transform
Taking the inverse Fourier Transform we obtain
fSn/√n(x)→ 1√
2πe−x
2/2
as n→∞, which is the conclusion of the Central Limit Theorem!
I Observe that this is pointwise convergence in density (or
equivalently in distribution).
28 / 36
Outline
Examples
Statement of Theorem
Modes of Convergence
Fourier Transform and Convolution
Outline of Proof
Generalizations
29 / 36
Directions for Generalization
Three general versions of CLT will be discussed:
I Lyapunov's CLT which weakens the hypothesis of identical
distribution with a tradeback on the hypothesis of nite
variance (Lyapunov's Condition).
I Lindeberg's CLT which weakens Lyapunov's Condition (nite
variance) and maintains the same weak requirements on the
distribution of the random variables.
I Multivariate CLT which uses the covariance matrix of the
random variables for the generalization.
30 / 36
Lyapunov's CLT
Suppose X1,X2, . . . ,Xn is a sequence of independent random
variables, each with nite expected value µi and variance σ2i(i.e.
not identically distributed). Let
s2n =n∑
i=1
σ2i
and for some δ > 0, the following condition (called Lyapunov
condition), holds
limn→∞
1
s2+δn
n∑i=1
E[|Xi − µi |2+δ
]= 0
then a sum of Xi−µisn
converges in distribution to a standard normal
random variable, as n→∞.
31 / 36
Lindeberg's CLT
Suppose X1,X2, . . . ,Xn is a sequence of independent random
variables, each with nite expected value µi and variance σ2i(i.e.
not identically distributed). Let
s2n =n∑
i=1
σ2i
and for every ε > 0, the following condition (called Lindeberg
condition), holds
limn→∞
1
s2n
n∑i=1
E[(Xi − µi )2 · 1|Xi−µi |>εsn
]= 0
then a sum of Xi−µisn
converges in distribution to a standard normal
random variable, as n→∞.
32 / 36
Comparison of Finite Variance Conditions
Lindeberg: ∫|Xi−µi |>εsn
(Xi − µi )2dfi <∞
Classical: ∫R
(Xi − µi )2dfi <∞
Lyapunov: ∫R|Xi − µi |2+δdfi <∞
Observe that, in the Classical CLT, µi = µ and fi (x) = f (x) ∀i
33 / 36
Generalizations in a Nutshell: CLT is Robust
I If one has a lot of small random terms which are mostly
independent and each contributes a small fraction of the
total sum, then the total sum must be approximately
normally distributed.
34 / 36
Multivariate CLT
Suppose X1,X2, . . . ,Xn ∈ Rd is a sequence of independent
random vectors , with nite mean vector E [Xi ] = µ and nite
covariance matrix Σ, then
1√n
(n∑
i=1
Xi − nµ
)−→ Nd (0,Σ)
in distribution as n→∞, where Nd (0,Σ) is the multivariate
normal distribution with mean vector 0 and covariance matrix Σ.
Note: Addition is done componentwise.
35 / 36
Thank you for your attention!
Figure: Laplace 4
36 / 36
Outline
More Details
37 / 36
Almost Sure convergence and Convergence in Probability
Because of their relationship to Convergence in Distribution, it is
useful to review Almost Sure Convergence and Convergence
in Probability. We let X1,X2, . . . ,Xn, . . . be a sequence of
random variables dened on the probability space (Ω,F ,P)
I Almost Sure Convergence (Strong convergence):
X1,X2, . . . ,Xn, . . . converges almost surely to a random
variable X if, for every ε > 0
P(limn→∞
|Xn − X | < ε)
= 1
I Convergence in Probability (Weak convergence):
X1,X2, . . . ,Xn, . . . converges in probability to X if, for for
every ε > 0
limn→∞
P (|Xn − X | < ε) = 1 or limn→∞
P (|Xn − X | ≥ ε) = 0
38 / 36
Notable Relationship between Convergence Concepts
(A.S.) Conv =⇒ Conv in Prob =⇒ Conv inDistribution
39 / 36