cs#109 lecture#19 may 9th, 2016 - stanford...

CS 109Lecture 19

May 9th, 2016

0

5

10

15

20

25

30

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

105

110

115

120

Midterm Distribution

E[X] = 87√Var(X) = 15median = 90

0

5

10

15

20

25

30

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

105

110

115

120

A+

AA-B+

B

B-

Cs


E[X] = 87√Var(X) = 15median = 90

0

5

10

15

20

25

30

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

105

110

115

120

Core Understanding

AdvancedUnderstanding


E[X] = 87√Var(X) = 15median = 90

0

5

10

15

20

25

30

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

105

110

115

120

Midterm Beta

E[X] = 87√Var(X) = 15median = 90

ɑ = 7.1β = 2.7

0.0

0.2

0.4

0.6

0.8

1.0

0 20 40 60 80 100 120

0.000.020.040.060.080.100.120.14

0 20 40 60 80 100 120

You can interpret this as a percentile

function

Derivative of percentile per point

CS109 Midterm CDF

CS109 Midterm PDF

Midterm Beta

0

50

100

2 4 6 8 10 12 14 16 18 20 22 24

Problem 1

0

50

100

2 4 6 8 10 12 14 16 18 20

Problem 2

0

50

100

2 4 6 8 10 12 14 16 18 20

Problem 3

0

50

100

2 4 6 8 10 12 14 16 18 20

Problem 4

0

50

100

2 4 6 8 10 12

Problem 5

0

50

100

2 4 6 8 10 12 14 16 18 20 22 24

Problem 6

Midterm Question Correlations

1.a

1.b

1.c

1.d

2.a

2.b2.c

3.a

3.b

3.c

4.a4.b

4.c

5 6.a 6.b

6.c 0.600.400.22

Correlation

*Correlations bellow 0.22 are not shown **Almost all correlations were positive

Midterm Question Correlations

y = 0.46x + 2.6

-1

0

1

2

3

4

5

6

7

8

-1 0 1 2 3 4 5 6 7 8

Points on 6.a

Poin

ts o

n 6.

cJoint Probability Mass Function 6.a,6.cJoint PMF for Questions 6.a,6.c

y = 0.2x + 11

0

2

4

6

8

10

12

14

16

18

20

0 4 8 12 16 20 24

Joint Probability Mass Function 4, 6

Total Points on 6

Tota

l Poi

nts o

n 4

Joint PMF for Questions 4,6

Joint Probability Mass Function 4, 6Conditional Expectation• Let X be your score on problem 4. • Let Y be your score on problem 6.• E[X | Y]?

02468

101214161820

0 4 8 12 16 20 24

y

E[X|Y = y]

I use standard error because expectation is from a sample

Four Prototypical Trajectories

1. Inequalities

• In many cases, we don’t know the true form of a probability distribution§ E.g., Midterm scores§ But, we know the mean

§ May also have other measures/propertieso Variance

o Non-negativity

o Etc.

§ Inequalities and bounds still allow us to say something about the probability distribution in such caseso May be imprecise compared to knowing true distribution!

Inequality, Probability and Joviality

• Say X is a non-negative random variable

• Proof:§ I = 1 if X ≥ a, 0 otherwise

§

§ Taking expectations:

0 ,][)( >≤≥ aaXEaXP all for

,0 aXIX ≤≥Since

aXE

aXEaXPIE ][ )(][ =⎥⎦

⎤⎢⎣

⎡≤≥=

Markov’s Inequality

• Andrey Andreyevich Markov (1856-1922) was a Russian mathematician

§ Markov’s Inequality is named after him§ He also invented Markov Chains…

o …which are the basis for Google’s PageRank algorithm

Andrey Markov

• Statistics from a previous quarter’s CS109 midterm§ X = midterm score

§ Using sample mean X = 86.7 ≈ E[X]

§ What is P(X ≥ 100)?

§ Markov bound: ≤ 86.7% of class scored 100 or greater

§ In fact, 20.1% of class scored 100 or greatero Markov inequality can be a very loose bound

o But, it made no assumption at all about form of distribution!

867.01007.86

100][)100( ≈=≤≥XEXP

Markov and the Midterm

• X is a random variable with E[X] = µ, Var(X) = σ2

• Proof:§ Since (X – µ)2 is non-negative random variable, apply Markov’s Inequality with a = k2

§ Note that: (X – µ)2 ≥ k2 ⇔ |X – µ| ≥ k, yielding:

0 ,)( 2

2

>≤≥− kk

kXP all forσµ

2

2

2

222 ])[())((

kkXEkXP σµ

µ =−

≤≥−

2

2

)(k

kXP σµ ≤≥−

Chebyshev’s Inequality

• Pafnuty Lvovich Chebyshev (1821-1894) was also a Russian mathematician

§ Chebyshev’s Inequality is named after himo But actually formulated by his colleague Irénée-Jules Bienaymé

§ He was Markov’s doctoral advisoro And sometimes credited with first deriving Markov’s Inequality

§ There is a crater on the moon named in his honor

Pafnuty Chebyshev


2. Law of Large Numbers

• Consider I.I.D. random variables X1, X2, ...§ Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 2

§ Let

§ For any ε > 0:

• Proof:

§ By Chebyshev’s inequality:

0)( ⎯⎯→⎯≥− ∞→nXP εµ

[ ] µ== +++

nnXXXEXE ...21][ ( ) nn

nXXXX2...21Var)(Var σ== +++

0)( 2

2

⎯⎯ →⎯≤≥− ∞→n

nXP

εσ

εµ

∑=

=n

iiXn

X1

1

Weak Law of Large Numbers

2

2

)(k

kXP σµ ≤≥− Var(X)

k

• Consider I.I.D. random variables X1, X2, ...§ Xi have distribution F with E[Xi] = µ

§ Let

§ Strong Law ⇒Weak Law, but not vice versa

§ Strong Law implies that for any ε > 0, there are only a finite number of values of n such that condition of Weak Law: holds.

∑=

=n

iiXn

X1

1

1...lim 21 =⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎠

⎞⎜⎝

⎛ +++∞→

µn

XXXP nn

εµ ≥−X

Strong Law of Large Numbers

• Say we have repeated trials of an experiment§ Let event E = some outcome of experiment§ Let Xi = 1 if E occurs on trial i, 0 otherwise§ Strong Law of Large Numbers (Strong LLN) yields:

§ Recall first week of class:

§ Strong LLN justifies “frequency” notion of probability§ Misconception arising from LLN:

o Gambler’s fallacy: “I’m due for a win”o Consider being “due for a win” with repeated coin flips...

)(][...21 EPXEn

XXXi

n =→+++

nEnEP

n

)(lim)(∞→

=

Intuitions and Misconceptions of LLN

• History of the Law of Large Numbers§ 1713: Weak LLN described by Jacob Bernoulli

§ 1835: Poisson calls it “La Loi des Grands Nombres”o That would be “Law of Large Numbers” in French

§ 1909: Émile Borel develops Strong LLN for Bernoulli random variables

§ 1928: Andrei Nikolaevich Kolmogorov proves Strong LLN in general case

La Loi des Grands Nombres

And now a moment of silence...

...before we present...

...the greatest result of probability theory!

Silence!!


3. Central Limit Theorem


§ Let:

§ Central Limit Theorem:

∑=

=n

iiXn

X1

1

∞→n ),(~2

asn

NX σµ

The Central Limit Theorem

http://onlinestatbook.com/stat_sim/sampling_dist/

The Central Limit Theorem


§ Let:

§ Recall where Z ~ N(0, 1):

∑=

=n

iiXn

X1

1∞→n ),(~

2

asn

NX σµ

n/2σ

µ−= XZ

( ) ( ) ( )nnX

nn

Xn

n

n

XnZ

nNX

n

i i

n

i in

i i

σ

µ

σ

µ

σ

µσµ

−=

⎥⎦

⎤⎢⎣

⎡ −=

−=⇔

∑∑∑=

== 12

1

2

1211

),(~

∞→→−+++ nN

nnXXX n )1 ,0(...21 as

σµ ~

Another form of the Central Limit Theorem

1733

Once Upon a Time…Abraham De Moivre

• History of the Central Limit Theorem§ 1733: CLT for X ~ Ber(1/2) postulated by Abraham de Moivre

§ 1823: Pierre-Simon Laplace extends de Moivre’s work to approximating Bin(n, p) with Normal

§ 1901: Aleksandr Lyapunov provides precise definition and rigorous proof of CLT

§ 2003: Charlie Sheen stars in television series “Two and a Half Men”o By end of the 7th (final) season, there were 161 episodes

o Mean quality of subsamples of episodes is Normally distributed (thanks to the Central Limit Theorem)

Once Upon a Time…

• CLT is why many things in “real world” appear Normally distributed§ Many quantities are sum of independent variables§ Exams scores

o Sum of individual problems on the SATo Why does the CLT not apply to our midterm?

§ Election pollingo Ask 100 people if they will vote for candidate X (p1 = # “yes”/100)o Repeat this process with different groups to get p1, ... , pno Will have a normal distribution over pio Can produce a “confidence interval”

• How likely is it that estimate for true p is correct

Central Limit Theorem in the Real World

• Consider I.I.D. Bernoulli variables X1, X2, ... With probability p§ Xi have distribution F with E[Xi] = µ and Var(Xi) = σ 2

§ Let: Let:∑=

=n

iiXn

X1

1

Binomial Approximation

Y = nX

Y ⇠ N(nµ, n2�2

n)

X ⇠ N(µ,�2) as n ! 1

Y ⇠ N(np, np(1� p))

• Start with 370 midterm scores: X1, X2, ..., X370§ E[Xi] = 87 and Var(Xi) = 225§ Created 50 samples (Yi ) of size n = 10

§ Prediction by CLT:

048

1216

-‐3 -‐2 -‐1 0 1 2 3

Your Midterm on the Central Limit Theorem

Zi =Yi � E[Xi]p

�2/n=

Yi � 87p22.5

Yi ⇠ N(87, 22.5)

Z =1

50

50X

i=1

Zi = 4⇥ 10�16

Var(Z) = 0.997

• Have new algorithm to test for running time§ Mean (clock) running time: µ = t sec.§ Variance of running time: σ2 = 4 sec2.§ Run algorithm repeatedly (I.I.D. trials), measure time

o How many trials so estimated time = t ± 0.5 with 95% certainty?o Xi = running time of i-th run (for 1 ≤ i ≤ n)

o By Central Limit Theorem, Z ~ N(0, 1), where:

Estimating Clock Running Time

Zn =(Pn

i=1 Xi)� nµ

�pn

=(Pn

i=1 Xi)� nt

2pn

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

Zn =(Pn

i=1 Xi)� nt

2pn

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

0.95 = P (�0.5 Pn

i=1 Xi

n� t 0.5)

= P (�0.5

pn

2

Pni=1 Xi

n� t 0.5

pn

2)

= P (�0.5

pn

2

pn

2

Pni=1 Xi

n�

pn

2t 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi

2pn

�pnpn

pnt

2 0.5

pn

2)

= P (�0.5

pn

2

Pni=1 Xi � nt

2pn

0.5pn

2)

= P (�0.5

pn

2 Z 0.5

pn

2)0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

0.95 = �(

pn

4)� �(

pn

4)

= �(

pn

4)� (1� �(

pn

4))

= 2�(

pn

4)� 1

0.975 = �(

pn

4)

��1(0.975) =

pn

4

1.96 =

pn

4n = 61.4

William Sealy Gosset(aka Student)

• Have new algorithm to test for running time§ Mean (clock) running time: µ = t sec.§ Variance of running time: σ2 = 4 sec2.§ Run algorithm repeatedly (I.I.D. trials), measure time

o How many trials so estimated time = t ± 0.5 with 95% certainty?o Xi = running time of i-th run (for 1 ≤ i ≤ n), and

§ What would Chebyshev say? 2

2

)(k

kXP SSS

σµ ≤≥−

nnn

nX

nXt

nXE

n

i

in

i

iS

n

i

iS

4VarVar 2

2

11

2

1==⎟

⎠

⎞⎜⎝

⎛=⎟⎠

⎞⎜⎝

⎛==⎥

⎦

⎤⎢⎣

⎡= ∑∑∑

===

σσµ

320 05.016)5.0(

/4)5.0( 21

≥⇒==≤≥−∑=

nn

ntnXP

n

i

i

∑=

=n

i

iS n

XX1

Thanks for playing, Pafnuty!

Estimating Time With Chebyshev

• Number visitors to web site/minute: X ~ Poi(100) § Server crashes if ≥ 120 requests/minute§ What is P(crash in next minute)?

§ Exact solution:

§ Use CLT, where (all I.I.D)

o Note: Normal can be used to approximate Poisson

0282.0!)100()120(

120

100

≈=≥ ∑∞

=

−

i

i

ieXP

0256.0)95.1(1)1001005.119

100100()5.119()120( ≈Φ−=

−≥

−=≥=≥

YPYPXP

∑=

n

in

1

)/100(Poi~)100(Poi

Crashing Your Website

It’s play time!

• You will roll 10 6-sided dice (X1, X2, …, X10)§ X = total value of all 10 dice = X1 + X2 + … + X10§ Win if: X ≤ 25 or X ≥ 45§ Roll!

• And now the truth (according to the CLT)…

Sum of Dice

• You will roll 10 6-sided dice (X1, X2, …, X10)§ X = total value of all 10 dice = X1 + X2 + … + X10§ Win if: X ≤ 25 or X ≥ 45

• Recall CLT:

§ Determine P(X ≤ 25 or X ≥ 45) using CLT:

0784.0)9608.01(2)1)76.1(2(1 =−≈−Φ−≈

[ ]1235)(Var 5.3 2 ==== ii XXE σµ

)1012/35)5.3(105.44

1012/35)5.3(10

1012/35)5.3(105.25(1)5.445.25(1 −

≤−

≤−

−=≤≤−XPXP

∞→→−+++ nN

nnXXX n )1 ,0(...21 as

σµ

Sum of Dice

I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the ”[Central limit theorem]". The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.

-Sir Francis Galton

Wonderful Form of Cosmic Order

0

5

10

15

20

25

30

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

105

110

115

120

Beta

E[X] = 87√Var(X) = 15median = 90

ɑ = 7.1β = 2.7

Core Understanding

AdvancedUnderstanding

A+

AA-B+

B

B-

Cs

cs#109 lecture#19 may 9th, 2016 - stanford...

Documents