statistical paradises and paradoxes in big data (i): law ... · menu 1 xiao-li meng department of...
TRANSCRIPT
Menu 1
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Statistical Paradises and Paradoxes in Big Data (I):
Law of Large Populations, Big DataParadox, and the 2016 US Election
Xiao-Li MengDepartment of Statistics, Harvard University
Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng
Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.
Menu 1
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Statistical Paradises and Paradoxes in Big Data (I):
Law of Large Populations, Big DataParadox, and the 2016 US Election
Xiao-Li MengDepartment of Statistics, Harvard University
Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng
Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.
Menu 1
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Statistical Paradises and Paradoxes in Big Data (I):
Law of Large Populations, Big DataParadox, and the 2016 US Election
Xiao-Li MengDepartment of Statistics, Harvard University
Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng
Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.
Menu 2
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Motivating questions
We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).
But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)
“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)
Menu 2
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Motivating questions
We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).
But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)
“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)
Menu 2
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Motivating questions
We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).
But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)
“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)
The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 3
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Bit of History: Theory and Practice
Law of Large Numbers:Jakob Bernoulli (1713)
Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√
n: n − sample size
Survey Sampling:
Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway
Landmark paper: JerzyNeyman (1934)
The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)
First implementation inUS Census: 1940 led byMorris Hansen
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
Menu 4
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
When/why can we ignore the population size?
Think about tastingsoup ...
Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!
⇐⇒
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 5
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
2016 US Presidential Election
n: number of respondents to an election survey
N: number of (actual) voters in US
Xj = 1: plan to vote for Trump; Xj = 0 otherwise
Rj = 1: report (honestly) voting plan; Rj = 0 otherwise
Estimatinng Trump’s share: µN
= Ave(Xj) by sample average:
µ̂n =R1X1 + . . .+ RNXN
n=
Ave(RjXj)
Ave(Rj)
Actual estimation error
µ̂n − µN=
Ave(RjXj)
Ave(Rj)− Ave(Xj)
=
[Ave(RjXj)− Ave(Rj)Ave(Xj)
σRσ
X
]× σ
R
Ave(Rj)× σ
X
Menu 6
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data quality, quantity, and uncertainty
Because σ2R = f (1− f ), f = Ave{Rj} = n
N , we have
Error = ρ̂R,X︸︷︷︸
Data Quality
×
√N − n
n︸ ︷︷ ︸Data Quantity
× σX︸︷︷︸
Problem Difficulty
Menu 6
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data quality, quantity, and uncertainty
Because σ2R = f (1− f ), f = Ave{Rj} = n
N , we have
Error = ρ̂R,X︸︷︷︸
Data Quality
×√
N − n
n︸ ︷︷ ︸Data Quantity
×
σX︸︷︷︸
Problem Difficulty
Menu 6
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data quality, quantity, and uncertainty
Because σ2R = f (1− f ), f = Ave{Rj} = n
N , we have
Error = ρ̂R,X︸︷︷︸
Data Quality
×√
N − n
n︸ ︷︷ ︸Data Quantity
× σX︸︷︷︸
Problem Difficulty
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
Menu 7
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Data Defect Index (d.d.i.)
Mean Squared Error (MSE)
MSE(µ̂n) = ER(ρ̂2)× N − n
n× σ2
X
Data Defect Index (d.d.i): DI = ER(ρ̂2)
For Simple Random Sample (SRS): DI = (N − 1)−1
For probabilistic samples in general: DI ∝ N−1
Deep trouble when DI does not vanish with N−1;
or equivalently when ρ̂ does not vanish with N−1/2 ...
Menu 8
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Law of Large Populations (LLP)
If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:
Actual Error
Benchmark SRS Standard Error=√N − 1ρ̂
The (lack-of) design effect (Deff)
Deff =MSE
Benchmark SRS MSE= (N − 1)DI
Paradigm shift for “Big Data”:
Fromσ√n︸︷︷︸
random error
to ρ̂√N︸ ︷︷ ︸
relative systemtic bias
Menu 8
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Law of Large Populations (LLP)
If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:
Actual Error
Benchmark SRS Standard Error=√N − 1ρ̂
The (lack-of) design effect (Deff)
Deff =MSE
Benchmark SRS MSE= (N − 1)DI
Paradigm shift for “Big Data”:
Fromσ√n︸︷︷︸
random error
to ρ̂√N︸ ︷︷ ︸
relative systemtic bias
Menu 8
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
A Law of Large Populations (LLP)
If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:
Actual Error
Benchmark SRS Standard Error=√N − 1ρ̂
The (lack-of) design effect (Deff)
Deff =MSE
Benchmark SRS MSE= (N − 1)DI
Paradigm shift for “Big Data”:
Fromσ√n︸︷︷︸
random error
to ρ̂√N︸ ︷︷ ︸
relative systemtic bias
Menu 9
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Effective Sample Size
The Effective Sample Size neff of a “Big Data” set
Equate its MSE to that from a SRS with size neff :
DI
[N − n
n
]σ2 =
1
N − 1
[N − neff
neff
]σ2
What matters is the relative size f = n/N
neff =n
1 + (1− f )[(N − 1)DI − 1]≈ f
1− f
1
ρ̂2.
Menu 9
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Effective Sample Size
The Effective Sample Size neff of a “Big Data” set
Equate its MSE to that from a SRS with size neff :
DI
[N − n
n
]σ2 =
1
N − 1
[N − neff
neff
]σ2
What matters is the relative size f = n/N
neff =n
1 + (1− f )[(N − 1)DI − 1]≈ f
1− f
1
ρ̂2.
Menu 9
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Effective Sample Size
The Effective Sample Size neff of a “Big Data” set
Equate its MSE to that from a SRS with size neff :
DI
[N − n
n
]σ2 =
1
N − 1
[N − neff
neff
]σ2
What matters is the relative size f = n/N
neff =n
1 + (1− f )[(N − 1)DI − 1]≈ f
1− f
1
ρ̂2.
Menu 10
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)
CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers
on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)
Poll underestimated
Clinton support
Poll overestimated
Clinton support
0%
50%
100%
0% 50% 100%
Final Clinton Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,C
linto
n S
uppo
rt
Root Mean Squared Error: 0.07
Reasonable predictions forClinton’s Vote Share
Poll underestimated
Trump support
Poll overestimated
Trump support
0%
50%
100%
0% 50% 100%
Final Trump Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,Tr
ump
Sup
port
Root Mean Squared Error: 0.15
Serious underestimation ofTrump’s Vote Share
Menu 10
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)
CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers
on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)
Poll underestimated
Clinton support
Poll overestimated
Clinton support
0%
50%
100%
0% 50% 100%
Final Clinton Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,C
linto
n S
uppo
rt
Root Mean Squared Error: 0.07
Reasonable predictions forClinton’s Vote Share
Poll underestimated
Trump support
Poll overestimated
Trump support
0%
50%
100%
0% 50% 100%
Final Trump Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,Tr
ump
Sup
port
Root Mean Squared Error: 0.15
Serious underestimation ofTrump’s Vote Share
Menu 10
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)
CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers
on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)
Poll underestimated
Clinton support
Poll overestimated
Clinton support
0%
50%
100%
0% 50% 100%
Final Clinton Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,C
linto
n S
uppo
rt
Root Mean Squared Error: 0.07
Reasonable predictions forClinton’s Vote Share
Poll underestimated
Trump support
Poll overestimated
Trump support
0%
50%
100%
0% 50% 100%
Final Trump Popular Vote Share
Val
idat
ed V
oter
Pol
l Est
imat
e,Tr
ump
Sup
port
Root Mean Squared Error: 0.15
Serious underestimation ofTrump’s Vote Share
Menu 11
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Assessing ρ̂ using state-level data
Let µN
be the true share, and µ̂n the estimated share. Then
ρ̂ =µ̂n − µN√
N−nn σ2
, & σ2 = µN
(1− µN
)
−0.00021 ± 0.00061
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Clinton ρN
Cou
nt
Clinton: ρ̂ ≈ −0.0002± 0.0006
−0.0045 ± 0.00056
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Trump ρN
Cou
nt
Trump: ρ̂ ≈ −0.0045± 0.0006
Menu 11
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Assessing ρ̂ using state-level data
Let µN
be the true share, and µ̂n the estimated share. Then
ρ̂ =µ̂n − µN√
N−nn σ2
, & σ2 = µN
(1− µN
)
−0.00021 ± 0.00061
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Clinton ρN
Cou
nt
Clinton: ρ̂ ≈ −0.0002± 0.0006
−0.0045 ± 0.00056
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Trump ρN
Cou
nt
Trump: ρ̂ ≈ −0.0045± 0.0006
Menu 11
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Assessing ρ̂ using state-level data
Let µN
be the true share, and µ̂n the estimated share. Then
ρ̂ =µ̂n − µN√
N−nn σ2
, & σ2 = µN
(1− µN
)
−0.00021 ± 0.00061
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Clinton ρN
Cou
nt
Clinton: ρ̂ ≈ −0.0002± 0.0006
−0.0045 ± 0.00056
0
4
8
12
−0.010 −0.005 0.000 0.005 0.010
Trump ρNC
ount
Trump: ρ̂ ≈ −0.0045± 0.0006
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
Menu 12
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
What’s the implication of ρ̂ = −0.005?
Many (major) survey results published before Nov 8, 2016;
Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;
Equivalent to 2,300 surveys of 1,000 respondents each.
When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence
neff =f
1− f
1
DI=
1
99× 40000 ≈ 404!
A 99.98% reduction in n, caused by ρ̂ = −0.005.
Butterfly Effect due to Law of Large Populations (LLP)
Relative Error =√
N− 1ρ̂
Menu 13
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Visualizing LLP: Actual Coverage for Clinton
AL
AK
AZ
AR
CA
CO
CTDEDC
FL
GA
HI
ID
IL
IN
IA
KS
KY
LA
ME
MDMA
MI
MN
MSMO
MT
NE
NV
NH
NJNM
NY
NCND
OH
OK
OR
PA
RISC
SDTN
TX
UT
VT VA
WA
WVWI
WY
−10
−5
−2
0
2
5
5.5 6.0 6.5 7.0
log10 (Total Voters)
Clin
ton
Zn
Menu 14
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Visualizing LLP: Actual Coverage for Trump
AL
AKAZ
AR
CACOCT
DE
DC
FL
GA
HI
ID
IL
IN
IA
KS KY
LAME MDMA
MI
MN
MS
MO
MT
NE
NV
NH
NJ
NM
NY
NC
ND
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VT
VA
WAWV
WI
WY
−10
−5
−2
0
2
5
5.5 6.0 6.5 7.0
log10 (Total Voters)
Tru
mp
Zn
Menu 15
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
The Big Data Paradox:
If we do not pay attention to data quality, then
The bigger the data,
the surer we fool ourselves.
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
Menu 16
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
Lessons Learned ...
Lesson 1: What matters most is the quality, not thequantity.
Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.
Lesson 3: Watch the relative size, not the absolutesize.
Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.
Menu 17
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
In case you are kind enough to invite me again ...
The sequel: Meng (2018/9)
Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s
Paradox, and Individualized Treatments
Menu 17
Xiao-Li MengDepartment of
Statistics,Harvard
University
Motivation
Soup
”Trio” Identity
Trio
LLP
What’s Big?
CCES
Assessing d.d.i
Paradox
Lessons
In case you are kind enough to invite me again ...
The sequel: Meng (2018/9)
Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s
Paradox, and Individualized Treatments