statistical paradises and paradoxes in big data (i): law ... · menu 1 xiao-li meng department of...

61
Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup ”Trio” Identity Trio LLP What’s Big? CCES Assessing d.d.i Paradox Lessons Statistical Paradises and Paradoxes in Big Data (I): Law of Large Populations, Big Data Paradox, and the 2016 US Election Xiao-Li Meng Department of Statistics, Harvard University

Upload: others

Post on 28-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Page 2: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Page 3: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Page 4: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 2

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Page 5: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 2

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Page 6: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 2

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Page 7: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 8: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 9: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)

The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 10: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 11: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 12: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 13: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 14: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 15: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 3

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Page 16: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Page 17: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Page 18: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Page 19: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Page 20: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 4

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Page 21: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 22: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 23: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 24: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 25: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 26: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 27: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 5

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

n=

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

=

[Ave(RjXj)− Ave(Rj)Ave(Xj)

σRσ

X

]× σ

R

Ave(Rj)× σ

X

Page 28: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 6

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

×

√N − n

n︸ ︷︷ ︸Data Quantity

× σX︸︷︷︸

Problem Difficulty

Page 29: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 6

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

×√

N − n

n︸ ︷︷ ︸Data Quantity

×

σX︸︷︷︸

Problem Difficulty

Page 30: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 6

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

×√

N − n

n︸ ︷︷ ︸Data Quantity

× σX︸︷︷︸

Problem Difficulty

Page 31: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Page 32: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Page 33: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Page 34: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Page 35: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 7

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

X

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Page 36: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 8

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸ ︷︷ ︸

relative systemtic bias

Page 37: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 8

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸ ︷︷ ︸

relative systemtic bias

Page 38: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 8

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸ ︷︷ ︸

relative systemtic bias

Page 39: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 9

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

DI

[N − n

n

]σ2 =

1

N − 1

[N − neff

neff

]σ2

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

1

ρ̂2.

Page 40: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 9

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

DI

[N − n

n

]σ2 =

1

N − 1

[N − neff

neff

]σ2

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

1

ρ̂2.

Page 41: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 9

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

DI

[N − n

n

]σ2 =

1

N − 1

[N − neff

neff

]σ2

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

1

ρ̂2.

Page 42: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 10

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0%

50%

100%

0% 50% 100%

Final Clinton Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,C

linto

n S

uppo

rt

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0%

50%

100%

0% 50% 100%

Final Trump Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,Tr

ump

Sup

port

Root Mean Squared Error: 0.15

Serious underestimation ofTrump’s Vote Share

Page 43: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 10

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0%

50%

100%

0% 50% 100%

Final Clinton Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,C

linto

n S

uppo

rt

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0%

50%

100%

0% 50% 100%

Final Trump Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,Tr

ump

Sup

port

Root Mean Squared Error: 0.15

Serious underestimation ofTrump’s Vote Share

Page 44: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 10

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0%

50%

100%

0% 50% 100%

Final Clinton Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,C

linto

n S

uppo

rt

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0%

50%

100%

0% 50% 100%

Final Trump Popular Vote Share

Val

idat

ed V

oter

Pol

l Est

imat

e,Tr

ump

Sup

port

Root Mean Squared Error: 0.15

Serious underestimation ofTrump’s Vote Share

Page 45: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 11

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

)

−0.00021 ± 0.00061

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Cou

nt

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Trump ρN

Cou

nt

Trump: ρ̂ ≈ −0.0045± 0.0006

Page 46: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 11

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

)

−0.00021 ± 0.00061

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Cou

nt

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Trump ρN

Cou

nt

Trump: ρ̂ ≈ −0.0045± 0.0006

Page 47: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 11

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

)

−0.00021 ± 0.00061

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Cou

nt

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

0

4

8

12

−0.010 −0.005 0.000 0.005 0.010

Trump ρNC

ount

Trump: ρ̂ ≈ −0.0045± 0.0006

Page 48: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Page 49: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Page 50: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Page 51: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Page 52: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 12

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

1

DI=

1

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Page 53: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 13

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Visualizing LLP: Actual Coverage for Clinton

AL

AK

AZ

AR

CA

CO

CTDEDC

FL

GA

HI

ID

IL

IN

IA

KS

KY

LA

ME

MDMA

MI

MN

MSMO

MT

NE

NV

NH

NJNM

NY

NCND

OH

OK

OR

PA

RISC

SDTN

TX

UT

VT VA

WA

WVWI

WY

−10

−5

−2

0

2

5

5.5 6.0 6.5 7.0

log10 (Total Voters)

Clin

ton

Zn

Page 54: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 14

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Visualizing LLP: Actual Coverage for Trump

AL

AKAZ

AR

CACOCT

DE

DC

FL

GA

HI

ID

IL

IN

IA

KS KY

LAME MDMA

MI

MN

MS

MO

MT

NE

NV

NH

NJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VT

VA

WAWV

WI

WY

−10

−5

−2

0

2

5

5.5 6.0 6.5 7.0

log10 (Total Voters)

Tru

mp

Zn

Page 55: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 15

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

The Big Data Paradox:

If we do not pay attention to data quality, then

The bigger the data,

the surer we fool ourselves.

Page 56: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Page 57: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Page 58: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Page 59: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 16

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Page 60: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 17

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

In case you are kind enough to invite me again ...

The sequel: Meng (2018/9)

Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s

Paradox, and Individualized Treatments

Page 61: Statistical Paradises and Paradoxes in Big Data (I): Law ... · Menu 1 Xiao-Li Meng Department of Statistics, Harvard University Motivation Soup "Trio" Identity Trio LLP What’s

Menu 17

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

Soup

”Trio” Identity

Trio

LLP

What’s Big?

CCES

Assessing d.d.i

Paradox

Lessons

In case you are kind enough to invite me again ...

The sequel: Meng (2018/9)

Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s

Paradox, and Individualized Treatments