statistical paradises and paradoxes in big data (i): law ... · menu 1 xiao-li meng department of...

Menu 1

Xiao-Li MengDepartment of

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Statistical Paradises and Paradoxes in Big Data (I):

Law of Large Populations, Big DataParadox, and the 2016 US Election

Xiao-Li MengDepartment of Statistics, Harvard University

Meng (2018). Annals of Applied Statistics, No. 2, 685-726.https://statistics.fas.harvard.edu/people/xiao-li-meng

Many thanks to Stephen Ansolabehere and ShiroKuriwaki for the CCES (Cooperative CongressionalElection Study) data and analysis on 2016 US election.

Menu 1

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Menu 1

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Menu 2

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Motivating questions

We know that a 5% random sample is better than a 5%non-random sample in measurable ways (e.g., bias,predictive power).

But is an 80% non-random sample “better” than a5% random sample in measurable terms? 90%?95%? 99%? (Jeremy Wu of US Census Bureau, 2012,Seminar at Harvard Statistics)

“Which one should we trust more: a 1% survey with60% response rate or a non-probabilistic datasetcovering 80% of the population?” (Keiding and Louis,2015, Joint Statistical Meetings; and JRSSB, 2016)

Menu 2

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Menu 2

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

A Bit of History: Theory and Practice

Law of Large Numbers:Jakob Bernoulli (1713)

Central Limit Theorem:Abraham de Moivre (1733):error ∝ 1√

n: n − sample size

Survey Sampling:

Graunt (1662); Laplace (1882)The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Landmark paper: JerzyNeyman (1934)

The “revolution” lastedabout 50 years (JelkeBethlehem, 2009)

First implementation inUS Census: 1940 led byMorris Hansen

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Graunt (1662); Laplace (1882)

The “intellectually violentrevolution” in 1895 by AndersKiær, Statistics Norway

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 3

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Survey Sampling:

Menu 4

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

When/why can we ignore the population size?

Think about tastingsoup ...

Stir it well, then afew bits aresufficient regardlessof the size of thecontainer!

⇐⇒

Menu 4

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

⇐⇒

Menu 4

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

⇐⇒

Menu 4

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

⇐⇒

Menu 4

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

⇐⇒

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

2016 US Presidential Election

n: number of respondents to an election survey

N: number of (actual) voters in US

Xj = 1: plan to vote for Trump; Xj = 0 otherwise

Rj = 1: report (honestly) voting plan; Rj = 0 otherwise

Estimatinng Trump’s share: µN

= Ave(Xj) by sample average:

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

Actual estimation error

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

[Ave(RjXj)− Ave(Rj)Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 5

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

µ̂n =R1X1 + . . .+ RNXN

Ave(RjXj)

Ave(Rj)

µ̂n − µN=

Ave(RjXj)

Ave(Rj)− Ave(Xj)

]× σ

Ave(Rj)× σ

Menu 6

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Data quality, quantity, and uncertainty

Because σ2R = f (1− f ), f = Ave{Rj} = n

N , we have

Error = ρ̂R,X︸︷︷︸

Data Quality

√N − n

n︸︷︷︸Data Quantity

× σX︸︷︷︸

Problem Difficulty

Menu 6

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

N , we have

Data Quality

N − n

σX︸︷︷︸

Problem Difficulty

Menu 6

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

N , we have

Data Quality

N − n

× σX︸︷︷︸

Problem Difficulty

Menu 7

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Data Defect Index (d.d.i.)

Mean Squared Error (MSE)

MSE(µ̂n) = ER(ρ̂2)× N − n

n× σ2

Data Defect Index (d.d.i): DI = ER(ρ̂2)

For Simple Random Sample (SRS): DI = (N − 1)−1

For probabilistic samples in general: DI ∝ N−1

Deep trouble when DI does not vanish with N−1;

or equivalently when ρ̂ does not vanish with N−1/2 ...

Menu 7

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

n× σ2

Menu 7

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

n× σ2

Menu 7

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

n× σ2

Menu 7

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

n× σ2

Menu 8

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

A Law of Large Populations (LLP)

If ρ = ER(ρ̂) 6= 0, then on average, the relative error ↑√N:

Actual Error

Benchmark SRS Standard Error=√N − 1ρ̂

The (lack-of) design effect (Deff)

Deff =MSE

Benchmark SRS MSE= (N − 1)DI

Paradigm shift for “Big Data”:

Fromσ√n︸︷︷︸

random error

to ρ̂√N︸︷︷︸

relative systemtic bias

Menu 8

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Actual Error

Deff =MSE

random error

to ρ̂√N︸︷︷︸

Menu 8

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Actual Error

Deff =MSE

random error

to ρ̂√N︸︷︷︸

Menu 9

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Effective Sample Size

The Effective Sample Size neff of a “Big Data” set

Equate its MSE to that from a SRS with size neff :

[N − n

]σ2 =

N − 1

[N − neff

What matters is the relative size f = n/N

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

ρ̂2.

Menu 9

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

[N − n

]σ2 =

N − 1

[N − neff

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

ρ̂2.

Menu 9

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

[N − n

]σ2 =

N − 1

[N − neff

neff =n

1 + (1− f )[(N − 1)DI − 1]≈ f

1− f

ρ̂2.

Menu 10

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Gaining 2020 Vision: Assessing the behavioral ρ̂using validated voter counts (≈ 35, 000)

CCES: Cooperative Congressional Election Study(Conducted by Stephen Ansolabehere, Brian Schaffner, Sam Luks, Douglas Rivers

on Oct 4 - Nov 6, 2016 (YouGov); Analysis assisted by Shiro Kuriwaki)

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0% 50% 100%

Final Clinton Popular Vote Share

Root Mean Squared Error: 0.07

Reasonable predictions forClinton’s Vote Share

Poll underestimated

Trump support

Poll overestimated

Trump support

0% 50% 100%

Final Trump Popular Vote Share

Serious underestimation ofTrump’s Vote Share

Menu 10

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0% 50% 100%

Poll underestimated

Trump support

Poll overestimated

Trump support

0% 50% 100%

Menu 10

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Poll underestimated

Clinton support

Poll overestimated

Clinton support

0% 50% 100%

Poll underestimated

Trump support

Poll overestimated

Trump support

0% 50% 100%

Menu 11

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Assessing ρ̂ using state-level data

Let µN

be the true share, and µ̂n the estimated share. Then

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

−0.00021 ± 0.00061

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

−0.010 −0.005 0.000 0.005 0.010

Trump ρN

Trump: ρ̂ ≈ −0.0045± 0.0006

Menu 11

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Let µN

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

−0.00021 ± 0.00061

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

−0.010 −0.005 0.000 0.005 0.010

Trump ρN

Trump: ρ̂ ≈ −0.0045± 0.0006

Menu 11

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Let µN

ρ̂ =µ̂n − µN√

N−nn σ2

, & σ2 = µN

(1− µN

−0.00021 ± 0.00061

−0.010 −0.005 0.000 0.005 0.010

Clinton ρN

Clinton: ρ̂ ≈ −0.0002± 0.0006

−0.0045 ± 0.00056

−0.010 −0.005 0.000 0.005 0.010

Trump ρNC

Trump: ρ̂ ≈ −0.0045± 0.0006

Menu 12

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

What’s the implication of ρ̂ = −0.005?

Many (major) survey results published before Nov 8, 2016;

Roughly amounts to 1% of eligible voters: n ≈ 2, 300, 000;

Equivalent to 2,300 surveys of 1,000 respondents each.

When ρ̂ = −0.005 = −1/200,DI = 1/40000, and hence

neff =f

1− f

99× 40000 ≈ 404!

A 99.98% reduction in n, caused by ρ̂ = −0.005.

Butterfly Effect due to Law of Large Populations (LLP)

Relative Error =√

N− 1ρ̂

Menu 12

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

neff =f

1− f

99× 40000 ≈ 404!

Relative Error =√

N− 1ρ̂

Menu 12

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

neff =f

1− f

99× 40000 ≈ 404!

Relative Error =√

N− 1ρ̂

Menu 12

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

neff =f

1− f

99× 40000 ≈ 404!

Relative Error =√

N− 1ρ̂

Menu 12

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

neff =f

1− f

99× 40000 ≈ 404!

Relative Error =√

N− 1ρ̂

Menu 13

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Visualizing LLP: Actual Coverage for Clinton

CTDEDC

5.5 6.0 6.5 7.0

log10 (Total Voters)

Menu 14

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Visualizing LLP: Actual Coverage for Trump

CACOCT

LAME MDMA

5.5 6.0 6.5 7.0

log10 (Total Voters)

Menu 15

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

The Big Data Paradox:

If we do not pay attention to data quality, then

The bigger the data,

the surer we fool ourselves.

Menu 16

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Lesson 1: What matters most is the quality, not thequantity.

Lesson 2: Don’t ignore seemingly tiny probabilisticdatasets when combining data sources.

Lesson 3: Watch the relative size, not the absolutesize.

Lesson 4: Classical theory is BIG for “big data”, aslong as we let it go outside the classical box.

Menu 16

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Menu 16

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Menu 16

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

Lessons Learned ...

Menu 17

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

In case you are kind enough to invite me again ...

The sequel: Meng (2018/9)

Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s

Paradox, and Individualized Treatments

Menu 17

Statistics,Harvard

University

Motivation

”Trio” Identity

What’s Big?

Assessing d.d.i

Paradox

Lessons

In case you are kind enough to invite me again ...

The sequel: Meng (2018/9)

Statistical Paradises and Paradoxes in Big Data (II):Multi-resolution Inference, Simpson’s

Paradox, and Individualized Treatments

statistical paradises and paradoxes in big data (i): law ... · menu 1 xiao-li meng department of...

Documents