how to fake data if you must department of statistics rachel fewster

62
How to Fake Data if you must Department of Statistics Rachel Fewster

Upload: river-copple

Post on 28-Mar-2015

227 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: How to Fake Data if you must Department of Statistics Rachel Fewster

How to Fake Dataif you must

Department of Statistics

Rachel Fewster

Page 2: How to Fake Data if you must Department of Statistics Rachel Fewster

Who wants to fake data?

• Electoral finance returns…

• Toxic emissions reports…

• Business tax returns…

Page 3: How to Fake Data if you must Department of Statistics Rachel Fewster

Land areas of world countries: real or fake?

Page 4: How to Fake Data if you must Department of Statistics Rachel Fewster

Land areas of world countries: real or fake?

123456789

IIIIIIII

III

IIIII

Page 5: How to Fake Data if you must Department of Statistics Rachel Fewster

Land areas of world countries: real or fake?

123456789

IIIII

IIIIIIIIIII

123456789

IIIIIIII

III

IIIII

Page 6: How to Fake Data if you must Department of Statistics Rachel Fewster

Land areas of world countries: real or fake?

123456789

IIIII

IIIIIIIIIII

123456789

IIIIIIII

III

IIIII

This one seems more

even…This one has as

many 1s as 5-9s

put together!

This one is

right!

Page 7: How to Fake Data if you must Department of Statistics Rachel Fewster

Real land areas of world countries

123456789

IIIIIIII

III

IIIII

11 of them begin with

digits 1 – 4…

Only 5 begin with digits

5 – 9…

Page 8: How to Fake Data if you must Department of Statistics Rachel Fewster

Friday’s Newspaper:123456789

IIII IIIIIIII IIIIIIIIIIIII

IIIII

10 out of 34 numbers

began with a 1…

None out of 34 began with

a 9!

Page 9: How to Fake Data if you must Department of Statistics Rachel Fewster

The Curious Case of the Grimy Log-books

• In 1881, American astronomer Simon Newcomb noticed something funny about books of logarithm tables…

Page 10: How to Fake Data if you must Department of Statistics Rachel Fewster

The Curious Case of the Grimy Log-books

The books always seemed grubby on the first pages…

… but clean on the last pages

The first pages are

for numbers beginning

with digits 1 and 2…The last

pages are for

numbers beginning

with digits 8 and 9…

Page 11: How to Fake Data if you must Department of Statistics Rachel Fewster

The Curious Case of the Grimy Log-books

People seemed to look up numbers beginning with 1 and 2 more often than they looked up numbers beginning with 8 and 9.

Why?

Because numbers beginning with 1 and 2 are MORE COMMON than

numbers beginning with 8 and 9!!

Page 12: How to Fake Data if you must Department of Statistics Rachel Fewster

Newcomb’s Law

American Journal of Mathematics, 1881

30% of numbers begin with a 1 !!

< 5% of numbers begin with a 9 !!

Page 13: How to Fake Data if you must Department of Statistics Rachel Fewster

The First Digits…Over 30% of numbers begin with a 1

Only 5% of numbers

begin with a 9

Page 14: How to Fake Data if you must Department of Statistics Rachel Fewster

The First Digits…

Numbers beginning with a 1

Numbers beginning with a 9

There is the same “opportunity” for numbers to begin with 9 as with 1 …

but for some reason they don’t!

Page 15: How to Fake Data if you must Department of Statistics Rachel Fewster

0.301 = log10(2/1)

0.176 = log10(3/2)0.125 = log10(4/3)

d

d 1log10

Chance of anumber starting with digit d

Page 16: How to Fake Data if you must Department of Statistics Rachel Fewster

Reactions to Newcomb’s law

Nothing!

…for 57 years!

Page 17: How to Fake Data if you must Department of Statistics Rachel Fewster

Enter Frank Benford: 1938

Physicist with the General Electric Company

Assembled over 20,000 numbers and counted their first digits!

‘A study as wide as time and energy permitted.’

Page 18: How to Fake Data if you must Department of Statistics Rachel Fewster

Populations

Numbers from newspapers

Drainage rates of rivers

Numbers from Readers Digest articles

Street addresses of American Men of Science

Page 19: How to Fake Data if you must Department of Statistics Rachel Fewster

About 30% begin with a 1 About 5% begin with a 9

Page 20: How to Fake Data if you must Department of Statistics Rachel Fewster

Benford gave the ‘law’ its name……but no explanation. Anomalo

us numbers

!!

Page 21: How to Fake Data if you must Department of Statistics Rachel Fewster

“…The logarithmic law applies to outlaw numbers that are without known relationship,

rather than to those that follow an orderly course;

and so the logarithmic relation is essentially a Law of Anomalous Numbers.”

Page 22: How to Fake Data if you must Department of Statistics Rachel Fewster

Explanations for Benford’s Law

• Numbers from a wide range of data sources have about 30% of 1’s, down to only 5% of 9’s.

• Benford called these ‘outlaw’ or ‘anomalous’ numbers. They include street addresses of American Men of Science, populations, areas, numbers from magazines and newspapers.

• Benford’s ‘orderly’ numbers don’t follow the law – like atomic weights and physical constants

What is the explanation

?

Page 23: How to Fake Data if you must Department of Statistics Rachel Fewster

Popular Explanations

• Scale Invariance

• Base Invariance

• Complicated Measure Theory

• Divine choice

• Mystery of Nature

These two say that IF there is a universal law,

it must be Benford’s.

They don’t explain whythere should be a law

to start with!

Page 24: How to Fake Data if you must Department of Statistics Rachel Fewster

In a nutshell … If you grab numbers from all over

the place (a random mix ofdistributions), their digit

frequencies ultimately converge to Benford’s Law

Complicated Measure Theory

Page 25: How to Fake Data if you must Department of Statistics Rachel Fewster

That’s why THIS works well

Page 26: How to Fake Data if you must Department of Statistics Rachel Fewster

It doesn’t explain why street addresses of American Men of Science works well!

It doesn’t reallyexplain WHAT will work well, nor why

Page 27: How to Fake Data if you must Department of Statistics Rachel Fewster

The Key Idea…

If a hat is covered

evenly in red andwhite

stripes…

Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon

Page 28: How to Fake Data if you must Department of Statistics Rachel Fewster

Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon

The Key Idea…

… it will behalf red

and half white.

If a hat is covered

evenly in red andwhite

stripes…

Page 29: How to Fake Data if you must Department of Statistics Rachel Fewster
Page 30: How to Fake Data if you must Department of Statistics Rachel Fewster
Page 31: How to Fake Data if you must Department of Statistics Rachel Fewster

The red stripes and the white stripeseven out over the shape of the hat

If the red stripes cover half the base, they’ll cover about half the hat

Page 32: How to Fake Data if you must Department of Statistics Rachel Fewster

What if the red stripes cover 30% of the base?

0 0.3 1 1.3 2 2.3 3 3.3 4 4.3 5 5.3 6

Then they’ll cover about 30% of the hat.

Page 33: How to Fake Data if you must Department of Statistics Rachel Fewster

What if the red stripes cover precisely fraction 0.301 of the base?

0.301 = log10(2/1)

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

Then they’ll cover fraction ~0.301 of the hat.

Page 34: How to Fake Data if you must Department of Statistics Rachel Fewster

Think of X as a random number…

We want the probability that X has first digit = 1

Let the ‘hat’ be a probability density curve for X

Then AREAS on the hat give PROBABILITIES for X

Page 35: How to Fake Data if you must Department of Statistics Rachel Fewster

Think of X as a random number…

We want the probability that X has first digit = 1

Let the ‘hat’ be a probability density curve for X

Then AREAS on the hat give PROBABILITIES for X

Pr(1 < X < 5) = 0.95

Area = 0.95 from 1 to 5

Total area = 1

Page 36: How to Fake Data if you must Department of Statistics Rachel Fewster

In the same way ….

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

If the red stripes somehow represent the X values with first digit = 1,

and the red stripes have area ~ 0.301,

then Pr(X has first digit 1) ~ 0.301.

Page 37: How to Fake Data if you must Department of Statistics Rachel Fewster

So X values with first digit=1 somehow lie on a set of evenly spaced stripes?

Write X in Scientific Notation:

Page 38: How to Fake Data if you must Department of Statistics Rachel Fewster

So X values with first digit=1 somehow lie on a set of evenly spaced stripes?

Write X in Scientific Notation:

nrX 10r is

between 1 and

10

n is an integer

Page 39: How to Fake Data if you must Department of Statistics Rachel Fewster

For example…

nrX 10r is

between 1 and

10

n is an integer

21024.1124 1106.776

Page 40: How to Fake Data if you must Department of Statistics Rachel Fewster

For example…

nrX 10For the first

digit of X, only r

matters!

21024.1124 1106.776

21en exactly wh

1 digit first has

r

X

Page 41: How to Fake Data if you must Department of Statistics Rachel Fewster

For example…

nrX 10For the first

digit of X, only r

matters!

21024.1124 1106.776

21en exactly wh

1 digit first has

r

X

1 < r < 2

r > 2

Page 42: How to Fake Data if you must Department of Statistics Rachel Fewster

nrX 1021en exactly wh

1 digit first has

r

X

Take logs to base 10…

)10log(loglog nrX Or in other words…

nrX loglog

Page 43: How to Fake Data if you must Department of Statistics Rachel Fewster

nrX loglogr is

between 1 and

10

n is an integer

Page 44: How to Fake Data if you must Department of Statistics Rachel Fewster

nrX loglogr is

between 1 and

10

2loglog 1log

...when i.e.

21when

1digit first has

r

r

X

n is an integer

Page 45: How to Fake Data if you must Department of Statistics Rachel Fewster

nrX loglogr is

between 1 and

10

301.0log 0

...when i.e.

r

n is an integer

2loglog 1log

...when i.e.

21when

1digit first has

r

r

X

Page 46: How to Fake Data if you must Department of Statistics Rachel Fewster

nrX loglog

n is an integer301.0log 0

when1digit first has

r

X

X has first digit 1 precisely when log(X) isbetween n and n + 0.301 for any integer n

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

X from 1 to 2

X from 10 to 20

X from 100 to 200

Page 47: How to Fake Data if you must Department of Statistics Rachel Fewster

nrX loglog

n is an integer301.0log 0

when1digit first has

r

X

X has first digit 1 precisely when log(X) isbetween n and n + 0.301 for any integer n

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

STRIPES!!

Page 48: How to Fake Data if you must Department of Statistics Rachel Fewster

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

X values with first digit = 1 satisfy:

and so on!

The ‘hat’ is the probability density curve for log(X)

Page 49: How to Fake Data if you must Department of Statistics Rachel Fewster

n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

X values with first digit = 1 satisfy:

The ‘hat’ is the probability density curve for log(X)

X from 1 to 2

X from 10 to 20

X from 100 to 200

Page 50: How to Fake Data if you must Department of Statistics Rachel Fewster

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

So X values with first digit=1 DO lie on evenly spaced stripes, on the log scale!

The PROBABILITY of getting first digit 1 is the AREA of the red stripes,~ approx the fraction on the base, = 0.301.

Page 51: How to Fake Data if you must Department of Statistics Rachel Fewster

We’ve done it!

We’ve shown that we really should expect the first digit to be 1 about 30% of the time!

Page 52: How to Fake Data if you must Department of Statistics Rachel Fewster

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

The log scale distorts: small numbers (e.g. 100) are stretched out; larger numbers (e.g. 900) are bunched up.The first digit corresponds to regularly spaced stripes on the log scale.

Intuitively…

So the smallest numbers (first digit = 1) are

stretched out, and get the highest probability!

Page 53: How to Fake Data if you must Department of Statistics Rachel Fewster

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

We need a lot of stripes to balance out big ones and little ones! We get one stripe every integer…So we need a lot of integers!

When is this going to work?

The distribution of X needs to be

WIDE on the log scale!

Page 54: How to Fake Data if you must Department of Statistics Rachel Fewster

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

X ranges from 0 to 6 on the log scale…So it ranges from 1 to 106 on usual scale!

When is this going to work?

1 .. 2 .. Miss a few ... 999,999 .. 1,000,000

Page 55: How to Fake Data if you must Department of Statistics Rachel Fewster

0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6

These are Benford’s ‘Outlaw Numbers’!

All we need is a distribution that is:• WIDE (4 – 6 orders of magnitude or more)• Reasonably SMOOTH …Then the red stripes will even out to cover about 30% of the total area.

Page 56: How to Fake Data if you must Department of Statistics Rachel Fewster

In Real Life…

World Populations: From 50 for the Pitcairn Islands …To 1.3 x 109 for China…

Wide (9 integers => 9 stripes)

First digits very good fit to Benford!

Page 57: How to Fake Data if you must Department of Statistics Rachel Fewster

In Real Life…

World Populations: From 50 for the Pitcairn Islands …To 1.3 x 109 for China…

Page 58: How to Fake Data if you must Department of Statistics Rachel Fewster

Electorate populations? From 583,000 to 773,000 in California:

Of course not! All the first

digits are 5, 6, or 7…

The hat has less than one stripe! Benford doesn’t work here.

Page 59: How to Fake Data if you must Department of Statistics Rachel Fewster

But naturally occurring populations are a different story!Cities in California:

- from 94 in the city of Vernon…- to 3.9 million in Los Angeles…

Yes! It’s Benford!

Wide enough (5 integers => 5 stripes)

Page 60: How to Fake Data if you must Department of Statistics Rachel Fewster

Powerball Jackpots?- from $10 million to $365 million…

Not bad!

Orders of magnitude only 1.5 …

… but sometimes you just hit lucky!Data with kind permission from www.lottostrategies.com

Page 61: How to Fake Data if you must Department of Statistics Rachel Fewster

Your tax return….?

If you plan to fake data, you should first check whether it ought to be Benford!

BUT the IRD has a few other tricks up its sleeve too….

Page 62: How to Fake Data if you must Department of Statistics Rachel Fewster

To find out more:• A Simple Explanation of Benford’s Law by R. M. Fewster The American Statistician, to appear. PDF fromwww.stat.auckland.ac.nz/~fewster/benford.html

• Judy Paterson’s CMCT course, Term 1 2009: Centre for Mathematical Content in Teaching

Thanks for listening!