dark data by prof. mark whitehorn

56
Dark Data 1

Upload: colin-mackay

Post on 14-Jul-2015

819 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Dark Data by Prof. Mark Whitehorn

Dark Data

1

Page 2: Dark Data by Prof. Mark Whitehorn

It’s all about me…

Prof. Mark Whitehorn Emeritus Professor of Analytics

School of Computing

University of Dundee

Consultant

Writer (author)

[email protected]

2

Page 3: Dark Data by Prof. Mark Whitehorn

It’s all about me… School of Computing Teach Masters in: Data Science Part time - aimed at existing data professionals

Data Engineering

3

Page 4: Dark Data by Prof. Mark Whitehorn

Dark Data

First mention: “Shedding Light on the Dark Data in the Long Tail of Science”, P. Bryan Heidorn,

LIBRARY TRENDS, Vol. 57, No. 2, Fall 2008 (“Institutional Repositories: Current

State and Future,” edited by Sarah L. Shreeves and Melissa H. Cragin), pp. 280–

299

Heidorn writes that:

‘“Dark data” is not carefully indexed and stored so it becomes nearly invisible to

scientists and other potential users and therefore is more likely to remain

underutilized and eventually lost.’

However this is not synonymous with the modern usage. Heidorn was talking

about scientific data that is not published.

Page 5: Dark Data by Prof. Mark Whitehorn

Dark Data

First usage (modern sense) of which I am aware was in

Alyth, Lands of Loyal, 2013.

BIDS (Business Intelligence & Data Science) conference.

Page 6: Dark Data by Prof. Mark Whitehorn

Dark Data

Companies currently collect vast swathes of data; some is

analysed, much isn’t. The stuff that isn’t analysed is Dark.

Clearly the origin of the term is ‘Dark Matter’.

Donald Rumsfelt told us about:

Known knowns

Known unknowns

Unknown unknowns

Page 7: Dark Data by Prof. Mark Whitehorn

Dark Data

I think we have three identifiable flavours:

Dark Data 1.0 – The data that is in the transactional systems

but never makes it to the analytical - visible but unseen

DD 2.0 – Data that is published but which the publisher

does not expect to be analysed

DD 3.0 – Data that vanishes and the disappearance is highly

significant

Page 8: Dark Data by Prof. Mark Whitehorn

Dark Data of the first kind

Is all around us and I believe it is very important. We are

going to be moving a great deal of this data into analytical

systems.

Page 9: Dark Data by Prof. Mark Whitehorn

The Andria Doria

Dark Data of the first kind

Page 10: Dark Data by Prof. Mark Whitehorn

The Andria Doria

Collision between the Andrea Doria and the Stockholm.

Worst maritime disaster to occur in US waters

25 July 1956. Happily not too great a loss of life.

Collision near Nantucket, Mass.

Dark Data of the first kind

Page 11: Dark Data by Prof. Mark Whitehorn

Dark Data of the first kind

The Andria Doria

Officers on board the Andrea Doria did not use the

plotting equipment that was available to them to calculate

the speed and position of the Stockholm.

The navigation officer on board the Stockholm thought his

radar was set to 15 miles when in fact it was set to 5 miles.

Page 12: Dark Data by Prof. Mark Whitehorn

Dark Data of the first kind

The Andria Doria

Both had radar running and yet they collided.

Page 13: Dark Data by Prof. Mark Whitehorn

Dark Data of the first kind

The Andria Doria Effect

Bats flying into nets in caves

Griffin, Donald R. *

*Listening in the dark: the acoustic orientation of bats and men. New Haven, Yale University Press,

1958. 415 p.

Page 14: Dark Data by Prof. Mark Whitehorn

Dark Data of the first kind

Bats

“For example, Griffin (1958) describes bats, returning to the roost after an evening's hunt, crashing into a newly erected cave barricade even though it reflected strong echoes. Griffin coins this mishap the “Andrea Doria Effect,” because the bats seemed to ignore important information from echo returns to guide their behavior and instead favored spatial memory.” http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927269/

Page 15: Dark Data by Prof. Mark Whitehorn

Dark Data of the first kind

You are unlikely to collect data or information from or

about bats and/or ocean-going liners.

But your organisation is likely to collect data that is not

being analysed (in the sense that the information it

contains is not being extracted).

• Patient monitors in intensive care

• Robots in car production - Volvo

Page 16: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

DD 2.0 – Data that is published but which the publisher

does not expect to be analysed

Data leakage

Amazon publishes the position of a book relative to other

books. “One” is good (think J. K. Rowlings).

I collected the position of my books for several years.

Page 17: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec Jan Feb Mar Apr

Amazon positional data (IRDB) for Aug 98 – Mar 00

Page 18: Dark Data by Prof. Mark Whitehorn

I also had the sales figures (number of books sold) from my

publisher. I could correlate these with the Amazon position

figures.

I also tracked the Amazon position of ‘competing’ books.

So…..

Dark Data of the second kind

Page 19: Dark Data by Prof. Mark Whitehorn

I also had the sales figures (number of books sold) from my

publisher. I could correlate these with the Amazon position

figures.

I also tracked the Amazon position of ‘competing’ books.

So I could estimate the sales figures of my competitors.

Dark Data of the second kind

Page 20: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

The estimation of the maximum of a discrete

uniform distribution by sampling (without

replacement).

Page 21: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

The estimation of the maximum of a discrete

uniform distribution by sampling (without

replacement).

A superb example of this is:

The German Tank problem

Page 22: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

The German Tank problem

The allies were very interested in how many tanks the

Germans were producing during the second World War.

Intelligence suggested the production was about 1,000 -

1,500 tanks per month.

Page 23: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

The German Tank problem

The allies started taking the serial numbers from the

captured/disabled German tanks.

It turned out that the gearbox numbers were the most

useful. (However the chassis numbers and wheel moulds

were also used.)

These numbers were also used to estimate production.

Page 24: Dark Data by Prof. Mark Whitehorn

The German Tank problem

As an example, in June 1941, intelligence suggested tank

production was 1,550.

The estimate from serial numbers was 244.

The actual value (verified after the war) was 271.

In August 1942 the numbers were 1,550, 327, 342.

It worked.

Dark Data of the second kind

Page 25: Dark Data by Prof. Mark Whitehorn

An Empirical Approach to Economic

Intelligence in World War II Author(s):

Richard Ruggles and Henry Brodie

Source: Journal of the American

Statistical Association, Vol. 42, No. 237

(Mar., 1947), pp. 72- 91 Published by:

American Statistical Association

Stable URL:

http://www.jstor.org/stable/2280189

Page 26: Dark Data by Prof. Mark Whitehorn

So how is it done?

Suppose (simplistically) that the actual numbers run from 1

to n.

If we see a number like 24 then at least 24 tanks have been

produced.

But we can be smarter. If we see numbers like:

6, 7, 9, 11, 12, 23, 16, 17, 24

then an intelligent guess would be around 30-40.

Dark Data of the second kind

Page 27: Dark Data by Prof. Mark Whitehorn

So how is it done?

If we capture 5 tanks (we’ll call this value C) 432, 435, 1542

2187, 3245, then the guess has to be higher than 3,245 –

perhaps 4,000.

But we can apply logic and a formula.

The biggest number (B) is 3,245, and the smallest (s) is 432.

We could reasonably expect as many ‘hidden’ numbers above

3,245 as there are below 432.

Dark Data of the second kind

Page 28: Dark Data by Prof. Mark Whitehorn

So how is it done?

C=5, B=3,245, s=432, n=number of tanks produced

1 s 1K 2K 3K B 4K

1..…432……………………………………………………………...3245.….n

So n is about 3,245 + 432 = 3,677

Dark Data of the second kind

Page 29: Dark Data by Prof. Mark Whitehorn

So how is it done?

C=5, B=3,245, s=432, n=number of tanks produced

Well, we hypothesised that (n-B) is roughly equal to (s-1),

so if n-B = s-1

then:

n=B+(s-1)

n= 3,245 + (432-1) = 3,676

Dark Data of the second kind

Page 30: Dark Data by Prof. Mark Whitehorn

You can also use:

n = (1+1/C)B

n = (1 +1/5)3,245 = 3,894*

* Roger Johnson of Carleton College in Northfield, Minnesota

Teaching Statistics Vol. 16, number 2 Summer 1994

http://www.mcs.sdsmt.edu/rwjohnso/html/tank.pdf

Dark Data of the second kind

Page 31: Dark Data by Prof. Mark Whitehorn

Dark Data of the second kind

Page 32: Dark Data by Prof. Mark Whitehorn

There are various ways you can do the sums.

I certainly wouldn’t want to argue for any particular one.

I certainly would argue that this may well be worth doing.

Dark Data of the second kind

Page 33: Dark Data by Prof. Mark Whitehorn

Other uses

The serial number data was also used to deduce the:

• number of tank factories and their relative production

• logistics of tank deployment

• changes in production due to, for example, bombing

Dark Data of the second kind

Page 34: Dark Data by Prof. Mark Whitehorn

Of course, if the Germans had been aware of this they

could have:

• Coded the serial numbers, perhaps using a random

surrogate key (boring)

• Created misinformation (a much more interesting option)

Dark Data of the second kind

Page 35: Dark Data by Prof. Mark Whitehorn

Local paper – certain pages were printed out of register

enabling us to estimate the number of printing presses in

use.

This is an example of “It’s the exception that proves the

rule.”

Dark Data of the second kind

Page 36: Dark Data by Prof. Mark Whitehorn

Again, don’t think of this technique as being limited to

tanks.

What data is your company leaking?

What can you do about it?

Which of your competitors is leaking?

(satellite photographs of oppositions parking lot)

(classis car advertisements)

Dark Data of the second kind

Page 37: Dark Data by Prof. Mark Whitehorn
Page 38: Dark Data by Prof. Mark Whitehorn

Let’s see.

7517 – 6108 =

1409 in 130 minutes

= about 10 per

minute.

Page 39: Dark Data by Prof. Mark Whitehorn

70836 - 67517 =

3,319 in 1,105

minutes

= about 3 per

minute; implying

the rate drops

overnight.

Page 40: Dark Data by Prof. Mark Whitehorn

Quick test.

70845 followed, one

minute later, by

70836. 9 in one

minute and it looks

like they are

sequential (but

much more testing

required).

Page 41: Dark Data by Prof. Mark Whitehorn

The German Bomb problem

DD 3.0 – Data that vanishes and the disappearance is highly

significant

My mother worked on explosives and casings from German

bombs and shells during the war.

At the start of the war the Germans used very good quality

alloys. The allies tracked the elements in the alloy. As the

war progressed, certain elements disappeared completely

from the mix.

Dark Data of the third kind

Page 42: Dark Data by Prof. Mark Whitehorn

The German Bomb problem

If the allies had started analysing in the middle of the war,

they would have missed the missing data.

(Would that have made it an unknown, unknown,

unknown?)

As it was, they could see what elements were becoming

scarce in Germany and plan accordingly.

Dark Data of the third kind

Page 43: Dark Data by Prof. Mark Whitehorn

Take home message

Collect data and use it imaginatively.

Dark Data

Page 44: Dark Data by Prof. Mark Whitehorn

Simpson’s Paradox

Why is Homer Simpson yellow?

Page 45: Dark Data by Prof. Mark Whitehorn

Simpson’s paradox can be seen when percentages and/or actions are

combined without thought.

Two schools and two teachers –

School 1 – Sally

School 2 – Brian

But they occasionally work in each other’s schools.

Scored on the percentage exam pass of the students.

Simpson’s Paradox

Page 46: Dark Data by Prof. Mark Whitehorn

Brian beats Sally in both schools.

Simpson’s Paradox

Brian Sally

School 1 90% 88%

School 2 75% 60%

Page 47: Dark Data by Prof. Mark Whitehorn

But when we do the sums for all children, Sally beats Brian.

Simpson’s Paradox

Brian Sally

School 1 90% 88%

School 2 75% 60%

Overall 75% 87%

Page 48: Dark Data by Prof. Mark Whitehorn

The paradox arises because the teachers teach very

different ability groups in the two schools and the schools

themselves differ in intake.

Simpson’s Paradox

Brian Sally

Pass Total Percentage Pass Total Percentage

School 1 No. of

children

9 10 90% 700 800 88%

School 2 No. of

children

600 800 75% 6 10 60%

Combined No. of

children

609 810 75% 706 804 75%

Page 49: Dark Data by Prof. Mark Whitehorn

Brian may have achieve 90% in school 1, but this is based on a very

small sample – he taught far more students in school 2.

And Sally taught very few children in school 2 and far more in school 1.

Simpson’s Paradox

Brian Sally

Pass Total Percentage Pass Total Percentage

School 1 No. of

children

9 10 90% 700 800 88%

School 2 No. of

children

600 800 75% 6 10 60%

Combined No. of

children

609 810 75% 706 804 75%

Page 50: Dark Data by Prof. Mark Whitehorn

Both small and large samples are producing the same kind of value (a

percentage). Combining these means that samples with the high

numbers of students ‘overwhelm’ the samples with low numbers.

Simpson’s Paradox

Brian Sally

Pass Total Percentage Pass Total Percentage

School 1 No. of

children

9 10 90% 700 800 88%

School 2 No. of

children

600 800 75% 6 10 60%

Combined No. of

children

609 810 75% 706 804 75%

Page 51: Dark Data by Prof. Mark Whitehorn

We have one set of data 1 140

9 100

10 190

12 150

14 140

15 170

15 150

17 150

17 200

20 250

20 270

22 270

23 450

23 270

34 430

34 370

45 490

56 480

60 330

y = 6.4397x + 111.66

R² = 0.6344

0

100

200

300

400

500

600

0 10 20 30 40 50 60 70

Page 52: Dark Data by Prof. Mark Whitehorn

And another set for the same two axes

y = 0.1673x + 90.477

R² = 0.3125

0

20

40

60

80

100

120

140

160

180

200

0 100 200 300 400 500 600

98 95

125 180

134 70

148 85

152 75

200 125

200 135

220 135

230 120

230 135

340 180

340 185

450 180

560 140

Page 53: Dark Data by Prof. Mark Whitehorn

But when we combining the sets 1 140

9 100

10 190

12 150

14 140

15 170

15 150

17 150

17 200

20 250

20 270

22 270

23 450

23 270

34 430

34 370

45 490

56 480

60 330

98 95

125 180

134 70

148 85

152 75

200 125

200 135

220 135

230 120

230 135

340 180

340 185

450 180

560 140

y = -0.2662x + 238.52

R² = 0.0989

0

100

200

300

400

500

600

0 100 200 300 400 500 600

Page 54: Dark Data by Prof. Mark Whitehorn

How likely is Simpson’s paradox?

Marios G. Pavlides and Michael D. Perlman (August 2009).

The American Statistician 63 (3): 226–233.

www.stat.washington.edu/research/reports/2009/tr558.pdf

Simpson’s Paradox

Page 55: Dark Data by Prof. Mark Whitehorn

Simpson’s Paradox

Simpson’s Paradox is usually attributed to Simpson in 1951.

However Pavlides and Perlman quote various other sources

which suggest that the underlying idea goes back at least

as far 1899.

Page 56: Dark Data by Prof. Mark Whitehorn

Simpson’s Paradox

This is a wonderful example of Stigler's law of eponymy.

Stephen Stigler in his 1980 publication "Stigler’s law of eponymy".

Essentially this postulates that scientific discoveries are not usually named

after the original discoverer.

Stigler law was originally postulated by the sociologist Robert K. Merton;

thus ensuring the Stigler’s law obeys Stigler’s law.

Gieryn, T. F., ed. (1980). Science and social structure: a festschrift for Robert K. Merton. New York: NY Academy of Sciences. pp. 147–57.

ISBN 0-89766-043-9. , republished in Stigler's collection "Statistics on the Table"