dark data by prof. mark whitehorn

Dark Data

1

It’s all about me…

Prof. Mark Whitehorn Emeritus Professor of Analytics

School of Computing

University of Dundee

Consultant

Writer (author)

[email protected]

2

It’s all about me… School of Computing Teach Masters in: Data Science Part time - aimed at existing data professionals

Data Engineering

3

Dark Data

First mention: “Shedding Light on the Dark Data in the Long Tail of Science”, P. Bryan Heidorn,

LIBRARY TRENDS, Vol. 57, No. 2, Fall 2008 (“Institutional Repositories: Current

State and Future,” edited by Sarah L. Shreeves and Melissa H. Cragin), pp. 280–

299

Heidorn writes that:

‘“Dark data” is not carefully indexed and stored so it becomes nearly invisible to

scientists and other potential users and therefore is more likely to remain

underutilized and eventually lost.’

However this is not synonymous with the modern usage. Heidorn was talking

about scientific data that is not published.

Dark Data

First usage (modern sense) of which I am aware was in

Alyth, Lands of Loyal, 2013.

BIDS (Business Intelligence & Data Science) conference.

Dark Data

Companies currently collect vast swathes of data; some is

analysed, much isn’t. The stuff that isn’t analysed is Dark.

Clearly the origin of the term is ‘Dark Matter’.

Donald Rumsfelt told us about:

Known knowns

Known unknowns

Unknown unknowns

Dark Data

I think we have three identifiable flavours:

Dark Data 1.0 – The data that is in the transactional systems

but never makes it to the analytical - visible but unseen

DD 2.0 – Data that is published but which the publisher

does not expect to be analysed

DD 3.0 – Data that vanishes and the disappearance is highly

significant

Dark Data of the first kind

Is all around us and I believe it is very important. We are

going to be moving a great deal of this data into analytical

systems.

The Andria Doria


http://en.wikipedia.org/wiki/File:Andreadoria02.jpg

The Andria Doria

Collision between the Andrea Doria and the Stockholm.

Worst maritime disaster to occur in US waters

25 July 1956. Happily not too great a loss of life.

Collision near Nantucket, Mass.



The Andria Doria

Officers on board the Andrea Doria did not use the

plotting equipment that was available to them to calculate

the speed and position of the Stockholm.

The navigation officer on board the Stockholm thought his

radar was set to 15 miles when in fact it was set to 5 miles.


The Andria Doria

Both had radar running and yet they collided.


The Andria Doria Effect

Bats flying into nets in caves

Griffin, Donald R. *

*Listening in the dark: the acoustic orientation of bats and men. New Haven, Yale University Press,

1958. 415 p.


Bats

“For example, Griffin (1958) describes bats, returning to the roost after an evening's hunt, crashing into a newly erected cave barricade even though it reflected strong echoes. Griffin coins this mishap the “Andrea Doria Effect,” because the bats seemed to ignore important information from echo returns to guide their behavior and instead favored spatial memory.” http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927269/

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927269/

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2927269/


You are unlikely to collect data or information from or

about bats and/or ocean-going liners.

But your organisation is likely to collect data that is not

being analysed (in the sense that the information it

contains is not being extracted).

• Patient monitors in intensive care

• Robots in car production - Volvo

Dark Data of the second kind

DD 2.0 – Data that is published but which the publisher

does not expect to be analysed

Data leakage

Amazon publishes the position of a book relative to other

books. “One” is good (think J. K. Rowlings).

I collected the position of my books for several years.


Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec Jan Feb Mar Apr

Amazon positional data (IRDB) for Aug 98 – Mar 00

I also had the sales figures (number of books sold) from my

publisher. I could correlate these with the Amazon position

figures.

I also tracked the Amazon position of ‘competing’ books.

So…..


I also had the sales figures (number of books sold) from my

publisher. I could correlate these with the Amazon position

figures.

I also tracked the Amazon position of ‘competing’ books.

So I could estimate the sales figures of my competitors.



The estimation of the maximum of a discrete

uniform distribution by sampling (without

replacement).


The estimation of the maximum of a discrete

uniform distribution by sampling (without

replacement).

A superb example of this is:

The German Tank problem



The allies were very interested in how many tanks the

Germans were producing during the second World War.

Intelligence suggested the production was about 1,000 -

1,500 tanks per month.



The allies started taking the serial numbers from the

captured/disabled German tanks.

It turned out that the gearbox numbers were the most

useful. (However the chassis numbers and wheel moulds

were also used.)

These numbers were also used to estimate production.


As an example, in June 1941, intelligence suggested tank

production was 1,550.

The estimate from serial numbers was 244.

The actual value (verified after the war) was 271.

In August 1942 the numbers were 1,550, 327, 342.

It worked.


An Empirical Approach to Economic

Intelligence in World War II Author(s):

Richard Ruggles and Henry Brodie

Source: Journal of the American

Statistical Association, Vol. 42, No. 237

(Mar., 1947), pp. 72- 91 Published by:

American Statistical Association

Stable URL:

http://www.jstor.org/stable/2280189

http://www.jstor.org/stable/2280189

So how is it done?

Suppose (simplistically) that the actual numbers run from 1

to n.

If we see a number like 24 then at least 24 tanks have been

produced.

But we can be smarter. If we see numbers like:

6, 7, 9, 11, 12, 23, 16, 17, 24

then an intelligent guess would be around 30-40.


So how is it done?

If we capture 5 tanks (we’ll call this value C) 432, 435, 1542

2187, 3245, then the guess has to be higher than 3,245 –

perhaps 4,000.

But we can apply logic and a formula.

The biggest number (B) is 3,245, and the smallest (s) is 432.

We could reasonably expect as many ‘hidden’ numbers above

3,245 as there are below 432.


So how is it done?

C=5, B=3,245, s=432, n=number of tanks produced

1 s 1K 2K 3K B 4K

1..…432……………………………………………………………...3245.….n

So n is about 3,245 + 432 = 3,677


So how is it done?

C=5, B=3,245, s=432, n=number of tanks produced

Well, we hypothesised that (n-B) is roughly equal to (s-1),

so if n-B = s-1

then:

n=B+(s-1)

n= 3,245 + (432-1) = 3,676


You can also use:

n = (1+1/C)B

n = (1 +1/5)3,245 = 3,894*

* Roger Johnson of Carleton College in Northfield, Minnesota

Teaching Statistics Vol. 16, number 2 Summer 1994

http://www.mcs.sdsmt.edu/rwjohnso/html/tank.pdf




There are various ways you can do the sums.

I certainly wouldn’t want to argue for any particular one.

I certainly would argue that this may well be worth doing.


Other uses

The serial number data was also used to deduce the:

• number of tank factories and their relative production

• logistics of tank deployment

• changes in production due to, for example, bombing


Of course, if the Germans had been aware of this they

could have:

• Coded the serial numbers, perhaps using a random

surrogate key (boring)

• Created misinformation (a much more interesting option)


Local paper – certain pages were printed out of register

enabling us to estimate the number of printing presses in

use.

This is an example of “It’s the exception that proves the

rule.”


Again, don’t think of this technique as being limited to

tanks.

What data is your company leaking?

What can you do about it?

Which of your competitors is leaking?

(satellite photographs of oppositions parking lot)

(classis car advertisements)


Let’s see.

7517 – 6108 =

1409 in 130 minutes

= about 10 per

minute.

70836 - 67517 =

3,319 in 1,105

minutes

= about 3 per

minute; implying

the rate drops

overnight.

Quick test.

70845 followed, one

minute later, by

70836. 9 in one

minute and it looks

like they are

sequential (but

much more testing

required).

The German Bomb problem

DD 3.0 – Data that vanishes and the disappearance is highly

significant

My mother worked on explosives and casings from German

bombs and shells during the war.

At the start of the war the Germans used very good quality

alloys. The allies tracked the elements in the alloy. As the

war progressed, certain elements disappeared completely

from the mix.

Dark Data of the third kind

The German Bomb problem

If the allies had started analysing in the middle of the war,

they would have missed the missing data.

(Would that have made it an unknown, unknown,

unknown?)

As it was, they could see what elements were becoming

scarce in Germany and plan accordingly.

Dark Data of the third kind

Take home message

Collect data and use it imaginatively.

Dark Data

Simpson’s Paradox

Why is Homer Simpson yellow?

Simpson’s paradox can be seen when percentages and/or actions are

combined without thought.

Two schools and two teachers –

School 1 – Sally

School 2 – Brian

But they occasionally work in each other’s schools.

Scored on the percentage exam pass of the students.

Simpson’s Paradox

Brian beats Sally in both schools.

Simpson’s Paradox

Brian Sally

School 1 90% 88%

School 2 75% 60%

But when we do the sums for all children, Sally beats Brian.

Simpson’s Paradox

Brian Sally

School 1 90% 88%

School 2 75% 60%

Overall 75% 87%

The paradox arises because the teachers teach very

different ability groups in the two schools and the schools

themselves differ in intake.

Simpson’s Paradox

Brian Sally

Pass Total Percentage Pass Total Percentage

School 1 No. of

children

9 10 90% 700 800 88%

School 2 No. of

children

600 800 75% 6 10 60%

Combined No. of

children

609 810 75% 706 804 75%

Brian may have achieve 90% in school 1, but this is based on a very

small sample – he taught far more students in school 2.

And Sally taught very few children in school 2 and far more in school 1.

Simpson’s Paradox

Brian Sally


School 1 No. of

children

9 10 90% 700 800 88%

School 2 No. of

children

600 800 75% 6 10 60%

Combined No. of

children

609 810 75% 706 804 75%

Both small and large samples are producing the same kind of value (a

percentage). Combining these means that samples with the high

numbers of students ‘overwhelm’ the samples with low numbers.

Simpson’s Paradox

Brian Sally


School 1 No. of

children

9 10 90% 700 800 88%

School 2 No. of

children

600 800 75% 6 10 60%

Combined No. of

children

609 810 75% 706 804 75%

We have one set of data 1 140

9 100

10 190

12 150

14 140

15 170

15 150

17 150

17 200

20 250

20 270

22 270

23 450

23 270

34 430

34 370

45 490

56 480

60 330

y = 6.4397x + 111.66

R² = 0.6344

0

100

200

300

400

500

600

0 10 20 30 40 50 60 70

And another set for the same two axes

y = 0.1673x + 90.477

R² = 0.3125

0

20

40

60

80

100

120

140

160

180

200

0 100 200 300 400 500 600

98 95

125 180

134 70

148 85

152 75

200 125

200 135

220 135

230 120

230 135

340 180

340 185

450 180

560 140

But when we combining the sets 1 140

9 100

10 190

12 150

14 140

15 170

15 150

17 150

17 200

20 250

20 270

22 270

23 450

23 270

34 430

34 370

45 490

56 480

60 330

98 95

125 180

134 70

148 85

152 75

200 125

200 135

220 135

230 120

230 135

340 180

340 185

450 180

560 140

y = -0.2662x + 238.52

R² = 0.0989

0

100

200

300

400

500

600

0 100 200 300 400 500 600

How likely is Simpson’s paradox?

Marios G. Pavlides and Michael D. Perlman (August 2009).

The American Statistician 63 (3): 226–233.

www.stat.washington.edu/research/reports/2009/tr558.pdf

Simpson’s Paradox

http://www.stat.washington.edu/research/reports/2009/tr558.pdf

Simpson’s Paradox

Simpson’s Paradox is usually attributed to Simpson in 1951.

However Pavlides and Perlman quote various other sources

which suggest that the underlying idea goes back at least

as far 1899.

Simpson’s Paradox

This is a wonderful example of Stigler's law of eponymy.

Stephen Stigler in his 1980 publication "Stigler’s law of eponymy".

Essentially this postulates that scientific discoveries are not usually named

after the original discoverer.

Stigler law was originally postulated by the sociologist Robert K. Merton;

thus ensuring the Stigler’s law obeys Stigler’s law.

Gieryn, T. F., ed. (1980). Science and social structure: a festschrift for Robert K. Merton. New York: NY Academy of Sciences. pp. 147–57.

ISBN 0-89766-043-9. , republished in Stigler's collection "Statistics on the Table"

dark data by prof. mark whitehorn

Data & Analytics

kind dark data

dark data companies

significant dark data

andria doria dark data

scientific data

dark matter

vast swathes of data

kind bats