big data as a source for official statistics

48
Big Data as a source for Official Statistics Edwin de Jonge and Piet Daas November 12, London

Upload: edwin-de-jonge

Post on 19-Jun-2015

147 views

Category:

Technology


3 download

DESCRIPTION

Presentation given at Strata conf London 2013.

TRANSCRIPT

Page 1: Big data as a source for official statistics

Big Data as a source for Official Statistics

Edwin de Jonge and Piet Daas

November 12, London

Page 2: Big data as a source for official statistics

Overview

2

• Big Data• Research ‘theme’ at Stat. Netherlands

• Data driven approach• Visualization as a tool• Why?• Examples in our office

• Issues & challenges• From an official statistical perspective• Focus on methodological and legal ones

Page 3: Big data as a source for official statistics

Why Visualization?

October 1st 2013, Statistics Netherlands

Page 4: Big data as a source for official statistics

Effective Display!

(see Tor Norretranders, “Band width of our senses)

Page 5: Big data as a source for official statistics

Anscombes quartet…

5

DS1 x

y DS2 x y

DS3 x y DS4 x y

10 8.04 10 9.14 10 7.46 8 6.58

8 6.95 8 8.14 8 6.77 8 5.76

13 7.58 13 8.74 13 12.74 8 7.71

9 8.81 9 8.77 9 7.11 8 8.84

11 8.33 11 9.26 11 7.81 8 8.47

14 9.96 14 8.1 14 8.84 8 7.04

6 7.24 6 6.13 6 6.08 8 5.25

4 4.26 4 3.1 4 5.39 19 12.5

12 10.84 12 9.13 12 8.15 8 5.56

7 4.82 7 7.26 7 6.42 8 7.91

5 5.68 5 4.74 5 5.73 8 6.89

Page 6: Big data as a source for official statistics

Anscombe’s quartet

Property Value

Mean of x1, x2, x3, x4 All equal: 9

Variance of x1, x2, x3, x4 All equal: 11

Mean of y1, y2, y3, y4 All equal: 7.50

Variance of y1, y2, y3, y4 All equal: 4.1

Correlation for ds1, ds2, ds3, ds4 All equal 0.816

Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x 

Looks the same, right?

Page 7: Big data as a source for official statistics

Lets plot!

Page 8: Big data as a source for official statistics

Assumptions…

8

Page 9: Big data as a source for official statistics

Why visualization?

Tool for data analysis – Effective display of information– Summary of data– Show outliers / patterns– Helps exploring data– Helps checking assumptions

Page 10: Big data as a source for official statistics

Often Maps

Many visualizations are maps– Positive:

‐ Is familiar‐ Attractive

But: only makes sense:‐ When data geographically distributed‐ When locality is meaningful ‐ When data is correctly normalized

Page 11: Big data as a source for official statistics

11

Huh, Normalized?,

Page 12: Big data as a source for official statistics
Page 13: Big data as a source for official statistics

Many maps just population maps!

13

A better map:

‐ Takes population size into account (e.g. by making figures relative)

‐ May plot difference w.r.t. an expected value.

Page 14: Big data as a source for official statistics

Visualization is not easy

– Creating good visualizations is hard– “Easy Reading” is not “Easy Writing”

Visualization must be:– Faithful– Objective

Thus not introduce perceptial bias

Page 15: Big data as a source for official statistics

Visualization

– Use appropriate chart– Use approprate scales

‐ x,y, color, time– Use appropriate granularity

Research: What works for which data?

Page 16: Big data as a source for official statistics

16

Example:Census

Page 17: Big data as a source for official statistics

Example Virtual Census

‐ Every 10 years a Census needs to be conducted

‐ No longer with surveys in the Netherlands• Last traditional census was in 1971

‐ Now by (re-)using existing information• Linking administrative sources and available

sample survey data at a large scale• Check result• How?

• With a visualisation method: the Tableplot 11

Page 18: Big data as a source for official statistics

Making the Tableplot

1. Load file 17 million records2. Sort record according to 17 million

records key variable• Age in this example

3. Combine records 100 groups (170,000 records each)

• Numeric variables • Calculate average (avg. age)

• Categorical variables • Ratio between categories present (male vs. female)

4. Plot figure of select number of variables

• Colours used are important up to 12

12

Page 19: Big data as a source for official statistics
Page 20: Big data as a source for official statistics

October 1st 2013, Statistics Netherlandstableplot of the census test file

Page 21: Big data as a source for official statistics

Tableplot: Monitor data quality

21

– All data in Office passes stages:‐ Raw data (collected)‐ Preproccesed (technically correct)‐ Edited (completed data)‐ Final (removal of outliers etc.)

Page 22: Big data as a source for official statistics

Processing of dataRaw (unedited) data

Edited data

Final data

Page 23: Big data as a source for official statistics

Example 2 : Social Security Register

15

Page 24: Big data as a source for official statistics

– Contains all financial data on jobs, benefits and pensions in the Netherlands‐ Collected by the Dutch Tax office‐ A total of 20 million records each

month

‐ How to obtain insight into so much data?• With a visualisation method: a heat map24

Page 25: Big data as a source for official statistics

October 1st 2013, Statistics Netherlands

Heat map: Age vs. ‘Income’

16

Age

Inco

me (

euro

)

Page 26: Big data as a source for official statistics

October 1st 2013, Statistics Netherlands

age

After ‘data reduction’

age

17

amount

amount

Page 27: Big data as a source for official statistics

Visualization helps with volume of data

27

– Summarize by “binning”– Tableplot– Histogram– Heatmap (2D histogram)– Smoothing?– Detect unexpected patterns

We use it as a tool to check, explore and communicate data

Page 28: Big data as a source for official statistics

Big Data: Issues and challenges

Page 29: Big data as a source for official statistics

Big Data: issues & challenges

During our exploratory studies we identified a number of issues & challenges.

Focussing on the methodological and legal ones,

we found that there is a need to:1) deal with noisy and dirty data2) deal with selectivity3) go beyond correlation 4) cope with privacy and security issuesWe have only solved some of them (partially)

29

Page 30: Big data as a source for official statistics

1) Deal with noisy and dirty data

– Big Data is often‐ noisy‐ dirty‐ redundant‐ unstructured

• e.g. texts, images

– How to extract informationfrom Big data?‐ In the best/most efficient way

30

Page 31: Big data as a source for official statistics

Noisy and dirty data

Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models (capture structure), ‘Google approach’ (80/20 rule)

Preferably do NOT use

samples !31

Social media sentiment Traffic loop data

Page 32: Big data as a source for official statistics

Noise reduction

Social media: daily sentiment in Dutch messages

32

Page 33: Big data as a source for official statistics

Noise reduction

Social media, daily sentiment in Dutch messagesSocial media: daily & weekly sentiment in Dutch messages

33

Page 34: Big data as a source for official statistics

Noise reduction

Social media, daily sentiment in Dutch messagesSocial media: daily, weekly & monthly sentiment in Dutch messages

34

Page 35: Big data as a source for official statistics

Noise reduction

Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages

35

Page 36: Big data as a source for official statistics

Social media sentiment & Consumer confidence

Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages & Consumer confidence

36Corr: 0.88

Page 37: Big data as a source for official statistics

Dirty data

37 Time (hour)

Total number of vehicles detected by traffic loops during the day

Page 38: Big data as a source for official statistics

Loop active varies during the day (first 10 min)

38

Page 39: Big data as a source for official statistics

Correct for dirty data

Use data from same location from previous/next minute (5 min. window)

Before After

Total = ~ 295 million vehicles Total = ~ 330 million vehicles (+ 12%) 39

Page 40: Big data as a source for official statistics

2) Deal with selectivity

– Big data sources are selective (they do NOT cover the entire population considered)

‐ Some probably more then others

– AND: all Big Data sources studied so far contain events!

‐ E.g. social media messages created, calls made and vehicles detected

‐ Events are probably the reason why these sources are so Big

– When there is a need to correct for selectivity1) Convert events to units

How to identify units?

2) Correct for selectivity of units includedHow to cope with units that are truly absent and part of the population under study?

40

Page 41: Big data as a source for official statistics

Units / events

– Big Data contains events‐ Social media messages are generated by

usernames ‐ Traffic loops count vehicles (Dutch roads

are units)‐ Call detail records of mobile phone ID’s

‐ Convert events to units• By profiling

41

Page 42: Big data as a source for official statistics

Profiling of Big data

42

Page 43: Big data as a source for official statistics

Travel behaviour of active mobile phones

Mobility of very active mobile phone users

- during a 14-day period

Based on:- Call- and text-activity

multiples times a day- Location based on phone masts

Clearly selective:- North and South-west of the country hardly included

43 __

Page 44: Big data as a source for official statistics

3) Go beyond correlation

– You will very likely use correlation to check Big Data findings with those in other (survey) data

– When correlation is high:1) try falsifying it first (is it

coincidental/spurious?)correlation ≠ causation

2) If this fails, you may have found something interesting!

3) Perform additional analysis (look for causality)

cointegration, structural time-series approach

44 Use common sense!

Page 45: Big data as a source for official statistics

An illustrative example

Official unemployment percentageNumber of social media messages including the word “unemployment”

Corr: 0.90 ?X45

Page 46: Big data as a source for official statistics

4) Privacy and security issues

– The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research

– Still appropriate measures need to be taken• Prior to new research studies, check privacy sensitivity of

data• In case of privacy sensitive data:

• Try to anonymize micro data or use aggregates• Use a secure environment

– Legal issues that enable the use of Big Data for official statistics production is currently being looked at‐ No problems for Big Data that can be considered

‘Administrative data’: i.e. Big Data that is managed by a (semi-)governmentally funded organisation

46

Page 47: Big data as a source for official statistics

Conclusions

– Big data is a very interesting data source ‐ Also for official statistics

– Visualisation is a great way of getting/creating insight‐ Not only for data exploration

– A number of fundamental issues need to be resolved‐ Methodological‐ Legal‐ Technical (not discussed here)

– We expect great things in the near future!47

Page 48: Big data as a source for official statistics

The future of statistics?