big data as a source for official statistics
DESCRIPTION
Presentation given at Strata conf London 2013.TRANSCRIPT
Big Data as a source for Official Statistics
Edwin de Jonge and Piet Daas
November 12, London
Overview
2
• Big Data• Research ‘theme’ at Stat. Netherlands
• Data driven approach• Visualization as a tool• Why?• Examples in our office
• Issues & challenges• From an official statistical perspective• Focus on methodological and legal ones
Why Visualization?
October 1st 2013, Statistics Netherlands
Effective Display!
(see Tor Norretranders, “Band width of our senses)
Anscombes quartet…
5
DS1 x
y DS2 x y
DS3 x y DS4 x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Anscombe’s quartet
Property Value
Mean of x1, x2, x3, x4 All equal: 9
Variance of x1, x2, x3, x4 All equal: 11
Mean of y1, y2, y3, y4 All equal: 7.50
Variance of y1, y2, y3, y4 All equal: 4.1
Correlation for ds1, ds2, ds3, ds4 All equal 0.816
Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x
Looks the same, right?
Lets plot!
Assumptions…
8
Why visualization?
Tool for data analysis – Effective display of information– Summary of data– Show outliers / patterns– Helps exploring data– Helps checking assumptions
Often Maps
Many visualizations are maps– Positive:
‐ Is familiar‐ Attractive
But: only makes sense:‐ When data geographically distributed‐ When locality is meaningful ‐ When data is correctly normalized
11
Huh, Normalized?,
Many maps just population maps!
13
A better map:
‐ Takes population size into account (e.g. by making figures relative)
‐ May plot difference w.r.t. an expected value.
Visualization is not easy
– Creating good visualizations is hard– “Easy Reading” is not “Easy Writing”
Visualization must be:– Faithful– Objective
Thus not introduce perceptial bias
Visualization
– Use appropriate chart– Use approprate scales
‐ x,y, color, time– Use appropriate granularity
Research: What works for which data?
16
Example:Census
Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands• Last traditional census was in 1971
‐ Now by (re-)using existing information• Linking administrative sources and available
sample survey data at a large scale• Check result• How?
• With a visualisation method: the Tableplot 11
Making the Tableplot
1. Load file 17 million records2. Sort record according to 17 million
records key variable• Age in this example
3. Combine records 100 groups (170,000 records each)
• Numeric variables • Calculate average (avg. age)
• Categorical variables • Ratio between categories present (male vs. female)
4. Plot figure of select number of variables
• Colours used are important up to 12
12
October 1st 2013, Statistics Netherlandstableplot of the census test file
Tableplot: Monitor data quality
21
– All data in Office passes stages:‐ Raw data (collected)‐ Preproccesed (technically correct)‐ Edited (completed data)‐ Final (removal of outliers etc.)
Processing of dataRaw (unedited) data
Edited data
Final data
Example 2 : Social Security Register
15
– Contains all financial data on jobs, benefits and pensions in the Netherlands‐ Collected by the Dutch Tax office‐ A total of 20 million records each
month
‐ How to obtain insight into so much data?• With a visualisation method: a heat map24
October 1st 2013, Statistics Netherlands
Heat map: Age vs. ‘Income’
16
Age
Inco
me (
euro
)
October 1st 2013, Statistics Netherlands
age
After ‘data reduction’
age
17
amount
amount
Visualization helps with volume of data
27
– Summarize by “binning”– Tableplot– Histogram– Heatmap (2D histogram)– Smoothing?– Detect unexpected patterns
We use it as a tool to check, explore and communicate data
Big Data: Issues and challenges
Big Data: issues & challenges
During our exploratory studies we identified a number of issues & challenges.
Focussing on the methodological and legal ones,
we found that there is a need to:1) deal with noisy and dirty data2) deal with selectivity3) go beyond correlation 4) cope with privacy and security issuesWe have only solved some of them (partially)
29
1) Deal with noisy and dirty data
– Big Data is often‐ noisy‐ dirty‐ redundant‐ unstructured
• e.g. texts, images
– How to extract informationfrom Big data?‐ In the best/most efficient way
30
Noisy and dirty data
Aggregate, apply filters (Poisson/Kalman), try to exclude noisy records, models (capture structure), ‘Google approach’ (80/20 rule)
Preferably do NOT use
samples !31
Social media sentiment Traffic loop data
Noise reduction
Social media: daily sentiment in Dutch messages
32
Noise reduction
Social media, daily sentiment in Dutch messagesSocial media: daily & weekly sentiment in Dutch messages
33
Noise reduction
Social media, daily sentiment in Dutch messagesSocial media: daily, weekly & monthly sentiment in Dutch messages
34
Noise reduction
Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages
35
Social media sentiment & Consumer confidence
Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages & Consumer confidence
36Corr: 0.88
Dirty data
37 Time (hour)
Total number of vehicles detected by traffic loops during the day
Loop active varies during the day (first 10 min)
38
Correct for dirty data
Use data from same location from previous/next minute (5 min. window)
Before After
Total = ~ 295 million vehicles Total = ~ 330 million vehicles (+ 12%) 39
2) Deal with selectivity
– Big data sources are selective (they do NOT cover the entire population considered)
‐ Some probably more then others
– AND: all Big Data sources studied so far contain events!
‐ E.g. social media messages created, calls made and vehicles detected
‐ Events are probably the reason why these sources are so Big
– When there is a need to correct for selectivity1) Convert events to units
How to identify units?
2) Correct for selectivity of units includedHow to cope with units that are truly absent and part of the population under study?
40
Units / events
– Big Data contains events‐ Social media messages are generated by
usernames ‐ Traffic loops count vehicles (Dutch roads
are units)‐ Call detail records of mobile phone ID’s
‐ Convert events to units• By profiling
41
Profiling of Big data
42
Travel behaviour of active mobile phones
Mobility of very active mobile phone users
- during a 14-day period
Based on:- Call- and text-activity
multiples times a day- Location based on phone masts
Clearly selective:- North and South-west of the country hardly included
43 __
3) Go beyond correlation
– You will very likely use correlation to check Big Data findings with those in other (survey) data
– When correlation is high:1) try falsifying it first (is it
coincidental/spurious?)correlation ≠ causation
2) If this fails, you may have found something interesting!
3) Perform additional analysis (look for causality)
cointegration, structural time-series approach
44 Use common sense!
An illustrative example
Official unemployment percentageNumber of social media messages including the word “unemployment”
Corr: 0.90 ?X45
4) Privacy and security issues
– The Dutch privacy and security law allows the study of privacy sensitive data for scientific and statistical research
– Still appropriate measures need to be taken• Prior to new research studies, check privacy sensitivity of
data• In case of privacy sensitive data:
• Try to anonymize micro data or use aggregates• Use a secure environment
– Legal issues that enable the use of Big Data for official statistics production is currently being looked at‐ No problems for Big Data that can be considered
‘Administrative data’: i.e. Big Data that is managed by a (semi-)governmentally funded organisation
46
Conclusions
– Big data is a very interesting data source ‐ Also for official statistics
– Visualisation is a great way of getting/creating insight‐ Not only for data exploration
– A number of fundamental issues need to be resolved‐ Methodological‐ Legal‐ Technical (not discussed here)
– We expect great things in the near future!47
The future of statistics?