van big data naar officiële statistiek...van big data naar officiële statistiek piet j.h. daas en...
TRANSCRIPT
Van Big Data naar Officiële Statistiek
Piet J.H. Daas en al mijn Big Data collega's/Data scientists bij het CBDS
31 Jan., Leuven
Statistics Netherlands
Spanning tussen theorie en data gedreven manier van werken
Overview
2
• Big Data and Statistics Netherlands
• A Big Data based official statistic
• Skills needed
• Results of other Big Data projects
• Some concluding remarks
Statistics Netherlands
– Where?
3
Heerlen
Den Haag
We love Big Data!!
Center for Big Data Statistics (CBDS)
• Produce new, real time statistics and enriches and
deepens the statistics already produced (such as regional indicators)
• Reduce the impact on society (‘response burden’) • Deepens the methodological knowledge and privacy
considerations for using Big Data in official statistics • Stimulate cooperation by creating an ecosystem of
partners
4
CBDS Scope
Data-scouting and data
access
Ethics and privacy
Methodology and data integration
Big data in official
statistics
Social statistics, safety, housing and health
Sustainable Development Goals
Smart Cities
Statistics on Economics internet economy, labour market, energy transition
Mobility day time population, traffic flows
5
Why is Big Data important?
Big Data has the potential to
– Shorter time to publication
– Respond to current events
– Higher reliability
– More detail
– More efficient processes
Considerations:
- Infrastructure
- Skills
- Culture
6
Big data based official statistics
– Big Data can be used for official statistics in several ways
1) As a single source
- census like
2) As an additional source
- combined with survey data
- combined with admin data
3) Other ways
- add missing data for some variables and/or units
– Road sensor data is used by our office to produce the
first Big Data based official statistic!
‐ Use this to illustrate the (new) skills needed! 7
Road sensors
Road sensor data – Passing vehicle counts for each minute (24/7) by about 60.000 sensors – 20.000 on the Dutch highways – Types of sensors:
‐ Induction loop ‐ Camera ‐ Bluetooth
– Large volume: approx. 230 million records/day
8
Dutch highways
9
Dutch highways + road sensors
10
20.000 sensors on highways
Minute data of 1 sensor for 196 days
11
‘Afsluitdijk’ (IJsselmeer dam)
12
‘Afsluitdijk’ (IJsselmeer dam) (2)
Overall process
(2) Cleaning
(1) Transform
+ Select
(3) Estimation
(A) F
rame
14 -Regional estimates -Month/quarter/year
‘Reducing’ Big Data
Big Data steps
(1)
(2) (3)
Process steps
(1) Transform and Select
(2) Cleaning
(A) Frame
(3) Estimation
16
Skills needed?
Skills needed?
Skills needed?
Skills needed?
Skills needed
17
Data Science Venn Diagram
(1) Transform + Select
– Convert raw data to more compact data (without
information loss)
‐ Remove unneeded data
(variables and erroneous records)
‐ Recalculate values
‐ Store as compact as possible
‐ Implement process as efficient as possible
– Reduces size > 1000x !!
18
Statistics
Statistics
IT
IT
(2) Cleaning
– Check quality of daily sensor data
– Correct for missing data
– Implement process as efficiently as possible
19 Bayesian filter ( ‘a Kalman filter for semi Poisson process’)
IT
Statistics
Statistics
(A) Frame
– Use sensors on main route of Dutch Highways
– Project geolocation of sensors on roads
– Metadata quality checking and editing
– Calculate weights for sensors on road segments
20
Statistics
Statistics
IT
Statistics
(3) Estimation
– Calculate number of vehicles per road segment
– Calculate traffic intensity per region
– Check/compare time series
– Adjust extremes where needed (if unexplained)
21
Statistics
Statistics
Statistics
Content
Skills when using Big Data
22
For Big Data we need Data Scientists (statisticians with IT skills!)
1x
10x Statistics
Content
IT 4x
Data journalism and fast statistics
Produced within
tw0 days!
Produce very rapidly available
statistics
Traffic reduced by half because of glazed frost
23
Traffic intensity and GDP
- GDP - Traffic
Traffic precedes GDP!
• By 1 quarter
Correlation
• 91% from 2011-
Q2 till 2014-Q4
24
Social media sentiment
Consumer confidence
So
cial
med
ia s
enti
men
t
- Correlation > 0.9, Facebook is most important date source (Twitter is the other one) - Including social media in survey based consumer confidence increases precision of estimate
Social unrest indicator (near ‘real time’)
26
Social unrest indicator (2)
Year Month
Week Day
Cyber security
28
Study DDos attacks in various sources
These are all reactions to the attack, not the attack itself
Automatic Identification System data
Data of ships (GPS signal) 200 millions records/day world wide Courtesy of Maarten Pouwels 29
Innovation in the Netherlands
30
New (and fun) indicators
31
‘Pepernoten’ index: result of data-driven exploratory study on scanner data
(Friday afternoon projects)
Turn over of ‘cookies’ specific for Saint Nicolas festivities (2015 and 2016: weekly)
31
Spring in the Netherlands
2013 2,5 mean 8 days below zero
2014 8,3 mean 0 days below zero
Flowering of the wood anemone
32
33
Big Data and CBS
Sources (bits)
‘Big Data’ Administrative data Survey data
Sta
tist
ics
(bit
s)
16,00% 0,62%
13,62%
0,38%
23,95% 14,52%
5,09%
3,07%
3,05%
19,69%
scannerdata
Concluding remarks
– Big Data has potential for official statistics – There is one example, more are on the way
– Interesting (first) results but ‐ It is a relatively new area for official statistics, so a lot needs to be
checked
‐ People need to get adjusted to the ‘Big Data’ way of working
– The skills set of ‘statisticians’ needs to be extended ‐ Programming and optimization
– Definite need for a methodological foundation ‐ Population view
‐ Interpret and asses data-driven results
34
Big Data !!!
35
The Future
36
The future
of statistics
looks
BIG
Thank you for your attention! @pietdaas
Questions?
38