Barteld Braaksma and Kees Zeelenberg
“Re-make / Re-model”:
Should big data change the modelling paradigm in official statistics?
Lay-out of presentation
– Sources and modes of inference– Big data examples at Statistics Netherlands– How to use big data?
‐ ‘as is’‐ models
– But how about quality?– More examples– Conclusions
2
Sources for official statistics
Always start from observations– Traditional surveys• Statistical populations• Owned by statistical offices (full control)• Costly and burdensome
– Administrative sources• Administrative populations• Owned by government bodies (limited control)• Cheaper to obtain
– Big (‘organic’) data• Unclear populations• Owned by private companies (no control)
• Cost unclear3
Modes of inference in official statistics
Main approaches for collecting and processing data– Design-based
‐ Stratified sample survey of sales
– Model-assisted‐ Combine tax data with sales survey (regression)
– Model-based‐ Add up all sales from tax declarations‐ (small-area estimates)‐ (seasonal adjustment)‐ (…)
– Sometimes ‘implicit models’‐ Imputation of missing values‐ Preliminary estimates of GDP
4
Big data at Statistics Netherlands
Experiments discussed today– Traffic detection loops– Social media messages– Mobile phone data
Other examples, not discussed here– Scanner data (in production)– Satellite images– Financial transactions– Internet robots (close to production)– Google Trends
– PM: Administrative data (in production)5
Big data ‘as is’
– Imperfect, yet timely, indicator of trends– “These data exist and that’s why they are interesting”
– Example: social media messages‐ Signals of human activity and feelings
8
Dutch social media activity, 2010-2012
Big data and statistics
Important issues:– Undercoverage– Selectivity– Volatility– Interpretation– Continuity
Traditionalists’ view:– These sources are useless for producing quality statistics
Modernists’ view:– We should stop doing surveys, everything is already out there
Déjà-vu:– Similar discussions when introducing administrative data…
11
How to use big data?
– Many methodological issues– No linking variables (often)– Additional information may be available
– Possible approach: combine available information‐ By old or new mathematical methods (often Bayesian)‐ By integration techniques (“National accounts”-style)
– But how about models?
12
Examples of models in official statistics
– Correction by weighing for non-response– Imputation for item non-response– Seasonal adjustment– Estimates for small areas– Capture-recapture models for hard to observe
populations– Preliminary (flash) estimates of GDP
– So we are already using models in official statistics!– But we should look carefully at principles and conditions
13
Guiding principles of official statistics
European Statistical System, mission statement– “We provide the European Union, the world and the public with independent high quality
information on the economy and society on European, national and regional levels and make the information available to everyone for decision-making purposes, research and debate.”
ESS Code of Practice, principle 6:‐ “Statistical authorities develop, produce and disseminate European Statistics respecting
scientific independence and in an objective, professional and transparent manner in which all users are treated equitably.”
ESS Code of Practice, principle 7:– “Sound methodology underpins quality statistics. This requires adequate tools, procedures
and expertise.”
ESS Code of Practice, principle 12:– “European Statistics accurately and reliably portray reality.”
14
So how about quality?
For use of models this implies:– Objectivity:
‐ Do not move too far from observed data‐ Objects and populations for the model correspond to the
statistical phenomenon ‐ No forecasting
– Reliability:‐ Extensive specification to guarantee robustness against model
failure‐ No behavioural models
15
Some model-based examples
– Relation assumed between observations and phenomena– Sophisticated modelling– Trial and error– Signal and noise
16
Conclusions
– Big data leads to new opportunities‐ Better accuracy and more details‐ More frequent and more timely estimates‐ Statistics in new areas
– Big data based statistics are useful in their own right
– Don’t be afraid to use models‐ Documented and transparent‐ Well tested‐ Describe, do not judge
21