big data in official statistics - experiences at statistics...

Post on 22-May-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Poznań, Poland, 10 September 2015

Big Data and Official Statistics – Experiences

at Statistics Netherlands

Peter Struijs

Outline

¾ Big Data and official statistics

¾ Experiences at Statistics Netherlands with:

- use of road sensor data- use of mobile phone location data- use of public social media messages

¾ Issues and solutions

¾ Strategic, policy and organisational challenges

¾ Cooperation and collaboration

2

What is Big Data?

3

4

Data sources and approaches

Surveys / questionnaires

sampling theory

Administrative data sources

Where does Big Data fit in?

New methods may be needed, e.g. modeling fornowcasting and other methods not based on sampling theory

5

Potential Opportunities

¾ New statistics

¾ More detailed statistics

¾ More timely statistics

¾ Nowcasts and early indicators

¾ Quality improvement

¾ Response burden reduction

¾ Cost reduction and higher efficiency

6

Examples of possible Big Data sources

¾ Road sensor data

¾ Mobile phone location data

¾ Public social media messages

¾Websites

¾ Google Trends

¾ Satellite information

¾ Etc…

7

Data scientists involved in the research shown

¾ Piet Daas (photo)

¾May Offermans

¾Marco Puts

¾Martijn Tennekes

8

Statistics based on road sensor data

¾ Aim: Statistics on traffic intensities

¾ Characteristics of the data source

¾ Research on the usability of the data

¾ Process of using the data for statistics

¾ Issues when using traffic loop data

9

Road sensor data

¾ Source: National Data Warehouse for Traffic Information (NDW)

¾There are 20.000 traffic loops on Dutch motorways, and 40.000 on provincial roads

¾ Each minute (24/7) the number of passing vehicles is counted, and their average speed

¾ Three different length classes are distinguished

¾ No identification of vehicles¾ Around 230 million records a day used

Locations

10

The main roads

11

A special dike

12

Road sensors in the dike

13

Minute data of one sensor for 196 days

14

Researching the data

Cross correlation between sensor pairs- Used to validate metadata

Trajectory speed vs. point speed- Average speed is 98 Km/h

Small, medium-sized & large vehicles

22

Sensors in a road segment

17

Process of making traffic intensities statistics

¾ Select sensors on Dutch highways

¾ Preprocessing -­‐ Remove non-informative variables-­‐ Remove bad records-­‐ Exclude bad sensors-­‐ Quality indicators for daily data per sensor

¾ Processing-­‐ Reduce dimensions on same road and region

-­‐ Obtain number of vehicles for each road and region-­‐ For each road and region, calculate monthly traffic intensity-­‐ Use of R-Hadoop

¾ Validation and publication18

Data options

Historical database-­‐ Request data via web interface-­‐ Minute data for all highways (48 variables, Jan 2010-

April 2014: around 2.5 TB)

Data stream-­‐ Every minute, all data for all active sensors-­‐ Continuously collected

19

Road sensor data: Issues and non-issues

Non-issues:¾ Privacy¾ Data acquisition

Issues:¾Methodology

- Selectivity- Quality

¾ Infrastructural needs¾Other issues

- Skills needed- Transition from research to

regular statistics 20

References: statistical use of road sensor data

¾ Publication of statistical results (in Dutch):http://www.cbs.nl/nl-NL/menu/themas/verkeer-vervoer/publicaties/artikelen/archief/2015/a13-drukste-rijksweg.htm

¾ Explanation in English:http://www.cbs.nl/NR/rdonlyres/25CE3592-A756-42B7-BABF-C3E4C4E9375B/0/a13busiestnationalmotorwayinthenetherlands.pdf

¾ Research reference:Puts, M., Tennekes, M. and Daas, P. (2014) Using Road Sensor Data for Official Statistics: Towards a Big Data Methodology. Paper for Strata + Hadoop World, Barcelona, Spain.

21

Statistics based on mobile phone location data

¾ Why use mobile phone data for official statistics?

¾Characteristics of the data source

¾ Research on the usability of the data

¾ Issues when using mobile phone location data

¾ Solutions to the issues

22

Possible uses of mobile phone data

¾ Daytime population statistics

¾ Mobility statistics

¾ Tourism statistics

¾ Other uses

23

Mobile phone activity as a data source

¾ Nearly every person in the Netherlands has a mobile phone-­‐ Usually on them -­‐ Almost always switched on-­‐ Many people are very active during the day

¾ There is a grid of antennas with good coverage

¾ Data of a single mobile company was used-­‐ Hourly aggregates per area-­‐ Threshold of 15 events

24

Daytime population based on mobile phone data

Issues when using mobile phone data

¾ Privacy

¾ Data acquisition

¾Methodology- Representativeness- Selectivity- Quality

¾Other issues- Infrastructure- Skills needed

26

Solutions

¾ Agreement with data provider to provide onlyaggregates and apply a threshold

¾ Data provider performed analysis of mobile phoneownership characteristics

¾ A large number of analyses were made, with the regularpopulation registration data as a reference

¾ A number of assumptions had to be made

27

References: statistical use of mobile phone data

¾ Research references:

Daas, P.J.H., Puts, M.J., Buelens, B. and van den Hurk, P.A.M. (2015) Big Data as a Source for Official Statistics. Journal of Official Statistics 31(2), pp. 249-262.

Daas, P. and Burger, J. (2014) Profiling big data sources to assess their selectivity. Paper for the 2015 New Techniques and Technologies for Statistics conference, Brussels, Belgium.

28

Mobile phone data versus road sensor data

29

Statistics based on social media data

¾ Why use social media data for official statistics?

¾Characteristics of the data source

¾ Research on the usability of the data

¾ Issues when using social media data

30

Possible uses of social media data

¾ Sentiment indicators

- e.g. consumer confidence index

¾ Social indicators

- e.g. social coherence indices

¾Other uses

31

Social media

– Dutch are very active on social media!-­‐ Around 60% according to a surveyna altijd bij zich en staat vrijwel altijd aan

• Steeds meer mensen hebben een smartphone!

– Mogelijke informatiebron voor:-­‐ Welke onderwerpen zijn actueel:

• Aantal berichten en sentiment hierover

-­‐ Als meetinstrument te gebruiken voor:

• .

Map by Eric Fischer (via Fast Company)

32

The data

¾ All social media messages:- that are written in Dutch- and are public

¾ These messages are systematically and instantly collected by the Dutch firm Coosto

33

¾ Dataset of more than 3.5 billion messages:- covering June 2010 till the present- between 3-4 million new messages added per day

Research question

C an we replicate the consumer confidence index by only using social media data, while reducing production time?

34

Sentiment determination

¾ ‘Bag  of  words’  approach- list of Dutch words with their associated sentiment - added social  media  specific  words  (‘FAIL’,  ‘LOL’,  ‘OMG’  etc.)

¾ Use overall score to determine sentiment- is either positive, negative or neutral

¾ Average sentiment per period (day / week / month)- (#positive - #negative)/#total * 100%

35

Sentiment per platform

(~10%) (~80%)

Build a model

Idea: Fitting characteristics derived from social media messages to consumer confidence

Success: If correlation can be found that is high andremains high, that is, has predictive power

37

38

Figure 1. Development of daily, weekly and monthly aggregates of social media sentiment from June 2010 until November 2013, in green, redand black, respectively. In the insert the development of consumer confidence is shown for the identical period.

Results

¾ High correlation achieved (0.9)

¾ Changes in consumer confidence preceed changes in sentiment by one week

¾ Short processing time, so time-to-market may bereduced.

¾ Sentiment index can be produced on a weekly basis

¾ To be considered:

- Use model-based figures as early indicators- Reduce sampling of consumer confidence index

39

General sentiment indicator (draft version)

40

Issues when using social media data

Lesser issues: ¾ Privacy¾ Data acquisition

Main issues:¾Methodology

- Selectivity- Meaning of the data- Validity of methods used

¾Other issues- Skills needed

41

Questions on the validity of methods used

¾ Is it acceptable, under certain conditions, to base official statistics on correlations?

¾ If so, what are the conditions?

¾What to do if there is a shock?

42

Reference: statistical use of social media data

¾ Research reference:

Daas, P.J.H. and Puts, M.J.H. (2014) Social Media Sentiment andConsumer Confidence. European Central Bank Statistics Paper Series No. 5, Frankfurt, Germany.

43

Big Data Characteristics

44

Definition: ¾ Volume¾ Velocity¾ Variety

Data characteristics: ¾ Unstructured data¾ Selectivity¾ Population dynamics¾ Event data¾ Organic data¾ Distributed data

Data use: ¾ Other ways of processing¾ Fundamentally new applications

Overview of Issues

¾ Getting access to the data

¾ Usability of the data

- Meaning of the data, stability of the source, reproducability

¾ Methodologal issues

- Selectivity, representativeness, unknown population, quality and validity

¾ Privacy, confidentiality and reputation

¾ IT-infrastructure and security

¾ Knowledge and skills

¾ Transition from research to production

¾ Strategic challenges45

Possible responses to the issues

¾ Invest in good relations with the data provider

¾ Invest in methodological research and play with the data to get a grip on quality

¾ Use only aggregate data if possible

¾ Explore alternatives to population-based estimationmethods

¾ Keep an open mindset

¾ Take the strategic challenges seriously

46

Strategic aspects

¾ Others start producing statistics- there may be quality issues- but they are extremely rapid- and there is obviously demand

¾ Need for good, impartial information(benchmark information) will remain

- without a monopoly for NSIs

¾ There is a need for validation of information produced by others

47

Billion Prices Project MIT

48

49

The Roadmap Approach

¾ Awareness that Big Data is a strategic issue

¾ Position paper for Board of Directors

¾ Roadmap Big Data

¾ External validation of the Roadmap

¾ Roadmap updated twice a year for Board of Directors

¾ Roadmap monitor

¾ Deputy Director General responsible at strategic level

¾ Coordination group for Big Data

50

The Scope of the Roadmap

¾ Identification of outputs to be based on Big Data

¾ For each output, definition of time target and ownership

¾ Identification by owner of conditions to be fulfilled

¾ Commitment by supporting services for fulfilling the

conditions  (IT,  data  collection,  methodological  support,  …)

¾ Supporting programmes

51

The Roadmap Projects

Focus projectsRoad sensor data for traffic intensities statisticsMobile phone data for daytime population statistics

Other projectsInternet data for price statisticsFinancial transactions data for statisticsSocial media data for detecting trends in social cohesionInternet data for encoding enterprise purchases and sales

52

Supporting Programmes

Big Data features in:

¾ Innovation programme

¾Methodological research programme

53

Cooperation and Collaboration on Big Data

Statistics Netherlands works together with:

¾ Other NSIs¾ UN, UNECE, EU, WorldBank¾ ESSnet on Big Data (to be confirmed)¾ Government organisations¾ Universities and research organisations¾ Data providers¾ IT providers¾ Big Data firms¾ Research consortia (e.g. H2020)

54

UNECE Big Data Activities

– Classification of Big Data sources– Big Data project in 2014, with three Task Teams:

-­‐ Partnerships-­‐ Privacy-­‐ Quality

– Sandbox in 2014, 2015 and possibly beyond– Big Data survey, together with UNSD

Results: http://www1.unece.org/stat/platform/display/bigdata/2014+Project

55

UN Big Data Activities

¾ Global Working Group on Big Data for Official Statistics with eight Task Teams:

-­‐ Mobile phone data-­‐ Satellite imagery-­‐ Social media data-­‐ Access / partnerships-­‐ Advocacy / communication-­‐ Big Data and SDGs-­‐ Training / skills / capacity building-­‐ Cross-cutting issues

¾ UNSD survey on Big Data for official statistics

56

Draft Big Data Access Principles (UN)

¾ Social responsibility

¾ Level playing field

¾ Equal treatment

¾Confidentiality and security

¾Transparency

¾ Respect for business interest

¾ Proportionality

57

Conclusion: The Way Forward

¾ Get to know Big Data

¾ Use Big Data for efficiency andresponse burden reduction

¾ Use Big Data for early indicators

¾ Use Big Data for filling gaps and new demands

¾ Use new professional methodswhere needed

¾ Create the right environment

¾ Don’t do it alone!58

General references

Glasson, M., Trepanier, J., Patruno, V., Daas, P., Skaliotis, M. and Khan, A. (2013) What does "Big Data" mean for Official Statistics? Paper for the High-Level Group for the Modernization of Statistical Production and Services, March 10.

Struijs, P., Braaksma, B. and Daas, P. (2014) Official Statistics and Big Data. Big Data & Society, April–June, pp. 1–6.

Struijs, P. and Daas, P.J.H. (2013) Big Data, Big Impact? Paper for the Seminar on Statistical Data Collection, Geneva, Switzerland.

Struijs, P. and Daas, P. (2014) Quality Approaches to Big Data in Official Statistics. Paper for the European Conference on Quality in Official Statistics 2014, Vienna, Austria. 59

The Future

60

Questions?

Thank you for your attention!

p.struijs@cbs.nl61

top related