big data (and official statistics)

25
Big Data (and official statistics) Piet Daas and Mark van der Loo* Statistics Netherlands MSIS 2013, April 25, Paris With contributions of: Edwin de Jonge and Paul van den Hurk 3

Upload: said

Post on 21-Mar-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Big Data (and official statistics). 3. Piet Daas and Mark van der Loo* Statistics Netherlands. * With contributions of: Edwin de Jonge and Paul van den Hurk. MSIS 2013, April 25, Paris. Overview. What’s Big Data? Definition and the 3 V’s Can Big Data be used for official statistics? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Big Data (and  official statistics)

Big Data (and official statistics)

Piet Daas and Mark van der Loo*

Statistics Netherlands

MSIS 2013, April 25, Paris

* With contributions of: Edwin de Jonge and Paul van den Hurk

3

Page 2: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Overview

• What’s Big Data?• Definition and the 3 V’s

• Can Big Data be used for official statistics?• Examples from Statistics Netherlands

• Future challenges• What has to change?

1

Page 3: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

• Data, data everywhere!Data, data everywhere!

2

X

Page 4: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

What is Big Data?

• According to a group of expertsBig data are data sources that can be –

generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.”

• According to a user“Data so big that it becomes awkward to work with”

3

Page 5: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

The most 3 important characteristics of Big Data

Rapid availability

Amount

Complexity Unstructured data Text

4

Page 6: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

3 Big Data case studies

Can Big Data be used for official statistics?

Examples from Statistics Netherlands1. Traffic loop detection data (100 million records/day)

• Traffic & transport statistics

2. Mobile phone data (35 million records/day)• Day time population, tourism

3. Dutch social media messages (1~2 million messages/day)• Topics and sentiment

5

Page 7: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

1. Traffic loop detection data

• Traffic ‘loops’• Every minute (24/7) the number of passing

vehicles is counted by >10,000 road sensors & camera’s in the Netherlands• Total vehicles and in different length classes

• Interesting source to produce traffic and transport statistics (and more)• Huge amounts of data, about 100 million

records a dayLocations

6

Page 8: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Number of detected vehicles on a single day

Total = ~ 295 million

7

By all loops

Page 9: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Traffic loop detection activity (only first 10 min.)

8

Page 10: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Correct for missing data • ‘Corrected’ data (for blocks of 5 min)

Before After

Total = ~ 295 million Total = ~ 330 million (+ 12%)

9

Page 11: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

For different vehicle lengths

Small vehicles <= 5.6 mMedium sized vehicles > 5.6 m & <= 12.2 mLarge vehicles > 12.2 m

1 categorie 3 categoriën 5 categoriën

Totaal Totaal Totaal<= 5.6m > 1.85 & <= 2.4m> 5.6 & <= 12.2m > 2.4 & <= 5.6m> 12.2m > 5.6 & <= 11.5m

> 11.5 & <= 12.2m> 12.2m

10

X

Page 12: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Small vehicles

~75% of total

11

Page 13: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Small & medium vehicles

12

Page 14: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Small, medium & large vehicles

13

Page 15: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

2. Mobile phone data

• Nearly every person in the Netherlands has a mobile phone• On them and almost always switched on!

• An increasing number of people has a smart phone

• Ideal source of information to:• Use mobile phone data of mobile phone companies:

• Travel behaviour (‘Day time’-population)• Tourism (new phones that register to network) • Crowd info (for example during events)

14

Page 16: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Travel behaviour of mobile phones

15

Mobility of very activeactive mobile phone users

- during a 14-day period- data of a single mob. company

Based on:- Call- and text-activity

multiples times a day- Location based on phone masts

Clearly selective:- Includes major cities- But the North and South-east of the country much less

Page 17: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

3. Social media messages

• Dutch are very active on social media platforms• Bijna altijd bij zich en staat vrijwel altijd aan

• Steeds meer mensen hebben een smartphone!

• Mogelijke informatiebron voor:• Welke onderwerpen zijn actueel:

• Aantal berichten en sentiment hierover

• Als meetinstrument te gebruiken voor:• .

Map by Eric Fischer (via Fast Company)

16

Page 18: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

3. Social media messages

3a. Content:- Collected Dutch Twitter messages for study: ‘selection’ of 12 million

3b. Sentiment- Sentiment in Dutch social media messages: ‘all’ ~2 billion

• Dutch are very active on social media platforms• Potential information source for:

• Topics discussed and sentiment over these topics (quickly available!) and probably more?

• Investigate it to obtain an answer on its potential use

17

Page 19: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Social media: Dutch Twitter topics

18

(46%)

(10%)(7%)

(3%)

(5%)

12 million messages

(7%)

(3%)

(3%)

Page 20: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Sentiment in Social media

• Access to Coosto database• ~ 2 billion publicly available messages

• Twitter, Facebook, Hyves, Webfora, Blogs etc.

• Sentiment of each message• Positive, negative or neutral

• Interesting finding• Looked at so-called ‘Mood of the nation’ compared

to Consumer confidence of Statistics Netherlands

19

Page 21: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Consumer confidence, survey data

20

Sentiment towards the economic climate

~1000 respondents/month

(pos

– n

eg) a

s %

of t

otal

Page 22: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Sentiment in social media messages

Corr: 0.88

21

~25 million messages/month

Sentiment towards the economic climate &Social media message sentiment

(pos

– n

eg) a

s %

of t

otal

Page 23: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Challenges: Big Data and statistics

• Legal• Is access routinely allowed (not only for research)?

• Privacy• With more and more data, privacy demands increase• We have to be careful here!

• Costs• In the Netherlands we don’t pay for admin data. • Should we pay for Big Data?

• Manage• Who owns the data? Stability of delivery/source• Because of its volume, run queries in database of data

source holder

22

Page 24: Big Data (and  official statistics)

MSIS 2013, April 25, Paris

Challenges: Big Data and statistics (2)

• Methodological• Big data sources register events, not units, and they are selective!• Methods & models specific for large dataset (fast and ‘robust’) • Try to ‘make big data small’ ASAP (noise reduction)

• Technological• Learn from ‘computational statistical’ research areas• High Performance Computing needs, parallel processing

• People• Need ‘data scientists’ (statistical minded people with programming

skills that are curious) • That are able to think outside the traditional sample survey based

paradigm!

23

Page 25: Big Data (and  official statistics)

MSIS 2013, April 25, Paris The future of Stat Neth?