dealing with big data for official statistics: it issues giulio barcaroli stefano de francisci

21
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics

Upload: mercia

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics. Outline. Background: Istat Big D ata strategy and experimental projects - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Dealing with Big Data for Official Statistics: IT Issues

Giulio BarcaroliStefano De FrancisciMonica ScannapiecoDonato Summa

Istat – Italian National Institute of Statistics

Page 2: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 2

Outline

1.Background: Istat Big Data strategy and experimental projects

2. IT issues in experimental projects

3.Final remarks

Page 3: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 3

Istat Big Data Strategy - 1

Istat (The Italian National Institute of Statistics) set up a technical Commission

with the objective to orient investments on Big Data adoption

in statistical production processes

Duration: from February 2013 to

February 2015 Members coming from different

areas: Official Statistics, Academy, Private Sector

Page 4: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 4

Objective of the talk I will NOT deal with (just) technological issues

I will deal instead (mainly) with IT methodological issues

Example:

. MapReduce-Hadoop : Open Source Framework

Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduce-ability of (classes of) computational problems

Page 5: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 5

Istat Big Data Strategy - 2

The Commission will release a strategy for Big Data adoption

Three experimental projects launched and monitored by the Commission: Persons and Places Labour Market Estimation based on Google Trends ICT Usage in enterprises based on Internet as a Data

Source (IaD) Status: advanced implementation (first results already

available)

Page 6: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 6

Persons and Places Purpose

Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data

Actors involved in the project Istat National Research Council University of Pisa

Methodology Inference of population mobility profiles from GSM Call Data

Records (CDRs) Comparison with data derived from administrative sources

Page 7: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 7

Labour Market Estimation

Purpose Test the usage of Google Trends for forecasting and

nowcasting purposes in the Labour Force domain Actors involved in the project

Istat: Central Methodology Sector and Labour Force Survey

Methodology Autoregressive model vs. Usage of Google Trends

as prediction models Comparison extended to macroeconomics prediction

models

Page 8: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 8

ICT Usage in Enterprises Purpose:

Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions

Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities,

National Research Council and Ministry of Education and Research)

Methodology Scraping of web sites for data extraction Supervised classification task

Page 9: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 9

Features of Experimental Projects

  1: Persons & Places 2: Google Trends 3: ICT Usage

DATA SOURCE • Mobility data• Web search

record• Web data

extraction

SCENARIO (IMPACT ON THE PRODUCTION PROCESS)

• Deep impact: source replaces traditional sampling and collection

• Considerable impact: estimation phase

• Limited impact: subset of data gathered by using IaD

KEY TECHNOLOGIES

• Machine learning libraries

• MapReduce/ Hadoop (future)

• Google Trends

• Scraping• NoSql• Machine learning

libraries• MapReduce/

Hadoop (future?)

Page 10: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 10

Statistical Phases for Big Data Management

• Principal selected phases• Inversion due to the fact that “traditional” design phase is

not anymore present for Big Data• Collapse due to the fact that same methods can be used

for both phases• Other phases, e.g. Dissemination, not (yet) involved in

Big Data

Inversion of the two phases

Collapsed phases

Page 11: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 11

Collect: IT issues - 1

Access to Big Data sources: Type 1: Access control mechanisms that the

Big data provider designedly set up and/or Type 2: Technological barriers

Google Trends: Absence of APIs, preventing from the

possibility of accessing GT data by a software program

Not possible to foresee the usage of such a facility in production processes

Page 12: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 12

Collect: IT issues - 2

ICT Usage: Both type 1 and 2 problems 8.647 URLs of enterprises’ Web sites, but only

about 5.600 were actually accessed Type 1: Scrapers deliberately blocked, e.g.

mechanisms in place to verify human access to sites, like CAPTCHA

Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents

Page 13: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 13

Design: IT Issues - 1

Even if a traditional survey design cannot take place, the problem of “understanding” the data still present

Semantic extraction techniques Knowledge representation and natural

language processing E.g.: FRED (http://wit.istc.cnr.it/stlab-tools/fred

) permits to extract an ontology from sentences in natural language

Page 14: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 14

Design: IT Issues - 2

ICT Usage: Human inspection refined by:

Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc,

Semantic enrichment by semantic dictionaries (WordNet)

Images: tag extraction

Page 15: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 15

Process/Analyse: IT Issues - 1

Big size, possibly solvable by Map-Reduce algorithms

Model absence, possibly solvable by learning techniques

Privacy constraints, solvable by privacy-preserving techniques

Page 16: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 16

Process/Analyse: IT Issues - 2

Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability”

Recent state of the art Map-Reduce algorithms for: Basic graph problems, e.g. minimum spanning trees,

triangle counting and matching Combinatorial optimization, e.g. maximum coverage,

densest subgraph, and k-means clustering

Page 17: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 17

Process/Analyse: IT Issues - 3

Persons and Places: Match mobility-related data with data stored in

Istat archives Record linkage problem should be solved

(future task) Model Absence: neither survey-based nor

“traditional” model-based approaches directly applicable to Big Data Possible usage of machine learning

techniques

Page 18: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 18

Process/Analyse: IT Issues - 4 ICT Usage:

Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting)

Persons and Places: Unsupervised learning technique, namely SOM

(Self Organizing Map) to learn mobility profiles E.g. “free city users” vs. “embedded city users”

(more confidently estimated by deterministic constraints)

Page 19: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 19

Process/Analyse: IT Issues - 5

Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data

Privacy-preserving data integration, e.g. [DMKM-2004]

Privacy-preserving data mining, e.g. [TKDE - 2004]

Persons and Places: Anonymous matching of CDRs with Istat archives via privacy-preserving record linkage

Page 20: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 20

Concluding Remarks Illustration of some IT issues considered

as relevant for Big Data adoption by OS on the basis of practical experiences Probably technology is not an issue but IT

methodology is an issue!!! Some IT issues also share some statistical

methodological aspects Other relevant IT issues:

Event data management On-line analytics

Page 21: Dealing with Big Data for Official Statistics:  IT Issues Giulio  Barcaroli Stefano De  Francisci

Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 21

Thank you for the attention!