wp7 multi domains - europa · wp7 multi domains the aim ... data sources: social...

87
State of affairs Łukasz Błaszczyk & Anna Nowicka WP7 Multi Domains

Upload: lenhi

Post on 20-Sep-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

State of affairs

Łukasz Błaszczyk & Anna Nowicka

WP7 Multi Domains

Page 2: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Agenda

• General information on WP7

• SGA-1

• SGA-2

• Future perspectives

2

Page 3: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

WP7 Multi Domains TEAM

Apart from GUS (Statistics Poland) which is leading WP 7 and CBS (Statistics Netherlands), this WP had been carried out by two other representatives of ESSnet Big Data partners: CSO (Statistics Ireland) and ONS (Statistics United Kingdom).

3

WP0: CO-ORDINATION

WP9: DISSEMINATION

ESSnet BIG

DA

TA

WP

1 : W

ebsc

rapi

ng /

Job

Vaca

ncie

s

WP

2 : W

ebsc

rapi

ng /

Ente

rpris

e Ch

arac

teris

tics

WP

3 : Sm

art M

eter

s

WP

4 : A

IS D

ata

WP

5 : M

obile

Pho

ne D

ata

WP

6 : Ea

rly Es

timat

es

WP

8 : M

etho

dolo

gy

WP

7 : M

ulti

Dom

ains

From SGA-2 (in March 2017) Portugal joined to this team.

Page 4: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

• Regional statistical office in Poznań

• Regional statistical office in Bydgoszcz

Population

• Regional statistical office in Rzeszów

• Department of Social Research

Tourism/

border crossing

• Department of Agriculture

• Regional statistical office in Olsztyn

AGRICULTURE

Country team for each domain

WP7 Multi Domains TEAM

Anna Nowicka Leader cooperation

PARTNERS

Janusz Dygaszewicz Project Manager of Polish work

John Sheridan Sinead Bracken

Piet Daas Alessandra Sozzi Nigel Swier, Leone Wardman

Jacek Maślankowski Coordinator of methodology

Rui Alves Sónia Quaresma

4

Page 5: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

WP7 Multi Domains

The aim

•is to investigate how a combination of big data sources and existing official statistical data can be used to improve current statistics and create new statistics in statistical domains. •the work package focusses on the statistical domains : Population, Tourism/border crossings and Agriculture. •the work package team will describe the data collection, data linking, data processing and methodological aspects when combining data in statistical domains.

Challenges ahead are:

•representativity issues, linking to other datasets, metadata, international comparability and long lasting solutions with sustainable cost.

5

Page 6: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

General summary of WP7 work done

6

Page 7: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Annotated bibliography

• Several research papers and statistical reports relevant for WP7 work, e.g.:

• Social Media Sentiment and Consumer Confidence: Piet J.H. Daas and Marco J.H. Puts;

• Experiment report: Social Media - Sentiment Analysis, UNECE, created by: Antonino Virgillito, modified by Steven Vale;

• Twitter Sentiment Classification using Distant Supervision; Alec Go, Richa Bhayani, Lei Huang.

7

Page 8: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Work done - general overview

• Brainstorming on data

sources

• Questionnaire on

different aspects of Big

Data implementation – e.g., data access, data quality,

combining data, methodology

• Final use cases

preparation

• Several videoconferences

• Annotated bibliography

• Pre-Pilot use cases

implementation

8

Page 9: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

9

Page 11: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

SGA-2 perspectives

• Extend the scope of pilot surveys

• Combining data within domain as well as

inter-domain data combination

• Sharing general framework

11

Under SGA-2 to achieve the main goals, WP7 has carried out experimental work. For this reason WP7 is carry out the three following case studies.

Page 12: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

POPULATION – three pilots Responsibility: PL – coordinator, supported by UK, PT Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As the data

sources are selective, i.e. only cover units that put text on social media and the internet, the methodology will aim at yielding valid information for the population as a whole. Use will be made of methods described in the literature (such as research on the use of public social media messages done in the Netherlands).

Three pilots: 1) to examine the level of daily satisfaction of the population by analyzing the

content of messages for the presence of defined expressions describing emotional states, e.g., happiness, joy, sadness, fear, anger;

2) to present the moods of the population associated with various public events; 3) to observe morbidity areas, e.g., flu.

Plan of Combining Datasets: (1) Combine in one repository the selected data from all Big Data sources, (2) Comparison with the results of social studies to add more detailed information, (3) Supplement of information gained in social studies. Main benefits and value added for official statistics: Support traditional European Social Survey, supplement the research methodology of some phenomena that are difficult to measure through traditional polls.

12

Page 13: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Framework that is used in PL Population Use Case (individuals well-being) to scrap and process Twitter data

Twitter

data

Tweepy

Sklearn

Training

Dataset

Machine

Learning

algorithm

Data

extracting

Predictive

model

Labels

Feature

vectors

Result set

We use the following classification based on emotional states from European Social Survey, Social Cohesion Survey and EU Statistics on Income and Living Conditions (EU-SILC): •szczęśliwy (happy), •neutralny (neutral), •spokojny (calm), •zdenerwowany (upset), •przygnębiony (depressed), •zniechęcony (discouraged), •nie wiem (indeterminate).

We need to divide this task into four stages, presented in the picture below.

13

Page 14: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Structure of the training dataset

Total 4622

• szczęśliwy (happy) 1181

• neutralny (neutral) 915

• spokojny (calm) 675

• zdenerwowany (upset) 646

• przygnębiony (depressed) 615

• zniechęcony (discouraged) 552

• nie wiem (indeterminate) 38

25,6%

19,8%

14,6%

14,0%

13,3%

11,9% 0,8%

szczęśliwy

neutralny

spokojny

zdenerwowany

przygnębiony

zniechęcony

nie wiem

14

Page 15: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

• structure of training dataset is critical – it may lead to the wrong conclusions if disproportion in different attributes;

• we have to maintain and modify the training dataset during the lifetime of the tool.

• representativeness – Twitter popularity in your country (e.g., Poland: 20 thous. tweets per hour; worldwide: 400 million tweets a day in 2013),

• daily life satisfaction (value added) – how many tweets a day can you collect to supplement surveys such as EHIS or EU-SILC?

• concentrate only on text, remove usernames; lemmatization, stemming may not work,

• code page – when there are special diacritic characters – you have to concern unification (UTF-8, cp1252 (windows-1250) vs. ISO-8859-2),

• precision of ML – depending on the training dataset it may be in the scope of 0.49-0.80,

• retweets – according to Dutch experience, should also be included, • attributes for the structure of population – we can extract the region

but the gender should be extracted with different algorithms, as it is not an attribute given by social media websites.

15

Page 16: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Results • Population – more recent data

• Data from social networks are

information about the state, they cannot

predict developments

• Support European Surveys by extending

selected attributes of Population

16

Page 17: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Population use case in ONS, UK Background

The purpose of the case study is to examine the level of daily

satisfaction by analysing the content of messages for the presence of defined expressions describing emotional states, positive/negative, or e.g. joy, sadness, fear, anger.

The idea is to explore how we might produce statistics on

social sentiment from news sites/blogs/social media towards events/topics and how those can be linked to existing official statistics which annually measure population well-being.

17

Page 18: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Collecting data from the Facebook API

Main Source: The Guardian Facebook Page

Daily collection of:

- Posts: reactions counts, comments count, Guardian url of the article, created time, message of the post

- other information taken from the Guardian website about the article, mainly category, tags

- Comments: the post id to which the comment refers to,

comments count, likes count, user, created time,

message of the comment text, parent comment id (for

comment replies only)

18

Page 19: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

19

Page 20: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

20

Page 21: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

21

Page 22: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

22

Page 23: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Results

● Sentiment scores were converted to a multinomial classification and divided into three categories: positive, negative and neutral. Assessment against a manually graded sample of the data suggests Vader to be the best performing lexicon when detecting sentiment

● Looking at the sentiment produced by the different lexicons over time, it appears that that they tend to follow more or less the same sentiment trajectory

● It is possible to detect events triggering spikes of positive and negative sentiment

● it is not clear the outcome of the emotions analysis based on the emotions extracted using the NRC lexicon. The outcome in this case failed to provide any meaningful result.

23

Page 24: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Obstacles Long text

Noisy comments: many comments with just a name in it

Context relevant

Keyword-based approach is totally based on the set of keywords. Sentences without any keyword would imply that they do not carry any sentiment at all.

Meanings of keywords could be multiple and vague, as most words could change their meanings according to different usages and contexts.

24

Page 25: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Google Trends as a source for measuring sentiment and personal well-being

0

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Inte

rest

ov

er

tim

e

Poland Spain UK Summer holiday period

Christmas period

25

Page 26: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Results In all three countries there is distinct drop in the volume of

searches during the main summer holiday period, i.e. from July through to the first half of August. Interestingly, in all three countries, there is a strong increase in the last week August, about the time when many people are returning from summer holidays. There is also a further dip at the end of the year around the Christmas period. This suggests an inverse relationship between searches for “depression” and holiday periods. These changes may not necessarily represent real change in the levels of clinical depression. It is likely that at least some of this explained by people who, for example, are simply not looking forward to going back to work after a pleasant holiday. However, this in turn may represent something closer to the more general feeling of well-being prevalent across the population as a whole.

It is also interesting that this seasonal pattern is more distinct in Poland, less distinct in Spain with the UK in between. Although this sample of three is too small to draw firm conclusions, this pattern hints at a possible correlation between the gap between average maximum and minimum interest levels in searching for “depression”, and climatic factors, possibly the relative severity of the winter climate in different countries.

26

Page 27: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

AGRICULTURE – Estimation of Agricultural statistics – pilot case study on crop types based on satellite data

Responsibility: PL – coordinator, supported by IE. Data sources: Satellite images, administrative data, in situ surveys. Methodology: combining data – data fusion on radar and optical remote sensing data; data comparison with traditional surveys e.g. FSS; combining data – administrative data sources with satellite data. The goal of the case study: Crop type: look at the types of crops being

grown and see if we can tell this accurately from the imagery; analysis of possibilities of using satellite images.

Plan of Combining Datasets: Data fusion – combining data sources by spatial reference.

Main benefits and value added for official statistics: Increase the quality of the agricultural surveys; Decrease of respondents burden; More detailed data published by official statistics; Potential decrease of the cost of conducting surveys. 27

Page 28: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

28

ADMINISTRATIVE SOURCES USED:

• cadastral parcels vector data from Land Parcel Identification System (over 34 mln records, 13GB of data)1 - used for in situ plots selection and also will be used for satellite data segmentation,

• agricultural plots vector data from Land Parcel Identification System (over 33 mln records, 23GB of data)1 - used for in situ plots selection and also will be used for satellite data segmentation,

• General Geographic Database (BDOO)2 - used for in situ plots selection.

1. Agency for Restructuring and Modernisation of Agriculture (ARMA) 2. Head Office of Geodesy and Cartography (GUGiK)

Page 29: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

29

COLLECTED DATA: • SENTINEL-1 one year time serie raw data for Poland approx. 2,5TB

(approx. 20TB after postprocessing), • IN SITU GEODATABASES: - sample data geodatabase – 5084 records,

Page 30: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

30

COLLECTED DATA: - image geodatabase – 21747 geotagged images, 32GB of data,

Page 31: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

31

IN SITU RESULTS:

Crop Number of samples

spring barley 481

winter barley 340

corn 509

spring cereal mixes 448

winter cereal mixes 207

oat 426

spring wheat 432

winter wheat 559

spring triticale 218

winter triticale 448

spring rape 134

winter rape 476

rye 406

Total 5084

PRESENT ACTIVITIES: • satellite images preprocessing and calibration

with in situ samples FURTHER ACTIVITIES: • satellite images segmentation, • object based image classification, • methodology development for calculating agricultural statistics from

remote sensing data.

Page 32: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

The use of Remote sensing, GIS and Big Data to generate crop statistics in Ireland there is an increase in

the quality of agricultural surveys,

we can have a decrease in respondents’ burden,

there is a potential decrease of conducting surveys,

more detailed data can be published.

32

Page 33: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Data:

Sentinel 2

Sentinel 1

•Sentinel 1 (SAR) and 2 (MSI) data.

•Land parcel identification system (LPIS) data.

33

Page 34: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Pre-processing procedures carried out in snap:

Sentinel 2 L1C: • atmospheric correction • image resampling • mosaic

34

Page 35: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Pre-processing procedures carried out in snap:

Sentinel 1 GRD: (VV and VH) were batch pre-processed: • Radiometric correction -> sigma

bands • geometric correction • speckle filtering • mosaic

35

Page 36: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Limitations of the project October 2016-March 2017:

• Data: Data not covering areas as specified, cloud cover, length of download time

• Pre-processing: Pre-processing took several attempts to complete in full.

• Current results: Images have had an overall accuracy of at least 85%.

36

Page 37: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

March 2017-present:

Remote sensing and GIS professionals were contacted (Dermot Corcoran, CSO; Aoife

Shinners, OSI; Claire Fitzgerald, Mapsphere; Guy Serbin, Teagsc; Gavin Smith, EPA) in

relation to this project. From these professionals, a series of suggestions were made:

Focus on an area closer to home, this will enable physical examination of the area. Source Landsat 8 data – this may tackle data availability issue. Source crop calendars and be aware crops have different flowering times and harvest

times. Obtain Sen2cor plugin for SNAP to do atmospheric correction on Sentinel 2 imagery. Look at identifying crops using crop indices e.g. (NDVI, EVI etc). When classifying, classify one crop at a time i.e. two classes ‘wheat’ and ‘all other

except wheat’, then classify ‘Barley’ and ‘all other except Barley’.

Use LPIS polygon data + prime 2 data for more accurate reference set, to obtain more accurate boundaries and more accurate signatures for signature library.

Concentrate on smaller, more crop intensive study areas, to create a spectral and back scatter library

Look at both winter and summer crops (area dependant). Generate classifications with accuracy of 85% +.

Future plans for this project (as of March 2017):

37

Page 38: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Data: Sentinel 1 image Sentinel 2 image Lansat 8 image(s) Land parcel identification sysem (LPIS) Data set Prime 2 Ordnance survey data Software: • SNAP (Sen2cor). • QGIS

38

Page 39: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

What have we learned? Skills: • How to obtain open source Big Data satellite imagery • How to obtain open source software for BD • How to pre-process this data (correct, subset) • How to process (classify) this data • How to assess this data for accuracy Knowledge: • Data requires large storage space (each image 5-7GB, mosaic of country 21-27GB). • Crop dates influence spectral signature (different crops will be flowering/harvested at different times, climate,

plant date, affect spectral signature, images must be layered for full signature) • Sentinel files must be unzipped with 7 zip tool, as win zip tool does not work. • False composite and Indices SAVI highlights crop type and is accurate • Combining vv and vh bands increases accuracy for sentinel 1

39

Page 40: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Plans for the remainder of the project: • Assess image fusion between indices from Sentinel 2 and Landsat data. • Use gained knowledge to classify complete satellite images. • Investigate uses of satellite imagery for statistics.

40

Page 41: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

TOURISM/BORDER CROSSING – Border movement Responsibility: PL – coordinator, supported by NL and PT.

Data sources: Traffic sensors (data already acquired from Polish and German data owners), traditional surveys on tourism, flight statistics such as origin, destination, estimation of number of passengers from Civil Aviation Authority of the Republic of Poland and webscraping. Depending on availability, Mobile Call Data will also be used, building on the results of WP 5.

Methodology:

• spatial-temporal models and graph interpolation methods;

• cross-entropy econometrics for combining data sets.

The goal of the case study: to estimate border traffic through internal border of EU (Polish-German, Polish-Slovakian, Polish-Czech and Polish-Lithuanian border) also regarding to some mirror statistics. Partial estimation of domestic traffic may be an extra result. Selected data sources from national authorities show the scale of border movement that is regarded as tourism in terms of statistical surveys.

Plan of Combining Datasets:

• Unifying structure of data sets;

• Collecting exogenous variables (road class, etc.);

• Preparing distance and graph matrices;

• Quantifying reliability of each data source (expected standard error);

• Combining traffic data from different sources with cross-entropy econometrics.

Main benefits and value added for official statistics: Decreased burden of interviewers, more detailed results than from the survey solely, data consistent with mirror statistics.

41

Page 42: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

SPECIFICITY OF DATA SOURCES

• Main objective: estimation of road traffic at the EU's internal border, including mirror statistics

• Data sources: – data from road sensors (currently for Poland,

Germany, Lithuania and Slovakia) – sample survey coordinated by the Statistical Office in

Rzeszów

• Methodology: relative non-extensive entropy econometrics

• Problems: representativeness, pooling of data sets, cost-effective solutions

42

Page 43: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

SPECIFICITY OF DATA SOURCES

Large amount of data but only at several time points (general measurement)

High frequency data but in small amount (continuous measurement)

Irregular data gaps

Recording of internal traffic (bias of result)

Changes of road network

43

Page 44: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

No. of measurement point

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

71001 70814 80201 80303 11401 10702 11309 71201 10911 51108 51201 51106 51408 60403 60209 30801 20802 12074 70803 10911 71106 40534 1091

30604

30412

30801

40310

30601

10604

60315

60701

31708

31502

31507

20602

12071

81106

50601

60714

31501

31701

31718

31201

32048

32068

Availability of data – continuous traffic measurement

44

Page 45: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

No. of measurement point

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

71001 70814 80201 80303 11401 10702 11309 71201 10911 51108 51201 51106 51408 60403 60209 30801 20802 12074 70803 10911 71106 40534 1091 30604 30412 30801 40310 30601 10604 60315 60701 31708 31502 31507 20602 12071 81106 50601 60714 31501 31701 31718 31201 32048 32068

Availability of data – general traffic measurement

45

Page 46: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

No. of measurement point

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

71001

70814

80201

80303

11401

10702

11309

71201

10911

51108

51201

51106

51408

60403

60209

30801

20802

12074

70803

10911

71106

40534

1091

30604

30412

30801

40310 30601 10604 60315 60701 31708 31502 31507 20602 12071 81106 50601 60714 31501 31701 31718 31201 32048 32068

Availability of data – BAST

46

Page 47: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

No. of measurement point

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

71001

70814

80201

80303

11401

10702

11309

71201

10911

51108

51201

51106

51408

60403

60209

30801

20802

12074

70803

10911

71106

40534

1091

30604

30412

30801

40310

30601

10604

60315

60701

31708

31502

31507

20602

12071

81106

50601

60714

31501

31701

31718

31201

32048

32068

Availability of data – three sources of data

47

Page 48: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Szczecin

BAST

GDDKiA

Kołbaskowo

48

Page 49: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

49

0

100.000

200.000

300.000

400.000

500.000

600.000

700.000

800.000

900.000

1.000.000

1.100.000

I II III IV I II III IV I II III IV I II III IV I II III IV

2003 2004 2005 2006 2007

Słubice

BASt SG

0

100.000

200.000

300.000

400.000

500.000

600.000

700.000

800.000

900.000

1.000.000

1.100.000

I II III IV I II III IV I II III IV I II III IV I II III IV

2003 2004 2005 2006 2007

Olszyna

BASt SG

Page 50: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

0

2000

4000

6000

8000

10000

12000

14000

16000

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

ŚDR

R

Wierzbica - Pułtusk Stolno-Kończewice

50

Page 51: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

• Minimization of the formula

• (eq.1) 𝐻𝑞 𝑝//𝑝0, 𝑟//𝑟0, 𝑤//𝑤0 =

𝛼∑𝑝𝑘𝑚 𝑝𝑘𝑚/𝑝𝑘𝑚

0 𝑞−1− 1

𝑞 − 1+ 𝛽 ∑𝑟𝑛𝑗

𝑟𝑛𝑗/𝑟𝑛𝑗0 𝑞−1

− 1

𝑞 − 1+ 𝛿 ∑𝑤𝑡𝑠

𝑤𝑡𝑠/𝑤𝑡𝑠0 𝑞−1 − 1

𝑞 − 1

• under conditions

• (eq.2) 𝑌 = 𝑋 ⋅ 𝛽 + 𝑒 = 𝑋 ⋅ ∑ 𝑣𝑚𝑝𝑘𝑚

𝑞

∑ 𝑝𝑘𝑚𝑞𝑀

𝑚=1

𝑀𝑚=1 + ∑ 𝑧𝑗

𝑟𝑛𝑗𝑞

∑ 𝑟𝑛𝑗𝑞𝐽

𝑗=1

𝐽𝑗=1

• (eq.3) ∑ ∑ 𝑝𝑘𝑚𝑀𝑚>2

𝐾𝑘=1 = 1

• (eq.4) ∑ ∑ 𝑟𝑛𝑗𝐽𝑗>2

𝑁𝑛=1 = 1

• (eq.5) ∑ ∑ 𝑤𝑡𝑠𝑆𝑠>2

𝑇𝑡=1 = 1

• and other a priori information.

51

Page 52: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Find data common for both sources: S1, S2

Link available data: Y

Save result in resulting matrix

Complete data so as to minimize cross-entropy between remaining S1 data and

linked Y data.

Complete data so as to minimize cross-entropy between remaining S2 data and

linked Y data

Has iteration been completed?

no

Imputation algorithm yes

52

Page 53: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Find data gaps. Designate row and

base column.

For each row, count the number of common data

For each column, count the number of common

data

For each row, examine similarity to base row

For each column, examine similarity to

base column

Select optimal vector (row or column)

Complete data so as to minimize cross-entropy between optimal and

base vectors

Select records that meet

information criteria

Has iteration been completed?

no

yes Substantive verification

53

Page 54: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Pierwsze źródło Drugie źródłoFirst source Second source

54

Page 55: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Pierwsze źródło - dane do złączenia Drugie źródło - dane do złączenia

Dane nieużywane na etapie łączenia

First source – data to be linked Second source - data to be linked

Data not used during linking

55

Page 56: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Pierwsze źródło - dane do złączenia Drugie źródło - dane do złączenia

Dane nieużywane na etapie łączenia Po złączeniu źródeł

First source – data to be linked Second source - data to be linked

Data not used during linking After sources have been combined

56

Page 57: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Dane nieużywane na etapie łączenia Po złączeniu źródełData not used during linking After sources have been combined

57

Page 58: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Dane nieużywane na etapie łączenia Imputacja względem pierwszego źródłaData not used during linking Imputation to first source

58

Page 59: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Dane nieużywane na etapie łączenia Imputacja względem drugiego źródłaData not used during linking Imputation to second source

59

Page 60: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Dane nieużywane na etapie łączenia Imputacja algorytmem krzyżówkowymData not used during linking Imputation using crossword algorithm

60

Page 61: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

6000

6500

7000

7500

8000

8500

9000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Pierwsze źródło Drugie źródło Wynik ostatecznyFirst source Second source Final result

61

Page 62: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

How big is Tourism's Potential? 62

Page 63: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

63

Page 64: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

How big is Tourism's Potential?

Border movement or immigration?

Train and other various mode transport for travelling;

Air traffic – ex. distribution of flights

Hotel chains

64

Page 65: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Air traffic data obtained by day-by-day webscraping 29 airports: 15 Polish airports and 14 hub airports

(selected according to traffic intensity from Polish airports) Data consists of

• origin and destination airport, • IATA and ICAO codes, • type of aircraift, • date of arrival and departure

Supplementary information • IATA, ICAO and FAA codes base for retrieving origin and

destination country, • sample survey on tourism conducted by Statistical Office in

Rzeszów

65

Page 66: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Air traffic data issues • duplicates, • coding problems, • capacity of aircraift is different for each airline even for the same

type of aircraift, • IATA, ICAO and FAA codes base needs updates

66

Page 67: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Direct flights from Poland to • 198 airports in 54 countries • 19% - internal traffic • 81% - external traffic

Direct flights from Poland and hub airports to • 484 airports from 117 countries

Sample survey covers up to 80 countries

67

Page 68: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

68

Countries by direct flights from Poland

Page 69: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

69

…by one connecting flight in Germany

Page 70: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

… by one connecting flight outside Poland

70

Page 71: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

0

2000

4000

6000

8000

10000

12000

51820 flights to 53 countries

71

Number of flights directly from Poland

Page 72: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

0

2000

4000

6000

8000

10000

12000

72368 flights to 102 countries

72

…with one connecting flight in Germany

Page 73: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

0

2000

4000

6000

8000

10000

12000

95311 flights to 145 countries

… with one connecting flight outside Poland

73

Page 74: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

74

Distribution of Polish residents destinations

Page 75: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

75

Big Data can be implemented in Integrated Survey of Tourism

Data on road traffic intensity may decrease a burden of interviewers

and border crossing for data is obtained

Data on air traffic may increase countries coverage for which

statistics are produced

Page 76: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

• Selection of ca. 35 data sources for the pilot city

• Duplicates and accurracy

• Benefits – List by type of units and its

facilities

– Prices in time series, as category or exact prices

• Additional outcomes – Using Big Data to collect

missing information in Business Register

– Monitoring of NACE

76

Page 77: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Elementary Data. List of units with names and additional attributes 1. Villa Meduza 2. Park Design - Apartments

M&M (Polanki Park) 3. Apartament Manhattan 4. Bursztyn Spa 5. Jantar Apartament -

Exclusive Marine Polanki 6. Apartament 233 w Hotelu

Diva Kołobrzeg 7. 100-SIO Apartamenty 8. Apartament Solna -

Eldorado z garażem 9. Apartamenty Bodnar

Polanki 10. Kurhotel Etna 11. Penthouse Apartment

Morska 12. Luksusowy Apartament

553 13. Taaakaryba 14. Apartamenty Sun&Snow

Polanki …

• More accurate elementary data

• Aggregated data

77

Page 78: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

WP7 will prepare recommendations (as a final product of work) for second wave of pilots for 3 domains. One of theme is Tourism domain area.

At the first stage of working with big data on air traffic we estimated distribution of flights. Basing on big data solely it leads us to solution with some drawback. Thus, we improved big data sources with some statistics from sample survey on trips and estimated distribution of trips. Following this experience we conclude that we can go much further: connect big data with every statistics in sample survey on trips. That connection can be direct as in the case of trips or indirect in a case of expenditures in specific domains.

At this moment we are working on inter-domain case connecting agriculture and tourism. Combining big data from webscraping and sample survey we want to obtain more detailed information on agrotourism than from sample survey solely.

78

Page 79: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

• Dedicated web portals for agrotourism

• First results can show the difference of price in time series

• Will be also used for the work on combining data

79

Page 80: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

Hotel chains

Air traffic

- distribution of flights

Train/various mode transport for travelling

Top destinations in the EU

Border movement or immigration?

The level of difficulty grows when we combine big data with official research but

this is necessary to improve quality of data sources Summing up

Sometimes Big Data helps to improve official statistics and sometimes official data

improves Big Data

Main statistical

findings

Top destinations in the EU

Economic aspects of international travel

Statistics on tourism demand are collected in relation to the number

of tourism trips made (and the number of nights spent on those

trips), separated by:

• destination country;

• purpose

• length of stay;

• accommodation

type;

• departure month;

• transport mode;

• expenditure.

80

Page 81: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

The great potential of these sources…

… for the production of statistics is needed more advanced methodology in order to face the challenges.

81

Page 82: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

POPULATION TOURISM AGRICULTURE PL Use Case 1: the need of preparing a

training data set for your language; test

our machine learning tool (we can share

three different tools for this purposes).

Implemented in Python.

PL Use Case 1: Flight movement –

border crossing - Python scripts to web

scrape the data on flights can be shared

with other countries; there is a need of

estimation of the number of passengers

based on the aircraft type; before

starting implementation you need to

select airports in your country.

PL Use Case 1: Sentinel data are for

free; access to LPIS; in situ survey data

needed for training; the software - GIS -

for Polish use case dedicated software

is used but an alternative is a free

SNAP; the methodology for the

segmentation is needed; all aspects of

PL implementation can be shared.

UK Use Case 2: lexicon is available for

different languages, not in Polish; scripts

in Python for query for API available, for

assessing the scores of the comments.

PL Use Case 2: Road sensors – border

crossing - need to obtain the data which

may lead to the necessity of signing

agreements with data owners; the

methodology can be shared.

IE Use Case 2: storage space; Sentinel-

2 data for free; SNAP software is free;

methodology and guidelines how to

implement can be shared.

UK Use Case 3: use Google Trends

(freely available) to identify morbidity

areas; the use case on depression for

Spain, UK, Poland was conducted by

ONS UK. 82

Page 83: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

After identification of the preliminary conditions and obstacles in pilots implementation, the discussion was on the possibilities of implementing presented use cases by other

countries. The results are presented in the table below – blue means that the pilot was implemented, red means that there will be an attempt of implementing the pilot.

Identifier Topic/source IE NL PL PT UK

Use Case 1

Population PL

Life satisfaction

(Twitter)

Use Case 2

Population UK

People’s

opinion

(Guardian)

Use Case 3

Population UK

Depression

(Google Trends)

Use Case 1

Tourism PL

Border crossing

(road sensors)

Use Case 2

Tourism PL

Border crossing

(flight websites)

Use Case 1

Agriculture PL

Crops (satellite

images)

Use Case 2

Agriculture IE

Crops (satellite

images)

83

Page 84: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

84

Domain Topic/source IE NL PL PT UK

Population Life satisfaction by gender based on Twitter P P D P P

Population People’s opinion/interestingness by topics

based on websites

P P D

Population Depression by country P

D

Tourism Internal EU border crossing by country and type

of transportation

D

Agriculture Crops by region/crop type D P D

Legend: D – will deliver the values for the indicator, P – possible delivery depending on the pilot

implementation

Page 85: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

85

Page 86: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

• Human well-beings (Python)

• Border movement – traffic sensors (R language)

• Flight movement (Python)

• Hotel chains (Python) • Agrotourism (Python) • Crop types (dedicated

tools) • Life satisfaction – pilot

by ONS UK (Python) • And many more…

86

Page 87: WP7 Multi Domains - Europa · WP7 Multi Domains The aim ... Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As

87