wp7 multi domains - europa · wp7 multi domains the aim ... data sources: social...
TRANSCRIPT
State of affairs
Łukasz Błaszczyk & Anna Nowicka
WP7 Multi Domains
Agenda
• General information on WP7
• SGA-1
• SGA-2
• Future perspectives
2
WP7 Multi Domains TEAM
Apart from GUS (Statistics Poland) which is leading WP 7 and CBS (Statistics Netherlands), this WP had been carried out by two other representatives of ESSnet Big Data partners: CSO (Statistics Ireland) and ONS (Statistics United Kingdom).
3
WP0: CO-ORDINATION
WP9: DISSEMINATION
ESSnet BIG
DA
TA
WP
1 : W
ebsc
rapi
ng /
Job
Vaca
ncie
s
WP
2 : W
ebsc
rapi
ng /
Ente
rpris
e Ch
arac
teris
tics
WP
3 : Sm
art M
eter
s
WP
4 : A
IS D
ata
WP
5 : M
obile
Pho
ne D
ata
WP
6 : Ea
rly Es
timat
es
WP
8 : M
etho
dolo
gy
WP
7 : M
ulti
Dom
ains
From SGA-2 (in March 2017) Portugal joined to this team.
• Regional statistical office in Poznań
• Regional statistical office in Bydgoszcz
Population
• Regional statistical office in Rzeszów
• Department of Social Research
Tourism/
border crossing
• Department of Agriculture
• Regional statistical office in Olsztyn
AGRICULTURE
Country team for each domain
WP7 Multi Domains TEAM
Anna Nowicka Leader cooperation
PARTNERS
Janusz Dygaszewicz Project Manager of Polish work
John Sheridan Sinead Bracken
Piet Daas Alessandra Sozzi Nigel Swier, Leone Wardman
Jacek Maślankowski Coordinator of methodology
Rui Alves Sónia Quaresma
4
WP7 Multi Domains
The aim
•is to investigate how a combination of big data sources and existing official statistical data can be used to improve current statistics and create new statistics in statistical domains. •the work package focusses on the statistical domains : Population, Tourism/border crossings and Agriculture. •the work package team will describe the data collection, data linking, data processing and methodological aspects when combining data in statistical domains.
Challenges ahead are:
•representativity issues, linking to other datasets, metadata, international comparability and long lasting solutions with sustainable cost.
5
General summary of WP7 work done
6
Annotated bibliography
• Several research papers and statistical reports relevant for WP7 work, e.g.:
• Social Media Sentiment and Consumer Confidence: Piet J.H. Daas and Marco J.H. Puts;
• Experiment report: Social Media - Sentiment Analysis, UNECE, created by: Antonino Virgillito, modified by Steven Vale;
• Twitter Sentiment Classification using Distant Supervision; Alec Go, Richa Bhayani, Lei Huang.
7
Work done - general overview
• Brainstorming on data
sources
• Questionnaire on
different aspects of Big
Data implementation – e.g., data access, data quality,
combining data, methodology
• Final use cases
preparation
• Several videoconferences
• Annotated bibliography
• Pre-Pilot use cases
implementation
8
9
Warsaw, 12-13 June 2017
10
SGA-2 perspectives
• Extend the scope of pilot surveys
• Combining data within domain as well as
inter-domain data combination
• Sharing general framework
11
Under SGA-2 to achieve the main goals, WP7 has carried out experimental work. For this reason WP7 is carry out the three following case studies.
POPULATION – three pilots Responsibility: PL – coordinator, supported by UK, PT Data sources: Social media/Blogs/Internet portals Methodology: Webscraping, Data/Text/Web mining, Machine learning. As the data
sources are selective, i.e. only cover units that put text on social media and the internet, the methodology will aim at yielding valid information for the population as a whole. Use will be made of methods described in the literature (such as research on the use of public social media messages done in the Netherlands).
Three pilots: 1) to examine the level of daily satisfaction of the population by analyzing the
content of messages for the presence of defined expressions describing emotional states, e.g., happiness, joy, sadness, fear, anger;
2) to present the moods of the population associated with various public events; 3) to observe morbidity areas, e.g., flu.
Plan of Combining Datasets: (1) Combine in one repository the selected data from all Big Data sources, (2) Comparison with the results of social studies to add more detailed information, (3) Supplement of information gained in social studies. Main benefits and value added for official statistics: Support traditional European Social Survey, supplement the research methodology of some phenomena that are difficult to measure through traditional polls.
12
Framework that is used in PL Population Use Case (individuals well-being) to scrap and process Twitter data
data
Tweepy
Sklearn
Training
Dataset
Machine
Learning
algorithm
Data
extracting
Predictive
model
Labels
Feature
vectors
Result set
We use the following classification based on emotional states from European Social Survey, Social Cohesion Survey and EU Statistics on Income and Living Conditions (EU-SILC): •szczęśliwy (happy), •neutralny (neutral), •spokojny (calm), •zdenerwowany (upset), •przygnębiony (depressed), •zniechęcony (discouraged), •nie wiem (indeterminate).
We need to divide this task into four stages, presented in the picture below.
13
Structure of the training dataset
Total 4622
• szczęśliwy (happy) 1181
• neutralny (neutral) 915
• spokojny (calm) 675
• zdenerwowany (upset) 646
• przygnębiony (depressed) 615
• zniechęcony (discouraged) 552
• nie wiem (indeterminate) 38
25,6%
19,8%
14,6%
14,0%
13,3%
11,9% 0,8%
szczęśliwy
neutralny
spokojny
zdenerwowany
przygnębiony
zniechęcony
nie wiem
14
• structure of training dataset is critical – it may lead to the wrong conclusions if disproportion in different attributes;
• we have to maintain and modify the training dataset during the lifetime of the tool.
• representativeness – Twitter popularity in your country (e.g., Poland: 20 thous. tweets per hour; worldwide: 400 million tweets a day in 2013),
• daily life satisfaction (value added) – how many tweets a day can you collect to supplement surveys such as EHIS or EU-SILC?
• concentrate only on text, remove usernames; lemmatization, stemming may not work,
• code page – when there are special diacritic characters – you have to concern unification (UTF-8, cp1252 (windows-1250) vs. ISO-8859-2),
• precision of ML – depending on the training dataset it may be in the scope of 0.49-0.80,
• retweets – according to Dutch experience, should also be included, • attributes for the structure of population – we can extract the region
but the gender should be extracted with different algorithms, as it is not an attribute given by social media websites.
15
Results • Population – more recent data
• Data from social networks are
information about the state, they cannot
predict developments
• Support European Surveys by extending
selected attributes of Population
16
Population use case in ONS, UK Background
The purpose of the case study is to examine the level of daily
satisfaction by analysing the content of messages for the presence of defined expressions describing emotional states, positive/negative, or e.g. joy, sadness, fear, anger.
The idea is to explore how we might produce statistics on
social sentiment from news sites/blogs/social media towards events/topics and how those can be linked to existing official statistics which annually measure population well-being.
17
Collecting data from the Facebook API
Main Source: The Guardian Facebook Page
Daily collection of:
- Posts: reactions counts, comments count, Guardian url of the article, created time, message of the post
- other information taken from the Guardian website about the article, mainly category, tags
- Comments: the post id to which the comment refers to,
comments count, likes count, user, created time,
message of the comment text, parent comment id (for
comment replies only)
18
19
20
21
22
Results
● Sentiment scores were converted to a multinomial classification and divided into three categories: positive, negative and neutral. Assessment against a manually graded sample of the data suggests Vader to be the best performing lexicon when detecting sentiment
● Looking at the sentiment produced by the different lexicons over time, it appears that that they tend to follow more or less the same sentiment trajectory
● It is possible to detect events triggering spikes of positive and negative sentiment
● it is not clear the outcome of the emotions analysis based on the emotions extracted using the NRC lexicon. The outcome in this case failed to provide any meaningful result.
23
Obstacles Long text
Noisy comments: many comments with just a name in it
Context relevant
Keyword-based approach is totally based on the set of keywords. Sentences without any keyword would imply that they do not carry any sentiment at all.
Meanings of keywords could be multiple and vague, as most words could change their meanings according to different usages and contexts.
24
Google Trends as a source for measuring sentiment and personal well-being
0
10
20
30
40
50
60
70
80
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51
Inte
rest
ov
er
tim
e
Poland Spain UK Summer holiday period
Christmas period
25
Results In all three countries there is distinct drop in the volume of
searches during the main summer holiday period, i.e. from July through to the first half of August. Interestingly, in all three countries, there is a strong increase in the last week August, about the time when many people are returning from summer holidays. There is also a further dip at the end of the year around the Christmas period. This suggests an inverse relationship between searches for “depression” and holiday periods. These changes may not necessarily represent real change in the levels of clinical depression. It is likely that at least some of this explained by people who, for example, are simply not looking forward to going back to work after a pleasant holiday. However, this in turn may represent something closer to the more general feeling of well-being prevalent across the population as a whole.
It is also interesting that this seasonal pattern is more distinct in Poland, less distinct in Spain with the UK in between. Although this sample of three is too small to draw firm conclusions, this pattern hints at a possible correlation between the gap between average maximum and minimum interest levels in searching for “depression”, and climatic factors, possibly the relative severity of the winter climate in different countries.
26
AGRICULTURE – Estimation of Agricultural statistics – pilot case study on crop types based on satellite data
Responsibility: PL – coordinator, supported by IE. Data sources: Satellite images, administrative data, in situ surveys. Methodology: combining data – data fusion on radar and optical remote sensing data; data comparison with traditional surveys e.g. FSS; combining data – administrative data sources with satellite data. The goal of the case study: Crop type: look at the types of crops being
grown and see if we can tell this accurately from the imagery; analysis of possibilities of using satellite images.
Plan of Combining Datasets: Data fusion – combining data sources by spatial reference.
Main benefits and value added for official statistics: Increase the quality of the agricultural surveys; Decrease of respondents burden; More detailed data published by official statistics; Potential decrease of the cost of conducting surveys. 27
28
ADMINISTRATIVE SOURCES USED:
• cadastral parcels vector data from Land Parcel Identification System (over 34 mln records, 13GB of data)1 - used for in situ plots selection and also will be used for satellite data segmentation,
• agricultural plots vector data from Land Parcel Identification System (over 33 mln records, 23GB of data)1 - used for in situ plots selection and also will be used for satellite data segmentation,
• General Geographic Database (BDOO)2 - used for in situ plots selection.
1. Agency for Restructuring and Modernisation of Agriculture (ARMA) 2. Head Office of Geodesy and Cartography (GUGiK)
29
COLLECTED DATA: • SENTINEL-1 one year time serie raw data for Poland approx. 2,5TB
(approx. 20TB after postprocessing), • IN SITU GEODATABASES: - sample data geodatabase – 5084 records,
30
COLLECTED DATA: - image geodatabase – 21747 geotagged images, 32GB of data,
31
IN SITU RESULTS:
Crop Number of samples
spring barley 481
winter barley 340
corn 509
spring cereal mixes 448
winter cereal mixes 207
oat 426
spring wheat 432
winter wheat 559
spring triticale 218
winter triticale 448
spring rape 134
winter rape 476
rye 406
Total 5084
PRESENT ACTIVITIES: • satellite images preprocessing and calibration
with in situ samples FURTHER ACTIVITIES: • satellite images segmentation, • object based image classification, • methodology development for calculating agricultural statistics from
remote sensing data.
The use of Remote sensing, GIS and Big Data to generate crop statistics in Ireland there is an increase in
the quality of agricultural surveys,
we can have a decrease in respondents’ burden,
there is a potential decrease of conducting surveys,
more detailed data can be published.
32
Data:
Sentinel 2
Sentinel 1
•Sentinel 1 (SAR) and 2 (MSI) data.
•Land parcel identification system (LPIS) data.
33
Pre-processing procedures carried out in snap:
Sentinel 2 L1C: • atmospheric correction • image resampling • mosaic
34
Pre-processing procedures carried out in snap:
Sentinel 1 GRD: (VV and VH) were batch pre-processed: • Radiometric correction -> sigma
bands • geometric correction • speckle filtering • mosaic
35
Limitations of the project October 2016-March 2017:
• Data: Data not covering areas as specified, cloud cover, length of download time
• Pre-processing: Pre-processing took several attempts to complete in full.
• Current results: Images have had an overall accuracy of at least 85%.
36
March 2017-present:
Remote sensing and GIS professionals were contacted (Dermot Corcoran, CSO; Aoife
Shinners, OSI; Claire Fitzgerald, Mapsphere; Guy Serbin, Teagsc; Gavin Smith, EPA) in
relation to this project. From these professionals, a series of suggestions were made:
Focus on an area closer to home, this will enable physical examination of the area. Source Landsat 8 data – this may tackle data availability issue. Source crop calendars and be aware crops have different flowering times and harvest
times. Obtain Sen2cor plugin for SNAP to do atmospheric correction on Sentinel 2 imagery. Look at identifying crops using crop indices e.g. (NDVI, EVI etc). When classifying, classify one crop at a time i.e. two classes ‘wheat’ and ‘all other
except wheat’, then classify ‘Barley’ and ‘all other except Barley’.
Use LPIS polygon data + prime 2 data for more accurate reference set, to obtain more accurate boundaries and more accurate signatures for signature library.
Concentrate on smaller, more crop intensive study areas, to create a spectral and back scatter library
Look at both winter and summer crops (area dependant). Generate classifications with accuracy of 85% +.
Future plans for this project (as of March 2017):
37
Data: Sentinel 1 image Sentinel 2 image Lansat 8 image(s) Land parcel identification sysem (LPIS) Data set Prime 2 Ordnance survey data Software: • SNAP (Sen2cor). • QGIS
38
What have we learned? Skills: • How to obtain open source Big Data satellite imagery • How to obtain open source software for BD • How to pre-process this data (correct, subset) • How to process (classify) this data • How to assess this data for accuracy Knowledge: • Data requires large storage space (each image 5-7GB, mosaic of country 21-27GB). • Crop dates influence spectral signature (different crops will be flowering/harvested at different times, climate,
plant date, affect spectral signature, images must be layered for full signature) • Sentinel files must be unzipped with 7 zip tool, as win zip tool does not work. • False composite and Indices SAVI highlights crop type and is accurate • Combining vv and vh bands increases accuracy for sentinel 1
39
Plans for the remainder of the project: • Assess image fusion between indices from Sentinel 2 and Landsat data. • Use gained knowledge to classify complete satellite images. • Investigate uses of satellite imagery for statistics.
40
TOURISM/BORDER CROSSING – Border movement Responsibility: PL – coordinator, supported by NL and PT.
Data sources: Traffic sensors (data already acquired from Polish and German data owners), traditional surveys on tourism, flight statistics such as origin, destination, estimation of number of passengers from Civil Aviation Authority of the Republic of Poland and webscraping. Depending on availability, Mobile Call Data will also be used, building on the results of WP 5.
Methodology:
• spatial-temporal models and graph interpolation methods;
• cross-entropy econometrics for combining data sets.
The goal of the case study: to estimate border traffic through internal border of EU (Polish-German, Polish-Slovakian, Polish-Czech and Polish-Lithuanian border) also regarding to some mirror statistics. Partial estimation of domestic traffic may be an extra result. Selected data sources from national authorities show the scale of border movement that is regarded as tourism in terms of statistical surveys.
Plan of Combining Datasets:
• Unifying structure of data sets;
• Collecting exogenous variables (road class, etc.);
• Preparing distance and graph matrices;
• Quantifying reliability of each data source (expected standard error);
• Combining traffic data from different sources with cross-entropy econometrics.
Main benefits and value added for official statistics: Decreased burden of interviewers, more detailed results than from the survey solely, data consistent with mirror statistics.
41
SPECIFICITY OF DATA SOURCES
• Main objective: estimation of road traffic at the EU's internal border, including mirror statistics
• Data sources: – data from road sensors (currently for Poland,
Germany, Lithuania and Slovakia) – sample survey coordinated by the Statistical Office in
Rzeszów
• Methodology: relative non-extensive entropy econometrics
• Problems: representativeness, pooling of data sets, cost-effective solutions
42
SPECIFICITY OF DATA SOURCES
Large amount of data but only at several time points (general measurement)
High frequency data but in small amount (continuous measurement)
Irregular data gaps
Recording of internal traffic (bias of result)
Changes of road network
43
No. of measurement point
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
71001 70814 80201 80303 11401 10702 11309 71201 10911 51108 51201 51106 51408 60403 60209 30801 20802 12074 70803 10911 71106 40534 1091
30604
30412
30801
40310
30601
10604
60315
60701
31708
31502
31507
20602
12071
81106
50601
60714
31501
31701
31718
31201
32048
32068
Availability of data – continuous traffic measurement
44
No. of measurement point
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
71001 70814 80201 80303 11401 10702 11309 71201 10911 51108 51201 51106 51408 60403 60209 30801 20802 12074 70803 10911 71106 40534 1091 30604 30412 30801 40310 30601 10604 60315 60701 31708 31502 31507 20602 12071 81106 50601 60714 31501 31701 31718 31201 32048 32068
Availability of data – general traffic measurement
45
No. of measurement point
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
71001
70814
80201
80303
11401
10702
11309
71201
10911
51108
51201
51106
51408
60403
60209
30801
20802
12074
70803
10911
71106
40534
1091
30604
30412
30801
40310 30601 10604 60315 60701 31708 31502 31507 20602 12071 81106 50601 60714 31501 31701 31718 31201 32048 32068
Availability of data – BAST
46
No. of measurement point
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
71001
70814
80201
80303
11401
10702
11309
71201
10911
51108
51201
51106
51408
60403
60209
30801
20802
12074
70803
10911
71106
40534
1091
30604
30412
30801
40310
30601
10604
60315
60701
31708
31502
31507
20602
12071
81106
50601
60714
31501
31701
31718
31201
32048
32068
Availability of data – three sources of data
47
Szczecin
BAST
GDDKiA
Kołbaskowo
48
49
0
100.000
200.000
300.000
400.000
500.000
600.000
700.000
800.000
900.000
1.000.000
1.100.000
I II III IV I II III IV I II III IV I II III IV I II III IV
2003 2004 2005 2006 2007
Słubice
BASt SG
0
100.000
200.000
300.000
400.000
500.000
600.000
700.000
800.000
900.000
1.000.000
1.100.000
I II III IV I II III IV I II III IV I II III IV I II III IV
2003 2004 2005 2006 2007
Olszyna
BASt SG
0
2000
4000
6000
8000
10000
12000
14000
16000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
ŚDR
R
Wierzbica - Pułtusk Stolno-Kończewice
50
• Minimization of the formula
• (eq.1) 𝐻𝑞 𝑝//𝑝0, 𝑟//𝑟0, 𝑤//𝑤0 =
𝛼∑𝑝𝑘𝑚 𝑝𝑘𝑚/𝑝𝑘𝑚
0 𝑞−1− 1
𝑞 − 1+ 𝛽 ∑𝑟𝑛𝑗
𝑟𝑛𝑗/𝑟𝑛𝑗0 𝑞−1
− 1
𝑞 − 1+ 𝛿 ∑𝑤𝑡𝑠
𝑤𝑡𝑠/𝑤𝑡𝑠0 𝑞−1 − 1
𝑞 − 1
• under conditions
• (eq.2) 𝑌 = 𝑋 ⋅ 𝛽 + 𝑒 = 𝑋 ⋅ ∑ 𝑣𝑚𝑝𝑘𝑚
𝑞
∑ 𝑝𝑘𝑚𝑞𝑀
𝑚=1
𝑀𝑚=1 + ∑ 𝑧𝑗
𝑟𝑛𝑗𝑞
∑ 𝑟𝑛𝑗𝑞𝐽
𝑗=1
𝐽𝑗=1
• (eq.3) ∑ ∑ 𝑝𝑘𝑚𝑀𝑚>2
𝐾𝑘=1 = 1
• (eq.4) ∑ ∑ 𝑟𝑛𝑗𝐽𝑗>2
𝑁𝑛=1 = 1
• (eq.5) ∑ ∑ 𝑤𝑡𝑠𝑆𝑠>2
𝑇𝑡=1 = 1
• and other a priori information.
51
Find data common for both sources: S1, S2
Link available data: Y
Save result in resulting matrix
Complete data so as to minimize cross-entropy between remaining S1 data and
linked Y data.
Complete data so as to minimize cross-entropy between remaining S2 data and
linked Y data
Has iteration been completed?
no
Imputation algorithm yes
52
Find data gaps. Designate row and
base column.
For each row, count the number of common data
For each column, count the number of common
data
For each row, examine similarity to base row
For each column, examine similarity to
base column
Select optimal vector (row or column)
Complete data so as to minimize cross-entropy between optimal and
base vectors
Select records that meet
information criteria
Has iteration been completed?
no
yes Substantive verification
53
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Pierwsze źródło Drugie źródłoFirst source Second source
54
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Pierwsze źródło - dane do złączenia Drugie źródło - dane do złączenia
Dane nieużywane na etapie łączenia
First source – data to be linked Second source - data to be linked
Data not used during linking
55
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Pierwsze źródło - dane do złączenia Drugie źródło - dane do złączenia
Dane nieużywane na etapie łączenia Po złączeniu źródeł
First source – data to be linked Second source - data to be linked
Data not used during linking After sources have been combined
56
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Dane nieużywane na etapie łączenia Po złączeniu źródełData not used during linking After sources have been combined
57
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Dane nieużywane na etapie łączenia Imputacja względem pierwszego źródłaData not used during linking Imputation to first source
58
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Dane nieużywane na etapie łączenia Imputacja względem drugiego źródłaData not used during linking Imputation to second source
59
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Dane nieużywane na etapie łączenia Imputacja algorytmem krzyżówkowymData not used during linking Imputation using crossword algorithm
60
6000
6500
7000
7500
8000
8500
9000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Pierwsze źródło Drugie źródło Wynik ostatecznyFirst source Second source Final result
61
How big is Tourism's Potential? 62
63
How big is Tourism's Potential?
Border movement or immigration?
Train and other various mode transport for travelling;
Air traffic – ex. distribution of flights
Hotel chains
64
Air traffic data obtained by day-by-day webscraping 29 airports: 15 Polish airports and 14 hub airports
(selected according to traffic intensity from Polish airports) Data consists of
• origin and destination airport, • IATA and ICAO codes, • type of aircraift, • date of arrival and departure
Supplementary information • IATA, ICAO and FAA codes base for retrieving origin and
destination country, • sample survey on tourism conducted by Statistical Office in
Rzeszów
65
Air traffic data issues • duplicates, • coding problems, • capacity of aircraift is different for each airline even for the same
type of aircraift, • IATA, ICAO and FAA codes base needs updates
66
Direct flights from Poland to • 198 airports in 54 countries • 19% - internal traffic • 81% - external traffic
Direct flights from Poland and hub airports to • 484 airports from 117 countries
Sample survey covers up to 80 countries
67
68
Countries by direct flights from Poland
69
…by one connecting flight in Germany
… by one connecting flight outside Poland
70
0
2000
4000
6000
8000
10000
12000
51820 flights to 53 countries
71
Number of flights directly from Poland
0
2000
4000
6000
8000
10000
12000
72368 flights to 102 countries
72
…with one connecting flight in Germany
0
2000
4000
6000
8000
10000
12000
95311 flights to 145 countries
… with one connecting flight outside Poland
73
74
Distribution of Polish residents destinations
75
Big Data can be implemented in Integrated Survey of Tourism
Data on road traffic intensity may decrease a burden of interviewers
and border crossing for data is obtained
Data on air traffic may increase countries coverage for which
statistics are produced
• Selection of ca. 35 data sources for the pilot city
• Duplicates and accurracy
• Benefits – List by type of units and its
facilities
– Prices in time series, as category or exact prices
• Additional outcomes – Using Big Data to collect
missing information in Business Register
– Monitoring of NACE
76
Elementary Data. List of units with names and additional attributes 1. Villa Meduza 2. Park Design - Apartments
M&M (Polanki Park) 3. Apartament Manhattan 4. Bursztyn Spa 5. Jantar Apartament -
Exclusive Marine Polanki 6. Apartament 233 w Hotelu
Diva Kołobrzeg 7. 100-SIO Apartamenty 8. Apartament Solna -
Eldorado z garażem 9. Apartamenty Bodnar
Polanki 10. Kurhotel Etna 11. Penthouse Apartment
Morska 12. Luksusowy Apartament
553 13. Taaakaryba 14. Apartamenty Sun&Snow
Polanki …
• More accurate elementary data
• Aggregated data
77
WP7 will prepare recommendations (as a final product of work) for second wave of pilots for 3 domains. One of theme is Tourism domain area.
At the first stage of working with big data on air traffic we estimated distribution of flights. Basing on big data solely it leads us to solution with some drawback. Thus, we improved big data sources with some statistics from sample survey on trips and estimated distribution of trips. Following this experience we conclude that we can go much further: connect big data with every statistics in sample survey on trips. That connection can be direct as in the case of trips or indirect in a case of expenditures in specific domains.
At this moment we are working on inter-domain case connecting agriculture and tourism. Combining big data from webscraping and sample survey we want to obtain more detailed information on agrotourism than from sample survey solely.
78
• Dedicated web portals for agrotourism
• First results can show the difference of price in time series
• Will be also used for the work on combining data
79
Hotel chains
Air traffic
- distribution of flights
Train/various mode transport for travelling
Top destinations in the EU
Border movement or immigration?
The level of difficulty grows when we combine big data with official research but
this is necessary to improve quality of data sources Summing up
Sometimes Big Data helps to improve official statistics and sometimes official data
improves Big Data
Main statistical
findings
Top destinations in the EU
Economic aspects of international travel
Statistics on tourism demand are collected in relation to the number
of tourism trips made (and the number of nights spent on those
trips), separated by:
• destination country;
• purpose
• length of stay;
• accommodation
type;
• departure month;
• transport mode;
• expenditure.
80
The great potential of these sources…
… for the production of statistics is needed more advanced methodology in order to face the challenges.
81
POPULATION TOURISM AGRICULTURE PL Use Case 1: the need of preparing a
training data set for your language; test
our machine learning tool (we can share
three different tools for this purposes).
Implemented in Python.
PL Use Case 1: Flight movement –
border crossing - Python scripts to web
scrape the data on flights can be shared
with other countries; there is a need of
estimation of the number of passengers
based on the aircraft type; before
starting implementation you need to
select airports in your country.
PL Use Case 1: Sentinel data are for
free; access to LPIS; in situ survey data
needed for training; the software - GIS -
for Polish use case dedicated software
is used but an alternative is a free
SNAP; the methodology for the
segmentation is needed; all aspects of
PL implementation can be shared.
UK Use Case 2: lexicon is available for
different languages, not in Polish; scripts
in Python for query for API available, for
assessing the scores of the comments.
PL Use Case 2: Road sensors – border
crossing - need to obtain the data which
may lead to the necessity of signing
agreements with data owners; the
methodology can be shared.
IE Use Case 2: storage space; Sentinel-
2 data for free; SNAP software is free;
methodology and guidelines how to
implement can be shared.
UK Use Case 3: use Google Trends
(freely available) to identify morbidity
areas; the use case on depression for
Spain, UK, Poland was conducted by
ONS UK. 82
After identification of the preliminary conditions and obstacles in pilots implementation, the discussion was on the possibilities of implementing presented use cases by other
countries. The results are presented in the table below – blue means that the pilot was implemented, red means that there will be an attempt of implementing the pilot.
Identifier Topic/source IE NL PL PT UK
Use Case 1
Population PL
Life satisfaction
(Twitter)
Use Case 2
Population UK
People’s
opinion
(Guardian)
Use Case 3
Population UK
Depression
(Google Trends)
Use Case 1
Tourism PL
Border crossing
(road sensors)
Use Case 2
Tourism PL
Border crossing
(flight websites)
Use Case 1
Agriculture PL
Crops (satellite
images)
Use Case 2
Agriculture IE
Crops (satellite
images)
83
84
Domain Topic/source IE NL PL PT UK
Population Life satisfaction by gender based on Twitter P P D P P
Population People’s opinion/interestingness by topics
based on websites
P P D
Population Depression by country P
D
Tourism Internal EU border crossing by country and type
of transportation
D
Agriculture Crops by region/crop type D P D
Legend: D – will deliver the values for the indicator, P – possible delivery depending on the pilot
implementation
85
• Human well-beings (Python)
• Border movement – traffic sensors (R language)
• Flight movement (Python)
• Hotel chains (Python) • Agrotourism (Python) • Crop types (dedicated
tools) • Life satisfaction – pilot
by ONS UK (Python) • And many more…
86
87