& data ingestion
TRANSCRIPT
DSS implementation and data ingestion
DSS implementation
& data ingestion
2
Summary
Background and motivations .................................................................................................. 3 AI Methodology ........................................................................................................................ 4
Data research, evaluation and selection ................................................................................. 4 KPIs identification ................................................................................................................... 4 Identification of the levers to improve KPIs ............................................................................. 5 Scenarios analysis .................................................................................................................. 5
DSS Data ................................................................................................................................... 6
Data providers ........................................................................................................................ 6
DSS Architecture ...................................................................................................................... 7
Architectural choices ............................................................................................................... 7
Apache Mesos .............................................................................................................. 7 Integration with Apache Marathon ................................................................................. 8 Datacenter Operating System ....................................................................................... 8 Kafka ............................................................................................................................. 9 Spark ............................................................................................................................. 9 Cassandra ................................................................................................................... 10 WSO2 API Management ............................................................................................. 10 Data flow ..................................................................................................................... 11
Focus on Lombardy Region .................................................................................................. 12
Innovation Hub for green business ........................................................................................ 12
DSS implementation and data ingestion
3
BACKGROUND AND MOTIVATIONS
The availability of models able to highlight the specific needs of a territory and
suggest efficient levers is a precious support tool during decision-making processes
oriented to establish targeted interventions to reach concrete results. These models can be
obtained with the systematic use of data-driven methodologies, in combination with Big
Data Analytics and AI tools. Artificial Intelligence can be used to identify the main levers on
which it is possible to act to overcome shortcomings or to enhance already virtuous
situations but with further room for improvement, in order to work on KPIs in the most
efficient and effective way by “learning” from the territory past experience or from the past
experience of similar territories. In this way, it is possible both to simulate the impact of
positive actions undertaken by competitor regions in the same areas and to prevent
negative trends already occurred in other regions.
The starting point to build such models is the availability of quality data. The data can be
collected from heterogeneous data providers and, successively, integrated. Once
integrated, they have to be properly normalized and pre-processed to be then provided as
input to AI algorithms that automatically analyse them to build models that quantify the
relationships between phenomena.
In this scenario, the use of automated processes for data collection, normalization, and pre-
processing is useful to minimize the cost in terms of time and to maximize the accuracy of
the process.
For the above mentioned reasons, the design of the platform has been completed. The
platform will be used to automatically collect data and the algorithms needed to process
those data.
In particular, once implemented, the platform will be used both during the collection and
during data analysis, in order to obtain models to support the decision-making process.
DSS implementation and data ingestion
4
AI METHODOLOGY
The underlying methodological approach involves comparison between the region or
set of regions of interest with a set of reference regions, in a "data-driven" approach.
The real innovation of this methodology consists in the use of automatic analysis
techniques, not only to identify the main problems or shortcomings of the region of interest,
but also to identify the main levers on which it is possible to act to overcome the
shortcomings or to enhance virtuous situations.
The flow of the methodological approach is schematised in the following figure, listing the
relevant steps which will be described in details below.
Figure 1: Data-driven analysis
Data research, evaluation and selection
The first step is to find reliable quantitative information, objective-oriented,
taking into account both the specific territorial context and the surrounding one. To this end,
the availability of datasets containing values for a set as large as possible of European
regions of NUTS2 or possibly NUTS1 level, is necessary.
KPIs identification
To identify performance indicators (KPIs) it is first necessary to analyse the relevant high-
level indicators which can be found in the different regional plans for the field of interest.
Then, the KPIs research is completed by analysing literature related to the topics of interest,
in order to identify further potential levers for which there is evidence of known impacts on
the relevant KPIs in the European territory.
For the chosen set of candidate KPIs, the analysis of the performance of the region of
interest with respect to a set of Competitor regions is performed to identify the areas of
intervention. Competitor regions are selected on the basis of context analysis where there is
a set of regions that represent known "competitors". Where the literature in the sector does
not suggest particular consolidated groups of competing regions, the selection is made from
the data, based on similarity with the region of interest, calculated automatically on a set of
context indicators. For example, if the theme of social inclusion is considered, the regions
most similar to the target region in terms of demographic and industrial structure can be
selected.
Data research, evaluation and
selection
KPIs identification
Identification of the levers to improve KPIs
Scenarios analysis
DSS implementation and data ingestion
5
The main purpose of this step is providing evidence of the positioning of the Region under
analysis with respect to the competitor ones, with reference to a particular KPI.
Identification of the levers to improve KPIs
Artificial Intelligence is used to identify the main levers on which it is possible to act
to overcome shortcomings or to enhance already virtuous situations but with further room
for improvement, in order to work on KPIs in the most efficient and effective way.
The data-driven levers selection is based the use of AI algorithms which process the data
both related to an output indicator (the KPI), the dependent variable, and to a large pool of
input indicators (the levers), the independent variables. This processing aims are to identify
the levers having the greatest impact on the KPI and to assess the type of impact (positive
or negative) on the KPI itself. The identification of the relevant levers and of their impact
takes place with the so-called "learning by examples" approach, where each example is
represented by data of a specific Region.
Specifically, the levers-KPIs relationship is modelled as a multivariate regression (i.e.
dependent on several variables) in which the KPI is a function of the leverage indicators.
Therefore, each training region is described by a set of values for the lever indicators
(inputs) and a single value for the KPI (output). Multiple KPIs are treated in parallel.
From the Machine Learning point of view, each region represents an "example", while the
whole set of regions constitutes the training set used to train the machine learning algorithm
in order to identify the multivariate function that best models the relationship between levers
and KPIs, based on the available examples. For this analysis, the decision tree approach is
used as it allows to identify which of the levers are actually relevant to the KPIs, by creating
an input-output function that depends on a subset of the inputs.
Scenarios analysis
The last stage consists in carrying out predictive scenarios for the chosen KPIs, in
order to define realistic targets. In fact, the quantitative modelling of the relationship
among levers and KPIs allows to create projective scenarios based on assumptions
made on the future values of the leverage indicators, hence computing automatically the
corresponding value of KPI by setting realistic targets.
In particular, starting from the analysis of the trends of Competitors Regions, the
methodology allows to estimate the KPI evolution in the following scenarios:
Neutral scenario: the region keeps on pursuing the same KPI policies as in the
past
Best case scenario: the region improves its effectiveness in pursuing policies
related to the KPI, until it reaches the same performance as the leading
Competitor region
Worst case scenario: the region worsens its effectiveness in pursuing policy with
respect to the KPI, following the trend of the least performing Competitor region
DSS implementation and data ingestion
6
DSS DATA
In order to perform the analysis, the platform requires the availability of "quality"
data. The term "quality" data refers to data that need to be available in digital format and
homogeneous, to be processed automatically, reliable, in order to guarantee the accuracy
of the results of the analysis, regularly updated, in order to analyse the effect of a specific
intervention and with a high granularity, in order to identify the most advantageous
conditions towards which interventions should be oriented.
These data need to be extracted from heterogeneous data providers and then integrated in
compliance with data processing regulations (i.e. GDPR).
Data providers
The following initial set of relevant data providers have been identified for the ingestion in
the DSS:
OECD – Regional Statistics and Indicators
OECD REGPAT – Regional Patent Database
ISTAT – Indicators of Fair and Sustainable Welfare
ISTAT – Regional Statistics
SALUTEGOV – Regional Economic and Financial Database archive
SALUTEGOV – ASL Structures and activities
AIFI – Venture Capital Investments
Terna – Electric energy consumes
INAPP – IeFP courses data
INDIRE – Dual system data
CRISP e Fondazione Agnelli – Technical and professional schools performance
data
COEWEB – Import-export data
EQI QOG - European Government Quality Index
ISPRA – Environmental data
2017 RIS
EUROSTAT
DSS implementation and data ingestion
7
DSS ARCHITECTURE
Architectural choices
In this section, the architectural choices for the realisation of a distributed and scalable
computing platform for the storage and the analysis of large amounts of data and real-time
information flows will be detailed below. All the technological components and the
interactions among them will be analysed.
Apache Mesos
Mesos ( http://mesos.apache.org ) is a kernel for distributed systems which allows cluster
management by abstracting resources such as memory, storage space and CPU, and
making them available to services that request them, while isolating them.
It can be installed on local servers or on the main cluster providers (aws, azure, gcp), and is
compatible with Linux, OSX and Windows servers.
Figure 2: Dynamic configuration of nodes with Apache Mesos
DSS implementation and data ingestion
8
Mesos uses a master/agent architecture. The master nodes manage the distribution of
resources on the agent nodes, i.e. the nodes where the services are actually running. The
possibility of replicating master nodes allows to obtain a fault tolerant configuration.
Mesos has the native possibility of using containers (with Docker, Appc and OCI images),
relying directly on Docker or on a proprietary implementation that allows greater control over
resources. An integrated REST API allows to automate cluster monitoring and management
procedures.
Programs that exploit Mesos are called frameworks, and are responsible for allocating
resources for the services they manage.
Figure 3: Allocation of resources to different frameworks via Mesos
Integration with Apache Marathon
Marathon ( http://mesosphere.github.io/marathon ) is a framework for Mesos that
orchestrates services and other frameworks on Mesos, similarly to what an init system like
systemd does, providing also service discovery and load balancing capabilities.
Marathon executes services on the cluster, according to some given constraints. If a
service, or an instance of it, stops working, it is restarted; if a node crashes, its allocated
services are automatically relocated to other machines; it also scales service instances
according to their load, and provides persistent volumes to services such as databases. As
for Mesos, a REST API is available to automate and customise the management of
services.
Datacenter Operating System
Datacenter Operating System or DC/OS (http://dcos.io) is a commercial product based
on Mesos and Marathon, which makes available a free open-source version. DC/OS
simplifies the installation on nodes, provides a repository of services that can be directly
DSS implementation and data ingestion
9
installed on the cluster and offers an interface which simplifies the deployment,
configuration and updating of services.
Kafka
Kafka ( http://kafka.apache.org ) is a distributed streaming platform. Input data are
immediately saved to disk and are retained for a given and configurable period of time.
The data streams are organised into topics, which are divided into partitions stored on
several servers.
A software program can subscribe to the Kafka topics it wants to receive messages from,
and can read all saved messages, regardless of sorting.
Spark
Spark ( http://spark.apache.org ) is a framework for distributed data processing, designed
specifically for machine learning applications. It is very powerful (up to two orders of
magnitude more than Hadoop Map-Reduce) thanks to the use of data in ram memory and
thanks to the optimisation of the computation pipeline.
There are several high-level libraries already integrated in Spark that provide functionalities
for Machine Learning (ML), graph management and analysis, data manipulation through
SQL-like queries and real-time data streaming. The use of these libraries and other non-
native tools which can be integrated with Spark accelerates the development and
implementation processes.
Figure 4: The Spark integrated library ecosystem
Furthermore, Spark offers an ML library (MLLib), and integrates partially or fully with other
libraries, creating a complete ecosystem for prototyping, developing and deploying machine
learning solutions.
MLib. Native library for ML in Spark, originally inspired by scikit-learn (from which it
inherits the pipeline concept), it provides a Python interface and includes a complete
DSS implementation and data ingestion
10
collection of machine learning algorithms and data transformation capabilities. Actively
supported, open-source, and with a growing community of developers and users that
guarantees support and improvements over time.
H2O. Machine learning platform that, through a complete integration ("Sparkling
Water"), calls itself "the killer application for Spark" (https://blog.h2o.ai/2014/06/h2o-killer-
application-spark/). It allows the construction of machine learning models and pipelines in
distributed computing environments and it also has a Python interface. It partially overlaps
with and complements the MLLib algorithm library (e.g. with dedicated Deep Learning
functionality).
Other ML libraries partially and fully integrate with Spark, further expanding and completing
the pool of algorithms provided by the two main libraries above.
Some examples of integration with other libraries are:
XGBoost. Library that implements learning algorithms belonging to the gradient
boosting framework, compatible with the Spark environment and successfully used
in other use-cases. Several benchmarks show that it is more efficient and performs
better than MLLib and H2O in the class of algorithms in which it specialises.
http://datascience.la/benchmarking-random-forest-implementations/
Spark-sklearn bridge. Parallelisation on distributed environments of sklearn tasks,
which is the main Python library used successfully in the past for prototyping and
developing other use cases.
Cassandra
Cassandra ( http://cassandra.apache.org ) is a distributed NoSQL database that uses a
column family model. Data is automatically replicated across multiple nodes in the cluster or
across multiple clusters to ensure fault-tolerance. The Cassandra cluster is based on a
peer-to-peer architecture that tolerates multiple nodes dropping out, and which distributes
the workload.
Cassandra is extremely scalable (clusters with more than 50k nodes are known).
WSO2 API Management
The API Management functionalities of the WSO2 framework (http://wso2.com/api-
management) are used to manage access to APIs on different backends. This module
simplifies API management by controlling access and supporting different types of
authentication and authorisation, as well as validating content and providing protection
against bots and forged tokens. It can limit the number of requests based on the service
and the specific user, and provides advanced statistics on API usage for monitoring
purposes.
DSS implementation and data ingestion
11
Data flow
The real-time data flow, based on the components described above, is summarised in
the figure.
Figure 5: real time data flow
Real-time data are taken from remote streams and collected by Kafka, which stores them
for a given and defined period of time, guaranteeing their integrity. From Kafka, those data
are picked up by the real-time flow, based on Apache Spark Streaming, which processes
them in micro-batches, filters, processes and saves them on a Cassandra database. The
batch flow, based on Apache Spark, exploiting the analysis capabilities provided by MLLib
and H2O, accesses the data from Cassandra and possibly directly to Kafka flows and
performs the analysis, saving their results in the database. Through APIs, a web service
can access the results of the analysis, recorded on the database, and give them to the user.
DSS implementation and data ingestion
12
FOCUS ON LOMBARDY REGION
Innovation Hub for green business
Within the AlpGov2 project, besides activities carried out by each Action Group, 5
transversal Strategic Policy Area are carried out through the collaboration of different Action
Groups: besides the EIF - EUSALP Innovation Facility, the others are thematic initiative with
focus on different topics: Innovation Hub for Green Business, Smart Villages, Spatial
Planning, Carbon Neutral Alpine Region.
For each of the abovementioned topics, several indicators can be identified and different
dataset can be compared, this depending on the data availability, consistency and
comparability. In fact, not all data providers regularly update the data available in an open
form, and often the available datasets are not standardized thus making it difficult or even
impossible to allow comparisons; moreover, even these datasets are not significant or not
sufficiently thorough with respect to the geographical deepness we are interested in, the
choice of the topics and the associated indicators is deliberately redundant, in order to have
a wider spectrum of possibilities to choose from. Further steps will thus deal with the
identification of the strategic areas of interest where to apply the predictive scenarios for the
chosen KPIs.