d03.1 design of the big data test infrastructure of...specific contract 406–d03.1design of the big...
TRANSCRIPT
DG DIGIT
Unit.D.1
D03.1 DESIGN OF THE BIG DATA TEST
INFRASTRUCTURE
ISA2 action 2016.03 – Big Data for Public Administrations
“Big Data Test Infrastructure”
Specific contract n°406 under Framework Contract n° DI/07172 – ABCIII
October 2017
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 1 / 63
This study was carried out for the ISA2 Programme by KPMG Italy.
Authors:
Lorenzo CARBONE
Simone FRANCIOSI
Silvano GALASSO
Pavel JEZ
Valerio MEZZAPESA
Alessandro TRAMONTOZZI
Stefano TURCHETTA
Specific Contract No: 406
Framework Contract: DI/07172
Disclaimer
The information and views set out in this publication are those of the author(s) and do not
necessarily reflect the official opinion of the Commission. The Commission does not guarantee the
accuracy of the data included in this study. Neither the Commission nor any person acting on the
Commission’s behalf may be held responsible for the use which may be made of the information
contained therein.
© European Union, 2017
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 2 / 63
Document Control Information
Settings Value
Document Title: D03.1Design of the Big Data Test Infrastructure
Project Title: ISA2 Action 2016.03 – Big Data for Public Administrations – Big Data Test Infrastructure
Document Authors:
Lorenzo CARBONE Simone FRANCIOSI Silvano GALASSO Pavel JEZ Valerio MEZZAPESA Alessandro TRAMONTOZZI Stefano TURCHETTA
Commission Project Officer:
Marco FICHERA – European Commission – DIGIT D.1
External Contractor Project Manager:
Lorenzo CARBONE
Doc. Version: 2.0
Sensitivity: Internal
Date: October 2017
Revision History
The following table shows the development of this document.
Version Date Description Created by Reviewed by
0.1 June 2017 Proposal for a Table of Contents for the Report
Simone FRANCIOSI Pavel JEZ Valerio MEZZAPESA Alessandro TRAMONTOZZI Stefano TURCHETTA
Lorenzo CARBONE Silvano GALASSO
0.2 July 2017 Draft version of Chapters 4, 5.1 and 5.2
Simone FRANCIOSI Valerio MEZZAPESA Alessandro TRAMONTOZZI
Lorenzo CARBONE Silvano GALASSO
0.3 July 2017
Consolidated version of Chapters 4, 5.1 and 5.2
Simone FRANCIOSI Valerio MEZZAPESA Alessandro TRAMONTOZZI
Lorenzo CARBONE Silvano GALASSO
0.9 September 2017
Draft version of Chapters 7 and 8
Simone FRANCIOSI Valerio MEZZAPESA Alessandro TRAMONTOZZI
Lorenzo CARBONE Silvano GALASSO
1.0 October 2017 Complete drafted version
Simone FRANCIOSI Valerio MEZZAPESA Alessandro TRAMONTOZZI
Lorenzo CARBONE Silvano GALASSO
2.0 October 2017 Final version
Simone FRANCIOSI Valerio MEZZAPESA Alessandro TRAMONTOZZI
Lorenzo CARBONE Silvano GALASSO Marco FICHERA
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 3 / 63
TABLE OF CONTENTS
EXECUTIVE SUMMARY ................................................................................................................ 5
1. INTRODUCTION .......................................................................................................................... 10
1.1. Objectives of the document ...................................................................................................... 12
1.2. Structure of the document ........................................................................................................ 13
2. CONTEXT .................................................................................................................................. 15
2.1. The ISA2 programme and the Action 2016.03 “Big Data for Public Administrations” .............. 15
2.2. EIRA ............................................................................................................................................ 17
2.3. CEF Programme ......................................................................................................................... 20
3. METHODOLOGY FOLLOWED .......................................................................................................... 23
4. USER STORIES ........................................................................................................................... 27
5. TARGET ARCHITECTURE ................................................................................................................ 30
5.1. Architecture principles ............................................................................................................... 30
5.2. Business architecture ................................................................................................................. 31
5.3. Application-Data-Technology architecture................................................................................ 36
5.3.1. High-level architecture view......................................................................................... 37
5.3.2. Detailed view of the Analytics building block .............................................................. 44
5.4. Solution architecture ................................................................................................................. 47
5.4.1. Criteria for the identification of tools/technologies ..................................................... 47
5.4.2. Identified Solution building blocks ............................................................................... 47
6. ILLUSTRATIVE IMPLEMENTATION OF A USER STORY ............................................................................. 55
7. GOVERNANCE AND OPERATIONAL MODEL FOR THE BIG DATA TEST INFRASTRUCTURE.................................. 58
8. NEXT STEPS............................................................................................................................... 62
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 4 / 63
LIST OF FIGURES
Figure 1 – Narratives of the Big Data Test Infrastructure .................................................................. 11
Figure 2 – Key concepts in EIRA ......................................................................................................... 19
Figure 3 – Interoperability levels of the EIF ....................................................................................... 19
Figure 4 – The CEF GOFA model ........................................................................................................ 22
Figure 5 – Methodological approach followed under Task 3 ............................................................ 23
Figure 6 – Adopted Methodology for the overall study .................................................................... 26
Figure 7 – Targeted personas in scope for the User Stories .............................................................. 27
Figure 8 – Business view of the Big Data Test Infrastructure and related user stories ..................... 32
Figure 9 – Business services mapped in the template of the CEF service offering ........................... 36
Figure 10 – Preview of the logical architecture of the Big Data Test Infrastructure ......................... 37
Figure 11 – High-level architecture view of the Big Data Test Infrastructure ................................... 38
Figure 12 – Drill-down of the Analytics building block ...................................................................... 44
Figure 13 – Final priority for the identified Big Data use cases ......................................................... 45
Figure 14 – Solution Architecture of the Big Data Test Infrastructure .............................................. 48
Figure 15 – Illustrative implementation of the User Story “Big Data pilot implementation” ........... 55
Figure 16 – Reference framework for the strategy, governance and operational model ................ 58
Figure 17 – Strategy layer .................................................................................................................. 58
Figure 18 – Governance layer ............................................................................................................ 59
Figure 19 – Development and Operational layer............................................................................... 60
Figure 20 – High-level roadmap for the implementation of the Big Data Test Infrastructure .......... 62
Figure 21 – Detailed roadmap for the implementation of the Big Data Test Infrastructure ............ 63
LIST OF TABLES
Table 1 - Methodology followed for Task 3 of the study ................................................................... 26
Table 2 - "User Stories" in scope for the Big Data Test Infrastructure .............................................. 29
Table 3 - Architecture principles ........................................................................................................ 31
Table 4 – Description of the building blocks for the Application-Data-Technology Architecture ..... 42
Table 5 – Mapping between architecture building blocks and business services ............................. 44
Table 6 – Description of the architectural detailed building block .................................................... 45
Table 7 – Mapping between architecture and solution building blocks ........................................... 50
Table 8 – Description of the solution building blocks ........................................................................ 54
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 5 / 63
EXECUTIVE SUMMARY
The present Report has been issued under the ISA2 Action 2016.03 – Big Data for Public
Administrations – Big Data Test Infrastructure and is the outcome of Task 3 of the “Big Data Test
Infrastructure” project. The objectives of the project are briefly described below:
Focusing on Task 3, the present Report illustrates the designed Target architecture (business and
technical architecture) and the target governance/operational model of a Big Data Test
Infrastructure to be made available by the European Commission to other EC DGs, Member States’
Public Administrations and EU Institutions in order to:
1. Facilitating the launch of pilot projects on big data, data analytics or text mining, by
providing the infrastructure and the software tools needed to start a small project;
2. Fostering the sharing of various data sources across policy domains and organisations to
support better policy-making;
3. Supporting Public Administrations through the creation of a Big Data community around
best practices, methodologies and artefacts (algorithms, analytical models, pilots output,
etc.) on big data for policy-making.
The following methodological approach (see Chapter 3) has been applied as depicted below:
Identification of User stories (see Chapter 4): the present Report illustrates key scenarios that
users may encounter in a Big Data context, based on business needs collected during the project
through primary data collection activities (interviews with ISA Coordination group members).
Step 3.2.Design of the
target architecture
Step 3.3.Design of the
Governance and Operational model
Step 3.4.Report on final
results
Step 3.1.Identification of the User Stories
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 6 / 63
Each user story highlights a specific purpose of a potential user – Targeted Persona – on the Big
Data field, to be supported by the future Big Data Test Infrastructure through business services
and solutions. The table below summarises the identified user stories:
User story Big Data Test Infrastructure solution
Learning from other European
PA
The IT Director can be supported with a Big Data Community in which can find use cases implemented by other Public Administrations or can share Big Data methodologies, strategies, artefacts and outcomes, in order to the know-how among Public Administration at European, National and Local level on the Big Data field of expertise.
Test off-the-shelf analytical
tools
The IT Practitioner can be supported with off-the-shelf Big Data analytical tools that can download from a catalogue for implementing Big Data solution, Hiding technical complexity in order to have easy-to-use analytical functionalities and optimising experimentation costs. IT Practitioner can also be supported by a specialised team in order to implement analytical functionalities and find new context to apply Analytics.
Experimenting with Big Data platform
The IT Practitioner can be supported with a ready-to-use Big Data platform, following a structured process from a marketplace, respecting privacy policies and using open source tools, thus fostering the adoption of Big Data technologies, the acquisition of Analytics skills and understanding the added-value of Big Data and Analytics tools in the public sector.
Big Data Pilot implementation
The Policymaker can be supported with a dedicated specialised team to implement and execute the Big Data pilot using a ready-to-use Big Data platform, using pre-built analytical functionalities to save cost and time. Policymaker can also share pilot’s results (impacts, advantages, etc.) using a Big Data Community, thus fostering the spread of know-how.
Integrating
Open Datasets
The Data Scientist can be supported with a shared repository with other MSs which enables the gathering and usage of the desired datasets that can be downloaded in an easy way from a catalogue, thus encouraging the sharing of open data among different PAs working on different policy domains and finding solutions of problems whose correlation with the analysed policy domains was originally hidden and/or unclear.
Design of the Target Business and Technical Architecture (see Chapter 5): starting from clear and
agreed architecture principles (e.g. SW openness, reusability, etc.), the present Report describes
the designed Target Business and Technical Architecture for the future Big Data Test Infrastructure.
The target Business Architecture includes a set of business services linked to the User Stories,
which will represent the service offering of the Big Data Test Infrastructure. The set of business
services is summarised in the following table:
Business service Description
PaaS for implementing Big
Data Use Case
This service aims at providing the Big Data platform and all the tools supplied by the European Commission. The initialisation of the platform will be a wizard process for the users that will be able to choose specific templates of the platform through a marketplace. Should new needs arise, this platform will be enriched with new functionalities. Once the platform is instantiated, its resource configuration and tools list will be kept in a catalogue for future reuse.
Data Catalogue and Data Exchange APIs
This service aims at providing data and/or exchange APIs that enable the gathering and usage of the desired data to be correlated with user data. It is a service that provides a catalogue of data sources from which users can retrieve links to data if already available (e.g. coming from previous pilots) or it provides the access to sample datasets (classified by policy domains) which the European Commission makes available in a centralised repository. The catalogue of data sources could be enriched during the implementation of their pilots for future reuse.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 7 / 63
Business service Description
Analytics as a Service
It aims at providing a list of analytical functionalities which enables any EU Institutions / Public Administrations to quick access to a series of customisable pre-built elaborations. The following list, non-exhaustive, represent a set of analytical services that could be available: extract information from documents (text mining), time-series forecasting, geo information normalisation, population / customer segmentation.
Community Building
and Innovation Portal
This service aims at building a Big Data community where users can share knowledge and Big Data artefacts (e.g. methodologies, statistical models, pilots’ outcome and datasets). It also consist in providing an innovation portal where users can contribute with their own ideas in order to launch new proposition. The innovation portal will have a recommendation engine for aggregating contributions aligned to users’ search.
Big Data and Analytics
software catalogue
This service aims at providing a catalogue of Analytics artefacts or software tools that users will be able to download for implementing Big Data solutions. Part of this catalogue will be a special software stack (like a Sandbox) usable in the preferred user environment (on premise or cloud): it is a preconfigured environment that contains services, sample data and interactive tutorials useful for testing Big Data technologies.
Support for Analytics
implementation
This service aims at providing a technical support by a specialised team (e.g. Business Analyst, Data Scientist or Data Engineer) which helps PAs to implement Big Data pilot. For example, support could be provided in terms of sizing of the infrastructure, selection of the involved technologies, execution of the pilot, sharing and presentation of the results. Any outcome of the Big Data pilot could feed the community portal in order to be shared with the community.
Advisory
This service aims at providing an advisory support for the assistance in activities related to the Big Data Test Infrastructure. It covers a series of activities such as pilot scoping, business case definition, evaluation of risks related to the implementation of a big data solution, data sources identification and integration, services identification and integration.
The Big Data service offering will be enabled through a Technical Architecture composed by a set of
architectural building blocks
describing the functional
capabilities of the future Big Data
Test Infrastructure. Furthermore,
a preliminary view of potential
solutions to be used for the
implementation of the target
architecture is provided (focusing
on open source software).
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 8 / 63
Design of the Governance & Operational model (see Chapter 7): the present Report describes the
defined high-level governance and
operational model of the future Big Data
Test Infrastructure, to be further
elaborated in a detailed Target Operating
Model. The high-level governance and
operational model has been set-up based
on a well-defined framework and has been
structured in the following layers:
Strategy layer, proposing an Operational Management Board (OMB) which will be organised
every month to take strategic decisions on the development of the Big Data Test
Infrastructure (main processes: Big Data Test Infrastructure Strategy, Demand Management
and Financial Management).
Governance layer, focusing on service management activities, all based on well-known and
market leading methodologies (e.g., ITIL Framework), ensuring a proper information security
management, monitoring the overall performances and, furthermore, managing the staff, the
IT Providers (Procurement Management) and the business stakeholders such as European
Public Administrations and Institutions (Stakeholders Management).
Operational layer, covering activities in the field of service development and evolution
(implementation and evolution of business services and technical building blocks), day-by-day
operations of the Big Data Test Infrastructure in terms of ICT Infrastructure Management,
Availability and Capacity Management, and a dedicated Service Desk implementing the IT
Service support. Finally, a Communication Office (CO) will be responsible for the Big Data
Community Building Management, including the promotion of the Big Data Test
Infrastructure and its business services.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 9 / 63
The next steps expected for the implementation of the Big Data Test Infrastructure are described in
detail in Chapter 8. Assuming that the
implementation of the Big Data Test
Infrastructure would be part of the CEF
Work Programme, it is planned to
implement the infrastructure with an
incremental approach, starting from
2018 with the implementation of the
“core” Big Data services and
implementing all the other business services by 2019.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 10 / 63
1. INTRODUCTION
The amount of data generated worldwide keeps increasing at an astounding pace, by 40% each
year, and forecasts expect it to rise 30-fold between 2010–2020. Since non-interoperable means
are being used to describe data generated in the public sector, most of this data cannot be re-used.
Previous studies have already investigated Big Data and data analytics initiatives launched by
EU Public Administrations (PAs) and EU Institutions (EUIs) both at European and National level.
Indeed, their focus was geared towards studying the potential or added value of Big Data analytics
to help public authorities at all levels of government and in different domains in reaching their
goals, as well as towards capturing valuable lessons learned and best practices of mature public
organisations to inspire peers while helping them in further use of Big Data analytics and to become
more insight-driven. That being said, despite the various use cases covered by these
aforementioned studies, the adoption of some analytics technologies in public administrations is
still lacking. At the moment, several Cloud environments exist in the European Commission but no
Big Data infrastructure is available to any PA or EUI with a full stack of technologies
(infrastructure in terms of storage and computing capacity, analytics tools and test datasets) to test
the value of new ways of processing Big Data and display its benefits to their management.
Providing these analytics technologies to PAs and EUIs would both significantly increase the
adoption of analytics technologies and encourage users to initiate research and test projects in the
Big Data field, and as a result boost innovation and R&D (Research and Development).
Therefore, the ISA2 Action 2016.03 – “Big Data for Public Administrations” aims to address the use
of Big Data within PAs to support better decision-making.1 The study “Big Data Test Infrastructure”,
launched at the beginning of January 2017 under the above-mentioned activities, aims at filling the
gap within PAs in the Big Data field, providing the design of a centralised European Big Data Test
Infrastructure to be used by any PA and EUI in Europe.
1 https://ec.europa.eu/isa2/sites/isa/files/library/documents/isa2-work-programme-2016-summary_en.pdf
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 11 / 63
Indeed, the purpose of this study is to identify the main key features of the “Big Data Test
Infrastructure” and design its architecture, which the European Commission (EC) will make
available to any interested EC DGs, PAs and EUIs in Europe in order to:
1. Facilitate the launch of pilot projects on Big Data, data analytics or text mining, by
providing the infrastructure and software tools needed to start a pilot project;
2. Foster the sharing of various data sources across policy domains and organisations to
support better policy-making; and
3. Support PAs through the creation of a Big Data community around best practices,
methodologies and artefacts (big data algorithms, analytical models, pilots’ outputs, etc.) on
Big Data for policy-making.
A cross-border aggregation of data through a ready-to-use Big Data Test infrastructure would allow
and increase the adoption of meaningful analytics services that will benefit the European and
National PAs / EUIs and the European Union as a whole.
The following examples will facilitate readers in understanding the potential of this Big Data Test
Infrastructure:
Figure 1 – Narratives of the Big Data Test Infrastructure
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 12 / 63
The entire “Big Data Test Infrastructure” study is structured in accordance with the following three
main tasks:
The objective of this document is to report on the final results of Task 3.
1.1. OBJECTIVES OF THE DOCUMENT
As anticipated in the Introduction, the objective of this document (outcome of Task 3 of the overall
study) is to describe the service offering of the initiative and the overall business and technical
architecture of the Big Data Test Infrastructure to be used by EU Institutions and EU Public
Administrations to launch pilot projects on Big Data.
On one side, a list of simple User Stories has been described in order to better define the set of
services the Big Data Test Infrastructure will offer to future users, while, on the other side, good
practices and a set of business / technical requirements identified during Task 1 were used with
well-defined architecture principles to define the key features and main functionalities of the Big
Data Test Infrastructure (output of Task 1 can be found in “D02.1_Requirements and good
practices for a Big Data Test Infrastructure”).
The main objectives of this document can therefore be summarised as follows:
Identification of the User Stories
Quick reference: Chapter 4
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 13 / 63
Target Architecture (business and technical architecture)
Quick reference: Chapter 5
Illustrative implementation of a User Story
Quick reference: Chapter 6
Governance and Operational scenarios
Quick reference: Chapter 7
1.2. STRUCTURE OF THE DOCUMENT
This document represents the final deliverable of TASK 3 of the overall study related to the Big Data
Test Infrastructure. This document contains seven main sections, structured according to the
approach to the study, as listed below:
• Introduction – presents the entire study and its main objectives (Chapter 1);
• Context – outlines the context of the study, pointing out the ISA2 Programme and the ISA2
Action 2016.03 – “Big Data for Public Administrations”, the European Digital Pole in
Luxembourg;
• Methodology followed – introduces the methodological approach used for TASK 3,
highlighting the steps adopted (Chapter 3);
• User Stories – summarises the main user stories (using Personas) for usage of the Big Data
Infrastructure by public administrations (Chapter 4);
• Target architecture – describes the overall architecture of the Big Data Test Infrastructure
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 14 / 63
both in terms of services provided and the resulting business, technical and functional
architecture, linked to the business / technical requirements and the solution adopted in
the selected Good Practices coming from Task 1; it also shows how the obtained
architecture supports the selected use cases coming from Task 1 (Chapter 5);
• Illustrative implementation of a User Story – contains a complete implementation of a
user story with both the services provided and the activated architectural / solution
building blocks of the Big Data Test Infrastructure (Chapter 6);
• Governance and Operational scenario for the Big Data Test Infrastructure – contains the
demand management and access policies, the infrastructure standard templates and the
outcomes sharing (Chapter 7).
• Next Steps – provides information related to the future implementation of a small-scale
Big Data Test Infrastructure (Chapter 8).
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 15 / 63
2. CONTEXT
2.1. THE ISA2 PROGRAMME AND THE ACTION 2016.03 “BIG DATA FOR
PUBLIC ADMINISTRATIONS”
Nowadays, European Public Administrations are expected to provide efficient and effective
electronic cross-border or cross-sector interactions between not only PAs but also between PAs
and both citizens and businesses without any disruption. By implementing and executing the ISA2
Programme (commonly referred to as ISA2) from 1 January 2016 to 31 December 2020, the EC
finances thirty-five (35) clusters of actions2 with an operational financial envelope of
approximately EUR 131 million. This programme will continue to ensure that Member States (MSs)
are provided with high-quality, fast, simple and low-cost interoperable digital services.
By supporting and developing new actions and interoperability solutions, the Council and the
European Parliament ensure that ISA2 will contribute to increasing interoperability that will in turn
advance the services offered, cut overall costs and result in a better-functioning internal market.
Under ISA2, the Presidency will prioritise actions and develop provisions to prevent any overlaps
and promote full coordination and consistency with other EU programmes (Connecting Europe
Facility Programme, DSM Strategy).
The 5-year ISA2 Programme 2016–2020 has been developed as a follow-up to its predecessor ISA, which ran from
2010 to 2015. Still managed by the ISA Unit (up to 2016, DIGIT.B6, now DIGIT.D1) of DG Informatics of the EC, the
ISA2 Programme will focus on specific aspects such as ensuring correct coordination of interoperability activities
at EU level; expanding the development of solutions for public administrations according to businesses’ and
citizens’ needs; proposing updated versions of tools that boost interoperability at EU and national level, namely
the European Interoperability Framework (EIF) and the European Interoperability Strategy (EIS); the European
Interoperability Reference Architecture (EIRA) and a cartography of solutions: the European Interoperability
Cartography (EIC).
With the adoption of ISA2, the EC commits to developing necessary IT services and solutions for the advancement
of public-sector innovation and digital public service delivery to citizens and businesses.
In order to remain in line with the European DSM Strategy, ISA2 monitors and supports EIF implementation in
Europe.
ISA is also well aligned with the Connecting Europe Facility Programme (CEF Programme), the Union’s funding
instrument for trans-European networks in the fields of transport, energy and telecommunications. The CEF
supports the deployment and operation of key cross-border digital services. ISA2 supports the quality
improvement of selected services and brings them to the operational level required to become a CEF service. It is
also one of the enabler and contributor programmes for public-sector innovation in Europe.
2 See: https://ec.europa.eu/isa2/dashboard/isadashboard
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 16 / 63
The ISA2 Programme currently covers 35 actions, in which the “Big Data for Public
Administrations” represents the third, namely Action 2016.03. ISA2 is structured in such a way
that actions are grouped into packages of similar policy areas, which are agreed by the
Commission and Member States. Action 2016.03 belongs to the package “access the data / data
sharing / open data” under which the ISA2 programme funds actions to help open up national data
repositories, facilitate the reuse of data across borders and sectors, and widen access to data
created by the public sector.3
Phase 1 of this Action is aimed at carrying out a landscape analysis in order to identify: (i) the
requirements and challenges of PAs in Europe and the Commission in the context of Big Data;
(ii) ongoing initiatives and best practices in these areas, including an assessment of the tools and
solutions that these initiatives have implemented; and (iii) synergies and areas of cooperation with
the policy DGs and the MSs in this domain. Furthermore, phase 1 also intends to execute some
pilots that showcase the usefulness and policy benefits that Big Data can bring.
This action will continue to build upon the results of phase 1, focusing on the following activities:
Track 1: continue with the identification of further opportunities and areas of interest
whereby the use of Big Data could help improve working methods as well as ensure better
policy-making for policy DGs as well as Member States' Public Administrations;
Track 2: continue the implementation of already identified pilots through generalising the
developed functionalities and thus extending their use to policy agnostic contexts in order
to maximise the benefit and return on investment of the proposed solution;
Track 3: launch a new wave of pilots in specific domains, which hold a potential for later
being generalised and scaled-up to be made available to different services agnostic of their
specific policy area.
Moreover, in order to encourage the use of Big Data tools, under the same action, ISA2 funded
some Big Data pilots which may motivate PAs.
The ISA2 Action 2016.03 is a natural continuation of the ISA Action (1–22) “Big Data and Open
Knowledge for Public Administrations”, carried out in the context of the 2010–2015 ISA
programme. It aimed at identifying “the challenges and opportunities that Member States and the
Commission face in the context of Big Data and open knowledge [and] to create synergies and
3 See: https://ec.europa.eu/isa2/sites/isa/files/isa2_2017_work_programme_summary.pdf
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 17 / 63
cooperation between the Commission and Member States, leading to more effective and informed
actions by public administrations”.4 Under this action, a study by Deloitte was conducted on
Big Data entitled “Big Data Analytics for Policy Making”5 and the initiative “Big Data Test
Infrastructure” represents a technical follow-up of this Deloitte report. The final report assigns
specific attention to Big Data and data analytics initiatives launched by European PAs in order to
provide insights. The study first analyses the added value of Big Data analytics to assist public
authorities at all levels of government and in different domains to achieve their goals. Second, it
captures valuable lessons learned and best practices of mature public organisations to inspire
peers and assist them on their path to using Big Data analytics and become more insight driven.
The study gathered over 100 cases, of which 10 were selected, where PAs mine Big Data or use
data analytics to gain better insights and increase their impact. These cases covered a wide range
of different data sources and types of analytics as well as policy domains and levels of
government, to conduct more in-depth case studies and to gather key lessons learned from the
use of Big Data and data analytics within these public authorities.
Based on all use cases and best practices, Deloitte’s study developed several recommendations
addressed to any public organisation that is willing to work with data analytics and Big Data. All
these useful insights are published in the above-mentioned final report: “Big Data Analytics for
Policy Making”.
2.2. EIRA
In order to better understand EIRA’s role and objectives, the document “Introduction to the
European Interoperability Reference Architecture (EIRA©) v2.0.0”6 has been taken into account.
The document provided by the European Commission has been used as a guideline in order to
create the architecture principles (based on the European Interoperability Framework underlying
principles) and define the Architecture Building Blocks of the future Big Data Test Infrastructure.
The European Interoperability Reference Architecture (EIRA) is a reference architecture focused on
the interoperability of digital public services. It is composed of the most salient Architecture
Building Blocks (ABBs) needed to promote cross-border and cross-sector interactions between
public administrations. This interoperability aims at improving cooperation between public
4 See: http://ec.europa.eu/isa/actions/01-trusted-information-exchange/1-22action_en.htm 5 See: https://joinup.ec.europa.eu/asset/isa_bigdata/document/big-data-analytics-policy-making-report 6 See: https://joinup.ec.europa.eu/catalogue/distribution/eira-v200-overview
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 18 / 63
administrations – removing barriers for administration, businesses and citizens. The New European
Interoperability Framework defines interoperability as the ability of organisations to interact to
achieve mutually beneficial goals, involving the sharing of information and knowledge between
these organisations, through the business processes they support, by means of exchange of data
between their ICT systems. The EIRA is a four-view reference architecture for delivering
interoperable digital public services across borders and sectors and it has four main characteristics:
1. Common terminology to achieve coordination – It provides a common understanding of
the most salient Architecture Building Blocks needed to build interoperable public services.
2. Reference architecture for delivering digital public services – It offers a framework to
categorise Solution Building Blocks (SBBs) of an eGovernment solution. It allows portfolio
managers to rationalise, manage and document their portfolio of solutions.
3. Technology and product neutral and a service-oriented architecture (SOA) style – The EIRA
adopts a service-oriented architecture style and promotes ArchiMate as a modelling
notation. ArchiMate7 is an open and independent enterprise architecture modelling
language to support the description, analysis and visualisation of architecture within and
across business domains in an unambiguous way.
4. Alignment with EIF and TOGAF – EIRA is aligned with the New European Interoperability
Framework (EIF). The views of EIRA correspond to the interoperability levels in the EIF: legal,
organisational, semantic and technical interoperability which are already anchored in the
National Interoperability Frameworks (NIFs) of the Member States. EIRA also reuses
terminology and paradigms from TOGAF, such as architecture patterns, building blocks and
views.
The main objective of EIRA is to support users within the public administrations of Member States
or EU Institutions (architects, business analysts and portfolio managers) in the implementation of
some use cases (design and document solution architecture, compare solution architectures,
structure impact assessment and create, manage or rationalise a portfolio of solutions).
Figure 2 below provides an overview of the key concepts of EIRA and its relationships.
7 See: http://www.opengroup.org/subjectareas/enterprise/archimate-overview
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 19 / 63
Figure 2 – Key concepts in EIRA
The key concepts of EIRA are defined as follows:
EIF interoperability levels cover legal, organisational, semantic and technical
interoperability;
Figure 3 – Interoperability levels of the EIF
EIF principles consist of 12 underlying principles of European public services that are
relevant to the process of establishing European public services;
EIRA views consist of several views, including one view for each of the EIF interoperability
levels;
EIRA viewpoints provide a perspective keeping the concerns of specific stakeholders
in mind;
Architecture Building Blocks are abstract components that capture architecture
requirements and guide the development of Solution Building Blocks;
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 20 / 63
Solution Building Blocks are concrete elements that define the implementation of one or
more Architecture Building Blocks;
Solution Architecture Template (SAT) focuses on the most salient building blocks needed to
build an interoperable solution;
Reference Architecture is a generalized architecture of a solution, based on best-practices,
domain neutral with a focus on a particular aspect. The goal of a reference architecture is
reusability; it reduces the amount of work, reduces errors and accelerates the development
of solutions;
Solution Architecture is a description of a discrete and focused business operation or
activity and how information systems/technical infrastructure supports that operation. It
can be derived from a Solution Architecture Template (SAT);
Solution consists of one or more Solution Building Blocks to meet a certain
stakeholder need.
EIRA’s approach and methodologies have been used as guidelines in order to create the
architecture principles and to define the architecture building blocks of the future Big Data Test
Infrastructure.
2.3. CEF PROGRAMME
The Connecting Europe Facility (CEF) represents a key EU funding instrument that promotes
growth, jobs and competitiveness through targeted infrastructure investment at the European
level. CEF aims to support the development of high-performing, sustainable and efficiently
interconnected trans-European networks in the fields of transport, energy and digital services. The
programme’s investments fill the missing links in Europe's energy, transport and digital backbone.8
CEF’s benefits are multiple for citizens from the EU MSs, especially within the following sectors:
Transport: travel will be easier and more sustainable;
Energy: Europe’s energy security will be enhanced, while enabling wider use of
renewables;
Telecom: cross-border interaction between public administrations, businesses and citizens
will be facilitated;
Economy: the CEF offers financial support to projects through innovative financial
8 See: https://ec.europa.eu/inea/en/connecting-europe-facility
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 21 / 63
instruments such as guarantees and project bonds. These instruments create significant
leverage in their use of EU budget and act as a catalyst to attract further funding from the
private sector and other public sector actors.
Moreover, in order to facilitate the delivery of digital public services across borders, the EU MSs
have created interoperability agreements aimed at deploying trans-European Digital Service
Infrastructures (the DSIs) to be run by CEF Digital.9 This programme supports the provision of basic
and re-usable digital services, known as the CEF building blocks,10 such as eDelivery, eID,
eSignature and eInvoicing. The CEF building blocks can be combined with each other, adopting a
Service Oriented Architecture approach, and integrated with more complex services (e.g. eJustice).
Building blocks denote the basic digital service infrastructures, which are key enablers to be
reused in more complex digital services.11
The CEF building blocks offer basic capabilities that can be used in any European project, and they
can be combined and used in projects in any domain or sector at European, national or local level.
The building blocks are based on existing formalised technical specifications and standards.
The main goals of the CEF building blocks are listed below:
Facilitating the adoption of common technical specifications by PAs;
Ensuring interoperability between IT systems so that citizens, businesses and
administrations can benefit from seamless digital public services wherever they may be in
Europe;
Facilitating the adoption of common technical specifications by projects across different
policy domains with minimal (or no) adaptation by providing services and sometimes
sample software.
The CEF Regulation and the CEF Principles set the context and objectives for the CEF Programme
and define the conditions for providing funding to current and future Building Block DSIs. Each DSI
is implemented through its Service Offering. The GOFA model describes the four aspects
(Governance, Operations, Financing and Architecture) that need to be managed to deliver this
Service Offering:
9 See: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+Digital+Home 10 See: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+building+blocks 11 See: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+Definitions
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 22 / 63
Figure 4 – The CEF GOFA model
Currently preparations are ongoing for the next CEF funding Programme, in particular the
prioritisation of the CEF candidate Building Blocks by members of the CEF Expert Group based on
their national needs. As a result, some candidate(s) will become part of the CEF Programme,
requiring a strong Governance of the delivery of those candidates among Member States.
The future “Big Data Test Infrastructure” is in line with the above-mentioned CEF Regulation and
principles and the future infrastructure will be designed around the four aspects of the GOFA
model.
In order to participate in this programme, DG DIGIT.D1 prepared, with KPMG support, a list of
deliverables to describe the Big Data Test Infrastructure (i.e. a Maturity form, Synopsis and
Narratives document) and the candidate obtained strong interest from several Member States in
the prioritisation process (MSs representatives of the so-called CEF Expert Group), officially
entering the short-list of candidate building blocks.
It has been agreed with DG CNECT, Context broker and European Data Portal DSI that BDTI could be
part of the CEF 2018 Work Programme and, in the case of a positive outcome, the implementation
of the Big Data Test Infrastructure will be jointly coordinated under the new CEF “Data” DSI.
Currently, the 2018 Work Programme is under validation by the CEF Member States and the
candidate building block has already been presented to the CEF Expert Group during a webinar
session.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 23 / 63
3. METHODOLOGY FOLLOWED
This section provides a description of the methodological approach applied in order to guarantee
achievement of the objectives of Task 3.
Figure 5 – Methodological approach followed under Task 3
Step 3.1
Identification of the
User Stories
In order to provide the reader with a clear understanding of the usage
of the Big Data Test Infrastructure and how it can satisfy specific needs
that users may encounter in a Big Data context, a list of examples have
been created in this step. These examples have been translated into
User Stories, following these activities:
1. Identification of the User Personas – based on the information
provided by the ISA2 and KPMG expertise, a list of users more
related to usage of the Big Data Test Infrastructure have been
identified. These users have been defined as Targeted Personas.
2. Identification of the User Stories – starting from the User
Personas identified in the first step, the User Stories have been
created considering specific problems/needs which users may
encounter in a Big Data context.
Further details regarding the User Stories can be found in Chapter 4.
Step 3.2.Design of the
target architecture
Step 3.3.Design of the
Governance and Operational model
Step 3.4.Report on final
results
Step 3.1.Identification of the User Stories
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 24 / 63
Step 3.2
Design of the target
architecture
The objective of this step has been to provide the target architecture of
the Big Data Test Infrastructure. The first task of this step was to
identify the architecture principles taking into account the new
European Interoperability Framework document, KPMG expertise and
business requirements collected during Task 1. The architecture
principles have been used as the high-level driver for the design of the
Big Data Test Infrastructure. Subsequently, the activities performed for
the creation of the target architecture can be grouped into three tasks:
1. Identification of the Business Services – initiating from the User
Stories and the architecture principles identified, a list of
business services have been defined. Each business service has
been linked with the User Stories.
2. Design of the Application-Data-Technology architecture – in
this task an overall perspective of the Application-Data-
Technology Architecture has been developed in order to
provide the business services identified in the previous task.
This architecture contains an Analytics building block that has
been broken down into architecture building blocks in order to
map technical requirements collected during Task 1.
3. Design of the Solution architecture – in this task, an example of
a solution for the implementation of the relevant building
blocks has been provided only for each use case in scope. The
good practices/pilots, KPMG experience and any available study
on Big Data SW (e.g. Gartner) have been taken into account in
order to identify solutions.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 25 / 63
Step 3.3
Design of the
Governance and
operational model
The goal of this step has been the creation of the governance and
operational model of the Big Data Test Infrastructure that involves the
aspects of control and management of the platform. This step provides
a set of guidelines for the future implementation of the engine and, to
simply classify the main aspects, the GOFA model used by CEF was
taken into account.
Step 3.4
Report on final
results
The final step of Task 3 has been the delivery of the Report “D03.1
Design of the Big Data Test Infrastructure” based on the consolidated
results of Task 3:
List of User Stories which describe specific purposes supported
by the services provided by the Big Data Test Infrastructure;
List of relevant architecture principles that have been used as a
driver for the design of the architecture for the future Big Data
Test Infrastructure;
The Business architecture composed of a series of business
services to be supported by the future Big Data Test
Infrastructure;
The Application-Data-Technology architecture and the Solution
architecture in order to provide all the needed building blocks
and an example of technology solutions for each them;
An illustrative implementation of a User Story in order to
provide a complete example of the implementation of a User
Story;
The Governance and Operational model in order to identify the
elements of the overall architecture that can be implemented in
different scenarios;
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 26 / 63
Table 1 - Methodology followed for Task 3 of the study
As described above, the methodological approach adopted for Task 3 of the study is focused on the
design of the target architecture of the Big Data Test Infrastructure. In order to provide a clear
picture of the overall methodological approach defined for the complete study, the following figure
highlights the principal interconnections between the three Tasks of the study.
Figure 6 – Adopted Methodology for the overall study
As highlighted in Figure 6 above, all the information collected during Task 1 has been fundamental
in identifying the User Stories, which have been used as guidelines for the creation of the so-called
“service offering” of the business architecture, while the architecture principles have been used as
the high-level driver for the identification of the Target Architecture. Finally, all the needed building
blocks of the Application-Data-Technology architecture have been identified to support the
implementation of the target architecture of the Big Data Test Infrastructure.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 27 / 63
4. USER STORIES
This chapter describes how business needs collected during Task 1 have been translated into experimental scenarios – User Stories – that users
may encounter in a Big Data context. Each of them highlights a specific purpose supported by a set of services (described in Chapter 5.2)
provided by the future Big Data Test Infrastructure. The figure below gives a brief description of potential users – Targeted Personas – around
whom the User Stories have been developed, also taking into account the documentation provided by the ISA2 unit.
Figure 7 – Targeted personas in scope for the User Stories
The following attributes have been used for each User Story:
Name – the name of the User Story;
Targeted Persona – a fictional user who has expressed a business need and who could potentially be supported by the services
provided by the Big Data Test Infrastructure;
Problem statement – a description of the needs expressed by the Targeted Persona;
Solution – a description of the solution that Targeted Personas could adopt to solve their Big Data problems;
Benefits – a list of benefits that users acquire utilising a Big Data solution;
Example – an example of the User Story’s problem.
IT Director Ingrid
FOTO
DescriptionIngrid has a background in IT but she
moved into managing other areas of
government several years ago. Now
back in IT, she puts a lot of effort into
being well-informed through industry
publications and conferences. She also
is responsible for several services being
digitalised across the city and she
spends a lot of time discussing details
with technicians to understand their
choices because she is accountable for
them.
Age: 44
Work: Local
Government
Location: City of
Hamburg, Germany
IT Practitioner Marko
INSERIRE FOTO
DescriptionMarko works in the Digital
Transformation Team and, as a
practitioner, he feels remote from
politics and strategy. He isn't a manager
and he used to do the IT in a hospital
and for a small business, so he knows
the private as well as the public sector.
Age: 46
Work: Central
Government
Location: Slovenia
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 28 / 63
Name Targeted Persona
Problem statement
Solution Benefits Example
Learning from
other European Public
Administrations
IT Director
As an IT Director, I want to obtain knowledge and methodologies on the Big Data field of expertise from other MSs in order to reduce the risk of replicating their efforts by implementing similar use cases. Unfortunately, I haven’t a dedicated platform where all MSs share their Big Data pilots’ outcomes and methodologies
The IT Director can be supported by a Big Data Community where he can find use cases implemented by other Public Administrations or can share Big Data methodologies, strategies, artefacts and outcomes
Obtaining/Sharing knowledge and pilot outcomes (algorithms, models, etc.) on the Big Data field of expertise
Increasing the know-how among Public Administration at European, National and Local level
Ingrid, as an IT Director, is responsible for several big data pilots and she wants to discover new ideas adopted by other MSs in order to implement her Big Data pilot for real-time traffic congestion
Test off-the-
shelf analytical tools
IT Practitioner
As an IT Practitioner, I want to use Big Data and analytics technologies in order to handle a large volume of data but I haven’t analytics skills and Big Data analytical tools
The IT Practitioner can be supported with off-the-shelf Big Data analytical tools that can be downloaded from a catalogue for implementing Big Data solutions. IT Practitioner can also be supported by a specialised team in order to implement analytical functionalities
Hiding technical complexity in order to have easy-to-use analytical functionalities, optimising experimentation costs
Extracting benefits on the application of analytical functionalities in a Big Data / Analytics context
Find new contexts to apply Analytics
Marko, as an IT Practitioner working in a hospital, has to deal with huge volumes of unstructured documents and currently is struggling to extract information without a labour-intensive manual approach. Big data tools, text mining algorithms and analytics skills can help to tackle this kind of problem but he does not have a Big Data analytical tool
Experimenting with Big Data
platform
IT Practitioner
As an IT Practitioner, I want to experiment with Big Data / Analytics technologies in order to evaluate the Big Data added-value and the differences with regard to traditional systems but I haven’t a Big Data environment with all the
The IT Practitioner can be supported with a ready-to-use Big Data platform, respecting privacy policies and using open source tools. The IT Practitioner can initialise the Big Data platform following a structured process from a marketplace
Fostering the adoption of Big Data technologies and the acquisition of Analytics skills
Understanding the added-value of Big Data and Analytics tools in the public sector
Marko, as an IT Practitioner, needs an analytical “sandbox” environment to experiment new (open) technologies and to quickly test new Big Data use cases before deploying them into their production environment
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 29 / 63
Name Targeted Persona
Problem statement
Solution Benefits Example
software components
Big Data Pilot
implementation
Policymaker
As a Policymaker, I want to start a Big Data pilot in order to support the policy making process and to evaluate and share the impacts and advantages of the implementation of a Big Data solution. Unfortunately, I am not working in a Big Data environment and do not have the technical skills to apply Big Data methodologies
The Policymaker can be supported by a dedicated specialised team to implement and execute the Big Data pilot using a ready-to-use Big Data platform, employing pre-built analytical functionalities to save costs and time. The Policymaker can also share the pilot’s results (impacts, advantages, etc.) using a Big Data Community.
Understanding the added-value and the impact of Big Data and Analytics technologies in the policy-making process
Understanding what technologies to use for execution of a Big Data pilot
Understanding what data sources to use for the execution of a Big Data pilot
Laura, as a Policymaker, has to analyse the impact of a new tax policy introduced 6 months previously. Her colleagues advised her to analyse the general feeling about this change with Big Data technologies using social media datasets
Integrating
Open Datasets
Data Scientist
As a Data Scientist, I want to integrate internal data with external open datasets in order to improve the effectiveness of the analysis. Unfortunately, I do not have a structured process and the tools to collect and integrate open data shared by other PAs
The Data Scientist can be supported with a shared repository with other MSs which enables the gathering and usage of the desired datasets that can be easily downloaded from a catalogue
Encouraging the sharing of open data among different PAs working in different policy domains
Finding solutions to problems where correlation with the analysed policy domains was originally hidden and/or unclear
David, as a Data Scientist, wants to pinpoint correlations among the internal data of the Local Government and open data in the policy domain “Energy” in order to enrich his statistical model for the improvement of energy efficiency in central public administration buildings
Table 2 - "User Stories" in scope for the Big Data Test Infrastructure
Each of these User Stories has been analysed in order to be supported by the future Big Data Test Infrastructure.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 30 / 63
5. TARGET ARCHITECTURE
This chapter provides information on the target architecture for the Big Data Test Infrastructure.
Specifically, it describes the architecture principles and why they are relevant to the design of the
Big Data Platform (5.1); subsequently, the chapter illustrates a set of business services contained in
the Big Data Test Infrastructure (service offering) linked with the user stories identified in Chapter 4
(5.2) and provides an overall view of all the building blocks needed to provide the identified
business services (0). Finally, the chapter furnishes an example of a solution for the implementation
of building blocks for each Big Data use case in scope as a result of the prioritisation process
undertaken during Task 1 of the initiative (5.4); this study was the principal source for determining
the contents of this section and designing the final target architecture.
5.1. ARCHITECTURE PRINCIPLES
The architecture principles are guidelines that have been used as a high-level driver for the design
of the architecture for the future Big Data Test Infrastructure. This section presents the most
relevant principles, taking into account:
the underlying principles of European public services outlined in the new European
Interoperability Framework12 document;
KPMG expertise;
business requirements collected during Task 1 of the initiative;
Principle Description
Openness
This principle refers to data, specifications and software. Openness is an important selection criteria: it is preferable to have software products that are provided as open source and with an active and solid community, documentation and a high level of maturity.
Reusability
This relates to the ability to use solutions or components adopted by others and it mainly refers to analytical models, APIs, data connectors, training material, etc. that can be reused beyond the domain for which they were originally developed. It is preferable to have solutions that represent an opportunity to reduce costs and increase the velocity of the experimentation process.
12 See: https://ec.europa.eu/isa2/sites/isa/files/eif_brochure_final.pdf
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 31 / 63
Principle Description
Test
environment
This principle relates to the nature of the infrastructure. It is preferable to have an environment for experimentation which allows users to have a ready-to-use platform; therefore production environment functionalities (e.g. back-up, high-availability, logging, etc.) are not relevant for the solution.
Flexibility
This is a key aspect of the infrastructure in order to support different Big Data use cases. It is preferable to have modular solutions (composed of customisable building blocks) able to address several business requirements arising from EU Institutions and/or PAs.
Scalability
This principle refers to the capability of the environment to scale-up both in terms of services and technical specification in order to implement pilots and handle data. It is preferable to design an infrastructure that is scalable in terms of volume, dimension and performance of services and pilots based on resources and budget available.
Security and
Privacy
This principle refers to the capability to be compliant with Security and Privacy policies at European and National levels, so it is preferable to have software products and solutions in line with these policies. This is the most important principle to take into account in order to allow public administrations to handle citizens’ information avoiding unauthorised access and disclosure.
Table 3 - Architecture principles
5.2. BUSINESS ARCHITECTURE
This paragraph illustrates the list of services composing the business architecture that provide the
means to satisfy the business requirements. These services are provided to Member States in order
to experiment Big Data solutions and support all the User Stories described in Chapter 4. For design
of the target architecture of the future Big Data Test Infrastructure, the ArchiMate language has
been used (using the Archi tool). Figure 8 below shows the motivation layer, which was identified
using the user stories previously described, and the business layer which contains the following
business services:
PaaS for implementing the Big Data use case;
Data Catalogue and Data Exchange APIs;
Big Data and Analytics software catalogue;
Analytics as a Service;
Community Building and Innovation Portal;
Support for Analytics implementation;
Advisory.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 32 / 63
Figure 8 – Business view of the Big Data Test Infrastructure and related user stories
Below is a brief description of the relationships among the entities of the Motivation layer
(user stories) and Business layer (business services).
Narrative: The [Data Scientist] is associated with the User Story [Integrating open datasets], which
is realised by the [Data Catalogue and Data Exchange APIs]; the [IT Practitioner] is associated with
the User Story [Experimenting with the Big Data platform], which is realised by [PaaS for
implementing Big Data use cases]. The IT Practitioner is also associated with [Test off-the-shelf
analytical tools], which is realised by [Analytics as a Service] and [Support for Analytics
implementation] business services. The [Policymaker] is associated with the User Story [Big Data
pilot implementation], which is realised by the business services [PaaS for implementing Big Data
use cases], [Analytics as a Service], [Community Building and Innovation Portal], [Support for
Analytics implementation] and [Advisory]. Finally, the [IT Director] is associated with the User Story
[Learning from other European Public Administrations], which is realised by the [Community
Building and Innovation Portal].
A list of the identified business services is provided below. For each business service, a detailed
factsheet has been drafted containing the following information:
Name – name of the business service;
Description – a description of the business service;
Prerequisites – a set of prerequisites for implementation of the business service;
Outcome – a brief description of the business service’s outcome;
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 33 / 63
Linked User Stories – a list of the User Stories which the business service satisfies.
Business Service Description
PaaS for
implementing Big Data Use Case
This service aims at providing the Big Data platform and all the tools supplied by the European Commission. The initialisation of the platform will be a wizard process for the users who will be able to choose specific templates of the platform through a marketplace. Should new needs arise, this platform will be enriched with new functionalities. Once the platform is established, its resource configuration and tools list will be kept in a catalogue for future reuse.
Prerequisites Outcome
the volume of data to be stored and elaborated in order to perform the sizing of the platform;
privacy policies linked to the geographic location of the servers;
the nature of data.
A Big Data Infrastructure with a ready-to-use software environment.
Linked Targeted Personas Linked User Stories
Data Scientist
IT Practitioner
Policymaker
Experimenting with the Big Data platform since the objective is to have a ready-to-use Big Data infrastructure in order to test Big Data technologies and acquire know-how and skills in the Big Data field of expertise.
Big Data Pilot implementation since the objective is the availability of a Big Data environment to initiate a Big Data pilot.
Business Service Description
Data Catalogue and Data Exchange APIs
This service aims at providing data and/or exchange APIs that enable the gathering and usage of the desired data to be correlated with user data. It is a service that provides a catalogue of data sources from which users can retrieve links to data if already available (e.g. coming from previous pilots) or it provides access to sample datasets (classified by policy domains) which the European Commission makes available in a centralised repository. The catalogue of data sources could be enriched by both datasets and connectors that users will develop during implementation of their pilots for future reuse.
Prerequisites Outcome
data privacy policies (e.g. in order to secure sensitive data);
A catalogue to easily access desired data from available data sources.
Linked Targeted Personas Linked User Stories
Data Scientist
IT Practitioner
Policymaker
Integrating Open Datasets since the objective is to integrate internal data with external open datasets provided by the European Commission in a shared repository.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 34 / 63
Business Service Description
Analytics as a Service
It aims at providing a list of analytical functionalities which enable any EU Institution/Public Administration to quickly access a series of customisable pre-built elaborations. The following list, non-exhaustive, represents a set of analytical services that could be available:
extraction of information from documents (text mining);
time-series forecasting;
geo information normalisation;
population / customer segmentation.
Prerequisites Outcome
the type of data that users want to handle;
the type of analysis that users want to create.
A list of analytics to quickly access functionalities related to the elaboration of data.
Linked Targeted Personas Linked User Stories
Data Scientist
IT Practitioner
Policymaker
Test off-the-shelf analytical tools since the goal is to use built-in analytical functionalities.
Big Data Pilot implementation to use analytical functionalities in order to analyse the general sentiment about a new tax policy.
Business Service Description
Community Building
and Innovation Portal
This service aims at building a Big Data community where users can share knowledge and Big Data artefacts (e.g. methodologies, statistical models, pilots’ outcomes and datasets).It also provides an innovation portal where users can contribute with their own ideas in order to launch new propositions. The innovation portal will have a recommendation engine for aggregating contributions aligned to the user’s search.
Prerequisites Outcome
Sharing mind-set. A Big Data community and an innovation portal for users’ ideas with a recommendation engine.
Linked Targeted Personas Linked User Stories
Data Scientist
IT Practitioner
Policymaker
IT Director
Learning from other European Public Administrations since the goal is to obtain knowledge and methodologies regarding the Big Data field of expertise from other MSs.
Big Data Pilot implementation since the Policymaker wants to share the impacts and advantages of implementation of a Big Data solution.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 35 / 63
Business Service Description
Big Data and Analytics
software catalogue
This service aims at providing a catalogue of Analytics artefacts or software tools that users will be able to download for implementing Big Data solutions. Part of this catalogue will be a special software stack (like a Sandbox) usable in the preferred user environment (on premise or cloud): it is a preconfigured environment that contains services, sample data and interactive tutorials useful for testing Big Data technologies.
Prerequisites Outcome
System requirements of the software solution. A catalogue of Big Data and analytics software that users will be able to download and use.
Linked Targeted Personas Linked User Stories
Data Scientist
IT Practitioner
Policymaker
Test off-the-shelf analytical tools since the IT Practitioner could use software tools contained in the Big Data and Analytics software catalogue.
Business Service Description
Support for Analytics
implementation
This service aims at providing technical support for the implementation of a pilot through the provisioning of a specialised team (e.g. Business Analyst, Data Scientist or Data Engineer) which helps Public Administrations to implement Big Data pilots. For example, support could be provided in terms of:
sizing of the necessary infrastructure resources;
choice of the involved technologies in the pilot;
execution of the pilot;
sharing and presentation of the pilot’s outcome. Any outcome of the Big Data pilot could feed the community portal in order to be shared with the community.
Prerequisites Outcome
Activation of one or more of the aforementioned services.
Support for the whole life-cycle of a pilot.
Linked Targeted Personas Linked User Stories
IT Practitioner
Policymaker
Test off-the-shelf analytical tools since the IT Practitioner does not know what software tools have to be used.
Big Data Pilot implementation since PAs may not have the availability of technicians with Big Data skills (Data Scientists and Data Engineers) to execute a Big Data pilot.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 36 / 63
Business Service Description
Advisory
This service aims at providing advisory support for assistance in activities related to the Big Data Test Infrastructure. It covers a series of activities such as:
pilot scoping;
business case definition;
evaluation of risks related to the implementation of a Big Data solution;
data sources identification and integration;
services identification and integration.
Prerequisites Outcome
Activation of one or more of the aforementioned services
Advisory support.
Linked Targeted Personas Linked User Stories
IT Practitioner
Policymaker
IT Director
Big Data Pilot implementation since the user wants to evaluate the impacts and advantages of the implementation of a Big Data solution.
All the User Stories could be linked with this service.
Assuming that the implementation of the Big Data Test Infrastructure would be part of the CEF
2018 work programme (see Chapter 2.3), each business service and targeted personas can be
mapped with the template of the CEF Service Offering, as follows:
Figure 9 – Business services mapped in the template of the CEF service offering
5.3. APPLICATION-DATA-TECHNOLOGY ARCHITECTURE
This paragraph presents the Application-Data-Technology architecture required in order to identify
all the needed building blocks that enable the business services previously described. The TOGAF
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 37 / 63
framework has been taken into account to design the Application-Data-Technology architecture;
TOGAF (The Open Group Architecture Framework) is a framework that provides an approach for
designing, planning, implementing and governing an enterprise information technology
architecture. The Application-Data-Technology architecture has been designed using the Archi tool,
starting from the framework provided in Task 1; in this case, the logical areas identified in Task 1
have been considered as application functions, which include building blocks
(application components).
Figure 10 below provides a preview of the logical architecture which will be further detailed in the
following paragraphs using the TOGAF framework.
Figure 10 – Preview of the logical architecture of the Big Data Test Infrastructure
5.3.1. HIGH-LEVEL ARCHITECTURE VIEW
In order to provide an overall view of the Application-Data-Technology architecture for the Big Data
Test Infrastructure, the TOGAF standard has been considered as guideline. The Application-Data-
Technology architecture provided is composed of two architecture domains:
The Information System Architecture – which covers the development of Data and
Application architecture. In this domain, the main goal is to develop target Data and
Application architectures that enable the business architecture.
The Technology Architecture – this domain covers the hardware configurations,
Data Ingestion/Storage
Real Time Ingestion
Batch Ingestion
Distributed File System
Infrastructure
NetworkStorageServers Distribution
National Data Portal
Data Sources
Exte
rnal S
yste
ms
Users
European Data Portal
EU Open Data Portal
Social Media
Third-Party Data provider Data Elaboration
Analytics
Data Transformation
Data Consumption
Data Visualisation
Data Discoveryand Exploration
API M
anag
emen
t
Governance & Security
Software/Data Catalogue Privacy and Security PolicyInfrastructure Monitoring
Management
Intero
perab
ility
Community
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 38 / 63
the infrastructure services that enable the Application and Data architecture, the protocols
and networks that connect applications.
Figure 11 below provides a high-level architecture view of the Application-Data-Technology layers
of the target architecture and the business layer previously presented.
Figure 11 – High-level architecture view of the Big Data Test Infrastructure
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 39 / 63
Below a brief description is provided of the relationships among the entities, starting from the
previous one detailed for the business architecture.
Narrative: The business services are realised by the application services [Big Data platform] (which
is assigned to the interface [Interoperability]) and [Community]. The [Big Data platform] is realised
by the functions of [Governance and Security] (composed of the [Software/Data Catalogue], the
[Privacy and Security Policy] and [Infrastructure Monitoring Management]),
[Data Ingestion/Storage] (composed of [Batch Ingestion], [Real Time Ingestion] and the [Distributed
File System]), [Data Elaboration] (composed of [Data Transformation] and [Analytics]),
[Data Consumption] (composed of [Data Discovery and Exploration] and [Data Visualisation]) and
[API Management], which accesses to [Data Sources]. The whole application layer is served by the
[Infrastructure] technology service, which also realises the [Community] application service.
[Infrastructure] is realised by the functions of [Storage] and [Distribution], which are assigned to
one or more [Servers].
Starting from the analysis of datasets and APIs carried out during Task 2,13 five data sources have
been identified (National Data Portal, European Data Portal, EU Open Data Portal, Social Media
and Third-Party Data provider).
Table 4 below illustrates, for each building block, the following attributes:
Reference architecture – an image of the Application-Data-Technology architecture where
the application function which contains the building block has been highlighted;
Name – the name of the building block;
Description – a brief description of the building block;
TOGAF component – the name of the component of the TOGAF framework (Technology,
Data or Application Architecture).
Reference architecture
Name Description TOGAF
component
Infrastructure
Servers
It refers to hardware components of the Big Data Test Infrastructure. It includes the choice of RAM and virtual cores of the infrastructure, the choice of the physical location of each node of the Big Data Platform and the choice of operating system.
Technology architecture
13 See the document “D01.01 Study on interoperable data ontologies and data exchange APIs”.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 40 / 63
Reference architecture
Name Description TOGAF
component
Storage
This building block is mainly related to the storage capacity of the infrastructure; it also takes into account the performance, replication, reliability, encryption, synchronisation, back-up and restore of data.
Technology architecture
Network
It is related to the communication service of the Big Data Test Infrastructure to manage both traffic among the platform nodes and to the Internet, also taking into account security concerns.
Technology architecture
Distribution
Distribution refers to the ensemble of products, tools and projects that enable users to use Big Data applications. Apache Hadoop is the most important technology that is most commonly associated with Big Data applications. It enables the use of the preferred computation engine.
Technology architecture
Data Ingestion/Storage
Batch Ingestion
This building block refers to the capability of the platform to collect and load data into the Big Data platform on a regular time base. Batch ingestion usually interests a large volume of data and it is scheduled in a specific time window defined.
Application architecture
Real Time Ingestion
This building block is responsible for managing real time data streams. Real time ingestion allows for the analysis of data as soon as they are issued by the source. It may be implemented using a message queue or a memory channel.
Application architecture
Distributed File
System
This system allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file. It organizes file and directory services into a global directory in such a way that remote data access is not location-specific but is identical from any client. All files are accessible to all users of the global file system and organization is hierarchical and directory-based.
Application architecture
Data Elaboration
Data
Transformation
This building block refers to the process of data quality management. It includes, for example, tasks of data cleaning, data quality, data enrichment and data integration. It is typically performed via a mixture of manual and automated steps.
Application architecture
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 41 / 63
Reference architecture
Name Description TOGAF
component
Analytics
Analytics refers to a series of customisable pre-built elaborations mainly related to the data mining and machine learning areas. It allows users to apply supervised and unsupervised algorithms, forecasting methodologies and any other analytical services.
Application architecture
Data Consumption
Data Discovery and Exploration
This building block provides the opportunity to business users to search for specific information, write queries, compute statistics, visualise datasets and any other data exploration activity. It also allows users to quickly and simply view most of the relevant features of their datasets through the use of descriptive statistics.
Application architecture
Data
Visualisation
Data visualisation addresses the classical Business Intelligence use cases represented by dashboard and reporting. It also provides the capability to monitor certain events in real time.
Application architecture
Governance & Security
Software/Data
Catalogue
This building block refers to a list of Big Data and Analytics software tools and artefacts. This list is divided into various sections, based on the category of software tools. It also contains some services and solutions funded by the EU (e.g. the DORIS service14) which users are able to integrate with the Big Data platform. Software/Data Catalogue also refers to a list of datasets related to different policy domains (e.g. Energy, Transportation, Health, etc.) which the European Commission makes available in a centralised repository.
Application architecture
Privacy and
Security Policy
This building block enables the governance of the creation, acquisition, integrity and use of data and information. It provides mechanisms to ensure that only authorized users can access and perform actions on IT resources.
Application architecture
Infrastructure
Monitoring Management
This building block provides insight into the status of physical, virtual, and cloud systems and helps ensure availability and performance. Infrastructure monitoring management covers the sub-categories of systems management, network management and storage management.
Application architecture
14http://www.doris-project.eu/
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 42 / 63
Reference architecture
Name Description TOGAF
component
API
Management
API Management refers to the capability of the platform to interact with data sources. This building block also allows users to control and manage access and usage policies for APIs.
Application architecture
Community
This building block refers to a platform that allows users to share Big Data artefacts or methodologies among European Public Administrations and Public Institutions. The platform will contain a special section – Innovation Portal – with a recommendation engine (using Machine Learning techniques) where users can contribute with their own ideas in order to launch new propositions.
Application architecture
Interoperability
Interoperability provides users with the ability to work and interact with external systems and products.
Application architecture
Table 4 – Description of the building blocks for the Application-Data-Technology Architecture
As mentioned above, these building blocks enable the business services identified in the previous
paragraph (5.2); a business service can be covered by one or more architecture building blocks of
different logical areas of the Application-Data-Technology architecture.
The following matrix provides the mapping between the architecture building blocks and the
business services identified in the business architecture, except for the “Advisory” and “Support for
Analytics implementation” services.
Business services
Architecture Building Blocks
PaaS for
implementing Big Data use
case
Data Catalogue
and Data Exchange APIs
Big Data and
Analytics software catalogue
Analytics as a
Service
Community Building and Innovation
Portal
Infr
astr
uct
ure
Servers ✓
Storage
✓
Network
✓
Distribution
✓
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 43 / 63
Business services
Architecture Building Blocks
PaaS for
implementing Big Data use
case
Data Catalogue
and Data Exchange APIs
Big Data and
Analytics software catalogue
Analytics as a
Service
Community Building and Innovation
Portal
Dat
a In
gest
ion
/ S
tora
ge
Batch Ingestion
✓ ✓
Real Time Ingestion
✓ ✓
Distributed File
System ✓
Dat
a El
abo
rati
on
Data
Transformation
✓ ✓
Analytics
✓ ✓
Dat
a C
on
sum
pti
on
Data Discovery and Exploration
✓ ✓
Data
Visualisation
✓ ✓
Go
vern
ance
& S
ecu
rity
Software/Data
Catalogue
✓ ✓
Privacy and
Security Policy ✓ ✓
Infrastructure
Monitoring Management
✓
API Management
✓
Community
✓
Interoperability
✓
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 44 / 63
Table 5 – Mapping between architecture building blocks and business services
5.3.2. DETAILED VIEW OF THE ANALYTICS BUILDING BLOCK
This paragraph provides a detailed overview of the Analytics building block in order to help users to
better understand how technical requirements collected during Task 1 are satisfied. This building
block is the most important in the final architecture of the Big Data Test Infrastructure since it
enables the implementation of all the use cases in scope for Task 1. Figure 12 below shows a
drill-down of the Analytics building block, which is composed of lower level building blocks related
to the main Data Scientist’s activities.
Figure 12 – Drill-down of the Analytics building block
Table 6 below provides a detailed description of each lower level building block shown, containing
the following information:
Name – the name of thelower level building block;
Description – a brief description of the lower level building block;
Linked Technical Requirements – the ID and the short-name of the technical requirements
collected during Task 1.
Name Description Linked Technical
Requirements (ID – short-name)
Data Mining
This building block refers to the activity of discovering patterns in large quantities of data. These patterns can be seen as a kind of summary of the input data which can be used for further analysis (e.g. machine learning or predictive analytics) in order to obtain more accurate prediction results.
32 – Tool support
34 – Data processing
36 – Advanced Analytics
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 45 / 63
Name Description Linked Technical
Requirements (ID – short-name)
Statistical Modelling
Statistical Modelling includes statistical methods that allow users to manage and analyse structured datasets (cross-sectional or time series data), in order to describe or summarise features and provide a descriptive view of a phenomenon.
32 – Tool support
36 – Advanced Analytics
Machine Learning
This building block provides users with the tool needed to train, test and validate analytical models built using machine learning and statistical algorithms. With this module, users can apply classification, clustering or regression tasks.
32 – Tool support
36 – Advanced Analytics
Table 6 – Description of the architectural detailed building block
As shown in Table 6 above, a technical requirement can be covered by the union of multiple lower
level building blocks and a lower level building block can cover the needs of multiple technical
requirements.
During Task 1, a prioritisation process of all the use cases was carried out in order to identify the
final priority for the Big Data use cases; Figure 13 below illustrates the results of the process.
Figure 13 – Final priority for the identified Big Data use cases
For each use case with high priority, the following list provides readers with a clear explanation of
how the Analytics’ lower level building blocks enable the implementation of the use cases in scope.
Predictive analysis: it consists of applying statistical techniques (e.g. predictive modelling,
machine learning) that analyse current and historical facts to make predictions about future
or unknown events. In order to implement this use case, users have to use Data Mining and
Machine Learning techniques to discover hidden patterns, extract useful information and
make predictions on data.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 46 / 63
Web analysis (scraping/monitoring): it consists of gathering information from websites,
involving data scraping (using bot or web-crawler) and data parsing to extract the
unorganised web data as well as converting data from APIs into a manageable format. In
order to implement this use case, users have to use Data Mining techniques to discover
patterns from the web using automated processes to extract data from servers and
web reports.
Text analysis: it consists of using natural language processing to analyse unstructured text
data, deriving patterns and trends, possibly extracting the text content and evaluating and
interpreting the output data. In order to implement this use case, users have to use Text
Data Mining techniques which include text categorisation, text clustering and
document summarisation.
Descriptive analysis: this refers to the use of statistics to quantitatively describe or
summarise features of a collection of information. Such descriptions may be either
quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. In order to
implement this use case, users have to use a Statistical Modelling approach which refers to
finding a statistical correlation among features or detecting outliers.
Time-Series analysis: this comprises of methods for analysing time-series data to extract
meaningful statistics and other characteristics of data. In order to implement this use case,
users have to use Statistical Modelling techniques to identify trends or cyclic behaviours
and forecast future trends.
Social media analysis: it consists of gathering data from blogs and social media websites and
analysing data to make business decisions. In order to implement this use case, users have
to use Machine Learning algorithms, performing rules-based analysis or keywords-based
analysis to identify posts that fit into specific categories.
Network analysis: this consists of investigating any structures through the use of network
and graph theories, characterising networked structures in terms of nodes and the ties,
edges, or links (relationships or interactions) that connect them. In order to implement this
use case, users have to use Machine Learning techniques, with a particular focus on neural
network and deep learning methods which allow the evaluation of the presence or absence
of a relationship among elements.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 47 / 63
5.4. SOLUTION ARCHITECTURE
This section introduces the identified solution building blocks (i.e. the software/technology items
that could support the implementation of the architecture building blocks) which design the
solution architecture. The solution building blocks were identified following high-level criteria and
the results emerging from Task 1. This section also illustrates how such solution building blocks will
support the implementation of the Big Data use cases in scope.
5.4.1. CRITERIA FOR THE IDENTIFICATION OF TOOLS/TECHNOLOGIES
The aspects that drive the identification of the software solutions for each architecture building
block of the target architecture are the following:
Architecture principles, the guidelines that have been used as a high-level driver for the
design of the architecture of the Big Data Test Infrastructure (see section 5.1);
Good practices, the recommendations and results coming from the most relevant projects
analysed in Task 1;
KPMG expertise, KPMG credentials/experience and Competence Centre in the
Data & Analytics area, including the KAVE platform and relevant projects.
5.4.2. IDENTIFIED SOLUTION BUILDING BLOCKS
This section introduces, for each architecture building block, a non-exhaustive list of software
solutions that have been identified according to the criteria presented in the previous section.
These software solutions can be improved on in future analysis. During this activity, cloud solutions
have been taken into account in order to respect the architecture principles previously defined.
Cloud infrastructure refers to an abstraction layer that virtualises resources and presents them to
users through API: these virtualised resources are hosted by a service provider (such as Amazon
Web Services). It leads to greater flexibility since users can reallocate infrastructure resources at
any time without worrying about finding new physical servers which meet their needs. Below are
some examples which can help the reader to better understand how BDTI users can take advantage
of using a cloud infrastructure:
a Public Administration wants to start a new pilot using the Big Data Test Infrastructure, the
amount of initial data is low but a substantial increase is expected; in this case a cloud
infrastructure provides an advantage for scaling-up the disk space of their servers in an
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 48 / 63
easy manner.
a Public Administration (which already uses the Big Data Test Infrastructure) wants to test a
more complex machine learning algorithm which requires more hardware resources; in this
case a cloud infrastructure provides the advantage to increase the RAM and CPU of their
servers in an easy way.
Figure 14 below illustrates the solution architecture, starting from the Application-Data-Technology
architecture previously seen and the criteria identified.
Figure 14 – Solution Architecture of the Big Data Test Infrastructure
Table 7 below provides the mapping between architecture building blocks and the identified
solution building blocks; there is not an exclusive relationship among them: the requirements of an
architecture building block can be covered by multiple solution building blocks and a solution
building block can cover the needs of multiple architecture building blocks; the notes provide a
description of how the solution building block serves the architecture building block.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 49 / 63
Architecture Building Blocks
Solution Building Blocks
Notes Architecture
Building Blocks Solution
Building Blocks Notes
Servers
Amazon Web Services EC2
Server instance provider
Data Discovery and Exploration
Solr
Indexing-search platform
Zeppelin
Web based notebook for
Data Discovery
Storage
Amazon Web
Services S3
Storage provider
Data Visualisation
Zeppelin
Web based notebook for
interactive data visualisation
Network
Amazon Web Services EC2
Network configuration
Software/Data
Catalogue
AWS Service
Catalog
Catalogue of IT services
Ambari
Services management
Distribution
Hortonworks Data
Platform
Open Source platform
Privacy and
Security Policy
Sentry
User access management
Batch Ingestion
Flume
Distributed data flow
Infrastructure
Monitoring Management
Ambari
Alerts management
Real Time Ingestion
Storm
Distributed real time data flow
API Management
API Manager
Platform for APIs
Distributed File
System
HDFS
Distributed storage
Community
Confluence
Team Collaboration
tool
Data
Transformation
RHadoop
Preliminary analysis in R
Interoperability
KNOX
Application gateway for
REST and HTTP interaction
SparkR
Preliminary analysis in
Spark
Analytics
Storm
Real time analytics
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 50 / 63
Architecture Building Blocks
Solution Building Blocks
Notes Architecture
Building Blocks Solution
Building Blocks Notes
HDFS
Data Lab environment
RHadoop
R analysis in Hadoop
environment
SparkR
R analysis in Spark
Table 7 – Mapping between architecture and solution building blocks
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 51 / 63
The following table summarises the solution building blocks, describing each of them in terms of:
Name – the name of the identified solution building block;
Description – a brief description of the solution;
License – the license type of the solution;
Support – the support type offered;
Community – community dimension and documentation provided;
Maturity – the maturity level.
Solution Building
Block Description License Support Community Maturity
Amazon Web Services EC2
Amazon Elastic Compute Cloud15 (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
Proprietary software
✓ Three levels of
support
Millions of results
Since 2006
Amazon Web
Services S3
Amazon Simple Storage Service16 (Amazon S3) is object storage built to store and retrieve any amount of data from anywhere. It can be used for media storage or as data lake for Big Data analytics.
Proprietary software
✓ Three levels of
support
Millions of results
Since 2006
Hortonworks Data Platform
HDP17 is an open source Apache Hadoop distribution based on a centralized architecture. HDP provides a complete big data ecosystem composed of the most common tools; it also can be easily extended, developing new or adding third party software.
Apache 2.0 ✓
Three levels of support
Many thousands of
results
Since 2011 Version 2.6
Flume
Flume18 is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It also uses a simple extensible data model that allows for online analytic application.
Apache 2.0
✓
Project mailing lists and different
support among platform providers
Many thousands of
results
Since 2012
Version 1.7.0
15 https://aws.amazon.com/ec2/?nc1=h_ls
16 https://aws.amazon.com/s3/?nc1=h_ls
17 https://hortonworks.com/products/data-center/hdp/
18 https://flume.apache.org/
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 52 / 63
Solution Building
Block Description License Support Community Maturity
Storm
Storm19 is a free and open source distributed realtime computation system that allows users to process large volumes of high-velocity data. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.
Apache 2.0
✓ Project mailing
lists and different
support among platform providers
Many thousands of
results
Since 2011 Version
1.1.1
HDFS
Hadoop Distributed File System20 (HDFS) is the core of the Hadoop ecosystem; it is the layer that allows to manage data as if it were stored on a single node, instead of being divided and replicated on multiple nodes. It automatically manages replication, assuring fault toleration and scalability.
Apache 2.0
✓ Different
support among platform providers
Many thousands of
results
Since 2009 Version
2.8.1
RHadoop
RHadoop21 is a collection of five R packages that allow users to manage and analyse data within the Hadoop environment; it allows to deal with HDFS and other elements of the ecosystem and also to exploit the Map Reduce framework.
GNU GPL N/A
Thousands of results
Since 2011 Different
version of R packages
SparkR
SparkR22 is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
Apache 2.0
✓
Project mailing lists and different
support among platform providers
Many thousands of
results
Since 2011 Version
2.2.0
19 http://storm.apache.org/
20 https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
21 https://github.com/RevolutionAnalytics/RHadoop/wiki
22 https://spark.apache.org/docs/latest/sparkr.html#overview
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 53 / 63
Solution Building
Block Description License Support Community Maturity
Solr
Solr23 is the popular open source search platform that provides distributed indexing. It has enterprise production features, such as high reliability, scalability and fault tolerant, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. It supports batch, real-time, and on-demand indexing of data.
Apache 2.0
✓ Project mailing lists and an IRC
channel
Millions of results
Since 2004 Version
6.6.0
Zeppelin
Zeppelin24 is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Python and more. It offers multiple interpreters (connectors) for different data sources.
Apache 2.0
✓ Project mailing
lists and different
support among platform providers
Many thousands of
results
Since 2015 Version
0.7.2
Amazon Web
Services Service Catalog
Amazon Service Catalog25 allows users to manage a catalogue of IT services which include, for example, virtual machine images, servers, software and databases.
Proprietary software
✓ Different levels
of support
Many thousands of
results
Since 2015
Ambari
Ambari26 is an open source tool for provisioning, managing, and monitoring clusters. Ambari provides an intuitive, easy-to-use platform management web UI backed by its RESTful APIs.
Apache 2.0
✓ Project mailing
lists, an IRC channel and
different support among
platform providers
Many thousands of
results
Since 2011 Version
2.5.1
Sentry
Sentry27 provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Big Data cluster. Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala and HDFS (limited to Hive table data).
Apache 2.0
✓ Project mailing
lists and different
support among platform providers
Many thousands of
results
Since 2012 Version
1.8.0
23 http://lucene.apache.org/solr/
24 https://zeppelin.apache.org/
25 https://aws.amazon.com/servicecatalog/?nc1=h_ls
26 https://ambari.apache.org/
27 https://sentry.apache.org/
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 54 / 63
Solution Building
Block Description License Support Community Maturity
WSO2 API Manager
WSO2 API Manager28 allows the support of API publishing, lifecycle management, application development, access control, rate limiting and analytics in one cleanly integrated system. It include a store that allows users to discover API, get the related documentation and also try API.
Apache 2.0 ✓
Different levels of support
Many thousands of
results
Since 2012 Version
2.1.0
Confluence
Confluence29 is content collaboration software that allows users to create, share, and collaborate on projects all in one place; it includes meeting notes, project plans, product requirements, multimedia and dynamic content.
Proprietary, it requires a Confluence Server license.
✓ Technical support
Many thousands of
results
Since 2004 Version 6.2
Knox
Knox30 is an Application Gateway for interacting with the REST APIs and UIs of a Big Data platform. The Knox Gateway provides a single access point for all REST and HTTP interactions with the Big Data cluster.
Apache 2.0
✓ Project mailing
lists and different
support among platform providers
Many thousands of
results
Since 2013 Version 0.13.0
Table 8 – Description of the solution building blocks
28 http://wso2.com/api-management/
29 https://www.atlassian.com/software/confluence
30 https://knox.apache.org/
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 55 / 63
6. ILLUSTRATIVE IMPLEMENTATION OF A USER STORY
This chapter provides an illustrative implementation of the User Story “Big Data pilot
implementation” described in Chapter 4 and it also aims to provide the reader with a clearer
understanding of the steps that users have to follow to implement a Big Data pilot using the Big
Data Test Infrastructure. In this User Story, the Policymaker wants to start a Big Data pilot using a
ready-to-use Big Data environment in order to support the policy making process and to evaluate
the impacts and advantages of the implementation of a Big Data solution. Figure 15 below
illustrates the main steps of the governance process of the identified User Story and their relations
with the main entities of the Big Data Test Infrastructure (business services, architecture building
blocks and solution building blocks).
Figure 15 – Illustrative implementation of the User Story “Big Data pilot implementation”
Solu
tio
n B
uild
ing
Blo
cks
Func
tiona
l Bui
ldin
g Bl
ocks Servers Storage
Network
Community
Use
r St
ory
's
acti
viti
es
Initial Assessment
Big Data Test Infrastructure initialisation
Analytics implementation
Outcomes sharing &
presentation
AdvisoryPaaS for implementing Big Data
use cases
Support for Analytics implementation
Bu
sin
ess
Serv
ices
Community Building and Innovation Portal
Privacy and Security Policy
Analytics as a Service
Distribution
Distributed File System
Infrastructure Monitoring Management
Data Transformation
Batch Ingestion
Real Time Ingestion
Analytics
Data Discovery and Exploration
Data Visualisation
Big Data Pilot implementation
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 56 / 63
Detailed below are the steps of the User Story’s activities, taking into account the example
associated with the user story “Big Data Pilot implementation” defined in Chapter 4 and
motivating the choices of the business services of each step and the associated architecture and
solution building blocks.
1. Initial Assessment: in order to evaluate the impact of a new tax policy introduced 6 months
previously, the Advisory business service helps the Policymaker to define the pilot scoping
and to identify the hardware requirements of the Big Data platform; more in detail, the
advisory team evaluates what kinds of data sources are useful to analyse the general
sentiment regarding the new tax policy (e.g. social media datasets), the duration of the pilot
and defines the storage capacity of the future platform.
The Advisory business service does not include architecture building blocks.
2. Big Data Test Infrastructure initialisation: in this phase the Policymaker uses the PaaS for
implementing the Big Data use cases business service in order to obtain the template for
initialisation of the Big Data platform. This phase starts with the choice of the best Big Data
distribution based on the requirements collected in the first phase and it includes the
configurations of the number of servers, inbound and outbound traffic ports, storage
capacity and the choice of the Big Data technologies identified in the assessment phase. At
the end of this phase, the Policymaker will have a ready-to-use Big Data platform and will be
ready to implement analytics.
The PaaS for implementing the Big Data use cases business service meets the initialisation
phase of the platform which is covered by most of the architecture building blocks of the
target architecture.
3. Analytics implementation: in this phase the Policymaker and a specialised team, composed
of Data Engineers and Data Scientists and provided by the Support for Analytics
implementation business service (which does not include architecture building blocks),
implement the analytics. In order to obtain feedback on a new tax policy, they can use
customisable pre-built functionalities related to the analysis of data from social media.
These functionalities are provided by the Analytics as a Service business service, which
allows the Policymaker and the specialised team to economize on costs and time for the
implementation of ad-hoc algorithms which satisfy their needs.
The Analytics as a Service business service covers the following architecture building blocks:
data ingestion (batch and real time): they allow Data Engineers to gather
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 57 / 63
unstructured datasets from social media (e.g. Twitter or Facebook) and to store data
in the distributed file system;
data transformation: it allows Data Engineers to perform ETL operations (e.g. joining
data from multiple sources, encoding free-form values, etc.) in order to prepare
datasets for further analysis;
analytics: it allows Data Scientists to implement social media analytics (e.g. word
cloud, which allows the Policymaker to understand the number of times that a word
has been associated with the tax policy);
data discovery: it allows the Data Scientists to explore data in order to view most of
the relevant features of social media datasets;
data visualisation: it allows Data Scientists to show the results obtained using
interactive data visualisation tools.
4. Outcomes sharing and presentation: in this last phase, the Policymaker shares the results in
a community with other Member States in order to obtain feedback on the methodologies
or algorithms used during the pilot or simply to share the pilot’s outcomes. The Community
Building and Innovation Portal business service enables the usage of a portal in a
configured environment. It covers the community building block and the dashboard and
reporting building block which allow the Policymaker to create and share the graphs
obtained.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 58 / 63
7. GOVERNANCE AND OPERATIONAL MODEL FOR THE BIG DATA
TEST INFRASTRUCTURE
This chapter outlines a high-level governance and operational model of the future Big Data Test
Infrastructure, to be further elaborated in a detailed Target Operating Model. Assuming that the
implementation would be part of the CEF 2018 work programme (see Chapter 2.3), the high-level
governance and operational model has been set-up based on the following framework:
Figure 16 – Reference framework for the strategy, governance and operational model
The framework outlines the main IT capabilities and processes needed for the governance and
management of a complex IT Platform such as the future Big Data Test Infrastructure.
Strategy layer
For the Strategy layer, the proposed model anticipates an Operational Management Board (OMB)
for the Big Data Test Infrastructure. The OMB
will be composed of DG DIGIT and DG CNECT
Top Management and CEF Member States
representatives (or appropriate governance
bodies in the case that the Big Data Test
Infrastructure would not be part of the CEF
2018 work programme), and will be organised Figure 17 – Strategy layer
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 59 / 63
every month to discuss and to take decisions about all types of operational matters (strategy,
planning, and budgeting). The main processes managed by the OMB are listed below:
Big Data Test Infrastructure Strategy – process focused on the definition and maintenance of a
strategic roadmap for the scale-up and improvement of the Big Data Test Infrastructure using as
an input the outcomes of the Demand Management process;
Demand Management – process focused on anticipating and understanding Member States
demand for Big Data services (e.g. requests for new pilots);
Financial Management – process focused on the management of the Big Data Test
Infrastructure budgeting and financing.
Furthermore, the OMB will ensure that the architecture principles (see Chapter 5.1) will be
respected during the development and improvement of the Big Data Test Infrastructure,
participating actively in the process decision of the Change Management.
Governance layer
The governance layer will focus on service management activities, all based on well-known and
market leading methodologies such as the ITIL
Framework, ensuring proper information
security management, monitoring the overall
performance of the Big Data Test
Infrastructure and, furthermore, management
of the staff.
This role will be played by DG DIGIT and DG
CNECT representatives (or appropriate governance bodies in the case that the Big Data Test
Infrastructure would not be part of the CEF 2018 work programme) and they will oversee relations
with IT Providers (Procurement Management) and with business stakeholders such as European
public administrations (Stakeholders Management).
Figure 18 – Governance layer
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 60 / 63
Operational layer
This layer covers activities in the field of service development, service evolution,
project/programme management of
IT developments and service delivery. The role
of the Solution Provider (SP), who will be
accountable for the development and delivery
of the Big Data Test Infrastructure’s building
blocks and related services, will be played by
DIGIT and/or external contractors.
Indeed, operations refers to the day-to-day running of the Big Data Test Infrastructure and includes
processes to guarantee that services are running without interruptions.
The main operational processes which will be managed by the SP are listed below:
ICT Infrastructure Management, to manage and monitor the IT infrastructure, including facilities
management related to all the aspects of managing the physical environment, for example
power and cooling, building access management, environmental monitoring and including
actions to monitor and control the IT services of the underlying infrastructure. The SP executes
day-to-day routine tasks related to the operation of infrastructure components and applications
(including pilots scheduling, backup and restore activities, routine maintenance, etc.);
Availability and Capacity Management, to establish and maintain capacity and availability at a
justifiable cost and with an efficient use of resources, including activities related to the
appropriated provision of resources to the Big Data Test Infrastructure, monitoring, analysing,
understanding, and reporting on current and future demand for services, use of resources,
capacity, service system performance, and service availability, determining corrective actions to
ensure appropriate capacity and availability while balancing costs against resources needed and
supply against demand, in order to grant authorised users the right to use a Big Data Test
Infrastructure service, while preventing access to non-authorised users, executing policies
defined in the Information Security field;
IT Service support will be implemented by a dedicated Service Desk (SD) which represents the
Single Point of Contact (SPOC) for the Users/customers on a day-to-day basis. It will be also a
focal point for reporting and managing problems and incidents (disruptions or potential
disruptions in service availability or quality) and for users making service requests (routine
Figure 19 – Development and Operational layer
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 61 / 63
requests for services).
Finally, a Communication Office (CO) will be responsible for the promotion of the Big Data Test
Infrastructure and the management of the Community Innovation portal.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 62 / 63
8. NEXT STEPS
This final chapter provides information about the next steps for the implementation of the future
Big Data Test Infrastructure. In order to outline the implementation foreseen over the next years,
Figure 20 below illustrates a high-level timeline that summarises the main steps performed so far
and the next steps.
Figure 20 – High-level roadmap for the implementation of the Big Data Test Infrastructure
As shown in the above figure, it is planned to implement the infrastructure using an incremental
approach, starting from 2018 with a first set of Big Data services and finalising the implementation
of all the services by 2019.
Regarding the practical steps to be followed for the implementation of the Big Data Test
Infrastructure, Figure 21 below provides an implementation roadmap, which will enable the
implementation of the “core” services (“PaaS for implementing Big Data use cases”, “Community
Building and Innovation Portal” and “Big Data and Analytics software catalogue”) and the execution
of a first set of pilot projects with some Member States, supported by the business services
“Advisory” and “Support for Analytics implementation”.
Specific Contract 406–D03.1Design of the Big Data Test Infrastructure
Date: 27/10/2017 Doc. Version: 2.0 63 / 63
Figure 21 – Detailed roadmap for the implementation of the Big Data Test Infrastructure