d03.1 design of the big data test infrastructure of...specific contract 406–d03.1design of the big...

DG DIGIT

Unit.D.1

D03.1 DESIGN OF THE BIG DATA TEST

INFRASTRUCTURE

ISA2 action 2016.03 – Big Data for Public Administrations

“Big Data Test Infrastructure”

Specific contract n°406 under Framework Contract n° DI/07172 – ABCIII

October 2017

Specific Contract 406–D03.1Design of the Big Data Test Infrastructure

Date: 27/10/2017 Doc. Version: 2.0 1 / 63

This study was carried out for the ISA2 Programme by KPMG Italy.

Authors:

Lorenzo CARBONE

Simone FRANCIOSI

Silvano GALASSO

Pavel JEZ

Valerio MEZZAPESA

Alessandro TRAMONTOZZI

Stefano TURCHETTA

Specific Contract No: 406

Framework Contract: DI/07172

Disclaimer

The information and views set out in this publication are those of the author(s) and do not

necessarily reflect the official opinion of the Commission. The Commission does not guarantee the

accuracy of the data included in this study. Neither the Commission nor any person acting on the

Commission’s behalf may be held responsible for the use which may be made of the information

contained therein.

© European Union, 2017


Date: 27/10/2017 Doc. Version: 2.0 2 / 63

Document Control Information

Settings Value

Document Title: D03.1Design of the Big Data Test Infrastructure

Project Title: ISA2 Action 2016.03 – Big Data for Public Administrations – Big Data Test Infrastructure

Document Authors:

Lorenzo CARBONE Simone FRANCIOSI Silvano GALASSO Pavel JEZ Valerio MEZZAPESA Alessandro TRAMONTOZZI Stefano TURCHETTA

Commission Project Officer:

Marco FICHERA – European Commission – DIGIT D.1

External Contractor Project Manager:

Lorenzo CARBONE

Doc. Version: 2.0

Sensitivity: Internal

Date: October 2017

Revision History

The following table shows the development of this document.

Version Date Description Created by Reviewed by

0.1 June 2017 Proposal for a Table of Contents for the Report

Simone FRANCIOSI Pavel JEZ Valerio MEZZAPESA Alessandro TRAMONTOZZI Stefano TURCHETTA

Lorenzo CARBONE Silvano GALASSO

0.2 July 2017 Draft version of Chapters 4, 5.1 and 5.2

Simone FRANCIOSI Valerio MEZZAPESA Alessandro TRAMONTOZZI


0.3 July 2017

Consolidated version of Chapters 4, 5.1 and 5.2



0.9 September 2017

Draft version of Chapters 7 and 8



1.0 October 2017 Complete drafted version



2.0 October 2017 Final version


Lorenzo CARBONE Silvano GALASSO Marco FICHERA


Date: 27/10/2017 Doc. Version: 2.0 3 / 63

TABLE OF CONTENTS

EXECUTIVE SUMMARY ................................................................................................................ 5

1. INTRODUCTION .......................................................................................................................... 10

1.1. Objectives of the document ...................................................................................................... 12

1.2. Structure of the document ........................................................................................................ 13

2. CONTEXT .................................................................................................................................. 15

2.1. The ISA2 programme and the Action 2016.03 “Big Data for Public Administrations” .............. 15

2.2. EIRA ............................................................................................................................................ 17

2.3. CEF Programme ......................................................................................................................... 20

3. METHODOLOGY FOLLOWED .......................................................................................................... 23

4. USER STORIES ........................................................................................................................... 27

5. TARGET ARCHITECTURE ................................................................................................................ 30

5.1. Architecture principles ............................................................................................................... 30

5.2. Business architecture ................................................................................................................. 31

5.3. Application-Data-Technology architecture................................................................................ 36

5.3.1. High-level architecture view......................................................................................... 37

5.3.2. Detailed view of the Analytics building block .............................................................. 44

5.4. Solution architecture ................................................................................................................. 47

5.4.1. Criteria for the identification of tools/technologies ..................................................... 47

5.4.2. Identified Solution building blocks ............................................................................... 47

6. ILLUSTRATIVE IMPLEMENTATION OF A USER STORY ............................................................................. 55

7. GOVERNANCE AND OPERATIONAL MODEL FOR THE BIG DATA TEST INFRASTRUCTURE.................................. 58

8. NEXT STEPS............................................................................................................................... 62


Date: 27/10/2017 Doc. Version: 2.0 4 / 63

LIST OF FIGURES

Figure 1 – Narratives of the Big Data Test Infrastructure .................................................................. 11

Figure 2 – Key concepts in EIRA ......................................................................................................... 19

Figure 3 – Interoperability levels of the EIF ....................................................................................... 19

Figure 4 – The CEF GOFA model ........................................................................................................ 22

Figure 5 – Methodological approach followed under Task 3 ............................................................ 23

Figure 6 – Adopted Methodology for the overall study .................................................................... 26

Figure 7 – Targeted personas in scope for the User Stories .............................................................. 27

Figure 8 – Business view of the Big Data Test Infrastructure and related user stories ..................... 32

Figure 9 – Business services mapped in the template of the CEF service offering ........................... 36

Figure 10 – Preview of the logical architecture of the Big Data Test Infrastructure ......................... 37

Figure 11 – High-level architecture view of the Big Data Test Infrastructure ................................... 38

Figure 12 – Drill-down of the Analytics building block ...................................................................... 44

Figure 13 – Final priority for the identified Big Data use cases ......................................................... 45

Figure 14 – Solution Architecture of the Big Data Test Infrastructure .............................................. 48

Figure 15 – Illustrative implementation of the User Story “Big Data pilot implementation” ........... 55

Figure 16 – Reference framework for the strategy, governance and operational model ................ 58

Figure 17 – Strategy layer .................................................................................................................. 58

Figure 18 – Governance layer ............................................................................................................ 59

Figure 19 – Development and Operational layer............................................................................... 60

Figure 20 – High-level roadmap for the implementation of the Big Data Test Infrastructure .......... 62

Figure 21 – Detailed roadmap for the implementation of the Big Data Test Infrastructure ............ 63

LIST OF TABLES

Table 1 - Methodology followed for Task 3 of the study ................................................................... 26

Table 2 - "User Stories" in scope for the Big Data Test Infrastructure .............................................. 29

Table 3 - Architecture principles ........................................................................................................ 31

Table 4 – Description of the building blocks for the Application-Data-Technology Architecture ..... 42

Table 5 – Mapping between architecture building blocks and business services ............................. 44

Table 6 – Description of the architectural detailed building block .................................................... 45

Table 7 – Mapping between architecture and solution building blocks ........................................... 50

Table 8 – Description of the solution building blocks ........................................................................ 54


Date: 27/10/2017 Doc. Version: 2.0 5 / 63

EXECUTIVE SUMMARY

The present Report has been issued under the ISA2 Action 2016.03 – Big Data for Public

Administrations – Big Data Test Infrastructure and is the outcome of Task 3 of the “Big Data Test

Infrastructure” project. The objectives of the project are briefly described below:

Focusing on Task 3, the present Report illustrates the designed Target architecture (business and

technical architecture) and the target governance/operational model of a Big Data Test

Infrastructure to be made available by the European Commission to other EC DGs, Member States’

Public Administrations and EU Institutions in order to:

1. Facilitating the launch of pilot projects on big data, data analytics or text mining, by

providing the infrastructure and the software tools needed to start a small project;

2. Fostering the sharing of various data sources across policy domains and organisations to

support better policy-making;

3. Supporting Public Administrations through the creation of a Big Data community around

best practices, methodologies and artefacts (algorithms, analytical models, pilots output,

etc.) on big data for policy-making.

The following methodological approach (see Chapter 3) has been applied as depicted below:

Identification of User stories (see Chapter 4): the present Report illustrates key scenarios that

users may encounter in a Big Data context, based on business needs collected during the project

through primary data collection activities (interviews with ISA Coordination group members).

Step 3.2.Design of the

target architecture


Governance and Operational model

Step 3.4.Report on final

results

Step 3.1.Identification of the User Stories


Date: 27/10/2017 Doc. Version: 2.0 6 / 63

Each user story highlights a specific purpose of a potential user – Targeted Persona – on the Big

Data field, to be supported by the future Big Data Test Infrastructure through business services

and solutions. The table below summarises the identified user stories:

User story Big Data Test Infrastructure solution

Learning from other European

PA

The IT Director can be supported with a Big Data Community in which can find use cases implemented by other Public Administrations or can share Big Data methodologies, strategies, artefacts and outcomes, in order to the know-how among Public Administration at European, National and Local level on the Big Data field of expertise.

Test off-the-shelf analytical

tools

The IT Practitioner can be supported with off-the-shelf Big Data analytical tools that can download from a catalogue for implementing Big Data solution, Hiding technical complexity in order to have easy-to-use analytical functionalities and optimising experimentation costs. IT Practitioner can also be supported by a specialised team in order to implement analytical functionalities and find new context to apply Analytics.

Experimenting with Big Data platform

The IT Practitioner can be supported with a ready-to-use Big Data platform, following a structured process from a marketplace, respecting privacy policies and using open source tools, thus fostering the adoption of Big Data technologies, the acquisition of Analytics skills and understanding the added-value of Big Data and Analytics tools in the public sector.

Big Data Pilot implementation

The Policymaker can be supported with a dedicated specialised team to implement and execute the Big Data pilot using a ready-to-use Big Data platform, using pre-built analytical functionalities to save cost and time. Policymaker can also share pilot’s results (impacts, advantages, etc.) using a Big Data Community, thus fostering the spread of know-how.

Integrating

Open Datasets

The Data Scientist can be supported with a shared repository with other MSs which enables the gathering and usage of the desired datasets that can be downloaded in an easy way from a catalogue, thus encouraging the sharing of open data among different PAs working on different policy domains and finding solutions of problems whose correlation with the analysed policy domains was originally hidden and/or unclear.

Design of the Target Business and Technical Architecture (see Chapter 5): starting from clear and

agreed architecture principles (e.g. SW openness, reusability, etc.), the present Report describes

the designed Target Business and Technical Architecture for the future Big Data Test Infrastructure.

The target Business Architecture includes a set of business services linked to the User Stories,

which will represent the service offering of the Big Data Test Infrastructure. The set of business

services is summarised in the following table:

Business service Description

PaaS for implementing Big

Data Use Case

This service aims at providing the Big Data platform and all the tools supplied by the European Commission. The initialisation of the platform will be a wizard process for the users that will be able to choose specific templates of the platform through a marketplace. Should new needs arise, this platform will be enriched with new functionalities. Once the platform is instantiated, its resource configuration and tools list will be kept in a catalogue for future reuse.

Data Catalogue and Data Exchange APIs

This service aims at providing data and/or exchange APIs that enable the gathering and usage of the desired data to be correlated with user data. It is a service that provides a catalogue of data sources from which users can retrieve links to data if already available (e.g. coming from previous pilots) or it provides the access to sample datasets (classified by policy domains) which the European Commission makes available in a centralised repository. The catalogue of data sources could be enriched during the implementation of their pilots for future reuse.


Date: 27/10/2017 Doc. Version: 2.0 7 / 63

Business service Description

Analytics as a Service

It aims at providing a list of analytical functionalities which enables any EU Institutions / Public Administrations to quick access to a series of customisable pre-built elaborations. The following list, non-exhaustive, represent a set of analytical services that could be available: extract information from documents (text mining), time-series forecasting, geo information normalisation, population / customer segmentation.

Community Building

and Innovation Portal

This service aims at building a Big Data community where users can share knowledge and Big Data artefacts (e.g. methodologies, statistical models, pilots’ outcome and datasets). It also consist in providing an innovation portal where users can contribute with their own ideas in order to launch new proposition. The innovation portal will have a recommendation engine for aggregating contributions aligned to users’ search.

Big Data and Analytics

software catalogue

This service aims at providing a catalogue of Analytics artefacts or software tools that users will be able to download for implementing Big Data solutions. Part of this catalogue will be a special software stack (like a Sandbox) usable in the preferred user environment (on premise or cloud): it is a preconfigured environment that contains services, sample data and interactive tutorials useful for testing Big Data technologies.

Support for Analytics

implementation

This service aims at providing a technical support by a specialised team (e.g. Business Analyst, Data Scientist or Data Engineer) which helps PAs to implement Big Data pilot. For example, support could be provided in terms of sizing of the infrastructure, selection of the involved technologies, execution of the pilot, sharing and presentation of the results. Any outcome of the Big Data pilot could feed the community portal in order to be shared with the community.

Advisory

This service aims at providing an advisory support for the assistance in activities related to the Big Data Test Infrastructure. It covers a series of activities such as pilot scoping, business case definition, evaluation of risks related to the implementation of a big data solution, data sources identification and integration, services identification and integration.

The Big Data service offering will be enabled through a Technical Architecture composed by a set of

architectural building blocks

describing the functional

capabilities of the future Big Data

Test Infrastructure. Furthermore,

a preliminary view of potential

solutions to be used for the

implementation of the target

architecture is provided (focusing

on open source software).


Date: 27/10/2017 Doc. Version: 2.0 8 / 63

Design of the Governance & Operational model (see Chapter 7): the present Report describes the

defined high-level governance and

operational model of the future Big Data

Test Infrastructure, to be further

elaborated in a detailed Target Operating

Model. The high-level governance and

operational model has been set-up based

on a well-defined framework and has been

structured in the following layers:

Strategy layer, proposing an Operational Management Board (OMB) which will be organised

every month to take strategic decisions on the development of the Big Data Test

Infrastructure (main processes: Big Data Test Infrastructure Strategy, Demand Management

and Financial Management).

Governance layer, focusing on service management activities, all based on well-known and

market leading methodologies (e.g., ITIL Framework), ensuring a proper information security

management, monitoring the overall performances and, furthermore, managing the staff, the

IT Providers (Procurement Management) and the business stakeholders such as European

Public Administrations and Institutions (Stakeholders Management).

Operational layer, covering activities in the field of service development and evolution

(implementation and evolution of business services and technical building blocks), day-by-day

operations of the Big Data Test Infrastructure in terms of ICT Infrastructure Management,

Availability and Capacity Management, and a dedicated Service Desk implementing the IT

Service support. Finally, a Communication Office (CO) will be responsible for the Big Data

Community Building Management, including the promotion of the Big Data Test

Infrastructure and its business services.


Date: 27/10/2017 Doc. Version: 2.0 9 / 63

The next steps expected for the implementation of the Big Data Test Infrastructure are described in

detail in Chapter 8. Assuming that the

implementation of the Big Data Test

Infrastructure would be part of the CEF

Work Programme, it is planned to

implement the infrastructure with an

incremental approach, starting from

2018 with the implementation of the

“core” Big Data services and

implementing all the other business services by 2019.


Date: 27/10/2017 Doc. Version: 2.0 10 / 63

1. INTRODUCTION

The amount of data generated worldwide keeps increasing at an astounding pace, by 40% each

year, and forecasts expect it to rise 30-fold between 2010–2020. Since non-interoperable means

are being used to describe data generated in the public sector, most of this data cannot be re-used.

Previous studies have already investigated Big Data and data analytics initiatives launched by

EU Public Administrations (PAs) and EU Institutions (EUIs) both at European and National level.

Indeed, their focus was geared towards studying the potential or added value of Big Data analytics

to help public authorities at all levels of government and in different domains in reaching their

goals, as well as towards capturing valuable lessons learned and best practices of mature public

organisations to inspire peers while helping them in further use of Big Data analytics and to become

more insight-driven. That being said, despite the various use cases covered by these

aforementioned studies, the adoption of some analytics technologies in public administrations is

still lacking. At the moment, several Cloud environments exist in the European Commission but no

Big Data infrastructure is available to any PA or EUI with a full stack of technologies

(infrastructure in terms of storage and computing capacity, analytics tools and test datasets) to test

the value of new ways of processing Big Data and display its benefits to their management.

Providing these analytics technologies to PAs and EUIs would both significantly increase the

adoption of analytics technologies and encourage users to initiate research and test projects in the

Big Data field, and as a result boost innovation and R&D (Research and Development).

Therefore, the ISA2 Action 2016.03 – “Big Data for Public Administrations” aims to address the use

of Big Data within PAs to support better decision-making.1 The study “Big Data Test Infrastructure”,

launched at the beginning of January 2017 under the above-mentioned activities, aims at filling the

gap within PAs in the Big Data field, providing the design of a centralised European Big Data Test

Infrastructure to be used by any PA and EUI in Europe.

1 https://ec.europa.eu/isa2/sites/isa/files/library/documents/isa2-work-programme-2016-summary_en.pdf


Date: 27/10/2017 Doc. Version: 2.0 11 / 63

Indeed, the purpose of this study is to identify the main key features of the “Big Data Test

Infrastructure” and design its architecture, which the European Commission (EC) will make

available to any interested EC DGs, PAs and EUIs in Europe in order to:

1. Facilitate the launch of pilot projects on Big Data, data analytics or text mining, by

providing the infrastructure and software tools needed to start a pilot project;

2. Foster the sharing of various data sources across policy domains and organisations to

support better policy-making; and

3. Support PAs through the creation of a Big Data community around best practices,

methodologies and artefacts (big data algorithms, analytical models, pilots’ outputs, etc.) on

Big Data for policy-making.

A cross-border aggregation of data through a ready-to-use Big Data Test infrastructure would allow

and increase the adoption of meaningful analytics services that will benefit the European and

National PAs / EUIs and the European Union as a whole.

The following examples will facilitate readers in understanding the potential of this Big Data Test

Infrastructure:

Figure 1 – Narratives of the Big Data Test Infrastructure


Date: 27/10/2017 Doc. Version: 2.0 12 / 63

The entire “Big Data Test Infrastructure” study is structured in accordance with the following three

main tasks:

The objective of this document is to report on the final results of Task 3.

1.1. OBJECTIVES OF THE DOCUMENT

As anticipated in the Introduction, the objective of this document (outcome of Task 3 of the overall

study) is to describe the service offering of the initiative and the overall business and technical

architecture of the Big Data Test Infrastructure to be used by EU Institutions and EU Public

Administrations to launch pilot projects on Big Data.

On one side, a list of simple User Stories has been described in order to better define the set of

services the Big Data Test Infrastructure will offer to future users, while, on the other side, good

practices and a set of business / technical requirements identified during Task 1 were used with

well-defined architecture principles to define the key features and main functionalities of the Big

Data Test Infrastructure (output of Task 1 can be found in “D02.1_Requirements and good

practices for a Big Data Test Infrastructure”).

The main objectives of this document can therefore be summarised as follows:

Identification of the User Stories

Quick reference: Chapter 4


Date: 27/10/2017 Doc. Version: 2.0 13 / 63

Target Architecture (business and technical architecture)


Illustrative implementation of a User Story


Governance and Operational scenarios


1.2. STRUCTURE OF THE DOCUMENT

This document represents the final deliverable of TASK 3 of the overall study related to the Big Data

Test Infrastructure. This document contains seven main sections, structured according to the

approach to the study, as listed below:

• Introduction – presents the entire study and its main objectives (Chapter 1);

• Context – outlines the context of the study, pointing out the ISA2 Programme and the ISA2

Action 2016.03 – “Big Data for Public Administrations”, the European Digital Pole in

Luxembourg;

• Methodology followed – introduces the methodological approach used for TASK 3,

highlighting the steps adopted (Chapter 3);

• User Stories – summarises the main user stories (using Personas) for usage of the Big Data

Infrastructure by public administrations (Chapter 4);

• Target architecture – describes the overall architecture of the Big Data Test Infrastructure


Date: 27/10/2017 Doc. Version: 2.0 14 / 63

both in terms of services provided and the resulting business, technical and functional

architecture, linked to the business / technical requirements and the solution adopted in

the selected Good Practices coming from Task 1; it also shows how the obtained

architecture supports the selected use cases coming from Task 1 (Chapter 5);

• Illustrative implementation of a User Story – contains a complete implementation of a

user story with both the services provided and the activated architectural / solution

building blocks of the Big Data Test Infrastructure (Chapter 6);

• Governance and Operational scenario for the Big Data Test Infrastructure – contains the

demand management and access policies, the infrastructure standard templates and the

outcomes sharing (Chapter 7).

• Next Steps – provides information related to the future implementation of a small-scale

Big Data Test Infrastructure (Chapter 8).


Date: 27/10/2017 Doc. Version: 2.0 15 / 63

2. CONTEXT

2.1. THE ISA2 PROGRAMME AND THE ACTION 2016.03 “BIG DATA FOR

PUBLIC ADMINISTRATIONS”

Nowadays, European Public Administrations are expected to provide efficient and effective

electronic cross-border or cross-sector interactions between not only PAs but also between PAs

and both citizens and businesses without any disruption. By implementing and executing the ISA2

Programme (commonly referred to as ISA2) from 1 January 2016 to 31 December 2020, the EC

finances thirty-five (35) clusters of actions2 with an operational financial envelope of

approximately EUR 131 million. This programme will continue to ensure that Member States (MSs)

are provided with high-quality, fast, simple and low-cost interoperable digital services.

By supporting and developing new actions and interoperability solutions, the Council and the

European Parliament ensure that ISA2 will contribute to increasing interoperability that will in turn

advance the services offered, cut overall costs and result in a better-functioning internal market.

Under ISA2, the Presidency will prioritise actions and develop provisions to prevent any overlaps

and promote full coordination and consistency with other EU programmes (Connecting Europe

Facility Programme, DSM Strategy).

The 5-year ISA2 Programme 2016–2020 has been developed as a follow-up to its predecessor ISA, which ran from

2010 to 2015. Still managed by the ISA Unit (up to 2016, DIGIT.B6, now DIGIT.D1) of DG Informatics of the EC, the

ISA2 Programme will focus on specific aspects such as ensuring correct coordination of interoperability activities

at EU level; expanding the development of solutions for public administrations according to businesses’ and

citizens’ needs; proposing updated versions of tools that boost interoperability at EU and national level, namely

the European Interoperability Framework (EIF) and the European Interoperability Strategy (EIS); the European

Interoperability Reference Architecture (EIRA) and a cartography of solutions: the European Interoperability

Cartography (EIC).

With the adoption of ISA2, the EC commits to developing necessary IT services and solutions for the advancement

of public-sector innovation and digital public service delivery to citizens and businesses.

In order to remain in line with the European DSM Strategy, ISA2 monitors and supports EIF implementation in

Europe.

ISA is also well aligned with the Connecting Europe Facility Programme (CEF Programme), the Union’s funding

instrument for trans-European networks in the fields of transport, energy and telecommunications. The CEF

supports the deployment and operation of key cross-border digital services. ISA2 supports the quality

improvement of selected services and brings them to the operational level required to become a CEF service. It is

also one of the enabler and contributor programmes for public-sector innovation in Europe.

2 See: https://ec.europa.eu/isa2/dashboard/isadashboard


Date: 27/10/2017 Doc. Version: 2.0 16 / 63

The ISA2 Programme currently covers 35 actions, in which the “Big Data for Public

Administrations” represents the third, namely Action 2016.03. ISA2 is structured in such a way

that actions are grouped into packages of similar policy areas, which are agreed by the

Commission and Member States. Action 2016.03 belongs to the package “access the data / data

sharing / open data” under which the ISA2 programme funds actions to help open up national data

repositories, facilitate the reuse of data across borders and sectors, and widen access to data

created by the public sector.3

Phase 1 of this Action is aimed at carrying out a landscape analysis in order to identify: (i) the

requirements and challenges of PAs in Europe and the Commission in the context of Big Data;

(ii) ongoing initiatives and best practices in these areas, including an assessment of the tools and

solutions that these initiatives have implemented; and (iii) synergies and areas of cooperation with

the policy DGs and the MSs in this domain. Furthermore, phase 1 also intends to execute some

pilots that showcase the usefulness and policy benefits that Big Data can bring.

This action will continue to build upon the results of phase 1, focusing on the following activities:

Track 1: continue with the identification of further opportunities and areas of interest

whereby the use of Big Data could help improve working methods as well as ensure better

policy-making for policy DGs as well as Member States' Public Administrations;

Track 2: continue the implementation of already identified pilots through generalising the

developed functionalities and thus extending their use to policy agnostic contexts in order

to maximise the benefit and return on investment of the proposed solution;

Track 3: launch a new wave of pilots in specific domains, which hold a potential for later

being generalised and scaled-up to be made available to different services agnostic of their

specific policy area.

Moreover, in order to encourage the use of Big Data tools, under the same action, ISA2 funded

some Big Data pilots which may motivate PAs.

The ISA2 Action 2016.03 is a natural continuation of the ISA Action (1–22) “Big Data and Open

Knowledge for Public Administrations”, carried out in the context of the 2010–2015 ISA

programme. It aimed at identifying “the challenges and opportunities that Member States and the

Commission face in the context of Big Data and open knowledge [and] to create synergies and

3 See: https://ec.europa.eu/isa2/sites/isa/files/isa2_2017_work_programme_summary.pdf


Date: 27/10/2017 Doc. Version: 2.0 17 / 63

cooperation between the Commission and Member States, leading to more effective and informed

actions by public administrations”.4 Under this action, a study by Deloitte was conducted on

Big Data entitled “Big Data Analytics for Policy Making”5 and the initiative “Big Data Test

Infrastructure” represents a technical follow-up of this Deloitte report. The final report assigns

specific attention to Big Data and data analytics initiatives launched by European PAs in order to

provide insights. The study first analyses the added value of Big Data analytics to assist public

authorities at all levels of government and in different domains to achieve their goals. Second, it

captures valuable lessons learned and best practices of mature public organisations to inspire

peers and assist them on their path to using Big Data analytics and become more insight driven.

The study gathered over 100 cases, of which 10 were selected, where PAs mine Big Data or use

data analytics to gain better insights and increase their impact. These cases covered a wide range

of different data sources and types of analytics as well as policy domains and levels of

government, to conduct more in-depth case studies and to gather key lessons learned from the

use of Big Data and data analytics within these public authorities.

Based on all use cases and best practices, Deloitte’s study developed several recommendations

addressed to any public organisation that is willing to work with data analytics and Big Data. All

these useful insights are published in the above-mentioned final report: “Big Data Analytics for

Policy Making”.

2.2. EIRA

In order to better understand EIRA’s role and objectives, the document “Introduction to the

European Interoperability Reference Architecture (EIRA©) v2.0.0”6 has been taken into account.

The document provided by the European Commission has been used as a guideline in order to

create the architecture principles (based on the European Interoperability Framework underlying

principles) and define the Architecture Building Blocks of the future Big Data Test Infrastructure.

The European Interoperability Reference Architecture (EIRA) is a reference architecture focused on

the interoperability of digital public services. It is composed of the most salient Architecture

Building Blocks (ABBs) needed to promote cross-border and cross-sector interactions between

public administrations. This interoperability aims at improving cooperation between public

4 See: http://ec.europa.eu/isa/actions/01-trusted-information-exchange/1-22action_en.htm 5 See: https://joinup.ec.europa.eu/asset/isa_bigdata/document/big-data-analytics-policy-making-report 6 See: https://joinup.ec.europa.eu/catalogue/distribution/eira-v200-overview


Date: 27/10/2017 Doc. Version: 2.0 18 / 63

administrations – removing barriers for administration, businesses and citizens. The New European

Interoperability Framework defines interoperability as the ability of organisations to interact to

achieve mutually beneficial goals, involving the sharing of information and knowledge between

these organisations, through the business processes they support, by means of exchange of data

between their ICT systems. The EIRA is a four-view reference architecture for delivering

interoperable digital public services across borders and sectors and it has four main characteristics:

1. Common terminology to achieve coordination – It provides a common understanding of

the most salient Architecture Building Blocks needed to build interoperable public services.

2. Reference architecture for delivering digital public services – It offers a framework to

categorise Solution Building Blocks (SBBs) of an eGovernment solution. It allows portfolio

managers to rationalise, manage and document their portfolio of solutions.

3. Technology and product neutral and a service-oriented architecture (SOA) style – The EIRA

adopts a service-oriented architecture style and promotes ArchiMate as a modelling

notation. ArchiMate7 is an open and independent enterprise architecture modelling

language to support the description, analysis and visualisation of architecture within and

across business domains in an unambiguous way.

4. Alignment with EIF and TOGAF – EIRA is aligned with the New European Interoperability

Framework (EIF). The views of EIRA correspond to the interoperability levels in the EIF: legal,

organisational, semantic and technical interoperability which are already anchored in the

National Interoperability Frameworks (NIFs) of the Member States. EIRA also reuses

terminology and paradigms from TOGAF, such as architecture patterns, building blocks and

views.

The main objective of EIRA is to support users within the public administrations of Member States

or EU Institutions (architects, business analysts and portfolio managers) in the implementation of

some use cases (design and document solution architecture, compare solution architectures,

structure impact assessment and create, manage or rationalise a portfolio of solutions).

Figure 2 below provides an overview of the key concepts of EIRA and its relationships.

7 See: http://www.opengroup.org/subjectareas/enterprise/archimate-overview


Date: 27/10/2017 Doc. Version: 2.0 19 / 63

Figure 2 – Key concepts in EIRA

The key concepts of EIRA are defined as follows:

EIF interoperability levels cover legal, organisational, semantic and technical

interoperability;

Figure 3 – Interoperability levels of the EIF

EIF principles consist of 12 underlying principles of European public services that are

relevant to the process of establishing European public services;

EIRA views consist of several views, including one view for each of the EIF interoperability

levels;

EIRA viewpoints provide a perspective keeping the concerns of specific stakeholders

in mind;

Architecture Building Blocks are abstract components that capture architecture

requirements and guide the development of Solution Building Blocks;


Date: 27/10/2017 Doc. Version: 2.0 20 / 63

Solution Building Blocks are concrete elements that define the implementation of one or

more Architecture Building Blocks;

Solution Architecture Template (SAT) focuses on the most salient building blocks needed to

build an interoperable solution;

Reference Architecture is a generalized architecture of a solution, based on best-practices,

domain neutral with a focus on a particular aspect. The goal of a reference architecture is

reusability; it reduces the amount of work, reduces errors and accelerates the development

of solutions;

Solution Architecture is a description of a discrete and focused business operation or

activity and how information systems/technical infrastructure supports that operation. It

can be derived from a Solution Architecture Template (SAT);

Solution consists of one or more Solution Building Blocks to meet a certain

stakeholder need.

EIRA’s approach and methodologies have been used as guidelines in order to create the

architecture principles and to define the architecture building blocks of the future Big Data Test

Infrastructure.

2.3. CEF PROGRAMME

The Connecting Europe Facility (CEF) represents a key EU funding instrument that promotes

growth, jobs and competitiveness through targeted infrastructure investment at the European

level. CEF aims to support the development of high-performing, sustainable and efficiently

interconnected trans-European networks in the fields of transport, energy and digital services. The

programme’s investments fill the missing links in Europe's energy, transport and digital backbone.8

CEF’s benefits are multiple for citizens from the EU MSs, especially within the following sectors:

Transport: travel will be easier and more sustainable;

Energy: Europe’s energy security will be enhanced, while enabling wider use of

renewables;

Telecom: cross-border interaction between public administrations, businesses and citizens

will be facilitated;

Economy: the CEF offers financial support to projects through innovative financial

8 See: https://ec.europa.eu/inea/en/connecting-europe-facility


Date: 27/10/2017 Doc. Version: 2.0 21 / 63

instruments such as guarantees and project bonds. These instruments create significant

leverage in their use of EU budget and act as a catalyst to attract further funding from the

private sector and other public sector actors.

Moreover, in order to facilitate the delivery of digital public services across borders, the EU MSs

have created interoperability agreements aimed at deploying trans-European Digital Service

Infrastructures (the DSIs) to be run by CEF Digital.9 This programme supports the provision of basic

and re-usable digital services, known as the CEF building blocks,10 such as eDelivery, eID,

eSignature and eInvoicing. The CEF building blocks can be combined with each other, adopting a

Service Oriented Architecture approach, and integrated with more complex services (e.g. eJustice).

Building blocks denote the basic digital service infrastructures, which are key enablers to be

reused in more complex digital services.11

The CEF building blocks offer basic capabilities that can be used in any European project, and they

can be combined and used in projects in any domain or sector at European, national or local level.

The building blocks are based on existing formalised technical specifications and standards.

The main goals of the CEF building blocks are listed below:

Facilitating the adoption of common technical specifications by PAs;

Ensuring interoperability between IT systems so that citizens, businesses and

administrations can benefit from seamless digital public services wherever they may be in

Europe;

Facilitating the adoption of common technical specifications by projects across different

policy domains with minimal (or no) adaptation by providing services and sometimes

sample software.

The CEF Regulation and the CEF Principles set the context and objectives for the CEF Programme

and define the conditions for providing funding to current and future Building Block DSIs. Each DSI

is implemented through its Service Offering. The GOFA model describes the four aspects

(Governance, Operations, Financing and Architecture) that need to be managed to deliver this

Service Offering:

9 See: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+Digital+Home 10 See: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+building+blocks 11 See: https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/CEF+Definitions


Date: 27/10/2017 Doc. Version: 2.0 22 / 63

Figure 4 – The CEF GOFA model

Currently preparations are ongoing for the next CEF funding Programme, in particular the

prioritisation of the CEF candidate Building Blocks by members of the CEF Expert Group based on

their national needs. As a result, some candidate(s) will become part of the CEF Programme,

requiring a strong Governance of the delivery of those candidates among Member States.

The future “Big Data Test Infrastructure” is in line with the above-mentioned CEF Regulation and

principles and the future infrastructure will be designed around the four aspects of the GOFA

model.

In order to participate in this programme, DG DIGIT.D1 prepared, with KPMG support, a list of

deliverables to describe the Big Data Test Infrastructure (i.e. a Maturity form, Synopsis and

Narratives document) and the candidate obtained strong interest from several Member States in

the prioritisation process (MSs representatives of the so-called CEF Expert Group), officially

entering the short-list of candidate building blocks.

It has been agreed with DG CNECT, Context broker and European Data Portal DSI that BDTI could be

part of the CEF 2018 Work Programme and, in the case of a positive outcome, the implementation

of the Big Data Test Infrastructure will be jointly coordinated under the new CEF “Data” DSI.

Currently, the 2018 Work Programme is under validation by the CEF Member States and the

candidate building block has already been presented to the CEF Expert Group during a webinar

session.


Date: 27/10/2017 Doc. Version: 2.0 23 / 63

3. METHODOLOGY FOLLOWED

This section provides a description of the methodological approach applied in order to guarantee

achievement of the objectives of Task 3.

Figure 5 – Methodological approach followed under Task 3

Step 3.1

Identification of the

User Stories

In order to provide the reader with a clear understanding of the usage

of the Big Data Test Infrastructure and how it can satisfy specific needs

that users may encounter in a Big Data context, a list of examples have

been created in this step. These examples have been translated into

User Stories, following these activities:

1. Identification of the User Personas – based on the information

provided by the ISA2 and KPMG expertise, a list of users more

related to usage of the Big Data Test Infrastructure have been

identified. These users have been defined as Targeted Personas.

2. Identification of the User Stories – starting from the User

Personas identified in the first step, the User Stories have been

created considering specific problems/needs which users may

encounter in a Big Data context.

Further details regarding the User Stories can be found in Chapter 4.


target architecture


Governance and Operational model

Step 3.4.Report on final

results

Step 3.1.Identification of the User Stories


Date: 27/10/2017 Doc. Version: 2.0 24 / 63

Step 3.2

Design of the target

architecture

The objective of this step has been to provide the target architecture of

the Big Data Test Infrastructure. The first task of this step was to

identify the architecture principles taking into account the new

European Interoperability Framework document, KPMG expertise and

business requirements collected during Task 1. The architecture

principles have been used as the high-level driver for the design of the

Big Data Test Infrastructure. Subsequently, the activities performed for

the creation of the target architecture can be grouped into three tasks:

1. Identification of the Business Services – initiating from the User

Stories and the architecture principles identified, a list of

business services have been defined. Each business service has

been linked with the User Stories.

2. Design of the Application-Data-Technology architecture – in

this task an overall perspective of the Application-Data-

Technology Architecture has been developed in order to

provide the business services identified in the previous task.

This architecture contains an Analytics building block that has

been broken down into architecture building blocks in order to

map technical requirements collected during Task 1.

3. Design of the Solution architecture – in this task, an example of

a solution for the implementation of the relevant building

blocks has been provided only for each use case in scope. The

good practices/pilots, KPMG experience and any available study

on Big Data SW (e.g. Gartner) have been taken into account in

order to identify solutions.


Date: 27/10/2017 Doc. Version: 2.0 25 / 63

Step 3.3

Design of the

Governance and

operational model

The goal of this step has been the creation of the governance and

operational model of the Big Data Test Infrastructure that involves the

aspects of control and management of the platform. This step provides

a set of guidelines for the future implementation of the engine and, to

simply classify the main aspects, the GOFA model used by CEF was

taken into account.

Step 3.4

Report on final

results

The final step of Task 3 has been the delivery of the Report “D03.1

Design of the Big Data Test Infrastructure” based on the consolidated

results of Task 3:

List of User Stories which describe specific purposes supported

by the services provided by the Big Data Test Infrastructure;

List of relevant architecture principles that have been used as a

driver for the design of the architecture for the future Big Data

Test Infrastructure;

The Business architecture composed of a series of business

services to be supported by the future Big Data Test

Infrastructure;

The Application-Data-Technology architecture and the Solution

architecture in order to provide all the needed building blocks

and an example of technology solutions for each them;

An illustrative implementation of a User Story in order to

provide a complete example of the implementation of a User

Story;

The Governance and Operational model in order to identify the

elements of the overall architecture that can be implemented in

different scenarios;


Date: 27/10/2017 Doc. Version: 2.0 26 / 63

Table 1 - Methodology followed for Task 3 of the study

As described above, the methodological approach adopted for Task 3 of the study is focused on the

design of the target architecture of the Big Data Test Infrastructure. In order to provide a clear

picture of the overall methodological approach defined for the complete study, the following figure

highlights the principal interconnections between the three Tasks of the study.

Figure 6 – Adopted Methodology for the overall study

As highlighted in Figure 6 above, all the information collected during Task 1 has been fundamental

in identifying the User Stories, which have been used as guidelines for the creation of the so-called

“service offering” of the business architecture, while the architecture principles have been used as

the high-level driver for the identification of the Target Architecture. Finally, all the needed building

blocks of the Application-Data-Technology architecture have been identified to support the

implementation of the target architecture of the Big Data Test Infrastructure.


Date: 27/10/2017 Doc. Version: 2.0 27 / 63

4. USER STORIES

This chapter describes how business needs collected during Task 1 have been translated into experimental scenarios – User Stories – that users

may encounter in a Big Data context. Each of them highlights a specific purpose supported by a set of services (described in Chapter 5.2)

provided by the future Big Data Test Infrastructure. The figure below gives a brief description of potential users – Targeted Personas – around

whom the User Stories have been developed, also taking into account the documentation provided by the ISA2 unit.

Figure 7 – Targeted personas in scope for the User Stories

The following attributes have been used for each User Story:

Name – the name of the User Story;

Targeted Persona – a fictional user who has expressed a business need and who could potentially be supported by the services

provided by the Big Data Test Infrastructure;

Problem statement – a description of the needs expressed by the Targeted Persona;

Solution – a description of the solution that Targeted Personas could adopt to solve their Big Data problems;

Benefits – a list of benefits that users acquire utilising a Big Data solution;

Example – an example of the User Story’s problem.

IT Director Ingrid

FOTO

DescriptionIngrid has a background in IT but she

moved into managing other areas of

government several years ago. Now

back in IT, she puts a lot of effort into

being well-informed through industry

publications and conferences. She also

is responsible for several services being

digitalised across the city and she

spends a lot of time discussing details

with technicians to understand their

choices because she is accountable for

them.

Age: 44

Work: Local

Government

Location: City of

Hamburg, Germany

IT Practitioner Marko

INSERIRE FOTO

DescriptionMarko works in the Digital

Transformation Team and, as a

practitioner, he feels remote from

politics and strategy. He isn't a manager

and he used to do the IT in a hospital

and for a small business, so he knows

the private as well as the public sector.

Age: 46

Work: Central

Government

Location: Slovenia


Date: 27/10/2017 Doc. Version: 2.0 28 / 63

Name Targeted Persona

Problem statement

Solution Benefits Example

Learning from

other European Public

Administrations

IT Director

As an IT Director, I want to obtain knowledge and methodologies on the Big Data field of expertise from other MSs in order to reduce the risk of replicating their efforts by implementing similar use cases. Unfortunately, I haven’t a dedicated platform where all MSs share their Big Data pilots’ outcomes and methodologies

The IT Director can be supported by a Big Data Community where he can find use cases implemented by other Public Administrations or can share Big Data methodologies, strategies, artefacts and outcomes

Obtaining/Sharing knowledge and pilot outcomes (algorithms, models, etc.) on the Big Data field of expertise

Increasing the know-how among Public Administration at European, National and Local level

Ingrid, as an IT Director, is responsible for several big data pilots and she wants to discover new ideas adopted by other MSs in order to implement her Big Data pilot for real-time traffic congestion

Test off-the-

shelf analytical tools

IT Practitioner

As an IT Practitioner, I want to use Big Data and analytics technologies in order to handle a large volume of data but I haven’t analytics skills and Big Data analytical tools

The IT Practitioner can be supported with off-the-shelf Big Data analytical tools that can be downloaded from a catalogue for implementing Big Data solutions. IT Practitioner can also be supported by a specialised team in order to implement analytical functionalities

Hiding technical complexity in order to have easy-to-use analytical functionalities, optimising experimentation costs

Extracting benefits on the application of analytical functionalities in a Big Data / Analytics context

Find new contexts to apply Analytics

Marko, as an IT Practitioner working in a hospital, has to deal with huge volumes of unstructured documents and currently is struggling to extract information without a labour-intensive manual approach. Big data tools, text mining algorithms and analytics skills can help to tackle this kind of problem but he does not have a Big Data analytical tool

Experimenting with Big Data

platform

IT Practitioner

As an IT Practitioner, I want to experiment with Big Data / Analytics technologies in order to evaluate the Big Data added-value and the differences with regard to traditional systems but I haven’t a Big Data environment with all the

The IT Practitioner can be supported with a ready-to-use Big Data platform, respecting privacy policies and using open source tools. The IT Practitioner can initialise the Big Data platform following a structured process from a marketplace

Fostering the adoption of Big Data technologies and the acquisition of Analytics skills

Understanding the added-value of Big Data and Analytics tools in the public sector

Marko, as an IT Practitioner, needs an analytical “sandbox” environment to experiment new (open) technologies and to quickly test new Big Data use cases before deploying them into their production environment


Date: 27/10/2017 Doc. Version: 2.0 29 / 63

Name Targeted Persona

Problem statement

Solution Benefits Example

software components

Big Data Pilot

implementation

Policymaker

As a Policymaker, I want to start a Big Data pilot in order to support the policy making process and to evaluate and share the impacts and advantages of the implementation of a Big Data solution. Unfortunately, I am not working in a Big Data environment and do not have the technical skills to apply Big Data methodologies

The Policymaker can be supported by a dedicated specialised team to implement and execute the Big Data pilot using a ready-to-use Big Data platform, employing pre-built analytical functionalities to save costs and time. The Policymaker can also share the pilot’s results (impacts, advantages, etc.) using a Big Data Community.

Understanding the added-value and the impact of Big Data and Analytics technologies in the policy-making process

Understanding what technologies to use for execution of a Big Data pilot

Understanding what data sources to use for the execution of a Big Data pilot

Laura, as a Policymaker, has to analyse the impact of a new tax policy introduced 6 months previously. Her colleagues advised her to analyse the general feeling about this change with Big Data technologies using social media datasets

Integrating

Open Datasets

Data Scientist

As a Data Scientist, I want to integrate internal data with external open datasets in order to improve the effectiveness of the analysis. Unfortunately, I do not have a structured process and the tools to collect and integrate open data shared by other PAs

The Data Scientist can be supported with a shared repository with other MSs which enables the gathering and usage of the desired datasets that can be easily downloaded from a catalogue

Encouraging the sharing of open data among different PAs working in different policy domains

Finding solutions to problems where correlation with the analysed policy domains was originally hidden and/or unclear

David, as a Data Scientist, wants to pinpoint correlations among the internal data of the Local Government and open data in the policy domain “Energy” in order to enrich his statistical model for the improvement of energy efficiency in central public administration buildings

Table 2 - "User Stories" in scope for the Big Data Test Infrastructure

Each of these User Stories has been analysed in order to be supported by the future Big Data Test Infrastructure.


Date: 27/10/2017 Doc. Version: 2.0 30 / 63

5. TARGET ARCHITECTURE

This chapter provides information on the target architecture for the Big Data Test Infrastructure.

Specifically, it describes the architecture principles and why they are relevant to the design of the

Big Data Platform (5.1); subsequently, the chapter illustrates a set of business services contained in

the Big Data Test Infrastructure (service offering) linked with the user stories identified in Chapter 4

(5.2) and provides an overall view of all the building blocks needed to provide the identified

business services (0). Finally, the chapter furnishes an example of a solution for the implementation

of building blocks for each Big Data use case in scope as a result of the prioritisation process

undertaken during Task 1 of the initiative (5.4); this study was the principal source for determining

the contents of this section and designing the final target architecture.

5.1. ARCHITECTURE PRINCIPLES

The architecture principles are guidelines that have been used as a high-level driver for the design

of the architecture for the future Big Data Test Infrastructure. This section presents the most

relevant principles, taking into account:

the underlying principles of European public services outlined in the new European

Interoperability Framework12 document;

KPMG expertise;

business requirements collected during Task 1 of the initiative;

Principle Description

Openness

This principle refers to data, specifications and software. Openness is an important selection criteria: it is preferable to have software products that are provided as open source and with an active and solid community, documentation and a high level of maturity.

Reusability

This relates to the ability to use solutions or components adopted by others and it mainly refers to analytical models, APIs, data connectors, training material, etc. that can be reused beyond the domain for which they were originally developed. It is preferable to have solutions that represent an opportunity to reduce costs and increase the velocity of the experimentation process.

12 See: https://ec.europa.eu/isa2/sites/isa/files/eif_brochure_final.pdf


Date: 27/10/2017 Doc. Version: 2.0 31 / 63

Principle Description

Test

environment

This principle relates to the nature of the infrastructure. It is preferable to have an environment for experimentation which allows users to have a ready-to-use platform; therefore production environment functionalities (e.g. back-up, high-availability, logging, etc.) are not relevant for the solution.

Flexibility

This is a key aspect of the infrastructure in order to support different Big Data use cases. It is preferable to have modular solutions (composed of customisable building blocks) able to address several business requirements arising from EU Institutions and/or PAs.

Scalability

This principle refers to the capability of the environment to scale-up both in terms of services and technical specification in order to implement pilots and handle data. It is preferable to design an infrastructure that is scalable in terms of volume, dimension and performance of services and pilots based on resources and budget available.

Security and

Privacy

This principle refers to the capability to be compliant with Security and Privacy policies at European and National levels, so it is preferable to have software products and solutions in line with these policies. This is the most important principle to take into account in order to allow public administrations to handle citizens’ information avoiding unauthorised access and disclosure.

Table 3 - Architecture principles

5.2. BUSINESS ARCHITECTURE

This paragraph illustrates the list of services composing the business architecture that provide the

means to satisfy the business requirements. These services are provided to Member States in order

to experiment Big Data solutions and support all the User Stories described in Chapter 4. For design

of the target architecture of the future Big Data Test Infrastructure, the ArchiMate language has

been used (using the Archi tool). Figure 8 below shows the motivation layer, which was identified

using the user stories previously described, and the business layer which contains the following

business services:

PaaS for implementing the Big Data use case;

Data Catalogue and Data Exchange APIs;

Big Data and Analytics software catalogue;

Analytics as a Service;

Community Building and Innovation Portal;

Support for Analytics implementation;

Advisory.


Date: 27/10/2017 Doc. Version: 2.0 32 / 63

Figure 8 – Business view of the Big Data Test Infrastructure and related user stories

Below is a brief description of the relationships among the entities of the Motivation layer

(user stories) and Business layer (business services).

Narrative: The [Data Scientist] is associated with the User Story [Integrating open datasets], which

is realised by the [Data Catalogue and Data Exchange APIs]; the [IT Practitioner] is associated with

the User Story [Experimenting with the Big Data platform], which is realised by [PaaS for

implementing Big Data use cases]. The IT Practitioner is also associated with [Test off-the-shelf

analytical tools], which is realised by [Analytics as a Service] and [Support for Analytics

implementation] business services. The [Policymaker] is associated with the User Story [Big Data

pilot implementation], which is realised by the business services [PaaS for implementing Big Data

use cases], [Analytics as a Service], [Community Building and Innovation Portal], [Support for

Analytics implementation] and [Advisory]. Finally, the [IT Director] is associated with the User Story

[Learning from other European Public Administrations], which is realised by the [Community

Building and Innovation Portal].

A list of the identified business services is provided below. For each business service, a detailed

factsheet has been drafted containing the following information:

Name – name of the business service;

Description – a description of the business service;

Prerequisites – a set of prerequisites for implementation of the business service;

Outcome – a brief description of the business service’s outcome;


Date: 27/10/2017 Doc. Version: 2.0 33 / 63

Linked User Stories – a list of the User Stories which the business service satisfies.

Business Service Description

PaaS for

implementing Big Data Use Case

This service aims at providing the Big Data platform and all the tools supplied by the European Commission. The initialisation of the platform will be a wizard process for the users who will be able to choose specific templates of the platform through a marketplace. Should new needs arise, this platform will be enriched with new functionalities. Once the platform is established, its resource configuration and tools list will be kept in a catalogue for future reuse.

Prerequisites Outcome

the volume of data to be stored and elaborated in order to perform the sizing of the platform;

privacy policies linked to the geographic location of the servers;

the nature of data.

A Big Data Infrastructure with a ready-to-use software environment.

Linked Targeted Personas Linked User Stories

Data Scientist

IT Practitioner

Policymaker

Experimenting with the Big Data platform since the objective is to have a ready-to-use Big Data infrastructure in order to test Big Data technologies and acquire know-how and skills in the Big Data field of expertise.

Big Data Pilot implementation since the objective is the availability of a Big Data environment to initiate a Big Data pilot.


Data Catalogue and Data Exchange APIs

This service aims at providing data and/or exchange APIs that enable the gathering and usage of the desired data to be correlated with user data. It is a service that provides a catalogue of data sources from which users can retrieve links to data if already available (e.g. coming from previous pilots) or it provides access to sample datasets (classified by policy domains) which the European Commission makes available in a centralised repository. The catalogue of data sources could be enriched by both datasets and connectors that users will develop during implementation of their pilots for future reuse.


data privacy policies (e.g. in order to secure sensitive data);

A catalogue to easily access desired data from available data sources.


Data Scientist

IT Practitioner

Policymaker

Integrating Open Datasets since the objective is to integrate internal data with external open datasets provided by the European Commission in a shared repository.


Date: 27/10/2017 Doc. Version: 2.0 34 / 63



It aims at providing a list of analytical functionalities which enable any EU Institution/Public Administration to quickly access a series of customisable pre-built elaborations. The following list, non-exhaustive, represents a set of analytical services that could be available:

extraction of information from documents (text mining);

time-series forecasting;

geo information normalisation;

population / customer segmentation.


the type of data that users want to handle;

the type of analysis that users want to create.

A list of analytics to quickly access functionalities related to the elaboration of data.


Data Scientist

IT Practitioner

Policymaker

Test off-the-shelf analytical tools since the goal is to use built-in analytical functionalities.

Big Data Pilot implementation to use analytical functionalities in order to analyse the general sentiment about a new tax policy.


Community Building

and Innovation Portal

This service aims at building a Big Data community where users can share knowledge and Big Data artefacts (e.g. methodologies, statistical models, pilots’ outcomes and datasets).It also provides an innovation portal where users can contribute with their own ideas in order to launch new propositions. The innovation portal will have a recommendation engine for aggregating contributions aligned to the user’s search.


Sharing mind-set. A Big Data community and an innovation portal for users’ ideas with a recommendation engine.


Data Scientist

IT Practitioner

Policymaker

IT Director

Learning from other European Public Administrations since the goal is to obtain knowledge and methodologies regarding the Big Data field of expertise from other MSs.

Big Data Pilot implementation since the Policymaker wants to share the impacts and advantages of implementation of a Big Data solution.


Date: 27/10/2017 Doc. Version: 2.0 35 / 63


Big Data and Analytics

software catalogue

This service aims at providing a catalogue of Analytics artefacts or software tools that users will be able to download for implementing Big Data solutions. Part of this catalogue will be a special software stack (like a Sandbox) usable in the preferred user environment (on premise or cloud): it is a preconfigured environment that contains services, sample data and interactive tutorials useful for testing Big Data technologies.


System requirements of the software solution. A catalogue of Big Data and analytics software that users will be able to download and use.


Data Scientist

IT Practitioner

Policymaker

Test off-the-shelf analytical tools since the IT Practitioner could use software tools contained in the Big Data and Analytics software catalogue.


Support for Analytics

implementation

This service aims at providing technical support for the implementation of a pilot through the provisioning of a specialised team (e.g. Business Analyst, Data Scientist or Data Engineer) which helps Public Administrations to implement Big Data pilots. For example, support could be provided in terms of:

sizing of the necessary infrastructure resources;

choice of the involved technologies in the pilot;

execution of the pilot;

sharing and presentation of the pilot’s outcome. Any outcome of the Big Data pilot could feed the community portal in order to be shared with the community.


Activation of one or more of the aforementioned services.

Support for the whole life-cycle of a pilot.


IT Practitioner

Policymaker

Test off-the-shelf analytical tools since the IT Practitioner does not know what software tools have to be used.

Big Data Pilot implementation since PAs may not have the availability of technicians with Big Data skills (Data Scientists and Data Engineers) to execute a Big Data pilot.


Date: 27/10/2017 Doc. Version: 2.0 36 / 63


Advisory

This service aims at providing advisory support for assistance in activities related to the Big Data Test Infrastructure. It covers a series of activities such as:

pilot scoping;

business case definition;

evaluation of risks related to the implementation of a Big Data solution;

data sources identification and integration;

services identification and integration.


Activation of one or more of the aforementioned services

Advisory support.


IT Practitioner

Policymaker

IT Director

Big Data Pilot implementation since the user wants to evaluate the impacts and advantages of the implementation of a Big Data solution.

All the User Stories could be linked with this service.

Assuming that the implementation of the Big Data Test Infrastructure would be part of the CEF

2018 work programme (see Chapter 2.3), each business service and targeted personas can be

mapped with the template of the CEF Service Offering, as follows:

Figure 9 – Business services mapped in the template of the CEF service offering

5.3. APPLICATION-DATA-TECHNOLOGY ARCHITECTURE

This paragraph presents the Application-Data-Technology architecture required in order to identify

all the needed building blocks that enable the business services previously described. The TOGAF


Date: 27/10/2017 Doc. Version: 2.0 37 / 63

framework has been taken into account to design the Application-Data-Technology architecture;

TOGAF (The Open Group Architecture Framework) is a framework that provides an approach for

designing, planning, implementing and governing an enterprise information technology

architecture. The Application-Data-Technology architecture has been designed using the Archi tool,

starting from the framework provided in Task 1; in this case, the logical areas identified in Task 1

have been considered as application functions, which include building blocks

(application components).

Figure 10 below provides a preview of the logical architecture which will be further detailed in the

following paragraphs using the TOGAF framework.

Figure 10 – Preview of the logical architecture of the Big Data Test Infrastructure

5.3.1. HIGH-LEVEL ARCHITECTURE VIEW

In order to provide an overall view of the Application-Data-Technology architecture for the Big Data

Test Infrastructure, the TOGAF standard has been considered as guideline. The Application-Data-

Technology architecture provided is composed of two architecture domains:

The Information System Architecture – which covers the development of Data and

Application architecture. In this domain, the main goal is to develop target Data and

Application architectures that enable the business architecture.

The Technology Architecture – this domain covers the hardware configurations,

Data Ingestion/Storage

Real Time Ingestion

Batch Ingestion

Distributed File System

Infrastructure

NetworkStorageServers Distribution

National Data Portal

Data Sources

Exte

rnal S

yste

ms

Users

European Data Portal

EU Open Data Portal

Social Media

Third-Party Data provider Data Elaboration

Analytics

Data Transformation

Data Consumption

Data Visualisation

Data Discoveryand Exploration

API M

anag

emen

t

Governance & Security

Software/Data Catalogue Privacy and Security PolicyInfrastructure Monitoring

Management

Intero

perab

ility

Community


Date: 27/10/2017 Doc. Version: 2.0 38 / 63

the infrastructure services that enable the Application and Data architecture, the protocols

and networks that connect applications.

Figure 11 below provides a high-level architecture view of the Application-Data-Technology layers

of the target architecture and the business layer previously presented.

Figure 11 – High-level architecture view of the Big Data Test Infrastructure


Date: 27/10/2017 Doc. Version: 2.0 39 / 63

Below a brief description is provided of the relationships among the entities, starting from the

previous one detailed for the business architecture.

Narrative: The business services are realised by the application services [Big Data platform] (which

is assigned to the interface [Interoperability]) and [Community]. The [Big Data platform] is realised

by the functions of [Governance and Security] (composed of the [Software/Data Catalogue], the

[Privacy and Security Policy] and [Infrastructure Monitoring Management]),

[Data Ingestion/Storage] (composed of [Batch Ingestion], [Real Time Ingestion] and the [Distributed

File System]), [Data Elaboration] (composed of [Data Transformation] and [Analytics]),

[Data Consumption] (composed of [Data Discovery and Exploration] and [Data Visualisation]) and

[API Management], which accesses to [Data Sources]. The whole application layer is served by the

[Infrastructure] technology service, which also realises the [Community] application service.

[Infrastructure] is realised by the functions of [Storage] and [Distribution], which are assigned to

one or more [Servers].

Starting from the analysis of datasets and APIs carried out during Task 2,13 five data sources have

been identified (National Data Portal, European Data Portal, EU Open Data Portal, Social Media

and Third-Party Data provider).

Table 4 below illustrates, for each building block, the following attributes:

Reference architecture – an image of the Application-Data-Technology architecture where

the application function which contains the building block has been highlighted;

Name – the name of the building block;

Description – a brief description of the building block;

TOGAF component – the name of the component of the TOGAF framework (Technology,

Data or Application Architecture).

Reference architecture

Name Description TOGAF

component

Infrastructure

Servers

It refers to hardware components of the Big Data Test Infrastructure. It includes the choice of RAM and virtual cores of the infrastructure, the choice of the physical location of each node of the Big Data Platform and the choice of operating system.

Technology architecture

13 See the document “D01.01 Study on interoperable data ontologies and data exchange APIs”.


Date: 27/10/2017 Doc. Version: 2.0 40 / 63



component

Storage

This building block is mainly related to the storage capacity of the infrastructure; it also takes into account the performance, replication, reliability, encryption, synchronisation, back-up and restore of data.


Network

It is related to the communication service of the Big Data Test Infrastructure to manage both traffic among the platform nodes and to the Internet, also taking into account security concerns.


Distribution

Distribution refers to the ensemble of products, tools and projects that enable users to use Big Data applications. Apache Hadoop is the most important technology that is most commonly associated with Big Data applications. It enables the use of the preferred computation engine.


Data Ingestion/Storage

Batch Ingestion

This building block refers to the capability of the platform to collect and load data into the Big Data platform on a regular time base. Batch ingestion usually interests a large volume of data and it is scheduled in a specific time window defined.

Application architecture

Real Time Ingestion

This building block is responsible for managing real time data streams. Real time ingestion allows for the analysis of data as soon as they are issued by the source. It may be implemented using a message queue or a memory channel.


Distributed File

System

This system allows clients to access and process data stored on the server as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file. It organizes file and directory services into a global directory in such a way that remote data access is not location-specific but is identical from any client. All files are accessible to all users of the global file system and organization is hierarchical and directory-based.


Data Elaboration

Data

Transformation

This building block refers to the process of data quality management. It includes, for example, tasks of data cleaning, data quality, data enrichment and data integration. It is typically performed via a mixture of manual and automated steps.


http://searchwinit.techtarget.com/definition/directory


Date: 27/10/2017 Doc. Version: 2.0 41 / 63



component

Analytics

Analytics refers to a series of customisable pre-built elaborations mainly related to the data mining and machine learning areas. It allows users to apply supervised and unsupervised algorithms, forecasting methodologies and any other analytical services.


Data Consumption

Data Discovery and Exploration

This building block provides the opportunity to business users to search for specific information, write queries, compute statistics, visualise datasets and any other data exploration activity. It also allows users to quickly and simply view most of the relevant features of their datasets through the use of descriptive statistics.


Data

Visualisation

Data visualisation addresses the classical Business Intelligence use cases represented by dashboard and reporting. It also provides the capability to monitor certain events in real time.


Governance & Security

Software/Data

Catalogue

This building block refers to a list of Big Data and Analytics software tools and artefacts. This list is divided into various sections, based on the category of software tools. It also contains some services and solutions funded by the EU (e.g. the DORIS service14) which users are able to integrate with the Big Data platform. Software/Data Catalogue also refers to a list of datasets related to different policy domains (e.g. Energy, Transportation, Health, etc.) which the European Commission makes available in a centralised repository.


Privacy and

Security Policy

This building block enables the governance of the creation, acquisition, integrity and use of data and information. It provides mechanisms to ensure that only authorized users can access and perform actions on IT resources.


Infrastructure

Monitoring Management

This building block provides insight into the status of physical, virtual, and cloud systems and helps ensure availability and performance. Infrastructure monitoring management covers the sub-categories of systems management, network management and storage management.


14http://www.doris-project.eu/


Date: 27/10/2017 Doc. Version: 2.0 42 / 63



component

API

Management

API Management refers to the capability of the platform to interact with data sources. This building block also allows users to control and manage access and usage policies for APIs.


Community

This building block refers to a platform that allows users to share Big Data artefacts or methodologies among European Public Administrations and Public Institutions. The platform will contain a special section – Innovation Portal – with a recommendation engine (using Machine Learning techniques) where users can contribute with their own ideas in order to launch new propositions.


Interoperability

Interoperability provides users with the ability to work and interact with external systems and products.


Table 4 – Description of the building blocks for the Application-Data-Technology Architecture

As mentioned above, these building blocks enable the business services identified in the previous

paragraph (5.2); a business service can be covered by one or more architecture building blocks of

different logical areas of the Application-Data-Technology architecture.

The following matrix provides the mapping between the architecture building blocks and the

business services identified in the business architecture, except for the “Advisory” and “Support for

Analytics implementation” services.

Business services

Architecture Building Blocks

PaaS for

implementing Big Data use

case

Data Catalogue

and Data Exchange APIs

Big Data and

Analytics software catalogue

Analytics as a

Service

Community Building and Innovation

Portal

Infr

astr

uct

ure

Servers ✓

Storage

✓

Network

✓

Distribution

✓


Date: 27/10/2017 Doc. Version: 2.0 43 / 63

Business services


PaaS for

implementing Big Data use

case

Data Catalogue

and Data Exchange APIs

Big Data and

Analytics software catalogue

Analytics as a

Service

Community Building and Innovation

Portal

Dat

a In

gest

ion

/ S

tora

ge

Batch Ingestion

✓ ✓

Real Time Ingestion

✓ ✓

Distributed File

System ✓

Dat

a El

abo

rati

on

Data

Transformation

✓ ✓

Analytics

✓ ✓

Dat

a C

on

sum

pti

on


✓ ✓

Data

Visualisation

✓ ✓

Go

vern

ance

& S

ecu

rity

Software/Data

Catalogue

✓ ✓

Privacy and

Security Policy ✓ ✓

Infrastructure


✓

API Management

✓

Community

✓

Interoperability

✓


Date: 27/10/2017 Doc. Version: 2.0 44 / 63

Table 5 – Mapping between architecture building blocks and business services

5.3.2. DETAILED VIEW OF THE ANALYTICS BUILDING BLOCK

This paragraph provides a detailed overview of the Analytics building block in order to help users to

better understand how technical requirements collected during Task 1 are satisfied. This building

block is the most important in the final architecture of the Big Data Test Infrastructure since it

enables the implementation of all the use cases in scope for Task 1. Figure 12 below shows a

drill-down of the Analytics building block, which is composed of lower level building blocks related

to the main Data Scientist’s activities.

Figure 12 – Drill-down of the Analytics building block

Table 6 below provides a detailed description of each lower level building block shown, containing

the following information:

Name – the name of thelower level building block;

Description – a brief description of the lower level building block;

Linked Technical Requirements – the ID and the short-name of the technical requirements

collected during Task 1.

Name Description Linked Technical

Requirements (ID – short-name)

Data Mining

This building block refers to the activity of discovering patterns in large quantities of data. These patterns can be seen as a kind of summary of the input data which can be used for further analysis (e.g. machine learning or predictive analytics) in order to obtain more accurate prediction results.

32 – Tool support

34 – Data processing

36 – Advanced Analytics


Date: 27/10/2017 Doc. Version: 2.0 45 / 63

Name Description Linked Technical

Requirements (ID – short-name)

Statistical Modelling

Statistical Modelling includes statistical methods that allow users to manage and analyse structured datasets (cross-sectional or time series data), in order to describe or summarise features and provide a descriptive view of a phenomenon.

32 – Tool support


Machine Learning

This building block provides users with the tool needed to train, test and validate analytical models built using machine learning and statistical algorithms. With this module, users can apply classification, clustering or regression tasks.

32 – Tool support


Table 6 – Description of the architectural detailed building block

As shown in Table 6 above, a technical requirement can be covered by the union of multiple lower

level building blocks and a lower level building block can cover the needs of multiple technical

requirements.

During Task 1, a prioritisation process of all the use cases was carried out in order to identify the

final priority for the Big Data use cases; Figure 13 below illustrates the results of the process.

Figure 13 – Final priority for the identified Big Data use cases

For each use case with high priority, the following list provides readers with a clear explanation of

how the Analytics’ lower level building blocks enable the implementation of the use cases in scope.

Predictive analysis: it consists of applying statistical techniques (e.g. predictive modelling,

machine learning) that analyse current and historical facts to make predictions about future

or unknown events. In order to implement this use case, users have to use Data Mining and

Machine Learning techniques to discover hidden patterns, extract useful information and

make predictions on data.


Date: 27/10/2017 Doc. Version: 2.0 46 / 63

Web analysis (scraping/monitoring): it consists of gathering information from websites,

involving data scraping (using bot or web-crawler) and data parsing to extract the

unorganised web data as well as converting data from APIs into a manageable format. In

order to implement this use case, users have to use Data Mining techniques to discover

patterns from the web using automated processes to extract data from servers and

web reports.

Text analysis: it consists of using natural language processing to analyse unstructured text

data, deriving patterns and trends, possibly extracting the text content and evaluating and

interpreting the output data. In order to implement this use case, users have to use Text

Data Mining techniques which include text categorisation, text clustering and

document summarisation.

Descriptive analysis: this refers to the use of statistics to quantitatively describe or

summarise features of a collection of information. Such descriptions may be either

quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. In order to

implement this use case, users have to use a Statistical Modelling approach which refers to

finding a statistical correlation among features or detecting outliers.

Time-Series analysis: this comprises of methods for analysing time-series data to extract

meaningful statistics and other characteristics of data. In order to implement this use case,

users have to use Statistical Modelling techniques to identify trends or cyclic behaviours

and forecast future trends.

Social media analysis: it consists of gathering data from blogs and social media websites and

analysing data to make business decisions. In order to implement this use case, users have

to use Machine Learning algorithms, performing rules-based analysis or keywords-based

analysis to identify posts that fit into specific categories.

Network analysis: this consists of investigating any structures through the use of network

and graph theories, characterising networked structures in terms of nodes and the ties,

edges, or links (relationships or interactions) that connect them. In order to implement this

use case, users have to use Machine Learning techniques, with a particular focus on neural

network and deep learning methods which allow the evaluation of the presence or absence

of a relationship among elements.


Date: 27/10/2017 Doc. Version: 2.0 47 / 63

5.4. SOLUTION ARCHITECTURE

This section introduces the identified solution building blocks (i.e. the software/technology items

that could support the implementation of the architecture building blocks) which design the

solution architecture. The solution building blocks were identified following high-level criteria and

the results emerging from Task 1. This section also illustrates how such solution building blocks will

support the implementation of the Big Data use cases in scope.

5.4.1. CRITERIA FOR THE IDENTIFICATION OF TOOLS/TECHNOLOGIES

The aspects that drive the identification of the software solutions for each architecture building

block of the target architecture are the following:

Architecture principles, the guidelines that have been used as a high-level driver for the

design of the architecture of the Big Data Test Infrastructure (see section 5.1);

Good practices, the recommendations and results coming from the most relevant projects

analysed in Task 1;

KPMG expertise, KPMG credentials/experience and Competence Centre in the

Data & Analytics area, including the KAVE platform and relevant projects.

5.4.2. IDENTIFIED SOLUTION BUILDING BLOCKS

This section introduces, for each architecture building block, a non-exhaustive list of software

solutions that have been identified according to the criteria presented in the previous section.

These software solutions can be improved on in future analysis. During this activity, cloud solutions

have been taken into account in order to respect the architecture principles previously defined.

Cloud infrastructure refers to an abstraction layer that virtualises resources and presents them to

users through API: these virtualised resources are hosted by a service provider (such as Amazon

Web Services). It leads to greater flexibility since users can reallocate infrastructure resources at

any time without worrying about finding new physical servers which meet their needs. Below are

some examples which can help the reader to better understand how BDTI users can take advantage

of using a cloud infrastructure:

a Public Administration wants to start a new pilot using the Big Data Test Infrastructure, the

amount of initial data is low but a substantial increase is expected; in this case a cloud

infrastructure provides an advantage for scaling-up the disk space of their servers in an


Date: 27/10/2017 Doc. Version: 2.0 48 / 63

easy manner.

a Public Administration (which already uses the Big Data Test Infrastructure) wants to test a

more complex machine learning algorithm which requires more hardware resources; in this

case a cloud infrastructure provides the advantage to increase the RAM and CPU of their

servers in an easy way.

Figure 14 below illustrates the solution architecture, starting from the Application-Data-Technology

architecture previously seen and the criteria identified.

Figure 14 – Solution Architecture of the Big Data Test Infrastructure

Table 7 below provides the mapping between architecture building blocks and the identified

solution building blocks; there is not an exclusive relationship among them: the requirements of an

architecture building block can be covered by multiple solution building blocks and a solution

building block can cover the needs of multiple architecture building blocks; the notes provide a

description of how the solution building block serves the architecture building block.


Date: 27/10/2017 Doc. Version: 2.0 49 / 63


Solution Building Blocks

Notes Architecture

Building Blocks Solution

Building Blocks Notes

Servers

Amazon Web Services EC2

Server instance provider


Solr

Indexing-search platform

Zeppelin

Web based notebook for

Data Discovery

Storage

Amazon Web

Services S3

Storage provider

Data Visualisation

Zeppelin

Web based notebook for

interactive data visualisation

Network


Network configuration

Software/Data

Catalogue

AWS Service

Catalog

Catalogue of IT services

Ambari

Services management

Distribution

Hortonworks Data

Platform

Open Source platform

Privacy and

Security Policy

Sentry

User access management

Batch Ingestion

Flume

Distributed data flow

Infrastructure


Ambari

Alerts management

Real Time Ingestion

Storm

Distributed real time data flow

API Management

API Manager

Platform for APIs

Distributed File

System

HDFS

Distributed storage

Community

Confluence

Team Collaboration

tool

Data

Transformation

RHadoop

Preliminary analysis in R

Interoperability

KNOX

Application gateway for

REST and HTTP interaction

SparkR

Preliminary analysis in

Spark

Analytics

Storm

Real time analytics


Date: 27/10/2017 Doc. Version: 2.0 50 / 63


Solution Building Blocks

Notes Architecture

Building Blocks Solution

Building Blocks Notes

HDFS

Data Lab environment

RHadoop

R analysis in Hadoop

environment

SparkR

R analysis in Spark

Table 7 – Mapping between architecture and solution building blocks


Date: 27/10/2017 Doc. Version: 2.0 51 / 63

The following table summarises the solution building blocks, describing each of them in terms of:

Name – the name of the identified solution building block;

Description – a brief description of the solution;

License – the license type of the solution;

Support – the support type offered;

Community – community dimension and documentation provided;

Maturity – the maturity level.

Solution Building

Block Description License Support Community Maturity


Amazon Elastic Compute Cloud15 (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Proprietary software

✓ Three levels of

support

Millions of results

Since 2006

Amazon Web

Services S3

Amazon Simple Storage Service16 (Amazon S3) is object storage built to store and retrieve any amount of data from anywhere. It can be used for media storage or as data lake for Big Data analytics.


✓ Three levels of

support

Millions of results

Since 2006

Hortonworks Data Platform

HDP17 is an open source Apache Hadoop distribution based on a centralized architecture. HDP provides a complete big data ecosystem composed of the most common tools; it also can be easily extended, developing new or adding third party software.

Apache 2.0 ✓

Three levels of support

Many thousands of

results

Since 2011 Version 2.6

Flume

Flume18 is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It also uses a simple extensible data model that allows for online analytic application.

Apache 2.0

✓

Project mailing lists and different

support among platform providers

Many thousands of

results

Since 2012

Version 1.7.0

15 https://aws.amazon.com/ec2/?nc1=h_ls

16 https://aws.amazon.com/s3/?nc1=h_ls

17 https://hortonworks.com/products/data-center/hdp/

18 https://flume.apache.org/


Date: 27/10/2017 Doc. Version: 2.0 52 / 63

Solution Building


Storm

Storm19 is a free and open source distributed realtime computation system that allows users to process large volumes of high-velocity data. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.

Apache 2.0

✓ Project mailing

lists and different


Many thousands of

results

Since 2011 Version

1.1.1

HDFS

Hadoop Distributed File System20 (HDFS) is the core of the Hadoop ecosystem; it is the layer that allows to manage data as if it were stored on a single node, instead of being divided and replicated on multiple nodes. It automatically manages replication, assuring fault toleration and scalability.

Apache 2.0

✓ Different


Many thousands of

results

Since 2009 Version

2.8.1

RHadoop

RHadoop21 is a collection of five R packages that allow users to manage and analyse data within the Hadoop environment; it allows to deal with HDFS and other elements of the ecosystem and also to exploit the Map Reduce framework.

GNU GPL N/A

Thousands of results

Since 2011 Different

version of R packages

SparkR

SparkR22 is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

Apache 2.0

✓

Project mailing lists and different


Many thousands of

results

Since 2011 Version

2.2.0

19 http://storm.apache.org/

20 https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

21 https://github.com/RevolutionAnalytics/RHadoop/wiki

22 https://spark.apache.org/docs/latest/sparkr.html#overview

https://github.com/hadley/dplyr


Date: 27/10/2017 Doc. Version: 2.0 53 / 63

Solution Building


Solr

Solr23 is the popular open source search platform that provides distributed indexing. It has enterprise production features, such as high reliability, scalability and fault tolerant, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. It supports batch, real-time, and on-demand indexing of data.

Apache 2.0

✓ Project mailing lists and an IRC

channel

Millions of results

Since 2004 Version

6.6.0

Zeppelin

Zeppelin24 is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Python and more. It offers multiple interpreters (connectors) for different data sources.

Apache 2.0

✓ Project mailing

lists and different


Many thousands of

results

Since 2015 Version

0.7.2

Amazon Web

Services Service Catalog

Amazon Service Catalog25 allows users to manage a catalogue of IT services which include, for example, virtual machine images, servers, software and databases.


✓ Different levels

of support

Many thousands of

results

Since 2015

Ambari

Ambari26 is an open source tool for provisioning, managing, and monitoring clusters. Ambari provides an intuitive, easy-to-use platform management web UI backed by its RESTful APIs.

Apache 2.0

✓ Project mailing

lists, an IRC channel and

different support among

platform providers

Many thousands of

results

Since 2011 Version

2.5.1

Sentry

Sentry27 provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Big Data cluster. Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala and HDFS (limited to Hive table data).

Apache 2.0

✓ Project mailing

lists and different


Many thousands of

results

Since 2012 Version

1.8.0

23 http://lucene.apache.org/solr/

24 https://zeppelin.apache.org/

25 https://aws.amazon.com/servicecatalog/?nc1=h_ls

26 https://ambari.apache.org/

27 https://sentry.apache.org/


Date: 27/10/2017 Doc. Version: 2.0 54 / 63

Solution Building


WSO2 API Manager

WSO2 API Manager28 allows the support of API publishing, lifecycle management, application development, access control, rate limiting and analytics in one cleanly integrated system. It include a store that allows users to discover API, get the related documentation and also try API.

Apache 2.0 ✓

Different levels of support

Many thousands of

results

Since 2012 Version

2.1.0

Confluence

Confluence29 is content collaboration software that allows users to create, share, and collaborate on projects all in one place; it includes meeting notes, project plans, product requirements, multimedia and dynamic content.

Proprietary, it requires a Confluence Server license.

✓ Technical support

Many thousands of

results

Since 2004 Version 6.2

Knox

Knox30 is an Application Gateway for interacting with the REST APIs and UIs of a Big Data platform. The Knox Gateway provides a single access point for all REST and HTTP interactions with the Big Data cluster.

Apache 2.0

✓ Project mailing

lists and different


Many thousands of

results

Since 2013 Version 0.13.0

Table 8 – Description of the solution building blocks

28 http://wso2.com/api-management/

29 https://www.atlassian.com/software/confluence

30 https://knox.apache.org/


Date: 27/10/2017 Doc. Version: 2.0 55 / 63

6. ILLUSTRATIVE IMPLEMENTATION OF A USER STORY

This chapter provides an illustrative implementation of the User Story “Big Data pilot

implementation” described in Chapter 4 and it also aims to provide the reader with a clearer

understanding of the steps that users have to follow to implement a Big Data pilot using the Big

Data Test Infrastructure. In this User Story, the Policymaker wants to start a Big Data pilot using a

ready-to-use Big Data environment in order to support the policy making process and to evaluate

the impacts and advantages of the implementation of a Big Data solution. Figure 15 below

illustrates the main steps of the governance process of the identified User Story and their relations

with the main entities of the Big Data Test Infrastructure (business services, architecture building

blocks and solution building blocks).

Figure 15 – Illustrative implementation of the User Story “Big Data pilot implementation”

Solu

tio

n B

uild

ing

Blo

cks

Func

tiona

l Bui

ldin

g Bl

ocks Servers Storage

Network

Community

Use

r St

ory

's

acti

viti

es

Initial Assessment

Big Data Test Infrastructure initialisation

Analytics implementation

Outcomes sharing &

presentation

AdvisoryPaaS for implementing Big Data

use cases

Support for Analytics implementation

Bu

sin

ess

Serv

ices

Community Building and Innovation Portal

Privacy and Security Policy


Distribution

Distributed File System

Infrastructure Monitoring Management

Data Transformation

Batch Ingestion

Real Time Ingestion

Analytics


Data Visualisation

Big Data Pilot implementation


Date: 27/10/2017 Doc. Version: 2.0 56 / 63

Detailed below are the steps of the User Story’s activities, taking into account the example

associated with the user story “Big Data Pilot implementation” defined in Chapter 4 and

motivating the choices of the business services of each step and the associated architecture and

solution building blocks.

1. Initial Assessment: in order to evaluate the impact of a new tax policy introduced 6 months

previously, the Advisory business service helps the Policymaker to define the pilot scoping

and to identify the hardware requirements of the Big Data platform; more in detail, the

advisory team evaluates what kinds of data sources are useful to analyse the general

sentiment regarding the new tax policy (e.g. social media datasets), the duration of the pilot

and defines the storage capacity of the future platform.

The Advisory business service does not include architecture building blocks.

2. Big Data Test Infrastructure initialisation: in this phase the Policymaker uses the PaaS for

implementing the Big Data use cases business service in order to obtain the template for

initialisation of the Big Data platform. This phase starts with the choice of the best Big Data

distribution based on the requirements collected in the first phase and it includes the

configurations of the number of servers, inbound and outbound traffic ports, storage

capacity and the choice of the Big Data technologies identified in the assessment phase. At

the end of this phase, the Policymaker will have a ready-to-use Big Data platform and will be

ready to implement analytics.

The PaaS for implementing the Big Data use cases business service meets the initialisation

phase of the platform which is covered by most of the architecture building blocks of the

target architecture.

3. Analytics implementation: in this phase the Policymaker and a specialised team, composed

of Data Engineers and Data Scientists and provided by the Support for Analytics

implementation business service (which does not include architecture building blocks),

implement the analytics. In order to obtain feedback on a new tax policy, they can use

customisable pre-built functionalities related to the analysis of data from social media.

These functionalities are provided by the Analytics as a Service business service, which

allows the Policymaker and the specialised team to economize on costs and time for the

implementation of ad-hoc algorithms which satisfy their needs.

The Analytics as a Service business service covers the following architecture building blocks:

data ingestion (batch and real time): they allow Data Engineers to gather


Date: 27/10/2017 Doc. Version: 2.0 57 / 63

unstructured datasets from social media (e.g. Twitter or Facebook) and to store data

in the distributed file system;

data transformation: it allows Data Engineers to perform ETL operations (e.g. joining

data from multiple sources, encoding free-form values, etc.) in order to prepare

datasets for further analysis;

analytics: it allows Data Scientists to implement social media analytics (e.g. word

cloud, which allows the Policymaker to understand the number of times that a word

has been associated with the tax policy);

data discovery: it allows the Data Scientists to explore data in order to view most of

the relevant features of social media datasets;

data visualisation: it allows Data Scientists to show the results obtained using

interactive data visualisation tools.

4. Outcomes sharing and presentation: in this last phase, the Policymaker shares the results in

a community with other Member States in order to obtain feedback on the methodologies

or algorithms used during the pilot or simply to share the pilot’s outcomes. The Community

Building and Innovation Portal business service enables the usage of a portal in a

configured environment. It covers the community building block and the dashboard and

reporting building block which allow the Policymaker to create and share the graphs

obtained.


Date: 27/10/2017 Doc. Version: 2.0 58 / 63

7. GOVERNANCE AND OPERATIONAL MODEL FOR THE BIG DATA

TEST INFRASTRUCTURE

This chapter outlines a high-level governance and operational model of the future Big Data Test

Infrastructure, to be further elaborated in a detailed Target Operating Model. Assuming that the

implementation would be part of the CEF 2018 work programme (see Chapter 2.3), the high-level

governance and operational model has been set-up based on the following framework:

Figure 16 – Reference framework for the strategy, governance and operational model

The framework outlines the main IT capabilities and processes needed for the governance and

management of a complex IT Platform such as the future Big Data Test Infrastructure.

Strategy layer

For the Strategy layer, the proposed model anticipates an Operational Management Board (OMB)

for the Big Data Test Infrastructure. The OMB

will be composed of DG DIGIT and DG CNECT

Top Management and CEF Member States

representatives (or appropriate governance

bodies in the case that the Big Data Test

Infrastructure would not be part of the CEF

2018 work programme), and will be organised Figure 17 – Strategy layer


Date: 27/10/2017 Doc. Version: 2.0 59 / 63

every month to discuss and to take decisions about all types of operational matters (strategy,

planning, and budgeting). The main processes managed by the OMB are listed below:

Big Data Test Infrastructure Strategy – process focused on the definition and maintenance of a

strategic roadmap for the scale-up and improvement of the Big Data Test Infrastructure using as

an input the outcomes of the Demand Management process;

Demand Management – process focused on anticipating and understanding Member States

demand for Big Data services (e.g. requests for new pilots);

Financial Management – process focused on the management of the Big Data Test

Infrastructure budgeting and financing.

Furthermore, the OMB will ensure that the architecture principles (see Chapter 5.1) will be

respected during the development and improvement of the Big Data Test Infrastructure,

participating actively in the process decision of the Change Management.

Governance layer

The governance layer will focus on service management activities, all based on well-known and

market leading methodologies such as the ITIL

Framework, ensuring proper information

security management, monitoring the overall

performance of the Big Data Test

Infrastructure and, furthermore, management

of the staff.

This role will be played by DG DIGIT and DG

CNECT representatives (or appropriate governance bodies in the case that the Big Data Test

Infrastructure would not be part of the CEF 2018 work programme) and they will oversee relations

with IT Providers (Procurement Management) and with business stakeholders such as European

public administrations (Stakeholders Management).

Figure 18 – Governance layer


Date: 27/10/2017 Doc. Version: 2.0 60 / 63

Operational layer

This layer covers activities in the field of service development, service evolution,

project/programme management of

IT developments and service delivery. The role

of the Solution Provider (SP), who will be

accountable for the development and delivery

of the Big Data Test Infrastructure’s building

blocks and related services, will be played by

DIGIT and/or external contractors.

Indeed, operations refers to the day-to-day running of the Big Data Test Infrastructure and includes

processes to guarantee that services are running without interruptions.

The main operational processes which will be managed by the SP are listed below:

ICT Infrastructure Management, to manage and monitor the IT infrastructure, including facilities

management related to all the aspects of managing the physical environment, for example

power and cooling, building access management, environmental monitoring and including

actions to monitor and control the IT services of the underlying infrastructure. The SP executes

day-to-day routine tasks related to the operation of infrastructure components and applications

(including pilots scheduling, backup and restore activities, routine maintenance, etc.);

Availability and Capacity Management, to establish and maintain capacity and availability at a

justifiable cost and with an efficient use of resources, including activities related to the

appropriated provision of resources to the Big Data Test Infrastructure, monitoring, analysing,

understanding, and reporting on current and future demand for services, use of resources,

capacity, service system performance, and service availability, determining corrective actions to

ensure appropriate capacity and availability while balancing costs against resources needed and

supply against demand, in order to grant authorised users the right to use a Big Data Test

Infrastructure service, while preventing access to non-authorised users, executing policies

defined in the Information Security field;

IT Service support will be implemented by a dedicated Service Desk (SD) which represents the

Single Point of Contact (SPOC) for the Users/customers on a day-to-day basis. It will be also a

focal point for reporting and managing problems and incidents (disruptions or potential

disruptions in service availability or quality) and for users making service requests (routine

Figure 19 – Development and Operational layer


Date: 27/10/2017 Doc. Version: 2.0 61 / 63

requests for services).

Finally, a Communication Office (CO) will be responsible for the promotion of the Big Data Test

Infrastructure and the management of the Community Innovation portal.


Date: 27/10/2017 Doc. Version: 2.0 62 / 63

8. NEXT STEPS

This final chapter provides information about the next steps for the implementation of the future

Big Data Test Infrastructure. In order to outline the implementation foreseen over the next years,

Figure 20 below illustrates a high-level timeline that summarises the main steps performed so far

and the next steps.

Figure 20 – High-level roadmap for the implementation of the Big Data Test Infrastructure

As shown in the above figure, it is planned to implement the infrastructure using an incremental

approach, starting from 2018 with a first set of Big Data services and finalising the implementation

of all the services by 2019.

Regarding the practical steps to be followed for the implementation of the Big Data Test

Infrastructure, Figure 21 below provides an implementation roadmap, which will enable the

implementation of the “core” services (“PaaS for implementing Big Data use cases”, “Community

Building and Innovation Portal” and “Big Data and Analytics software catalogue”) and the execution

of a first set of pilot projects with some Member States, supported by the business services

“Advisory” and “Support for Analytics implementation”.


Date: 27/10/2017 Doc. Version: 2.0 63 / 63

Figure 21 – Detailed roadmap for the implementation of the Big Data Test Infrastructure

d03.1 design of the big data test infrastructure of...specific contract 406–d03.1design of the big...

Documents