d4.1:designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · keywords...

73
www.eubrabigsea.eu | [email protected] |@bigsea_eubr 1 D4.1: Design of the integrated big and fast data ecosystem Author(s) Sandro Fiore (CMCC), Donatello Elia (CMCC), Walter dos Santos Filho (UFMG), Carlos Eduardo Pires (UFCG) Status Draft/Review/Approval/Final Version v1.0 Date 01/07/2016 Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group specified by the consortium (including the Commission) CO: Confidential, only for members of the consortium (including the Commission) Abstract: Europe Brazil Collaboration of BIG Data Scientific Research through CloudCentric Applications (EUBraBIGSEA) is a mediumscale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third EuropeanBrazilian coordinated call. The document has been produced with the cofunding of the European Commission and the MCT. The purpose of this report is the design of the integrated big and fast data ecosystem. The deliverable aims at identifying and describing in detail all the key architectural building blocks needed to address the multifaceted data management aspects (data storage, access, analytics and mining) of the EUBraBIGSEA project. EUBraBIGSEA is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116. Este projeto é resultante da 3a Chamada Coordenada BRUE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI)

Upload: others

Post on 02-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   1

D4.1:  Design  of  the  integrated  big  and  fast  data  eco-­‐system   Author(s) Sandro  Fiore  (CMCC),  Donatello  Elia  (CMCC),  Walter  dos  

Santos  Filho  (UFMG),  Carlos  Eduardo  Pires  (UFCG)

Status Draft/Review/Approval/Final  

Version v1.0 Date 01/07/2016

Dissemination  Level X PU:  Public PP:  Restricted  to  other  programme  participants  (including  the  Commission) RE:  Restricted  to  a  group  specified  by  the  consortium  (including  the  Commission) CO:  Confidential,  only  for  members  of  the  consortium  (including  the  Commission)

Abstract:  Europe  -­‐  Brazil  Collaboration  of  BIG  Data  Scientific  Research  through  Cloud-­‐Centric  Applications  (EUBra-­‐BIGSEA)  is  a  medium-­‐scale  research  project  funded  by  the  European  Commission  under  the  Cooperation  Programme,  and  the  Ministry  of  Science  and  Technology  (MCT)  of  Brazil  in  the  frame  of  the  third  European-­‐Brazilian  coordinated  call.  The  document  has  been  produced  with  the  co-­‐funding  of  the  European  Commission  and  the  MCT.   The  purpose  of  this  report  is  the  design  of  the  integrated  big  and  fast  data  eco-­‐system.  The  deliverable  aims  at  identifying  and  describing  in  detail  all  the  key  architectural  building  blocks  needed  to  address  the  multifaceted  data  management  aspects  (data  storage,  access,  analytics  and  mining)  of  the  EUBra-­‐BIGSEA  project.    

EUBra-­‐BIGSEA  is  funded  by  the  European  Commission  under  the   Cooperation  Programme,  Horizon  2020  grant  agreement  No  690116.

Este  projeto  é  resultante  da  3a  Chamada  Coordenada  BR-­‐UE  em  Tecnologias  da  Informação   e  Comunicação  (TIC),  anunciada  pelo  Ministério  de  Ciência,  Tecnologia  e  Inovação  (MCTI)

Page 2: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   2

Document  identifier:  EUBRA  BIGSEA  -­‐WP4-­‐D4.1  

Deliverable  lead CMCC

Related  work  package WP4

Author(s) Sandro   Fiore   (CMCC),   Donatello   Elia   (CMCC),   Walter   dos   Santos   Filho  (UFMG),  Carlos  Eduardo  Pires  (UFCG)

Contributor(s)   Ignacio   Blanquer   (UPV),   Gustavo   Avelar   (UFMG),   Wagner   Meira   (UFMG),  Dorgival   Guedes   (UFMG),   Luiz   Fernando   Carvalho   (UFMG),   Monica   Vitali  (POLIMI),   Demetrio   Mestre   (UFCG),   Tiago   Brasileiro   (UFCG),   Nádia   P.  Kozievitch  (UTFPR),  Daniele  Lezzi  (BSC),  Igor  Oliveira  (IBM)

Due  date 30/06/2016

Actual  submission  date 01/07/2016

Reviewed  by Nádia  P.  Kozievitch  (UTFPR),  Cinzia  Cappiello  (POLIMI)

Approved  by PMB

Start  date  of  Project 01/01/2016

Duration 24  months Keywords Big  data  eco-­‐system,  architecture  design,  analytics,  machine  learning

Versioning  and  contribution  history

Version Date Authors Notes

0.1 02/05/2016 Sandro  Fiore  (CMCC)   Table  of  Contents  

0.2 17/05/2016 Walter  dos  Santos  Filho  (UFMG) Formatting

0.3 30/05/2016 Sandro  Fiore,  Donatello  Elia  (CMCC) Requirements,   general   architecture  sections  and  tools  analysis  definition

0.4 10/06/2016 Sandro  Fiore,  Donatello  Elia  (CMCC) Updated   ToC,   introduction,  executive  summary,  architecture

0.5 15/06/2016 Walter  dos  Santos  Filho  (UFMG) Architecture  sequence  diagrams  and  management  API

0.6 16/06/2016 Monica   Vitali   (POLIMI),   Igor   Oliveira  (IBM),  Donatello  Elia   (CMCC),  Sandro  Fiore  (CMCC),  Luiz  Fernando  Carvalho  (UFMG),   Walter   dos   Santos   Filho  (UFMG)  

Data  sources  update,  data  quality  as  a  service

0.7 17/06/2016 Carlos  Eduardo  Pires  (UFCG) Entity-­‐Matching,  general  review

0.8 20/06/2016 All  contributors

Review  of   the   tools   analysis   section,  tools  assessment

0.9 24/06/2016 Sandro  Fiore  (CMCC) General   review   of   the   document,  conclusions

Page 3: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   3

Copyright   notice:   This   work   is   licensed   under   the   Creative   Commons   CC-­‐BY   4.0   license.   To   view   a   copy   of   this  license,  visit  https://creativecommons.org/licenses/by/4.0.   Disclaimer:   The   content   of   the   document   herein   is   the   sole   responsibility   of   the   publishers   and   it   does   not  necessarily  represent  the  views  expressed  by  the  European  Commission  or  its  services.   While  the  information  contained  in  the  document  is  believed  to  be  accurate,  the  author(s)  or  any  other  participant  in  the  EUBra-­‐BIGSEA  Consortium  make  no  warranty  of  any  kind  with  regard  to  this  material  including,  but  not  limited  to  the  implied  warranties  of  merchantability  and  fitness  for  a  particular  purpose. Neither   the   EUBra-­‐BIGSEA   Consortium   nor   any   of   its   members,   their   officers,   employees   or   agents   shall   be  responsible  or  liable  in  negligence  or  otherwise  howsoever  in  respect  of  any  inaccuracy  or  omission  herein. Without   derogating   from   the   generality   of   the   foregoing   neither   the   EUBra-­‐BIGSEA   Consortium   nor   any   of   its  members,   their   officers,   employees   or   agents   shall   be   liable   for   any   direct   or   indirect   or   consequential   loss   or  

Page 4: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   4

TABLE  OF  CONTENT  

 EXECUTIVE  SUMMARY  ..............................................................................................................................  7  

1.   Introduction  ......................................................................................................................................  8  1.1.   Scope  of  the  Document  ...............................................................................................................  8  1.2.   Target  Audience  ..........................................................................................................................  8  1.3.   Structure  .....................................................................................................................................  8  

2.   EUBra-­‐BIGSEA  Architectural  Overview  ...............................................................................................  9  

3.   Big  and  Fast  Data  Eco-­‐system  Requirements  ...................................................................................  10  3.1.   Use  Case  Requirements  .............................................................................................................  10  3.2.   Technical  Requirements  ............................................................................................................  11  3.3.   Classes  of  Users  .........................................................................................................................  12  

4.   Data  Sources  ...................................................................................................................................  13  4.1.   External  Data  ............................................................................................................................  14  

4.1.1.   Stationary  Data  .........................................................................................................................  15  4.1.2.   Dynamic  Spatial  Data  ................................................................................................................  15  4.1.3.   Environmental  Data  ..................................................................................................................  16  4.1.4.   Social  Data  ................................................................................................................................  17  

4.2.   Derived  Data  .............................................................................................................................  18  4.3.   Platform-­‐level  Data  ...................................................................................................................  18  

4.3.1.   QoS  Monitoring  Data  ................................................................................................................  18  4.3.2.   Data  Quality  as  a  Service  Data  .................................................................................................  19  

5.   Big  and  Fast  Data  Eco-­‐system  General  Architecture  .........................................................................  20  

6.   Big  and  Fast  Data  Eco-­‐system  Design  ...............................................................................................  22  6.1.   Architectural  Diagram  ...............................................................................................................  22  

6.1.1.   Data  Storage  .............................................................................................................................  23  6.1.2.   Big  Data  Technologies  ...............................................................................................................  25  6.1.3.   Entity  Matching  Service  ............................................................................................................  26  6.1.4.   Data  Quality  as  a  Service  ..........................................................................................................  27  6.1.5.   Extraction,  Transformation  and  Load  .......................................................................................  27  

6.2.   Sequence  Diagrams  ...................................................................................................................  27  6.2.1.   User  Stories  for  UC1:  Data  Acquisition  .....................................................................................  27  6.2.2.   User  Stories  for  UC2:  Descriptive  Models  ................................................................................  29  6.2.3.   User  Stories  for  UC3:  Predictive  Models  ..................................................................................  30  6.2.4.   Other  interactions  between  WP4  components  .......................................................................  31  

6.3.   Exposed  QoS  metrics  .................................................................................................................  32  6.3.1.   Java  Virtual  Machine  metrics  ...................................................................................................  32  6.3.2.   Data  Storage  Metrics  ................................................................................................................  33  6.3.3.   Data  Access  Metrics  ..................................................................................................................  33  6.3.4.   Data  Ingestion  and  Streaming  Processing  Metrics  ..................................................................  33  

Page 5: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   5

6.3.5.   Data  Analytics  and  Data  Mining  Metrics  ..................................................................................  34  6.3.6.   Data  Mining  and  Analytical  Toolbox  ........................................................................................  34  

6.4.   Data  Management  API  ..............................................................................................................  34  6.5.   Security  Aspects  ........................................................................................................................  36  

7.   Tools  evaluation  ..............................................................................................................................  38  7.1.   Procedure  to  describe  components  ...........................................................................................  38  7.2.   Data  Storage  .............................................................................................................................  39  

7.2.1.   HDFS  ..........................................................................................................................................  39  7.3.   Data  Access  ...............................................................................................................................  40  

7.3.1.   PostGIS  ......................................................................................................................................  40  7.3.2.   MongoDB  ..................................................................................................................................  42  7.3.3.   Apache  HBase  ...........................................................................................................................  43  

7.4.   Data  Ingestion  and  Streaming  Processing  ..................................................................................  45  7.4.1.   Apache  Kafka  ............................................................................................................................  45  7.4.2.   Apache  Storm  ............................................................................................................................  46  7.4.3.   Apache  Flink  ..............................................................................................................................  48  7.4.4.   Apache  Spark  Streaming  ...........................................................................................................  51  

7.5.   Data  Analytics  and  Mining  .........................................................................................................  52  7.5.1.   Ophidia  ......................................................................................................................................  52  7.5.2.   Apache  Kylin  .............................................................................................................................  54  7.5.3.   Apache  Hive  ..............................................................................................................................  55  7.5.4.   Druid  ..........................................................................................................................................  57  7.5.5.   Spark  .........................................................................................................................................  59  7.5.6.   Hadoop  MapReduce  .................................................................................................................  60  

7.6.   Data  Mining  and  Analytics  Toolbox  ...........................................................................................  62  7.6.1.   Apache  Spark  MLlib  ..................................................................................................................  62  7.6.2.   Ophidia  Operators  ....................................................................................................................  64  7.6.3.   Ophidia  Primitives  ....................................................................................................................  65  

7.7.   Final  Assessment  .......................................................................................................................  66  

8.   Preliminary  Architectural  Mapping  ..................................................................................................  69  

9.   Conclusions  .....................................................................................................................................  70  

10.   References  ....................................................................................................................................  71  

GLOSSARY  ...............................................................................................................................................  72     LIST  OF  TABLES  Table  1.  List  of  Use  Case  requirements  related  to  WP4  ..................................................................................  11  Table  2.  List  of  technological  requirements  related  to  WP4  ...........................................................................  11  Table  3.  Summary  of  external  data  sources  ....................................................................................................  15  Table  4.  Some  examples  of  JVM  metrics  available  .........................................................................................  33  Table  5.  Distributed  file  system  metrics  .........................................................................................................  33  Table  6.  Data  access  metrics  ...........................................................................................................................  33  

Page 6: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   6

Table  7.  Data  ingestion  and  streaming  processing  metrics  ............................................................................  34  Table  8.  Data  analytics  &  mining  metrics  ........................................................................................................  34  Table  9.  Template  used  to  describe  the  potential  components  to  be  used  in  the  Big  Data  eco-­‐system.  .......  39    

LIST  OF  FIGURES  Figure  1.  High-­‐level  view  of  the  EUBra-­‐BIGSEA  architecture  ............................................................................  9  Figure  2.  WRF  outer  (map)  and  inner  (blue  rectangle)  domains  ....................................................................  17  Figure  3.  WP4  general  architecture  ................................................................................................................  20  Figure  4.  Big  and  Fast  Data  eco-­‐system  detailed  architecture  ........................................................................  22  Figure  5.  Data  sources  levels  ...........................................................................................................................  24  Figure  6.  Sequence  diagram  for  scernario  1.1  ................................................................................................  28  Figure  7.  Sequence  diagram  for  scenario  1.2  ..................................................................................................  28  Figure  8.  Sequence  diagram  for  scenario  2.1  ..................................................................................................  29  Figure  9.  Sequence  diagram  for  scenario  2.2  ..................................................................................................  30  Figure  10.  Sequence  diagram  for  scenario  3.1  ................................................................................................  30  Figure  11.  Sequence  diagram  for  scenario  3.2  ................................................................................................  31  Figure  12.  Sequence  diagram  for  scenario  4.1  ................................................................................................  32  Figure  13.  Preliminary  architectural  mapping  ................................................................................................  69    

Page 7: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   7

EXECUTIVE  SUMMARY  EUBra-­‐BIGSEA  project  aims  at  developing  a  set  of  cloud  services  empowering  Big  Data  analytics  to  ease  the  development  of  massive  data  processing   applications.   EUBra-­‐BIGSEA  will   develop  models,   predictive   and  reactive  cloud   infrastructure  QoS   techniques,  efficient  and  scalable  Big  Data  operators  and  a  privacy  and  quality  analysis  framework,  exposed  to  several  programming  environments.  EUBra-­‐BIGSEA  aims  at  covering  general  requirements  of  multiple  application  areas,  although  it  will  showcase  in  the  treatment  of  massive  connected  society  information,  and  particularly  in  traffic  recommendation.

The   integrated   fast   and   Big   Data   eco-­‐system   represents   the   central   component   devoted   to   data  management   aspects   (i.e.,   access,   analytics/mining,   and   quality)   of   the   EUBra-­‐BIGSEA   platform.   Its  architecture   has   been  defined   in   this   document   starting   from   the   requirements   gathered   from   the  main  project  use  cases  and  highlighted  in  D7.1.

To   fulfil   the   requirements,   the   proposed   architecture   integrates  multiple   classes   of   big   data   systems   to  address   fast   data   analysis   over   continuous   streams   from   external   data   sources,   general   purpose   data  mining   and  machine   learning   tools   as  well   as  OLAP-­‐based   systems   for  multidimensional   data   analysis.   A  storage  and  access  layer  has  been  also  defined  to  provide  low-­‐level  key  functionalities.  Aspects  related  to  the  API  exposed  by  the  WP4  have  been  also  reported  in  this  document  as  they  are  strongly  connected  to  the  programmability  aspects  of  the  data  management  part.

Also   relevant   to   this   document   are   both   (i)   a   comprehensive   evaluation   and   assessment   of   the   big   data  tools  available  in  the  general  landscape  from  data  storage,  access,  analytics  and  mining  standpoint  and  (ii)  a  deep  data  sources  analysis  in  terms  of  data  model,  formats,  volume,  metadata,  and  functional  needs.  In  the  former  case,  a  comprehensive  description  of  the  data  tools  has  been  provided  whereas  in  the  latter  one  a  complete  description  of  the  data  sources  (from  raw  -­‐   level0  -­‐  to  derived  -­‐   level2  -­‐  data)  and  the  links  with  the  functional  components  of  the  architecture  have  been  described.  The  analysis  of  the  big  data  landscape  has  also  included  the  technical  evaluation  of  the  tools  as  well  as  their  final  assessment  based  on  evaluation  criteria  linked  to  the  use  case  requirements.

As   highlighted   in   the   document,   key   features   of   the   designed   data   management   architecture   are   the  integration  of  different  classes  of  big/fast  data  tools  to  address  multifaceted  use  cases  requirements,   the  dynamicity   and   elasticity   of   the   environment   (which   are   linked   to   QoS   metrics/policies)   jointly   with   a  secured   by   design   eco-­‐system   (e.g.   w.r.t.   privacy   aspects).   The   proposed   architecture   joins   all   these  elements  in  a  cloud  environment  aiming  at  providing,  to  some  extent,  a  general  approach  to  deal  with  high  social  impact  use  cases  and  scenarios  like  the  ones  proposed  in  the  project.

A   preliminary   mapping   of   the   main   architectural   blocks   into   infrastructural   components   has   been   also  proposed  at  the  end  of  the  document.

Page 8: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   8

1. INTRODUCTION  

1.1. Scope  of  the  Document  This   document   provides   a   complete   overview   about   the   design   of   the   integrated   big   and   fast   data   eco-­‐system.   It   aims   at   identifying   and   describing   in   detail   all   the   key   architectural   components   needed   to  address   the   multifaceted   data   management   aspects   (data   storage,   access,   analytics   and   mining)   of   the  project.  The  document  also  includes  the  full   list  of  the  data-­‐related  requirements  from  D7.1  jointly  with  a  comprehensive  description  of  the  data  sources  and  big  and  fast  data  tools.  In  addition,  UML  diagrams  are  proposed  to  clarify  architectural  aspects  and  interactions  among  components.  The   links  to  the  other  WPs  from   a   security,   quality   of   service,   user   requirements   and   programming   framework   standpoints   are   also  highlighted  in  the  text.  

1.2. Target  Audience  The  document  is  mainly   intended  for   internal  use,  although  it   is  publicly  released.  The  main  target  of  this  document  is  the  global  team  of  technical  experts  of  the  EUBra-­‐BIGSEA,  including  WP3,  WP4,  WP5  and  WP6.  This  document  goes  beyond  the  data  management  aspects  to  understand  the  global  architecture  of  WP4.

1.3. Structure  The  rest  of  the  document  is  structured  into  8  main  parts.  First,  Section  2  provides  a  general  introduction  to  the  EUBra-­‐BIGSEA  architecture.  Section  3  provides  a  summary  of  the  requirements  in  terms  of:  use  cases,  technical   requirements   and   classes   of   users.   In   Section   4,   a   complete   description   of   the   data   sources  according   to   the   identified   three   classes   (raw,   derived   and   platform-­‐level)   is   presented   and   discussed.  Section   5   presents   the   general   architecture   of   the   big   and   fast   data   eco-­‐system   highlighting   the   main  building  blocks  of   the   system  as  well   as   the   links  among   the  different   components  and   the   relationships  with   the  other  work  packages;   it   provides   a   general   conceptual   view  of   the  proposed  data  management  eco-­‐system.  Section  6  provides  a  detailed  view  of  the  architecture  providing  information  about  the  internal  components   (storage,   ETL,   big   data   technologies,   Entity-­‐Matching   and   Data   Quality   services),   sequence  diagrams,  QoS  metrics   to  be  exposed  at   the  WP4   level,  data  management  APIs  and  data-­‐related  security  aspects.   Section   7   provides   a   comprehensive   tools   evaluation   based   on   the   following   characterization:  storage,  access,  analytics/mining  and  related   toolbox,   ingestion  and  streaming  processing  components.  A  final  assessment  based  on  key  dimensions  coming  from  the  D7.1  requirements  is  also  presented  at  the  end  of  the  section.  Section  8  provides  a  preliminary  mapping  of  the  components  onto  the  architectural  view  to  provide   some   initial   insights   about   the   infrastructural   implementation   of   the   data   eco-­‐system.   Finally  Section  9  draws  the  main  conclusions  of  the  deliverable.

Page 9: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   9

2. EUBRA-­‐BIGSEA  ARCHITECTURAL  OVERVIEW  The  EUBra-­‐BIGSEA  general  architecture,  as  described  in  deliverable  D7.1,  comprises  four  main  blocks:

● QoS  Cloud  Infrastructure  services,  which  integrate  the  modelling  of  the  workload,  the  monitoring  of  the  resources,  the  implementation  of  vertical  and  horizontal  elasticity  and  the  contextualization.  

● Big  Data  Analytics   services,  which  provide  operators   to   process   huge  datasets   and  which   can  be  integrated   in   the   programming   models.   Analytics   services   are   characterized   in   the   QoS   cloud  infrastructure  models  of   the  underlying   layer,  which  will  automatically   (or  explicitly  driven  by  the  analytics  services)  adjust  resources  to  the  expected  workload  and  considering  its  specificities.  This  document  will  mainly  focus  on  the  big  data  eco-­‐system  block.

● Programming   Models,   which   provide   a   higher-­‐level   programmatic   framework   and   are   also  characterized   by   the   models   of   the   infrastructure.   The   programming   models   will   ease   the  parallelization  of  the  applications  developed  on  top  of  them.

● Privacy  and   Security   framework,  which  provides   the  means   to   annotate  data   and  processing   and  ensures  the  proper  protection  of  privacy  and  security.

On   top   of   those   four   blocks,   applications   are   developed   using   the   programming   models   and   the   data  analytics   extensions.   Application   developers   are   expected   to   use   the   programming  models   and  may   use  other  features  of  underlying  layers,  such  as  the  user-­‐level  QoS  metrics.

Figure  1  shows  the  high-­‐level  view  of  the  EUBra-­‐BIGSEA  architecture  depicting  the  interactions  among  the  main  blocks.  

 Figure  1.  High-­‐level  view  of  the  EUBra-­‐BIGSEA  architecture  

Page 10: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   10

3. BIG  AND  FAST  DATA  ECO-­‐SYSTEM  REQUIREMENTS    The   requirements   analysis   is   an   essential   preliminary   step   necessary   for   the   design   of   a   software  environment.   It  defines  the  features,  objectives  and  constraints  that  the  system  is  expected  to  guarantee  and  comply  with.

This  section  provides  a  summary  of   the  requirements,  both   functional  and  non-­‐functional,  of   the  big  and  fast  data  eco-­‐system  developed  within  the  WP4  of  the  project.  End-­‐user  requirements  represent  the  initial  set  of  requirements  to  be  addressed,  however,  since  the  big  data  framework  will  also   interact  with  other  entities,  such  as  programming  frameworks,  external  data  sources  and  infrastructure  management  systems,  additional   requirements,   not   directly   connected   to   the   end   users,   should   be   also   targeted.   In   particular,  since   data   have   a   key   role   in   the   whole   eco-­‐system,   a   special   attention   has   been   devoted   to   the  examination   of   the   various   data   sources   necessary   for   the   end-­‐user   analysis   in   order   to   identify   and  characterize   requirements   and   constraints.   A   complete   data   sources   description   is   provided   in   the   next  section,  whereas  this  section  will  mainly  describe  the  end-­‐user  requirements  of  the  big  and  fast  data  eco-­‐system.  

End-­‐user   requirements  are  essential   for   the  design  and   implementation  of   the  big  data  eco-­‐system  since  user   applications   will   act   as   the   main   validator   of   the   features   provided   by   the   whole   EUBra-­‐BIGSEA  platform.

Deliverable   D7.1,   “End-­‐User   Requirements   Elicitation”,   provides   a   complete   description   of   the  requirements  elicitation  phase  and  specifies  a  set  of   requirements,   from  the  end-­‐user  point  of  view,   that  must  or   should  be  addressed  by   the  various  work  packages.  Some  of   them,   the  ones   related   to   the  data  part,  will  be  highlighted  in  this  document.

3.1. Use  Case  Requirements  The   projects’   general   user   scenario   has   been   split   into   different   use   cases   to   ease   the   elicitation   phase.  Hence,  three  use  cases  have  been  identified  starting  from  the  type  of  operations  required  on  data.  Briefly,  as  reported  in  more  detail  in  D7.1,  the  use  cases  are:

● Use  Case  1  (UC1),  devoted  to  the  acquisition  of  data  in  the  system  using  different  data  sources;  ● Use  Case  2  (UC2),  related  to  the  processing  of  the  data  to  extract  the  historical  knowledge  from  it;  ● Use  Case  3  (UC3),  devoted  to  the  creation  of  new  knowledge  by  projecting  existing  models   in  the  

future  or  under  different  conditions.  

Table  1  lists  the  requirements  for  the  three  use  cases  that  must  or  should  be  taken  into  account  during  the  design  of  the  big  data  eco-­‐system  (those  related  to  WP4);  these  cover  both  functional  and  non-­‐functional  aspects.  The  full  description  of  the  requirements  and  the  use  cases  is  available  in  D7.1.

 

Req  # UC  # Description Requirement  Level

WP

R1.1. UC1. To  integrate  GIS  data  sources MUST WP4

R1.2. UC1 To  integrate  meteorological/climate  data  sources MUST WP4

R1.3. UC1 Metadata  must  be  included  into  the  application MUST WP4

Page 11: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   11

R1.5. UC1 Availability  of  an  API MUST WP4/5

R2.5. UC2 Selection  of  data  sources  for  time-­‐series  analysis MUST WP7/4

R2.6. UC2 Selection  of  the  area  of  interest MUST WP7/4

R2.8. UC2 Reuse  aggregated  results SHOULD WP7/4

R3.5. UC3 Download  of  aggregated  results MUST WP7/4

R3.6. UC3 Selection  of  data  sources MUST WP7/4

Table  1.  List  of  Use  Case  requirements  related  to  WP4  

3.2. Technical  Requirements  From  the  analysis  of   the  Use  Case   requirements,  general   functional   requirements   regarding   the  project’s  infrastructure  have  been  identified.  These  requirements,  defined  as  “Technological  Requirements”,  refer  to  data   access,   execution,   and   security   and   logging.   A   complete   reference   to   the   full   description   of   the  requirements  is  provided  by  D7.1.

Table  2  provides  a  list  of  the  technological  requirements  that  must  or  should  be  tackled  by  the  big  data  eco-­‐system  (those  related  to  WP4);  these  fall  within  the  context  of  data  access  and  security  aspects.  It  is  worth  mentioning  that,  although  execution  requirements  are  mainly  addressed  by  WP3,  the  big  data  eco-­‐system  provides  the  components  necessary  to  perform  big  data  processing  on  the  underlying  infrastructure  as  well  as   the   adaptations   required   by   the  QoS   cloud   infrastructure  models.   Hence,   these   aspects  must   also   be  considered  during  the  WP4  design  phase.

Req  # Description Requirement  Level

WP

RD.1. Integrate  external  existing  data  sources MUST WP4

RD.2. Automatic  synchronization  with  original  data  sources MUST WP4

RD.3. Storage  of  processing  products MUST WP4/6

RD.4. Authentication  and  Authorization MUST WP4/6

RD.5. Data  Access MUST WP4

RD.6. Deal  with  poor-­‐Internet  connection  limitations SHOULD WP4

RA.2. Data  and  applications  ACL MUST WP6/4/5

RA.4. Data  privacy  protection MUST WP6/4

Table  2.  List  of  technological  requirements  related  to  WP4  

 

Page 12: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   12

As  displayed  in  Tables  1  and  2,  several  requirements  are  shared  among  different  work  packages,  thus  it  is  key  and  critical  to  identify  links  and  dependencies  of  the  big  data  eco-­‐system  with  respect  to  security,  QoS  cloud   infrastructure  services,  programming   frameworks  and  end-­‐user  algorithms/applications,   in  order   to  correctly  model  these  aspects  in  the  eco-­‐system  design.  These  kind  of  aspects  have  been  the  main  purpose  of  several  cross-­‐WP  telcos  held  during  the  first  period.

3.3. Classes  of  Users  

Different   types  of  users  could  potentially  exploit   the   functionalities  provided  by   the  big  data  eco-­‐system.  The  following  classes  of  users  have  been  identified,  given  the  type  of  functionality  and  privileges  required:  

● Administrators,   have   control   over   the   big   data   eco-­‐system   at   the   infrastructural   level.   Different  roles  could  be  defined  at  different  levels  and  granularity;  

● Developers,  use  the  big  data  eco-­‐system  API  to  develop  end-­‐users  applications,  to  test  the  system  and  perform  data  inspection  activities;  

● Programming   models,   exploit   the   eco-­‐system   technologies   and   the   integrated   data   for   the  execution  of  data  mining  and  analytics  processing.  

Page 13: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   13

4. DATA  SOURCES    The   urban   mobility   scenario   targeted   by   the   EUBra-­‐BIGSEA   project   requires   the   integration   of   very  heterogeneous  data  sources  to  provide  information  regarding  urban  traffic,  environmental  conditions  and  people   sentiment/opinions   (from   social   networks   sources).   Even   though   this   data  mainly   focuses   on   the  pilot   case   "the   city   of   Curitiba"   the   EUBra-­‐BIGSEA   framework   should   be   independent   of   a   specific  geographical  area,  or  at  least  it  should  be  easily  re-­‐used  in  other  scenarios  with  minor  adaptation  activities.  The   WP4   copes   with   the   management   of   several   external   data   sources,   addressing   the   challenges  associated  with  these  data  to  provide  a  fast  and  scalable  environment  for  big  data  analytics  and  mining.

The   data   source   analysis   provides   important   input   for   the   WP4   architecture.   Deliverable   D7.1   gives   an  overview  of  some  main,  preliminary,  external  data  source  types  identified  for  the  project  scenario;   in  this  report,  additional  data  sources  have  been  identified  in  subsequent  analysis  and  included.  

The  external  data  sources  types  are  classified  as: ● Stationary  data  describing  static  elements  that  compose  the  infrastructure  and  mobility  in  the  city.  

It   includes   urban   geographic   information   (e.g.,   legal   limits,   land   cover,   land   use,   hydrography),  transportation  infrastructure  (e.g.,  street  map,  topology  of  the  traffic  network,  bus  stops),  points  of  interest   (e.g.,   schools,   hospitals,   squares,   stadiums)   and   other   information   that   is   relevant   to  understand  the  location  of  the  components  present  within  the  scenario;  

● Dynamic  spatial  data  containing  information  valid  for  a  specific  point  in  time.  It  includes  traffic  geo-­‐referenced  information  (e.g.,  vehicle  GPS,  routes  of  public  transportation  users),  traffic  status  and  news  (e.g.,  accidents,  traffic  jam),  existence  of  events  (e.g.,  high  concentrations  of  people,  concerts,  protests),  and  all  types  of  temporal  useful  information  to  measure  the  mobility  conditions;  

● Environmental  data   presenting   information  about   the  environmental   conditions  and   the  weather  forecasts  that  are  relevant  for  understanding  citizens’  mobility;  

● Social  network  data  providing  streams  of  data  useful  to  extracts  information  about  sentiments  and  unpredictable  events;  

External  data   could   require   some  preliminary   steps   to  get   integrated   into   the  big  data  eco-­‐system.  After  these   pre-­‐processing   steps,   data   can   be   used   for   access   and   processing   by   exploiting   the   big   data  technologies  available  in  the  eco-­‐system.  Derived  data  can  be  produced  by  running  analytics  and  machine  learning  algorithms  defined  for  the  UC2  and  UC3.  This  data  can  be  then  stored  into  the  system  to  be  used  for  subsequent  processing  and  analysis.  

Additionally,  the  big  data  eco-­‐system  could  be  exploited  to  integrate  “platform-­‐level”  data  sources  that  are  mainly   required   for   internal   use.   These   are:   (i)   the   monitoring   data   produced   to   evaluate   QoS   at   the  infrastructure  level  and  (ii)  the  information  produced  by  the  Data  Quality  as  a  Service  (DQaS)  to  annotate  data.  These  data  can,  in  fact,  be  stored  in  the  eco-­‐system  and  managed  by  the  same  big  data  technologies  used  for  the  other  available  data  sources.

The  management  and  analysis  of  this  (big)  data  raises  several  challenges  that  should  be  properly  handled,  such  as:

● Data  velocity:  data  can  be  stationary  and  valid  for  a  long  period,  generated  from  periodical  runs  of  weather  forecast  models  or  produced  by  continuous  streams;  

● Data  variety:  data  sources  are  very  heterogeneous  and  comes   in  different   types,  such  as   tabular,  structured,  non-­‐structured,  multi-­‐dimensional,  spatial  or  a  mix  of  the  aforementioned;  

Page 14: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   14

● Data  volume:  especially  dynamic  and  environmental  sources  are  expected  to  continuously  produce  data,  resulting  in  a  big  volume  of  data  to  be  managed;  

● Data  veracity:  data  sources  could  be  pre-­‐processed,  filtered  and  aligned  before  being  integrated  in  the  eco-­‐system  in  order  to  avoid  affecting  the  overall  quality  and  accuracy  of  stored  data.  

The  following  sections  provide  a  brief  description  of  the  main  aspects  of  the  (i)  external  data  (Section  4.1),  (ii)  derived  data  (Section  4.2),  and  (iii)  platform-­‐level  data  (Section  4.3)  from  a  WP4  perspective.

4.1. External  Data  Table  3  summarizes  the  external  data  that  will  be  potentially  integrated  in  the  data  eco-­‐system  during  the  project  lifetime,  providing  for  each  a  set  of  characteristics.  The  following  subsections  describe,  briefly,  the  data  sources  considered  for  each  of  the  classes  defined  before.

Data  Source  Name

Source  Type Domain   Expected  Volume  (size/freq.)  

Data  Format  

Availability  and  services  

Data  Policy   Storage    

Brazilian  boundaries

Stationary Geographic   189  files    

397,639  records    

2  GB  

Shape  files   Public   Not  applicable   PostGIS  

Curitiba  Infrastructure

Stationary Geographic    

75,882  records    

100  MB  

Shape  files   Public   Not  applicable   PostGIS  

Events  from  globo.com

Dynamic Events   1  record  /  day   JSON   Public   Not  applicable   NoSQL  

Events  from  Google  Search

Dynamic Events   1  record  /  day   JSON   Public   Not  applicable   NoSQL  

Events  from  Facebook

Social  Data Events   26  records  /  day    

JSON   Public,  API  access  (it  requires  authN)  

Non  distributable.  Restricted  to  the  project.  

NoSQL  

Twitter Social  Data Traffic  /  Events    

150,000  records  /  day    

630  MB  /  day  

JSON   Public,  API  access  (it  requires  authN)  

Non  distributable   NoSQL  

Curitiba  traffic  news

Dynamic Traffic    

20  records  /  day  

JSON   Public   Not  applicable   NoSQL  

Traffic  news Dynamic Traffic    

15  records  /  day  

JSON   Public   Not  applicable   NoSQL  

Traffic  status  -­‐  "MapLink"

Dynamic Traffic    

3,670  records  /  day    

JSON   Public   Non  distributable.  Restricted  to  the  project.  

NoSQL  

Page 15: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   15

6.5  MB  /  day  

Point  of  Interest  -­‐  URBS

Stationary POI    

2,537  records   JSON   Public   Restricted  to  the  project.  

NoSQL  

Points  of  interest  -­‐  Tripadvisor

Stationary POI    

309  records   JSON   Public   Restricted  to  the  project.  

NoSQL  

Curitiba  Bus  Cards

Dynamic Mobility   Sample  of  6  days.  3,970,059  records  

CSV   By  request  (sensitive  information)  

Restricted  to  the  project.  

PostGIS  

Curitiba  bus  service  -­‐  URBS

Dynamic Mobility    

20,000  records  /  hour    

2.6  MB  /  hour  

JSON   By  request   Non  distributable.  Restricted  to  the  project.  

PostGIS  

WRF  Climate  Data

Environmental  Data

Weather   2GB  /  day  

NetCDF   Public   Non  applicable   Filesystem  

Table  3.  Summary  of  external  data  sources  

4.1.1. Stationary  Data  

Stationary  data  provides   long-­‐living   information  describing  the  topology  of  the  traffic  network  of  the  city,  the  street  map,  relevant  city  spots  and  other  geographic  information  useful  to  identify  the  location  of  the  components   present   in   the   urban   mobility   scenario.   Possible   external   data   sources   belonging   to   this  category  are:

● Curitiba   Infrastructure:  this  database   is  composed  by  multiple   tables  with   information  about   land  cover   and   land   use   in   Curitiba,   such   as   streets   layout,   rivers,   squares,   along   with   polygon  boundaries.  Most  of  information  are  geo  located  in  the  PostGIS  format;  

● Brazilian   boundaries:   contains   the   name,   ID   and   polygon   points   of   the   boundaries   of   all   states,  meso   regions,   micro   regions,   cities,   districts,   sub   districts   and   sectors   of   Brazil.   It   is   an   official  database  provided  by  IBGE  (Brazilian  Institute  of  Geography  and  Statistics),  available  on  the  Web.  The  data  are  static  and  rarely  change;  

● Points  of  interest  from  URBS:  information  about  the  points  of  interest  from  Curitiba.  It  includes  the  name,  coordinates  and  type  of  each  POI   (e.g.   schools,  hospital,  hotels).  The  data  are  provided  by  URBS  (Curitiba  public  transportation  company)  through  API  requests  and  updated  once  a  week;  

● Point   of   interest   from   Tripadvisor:   this   database   contains   information   about   all   the   places   from  Curitiba   crawled   listed   in   the   Tripadvisor   website.   In   addition   to   the   name   and   location,   it   also  provides  popularity  and  evaluation  metrics.  The  data  acquisition   is  performed  with  a  web  crawler  and  updated  once  a  week.  

4.1.2. Dynamic  Spatial  Data  

Dynamic  spatial  data  provides  georeferenced  information  of  the  vehicles  and  users  for  a  specific  point  in  time.  Possible  dynamic  spatial  data  sources  include:  

Page 16: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   16

● Curitiba   bus   cards:   consists   of   information   about   the   use   of   bus   cards   of   Curitiba.   Each   record  includes  the  bus   line,  bus  vehicle,  date,  time,  card  ID  and  coordinates   in  which  the  card  has  been  used.  We  have  a  sample  of  6  days,  provided  by  the  URBS  (Curitiba  public  transportation  company);    

● URBS  bus  service:  contains  realtime  information  about  the  geographic  position  of  the  bus  vehicles.  Each   record   describes   the   position   of   a   bus   vehicle   in   a   specific   date   and   time.   The   data   are  provided  by  URBS  (Curitiba  public  transportation  company);  

● Events  from  globo.com:  information  about  the  most  important  concerts  scheduled  in  Curitiba  from  the  Globo  website.  The  information  are  the  event  name  and  place.  The  data  are  crawled  from  the  Globo  website  once  a  day;  

● Events   from   google.com:   information   about   the   most   important   concerts   scheduled   in   Curitiba  from  the  top  box  shown  during  the  Google  search.  The  information  are  the  event  name  and  place.  The  data  are  crawled  from  the  Google  search  once  a  day;  

● Traffic  status  -­‐  "MapLink":  this  database  contains  the  traffic  status  in  the  main  avenues  and  streets  of  Curitiba.  The  information  is  crawled  from  the  MapLink  website  and  updated  every  15  minutes.  It  includes  the  street  name,  geo  location  and  their  current  average  speed  and  traffic  status;    

● Curitiba   traffic  news:  composed  of  news  about   the   traffic   in  Curitiba   from  an  official   source.  The  records   are   composed   of   text   with   traffic   information   and   date/time.   It   is   collected   once   a   day  though  RSS  request  from  the  traffic  news  feed  on  the  Curitiba  City  Hall  website;  

● Traffic   news:   composed   of   news   about   the   traffic   in   Curitiba   from   news  websites.   The   data   are  crawled  in  real  time  from  the  main  news  websites  through  RSS  requests.  Some  of  the  information  are  the  news  text,  url,  keyworks,  date/time  and  source.  

4.1.3. Environmental  Data  

Environmental  data  provides  information  about  the  weather  conditions  and  forecasts  that  are  also  relevant  for  understanding  citizens  mobility.  This  data  will  be  produced  using  the  Weather  Research  and  Forecasting  Model   (WRF).   WRF   is   a   state-­‐of-­‐the-­‐art   regional   numerical   weather   prediction   system   developed   and  maintained   by   several   institutions   and   available   as   open   source   code   to   the  whole   community   [R01].   In  order  to  produce  forecasts  for  the  city  of  Curitiba,  two  nested  domains  centered  in  the  city  were  configured  with  horizontal  resolutions  of  12km  and  4km  (Figure  2).  

The  WRF  model  uses  input  data  from  the  Global  Forecasting  System  (GFS).  GFS  is  a  global  model  developed  by   the   National   Center   for   Environmental   Prediction   and   freely   available   for   community   use   [R02].  Estimated   data   volume   is   2GB/day,   being   1.5GB   input   data   from   GFS   (which   is   only   leveraged   by  WRF  model)   and   0.5GB   output   data   from   WRF   which   is   leveraged   by   the   whole   system.   WRF   output   data  consists  of  files  in  NetCDF  format  [R03]  containing  geo-­‐referenced  information  of  meteorological  variables  like  temperature,  precipitation,  winds,  etc.  

Page 17: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   17

Figure  2.  WRF  outer  (map)  and  inner  (blue  rectangle)  domains  

4.1.4. Social  Data  

Social  data  refer  to  data  produced  by  the  interaction  between  users  in  an  online  social  network.  Such  data  are  very  dynamic  and  can  be  obtained  shortly  after  being  published,  making  them  excellent  source  of  near  real  time  information.  Extracted  information  may  include  facts  about  situations  or  events  happening  right  now,  for  instance,  traffic  jams  or  floods,  used  to  estimate  public  attendance  or  to  collect  users’  demands,  sentiments  or  opinions.  

To  extract  information  of  social  data  it  is  necessary  to  filter,  transform  and  enrich  raw  data.  In  general,  one  of  most  important  attributes  available  in  social  data  is  the  text  of  the  message.  For  example,  Twitter  allows  users  to  send  text  up  to  140  characters.  There  are  some  other  well  structured  fields,  such  as  tweet  date  or  user   login,   but   in   general,  message   text   contains   important   information  merged  with   poor  written   text,  misspellings,  ambiguities  and  other  incomprehensive  expressions.  

We   foresee   social   data   as   being   an   important   source   of   real   time   information   and   an   excellent   input   to  processing   algorithms   in   areas   as   NLP   (Natural   Language   Processing,   for   example,   to   extract   entities  referred   in   the   text),  machine   learning   (for  example,   correlating  social  data  with  other  data  sources)  and  data  mining  (finding  frequent  patterns,  finding  sequences).  Data  sources  from  social  networks  include:

● Twitter:   composed   of   real   time   messages   from   Twitter.   The   data   are   extracted   through   API  [R04]  requests,  which  fetch  three  types  of  tweets:  

i. tweets  crawled  from  accounts  related  to  traffic  information.  These  accounts  were  manually  selected   and   there   is   no   guarantee   that   the   tweet   is   actually   about   traffic   status.   Just   a  small  piece  of  such  messages  are  geo  located;  

ii. tweets  with  keywords  related  to  traffic.  The  keywords  were  manually  selected  and  there  is  no   guarantee   that   the   tweet   is   actually   about   traffic   status.   Just   a   small   piece   of   such  messages  are  geo  located;  

Page 18: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   18

iii. all   geolocated   tweets   from  Brazil.   It   includes   tweets   related   to  all   topics,   therefore   just  a  small  fraction  of  the  messages  are  related  to  traffic  and  mobility.  

● Events   from   Facebook:   contains   information   about   the   events   on   Facebook   occurring   in   a   1000  meters   radius   from  each  subdistrict  centroid   in  Curitiba.  The  dataset  does  not  cover  all  events  of  the   city.   It   includes   information   about   the   event   location   and   popularity   (e.g.   number   of   invited  people,  interested  people  and  people  attending  to  the  event).  

4.2. Derived  Data  Derived   data   are   produced   by   the   execution   of   algorithms   that   implement  UC2   and  UC3,   using   the   pre-­‐processed  integrated  data  or  even  other  derived  data.  

The  main  aim  of  UC2  -­‐  Descriptive  Models  -­‐  is  to  extract,  characterize  trajectories  and  to  obtain  correlations  among   them   and   other   associated   metadata.   Such   models   will   characterize   trajectories,   deriving  aggregated   and   non-­‐aggregated   statistics   about   routes   and   will   identify   characteristics   probability  distributions   for   the   various   statistics,   as   a   strategy   to   extrapolate   the   findings   and   the   underlying  phenomena.   To   build   the   Descriptive   Model,   data   from   different   sources   will   be   consumed,   including  stationary,  spatial,  environmental  and  social  data.  The  resulting  Descriptive  Model  will  be  stored  in  a  such  way   that   favors   fast   query   (including   queries   with   geo-­‐spatial   operators)   and   incremental   updates.   Any  storage  technology  used  in  analytics  solutions  is  good  candidate  to  be  chosen  as  Descriptive  Model  storage.  But  also  a  more  “traditional”  record  or  document  oriented  database  will  be  used  to  store  metadata.  Model  updates  may  be  performed  by  batch  or   stream  execution   service.   Some  data  mining   techniques,   such  as  frequent   patterns  mining,   can   lead   to   an   explosion   in   the   number   of   derived   data   and   this   should   be   a  concern  if  that  technique  is  used  to  realize  some  scenario  in  UC2.

The  main  goal  of  UC3  -­‐  Predictive  Models  -­‐  is  to  build  an  applications  for  the  recommendation  of  routes  to  people  that  will  help  citizens  to  check  how  mobility  conditions  are  and  give  hints  on  how  to  better  reach  destinations   with   regards   to   multiple   criteria,   such   as   time,   predicted   traffic   (and   hence   stress),  pleasantness  to  walk,  sights,  and  interestingness  (taken  from  social  media).  The  project  will  implement  two  tasks:  classification  and  regression.  Predictive  models  will  construct  models  using  all  previous  kinds  of  data,  using  a  batch-­‐model  execution  service.  Predictive  Models  can  have  their  own  file  format  and  store  data  in  a  distributed  file  system  or  use  another  WP4  storage  service  such  as  NoSQL  databases  [R05].  

4.3. Platform-­‐level  Data  

4.3.1. QoS  Monitoring  Data  

QoS  Monitoring  System,  defined  in  WP3,  collects,  processes,  stores  and  displays  information  about  metrics,  alarms  and   logs   related   to  applications  or   infrastructure   components.  A   specific   component,  metrics  and  alarms  database,  organizes  data  in  a  structure  optimized  to  store  time  series  in  different  time  scales.  

In   QoS   Monitoring   System,   there   are   other   components   that   may   exploit   WP4   in   order   to   scale.   Such  components   include   analytics   engine,   anomaly   and   prediction   engine   and   transform   and   aggregation  engine.  

Analytics  engine  consumes  alarm  state  transitions  and  metrics  from  the  message  queue  and  does  anomaly  detection   and   alarm   clustering/correlation.   Anomaly   and   prediction   engine   evaluates   prediction   and  anomalies  and  generates  predicted  metrics  as  well  as  anomaly   likelihood  and  anomaly  scores.  Transform  

Page 19: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   19

and   aggregation   engine   transforms   metric   names   and   values,   such   as   delta   or   time-­‐based   derivative  calculations,  and  creates  new  metrics  that  are  published  to  the  message  queue  (optional).

WP4   components   also   generate   metrics   and   logs   at   application   level.   A   detailed   description   of   the  monitoring  system  is  available  in  deliverable  D3.1.

4.3.2. Data  Quality  as  a  Service  Data  

The  Data  quality  service  provides  additional  information  about  the  data  sources  managed  by  the  platform.  In  general,  data  quality  aims  to  evaluate  the  suitability  of  data  for  the  processes  and  applications  in  which  they  are   involved.  Such  “suitability”   is  assessed  by  means  of  a   set  of  quality  dimensions,  which   selection  and   definition   are   partially   dependent   on   the   context   of   use.   In   details,   the   data   quality   service   can   be  considered  as  composed  of  a  module  that  aims  (i)  to  provide  general  information  about  data,  such  as  values  ranges   and   uniqueness   degree   of   each   attribute   and   number   of   represented   objects   (ii)   to   evaluate   of  specific  data  quality  dimensions,  e.g.,  accuracy,  completeness,  and  timeliness.  

Outputs  are  a  set  of  metadata  that  can  be  used  to: ● Trigger  data  cleaning  activities:  the  definition  of  quality  levels  and  related  acceptability  thresholds  

can  help  in  activating  activities  that  aim  to  improve  data;    ● Let  the  users  be  aware  of  the  data  quality  level  of  the  accessed  data:  data  quality  metadata  can  be  

shown  together  with  the  application  results  to  let  the  users  understand  the  trustworthiness  of  the  information  they  are  looking  at;  

● Drive  the  data  integration:  data  quality  levels  can  be  used  as  a  driver  in  the  selection  of  equivalent  or  similar  sources.    

Page 20: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   20

5. BIG  AND  FAST  DATA  ECO-­‐SYSTEM  GENERAL  ARCHITECTURE    A   general   architecture   regarding   the   Big   and   Fast   Data   eco-­‐system   is   shown   in   Figure   3.   This   diagram  highlights   the  main   building   blocks   of   the   system   as  well   as   the   links   among   the   big   and   fast   data   eco-­‐system   internal   components   and   the   relationships   with   the   other   work   packages.   It   provides   a   general  conceptual  description  of  the  big  data  eco-­‐system,  outlining  the  logical  components  that  will  be  involved  to  address   the   requirements   specified   in   the  previous   sections.   The   architectural   view,   along  with   the   end-­‐user  requirements  and  data  sources  description,  have  been  used  to  drive  the  big  data  tools  analysis  phase  providing  a  base  for  tools  identification.  

 Figure  3.  WP4  general  architecture  

The  WP4  block  contains  the  main  components  required  by  the  big  data  eco-­‐system:   ● External  data  sources:  it  consists  of  the  raw  data  described  in  Section  4.1  and  being  integrated  into  

the   system.  QoS  monitoring   data   are   also   included   at   this   level,   as   a   platform-­‐level   data   source,  since  it  can  undergo  the  same  type  of  pre-­‐processing  required  for  the  other  sources;

● Data  Storage:  it  includes  several  types  of  databases  and  storage  systems  necessary  for  storing  and  handling  efficiently:  (i)  pre-­‐processed  data  coming  from  the  external  data  sources,  (ii)  derived  data,  (iii)  data  concerning  monitoring  metrics  related  to   infrastructure/application  QoS  and   information  from  DQaS  describing   the  quality   of   the  data   being   integrated.   These   systems   include:   relational  databases,   OLAP   data   warehouses,   geo-­‐spatial   databases,   NoSQL   databases,   and   distributed  storage;

● Fast  and  Big  Data  technologies:  it  is  a  macro-­‐block  including:    ○ Data   Ingestion   and   Streaming   processing:   it   comprises   the   tools   to   (i)   ingest   and  

synchronize   the   data   stored   at   the   system   with   the   external   data   sources   and   (ii)   to  

Page 21: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   21

continuously   process   streams   of   data.   This   block   provides   mainly   the   tools   to   run   pre-­‐processing  and  ETL  steps  before  loading  the  data  into  the  storage  systems;

○ Data  Access:   it   consists  of   the   technologies   that  allow  selection,   filtering  and  querying  of  the  data  stored  into  the  system.  It  also  provides  functionalities  to  save  and  access  derived  data  produced  by  descriptive  and  predictive  model   services.  These   features  are  exploited  both  by  external  users  and  the  programming  models;  

○ Data  Analytics  and  Mining  systems:  they  provide  the  engines  for  data  processing  in  order  to  perform  data  analytics  and  mining  tasks.  In  particular,  they  include  a  toolbox  with  a  set  of  analytics   routines,   mining   functions   and   machine   learning   algorithms   required   for   the  execution   of   the   processing   tasks.   Programming   models,   as   well   as   DQaS   and   Entity  Matching  services  will  use  the  features  of  the  technologies  adopted  for  this  block.    

The  architecture  view  also  displays  the  relationships  with  other  work  packages:   ● Programming  models   (WP5),   developers,   system   administrators,   as  well   as   end-­‐user   applications  

(WP7),  use  the  tools  provided  by  the  big  data  eco-­‐system  to  access  the  data  and  run  data  analytics  and  machine  learning  models;    

● Big  data  applications  interact  with  QoS  cloud  infrastructure  services  (WP3)  to  provide  information  (e.g.  metrics,  alarms,   logs)  to  the  cloud   infrastructure  useful  to  elastically  adjust  the  resources  on  which  the  applications  are  being  run  in  order  to  meet  the  expected  QoS  levels  based  on  the  current  workload;    

● Security  (WP6)  is  orthogonal  to  the  whole  architecture  and  defines  the  measures  and  technologies  required  in  several  blocks  of  the  big  data  eco-­‐system,  mainly  referring  to:  privacy  and  protection  of  the  data  managed  by  the  eco-­‐system,  and  authentication  and  authorization  across  the  various  big  data  tools  involved.  

Page 22: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   22

6. BIG  AND  FAST  DATA  ECO-­‐SYSTEM  DESIGN  This   section   focuses   on   the   design   of   the   big   and   fast   data   eco-­‐system.   The   main   components   of   the  architecture  and  their  relationships,  also  with  other  WPs,  are  thoroughly  described  (Section  6.1).  Moreover,  UML   [R06]   sequence   diagrams   derived   from   a   set   of   user   stories   are   provided   to   describe   how   the  components   interact  with   each   other   (Section   6.2).   Finally,   possible  QoS  metrics   to   be   exposed   (Section  6.3),   the   data   management   API   (Section   6.4)   and   some   security   aspects   to   address   (Section   6.5)   are  illustrated  too.  

6.1. Architectural  Diagram  The  detailed  architectural  diagram  of   the  big  and   fast  data  eco-­‐system   is  displayed   in  Figure  4.  This  view  provides  an   insight   into  the  main  blocks  defined   in  the  WP4  general  architecture  (Figure  3)  and  has  been  modeled   taking   into   account   the   end-­‐users   requirements   and   the   data   sources   needs   for   the  implementation  of  the  use  cases.  It  highlights  the  relationship  among  the  big  data  eco-­‐system  components  and  the  other  work  packages  of  the  EUBra-­‐BIGSEA  platform.    

Figure  4.  Big  and  Fast  Data  eco-­‐system  detailed  architecture  

In   particular,   programming   models,   defined   in   the   context   of   the   WP5,   developers,   administrators   and  applications,  defined  in  the  context  of  WP7,  represent  the  main  users  of  the  eco-­‐system.  They  will  exploit  the   big   data   technologies   to   access,   query   and   process   the   external   data   sources   integrated   into   the  system.  

Page 23: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   23

Security  solutions,   identified   in   the  context  of  WP6,  should  be  exploited   to  efficiently  handle  AAA  on  the  different   blocks   defined   in   the   picture   and   to   properly   address   privacy   and   protection   of   sensitive   data.  These  solutions  could  be  required  at  different   levels   in  the  eco-­‐system.  QoS  cloud   infrastructure  services,  designed  in  the  context  of  WP3,  are  also  orthogonal  to  the  architecture  since  most  big  data  component  will  interact   with   these   services   providing   feedbacks   to   proactively   adjust   the   resources   on   which   the  applications  are  being  run  in  order  to  meet  expected  QoS  levels.  

The  various  data  sources  analyzed  in  section  4.1  compose  the  External  Data  Sources  macro-­‐block,  whereas  the  pre-­‐processed,  derived,  QoS  and  DQaS  data  are  integrated  and  physically  stored  within  the  eco-­‐system  storage   layer.   The  Big  Data  Technologies  macro-­‐block  provides   the   features   to  handle   the  data   life-­‐cycle  including  ingestion,  streaming  processing,  pre-­‐processing,  access,  selection  through  queries,  calculation  of  statistics,  metadata  management,  and  computation  and  storage  of  data  derived  from  machine  learning  and  analytics  operations.  The  storage  layer,  along  with  the  big  data  block  represent  the  actual  components  that  define  the  big  and  fast  data  eco-­‐system.  

The  following  subsections  will  provide  a  detailed  description  of  the  main  blocks  and  aspects  defined  in  the  architecture.   It   is  worth  mentioning  that  additional   internal  modules  addressing  specific  operations  could  be  developed  during  the  project  lifetime.  

6.1.1. Data  Storage  

As  depicted  in  section  4.1,  the  set  of  external  data  sources  available  is  very  heterogeneous  in  terms  of  data  format,   volume   and   frequency.   The   big   data   eco-­‐system   is   going   to   integrate   these   data   sources   types,  exploiting   several   storage   technologies   with   different   data   models   and   capable   to   deal   with   the   data  variety,  volume  and  allow  fast  access  to  this  data.  Moreover,  the  storage  layer  will  also  handle  derived  data  and  platform-­‐level  data  that  can  be  produced  by  the  eco-­‐system  components.

Various  levels  of  data  have  been  defined  according  to  the  number  of  processing  tasks  applied  to  the  related  data  sources.  Figure  5  shows  the  flow  of  data  transformation  and  the  levels  of  the  data:

● Level-­‐0  data:  comprises  the  raw  data  from  external  data  sources  described  in  section  4.1.  These  are  grouped   in  the  External  Data  Sources  block,  which   includes  also  the  monitoring  data  gathered  by  the  QoS  cloud  infrastructure  service  (see  Figure  4);  

● Level-­‐1  data:  includes  the  integrated  data  that  is  stored  into  the  system  after  the  execution  of  pre-­‐processing   steps   (e.g.   ETL).   Entity  Matching   service   can   be   exploited   during   this   phase   to  match  entities   in  different  external   sources  and  produce  additional  data.   Level-­‐1  data  are   then  used   for  analysis  and  mining;

● Level-­‐2  data:  consists  of  the  data  stored  into  the  system  derived  from  the  integrated  level-­‐1  data.  The  data  are  produced  as  a  result  of  the  execution  of  descriptive  and  predictive  models  on  level-­‐1  data.  Additionally,  the  models  could  also  use  level-­‐2  data  as  their  input;

● Platform-­‐level  data:  includes  sources  required  for  internal  use;  i.e.  data  quality  and  QoS  monitoring.  DQaS  can  be  executed  during  the  preprocessing  phase  to  identify  the  quality  of  data  and  annotate  it.  This  metadata  information  is  stored  in  a  specific  level-­‐1  database  and  can  be  accessed  to  check  the  quality  of  data  stored  in  the  system.  Metrics  and  data  for  QoS  can  require  an  ETL  phase,  as  in  the   case   of   the   raw   data,   to   integrate   the   information   in   a   level-­‐1   data  warehouse   that   can   be,  then,  used  for  subsequent  analysis  and  mining  operations,  producing  level-­‐2  data.    

Page 24: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   24

 Figure  5.  Data  sources  levels  

Storage  technologies  should  be  capable  of  dealing  with  the  various  types  of   level-­‐1,   level-­‐2  and  platform-­‐level  data.  Technologies  exploited  for  this  data  may  include:  

● Distributed  file  systems  (e.g.,  HDFS  [R07,  R08]);  ● NoSQL  databases  (e.g.,  MongoDB);  ● Data  stores  for  multidimensional  scientific  data  (e.g.,  Ophidia  [R09,  R10]);  ● Databases  to  handle  geo-­‐spatial  data  (e.g.,  PostGIS).    

The  data  storage  component  and  the  technologies  employed  have  to  address  the  following  requirements  (defined  in  D7.1):

R1.1.  The  application  must  integrate  the  GIS  data  sources  (dynamic  and  spatial).    The   integration   of   data  should   be   done   in   a   standardized  way   to   facilitate   the   future   integration   of   any   other   data   source.   The  information  should  be  updated  accordingly.  The  data  integration  procedure  should  be  clearly  described,  as  well  as  the  storage  architecture  required.  

R1.2.   The   application   must   integrate   meteorological/climate   data   sources.   This   information   should   be  attached  to  historic  records  and  new  (forecast)  information  should  be  accessible.  

R1.3.  Metadata  must  be  included  into  the  application  to  describe  the  area  covered,  the  year  of  acquisition  of  the  data,  the  type/format  of  the  data  and  any  other  technical  specification  that  is  necessary.

R2.8.  It  should  be  possible  to  download  aggregated  results  and  products  to  be  used  in  subsequent  analysis.

Page 25: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   25

R3.5.  Download  of  aggregated  results  and  products  must  also  be  supported.  

RD.3.  The  infrastructure  must  store  the  data  processing  products,  taking  the  necessary  steps  to  ensure  data  persistence  and  data  protection,  when  necessary.  

6.1.2. Big  Data  Technologies  

This  macro-­‐block  comprises  a  set  of  big  data  systems  required  to  load,  handle  and  process  data.  These  can  be  grouped  in  classes  of  components  according  to  the  main  features  provided.  

Data  Ingestion  and  Streaming  Processing  modules

These  modules  mainly  take  care  of  the   loading  and  synchronization  of  the  data  stored  in  the  eco-­‐system,  targeting  in  particular  streams  of  data.  They  will  be  used  in  the  first  phase  of  the  data  management  process  to  extract  level-­‐1  data  from  the  external  data  sources,  especially  in  the  case  of  streaming  data  from  social  networks.   Moreover,   this   block   will   also   provide   the   features   to   synchronize   the   data   stored   into   the  storage  layer  with  the  external  sources.

Several   technologies,   for  example  Apache  Kafka,  Apache  Storm,  Spark  Streaming  or  Apache  Flink,   can  be  used   for   the   ingestion  phase   and   the  processing  of   real-­‐time   streaming  data.   These   systems   can   also  be  used  to  continuously  perform  ETL  pipelines.  

General  requirements  related  to  this  component  are  (defined  inD7.1):

RD.1.   The   infrastructure   must   support   the   integration   of   external   data   from   existing   data   sources.   This  integration  must  be  complemented  with  methods  for  referencing  the  data  in  their  original  locations,  and  to  pre-­‐process   and  annotate   the  data  with  additional   information.  Metadata   standards  must   be  used  when  available  to  annotate  the  data.

RD.2.  Automatic  synchronization  with  original  data  sources  must  be  addressed  (updating  the  infrastructure  with  the  latest  releases  of  the  data),  considering  the  individual  needs  of  each  case,  which  range  from  simply  discovering  and  downloading  new  data  when  it  becomes  available,  to  running  complex  data  pre-­‐processing  before  storing  the  data  in  the  infrastructure.

RD.6.   User   Internet   connection   is   a   potential   bottleneck   for   performance,   especially   low   bandwidths   as  expected   in   field   conditions.   Therefore,   the   infrastructure   should   facilitate   the   access   to   the   data   even   in  poor  Internet  connections.

Data  Access  and  Query  modules  

Data  access  and  query  modules  provide  the  features  to  access  the  data  available  in  the  system,  search  and  filter   the   information,   perform   basic   aggregations   and   store   the   results   of   pre-­‐processing   phase   and  analytics/mining  computations.  These  systems  will  allow  the  execution  of  specific  types  of  queries  for  the  various  data  sources  integrated.  Access  to  metadata  related  to  the  data  will  also  be  available.  The  features  will  be  directly  exploited  by  the  application  developers  and  administrators  to  get  access  to  the  information  available  in  the  eco-­‐system.  

Some  examples  of  technologies  that  could  be  potentially  included  in  this  block  are:  Ophidia,  Apache  HBase,  PostGIS  and  MongoDB.  

Requirements  addressed  by  this  component  are  (provided  in  D7.1):

Page 26: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   26

R1.5.  An  API  must  be  exposed   to  deal  with   the  storage  resources   to  authenticate,  populate  data,   retrieve  and  filter  data,  update  data.  Same  operations  for  metadata.  Data  access  should  have  a  short  latency  (near  real-­‐time  access).  

R2.5.   The   service  must   facilitate   the  end-­‐user   to   select   the  data   sources,   temporal  and   spatial   scales  and  output  format  for  historical  time-­‐series  analysis.  

R2.6.  The  service  must  facilitate  the  end-­‐user  to  select  an  area  of  interest  (e.g.  the  Batel  District  in  Curitiba,  Brazil),  for  that  area  the  available  data  (e.g.  GIS  stationary  data)  must  be  retrieved  from  the  application  and  the   end-­‐user   must   have   the   possibility   through   the   service   to   select   derived   information   for   that   area.  Trajectory  analysis  algorithms  may  be  implemented  in  an  incremental  way,  therefore  processing  just  recent  data  available  as  the  whole  data  will  not  fit  in  memory.

R3.6.  The  user  interface  must  facilitate  the  end-­‐user  to  select  the  data  sources,  temporal  and  spatial  scales  and  output  format  for  historical  time-­‐series  analysis.  

RD.5.   The   infrastructure  must   facilitate   the   end-­‐user   to   access   the   data,   providing   the  most   appropriate  protocols  and  data  formats  to  enable  developers  with  the  necessary  means  to  build  usable  user  interfaces.  Data  must  be  queried  in  a  variable  granularity.

Data  Analytics  and  Mining  toolbox  

The   toolbox   is   the   repository   of   the   machine   learning   algorithms,   analytics   operators,   array-­‐based  primitives   and   scientific   libraries   available   in   the   eco-­‐system   for   the   user   analysis.   It   will   also   feature   a  market-­‐place   for   user   communities.   Data   mining   algorithms   will   include,   for   example,   regression,  classification,   clustering   and   correlation,   whereas   data   analytics   operators   will   include   subsetting,  reduction,   aggregation   and   intercomparison.   Scientific   libraries   will   allow   complex   mathematical   and  statistical  computations.  Spark  MLlib,  Ophidia  primitives  and  Ophidia  operators  are  some  examples  that  fall  within  this  component.    

Data  Analytics  and  Mining  modules  

The  modules   include  the  processing  engines  and  computing  frameworks  to  run  data  mining  and  analytics  tasks  on  big  volumes  of  data.  The  block  includes  several  submodules  like  Entity  Matching  (EM),  Data  quality  as   a   service   (DQaS),   predictive   and   descriptive  model   services   and  On-­‐Line   Analytical   Processing   (OLAP).  These  submodules  will  execute  their  computation  through  the  engines  and  frameworks  exploiting  a  set  of  libraries  and  functionalities  available  in  the  Data  Analytics  and  Mining  toolbox.  Several  technologies  could  be   used   to   implement   analytics   and  mining  modules,   such   as   for   instance,   Apache   Spark   [R11],   Hadoop  MapReduce  [R12,  R13],  Ophidia,  Apache  Hive  [R14]  and  Druid.

6.1.3. Entity  Matching  Service  

The  Entity  Matching  (EM)  task,  i.e.,  the  problem  of  identifying  records  that  refer  to  the  same  entity  of  the  real  world,  is  known  to  be  challenging  due  to  its  pair-­‐wise  comparison  nature,  especially  when  the  datasets  involved  in  the  matching  process  have  a  high  volume  (big  data).  Since  the  EM  task  has  critical  importance  for  data   cleaning  and   integration,   e.g.,   to   find  duplicate  points  of   interest   in  different  databases,   studies  about   challenges   and   possible   solutions   of   how   EM   can   benefit   from   modern   parallel   computing  programming  models,   such   as  Apache   Spark   (Spark),   have   become  an   important   demand  nowadays.   For  this  reason,  the  EM  service,  to  be  provided  by  the  main  API  of  the  WP4  architecture,  consists  of  a  bag  of  

Page 27: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   27

tools  and  functions  that  can  process  the  EM  task  (e.g.,  geo-­‐matching)  in  parallel  by  using  Apache  Spark  and  Hadoop's  MapReduce  (MR).

The  EM  service  will  attend  the  requests  from  applications/systems  interested  in  submitting  EM  tasks  to  the  cluster  environment.  To  this  end,  the  service  will  establish  a  connection  to  the  Hadoop  Eco-­‐system  (WP3)  to  perform   the  necessary  operations   such  as   submitting  artifacts   (e.g.  datasets)   to   the  HDFS  or   starting   the  execution  of  MR  and  Spark  jobs.

6.1.4. Data  Quality  as  a  Service    

The  Data  Quality  (DQ)  task  assesses  the  quality  level  of  the  sources  addressing  the  veracity  issues  related  to  big  data  scenario.  The  DQ  service  annotates  the  sources  with  metadata  able  to  provide  knowledge  about  the   reliability   and   usefulness   of   the   data   values   involved   in   the   various   platform   applications.   The   data  quality   values   will   be   calculated   periodically   or   the   service   can   be   triggered   by   applications/systems  interested   in   updated   quality   information.   In   order   to   meet   big   data   requirements   related   to   velocity,  algorithms  will  be  implemented  by  using  Apache  Spark  that  supports  parallel  computing  programming.  DQ  metadata  are  stored  in  the  DQ  repository  included  in  the  Platform-­‐level  data.  

6.1.5. Extraction,  Transformation  and  Load  

As   depicted   in   the   data   source   section   (Section   4),   the   eco-­‐system   can   potentially   include   several  heterogenous   data   sources.   This   data   could   require   some   pre-­‐processing   steps   in   order   to   transform   or  normalize   the   data   before   loading   it   into   the   (level-­‐1)   storage:   streaming   data   could   require   tools   for  continuous   pre-­‐processing;   some   data   sources   could   also   require   anonymization   techniques   to   protect  personal   information   and   comply  with   the  data  owner  policies;   data   sources  providing   information   from  the  same  domain  could  require  various  types  of  transformation  to  uniform  data  to  a  common  view.

Extraction,   Transformation   and   Load   (ETL)   procedures   will   mainly   exploit   data   ingestion   and   streaming  processing  components  and,   to  a   lesser  extent,  also  analytical  and  mining   functionalities   (with  optionally  some   security   support   regarding   privacy).   For   example,   the   Entity   Matching   service   could   be   exploited  during   the  ETL  phase   to   identify  potential  matches   in   records   from  different   sources,  whereas   streaming  processing  could  be  used  to  filter  and  transform  streams  of  social  networks  data.  Spatial  data  could  require  location  filtering,  format  change  and  projection  to  a  standard  coordinate  system.  

Data  quality  information  from  DQaS  will  provide  a  solid  basis  for  tagging  the  different  data  sources  in  terms  of  quality.  DQaS  could  run  on  the  data  during  the  ETL  phase  to  annotate  data  quality-­‐based  metadata.

6.2. Sequence  Diagrams  Sequence  diagrams  are  used  to  represent  interactions  between  objects  and  their  order,  along  the  lifeline.  In  this  section,  objects  are  high-­‐level  architectural  components  defined  in  Section  6.1.

Use  cases  and  all  their  user  stories  are  defined  in  the  document  D7.1.  Starting  from  these  user  stories,  for  each   of   the   use   cases   defined   in   D7.1   some   scenarios   and   the   corresponding   UML   sequence   diagrams,  relevant  to  highlight  the  interactions  among  WP4  components,  have  been  provided.  

6.2.1. User  Stories  for  UC1:  Data  Acquisition  

The  type  of  user  stories  in  the  data  acquisition  and  integration  focus  on  retrieving  the  data,  the  periodicity  and  mechanism  for  retrieving  the  data,   the  basic   filtering  of   the  data,   raw  and  filtered  data  visualization.  

Page 28: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   28

This  use  case  is  mainly  intended  for  data  curators  and  data  scientists  that  have  to  prepare  and  understand  the  main  features  of  the  data  sources.

Scenario  1.1:  Data  ingestion  

This  scenario  has  been  derived  from  US1.1  (see  D7.1).  Data  load  process  retrieves  the  data  from  the  original  sources  (level-­‐0  data),  including  metadata,  transforms  it  into  a  common  format  (level-­‐1  data)  and  stores  it  into   a   database   system.   In   case   of   streaming   data,   a   streaming   processing   technology   consumes   data,  transforms  and  stores  it  into  a  proper  database.  

 Figure  6.  Sequence  diagram  for  scernario  1.1  

Scenario  1.2:  Data  selection  and  filtering  

This   scenario   has   been   derived   from   user   story   US1.2.   A   filtering   process,   invoked   by   a   user/developer,  selects  data  from  storage  according  to  a  particular  filter.  For  example,  in  the  case  of  dynamic  spatial  data,  filters  can  include  transportation  lines,  geographic  zones,  specific  users  or  data  periods.  

Figure  7.  Sequence  diagram  for  scenario  1.2  

Page 29: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   29

6.2.2. User  Stories  for  UC2:  Descriptive  Models  

The   type  of  user   stories   in   the  descriptive  models   focus  on   the  analysis  of   trajectories  and   the  associated  variables  that  could  affect  their  distribution  (weather,  date  and  time,  social  networks  information,  etc.).  The  Descriptive  Models  are  built  as  a  service   (DM  service),   targeting  data  scientists  on  traffic  management  to  discover  correlations  and  build  up  higher-­‐level  services.  

Scenario  2.1:  Trajectory  extraction  and  analysis  

This  scenario  has  been  derived  from  user  story  US2.1.  The  DM  service  queries  the  storage  to  retrieve  (level-­‐1)   data   for   trajectories   and   statistical   analysis   and   stores   the   result   of   the   analysis   (level-­‐2   data)   in   a  database  in  order  to  reuse  this  data  to  perform  statistical  analysis.  Queries  may  specify  a  period  of  time  or  a  geographical  region  to  filter  the  trajectories.  

Figure  8.  Sequence  diagram  for  scenario  2.1  

Scenario  2.2:  Trajectory  clustering    

This  scenario  has  been  derived  from  user  story  US2.3.  The  DM  service  requests  to  perform  a  clustering  on  trajectory   data   (level-­‐1   data)   available   in   the   storage.   The   service   can   specify   the   type   of   clustering  algorithm  and  the  parameters  to  be  used.  The  next  diagram  (Figure  9)  is  a  specialization  of  the  previous  one  for  trajectory  clustering.  

Page 30: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   30

Figure  9.  Sequence  diagram  for  scenario  2.2  

6.2.3. User  Stories  for  UC3:  Predictive  Models  

The  user  stories  in  the  Predictive  Models  deal  with  the  training  of  the  predictive  models  on  the  descriptive  data  obtained   in  the  previous  Use  Case.  These  user  stories  are  apparently  more  computing-­‐intensive  than  data-­‐intensive  bounded.  Prediction  will  also  include  projection  of  models,  which  is  not  computing  intensive  but  should  work  in  interactive  time.  

Scenario  3.1:  Prediction  model  training    

This  scenario  has  been  derived  from  user  story  US3.1.  The  PM  service  requests  the  training  of  a  predictive  model  with   the  most   recent  data  available   in   storage   (level-­‐2).  This   request   can   include  model   type   (e.g.  random  forest,  recurrent  ANN),  training  procedures  (e.g  10-­‐fold  cross  validation)  and  training  data  (e.g.  last  month)  to  be  considered.  After  the  training  phase,  the  model  is  stored  in  the  system  (as  level-­‐2  data)  to  be  accessed  later  to  run  predictions.  

 Figure  10.  Sequence  diagram  for  scenario  3.1  

Page 31: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   31

Scenario  3.2:  Execute  prediction  

This  scenario  has  been  derived  from  the  user  story  US3.2.  The  PM  service  request  a  prediction  to  be  made  using  a  model  previously  trained  (level-­‐2  data).  Feature  data  and  the  model  are  loaded  from  the  storage.  

Figure  11.  Sequence  diagram  for  scenario  3.2  

6.2.4. Other  interactions  between  WP4  components  

During   the   requirement   analysis,   new   interactions   between   WP4   components   were   found.   Such  interactions  are  related  to  background  processing  aimed  to  support  the  realization  of  previously  identified  use  cases.

Scenario  4.1:  Data  acquisition  and  online  streaming  processing  for  data  mining  

Figure  12  shows  a  data  acquisition  and  online  streaming  processing  for  data  mining  scenario  where  a  data  producer   ingests  data   (level-­‐0)   in   the  processing  pipeline.   In   this   interaction,  data  are  continuously  being  processed  as  soon  they  arrive  in  the  data  ingestion  component.  The  process  is  started  by  time  (scheduled)  and  then  data  ingestion  component  reads  data  (zero  to  many  records)  from  producers  and  stores  them  in  a  message  queueing  system.  Stream  processing  components  consume  messages  from  queue  containing  one  processing  item  (one  record).  Some  operations  can  require  loading  of  different  predictive  models  (level-­‐2),  for   instance,   to   classify   a   processing   item.  Other   available   operations   include   filter   (removes   item   if   it   is  considered   invalid   or   irrelevant   to   next   stages   in   pipeline),   map   (transformation   of   processing   item,   for  example,  by  applying  a  predictive  model)  and  reduce  (aggregation  of  data).  Finally,  derived  data  (level-­‐1)  is  updated  and  in  some  cases,  descriptive  models  (level-­‐2)  is  also  updated  (if  an  aggregation  was  performed).  

This   interaction   is   similar   to   the  user   story  US1.1  defined   in  D7.1.  However,   the  user   story  US1.1   in  D7.1  only  mentions  stationary  data  and  a  process  started  by  a  user,  whereas  Figure  12  also  includes  other  types  of  data  (i.e.  social  data)  and  it  represents  a  synchronous  automatic  batch  process.  

Page 32: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   32

 Figure  12.  Sequence  diagram  for  scenario  4.1  

6.3. Exposed  QoS  metrics  The  QoS   IaaS   (WP3)   focuses   on   the   deployment   and   configuration   of   the   infrastructure  where   the   data  analytics   jobs   run,   and   the   execution   of   elasticity   rules   to   adapt   the   resources.   QoS   profiles   for   the  applications  are  defined  in  advance  by  measuring  the  applications  performance.  

The  performance  of  the  data  analytics  and  mining  applications  will  be  followed  by  the  monitoring  system  to  request  additional  resources  when  needed.  WP4  applications  will  provide  metrics,  alarms  and   logs  to  the  WP3  monitoring  system  so  it  can  estimate  the  performance  level  of  the  current  execution.  It  this  way  the  service   can   provide   proactive   elasticity   in   order   to   adjust   the   system   for   guaranteeing   the   QoS   of   the  applications.

WP4  components  provide  different  application  level  QoS  metrics  related  to  their  function  in  the  system.  It  is   difficult   to   enumerate   all   QoS   metrics   in   early   stage   of   the   project   because   not   all   technologies   are  defined  nor  all  possible  uses  for  the  metrics  are  comprehensively  clear.  

Generically,   we   can   define   a   set   of   common   metrics   for   each   type   of   WP4   component   that   might   be  relevant.  Future  documentation  will  describe  specific  metrics  used  in  the  project  and  their  purpose.

6.3.1. Java  Virtual  Machine  metrics  

Many  WP4   technologies   are   implemented   in   programming   languages   targeting   the   Java  Virtual  Machine  (JVM).  These  technologies  share  a  common  set  of  metrics  related  to  the  JVM  itself.  Table  4  contains  some  examples  of  metrics  available  for  the  JVM.  For  a  more  complete  list  of  possible  JVM  metrics,  see  [R15].  For  each  metric,  a  set  of  tags  or  naming  convention  is  needed  to  identify  the  JVM  process  generating  it.

Page 33: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   33

Metric Description

MemHeapUsedM Current  heap  memory  used  in  MB

MemHeapMaxM Max  heap  memory  size  in  MB

GcCount Total  GC  (Garbage  Collector)  count  

GcTimeMillis Total  GC  time  in  msec

Table  4.  Some  examples  of  JVM  metrics  available  

6.3.2. Data  Storage  Metrics  

Data   storage  metrics  are   related   to  distributed   file   system.  Metrics   record   contains   tags   such  as  HAState  (high  availability  state)  and  Hostname  (e.g.  data  node  or  name  node  hostname)  as  additional   information  along   with   metrics.   Metrics   contain   information   about   cluster,   blocks   and   file   state   and   space   used,  available  and  total  in  the  distributed  file  system.  Some  potential  metrics  are  shown  in  table  below.

Metric Description

CapacityTotal Current  raw  capacity  of  data  nodes  in  bytes

CapacityUsed Current  used  capacity  across  all  data  nodes  in  bytes

CapacityRemaining Current  remaining  capacity  across  all  data  nodes  in  bytes

CorruptBlocks Current  number  of  blocks  with  corrupt  replicas

Table  5.  Distributed  file  system  metrics  

6.3.3. Data  Access  Metrics  

Data   access   metrics   are   used   to   monitor   databases   size   growth   and   concurrent   user   sessions   in   data  storage  technologies  used  in  the  project.  To  a  QoS  monitoring  system,  it  can  be  useful  to  identify  situations  where  a  data  storage  system  needs  to  be  scaled,  for  instance,  by  data  replication  and  sharding.  

Metric Description

DatabaseSize Database  size,  in  bytes

ConcurrentUsers Number  of  concurrent  users  connected  to  server

Table  6.  Data  access  metrics  

6.3.4. Data  Ingestion  and  Streaming  Processing  Metrics  

Data  ingestion  is  the  process  of  obtaining  and  importing  data  for  immediate  use  or  storage  in  a  database.  In  the  context  of  WP4  components,  data  ingested  will  be  consumed  by  stream  processing  components  and  is  temporarily  stored  in  processing  queues.  

Some  WP4  components  may  require  different  processing  times.  For  example,  it  takes  less  time  to  execute  an   algorithm   that   simply   classifies   a   record   using   a   precomputed   model   than   to   execute   a   route  

Page 34: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   34

optimization.  Message  queues  help  these  tasks  to  operate  efficiently  by  offering  a  buffer  layer.  This  buffer  controls  and  optimizes  the  data  flow  speed  through  the  system.  

Metrics   in   this   category   should  have   tags  or  use  naming  conventions   to   identify  queue  associated   to   the  metric.  Below,  some  potential  metrics:

Metric Description

QueueLength Quantity  of  items  waiting  in  the  queue  for  processing

ProducerRequestRate How  many  items  per  second  are  being  stored  in  the  queue

MessagesPerSec How  many  items  per  second  are  leaving  the  queue  (processed)

WaitingAck Quantity  of  items  waiting  for  consumer  acknowledgement  (commited)

Table  7.  Data  ingestion  and  streaming  processing  metrics  

6.3.5. Data  Analytics  and  Data  Mining  Metrics  

For  the  QoS  monitoring  system,  data  analytics  and  data  mining  algorithms  can  be  seen  as  processes  or  jobs  running  in  WP3  infrastructure.  This  abstraction  allows  the  separation  of  metrics  related  to  QoS  from  those  related   to   the   application.  QoS  metrics   are   used   by   other   components   of  WP3   to  model   static   and   pro-­‐active   policies   for   horizontal   and   vertical   elasticity   (T3.2   and   T3.4).   Required   metrics   are   listed   below.  Complementary   information   should   be   provided   as   tags   or   prepended/appended   to   metric   name   to  segment  jobs  in  different  types,  categories  and  domains.

Metric Description

JobExecutionTime Job  execution  time  in  seconds

JobTotalOfTasks Total  of  tasks  of  a  job

JobStartTime Job  start  time  (timestamp)

Table  8.  Data  analytics  &  mining  metrics  

6.3.6. Data  Mining  and  Analytical  Toolbox  

We  do  not  foresee  any  relevant  metric  for  tools  and  libraries  in  this  toolbox  regarding  QoS  monitoring.  All  metrics  are  very  specific,  algorithm  and  technology  dependent  and  under  the  QoS  monitoring  perspective,  could   be   replaced   by   one   or  more   system  metrics   such   as   CPU   or  memory   usage.   Very   specific  metrics  should  be  stored  in  another  type  of  repository,  not  in  the  QoS  monitoring  system.

6.4. Data  Management  API  Data  management  API  allows  administrators,  developers  and  programming  models  to  manage  data  sources  in  the  project  infrastructure.  Users  can  upload,  download,  create,  update,  delete  and  query  both  data  and  their  associated  metadata.  This  API  is  mainly  intended  to  address  the  requirement  R1.5  from  D7.1.  

This   API   is   based   on   REST   principles   and   it   is   possible   to   develop   programs   in   different   programming  languages   using   the   HTTP/HTTPS   protocols.   A   command   line   tool   that   interacts   with   the   API   may   be  developed.  

Page 35: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   35

Requests  must   be   authenticated  using   a   client   token.   The   first   version  of   the  API   deals   only  with   simple  token  authentication  that  can  be  obtained  by  an  anonymous  request  containing  a  system  user  and  her/his  password   as   parameters.   Future   versions  may   use  more   sophisticated   authentication   processes,   such   as  OAuth.  

In  this  document,  the  data  management  API  is  described  in  a  high-­‐level  manner.  We  foresee  API  will  evolve  during  the  project  and  a  better  tool,  such  as  Swagger  [R16]  will  be  used  to  document  the  API.  

6.4.1. Resources  URI  

In   addition   to   utilizing   the  HTTP   verbs   appropriately,   resource   naming   is   the  most   important   concept   to  grasp   when   creating   an   understandable,   easily   leveraged   service   API.   In   a   RESTful   API,   a   resource   is  identified  by  a  URI   (Uniform  Resource   Identifier).  The   final  version  of   the  API  may  have  changes,  but   the  general  URI  format  is  protocol://server:port/v1.0/resource,  where  protocol  is  http  or  https  and  server  and  port  are  API  HTTP  server  address  and  listen  port,  respectively.  Notice  v1.0  defines  the  current  API  version.  

6.4.2. Authentication  Calls  

Requests   to   the   API   are   authenticated   using   a   token.   To   obtain   this   token,   a   call   to   the   authentication  resource  is  required.  This  call  needs  to  inform  a  valid  user  and  password  as  parameters  in  order  to  receive  a  valid  token.  Authentication  parameters  are  checked  against  services  provided  by  WP6  infrastructure.  

Authentication  calls  follow  the  next  definition:

Resource  path HTTP  Method

Parameters Return Description

/v1/authenticate POST User,  password

Valid   authentication  token  or  error  status

Authenticates   user   and   ensures  he/she  can  manage  data.

6.4.3. Database  Management  Calls  

These  calls  allow  to  create,  update,  delete,  edit,  upload  and  download  data  and  metadata  stored  in  level-­‐1  and  level-­‐2  WP4  infrastructure.  In  all  calls,  to  have  success,  the  caller  (user)  must  have  required  permission.

Resource  path HTTP  Method

Parameters Return Description

/v1/databases POST Auth.   token,  database   id,  other  metadata

Status   of   the  operation  (success,  failure)

Creates   a   new   empty  database   using   metadata  information.

/v1/databases GET Auth.   token,  filter  parameters

Status   of   the  operation.   If  success,  returns   the   list  of  databases

List   all   databases   visible   to  authenticated   user   according  to  the  filter  parameters.

/v1/databases/id PATCH Auth.   token,  database   id,  

Status   of   the  operation  

Updates   an   existing   database  information   or   metadata  

Page 36: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   36

other  metadata

(success,  failure)

using  provided  parameters.  

/v1/databases/id DELETE Auth.   token,  database  id

Status   of   the  operation  (success,  failure)

Drop  an  existing  database.

/v1/databases/grant/id POST Auth.   token,  database   id,  permission,  user

Status   of   the  operation  (success,  failure)

Grants   a   permission  associated  to  the  database  to  user.

/v1/databases/revoke/id POST Auth.   token,  database   id,  permission,  user

Status   of   the  operation  (success,  failure)

Revokes   a   permission  associated  to  the  database  to  user.

/v1/databases/upload/id POST Auth.   token,  database   id,  database  content,  action  (replace,  append)

Status   of   the  operation  (success,  failure)

Uploads   a   compressed   file   to  API   server   and   imports   it   to  the   database,   replacing   or  appending  to  existing  records.

/v1/databases/export/id POST Auth.   token,  database   id,  target

Status   of   the  operation  (success,  failure)

Exports  a  database  to  a  target  (existing   database   or   web  download).

6.5. Security  Aspects  The   main   security   concerns   of   the   big   and   fast   data   eco-­‐system   are   related   to   user  authentication/authorization  to  the  various  big  data  technologies  as  well  as  data  privacy.  In  particular:

I. each  eco-­‐system  component  and,  in  turn,  each  technology/software  implementing  the  component,  may  require  user  authentication/authorization  in  order  to  identify  the  user,  its  role  and  privileges,  and  grant  access  to  the  data  stored  in  the  system.  Some  tools,  on  the  other  hand,  may  not  provide  any   features   for   user   authentication.   Hence,   since   the   eco-­‐system   is   going   to   integrate   several  different   big   data   and   storage   technologies,   user   authentication   and   authorization   should  transparently   deal   with   the   different   modes   provided   by   the   various   technologies   in   the   eco-­‐system.   To   this   end,   a   system   that   transparently   authenticates   and   authorizes   a   user   across   the  different  components  in  the  eco-­‐system  would  represents  a  preferable  solution;    

II. data  coming  from  external  sources  or  derived  from  this  through  the  execution  of  data  mining  and  analytics  could  present  sensitive  information  that  must  be  treated  accordingly  to  avoid  data  breach  and   unintentional   data   disclosure.   Additionally,   even   though   raw   data  may   not   contain   sensitive  information,   data   derived   from   them   could   expose   this   property.   For   example,   in   the   case   of  descriptive  models,  derived  data  describing  trajectories  could  be  used  to   infer  a  user  habits.  Data  

Page 37: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   37

privacy   and   protection   measures   (e.g.   anonymization,   obfuscation,   encryption,   etc.)   are,   hence,  required  to  avoid  these  type  of  threats.  Privacy  measures  should  also  try  to  preserve,  as  much  as  possible,  the  original  information  content,  reducing  the  loss  of  potentially  useful  information.    

As  depicted   in  Figure  4  security   features  could  be  possibly   required   in   two  different   sections  of   the  WP4  eco-­‐system:  when  a  data   source   is   ingested  and  when   it   is   accessed,  queried  of  processed.  Data  policies  constraints  and  sensitive  information  in  the  external  data  sources  could  require  the  necessity  to  apply  some  changes   in  the  data  before  storing   it   into  the  eco-­‐system.  These  policies,  defined  by  the  source  provider,  could  also  define  the  conditions  to  access  the  data.  Moreover,  sensitive  information  could  be  inferred  when  accessing   the  data  or  when  applying  particular   types  of  mining  and  analytical  operations.   This   should  be  also   taken   into   account  when   storing  derived  data.   Security   threats   and  possible   solutions   are  described  more  in  detail  in  deliverable  D6.1.

The  requirements  of  the  big  data  eco-­‐system  related  to  security  are  (as  defined  in  D7.1):

RD.3.  The  infrastructure  must  store  the  data  processing  products,  taking  the  necessary  steps  to  ensure  data  persistence  and  data  protection,  when  necessary.  

RD.4.   The   infrastructure  must   provide   access   to   authorized   applications   to   access   and   process   the   data,  supporting  the  application  data  processing  model.  

RA.2.   The   infrastructure  must   support   end-­‐user   authorization   for   accessing   the  data  and   the  applications  deployed  with  the  infrastructure.  

Page 38: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   38

7. TOOLS  EVALUATION  The  big  data  world  comprises  a  wide  range  of  software  and  technologies  that  provide,  among  the  others,  OLAP,   streaming   processing,   parallel   computing   frameworks,   NoSQL   databases,   distributed   storage,  machine  learning  libraries,  data  analytics,  job  scheduling,  system  management  and  visualization.  Hence,  it  is  very  important  to  identify  in  this  crowded  environment  the  right  set  of  tools  that  could  potentially  fit  the  project  needs.  To  this  end,  Section  7   is  organized  as   it   follows:   first  a  procedure  to  describe  the  different  components  is  presented  in  sub-­‐section  7.1,  whereas  a  complete  list  of  tools  is  provided  in  sub-­‐sections  7.2  (Data   Storage),   7.3   (Data  Access),   7.4   (Data   Ingestion  and   Streaming  Processing),   7.5   (Data  Analytics   and  Mining),  7.6  (Data  Mining  and  Analytics  Toolbox).  A  final  assessment  is  then  provided  in  Section  7.7.

7.1. Procedure  to  describe  components  Table  8  defines  a  template  to  describe  the  potential  components  to  be  used  in  EUBra-­‐BIGSEA  big  data  eco-­‐system.  A   comprehensive   set  of   candidate   technologies  have  been  evaluated   (for  each  block  depicted   in  the  Figure  3)  in  order  to  identify  those  more  suitable  for  the  EUBra-­‐BIGSEA  project  purposes.  The  following  subsections   will   use   this   template   for   the   description   and   evaluation   of   some   state-­‐of-­‐the-­‐art   big   data  technologies  for  the  different  components  identified.

Identification Name   and   layer   where   the   component   will   be   applied   (a   component   may   be  applied  to  different  layers)

Type Database,   processing   engine,   machine   learning   library,   storage   system,   module,  subprogram,  control  procedure,  framework,  service,  etc.

License License  model.

Current  Version Current  version  and  release  date  (at  the  time  of  the  analysis).

Website URL  to  official  website,  documentation,  references  and  to  source  code  repositories.

Purpose Brief  description  of   the  key  features  that  could  be  relevant  to  the  project,  what  the  component   does,   its   main   purpose,   the   transformation   process,   the   specific  processed   inputs,   the  used  algorithms,   the  produced  outputs,  where  the  data   items  are  stored,  and  which  data  items  are  modified.

High  Level  Architecture

The  internal  structure  of  the  component  and  their  inner  interactions  that  are  relevant  for  the  project  requirements.

Dependencies Other  components  required  by  the  components  and  how  this  component  is  used  by  other  components.  Interaction  details  such  as  timing,  interaction  conditions  (such  as  order  of  execution  and  data  sharing),  and  responsibility  for  creation,  duplication,  use,  storage,  and  elimination  of  components.

Interfaces  and  languages  supported

Detailed   descriptions   of   all   external   and   internal   interfaces   as   well   as   of   any  mechanisms   for   communicating   through   messages,   parameters,   or   common   data  areas.  This  includes  possible  programming  languages  supported  by  the  system.

Page 39: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   39

Security  support Security   mechanisms   provided   by   the   component   (if   any)   relevant   for   the   project,  such  as  AAA,  data  encryption,  ACL,  etc.

Data  source Type  of  data,  within  the  BIGSEA  project,  handled  or  processed  by  this  software.

Potential  usage  within  BIGSEA

How  the  tool  can  be  exploited  in  the  context  of  the  WP4  of  the  project.

Table  9.  Template  used  to  describe  the  potential  components  to  be  used  in  the  Big  Data  eco-­‐system.  

7.2. Data  Storage  

7.2.1. HDFS    

Identification Apache  Hadoop  Distributed  File  System  (HDFS)

Type Distributed  file  system  -­‐  Distributed  storage

License Apache  License:  V2.0

Current  Version

Release  Stable:  V2.7.2  -­‐  25  January.

Note:  HDFS  is  integrated  with  Apache  Hadoop.

Website Website/Documentation:  https://hadoop.apache.org/docs/r2.7.2/hadoop-­‐project-­‐dist/hadoop-­‐hdfs/HdfsUserGuide.html

Download  /  Source  code:

http://hadoop.apache.org/version_control.html

Purpose HDFS   (Hadoop  Distributed  File  System)  provides  high-­‐throughput  access   to  application  data,   it   runs  on  commodity  hardware  and   is   suitable   for  Big  Data.  Besides   that,   it   can  store  up  to  terabyte  or  petabyte  files.  It  was  built  from  Google  File  System  (GFS  [R17]).  The  main  features  are  fault  tolerance  with  data  replication.

High  Level  Architecture

HDFS  uses  WORM  (write-­‐once-­‐read-­‐many)  model,  which  enables  data  to  be  written  to  a  disk  a   single   time.  Scalability,   robustness  and  accessibility  make   it   suitable   for  use   like  file  system.  

The  data   stored  on  HDFS  are   replicated   in   the   cluster   to  ensure   fault   tolerance.  HDFS  ensures   data   integrity   and   can   detect   loss   of   connectivity  when   a   node   is   down.   The  main  concepts:

● Datanode:  Nodes  that  own  data;  ● Namenode:  Node  that  manages  the  file  access  operations.  

Page 40: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   40

[Image  source:  http://www.ibm.com/developerworks/br/library/wa-­‐introhdfs/fig1.gif]

There   is   only   one   Namenode   in   a   cluster   and   many   Datanodes.   Namenode   stores  information  about  number  of  blocks,  on  which  Datanode  the  data  are  stored,  number  of  data  replications,  and  others  aspects.  Finally,  Datanode  stores  data.

Dependencies SSH  connection  between  nodes  of  the  cluster,  which  establish  communication  and  data  transfer.  

Interfaces  and  languages  supported

HDFS  was  designed  in  Java  for  Hadoop  Framework,  therefore  any  machine  that  supports  Java  is  able  to  run  it.

HDFS  Java  API,  WebHDFS  REST  API  and  libhdfs  C  API,  as  well  as  a  Web  interface  and  CLI  shells  are  available.

Security  support

Security   is   based   on   file   authentication   (user   identity).   In   addition,   HDFS   accepts  network  protocols  like  Kerberos  (for  users)  and  encryption  (for  data).  Whole  permission  configurations  can  be  seen  in  the  Documentation.  

Data  source HDFS   is   the   source   of   many   processing   systems   like   Hadoop   and   Spark.   Then,   it   is  possible  to  store  different  data  types  to  be  processed.  All  data  stored  into  HDFS  become  “sequencefile”  files.

Potential  usage  within  BIGSEA

Increase   the   capability   of   partner   tools   to   handle   big   data,   considering   the   scalable  ambient  with  fault  tolerance  and  data  replication.

7.3. Data  Access  

7.3.1. PostGIS  

Identification PostGIS  

Page 41: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   41

Type Spatial  database  extender  for  PostgreSQL

License GNU  GPL  v2.0

Current  Version

 v2.2.2  -­‐  Mar/2016  

Website Website:  http://postgis.net

Documentation:  http://postgis.net/documentation/

Download/Source  code:  http://postgis.net/install/

Purpose PostGIS   is  an  open  sources  spatial  database  extension  for  PostgreSQL  ORDBMS  adding  support  for  geographic  objects  and  allowing  location  queries  to  be  run  in  SQL.  It  follows  the   Open   Geospatial   Consortium’s   “Simple   Features   for   SQL   Specification”   [R18]   and  provides   several   features   such   as   processing   of   vector   and   raster   data,   spatial  reprojection,  import/export  of  ESRI  shapefiles,  3D  object  support.

In   particular,   PostGIS   adds   extra   types   (geometry,   geography,   raster   and   others)   and  functions,  operators,  and  index  enhancements,  that  apply  to  these  spatial  types,  to  the  PostgreSQL  database.

High  Level  Architecture

PostGIS  extensions  run  within  the  PostgreSQL  DBMS  environment.

Dependencies PostgreSQL  and  some  spatial  tools  or  libraries  (Proj4,  GEOS,  GDAL).

Interfaces  and  languages  supported

SQL,   with   the   additional   geo-­‐spatial   functions   and   types,   is   the   language   used   to  perform  query  on  the  data.  

Interfaces   and   clients   are   those   provided   by   PostgreSQL.   Psql   is   the   PostgreSQL   CLI  interactive  terminal.  A  RESTful  API  as  well  as  a  C  client  library  (libpq)  are  also  provided.  EPCG  allows  SQL  to  be  embedded  in  C  code.

Security  support

Database  access  permissions  in  PostgreSQL  is  role-­‐based.  

Different   authentication  methods   are   available   to   authenticate   a   client   to   the   server,  these   include:   password-­‐based,   GSSAPI,   SSPI,   Kerberos,   ident-­‐based,   LDAP,   RADIUS,  certificate-­‐based,  PAM.   SSL   can  be  used   to   secure   client-­‐server   connections  exploiting  also   client-­‐side   certificates.   SSH   tunnels   can   be   used   to   encrypt   the   communication  when  SSL  can  not  be  used.  

Encryption  is  available  for  password  fields  (by  default),  table  columns  and  when  data  are  transferred  over  the  network.  

Data  source Provides  features  to  work  with  data  in  GIS  formats  (e.g.  shapefiles).

Potential  usage  within  BIGSEA

It   could   provide   useful   features   for   the   management   of   geo-­‐spatial   data,   such   as  stationary  and  dynamic  spatial  data  sources.   Integration  of  existing  tools  to  show  geo-­‐spatial  data  as  layers,  including  layers  with  boundaries  and  city  infrastructure.  

Page 42: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   42

7.3.2. MongoDB  

Identification MongoDB  

Type NoSQL  Database

License -­‐  GNU  AGPL  drivers:  Apache  License  v2.0

-­‐  Commercial  license

Current  Version

3.2.7  -­‐  June/2016

Website Website:  https://www.mongodb.com/

Documentation:  https://docs.mongodb.com/

Download/Source  code:  

https://www.mongodb.com/download-­‐center#community

Purpose MongoDB   is   a   multi-­‐platform   document-­‐oriented   database   that   provides   high  performance,  high  availability   and  easy   scalability.  MongoDB  works  on   the   concept  of  document  collections.

High  Level  Architecture

[Image  source:  https://www.mongodb.com/assets/images/products/application-­‐architecture.png]

With  MongoDB’s  flexible  storage  architecture,  the  database  automatically  manages  the  movement  of   data  between   storage  engine   technologies  using  native   replication.   This  approach  significantly  reduces  developer  and  operational  complexity  when  compared  to  running  multiple  distinct  database  technologies.  Users  can  leverage  the  same  MongoDB  query   language,  data  model,   scaling,   security,   and  operational   tooling  across  different  parts  of  their  application,  with  each  powered  by  the  optimal  storage  engine.

Through  the  use  of  a  flexible  storage  architecture,  MongoDB  can  be  extended  with  new  capabilities,   and   configured   for   optimal   use   of   specific   hardware   architectures.  MongoDB   uniquely   allows   users   to   mix   and  match  multiple   storage   engines   within   a  single   deployment.   This   flexibility   provides   a   more   simple   and   reliable   approach   to  meeting   diverse   application   needs   for   data.   Traditionally,   multiple   database  

Page 43: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   43

technologies  would  need   to  be  managed   to  meet   these  needs,  with   complex,   custom  integration   code   to   move   data   between   the   technologies,   and   to   ensure   consistent,  secure  access.

Dependencies None

Interfaces  and  languages  supported

The  official  drivers  (distributed  under  the  Apache  License,  Version  2.0)  of  the  MongoDB  allow  developers  to  connect  from  various  programming  languages.

Security  support

The  MongoDB  offer  a  basic  set  of  security  features: 1. control  of  read  and  write  access  to  data;  2. protection  of  the  integrity  and  confidentiality  of  the  data  stored;  3. control  of  modifications  to  the  database  system  configuration;  4. privilege  levels  for  different  user  types,  administrators,  applications,  etc.;  5. auditing  of  sensitive  operations;  6. stable  and  secure  operation  in  a  potentially  hostile  environment.  

These   security   requirements   can   be   achieved   in   different   ways.   A   database   is   often  placed  unprotected  on  a  “secured”,  internal  network.  This  is  an  idealized  scenario  since  no   network   is   entirely   secure,   architecture   changes   over   time,   and   a   considerable  number  of  successful  breaches  are  from  internal  sources.  A  defense-­‐in-­‐depth  approach  is   therefore   recommended   when   implementing   an   application’s   infrastructure.   While  MongoDB’s   security   features   help   to   improve   the   overall   security   posture,   security   is  only  as  strong  as  the  weakest  link  in  the  chain.

Data  source MongoDB  will  store  weather  data  and  information  from  social  networks.  Other  raw  data  data  needing  intermediate  and  temporary  storage  before  transformation.  

Potential  usage  within  BIGSEA

MongoDB  should  be  used   to   store  and  manipulate  data   in   JSON   format   such  as   social  networks  data.

7.3.3. Apache  HBase  

Identification Apache  HBase

Type NoSQL  distributed  database

License Apache  License  V2.0

Current  Version

V1.1.5  -­‐  May/2016

Website Website:  https://hbase.apache.org

Documentation:  https://hbase.apache.org/book.html

Download:  http://www.apache.org/dyn/closer.cgi/hbase/

Page 44: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   44

Purpose Apache  HBase™  is  the  Hadoop  database,  a  distributed,  scalable,  big  data  store.  It  can  be  used  when   random,   realtime   read/write   access   to   Big   Data   is   required.   The   project's  goal   is   the   hosting   of   very   large   tables,   composed   of   billions   of   rows   per   millions   of  columns,  on  top  of  clusters  of  commodity  hardware.  

HBase  is  an  open-­‐source,  distributed,  versioned,  non-­‐relational  database  modeled  after  Google's   “Bigtable:  A  Distributed   Storage   System   for   Structured  Data”  by  Chang  et   al.  [R19].   Similarly   to   Bigtable,   which   leverages   the   distributed   data   storage   provided   by  GFS,  Apache  HBase  provides  Bigtable-­‐like  capabilities  on  top  of  Hadoop  and  HDFS.

High  Level  Architecture

In   a  distributed   configuration,   a  HBase   cluster   contains  multiple  nodes,   each  of  which  runs  one  or  more  HBase  daemon.  The  following  types  of  nodes  are  available:  

● Master  Server  (HMaster):  is  a  server  responsible  for  monitoring  all  RegionServer  instances   in  the  cluster,  and   is  the   interface  for  all  metadata  changes.  A  multi-­‐Master  environment  includes  a  primary  and  backup  Master  instance;    

● Region  Server  (HRegionServer):  are  multiple  servers  responsible  for  serving  and  managing  regions.  Regions  are  the  basic  element  of  availability  and  distribution  for  tables  consisting  in  a  subsets  of  the  table’s  data.  Data  are  stored  on  HDFS  (in  the  Hadoop  DataNodes);  

● ZooKeeper  nodes:  distributed  Apache  HBase   installation  depends  on  a  running  ZooKeeper  cluster  to  coordinate  the  whole  cluster.  

Dependencies Apache  Zookeeper   for   the  coordination  of   the  cluster.  Fully-­‐distributed  mode   requires  also   a  HDFS   cluster,   since   in   this  mode  Hbase   can  only   run  on  HDFS.  MapReduce  and  Spark  can  be  integrated  with  HBase.

Interfaces  and  languages  supported

The   Apache   HBase   Shell   is   (J)Ruby's   Interactive   Ruby   Shell   (IRB)   with   some   HBase  commands  added.  

HBase   native   API   is   in   Java,   however   access   through   non-­‐JVM   languages   and   through  custom  protocols  is  possible.  

It  includes  a  REST  and  Thrift  API  as  well  as  C/C++,  Scala  and  Jython  client  drivers.  It  also  provides  a  web  interface.

Security  support

HBase  provides  mechanisms   to   secure   various   components   and   aspects   of  HBase   and  how  it  relates  to  the  rest  of  the  Hadoop  infrastructure,  as  well  as  clients  and  resources  outside  Hadoop:

● secure  HTTP  (HTTPS)  connections  to  the  Web  UI;  ● optional  SASL  authentication  of  clients;  ● secure  HBase  requires  secure  ZooKeeper  and  HDFS  so  that  users  can  not  access  

and/or  modify  the  metadata  and  data  from  under  HBase; ● several   strategies   for   securing   data   are   available:   Role-­‐based   Access   Control  

(RBAC),   controls   which   users   or   groups   can   read   and   write   to   a   given   HBase  resource   or   execute   a   coprocessor   endpoint;   Visibility   Labels,   which   allow   to  label  cells  and  control  access  to  labelled  cells;  transparent  encryption  of  data  at  rest  on  the  underlying  file  system.

Data  source HBase  can  read  data  from  HDFS  file  system  and,  only  in  stand-­‐alone  mode,  also  from  the  local   filesystem.   It   can  be  used   for  data  sources   that  can  be  better  handled   through  a  

Page 45: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   45

NoSQL   database.   Some   stored   data   will   be   replicated   from   other   primary   storage  systems  aiming  to  be  used  in  batch  and  online  processing.

Potential  usage  within  BIGSEA

HBase  can  be  exploited  at  the  data  access  layer  to  access  and  store  dynamic,  spatial  and  social   network   data   (or   other   semi-­‐structured/unstructured   data).   It   can   also   be  exploited  in  conjunction  with  Hive  to  run  HiveQL  queries  on  the  data.  

7.4. Data  Ingestion  and  Streaming  Processing    

7.4.1. Apache  Kafka  

Identification Apache  Kafka

Type Publish-­‐subscribe  messaging  system  

License Apache  License  V2.0

Current  Version

v0.10.0.0  -­‐  May/2016

Website Website:  http://kafka.apache.org

Documentation:  http://kafka.apache.org/documentation.html

Download/Source  code:  http://kafka.apache.org/code.html

Purpose Kafka   is   a   distributed,   partitioned  and   fault-­‐tolerant   commit   log   service.   It   can  handle  hundreds   of   megabytes   of   reads/writes   per   second   with   low   latencies.   Its   scalable  design  allows  streams  to  be  partitioned  over  a  cluster  of  multiple  nodes.  Messages  are  persisted  on  disk  and  replicated  in  the  cluster  to  ensure  durability  and  recoverability.

Kafka   maintains   published   messages   in   categories   called   topics.   Producers   are   the  processes   that   publish   these   messages   to   Kafka   topics,   whereas   Consumers   can  subscribe   to   topics   and   process   each   message.   So,   at   a   high   level,   producers   send  messages   over   the   network   to   the   Kafka   cluster   which,   in   turn,   serves   them   up   to  consumers  as  displayed  in  the  picture.

[Image  source:  http://kafka.apache.org/images/producer_consumer.png]

High  Level   A   Kafka   cluster   is   composed   of   one   or   more   “broker”   nodes   and,   additionally,   of  

Page 46: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   46

Architecture Zookeeper  nodes  that  coordinate  the  cluster.

Dependencies Apache  Zookeeper  for  the  coordination  of  the  cluster.

Apache  Storm  can  be  additionally  used  as  a  consumer  of  Kafka  data.

Interfaces  and  languages  supported

Communication   between   the   clients   and   the   servers   is   done   with   a   simple,   high-­‐performance,   language   agnostic   TCP   protocol.   Kafka   provides   Java   clients,   however  clients   in   other   languages   are   available   (e.g.   C/C++,   PHP,   Python,   Ruby,   Clojure,   etc.):  https://cwiki.apache.org/confluence/display/KAFKA/Clients  

Security  support

Security  measures  supported  are  (for  additional  information  on  configuration  http://kafka.apache.org/documentation.html#security):

1. Authentication   of   connections   to   brokers   from   clients   (producers   and  consumers),   other   brokers   and   tools,   using   either   SSL   or   SASL   (Kerberos).  SASL/PLAIN  can  also  be  used  from  release  0.10.0.0  onwards;  

2. Authentication  of  connections  from  brokers  to  ZooKeeper;  3. Encryption  of  data   transferred  between  brokers  and   clients,  between  brokers,  

or  between  brokers  and  tools  using  SSL;  4. Authorization  of  read  /  write  operations  by  clients;  5. Authorization  is  pluggable  and  integration  with  external  authorization  services  is  

supported.  

Data  source Kafka  can  be  used  with  streams  of  data  (e.g.  from  social  networks).  

Potential  usage  within  BIGSEA

Kafka  could  be  used  for  the  ingestion  process  to  track  particular  terms/keys  relevant  for  the   urban   mobility   scenario   from   social   networks   in   order   to   manage   streams   of  messages  that  can  be  consumed  by  the  streaming  processing  modules.  

7.4.2. Apache  Storm  

Identification Apache  Storm

Type Real-­‐time  computation  system

License Apache  License  V2.0

Current  Version

V1.0.1  -­‐  Apr/2016

Website Website:  http://storm.apache.org

Documentation:  http://storm.apache.org/releases/current/index.html

Download/Source  code:  http://storm.apache.org/downloads.html

Purpose Apache  Storm  is  a  distributed  realtime  computation  system.  It  allows  to  reliably  process  unbounded   streams   of   data.   Storm   can   be   used  with   any   programming   language   and  provides  a  simple  and  easy  to  use  API,  with  few  types  of  abstractions:

Page 47: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   47

● Spouts,  sources  of  streams  in  a  computation  (e.g.,  from  Kafka);  ● Bolts,  process  any  number  of   input  streams  and  produces  any  number  of  new  

output  streams;  ● Topologies,   are   networks   of   spouts   and   bolts,  with   each   edge   in   the   network  

representing   a   bolt   subscribing   to   the   output   stream   of   some   other   spout   or  bolt.  

The  picture  displays  two  spouts  and  multiple  bolts.  

[Image  source:  http://storm.apache.org/images/storm-­‐flow.png]

Storm  provides  inherent  parallelism  to  process  high  throughputs  of  messages  with  very  low  latency,  thus  allowing  applications  to  scale  over  the  resources  available.  

A  Storm  topology  consumes  streams  of  data  and  processes  those  streams  in  arbitrarily  complex   ways,   repartitioning   the   streams   between   each   stage   of   the   computation  however  needed.

High  Level  Architecture

A  Storm  cluster  consists  of: ● Master   node   (running   Nimbus   daemon),   a   single   node   that   is   responsible   for  

distributing   code   around   the   cluster,   assigning   tasks   to   machines,   and  monitoring  for  failures;    

● Worker  nodes  (running  Supervisor  daemon),  that  listen  for  work  assigned  to  its  machine  and  starts/stops  worker  processes  as  necessary  based  on  what  Nimbus  has  assigned  to  it.  Each  worker  process  executes  a  subset  of  a  topology;  

● Zookeeper  nodes,  that  coordinate  the  whole  cluster.  

[Image  source:  http://storm.apache.org/releases/current/images/storm-­‐cluster.png]

Dependencies Apache  Zookeeper  for  the  coordination  of  the  cluster.

Page 48: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   48

Additionally,   it   can  be   integrated  with  Apache  Kafka   in  order   to  consume  data  coming  from  it.

Interfaces  and  languages  supported

Storm  has  a  simple  and  easy  to  use  API.   Its  components  (bolts,  spouts,  topologies)  can  be  defined  with  any  programming  language.  Non-­‐JVM  languages  communicate  to  Storm  over  a  JSON-­‐based  protocol.  Adapters  that  implement  the  communication  protocol  are  available  for  Ruby,  Python,  Javascript  and  Perl.

It   provides   a   CLI   client   to   interact   and  manage   a   remote   cluster,   while   the   Storm   UI  daemon  provides  a  REST  API  to  interact  with  a  Storm  cluster.  

Additionally,   it   provides   a   high-­‐level   abstraction   (Trident)   to   perform   real   time  computing  on  top  of  Storm.  It  allows  both  stream  analytics  operations  and  transactional  queries.

Security  support

Several  options  are  available,  by  default  all  authentication/authorizathion  are  disabled  [http://storm.apache.org/releases/1.0.1/SECURITY.html]:

1. Services  allow  user  to  configure  SSL  (also  2-­‐way)  for  the  connection;  2. Pluggable  authentication  support  through  thrift  and  SASL  (e.g.  Kerberos);  3. Different   authorization   plugins   are   available   for   the   various  

components/services;  4. Isolation  in  multi-­‐tenancy;  5. User/group  roles  management.    

Data  source Storm  spouts  read  from  different  sources  to  produce  streams  of  data.  Spouts  typically  read   from   queueing   brokers   (e.g.   Kafka,   Kestrel,   RabbitMQ)   but   can   also   generate   its  own  stream  or  read  from  streaming  API.  

Potential  usage  within  BIGSEA

Storm   can   be   used   to   execute   a   set   of   operations   on   streaming   data.   In   particular,   it  could  be  used  to  apply  pre-­‐processing  and  ETL  operations  before  storing  data   into  the  system.   It   could  be   coupled  with  Apache  Kafka   to   ingest   and  process   streams  of   data  (e.g.,  produced  by  social  networks).  

7.4.3. Apache  Flink  

Identification Apache  Flink

Type Execution  Framework

License Apache  License  V2.0

Current  Version

Stable:  V1.0.3

Latest:  V1.1

Website Website:  https://flink.apache.org/

Documentation:  https://ci.apache.org/projects/flink/flink-­‐docs-­‐master/

Download/  Source  code:  

Page 49: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   49

https://flink.apache.org/downloads.html

https://flink.apache.org/community.html#source-­‐code

Purpose Apache   Flink   is   an   open   source   platform   for   distributed   stream   and   batch   data  processing.   Its  core   is  an  engine  to  process  data  stream  that  provides  communication,  fault   tolerance   and   data   distribution.  Moreover,   “it   has   native   support   for   iterations,  incremental   iterations   and   programs   consisting   of   large   DAGs   of   operations”.   In   Flink  batch  processing  runs  as  special  case  of  stream  processing.

A  Flink  program   is   split   into  "Streams"  and  "Transformations".  Each  of   these  parts  are  naturally   parallel   and   distributed.   A   Stream   is   a   set   of   transformations,   and   a  Transformation   is   an   operation   over   input   streams   (e.g.,  Map,   FlatMap,   Filter,   and   so  on).   Both   are   necessary   to   have   a   Flink   program,   and   compute   results   about   any  streams.  

[Image  source:  https://ci.apache.org/projects/flink/flink-­‐docs-­‐master/concepts/fig/program_dataflow.svg]

Many  other  operations  can  be  done  at  this  stage  to  process  streaming  data  (define  time  and   windows   of   each   process,   describe   state   and   fault   tolerance,   and   define  checkpoints).

Data  streaming  does  not  have  a  bound,  that  can  be  counted.  It  is  necessary  to  define  a  window   to   help   aggregating   many   data   streams   and   process   them.  Windows   can   be  delimited   by   time   (for   example,   each   10   seconds)   or   data   (for   example,   each   1000  elements).

During   the   processing,   the   states   of   each   operations   are   stored   in   a   key/value   file,  helping  to  do  right  operations  across  the  cluster,  and  keeping  it  stateful.  A  checkpoints  is  required  to  save  a  consistent  point,  that  can  be  used  to  restore  a  state.  

High  Level  Architecture

A  Flink  cluster  consists  of  2  runtime  processes: ● Master   (so-­‐called   JobManager)   -­‐   Organizes   resources   of   distributed   execution  

(whole   Flink   System).   The   main   assignments   are   schedule   tasks,   perform  

Page 50: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   50

checkpoints,  and  recovery  on  failure;  ● Worker     (so-­‐called  TaskManager)  -­‐  Executes  tasks  or  subtasks  of  parallel  

program.  There  has  to  be  at  least  1  worker  process;  ● Client  -­‐   It   is  a  component  responsible  for  planning  (turns  the  program  into  the  

parallel  data  flow  form)  and  sending  dataflows  to  the  Master  (JobManager).    

[Image  source:  https://ci.apache.org/projects/flink/flink-­‐docs-­‐release-­‐0.8/img/ClientJmTm.svg]

Dependencies HDFS  and  YARN  are  necessary  to  Flink,  both  dependencies  are  from  the  Apache  Hadoop  project.  By  default,  Flink  is  using  Hadoop  2.x  dependencies.  

For  High  Availability  is  necessary  to  use  Zookeeper  and  YARN.  

Interfaces  and  languages  supported

Flink   has   3   APIs   to   create   an   application.   DataStream   API   (streaming   processing)   to  handle  streams  that  can  be  written   in   Java  or  Scala.  DataSet  API   for  static  data   (batch  processing)   embedded   in   Java,   Scala,   and   Python.   Table   API   to   interpret   SQL-­‐like  expressions  embedded  in  Java  and  Scala.  In  addition,  Flink  has  many  libraries  for  specific  domains   like  Complex  Event  Processing   (CEP),   Flink  Machine  Learning   (ML)  and  Graph  processing  (called  Gelly).

Security  support

Flink   supports   Kerberos   authentication   of   Hadoop   services   such   as   HDFS,   YARN,   or  HBase.

Data  source Apache   Flink   accesses   different   data   sources.   For   example,   Hadoop   Distributed   File  System   (HDFS),   Amazon   S3,   MapR   file   system,   and   Tachyon.   Flink   allows   to   access  MongoDB,  but  it  is  not  so  mature  (https://github.com/okkam-­‐it/flink-­‐mongodb-­‐test).

Also,   data   (streams)   can   be   consumed   from   Apache   Kafka,   and   connected   to   several  data  storage  systems.  

Potential  usage  within  BIGSEA

Flink   can   be   used   to   process   real-­‐time   data   streams   and   integrate   these   data   with  historical   data   to   get   insights,   generate   knowledge,   and   produce   predictions   in   the  context  of  smart  cities.

Page 51: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   51

7.4.4. Apache  Spark  Streaming  

Identification Apache  Spark  Streaming

Type Execution  Framework

License Apache  License  V2.0

Current  Version

Stable:  Apache  Spark  V1.6.1

Latest:  Preview  V2.0.0

Website Website:  http://spark.apache.org/streaming/

Documentation:  http://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html

Download/  Source  code:  

http://spark.apache.org/downloads.html

Purpose Spark   Streaming   allows   stream   processing  with   high   scalability,   throughput,   and   fault  tolerance.

Data   streams   are   absorbed   from   many   sources,   then   processed   by   Spark   Streaming.  Data   processed   are   then   dumped   to   a   file   system,   databases   and   presented   via  dashboards.

[Image  source:  http://spark.apache.org/docs/latest/img/streaming-­‐arch.png]

High  Level  Architecture

When  Spark  Streaming  receives  a  "data  stream",  it  is  split  into  batches  to  be  processed  by  the  spark  engine.

[Image  source:  http://spark.apache.org/docs/latest/img/streaming-­‐flow.png]

Spark   streaming   represents   data   as   a   DStream   (Discretized   Stream)   [R20].   High-­‐level  operations  can  be  performed  on  a  DStream  (like  a  sequence  of  RDDs  in  Apache  Spark).  

Page 52: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   52

Each  RDD  contains  chunks  of  a  data  stream  from  a  short  lead  time.  

[Image  source:  http://spark.apache.org/docs/latest/img/streaming-­‐dstream.png]

If  necessary,  RDD  can  be  transformed/processed,  leading  to  a  new  RDD.  The  framework  hides  most  of  these  DStream/RDD  transformation  details.

Dependencies To  ingest  data  from  external  sources,  it  will  be  necessary  to  add  corresponding  artifacts  of  each  data  source (http://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html#linking).  Please,  read  full  list  of  supported  sources  and  artifacts  on  Maven  Repository.  

Interfaces  and  languages  supported

 Spark  Streaming  programs  can  be  created  in  Scala,  Java  and  Python.  

Security  support

Options  about  Security  are  configured  in  Apache  Spark.  Also,  it  is  necessary  to  configure  TCP  ports  (standalone/cluster)  for  network  security.  

[See  Section  7.4.5  Apache  Spark]

Data  source Spark  streaming  allows  data  to  be  ingested  from  2  types  of  sources: (1) Basic:  File  systems  (HDFS),  socket  connections,  and  Akka  actors;  (2) Advanced:  Kafka,  Flume,  Kinesis,  Twitter,  ZeroMQ,  and  MQTT.    

Potential  usage  within  BIGSEA

Spark   streaming   can  be   integrated  with  Kafka   to  get   streaming  data   from  Twitter  and  process   them.   This   helps   in   the   recognition   of   events   about   traffic   jams   and   other  situations/events  in  big  cities.

There  are  many  usages  for  this  tool  like:  ● Extract,  Transformation,  Load  (ETL)  over  streaming  data;  ● Detect  traffic  problems;  ● Data   enrichment   -­‐   compare   real-­‐time   data   with   historical   data   (Weather   and  

Traffic).  

The  choice  depends  only  on   the  data   to  which  we  have  access  or  we  get   through   the  internet.  

7.5. Data  Analytics  and  Mining    

7.5.1. Ophidia  

Identification Ophidia  

Type Big  data  framework  for  scientific  data  analysis

Page 53: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   53

License GNU  GPL  v3.0

Current  Version

v0.9  -­‐  Feb/2016  

Website Website:  http://ophidia.cmcc.it

Documentation:  http://ophidia.cmcc.it/documentation/

Download/Source  code:  http://ophidia.cmcc.it/download/

Purpose Ophidia  provides  a   complete  environment   for   the  execution  of  data-­‐intensive  analysis  exploiting   advanced   parallel   computing   techniques   and   smart   data   distribution  methods.   It   exploits   an   array-­‐based   storage   model   and   a   hierarchical   storage  organization   to   partition   and   distribute   multidimensional   scientific   datasets   over  multiple  nodes.

High  Level  Architecture

An  Ophidia  cluster  is  composed  by  the  following  components: ● Ophidia  Server,  the  cluster  front-­‐end.  It  provides  multiple  interfaces  for  client-­‐

server  interactions  and  manages  job  scheduling,  submission  and  monitoring;  ● Compute  nodes,  one  or  more  nodes  running  the  Ophidia  parallel  operators;  ● I/O  nodes,  multiple  nodes  running  one  or  more  I/O  servers  responsible  for  the  

execution  of  array-­‐based  analytical  primitives;  ● Storage  layer,  comprises  the  resources  to  physically  store  data.  

Dependencies MySQL  server  for  metadata  storage  and  management,  as  well  as  for  datacube  storage.  

SLURM  scheduler  for  the  management  and  execution  of  analytics  jobs.

MPI  environment  for  the  execute  parallel  jobs.

Interfaces  and   The  Ophidia  server  provides  a  multi-­‐interface  server-­‐side  front-­‐end.  Interfaces  available  

Page 54: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   54

languages  supported

are:   (i)   web   service   interface   compliant  with  WS-­‐I   Basic   Profile   v1.2,   (ii)   GSI   interface  with  support  for  Virtual  Organizations  (VOMS).  

The  Ophidia  Terminal  can  be  used   to   run   interactive  and  batch  analysis   sessions  using  the  interfaces  provided  by  the  server.  Additionally,  a  Python  binding  library  is  available.

Security  support

The   system   provides   mechanisms   to   authenticate   users   and   log   the   commands  executed.  It  also  defines  different  roles  (administrator/user)  and  allows  users  to  share  a  working  session  granting  privileges  to  other  users.  

Encryption  is  provided  to  secure  client-­‐server  communications.

Data  source Scientific  multi-­‐dimensional  data  like  environmental  (climate/weather)  data.

Potential  usage  within  BIGSEA

Ophidia  can  be  used  to  store,  manage,  access  and  analyse  environmental  data  providing  array-­‐based  primitives   for   time-­‐series  analysis.   It   natively   supports   the  NetCDF   format  and  features  climate/weather  domain  operations  for  analytics.  

7.5.2. Apache  Kylin  

Identification Apache  Kylin  

Type Distributed  Analytics  Engine

License Apache  License  V2.0

Current  Version

 v1.5.2.1  -­‐  Jun/2016  

Website Website:  http://kylin.apache.org

Documentation:  http://kylin.apache.org/docs15/

Download/Source  code:  http://kylin.apache.org/download/

Purpose Apache  Kylin™  is  an  open  source  Distributed  Analytics  Engine  designed  to  provide  SQL  interface  and  multidimensional  analysis  (OLAP)  on  Hadoop  supporting  large  datasets.

Apache  Kylin™  allows  to  query  big  Hive  tables  at  sub-­‐second  latency  in  3  steps: 1. Identify  a  set  of  Hive  tables  in  star  schema;  2. Build  a  cube  from  the  Hive  tables  in  an  offline  batch  process;  3. Query   the   Hive   tables   using   SQL   and   get   results   in   sub-­‐seconds,   via   Rest   API,  

ODBC,  or  JDBC.  

High  Level  Architecture

Kylin   is   commonly   installed   on   a  Hadoop   client  machine   to   allow   interaction  with   the  Hadoop  cluster  through  the  command  lines.

The   picture   shows   this   scenario.   The   application   (e.g.   Kylin   Web)   contains   a   web  interface   for   cube   building,   querying   and   management.   Kylin   Web   launches   a   query  engine   for  querying   the  data   and  a   cube  build   engine   for  building  data   cubes   starting  from  a  star  schema.  The  two  engines  interact  with  the  Hadoop  components,  like  hive  for  

Page 55: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   55

cube  building  and  hbase  for  cube  storage.

[Image  source:  http://kylin.apache.org/images/install/on_cli_install_scene.png]

Dependencies Hadoop  cluster,  Apache  Hive  to  read  data  from  it  and  Apache  HBase  to  store  data  cubes  into  it.

Interfaces  and  languages  supported

Kylin  provides  ODBC  and  JDBC  drivers  as  well  as  a  RESTful  API.  

Kylin  Web  Interface  allows  to  run  queries  (exploiting  SQL)  and  visualize  the  results  (also  in  charts).  

Security  support

Kylin   supports   LDAP   authentication   for   enterprise   or   production   deployment.  Additionally,  from  v1.5,  Kylin  provides  SSO  with  SAML.  The  implementation  is  based  on  Spring  Security  SAML  Extension.  

Data  source Kylin  can  read  data  stored  in  HDFS  from  Hive  and  store  data  cubes  in  HBase.  

Potential  usage  within  BIGSEA

It   can   be   used   for   scenarios   where   OLAP   analysis   is   required.   It   also   allow   OLAP   on  streaming  cubes  (still  prototype).  

7.5.3. Apache  Hive  

Identification Apache  Hive  

Type Data  warehouse  software

License Apache  License  V2.0

Current  Version

v2.0.1  -­‐  May/2016  

Website Website:  https://hive.apache.org

Page 56: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   56

Documentation:  https://cwiki.apache.org/confluence/display/Hive/Home

Download/Source  code:  https://hive.apache.org/downloads.html

Purpose Apache  HiveTM  is  a  data  warehouse  software  facilitating  reading,  writing,  and  managing  of   large  datasets   residing   in  distributed  storage  using  SQL.   It   is  built  on   top  of  Apache  HadoopTM  and  provides:

● Tools   to   enable   easy   access   to   data   via   SQL,   allowing   data  warehousing   tasks  such  as  ETL,  reporting,  and  data  analysis;  

● A  mechanism  to  impose  structure  on  a  variety  of  data  formats;  ● Access  to  files  stored  directly  in  Apache  HDFSTM  or  in  other  data  storage  systems  

like  Apache  HBaseTM;  ● Query  execution  via  various   frameworks   (i.e.  Apache  TezTM,  Apache  SparkTM  or  

MapReduce).  

Hive   provides   standard   SQL   functionality,   including  many   of   the   later   2003   and   2011  features  for  analytics.  

High  Level  Architecture

The  figure  displays  Hive  main  components  and  its  interactions  with  Hadoop.  The  components  are:

● UI:  is  the  user  interface  to  submit  queries  and  other  operations  to  the  system;  ● Driver:   is   the  component  which  receives  the  queries.   It   implements  the  notion  

of  session  handles  and  provides  execute  and  fetch  APIs;  ● Compiler:   it  parses   the  query,  does   semantic  analysis  on   the  query  blocks  and  

expressions   and,   eventually,   generates   an   execution   plan   based   on   the  metadata  available  in  the  metastore;  

● Metastore:   it   stores  all   the   information   related   to   the   structure  of   the  various  tables   and   partitions   in   the   warehouse   including   column   and   column   type  information,   the   serializers   and  deserializers   necessary   to  write   and   read  data  and   the   corresponding   HDFS   files   where   the   data   are   stored.  Deserializers/Serializers  provide  the  logic  to  read/write  from  a  custom  formats;  

● Execution  Engine:  is  the  component  that  executes  the  execution  plan  (a  DAG  of  stages)  created  by  the  compiler.  

[Image  source:  

https://cwiki.apache.org/confluence/download/attachments/27362072/system_architecture.png?version=1

Page 57: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   57

&modificationDate=1414560669000&api=v2]

Dependencies Hadoop  cluster

Interfaces  and  languages  supported

Hive  defines  a  SQL-­‐like  language  called  HiveQL  (HQL)  to  perform  queries  on  data.  It  can  be   extended   with   user   code   through   user   defined   functions   (UDFs),   user   defined  aggregates  (UDAFs),  and  user  defined  table  functions  (UDTFs).

In  terms  of  clients,  Hive  provides  a  shell  CLI  and  a  GUI  (Hive  Web  Interface),  as  well  as  several   client   libraries   (JDBC,   ODBC,   Python,   PHP,   Thrift).   Additionally,   it   provides  HiveServer2   (HS2),   a   server   interface   that   enables   remote   clients   to   execute   queries  against  Hive  and  retrieve  the  results.

Security  support

Hive  provides  three  types  of  authorization  modes: ● Storage  Based  Authorization  in  the  Metastore  Server.  In  this  case  the  metastore  

controls  access   to   the  different  metadata  objects   (database,   tables,  partitions)  by  checking  the  file  system  permissions  for  the  corresponding  folders;  

● SQL  Standards  Based  Authorization  in  HiveServer2.  It  allows  fine  grained  control  and  is  based  on  the  SQL  standard  for  authorization,  using  common  grant/revoke  statements;  

● Default  Hive  Authorization   (Legacy  Mode).   Is   the   authorization  mode   that   has  been   available   in   earlier   versions   of   Hive.   It   does   not   have   a   complete   access  control  model.  

Strong   authentication   for   tools   like   the   Hive   command   line   is   provided   through  Kerberos,   whereas   HiveServer2   provides   additional   authentication   options   (cookie-­‐based,  SASL,  PAM,  LDAP  and  Kerberos).

Data  source Data  stored  in  file  systems  (e.g.  HDFS,  Amazon  S3)  or  in  Apache  HBase  database.  

Potential  usage  within  BIGSEA

Apache  Hive  can  be  used  on  top  of  Hadoop  (or  HBase)  to  access,  filter  and  process  data  stored  in  the  HDFS  storage.  In  particular,  it  could  be  used  both  at  the  Data  Access  level,  to  select  the  pre-­‐processed  data  from  social  networks  (or  other  unstructured  data)  and  at  the  Data  Analytics  levels,  to  process  the  data.

7.5.4. Druid  

Identification Druid  

Type Big  data  solution  for  data  analytics

License Apache  License,  Version  2.0.

Current  Version

0.9.0  -­‐  Apr/2016

Website Website:  http://druid.io/

Documentation:  http://druid.io/docs/0.9.0/design/index.html

Page 58: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   58

Download/Source  code:  http://druid.io/downloads.html

Purpose Druid   is  an  open  source  data  store  designed  for  OLAP  queries  on  event  data.  Data  are  organized   as   dimension   and  metric   columns.   A   separated   column   type,   timestamp,   is  treated   separately   because   all   queries   center   around   the   time   axis.   Druid   provides   a  language   with   operations   to   load,   index,   query   and   group   (roll-­‐up   data).   Data   are  partitioned  in  segments.  Segments  are  key  to  provide  high  availability  in  a  cluster.

High  Level  Architecture

[Image  source:  https://upload.wikimedia.org/wikipedia/commons/0/0f/Druid_Open-­‐

Source_Data_Store%2C_architecture%2C_DruidArchitecture3.svg]

In  a  fault-­‐tolerant  architecture,  a  Druid  cluster  allows  to  partition  and  replicate  data.  A  cluster   coordinator   (Zookeeper)   is   needed   to   keep   cluster   information   synchronized,  allowing  identification  of  new  or  removed  nodes  and  leader  election.  MySQL  or  Postgres  is  used  as  storage  of  metadata  and  for  deep  storage  (historical);  a  distributed  file  system  such  as  HDFS  or  S3  or  even  Cassandra  NoSQL  Database  is  required.  

Druid  provides  data   ingestion   tools   to   load  data  directly   into   its   real-­‐time  nodes,  or   in  batch   into   historical   nodes.   Real-­‐time   nodes   accept   JSON-­‐formatted   data   from   a  streaming  data  source,  like  RabbitMQ  or  other  message  queueing  system.  Batch-­‐loaded  data  formats  can  be  JSON,  CSV,  or  TSV.

Data  are  partitioned  in  segments  and  designed  to  be  easily  moved  out  to  deep  storage.  The  location  of  segments  is  stored  in  the  relational  database  (MySQL)  and  all  transfer  is  coordinated  by  Zookeeper.  

Broker   nodes   are   responsible   for   receiving   client   queries   and   forward   them   to  appropriate  data  node  (historical  or  real-­‐time).  Brokers  interact  with  metadata  database  and  Zookeeper  in  order  to  know  in  which  nodes  segments  reside.  After  each  data  node  has   processed   the   query,   broker   nodes   merge   partial   results   before   returning   the  aggregated  result.

Dependencies MySQL   (or   Postgres)   as   a   metadata   storage;   HDFS,   Amazon   S3   or   any   sharable   and  mountable   file   system   as   deep   storage;   Apache   ZooKeeper,   "a   centralized   service   for  maintaining   configuration   information,   naming,   providing   distributed   synchronization,  and  providing  group  services."

Page 59: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   59

Interfaces  and  languages  supported

REST  API,  with  clients  implemented  for  Ruby,  Python,  R,  Javascript  (Node.js)  and  others.  Data  ingestion  can  be  done  by  stream  pull  or  push.  

Security  support

No  support  for  authentication  nor  authorization.

Data  source Data  ingestion  can  be  done  by  stream  pull  or  push,  directly  to  or  from  an  application.  

Potential  usage  within  BIGSEA

Potential  analytics   scenarios  where  an  OLAP   tool  would  be  enough.  Fast  operation  on  TOP-­‐N  queries.

7.5.5. Spark  

Identification Apache  Spark  

Type Framework  for  processing  big  data

License Apache  License,  Version  2.0.

Current  Version

1.6.1  -­‐  Mar/2016

Website Website:  http://spark.apache.org

Documentation:  http://spark.apache.org/docs/latest/

Download/Source  code:  http://spark.apache.org/downloads.html

Purpose In-­‐memory  engine   for   large-­‐scale  data  processing.  Apache  Spark   is  a   fast  and  general-­‐purpose   cluster   computing   system   and   an   optimized   engine   that   supports   general  execution  graphs.

High  Level  Architecture

[Image  source:  http://spark.apache.org/docs/latest/img/cluster-­‐overview.png]

In   a   standalone   cluster   deployment,   the   cluster   manager   is   a   Spark   master   instance.  

Page 60: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   60

When   using   Mesos,   the   Mesos   master   replaces   the   Spark   master   as   the   cluster  manager.

Spark   can  be  used   for   batch   jobs   through   spark-­‐submit,  which   can  use   local,   YARN  or  Mesos   resources,   among   others.   Spark-­‐submit   can   be   used   to   execute   binaries  remotely.  

Spark-­‐shell  is  a  Scala  interactive  console  that  can  use  as  back-­‐end  a  Mesos  cluster.  This  way,   one   can   execute   data   analytic   operations   and   execute   them   interactively   on   a  remote  system.

Dependencies A  Hadoop,  YARN  or  Mesos  cluster.  

Interfaces  and  languages  supported

 It  provides  high-­‐level  APIs  in  Java,  Scala,  Python  and  R.  

Security  support

Spark  currently  supports  authentication  via  a  shared  secret.  Spark  supports  SSL  for  Akka  and  HTTP  (for  broadcast  and  file  server)  protocols.  SASL  encryption  is  supported  for  the  block   transfer   service.   Encryption   is   not   yet   supported   for   the   WebUI.  Encryption   is   not   yet   supported   for   data   stored   by   Spark   in   temporary   local   storage,  such  as   shuffle   files,   cached  data,  and  other  application   files.   If  encrypting   this  data   is  desired,  a  workaround  is  to  configure  the  cluster  manager  to  store  application  data  on  encrypted  disks.

Data  source Data  stored   in   file  systems   (ext4,  HDFS).  There  are  many  other  connectors   that  allows  Spark  read/write  data  to  other  data  sources/storage.  

Potential  usage  within  BIGSEA

Spark  is  one  of  the  supported  programming  models  in  BigSea.

7.5.6. Hadoop  MapReduce  

Identification Hadoop  MapReduce  

Type Framework  for  processing  big  data

License Apache  License  2.0

Current  Version

Release  2.7.2  -­‐  Jan/2016

Website Website:  http://hadoop.apache.org/

Documentation:  http://hadoop.apache.org/docs/r2.7.2/

Download/Source  code:  http://hadoop.apache.org/#Download+Hadoop

Purpose MapReduce   is   a   programming  model   that   allows   the   processing   of  massive   data   in   a  

Page 61: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   61

parallel  and  distributed  algorithm,  usually  in  computer  clusters.  Hadoop  is  a  collection  of  sub-­‐projects,   hosted   by   the   Apache   Software   Foundation,   related   to   distributed  computing.   Although   the   best   known   Hadoop   subprojects   are   MapReduce   and   its  distributed   file   system   (HDFS),   other   subprojects   (e.g.,   Avro,   Pig,   HBase,   Hive   and  ZooKeeper)  offer  complementary  services  or  add  a  higher-­‐level  abstraction.

High  Level  Architecture

Hadoop  Common:  The  common  utilities  that  support  the  other  Hadoop  modules.

Hadoop   Distributed   File   System   (HDFS):   A   distributed   file   system   that   provides   high-­‐throughput  access  to  application  data.

Hadoop  YARN:  A  framework  for  job  scheduling  and  cluster  resource  management.

Hadoop  MapReduce:  A  YARN-­‐based  system  for  parallel  processing  of  large  data  sets.

Dependencies Java  installed  in  all  nodes  (master  and  slaves).

If  a  high  security  level  is  required,  API  Kerberos  is  a  possible  solution.

Interfaces  and  languages  supported

 Hadoop  MapReduce  supports  high-­‐level  APIs  or  algorithms  in  Java.  

Security  support

Security   features   of   Hadoop   consist   of   authentication,   service   level   authorization,  authentication  for  Web  consoles  and  data  confidentiality.

Authorization:  when  service   level  authentication   is  turned  on,  end  users  using  Hadoop  in  secure  mode  needs  to  be  authenticated  by  Kerberos.

Service   level   authorization:   it   is   the   initial   authorization  mechanism   to   ensure   clients  connecting   to   a   particular   Hadoop   service   have   the   necessary,   pre-­‐configured,  permissions  and  are  authorized  to  access  the  given  service.

Authentication   for  Web   consoles:   by   default   Hadoop   HTTP  web-­‐consoles   (JobTracker,  NameNode,   TaskTrackers   and   DataNodes)   allow   access   without   any   form   of  authentication.   On   the   other   hand,   Hadoop   HTTP  web-­‐consoles   can   be   configured   to  require  Kerberos  authentication  using  HTTP  SPNEGO  protocol   (supported  by  browsers  

Page 62: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   62

like  Firefox  and  Internet  Explorer).

Data   confidentiality:   the   data   transferred   between   hadoop   services   and   clients   are  encrypted.   Furthermore,   the   data   transfer   between   Web-­‐console   and   clients   are  protected  by  using  SSL(HTTPS).

Data  source The  data  can  be  stored   in   file  systems   (e.g.,  HDFS).  Using   the  YARN,  different   types  of  data  (e.g.,  text,   jar,   json)  are  read/write  on  the  file  systems.  Furthermore, Hadoop  can  be  integrated  with  other  data  sources  (e.g.,  HBase,  MySQL  and  MongoDB).

Potential  usage  within  BIGSEA

Hadoop  MapReduce  is  one  of  the  supported  programming  models  (to  process  big  data  on  batch  mode)  in  BigSea.

7.6. Data  Mining  and  Analytics  Toolbox    

7.6.1. Apache  Spark  MLlib  

Identification Apache  Spark  MLlib

Type Apache  Spark’s  scalable  machine  learning  (ML)  library.

License GNU  AGPL  drivers:  Apache  License  v2.0

Current  Version

1.6.1  -­‐  March/2016

Website Website:  https://spark.apache.org/mllib/

Documentation:  https://spark.apache.org/docs/latest/mllib-­‐guide.html

Download/Source  code:  

https://github.com/apache/spark/tree/master/mllib

Purpose Its  goal   is  to  make  practical  machine  learning  scalable  and  easy.   It  consists  of  common  learning   algorithms   and   utilities,   including   classification,   regression,   clustering,  collaborative   filtering,   dimensionality   reduction,   as   well   as   lower-­‐level   optimization  primitives  and  higher-­‐level  pipeline  APIs.

Page 63: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   63

High  Level  Architecture

Spark  MLlib  is  a  module  (a  library  /  an  extension)  of  Apache  Spark  to  provide  distributed  machine  learning  algorithms  on  top  of  Spark’s  RDD  abstraction.  Its  goal  is  to  simplify  the  development  and  usage  of  large  scale  machine  learning.

The  following  types  of  machine  learning  algorithms  are  available  in  MLlib: ● Classification  ● Regression  ● Frequent  itemsets  (via  FP-­‐growth  Algorithm)  ● Recommendation  ● Feature  extraction  and  selection  ● Clustering  ● Statistics  ● Linear  Algebra  

The  following  can  also  be  done  using  MLlib: ● Model  import  and  export  ● Pipelines  

Dependencies -­‐  Apache  Spark  2.0.  MLlib  is  included  as  a  module.

-­‐   MLlib   uses   the   linear   algebra   package   Breeze,   which   depends   on   netlib-­‐java   for  optimized  numerical  processing.

-­‐  To  use  MLlib  in  Python,  NumPy  version  1.4  or  newer  is  required.

Interfaces  and  languages  supported

Usable  in  Java,  Scala,  Python,  and  SparkR.

Security  support

Spark  currently  supports  authentication  via  a  shared  secret.  Spark  supports  SSL  for  Akka  and  HTTP  (for  broadcast  and  file  server)  protocols.  SASL  encryption  is  supported  for  the  block   transfer   service.   Encryption   is   not   yet   supported   for   data   stored   by   Spark   in  temporary  local  storage,  such  as  shuffle  files,  cached  data,  and  other  application  files.  If  encrypting   this   data   is   desired,   a   workaround   is   to   configure   the   cluster   manager   to  store  application  data  on  encrypted  disks.

Page 64: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   64

Data  source All  Data  sources  that  can  be  connected  to  Spark  (there  are  many  connectors  that  allows  Spark  read/write  data  to  other  data  sources/storage).  

Potential  usage  within  BIGSEA

The  key  benefit  of  MLlib  to  BIGSEA  is  that  it  allows  data  scientists  to  focus  on  their  data  problems  and  models   instead  of   solving   the  complexities   surrounding  distributed  data  (such   as   infrastructure,   configurations,   and   so  on).   Just   as   important,   Spark  MLlib   is   a  general-­‐purpose  library,  providing  algorithms  for  most  ML  use  cases  while  at  the  same  time   allowing   the   data   scientists   to   build   upon   and   extend   it   for   specialized  ML   use  cases.

7.6.2. Ophidia  Operators  

Identification Ophidia  analytics  operator

Type Datacube-­‐oriented  analytics  operators

License GNU  GPL  v3.0

Current  Version

v0.9  -­‐  Feb/2016  

Website Website:  http://ophidia.cmcc.it

Documentation:  http://ophidia.cmcc.it/documentation/

Download/Source  code:  http://ophidia.cmcc.it/download/

Purpose It   provides   around   50   parallel   (MPI-­‐based)   operators   that   allow   datacube-­‐oriented  analytics   and   metadata   management,   supporting   natively   the   NetCDF   format.   These  include:

● Data  import/export  ● Subsetting  ● Reduction/aggregation  ● Data  exploration  ● Cube  intercomparison  ● Metadata  handling  ● Script  execution  ● Run  Ophidia  primitives  

High  Level  Architecture

Ophidia  operators  run  on  the  compute  nodes  of  an  Ophidia  cluster  (see  Ophidia  table  in  Section  7.5.1)

Dependencies Ophidia  framework  and  MPI  environment

Interfaces  and  languages  supported

Can  be  executed  in  the  Ophidia  environment

Page 65: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   65

Security  support

See  Ophidia  table  in  Section  7.5.1

Data  source Scientific  multi-­‐dimensional  data  (NetCDF  format).

Potential  usage  within  BIGSEA

Ophidia   operators   allow   the   execution   of   a   wide   range   of   OLAP-­‐oriented   tasks   on  scientific   multi-­‐dimensional   data.   They   can   be   used   for   running   data   analytics  experiments  on  climate/weather  data.  Additionally,  some  operators  provide  support  for  metadata  management.  Data  operators  are  based  on  MPI  for  parallel  processing.

7.6.3. Ophidia  Primitives  

Identification Ophidia  primitives

Type Array-­‐based  analytical  primitives

License GNU  GPL  v3.0

Current  Version

v0.9  -­‐  Feb/2016  

Website Website:  http://ophidia.cmcc.it

Documentation:  http://ophidia.cmcc.it/documentation/

Download/Source  code:  http://ophidia.cmcc.it/download/

Purpose Ophidia   primitives   provide   around   100   array-­‐based   primitives,   based   on   well-­‐known  scientific   libraries   (e.g   GSL,   matheval,   etc.),   that   allow   analytics,   statistical   and  mathematical  operations.  Primitives  are  implemented  as  User  Defined  Functions  (UDF)  and   can  be  executed  directly   from   the  Ophidia   I/O   server   in   SQL  queries.  Additionally  more  primitives  can  be  nested  into  a  single  query.  

Among  the  functions  provided  by  the  primitives  there  are: ● Array  subsetting  and  extraction  ● Arithmetic/mathematical  (e.g.  multiplication,  addition,  absolute  value)  ● Array  aggregation  ● Statistical  (e.g.  max,  min,  average,  quantiles,  std.  deviation,  boxplot,  histogram)  ● Array  manipulation  (e.g.  shift,  permutation,  concatenation)  ● Data  conversion  and  cast  ● Mathematical   computations   (e.g.   linear   regression,   interpolation,   predicate  

evaluation,  etc.)    

High  Level  Architecture

Ophidia  primitives  run  in  the  I/O  servers  of  an  Ophidia  cluster  (see  Ophidia  table  7.5.1)

Dependencies Ophidia  framework

Interfaces  and   Can   be   executed   in   the   Ophidia   environment   and   also   standalone   in   SQL   queries   in  

Page 66: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   66

languages  supported

MySQL.

Security  support

See  Ophidia  table  in  Section  7.5.1

Data  source Scientific  multi-­‐dimensional  data  (NetCDF  format).

Potential  usage  within  BIGSEA

Ophidia   primitives   provide   a   set   of   low-­‐level   functions   to   perform   analytics   on   data  stored   in   arrays   (e.g.   time   series).   They   are   especially   suited   for   scientific   array-­‐based  data.   In   the  context  of   the  project   these   features  can  be  used   in  conjunction  with   the  Ophidia  operators  to  perform  statistical  computation  and  analytics  through  the  Ophidia  framework  on  climate/weather  data.

7.7. Final  Assessment  

The   following   tables  provide  a   summary  of   the   various   technologies   analyzed   for  each  block  of   the  WP4  architecture.   In  particular,   for   each  of   them,  a   set  of   features  necessary   for   the  big  data  eco-­‐system  has  been  highlighted.  In  the  following  tables,  Yes  (Partially  supported)  means  that  the  component  provides  the  proper  (Partial)  support  for  that  feature,  so  it  is  technically  sound  with  regard  to  the  project  requirements.  An  empty  cell  is  associated  to  components  that  would  require  too  many  adaptation/extension  activities  to  address  that  feature  or  that  are  not  able  to  support  that  feature  at  all.  These  have  been  identified  from  the  end-­‐users,  data  sources  and  analytical  requirements.  

Data  Storage  Components

Hadoop  HDFS PostGIS*  (PostgreSQL)

MongoDB*  storage

Ophidia*  storage

Store  files  from  data  sources Yes

Store  GIS  (stationary/dynamic)  data  sources

Yes Yes

Store  environmental  data  sources

Yes

Store  social  network  data  sources

Yes Yes

Store  derived  and  platform-­‐level  data

Yes Yes

Store  metadata  related  to  level-­‐1  data

Yes Yes Yes

*  Ophidia,   PostGIS   and  MongoDB   are  mainly   classified   in   Section   7.5   as   data   analytics   and   access   tools.  However,  they  also  provide  in  their  stack  storage  capabilities  which  explains  their  role  in  this  table.  

Page 67: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   67

Data  Access  Components  

Apache  HBase PostGIS MongoDB Ophidia*

Import/export  NetCDF  data Yes

Import/export  JSON  data Yes Yes

Import/export  Shapefile  data

Yes

Select/filter  data Yes Yes Yes Yes

Temporal/spatial  queries Partially Yes Yes Yes

Aggregation  queries Yes Yes Yes Yes

*  Ophidia   is   classified   in  Section  7.5  as  a  data  analytics   tool.  However,   it  also  provides   in   its   stack  access  capabilities  which  explains  its  role  in  this  table.  

Data  Ingestion  and  Streaming  Processing  Components

Apache  Kafka Apache  Storm Apache  Flink Spark  Streaming

Continuous  data  ingestion Yes

Processing  of  data  stream Yes Yes Yes

Streaming  analytics Requires  custom  code

Requires  custom  code

Requires  custom  code

Data  Analytics  and  Mining  Components  

Ophidia Apache  Hive

Apache  Kylin

Druid Apache  Spark

Hadoop  MapReduce

OLAP  Analysis Yes Yes Yes Yes Requires  custom  code

Requires  custom  code

Batch  processing Yes Yes Yes

Data  Analytics  and  Mining  Toolbox  

Spark  MLlib Ophidia  Primitives Ophidia  Operators

·∙ Data  Mining  Algorithms:  Clustering,  Classification,  Regression

Yes Only  clustering

Statistical/mathematical  analytics

Yes Yes Yes,  exploiting  Ophidia  primitives

Page 68: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   68

Time  series  analysis Partially  supported Yes Yes

Spatial  analysis Partially  supported Partially  supported

Page 69: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   69

8. PRELIMINARY  ARCHITECTURAL  MAPPING  The  following  diagram  (Figure  13)  shows  a  preliminary  mapping  of  some  technologies  on  the  fast  and  big  data  eco-­‐system  architecture.  These  tools  and  systems  have  been  selected  based  on  the  analysis  and  the  assessment   provided   in   the   previous   section.   The   mapping   highlights   clearly   the   different   big   data  technologies  that  could  be  exploited  to  address  the  use  cases  requirements.    

 Figure  13.  Preliminary  architectural  mapping  

Page 70: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   70

9. CONCLUSIONS  This   document   has   provided   a   complete   overview   about   the   design   of   the   integrated   big   and   fast   eco-­‐system.  In  particular  it  has  presented  the  general  conceptual  view  of  the  proposed  data  management  eco-­‐system   describing   in   detail   the   key   architectural   components   needed   to   address   the   multifaceted   data  management   aspects   (data   storage,   access,   analytics   and  mining)   of   the   project.   Additionally,   a   detailed  view   of   the   architecture   in   terms   of   internal   components   (storage,   ETL,   big   data   technologies,   Entity  Matching  and  Data  Quality  services),  sequence  diagrams  (UML  notation),  QoS  metrics  to  be  exposed  at  the  WP4  level,  data  management  APIs  and  data-­‐related  security  aspects  have  been  also  presented.

The   document   has   also   included   the   full   list   of   the   data-­‐related   requirements   from   D7.1   jointly   with   a  comprehensive  description  of  the  main  data  sources  classified  as  raw,  derived  and  platform-­‐level  and  user  classes.

Worth  of  mentioning  is  the  presentation  of  the  main  big  and  fast  data  tools  currently  available  in  the  data  landscape   from   a   (i)   storage,   (ii)   access,   (iii)   analytics/mining   and   related   toolbox,   (iv)   ingestion   and  streaming  processing   components  points   of   view,   that   could   fit   into   the  WP4   software   architecture.   The  links  to  the  other  WPs  from  a  security,  quality  of  service,  user  requirements  and  programming  framework  standpoints  have  been  also  highlighted  in  the  text  to  make  clear  how  the  WP4  architecture  is  linked  to  the  overall  project  picture.  

This  deliverable  provides   a   comprehensive   view  about   the  main  architectural   aspects  of   the   fast   and  big  data  management   eco-­‐system   and   provides   a   solid   basis   to  move   forward   in   the   implementation   of   the  software  stack.

Page 71: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   71

10. REFERENCES  [R01]   Michalakes,   J.,   et   al.   "The   weather   research   and   forecast   model:   software   architecture   and  performance."  Proceedings  of   the  11th  ECMWF  Workshop  on   the  Use  of  High  Performance  Computing   In  Meteorology.  2005.  [R02]   Climate,   Global,   and   Weather   Modeling   Branch.   Environmental   Modeling   Center   2003.   The   GFS  atmospheric  model.  Vol.  442.  Office  Note.  [R03]  Rew,  R.,  E.  Hartnett,  and   J.  Caron.   "NetCDF-­‐4:   software   implementing  an  enhanced  data  model   for  the   geosciences."   22nd   International   Conference   on   Interactive   Information   Processing   Systems   for  Meteorology,  Oceanography,  and  Hydrology.  2006.  [R04]  https://dev.twitter.com/  [R05]  Jing  Han,  Haihong   E,  Guan   Le   and   Jian  Du,   "Survey  on  NoSQL  database,"  Pervasive   Computing  and  Applications   (ICPCA),   2011   6th   International   Conference   on,   Port   Elizabeth,   2011,   pp.   363-­‐366.   doi:  10.1109/ICPCA.2011.6106531.  [R06]  The  Unified  Modeling  Language  User  Guide,  Grady  Booch,  James  Rumbaugh,  Ivar  Jacobson.  Publisher:  Addison  Wesley,  First  Edition  October  20,  1998,  ISBN:  0-­‐201-­‐57168-­‐4,  512  pages.  [R07]   Borthakur   D   (2008)   HDFS   architecture   guide.   HADOOP   APACHE   PROJECT.  http://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf  [R08]  K.  Shvachko,  H.  Kuang,  S.  Radia,  and  R.  Chansler,  “The  hadoop  distributed  file  system,”  in  Proc.  IEEE  26th  Symp.  Mass  Storage  Syst.  Technol.,  2010,  pp.  1–10.  [R09]  S.  Fiore,  C.  Palazzo,  A.  D’Anca,  I.  T.  Foster,  D.  N.  Williams,  G.  Aloisio,  “A  big  data  analytics  framework  for  scientific  data  management”,  IEEE  BigData  Conference  2013:  1-­‐8    [R10]  Donatello  Elia,  Sandro  Fiore,  Alessandro  D'Anca,  Cosimo  Palazzo,  Ian  T.  Foster,  Dean  N.  Williams:  An  in-­‐memory  based  framework  for  scientific  data  analytics.  Conf.  Computing  Frontiers  2016:  424-­‐429.  [R11]  M.   Zaharia,  M.   Chowdhury,   T.  Das,   A.  Dave,   J.  Ma,  M.  McCauley,  M.   J.   Franklin,   S.   Shenker,   and   I.  Stoica.   Resilient   distributed   datasets:   A   fault-­‐tolerant   abstraction   for   in-­‐memory   cluster   computing.   In  Proceedings   of   the   9th  USENIX  Conference  on  Networked   Systems  Design   and   Implementation,  NSDI'12,  pages  2{2,  Berkeley,  CA,  USA,  2012.  USENIX  Association.  [R12]   Ekanayake   J,   Pallickara   S,   Fox   G   (2008)   Mapreduce   for   data   intensive   scientific   analyses.   In:  Proceesings  of  IEEE,  Fourth  International  Conference  on  eScience.,  pp  277–284.  [R13]  Dean  J,  Ghemawat  S  (2008)  MapReduce:  simplified  data  processing  on  large  clusters.  Commun  ACM  51(1):107–113  [R14]  Thusoo  A,  Sarma  JS,  Jain  N,  Shao  Z,  Chakka  P,  Anthony  S,  Liu  H,  Wyckoff  P,  Murthy  R  (2009)  Hive:  a  warehousing   solution   over   a   map-­‐reduce   framework.   Proceedings   of   the   VLDB   Endowment   2(2):1626–1629.  [R15]  https://hadoop.apache.org/docs/r2.7.2/hadoop-­‐project-­‐dist/hadoop-­‐common/Metrics.html  [R16]  Swagger  -­‐  http://swagger.io  

[R17]   S.   Ghemawat,   H.   Gobioff,   and   S.-­‐T.   Leung,   “The   google   file   system,”   in   Proc.   19th   ACM   Symp.  Operating  Syst.  Principles,  2003,  pp.  29–43.  [R18]  http://www.opengeospatial.org/standards/sfs  [R19]   Fay   Chang,   Jeffrey   Dean,   Sanjay   Ghemawat,  Wilson   C.   Hsieh,   Deborah   A.  Wallach,  Mike   Burrows,  Tushar   Chandra,   Andrew   Fikes,   and   Robert   E.   Gruber.   2006.   “Bigtable:   a   distributed   storage   system   for  structured   data.”   In   Proceedings   of   the   7th   USENIX   Symposium   on   Operating   Systems   Design   and  Implementation  -­‐  Volume  7  (OSDI  '06),  Vol.  7.  USENIX  Association,  Berkeley,  CA,  USA,  15-­‐15.  [R20]  M.   Zaharia,   T.   Das,   H.   Li,   T.   Hunter,   S.   Shenker,   and   I.   Stoica,   “Discretized   streams:   Fault-­‐tolerant  streaming  computation  at  scale,”  in  Proc.  24th  ACM  Symp.  Operating  Syst.  Principles,  2013,  pp.  423–438.  

Page 72: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   72

GLOSSARY  

Acronym Explanation Usage  Scope

AAA Authentication,  Authorization  and  Accounting Security

ACL Access  Control  List Security

Amazon  S3 Amazon  Simple  Storage  Service Storage  technology

API Application  Programming  Interface Interfaces

CLI Command  Line  Interface Interfaces

CSV Comma  Separated  Value Data  type

DAG Directed  Acyclic  Graph Execution  plan

DBMS Database  Management  System Storage  technology

DQaS Data  quality  as  a  service Service

EM Entity  Matching   Service

ESRI Environmental  Systems  Research  Institute   Data  type

ETL Extraction,  Transformation  and  Load Data  integration

GDAL Geospatial  Data  Abstraction  Library   Software  library

GEOS Geometry  Engine  Open  Source Software  library

GFS The  Google  File  System Storage  technology

GIS Geographic  Information  System Data  Type

GNU  GPL GNU  General  Public  License License  Type

GSL GNU  Scientific  Library Software  library

GSSAPI Generic  Security  Service  Application  Program  Interface   Security

GUI Graphical  User  Interface Interfaces

IaaS Infrastructure  as  a  Service WP3

LDAP Lightweight  Directory  Access  Protocol Security

JDBC Java  DataBase  Connectivity Database  API

JSON JavaScript  Object  Notation Data  Type

MPI Message  Passing  Interface Parallel  Computing

NetCDF Network  Common  Data  Form Data  Type

NoSQL Not  Only  SQL Database  Paradigm

ODBC Open  DataBase  Connectivity Database  API

Page 73: D4.1:Designoftheintegratedbigandfastdataeco 5system& · 2016-07-12 · Keywords Big&data&eco5system,&architecture&design,&analytics,&machine&learning Versioning&and&contribution&history

 www.eubra-­‐bigsea.eu  |  contact@eubra-­‐bigsea.eu  |@bigsea_eubr   73

OLAP On-­‐line  Analytical  Processing Type  of  processing

PAM Pluggable  Authentication  Module Security

QoS Quality  of  Service WP3

RADIUS Remote  Authentication  Dial-­‐In  User  Service Security

RDD Resilient  Distributed  Dataset Data  structure

REST REpresentational  State  Transfer Interfaces

RSS Rich  Site  Summary Data  type

SAML Security  Assertion  Markup  Language Security

SASL Simple  Authentication  and  Security  Layer Security

SSL Secure  Sockets  Layer   Security

SSH Secure  Shell Security

SSO Single  sign-­‐on   Security

SSPI Security  Support  Provider  Interface Security

TSV Tab-­‐separated  values Data  Type

UI User  Interface Interfaces

UML Unified  Modeling  Language Modelling  language

WP Work  package Project  management

WRF Weather  Research  and  Forecasting Weather  Forecast  Model