d5.2: programming abstractions design · 2017-05-23 · scipy, etc.); (ii) it allows the...

32
www.eubra-bigsea.eu | [email protected] |@bigsea_eubr D5.2: Programming abstractions design Author(s) Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG) Status Draft/Review/Approval/Final Version V1.0 Date 09/01/2017 Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group specified by the consortium (including the Commission) CO: Confidential, only for members of the consortium (including the Commission) EUBra-BIGSEA is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116. Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI) Abstract: Europe - Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT. The purpose of this report is to detail the design of the first version of the software components building the programming abstractions layer of the EUBra-BIGSEA platform. The document describes how each component has been extended following the overall architecture design provided in D5.1. This document, together with the related software repositories, also realizes the milestone MS13 of the project.

Upload: others

Post on 16-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr

D5.2: Programming abstractions design Author(s) Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG)

Status Draft/Review/Approval/Final

Version V1.0

Date 09/01/2017

Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group specified by the consortium (including the Commission) CO: Confidential, only for members of the consortium (including the Commission)

EUBra-BIGSEA is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116.

Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI)

Abstract: Europe - Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT.

The purpose of this report is to detail the design of the first version of the software components building the programming abstractions layer of the EUBra-BIGSEA platform. The document describes how each component has been extended following the overall architecture design provided in D5.1.

This document, together with the related software repositories, also realizes the milestone MS13 of the project.

Page 2: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 2

Document identifier: EUBRA BIGSEA -WP5-D5.2

Deliverable lead BSC

Related work package WP5

Author(s) Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG)

Contributor(s) Dorgival Guedes (UFMG), Gustavo Avelar (UFMG)

Due date 31/12/2016

Actual submission date 10/01/2017

Reviewed by Andrey Brito (UFCG), Ignacio Blanquer (UPV)

Approved by PMB

Start date of Project 01/01/2016

Duration 24 months

Keywords Big Data, programming models

Versioning and contribution history

Version Date Authors Notes

0.1 30/11/2016 Daniele Lezzi (BSC) Table of contents

0.2 12/12/2016 Walter dos Santos Filho (UFMG) Lemonade sections

0.3 15/12/2016 Gustavo Avelar (UFMG) Edits on Lemonade

0.4 19/12/2016 Daniele Lezzi (BSC) COMPSs sections

0.5 20/12/2016 Daniele Lezzi (BSC), Walter dos Santos Filho (UFMG)

References fixes and general review

0.6 23/12/2016 Daniele Lezzi (BSC) Final version for review

0.7 29/12/2016 Sandro Fiore (CMCC) Edits programming models section

1.0 09/01/2017 Daniele Lezzi (BSC) Reviewed version

Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0. Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent the views expressed by the European Commission or its services. While the information contained in the document is believed to be accurate, the author(s) or any other participant in the EUBra-BIGSEA Consortium make no warranty of any kind with regard to this material including, but not limited to the implied warranties of merchantability and fitness for a particular purpose. Neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. Without derogating from the generality of the foregoing neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be liable for any direct or indirect or consequential loss or damage caused by or arising from any information advice or inaccuracy or omission herein.

Page 3: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 3

TABLE OF CONTENT

EXECUTIVE SUMMARY ....................................................................................................................................... 5 1. INTRODUCTION ......................................................................................................................................... 6

1.1. Scope of the document ..................................................................................................................... 6 1.2. Target audience ................................................................................................................................. 6 1.3. Structure of the document ................................................................................................................ 6

2. PROGRAMMING ABSTRACTIONS LAYER .................................................................................................... 7 2.1. Programming frameworks ................................................................................................................. 7

2.1.1. COMPSs ..................................................................................................................................... 8 2.1.2. Deployment of COMPSs in EUBra-BIGSEA ............................................................................... 13

2.2. Code generation with Lemonade Platform ..................................................................................... 16 2.2.1. Motivation ............................................................................................................................... 16 2.2.2. Related work ............................................................................................................................ 16 2.2.3. Main concepts ......................................................................................................................... 17 2.2.4. Lemonade architecture ........................................................................................................... 19 2.2.5. Metadata information ............................................................................................................. 20 2.2.6. Workflow execution and monitoring ...................................................................................... 21 2.2.7. Source code conversion (transpiler or code to code compiler) .............................................. 22 2.2.8. Visualization of results (work in progress) .............................................................................. 22 2.2.9. Security and Privacy ................................................................................................................. 22 2.2.10. Integration with other project components and work packages ............................................ 23 2.2.11. Deployment ............................................................................................................................. 23 2.2.12. Sample usage and application ................................................................................................. 23

3. CONCLUSIONS ......................................................................................................................................... 24 ANNEXES .......................................................................................................................................................... 25

Annex A. Example of code to code generation in Lemonade ..................................................................... 25 Annex B. Links to components and documentation ................................................................................... 30

Annex B.1. Lemonade .............................................................................................................................. 30 Annex B.2. COMPSs ................................................................................................................................. 31

REFERENCES .................................................................................................................................................... 31 GLOSSARY ........................................................................................................................................................ 32 LIST OF FIGURES

Figure 1 - Detailed view of WP5 components ................................................................................................... 8 Figure 2 - COMPSs Framework architecture ..................................................................................................... 9 Figure 3 - Graph of the execution of a Wordcount application with PyCOMPSs ............................................ 10 Figure 4 - Graph of the execution of a clustering algorithm developed with PyCOMPSs ............................... 12 Figure 5 - Architecture of COMPS on top of persistent storage ...................................................................... 13 Figure 6 - Integration of COMPSs with Mesos ................................................................................................. 14 Figure 7 - COMPSs framework created in a Mesos cluster ............................................................................. 14 Figure 8 - Execution of COMPSs workers in Mesos ......................................................................................... 15 Figure 9 - Lemonade domain model ................................................................................................................ 18 Figure 10 - Lemonade components ................................................................................................................. 20 Figure 11 - Interaction between Lemonade components for job execution .................................................. 21 Figure 12 - Path finder application (WP7 application) represented as a Lemonade workflow ...................... 25

Page 4: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 4

LIST OF TABLES

Table 1 - Relation of Lemonade with other BIGSEA components ................................................................... 23 Table 2 - Lemonade source code repositories ................................................................................................ 30

LIST OF LISTINGS

Listing 1 – PyCOMPSs code example ............................................................................................................... 10 Listing 2 – KMeans PyCOMPSs implementation .............................................................................................. 11 Listing 3 – Definition of COMPSs application execution with Chronos .......................................................... 16 Listing 4 – JSON representation of a Lemonade application ........................................................................... 26 Listing 5 – Lemonade source code of a Spark application ............................................................................... 29

Page 5: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 5

EXECUTIVE SUMMARY This document describes the implementation of the programming model prototypes developed as a part of the EUBra-BIGSEA platform. The programming models offer the tools to abstract the data services to the user scenarios and execute them on the QoS infrastructure. COMPSs and Apache Spark are the two available frameworks for the porting of the scenarios.

This document, together with the description of the software components available in the project’s repository, realizes the milestone MS13 First release of the programming layer.

COMPSs applications can be written in sequential Java, Python or C/C++, and make use of other higher-level software components, such as OPHIDIA workflows. Sequential code is instrumented with data flow information that COMPSs uses to infer parallelism. COMPSs is platform agnostic and deals both with the execution and the negotiation with the computing infrastructure to request the necessary resources for the execution of the workflows. In this project, COMPSs has been extended to create a Mesos framework and to support NoSQL storage. Additional dependencies are easily coded inside COMPSs jobs through the use of Docker containers.

Lemonade (Live Exploration and Mining Of Non-trivial Amount of Data from Everywhere) is a visual platform for distributed computing, aimed to enable implementation, experimentation, testing and deployment of data processing and machine learning applications. It provides developers with high-level abstractions, called operations to build processing workflows using a graphical web interface. Lemonade currently generates Spark code, and it will be extended to support COMPSs workflows during the second year.

Lemonade provides (or will provide) many operations typically used for Extraction, Transformation and Loading (ETL), including Data transformation, Machine Learning, Statistic analysis, Text processing and Data visualization. Lemonade is formed by a set of components which provide the whole functionality.

Page 6: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 6

1. INTRODUCTION

1.1. Scope of the document This document describes the Programming Models prototypes that have been developed as part of the EUBra-BIGSEA abstractions layer. The previous document D5.1 describes the high level integration of the software components of the Programming Model Abstraction Layer and its integration with the other layers for applications (WP7), Big Data Ecosystem (WP4), QoS infrastructure (WP3). This document focuses on the details of the implementations of the programming frameworks and also provides hints on the work needed to complete the deployment in the project’s architecture.

1.2. Target audience The document is mainly intended for internal use, although it is publicly released. The main target of this document is the global team of technical experts of the EUBra-BIGSEA, including WP3, WP4, WP5 and WP6 and of application experts in WP7 who need to adopt the programming frameworks to build their applications.

1.3. Structure of the document The main contribution of the document is mainly provided in section 2; section 2.1 provides an overview of COMPSs with special emphasis on the features related to the programming and execution of Big Data applications followed by a description of the initial prototype of COMPSs for Mesos. Section 2.2 describes the developments for the first release of the Lemonade (Live Exploration and Mining Of Non-trivial Amount of Data from Everywhere) toolkit with initial support to Spark. Annex A provides the description of the development of a WP7 pathfinder application with Lemonade to be used as an example of descriptive models and test for the infrastructure. Annex B contains links to software and documentation for COMPSs and Lemonade.

Page 7: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 7

2. PROGRAMMING ABSTRACTIONS LAYER The programming abstraction layer offers to the developers the functionalities needed to satisfy the requirements for the implementation of the applications scenarios on top of the Big Data layer in the project. In this document we focus on the description of the components that enable the development of modules and libraries (building blocks) that abstract the data layer intricacies to the applications.

2.1. Programming frameworks The Figure 1 depicts the detailed architecture of the programming frameworks. COMPSs is used to implement a set of high-level functionalities that could be workflows1 of Ophidia operators2 or modules implementing operations on multiple Big Data back-ends (e.g. frameworks). In the case of Ophidia, the code binding is performed using the Python module PyOphidia3. PyOphidia provides a programmatic support to submit from single operators to more complex workflows to an Ophidia cluster instance, while preserving at the same time client-side capabilities. More in particular: (i) it allows the implementation of client-side data analysis and visualization activities by exploiting Python libraries available from the Python eco-system (e.g. NumPy, SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively) to perform data analysis tasks. Such applications will be the ones implemented in EUBra-BIGSEA at WP4 level and will be properly managed by the programming framework layer (i.e. COMPSs) at WP5 level.

Spark is adopted to integrate existing application modules. It provides a library, called Spark ML, that supports the execution of different machine learning techniques (e.g. linear regression, classification, clustering) in a distributed way, by using programming abstractions and infrastructure of Spark. Lemonade is a main contribution of the project that provides a platform to compose Big Data applications starting from building blocks developed using COMPSs, Spark or directly through Ophidia operators.

Next sections provide details of the integration of COMPSs in the EUBra-BIGSEA QoS infrastructure.

1 http://ophidia.cmcc.it/documentation/users/workflow/index.html 2 http://ophidia.cmcc.it/documentation/users/operators/index.html 3 https://github.com/OphidiaBigData/PyOphidia

Page 8: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 8

Figure 1 - Detailed view of WP5 components

2.1.1. COMPSs COMPSs [R10] is a framework, composed of a programming model and a runtime system, which aims to ease the development and deployment of distributed applications and web services. The core of the framework is its programming model, which allows the programmer to write applications in a sequential way and execute them on top of heterogeneous infrastructures exploiting the inherent parallelism of the applications. The COMPSs programming model is task-based, allowing the programmer to select the methods of the sequential application to be executed remotely. This selection is done by means of an annotated interface where all the methods that must be considered as tasks are defined with annotations describing their data accesses and constraints on the execution of resources. At execution time this information is used by the runtime system to build a dependency graph and orchestrate the tasks on the available resources.

2.1.1.1. COMPSs Programming Model COMPSs offers a simple programming model based on sequential development. As depicted in Figure 2 COMPSs has native support for Java applications with bindings for C/C++ code and Python scripts. In this document, also considering the requirements of the project, we focus on the Python binding [R11], PyCOMPSs. In the model, the developer is mainly responsible for (i) identifying the functions to be executed as asynchronous parallel tasks and (ii) annotating them with a standard Python decorator. Then the runtime is in charge of exploiting the inherent concurrency of the script, automatically detecting the data dependencies between tasks and spawning those tasks to the available resources, which can be nodes in a cluster, cloud or containers in a Mesos cluster. A key aspect of providing an infrastructure-unaware programming model is that programs can be developed once and run on multiple back- ends, without having to change the implementation. This is important when portability between clouds must be achieved. In COMPSs, the programmer is freed from having to deal with the details of the specific cloud, since these details are handled transparently by the runtime. The availability of different connectors, each implementing

Page 9: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 9

the specific provider API (e.g Cloud providers), makes it possible to run computational loads on multiple backend environments without the need of code adaptation. In cloud environments, COMPSs provides scaling and elasticity features that allow the number of utilized resources to be dynamically adapted to the actual execution needs.

Figure 2 - COMPSs Framework architecture

Calls to annotated functions are wrapped by a function of the Python binding, which forwards the function name and parameters to the Java runtime. With that information, the Java runtime creates a task and adds it to the data dependency graph, immediately returning the control to Python. At this point, the main program can continue executing right after the task invocation, possibly invoking more tasks. Therefore, the Java runtime executes concurrently with the main program of the application, and as the latter issues new task creation requests, the former dynamically builds a task dependency graph. Such graph represents the inherent concurrency of the program, and determines what can be run in parallel. When a task is free of dependencies, it is scheduled by the run- time system on one of the available resources, specified in XML configuration files.

The default scheduling policy of the runtime is locality- aware. When scheduling a task, the runtime system computes a score for all the available resources and chooses the one with the highest score. This score is the number of task input parameters that are already present on that resource, and consequently they do not need to be transferred.

The COMPSs model is a very good candidate to build Big Data applications providing a more flexible programming model compared with alternatives as Spark. In particular, COMPSs has been recently extended to support a big data storage architecture, where the data stored in the backend is abstracted and accessed from the application in the form of persistent objects. More details are provided in the next section.

Listing 1 contains example code of a WordCount application whose graph execution with COMPSs is depicted in Figure 3. The main program of the application is a sequential Python script that contains calls to tasks (lines 9 and 10). Tasks can modify or generate data (e.g. a file or object), and these data can eventually be accessed from the main program. Before doing so, however, the programmer needs to synchronise that data (i.e. stall the main control flow until obtaining the last version produced by the task, which can imply waiting for the task to finish, line 11). As a result, the main program can work with the correct version of the data.

Page 10: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 10

Depending on the data type that is being synchronised, two API functions may need to be invoked, compss_wait_on(obj, to write = True) that synchronises for the last version of object obj and returns the synchronised object and compss_open(filename, mode = ’r’) that synchronises for the last version of file filename and returns the file descriptor for that synchronised file.

1 from collections import defaultdict 2 import sys 3 if __name__ == "__main__": 4 from pycompss.api.api import compss_wait_on 5 pathFile = sys.argv[1] 6 sizeBlock = int(sys.argv[2]) 7 result=defaultdict(int) 8 for block in read_file_by_block(pathFile, sizeBlock): 9 presult = word_count(block) 10 reduce_count(result, presult) 11 output = compss_wait_on(result) 12 for (word, count) in output: 13 print("%s: %i" % (word, count)) @task(dict_1=INOUT) def reduce_count(dict_1, dict_2): for k, v in dict_2.iteritems(): dict_1[k] += v @task(returns=dict) def word_count(collection): result = defaultdict(int) for word in collection: result[word] += 1 return result

Listing 1 – PyCOMPSs code example

Figure 3 - Graph of the execution of a Wordcount application with PyCOMPSs

Next listings contain another example of a PyCOMPSs application for clustering. K-means clustering is a method of cluster analysis that aims to partition ''n'' points into ''k'' clusters in which each point belongs to the cluster with the nearest mean. It follows an iterative refinement strategy to find the centres of natural clusters in the data.

When executed with COMPSs, K-means first generates the input points by means of initialization tasks. For parallelism purposes, the points are split in a number of fragments received as parameter, each fragment being created by an initialization task and filled with random points. 1 from pycompss.api.api import compss_wait_on 2 size = int(numV / numFrag)

Page 11: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 11

3 4 X = [genFragment(size, dim) for _ in range(numFrag)] 5 mu = init_random(dim, k) 6 oldmu = [] 7 n = 0 8 startTime = time.time() 9 while not has_converged(mu, oldmu, epsilon, n, maxIterations): 10 oldmu = mu 11 clusters = [cluster_points_partial(X[f], mu, f * size) for f in range(numFrag)] 12 partialResult = [partial_sum(X[f], clusters[f], f * size) for f in range(numFrag)] 13 mu = merge_reduce(reduceCentersTask, partialResult) 14 mu = compss_wait_on(mu) 15 mu = [mu[c][1] / mu[c][0] for c in mu] 16 n += 1 17 return (n, mu) 18 19 @task(returns=dict) 20 def partial_sum(XP, clusters, ind): 21 p = [(i, [(XP[j - ind]) for j in clusters[i]]) for i in clusters] 22 dic = {} 23 for i, l in p: 24 dic[i] = (len(l), np.sum(l, axis=0)) 25 return dic 26 27 @task(returns=dict, priority=True) 28 def reduceCentersTask(a, b): 29 for key in b: 30 if key not in a: 31 a[key] = b[key] 32 else: 33 a[key] = (a[key][0] + b[key][0], 34 a[key][1] + b[key][1]) 35 return a 36 37 @task(returns=dict) 38 def cluster_points_partial(XP, mu, ind): 39 dic = {} 40 for x in enumerate(XP): 41 bestmukey = min([(i[0], np.linalg.norm(x[1] - mu[i[0]])) 42 for i in enumerate(mu)], key=lambda t: t[1])[0] 43 if bestmukey not in dic: 44 dic[bestmukey] = [x[0] + ind] 45 else: 46 dic[bestmukey].append(x[0] + ind) 47 return dic

Listing 2 – KMeans PyCOMPSs implementation

After the initialization, the algorithm goes through a set of iterations. In every iteration, a computation task is created for each fragment; then, there is a reduction phase where the results of each computation are accumulated two at a time by merge tasks; finally, at the end of the iteration the main program post-processes the merged result, generating the current clusters that will be used in the next iteration. Consequently, if ''F'' is the total number of fragments, K-means generates ''F'' computation tasks and ''F-1'' merge tasks per iteration. The graph of the COMPSs execution with 8 fragments and 4 iterations is represented in Figure 4.

Page 12: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 12

Figure 4 - Graph of the execution of a clustering algorithm developed with PyCOMPSs

2.1.1.2. COMPSs Runtime One important feature of the COMPSs runtime is the ability to elastically adapt the amount of resources to the current workload. When the number of tasks is higher than the available cores, the runtime turns to the cloud looking for a provider offering the type of resources that better meet the requirements of the application and with the lowest economical cost. Analogously, when the runtime detects an excess of resources for the actual workload, it will power off unused instances in a cost-efficient way. Such decisions are based on the information on the type of resources, that contains the details of the software images and instance templates available for every cloud provider. Since each cloud provider offers its own API, COMPSs defines a generic interface to manage resources and to query about details concerning the execution cost of multiple cloud providers during one and the same execution. These, called connectors, are responsible for translating the generic requests to the actual provider’s API. In the EUBra-BIGSEA project, this elasticity has been extended to support Mesos clusters as explained in the section 2.1.2. This extension allows to benefit from the advanced resource allocation mechanisms developed in the project; COMPSs negotiates with Mesos

Page 13: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 13

the proper number of resources adapted to the current load while the QoS policies can dynamically modify this set according to performance parameters.

In a COMPSs application, tasks operate on data that is either in memory (regular Python objects and primitive types) or disk (files). To make COMPSs capable of orchestrating applications that process big amounts of data, the data management of COMPSs was extended both at programming model and runtime system levels. Figure 5 represents the integration of PyCOMPSs with an object storage platform that provides transparent persistency to applications enhances with COMPSs concurrency capabilities. The main idea of such platform is to ease the access to data also from no-COMPSs applications that could use in-memory objects or persistent objects in their functions. A Storage API must be implemented by the backend to create, delete, insert, retrieve and iterate over persistent data. The backend implements the specific functions to store and distribute the data, leaving the applications the abstraction of the data through Python objects. This model facilitates the sharing of data between concurrent applications that use objects in memory that at some point of the execution are made persistent (and automatically updated) and thus accessible from another application that has the reference to the object.

Figure 5 - Architecture of COMPS on top of persistent storage

The API also provides a set of methods that enable PyCOMPSs to consider data locality when scheduling tasks that work with persistent objects. The information on the locality of an object or a block is used by PyCOMPSs to schedule a task to a resource where (at least part of) its input data is already present, thus preventing remote accesses to the data from the task. A reference implementation of this API has been designed at BSC to provide access to non-relational key-value stores as Cassandra. In this implementation, a Python dictionary is mapped to the Cassandra’s data model since both consist of values indexed by keys. Each class of the application containing one or more Persistent Dictionary is mapped to a Cassandra table. This table is indexed by the same key attributes than the persistent dictionaries in the class, and contains as many non-key attributes as persistent dictionaries.

2.1.2. Deployment of COMPSs in EUBra-BIGSEA Figure 6 depicts the implementation of COMPSs on top of Mesos. A Framework running on top of Mesos, consists of two components a scheduler and an executor. The scheduler registers with the Mesos master and receives resource offerings from the master. The scheduler decides what to do with resources offered by the master within the framework. The Executor is launched on slave nodes and runs framework tasks.

Page 14: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 14

Figure 6 - Integration of COMPSs with Mesos

In the case of COMPSs the scheduler is integrated in the runtime and the negotiation of resources is performed through a specific connector (Mesos Framework in the Figure 6) that registers a new framework in Mesos. However, the executor has not been implemented in this first version; once the resources are offered to COMPSs, it deploys the workers on the nodes creating a direct connection between the COMPSs master and the workers (blue arrows in the Figure 6). Then, each task is executed on an available node by the COMPSs runtime. In this way, the behaviour of the COMPSs runtime is not changed. As depicted in the Figure 6, both the COMPSs runtime and the workers are executed in Mesos slaves within Docker containers. The adoption of containers allows easy and transparent deployments of applications, without the need of installing COMPSs and the developed applications in the cluster, and it also enables to configure each container without the need of modifying the base instance. It is worth highlighting again that the integration of Mesos is completely transparent to the applications developers who are not requested to provide any information related to the resources in the definition of the COMPSs tasks. In the future, the implementation of a COMPSs Executor will be evaluated. To make direct connections, an overlay network must be created on the Mesos cluster.

Figure 7 depicts the dashboard of Mesos with a COMPSs Framework running 4 tasks on a single node with one CPU while Figure 8 details the execution of the COMPSs workers (here using the default Mesos Executors) and the actual usage of the resources.

Figure 7 - COMPSs framework created in a Mesos cluster

Page 15: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 15

Figure 8 - Execution of COMPSs workers in Mesos

A COMPSs application is submitted to a Mesos cluster using Chronos passing a JSON file with the description of the command to be executed once the specified container is deployed. The next listing contains the description of a Simple COMPSs application with the definition of the Docker image to be deployed and the URIs of needed files to be copied in the sandbox of each worker.

The user has to provide the package of the COMPSs application in a .tar.gz file and list it in the URIs section of the JSON document; Chronos will copy it in the sandbox of the container of the COMPSs runtime; the COMPSs configuration files need also to be provided as a separate file. The application package will be then transferred to the worker containers by the runtime. The Docker image is the COMPSs base one, unless further dependencies are needed, in this case the user should create a new image starting from the base one and push it in the Docker Hub.

{ "name": "COMPSs_2_chronos_test", "command": "/opt/COMPSs/Runtime/scripts/user/runcompss --project=/mnt/mesos/sandbox/project_mesosFramework.xml --resources=/mnt/mesos/sandbox/resources_mesosFramework.xml --debug --comm=integratedtoolkit.nio.master.NIOAdaptor --lang=java --classpath=/mnt/mesos/sandbox/Simple.jar simple.Simple 1 25 1 3 60", "shell": true, "epsilon": "PT30M", "executor": "", "executorFlags": "", "retries": 2, "owner": "[email protected]", "async": false, "successCount": 190, "errorCount": 3, "lastSuccess": "2014-03-08T16:57:17.507Z", "lastError": "2014-03-01T00:10:15.957Z", "cpus": 0.5, "disk": 5120, "mem": 512, "disabled": false, "container": { "type": "DOCKER", "image": "compss/compss:2.0-mesos-0.28.2", "network": "USER" }, "uris": [ "http://bscgrid05.bsc.es/~aserven/compss.2.0.cluster/Simple.tar.gz", “http://bscgrid05.bsc.es/~aserven/compss.2.0.cluster/conf.tar.gz”,

Page 16: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 16

"http://bscgrid05.bsc.es/~aserven/compss.2.0.cluster/DockerKeys.tar.gz" ], "schedule": "R1//PT24H" }

Listing 3 – Definition of COMPSs application execution with Chronos

In the next versions, the adoption of the submission API as described in D3.1 will be evaluated. Further work around the proper network configuration will be performed in order to improve the portability of the proposed solution based on COMPSs.

2.2. Code generation with Lemonade Platform Lemonade (Live Exploration and Mining Of Non-trivial Amount of Data from Everywhere) is a visual platform for distributed computing, aimed to enable implementation, experimentation, testing and deployment of data processing and machine learning applications. It provides high-level abstractions, called operations, for developers to build processing workflows using a graphical web interface. By using high performance and scalable technologies, such as COMPSs, Ophidia and Spark, Lemonade can process very large amounts of data, hiding all backend complexity to the users and allowing them to focus mainly in the construction of the solution.

Lemonade is implemented as an open-source tool and is under development as a product of EUBra-BIGSEA project. The project plans to deliver a first stable version by July, 2017.

2.2.1. Motivation Visual workflows tools provide a higher level of abstraction than general-purpose programming languages, even those created specifically to data processing, such “R” language [R08].

Currently, the increased capacity and reduced price of existing processing infrastructures, as well as the availability of large amounts of data, has democratized the development of new applications, previously restricted to very large companies and organizations. However, to fully exploit such opportunity, a team should deal with different expertise, such as business domain, programming skills and infrastructure maintenance. Sometimes, researchers just want to test an hypothesis about the data. If they require a complex learning process to use a specific technical solution, this will not be used.

Available data processing tools have a very large spectrum regarding processing capacity, ranging from desktop spreadsheet tools to very large computer clusters. Also, abstraction ranges from low level programming languages (e.g. GPU processing) to completely black box solutions. Visual programming [R1] is an approach where procedure or program is constructed by arranging program elements graphically instead of writing it in a programming language and has become popular thanks to the proliferation of tools, such as Knime[R2], Weka [R3], RapidMiner [R4], CloudFlows [R5] and Microsoft Azure ML Studio [R6].

Lemonade shares many similarities with aforementioned technologies. We believe Lemonade is different regarding its integration with Big Data processing technologies included in the project (COMPSs. Spark, Ophidia) and with other project teams/work packages. Lemonade integrates with WP3 by allowing users to specify QoS constraints for the execution. Such constraints will be used to statically or dynamically allocate resources for process the workflow. Integration with WP6 is in progress and we foresee security and data privacy constraints implemented as parameters or operators in Lemonade.

All operations available in Lemonade user interface are kept as metadata information. It is easy to someone to define new operations by adding meta information and changing backend implementation.

2.2.2. Related work The KNIME [R2] is a framework which enables visual and interactive execution of a data pipeline for data mining. This environment was developed to be a collaborative platform for teaching and research. It provides high-level development of workflows for data analysis, and it does not require skills in programming languages. KNIME has 3 main attractive: Interactive framework, modularity and easy extensibility.

Page 17: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 17

Weka [R3] is a project aims to provide machine learning algorithms and data preprocessing tools to many users (teachers, students, researchers, etc). Its advantage is be a modular and extensible architecture. Weka has a wide variety of algorithms for regression, classification, clustering, association rule mining and attribute selection. Users may create sophisticated data mining processes to get relevant information on data.

The platform ClowdFlows [R5] is a cloud-based web application for distributed computing (with batch or real-time processing mode). It allows to create and perform data mining workflows through visual programming. An application in visual programming is constructed by “dragging and dropping” graphical elements instead of writing the source code as text. The ClowdFlows for batch processing uses MapReduce programming model through DiscoMLL library. MapReduce programming simplifies execution of parallel and distributed processing to multiples machines.

Azure Machine Learning Studio [R6] is a cloud service that enables to easily build, deploy, and share predictive analytics solutions. There’s no code required, but a commercial license is required to use whole functions and tools. Azure provides visual workflows with state of the art algorithms, that can be run in parallel/cloud.

In none these platforms Apache Spark was evaluated. Its reveals the potentiality of new platforms.

2.2.3. Main concepts The terminology used in Lemonade is a common one, but we would like to reinforce some definitions to better explain the platform. The domain model diagram is shown in Figure 9.

Page 18: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 18

Figure 9 - Lemonade domain model

Page 19: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 19

An Operation is the smaller execution unit defined in Lemonade. Each operation has a set of parameters (defined as forms), a set of ports. An operation is mapped into a block of code to be executed in the underlying execution platform. Different execution platforms may have different operations, but in general, there is a set of common operations available in most of them, generally operations related to tasks as extraction, transformation and loading (ETL). In Lemonade, there are different categories of operations:

● Related to data sources: ○ Read and write data from/into persistent storages

● data transformation: ○ Add rows (union), and columns, perform aggregation, set difference, filtering, join, set

intersection, sampling and selection/projection, clean missing values and remove duplicates, sorting, splitting and transformation of data, split data using built-in functions, assembly features, index features.

● Machine learning: ○ Apply ML models, perform cross-validation, create classification model (Naive Bayes, GBT,

Random Forest, Decision Tree), create clustering model (K-Means), create regression model (Logistic Regression), evaluate model (classification and regression), find frequent item-sets (FP-Growth), score model (classification and regression), save and load model.

● Statistics: ○ Linear Regression, Pearson Correlation

● Data utilities: ○ Broadcast data

● Text processing (to be implemented): ○ Text normalization (TF-IDF), topic detection (LDA, word2vec)

● Data visualization: ○ Define visualization parameters

● Publish in production: ○ Publish as a web service, publish as a visualization service

A Workflow is a group of tasks (instances of operations), organized as a direct acyclic graph (DAG). Tasks communicate each other by flows connecting a source port with a target port. Each port has a direction (INPUT or OUTPUT), a cardinality (ONE, MANY) and one or more interfaces defined (e.g., IData, IModel). Interfaces are used to validate flows by using kind of type system. Users can not connect two ports if they do not share at least one common interfaces. Forms defined in operations are filled by interface (Citron) for each task associated to those operations.

After submitting a workflow, a job is launched to process the workflow. The DAG formed by tasks and flows is evaluated and converted into code to be ran in the underlying execution platform. Current version of Lemonade allows only the execution of the entire workflow, although there are plans to allow partial execution and even incremental one, dispensing the execution of tasks already processed and not changed in the interface.

The Metadata related to data sources and their attributes are kept and used during the design and execution time. Users will configure tasks requiring information about data source attributes selecting them from a list of available ones. Such list is provided by metadata services, described in Section 2.2.4, in Lemonade architecture section.

2.2.4. Lemonade architecture Lemonade evolved since the last time it has been documented in deliverable D5.1. Now, Lemonade is formed by 7 (seven) different components, shown in Figure 10.

Page 20: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 20

Figure 10 - Lemonade components

Each component is implemented as a separate process, running standalone. Communication between components is performed by an API call (sync call) or by sending messages to a queue, implemented using a third party software (async call), in this case, Redis [R4].

Graphical user interface

Citron is a web based user interface used in Lemonade to create workflows. Users may choose among a set of predefined operations which will compose the workflow by dragging and dropping them into the design area. Among the operations, there are operations for reading and writing data in different storages, such as file systems (including distributed, such as HDFS) and databases.

Each operation is grouped by category and configured by forms, including parameters for execution, appearance, quality of service (QoS) and security & privacy settings. Citron interacts with Tahiti component in order to retrieve operations metadata and persist workflows and with Limonero component to retrieve and save data source metadata.

Workflow execution status is provided by Lemonade Stand and it is integrated with Citron by using web sockets [R9].

2.2.5. Metadata information In order to provide extensibility, all operations in Lemonade are defined in two components: Tahiti and Juicer. Tahiti keeps all operation metadata, including their names, ports and related forms and makes such information available by an API, consumed by Citron when it starts rendering the user interface. Even though the adoption of Tahiti with Citron eliminated the dependency between the interface and the available abstractions, still there is a strong dependency between Tahiti managed metadata and the execution component, Juicer. In order to create or modify a new operation, a developer must insert or update information in Tahiti and implement changes directly in Juicer source code.

Page 21: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 21

There is another class of metadata related to the data sources themselves. In a design decision, we choose to keep, for each data source available in Lemonade, metadata about user access permissions, attributes (including the name, datatype, precision, length, nullability, if they are labels or features and their statistical data, distribution, how many missing values, mean, max, min values, etc.) and the format of the data (CSV, JSON, Parquet, etc). Such information is used when reading data in order to avoid misinterpretation of formats, validation of input and of the workflow, optimization of reading and finally, integrating easily with the visualization component, Caipirinha (more details in Section 2.2.8).

2.2.6. Workflow execution and monitoring Citron allows users to start the workflow execution and Juicer is responsible for retrieving information about the execution from the underlying execution platform. To keep both components decoupled, a third component, Stand is needed.

Stand is a facade between user interface (Citron) and backend execution (Juicer). User interface should be responsiveness, while the backend is batch processing the workflow. Stand decouples the other two components by using async communication, implemented as a producer-consumer queue in Redis. Interactions between components are shown in Figure 11. When a user triggers the execution of a workflow, Citron invokes Stand in order to run the job (1a) and also connects to a websocket which provides feedback to the user interface (1b). Stand receives the request and pushes it into a queue (2a) and starts consuming status queue (2b) that feeds the websocket. Juicer consumes the execution queue (3) and it reports execution status by pushing it to a publisher-subscriber topic in Redis (volatile) and updating rows in MySQL (persistent) (5). Citron then receives notifications about tasks execution status (6) and updates the interface.

Figure 11 - Interaction between Lemonade components for job execution

Page 22: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 22

2.2.7. Source code conversion (transpiler or code to code compiler) Workflow is represented internally in JSON format. When a new job is created in order to execute a workflow, Juicer will convert code to code (transpile), parsing JSON format into a Python language script compatible with the underlying processing platform. The utilization of Python language restricts targeting platforms, but now, COMPSs, Ophidia and Spark supports it. We do not foresee the need of using a real compiler, but maybe in future we could extend Juicer to support it.

2.2.8. Visualization of results (work in progress) An important requirement of Lemonade is to provide feedback to users about tasks execution and results. When a task modifies an input data, user must be able to analyze the resulting data by inspecting it directly (e.g. using a table) or by visualizing it by a visual metaphor (charts, graphs, custom visualizations, etc.). Caipirinha is a framework that integrates with Lemonade data and metadata in order to provide data visualization. The idea is to provide to the user a set of common visualizations, like tables, pie, line and bar charts with little effort in customization. For more sophisticated visualizations, user can configure a generate visualization operation with the type of the visualization and its parameters, similarly to the chart generation wizard present in common spreadsheet software. Visualizations will be enable/disabled according to predefined requirements. Such requirements are part of the visualization metadata and will be stored in Caipirinha.

This module is under conception and we expect to have a first version in Oct’2017.

2.2.9. Security and Privacy Thorn is the module responsible for provide security and privacy constraints in Lemonade. Current version only works with basic authentication and authorization but working groups from WP5 and WP6 are cooperating in order to create a common solution. The interaction between all components, except Caipirinha, is shown below.

Figure 12 - Interaction between different Lemonade components

Page 23: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 23

2.2.10. Integration with other project components and work packages

Component Context Description Status

COMPSs WP5 Generate code and perform batch execution

Proof of concept tests. Initial development. Basic ETL tasks integrated in Apr’17.

Ophidia WP4 Generate code and perform batch execution

Under conception. Efforts will start in Jan’17.

Spark WP4 and WP5 Generate code and perform interactive and batch execution

Most of Lemonade operations already implemented. Execution triggering and monitoring being developed. Expectation of first version in Feb’17.

QoS policies WP3 Apply QoS policies for workflow execution

Basic constraints already defined. Integration in progress.

Security & privacy policies

WP6 Apply security & privacy constraints. Provide operations to protect data.

Integration in progress.

Path finder application

WP7 descriptive model application

Path finder is one of the applications created in WP7 to be examples of descriptive models and to test the infrastructure.

Application successfully workflow mapped into Lemonade workflows and operations. Ready to be run as a Spark application.

Table 1 - Relation of Lemonade with other BIGSEA components

2.2.11. Deployment Release versions of Lemonade will be distributed as source code and as Docker containers. Lemonade depends on MySQL 5.x, Redis 3.x, Python 2.x and Spark 2.0.2, with Hadoop HDFS 2.7.x.

2.2.12. Sample usage and application Appendix A presents a very basic workflow with only 3 tasks for lack of space (a bigger workflow would generate hundreds of lines of code), presented as its visual representation, the respective JSON and Python+Spark generated source code.

A more complex workflow is shown in Figure 13. Path finder application applies descriptive models to integrate data from 3 different data sources: bus position, user bus card utilization and Census data, all from the city of Curitiba, Brazil. In the application, data from bus card and the bus position are joined in order to infer the origin and destination of users. Such information is not easily available, because the bus card utilization data does not include the GPS coordinates. Such information is only available in bus position data. The Census data is used to aggregate trips according to a set of economical, social and educational

Page 24: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 24

characteristics. To be able to join first two data sets with Census data, we used shapefiles with boundaries of regions in Curitiba city. We used a special geospatial function to determine if a point is inside a polygon.

3. CONCLUSIONS This document provides the description of the implementation of the first version of the programming abstraction layer of the EUBra-BIGSEA Platform. Following the directives of the architecture design in the document D5.1, partners have started the needed developments to extend the components required to implement the use applications.

In particular, COMPSs has been extended to support the elastic execution of tasks on Mesos clusters while Spark is adopted to support existing blocks. The Lemonade tool is going to provide the programmers with high level abstractions to build Big Data workflows on top of COMPSs and Spark without the need to directly access the data layer.

This report also documents the achievement of milestone MS13 First release of the programming layer, with the links to the software repositories listed in Annex B.

Page 25: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 25

ANNEXES

Annex A. Example of code to code generation in Lemonade For sake of illustration, we are using a simple workflow consisting of only 3 tasks: to read, to sort and to write data. The different representations are shown below.

A.1. Visual representation

Figure 12 - Path finder application (WP7 application) represented as a Lemonade workflow

Page 26: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 26

A.2. JSON representation

{"updated": "2016-12-12T16:20:03+00:00", "tasks": [{"z_index": 11, "top": 49, "forms": {"color": {"value": "#FFEBCC"}, "header": {"value": "1"}, "data_source": {"value": "4"}}, "version": 2, "operation": {"slug": "data-reader", "id": 18, "name": "Data reader"}, "id": "922ca54d-98d3-4e2f-a008-b02c93ca823f", "left": 127}, {"z_index": 12, "top": 158, "forms": {"color": {"value": "#FFCCE0"}, "attributes": {"value": [{"attribute": "age", "f": "asc", "alias": ""}]}}, "version": 2, "operation": {"slug": "sort", "id": 32, "name": "Sort"}, "id": "9dad14aa-c191-4d60-8045-6623af29ffc9", "left": 127}, {"z_index": 13, "top": 265, "forms": {"color": {"value": "#CCFFD1"}, "path": {"value": "/lemonade/example"}, "storage": {"value": "1"}, "name": {"value": "titanic_sorted"}}, "version": 2, "operation": {"slug": "data-writer", "id": 30, "name": "Data writer"}, "id": "dd99334c-954d-4881-a711-e71e80c25b91", "left": 127}], "description": null, "created": "2016-12-07T16:53:06+00:00", "enabled": true, "flows": [{"source_port": 35, "source_id": "922ca54d-98d3-4e2f-a008-b02c93ca823f", "target_id": "9dad14aa-c191-4d60-8045-6623af29ffc9", "target_port": 61}, {"source_port": 62, "source_id": "9dad14aa-c191-4d60-8045-6623af29ffc9", "target_id": "dd99334c-954d-4881-a711-e71e80c25b91", "target_port": 58}], "platform": {"slug": "spark", "description": "Apache Spark 2.0 execution platform", "id": 1, "icon": "/static/spark.png", "name": "Spark"}, "version": 2, "user": {"login": "admin", "id": 0, "name": "admin"}, "id": 11, "name": "Sort titanic data by passenger age"}

Listing 4 – JSON representation of a Lemonade application

Page 27: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 27

A.3. Lemonade generated source code (Python + Spark)

# -*- coding: utf-8 -*-

#!/usr/bin/env python

#

# Auto-generated Spark code from Lemonade Workflow

# (c) Speed Labs - Departamento de Ciência da Computação

# Universidade Federal de Minas Gerais

# More information about Lemonade to be provided

#

from pyspark.ml import Pipeline

from pyspark.ml.classification import *

from pyspark.ml.evaluation import *

from pyspark.ml.feature import *

from pyspark.ml.tuning import *

from pyspark.sql import SparkSession

from pyspark.sql.window import Window

from pyspark.sql.functions import *

from pyspark.sql.types import *

def data_reader_generate_df0(spark_session):

schema_df0 = StructType()

schema_df0.add('id', IntegerType(), False,

{'type': u'INTEGER'})

schema_df0.add('class', StringType(), True,

{'nullable': True, 'size': 50, 'type': u'CHARACTER'})

schema_df0.add('survived', DoubleType(), True,

{'label': True, 'nullable': True, 'size': 1,

'type': u'DOUBLE'})

schema_df0.add('name', StringType(), True,

{'nullable': True, 'size': 100, 'type': u'CHARACTER'})

schema_df0.add('sex', StringType(), True,

{'enumeration': True, 'feature': True, 'nullable': True,

'size': 40,'type': u'CHARACTER'})

schema_df0.add('age', FloatType(), True,

{'feature': True, 'nullable': True, 'type': u'FLOAT'})

schema_df0.add('sibsp', IntegerType(), True,

{'nullable': True, 'type': u'INTEGER'})

schema_df0.add('parch', IntegerType(), True,

{'nullable': True, 'type': u'INTEGER'})

Page 28: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 28

schema_df0.add('ticket', StringType(), True,

{'nullable': True, 'type': u'CHARACTER'})

schema_df0.add('fare', FloatType(), True,

{'nullable': True, 'type': u'FLOAT'})

schema_df0.add('cabin', StringType(), True,

{'nullable': True, 'size': 10, 'type': u'CHARACTER'})

schema_df0.add('embarked', StringType(), True,

{'feature': True, 'nullable': True, 'size': 100, 'type':

u'CHARACTER'})

schema_df0.add('boat', StringType(), True,

{'feature': True, 'nullable': True, 'type': u'CHARACTER'})

schema_df0.add('body', StringType(), True,

{'nullable': True, 'type': u'CHARACTER'})

schema_df0.add('home.dest', StringType(), True,

{'feature': True, 'nullable': True, 'size': 100, 'type':

u'CHARACTER'})

url_df0 = 'hdfs://spark01:9000/lemonade/samples/titanic.csv'

df0 = spark_session.read.option('nullValue', '')\

.option('treatEmptyValuesAsNulls','true')\

.csv(url_df0, schema=schema_df0,

header=True, sep=',',

inferSchema=False, mode='DROPMALFORMED')

df0.cache()

return df0

def sort_generate_df1(spark_session, df0):

df1 = df0.orderBy(["age"], ascending=[1])

return df1

def data_writer_generate_df1_tmp_2(spark_session, df1):

# Code to update Limonero metadata information

from metadata import MetadataPost

types_names = {

'IntegerType': "INTEGER",

'StringType': "TEXT",

'LongType': "LONG",

'DoubleType': "DOUBLE",

'TimestampType': "DATETIME",

}

Page 29: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 29

schema = []

# nullable information is also stored in metadata

# because Spark ignores this information when loading CSV files

for att in df1.schema:

schema.append({

'name': att.name,

'dataType': types_names[str(att.dataType)],

'nullable': att.nullable or attr.metadata.get('nullable'),

'metadata': att.metadata,

})

parameters = {

'name': "titanic_sorted",

'format': "None",

'storage_id': 1,

'description': "admin",

'user_id': "0",

'user_login': "admin",

'user_name': "admin",

'workflow_id': "11",

'url': "hdfs://spark01:9000/lemonade/exampletitanic_sorted",

}

instance = MetadataPost('123456', schema, parameters)

return df1_tmp_2

def main():

app_name = u'## Sort titanic data by passenger age ##'

spark_session = SparkSession.builder\

.appName(app_name)\

.getOrCreate()

df0 = data_reader_generate_df0(spark_session)

df1 = sort_generate_df1(spark_session, df0)

df1_tmp_2 = data_writer_generate_df1_tmp_2(spark_session, df1)

return {

'922ca54d-98d3-4e2f-a008-b02c93ca823f': (df0,),

'9dad14aa-c191-4d60-8045-6623af29ffc9': (df1,),

'dd99334c-954d-4881-a711-e71e80c25b91': (df1_tmp_2,),

} Listing 5 – Lemonade source code of a Spark application

Page 30: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 30

Annex B. Links to components and documentation

Annex B.1. Lemonade

Lemonade has been developed as part of EUBra-BIGSEA project since July, 2016 (7th month of project) and it is not finished yet. Current source code is distributed as open-source and each component is available in its own repository under global project organisation in GitHub (https://github.com/eubr-bigsea/).

Available repositories and current development status are listed in Table 2.

Component Repository Status

Citron https://github.com/eubr-bigsea/ember-citron Integration with Stand under development, other integrations finished, dashboard construction in alpha.

Stand https://github.com/eubr-bigsea/stand Under development.

Tahiti https://github.com/eubr-bigsea/tahiti Alpha version. Missing integration with Thorn and its documentation.

Limonero https://github.com/eubr-bigsea/limonero Alpha version. Missing integration with Thorn and its documentation.

Juicer https://github.com/eubr-bigsea/juicer Under development. Integration with Spark almost finished. Missing integration with Thorn and its documentation. Integration with other WP4 technologies in early stage.

Thorn https://github.com/eubr-bigsea/thorn Early stage. Missing integration with all other components except Tahiti.

Caipirinha No repository defined. Under conception. User requirements identified.

Table 2 - Lemonade source code repositories

Page 31: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 31

Annex B.2. COMPSs

COMPSs is available for download as source code and also binary packages for several Linux releases can be installed through repositories for Ubuntu, Debian and openSUSE and CentOS.

A virtual appliance is also available in OVA format.

All the downloads and manuals are available at http://compss.bsc.es

REFERENCES [R1] Jost, Beate, et al. "Graphical programming environments for educational robots: Open roberta-yet another one?." Multimedia (ISM), 2014 IEEE International Symposium on. IEEE, 2014.

[R2] Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., ... & Wiswedel, B. (2009). KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD explorations Newsletter, 11(1), 26-31.

[R3] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10-18.

[R4] Redis. Avaliable from https://redis.io/. Visited in Dec, 1st, 2016.

[R5] Kranjc, J., Orač, R., Podpečan, V., Lavrač, N., & Robnik-Šikonja, M. (2017). ClowdFlows: Online workflows for distributed big data mining. Future Generation Computer Systems, 68, 38-58.

[R6] Microsoft Azure Machine Learning Studio. Available from: https://azure.microsoft.com/en-us/services/machine-learning/. Visited in Dec, 1st, 2016.

[R7] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10. [R8] R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. [R9] Fette, I. and Melnikov, A. The WebSocket Protocol. (6455) , RFC Editor , RFC Editor , Internet Requests for Comments (2011). Available from: http://www.rfc-editor.org/rfc/rfc6455.txt . [R10] Badia RM, Conejero J, Diaz C, Ejarque J, Lezzi D, Lordan F, Ramon-Cortes C, Sirvent R. COMP Superscalar, an interoperable programming framework. SoftwareX [Internet]. 2015 ;3-4:32-36. Available from: http://www.sciencedirect.com/science/article/pii/S2352711015000151. [R11] E. Tejedor, Y. Becerra, G. Alomar, A. Queralt, R.M. Badia, J. Torres, et al. PyCOMPSs: Parallel computational workflows in Python, Int J High Perform Comput Appl (2015) [in press]. Published online before print August 19, 2015, http://dx.doi.org/10.1177/1094342015594678

Page 32: D5.2: Programming abstractions design · 2017-05-23 · SciPy, etc.); (ii) it allows the implementation of arbitrarily complex applications that interact with Ophidia (i.e. iteratively)

EUBra-BIGSEA D5.2: Programming abstractions design

www.eubra-bigsea.eu | [email protected] @bigsea_eubr 32

GLOSSARY

Term Explanation

transpiler A source-to-source compiler, transcompiler or transpiler is a type of compiler that takes the source code of a program written in one programming language as its input and produces the equivalent source code in another programming language.

shapefile The shapefile format is a popular geospatial vector data format for geographic information system (GIS) software.

Acronym Explanation Usage Scope

AAA Authentication, Authorization and Accounting Security

API Application Programming Interface Interfacing

CSV Comma Separated Value Data type

ETL Extraction, Transformation and Load Data Integration

HDFS Apache Hadoop Distributed File System

JSON JavaScript Object Notation Data Type

JVM Java Virtual Machine Processing

Mesos A Resource Management platform that abstracts CPU, memory, storage, and other compute resources away from machines

Resource Management

ML Machine Learning Processing

OLAP Online Analytical Processing Processing