concepts for handling heterogeneous data transformation logic and their integration with trade

Institute of Architecture of Application Systems

University of StuttgartUniversitätsstraße 38

D–70569 Stuttgart

Master’s Thesis

Concepts for HandlingHeterogeneous Data

Transformation Logic and theirIntegration with TraDE

Middleware

Vladimir Yussupov

Course of Study: Computer Science

Examiner: Prof. Dr. Dr. h.c. Frank Leymann

Supervisor: Dipl.-Inf. Michael Hahn

Commenced: 2017-07-18

Completed: 2017-12-18

CR-Classification: C.2.4, D.2.11, H.4.1, I.7.2

Abstract

The concept of programming-in-the-Large became a substantial part of modern computer-based scientific research with an advent of web services and the concept of orchestrationlanguages. While the notions of workflows and service choreographies help to reducethe complexity by providing means to support the communication between involvedparticipants, the process still remains generally complex. The TraDE Middleware andunderlying concepts were introduced in order to provide means for performing the modeleddata exchange across choreography participants in a transparent and automated fashion.However, in order to achieve both transparency and automation, the TraDE Middlewaremust be capable of transforming the data along its path. The data transformation’stransparency can be difficult to achieve due to various factors including the diversity ofrequired execution environments and complicated configuration processes as well as theheterogeneity of data transformation software which results in tedious integration processesoften involving the manual wrapping of software.

Having a method of handling data transformation applications in a standardized mannercan help to simplify the process of modeling and executing scientific service choreographieswith the TraDE concepts applied. In this master thesis we analyze various aspects of thisproblem and conceptualize an extensible framework for handling the data transformationapplications. The resulting prototypical implementation of the presented frameworkprovides means to address data transformation applications in a standardized manner.

3

Acknowledgements

I want to thank my supervisor Michael Hahn at University of Stuttgart for constructivefeedback, patience and continuous guidance during the progress of my master’s thesis.

I would also like to thank Professor Frank Leymann for giving me the opportunity to do mymaster’s thesis at the Institute of Architecture of Application Systems.

My heartfelt appreciation goes to my beloved wife Anna whose love and support mademe more confident in chasing my dreams and made this thesis possible. I am immenselygrateful to my parents Nail and Vera, and my sister Jane for their love, tremendous supportand encouragement throughout my whole life. I am also very grateful to my other familymembers who have supported and encouraged me along the way.

5

Contents

1 Introduction 111.1 Informal Thoughts on Topic’s Semantics . . . . . . . . . . . . . . . . . . . . 121.2 TraDE: Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 e-Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2.2 Workflows And Choreographies . . . . . . . . . . . . . . . . . . . . . 141.2.3 SimTech Cluster of Excellence . . . . . . . . . . . . . . . . . . . . . . 151.2.4 TraDE Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.2.5 A High Level View On Data Transformation Tasks . . . . . . . . . . . 151.2.6 OPAL Simulation: A Motivational Example . . . . . . . . . . . . . . . 16

1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4 Main goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Background and Related Work 212.1 TraDE Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Software Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Legacy Code Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 What is Legacy Code Wrapping . . . . . . . . . . . . . . . . . . . . . 282.3.2 General Frameworks and Code Wrapping Techniques . . . . . . . . . 302.3.3 Service Oriented Wrapping Approaches . . . . . . . . . . . . . . . . 342.3.4 Grid and Cloud-based Wrapping Approaches . . . . . . . . . . . . . . 36

2.4 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.4.1 Hypervisor and Container technology . . . . . . . . . . . . . . . . . . 402.4.2 Comparison of Virtualization Techniques . . . . . . . . . . . . . . . . 422.4.3 Virtualization For Research Reproducibility and Scientific Containers 44

2.5 Container-based Approaches in e-Science . . . . . . . . . . . . . . . . . . . . 45

3 Concepts and Design 473.1 Data Transformation Logic and TraDE Middleware . . . . . . . . . . . . . . 473.2 Interaction Models and Main Actors . . . . . . . . . . . . . . . . . . . . . . . 483.3 Description and Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 The conceptual meta-model of a data transformation application . . 533.3.2 Packaging the data transformation applications . . . . . . . . . . . . 61

3.4 Handling Data Transformation Applications: A Generic Framework . . . . . 623.5 Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.1 Publishing Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5.2 Generation of a Provisioning-ready Package . . . . . . . . . . . . . . 65

7

3.5.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.5.4 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.5.5 Searching for Composite Data Transformations . . . . . . . . . . . . 723.5.6 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.5.7 Refining the Repository . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.6 Task Manager and Provisioning Layer . . . . . . . . . . . . . . . . . . . . . . 763.7 Request Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.8 User Interaction and Security Layers . . . . . . . . . . . . . . . . . . . . . . 783.9 Refining and Zoning the Framework . . . . . . . . . . . . . . . . . . . . . . 793.10 Interaction with the Framework . . . . . . . . . . . . . . . . . . . . . . . . . 803.11 Integration with TraDE Middleware . . . . . . . . . . . . . . . . . . . . . . . 85

4 Implementation 874.1 Architecture of the Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . 874.2 API Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.3 Application Package Specification . . . . . . . . . . . . . . . . . . . . . . . . 934.4 Publishing and Generation of Provisioning-ready Specification . . . . . . . . 944.5 Requesting a Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5 Case Study 101

6 Conclusion and Future Work 107

Bibliography 111

8

List of Abbreviations

ADL Architecture Description Language. 32

API Application Programming Interface. 22

BPEL Business Process Execution Language. 11

BPMN Business Process Model and Notation. 14

CBSE Component-Based Software Engineering. 25

CLI Command Line Interface. 32

CORBA Common Object Request Broker Architecture. 25

COTS Commercial Off-The-Shelf. 23

CSV Comma-Separated Values. 56

DSL Domain-Specific Language. 23

EJB Enterprise JavaBeans. 25

ESB Enterprise Service Bus. 78

FSA Finite State Automaton. 35

GUI Graphical User Interface. 28

HATEOAS Hypertext As The Engine Of Application State. 92

HTML Hypertext Markup Language. 34

HTTP Hypertext Transfer Protocol. 34

IaaS Infrastructure as a Service. 42

IDE Integrated Development Environment. 31

Java RMI Java Remote Method Invocation. 86

JSON JavaScript Object Notation. 38

KDD Knowledge Discovery in Databases. 35

KMC Kinetic Monte Carlo. 16

LIS Legacy Information Systems. 28

MIME Multipurpose Internet Mail Extensions. 56

9

List of Abbreviations

OPAL Ostwald ripening of Precipitates on an Atomic Lattice. 16

OS Operating System. 35

PaaS Platform as a Service. 38

RBAC Role Based Access Control. 79

REST Representational State Transfer. 22

SaaS Software as a Service. 38

SOA Service Oriented Architecture. 37

SSH Secure Shell. 48

UI User Interface. 22

UML Unified Modeling Language. 28

URI Uniform Resource Identifier. 36

URL Uniform Resource Locator. 34

UTS UNIX Timesharing System. 41

VM Virtual Machine. 40

WFMS Workflow Management System. 14

WS-CDL Web Service Choreography Description Language. 14

WSDL Web Service Description Language. 34

XML Extensible Markup Language. 28

YAML YAML Ain’t Markup Language. 89

10

1 Introduction

The idea to distinguish between the development of atomic software modules and compos-ing systems using such modules as building blocks is not novel and dates back at least tothe middle of 70s when DeRemer et al. [DK76] coined the terms programming-in-the-Smalland programming-in-the-Large. The former describes a process of developing individualsoftware modules using a traditional set of programming languages. The latter refers tothe process of composing a system using a large set of modules. DeRemer et al. describethese concepts as distinct types of intellectual activity requiring different programminglanguages. With an advent of web services and orchestration languages like BusinessProcess Execution Language (BPEL) the concept of programming-in-the-Large became asubstantial part of modern computer-based scientific research. Splitting a complex task intoa set of dedicated tasks involving distinct software modules and orchestrating them makesoverall process more controllable and understandable. Moreover, multiple sub-tasks areconducted by completely separate groups of people and the concept of orchestration caneven be seen from the higher levels when parts of the experiment orchestrated by distinctscientists need to be combined as a whole. This process resembles a choreography classwhere an elegantly-executed dance is the result of a hard work required from any involvedparticipant. The concept of service choreographies describes how multiple participants cancommunicate in a specified way in order to achieve a common goal. Applying this conceptto the scientific setting, participants become parts of the computer-based experiment andthe choreography becomes the model of how these parts need to interact.

However, the technical aspects of running the computer-based experiments “in-the-Large”add more complexity to the process. The modeling and orchestration of such experimentsbecomes a task requiring knowledge of specific tools and technologies. However, theinvolved scientists do not always have expertise in the field of service technology ororchestration languages. Moreover, the software which is used for the experiments isnot always suitable for orchestrations and choreographies. The process of preparing thecomputer-based experiment for “in-the-Large” setting becomes a separate task and mightbecome a serious difficulty on the way to running the experiment in such way. Thisthesis focuses on how a particular type of applications can be handled in the context of achoreographed interaction in the domain of e-Science.

In this chapter we focus on the high-level concepts underlying the topic of research. Wediscuss what is the motivation behind the topic and describe the problems needed to besolved. Based on a high-level view we define the main goal of this thesis and highlight theprimary research questions. Finally, the general structure of this work is presented.

11

1 Introduction

1.1 Informal Thoughts on Topic’s Semantics

The wrong interpretation of the thesis’ title might result in the improper direction ofresearch. Prior to formulating the goal and the main research questions it can be helpfulto start with an informal analysis of this work’s title. The concept of text segmentation inthe field of natural language processing is a helpful way to identify the important partsof the title. Luckily, the human brain is able to perform a sophisticated text segmentationon-the-fly. We are interested in identifying and analyzing the important text segments inthe following topic: “Concepts for Handling Heterogeneous Data Transformation Logic andtheir Integration with TraDE Middleware”. The most important part to highlight is DataTransformation Logic. In the context of computer science, this segment can be used todescribe any form of software responsible for performing a task of data transformation, e.g.a source code, an executable binary, or a web service. The definition of data transformationtask depends on the context in which it is applied. For instance, the data transformationsusing various functions in statistics differ from conversion between different data formats orstructures in information integration. A more precise explanation of a data transformationtask in the context of this thesis is needed.

The Data Transformation Logic text segment can serve as a center point and we can startlinking it to the other words or text segments of the topic and analyzing their relations.This rather informal approach might help to highlight the potential problems which requireadditional attention.

Heterogeneous Data Transformation Logic The word heterogeneous adds an emphasison diversity of software entities performing a data transformation task. Examplesof such heterogeneity include a variety of factors such as the usage of differentprogramming languages, various structures or formats of input and output data, ordiverse invocation mechanisms.

(Concepts for) Handling Data Transformation Logic The segment Concepts for puts anemphasis on the plurality of ideas related to handling the data transformation logic.The term handling can be considered from the perspective of software reuse. Aninteresting question to answer is how multiple standalone data transformation appli-cations can be handled in a standardized fashion. This problem, depending on reusemethodology, can refer to various research topics including software re-engineeringand legacy code wrapping. Moreover, the word handling still sounds ambiguous atthis point. Being synonymous to the word management, it has a broad spectrum ofinterpretations in computer science context. Assuming heterogeneity of data transfor-mation software, the selection of storage and search mechanisms is one of the keyquestions to answer. Another important issue is the remote execution of softwarethat might use diverse invocation mechanisms or have different and potentially con-tradicting dependencies. Numerous additional questions, for instance, monitoringof the execution or composition of data transformations might be addressed as well.Every stated interpretation leads to a different area of research, hence more precisedefinition for the spectrum of targeted problems is needed.

12

1.2 TraDE: Background and Motivation

Integration with TraDE Middleware This text segment is connected to the concepts forhandling the heterogeneous data transformation logic. The word Integration refersto a process of coordination and synthesis of multiple diverse program entities inorder to achieve some global goal which requires these programs to communicate.Enterprise Application Integration is one of the research fields which deals with suchtopics. TraDE Middleware is the name of a target software with which the producedconcepts need to be integrated. The TraDE Middleware has a broad spectrum ofpotential application domains. It operates with terms like workflow, service choreogra-phies, simulation. The main idea is to support modeling and execution of multipleparticipants’ workflows. One domain of application where TraDE Middleware mightbe particularly useful is the e-Science domain.

Combining everything together we can highlight the following important points as asummary for this informal analysis:

• we need to produce concepts for handling heterogeneous applications,

• every such application is only responsible for data transformation tasks,

• the meaning of data transformation task has to be defined more clearly in the contextof this work,

• the term handling an application might be interpreted in various ways and we needto define its boundaries more precisely in the context of this work,

• derived concepts have to be integrated with the TraDE Middleware, an existingresearch prototype which can be used in multiple fields including e-Science.

The task of integration with the TraDE Middleware is a significant part of the research.Before proceeding to the problem statement and the description of the main researchquestions, we need to provide more details about the related concepts along with a briefmotivational example.


1.2.1 e-Science

Traditionally, scientific research operated with terms in vivo, in vitro or in situ in order todescribe the way how experiments were performed. With ubiquitous usage of computer-based simulations researchers started to use the term in silico to proper describe thistype of experiments [Joh01]. Drastic increase in amounts of data [HT03; NEO03] andgrowing computational complexity of experiments emerged interest in search for effectivecollaboration methods for scientists. Being coined in 1999 [Jan07], the term e-Sciencewas meant to describe a large funding initiative in UK. Since then, the definition of thisterm remains flexible, although the general idea is about innovative ways of doing science.For instance, IEEE international e-Science conference’s website provides two (short and

13

1 Introduction

long) definitions of e-Science [eSc17] which both focus on the notion of collaborativeand data- or computationally-intensive research on the global scale. With the advent ofdistributed computing paradigms like grid and cloud computing, performing experimentson distributed systems over data spread across multiple locations became possible in amore transparent and convenient way.

A typical chain of steps in scientific experiment is to formulate a hypothesis, design andrun the experiment, obtain the results and evaluate their importance [SGG07]. Repetitivenature of experiments often makes the chain of steps to be cyclic. Typical examplewould be transferring the data to supercomputers for analysis or simulations, running theexperiments, storing and evaluating the output, modifying the parameters and repeatingthe cycle again [Dee+09]. The application of traditional methodologies to the world ofdistributed computing required proper coordination mechanisms to be developed. Oneof the popular approaches helping to tackle this issue is the usage of workflow technologywhich is well-established in production management and business intelligence.

1.2.2 Workflows And Choreographies

The term workflow in its very generic sense describes a sequence of steps which defines acertain process [Dee+09]. Workflow Management Systems (WFMS) are specialized program-ming and run-time environments that provide means to orchestrate composed sequencesof tasks with the help of service-based computing [Dee+09; SGG07]. This approach isparticularly common for automation of production workflows [LR00] where the focus is onautomatic execution of relatively fixed business processes across organizational boundaries.Well-established languages like BPEL [OAS06] and Business Process Model and Notation(BPMN) [Obj11] simplify execution and modeling of workflows [SGG07].

Compared to business workflows, scientific workflows are very dynamic in their natureand impose different requirements on workflow management systems. Every instance ofa scientific workflow is a separate experiment with its own set of configuration data andpreferences. Additionally, scientific workflows usually undergo rapid changes [SGG07].Development of specialized software aimed to support scientific workflows utilizing themodern computing paradigms is an ongoing research area. Examples of e-Science workflowsystems include Mayflower [SK13], Kepler [Alt+04], Triana [Tay+05] and many others.

The term choreography describes a global view on conversations among multiple work-flow orchestrations. Choreographies behave as specifications allowing to utilize existingworkflows as well as implement new ones [DKB08]. Collaborations could include multiplediverse partners with their own well-organized workflows willing to interact with eachother. Modeling a choreography and looking at the overall process from the higher levelmight be very beneficial for the specification and understanding of an overall complexe-Science experiment composed of multiple scientific workflows. Various choreographymodeling languages are available [DKB08] including BPMN, BPEL4Chor [Dec+07], or WebService Choreography Description Language (WS-CDL) [Wor05].

14


1.2.3 SimTech Cluster of Excellence

The SimTech Cluster of Excellence is “an interdisciplinary research association in the fieldof simulation sciences” [Uni17]. It combines more than 200 researchers from variousfaculties of the University of Stuttgart. The SimTech Cluster of Excellence focuses onsix areas of research including molecular dynamics, numerical mathematics and highperformance computing. The Institute of Architecture of Application Systems (IAAS) hasdeveloped a workflow management system shaped specifically for scientific simulationworkflows. At its initial state, the system allowed to orchestrate experiments as oneworkflow incorporating all different simulation components. In order to simplify theprocesses involving communication of multiple participants operating on different scales thenotion of choreographies was introduced to the system [Wei+17]. Choreography modelingis performed using the BPEL4Chor [Dec+07] language and can later be transformed intoexecutable BPEL workflows for every involved participant.

1.2.4 TraDE Middleware

Collaboration, being one of the key reasons for emergence of e-Science, at the same timeis an additional factor which adds an implementation complexity. Choreography modelswhich consist of multiple participants working with different formats and settings all have aspecific data flow related to them. Transparent Data Exchange methods were introduced byHahn et al. [HKL16] to support data-awareness during the choreography’s lifecycle. Dataflow modeling capabilities were introduced on the level of the choreographies. This allowsseparating the actual control flow logic from the data flow logic between participants as aprerequisite to enable transparent data exchange. We provide more details regarding theseconcepts in Section 2.1.

1.2.5 A High Level View On Data Transformation Tasks

The term data transformation is not new and has been used since at least 1947 [Bar47].Commonly, it describes numeric data transformation in the context of statistics and ex-ploratory data analysis [Fin09]. Transformation of the data might pursue various goals,e.g. meeting some assumptions for statistical inference or improving the overall readabilityof the resulting diagram. Some examples of transformation techniques include linear, rootsquare, and Box-Cox transformations [GLH15]. Information integration is another fieldwhere this term is also used to describe a data preprocessing technique [RD00]. However, inthis context transformation often refers to combining multiple heterogeneous data sourcesusing schema and instance-level transformations. For example, a format conversion mightbe needed if data sources have different representations.

In the context of service choreographies and TraDE Middleware, transparent data flowis only achievable by means of data transformation. The modeling of data flow needs tocapture data transformations in often complex cross-participant interactions relying on

15

1 Introduction

different data formats and representations. While the difference between the modeling ofchoreographies and data flow is not in the scope of this thesis, it is worth mentioning thatthey might introduce some ambiguity. More specifically, a modeler has to decide whethera particular constituent is a part of choreography or it needs to be modeled as a part ofa data flow. In this thesis we are only interested in the data transformation as means tosupport a transparent data exchange in cross-participant interaction. Ignoring the contentof the transformation allows considering it as a function which transforms some vector ofinput parameters into an output parameters vector.

1.2.6 OPAL Simulation: A Motivational Example

To provide the reader with a better view on the topic we briefly discuss how the conceptsof service choreographies and transparent data exchange can be applied to a Kinetic MonteCarlo (KMC) simulation which uses a special software called OPAL [BS03; Hah+17a].Its name is an acronym which stands for Ostwald ripening of Precipitates on an AtomicLattice (OPAL). This software allows simulating how copper precipitates are generated.More precisely, OPAL simulates a formation of clusters of atoms in a lattice due to thermalaging. Figure 1.1 illustrates a BPMN diagram of a simulation choreography using OPALsoftware with the TraDE concepts applied. One important aspect is that OPAL software isdeveloped in Fortran and uses files as a unit of interchange. The software is provided as aset of executables representing different modules of OPAL which have to be wrapped asservices in order to run as parts of the choreography. In Figure 1.1 the participants of thechoreography are named with the Opal string in front whereas the data flow is representedby the grey cross-partner data objects [Hah+17a] connected with the dotted arrows. Suchseparation between all the data which is relevant for a choreography and its participantsimproves the overall readability of the choreography as well as optimizes the graphicalmodels by reducing the number of data objects and inter-participant data flow.

Before the simulation process begins, an initial request containing parameters such asthe total number of snapshots to take, initial energy configuration and a lattice have tobe received by the participant named OpalPrep. Already at this point the data needs tobe transformed. This data transformation is not a part of the simulation itself, but isrequired in order for the KMC simulation to start as the parameters need to be transformedinto a supported input format. In sim_input data object shown in Figure 1.1, energy andparams are transformed into opal_in by the service called “Prepare Input Files”. The KMCsimulation is triggered by the OpalMC participant. Based on the passed parameters, theservice responsible for the simulation creates a particular number of snapshots of theatom lattice’s state at certain points in time and produces the saturation data. Afterwards,this data is analyzed and visualized at the same time by responsible participants of thechoreography. Without going into details, each snapshot is first checked for the presence ofclusters by OpalCLUS participant and resulting cluster data is then processed by OpalXYZRparticipant in order to identify the position and size of the cluster. The resulting data forevery snapshot is then grouped together. At the same time when the snapshots are beinganalyzed, OpalVisual participant is responsible for visualizing the snapshot and saturation

16


Opa

lXYZ

R

Opa

lPre

p

Opa

lMC

Opa

lCLU

S

Opa

lVisu

al

ProcessSnapshot

SearchAtom

Clusters

DeterminePositionand Size

allClusters[i]

CreatePlot

CreateVideo

PrepareInput Files

RunOpal MC

SimulationVisualizeResults

saturation

snapshots allPosSizesallClusters

snapshots[i]allClusters[i] allPosSizes[i]

sim_input

energyparams

latticeopal_in

sim_results

videoplot

Legend

query

Cross-Partner Data Flow

SelectiveQuery

Figure 1.1: A BPMN diagram of OPAL simulation choreography with TraDE conceptsapplied [Hah+17a]

data. The former is visualized as an animated video of 3D scatter plots and the latter isused for visualizing a saturation function of the precipitation process as a 2D plot. Finally,the output media is passed to OpalMC and the simulation process finishes [Hah+17a].

During the execution of a simulation process one can notice several data transformationsrequired for achieving a data flow’s transparency. Moreover, it is also possible to saythat in some cases the data transformation is needed to produce new information, e.g.deriving the clusters, while in other cases transformation serves more like a technicalstep. At this point, a modeler needs to decide which parts of the simulation processhave to be modeled explicitly, i.e. as participants of the choreography. For instance,modeling the analysis of snapshots as participants in the choreography simplifies an overallreadability of the simulation process. On the other hand, some parts can be modeledrather implicitly, i.e. as a part of a data flow. For example, generating a video and a plotcan be seen as technicalities not directly related to the simulation process. A modeler, inorder to simplify the choreography can potentially hide OpalVisual participant by means ofmodeling the involved transformations as parts of the data flow. Figure 1.2 illustrates themodeling of OPAL simulation in the actual software supporting the discussed concepts usingBPEL4Chor [Dec+09] as underlying choreography modeling notation. The participantsof the choreography are represented by white boxes with the lists of activities inside. Thecross-partner data objects are modeled as grey boxes with blue headers and the data flow isrepresented by dotted arrows connecting the data objects and participants’ activites. Eventhis relatively small model of the simulation looks quite complex especially for a person not

17

1 Introduction

Figure 1.2: A modeling of OPAL simulation [Hah+17a]

familiar with the software. In this thesis we examine the ways of how handling the datatransformations in service choreographies in the domain of e-Science can be simplified andcombined with the TraDE concepts.

1.3 Problem statement

After an informal discussion of the topic’s semantics and motivation behind it, we canformulate the problems which need to be solved:

How heterogeneous data transformation logic can be uniformly addressed?Data transformation logic could be provided in multiple ways. For instance, it could besupplied directly by a participant of an experiment or a trusted third party, e.g., a publicly-available service can be reused. In any case, this logic represents an atomic applicationsupporting one or more data transformation types. One of the problems is to identifythe optimal solution among multiple available options of software reuse. Another partof this problem is to find a uniform way of communication with heterogeneous reusableapplications.

18

1.4 Main goal

What are the most important aspects of handling the data transformation applica-tion?Analysis of the ways data transformation logic can be invoked is one of the most impor-tant parts of the work. However, there are multiple other issues to tackle. For instance,the following tasks might be considered: storage and search options, an analysis of thedata transformation code suitability for specific data transformation task, error handling,mining data transformations in order to derive new insight, searching for composite datatransformations.

What architectural decision could unify the resulting concepts?Apart from identifying and prioritizing the concepts we need to understand which interac-tion models are suitable for combining various aspects together and what zoning should beapplied to the set of concepts in order to fit them into the overall picture.

How to integrate the resulting set of concepts with the TraDE Middleware?The most suitable option has to be chosen from the possible integration styles. Additionally,the expected communication behavior in the context of the TraDE Middleware has to bedefined.

1.4 Main goal

The goal of this research work is to develop a framework for handling the data trans-formation applications used in service choreographies with the focus on the domain ofe-Science. The resulting set of assumptions, principles and practices should be able to solvethe previously described problems in a technology- and domain-agnostic way. Moreover,the resulting framework should support the integration with the TraDE Middleware. Fi-nally, we will validate our concepts by developing a prototypical implementation of theproposed framework and by applying the results in a case study to the previously discussedmotivational example as a proof of concepts.

1.5 Structure

The thesis is structured as follows:

Chapter 2 – Background and Related Work describes the background concepts under-lying the research and discuss the related scientific work relevant to the topic ofresearch.

Chapter 3 – Concepts and Design introduces the concepts of handling the data trans-formation logic and describes the design of the framework.

Chapter 4 – Implementation elaborates on the details of the prototypical implementationof the introduced framework.

19

1 Introduction

Chapter 5 – Case Study presents the results of the case study showing how the intro-duced concepts can be applied to a real-world example.

Chapter 6 – Conclusion and Future Work summarizes the work and discusses the po-tential directions for future work.

20

2 Background and Related Work

In this chapter we focus on the theory underlying our research work. We discuss thebasic concepts and review the relevant scientific work from various fields of research.The chapter consists of five sections each related to a certain research topic relevant forour work. Section 2.1 discusses the concepts behind the TraDE Midlleware and brieflyoutlines its architecture. Section 2.2 describes the notion of software reuse and its types aswell as reviews the search methods for software components retrieval. Section 2.3 divesdeeper into the topic of legacy code wrapping as one of the methods of software reuseand describes how this problem can be solved in various scenarios. Section 2.4 comparesdifferent virtualization techniques and provides a performance comparison information. Inaddition, this section describes how container-based virtualization can be used in e-Scienceworld. Finally, Section 2.5 lists several container-based architectures in e-Science whichfocus on the encapsulation of scientific software.

2.1 TraDE Concepts

The introduction and motivational example briefly touches the concept of transparent dataexchange. In this section we provide more details on the underlying theory. Consider asimple choreography example shown in Figure 2.1. It is modeled using BPMN and consistsof three participants interacting with each other using messages [Hah+17b].

P1P3

P2

c1

mx1

B

A

dx1

input

DE

dx2

output

FG

dx3

Cross-PartnerData Flow

Data Element

Cross-PartnerData Object

Data Object

MessageStart Event

MessageEnd Event

MessageReceive Event

MessageSend Event

DataAssociation

Task

Legend

MessageFlow

Figure 2.1: A BPMN model of a choreography with TraDE concepts applied [Hah+17b]

The cross-partner data objects and cross-partner data flows modeled as separate constructsin the diagram describe how the data is exchanged independently from the message

21


flow. Such separation of concerns allows modeler to focus on the common data modelof a choreography without need to include the data objects into the message flow using,e.g. standard BPMN modeling constructs. A cross-partner data object consists of one ormore data elements describing the actual data. A cross-partner data flow is the path ofdata interchange modeled independently from the message flow. It can use the cross-partner data objects as well as certain data elements as units of interchange. In classicalchoreography modeling languages such as BPMN the resulting models are often not directlyexecutable because of missing technical details, e.g., transport protocols to use or theinformation required by a concrete process engine. The common approach is to transformthe choreography into a set of private processes and refine them so that the resultingprocesses are possible to execute. In order to use the new TraDE modeling constructs theconcept of choreography transformation was enhanced. As a result, at the transformationstage cross-partner data objects and cross-partner data flows are transformed into standardBPEL or BPMN constructs enriched with corresponding TraDE annotations. Linking theelements of cross-partner data objects with the standard data containers in private processmodels allows influencing the data interchange rules of respective process engines. Forinstance, if during the execution a private process needs to store the data in a data containerwhich is linked to the element of a cross-partner data object then instead of storing thisdata internally, the process engine uploads it to the corresponding data element of thecross-partner data object at the TraDE Middleware [Hah+17b].

The final goal of modeling a data-aware choreography is being able to execute it. Everyparticipant’s process is executed using possibly a diverse process engine, i.e. a WFMS, whichhave to interact with other participants’ processes based on the newly introduced cross-partner data exchange concepts. For this reason the special software layer called TraDEMiddleware is implemented. It serves as a hub for inter-participant data exchange. The

Business Logic

Exte

nsio

nsData Management

Persistence

TraDE Instance Models

Auditing & Monitoring

TraDE Node &Network Management

PresentationREST APIWeb UI

Resources

DataModels & Metadata

Figure 2.2: Architecture of TraDE Middleware [Hah+17b]

TraDE Middleware relies on its own choreography modeling language-agnostic metamodeland is not bound to a particular process engine. Figure 2.2 demonstrates the three-layeredarchitecture of TraDE Middleware consisting of Presentation, Business Logic, and Resources

22

2.2 Software Reuse

layer. The first layer provides means to interact with the software either via a Web UserInterface (UI) or a Representational State Transfer (REST) Application ProgrammingInterface (API). The Business Logic layer is responsible for multiple tasks including datamanagement of cross-partner data objects, persistence which allows to decouple the datafrom the choreography instances, audit and monitoring tasks. Essentially, this layer groupsall core functionality into several sub-components. The Resources layer provides all thedata that is necessary for Business Logic layer [Hah+17b].

2.2 Software Reuse

The problem of a standardized addressing of heterogeneous data transformation applica-tions described in Section 1.3 has a lot in common with the topic of software reuse. In orderto distinguish the most suitable solutions we first need to provide a general view on thesoftware reuse and discuss its various types.

Historical Perspective and Modern Days

The concept of software reuse is not novel and dates back to Malcolm Douglas McIlroy’sclassic paper [McI68] “Mass Produced Software Components” presented back in 1968 atNATO Software Engineering Conference. McIlroy describes the bottlenecks in softwareengineering processes and emphasizes the importance of using industrial mass productiontechniques in software engineering. More specifically, analogously to subassembly in indus-try, he proposes to develop software by means of component factories. Instead of engineeringthe system from scratch it could be composed from reusable software components. In oneof the examples, sine calculation is considered as a required-to-implement routine and theoverall number of possible implementations complying to certain requirements equaled to300. Instead of implementing all routines it could be more beneficial to pick a reusablesoftware component from a dedicated repository. Component’s suitability for reuse canbe identified using a specific set of so-called sale time parameters. For instance, followingdimensions could be considered:

• Precision of the output

• component’s robustness, as a compromise between reliability and compactness

• choice of particular algorithm performing the calculation

• generality, as a degree of parameters’ adjustability during runtime

• desired interface and access methods

• choice of underlying data structures.

23


The introduced notion of software reuse raises multiple questions. Among these questionsare techniques for development and testing parametrized components, their categorization,distribution and delivery [McI68].

Horowitz et al. [HM84] emphasize that reusability comes in multiple forms, namelyprototyping, reusable code, reusable design, application generators, formal specificationand transformation systems, and Commercial Off-The-Shelf (COTS) software. Most of thelisted above forms of reuse are linked to the concept of program generation based on somegeneral-purpose language which in contrast to a Domain-Specific Language (DSL) is meantto describe a wide spectrum of applications. On the other hand, authors categorize theconcept of code reuse as a rather narrow view on the problem. Additionally, Horowitz et al.list following problems of code reuse:

• Identification methodsOne of the main questions to answer is how to identify and specify reusable compo-nents independently of a domain.

• Description techniquesSpecification formalism choice is the next issue after the component has been identi-fied. How can one describe the component in an understandable manner?

• Implementation formNext problem is which language should be used for implementation of the component.For instance, general-purpose programming language or a higher-level programdesign language might be used.

• CategorizationExpecting the amount of components to be immense, how can similar components bedescribed and grouped?

Moreover, the authors mention that code reusability could be problematic in situationswhen modification of the reusable component is needed. Such cases may require a deepknowledge of component’s implementation which diminishes the advantages of codereuse [HM84].

Prieto-Díaz [PD93] presents a taxonomy of software reuse and assessed the state of researchat that time. Following perspectives on reusability are discussed: (i) By substance: concepts,artifacts, processes; (ii) By scope: vertical, horizontal; (iii) By mode: systematic, ad-hoc;(iv) By technique: compositional, generative; (v) By intention: black-box, white-box; (vi) Byproduct: source code, design, specifications, objects, text, architectures.

An extensive work by Mili et al. [MMM95] discusses various aspects and the state ofresearch in the area of software reuse. In this work, emphasis is put on the type of artifactsbeing reused, listing data, architectures, detailed application design, and program reuse.The term reusable assets is used to describe both products and processes intended for reuse.Additionally, three categories of systems are listed:

• reusable program patterns, which describe the usage of source code or design patterns

24

2.2 Software Reuse

• reusable processors, which are used for interpretation of high-level specifications

• reusable transformation systems, which involve developing activities in transformations

Reuse of domain knowledge in the form of models is helpful in multiple ways: it simplifiesdeveloper’s understanding of the domain and serves as an initial point in system analysis.Also, it provides application-dependent classification of reusable components. Domainmodels incorporate entities and operations on them, constraints and relationships betweenthe entities, and properties of objects which will be used for searching components. Inother words, a domain model is a field-specific vocabulary describing entities, behaviors,and constraints. Furthermore, Mili et al. differentiate between issues of developing reusableassets and developing with reusable assets and describe details for both categories. While theformer category is mostly about how to develop reusable building blocks using generativeapproaches, the latter incorporates issues related to component retrieval, composition andadaptation.

Ravichandran and Rothenberger [RR03] focus on advantages of black-box reuse in thecontext of Component-Based Software Engineering (CBSE) paradigm based on previouslydiscussed concepts of software reuse. Software is built using compositional techniqueswith prepackaged components as building blocks. In its essence, a component is anexecutable piece of code providing a specific functionality via some interface. As anexample of improvements in reuse component models like Common Object Request BrokerArchitecture (CORBA) [Obj17] and Enterprise JavaBeans (EJB) [Ora17b] are mentioned.The authors discuss differences of black-box vs. white-box reuse arguing that usage ofthe former can be highly beneficial. Additionally, a component-level reuse decision tree ispresented with differentiation between in-house and market black-box components.

A literature review by Mohagheghi and Conradi [MC07] assesses software reuse effects inindustry. For this purpose, eleven papers related to observational studies and experimentsconducted in industry are analyzed. According to the authors, most of the papers areconcerned with systematic reuse in the sense of having an explicit asset for reuse with asource code being a unit of reuse. One of the findings is a productivity gain for small andmedium-scale studies. However, no consistent results for actual productivity are given.Another important point is the importance to identify the contexts which could benefitfrom reuse. The study suggested that black-box reuse and reuse with slight modificationsresulted in lower development and correction efforts. An additional observation is thecorrelation between size and complexity of reusable assets and the frequency of reuse.Small modules or functions could be reused more often while larger reusable assets requirecomplex design decisions.

Zaimi et al. [Zai+15] analyze third party libraries reuse in open-source software. Theauthors focus on the evaluation of reuse intensity, how reuse decisions evolve acrosstime, and how reuse affects the product’s quality. The study includes five Java projectsand considered more than ten versions of each in order to monitor reuse history. As anoutcome of the study the authors present several observations. Firstly, more libraries arereused over time signifying an overall increase of reuse intensity during the lifespan of aproject. Secondly, developers tend to not revisit their reuse decisions. Thirdly, libraries

25


removal or updates occur rarely. In general, the study advocates for a more systematic andstandardized approach in reuse as well as a regular review of decisions made.

The concept of domain knowledge reuse was already mentioned previously both explic-itly [MMM95] and implicitly. For instance, the term vertical reuse is used to describereuse of software within the domain, whereas horizontal reuse is applied across domains.Dabhade et al. [DSM16] present a survey on software reuse methods in the context ofdomain engineering processes. The development of components conforming to specificmodels should include such steps as categorization of components, specification of inter-faces and protocols, as well as validation. While rigorous classification of reuse methodsis not in the scope of this thesis, some methods described by Dabhade et al. are worthmentioning. Program libraries and application frameworks reuse basically operate at dif-ferent levels of granularity in the sense of reusing dedicated sets of source code unitstailored for a particular goal. Legacy system wrapping is a method of reusing an existingsystem in a new environment or applying different constraints. Service Oriented Systemsis basically a concept of component-based software engineering applied to the world ofservice technology [Erl16].

The study by Vale et al. [Val+16] synthesizes academic research knowledge about CBSEin the period between 1984 and 2012. Therefore, the authors studied several questions,including research intensity in the field within the stated period of time, most investigatedresearch topics, and in which domains CBSE has been applied. The interest in the fieldincreased in late nineties and was at stable, but lower levels in 2010s. The authors assumethat the lower rate of interest in the topic is connected with more rigorous reviewing due tolarge numbers of papers in the field as well as the integration of CBSE with new approachessuch as software product lines, model-based engineering, software oriented architecturesand cloud computing. The analysis of research topics popularity reveals that CBSE paperscover practically all areas of software engineering (such as testing, architectures, etc.) aswell as specific concerns related to component-related topics (component specification,interaction and composition, component models, interfaces, etc.). Additionally, Vale et al.note that the field of CBSE is still active with multiple open problems.

To summarize, the concept of software reuse is an important part of modern softwaredevelopment. While various categorizations are presented in multiple papers, there is nostandard classification of software reuse methods. In the context of this thesis, the conceptsof black-box reuse, component specification, search, retrieval, and composition techniques areof great interest.

Component Retrieval Methods

In order to find components which are suitable for reuse, efficient retrieval mechanismsare needed. This topic attracts researchers at least since early 1990s.

Podgurski et al. [PP93] explain the retrieval of executable software components based onbehavior sampling instead of standard text retrieval methods. A component’s behavior canbe sampled by executing it using a user-provided input and comparing the output with the

26

2.2 Software Reuse

user-provided output. The main idea is that if a component returns the expected valuesthen it suits the requirements of a user.

Mili et al. [MMK+94] discuss various ways of component retrieval separating betweenthe keyword-based methods which aim at returning a single suitable component and thecomposition techniques which discover component chains which suit user’s needs. Theso-called component composers might be considered in case a standard search does notreturn any suitable component. Mili et al. refer to the issue of component composition asfunction realization problem and prove it to be NP-complete.

Zaremski et al. [ZW95] describe a signature matching technique for component retrieval.For instance, if a user looks for a certain function, the function’s type could be used as away to retrieve a suitable one. In this case the function’s type is the set of input and outputparameters’ types. Such signature is either provided or derivable because usually it isrequired by the compiler. The component can be either a function or a module consisting ofmultiple functions. Zaremski et al. discuss the differences between the function matchingand module matching. Additionally, the authors consider exact matches as well as howmatching relaxation can be achieved.

In subsequent work, Zaremski et al. [ZW97] discuss a specification matching as a wayto determine if two components are related. This process underlies multiple questionsincluding retrieval of a suitable component, reuse of the retrieved component, substitutionof components, and whether one component is a subtype of another. A specification isconsidered to be a formal description of the component’s behavior. Two components matchif two conditions are sufficed: their signatures match and their specifications match. Aswith the previous work, function and module matching are explained and the matchingrelaxation techniques are discussed.

Bawa et al. [BK16] describe numerous retrieval techniques with their advantages anddisadvantages, including:

Keyword-based retrieval is one of the simplest forms of search using the uncontrolledvocabulary. Basically, a user supplies free text keywords and the matching happens onthe word-by-word basis with the list of indexed terms of the components. Techniquesfrom information retrieval such as term frequency and inverse term frequency areused for components indexing. The resulting ranked list of components is returnedto the user.

Enumerated-based retrieval is a one-dimensional retrieval technique based on mutually-exclusive category codes which can be further sub-coded. An example of suchclassification is the division between arts and science in a library. However, thistechnique is very strict in its mutual exclusiveness which makes it impossible toclassify components as belonging to several categories.

Attribute value retrieval is a retrieval technique which uses the component’s attributes.As with the library example, the attributes of a book such as its title, author, orpublisher can be used to form a query looking for a specific book in a library. However,the set of attributes of a component might differ in various repositories or an attribute

27


can be dynamic for components meaning that it is not possible to know the set ofattributes in advance.

Faceted retrieval is a technique which uses the domain knowledge to form a pre-enumerated set of so-called facets. Domain specialists carefully identify facets,e.g. the terms “function”, “procedure”, “objects” can be chosen as the functionalfacets whereas the terms “operating system”, “network parameters” as environmentalfacets. One limitation of this approach is to balance the complexity of facets structure:complex structure makes it more difficult for a user to understand it and the simplestructure results in many components falling into the same classification.

Query and browsing-based retrieval relies on Graphical User Interface (GUI)-baseduser interaction with the component repository. In complex and ambiguous cases itcan be more efficient for a user to navigate through the query’s results and view thecomponents’ details.

Other techniques include signature matching, usage of Unified Modeling Language(UML) and Extensible Markup Language (XML) descriptions of components, behavior-based method, usage of genetic algorithms and many others. Additionally, combina-tions of the methods can be used for retrieval.

2.3 Legacy Code Wrapping

Legacy code wrapping is one of the software reuse methods. Compared to other reuseapproaches, legacy software wrapping is a necessity rather than a free choice. Handlingdata transformation applications can also be seen as a work with the legacy code due tothe fact that every application which is intended for reuse is an existing piece of software.Commonly, the software is not developed with the uniform access methods in mind andwrapping is the only possible way of integration. Therefore, in this chapter we providemore details on legacy code wrapping methods.

2.3.1 What is Legacy Code Wrapping

The term Legacy Information Systems (LIS) describes mission critical software which resistmodification and evolution [Bis+99]. Such systems usually function due to historicalreasons and share multiple common properties such as use of obsolete hardware, highmaintenance costs, lack of clean interfaces, resistance to extension. Bisbal et al. [Bis+99]explain different strategies of coping with legacy systems namely, redevelopment, migration,and wrapping. Having the most impact on the system, redevelopment is basically a rewriteof the whole software system based on new requirements. Migration is less expensive andaims at moving the LIS to a new environment while keeping the original functionalityand data. With the least impact on the system, the wrapping strategy aims to preservethe original functionality by surrounding it with new interfaces. Wrapping allows reusinglegacy software with accordance to new requirements.

28


While the whole spectrum of legacy system modernization approaches (e.g. [ACD10;CD+00; FSB16; LZ16; RS16; SNLM16]) is not in the scope of this thesis, we furtherinvestigate available software wrapping approaches. Sneed [Sne00] discusses varioustypes of wrappers and levels of granularity at which software can be wrapped. He listsfour wrapper types, namely database, system service, application, and function wrappers.The main distinction is the type of object being wrapped. A database wrapper usuallyoffers a common interface which simplifies access to one or more databases. Similarly, asystem service wrapper allows clients to access standard services. Application and functionwrappers operate on different scales. While an application wrapper encapsulates the wholeprogram via a new interface, the function wrapper works with individual functions of aprogram.

In general, a wrapper is responsible for receiving an input from a source application,converting it to an internal format and passing it to the target application. After targetprogram completion, the wrapper collects its output, converts it to a suitable externalformat and passes it to the source application. Thus, wrappers should have two interfaces:external public and internal private interfaces. Moreover, Sneed lists message handler,interface converter, and IO simulator as parts of wrapper’s structure located in betweenexternal and internal interfaces. A message handler is responsible for maintaining inputand output queues in order to guarantee processing order and correctness. An interfaceconverter maps external and internal interfaces in both directions. In standard casesmapping is a 1:1 relationship, however this is not always the case. For instance, a 1:nrelationship means that parameters have to be duplicated and m:n relationship requirescareful values association. An IO simulator handles input and output operations of thewrapped object. It passes parameters from the external interface to the input and copiesthe output into the external interface’s parameters. In other words, it encapsulates thetarget software via emulation of input and output functions without affecting it [Sne00].Figure 2.3 demonstrates the connection between the introduced modules of a wrapper.

External Interface

Internal Interface

MessageHandler

InterfaceConverter

IOSimulator

Figure 2.3: Modules of a wrapping framework [Sne00]

Wrapping solutions are usually oriented towards specific technologies, architectures andin order to improve readability we use following informal categories to group our litera-ture review: general frameworks and code wrapping techniques, service-oriented wrappingapproaches, and grid and cloud-oriented wrapping approaches.

29


2.3.2 General Frameworks and Code Wrapping Techniques

Juhnke et al. [Juh+09] introduce a method of wrapping a legacy code based on theLegacy Code Description Language (LCDL) framework. It provides an extensible legacycode specification model which later can be used for generation of executable wrappercode. The model incorporates all information required to describe binary and sourcecode legacy applications. Emphasis is placed on the extensibility of the model which,for instance, allows the inclusion of new input / output sources or binding types. Themain entity in LCDL model is called service. It consists of one or many operations whichrepresent legacy code’s methods and bindings which define a wrapper’s type. Figure 2.4demonstrates a simplified UML diagram of LCDL entities without properties. An operation

Figure 2.4: Simplified UML diagram of LCDL model [Juh+09]

incorporates several entities, namely, input, output, execute, and environment. An Executeentity represents information about the type of legacy code. A path to the library or binaryneeds to be specified. In case of a binary, explicit parameters might need to be defined.LCDL distinguishes the following input types: ElementInput, OptionInput, and FileInput.First one describes an arbitrary parameter of particular type which is passed to a Binding.An OptionInput represents a flag which can be set by a caller. StaticInput is a parameterwhich is always set. A FileInput repesents a file (the way of handling it depends on aparticular Binding). As a next part, the output information must be specified. It dependson two elements: return source and return type. The former describes from which sourcethe return value is obtained. For instance, it could be a standard output from a commandline or a file. The latter defines in which form the output should actually be returned.Additionally, an Operation might require the specification of environment information. Forinstance, a specification of an environment variable [Juh+09].

The concept of atomic domains [HT02] is introduced by Haddad et al. as an attempt toswitch focus from traditional component-level reuse to domain-level reuse. An atomic

30


domain is a collection of reusable assets having a common set of properties specific torelated domains. Essentially, a subset of the domain which cannot be reduced is called anatomic domain. The inability to remove a component from a subdomain without changingthe functionality or identity of the domain is the main property of irreducibility. As anexample, the authors describe a set of operations related to robotic arm control (such ascreate an arm of any segments, rotate an arm around joint, extend/retract, etc.) as arobotic arm atomic domain. In a follow-up work they build a wrapper-based frameworkaimed at domain-specific software reuse on top of the atomic domain concept [HX06].Figure 2.5 displays the general framework model. The main idea of the framework is to

Future Wrapper Additions

Atomic Domain AManagerAPI

Atomic Domain BManagerAPI

Atomic Domain CManagerAPI

SourceCode

IDE / Compiler

(Wrappermenu

system)

Figure 2.5: The general framework model [HX06]

reuse source code (modules, classes or functions) components related to atomic domainsusing domain-specific wrappers. Each atomic domain has a wrapper as its smart interfaceallowing the selection of a suitable component based on application’s requirements. Haddadet al. introduce a wrapper’s structure consisting of:

Taxonomy: consolidates vocabulary, patterns, algorithms and other information specificto an atomic domain,

Application Programming Interface: specifies data formats, parameters, constraints andexpected behavior,

Manager: based on the API description the manager defines which component in theatomic domain is suitable for reuse and passes it to the client application,

Communication Mechanism: is responsible for presenting atomic domain componentsto the outside world, e.g. via messaging,

Control Mechanism: manages the program flow within the domain.

Assuming that atomic domains have been defined by domain experts, we briefly describethe remaining reuse steps. Initially, each atomic domain is mapped to a certain commandwhich is referenced by an Integrated Development Environment (IDE). This commandencapsulates all components within the domain and components can only be accessed viathe manager. In case such a command is used in developer’s code, during compile timethe compiler sends necessary data to a dedicated manager via the wrapper’s API. As a nextstep, the manager checks the command’s context, identifies the suitable component and, if

31


this check is successful, it passes the component to the compiler. Afterwards, the singlecommand in the developer’s code is substituted with the received component [HX06].

Gannod et al. [GML00] introduce an architecture-based method for legacy command-lineapplications’ wrapper specification and synthesis as well as discuss the aspects of wrappersintegration with client software. The authors distinguish the following set of propertieswhich appropriately describes command-line applications: Command, Pre, Post, Signature,and Path. A Command property describes how legacy applications can be invoked. Pre andPost properties describe commands defining pre- and post-conditions for legacy componentinvocation. A Signature property identifies the names and types of the input and output ofthe application. Finally, a Path property describes the path to the legacy application. In theproposed approach, the wrappers are generated using specifications which consist of thelisted above properties. Additionally, such specifications are created using ACME [GMW10]Architecture Description Language (ADL).

Lee et al. [LCJ03] propose an XML wrapper API for Command Line Interface (CLI) systems.Their work focuses on problems of network management which commonly relies on theusage of CLI-based systems. Due to changes in the syntax of a command, its implementationmight change as well. In order to tackle this issue, the authors propose to model groups ofCLI-commands using XML templates and for further processing to convert these templatesinto actual commands using APIs. The main idea is to leverage hierarchical XML structuresto define commands order and also support error handling in case a command in a sequencefails.

Wettinger et al. [WBL15] introduce a generic technique for generating public API imple-mentations (APIfication) for executable programs. The aim is to avoid the tedious manualwrapping by means of standardized invocation mechanisms to support software reuse.Additionally, the authors present an extensible framework any2api as a realization of theintroduced approach. The presented method does not necessarily aim at legacy softwareand can be considered as a general reuse methodology focused on executable applications.As a use case, the paper describes a deployment automation scenario for a web applica-tion in the context of cloud computing. To deal with the heterogeneity of technologies,interfaces, and implementation details, Wettinger et al. propose to wrap the invocation ofvarious executables in a generic API. In the context of deployment automation this methodhelps to hide details about placement, runtime dependencies, or invocation parametersof an executable, as well as simplifies the return of formatted results. One importantassumption underlying the presented method is that every executable provides a metadatadescription of input and output, dependencies, etc. The APIfication method which consistsof eight steps is shown in Figure 2.6. Firstly, the desired executable is chosen for APIimplementation generation. Second and third steps are responsible for the definition ofinterface type (e.g., REST) and API implementation type (e.g., Java, Python). The next step(step 4) is about scanning the executable with its metadata to get the information aboutinput and output parameters. Optionally (step 5), the refinement of information obtainedfrom the previous step might occur. The generation of an API implementation happensnext (step 6). To support API implementation’s portability self-contained package is created(step 7), e.g., using Docker container engine. Finally, the resulting package is ready for use

32


Select executable Select type ofinterface

Select type of APIimplementation Scan executable

Specify and refineinput/outputparameters

Generate APIimplementation

Package APIimplementationwith executable

Test and usepackaged APIimplementation

Optional, manual

21 3 4

568 7

Figure 2.6: APIfication framework [WBL15]

(step 8) and from this point can be refined or modified by returning to selection (step 1)again [WBL15].

API Endpoint

Invoker

Executable

API implementation package

Generated by APIImplementation

Generator

Included fromInvoker Registry

SelectedExecutable

Specified by I/OAPI specification

Generated byScanner

(e.g. REST over HTTP)

Figure 2.7: API implementation package [WBL15]

As a proof of concept, the any2api framework is implemented based on the describedmethod. Therefore, Wettinger et al. introduce the concepts of invokers, generators, andscanners. An invoker is a special piece of software which is responsible for running at leastone type of executable programs. If the required invoker is available in a correspondinginvoker registry the next steps are performed. In order to proceed to the generation ofportable API implementation, scanning has to be performed. A scanner is responsible foranalysis of certain type of executables and the related metadata. As a result, it produces aspecification which contains information about the input and output parameters, their datatypes and the mappings between the executable and its API to be exposed. A generatoruses the resulting specification and corresponding invoker obtained from the dedicatedregistry to generate an API implementation. Finally, all artifacts are grouped by means of aself-contained package resulting in a portable API implementation supporting the chosenexecutable [WBL15]. Figure 2.7 demonstrates the structure of such API implementationpackage.

One can notice that the structure of packaged API implementation resembles the traditionalwrapper shown in Figure 2.3. The API endpoint plays the role of an external interface

33


and the executable has its own internal interface. In its turn, the invoker incorporatesfunctions of message handler, IO simulator and interface converter. However, the conceptof packaged API implementations is on the higher level of abstraction as it also proposeswrapping the environment to support portability.

2.3.3 Service Oriented Wrapping Approaches

Zdun [Zdu02] conceptualizes a process of migrating legacy applications to the web.

LegacySystem

RequestDecoder

HTMLDecorator

APIs

Call conformingto legacy

system's API

legacy system'sresponse

WebServer

incomingrequest

responsein HTML

Client UI

Figure 2.8: Simplistic Legacy to the Web architecture [Zdu02]

Therefore, the generic simplified architecture shown in Figure 2.8 is presented. Based on it,Zdun derives a generic process model consisting of four steps:

1. Provide an API to the web either using wrapping or redevelopment approaches

2. Implementation of a Request Decoder which maps Hypertext Transfer Protocol (HTTP)requests to wrapper’s or legacy API

3. Implementation of an Hypertext Markup Language (HTML) Decorator which createsHTML representation of the response

4. Integration of implemented components with a web server

The simplified architecture is enriched by analyzing critical issues related to every stepof the process listed above. The following issues are highlighted regarding legacy codewrapping: i) how requests should be mapped to the legacy system, i.e., how to approachsuch problems as responsible wrapper selection and invocation; ii) since HTTP is stateless,the wrapper has to maintain state and session information in order to handle asynchronouscalls; iii) wrapper might need to be implemented not only for HTTP, but for multipleprotocols which raises a question of wrappers integration and reuse; iv) how to abstractservice providers in a wrapper in case the legacy system provides multiple services. Basedon that, Zdun discusses various aspects of HTTP protocol handling (HTTP request handling,Uniform Resource Locator (URL) handling, session management and state preservation,authorization, encryption, logging, testing, and deployment) as well as content creationand representation approaches. Finally, a complex reference architecture integratingaforementioned concepts is presented.

Works by Sneed et al. [Sne06; Sne+06; SS03] discuss white-box approaches of wrappingselected functions from legacy software by means of XML descriptions and corresponding

34


wrapper generation. The authors describe the tool SoftWrap which is able to transformlegacy functions’ interfaces into Web Service Description Language (WSDL) interfaces.Supported languages are PL/I, COBOL, and C/C++. Apart from WSDL interfaces, thistransformation is responsible for the creation of two additional modules which are usedin a wrapper. These modules are responsible for transferring of input and output databetween WSDL and legacy code’s interfaces.

A white-box approach proposed by Guo et al. [Guo+05] focuses on generation of webservices from Microsoft .Net legacy applications. The authors present a tool called WebServices Wrapper (WSW) which consists of two parts: analyzer and wrapper. The formeris responsible for parsing and displaying legacy source code. The latter generates webservice code for a chosen function and compiles it using the .Net compiler if restrictions(e.g., method must be public and not abstract) defined by developers are satisfied.

A method presented by Stroulia et al. [SERS02] in the context of the CelLEST projectdescribes how to derive an executable specification of a service provided by legacy softwareby means of reverse engineering. This method follows the screen scraping approach inorder to capture and model user interactions with legacy software. The main idea is toexpose the desired parts of a legacy interface to a new interface. The presented CelLESTprocess is an interaction-based wrapping approach which creates a new interface workingas a bridge between clients and the legacy software. It consists of five steps:

1. Tracing user interactions using the emulator tool

2. Modeling captured behavior as a state transition model

3. Mining task execution patterns

4. Modeling services based on mined patterns

5. Constructing web-based UIs for modeled services

Canfora et al. [Can+06; Can+08] introduce a technique of migrating interactive legacyapplications to the web, i.e., involving form-based interactions with users. The goalis to provide a request/response-based interface for the legacy system. This cannot beeasily addressed using techniques like screen scraping. Therefore, the authors proposea wrapping-based approach where the wrapper is responsible for autonomous handlingof conversations between the legacy system and the client. However, in order to supportconversations, the wrapper must know the conversation rules and be able to adapt to theexecution flow. Considering the fact that a wrapper treats the legacy system as a black-box, the human-computer interaction model is needed in order to achieve this goal. Thediscussed resulting solution uses a Finite State Automaton (FSA) to specify the conversationbehavior. The FSA is composed of states, transitions, and actions. A state describes changesfrom the system’s start to the present. A state change is described by a transition whichcan be enabled in case some condition will be satisfied. An action defines what has to bedone at a given moment. The resulting wrapper consists of an Automaton Engine, TerminalEmulator, and State Identifier. Additionally, a Repository is used to persist FSA and thecorresponding Screen Templates (use case-dependent iterpretable descriptions).

35


Diamantini et al. [DPP05] describe how to automatically generate wrappers for KnowledgeDiscovery in Databases (KDD) tools in order to expose them as services. The authorsimplement an Automatic Wrapping Service (AWS) which generates wrappers for KDDtools. This enables the generation of a WSDL descriptor for a KDD tool based on an XMLspecification provided by a developer. This specification not only contains informationnecessary for wrapping, but also provides a formal description of the functionalities usinga KDD ontology, optional linkable applications and tool performances. Furthermore, thepaper provides a detailed description of specification information required for wrappergeneration. Diamantini et al. distinguish the following data which needs to be specified:name, input, output, language and description. Input parameters can be obligatory, optionaland hidden (in case they are needed, but not explicitly set by the client). The languagespecification describes Operating System (OS) and compiler-related settings.

Afanasiev et al. [ASV13] introduce MathCloud, a platform aimed at publication and reuseof scientific applications by exposing them as RESTful web services. One of the keypoints is to provide a unified interface which can be used to communicate with any kindof computational service. Commonly, the interaction between a service and a client incomputational applications involves the following steps: (i) service receives a request tosolve a task from a client; (ii) request contains a description of the task and includes allthe required input; (iii) after request is processed, service returns output to a client. Theauthors propose a unified RESTful interface consisting of following resources: Service, Job,and File. These resources are identified by Uniform Resource Identifiers (URIs) and canbe accessed using standard HTTP methods. A Service resource can be accessed using GETand POST methods. The former results in a service description being returned. The lattertriggers the creation of a new task with the input specified in the request’s body. A Jobis accessible by GET method which returns the status and results of the job. Additionally,DELETE method can be used to delete the job and the resulting output data. A File resourcecan be accessed only using GET method which returns the results of the job.

The MathCloud platform consists of several components, including a Service Containeralso referred to as Everest, Service Catalogue, and Workflow Management System. Beingthe core of the system, the Service Container (Everest) provides adapters for various typesof applications and implements a runtime for resulting RESTful services. The ServiceCatalogue provides means to discover, monitor, and annotate the deployed services. TheWFMS handles the service composition tasks [ASV13].

2.3.4 Grid and Cloud-based Wrapping Approaches

Huang et al. [Hua+03] propose a semi-automatic approach to support running legacyC code as services which conform to the Triana [Tay+05] component model within agrid infrastructure. This approach relies on two tools, namely Java-C Automatic Wrapper(JACAW) and MEdiation of Data and Legacy code Interface (MEDLI). The former tool isresponsible for automatic wrapping of legacy C code as Java code. JACAW takes C headerfiles as input and produces corresponding C and Java files needed for making native calls.

36


The latter tool provides a GUI to facilitate conversion from Java code produced by JACAWto components used by the Triana scientific workflow system [Hua+03].

Delaitre et al. [Del+05] introduce Grid Execution Management for Legacy Code Archi-tecture (GEMLCA) approach aimed at legacy application deployment as a grid service. Itis a black-box approach suitable for multiple source languages including, e.g. Fortran, C,Java. Conceptually, GEMLCA allows exposing legacy applications which already run inthe grid infrastructure as grid services. So-called GEMLCA Resource layer is responsiblefor presenting legacy applications to a client using an XML-based Legacy Code InterfaceDescription (LCID) descriptor. This file stores information about the execution environmentand a set of parameters required for the legacy application. For instance, the environmentsection stores data about the name and binary file of the legacy application as well asgrid-specific information like which job manager should be used or the maximum numberof allowed jobs. The parameters section describes a set of parameters along with theirspecific characteristics like name, friendly name, type (input or output), status (obligatoryor optional), file or command line, regular expression which will be used to validate theinput. Internal design of GEMLCA consists of three layers: front-end, core, and back-end.The front-end layer provides a set of grid services related to client’s communication withthe legacy code. The core layer is responsible for management of legacy code processesand jobs. The back-end layer facilitates communication with the grid middleware.

Glatard et al. [Gla+08] discuss how Service Oriented Architecture (SOA) can be advanta-geous for execution of scientific applications in the grid infrastructure. One of the mainparts is the description of a generic web service wrapper which allows running legacyapplications through a standard service interface. Another significant part of the workshows how the generic wrapper allows dynamic service grouping using the MOTEURworkflow system. The idea of such a generic wrapper is to expose a standard interfacewhich will hide the grid infrastructure and which can be invoked by any client compliant toweb services specification. To achieve this, a generic service relies on two different kinds ofinput: (i) legacy application command line format specification and (ii) legacy application’sinput parameters and data. Thus, the only required task a developer has to do is to createan XML descriptor containing the required information. Firstly, the application’s nameand access method have to be specified. The proposed implementation allows choosingbetween two access methods, namely URL or Grid File Name (GFN) which define how awrapper will fetch the data. Glatard et al. distinguish two types of input, namely data andparameters. The former describes input files which can also be accessed using differentmethods and have additional command line options to be specified. The latter specifiescommand line values which are not files, thus there is no need to define access methodsfor them. Moreover, an output data access method and command line options have tobe specified. Finally, sandboxed files might be defined. Those are the files describing allexternal dependencies like scripts or dynamic libraries which are implicitly required for theexecution. As further steps, the authors describe how generic wrapping might be used forservice composition by means of a workflow enactment system.

Balis et al. [BBW08] present the Legacy to Grid Framework (LGF) aimed at exposing legacyapplications as grid services in a semi-automatic manner. The proposed structure of the

37


framework consists of three main parts: a Service Client, a Service Manager, and a WorkerJob. A service client is responsible for transparent and isolated interactions between a clientand a legacy application. In its turn, a service manager is a set of web services and theirresources which are deployed in a hosting environment. This set of services is responsiblefor communication management between the client service and the worker job, creationof resources, and submission of worker jobs. A worker job is responsible for running theexposed functions of legacy application. Every client is mapped to exactly one worker joband the interaction between them happens via service manager’s services. The worker jobuses standard secure web service messaging for communication with other parts of thesystem.

Elkhouly et al. [Elk16] propose a Software Components as a Service (SCaaS) solution todeliver components for reuse based on the Software as a Service (SaaS) cloud computingdelivery model. SCaaS allows deploying components on demand by means of cloudinfrastructure and according to some licensing model. A queried component is firstlychecked for compliance to requirements such as supported architecture, functionality, andinterfaces. If check is passed a black-box reuse is used, otherwise white-box reuse isproposed. As white-box reuse involves modifications, the newly derived component is sentback to the cloud repository for future reuse. The authors focus on descriptions of addand retrieve modules. The former is responsible for adding components into a repositoryand consists of two indexing subsystems, namely concepts (functional description) andsignatures (constraints) indexing. The latter allows searching for a component and proposesa suitable reuse method, e.g. black-box or white-box, depending on matching results, e.g.exact match or similar. The model of SCaaS is shown in Figure 2.9.

SCaaS

Components Repository

Add Module Retrieve Module

SaaS

PaaS

IaaS

Figure 2.9: SCaaS model [Elk16]

Sukhoroslov et al. [SA14; SVA15] present Everest, a web-based platform for reuse ofscientific applications which uses the Platform as a Service (PaaS) cloud delivery model.Everest is a further development of concepts from the aforementioned MathCloud [ASV13]platform. Everest consists of client and server-side parts. The former part offers a web userinterface and client libraries. The latter is composed of three layers:

REST API is a single entry point for accessing the platform’s functionalities. HTTP is usedfor requests with JavaScript Object Notation (JSON) being an interchange format. A

38


unified web services interface for accessing the deployed applications is implementedas a part of the REST API.

Applications layer is responsible for hosting the applications intended for reuse. Anapplication is treated as a stateless black-box with sets of inputs and outputs whichoperates independently of other requests. Applications are created by clients whocan also configure access permissions.

Compute layer is responsible for execution of applications using remote computing re-sources. Everest does not offer any infrastructure for the execution of applicationsand fully relies on computing resources provided by users. This layer controls aprocess of jobs execution. After the application is invoked via the REST API and thecorresponding job is created, the Compute layer controls such actions as staging theinput files, task submission and monitoring, or obtaining the output.

4. Translate inputvalues to tasks

7. Translate job resultsto output values

Resources1. Authenticate andauthorize client

in1

in2

in3

2. Parse and validateinput values

3. Create new job

out1

out2

5. Run tasks oncomputing resources

6. Process task results(generate new tasks)

Application

Figure 2.10: Structure of Everest application [SVA15]

Figure 2.10 demonstrates how a request to an Everest-deployed application is handled.Steps colored in green are application-agnostic and can be implemented similarly for anykind of application. However, steps colored in red depend on a certain application. Everestexpects an application’s description consisting of two parts, namely public information andinternal configuration. The former provides specifications of input and output, as well asthe information related to application discovery and how a client can communicate with it.The latter is responsible for forwarding requests to an application and generation of theresults. For instance, the internal configuration of command-line applications includes thespecification of mappings between the input values and the task command as well as thespecification of how a task’s output is mapped to the output values. As Everest does notoffer its own computing resources, users can attach their own resources. For this purpose,a software called agent which runs on the resource is used. An agent supports various typesof resources and can be integrated with them using corresponding adapters [SVA15].

Unlike in cloud or grid-based wrapping approaches, we are interested in decoupling theprocess of handling data transformation applications from a particular infrastructure. Firstly,

39


the cloud and grid infrastructures are not always available to the scientist willing to performa computer-based experiment. Setting and configuring a private infrastructure can also beproblematic in terms of resources and costs. Secondly, sharing applications by publishingit to, e.g. the cloud, is not always wanted by the scientist. We are interested in finding asimplistic way to publish and run data transformation applications in a technology andinfrastructure-agnostic way which requires less effort from the publisher.

2.4 Virtualization

The invocation of heterogeneous data transformation applications is another problemfrom Section 1.3 which requires special care. Depending on the implementation technology,software might rely on particular dependencies, e.g. additional applications, thus making itimpossible to invoke it on a computer not having such dependencies available. One possiblesolution to this problem is to use the virtualization technology. In this section we discussthe related work on various types of virtualization and their comparison. Additionally, wedescribe the relevant work on how the container-based virtualization might be used in thecontext of computer-based scientific experiments and which drawbacks it might have.

2.4.1 Hypervisor and Container technology

A Virtual Machine (VM) is commonly defined as an isolated and efficient replica of a realmachine [PG74]. The concept of virtualization has been in the scope of numerous researchpapers since at least 1970s, e.g. discussed by Meyer et al. [MS70], Goldberg [Gol73],or Popek and Goldberg [PG74]. Popek and Goldberg [PG74] define a virtual machineas an environment generated by a Virtual Machine Monitor (VMM), or hypervisor. VMMis a software which has three major characteristics: (i) it provides the environment forapplications which is basically indistinguishable from the original machine; (ii) executionof the applications might only result in insignificant decrease of performance; (iii) systemresources are completely controlled by the VMM. The authors differentiate between bare-metal (type-1) and hosted (type-2) hypervisors. The latter type operates on top of a host’soperating system.

Hypervisor-based virtualization is a traditional way of achieving isolation and resourcecontrol for co-location of multiple workloads [Fel+15]. Workload’s isolation guarantees theimpossibility of interference with the execution of another workload. Resource control is anability to bind the workload to a specific set of resources. When executing a workload insidea dedicated VM, these two key requirements are fulfilled because the VM both isolatesthe workload and constrains the resources (as it is configured with some resource limits).Compared to native execution, this, however, is more expensive performance-wise.

As historical inspiration for container technology, Bernstein [Ber14] mentions Unix operat-ing system’s chroot command (introduced back in 1979). This command allows changing

40

2.4 Virtualization

the root directory for a currently running process and its children. The resulting envi-ronment, for instance, can host a virtualized replica of a software system. An extendedimplementation of chroot called jail was added to FreeBSD in 1998. Six years later, anenhanced version of this concept was introduced in Solaris 10 operating system and calledzones. Solaris 11 presented containers which were based on zones. With the advent ofthe Linux OS, an evolved container technology based on such concepts as kernel names-paces [BN06] and control groups (cgroups) [Men+17] replaced these previous variants. Wefocus more on Linux containers later in this chapter.

Being a lightweight alternative to the hypervisor-based approach, container-based virtual-ization operates on a different level of abstraction [MKK15]. As hypervisors operate on thehardware level they need to virtualize hardware and related drivers. Moreover, every VMuses a separate OS on top of the virtualized hardware. Figure 2.11 shows the structureof hypervisor-based virtualization with respect to the types of hypervisors. In order to

Hardware

Hypervisor Engine

VM VM VM VM

Hardware

Host OS

VM VM VM VM

Hypervisor EngineVM

Application

Dependencies

Guest OS

Type 1Type 2

Figure 2.11: Hypervisor-based virtualization [MKK15]

avoid such overhead, containers switch the level of abstraction from hardware to operatingsystem level. This is achieved by sharing the operating system’s kernel. Every containerruns on top of the host’s operating system. Figure 2.12 demonstrates the structure ofcontainer-based virtualization.

Container

Containers Engine

Host OS

Hardware

Host OS's bins / libs

Container

Container

Application

Dependencies

Figure 2.12: Container-based virtualization [MKK15]

41


As was mentioned previously, one of the features used in Linux containers is called kernelnamespaces [Fel+15]. In a very general sense, namespaces establish “boundaries” forgroups of logically-related objects. Kernel namespaces allow creating “separate instances ofpreviously-global namespaces” [Fel+15]. When a process is associated with a namespace itcan only see and access that namespace’s resources. There are several kinds of namespacesimplemented in Linux and each is responsible for isolation of a certain resource type. Forinstance, a mount (mnt) namespace is responsible for maintaining mounting points and aprocess ID (pid) namespace provides processes with unique identifiers. Other examplesinclude network (net), interprocess communication (ipc), and UNIX Timesharing System(UTS) namespaces. This powerful mechanism can be used in multiple scenarios includingthe creation of isolated containers which cannot see and access objects located outside. Asa result, processes seem to run within a container in the same way they run on a regularLinux system. However, they share the kernel with processes from other namespaces. Incontrast to a VM, a container is not required to run a complete operating system insideand in fact can represent only one process. Thus, depending on the container’s contents, itcould be either a system or an application container. The former acts like a complete OSwith system daemons like init, inetd, cron running. The latter only runs an applicationwhich results in less consumption of resources. In some cases, there is a need to relaxthe degree of isolation for a container. This can be achieved by sharing resources amongthe containers, e.g. using bind mounts which allows making the same directory visible formultiple containers. Also, an inter-container communication or an interaction betweenthe host and container happens with the same efficiency as a regular Linux inter-processcommunication.

Another feature that is used in Linux containers implementation is cgroups [Fel+15]which allows grouping processes and control their resource usage. In the context ofcontainers, cgroups is used for managing container’s resources, e.g. memory limits andCPU consumption. Modifying container’s boundaries is as easy as applying changes toa cgroup which corresponds to this container. Additionally, cgroup allows terminatingall processes inside the container. One problem related to resource management is thatresource constraints imposed on a container are not known to the processes running inside.This might lead to an over-allocation problem if an application will try to use all availableresources of the system when in fact it can access only a subset of them.

Commonly, LXC [Lin17] is used as a synonym to Linux containers. However, LXC is one ofmany container management tools available today. Other examples of container enginesare Warden [Fou17], Docker [Doc17b], and Rocket (rkt) [Cor17].

2.4.2 Comparison of Virtualization Techniques

According to Bernstein [Ber14], most commercial cloud providers use hypervisor-basedvirtualization, although there are examples of container usage to support cloud deliverymodels like Infrastructure as a Service (IaaS) and PaaS as well. The choice of a virtualizationtechnique highly depends on the task’s specifics and multiple research papers examinethe differences between hypervisor-based and container-based virtualization focusing on

42

2.4 Virtualization

various details [DRK14; Fel+15; Joy15; LK15; MKK15; Mor15; Yam15]. This subsectionbriefly discusses findings in several of mentioned comparisons.

Dua et al. [DRK14] compares the usage of hypervisor-based and container-based vir-tualization in the context of the PaaS cloud delivery model. A description of variousimplementation options of PaaS using containers and virtual machines and the comparisonof several container engines is given. The authors mention an increase in popularity of con-tainers for PaaS solutions due to better performance and reduced startup time. Additionally,Dua et al. highlight three features that can help to improve the adoption of the technology,namely a standard container format, enhanced container security, and OS independence.

Felter et al. [Fel+15] compare the performance of non-virtualized Linux to hypervisor-basedvirtualization (specifically, Linux KVM [Kiv+07] feature) and container-based virtualization(Docker engine) by using various benchmarks and measuring the overall performanceof database products like MySQL [Ora17a] and Redis [Red17]. Docker was either equalor better than the KVM-based solution in every performed test. Furthermore, both KVMand Docker had insignificant overhead for memory and CPU performance. A commonapproach is to use VMs for IaaS and containers for PaaS. Felter et al. mention the lack oftechnical reasons for such distinction which leads to a broader application spectrum forcontainer-based technology.

Yamato [Yam15] presents an evaluation and comparison of performance for bare-metal,Docker-based, and KVM-based servers deployed on Openstack [Ope17]. The resultingfindings predictably show that bare-metal outperformed both Docker and KVM. Dockerperforms better than KVM in most disciplines. However, file copy operations were betterfor the KVM-based server which resulted in a total index of Docker being not much higherthan the index of KVM. Additionally, the startup time was measured for all types of servers.In this case, it took significantly longer time for a bare-metal server to start than for bothDocker and KVM. Compared to KVM, Docker startup time took less time due to absence ofnecessity to boot an OS.

Morabito et al. [MKK15] compare the performance of KVM as a hypervisor-based approach,LXC and Docker container engines, and OSv [Clo17] which is an open-source lightweightcloud OS intended to run on top of a hypervisor. One of the findings in this comparisonshows that containers introduce insignificant overhead. Although both LXC and Dockerperform well, security might be an problematic issue.

Morabito [Mor15] presents a comparison of power consumption for various virtualizationtechnologies including both hypervisors (KVM and Xen [Xen17]) and container engines(LXC and Docker). Results demonstrated no noticeable difference for idle state, CPU,and memory performance. However, for network performance’s comparison results forhypervisors were different. In contrast to containers, hypervisors consumed more powerbecause of additional layers in the hypervisor environment through which network packetshave to go.

43


2.4.3 Virtualization For Research Reproducibility and Scientific Containers

The notion of reproducibility is a basis of scientific research. The results of experiments areusually published together with the steps describing how they can be achieved. In case adescription allows obtaining the semantically equivalent outcome then the results are calledreproducible. Follow-up scientific work can be confidently built on top of such findings. Aswas previously discussed, e-Science experiments are computer-based. Reusing the scientificsoftware in many cases relies on the idea of reproducibility. Although software might seemto be deterministic in its nature, many factors influence the outcome of computations. Inorder to reproduce a software-based experiment, a scientist often needs to repeat exactlythe same packaging, installation, and configuration steps which can be poorly documentedor even partially omitted [PF16].

Piccolo et al. [PF16] describe various tools and techniques simplifying the computational re-producibility of research. Examples include the usage of custom scripts allowing automaticexecution of software or software frameworks which simplify the handling of dependencies.Literate programming as a combination of narrative description with the code, workflowmanagement systems, hypervisor and container-based virtualization techniques are alsoamong the options to support reproducible research. Being the lightweight alternative tohypervisors, container technology and in particular Docker [Doc17b], attracts scientistsas a means to achieve reproducibility. Containers allow preserving the whole executionenvironment including all the related dependencies. As a result, software which waspackaged as a container is incredibly simple to install and configure.

Chamberlain et al. [CS14] discuss how scientific research can benefit from the Dockercontainers. The main idea is to package software as layered containers depending on theuse case. A container which is used as a basis is called a base container. It packages thelibraries and instruments required for the software to run. Both development and releasecontainers are built on top of the base container. The former includes an additional softwareuseful for the development and testing processes while the source code is not included andaccessed directly from the host’s file system. This simplifies keeping the source code in thelatest state. The release container is also built on top of the base container. However, itincludes the source code alongside its dependencies. As a result, a release container can bedistributed for further scientific work reproducing exactly the same environment whichwas used for original experiments.

Boettiger [Boe15] analyzes technical challenges to computational reproducibility and dis-cusses how Docker can be used as a solution. Boettiger lists four technical challengespreventing the computational reproducibility. Firstly, a need to recreate an original envi-ronment of the experiments, i.e. tackling the “dependency hell”. Secondly, experimentsoften lack a precise documentation on how to configure and run the software. Anotherissue affecting the reproducibility is a so-called “code rot”. Dependencies are not static andevolve over time, e.g. release of new versions with fixed bugs or modified functionality.The knowledge of software’s tolerance to the changes in dependencies is also an importantpoint in reproducibility. Additionally, Boettiger discusses the adoption complexity of someexisting solutions like workflow technology or virtual machines. Often, the scientists are

44

2.5 Container-based Approaches in e-Science

not familiar enough with these technologies or it takes a lot of effort to adopt them inexperiments. Using Docker containers, the dependencies can be packaged into a binaryimage of a container. This approach helps to encapsulate an environment thus releasing thestress from a scientist who is willing to reproduce an experiment. Moreover, the Dockerfilewhich is a file of a specific format allowing to build a Docker container image can serve asa proper documentation. Furthermore, the Docker engine supports the images versioning,packaging images into tarballs, and layering images using the latest stable releases ofwell-known software, e.g. Linux distributions. As a result, the scientist can examine ifthe image built using the latest versions is not affected by code rot comparing to theself-packaged tarball.

Although Docker is one of the most widely used enterprise container engines, it was notdesigned with scientific applications in mind, e.g. running containers in a high performancecomputing environment. One of its disadvantages is the requirement to run the Dockerdaemon under root privileges which introduces additional security risks [KSB17]. To tackleDocker’s limitations in the context of scientific computing, a number of scientific containerengines were introduced including Singularity [KSB17] and Charliecloud [PR16].

2.5 Container-based Approaches in e-Science

Nowadays, the container-based virtualization is a common way to encapsulate the scientificsoftware. Not only the single pieces of software can be encapsulated using containerengines like Docker [Doc17b], but also the whole software chains representing a particularcomputer-based experiment. This allows a scientist to focus on the computer-basedexperiment itself rather than on setting up the environment and configuring the respectivesoftware. This approach might be particularly useful in our context. Therefore, in thissection we list some related solutions using containers as means to encapsulate and runthe scientific software.

Hosny et al. [Hos+16; Alg17] present AlgoRun, a Docker-based container template whichcan be used for packaging and provisioning of command line algorithms via a RESTinterface. The main goal is to simplify reuse of scientific algorithms by packaging theirsource code with the corresponding dependencies as Docker containers built on top of aspecialized AlgoRun template container. A scientist who is willing to publish an algorithmneeds to provide a description of the algorithm conforming to a specification format,provide its source code, and create a Dockerfile based on the AlgoRun Docker image whichserves as a wrapper allowing to interact with the algorithm via a REST API. The packagedalgorithm consists of three parts:

Description part is a file in JSON format stored in the algorun_info folder. It containsthe information about the authors and publishers as well as the details about input,output, and how the algorithm can be invoked via a command line.

Algorithm’s code part is a complete algorithm’s source code which is stored in the srcfolder.

45


Dockerfile is a Docker-specific file allowing to build a container image. This file has to becreated by the publisher of the algorithm and the image has to built on top of theAlgoRun container image.

Moreews et al. [MSM+15] introduce BioShaDock, a registry for bioinformatics applicationswrapped into Docker containers. This software provides means for authorized users todescribe, register and build public Docker images. In order to register a Docker containera user needs to provide a Dockerfile and additional metadata describing the scientificapplication. After the container is registered, the image building process as well asadditional integration steps are performed automatically on a specified server. After theimage is available in BioShaDock it can be used by a scientist, e.g. on a local machine usingDocker or on a cluster which integrates a Docker scheduler front-end.

Kim et al. [Kim+17] propose a technique for execution of multi-step bioinformaticspipelines as pre-configured Docker containers called Bio-Docklets. The idea is to en-capsulate the complex behavior into a Docker container and expose a single input andoutput endpoints for user which allows executing the pipeline. From the user’s point ofview the invocation is identical to running one particular application. The processes ofinstantiation, controlling the execution are controlled by special Python scripts.

Belmann et al. [Bel+15] present an approach to expose bioinformatics software usingDocker containers with a standardized interface called bioboxes describing which inputfiles and parameters are required and what is the output result. The focus is put on aneed to standardize the interface for containers in order to achieve interchangeability inbioinformatics pipelines.

To summarize, the idea to use container technology for packaging an application looks veryappealing. The container encapsulates all required dependencies and a uniform interfacemight be created in case a proper specification is provided by the publisher of the software.None of the listed solutions suits our needs due to various reasons: tight coupling with aparticular container technology, need to manually create technology-specific descriptions,e.g. Dockerfiles, lack of automation and extensibility of the underlying conceptual models,or overall complicated reuse process. As we considered related work from the field ofbioinformatics, most of the approaches aim at targeted reuse of a particular bioinformaticsalgorithms or pipelines. In contrast, we are interested in an extensible and technology-agnostic way of reusing the data transformation applications requiring less efforts on theconsumer’s side.

46

3 Concepts and Design

In this chapter we present our concepts and design the software framework which supportshandling data transformation applications. The chapter is divided in eleven sections andstarts from the assumptions in Section 3.1 which serve as a basis for the concepts. Sec-tion 3.2 analyzes various interaction scenarios among the actors involved in the processof handling the data transformation applications. Section 3.3 focuses on the descriptionand packaging of the data transformation applications. Section 3.4 presents the abstractdesign of the framework for handling data transformation applications. The remainingsections focus on specific parts of the framework developing them and presenting therefined versions. Section 3.9 introduces the refined framework’s architecture. Section 3.10presents more details on the interaction with the framework. Finally, the descriptionof how the presented concepts can be integrated with the TraDE Middleware is givenin Section 3.11.

3.1 Data Transformation Logic and TraDE Middleware

The specifics of a data transformation application might vary depending on the usagescenario. For instance, source code modifications or user interactions might be requiredby certain applications. The diversity of usage scenarios make it difficult to derive acommon abstract model for data transformation applications suitable for any possiblecase. Considering the high-level view on the data transformation in the context of TraDEMiddleware we can describe the target application being similar to a first-order functionwhich takes a vector of input values and produces a vector of output values. However,this simplistic view is not really suitable for cases where, e.g., a user interaction is needed.In general, multiple details affect the way how we can abstract the data transformationapplications. The concepts presented in this work are based on several assumptionsallowing to clearer define the data transformation applications.

Black-box approach Treat heterogeneous data transformation applications as atomicreusable entities. This assumption is based on several ideas. Firstly, an application canbe provided as a binary executable, thus making the white-box analysis complex andunreliable. Secondly, the provided application must remain immutable throughout itslifecycle in order to guarantee the behavior desired by its provider. This also implies,that provided applications require no modifications and are ready-to-use.

No user interaction required As was discussed in the related literature part in Sec-tion 2.3, the wrapping of user interaction is a complex task which can be tackled by

47


maintaining the states of interactions, e.g., using a finite state automaton. The com-mon way of performing data transformation in e-Science relies on files and often doesnot require a user’s participation apart from the initial invocation routine. While han-dling an interactive data transformation application is an interesting and challengingtask, in this thesis we focus on automatically-performed data transformations.

Application is runnable We assume that a provider knows the requirements of the datatransformation application including its dependencies, configurations and the overallexecution routine. This implies that apart from what was specified by a provider nofurther information is needed to successfully run an application.

Applications exchange files There are various ways of how applications can accept andproduce data. Files are frequently used in e-Science as one of the most portable waysto transfer data among diverse scientists. While options such as using input/outputdata streams or printing the data onto the screen are available, files can serve as agood starting point for describing the data transformation in the context of scientificworkflows. In this work we mostly focus on the applications accepting files as inputand producing files as their output. Nevertheless, the concepts presented throughoutthe work can be extended in order to suit other data exchange cases.

3.2 Interaction Models and Main Actors

The process of reusing the data transformation logic relies on several distinct roles. As astarting point, we can distinguish three main actors analogous to the classic SOA roles:an application provider, an application consumer, and a registry. Basically, an applicationprovider makes the data transformation logic available for reuse. An application consumeris interested in reusing the published data transformation logic. This interaction is madepossible via the application registry which serves as a hub for providers and consumers.Worth mentioning that the application publisher and the actual provider are not necessarilythe same roles. At first, we consider a scenario where the publisher and provider of aservice is the same role and the data transformation applications are only implemented asweb services. The simplified SOA-like interaction model is shown in Figure 3.1. One of thebiggest advantages of SOA which we can benefit from in this case is an implementationtransparency. The invocation of a desired service happens at the provider’s side onlyusing the binding information obtained from the registry. In such an interaction scenario,the registry only offers means to find a service and provides its corresponding bindinginformation. The technical details about the actual implementation remain unknownto the consumer. However, the type of applications intended for reuse is not limited toweb services. This means that invocation of a published application is not achievable ina standardized manner as in SOA, e.g., using a WSDL interface. Hence, the handlingof non-standardized applications puts an additional stress on every distinct role in theinteraction model. Although the application is still invoked on the side of a publisher, theinformation about the custom interfaces has to be present at the registry. Additionally, theaccess information like Secure Shell (SSH) credentials might be needed as well. The lack of

48


2. Search1. Publish

3. Interact

Registry

Consumer

Publisher+

Provider

Figure 3.1: Simplified SOA-like interaction model

standardized access requires an application consumer to individually handle each desiredapplication. In case of high reuse costs, the usage of such architecture is not beneficial.Moreover, in many cases a standard invocation is not even possible without additionalwrapping, e.g. an application’s response might be just saved as a file on publisher’s side.An example of such interaction model is shown in Figure 3.2.

2. Search

1. Publish

Registry

Consumer

3. Interact

Invoke using custom interface(regular application)

Application + Environment

Publisher+

Provider

Web Service

Invoke using standardinterface (e.g. WSDL)

WSDL interfaceSSH invocationRemote Procedure Call...

Figure 3.2: More refined SOA-like interaction model

In fact, these problems are among the reasons why the SOA concept was introduced afterall. Unfortunately, handling different application types in a SOA fashion is not directlyapplicable in our case unless all applications are wrapped as web services prior to publishing.Such scenario is desirable yet not realistic. Wrapping an application as a web service is nota trivial task and requires special skills. However, the scientists who intend to publish their

49


data transformation applications for reuse are not necessarily the experts in the field ofservice technology. A requirement to wrap the target application as a web service beforepublishing is a limitation which should be avoided.

The publisher and provider roles were combined in previous examples. The separation ofthese roles allows discussing the ways applications can be published and how it affects theprovisioning. An example of the interaction model with distinct publisher and providerroles is shown in Figure 3.3. As with the previous models, the consumer role is interestedin reusing the available applications matched by some search criteria. Multiple publishersare responsible for making the applications available via the registry. An intermediaryzone referred to as a hub zone incorporates the registry and multiple providers. Thehub zone is meant to separate the publishing from consuming roles and highlight thatprovisioning happens independently. This modification is not really different from thepreviously discussed models, because the consumer faces the same difficulties workingwith the various custom interfaces as before.

2. Search 1. Publish

Registry

Consumer Publishers

Providers

Invokecustom

interface

Environment

3. Interact

Consumer side Hub zone Publisher side

Application+ Web

Service

Invokestandardized

interface(e.g. WSDL)

Figure 3.3: Interaction model with separate publisher and provider

Interestingly, the provider’s responsibilities depend on what information is offered by theregistry, i.e. how the applications are published. On the conceptual level, publishing a webservice implies that it is available at some location via a standardized interface. On the otherhand, publishing a regular application does not necessarily mean that it has to be availablesomewhere. In fact, the way an application is published defines how it can be providedlater. For instance, if the application is published as a standalone self-contained packagethen a completely independent actor who has means to invoke such packages might act as aprovider. Conversely, if the interface and invocation details are the only information offeredby the registry then only the actual provider is capable of invocation. These different

50


scenarios show that a published application can be classified either as remotely-availableor self-provisioned. The former case is about a consumer communicating directly with anactual provider. The latter case allows the consumer or some intermediary role to become aprovider, assuming that the capabilities to instantiate and invoke applications are supported.A combination of both is also possible, however this introduces additional complexity, e.g.which provisioning type is preferable. The interaction model shown in Figure 3.3 doesnot reflect how a consumer can actually interact with different types of providers. Infact, this interaction is as inconvenient as in previously discussed models. A consumerneeds to handle each provider’s technical details independently which is problematic in e.g.automatic reuse scenarios.

One way to standardize the interaction between a consumer and various data transfor-mation applications of choice is to introduce some proxy role which will be aware ofapplication’s provisioning details. On the conceptual level, this role separates the outsidetransformation interface from the inner application-specific interfaces. Considering theblack-box view on the data transformation, the outside interface can be generalized inorder to uniformly support application-specific interfaces. In other words, the purpose ofthe proxy provider role is to wrap interfaces of data transformation applications. However,instead of forcing publishers to wrap their applications the actual wrapping can happen af-terwards by means of a proxy provider role which is aware of the application’s provisioningdetails. Figure 3.4 illustrates an interaction model with a proxy provider role.

2. Search

Registry

Consumer

ProxyProvider

Invokecustom

interface

Environment

Application+ Web

Service

Invokestandardized

interface(e.g. WSDL)

Providers3. Request datatransformation

4. Interact

Figure 3.4: Interaction model with a proxy provider role

There are several ways of how a proxy provider role can fit into the interaction model. Oneobvious option is a standalone role which is responsible for mediation between the uniformtransformation requests and the actual application interfaces. From the implementationpoint of view, this role can be either a completely separate piece of software or be a part

51


of a consumer. In the former case, a consumer must be able to invoke a standardizedtransformation interface at some location (potentially obtained from the registry) and theproxy provider role must be able to instantiate an application or handle the web serviceinvocation based on the provisioning details obtained from the registry. As a result, lesseffort is needed from a consumer point of view, but a separate proxy provider role has to beimplemented. In the case when a proxy provider is a part of the consumer, a set of proxyproviders related to the corresponding provisioning type must be implemented. Additionally,the consumer must be able to launch packaged applications or support other provisioningtypes. Such scenario puts a lot of burden on the consumer and is not very practical. Fromthe integration point of view, this is similar to the problem of multiple distinct wrappersversus an enterprise service bus. An increase in implementation complexity diminishes thebenefits of reuse for multiple independent consumers. Sending a standardized request tosome location and getting a result of the data transformation looks much more appealing.

Another option is to unify the proxy provider role with the actual provider at publishingtime based on its provisioning type, i.e. for every provisioning type use a standardizedwrapper type. As in the previous example it requires having an implemented set of genericproxy providers conforming to a uniform transformation interface on the outside andtailored for a specific inner interface, e.g. handling a remote web service invocation orinvoking a command line application. However, these generic proxy providers do not needto be implemented by a consumer because the unification happens at publishing time. Thisputs additional burden on the registry role. On one hand, the resulting interaction modellooks similar to the case when a proxy provider is a standalone role. On the other hand,a consumer himself must be able to launch the resulting wrapped application packagesand interact with them. The registry in this case serves more like a static files repositoryfor downloading the wrapped data transformation applications and running them on theconsumer’s side. Such scenario has several disadvantages. Firstly, it introduces the imple-mentation overhead for the consumer. Secondly, the inclusion of identical proxy providerinside every application package based on its type introduces redundancy. Moreover, theconsumer might be willing to use a slightly different implementation of a proxy provider.The scenario with a standalone proxy provider role allows postponing this decision to thetime when a transformation is requested.

In the context of this work, we are interested in providing a standard way to invoke anypublished data transformation application for any interested consumer. The interactionmodel with a standalone proxy provider shown in Figure 3.4 reduces implementationoverhead for consumers thus making it more attractive for reuse. In further sections wediscuss the design and implementation details of this interaction model.

52

3.3 Description and Packaging


3.3.1 The conceptual meta-model of a data transformation application

In order to reuse the data transformation applications in a standardized way, uniformdescription and packaging mechanisms are needed. The previously described LCDL frame-work [Juh+09] or the AlgoRun container template [Hos+16] are used as an inspirationfor how the data transformation applications can be conceptualized for wrapping purposes.A resulting conceptual meta-model shown in Figure 3.5 illustrates the main entities of thedata transformation’s application. This model is made extensible and can be modified tosupport additional types of entities depending on the use case.

ApplicationPackage

+AppSpecification: Description[1]+TransformationSpecs: Transformation[1..*]+Dependencies: Dependency[0..*]+ConfigurationSpecs: Configuration[0..*]+InvocationSpecs: Invocation[1..*]+TestRunSpecs: TestRun[0..*]

Input

+InputName: String+Alias: String {unique}+isOptional: Boolean[0..1]

Description

+AppName: String+AppVersion: String+AppPublisher: String+AppDescription: String+Developers: String[1..*]+License: String+Tags: String[0..*]

Output

+OutputName: String+Alias: String {unique}

Dependency

+DepName: String+Alias: String {unique}+Description: String

Transformation

+Name: String {unique}+QName: String {unique}+Inputs: Input[1..*]+Outputs: Output[1..*]

CommandLineInvocation

+Command: String

InputFile

+InputFormat: String+InputSchema: String[0..1]+RequiredPath: String[0..1]

InputParameter

+Value: String[0..1]+Type: {String | Int | Float}

SoftwareDependency

+DepVersion: String+DepPath: String[0..1]+Commands: String[1..*]

EnvironmentDependency

+EnvValue: String

TestRun

+Name: String {unique}+Transformation: String {unique}+SampleInputs: SampleInput[1..*]+ResultingOutput: ResultingOutput[1..*]+Description: String

FileOutput

+OutputFormat: String+OutputSchema: String[0..1]+AccessPath: String

Invocation

+Name: String+Description: String

Configuration

+Name: String+Command: String[1..*]

FileDependency

+FilePath: String

SampleInput

+Alias: String {unique}

SampleInputParameter

+Value: String

SampleInputFile

+AccessPath: String

ResultingOutput

+Alias: String {unique}+SampleOutputPath: String

InputFileSet

+InputFormat: String+InputSchema: String[0..1]+RequiredPath: String+FileSetSize

SampleInputFileSet

+AccessPath: String

Figure 3.5: The conceptual meta-model of a packaged data transformation application

The main entity in the model is called ApplicationPackage and it consists of multipleproperties describing various aspects of the data transformation application: AppSpeci-fication, TransformationSpecs, Dependencies, ConfigurationSpecs, InvocationSpecs, andTestRunSpecs. The AppSpecification property is of type Description and contains the generalinformation about the application. The TransformationSpecs property is a set of one or moredata transformation specifications described by the Transformation entity. The Dependenciesproperty is a set of zero or more application dependencies described by the Dependencyentity. The ConfigurationSpecs is a set of zero or more application-specific configurationcommands described by the Configuration entity. The InvocationSpecs is a set of one ormore specifications of how an application can be invoked described by the Invocation entity.Finally, the TestRunSpecs is a set of zero or more specifications of how a particular data

53


transformation supported by the application can be verified for correctness described bythe TestRun entity.

Multiple applications can perform the same kinds of transformations or can even besemantically equivalent. A set of general properties related to the application’s descriptionis one of the parts required for the search and publish operations. The Description entityconsists of following properties containing a textual information related to an application:name, version, publisher, description, developers, license, and the list of tags. While mostof the property names are self-explanatory, some observations might be done for furtherconcepts, e.g. search and publishing which will be described in later sections. A specificationof the application’s name is insufficient as a unique identifier because applications can havethe same names. It is rather an informal way to address an application. As a consequence,some combination of the properties should be considered in order to uniquely identify theapplication. The actual format of the version is not enforced in the model and a simplestring of any format might suffice. It can serve as an additional mechanism for versioning ofthe same applications (the ones which have only the version property different). However, ifused as a part of the unique application key, the version property will separate the differentversions of the application as unrelated applications. Application’s publisher is anotherstring property which adds more details but cannot be used on its own to characterizeapplications. For instance, the same application can be published by multiple differentpublishers in case it was previously shared using other means. Depending on how thisproperty is filled in, it can be either implicitly linked with a publisher’s credentials orexplicitly provided by the publisher. Next property in the list is an application’s description.It is intended solely for human-readable description of an application which can be used forGUI-based work with the registry. For instance, having a proper textual description mightsimplify the manual search for an application or manually-performed administrative tasks.The application’s developer and licensing information is reflected in the correspondingproperties. Multiple developers can be specified and in comparison with the publisherinformation, this information might look more trustworthy as a part of the application’sidentifier. However, as multiple developers can be specified, their order in the specificationplays an important role for generating an identifier. Additionally, there is no guarantee thatdeveloper information will not be partially omitted if the same application is publishedseveral times by different publishers. The last property providing the general informationabout the application is called Tags. More precisely, this is a set of strings which can beused for application search. For instance, some data transformation applications might betailored for very specific tasks related to certain projects and general purpose applicationssimply do not fit in these scenarios. A publisher can provide a set of exact tags, e.g. a projectname or custom transformation format in order to simplify the search process. However,the consumer will be expected to know the exact tags in order to find this application.

In theory, a single application can be responsible for multiple kinds of data transformations,e.g. a conversion to several image formats like JPEG and PNG based on specified outputfile extension. Thus, the single application package might have multiple transformationsspecified. More precisely, at least one transformation has to be present otherwise thepackage specification makes no sense. Perceiving the application as a black-box, we can

54


describe its transformations using its inputs and outputs. Additionally, it has to have aunique name. An application can consume one or more inputs and produce one or moreoutputs. The Transformation entity describes supported data transformations and is shownin Figure 3.6.

ApplicationPackage


Transformation

+Name: String {unique}+QName: String {unique}+Inputs: Input[1..*]+Outputs: Output[1..*]

Input

+InputName: String+Alias: String {unique}+isOptional: Boolean[0..1]

Output

+OutputName: String+Alias: String {unique}

InputFile

+InputFormat: String+InputSchema: String[0..1]+RequiredPath: String[0..1]

InputParameter

+Value: String[0..1]+Type: {String | Int | Float}

FileOutput

+OutputFormat: String+OutputSchema: String[0..1]+AccessPath: String

InputFileSet

+InputFormat: String+InputSchema: String[0..1]+RequiredPath: String+FileSetSize

Figure 3.6: The constituents of a transformation specification

The property Name contains a unique string (only in the scope of current application)which can be used to allow addressing the distinct transformations. Another possible wayto identify a certain type of transformation is to introduce an abstract functional descriptionsimilar to a “PortType” in WSDL. The property QName is a unique string allowing to link aparticular application’s transformation to an abstract description used in the choreographymodeling with the TraDE concepts applied. The publisher of the application and thechoreography modeler have to agree on the list of such abstract descriptions in advance.As a result, searching an application can be simplified to looking directly for an identicalQName. A generic Input entity can be described by three properties: its name, an alias, andwhether it is optional or not. The name of the input is a string property which is necessaryfor improving the readability of the specification, especially in GUI-based interactionscenarios. Additionally, it can be used for renaming the input using a publisher-definedvalue in case an application supports only predefined input names. An alias is a uniquestring identifier of the input which can be used for ordering multiple input parameters inthe invocation command. Each alias has to be unique only within the current application’stransformation specification, meaning that the same aliases can be used for differenttransformations. Another input property is a Boolean value specifying whether the inputparameter is optional. For instance, some transformations might allow specifying optionalparameters like the output resolution in case of an image conversion. Thus, specifying thata parameter is optional signals that the transformation can proceed even if this parameteris not specified. For simplicity, only optional inputs can be supplied with this parameterand if it is omitted then the input is considered to be mandatory. An excerpt from the

55


model shown in Figure 3.6 illustrates three types of input: an InputParameter, an InputFile,and an InputFileSet. The former describes a value of a simple type, e.g. numeric or string,needed for the transformation to take place. For example, a specification of some numericcoefficient. This input type is described by an optional value property. In case of its presence,the input parameter is basically set by default at specification time and thus can be directlyused during the invocation. However, if the value is not set then the parameter has to beprovided in the transformation request. Additionally, parameters can be of different type,hence the types have to be specified in order to simplify processing of the values. One ofthree options can be chosen: a string, an integer, or a float parameter.

As was described previously, the usage of files as the unit of interchange is quite commonin scientific workflows, although other methods are possible. Assuming the interactionmodel with the standalone proxy provider role, all input or output data can be collectedas intermediary files for further processing, which allows us to focus mainly on files. TheInputFile is described using three properties: InputFormat, InputSchema, and RequiredPath.In simplest cases the data transformation takes one input file, e.g. a plain text file, andproduces one output file, e.g. a PDF file. In more complex cases, a data transformationmight combine multiple input files of different formats into one or many possibly differentoutput files. In both cases, an input format defines the file’s specifics and serves as avalidation marker describing which input files are accepted. InputFormat property is astring which describes the input file’s format. It can contain a regular file extension like PDFor PNG. Another option is to store a Multipurpose Internet Mail Extensions (MIME) type,e.g. “text/plain”. The latter option describes custom formats in a more understandablefashion as it can contain a hint on type of data. This can be helpful in case severalcustom formats have the same extension, but describe different types of data. Anotherproperty which characterizes the input is its schema. Two files of the same format may notnecessarily be supported by the same data transformation application. For instance, wecan consider two versus three column Comma-Separated Values (CSV) text files and a datatransformation application which can only work with a two column CSV. In such case, theapplication can simply fail due to the wrong file format. The file’s schema is usually storedin a separate file of specific format depending on the input file’s format, e.g. XML Schemafor XML, or JSON Schema for JSON. The InputSchema property in the model is a stringdefining the path where the file’s schema is stored. The RequiredPath property is optionaland describes the path where an input file has to be stored in case an application requiresa static path value. For example, an application might look for a file located in the samedirectory where the application’s files are stored. Likewise, the output files are describedby their format, schema, and path properties. The path describes a location where theresulting output file is going to be saved.

In some cases, multiple input files of the same type are expected by the application.In Section 1.2.6 we discussed an example of simulation involving the transformation ofmultiple snapshots into a video. Technically, all snapshots are of the same type and usedby the application implicitly, meaning that the invocation command does not list the fileswhich are going to be used for the transformation. Instead, an input parameter defines howmany files the application needs to read from its root folder. Such dynamic nature of input

56


requires special modeling. InputFileSet is a type of input describing a set of files having thesame type. It has common properties with the InputFile, but in this case we assume thatsuch set is used only implicitly, thus a RequiredPath property is mandatory. The set’s sizecan be described by FileSetSize property which can be either a number or a reference to analias. In the former case the application’s input size is static and a numeric value is usedto describe the set’s size. The latter is related to cases when the amount of input files isdefined using an input parameter as with the video transformation example. In such casean input parameter’s alias could be used to identify the size of the InputFileSet.

Commonly, the applications can be executed only if their dependencies are availablein the environment. A self-contained application package must have all the necessarydependencies specified. This problem is referred to as dependency hell by Boettiger [Boe15].In some specific cases, software might require some hardware dependencies fulfilled, e.g.a certain amount of processing power or memory has to be available. In this work wefocus on software-related dependencies which can be of multiple types. The model shownin Figure 3.5 allows extending it by adding other types of dependencies required for aspecific use case. The excerpt from the model shown in Figure 3.7 demonstrates theDependency entity containing software-related dependencies required for the invocation ofan application.

ApplicationPackage


Dependency

+DepName: String+Alias: String {unique}+Description: String

SoftwareDependency

+DepVersion: String+DepPath: String[0..1]+Commands: String[1..*]

FileDependency

+FilePath: String

EnvironmentDependency

+EnvValue: String

Figure 3.7: The constituents of a dependency specification

A generic dependency entity is described by three properties, namely DepName, Alias, andDescription. The DepName is a string identifier containing the dependency’s name whichcan be used for readability and identification purposes. The property called Alias is aunique identifier which can be used in installation or configuration commands related to

57


this dependency. For example, a dependency might need to be configured for the properinvocation of the application. An alias needs only to be unique within the boundaries ofthe current application’s specification. A textual description is an additional informationabout the dependency which can be used for a GUI-based interaction with the registry.We distinguish three types of dependencies which extend the generic dependency entity:EnvironmentDependency, SoftwareDependency, and FileDependency. In some cases, anapplication might depend on a specific environment variable to be set in the operatingsystem. One example is the presence of Java’s path variable in case the invocation relieson java alias in the command line interface. As environment variables are key-value pairs,this type of dependency has to have a name and a value. For this purpose, the EnvValueproperty contains the value and the DepName property inherited from the Dependencyentity contains the name of the environment variable. The next dependency type inthe conceptual model is called SoftwareDependency. Basically, it describes any kind ofthird party software which is required by the data transformation application in order torun. A software dependency is described by a version, a path, and a set of commandsrequired for its installation. An application might only work with a particular version ofa software, thus the specification of a version number should be provided. This is alsouseful for providing the information about the dependencies in GUI-based interactionwith packaged applications. A publisher might decide to provide the dependency’s filesfor installation. A path property describes where the corresponding dependency’s filesare located. Furthermore, the installation commands have to be documented as a setof strings which makes the dependency specification complete. However, the softwareinstallation process is defined by multiple various aspects, e.g. operating system, packagemanagers, network accessibility. Thus, it is difficult to provide installation commandswhich will suffice for every single provisioning method. For instance, the commands whichwill work simultaneously on Windows and Linux distributions, or even Linux distributionswith the different package managers. As a consequence, this requires some assumptionson the implementation level regarding the supported format of the commands. Anotherdependency type is FileDependency. In some cases, a software might depend on someexternal files which need to be explicitly modeled not as a part of application’s files.For instance, some patched file might be modeled explicitly to highlight the fact that anapplication is updated. Another example is documentation or localization files. Additionally,a software dependency might itself depend on a file, e.g. an installation script. Thus, asoftware dependency’s command property can be specified using the alias of an installationscript modeled as a file dependency.

Apart from its dependencies, an application might require additional configurations priorto invocation. For instance, file dependencies have to be copied into another locationor installed software dependencies need to be configured. A Configuration entity shownin Figure 3.8 contains the specifications of additional configurations to be made which areneeded for application prior to invocation. A configuration is defined by two properties: aname, and a set of required commands. As was discussed previously, command’s formathas to be specified more precisely on the implementation level.

58


ApplicationPackage


Configuration

+Name: String+Command: String[1..*]

Figure 3.8: The configuration specification

As a next step in the application packaging specification, the ways of how the applicationcan be invoked have to be defined. The Invocation entity shown in Figure 3.9 containsavailable invocation methods for the packaged application. Provisioning types discussed

ApplicationPackage


Invocation

+Name: String+Description: String

CommandLineInvocation

+Command: String

Figure 3.9: The invocation specification

in Section 3.2 determine how an application can be invoked. Commonly, applicationssupport CLI-based invocation. Considering only the data transformation applicationsand assuming that no user interaction is needed, we assume that a data transformationapplication provisioned alongside its files, dependencies, and other related specifications,can be invoked using the command line interface. In case of web service invocation, thecommand line utility such as curl [cur17] can still be used. The results of invocationshave to be obtained differently depending on the application or web service’s responsespecification. A packaged application might support multiple invocation methods, e.g.a web service offering both REST and WSDL interfaces. Thus, one or more invocationspecifications can be provided by the publisher. The generic invocation entity has a nameand a description as its properties. Both properties are for readability of the specificationand can be used for GUI-based interaction with the registry. A command line invocation isan extension of the generic invocation entity which is described by an invocation command.A command which invokes an application includes the input parameters usually ordered ina predefined way. Input specifications in the model are provided with the alias propertywhich can be used for ordering of input in the invocation command specification. Forexample, an application has three input parameters with aliases $input1, $input2, and$input3. Then, the invocation command can be specified as “java application $input2$input1 $input3”. On the conceptual level, the model does not enforce checks of how many

59


input parameters are specified and whether the number is equivalent to the number ofaliases used in the input mapping string. However, these checks can be introduced in theimplementation part specifically at the time when an application is going to be published.This rises the questions of how we can verify the “correctness” of a packaged applicationand what is a correctly packaged data transformation application.

There are several aspects which can be checked in order to verify the correctness of adata transformation application. First of all, it is useful to make sure if the providedtransformation can be invoked and behaves as intended by its publisher. With a black-box view on the application, the checks related to source code are not the option. Theconceptual model provides means to make a published application verifiable. The TestRunentity shown in Figure 3.10 describes the specification of a sample application’s run.

ApplicationPackage


SampleInputFile

+AccessPath: String

SampleInputParameter

+Value: String

SampleInput

+Alias: String {unique}

TestRun

+Name: String {unique}+Transformation: String {unique}+SampleInputs: SampleInput[1..*]+ResultingOutput: ResultingOutput[1..*]+Description: String

ResultingOutput

+Alias: String {unique}+SampleOutputPath: String

SampleInputFileSet

+AccessPath: String

Figure 3.10: The test run specification

The main idea is to provide a set of predefined test inputs and corresponding canonicaloutputs related to a particular transformation supported by the application. Essentially, suchpredefined sets represent instances of successful invocations. In other words, running thetransformation using a test input should produce a specified output. Before the applicationis published, it can be invoked using the provided test inputs. The resulting outputs arecompared with the test outputs of the corresponding TestRun specification. If the resultsare identical then an application can be considered correct. Each TestRun specificationis described by a unique name, related transformation’s name, sets of sample inputs andresulting outputs, and the description of the TestRun which can be used for providing anadditional information to the consumer. Specifications of both inputs and outputs have toinclude aliases used in the actual transformation’s specification. This allows correlatingsample inputs in order to properly invoke the application and to correctly identify theoutputs. In fact, having the input and output examples help users to better understand thedata transformation application by analyzing them in conjunction with the other available

60


information. However, this is a very simplistic view on the application correctness, e.g.the presence of bugs in the code are not considered due to the black-box view on theapplication. Moreover, from the security point of view, blind trust in a user-submittedapplication is a serious risk as malicious code can be published.

3.3.2 Packaging the data transformation applications

The conceptual model discussed above allows creating a standalone package of a datatransformation application. However, the idea of publishing the application files anddependencies in a standardized fashion implies having a precise description of how files canbe accessed. Figure 3.11 demonstrates an example of a standardized application package’sstructure. For instance, if we assume that the application’s dependencies are always stored

ApplicationPackage

App Dep

SoftDeps

IO_Schemas TestRun

TestRun_1

Sample_Input

Expected_Output

...

Specification

Transformation_1

Input

Output

...

FileDeps

...

...

Figure 3.11: An example structure of an application package

in the dep folder, then the path for every dependency only needs to include the actualdependency-related information and the parent folders can be ignored. Similarly, the inputand output schemas, application files and other related information can be accessed usingthe predefined package structure. The presented conceptual model has to be materializedas a Application Package Specification file of some format and stored together with theother artifacts. Conversely, if the package’s structure is not standardized this introducesadditional clutter in the registry.

Especially with respect to several path properties discussed before it makes sense to focusmore on potential ways of accessing the files. One important point related to publishingthe files is whether it is acceptable to use links to remote files or not. There are two stagesat which this question is meaningful: before and after the application is published. In theformer case it is a suitable scenario as it gives more freedom to publishers. A publisher

61


might simply use the link to a public file sharing service instead of downloading the actualfiles and placing them in the package’s folders. However, when the application is alreadypublished it is not safe to use links as remote files might change, e.g. a new version wasreleased, or become inaccessible. A published application represents a certain state witha specific version, dependencies, and invocation details. This information is fixed for apackage and with new versions the application should either be updated or publishedas another version. If the links represent remote files then all changes to these filesbecome transparent for a consumer and this might lead to an improper and inconsistentdata transformation process. Having this considerations in mind, it is preferable if theapplication package will contain the actual files which are listed in the specification. As aresult, two different states of the packaged application representations can be distinguished:one that contains both, file paths and remote links, and another which only contains filepaths. The former reflects pre-published state and the latter represents the applicationpackage after it was published. The “materialization” of the remote links has to be donein-between these states. This step can also be considered as a part of testing an applicationfor correctness: in case some of the specified remote files are not available it makes nosense to publish an application and the process can be stopped. After the remote files aresuccessfully accessed they can be copied and packaged. Then, the representation of theapplication model needs to be changed in order to reflect the local paths instead of remotelinks. The corresponding paths in such case must either explicitly reflect the packagestructure or the package structure has to be standardized, e.g. as shown in Figure 3.11.

3.4 Handling Data Transformation Applications: A GenericFramework

When introducing the interaction model with a proxy provider role in Section 3.2, wediscussed why shifting many responsibilities to the consumer makes the reuse process veryconsumer-specific. On the contrary, we are interested in consumer-independent methods ofreuse. One consequence of such view on the role of a consumer is a simplified process ofintegration with the TraDE Middleware which, in this case, can be considered just as anotherconsumer type. Although the consumer role is the main beneficiary of the application reuseprocess, the key responsibilities fall on the publishing, storing, and provisioning roles. Apublisher is expected to provide a correctly-formed application package. After it is publishedno further participation from the publisher is required. The storing and invocation-relatedfunctionalities thus have to be handled somewhere in-between the publisher and consumer.Basically, the interaction model with a proxy provider and the registry located in the hubzone described in Section 3.2 covers these requirements. Figure 3.12 illustrates a genericlayered architecture for reuse of the data transformation logic which consists of severalparts: (i.) a user interaction layer; (ii.) a security layer; (iii.) a request router; (iv.) arepository; (v.) a task manager; (vi.) and a provisioning layer. The two main components,namely a repository and a task manager are located on the bottom level of the architecture.The former is responsible for publishing and lookup of the applications. The latter handlesthe transformation requests depending on how an application was published. In a scenario

62

3.5 Repository

Provisioning Layer

User Interaction Layer

Repository

Security Layer

Request Router

Task Manager

Figure 3.12: A generic layered architecture for data transformation applications reuse

where an application is published as a self-contained package a provisioning layer is neededfor handling the invocation and returning the results back. If the application was publishedas a reference to a web service then the task manager is responsible for the mediation of therequest and response. The three top layers provide means for a secure user interaction withthe repository and task manager. Starting from the bottom, each part of the architecturewill be described in more details in the next sections leading to a more detailed architectureto be derived in the end.

3.5 Repository

3.5.1 Publishing Scenarios

The main purpose of a repository component in the architecture is to allow storing andaccessing provisioning-ready data transformation applications. An application is consideredto be ready for provisioning when it is successfully published which means that at least anapplication package has to be generated and at most it has to be tested and provisionedusing a provisioning technology like container virtualization, e.g. Docker. Important pointhere is that a package provided by the publisher differs from a provisioning-ready packagethat has to be generated by means of the repository. The publisher is only required toprovide an application package conforming to the structure shown in Figure 3.11 whichconsists of application-related files and an application package specification file. The latteris a technology-agnostic description of the application package based on the conceptualmodel presented in Section 3.3. Storing an application package without relying on specificprovisioning technology allows reusing the same package for multiple provisioning enginesof choice. Although this format neutrality adds an implementation complexity of performingthe conversion into a target format supported by the engine of choice, the publisher isnot required to provide technology-specific artifacts in advance. This approach simplifies

63


the publishing routine for scientists who are not familiar with implemented provisioningtechnology.

The process of publishing might include a combination of several steps including the processof downloading remote files, generating the Provisioning Specification meta-descriptionfile supported by an engine of choice, e.g. a Dockerfile, testing the package using thisgenerated specification. Several listed steps already rely on actual provisioning whichleads to a question of whether the repository needs to support the provisioning-relatedfunctionalities. To better understand the connection between provisioning and actualconsuming of the data transformation logic we will start from the publishing process. Morespecifically, there are several scenarios of how an application can be published:

Publishing only In this case, a package has to be generated and stored without anyfurther steps. However, in case an application packaged together with its files anddependencies, the package generation itself requires some modifications to originalpublisher’s package to be made. Firstly, the remote files have to be downloadedand stored inside the corresponding folders conforming to the directory structuredescribed in Section 3.3 and shown in Figure 3.11. Next, the original applicationpackage specification has to be edited in order to reflect the changes in file paths.Note, that it is beneficial to store both versions of application package specification:the original which was provided by the publisher and the modified one. In suchscenario, more information about the package is available which can be potentiallyuseful for the analysis and optimization of the packages. Furthermore, for the packageto become provisioning-ready it has to possess a provisioning specification in a formatwhich is used by the corresponding provisioning engine, e.g. a Dockerfile. Thegeneration of this specification must happen as a part of the publishing procedure.The resulting package is portable and can be used by any authorized third party whichsupports the provisioning engine of choice. Potentially, multiple provisioning enginescan be used implying that several provisioning specifications can be generated.A package then needs to be stored using the storage implementation of choice.Publishing an application as a remote web service requires much less overhead,although the test specifications can be provided together with the application packagespecification for verification purposes.

Publishing with verification Next option is to publish an application and verify its cor-rectness using provided test run specifications. The publishing part is similar tothe previous option, whereas the verification part involves some additional steps.After the provisioning-ready package is generated, the repository has to invoke theapplication using the provided test specifications. However, this task requires supportof the provisioning engine of choice if invocation must happen on the side of therepository. While a combination of the repository with actual provisioning is possible,it makes the implementation heavier and more monolithic. Additionally, it makes therepository technology-dependent and in case several provisioning technologies areplanned to be used it is unclear if the repository needs to implement all of them oronly a specific subset. The task manager component with the help of provisioninglayer is responsible for handling the provisioning-related tasks. In fact, the repository

64

3.5 Repository

can have a dedicated task manager only responsible for verification purposes. Inorder to run the test specification, a portable application package has to be “deployed”on the provisioning engine of choice. Thus, a request for running the application hasto be sent from the repository to the responsible component, i.e. the task manager.After the application is successfully invoked and the results are obtained from thetask manager, the comparison must happen. In case, the results are identical to theprovided output in the test specification, then an application is marked as verified.This ensures the correctness and might be used in consumer’s decision process incase several compatible data transformation applications are found and one has to bechosen. Note, that actual provisioning already happened, e.g. a Docker image wasbuilt and is available for the task manager. At this point, there are two options: tokeep the built application image (or, e.g. a virtual machine) or delete it, as it wasmeant only for testing purposes. We will delete the image by default after the testinvocation takes place and results are returned.

Publishing with provisioning While publishing is similar to the first option, the testingdoes not take place in this scenario. Instead, an application is provisioned using one(or potentially several) task managers relying on their own provisioning layers. Inorder to reflect that an application was provisioned, the repository has to maintainthis information. After the provisioning took place, an information about the corre-sponding task manager has to be added to the respective package’s information. Forinstance, if a database is used, then a dedicated table or field has to store the relateddata. This simplifies addressing the corresponding task managers when the searchresults for data transformation application are returned to the consumer.

Publishing with verification and provisioning In this scenario, an application is pub-lished, verified, and provisioned. Basically, this is a combination of all possible options.However, one additional issue that have to be considered in case an application’sverification was not successful is whether an application has to be provisioned ornot. Technically speaking, at the moment when verification fails the application stillremains provisioned as the corresponding image was not deleted after the transfor-mation result was returned back to the repository. As a consequence, if provisioninghas to be undone another request from the repository has to be made.

3.5.2 Generation of a Provisioning-ready Package

After the publisher chooses the most suitable publishing scenario and provides the packageddata transformation application, the generation of the provisioning-ready package must takeplace. As discussed previously, “materialization” of the remote links has to be completedat first. In order to separate concerns, a sub-component of the repository called packagegenerator is responsible for performing the generation of a provisioning-ready applicationpackage. The original application package specification must be updated in order to reflectthe changes in file paths. This updated version of the specification has to be included insidethe package as the main specification required for creation of a provisioning specification.While various provisioning technologies can be used, e.g. different container engines or

65


virtual machines, it is difficult to support all possible options. Firstly, the conversion ofan application package specification has to be done to every supported format. Secondly,all resulting provisioning specifications need to be packaged alongside other artifacts.Although such difficulties are possible to overcome from the technical point of view, theoverall process becomes more time-consuming. For this reason, we assume that therepository supports some default provisioning technology. This assumption allows focusingon specific provisioning specification format but does not limit the usage of other chosentechnologies. As the updated application package specification is included into the package,the conversion can happen on the corresponding task manager’s side. Another optionis to allow publishers to choose which provisioning specifications have to be generated.The contents of a provisioning-ready package are shown in Figure 3.13. On one hand,

ApplicationPackage

App Dep IO_Schemas TestRun UpdatedApplicationPackage

Specification

DefaultProvisioningSpecification

Figure 3.13: Provisioning-ready application package

the original application package specification, i.e. with a combination of file paths andremote links, could also be added into the provisioning-ready package. On the other hand,its presence there is redundant as it will not be used for conversion into a provisioningspecification format. It can only be used for provenance reasons, e.g. analyze remotefile locations, compare if the versions have changed. Basically, we can store the originalapplication package specification separately without any consequences, e.g. as a separatefile next to the provisioning-ready package or in the database together with the informationrequired for searching the data transformation application.

3.5.3 Storage

After the provisioning-ready package is generated, it has to be stored in the repository. Thisimplies that a storage layer is available in the repository component. We focus on followingopen issues:

• selection of the storage type to be used;

• which information has to be stored;

• how to avoid publishing duplicates.

66

3.5 Repository

Regarding the first question, a database technology is an obvious answer. However, storingthe application files in a database might be cumbersome. This applies to the cases whenan application is published as a self-contained package with all the related files anddependencies. In case an application is available only as a remote service then the mostcommon scenario would be a publishing of a single application package specificationwithout any other artifacts. On the other hand, for a remote web service it is still possibleto provide test run specifications. The study by Sears et al. [SVIG07] indicates that forstoring files larger than 1MB it is preferable to use the file system instead of a database.One of the most influencing factors is the fragmentation of a file system or a database. Asfile systems are designed specifically to deal with files, they show better performance overtime as a file system gets more and more fragmented. Assuming that files are stored on thefile system, there is a need to maintain file paths information as well as other specification-related details. This is a lightweight textual information which can be stored in a database.Depending on the choice of implementation this can be a RDBMS or a NoSQL system, e.g.a document or a column store. As a result, provisioning-ready application packages aregoing to be stored using a file system, and the specification-related information will use adatabase.

In fact, the files belonging to the provisioning-ready application package can be used onlywhen they are together, hence an archive can be created during the generation and placedin an application-specific directory as a way to optimize the usage of storage space. Adatabase needs to store the path to this archive on the file system. However, it is notsufficient to store only the path information. In order to implement the search, the databaseneeds to contain specification-related information. Basically, the whole modified applicationpackage specification or its parts can be copied to the database. For instance, the generalinformation and dependency names can be used to find an application. Additionally, theinformation about task managers where a data transformation application was provisionedhas to be stored in the database.

An application-specific directory has to be unique for every published application. Its namehas to be a unique identifier generated at publishing time. In theory, this could be anyabstract numeric identifier of sufficient length. However, this approach makes it easierto publish duplicates as an identifier does not correlate with the application in any way.Taking a set of variables, a skolem function [EE01] returns a unique value. However, thetask of finding a suitable set of variables which can be used for the generation of the uniqueidentifier is not a trivial task. The main obstacle is the absence of guarantees that someinformation in the specification will not be omitted by the publisher. Hence, the variableswhich form the unique identifier have to be made mandatory in the application packagespecification. In theory, an application might be published by different publishers in case oflarge and distributed projects. Thus, the publisher information is not a reliable variable evenin combination with other variables. However, similar reasoning can be applied with regardto the developers information as well. People might leave and join the projects, making thelist of developers not constant. Both publishers and developers can be used for analyzing thepotential duplicates as well as generation of a unique identifier, but they do not eliminatethe problem of duplicates. As a starting point, the publisher information will be used in

67


combination with other variables to form an identifier. The reasoning behind this is that thepublisher information is always present and easier to check. Multiple applications can havethe same name which makes this information useful only in conjunction with additionalvariables. Another variable we can add to the identifier is the application’s version. In thiscase different versions will have unique identifiers and if the version information will beused at the end of an identifier, the packages related to the same applications of differentversion will be located close to each other in the file system. For example, if user JohnDoe publishes a version 1.0 of the application called Image Transform, then the generatedidentifier might look like johndoe:imagetransform:1.0. Note, that the actual format of howidentifier can be concatenated is implementation-specific and could look differently, e.g.it could be a standard directory path johndoe/imagetransform/1.0 meaning that folderswill be nested in this order. Then, if the same user publishes the version 2.0 of the sameapplication, the generated identifier will look like johndoe:imagetransform:2.0.

Using the (publisher, name, version) combination in order to generate a unique identifier foran application it is possible to improve the chances of finding a duplicate during publishing,but does not guarantee that duplicates will not be present. Moreover, it still gives noguarantee about the uniqueness of an identifier as it is possible that the same publisher willpublish two applications accidentally having the same name and version, but performingdifferent transformations which will result in generating the same identifiers for both ofthem. One way to make it more precise is to also include the information about developers.The chances that two different applications have the same names, versions, developersand published by the same publisher are low. Moreover, as the applications are publishedby the same person, the ambiguous cases when the same identifiers are generated fordifferent applications can be refined at publishing time. In order to concatenate developersthey have to be sorted beforehand and some string optimization techniques might beapplied, e.g. using only the first and the last letter from names and surnames. Accordingto a study by Tagliacozzo et al. [TKR70] the error rate for first and last letters in namesis typically lower than for in-between letters. This can help to tackle misspelled names,although the possibility of mistakes is not eliminated. Additionally, using only parts of thenames can help reducing the overall length of the identifier which can grow big in case ofmultiple developers with long names. Following the previous example, if the developersnames are John Smith and John Doe while the rest information is the same, the generatedidentifier might look like johndoe:jndejnsh:imagetransform:2.0 if the developers were sortedby their surnames and only the first and the last letters from the name and surname wereused. The discussed notion of separation between the database and file system parts forsome application can be visualized as shown in Figure 3.14. Figure 3.14 lists followingdetails which needs to be stored in the database: the information about the application,provisioning, and verification as well as the file system path linking with the provisioning-ready package. The data stored in the database is meant to provide means for searching andinvoking the application and can be obtained from the application package specificationprovided by the publisher. Basically, the conceptual model of application’s descriptionstored in the database needs to reflect the model described in Section 3.3. However, it canbe simplified as many technical details can be omitted or shortened. Additional data can bederived from the specification to provide a consumer with statistics about the application,

68

3.5 Repository

Application1_Identifier

Provisioning-readypackage (archived)

File System Database

Application Information

Provisioning Information

Verification InformationOriginal Application

Package Specification

File System Path

Figure 3.14: Storing the published data transformation application

e.g. by summarizing the total number of dependencies. On the other hand, some parts ofthe application package specification have to be optimized for search implementation. Oneexample is a set of transformations supported by the data transformation application.

3.5.4 Search

Assuming that multiple applications were successfully published, the repository now mustprovide means to search for those applications which are compatible with the consumer’srequest. Depending on the level of details there are several ways to check the application’scompatibility. Although other input types such as input parameters are possible, the inputfiles play the most important role in defining the transformation. In a simple case where anapplication only takes one input file and produces one output file it may look as simple aschecking the compatibility of an input/output format pair. For instance, if an applicationtransforms the data from a .txt to .xml and a consumer requests a transformation providingmatching information about the input and output formats then a transformation mightbe considered as compatible. However, on the more detailed level one can argue thatif the schema of the provided input does not match the schema supported by the datatransformation application then compatibility can not be assured. For example, if anapplication requires a 4-column CSV text file as its input and the user provides only a 2-column CSV how the application will handle the missing details is implementation-specific.The behavior of such data transformation is non-deterministic as it may succeed, fail orproduce an incorrect output. In case both, a consumer and a publisher provide inputschemas, the comparison can happen to verify an application’s suitability for the request.However, this is not always achievable as providing an input schema is not obligatory.

With this considerations in mind it makes sense to differentiate between a search withoutand with schema verification. The former relies only on comparison of the informationprovided in application package specification. This is the simplest way of looking for asuitable application and is applicable if an application does not impose specific require-ments on input schemas. However, as described previously, it might introduce incorrectapplication’s behavior and the error handling therefore should be implemented on the task

69


manager’s side. A request for an application without verifying the schema’s compatibilitycan still contain various details about the potential application. The main prerequisitefor suitability is the equivalence of the input and output file formats in the request itselfand the application’s meta-data. When specified, input and output files have a formatdescription field. Note that we still consider the simplest example with one input file andone output file. In such case, the transformation can be described by a tuple (input fileformat, output file format), e.g. (txt, pdf). This information is crucial for checking theapplication’s basic suitability by comparing it with the same tuple derived from the request.Such comparison is similar to signature matching described by Zaremski et al. [ZW95]and allows limiting the number of applications for further refinement of the query. Thesignature of the application in this case is a combination of its input and output file for-mats. Other types of input or output are not yet considered in order to simplify a primarycompatibility check. Additionally, the search can be refined using other information fromthe conceptual model, e.g. tags, using keyword-based retrieval techniques [BK16]. As aresult, the response’s precision will be higher but every returned application will not bechecked for actual compatibility.

In a scenario with multiple input and output files, a tuple (input file format, output fileformat) needs to be modified. Both input and output parts of the tuple now becometuples themselves: ((input file format-1, ..., input file format-n), (output file format-1, ...,output file format-n)). One pitfall, however, is an ordering of the inputs and outputs inthe tuples. On one hand, if only the exact match will be allowed, then potentially suitableapplications will be rejected due to the different ordering. On the other hand, if an exactmatch is not enforced then the task manager has to deal with the mappings between theinputs/outputs supplied in the request and the actual applications’ inputs/outputs. Forexample, a signature derived from the request for the transformation of two input files (txtand dat formats) into two output files (txt and pdf format) might look as follows: ((txt,dat), (txt, pdf)). If there is an application with specifications in different order, e.g. ((dat,txt), (pdf, txt)), the exact matching will result in empty set. In contrast, if the order isignored then one suitable application is returned in response to this transformation request.At this point, if the application will be invoked, the order of inputs has to be adjustedby the task manager as well as the order of outputs has to be restored for the correctresponse. Moreover, if input/output contains several files of the same type, the mappingbetween the request and the application requires special handling. For instance, techniqueslike probabilistic mapping or invocations of the application for every possible order cantake place. These invocation strategies might be chosen based on the total number ofsimilar input/output formats or consumer-provided constraints, e.g. the allowed number ofinvocations. An assumption that a consumer can deal with the ordering of outputs makesthe system less reliable in case of automatic invocation, hence similar strategies have to beapplied before a response is sent back to the consumer.

The search with schema verification is only possible if input schemas are specified forthe application. Moreover, the schema comparison is not a trivial task due to potentialdifference in schema formats. For instance, the schema in a request might be provided inXML, whereas the application’s input schema is a JSON schema. For such cases schemas

70

3.5 Repository

have to be transformed into a common format for further analysis, e.g. building a parsetree and comparing the nodes. A naive way to compare schemas of the same format isto check for identity, i.e. schemas have to be equal. This can be helpful for checking thesimple comma-separated formats. However, such view on the problem does not take intoconsideration that the same structures can be characterized by different schemas, e.g. usingsubstitutable data types. In general, schema comparison is time-consuming and has tobe explicitly listed in consumer’s request as an additional check. Moreover, the systemhas to implement the schema comparison mechanism to support certain types of schemadefinition languages, e.g. XML Schema or JSON Schema. Therefore, schema comparison isout of scope in the context of this work.

In fact, comparison of the application’s signature with the signature derived from therequest better ensures compatibility if other types of input are added into the signature.Considering the multiple input/output scenario and the ordering issue discussed previously,the equal number of the input and output types in a request and application’s specificationcan be used to decide if an application is suitable for response. As was discussed before, forboth input and output files, the equal amount of corresponding formats is the prerequisitefor compatibility. If input parameters are added to the signature, some applications havingan equal number of input and output files, but lacking or having more parameters than therequest can be considered less suitable and ranked accordingly in the response. However,the ordering problem which was previously discussed arises for other input types as well.

The signature comparison allows forming the preliminary candidates list which can befurther refined based on the information available in the consumer’s request. Thus, it isbeneficial to support as much information from the application package’s conceptual modelin the request as possible. The search using tags is one of the options from the generalinformation about the application. However, the search through an uncontrolled vocabularymight result in imprecise results. On the other hand, limiting a publisher’s choices fordescribing an application might lead to potentially big number of results returned. Not onlythe general information such as tags, developer or publisher might help for the desiredapplication to be found but also certain dependency information in case the consumerknows exactly which application is needed. The structures of request and response werenot discussed yet and will be conceptualized in the next sections.

One significant detail that has to be considered is that an application might supportmultiple different transformations having independent signatures. This implies that thesame application can be returned in response to different consumer’s requests. For thisreason the signature strings can be generated for each supported transformation and usedfor comparison. Moreover, resulting signatures have to be associated with an applicationindependently of each other. Basically, each transformation has to be considered as aseparate application but in the conceptual model of the information stored in the databasethey belong to the same application.

71


3.5.5 Searching for Composite Data Transformations

Apart from the simple search which directly returns matching applications it is also pos-sible to search for composite data transformations involving multiple piped applications.Basically, this task is similar to the function realization problem described by Mili etal. [MMK+94]. In our context every data transformation application is being treated as ablack-box meaning that the composition can happen based on the analysis of inputs andoutputs. As a starting point, we consider a trivial composition technique shown in Fig-ure 3.15. It is similar to one of the examples from Mili et al. [MMK+94] discussing howsoftware components can be joined in order to get a desired output. In our case the datatransformation applications can be “chained” in such a way that the desired output isproduced. The overall idea is to check for compatibility of a requested input, intermediateoutputs and inputs in order to derive a requested output. As a first step, the requestedtransformation’s input is checked for compatibility with available applications’ inputs, thenthe resulting outputs need to be checked with the remaining application’s inputs and so on.The complete algorithm allowing to find trivial software component compositions is men-tioned in Mili et al. [MMK+94]. We informally discuss how it can be applied for chainingthe data transformation applications. Firstly, a directed graph has to be constructed in orderto reflect the connectivity information. For this purpose, for every available applicationwe need to compare the compatibility of its inputs with the outputs of other availableapplications and vice versa. If an output is compatible with an input then the outgoing edgeis created, whereas the incoming edge describes the opposite case. Considering that oneapplication might support several data transformations the graph construction might takelonger. Then a breadth-first search can be applied in order to find the shortest path from theapplication compatible with the requested input to the application producing the requestedoutput. The overall procedure is costly performance-wise and a couple of issues can bechecked prior to searching for a composition. One important point is to check if there isno application producing the requested output then the search for a trivial compositionwill always return an empty set. Additionally, if there is no application consuming therequested input the search for a composition cannot be started either.

Input(Request)

A B C

Output(A)=

Input(A)=

Input(B)

Output(B)=

Input(C)

Output(C)=

Output(Request)

Figure 3.15: A trivial composition example based on [MMK+94]

However, this rather simplistic view on composition does not cover all possibilities. Forinstance, Figure 3.16 demonstrates an example inspired by discussions of componentcomposition in Mili et al. [MMK+94]. It shows how at first the requested input is split intotwo parts and consumed by two distinct data transformation applications, then the resultingoutputs are combined in order to be consumed by another application which eventuallyproduces the requested output. In general, we can say that the lack of applicationsconsuming the requested input or producing the requested output does not necessarily

72

3.5 Repository

result in an empty set of possible compositions as inputs or outputs can be split. Withthis approach, various outputs could be combined in order to produce the requestedoutput leading to a bigger amount of available composition options. Furthermore, thisleads to optimization problems like minimizing the number of redundant intermediatetransformations. The algorithm for searching component compositions in the work of Miliet al. [MMK+94] is proven to be NP-complete. The problem of application composition isnot in the focus of this thesis. Instead, we briefly discuss how a composition of applicationscan be invoked and which issues might arise before.

Input1(Request)A

B

C

=Input(A)

=Input(B)

Output(A)=

Input1(C) Output(C)=

Output(Request)

Input2(Request) Output(B)=

Input2(C)

Figure 3.16: A composition involving input splitting inspired by [MMK+94]

As with the simple search example, relying only on the compatibility of input and outputtypes is not sufficient for declaring applications compatible with the request due to possibleschema incompatibility issues. However, the overall process of checking a compositionbecomes very complex if schema comparison is enforced. Moreover, if several applicationcompositions are available the process becomes even more complex due to necessity ofcomparing the schemas for every case. Thus, it makes sense to leave the errors handling tothe invocation stage and perform only simple input and output type checks.

The process of searching for a composition might be time-consuming. Thus, it is better toexplicitly allow searching for a composite data transformation. For this reason, the con-sumer’s request must contain a corresponding directive. In fact, there are two possibilitiesto search for composite data transformations: on-demand and offline search. The formeris related to processing the consumer’s request which allows a composite search. Thelatter refers to searching for composite transformations independent of incoming requestsand caching the returned results. While on-demand search is the most obvious solution,it also makes the request processing time significantly longer. On the other hand, thedifficulty of knowing which composite transformations to look for in advance makes theoffline search not trivial. One way to gather potential compositions is to cache consumer’srequested transformations and use these for offline search. Additionally, it is possible toselect particular groups of input and output types suitable for transformations into eachother. To be more precise, for many types the data transformation between them might beuseless, e.g. transforming video formats into textual formats. For this reason, the types canbe separated into dedicated groups, e.g. textual or image, and composite transformationsmight be search offline for groups of interest. For example, composite transformationsfor the textual types. Worth mentioning that the search will still involve all the available

73


applications, but the starting and ending sets will be limited to those supporting the desiredtypes. Offline search can be seen as an optimization which allows reducing the search timein case an actual consumer’s request will be received.

When a composition is found and chosen by the consumer to be invoked the task managerhas to handle the invocation process. Abstracting from the task manager’s specifics whichwas not yet discussed, we can mention a couple of issues requiring attention during theprocess. Firstly, different types of applications might be involved in a composite datatransformation, e.g. a web service and a self-contained application. Hence, the taskmanager must support the invocation of every involved type of application. Anotherimportant issue is handling intermediate inputs and outputs in-between the invocations.The most important question is how the storage has to be organized. Moreover, the errorsrequire special handling. If an intermediate application fails, task manager has to knowhow the results need to be handled, e.g. retry to trigger the application again, let theconsumer decide or stop processing the transformation request.

3.5.6 Optimizations

There are potential optimizations which might be useful throughout the repository’s lifecycle. As a means to group them we use an abstract sub-component of the repositorycalled optimizer. In particular, this sub-component can be responsible for the task ofdeduplication, searching the composite data transformations or performing data miningtasks like clustering the dependencies based on the supported transformations. In thissubsection we briefly discuss how optimizer can solve these tasks.

Unfortunately, the problem of publishing duplicates cannot be completely solved withapplication identifiers as specifications might come, e.g. with spelling mistakes. Thesearch for duplicates can be run by the optimizer in the background based on the definedset of parameters describing the application, mainly from the application’s descriptionpart. However, opposing to the publishing case an efficient comparison technique hasto be used in order to avoid comparing every possible pair of applications which leadsto a quadratic complexity. One option is to use a Sorted Neighborhood Method [HS95;HS98] which requires generating a key for every application, sorting them based on thekey, and comparing all pairs of the applications within a sliding window of some fixedsize. This allows reducing the overall number of comparisons. In our case, the key forapplications has to be based on the combination of the information which will most likelyremain constant, e.g. the supported transformation, concatenated information about thedevelopers. After the potential duplicates have been identified they have to be marked andresolved. However, the actual resolution can happen only in a semi-automatic fashion asone or several publishers have to decide if the applications are in fact duplicates or not.

The composite data transformations can also be searched by the optimizer in the back-ground, e.g. using one of the methods mentioned previously. After a composite transfor-mation has been found it needs to be stored for future use. Thus, the conceptual modelof the database information has to allow storing such data. Additionally, as the set of

74

3.5 Repository

applications is constantly changing some composite transformation might become obsoletedue to removal of some applications. For this reason, the list of found composite datatransformations has to be periodically refreshed.

Another enhancement is the usage of data mining techniques in order to derive newinformation from the repository. This can be used in various scenarios. For instance, theGUI-based interaction with the repository might rely on demonstration of applicationsand clustered groups of applications based on the transformation type are much easier toanalyze. Another scenario is the discovery of associative rules based on previous consumer’srequest and trying to suggest a suitable application. The analysis of repository’s contentsmight also be implemented as a part of optimizer’s responsibilities.

3.5.7 Refining the Repository

Having discussed various aspects of publishing and storing data transformation applicationswe can refine the architecture of the repository by zoning the responsibilities into separatesub-components as shown in Figure 3.17. The bottom part is the storage layer which is

Repository

PackageManager

OptimizerSearchEngine

Storage Layer


Figure 3.17: The refined architecture of the repository

based on the combination of a file system for storing provisioning-ready packages and adatabase for storing application-related information. On top of the storage layer there arethree sub-components grouping the previously discussed aspects:

Package Manager is responsible for generating a provisioning-ready package and han-dling other package-related tasks.

Search Engine is responsible for searching suitable transformations and search-relatedtasks.

Optimizer is responsible for possible optimizations and enhancements.

75


To be more precise we need also to include the interaction and security layers to therepository as it can be seen as a standalone piece of software functioning independentlyof other components. Thus, the repository has to provide means for secure interactionwith the publishers and consumers. In our case the repository alone is not sufficient forhandling the data transformation tasks as the invocation part is missing in such setting.For this purpose, we discuss the interactions and security layers as a topmost layers of theoverall architecture in the meantime omitting them from the refined architectures of thesub-components.

3.6 Task Manager and Provisioning Layer

Essentially, the task manager’s responsibilities can be divided into two categories: perform-ing the transformation and deploying an application in case it is not yet deployed. Thelatter can also be a sub-part of the former in case an application specified in the request wasnot previously deployed. Figure 3.18 illustrates the task manager’s architecture consistingof the application deployer, task processor which is responsible for running a set of tasks,and provisioning layer at the bottom. In the generic architecture we listed the provisioninglayer as a separate block to demonstrate that there is a need for a technology supportingthe invocation of self-contained application packages. In fact, this layer is coupled with thetask manager supporting a certain technology, e.g. invoking Docker containers via Docker’sAPI. Therefore, it makes sense to discuss the task manager and provisioning layer togetheras a whole. As with the repository, we omit the user interaction and security layers in thetask manager’s architecture. Although it also can be considered as a standalone piece ofsoftware, in our case it is preferable to discuss it as a part of the system having commonuser interaction and security layers.

Task Manager

Provisioning Layer

Task Processor

I/O Handler

Invoker & Monitor

ApplicationDeployer Task

Figure 3.18: Task Manager’s architecture

The Application Deployer is a sub-component responsible for deploying the data trans-formation applications using the supported provisioning layer. In this work we focus oncontainer-based provisioning as it is a lightweight alternative for hypervisors and mayalso serve as a building block for other available options when needed, e.g. containersused inside virtual machines. Although containers introduce simplicity, they also have

76

3.6 Task Manager and Provisioning Layer

limitations like a weaker security model. As was discussed in Chapter 2, there are scientificcontainer solutions improving the security and other related issues. Worth mentioning thatthe application deployer is only needed for reusing the applications which are published asself-contained packages. Deploying a web service would mean that it was also publishedalongside its files and other related information. Publishing scenarios discussed in Sec-tion 3.5 list several options of when an application can be deployed. Basically, all theseoptions differ only in the type of request sent, i.e. either a deployment during publishing orduring transformation, but the deployment process stays the same. These two examplesof a request originate from different senders, a publisher or a consumer. In case of adeployment during publishing, the request explicitly states that an application has to bedeployed as it is the part of the publishing process. However, when a consumer requests fora transformation a list of suitable applications is returned in response. This implies that aconsumer needs to choose one of the applications based on some criteria and send anotherrequest stating that the transformation has to be performed by the chosen application.At this point, there is a possibility that an application was not deployed previously andbefore the transformation takes place the deployment must happen. This information isalso available in the request itself due to the provisioning information maintained by therepository. As a result, the request has to be split into deployment and transformationrequests and processed by the task manager consecutively. When an application wassuccessfully deployed the confirmation must be sent to the repository in order to updatethe provisioning information for the respective application. In addition, the specificationinformation of the newly deployed application might be cached in order to simplify theaccess to the invocation-related information. In this case caching will also introduce theneed to synchronize the data in order to reflect the changes in the repository. Without useof caching the task manager will need to get this information either from the repository oraccess the specification directly inside the container in order to invoke the application.

The Task Processor is responsible for the invocation of the application and control overthe execution process. This includes processing of the input and output, invocation andmonitoring, returning the results, and clean up. If an application was published as aweb service, the task manager only needs to act as a proxy redirecting the requests andresponses. Figure 3.18 shows the task processor consisting of two sub-components: In-put/Output (I/O) Handler and Invoker & Monitor. The former is responsible for preparationof the inputs and outputs. Requests for the data transformation obviously should containthe inputs provided by the consumer of the application. The inputs might be providedin either push or pull-based manner. The case when a consumer directly includes theinputs into the request is a push-based approach. It is not preferable in case large inputsneed to be provided. As an alternative, a consumer might provide the links referring toactual inputs and task manager then can download the payload. With this approach arequest becomes more lightweight and flexible, as various potential locations and publicservices can be used as a source of input. On the other hand, it makes no sense to specifyparameter inputs as links. For this reason, the request can be a combination of both pushand pull-based approaches: using the links for file inputs and embedding parameter inputsinto the request. As soon as the request is received, the I/O Handler needs to prepare theinputs, meaning that the input files need to be downloaded and parameter inputs need

77


to be saved as well. Then the application has to be invoked using the prepared inputs.Note that an application container can be instantiated prior to input preparation and waituntil everything is ready for the execution. The Invoker and Monitor sub-component isresponsible for invocation of the application based on the information from the applicationpackage specification and monitoring its state during the execution. Every request must behandled in a separate container to avoid overlapping of inputs and outputs. As a result, anindividual task has to be created for every instantiated container and after completion itcan be eliminated. It is rather impossible to organize a monitoring on the application leveldue to legacy nature of the applications. In many cases an application does not provide away to check its state. As an option, a monitor can, e.g. check the contents of the outputfolder or use operating system’s tools to get the information about the respective process.From a consumer’s perspective, as soon as the transformation task is started there must bea way to check for an execution state via a standardized heartbeat-like request. Basically,the task manager must be able to tell if an application is still working, failed or completed.Both failed and completed states can be identified by checking the state of the process inthe operating system. However, if an application froze it might appear that it still works. Insuch case specifying a timeout either in the request or in the task manager could be helpfulto decide if an execution was successful.

3.7 Request Router

Essentially, the request router is analogous to an Enterprise Service Bus (ESB) [Cha04] inits purpose. On one hand, it is possible to omit the request router and work independentlywith the repository and the task manager via their interfaces. On the other hand, thenumber of repositories and task managers may be bigger than one and using a singleendpoint with a standard interface for interaction is more convenient. Therefore, a requestrouter serves for a single endpoint for potentially multiple task managers and repositories.Additionally, it enforces the compliance with a standard interface. An interaction betweenthe repository and task manager is required in several scenarios, e.g. publishing withtesting, and without request router both repository and task manager need to know eachother in order to communicate. If several repositories or task managers are available thistask becomes even more complicated. Instead, every sub-component needs to communicateonly with the request router which keeps track of the routing logic. Moreover, it can alsoprovide load balancing in case multiple sub-components of the same type are used.

3.8 User Interaction and Security Layers

In order to safely interact with the overall architecture user interaction and security layersare needed. The former provides means to communicate with the system in a standardizedway. In reality, the user interaction process is different for publishing, i.e. working withthe repository, and for reusing published applications which involves interaction with

78

3.9 Refining and Zoning the Framework

both the repository and provider of the logic. In our case the topmost user interactionlayer is basically a common ground for interfaces of the repository and task manager.More specifically, defining an standard interface on the higher level enforces any possibleimplementation of each sub-component to comply with it. User interaction can happenin different ways: in particular, we are interested in a separation between a GUI-basedinteraction and working directly with the API. These types of interaction are useful indifferent scenarios. For instance, publishing an application without relying on the GUI isless convenient for publishers as many details have to be specified and possibly refined.On the other hand, the process of reusing an application involves several distinct actionswhich may or may not leverage from GUI-based interaction. For instance, searching foran application can be done in both ways: either viewing a list of available applicationsvia GUI or sending a request containing all the required information and getting a list ofsuitable applications. The latter option can be a part of an automated interaction whenthe consumer is another application. The returned list of available applications is thenchecked based on some criteria and the most suitable option is chosen for performing atransformation. With these considerations in mind, we describe the user interaction layerconsisting of two parts: a GUI and an API as shown in Figure 3.19.

Security Layer

Graphical User interface Application Programming Interface

Figure 3.19: Updated User Interaction Layer with the Security Layer

The concept of trustfully running third-party applications is far from being safe. A securitylayer needs to rely on a set of complex measures in order to provide consumers withsafe-enough user experience. The fact that a transformation runs inside a container makesit look safer. However, a consumer has to be protected from reusing malicious applicationswhich can distort the result of a computer-based experiment, produce potentially harmfuloutput, etc. A set of required security measures is not in the focus of this thesis, howeverwe mention several aspects which might be helpful. The main security issue is the wayhow applications are published. Unauthorized publishing should be prohibited or markedaccordingly in the application’s information. In order to be authorized, the system has tokeep track of users, e.g. by means of a database. Respectively, a user registration procedurehas to be implemented. The same approach can be used for consumers prohibiting anyunauthorized requests. Additional security mechanisms like Role Based Access Control(RBAC) can be applied for control of who can consume the applications.

3.9 Refining and Zoning the Framework

Combining all discussed concepts together we can now refine the framework using thearchitecture demonstrated in Figure 3.12. After substituting every sub-component withits refined version, we get the result shown in Figure 3.20. Additionally, we use the red

79


dashed lines in order to group the sub-components into conceptually-related zones whichcan be used when developing a prototype, e.g. as a set of microservices. The repositoryand task manager are represented as atomic sub-components, whereas the combination ofthe request router, interaction and security layers is called the Interaction Endpoint zone.The idea is to have a single endpoint for communication with all the available repositoriesand task managers which are compliant to the interface offered by the interaction zone.On one hand, this introduces additional complexity for handling requests as they need tobe routed and the actual repositories and task managers have to be known. Moreover, suchinteraction endpoint is the bottleneck and the single point of failure in the system. On theother hand, having one interaction endpoint simplifies the overall interaction process forpublishers and consumers and enforces the sub-components to comply with a standardizedinterface. In the next section, we discuss how consumers and publishers can interact withthe framework.

Request Router


Security Layer

Interaction Endpoint

Repository

PackageManager

OptimizerSearchEngine

Storage Layer


RepositoryTask Manager

Task Manager

Provisioning Layer

Task Processor

I/O Handler

Invoker & Monitor

ApplicationDeployer

Task

Figure 3.20: Refined view on the framework’s constituents

3.10 Interaction with the Framework

The discussion of overall framework is not complete without explaining the interactionprocesses from various perspectives, i.e. describe publisher’s, consumer’s, and framework’sview on the interaction. Before reusing an application, it has to be published. Figure 3.21shows an abstract diagram of the generalized publishing process from the publisher’sperspective. In order to publish an application, the publisher has to provide the applicationpackage specification and prepare all related files and dependencies. Additionally, thepublisher has to specify whether an application needs to be provisioned and tested. We

80


consider publishing to be a semi-automatic process and assume that it is GUI-based as itoffers a more user-friendly way to control various specification parameters and preparerelated files.

choose anapplication

prepare theapplication

packagespecification

prepareapplication

files &dependencies

choose thepublishingscenario

publish theapplication

Figure 3.21: The generalized publishing process from the publisher’s perspective

From the user’s, i.e. publisher’s perspective, it is preferable that the process is simpleand intuitive. The hidden details of the publishing process are easier to discuss from thesystem’s perspective. By system, we mean the combination of framework’s sub-componentsinvolved in the process. Figure 3.22 illustrates the publishing process from the system’sperspective. After receiving the publishing request, several steps have to be performed for a

receive apublishing

request

materializeremote

artifacts

update theapplication

packagespecification

generate theprovisioningspecification

generate theprovisioning-

readypackage

store theprovisioning-

readypackage

test (and)provision theapplication

inform aboutpublishing

results

optionalrequired

Figure 3.22: The generalized publishing process from the system’s perspective

process to finish. Optionally, in case the application package specification contained remotelinks to artifacts they have to be materialized and the specification have to be updated.The next step is to generate the provisioning specification supported by default, e.g. aDockerfile. The resulting set of files then needs to be archived into a provisioning-readypackage which will be stored on the file system. Corresponding database entries haveto be created in addition. Depending on the publishing scenario chosen, the testing andprovisioning of the package might be required. Finally, the publisher needs to be notifiedabout the results. Several steps might involve errors which need to be handled. For instance,the remote files specified in the application package specification might be unavailablewhich leads to stoppage of the process. As an optimization measure, several retries mightbe performed in order to get the files. However, in case the files are still not available thepublisher has to be informed about the problem as a next step due to the impossibility tocontinue the publishing process without the refinement of remote links. Basically, in caseof errors during the next steps the overall process has to be halted and the publisher needs

81


to be informed. One issue worth mentioning is the case when the test of an applicationis finished and the produced output is not equal to the output provided in the test runspecification. Essentially, at this step an application can be published. On the other hand,due to different results the application must be considered incorrect. Thus, it has to bemarked as producing inconsistent results and the publisher has to be notified about thisfact.

From the consumer’s perspective the process of reusing an application must be simple toimplement, leading to a lower cost of the software reuse. Figure 3.23 illustrates the abstractprocess of reusing a data transformation application as it can be seen from the consumer’spoint of view. Firstly, the consumer needs to search for a suitable application by sendinga request which eventually has to be delivered to the repository. Potentially, the numberof suitable solutions can be greater than one resulting in a list of applications which canbe used. For instance, such list might include single atomic applications and compositeapplications. The framework does not offer means to choose which particular applicationis the most suitable for a consumer’s use case. This decision making process has to beperformed on the consumer’s side. Potentially, the framework might rank the applicationsin the list based on various criteria such as performance properties or credibility which cansimplify choosing the most suitable application for a consumer. However, every use casemight have specific requirements and the overall task of choosing the right application isleft for the consumer. After the decision has been made, the request for transformation

receive a list ofsuitable

applications

decide whichapplication to

use

request for atransformation

search for atransformation

check theexecution's

state

receivetransformation

results

optionalrequired

Figure 3.23: A consumer’s perspective on the process of reusing a data transformationapplication

has to be sent which eventually reach the task manager. At this point, all the consumer isinterested in is getting back the results of the transformation. Worth mentioning, that thewhole process of searching for an application is optional as the consumer might alreadyknow which transformation application is going to be used, e.g. by means of the QNameproperty discussed in Section 3.3.

One optional step that might be useful is the monitoring of the application’s executionprocess. For this reason, a standardized monitoring request has to be allowed. Oneimportant issue is that the transformation task is created after receiving a request which

82


leads to the requirement of sending a task’s reference to the consumer in order to allowmonitoring. This process is transparent to the consumer, but the knowledge of task’sidentifier is also crucial for correlation of the response. Therefore, the monitoring requesthas to contain a task’s reference in order for the system to process it. As soon as thetransformation is completed, the results have to be returned to a consumer. There areseveral options of how the results can be returned. Obviously, the response can containthe results directly. Another option is to temporary cache the results and send a generatedunique link to the consumer. Basically, these are the push and pull-based scenarios. Thelatter is preferable in case the transformation is needed only as a temporary step forperforming one or more additional data transformations. In such case, the consumer doesnot need to resend the same inputs as outputs for a new transformation and the taskmanager can directly reuse the inputs stored locally.

Another view on reusing the data transformation applications is from the system’s per-spective. Figure 3.24 shows what happens after the transformation request is receivedby the task manager. Depending on whether the application was provisioned before ornot, two optional steps might be needed. Firstly, the provisioning-ready package has to bedownloaded from the repository. Considering the fact, that the task manager does not knowthe repository in advance this information has to be provided. Since the information aboutprovisioning is already known at the time of the transformation search, the response tothe consumer with the list of suitable transformations can contain the abstract reference tothe repository. When the consumer sends a transformation request, the abstract repositoryreference can be substituted with the actual endpoint information by the request router. Asa result, when the task manager is ready to process the transformation request it alreadyknows all relevant information, i.e. provisioning status and the repository information.Hence, the task manager can download the respective package in order to deploy it in asubsequent step.

receive atransformation

request

get theprovisioning-

readypackage fromthe repository

deploy theapplication

get theinvocation

details fromthe repository

instantiatethe

application'scontainer

prepareinputs

invoke theapplication

monitorexecution

status

collectoutputs whenthe executionis completed

return theresults

optionalrequired

Figure 3.24: A system’s perspective on the process of reusing a data transformation appli-cation

Next step is to get the invocation details from the repository. This information can alsobe accessed in the provisioning-ready package or cached during the deployment. Without

83


Figure 3.25: An UML sequence diagram of the interaction with the framework

84

3.11 Integration with TraDE Middleware

going further into possible optimizations, we assume that this information is obtainedfrom the repository. Afterwards, the application’s container needs to be launched and theinputs have to be prepared as application package specification states. The preparationof inputs includes obtaining the input files using the remote links in request and placingthem in a required location in the application container. For instance, if the applicationrequires having inputs in a specific folder with respect to the executable, then the inputshave to be copied in this folder. Another option is when the invocation string containsthe list of files specified. In such case the inputs have to be copied into a predefinedfolder and the executable can be invoked using a concatenation of the static path andfile’s name. The application can be invoked after these steps are completed. Unlike inconsumer’s perspective where monitoring is optional, the system has to monitor executionas there is no other way to understand the state of execution. Moreover, the published datatransformation applications are diverse and offer no standardized way of checking theirstate. One way to check the application’s state is to get the process information from theoperating system. Another option would be checking the output folder in case the filesare saved somewhere and compare the number of output files. However, even when thenumber is equal we cannot guarantee that these files are complete and not being populated.In addition, it is also possible to log the application’s output from the console to the file foradditional information. For the case when the application is running for too long a task canbe stopped using a timeout specified in the request. As soon as the task is completed theoutputs have to collected and returned to the consumer.

An example shown in Figure 3.25 illustrates generalized communications between theconsumer and framework’s parts involved in a reuse scenario. Basically, the general processcombines the consumer’s and system’s perspectives on application reuse. The consumercommunicates with the interaction endpoint without the need to care about the exactrepository and task manager involved. This example does not contain all possible issueswhich might arise during the interaction, especially with respect to potential errors. Forinstance, in case the application was not deployed due to a failure, the transformationcannot proceed and the corresponding response has to be returned to the consumer. Similarbehavior is expected when invocation details are not accessible from the repository. Assoon as the transformation is invoked, the consumer has to receive the task’s referenceto be able to send the monitoring requests. On the task manager’s side the monitoringhas to be performed constantly using a specified timeout. Based on the task’s state thecorresponding response has to be returned to the consumer.

3.11 Integration with TraDE Middleware

In order to use the framework in conjunction with the TraDE Middleware we need todistinguish the most suitable integration options. However, before going into details, firstwe must understand the goal of integration. The publishing process can be seen as anindependent part of the framework due to its semi-automatic nature. Obviously, it is animportant part of the process, but the actor interested in performing the transformation

85


assumes that applications are present in the repository. For this reason, we assume thatthe publishing is performed independently and does not depend on specific choreographyand data flow in particular. More specifically, the main goal of integration is to support thetransparent data exchange by means of offering the data transformation logic for inclusioninto the data flow. In this case, the TraDE Middleware can be seen as a consumer, firstsearching and then reusing the chosen application.

There are four common ways to integrate software: file transfer, shared database, remoteprocedure invocation, and messaging [HW12]. The file transfer implies using files as acommon data transfer mechanism. This technique is not suitable when applied to our usecase due to several reasons, including handling a large amount of files is costly performance-wise, synchronization of the files might be needed to avoid staleness. The shared databaseintegration is also not suitable for this particular integration due to a couple of issues,e.g. the database might be a bottleneck when many requests have to be processed, anasynchronous communication is complex.

Instead, we focus more on remote procedure invocation and messaging integration tech-niques. The former is about providing an interface for other applications which allowscommunicating directly with an application [HW12]. Considering the whole framework asa single piece of software which provides an interface, the remote method invocation mightbe a suitable integration technique. Essentially, the TraDE Middleware needs to use theinterface of the framework in order to send standardized requests and receive the results.Due to implementation specifics of the TraDE Middleware, web service-like interactionis more preferable than using more tightly-coupled solutions like Java Remote MethodInvocation (Java RMI).

Making a step further, we can try to apply the messaging approach to integration of theframework with TraDE Middleware. Using messages as a data transfer mechanism havemultiple advantages, e.g. reducing the coupling between applications, or supportingasynchronous communication [HW12]. It relies on the powerful concept of messagingsystems which are dedicated pieces of software supporting message exchange. In theframework we have a zone called interaction endpoint which consists of request router,user interaction and security layers. The combination of API, security measures and arequest router can in fact be seen as a message-processing system with some messagingsystem serving as a basis. A secure GUI-based interacton in this case needs to be handledseparately.

The messaging approach is preferable in case of high messages amount. In general, weconsider a web service-like integration where the TraDE Middleware uses a standardizedinterface offered by the framework. For this purpose a simple API client has to be integratedto the TraDE’s side for enabling communication.

86

4 Implementation

This chapter discusses the details of the data transformation applications handling frame-work’s prototypical implementation. In Section 4.1 we present the details about theprototype’s architecture. Section 4.2 goes into the details about the framework’s APIspecification. Section 4.3 describes how the conceptual model of a data transformationapplication introduced in Section 3.3 is represented in the implementation. Section 4.4provides the details about the publishing and generation of an application’s provisioningspecification. Section 4.5 describes the details about requesting a transformation.

4.1 Architecture of the Prototype

Although the refined framework’s design shows potentially multiple independent compo-nents communicating via request router, for the prototypical implementation we chooseto unify one repository and one task manager as parts of one monolithic web application.This allows us to focus on public API instead of going into details of inter-componentcommunication. Figure 4.1 demonstrates the architecture of the prototype which is basedon the refined framework’s structure discussed in Section 3.9.


Provisioning Layer

Docker SDK for Python

Deployermodule

Task Processormodule

Package Managermodule

Searchmodule

Storage LayerFileSystem

Python FS modules + PyMongo

Monolithic Web Application

Figure 4.1: The architecture of the framework’s prototypical implementation

87

4 Implementation

We use Python1 as the main programming language. However, a stack of additionaltechnologies is required to support the discussed concepts. For the provisioning layer we useDocker [Doc17b] as technology supporting container-based virtualization (see Section 2.4).As a consequence, we need to generate Dockerfiles as provisioning specifications. Whilefor the prototype we are bound to Docker, the applications package specification suppliedby the publisher is also included into the final application package. This ensures thatthe repository remains technology-agnostic and another type of provisioning specificationcan be generated if needed. In order to communicate with Docker programmatically, weneed to use the Docker SDK for Python2 which allows using Docker’s API from withina Python application. While we don’t list every single sub-part of the framework usedin the prototype, we introduce logical groupings of framework’s components and theirsub-components into separate Python modules. This logical grouping allows splitting theprototype into standalone microservices exposing their own distinct APIs which can thenbe composed in future to form the overall framework.

We implement the repository’s storage layer as a combination of a file system and adatabase using the approach discussed in Section 3.5.3. The file system is used forstoring the application-related files, whereas the database stores a metadata about theapplications. As a database, we use MongoDB3, a non-relational document-orienteddatabase. We choose to use a non-relational database due to several reasons. It offersexpressive querying capabilities, supports secondary indexes and at the same time it isschemaless which is advantageous for our use case. One important thing to consider isthat the repository is mainly used for reading. The only writes that can happen are eitherrelated to publishing or updates which are meant to happen rare. Most of the databaseoperations are reads in order to find a suitable transformation or obtain the provisioning-ready package. One of the biggest advantages is the absence of a schema which makesstoring the application’s data extremely flexible, e.g. when the conceptual model needsto be extended. In addition, the model of a data transformation application shownin Section 3.3 is nested which requires creating many redundant tables if a relationaldatabase is used and the data needs to be normalized. For instance, the informationabout dependencies, configurations, invocations is only useful when bound to a particularapplication. MongoDB stores JSON documents which simplifies storing nested objects.Moreover, it allows creating interconnected collections of documents making it possible toseparate parts of the application package specification like dependencies or transformationsinto separate collections. As JSON is commonly used as a format of data interchange, thewhole documents can be returned directly as a response to API calls. In order to workwith MongoDB we use PyMongo4 as a Python database driver. For working with the filesystem we use a set of standard Python libraries. One important point is that we rely onthe file system of the machine which runs the prototype. However, this can be extended,e.g. a distributed file system can be used. As with the task manager’s sub-components,

1Python Software Foundation, Python: https://www.python.org2Docker Inc., Docker SDK for Python: http://docker-py.readthedocs.io/en/stable/3MongoDB, Inc., MongoDB: https://www.mongodb.com/4MongoDB, Inc., PyMongo: https://api.mongodb.com/python/current

88

https://www.python.org

http://docker-py.readthedocs.io/en/stable/

https://www.mongodb.com/

https://api.mongodb.com/python/current

4.2 API Specification

the repository’s parts are separated into separate Python modules. The optimizer sub-component is not in the scope of this work, hence we omit it in the implementation.

The prototype’s functionality is exposed via a REST API and partially via the GUI (publishingthe applications and basic querying). To support web-related functionality we use PythonFlask microframework5. For specification of the REST API we use Swagger6 which isan open source implementation of the OpenAPI Specification [Ini17]. A Swagger APIspecification is a YAML Ain’t Markup Language (YAML) file containing all the interfaces-related information in a human-readable format which later can be used for the generationof the client and server boilerplate code for various languages and frameworks includingPython Flask.


Having discussed the refined architecture of the prototype and technologies used, we cango into details of the API specification. One of the primary goals we need to achieve isan identification of the resources and analysis of the operations which can be performedwith them. Table 4.1 lists the resources of interest for the prototypical implementationalong with their descriptions and root paths. The Applications resource describes a datatransformation application and is based on the application package specification which in itsturn is a representation of the conceptual model introduced in Section 3.3. Essentially, theTransformations resource represents a part of the application which we also make accessibleindependently as a separate resource. The Tasks resource identifies the transformationtasks created by the task manager component described in Section 3.6. We use plural

Resource Root Path Description

Applications /apps This resource identifies a collection of datatransformation applications which werepublished to the repository

Transformations /transformations As one application might support multi-ple transformations, we use a separate re-source to identify a collection of availabletransformations

Tasks /tasks This resource identifies a collection oftransformation tasks requested by the con-sumer

Table 4.1: Resources descriptions

forms of the respective nouns to describe the collections of resources which is one of the

5The Pocoo Team, Flask: http://flask.pocoo.org6SmartBear Software, Swagger: https://swagger.io

89

http://flask.pocoo.org

https://swagger.io

4 Implementation

recommended ways for specification of REST APIs [Mas11; Mul13; SB15]. Note, that wedo not identify security-related resources such as users or sessions as we omit the securitylayer in the prototype.

According to Fowler’s description of Richardson’s REST maturity model [Fow10b] thenext step is to introduce a proper usage of HTTP verbs with respect to the identifiedresources. Table 4.2 demonstrates the usage of HTTP verbs for Applications and Transfor-mations resources. We introduce the latter due to the different search semantics: lookingfor a specific application versus searching for any suitable transformation available in therepository. For the prototype we support searching for both, applications and transforma-tions. The search for a suitable application relies on tags defined in the application packagespecification. Searching for a specific transformation is possible using the transformation’ssignature generation discussed in Section 3.5.4. The signature is generated using thedata supplied in the search request. The composite search described in Section 3.5.5 isomitted in the implementation. As shown in Table 4.2 the combination of an HTTP GET

Resource path HTTP Method Description

/apps GET Retrieve the list of data trans-formation applications from therepository

/apps POST Publish a new data transforma-tion application to the repository

/apps/{appID} GET Find the data transformation ap-plication by its identifier

/apps/{appID} PUT Update an existing data transfor-mation application by completelyreplacing it with a new one

/apps/{appID} DELETE Remove the data transformationapplication from the repository

/apps/{appID}/transformations GET Retrieve the list of transforma-tions supported by a particularapplication

/transformations GET Retrieve the list of available trans-formations in the repository

Table 4.2: The usage of HTTP verbs for Application and Transformation resources

request with a specific resource, e.g. /apps or /transformations, or a sub-resource, e.g./apps/{appID} results in triggering a corresponding search. While looking for a specificsub-resource requires providing only its identifier, searching for collections of resourcesrelies on specific query parameters. For the /apps resource we use an HTTP GET requestwhich includes the query parameter “tags” allowing to search for applications by tags speci-fied in the application package specification. Other application-related query parameterscan be introduced in addition. However, for the basic prototypical implementation we

90


distinguish semantically between searching for an application which groups potentiallymultiple transformations and searching for a specific transformation, e.g. using its qualifiedname or based on the signature matching. Thus, for the /transformations resource weintroduce following query parameters: “qname”, “pnum”, “infile[]”, “infsets[]”, “outfile[]”,and “strict”. Searching a transformation by its qualified name (“qname” query parameter)is very convenient in scenarios when both publisher and consumer of the applicationagreed on using standardized abstract identifiers. The search process becomes a trivial taskin such case as only transformations with matching qualified names are returned. Otherparameters allow describing the transformation to search for. The “pnum” parameter allowsspecifying the number of input parameters a transformation should be able to process.In the prototype we do not differentiate between the type of parameters. The “infile[]”,“infsets[]”, “outfile[]” parameters are the arrays of formats for InputFiles, InputFileSets, andOutputFiles respectively. The “strict” query parameter defines which type of signature stringdiscussed in Section 3.5.4 will be used for searching: either with optional input parametersor without them. The latter is less strict due to a smaller number of constraints. Note,that search for an application or a transformation are semantically different as resultingresponses contain different entities. The response for searching a transformation returns alist of suitable transformations with links to their parent applications, whereas the responsefor an application directly contains all suitable applications. When a consumer decidesupon which transformation to use, the transformation request sent to the framework mustcontain both references to the application and its transformation. In general, more complexsearch capabilities can be implemented by introducing special search-related resourceswhich support HTTP POST requests carrying a more complex structure. An example of aGET request which searches for Transformations with two input parameters, consuming aTXT and a DAT file as InputFiles and producing a PDF as OutputFile will look as follows:/transformations?pnum=2&infile[]=txt&infile[]=dat&outfile[]=pdf. Note that for formatsthe regular file extensions are used instead of MIME types which take more space. On therepository side we have a static MongoDB collection for mapping between MIME typesand file extensions for construction of proper signature strings in case application packagespecification uses MIME types to describe the formats.

An application can be published using HTTP POST method with the /apps resource. Theresponse in this case is a JSON representation of a newly-published application. We providemore details about the publishing in later sections. For the prototype we implement asimple update of applications which uses HTTP PUT replacing the chosen application withthe information from the request. Note, that a sub-resource identifying a certain applicationhas to be used in this case. Worth mentioning, that a set of checks is performed in order toupdate an application and the update happens if only everything was successful. Otherwisethe publisher gets an HTTP status code 400: “Bad Request”. For more fine-grained updatesit is possible to use HTTP PATCH referring to specific resources like dependencies orconfigurations. We omit this level of details in the prototype to support only the sufficientset of resources. Likewise, an existing application can be deleted using HTTP DELETEmethod. In addition, the set of transformations supported by a specific application can bereceived by addressing the dedicated sub-resource.

91

4 Implementation

In order to execute a transformation, a corresponding task has to be created by the TaskManager as introduced in Section 3.6. Therefore, the REST API allows consumers to createtransformation tasks by interacting with the Tasks resource. Table 4.3 lists the task-relatedresources and the HTTP methods they support. An HTTP POST request to the /tasksresource returns a created task’s resource representation. In case a consumer is interestedin checking the task’s state, an HTTP GET request to a corresponding sub-resource usingthe received task identifier should be made. Another action on the task resource supportedby the prototype is the cancellation of a task by updating its state through an HTTP PUTrequest. One can argue that the HTTP DELETE method should be used to cancel a taskinstead. However, we are interested in separating the cancellation and deletion actions forlogging purposes. Cancellation is basically an update of the task’s state, while the deletionmeans that the task will be completely removed. Only canceled tasks can be removed,hence the deletion might require canceling the task if it was not canceled before. Forlogging in the task manager we use plain text files.

Resource path HTTP Method Description

/tasks POST Trigger the creation of a new datatransformation task

/tasks/{taskID} GET Check the state of the data transfor-mation task

/tasks/{taskID} PUT Cancel an existing data transforma-tion task

/tasks/{taskID} DELETE Delete an existing data transforma-tion task

Table 4.3: The usage of HTTP verbs for Task resource

The last level in Richardson’s REST maturity model [Fow10b] relies on the so-calledHypertext As The Engine Of Application State (HATEOAS) concept. Basically, the mainidea is to support the navigation through API using links to relevant resources [SB15]. Forinstance, if a certain application was returned in response to publishing action, it also needsto contain the links at least to itself and its transformations as shown in Listing 4.1. The

Listing 4.1 Example of Hypermedia Controls for the case when an application is published...

"links": [

{

"rel": "self",

"uri": "/applications/someAppID"

},

{

"rel": "application.transformations",

"uri": "/applications/someAppID/transformations"

}, ...

]

...

92

4.3 Application Package Specification

usage of “self” might be helpful in cases such as obtaining a resource representation fromHTTP POST request where actual address of the resource is different from the originallyrequested one [SB15]. This particular example is only useful in case the publisher wants touse this application as a consumer after publishing. However, this approach simplifies theexploration of the details (if needed) about applications and transformations received assearch results. For content negotiation in the prototype we use JSON format as it is lessverbose than XML and integrates easily with MongoDB.

4.3 Application Package Specification

Before discussing the publishing details, we need to describe how the application packagespecification introduced in Section 3.3 is implemented and which way of publishing weuse for the prototype. Essentially, an implementation of a data transformation applicationpackage specification can be seen as a DSL. For the implementation we choose it to be anexternal DSL meaning that it does not use the syntax of some general purpose programminglanguage [Fow10a]. This allows reusing application package specification in a technology-agnostic manner. Although supporting the control flow logic in DSL might potentially beuseful in describing data transformation applications, we choose to use a declarative styleas it makes the overall description process easier for a publisher. An external DSL canbe based on either a custom syntax, e.g. a newly-created language, or a commonly-usedrepresentation formats such as XML or JSON. We use JSON due to several reasons. First ofall, basically all the information in the prototype is exchanged in JSON and JSON-baseddescription speeds up overall processing, e.g. parts of the specification can be directlycopied into corresponding MongoDB collections. Additionally, the newly created syntaxfor a declarative DSL has to be thoroughly documented and the creation of a specificationmight require more effort. In contrast, for JSON we use JSON Schema and validate thepublisher’s specification against it.

The conceptual model of application package specification introduced in Section 3.3 canbe represented in JSON almost as 1:1 mapping. Only some parts might differ slightly dueto the choice of data structures. For instance, we group the same types of entities intoseparate arrays to simplify parsing of the specification. Listing 4.2 shows how differentinput types are represented by separate object properties. We use the same approach forother similar entities in the conceptual model, e.g. types of dependencies. We omit listingan abstract example of the specification as most of the mappings are easy to reproduce.However, in Chapter 5 we demonstrate how concrete data transformation applications fromthe motivational example presented in Section 1.2.6 can be described using the discussedJSON format.

93

4 Implementation

Listing 4.2 Example of input type groupings in transformation specification...

"transformations": [

{

"name": "text-to-pdf",

"qname": "uniqueQualifier",

"inputParams": [ { ... }, ... ],

"inputFiles": [ { ... }, ... ],

"inputFileSets": [ { ... }, ... ], ...

}

], ...

...

4.4 Publishing and Generation of Provisioning-readySpecification

When an application’s specification is created in JSON and all related files are prepared, thepublisher can choose the most suitable option among the publishing scenarios discussedin Section 3.5.1. Then, a corresponding HTTP POST request can be issued against theAPI. From the technical point of view there are several ways of publishing a specificationand related files. First of all, supplying the application’s files separately might become atedious task due to a potentially large number of files. The easiest way would be to attachan archive of all related files including the application package specification. However,sending an archive included in the body of the request is not directly possible because ofadditionally required information which has to be specified apart from the archive files,e.g. whether an application has to be provisioned and (or) tested. One possible way isto use multipart/form-data content type for sending a form containing the applicationmetadata and attached files or send structured JSON requests with the base64 encodedfiles inside. However, if the archive’s size is large, sending such POST requests will takemore time. Moreover, if the validation does not succeed the publisher needs to uploadthe files again. Another option is to split the publishing process into two stages, firstspecifying the metadata and then providing the files archive. From the conceptual pointof view, we cannot create an application entry in the database until the files are receivedand validation takes place. For the prototype we use a simple option of sending a JSONrequest containing the related metadata plus a remote link to the application archive, e.g.to a Dropbox7 or a GitHub8 repository. Sending such requests is simple and a publisherdoes not need to upload files again if the validation was not successful. In the metadata thepublisher also needs to specify which publishing scenario is going to be used, e.g. includingprovisioning. Listing 4.3 shows a publishing request’s body example. The specification ofapplication’s name is not necessary as the application package specification contains allthe required data. In case the selected publishing scenario succeeds, the publisher receivesthe JSON representation of a newly-published application. As MongoDB allows creation of

7Dropbox, Inc.: https://www.dropbox.com8GitHub, Inc.: https://github.com/

94

https://www.dropbox.com

https://github.com/

4.4 Publishing and Generation of Provisioning-ready Specification

Listing 4.3 Example of a publishing request{

"name": "dummyApp4Publishing",

"deploy": true,

"test": false,

"archiveURL": "link/to/dropbox/archive"

}

interconnected collections of documents, for optimization purposes we also use a separatecollection for storing the transformations and maintaining the references to respectiveapplications. Note, that having denormalized data is common in non-relational databasesand some transformation-related information is also stored in the collection related toapplications to speed up the querying process.

During the processing of a publishing request several steps are completed as describedin Section 3.5. Firstly, if remote dependencies are specified then the materializationand update of the application package specification happen. Next, the respective entriesin database collections are created based on the application package specification butincluding additional information such as generated transformations’ signatures or numbersof dependencies. We do not demonstrate a complete MongoDB document’s structure of theapplication as it is similar to the conceptual model discussed in Section 3.3. However, wefocus on some details by showing the excerpts of the document’s model. Listing 4.4 showshow the general information about the application taken from the application packagespecification is enriched with additional data like information about the file system’s path,providers, validation status, and other statistical information. The providers are specified

Listing 4.4 Specification of the general information about the application...

"appInfo": {

"appID": "dummyID",

"appName": "dummyApp",

...

"tags": ["tag1", "tag2"],

"validated": false,

"path": "path/to/app/archive",

"providers": [

{

"providerQName": "TM1",

"pkgID": "some Docker Image ID"

}

],

"transformationsCount": 0,

"envDepsCount": 0,

"softDepsCount": 0,

"fileDepsCount": 0

}, ...

using abstract identifiers and Docker image identifiers. The repository does not keep theinformation about task managers assuming that routing is performed by request routers

95

4 Implementation

which can substitute abstract identifiers with the real ones when needed. And responseswith providers’ abstract identifiers can be passed directly to consumers without a risk toprovide sensitive information about internal endpoints.

Afterwards, the provisioning-ready specification, i.e. a Dockerfile, needs to be generatedbased on the application package specification. The resulting set of files and specificationshas to be archived and stored on the file system. The copies of schema files, sample inputsand outputs from test run specification are also left outside the archive to allow analyzingthem via GUI. Additionally, based on the publishing scenario provisioning and testing mighthappen. After everything is finished, the respective information in the database has tobe updated and a successful response is returned with the JSON representation of theapplication. Commonly, the Dockerfile is constructed using some image as its basis [Nic16].This defines which OS and potentially other software will be used. One important issue isto use a proper OS which will be compatible with the specified dependencies and especiallythe commands which use specific package managers like apt-get9. In the prototype weuse the ubuntu:14.04 base image for generation of Dockerfiles. This fact also leads tolimitations like usage of apt-get as a package manager. However, the conceptual modelof the data transformation application can be extended to support the specification of acertain OS as a dependency and including, e.g. its exact version. The Dockerfiles’ syntax isvery concise and consists of short instructions, e.g. FROM, RUN, or COPY. The main taskis to map the required information from application package specification to Dockerfileinstructions and based on these mappings to generate the actual Dockerfile. For instance,the publisher’s information is used to specify the image maintainer with the respectiveLABEL maintainer=“John Doe” instruction. We do not include the full description of allinstructions related to creation of predefined folders for storing application’s artifacts aswell as instructions related to setting file permissions. Instead, as an example we describehow the folder for application’s files is created and the correct permissions are set. Considerthe excerpt from a Dockerfile shown in Listing 4.5. Firstly, a corresponding environment

Listing 4.5 Example of Dockerfile instructions for copying application’s files and settingpermissions to execute them

FROM ubuntu:14.04

LABEL maintainer="[email protected]"

ENV APPHOME /app

RUN mkdir ${APPHOME}

COPY app ${APPHOME}

WORKDIR ${APPHOME}

RUN chmod -R a+x *...

variable is generated with the help of the ENV instruction and a respective directory iscreated using RUN mkdir $APPHOME. Next, the application’s files are copied using COPYinstruction into the predefined folder in provisioning-ready package’s structure described

9Canonical Ltd., AptGet: https://help.ubuntu.com/community/AptGet/Howto

96

https://help.ubuntu.com/community/AptGet/Howto

4.4 Publishing and Generation of Provisioning-ready Specification

in Section 3.3. The last steps are about changing the working directory using WORKDIRinstruction and setting the file permissions, e.g. RUN chmod -R a+x *.

In some cases it is a direct 1:1 mapping, whereas in other several instructions have tobe used. Table 4.4 describes how particular information from the application packagespecification is mapped to Dockerfile instructions. Basically, all files provided along with the

Application’s Information Docker Instruction Description

Publisher LABEL As the publisher is responsible forthe application, the value Author inmetadata will contain the informa-tion about publisher

Application files COPY Application’s files need to be copiedinto a respective predefined folder

Environment Dependency ENV The values of environment depen-dencies defined in application pack-age specification are set using thisinstruction

Software DependencyCOPY If provided, software dependency’s

files need to be copied into a respec-tive folder relative to the predefinedsoftware dependencies folder

RUN The set of specified installation com-mands have to be executed

File Dependency COPY File dependencies need to be copiedinto a respective folder relative tothe predefined file dependenciesfolder

Configuration RUN Specified configuration commandsneed to be executed

TestRun COPY Specified sample inputs and out-puts need to be copied into the pre-defined testrun folder

Table 4.4: Mapping of the application’s information to Dockerfile’s instructions

application such as software or file dependencies and sample inputs and outputs in test runspecifications have to be copied using the COPY instruction. The environment dependenciescan be set using the ENV instruction. Any specified command needs to be executed using theRUN instruction. As commands are provided in arrays, the generator loops through itemscreating a RUN instruction which executes multiple commands separated by semicolon. Animportant point is to use exactly one RUN instruction for semantically-related commands.

97

4 Implementation

Essentially, when a RUN instruction is executed a new instance of a shell is created. Hence,running two connected commands, e.g. open a folder and rename a file in this folder, usingdifferent RUN instructions will result in an error. We assume that the order of items inthe array is correctly specified by the publisher, hence this order is used for ordering thegenerated instructions. A similar approach is used for the overall ordering of dependency-related instructions in the Dockerfile. The overall generation process consists of severalsteps: parsing the application package specification, iterating over the groups of entitiesand mapping the information into Dockerfile instructions. In general, the main idea is tocopy all application’s artifacts inside respective predefined folders and run the specifiedcommands.

4.5 Requesting a Transformation

After finding a suitable transformation, the consumer needs to send a HTTP POST request tothe /tasks resource. Essentially, this request consists of the application and transformationidentifiers, providers information and specification of inputs including the access links.In case the application was found via searching the repository, the provider informationis available as it can be taken directly from the response. However, if the consumerissues the request knowing only the application’s details the provider information can beomitted. In such case the default provider is used in the prototype. Listing 4.6 shows theoverall structure of a transformation request. As was discussed in Section 3.6, inputs can

Listing 4.6 Structure of the transformation request{

"appID": "string",

"transformationID": "string",

"resultsEndpoint": "string",

"providers": [ ... ],

"inputParams": [ ... ],

"inputFiles": [

{ "format": "string", "link": "string" }

],

"inputFileSets": [

{ "format": "string", "count": 0, "linkToArchive": "string" }

]

}

be provided in a push or pull-based manner. In the prototype we use a combination ofboth. Input parameters are included inside the request, whereas file-based input typesare provided as links. This approach has several advantages. Firstly, the consumer doesnot need to upload files multiple times in case the same or similar requests have to beissued. Secondly, this approach fits nicely into the picture of communication with the TraDEMiddleware which already has a data model consisting of cross-partner data objects andtheir data elements which can be accessed via links. After receiving the request, the inputsare downloaded into the corresponding temporary folders created for the transformation

98

4.5 Requesting a Transformation

task. The process of handling transformation tasks might take a long time, thus it is moreconvenient to implement an asynchronous communication. For this reason the requesthas to include an information about respective endpoints for providing the output resultswhich can either be pushed to the consumer or pulled by the consumer. In the latter casethe endpoint is used for sending a notification that the task is completed. We choose to usethe push-based approach as it also requires less effort from the TraDE Middleware’s side asit does not need to implement polling. When the task is finished, the results are uploadedto the given address.

99

5 Case Study

In this chapter we demonstrate how the introduced concepts can be applied to a real worldexample. More specifically, we analyze how parts of the motivational example discussedin Section 1.2.6 can be simplified by modeling them implicitly as data transformationsand invoking them using the introduced framework for handling data transformationapplications.

Recall the choreography modeling example shown in Figure 1.2 from Section 1.2.6 describ-ing the KMC simulation using the OPAL software. From the conceptual point of view, onespecific part of this choreography is not a part of the actual simulation process, but ratherchanging the representation of existing data values. Figure 5.1 highlights the participantOpalVisual which transforms the snapshots data into an .mp4 video and creates a plotof the saturation data. The results of its execution do not influence the simulation and

Figure 5.1: A highlighted OpalVisual participant in the OPAL simulation choreography thatis solely responsible for the data transformation

101

5 Case Study

even if this participant will be removed from the choreography the simulation can still beconsidered successfully completed. In fact, when following this explicit modeling approachif other representations are required then either the OpalVisual participant needs to beextended with additional activities or another participant has to be created depending onthe representation’s semantics, e.g. whether it is a visualization or not. This is a perfectlyvalid approach which, however, leads to the increase of visual clutter in the choreographymodel making it less readable and understandable.

Switching to an implicit modeling of these transformations might help to make the modelmore concise without making it less comprehensive. However, this approach implies havingmeans to represent a transformation in the model without using the notions of partici-pants and activities. In TraDE modeling concepts, the connectors between data elementsrepresent the data flow. One way to provide such means is to enhance the notion of thedata flow by introducing different types of connectors. For instance, if a data elementbelonging to some cross-partner data object in order to be transferred to its next destina-tion has to be transformed a special type of connector can be used. Figure 5.2 shows anexample of enhanced data flow connectors linking respective data elements belonging to“simResults” and “media” cross-partner data objects. The representation of snapshots datais transformed into a video and a diagram of saturation data is generated. The OpalVisualparticipant is removed from the model making it less verbose. To provide a reader with the

Figure 5.2: The implicit data transformation modeling using enhanced data flow connec-tors

better insight, Figure 5.3 demonstrates the complete OPAL choreography with implicit datatransformation modeling applied. We focus on the opal3dAnimatedLoopFile applicationresponsible for the transformation of snapshots into a video (transformAllSnapshots2Videodata flow in Figure 5.2) due to its more complex organization which allows us to demon-strate more aspects of the process. This application is written in Python programming

102

Figure 5.3: The modified OPAL simulation choreography with the implicit data transfor-mation modeling

language and relies on several external modules including: (i.) numpy1, a scientific comput-ing package for Python; (ii.) matplotlib2, a plotting library; (iii.) and ffmpeg3, a multimediaframework for working with audio and video.

First, we need to create an application package specification for the opal3dAnimatedLoopFileapplication. Listing 5.1 shows the excerpt from the specification with the application’sdescription. The next step is to describe the supported transformations. In this case, the

Listing 5.1 The specification of general information about the application"appInfo": {

"appName": "opal3dAnimatedLoopFile",

"appVersion": "1.0.0",

"appPublisher": "Michael Hahn",

"appDesc": "Transformation of the snapshots into video",

"appDevelopers": ["Michael Hahn"],

"appLicense": "Apache License 2.0",

"tags": [ "opal", "simulation", "video"]

}

application supports only one transformation which takes two mandatory input parametersand transforms a set of .dat files into an .mp4 video. One important thing is that the filenames are not used in the invocation command and the application expects them to be

1NumPy developers, NumPy: http://www.numpy.org/index.html2The Matplotlib development team, Matplotlib: http://matplotlib.org/3FFmpeg team, FFmpeg: https://ffmpeg.org

103

http://www.numpy.org/index.html

http://matplotlib.org/

https://ffmpeg.org

5 Case Study

found next to the executable. The first parameter specifies a string prefix which is used infiles’ names. The second defines how many files will be used for the transformation. As aresult of the transformation, one video file is produced next to the executable. Additionally,the name and qname specifications need to be provided. Listing 5.2 shows the specificationof the transformation for opal3dAnimatedLoopFile. The input files are modeled as the

Listing 5.2 The application’s transformation specification"transformations": [

{

"name": "snapshots-to-video",

"qname": "simtech.opal.snapshots2video",

"inputParams": [

{

"inputName": "prefixName",

"alias": "$prefixName",

"isOptional": false,

"type": "string"

},

{

"inputName": "numberOfFilesToAnimate",

"alias": "$numberOfFilesToAnimate",


"type": "integer"

},

],

"inputFileSets": [

{

"inputName": "snapshots",

"alias": "$inputsShapshots",


"format": "text/plain",

"requiredPath": "{r}/",

"fileSetSize": "$numberOfFilesToAnimate"

}

],

"outputFiles": [

{

"outputName": "opalClusterSnapshotsVideo",

"alias": "$outputVideo",

"format": "video/mp4",

"accessPath": "{r}/"

}

]

}

]

InputFileSet. In the prototype we use the {r} qualifier in front of the path string to specifythat the rest of the string is the path relative to the application’s folder. Another optionto distinguish between relative and absolute paths is to extend the model by adding newproperties.

104

We omit the discussion of a complete specification of the dependencies due to its sizeand overall similarity of the process. One thing to notice is that we need to include anyadditional dependency apart from the aforementioned modules. For instance, the properPython distribution is itself a dependency. Listing 5.3 shows how the aforementioned

Listing 5.3 The specification of the application’s dependencies"dependencies": {

"softDeps": [

...,

{

"depName": "numpy",

"alias": "$numpy",

"depDesc": "fundamental package for scientific computing with Python",

"depVersion": ">1.13.0",

"commands": ["sudo apt-get -y install python-numpy"]

},

{

"depName": "matplotlib",

"alias": "$matplotlib",

"depDesc": "plotting library for the Python programming language",

"depVersion": ">2.0.0",

"commands": ["sudo apt-get -y install python-matplotlib"]

},

{

"depName": "ffmpeg",

"alias": "$ffmpeg",

"depDesc": "Install FFmpeg storing animated plots as videos",

"depVersion": ">3.0",

"commands": [

"sudo apt-get -y install ppa-purge software-properties-common",

"sudo ppa-purge -y ppa:mc3man/trusty-media",

"sudo add-apt-repository -y ppa:mc3man/trusty-media",

"sudo apt-get update",

"sudo apt-get -y install ffmpeg"

]

}

]

}

external modules and libraries are specified. In the prototype we use the version string onlyas an additional information which has no influence on the execution of commands. Alldependencies are installed from remote repositories meaning that no dependency-relatedfiles are provided. The remaining part is to specify the invocation. A command to invokethis application requires only two input parameters and is shown in Listing 5.4.

As a next step, the Dockerfile is generated based on the application package specificationrelying on the previously discussed mappings. Listing 5.5 demonstrates the generated Dock-erfile. It consists of instructions specifying the general application information and creationof system folders. Moreover, there is a separate RUN instruction for every dependency,and instructions for setting the working directory to the folder containing the application’sexecutable and specification of a default command.

105

5 Case Study

Listing 5.4 The specification of the application’s dependencies"invocations": {

"invocationsCLI": [ {

"invName": "opal3dAnimatedLoopFile CLI invocation",

"invDesc": "Specify the prefix of the snapshots and the number of snapshots to use",

"command": "python opal3dAnimatedLoopFile.py $prefixName $numberOfFilesToAnimate"

} ]

}

Listing 5.5 The resulting Dockerfile generated based on the application package specifica-tion

FROM ubuntu:14.04

LABEL maintainer="Michael Hahn"

#system folders

ENV APPHOME /app

COPY app ${APPHOME}

WORKDIR ${APPHOME}

RUN chmod -R a+x *#dependencies

...

#depName: numpy

RUN sudo apt-get -y install python-numpy

#depName: matplotlib

RUN sudo apt-get -y install python-matplotlib

#depName: ffmpeg

RUN sudo apt-get -y install ppa-purge software-properties-common;sudo ppa-purge -y

ppa:mc3man/trusty-media;sudo add-apt-repository -y ppa:mc3man/trusty-media;sudo

apt-get update;sudo apt-get -y install ffmpeg

#Set working directory and default command

WORKDIR ${APPHOME}

CMD ["/sbin/init"]

One aspect we want to focus on is the default command’s definition. According to officialrecommendations [Doc17a], Docker containers should be as clean and modular as possible.The more strict rule is to have one process per container meaning that it is dedicated toa specific application. While this approach suits our needs, we cannot directly use theinvocation command as the entry point for applications. Before running the application,inputs need to be copied into required locations inside the container to be able to executethe application. This can happen when the container is running which leads us to therequirement to run the container without actually running the application. Anotherpossibility is to use Docker volumes, i.e. run containers with the temporary volumesattached. For the prototype, we use the “/sbin/init” command which is the equivalent ofonly running the operating system inside the container. After the image is created usingthe Dockerfile, it can be used for the invocation of applications using the inputs obtainedfrom transformation requests. We use pull-based approach of getting the inputs based onthe links provided in the request. As soon as inputs are downloaded and mapped to thetransformation’s inputs, the copying into respective locations inside the container takesplace. After these preparations are finished the application is invoked and the resultingoutputs are pushed to the TraDE Middleware.

106

6 Conclusion and Future Work

In general, computer-based experiments are complex due to multiple factors includingheterogeneity of scientific software and its dependencies, diverse environments, needs towrap the software, and complicated configuration processes. While service choreographieshelp to reduce this complexity by structuring the communication processes and the TraDEMiddleware adds the flexibility by introducing the data flow transparency, the process stillremains generally complex. Furthermore, the technical aspects of the data flow such asdata transformation of its elements along the path could add unnecessary complexity tomodels of experiments. The issue of orchestrating computer-based experiments is solvedby workflow enactment systems relying on powerful yet technically complex orchestratinglanguages like BPEL or BPMN. On the other hand, an explicit modeling and executingthe data transformation applications independently for every involved participant is notpractical as it requires to manually wrap them as web services and implies that the modelerhas expertise in orchestrating languages. Having a way to handle data transformationapplications in a standardized manner can simplify the process of modeling and executingscientific service choreographies with the TraDE concepts applied. Moreover, the modelswithout additional visual clutter are easier to understand. As we have shown in Chapter 5,an implicit modeling of data transformations in the context of TraDE Middleware makesservice choreographies more concise without loosing their meaning.

In this thesis we conceptualize a framework which allows addressing this problem andprovide its prototypical implementation. In Chapter 2 we discuss the background conceptsand research topics which are relevant for this research. Next, in Chapter 3 we introducethe concepts for handling the data transformation applications and design the frameworksupporting these concepts. We first start with a set of assumptions underlying our work,focusing on the black-box approach towards data transformation applications which donot require user interaction and use files as the main unit of interchange. We then discussvarious interaction models between the actors involved in the process of interaction withsuch applications. Next, we derive the extensible conceptual model of data transformationapplications and describe the application packaging details. As a starting point of theframework design, we present a generic framework which we then gradually improve byrefining its parts. This process results in the unification of concepts in the refined framework.As the last parts we explain the interaction scenarios in more details and discuss how theframework can be integrated with the TraDE Middleware. Finally, Chapter 4 describes thedetails of the framework’s prototypical implementation. The already mentioned case studydiscussed in Chapter 5 analyzes a real world use case and discusses how the presentedconcepts can be applied to one of its parts. In the next couple of paragraphs we discuss thepossible directions and concepts for future work.

107

6 Conclusion and Future Work

General Ideas The concepts presented in this thesis are made extensible in order tosupport future modifications and enhancements. While we mostly discussed the file-based communication, other input types might be added into the conceptual model thusrequiring improvements in every related component. Moreover, we discussed only the CLI-based invocation and adding other types of invocations is another interesting modification.Likewise, the data transformation software requiring user interaction is another kind ofapplications which might be considered for adding into the prototype. In general, thepublishing process can be improved by supporting different options such as uploading filesseparately in a two-step publishing process, e.g. in combination with the authorization. TheAPI can be more fine-grained to support partial application updates and versioning. Anyscenarios involving file transfer can leverage from streaming. This might be particularlyhelpful in cases when large files are used in data transformations.

Repository Enhancements Several aspects related to the repository can be improved.For instance, more complex search implementation can be introduced using dedicatedsearch engines like Apache Solr1 or Elasticsearch2. The composite transformations search isanother interesting enhancement which requires a lot of effort not only from the repository’sperspective, but also from the task manager’s side. Additionally, an implementation ofthe repository’s optimizer component might be useful in cases when a large number ofapplications is published. Discovering the associative rules for dependencies and creatingseparate base images which can be used for generation of the Dockerfiles might significantlyreduce the image building times making the overall deployment more convenient.

One significant point is adding the support of other provisioning mechanisms which re-quires generating different provisioning specifications as well as modifying the structureof provisioning-ready application packages and corresponding metadata in the database.A particular example of supporting another provisioning technology is to implementthe generation of TOSCA [OAS13] topologies based on application package specifica-tions. This makes provisioning-ready packages platform-independent and allows using theOpenTOSCA [Bin+13] ecosystem for the automated provisioning of data transformationapplications as well as the management of their life cycle. Furthermore, the ecosystemprovides a GUI-based topology modeling software, Winery [Kop+13], which allows tographically specify and visualize the topology (e.g., components and their dependencies)of an application.

Task Manager Enhancements As with the repository, adding the support for otherprovisioning engines requires having a corresponding deployers implemented. Anotherinteresting direction is supporting the invocation of different kinds of applications, e.g.web services. In the presented concepts we focused mainly on automated transformationsscenarios due to the need to integrate the framework with the TraDE Middleware. However,

1Apache Software Foundation, Apache Solr: https://lucene.apache.org/solr/2Elasticsearch BV, Elasticsearch: https://www.elastic.co/

108

https://lucene.apache.org/solr/

https://www.elastic.co/

the GUI-based interaction methods are also an interesting approach especially for use caseswhen the framework is used independently of other systems.

109

Bibliography

[ACD10] A. A. Almonaies, J. R. Cordy, T. R. Dean. “Legacy system evolution towardsservice-oriented architecture.” In: International Workshop on SOA Migrationand Evolution. 2010, pp. 53–62 (cit. on p. 29).

[Alt+04] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, S. Mock. “Kepler:an extensible system for design and execution of scientific workflows.” In:Scientific and Statistical Database Management, 2004. Proceedings. 16th Inter-national Conference on. IEEE. 2004, pp. 423–424 (cit. on p. 14).

[ASV13] A. Afanasiev, O. Sukhoroslov, V. Voloshinov. “MathCloud: publication andreuse of scientific applications as RESTful web services.” In: InternationalConference on Parallel Computing Technologies. Springer. 2013, pp. 394–408(cit. on pp. 36, 38).

[Bar47] M. S. Bartlett. “The use of transformations.” In: Biometrics 3.1 (1947), pp. 39–52 (cit. on p. 15).

[BBW08] B. Balis, M. Bubak, M. Wegiel. “LGF: A flexible framework for exposing legacycodes as services.” In: Future Generation Computer Systems 24.7 (2008),pp. 711–719 (cit. on p. 37).

[Bel+15] P. Belmann, J. Dröge, A. Bremges, A. C. McHardy, A. Sczyrba, M. D. Bar-ton. “Bioboxes: standardised containers for interchangeable bioinformaticssoftware.” In: Gigascience 4.1 (2015), p. 47 (cit. on p. 46).

[Ber14] D. Bernstein. “Containers and cloud: From lxc to docker to kubernetes.” In:IEEE Cloud Computing 1.3 (2014), pp. 81–84 (cit. on pp. 40, 42).

[Bin+13] T. Binz, U. Breitenbücher, F. Haupt, O. Kopp, F. Leymann, A. Nowak, S. Wag-ner. “OpenTOSCA - A Runtime for TOSCA-based Cloud Applications.” In:Proceedings of 11th International Conference on Service-Oriented Computing(ICSOC’13). Vol. 8274. LNCS. Springer Berlin Heidelberg, 2013, pp. 692–695.DOI: 10.1007/978-3-642-45005-1_62 (cit. on p. 108).

[Bis+99] J. Bisbal, D. Lawless, B. Wu, J. Grimson. “Legacy information systems: Issuesand directions.” In: IEEE software 16.5 (1999), pp. 103–111 (cit. on p. 28).

[BK16] R. Bawa, I. Kaur. “Algorithmic Approach for Efficient Retrieval of ComponentRepositories in Component based Software Engineering.” In: Indian Journalof Science and Technology 9.48 (2016) (cit. on pp. 27, 70).

[BN06] E. W. Biederman, L. Networx. “Multiple instances of the global linux names-paces.” In: Proceedings of the Linux Symposium. Vol. 1. 2006, pp. 101–112(cit. on p. 41).

111

https://doi.org/10.1007/978-3-642-45005-1_62

Bibliography

[Boe15] C. Boettiger. “An introduction to Docker for reproducible research.” In: ACMSIGOPS Operating Systems Review 49.1 (2015), pp. 71–79 (cit. on pp. 44,57).

[BS03] P. Binkele, S. Schmauder. “An atomistic Monte Carlo simulation of precipita-tion in a binary system.” In: Zeitschrift für Metallkunde 94.8 (2003), pp. 858–863 (cit. on p. 16).

[Can+06] G. Canfora, A. R. Fasolino, G. Frattolillo, P. Tramontana. “Migrating interac-tive legacy systems to web services.” In: Software Maintenance and Reengi-neering, 2006. CSMR 2006. Proceedings of the 10th European Conference on.IEEE. 2006, 10–pp (cit. on p. 35).

[Can+08] G. Canfora, A. R. Fasolino, G. Frattolillo, P. Tramontana. “A wrapping ap-proach for migrating legacy system interactive functionalities to serviceoriented architectures.” In: Journal of Systems and Software 81.4 (2008),pp. 463–480 (cit. on p. 35).

[CD+00] S. Comella-Dorda, K. Wallnau, R. C. Seacord, J. Robert. A survey of legacysystem modernization approaches. Tech. rep. Carnegie-Mellon univ pittsburghpa Software engineering inst, 2000 (cit. on p. 29).

[Cha04] D. Chappell. Enterprise Service Bus: Theory in Practice. Theory in practice.O’Reilly Media, 2004. ISBN: 9781449391096 (cit. on p. 78).

[CS14] R. Chamberlain, J. Schommer. “Using Docker to support reproducible re-search.” In: DOI: https://doi. org/10.6084/m9. figshare 1101910 (2014)(cit. on p. 44).

[Dec+07] G. Decker, O. Kopp, F. Leymann, M. Weske. “BPEL4Chor: Extending BPELfor modeling choreographies.” In: Web Services, 2007. ICWS 2007. IEEEInternational Conference on. IEEE. 2007, pp. 296–303 (cit. on pp. 14, 15).

[Dec+09] G. Decker, O. Kopp, F. Leymann, M. Weske. “Interacting services: fromspecification to execution.” In: Data & Knowledge Engineering 68.10 (Apr.2009), pp. 946–972 (cit. on p. 17).

[Dee+09] E. Deelman, D. Gannon, M. Shields, I. Taylor. “Workflows and e-Science: Anoverview of workflow system features and capabilities.” In: Future generationcomputer systems 25.5 (2009), pp. 528–540 (cit. on p. 14).

[Del+05] T. Delaitre, T. Kiss, A. Goyeneche, G. Terstyanszky, S. Winter, P. Kacsuk.“GEMLCA: Running legacy code applications as grid services.” In: Journal ofGrid Computing 3.1-2 (2005), pp. 75–90 (cit. on p. 37).

[DK76] F. DeRemer, H. H. Kron. “Programming-in-the-large versus programming-in-the-small.” In: IEEE Transactions on Software Engineering 2 (1976), pp. 80–86(cit. on p. 11).

[DKB08] G. Decker, O. Kopp, A. Barros. “An Introduction to Service Choreographies.”In: Information Technology 50.2 (2008), pp. 122–127. DOI: 10.1524/itit.2008.0473 (cit. on p. 14).

112

https://doi.org/10.1524/itit.2008.0473

https://doi.org/10.1524/itit.2008.0473

Bibliography

[DPP05] C. Diamantini, D. Potena, M. Panti. “Wrapping Legacy Code for a ServiceOriented Knowledge Discovery Support system.” In: Proc. of IADIS Int. conf.WWW/Internet 2005. 2005, pp. 19–22 (cit. on p. 36).

[DRK14] R. Dua, A. R. Raja, D. Kakadia. “Virtualization vs containerization to supportpaas.” In: Cloud Engineering (IC2E), 2014 IEEE International Conference on.IEEE. 2014, pp. 610–614 (cit. on p. 43).

[DSM16] M. Dabhade, S. Suryawanshi, R Manjula. “A systematic review of softwarereuse using domain engineering paradigms.” In: Green Engineering and Tech-nologies (IC-GET), 2016 Online International Conference on. IEEE. 2016, pp. 1–6 (cit. on p. 26).

[EE01] H. Enderton, H. Enderton. A Mathematical Introduction to Logic. ElsevierScience, 2001. ISBN: 9780080496467 (cit. on p. 67).

[Elk16] M. M. Elkhouly. “Software Component as a Service (SCaaS).” In: AustralianJournal of Basic and Applied Sciences 10.16 (2016), pp. 45–51 (cit. on p. 38).

[Erl16] T. Erl. Service-oriented Architecture: Analysis and Design for Services andMicroservices. The Prentice Hall Service Technology Series from Thomas ErlSeries. Prentice Hall, 2016. ISBN: 9780133858587 (cit. on p. 26).

[eSc17] I. I. C. on eScience. IEEE International Conference on eScience. https://escience-conference.org. 2017 (cit. on p. 14).

[Fel+15] W. Felter, A. Ferreira, R. Rajamony, J. Rubio. “An updated performance com-parison of virtual machines and linux containers.” In: Performance Analysisof Systems and Software (ISPASS), 2015 IEEE International Symposium On.IEEE. 2015, pp. 171–172 (cit. on pp. 40, 42, 43).

[Fin09] E. L. Fink. “The FAQs on data transformation.” In: Communication Mono-graphs 76.4 (2009), pp. 379–397 (cit. on p. 15).

[Fou17] C. Foundry. Warden Documentation. http://docs.cloudfoundry.org/concepts/architecture/warden.html. 2017 (cit. on p. 42).

[Fow10a] M. Fowler. Domain-Specific Languages. Addison-Wesley Signature Series(Fowler). Pearson Education, 2010. ISBN: 9780131392809 (cit. on p. 93).

[Fow10b] M. Fowler. Richardson Maturity Model. 2010. URL: https://www.martinfowler.com/articles/richardsonMaturityModel.html (cit. on pp. 90, 92).

[FSB16] T. C. Fanelli, S. C. Simons, S. Banerjee. “A Systematic Framework for Mod-ernizing Legacy Application Systems.” In: Software Analysis, Evolution, andReengineering (SANER), 2016 IEEE 23rd International Conference on. Vol. 1.IEEE. 2016, pp. 678–682 (cit. on p. 29).

[Gla+08] T. Glatard, J. Montagnat, D. Emsellem, D. Lingrand. “A Service-OrientedArchitecture enabling dynamic service grouping for optimizing distributedworkflow execution.” In: Future Generation Computer Systems 24.7 (2008),pp. 720–730 (cit. on p. 37).

[GLH15] S. García, J. Luengo, F. Herrera. Data preprocessing in data mining. Springer,2015 (cit. on p. 15).

113

https://escience-conference.org

https://escience-conference.org

http://docs.cloudfoundry.org/concepts/architecture/warden.html

http://docs.cloudfoundry.org/concepts/architecture/warden.html

https://www.martinfowler.com/articles/richardsonMaturityModel.html

https://www.martinfowler.com/articles/richardsonMaturityModel.html

Bibliography

[GML00] G. C. Gannod, S. V. Mudiam, T. E. Lindquist. “An architecture-based approachfor synthesizing and integrating adapters for legacy software.” In: ReverseEngineering, 2000. Proceedings. Seventh Working Conference on. IEEE. 2000,pp. 128–137 (cit. on p. 32).

[GMW10] D. Garlan, R. Monroe, D. Wile. “Acme: An architecture description inter-change language.” In: CASCON First Decade High Impact Papers. IBM Corp.2010, pp. 159–173 (cit. on p. 32).

[Gol73] R. P. Goldberg. “Architecture of virtual machines.” In: Proceedings of theworkshop on virtual computer systems. ACM. 1973, pp. 74–112 (cit. on p. 40).

[Guo+05] H. Guo, C. Guo, F. Chen, H. Yang. “Wrapping client-server application to Webservices for Internet computing.” In: Parallel and Distributed Computing, Ap-plications and Technologies, 2005. PDCAT 2005. Sixth International Conferenceon. IEEE. 2005, pp. 366–370 (cit. on p. 35).

[Hah+17a] M. Hahn, U. Breitenbücher, O. Kopp, F. Leymann. “Modeling and execution ofdata-aware choreographies: an overview.” In: Computer Science - Research andDevelopment (2017). ISSN: 1865-2042. DOI: 10.1007/s00450-017-0387-y(cit. on pp. 16–18).

[Hah+17b] M. Hahn, U. Breitenbücher, F. Leymann, A. Weiß. “TraDE - A TransparentData Exchange Middleware for Service Choreographies.” In: On the Move toMeaningful Internet Systems. OTM 2017 Conferences: Confederated Interna-tional Conferences: CoopIS, C&TC, and ODBASE 2017, Rhodes, Greece, October23-27, 2017, Proceedings, Part I. Vol. 10573. Lecture Notes in Computer Sci-ence. Cham: Springer International Publishing AG, Oct. 2017, pp. 252–270.ISBN: 978-3-319-69462-7. DOI: 10.1007/978-3-319-69462-7_16 (cit. onpp. 21–23).

[HKL16] M. Hahn, D. Karastoyanova, F. Leymann. “A Management Life Cycle forData-Aware Service Choreographies.” In: Proceedings of the 23rd InternationalConference on Web Services (ICWS 2016). IEEE Computer Society, 2016,pp. 364–371. DOI: 10.1109/ICWS.2016.54 (cit. on p. 15).

[HM84] E. Horowitz, J. B. Munson. “An expansive view of reusable software.” In:IEEE Transactions on Software Engineering 5 (1984), pp. 477–487 (cit. onp. 24).

[Hos+16] A. Hosny, P. Vera-Licona, R. Laubenbacher, T. Favre. “AlgoRun: a Docker-based packaging system for platform-agnostic implemented algorithms.” In:Bioinformatics 32.15 (2016), pp. 2396–2398 (cit. on pp. 45, 53).

[HS95] M. A. Hernández, S. J. Stolfo. “The merge/purge problem for largedatabases.” In: ACM Sigmod Record. Vol. 24. 2. ACM. 1995, pp. 127–138(cit. on p. 74).

[HS98] M. A. Hernández, S. J. Stolfo. “Real-world data is dirty: Data cleansing andthe merge/purge problem.” In: Data mining and knowledge discovery 2.1(1998), pp. 9–37 (cit. on p. 74).

114

https://doi.org/10.1007/s00450-017-0387-y

https://doi.org/10.1007/978-3-319-69462-7_16

https://doi.org/10.1109/ICWS.2016.54

Bibliography

[HT02] H. Haddad, H. Tesser. “Reusable subsystems: domain-based approach.” In:Proceedings of the 2002 ACM symposium on Applied computing. ACM. 2002,pp. 971–975 (cit. on p. 30).

[HT03] A. J. Hey, A. E. Trefethen. “The data deluge: An e-science perspective.” In:(2003) (cit. on p. 13).

[Hua+03] Y. Huang, I. Taylor, D. W. Walker, R. Davies. “Wrapping legacy codes forgrid-based applications.” In: Parallel and Distributed Processing Symposium,2003. Proceedings. International. IEEE. 2003, 7–pp (cit. on pp. 36, 37).

[HW12] G. Hohpe, B. Woolf. Enterprise Integration Patterns: Designing, Building, andDeploying Messaging Solutions. Addison-Wesley Signature Series (Fowler).Pearson Education, 2012. ISBN: 9780133065107 (cit. on p. 86).

[HX06] H. M. Haddad, Y. Xie. “Wrapper-based framework for domain-specific soft-ware reuse.” In: Journal of information science and engineering 22.2 (2006),p. 269 (cit. on pp. 31, 32).

[Ini17] O. A. Initiative. Open API Initiative (OAI). https://www.openapis.org/. 2017(cit. on p. 89).

[Jan07] N. W. Jankowski. “Exploring e-science: an introduction.” In: Journal ofComputer-Mediated Communication 12.2 (2007), pp. 549–562 (cit. on p. 13).

[Joh01] G. Johnson. The World: In Silica Fertilization; All Science Is Computer Science.Ed. by T. N. Y. Times. [Online; posted 25-March-2001]. 2001. URL: http://www.nytimes.com/2001/03/25/weekinreview/the-world- in- silica-fertilization-all-science-is-computer-science.html (cit. on p. 13).

[Joy15] A. M. Joy. “Performance comparison between linux containers and virtualmachines.” In: Computer Engineering and Applications (ICACEA), 2015 In-ternational Conference on Advances in. IEEE. 2015, pp. 342–346 (cit. onp. 43).

[Juh+09] E. Juhnke, D. Seiler, T. Stadelmann, T. Dörnemann, B. Freisleben. “LCDL: anextensible framework for wrapping legacy code.” In: Proceedings of the 11thInternational Conference on Information Integration and Web-based Applica-tions & Services. ACM. 2009, pp. 648–652 (cit. on pp. 30, 53).

[Kim+17] B. Kim, T. A. Ali, C. Lijeron, E. Afgan, K. Krampis. “Bio-Docklets: VirtualizationContainers for Single-Step Execution of NGS Pipelines.” In: bioRxiv (2017),p. 116962 (cit. on p. 46).

[Kiv+07] A. Kivity, Y. Kamay, D. Laor, U. Lublin, A. Liguori. “kvm: the Linux virtualmachine monitor.” In: Proceedings of the Linux symposium. Vol. 1. 2007,pp. 225–230 (cit. on p. 43).

[Kop+13] O. Kopp, T. Binz, U. Breitenbücher, F. Leymann. “Winery - A Modeling Tool forTOSCA-based Cloud Applications.” In: Proceedings of 11th International Con-ference on Service-Oriented Computing (ICSOC’13). Vol. 8274. LNCS. SpringerBerlin Heidelberg, 2013, pp. 700–704. DOI: 10.1007/978-3-642-45005-1_64(cit. on p. 108).

115

https://www.openapis.org/

http://www.nytimes.com/2001/03/25/weekinreview/the-world-in-silica-fertilization-all-science-is-computer-science.html



https://doi.org/10.1007/978-3-642-45005-1_64

Bibliography

[KSB17] G. M. Kurtzer, V. Sochat, M. W. Bauer. “Singularity: Scientific containers formobility of compute.” In: PloS one 12.5 (2017), e0177459 (cit. on p. 45).

[LCJ03] B.-J. Lee, T. Choi, T. Jeong. “X-CLI: CLI-based management architecture usingXML.” In: Integrated Network Management VIII. Springer, 2003, pp. 477–480(cit. on p. 32).

[Lin17] LinuxContainers.org. LXC Documentation. https://linuxcontainers.org/lxc/documentation/. 2017 (cit. on p. 42).

[LK15] W. Li, A. Kanso. “Comparing containers versus virtual machines for achievinghigh availability.” In: Cloud Engineering (IC2E), 2015 IEEE InternationalConference on. IEEE. 2015, pp. 353–358 (cit. on p. 43).

[LR00] F. Leymann, D. Roller. Production Workflow: Concepts and Techniques. PrenticeHall PTR, 2000. ISBN: 9780130217530 (cit. on p. 14).

[LZ16] D. Liu, C. Zhang. “Comparison and analysis of methods for migrating legacysystems to SOA.” In: MATEC Web of Conferences. Vol. 77. EDP Sciences. 2016,p. 04005 (cit. on p. 29).

[Mas11] M. Masse. REST API Design Rulebook. Oreilly and Associate Series. O’ReillyMedia, 2011. ISBN: 9781449310509 (cit. on p. 90).

[MC07] P. Mohagheghi, R. Conradi. “Quality, productivity and economic benefitsof software reuse: a review of industrial studies.” In: Empirical SoftwareEngineering 12.5 (2007), pp. 471–516 (cit. on p. 25).

[McI68] D. M. McIlroy. “Mass-Produced Software Components.” In: Software Engineer-ing Concepts and Techniques (1968 NATO Conference of Software Engineering).Ed. by J. M. Buxton, P. Naur, B. Randell. NATO Science Committee, 1968,pp. 138–155 (cit. on pp. 23, 24).

[Men+17] P. Menage et al. CGROUPS. https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt. 2017 (cit. on p. 41).

[MKK15] R. Morabito, J. Kjällman, M. Komu. “Hypervisors vs. lightweight virtualiza-tion: a performance comparison.” In: Cloud Engineering (IC2E), 2015 IEEEInternational Conference on. IEEE. 2015, pp. 386–393 (cit. on pp. 41, 43).

[MMK+94] H. Mili, O. Marcotte, A. Kabbaj, et al. “Intelligent component retrieval forsoftware reuse.” In: Proceedings of the Third Maghrebian Conference on Artifi-cial Intelligence and Software Engineering. 1994, pp. 101–114 (cit. on pp. 27,72, 73).

[MMM95] H. Mili, F. Mili, A. Mili. “Reusing software: Issues and research directions.” In:IEEE transactions on Software Engineering 21.6 (1995), pp. 528–562 (cit. onpp. 24, 26).

[Mor15] R. Morabito. “Power consumption of virtualization technologies: an empiricalinvestigation.” In: Utility and Cloud Computing (UCC), 2015 IEEE/ACM 8thInternational Conference on. IEEE. 2015, pp. 522–527 (cit. on p. 43).

[MS70] R. A. Meyer, L. H. Seawright. “A virtual machine time-sharing system.” In:IBM Systems Journal 9.3 (1970), pp. 199–218 (cit. on p. 40).

116

https://linuxcontainers.org/lxc/documentation/

https://linuxcontainers.org/lxc/documentation/

https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

Bibliography

[MSM+15] F. Moreews, O. Sallou, H. Ménager, et al. “BioShaDock: a community drivenbioinformatics shared Docker-based tools registry.” In: F1000Research 4(2015) (cit. on p. 46).

[Mul13] B. Mulloy. Web API design. 2013. URL: https://pages.apigee.com/rs/apigee/images/api-design-ebook-2012-03.pdf (cit. on p. 90).

[NEO03] H. B. Newman, M. H. Ellisman, J. A. Orcutt. “Data-intensive e-science frontierresearch.” In: Communications of the ACM 46.11 (2003), pp. 68–77 (cit. onp. 13).

[Nic16] J. Nickoloff. Docker in Action. 1st. Greenwich, CT, USA: Manning PublicationsCo., 2016. ISBN: 1633430235, 9781633430235 (cit. on p. 96).

[OAS06] OASIS. Web Services Business Process Execution Language Version 2.0. http://docs.oasis-open.org/wsbpel/2.0/wsbpel-specification-draft.html. 2006(cit. on p. 14).

[PD93] R. Prieto-Díaz. “Status report: Software reusability.” In: IEEE software 10.3(1993), pp. 61–66 (cit. on p. 24).

[PF16] S. R. Piccolo, M. B. Frampton. “Tools and techniques for computational repro-ducibility.” In: GigaScience 5.1 (2016), p. 30 (cit. on p. 44).

[PG74] G. J. Popek, R. P. Goldberg. “Formal requirements for virtualizable third gen-eration architectures.” In: Communications of the ACM 17.7 (1974), pp. 412–421 (cit. on p. 40).

[PP93] A. Podgurski, L. Pierce. “Retrieving reusable software by sampling behavior.”In: ACM Transactions on Software Engineering and Methodology (TOSEM) 2.3(1993), pp. 286–303 (cit. on p. 26).

[PR16] R. Priedhorsky, T. C. Randles. “Charliecloud: Unprivileged containers foruser-defined software stacks.” In: (2016) (cit. on p. 45).

[RD00] E. Rahm, H. H. Do. “Data cleaning: Problems and current approaches.” In:IEEE Data Eng. Bull. 23.4 (2000), pp. 3–13 (cit. on p. 15).

[RR03] T. Ravichandran, M. A. Rothenberger. “Software reuse strategies and com-ponent markets.” In: Communications of the ACM 46.8 (2003), pp. 109–114(cit. on p. 25).

[RS16] S. Rochimah, A. S. Sankoh. “Migration of Existing or Legacy Software Systemsinto Web Service-based Architectures (Reengineering Process): A SystematicLiterature Review.” In: Migration 133.3 (2016) (cit. on p. 29).

[SA14] O. Sukhoroslov, A. Afanasiev. “Everest: A Cloud Platform for ComputationalWeb Services.” In: CLOSER. 2014, pp. 411–416 (cit. on p. 38).

[SB15] P. Sturgeon, L. Bohill. Build APIs You Won’t Hate: Everyone and Their DogWants an API, So You Should Probably Learn How to Build Them. Philip J.Sturgeon, 2015. ISBN: 9780692232699 (cit. on pp. 90, 92, 93).

[SERS02] E. Stroulia, M. El-Ramly, P. Sorenson. “From legacy to web through interac-tion modeling.” In: Software Maintenance, 2002. Proceedings. InternationalConference on. IEEE. 2002, pp. 320–329 (cit. on p. 35).

117

https://pages.apigee.com/rs/apigee/images/api-design-ebook-2012-03.pdf

https://pages.apigee.com/rs/apigee/images/api-design-ebook-2012-03.pdf

http://docs.oasis-open.org/wsbpel/2.0/wsbpel-specification-draft.html

http://docs.oasis-open.org/wsbpel/2.0/wsbpel-specification-draft.html

Bibliography

[SGG07] J. Syed, M. Ghanem, Y. Guo. “Supporting scientific discovery processes inDiscovery Net.” In: Concurrency and Computation: Practice and Experience19.2 (2007), pp. 167–179 (cit. on p. 14).

[SK13] M. Sonntag, D. Karastoyanova. “Model-as-you-go: An Approach for an Ad-vanced Infrastructure for Scientific Workflows.” In: Journal of Grid Computing11.3 (2013), pp. 553–583. ISSN: 1570-7873. DOI: 10.1007/s10723-013-9268-1. URL: http://www2.informatik.uni-stuttgart.de/cgi-bin/NCSTRL/NCSTRL_view.pl?id=ART-2013-06&engl=0 (cit. on p. 14).

[Sne00] H. M. Sneed. “Encapsulation of legacy software: A technique for reusinglegacy software components.” In: Annals of Software Engineering 9.1 (2000),pp. 293–313 (cit. on p. 29).

[Sne06] H. M. Sneed. “Integrating legacy software into a service oriented architec-ture.” In: Software Maintenance and Reengineering, 2006. CSMR 2006. Pro-ceedings of the 10th European Conference on. IEEE. 2006, 11–pp (cit. onp. 34).

[Sne+06] H. M. Sneed et al. “Wrapping legacy software for reuse in a SOA.” In: Multi-konferenz Wirtschaftsinformatik. Vol. 2. 2006, pp. 345–360 (cit. on p. 34).

[SNLM16] M. G. da Silva Neto, W. C. B. Lins, E. B. P. Mariz. “Services for Legacy SoftwareRejuvenation: A Systematic Mapping Study.” In: ICSEA 2016 (2016), p. 197(cit. on p. 29).

[SS03] H. M. Sneed, S. H. Sneed. “Creating web services from legacy host programs.”In: Web Site Evolution, 2003. Theme: Architecture. Proceedings. Fifth IEEEInternational Workshop on. IEEE. 2003, pp. 59–65 (cit. on p. 34).

[SVA15] O. Sukhoroslov, S. Volkov, A. Afanasiev. “A web-based platform for publica-tion and distributed execution of computing applications.” In: Parallel andDistributed Computing (ISPDC), 2015 14th International Symposium on. IEEE.2015, pp. 175–184 (cit. on pp. 38, 39).

[SVIG07] R. Sears, C. Van Ingen, J. Gray. “To blob or not to blob: Large object storagein a database or a filesystem?” In: arXiv preprint cs/0701168 (2007) (cit. onp. 67).

[Tay+05] I. Taylor, M. Shields, I. Wang, A. Harrison. “Visual grid workflow in Triana.”In: Journal of Grid Computing 3.3-4 (2005), pp. 153–169 (cit. on pp. 14, 36).

[TKR70] R. Tagliacozzo, M. Kochen, L. Rosenberg. “Orthographic error patterns ofauthor names in catalog searches.” In: Information Technology and Libraries3.2 (1970), pp. 93–101 (cit. on p. 68).

[Val+16] T. Vale, I. Crnkovic, E. S. de Almeida, P. A.d.M. S. Neto, Y. C. Cavalcanti,S. R. de Lemos Meira. “Twenty-eight years of component-based softwareengineering.” In: Journal of Systems and Software 111 (2016), pp. 128–148(cit. on p. 26).

118

https://doi.org/10.1007/s10723-013-9268-1

https://doi.org/10.1007/s10723-013-9268-1

http://www2.informatik.uni-stuttgart.de/cgi-bin/NCSTRL/NCSTRL_view.pl?id=ART-2013-06&engl=0


Bibliography

[WBL15] J. Wettinger, U. Breitenbücher, F. Leymann. “Streamlining APIfication byGenerating APIs for Diverse Executables Using Any2API.” In: InternationalConference on Cloud Computing and Services Science. Springer. 2015, pp. 216–238 (cit. on pp. 32, 33).

[Wei+17] A. Weiß, V. Andrikopoulos, M. Hahn, D. Karastoyanova. “Model-as-you-gofor Choreographies: Rewinding and Repeating Scientific Choreographies.”In: IEEE Transactions on Services Computing PP.99 (2017), pp. 1–1. DOI:10.1109/TSC.2017.2732988. URL: http://www2.informatik.uni-stuttgart.de/cgi-bin/NCSTRL/NCSTRL_view.pl?id=ART-2017-12&engl=0 (cit. on p. 15).

[Yam15] Y. Yamato. “OpenStack hypervisor, container and Baremetal servers perfor-mance comparison.” In: IEICE Communications Express 4.7 (2015), pp. 228–232 (cit. on p. 43).

[Zai+15] A. Zaimi, A. Ampatzoglou, N. Triantafyllidou, A. Chatzigeorgiou, A. Mavridis,T. Chaikalis, I. Deligiannis, P. Sfetsos, I. Stamelos. “An empirical study onthe reuse of third-party libraries in open-source software development.” In:Proceedings of the 7th Balkan Conference on Informatics Conference. ACM.2015, p. 4 (cit. on p. 25).

[Zdu02] U. Zdun. “Reengineering to the web: A reference architecture.” In: SoftwareMaintenance and Reengineering, 2002. Proceedings. Sixth European Conferenceon. IEEE. 2002, pp. 164–173 (cit. on p. 34).

[ZW95] A. M. Zaremski, J. M. Wing. “Signature matching: a tool for using softwarelibraries.” In: ACM Transactions on Software Engineering and Methodology(TOSEM) 4.2 (1995), pp. 146–170 (cit. on pp. 27, 70).

[ZW97] A. M. Zaremski, J. M. Wing. “Specification matching of software components.”In: ACM Transactions on Software Engineering and Methodology (TOSEM) 6.4(1997), pp. 333–369 (cit. on p. 27).

[Alg17] AlgoRun project. AlgoRun v2.0 Documentation. http : / / algorun . org /documentation. 2017 (cit. on p. 45).

[Clo17] Cloudius Systems. OSv. http://osv.io/. 2017 (cit. on p. 43).

[Cor17] CoreOS, Inc. rkt overview. https://coreos.com/rkt. 2017 (cit. on p. 42).

[cur17] curl project. curl. https://curl.haxx.se/. 2017 (cit. on p. 59).

[Doc17a] Docker, Inc. Best practices for writing Dockerfiles. 2017. URL: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices (cit. onp. 106).

[Doc17b] Docker, Inc. What is Docker? https://www.docker.com/what-docker. 2017(cit. on pp. 42, 44, 45, 88).

[OAS13] OASIS. Topology and Orchestration Specification for Cloud Applications(TOSCA) Version 1.0. http://docs.oasis- open.org/tosca/TOSCA/v1.0/os/TOSCA-v1.0-os.html. 2013 (cit. on p. 108).

[Obj11] Object Management Group. Business Process Model and Notation Version 2.0.http://www.omg.org/spec/BPMN/2.0/. 2011 (cit. on p. 14).

119

https://doi.org/10.1109/TSC.2017.2732988



http://algorun.org/documentation

http://algorun.org/documentation

http://osv.io/

https://coreos.com/rkt

https://curl.haxx.se/

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices

https://www.docker.com/what-docker

http://docs.oasis-open.org/tosca/TOSCA/v1.0/os/TOSCA-v1.0-os.html

http://docs.oasis-open.org/tosca/TOSCA/v1.0/os/TOSCA-v1.0-os.html

http://www.omg.org/spec/BPMN/2.0/

Bibliography

[Obj17] Object Management Group. CORBA. http://www.corba.org/. 2017 (cit. onp. 25).

[Ope17] OpenStack Foundation. OpenStack. https://www.openstack.org/software/.2017 (cit. on p. 43).

[Ora17a] Oracle Corporation. MySQL. https://www.mysql.com/. 2017 (cit. on p. 43).

[Ora17b] Oracle. Enterprise JavaBeans Technology. http : / / www . oracle . com /technetwork/java/javaee/ejb/index.html. 2017 (cit. on p. 25).

[Red17] Redis Labs. Redis. https://redis.io/. 2017 (cit. on p. 43).

[Uni17] University of Stuttgart. Cluster of Excellence SimTech. http://www.simtech.uni-stuttgart.de/en/index.html. 2017 (cit. on p. 15).

[Wor05] World Wide Web Consortium. Web Services Choreography Description Lan-guage Version 1.0. https://www.w3.org/TR/ws- cdl- 10/. 2005 (cit. onp. 14).

[Xen17] Xen Project. Xen. https://www.xenproject.org/. 2017 (cit. on p. 43).

All links were last followed on December 12, 2017.

120

http://www.corba.org/

https://www.openstack.org/software/

https://www.mysql.com/

http://www.oracle.com/technetwork/java/javaee/ejb/index.html

http://www.oracle.com/technetwork/java/javaee/ejb/index.html

https://redis.io/

http://www.simtech.uni-stuttgart.de/en/index.html

http://www.simtech.uni-stuttgart.de/en/index.html

https://www.w3.org/TR/ws-cdl-10/

https://www.xenproject.org/

List of Figures

1.1 A BPMN diagram of OPAL simulation choreography with TraDE conceptsapplied [Hah+17a] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2 A modeling of OPAL simulation [Hah+17a] . . . . . . . . . . . . . . . . . . 18

2.1 A BPMN model of a choreography with TraDE concepts applied [Hah+17b] 212.2 Architecture of TraDE Middleware [Hah+17b] . . . . . . . . . . . . . . . . 222.3 Modules of a wrapping framework [Sne00] . . . . . . . . . . . . . . . . . . 292.4 Simplified UML diagram of LCDL model [Juh+09] . . . . . . . . . . . . . . 302.5 The general framework model [HX06] . . . . . . . . . . . . . . . . . . . . . 312.6 APIfication framework [WBL15] . . . . . . . . . . . . . . . . . . . . . . . . . 332.7 API implementation package [WBL15] . . . . . . . . . . . . . . . . . . . . . 332.8 Simplistic Legacy to the Web architecture [Zdu02] . . . . . . . . . . . . . . 342.9 SCaaS model [Elk16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.10 Structure of Everest application [SVA15] . . . . . . . . . . . . . . . . . . . . 392.11 Hypervisor-based virtualization [MKK15] . . . . . . . . . . . . . . . . . . . 412.12 Container-based virtualization [MKK15] . . . . . . . . . . . . . . . . . . . . 41

3.1 Simplified SOA-like interaction model . . . . . . . . . . . . . . . . . . . . . 493.2 More refined SOA-like interaction model . . . . . . . . . . . . . . . . . . . . 493.3 Interaction model with separate publisher and provider . . . . . . . . . . . . 503.4 Interaction model with a proxy provider role . . . . . . . . . . . . . . . . . . 513.5 The conceptual meta-model of a packaged data transformation application . 533.6 The constituents of a transformation specification . . . . . . . . . . . . . . . 553.7 The constituents of a dependency specification . . . . . . . . . . . . . . . . 573.8 The configuration specification . . . . . . . . . . . . . . . . . . . . . . . . . 593.9 The invocation specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.10 The test run specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.11 An example structure of an application package . . . . . . . . . . . . . . . . 613.12 A generic layered architecture for data transformation applications reuse . . 633.13 Provisioning-ready application package . . . . . . . . . . . . . . . . . . . . . 663.14 Storing the published data transformation application . . . . . . . . . . . . 693.15 A trivial composition example based on [MMK+94] . . . . . . . . . . . . . . 723.16 A composition involving input splitting inspired by [MMK+94] . . . . . . . 733.17 The refined architecture of the repository . . . . . . . . . . . . . . . . . . . . 753.18 Task Manager’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.19 Updated User Interaction Layer with the Security Layer . . . . . . . . . . . . 793.20 Refined view on the framework’s constituents . . . . . . . . . . . . . . . . . 80

121

List of Figures

3.21 The generalized publishing process from the publisher’s perspective . . . . . 813.22 The generalized publishing process from the system’s perspective . . . . . . 813.23 A consumer’s perspective on the process of reusing a data transformation

application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.24 A system’s perspective on the process of reusing a data transformation

application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.25 An UML sequence diagram of the interaction with the framework . . . . . . 84

4.1 The architecture of the framework’s prototypical implementation . . . . . . 87

5.1 A highlighted OpalVisual participant in the OPAL simulation choreographythat is solely responsible for the data transformation . . . . . . . . . . . . . 101

5.2 The implicit data transformation modeling using enhanced data flow con-nectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.3 The modified OPAL simulation choreography with the implicit data transfor-mation modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

122

List of Tables

4.1 Resources descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.2 The usage of HTTP verbs for Application and Transformation resources . . . 904.3 The usage of HTTP verbs for Task resource . . . . . . . . . . . . . . . . . . . 924.4 Mapping of the application’s information to Dockerfile’s instructions . . . . 97

123

List of Listings

4.1 Example of Hypermedia Controls for the case when an application is published 924.2 Example of input type groupings in transformation specification . . . . . . . 944.3 Example of a publishing request . . . . . . . . . . . . . . . . . . . . . . . . . 954.4 Specification of the general information about the application . . . . . . . . 954.5 Example of Dockerfile instructions for copying application’s files and setting

permissions to execute them . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.6 Structure of the transformation request . . . . . . . . . . . . . . . . . . . . . 98

5.1 The specification of general information about the application . . . . . . . . 1035.2 The application’s transformation specification . . . . . . . . . . . . . . . . . 1045.3 The specification of the application’s dependencies . . . . . . . . . . . . . . 1055.4 The specification of the application’s dependencies . . . . . . . . . . . . . . 1065.5 The resulting Dockerfile generated based on the application package specifi-

cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Declaration

I hereby declare that the work presented in this thesis isentirely my own and that I did not use any other sources andreferences than the listed ones. I have marked all direct orindirect statements from other sources contained therein asquotations. Neither this work nor significant parts of it werepart of another examination procedure. I have not publishedthis work in whole or in part before. The electronic copy isconsistent with all submitted copies.

place, date, signature

concepts for handling heterogeneous data transformation logic and their integration with trade

Documents