provenance viewer for apache tavernastudentnet.cs.manchester.ac.uk/resources/library/3... · one of...

PROVENANCE VIEWER FOR APACHE TAVERNA

Visualization tool for scientific workflows and provenance

Student: Claudiu Stefan Padurariu

Supervisor: Carole Goble

Degree: (BSc) Computer Science with

Business and Management

THE UNIVERSITY OF MANCHESTER

SCHOOL OF COMPUTER SCIENCE

THIRD YEAR PROJECT REPORT

May 2016

i

Abstract

Over the last two centuries, scientific workflows have become more and more popular in conducting

research. One of the most widespread workflow management systems is Taverna Workbench.

Through the utilisation of this software, a researcher can design and run scientific workflows.

Depending on the complexity of the workflow and experiment the execution trace can be

substantially large. The analysis of large volume of data raises challenging problems to researchers.

One of them is the visualisation of the provenance of scientific workflow runs.

Therefore, this project aims to present the solutions found through which a scientist can visualise and

explore the provenance resulted from running a scientific workflow with Taverna Workbench 2.5.0.

Also, it includes an application whose main functionality is to display the exported result data in an

interactive way.

The software solution for this project was successfully designed and implemented through the

medium of a well thought out software development plan. Since scientific workflows encourage

sharing and publications of the results of the conducted researches, it was developed as a rich internet

application. Further possible improvements and future works were identified from the feedback

gathered in testing and evaluating the final product.

Keywords: scientific workflows, provenance, visualisation, Taverna, rich internet application, D3;

ii

Acknowledgements

I am grateful to my supervisor, Professor Carole Goble, for the patient guidance, generous advice

and life lessons provided throughout the entire duration of the project. I would also like to thank her

for allowing me to be temporarily a part of myGrid team who allowed me to grow as a research

scientist.

I will take this chance to express my gratitude to Dr. Milan Mihajlovic as well, for his constructive

feedback during the assessment meetings which proved to be valuable lessons for the improvement

in software engineering with regards to the planning stage.

I am also indebted to the members of myGrid Team, especially Alan Williams, Stian Soiland-Reyes,

Donal Fellows, Stuart Owen, Aleksandra Nenadic and Niall Beard. They have welcomed me to their

group and helped me with developing new skills.

Finally, I am thankful to my family for their love, encouragement and support that they have offered

me throughout my entire life, without whom I would never be the person who I am today.

I declare that this report is my own work and the work of others has been properly acknowledged

and referenced in accordance with University policies.

iii

Table of Contents

Abstract .................................................................................................................................................i

Acknowledgements ............................................................................................................................. ii

Table of Contents ............................................................................................................................... iii

Table of Tables .................................................................................................................................... v

Table of Figures ................................................................................................................................... v

Chapter 1. Introduction .................................................................................................................... 1

1.1 Project Overview................................................................................................................... 1

1.2 Area of Interest ...................................................................................................................... 1

1.3 Motivation ............................................................................................................................. 2

1.4 Aims and Objectives ............................................................................................................. 4

1.5 Methodology ......................................................................................................................... 6

1.6 Report Structure .................................................................................................................... 7

Chapter 2. Background .................................................................................................................... 8

2.1 Overview ............................................................................................................................... 8

2.2 In Silico Experimentation ..................................................................................................... 8

2.3 Scientific Workflows ............................................................................................................ 9

2.4 Apache Taverna .................................................................................................................... 9

2.5 Apache Taverna Plug-ins .................................................................................................... 10

2.6 Databundles ......................................................................................................................... 11

2.7 Provenance .......................................................................................................................... 12

2.8 myExperiment ..................................................................................................................... 13

Chapter 3. Planning ....................................................................................................................... 15

3.1 Overview ............................................................................................................................. 15

3.2 Context ................................................................................................................................ 15

3.2.1 Apache Taverna Databundle Viewer ........................................................................... 15

3.2.2 Project Scope and Functionalities ................................................................................ 16

3.2.3 High-level Tasks .......................................................................................................... 17

3.3 Container ............................................................................................................................. 17

3.3.1 Technologies ................................................................................................................ 17

3.3.1.1 Ruby on Rails ....................................................................................................... 17

3.3.1.2 Node.js .................................................................................................................. 19

3.3.1.3 D3.js ..................................................................................................................... 19

3.3.1.4 JSON .................................................................................................................... 19

iv

3.3.1.5 Bower ................................................................................................................... 20

3.3.1.6 Slim ...................................................................................................................... 20

3.3.1.7 CoffeeScript .......................................................................................................... 21

3.4 Container Diagram .............................................................................................................. 21

Chapter 4. Implementation and Results ........................................................................................ 22

4.1 Overview ............................................................................................................................. 22

4.2 Development Environment ................................................................................................. 22

4.3 Development ....................................................................................................................... 22

4.3.1 Reading, parsing and querying .................................................................................... 23

4.3.2 Diagrams ...................................................................................................................... 24

4.3.3 Sankey Diagram ........................................................................................................... 24

4.3.3.1 Coloured Nodes .................................................................................................... 25

4.3.3.2 Draggable Nodes .................................................................................................. 25

4.3.3.3 Text labels for Nodes ........................................................................................... 26

4.3.3.4 Node information displayed on mouse-over ........................................................ 26

4.3.3.5 Coloured links between nodes .............................................................................. 27

4.3.3.6 Path information on mouse-over .......................................................................... 27

4.3.3.7 Single Click on node ............................................................................................ 27

4.3.3.8 Double click on node hides links ......................................................................... 28

4.3.3.9 Zooming ............................................................................................................... 28

4.3.4 Adjacency Matrix ........................................................................................................ 28

4.3.4.1 Sorting .................................................................................................................. 28

4.3.4.2 Coloured cells for paths ........................................................................................ 29

4.3.4.3 Saving the graph as picture .................................................................................. 29

Chapter 5. Testing and Evaluation ................................................................................................ 30

5.1 Overview ............................................................................................................................. 30

5.2 Testing ................................................................................................................................. 30

5.2.1 Functional Testing ....................................................................................................... 30

5.2.1.1 Data extraction testing .......................................................................................... 30

5.2.1.2 Diagrams Testing ................................................................................................. 31

5.2.1.3 Coupling of the data with the diagram ................................................................. 31

5.2.1.4 Observations and remarks .................................................................................... 31

5.2.2 Cross-Browser Testing ................................................................................................ 32

5.3 Evaluation ........................................................................................................................... 32

Chapter 6. Conclusion ................................................................................................................... 33

v

6.1 Overview ............................................................................................................................. 33

6.2 Achievements ...................................................................................................................... 33

6.3 Variation from the initial plan ............................................................................................. 33

6.4 Experience gained ............................................................................................................... 34

6.5 Future Works....................................................................................................................... 35

References .......................................................................................................................................... 36

Appendix A: workflow.prov.ttl inside Hello Anyone databundle ..................................................... 39

Appendix B. Container Diagram ....................................................................................................... 41

Appendix C: Model-View-Controller Design ................................................................................... 42

Appendix D: Colours used in diagrams ............................................................................................. 43

Appendix E: Testing .......................................................................................................................... 44

Appendix F: Usability evaluation methods ....................................................................................... 45

Appendix G: More workflows ........................................................................................................... 46

Table of Tables

Table 1. Objectives. ............................................................................................................................. 5

Table 2. Databundle Structure ........................................................................................................... 11

Table 3. Directions of relationships ................................................................................................... 31

Table 4. Evaluation Techniques ........................................................................................................ 32

Table 5. Colours used for the Sankey Diagram and Adjacency Matrix ............................................ 43

Table 6. Testing cases for the diagrams ............................................................................................. 44

Table 7. Usability Evaluation Methods ............................................................................................. 45

Table of Figures

Figure 1. Provenance visualisation of “weather forecast” workflow run within Taverna ................... 3

Figure 2. Provenance visualisation of Hello Anyone workflow with Prov-O-Viz .............................. 3

Figure 3. The example of provenance visualisation with the Polish software .................................... 4

Figure 4. The “weather forecast” workflow opened with ATDV ....................................................... 5

Figure 5. Agile Methodology .............................................................................................................. 6

Figure 6. Hello World Workflow ........................................................................................................ 9

Figure 7. Taverna Workbench 2.5.0 – Perspectives .......................................................................... 10

Figure 8. An example of the icicle tree representation and coloured bands ...................................... 12

Figure 9. An example of the Provenance Matrix ............................................................................... 12

vi

Figure 10. Extraction of provenance data based on wfprov ontology ............................................... 13

Figure 11. Context Diagram of ADTV .............................................................................................. 16

Figure 12. Diagram with the functionalities ...................................................................................... 16

Figure 13. Original JSON structure ................................................................................................... 20

Figure 14. New JSON structure ......................................................................................................... 20

Figure 15. Example of SPARQL query - Extraction of Workflows Runs ........................................ 23

Figure 16. Sankey Diagram – Legend ............................................................................................... 25

Figure 17. Sankey Diagram – Draggable Node ................................................................................. 25

Figure 18. Sankey Diagram - Information on hovering nodes .......................................................... 26

Figure 19. Sankey Diagram – Information on hovering the coloured paths ..................................... 27

Figure 20. Sankey Diagram – Highlight the path .............................................................................. 27

Figure 21. Sankey Diagram – Hide outgoing links on double clicking ............................................. 28

Figure 22. Adjacency Matrix Diagram .............................................................................................. 29

Figure 23. Container Diagram ........................................................................................................... 41

Figure 24. Model-View-Controller Architecture ............................................................................... 42

Figure 25. Weather forecast workflow .............................................................................................. 46

Figure 26. Explicit looping workflow ............................................................................................... 46

1

Chapter 1. Introduction

1.1 Project Overview

This project aims to deliver an application that intends to speed up the work done by scientists who

utilize the workflow management system Apache Taverna Workbench in their research. The solution

is the development of a web platform that contributes to the Apache project through which an end-

user can visualize the information inside of a bundle exported by the program mentioned earlier. This

information describes a scientific workflow and the provenance that resulted from running the

workflow.

To be able to meet the needs of many scientists as possible, a prerequisite is researching the best

approachable methods for visualizing data. Once this requirement is fulfilled, and one or more

solutions were identified, the next step is the development of the software.

1.2 Area of Interest

Science is a practical discipline whose purpose is to attain systematic knowledge through progressive

steps of observations, hypothesis creation, experiment proposal, execution, and completion.

In the beginning, scientific research was conducted through manual and ad hoc approaches. However,

an increase in the amount of data to be analysed was observed due to new approaches in the nature

of scientific research. (Bowers, 2012) states that the traditional approaches are considered

controversial in practice for large-scale experiments.

Due to technological innovations and advances in computer science, many researchers started to

convert to in silico experimentation. By automatizing the process of gathering and generating data,

the speed and productivity of performing common activities have improved for scientists. The benefit

is that researchers can now focus on important tasks by letting the computers take care of common

tasks while also minimizing the human error.

When designing an application, complex computational components such as input resources,

specialized libraries, web services and many other processes had to be grouped together. This was

tackled by introducing the concept of workflow within the in silico experimentation. Workflows help

in designing an experiment as a multi-step process that provides an easy-to-use way of defining the

tasks that need to be included and executed for the completion of the research.

A workflow management system represents an environment in which a computational experiment

can be described and executed. Restating what (Taverna.org.uk, 2009a) asserts, a workflow

2

management system provides the infrastructure to design, run and monitor scientific workflows. For

this purpose, numerous systems have been devised to offer explicit support for managing workflows.

According to (Bowers, 2012) the most popular are: Taverna, Kepler, VisTrails, Triana, Pegasus,

KNIME, Galaxy.

This project focuses entirely on the Apache Taverna workflow management system. This software is

open-source and is used in many domains including arts, astronomy, biodiversity, heliophysics,

chemistry, databases, document and image processing and many others. It has a graphical user

interface that allows users to design workflows as directed graphs having the nodes represented by

data and processes while the edges define the relationships between them.

Another notable feature of Taverna is the functionality to capture information about the workflow

run. In other words, when a research worker executes a workflow, Taverna generates a bundle. As

mentioned in the overview, in this bundle there is a log file which represents the history of the steps

involved in the production of a piece of work. This log file contains information about the processes

executed, the input data, intermediate values, and output. The information inside that log file is called

provenance, and it used to make opinions on the quality, trustworthiness, and reliability of the

research. These characteristics can be determined through the answers to several questions such as:

what did the experiment achieve; how did it achieve; why did it execute the way it did; where the

data came from; are the sources trustworthy.

1.3 Motivation

Currently, provenance can be visualized with Taverna in 2 different ways. The first one is to read the

content of the provenance file that proves to be a large resource description framework (RDF) file.

As it can be seen in Appendix A, trying to visualize them as graphs is not a good idea due to the

unintelligibility or the convolution that happen especially when the workflow implies a bunch of

iterations.

The second method in which a Taverna user can visualize provenance of workflow runs is through

the Taverna tool itself. It uses the workflow diagram as a starting point. However, as it can be noticed

in Figure 1, this method is not very approachable since the details of workflows elements are

displayed separately.

3

Source: (Nenadic, 2014a)

Figure 1. Provenance visualisation of “weather forecast” workflow run within Taverna

There are third party apps that tried to build a way to visualize provenance. For example,

Data2Semantics created the project Prov-O-Viz. The user was able to visualize any provenance graph

that uses the PROV-O vocabulary as a Sankey Diagram. However, Prov-O-Viz has not been

completed. Also, as it can be seen in Figure 2, it is not displaying sufficient details and the nodes are

identified by their unique identifier resource id, rather than a significant name.

Source: (Hoekstra, 2013)

Figure 2. Provenance visualisation of Hello Anyone workflow with Prov-O-Viz

4

Another tool suggested by Alan R. Williams is one developed by Maciej Gol, a Polish student. He

designed this tool thoughtfully. He started as well from the workflow diagram. However, it has the

ability of sub-graphing elements. For example, if the user double clicks on a node, it will redraw the

entire page and it will include only the input and output of the respective node. The example of this

tool can be seen in Figure 3.

Source: (Gol, n.d.)

Figure 3. The example of provenance visualisation with the Polish software

1.4 Aims and Objectives

Therefore, this project has set a single main aim defined as the development of a website that will

allow scientists to visualize provenance of a workflow run in such a way that they will have the ability

to interact with the diagrams. My supervisor suggested an extension of the platform Apache Taverna

Databundle Viewer (ATDV), which has as its core functionality the ability to draw the workflow

diagram. Figure 4 illustrates how ATDV display information about the Hello Anyone workflow.

5

Source: (Apache Taverna Databundle Viewer, 2015)

Figure 4. The “weather forecast” workflow opened with ATDV

In order to add the functionality to visualize provenance in an efficient way, the aim of this project

can be split into a list of objectives. As mentioned in the overview, the first objective is to research

methods for visualising provenance, followed by a comprehensive analysis of the Apache Taverna

Databundle Viewer project. These two represent the most demanding tasks. Other objectives that

represent the development of the functionality are included in Table 1.

No. Description

1 Research visualization methods for provenance

2 Understand functionality offered by Apache Taverna Workbench Plug-in

3 Understand the ATDV platform

4 Extract data from provenance file

5 Implement and add an interactive diagram

6 Link the extracted data with the implemented diagram

7 Add other diagram(s)

Table 1. Objectives.

6

1.5 Methodology

It is crucial to decide on the type of methodology that the project needs to follow before starting

coding. As this report is part of a third-year project, the student has a set deadline for its completion.

Hard deadlines apply to businesses as well. Therefore, it is required to adopt a good strategy that will

allow the developer to fulfil the goals by the targeted time such that the clients will be satisfied.

To achieve the satisfiability norm, this project used one of the two of the most popular software

development life cycle methodologies. One of them is the traditionally Waterfall Methodology,

which implies that all decisions need to be made before starting the implementation and feedback

from users is provided at the end of the project. However, Waterfall methodology is inappropriate

for this project. Meanwhile, the second model is the Agile Methodology that focuses on an iterative

and incremental approach.

The latter is used to split the whole development process into smaller tasks as it is done in Section

3.2.3. This allows access to constant feedback from supervisor, researchers and users during the

implementation phase. In addition, testing can be done continuously which leads to early discovery

of errors and more stable releases. In other words, this project is easy to adapt to the new

requirements. The adopted methodology workflow is available in Figure 5.

Figure 5. Agile Methodology

7

1.6 Report Structure

This report reflects the life cycle of the project from the moment in which the concept of visualising

provenance was only a research idea to the stage in which it moulded and took the form of a web

application. The decision of selecting the best option is justified by examining throughout the entire

process alternative approaches and technologies. Therefore, a convenient structure is outlined as

follows:

Chapter 1 ("Introduction") introduce science as the area of interest. In addition, the motivation of

why the work is being done by presenting similar works done in the same area such as Taverna

Workbench, Prov-O-Viz and a Polish project. Additionally, it sets the main aim delivering a tool for

the visualization of provenance.

Chapter 2 ("Background") defines the prerequisites of the project such as the in-silico

experimentation and the workflow management system Taverna. These dependencies help with

understanding better the project and are used in determining the requirements of the project. Also, it

offers two methods through which provenance can be visualized.

Chapter 3 ("Planning") presents an analysis of ATDV platform and how it is extended to add

provenance visualisation. This analysis is made along with the discussion of the technologies and

approaches used. In the last part of this section, there is included a complete architecture of the

system.

Chapter 4 ("Implementation") reflects the development of the project. It will discuss the environment

that covers the operating system, computer programs, editors and debuggers used for software

development. Considering the Agile methodology, this chapter will present next the implementation

of each of the main task of this project with sample code.

Chapter 5 ("Testing and Evaluation") describe the testing and evaluation methods used throughout

the development stage of this project for the purpose of determining the quality and correctness of

the artefact. A result subsection follows and demonstrates the whole functionality of this project by

the use of examples.

Chapter 6 (“Conclusions “) reflects to what extent the objectives of this project were met and suggests

a list of possible future works for improvements, and it concludes with a short statement of the gained

knowledge during the process of software engineering.

8

Chapter 2. Background

2.1 Overview

The purpose of this chapter is to provide essential background information of the environment in

which this project will be used. It starts by introducing more details about in silico experimentation,

followed by an overview of scientific workflows as a solution to this type of scientific research. After

that, the discussion proceeds with a presentation of Taverna (a scientific workflow management

system) and myExperiment (a social website that contains workflows available to everyone). Next,

it introduces the platform on which the project is going to be built. Finally, the chapter concludes

with an analysis of provenance.

2.2 In Silico Experimentation

"In silico experimentation" is an expression which emerged in 1989 during the "Cellular Automata:

Theory and Applications" workshop held in Los Alamos, New Mexico. According to (Autoimmunity

Research Foundation, 2012), this expression can be interpreted as "performed on computer or via

computer simulation". This allowed scientists to conduct research and experiments on computers

using complex data that models and reflects the real world.

This new method of performing scientific experiments has numerous advantages and is possible due

to continuous developments in computer science. (Taverna.org.uk, 2009b) lists the following benefits

that are observable during in silico experimentation: "higher precision and better quality of

experimental data; better support for data-intensive research and access to vast sets of experimental

data generated by scientific communities; more accurate simulations through more sophisticated

models; faster individual experiments; higher work productivity".

However, there are also disadvantages in using the in silico experimentation. Firstly, scientists needed

a computing background in order to be able to design, develop and maintain an in silico experiment.

Therefore, in order to reproduce an experiment, a scientist would have required the necessary

technical skills that are not so easy to acquire. Thus, the majority of researchers would not be able to

use this approach.

Despite these disadvantages, a solution is available. If a researcher does not have a computing

background, then he can perform the in silico experiment with by utilizing scientific workflows.

9

2.3 Scientific Workflows

(Workflow Management Coalition, 1996) introduced the idea of workflow as "The automation of a

business process, in whole or part, during which documents, information or tasks are passed from

one participant to another for action, according to a set of procedural rules". A much more

comprehensive definition of a workflow is given by (Oxford Dictionaries, 2016) that defines it as a

sequence of steps undertaken by an activity from beginning to completion.

The first usages of workflow have been mostly within the business domain. However, due to the new

nature of computational-intensive experimentation, another usage for them has emerged in the

scientific environment. This new type of workflow, referred to as scientific workflow, can be thought

as an elaborate description of what an in silico experiment is aiming to accomplish.

The most significant aspect of scientific workflows is the

way they are modelled as directed graphs with nodes and

edges as it can be seen in Figure 6. More complex

workflows can be found in Appendix G. The vertices

represent computational steps such as data entities, local

services, web services, scripts, and sub-workflows.

According to the data flow, these components are linked

one to another and organized on layers.

Various tools exist that enable the user to design, create,

maintain and execute scientific workflows. The most

popular are Apache Taverna, Kepler, VisTrails, Triana,

Pegasus, and Galaxy.

The workflow management system that this project focuses on using is Taverna. The reason for

picking this tool is because several of Taverna developers are part of a team led by my supervisor

and they are carrying the development inside the University of Manchester campus. Thus, this has

given access to a much faster learning and an acquainted environment.

2.4 Apache Taverna

As mentioned before, Taverna is an open-source and platform-independent workflow management

system written in Java. The myGrid team at the University of Manchester created it with the scope

of delivering a tool to design and run scientific workflows. This software has already started

incubating as a project of the Apache Software Foundation.

Source: (Nenadic, 2014b)

Figure 6. Hello World Workflow

10

The software is composed of three major components:

- Taverna Engine which is responsible for all the computational work which includes running

scripts and converting data from one format to another format;

- Taverna Server which enables the ability to execute workflows remotely;

- Taverna Workbench which is the desktop application responsible for designing the workflow

and outputting the results of the run.

The latter is the product with the most use for scientists because it has a suitable graphical user

interface. The Taverna Workbench’s window frame is divided into three perspectives as seen in

Figure 7. In the top-left part, there is the Services Panel that lists all known services. In the bottom-

left side, the Workflow Explorer tab presents the workflow in a tree-like form, while the Details tab

allows the visualisation or modification of the workflow node’s attributes by pressing on the Details

Tab. Lastly, the workflow can be sketched in the right side of the screen named the Workflow

Diagram panel.

Source: (Williams, 2014)

Figure 7. Taverna Workbench 2.5.0 – Perspectives

2.5 Apache Taverna Plug-ins

Due to the many different areas in which Taverna can be used, this workflow management system

has been designed as an extensible tool. Thus, its functionality is replaceable with modules that fit

the needs of a specific scientist or extended through various services and plug-ins. For example, there

is the Taverna-PROV plug-in, which records the provenance of workflow runs along their inputs,

outputs, and the executed processes.

11

This plug-in allows the provenance to be stored in an internal database. As mentioned in Chapter 1,

this information can be visualised in Taverna as well through the “Previous runs and Intermediate”

tabs in the Results section. Another functionality of this plug-in is the ability to export the workflow

run with its provenance. The exported file is called a databundle and a more detailed analysis of it is

given in the next subsection.

2.6 Databundles

A databundle is a zip file of Taverna Workbench. It can be generated once the researcher has run the

workflow and has decided to save the output of the experiment. The zip is a collection of files that

represent the data that contributed to the experiment. This bundle contains a description of the

workflow, the provenance trace and the input, intermediate and output values. (Soiland-Reyes, 2013)

state a databundle is condensed to the same structure as the one presented in Table 2.

Source: (Soiland-Reyes, 2013)

Table 2. Databundle Structure

File path Description

. /inputs/

. /intermediates/

. /outputs/

These three folders contain sub-folders and files. The files contain data that

describe the input, the modification it underwent throughout the whole

execution of the workflow and the output.

. /mimetype

A Multi-purpose Internet Extension (mime) file whose purpose is to

provide a way of identifying the nature of the databundle. The content of

this file is usually “application/vnd.taverna.scufl2.workflow-bundle” or

“application/vnd.wf4ever.robundle+zip”.

./workflow.wfbundle A description of the workflow written in the SCUFL2 Format.

./workflow.prov.ttl

This file represents the provenance of the workflow run. It is an RDF graph

in the Turtle format that acts as a log file. It contains every step taken during

the workflow execution and links it with their respective values from the

inputs, intermediates and outputs folders. A sample of how this file looks

like is provided in Appendix A.

12

2.7 Provenance

With the provenance recorded inside the databundle, it is the time to discuss more the provenance

and the approaches found to visualise it. Firstly, a variety of diagrams and charts will be briefly

described. Next, this discussion will concentrate on presenting the basis through which the

information is extracted.

Having established earlier that the standard graph with nodes and edges is not among the answers to

the provenance problem, other techniques for visualization had to be found. Thus, an analysis of

relevant literature was conducted. Among all the research, the (Dang et al., 2015) proved to be quite

useful in determining two solutions for this project. Figure 8 and Figure 9 present the two approaches

that the paper mentioned earlier has proposed. Besides these, there were other diagrams considered,

such as the Chord Diagram or a variation of the Scatter Plot. However, through the discussions with

the supervisor and members of the myGrid team, these were rejected on the basis that the data

extracted about the workflow run could not be represented through these types.

As explained in Appendix A, the provenance of a workflow run is saved by Taverna using many

ontologies into the creation of a provenance file in Turtle format. The most essential ones to this

project are the “wfprov”, “prov” and “rdf” ontologies. The last one makes possible to distinguish

between the type of nodes that the provenance creates inside the RDF graph. The types generated are

Artifacts, Process Runs, Workflow Runs and Workflow Engines. In Figure 10, (Soiland-Reyes et al.,

2013) defines the relationships between these nodes. One thing that is not illustrated is that Artifacts

can be of 2 types: simple Artifacts and Dictionaries (also known as Lists or Collections). The “prov”

ontology makes possible to observe this distinction between the Artifacts.

Source: (Dang et al., 2015)

Figure 9. An example of the

Provenance Matrix

Source: (Dang et al., 2015)

Figure 8. An example of the icicle

tree representation and coloured

bands

13

Source: (Soiland-Reyes et al., 2013)

Figure 10. Extraction of provenance data based on wfprov ontology

2.8 myExperiment

Once a scientist built a properly working workflow, there is the possibility to publish and share the

solution of the research to others for reproducibility, collaborative or gathering feedback purposes.

This is possible through the social website myExperiment launched in November 2007, developed

by a team formed by members associated with the universities of Oxford, Manchester, and

Southampton. This team was managed and guided by Carole Goble and David De Roure.

Since 2007, according to the statistics published in (The University of Manchester and University of

Southampton, n.d.), this community has attracted more than 10200 users. Focusing on the needs of

the scientists, myExperiment acts as a scientific workflow repository regardless of the tool that has

been used to create the workflows. Thus, myExperiment allows uploading files and grouping them

by the workflow management system utilized. Taverna, RapidMiner, Galaxy, Kepler and Bio Extract

are examples of systems whose exports are with 100% certainty able to be uploaded. Currently, using

the same statistics, myExperiment is considered to be the largest public repository containing over

3700 workflows from which approximately 2100 designed for Taverna (~1550 for Taverna 2 and

~550 for Taverna 1).

14

This website is of high importance to this project since the testing of the project can be done with

workflows created by researchers. Most of the workflow tested are the work of Alan Williams, Alex

Nenadic and Stian Soiland-Reyes.

15

Chapter 3. Planning

3.1 Overview

The purpose of this chapter is to design the system architecture of the system. To be understood by

both specialized and non-specialized people, the architecture has been split into multiple levels as

suggested in (Brown, 2016).

The first step is to define the context, immediately followed by a presentation of the platform Apache

Taverna Databundle Viewer before jumping into any implementation. This chapter covers both the

functionalities and the technologies used to develop it.

Afterwards, the results of the research upon provenance and the types of diagrams found to the best

approachable for displaying it are presented. At the same time, the approaches and technologies

considered to be the best choice will be discussed in comparison with their alternatives. This part will

also explain what data should be extracted from the provenance file inside any databundle and how

to link this with the methods used for visualization.

Finally, this chapter will be concluded with a complete navigation of the system.

3.2 Context

This section outlines everything that was introduced in Chapter 2 and has been used as a starting

point for designing the software system. Therefore, the intent of the project along with its

surroundings are reflected in Figure 11. It also describes the nature of the users who are going to use

this system. In addition, it provides information about the origins of the artefact used as input.

3.2.1 Apache Taverna Databundle Viewer

This project is based on the ATDV platform which was developed by Denis Karyakin, a former

student of the University of Manchester. As it can be seen in Error! Reference source not found., t

his tool displays a scientific workflow within a databundle generated by Taverna as a force directed

graph drawing. However, as it can be seen in Figure 4, this is untidy. Therefore, as an out-of-scope

task was the modification to visualize workflow as a vertical icyclic tree.

16

Figure 11. Context Diagram of ADTV

3.2.2 Project Scope and Functionalities

This software represents the platform on which the functionality to visualize the provenance of

workflows runs will be implemented. All functionalities of ADTV are presented in Figure 12:

Red, yellow and cyan nodes represent the functionalities that are already implemented;

o Red implies that functionality is going to be removed;

o Yellow node that this part is going to be modified and improved;

Green node constitutes the functionalities that are going to be added.

Figure 12. Diagram with the functionalities

17

3.2.3 High-level Tasks

As a characteristic of the agile methodology, the development of this project has been split into three

iterations based on the information introduced earlier. The high-level tasks are:

(1). Read, parse and query the provenance file – Once the databundle has been uploaded to the

server, perform the computations on the provenance file;

(2). Diagram Implementation – Integrate a diagram that is going to be used to visualize mock

data with various interactive features;

(3). Couple the data resulted from (1) with the diagram implemented (2) – Combine the two

previous tasks and analyse whether this diagram is suitable to display the provenance.

3.3 Container

Having understood how the system is being included thoroughly in the environment by the means of

Figure 11 and Figure 12, the next move was to settle upon the technologies used for achieving the

overall objective of this project. This section can be thought as a dish recipe in which the ingredients

represent technologies.

Therefore, the next subsections will introduce some of the most important technologies along with

alternatives. The criteria considered in the technical selection process depend on:

the ability to fulfil the tasks mentioned earlier;

the developer’s experience with the technology or with similar technologies;

the difficulty level for learning and using the respective technology;

the cost of the technology (budget, performance, size, compatibility).

3.3.1 Technologies

3.3.1.1 Ruby on Rails

The most important technology that needs to be discussed is the web-based application language.

Because this project extends another one, this decision was already made. The selected technology

is the web framework Ruby on Rails.

The language introduced various difficulties as I had no previous experience with this technology.

Therefore, a decision of whether the project should continue forwards with Ruby on Rails or develop

the entire website again with a more familiar language such as PHP or ASP.NET with C# was

considered. All of them are suitable for delivering a successful application. After a bit of research

and completing several tutorials, Ruby has proven to be easily readable and mostly self-describing.

18

This fact influenced the final decision according to which Ruby on Rails will be used for the entire

project. There are several other things that influenced this decision.

First, Ruby on Rails is a recent smart web application development framework based on the software

architectural Model-View-Controller(MVC) design. This approach was used traditionally mostly for

developing desktop applications with graphical user interface. However, this has been adapted for

web designing. More details about this approach are found in Appendix C.

Another reason is the fact that Rails is considered to be suitable for agile development due to its

flexibility through which functionalities can be delivered in a short time-frame. Therefore, constant

feedback of the progress of the project can be provided to the stakeholders. This is one of the reasons

of why Ruby on Rails is considered to be rapid application development.

The decision is also influenced by seeing that the language is supported by an active community and

provide many ways of learning. Even more, this community is known for sharing solutions to various

common tasks. These solutions are offered in the form of “gems”. Thus, the wheel does not have to

be reinvented and the level of error resulted from the human factor is decreased. For example,

consider the login system which represents one of the most common tasks that a web developer is

required to implement. Therefore, the Rails community has provided gems which can be used to

create a complex and yet easy systems for authentication and authorization.

For the purpose of opening a Turtle file and reading from it, Ruby on Rails provides the simplest way

of doing it. There are two alternatives, namely the ActiveRDF and rdflib-turtle gems. Both fit the

requirements, but the rdflib-turtle has a better documentation. For this reason, the project uses rdflib-

turtle that allows the provenance file to be read directly as an RDF graph and stored in local memory.

Afterwards, the graph can be queried. The rdflib-turtle gem also provides two options through which

data can be accessed. Either the data can be queried directly or by using another gem, namely the rdf-

SPARQL gem. Since SPARQL is more flexible and its syntax is quite similar to SQL, the rdf-

SPARQL gem has been selected as the better option. In addition, it also makes the project easier to

maintain and test.

Other examples are the “devise” and “omniauth” gems. The later can be used for logging in through

other social websites such as Facebook, Twitter, Google.

19

3.3.1.2 Node.js

This tool is an open-source server-side Web application. As the developers of this application

mention in (Node.js Foundation, 2016) that its purpose is to offer an easy way to "build scalable

network applications". In other words, it allows for many connections to be dealt in parallel.

3.3.1.3 D3.js

D3.js is a JavaScript library which can be used to build various types of diagrams and charts to

visualize data. ADTV already uses D3.js to display the workflow diagrams. This tool uses scalable

vector graphics (SVG) through which animation and interactivity can be added to 2D graphics and it

also fits task (2) mentioned in 3.2.3. Some examples are the movement of nodes or the ability to

highlight specific paths. One important aspect is that it accepts all kind of data as long it has a constant

structure.

There were alternatives considered as well. One of them is the GraphViz software. This software is

efficient in determining the optimal layout with consideration to the paths. However, compared with

D3, the outcome is very dull and lifeless.

In addition, similar solutions to those proposed in (Dang et al., 2015) can be found in the online

gallery of the D3 tool. There is the Sankey Diagram, which seems to resemble the icicle tree

representation with coloured bands (Figure 8). This will be implemented as a Horizontal Sankey

Diagram similarly to the Prov-O-Viz tool.

Meanwhile, the provenance matrix (Figure 9) could be implemented either using the Adjacency

Matrix or either the Clustergrammer D3 diagram. Clustergrammer graphic seems to fulfil more

requirements than the Adjacency Matrix diagram. However, because Clustergrammer is still in

progress, it was assumed that that this tool might be unstable and with a more probability to crash.

Thus, this project aims to implement the Horizontal Sankey Diagram and Adjacency Matrix Diagram

as ways of visualizations of provenance. Sample images of these diagrams are available in the next

chapter.

3.3.1.4 JSON

As mentioned earlier, ADTV is already using D3 to display the workflow diagram whose built uses

data serialized to JSON. At the end of the project, as an out-of-scope objective, the structure of it has

been adapted to the one used for the provenance. The new structure maintains the same structure for

20

all types of diagrams that have been implemented and is presented in comparison with the old one in

Figure 13 and Figure 14.

Figure 13. Original JSON structure

Figure 14. New JSON structure

3.3.1.5 Bower

Bower is a package manager for the external components. (Bower, 2012) mentions that this tool can

be used for adding and controling a bunch of things to a project such as "frameworks, libraries, assets

and other utilities." For the ATDV project, this is used to install the AdminLTE theme to the website.

The theme is built with Bootstrap by Almsaeed Studio. This theme can be previewed online at

(Almsaeed Studio, 2016).

3.3.1.6 Slim

Slim is a template language whose purpose is to replace HTML by reducing the standard HTML

syntax to be as specific as possible. (Mendler, Stone and Wu, 2016) states that this started as an

exercise and its functionality started to expand once its popularity increased. In addition, it is proved

21

to be the fastest template language for Rails on Ruby by a software developer named Klaus Zanders

(Zanders, 2012). Also, this language has a logical structure. Its simplicity may be noticed from the

following rules:

Indentation is used, as in Python, to represent the relationship parent-children of the elements;

the ID attribute is denoted by the hash symbol and the class symbol by the dot symbol as in

CSS;

A new line that starts with a hash or a dot represents the “div” html element with the respective

attributes

3.3.1.7 CoffeeScript

Similarly, to the previous tool, CoffeeScript is a template language for writing JavaScript whose

purpose is to provide a readable and nice looking structure. It inherits almost the same rules as Slim

such as the fact that indentation marks the inclusion property. On their official website (CoffeeScript,

2015), it is claimed that this tool "tends to run as fast or faster than the equivalent handwritten

JavaScript". In addition to the performance that this language claims to have, this has been used to

ensure that consistency is maintained throughout the whole implementation of D3 diagrams included

in the project.

3.4 Container Diagram

Having introduced most of the technologies that were used to implement this project, these are

presented in Appendix B by the means of diagrams that are helpful during the implementation stage

to revise what are the technologies and how they communicate within the system.

22

Chapter 4. Implementation and Results

4.1 Overview

This chapter presents the development stage of the project. The first thing looked at is the

methodology considered for development and its environment. Based on the material presented in

the previous chapter, the discussion focuses entirely on the implementation of the main tasks. This

chapter also goes into details of certain features developed, while providing also code samples.

In addition, the chapter also includes the difficulties encountered during this process. Out of all, one

of the most difficult parts was the research conducted to establish the appropriate design. Onwards,

the nature of the problems was mostly technical rather than methodical.

4.2 Development Environment

Scientific Linux v7.1 was used as the operating system on which this website has been developed,

since the Ruby on Rails framework was created on Linux and is not very compatible with Windows.

The installation of Ruby on Rails on Windows is quite complex because there are many missing

dependencies that must be resolved. For example, Windows doesn’t have the libxml21 or the iconv2

libraries which are prerequisites for Rails.

The coding was done in a text editor called Sublime Text 2. This software is great for web developing

on Linux since it has many features such as auto-complete, it highlights reserved words, has a project

explorer and even a small compiler inside it. However, this tool does not have a debugger plug-in.

Google Chrome comes in handy with a plug-in called Developer Tools that allows debugging of

client-side code. However, for server-side errors, Ruby on Rails compiler provides accurate

description of compile-time errors such as wrong indentation in CoffeeScript files or out of bounds

errors.

4.3 Development

Having set the design and the environment, the coding part of the implementation started. Therefore,

this section presents further on the most important bits of this project with sample code, illustrations

and results.

1 a XML parser library written in C 2 an API used to convert between different character encodings

23

4.3.1 Reading, parsing and querying

One of the most important objective was the functionality of being able to read, parse and extract the

provenance information. These three tasks are explained together since the rdflib-turtle library is used

for all of them. The reading and parsing part are common tasks. All it needs is to initialise an empty

RDF graph, provide the source path of where provenance file is saved and in which format.

However, since only relevant information had to be carefully chosen from the file, the querying part

presented a few complications and required more thought. This challenge faced was that the type of

artifacts couldn’t be distinguished easily. As it can be seen in Figure 15, the tricky part was to realise

that the difference between artifacts is recognizable through the absence of one of the properties.

Figure 15. Example of SPARQL query - Extraction of Workflows Runs

Once all3 queries have been executed, the next step was to insert the resulted data into a Ruby Object

and append it to the existing JSON as shown in the Figure 14. This approach minimises the number

of requests a client makes to the server, while making all required data available for computation.

3 There is a query for each type of provenance node: workflows runs, processes runs and artifacts.

24

4.3.2 Diagrams

This project was developed as a visualisation tool to which you can easily add other type of graphics.

In order to keep the project tidy and easily maintainable, certain decisions were made in relation to

organisation. Therefore, all the types of diagrams that need to be added using CoffeeScript are placed

under the javascript folder found in the assets folder of the Ruby on Rails project.

However, an unexpected problem with D3 has aroused. This tool proves to request the JSON every

time a new graph needs to be (re)drawn. For example, if a website has 5 graphs, then the JSON

response would have been requested 5 times. As a solution, it has been decided that the file should

be requested only once and saved into a temporary variable. Afterwards, the methods for drawing

diagrams are called.

4.3.3 Sankey Diagram

One diagram that proved to be useful for visualizing provenance is the Sankey Diagram. This tool

was implemented using tutorials from the D3 website. However, the way the tutorials teach you how

to do it are not so well documented. The way D3 built the Sankey Diagram is not useful for the

purpose of our object. For example, their final product would have positioned wrongly all the nodes

based on a logarithm function of the number of nodes consecutively linked with base 10. Therefore,

adaptations had to be done to fit this project.

Another important thing about the Sankey Diagram worth mentioning is the way that D3 was

developed. D3 uses Tarjan`s Algorithm to determine the order, position and height of each node in

the page. In other words, all nodes are initially placed on the same column on the right part of the

browser. Once an edge has been drawn between two nodes, the source node will move one position

to the left. Furthermore, the height of the node is multiplied by the maximum between the number of

outgoing or incoming edges.

As an out-of-scope requirement, this type of diagram was used to replace also the current

visualisation of the workflow. However, the implementation differs between the workflow and the

provenance visualisations. The layout of the workflow diagram is displayed vertically while the

provenance is displayed horizontally with the flow being read from left to the right part of the screen.

Once the implementation of the Sankey Diagram was finished, the next part was to improve the user

experience by making changes to enhance its usability and interactivity. Some of these changes are

details in continuation of this section.

25

4.3.3.1 Coloured Nodes

Every node in the Sankey Diagram is coloured based on the node type. A legend is located above the

diagram. As in Figure 16, the legend is formed of boxes coloured differently based on the referenced

object such as: “Flamenco” → simple artifacts, “Spring Green” → workflow runs, “Curious Blue”

→ process runs and “Electric Violet” → dictionaries. A complete list of colours used across the

system can be found in Appendix D.

Figure 16. Sankey Diagram – Legend

4.3.3.2 Draggable Nodes

The next addition represents the ability to move the node everywhere in the page under the tabs. This

turns out to be useful for a scientist that wants to separate specific nodes. In Figure 17. there is an

example in which the user has moved the nodes that are irrelevant in the bottom-right corner to allow

a cleaner and more focused visualisation of that part of the workflow that is of specific interest.

Figure 17. Sankey Diagram – Draggable Nodes. Initially, the provenance had a lineage aspect.

However, the nodes can be moved to increase readability or to separate the relevant information.

26

4.3.3.3 Text labels for Nodes

As mentioned earlier, the Turtle file is composed of triples that consists of a subject, a predicate and

an object. Every subject represents a provenance node and is it uniquely identifiable due to its naming

convection. Additionally, labels are assigned to some of the nodes through the “rdf” ontology. It is

important to mention that a node can have more than one label. Therefore, the labels can be thought

as all the nicknames that a node receives.

Due to the method of labelling, there are three cases that need to be dealt with: nodes with no label,

nodes with exactly one label and nodes with multiple labels. The first and second case are quite

straight-forward. The nodes with no label are always of type dictionaries. Thus, one of the solution

is to label them “Lists”. The second case is the ideal case which is straightforward.

However, the real difficulty is related with the nodes with multiple labels. Due to the fact that D3

uses SVG, the standard characters for newline such as “\n” or “\newline” or html break element

“</br>” do not work. Therefore, it was not possible to read the labels of a single node directly on

multiple lines. In order to overcome this, the solution was to create a string in which I had to include

every line in <span> elements. The result is illustrated in the above figure.

4.3.3.4 Node information displayed on mouse-over

As the label was extracted and displayed, the next thing

was to show more details about the respective node.

For a clean and convenient approach, a tool-tip is used.

In the tool-tip body, the user can find the URI of the

node and other characteristics that differ for each type

of provenance node. An artifact would contain the

URI, the labels and the value inside that node. On the

other hand, a process would contain information about

the time when that process was executed and how long

it took to complete it. Figure 18 demonstrates one of

the cases mentioned previously.

Figure 18. Sankey Diagram - Information on

hovering nodes

27

4.3.3.5 Coloured links between nodes

Sankey Diagram colours the path between a source node and a targeted node with a variant colour of

the source node. In the documentation of the Sankey Diagram if two paths would intersect then one

of the paths would cover the other, making it hard to visualise these connections. As a solution to

this, making the paths opaque enabled the visualisation of both paths. The colour of intersection is a

combination of both colours.

4.3.3.6 Path information on mouse-over

The previous solution was applied also to distinguish between the path hovered by the mouse and

others by changing the opacities of those. In addition, extra information related to the path, such as

the source and target URIs and types, is are displayed on hover. This feature is visible in Figure 19.

Figure 19. Sankey Diagram – Information on hovering the coloured paths

4.3.3.7 Single Click on node

The difference with other provenance tools is the ability to highlight all the paths that consists of the

nodes that contributed to the creation of the clicked node and that are depended on the clicked node.

Figure 20. Sankey Diagram – Highlight the path of the orange node in the middle. This enables a

researcher to follow easier the flow of a specific node.

28

4.3.3.8 Double click on node hides links

Another feature that this tool implemented is the ability to hide all the outgoing edges of a single

node by double clicking on it. This allows the user to hide paths and concentrate on the bits that are

more important.

Figure 21. Sankey Diagram – Hide outgoing links on double clicking the violet node to reduce

noise

4.3.3.9 Zooming

Finally, the last interactive functionality of this diagram is the ability to zoom in or zoom out with

the mouse scroll. However, to use this feature the user needs to enable it by clicking on a button

positioned below the graph. Zooming on scroll is disabled by default to avoid the issue of the user

mistakenly scaling the graph when the intention is to navigate the page using the scroll button.

4.3.4 Adjacency Matrix

After the Sankey Diagram was successfully implemented and since there was still time left until the

deadline, a new type of diagram was implemented. The Adjacency Matrix which as the name

suggests, represents the data in an N*N matrix, where N is the number of nodes. Therefore, each row

and column are labelled and if the intersection of a row with a column is coloured this means that

there a relationship between those nodes. This type of graph was created using the D3 library as well

but it is not as sophisticate as the previous one.

4.3.4.1 Sorting

This diagram has a very different feature set compared to the ones highlighted in the previous section.

The Adjacency Matrix has the ability to sort the nodes based on some criteria such as labels,

frequency and type as shown in the Figure 22.

29

Figure 22. Adjacency Matrix Diagram – The provenance is represented as a coloured table in which

every type of cell represents a specific relationship between the types of provenance nodes. The

table can be sorted alphabetically by labels(left), type(mid) and frequency(right).

4.3.4.2 Coloured cells for paths

As the previous diagram, the relationship between nodes is coloured. However, this has a bigger

range of colours because it uses a different colour for each possible type of combination made from

all types of provenance nodes. Similarly, a detailed legend is provided.

4.3.4.3 Saving the graph as picture

All the diagrams can be saved and downloaded as pictures. The picture will have a unique name

based on the type of diagram the user is downloading and when. For example, if you download a

provenance diagram on 1st of February 2016 at 10:00:01am, then the file will be named

"provenance_01_02_2016_10_00_01.jpg".

This feature has some limitations. The first one is that only the nodes and links that are visible on the

screen are saved in the picture. Another one is that once the diagram is saved, the user loses the

ability to see the additional information in the picture when the mouse hovers a node.

In spite of these limitations, it is useful for scientists to save a picture of their research to include it

in reports, presentations or other materials.

30

Chapter 5. Testing and Evaluation

5.1 Overview

This chapter presents how the software was tested throughout the development stage to determine

the correctness of the functionalities that had been implemented. Afterwards, methods in measuring

the successfulness of the functionalities are evaluated.

5.2 Testing

Since this project is mostly about the visual presentation of information, standard test-driven

techniques could not be applied. Therefore, a new strategy, which was mostly based on manual

testing, had to be developed in order to test the graphical user interface.

Due to the decision to use the agile methodology, testing was executed immediately after each

objective was implemented. This enabled early detections of errors and bugs. The project consists of

three important iterations that needed to be tested: the extraction of the correct data from the

databundle, the correct implementation of every type of diagram, and the coupling of the data with

the diagram.

5.2.1 Functional Testing

Functional testing is performed once the project has been completed. It tests all the functionalities of

the project. Therefore, it includes all the testing that was made after each main task was finished. The

test was performed based on a more detailed version of the diagram presented in Figure 12 and as

such even the functionalities of the original website were considered.

5.2.1.1 Data extraction testing

For the data extraction, the reading and the parsing was done through the rdflib-turtle gem which

saves the entire provenance graph in local memory. The test was made by directly comparing the

output with the original file.

On the other hand, the data querying was tested using a third party software, named Apache Jena

Fuseki. This is a JAVA framework that can be used as a SPARQL end-point. It runs as a web

application and offers the ability to test and validate SPARQL queries. A considerate advantage is

that it can directly load files in Turtle format (like workflow.prov.ttl). As the graph is loaded, it is

31

then available to querying and testing your queries for correctness. The results can be examined from

the table where the results are outputted. In addition, this tool also validates whether a query conforms

to the recommended standards by the World Wide Web Consortium (W3C).

5.2.1.2 Diagrams Testing

The next one required a Black-Box testing to determine the quality of each diagram., A test

workbench was designed to simulate the cases mentioned in Appendix E. Those tests were performed

for different values of N.

5.2.1.3 Coupling of the data with the diagram

The next task that follows the implementation and testing of the diagrams, is to coupling of the

provenance data with the diagram. For this step, manual testing was entirely used by following

relationships presented in Table 3.

Node Type Relationship (wfprov) Node Type Direction

Artifact wasOutputFrom Workflow Run ←

Artifact wasOutputFrom Process Run ←

Dictionary hadMember Artifact →

Dictionary hadMember Dictionary →

Process Run wasPartOfWorkflowRun Workflow Run ←

Process Run usedInput Artifact →

Process Run usedInput Artifact →

Workflow Run wasPartOfWorkflowRun Workflow Run ←

Table 3. Directions of relationships

5.2.1.4 Observations and remarks

By testing the diagrams, various irregularities were found. One irregularity was the logarithmic

problem. Another irregularity was noticed when the graph was bigger than the width of the browser-

window and there was no way through which the user could have been able to visualize that part of

the diagram.

32

5.2.2 Cross-Browser Testing

Another type of testing was the cross-browser testing. Therefore, the project has been tested on many

different and modern browsers that support web sockets. The outcome of this test has determined

that the project works with the following browsers: Internet Explorer 10+, Safari 6.0+, Google

Chrome 16.0+ and Mozilla Firefox 6.0+.

5.3 Evaluation

Another important cycle in the development project is the evaluation. This is used to determine the

correctness of the whole product. Several methods were used for the evaluation and are described in

Source: (Owen and Rogers, 1999) mentioned in (Pritchard, 2014)

Table 4.

Type of

Evaluation

Indications

Formative Summative

Proactive Clarificative Interactive Monitoring Outcome

Time Period Research

Stage Planning Implementation Stage

Testing

Purpose of

Evaluation

Define and

understand

the

requirements.

Gain knowledge

on the

prerequisites

and design a

plan

Code the

functionaliti

es following

the plan

Ensure that

the project is

delivered

upon

deadline

Evaluate

whether the

project has

met its

objectives

Source: (Owen and Rogers, 1999) mentioned in (Pritchard, 2014)

Table 4. Evaluation Techniques

Another kind of evaluation was provided either through feedback received during the meetings with

the supervisor and myGrid team, or through the evaluated assessments related with this project such

as the Seminar and Presentation of Results.

Therefore, through all the activities mentioned above, access to sufficient feedback was provided.

All of them have taken in consideration and almost all those suggestions that could have been

implemented in the allocated time have been included in my project. The others that have not been

implemented were added as future work which can be found in the last chapter.

33

Chapter 6. Conclusion

6.1 Overview

This chapter will present the achievements. It will also provide an evaluation of the decisions made

during the allocated period of time to deliver this project and how they affected the initial timing and

milestones. Afterwards, a reflection on how this projected contributed to the development of new

skills or improving the existential ones and it will conclude with possible future improvements.

6.2 Achievements

Overall, the third year project presented in this report has fulfilled successfully its main aim which

was to extend and improve the ATDV website which at the moment allows scientists to visualise

workflows along with the provenance resulted from running it. The measurement of the success can

be determined by the demonstration offered to my supervisor, my second evaluator and several

members of the myGrid team.

In the scope of completing this project, the methods were presented to visualise the provenance,

namely the Sankey Diagram and the Adjacency Matrix. This software has been designed with the

possibility of adding additional types of diagrams by enforcing a fixed format for the JSON file.

Everything was possible due the research and design that were made using a combination of the

waterfall and agile methodologies.

6.3 Variation from the initial plan

The first project design was designed when I had little knowledge about the topic. During the time

allocated for the project, it was redone several times due to the new information acquired. This

required modifications to functionalities that have been added. As the time passed on, I noticed that

the dates of the milestones of the project needed to be updated either due to the fact that the progress

was ahead of the plan or behind it.

The interval of time when the progress was behind is represented by the period before and after the

winter holiday. The end of the first semester represented the moment when most of the courses

required submissions and evaluations of assignments. Similarly, the period after the winter holiday

was mostly reserved for the exams.

34

6.4 Experience gained

This development of this project signifies a great improvement for me. Through the completion of

this project, I have developed new skills and improved the existential ones which are useful in the

future. I believe that the most development can be noticed in technical and research skills which is

followed by the managements ones.

Technical Skills

The reason of why I believe that the technical skills were improved is based on the fact that previous

experience with most of the technologies used in this project was almost non-existent. This implied

that I had to learn everything from zero on my own by creating simple applications that are the results

from the completing the available online tutorials. Through practice and exercise, I have started

accumulating experience which allowed me to complete my project using these technologies.

Thus, I am satisfied that in such a short period of time I was able to gain knowledge and practical

skills in web developing with the Ruby on Rails framework and the other technologies mentioned in

chapter 3. These can be considered a valuable set of skills that can be added to my Curriculum Vitae.

Research Skills

Another area in which I have evolved is the research performing which results from the time spent

on finding the best tools that are able to meet the objectives of the project. Through proper research,

I was able to find similar solutions that are already made. Therefore, there is no need for me to

reinvent the whole wheel. I just need to add several new improvements or modify it so that fit my

requirement.

Management Skills

As all projects, there is need for several essentials skills in order to be able to complete the project at

time. On one hand, there is the time management skill. This required an organized schedule that

would have included both academic and personal deadlines which had been prioritized by their

importance. Usually, the academic ones were the ones that had to be completed first due to the

sanctions imposed resulted from not submitting the deliverables on time.

Other skill that I had the chance to improve is the communication skill. This skill was not improved

only by the meetings and mails exchanged with the supervisor, it was also refined by the fact that

during the first semester I worked with myGrid team. This opportunity allowed me to know people

who have more experience than me and ask them for direction when I was stuck or I had no ideas.

However, in the second semester due to the change of the timetable, I have started working from

home in which communication with the researchers was done via Skype, emails or short meetings.

35

6.5 Future Works

The current project represents the first release which provides only the basic functionality of

visualizing the provenance. In order to make this application qualified for public release to a wider

user community, I had thought of several future works that can be done:

The ability to group nodes as in the project created by the Polish student;

Install Taverna Server and perform the execution of the workflow directly on this server of

this website. There will be no need for the user to upload the entire databundle, only the

workflow;

Add the ability to compare provenance graphs of the same workflow run done with a different

set-up by pointing out the differences;

Upgrade the Adjacency Matrix to the Clustergrammer when is being released;

Save the JSON file next to the databundle or workflow so that the reading, parsing and

querying to be done only once;

A loading pop-up window that will let the user when the diagram finished loading up;

A summative user experience evaluation by following the instructions in Appendix F through

which the usability of this project is determined.

36

References

Almsaeed Studio, (2016). AdminLTE Preview - Almsaeed Studio. [online] Almsaeedstudio.com.

Available at: https://almsaeedstudio.com/preview [Accessed 26 November 2015].

Apache Taverna Databundle Viewer, (2015). DatabundleViewer. [online]

Databundle.herokuapp.com. Available at: http://databundle.herokuapp.com/ [Accessed 14

October 2015].

Autoimmunity Research Foundation. (2012). Differences between in vitro, in vivo, and in silico

studies. [online] http://mpkb.org. Available at:

http://mpkb.org/home/patients/assessing_literature/in_vitro_studies [Accessed 20 October

2015].

Bower, (2012). Bower. [online] Bower.io. Available at: http://bower.io/docs/about/ [Accessed 2

November 2015].

Bowers, S. (2012). Scientific Workflow, Provenance, and Data Modeling Challenges and

Approaches. Journal on Data Semantics, 1(1), pp.19-30.

Brown, S. (2016). The Art of Visualising Software Architecture. [online] Leanpub. Available at:

https://leanpub.com/visualising-software-architecture/read [Accessed 20 October 2015].

CoffeeScript, (2015). CoffeeScript Documentation. [online] Coffeescript.org. Available at:

http://coffeescript.org/ [Accessed 2 November 2015].

Dalling, T. (2014). Model View Controller Explained. [online] Tomdalling.com. Available at:

http://www.tomdalling.com/blog/software-design/model-view-controller-explained/ [Accessed

17 October 2015].

Dang, T., Franz, N., Ludascher, B. and Forbes, A. (2015). Provenance Matrix. International

Workshop on Visualizations and User Interfaces for Ontologies and Linked Data, VOILA 2015

- co-located with 14th International Semantic Web Conference, ISWC 2015. [online] Available

at: https://asu.pure.elsevier.com/en/publications/provenancematrix-a-visualization-tool-for-

multi-taxonomy-alignmen [Accessed 20 October 2015].

Gol, M. (n.d.). Provenance Visualisation Test. [online] Provenance.curo.ch. Available at:

http://provenance.curo.ch/graph [Accessed 2 October 2015].

Hoekstra, R. (2013). Data2Semantics | PROV Provenance Visualizer. [online] Provoviz.org.

Available at: http://provoviz.org/ [Accessed 2 October 2015].

Mendler, D., Stone, A. and Wu, F. (2016). Documentation for slim (3.0.6). [online] Rubydoc.info.

Available at: http://www.rubydoc.info/gems/slim [Accessed 24 October 2015].

37

Nenadic, A. (2014a). myExperiment - Workflows - Example workflow for REST and XPath

activities (Alex Nenadic) [Taverna 2 Workflow]. [online] Myexperiment.org. Available at:

http://www.myexperiment.org/workflows/4206.html [Accessed 10 November 2015].

Nenadic, A. (2014b). myExperiment - Workflows - Hello Anyone [Taverna 2 Workflow]. [online]

Myexperiment.org. Available at: http://www.myexperiment.org/workflows/4210.html

[Accessed 10 November 2015].

Node.js Foundation, (2016). About | Node.js. [online] Nodejs.org. Available at:

https://nodejs.org/en/about/ [Accessed 26 October 2015].

Owen, J. and Rogers, P. (1999). Program evaluation. London: Sage Publications.

Oxford Dictionaries, (2016). workflow – definition of workflow in English from the Oxford

dictionary. [online] Oxforddictionaries.com. Available at:

http://www.oxforddictionaries.com/definition/english/workflow [Accessed 26 March 2016].

Pritchard, M. (2014). Types of evaluation. [online] Evaluationtoolbox.net.au. Available at:

http://evaluationtoolbox.net.au/index.php?option=com_content&view=article&id=15&Itemid=

19 [Accessed 26 February 2016].

Soiland-Reyes, S., Bechhofer, S., Klyne, G., Palma, R., Belhajjame, K., Garca Cuesta, E., Garijo, D.

and Coricho, O. (2013). Wf4Ever Research Object Model 1.0 (2013-11-30). [online] Zenodo.

Available at: http://dx.doi.org/10.5281/zenodo.12744 [Accessed 18 October 2015].

Soiland-Reyes, S., (2013). Data Bundle - Taverna 3 dev - myGrid developer wiki. [online]

Dev.mygrid.org.uk. Available at:

http://dev.mygrid.org.uk/wiki/display/TAVOSGI/Data+Bundle [Accessed 18 October 2015].

Taverna.org.uk. (2009a). What is in silico experimentation? | Taverna. [online] Available at:

http://www.taverna.org.uk/introduction/what-is-in-silico-experimentation/ [Accessed 1

November 2015].

Taverna.org.uk, (2009b). What is a Workflow Management System? | Taverna. [online]

Taverna.org.uk. Available at: http://www.taverna.org.uk/introduction/what-is-a-workflow-

management-system/ [Accessed 2 October 2015].

The University of Manchester, and University of Southampton, (n.d.). About myExperiment. [online]

Myexperiment.org. Available at: http://www.myexperiment.org/about [Accessed 28 March

2016].

Williams, A. (2014). Design Perspective - Taverna 2.5 - myGrid developer wiki. [online]

Dev.mygrid.org.uk. Available at:

http://dev.mygrid.org.uk/wiki/display/tav250/Design+Perspective [Accessed 31 March 2016].

Workflow Management Coalition, (1996). Workflow Management Coalition - Glossary and

Terminology. [online] Aiai.ed.ac.uk. Available at:

38

http://www.aiai.ed.ac.uk/project/wfmc/ARCHIVE/DOCS/glossary/glossary.html [Accessed 26

March 2016].

Zanders, K. (2012). HamlErbSlim. [online] GitHub. Available at:

https://github.com/scalp42/hamlerbslim [Accessed 24 October 2015].

39

Appendix A: workflow.prov.ttl inside Hello Anyone databundle

41

Appendix B. Container Diagram

Figure 23. Container Diagram

42

Appendix C: Model-View-Controller Design

Source: (Dalling, 2014)

Figure 24. Model-View-Controller Architecture

This Model-View-Controlled is explained briefly. As in can be seen in the above picture, the flow of

a standard Ruby on Rails web application consists of 3 elements:

Controller:

o Is a list with all the commands that a user can request;

Model:

o the part whose responsibility is only to manage data;

View:

o an interface in which data is viewed and can be modified;

43

Appendix D: Colours used in diagrams

Colour Name Hex RGB

Cardin Green #003318 0, 51, 24

Chathams Blue #12476D 18, 71, 109

Curious Blue #258FDA 37, 143, 218

Electric Violet #7F0EFF 127, 14, 255

Flamenco #FF7F0E 255, 127, 14

Green Haze #009947 0, 153, 71

Heliotrope #BB80FF 187, 128, 255

Kaitoke Green #004D24 0, 77, 36

Nutmeg Wood Finish #663000 102, 48, 0

Peach Orange #FFC999 255, 201, 153

Pear #C3E221 195, 226, 33

Pigment Indigo #3C0080 60, 0, 128

Red Berry #990000 153, 0, 0

Seagull #7CBCE9 124, 188, 233

Spring Green #0EFF7F 14, 255, 127

Table 5. Colours used for the Sankey Diagram and Adjacency Matrix

44

Appendix E: Testing

Test

No. Case Expected

Outcome

Sankey Diagram Adjacency Matrix

1 There are N nodes with no

links All nodes will be on the same column.

All nodes are positioned on the

right side of the screen on the same

column.

The matrix should be empty (all

cells should be white; labels

visible on top and left sides).

2 There are N nodes which

are connected sequentially All nodes will be on the same row.

Initially, the nodes were layered

by the different values of log (N).

However, at the end this problem

was fixed and all the nodes are on

the same row.

There should be a line parallel to

the primary diagonal. It should be

above the primary diagonal.

3 There are N nodes that are

connected randomly.

The nodes should be positioned

according to the flow in input. On first

column the nodes with no incoming

edges. On last column the nodes with

no outgoing edges. On X-column the

nodes that are connected with the

nodes (X+1) column or the nodes on

(X-1) column.

Beside the above problem, a new

one has been noticed. If the width

of the graph was bigger than the

width of the browser, then the user

could not have been accessible.

By zooming out, it has been seen

that the nodes were positioned and

interconnected as it was intended.

The table should be having

coloured cells for all source-target

links. The rest should be empty.

Table 6. Testing cases for the diagrams

45

Appendix F: Usability evaluation methods

No Method Description Approach

1 Usability Testing This approach involves

real-world users that

are asked to complete a

set of tasks and

complete a

questionnaire at the

end.

For the Provenance Viewer, the instructions may look something like this:

Design a workflow in Taverna, run it and save it as databundle;

Register on Provenance Viewer and log in;

Look at the workflow diagram and answer part A of the questionnaire;

Look at Sankey Diagram and fill the next part of the questionnaire;

Analyse the Provenance Matrix and complete the questionnaire;

In the “Additional feedback” section of the questionnaire, the user should write

useful comments that be used in improving the experience of this tool.

Meanwhile, the questionnaire should contain the following:

Part A: Workflow Diagram

Part B: Provenance – Sankey Diagram;

Part C: Provenance – Provenance Matrix;

Part D: Additional Feedback

2 Think Aloud Testing A method through

which feedback can be

gained by asking to

think out loud as they

test the software.

Every comment provided by the tester of what are they thinking and feeling with regards

to the task performed should be recorded. For example, users can provide information

that may be used to determine an easier-to-use design for the tool.

Table 7. Usability Evaluation Methods

46

Appendix G: More workflows

Source: (The University of Manchester and University of Southampton, n.d.)

Figure 25. Weather forecast workflow

Figure 26. Explicit looping workflow

provenance viewer for apache tavernastudentnet.cs.manchester.ac.uk/resources/library/3... · one of...

Documents