provenance viewer for apache tavernastudentnet.cs.manchester.ac.uk/resources/library/3... · one of...
TRANSCRIPT
PROVENANCE VIEWER FOR APACHE TAVERNA
Visualization tool for scientific workflows and provenance
Student: Claudiu Stefan Padurariu
Supervisor: Carole Goble
Degree: (BSc) Computer Science with
Business and Management
THE UNIVERSITY OF MANCHESTER
SCHOOL OF COMPUTER SCIENCE
THIRD YEAR PROJECT REPORT
May 2016
i
Abstract
Over the last two centuries, scientific workflows have become more and more popular in conducting
research. One of the most widespread workflow management systems is Taverna Workbench.
Through the utilisation of this software, a researcher can design and run scientific workflows.
Depending on the complexity of the workflow and experiment the execution trace can be
substantially large. The analysis of large volume of data raises challenging problems to researchers.
One of them is the visualisation of the provenance of scientific workflow runs.
Therefore, this project aims to present the solutions found through which a scientist can visualise and
explore the provenance resulted from running a scientific workflow with Taverna Workbench 2.5.0.
Also, it includes an application whose main functionality is to display the exported result data in an
interactive way.
The software solution for this project was successfully designed and implemented through the
medium of a well thought out software development plan. Since scientific workflows encourage
sharing and publications of the results of the conducted researches, it was developed as a rich internet
application. Further possible improvements and future works were identified from the feedback
gathered in testing and evaluating the final product.
Keywords: scientific workflows, provenance, visualisation, Taverna, rich internet application, D3;
ii
Acknowledgements
I am grateful to my supervisor, Professor Carole Goble, for the patient guidance, generous advice
and life lessons provided throughout the entire duration of the project. I would also like to thank her
for allowing me to be temporarily a part of myGrid team who allowed me to grow as a research
scientist.
I will take this chance to express my gratitude to Dr. Milan Mihajlovic as well, for his constructive
feedback during the assessment meetings which proved to be valuable lessons for the improvement
in software engineering with regards to the planning stage.
I am also indebted to the members of myGrid Team, especially Alan Williams, Stian Soiland-Reyes,
Donal Fellows, Stuart Owen, Aleksandra Nenadic and Niall Beard. They have welcomed me to their
group and helped me with developing new skills.
Finally, I am thankful to my family for their love, encouragement and support that they have offered
me throughout my entire life, without whom I would never be the person who I am today.
I declare that this report is my own work and the work of others has been properly acknowledged
and referenced in accordance with University policies.
iii
Table of Contents
Abstract .................................................................................................................................................i
Acknowledgements ............................................................................................................................. ii
Table of Contents ............................................................................................................................... iii
Table of Tables .................................................................................................................................... v
Table of Figures ................................................................................................................................... v
Chapter 1. Introduction .................................................................................................................... 1
1.1 Project Overview................................................................................................................... 1
1.2 Area of Interest ...................................................................................................................... 1
1.3 Motivation ............................................................................................................................. 2
1.4 Aims and Objectives ............................................................................................................. 4
1.5 Methodology ......................................................................................................................... 6
1.6 Report Structure .................................................................................................................... 7
Chapter 2. Background .................................................................................................................... 8
2.1 Overview ............................................................................................................................... 8
2.2 In Silico Experimentation ..................................................................................................... 8
2.3 Scientific Workflows ............................................................................................................ 9
2.4 Apache Taverna .................................................................................................................... 9
2.5 Apache Taverna Plug-ins .................................................................................................... 10
2.6 Databundles ......................................................................................................................... 11
2.7 Provenance .......................................................................................................................... 12
2.8 myExperiment ..................................................................................................................... 13
Chapter 3. Planning ....................................................................................................................... 15
3.1 Overview ............................................................................................................................. 15
3.2 Context ................................................................................................................................ 15
3.2.1 Apache Taverna Databundle Viewer ........................................................................... 15
3.2.2 Project Scope and Functionalities ................................................................................ 16
3.2.3 High-level Tasks .......................................................................................................... 17
3.3 Container ............................................................................................................................. 17
3.3.1 Technologies ................................................................................................................ 17
3.3.1.1 Ruby on Rails ....................................................................................................... 17
3.3.1.2 Node.js .................................................................................................................. 19
3.3.1.3 D3.js ..................................................................................................................... 19
3.3.1.4 JSON .................................................................................................................... 19
iv
3.3.1.5 Bower ................................................................................................................... 20
3.3.1.6 Slim ...................................................................................................................... 20
3.3.1.7 CoffeeScript .......................................................................................................... 21
3.4 Container Diagram .............................................................................................................. 21
Chapter 4. Implementation and Results ........................................................................................ 22
4.1 Overview ............................................................................................................................. 22
4.2 Development Environment ................................................................................................. 22
4.3 Development ....................................................................................................................... 22
4.3.1 Reading, parsing and querying .................................................................................... 23
4.3.2 Diagrams ...................................................................................................................... 24
4.3.3 Sankey Diagram ........................................................................................................... 24
4.3.3.1 Coloured Nodes .................................................................................................... 25
4.3.3.2 Draggable Nodes .................................................................................................. 25
4.3.3.3 Text labels for Nodes ........................................................................................... 26
4.3.3.4 Node information displayed on mouse-over ........................................................ 26
4.3.3.5 Coloured links between nodes .............................................................................. 27
4.3.3.6 Path information on mouse-over .......................................................................... 27
4.3.3.7 Single Click on node ............................................................................................ 27
4.3.3.8 Double click on node hides links ......................................................................... 28
4.3.3.9 Zooming ............................................................................................................... 28
4.3.4 Adjacency Matrix ........................................................................................................ 28
4.3.4.1 Sorting .................................................................................................................. 28
4.3.4.2 Coloured cells for paths ........................................................................................ 29
4.3.4.3 Saving the graph as picture .................................................................................. 29
Chapter 5. Testing and Evaluation ................................................................................................ 30
5.1 Overview ............................................................................................................................. 30
5.2 Testing ................................................................................................................................. 30
5.2.1 Functional Testing ....................................................................................................... 30
5.2.1.1 Data extraction testing .......................................................................................... 30
5.2.1.2 Diagrams Testing ................................................................................................. 31
5.2.1.3 Coupling of the data with the diagram ................................................................. 31
5.2.1.4 Observations and remarks .................................................................................... 31
5.2.2 Cross-Browser Testing ................................................................................................ 32
5.3 Evaluation ........................................................................................................................... 32
Chapter 6. Conclusion ................................................................................................................... 33
v
6.1 Overview ............................................................................................................................. 33
6.2 Achievements ...................................................................................................................... 33
6.3 Variation from the initial plan ............................................................................................. 33
6.4 Experience gained ............................................................................................................... 34
6.5 Future Works....................................................................................................................... 35
References .......................................................................................................................................... 36
Appendix A: workflow.prov.ttl inside Hello Anyone databundle ..................................................... 39
Appendix B. Container Diagram ....................................................................................................... 41
Appendix C: Model-View-Controller Design ................................................................................... 42
Appendix D: Colours used in diagrams ............................................................................................. 43
Appendix E: Testing .......................................................................................................................... 44
Appendix F: Usability evaluation methods ....................................................................................... 45
Appendix G: More workflows ........................................................................................................... 46
Table of Tables
Table 1. Objectives. ............................................................................................................................. 5
Table 2. Databundle Structure ........................................................................................................... 11
Table 3. Directions of relationships ................................................................................................... 31
Table 4. Evaluation Techniques ........................................................................................................ 32
Table 5. Colours used for the Sankey Diagram and Adjacency Matrix ............................................ 43
Table 6. Testing cases for the diagrams ............................................................................................. 44
Table 7. Usability Evaluation Methods ............................................................................................. 45
Table of Figures
Figure 1. Provenance visualisation of “weather forecast” workflow run within Taverna ................... 3
Figure 2. Provenance visualisation of Hello Anyone workflow with Prov-O-Viz .............................. 3
Figure 3. The example of provenance visualisation with the Polish software .................................... 4
Figure 4. The “weather forecast” workflow opened with ATDV ....................................................... 5
Figure 5. Agile Methodology .............................................................................................................. 6
Figure 6. Hello World Workflow ........................................................................................................ 9
Figure 7. Taverna Workbench 2.5.0 – Perspectives .......................................................................... 10
Figure 8. An example of the icicle tree representation and coloured bands ...................................... 12
Figure 9. An example of the Provenance Matrix ............................................................................... 12
vi
Figure 10. Extraction of provenance data based on wfprov ontology ............................................... 13
Figure 11. Context Diagram of ADTV .............................................................................................. 16
Figure 12. Diagram with the functionalities ...................................................................................... 16
Figure 13. Original JSON structure ................................................................................................... 20
Figure 14. New JSON structure ......................................................................................................... 20
Figure 15. Example of SPARQL query - Extraction of Workflows Runs ........................................ 23
Figure 16. Sankey Diagram – Legend ............................................................................................... 25
Figure 17. Sankey Diagram – Draggable Node ................................................................................. 25
Figure 18. Sankey Diagram - Information on hovering nodes .......................................................... 26
Figure 19. Sankey Diagram – Information on hovering the coloured paths ..................................... 27
Figure 20. Sankey Diagram – Highlight the path .............................................................................. 27
Figure 21. Sankey Diagram – Hide outgoing links on double clicking ............................................. 28
Figure 22. Adjacency Matrix Diagram .............................................................................................. 29
Figure 23. Container Diagram ........................................................................................................... 41
Figure 24. Model-View-Controller Architecture ............................................................................... 42
Figure 25. Weather forecast workflow .............................................................................................. 46
Figure 26. Explicit looping workflow ............................................................................................... 46
1
Chapter 1. Introduction
1.1 Project Overview
This project aims to deliver an application that intends to speed up the work done by scientists who
utilize the workflow management system Apache Taverna Workbench in their research. The solution
is the development of a web platform that contributes to the Apache project through which an end-
user can visualize the information inside of a bundle exported by the program mentioned earlier. This
information describes a scientific workflow and the provenance that resulted from running the
workflow.
To be able to meet the needs of many scientists as possible, a prerequisite is researching the best
approachable methods for visualizing data. Once this requirement is fulfilled, and one or more
solutions were identified, the next step is the development of the software.
1.2 Area of Interest
Science is a practical discipline whose purpose is to attain systematic knowledge through progressive
steps of observations, hypothesis creation, experiment proposal, execution, and completion.
In the beginning, scientific research was conducted through manual and ad hoc approaches. However,
an increase in the amount of data to be analysed was observed due to new approaches in the nature
of scientific research. (Bowers, 2012) states that the traditional approaches are considered
controversial in practice for large-scale experiments.
Due to technological innovations and advances in computer science, many researchers started to
convert to in silico experimentation. By automatizing the process of gathering and generating data,
the speed and productivity of performing common activities have improved for scientists. The benefit
is that researchers can now focus on important tasks by letting the computers take care of common
tasks while also minimizing the human error.
When designing an application, complex computational components such as input resources,
specialized libraries, web services and many other processes had to be grouped together. This was
tackled by introducing the concept of workflow within the in silico experimentation. Workflows help
in designing an experiment as a multi-step process that provides an easy-to-use way of defining the
tasks that need to be included and executed for the completion of the research.
A workflow management system represents an environment in which a computational experiment
can be described and executed. Restating what (Taverna.org.uk, 2009a) asserts, a workflow
2
management system provides the infrastructure to design, run and monitor scientific workflows. For
this purpose, numerous systems have been devised to offer explicit support for managing workflows.
According to (Bowers, 2012) the most popular are: Taverna, Kepler, VisTrails, Triana, Pegasus,
KNIME, Galaxy.
This project focuses entirely on the Apache Taverna workflow management system. This software is
open-source and is used in many domains including arts, astronomy, biodiversity, heliophysics,
chemistry, databases, document and image processing and many others. It has a graphical user
interface that allows users to design workflows as directed graphs having the nodes represented by
data and processes while the edges define the relationships between them.
Another notable feature of Taverna is the functionality to capture information about the workflow
run. In other words, when a research worker executes a workflow, Taverna generates a bundle. As
mentioned in the overview, in this bundle there is a log file which represents the history of the steps
involved in the production of a piece of work. This log file contains information about the processes
executed, the input data, intermediate values, and output. The information inside that log file is called
provenance, and it used to make opinions on the quality, trustworthiness, and reliability of the
research. These characteristics can be determined through the answers to several questions such as:
what did the experiment achieve; how did it achieve; why did it execute the way it did; where the
data came from; are the sources trustworthy.
1.3 Motivation
Currently, provenance can be visualized with Taverna in 2 different ways. The first one is to read the
content of the provenance file that proves to be a large resource description framework (RDF) file.
As it can be seen in Appendix A, trying to visualize them as graphs is not a good idea due to the
unintelligibility or the convolution that happen especially when the workflow implies a bunch of
iterations.
The second method in which a Taverna user can visualize provenance of workflow runs is through
the Taverna tool itself. It uses the workflow diagram as a starting point. However, as it can be noticed
in Figure 1, this method is not very approachable since the details of workflows elements are
displayed separately.
3
Source: (Nenadic, 2014a)
Figure 1. Provenance visualisation of “weather forecast” workflow run within Taverna
There are third party apps that tried to build a way to visualize provenance. For example,
Data2Semantics created the project Prov-O-Viz. The user was able to visualize any provenance graph
that uses the PROV-O vocabulary as a Sankey Diagram. However, Prov-O-Viz has not been
completed. Also, as it can be seen in Figure 2, it is not displaying sufficient details and the nodes are
identified by their unique identifier resource id, rather than a significant name.
Source: (Hoekstra, 2013)
Figure 2. Provenance visualisation of Hello Anyone workflow with Prov-O-Viz
4
Another tool suggested by Alan R. Williams is one developed by Maciej Gol, a Polish student. He
designed this tool thoughtfully. He started as well from the workflow diagram. However, it has the
ability of sub-graphing elements. For example, if the user double clicks on a node, it will redraw the
entire page and it will include only the input and output of the respective node. The example of this
tool can be seen in Figure 3.
Source: (Gol, n.d.)
Figure 3. The example of provenance visualisation with the Polish software
1.4 Aims and Objectives
Therefore, this project has set a single main aim defined as the development of a website that will
allow scientists to visualize provenance of a workflow run in such a way that they will have the ability
to interact with the diagrams. My supervisor suggested an extension of the platform Apache Taverna
Databundle Viewer (ATDV), which has as its core functionality the ability to draw the workflow
diagram. Figure 4 illustrates how ATDV display information about the Hello Anyone workflow.
5
Source: (Apache Taverna Databundle Viewer, 2015)
Figure 4. The “weather forecast” workflow opened with ATDV
In order to add the functionality to visualize provenance in an efficient way, the aim of this project
can be split into a list of objectives. As mentioned in the overview, the first objective is to research
methods for visualising provenance, followed by a comprehensive analysis of the Apache Taverna
Databundle Viewer project. These two represent the most demanding tasks. Other objectives that
represent the development of the functionality are included in Table 1.
No. Description
1 Research visualization methods for provenance
2 Understand functionality offered by Apache Taverna Workbench Plug-in
3 Understand the ATDV platform
4 Extract data from provenance file
5 Implement and add an interactive diagram
6 Link the extracted data with the implemented diagram
7 Add other diagram(s)
Table 1. Objectives.
6
1.5 Methodology
It is crucial to decide on the type of methodology that the project needs to follow before starting
coding. As this report is part of a third-year project, the student has a set deadline for its completion.
Hard deadlines apply to businesses as well. Therefore, it is required to adopt a good strategy that will
allow the developer to fulfil the goals by the targeted time such that the clients will be satisfied.
To achieve the satisfiability norm, this project used one of the two of the most popular software
development life cycle methodologies. One of them is the traditionally Waterfall Methodology,
which implies that all decisions need to be made before starting the implementation and feedback
from users is provided at the end of the project. However, Waterfall methodology is inappropriate
for this project. Meanwhile, the second model is the Agile Methodology that focuses on an iterative
and incremental approach.
The latter is used to split the whole development process into smaller tasks as it is done in Section
3.2.3. This allows access to constant feedback from supervisor, researchers and users during the
implementation phase. In addition, testing can be done continuously which leads to early discovery
of errors and more stable releases. In other words, this project is easy to adapt to the new
requirements. The adopted methodology workflow is available in Figure 5.
Figure 5. Agile Methodology
7
1.6 Report Structure
This report reflects the life cycle of the project from the moment in which the concept of visualising
provenance was only a research idea to the stage in which it moulded and took the form of a web
application. The decision of selecting the best option is justified by examining throughout the entire
process alternative approaches and technologies. Therefore, a convenient structure is outlined as
follows:
Chapter 1 ("Introduction") introduce science as the area of interest. In addition, the motivation of
why the work is being done by presenting similar works done in the same area such as Taverna
Workbench, Prov-O-Viz and a Polish project. Additionally, it sets the main aim delivering a tool for
the visualization of provenance.
Chapter 2 ("Background") defines the prerequisites of the project such as the in-silico
experimentation and the workflow management system Taverna. These dependencies help with
understanding better the project and are used in determining the requirements of the project. Also, it
offers two methods through which provenance can be visualized.
Chapter 3 ("Planning") presents an analysis of ATDV platform and how it is extended to add
provenance visualisation. This analysis is made along with the discussion of the technologies and
approaches used. In the last part of this section, there is included a complete architecture of the
system.
Chapter 4 ("Implementation") reflects the development of the project. It will discuss the environment
that covers the operating system, computer programs, editors and debuggers used for software
development. Considering the Agile methodology, this chapter will present next the implementation
of each of the main task of this project with sample code.
Chapter 5 ("Testing and Evaluation") describe the testing and evaluation methods used throughout
the development stage of this project for the purpose of determining the quality and correctness of
the artefact. A result subsection follows and demonstrates the whole functionality of this project by
the use of examples.
Chapter 6 (“Conclusions “) reflects to what extent the objectives of this project were met and suggests
a list of possible future works for improvements, and it concludes with a short statement of the gained
knowledge during the process of software engineering.
8
Chapter 2. Background
2.1 Overview
The purpose of this chapter is to provide essential background information of the environment in
which this project will be used. It starts by introducing more details about in silico experimentation,
followed by an overview of scientific workflows as a solution to this type of scientific research. After
that, the discussion proceeds with a presentation of Taverna (a scientific workflow management
system) and myExperiment (a social website that contains workflows available to everyone). Next,
it introduces the platform on which the project is going to be built. Finally, the chapter concludes
with an analysis of provenance.
2.2 In Silico Experimentation
"In silico experimentation" is an expression which emerged in 1989 during the "Cellular Automata:
Theory and Applications" workshop held in Los Alamos, New Mexico. According to (Autoimmunity
Research Foundation, 2012), this expression can be interpreted as "performed on computer or via
computer simulation". This allowed scientists to conduct research and experiments on computers
using complex data that models and reflects the real world.
This new method of performing scientific experiments has numerous advantages and is possible due
to continuous developments in computer science. (Taverna.org.uk, 2009b) lists the following benefits
that are observable during in silico experimentation: "higher precision and better quality of
experimental data; better support for data-intensive research and access to vast sets of experimental
data generated by scientific communities; more accurate simulations through more sophisticated
models; faster individual experiments; higher work productivity".
However, there are also disadvantages in using the in silico experimentation. Firstly, scientists needed
a computing background in order to be able to design, develop and maintain an in silico experiment.
Therefore, in order to reproduce an experiment, a scientist would have required the necessary
technical skills that are not so easy to acquire. Thus, the majority of researchers would not be able to
use this approach.
Despite these disadvantages, a solution is available. If a researcher does not have a computing
background, then he can perform the in silico experiment with by utilizing scientific workflows.
9
2.3 Scientific Workflows
(Workflow Management Coalition, 1996) introduced the idea of workflow as "The automation of a
business process, in whole or part, during which documents, information or tasks are passed from
one participant to another for action, according to a set of procedural rules". A much more
comprehensive definition of a workflow is given by (Oxford Dictionaries, 2016) that defines it as a
sequence of steps undertaken by an activity from beginning to completion.
The first usages of workflow have been mostly within the business domain. However, due to the new
nature of computational-intensive experimentation, another usage for them has emerged in the
scientific environment. This new type of workflow, referred to as scientific workflow, can be thought
as an elaborate description of what an in silico experiment is aiming to accomplish.
The most significant aspect of scientific workflows is the
way they are modelled as directed graphs with nodes and
edges as it can be seen in Figure 6. More complex
workflows can be found in Appendix G. The vertices
represent computational steps such as data entities, local
services, web services, scripts, and sub-workflows.
According to the data flow, these components are linked
one to another and organized on layers.
Various tools exist that enable the user to design, create,
maintain and execute scientific workflows. The most
popular are Apache Taverna, Kepler, VisTrails, Triana,
Pegasus, and Galaxy.
The workflow management system that this project focuses on using is Taverna. The reason for
picking this tool is because several of Taverna developers are part of a team led by my supervisor
and they are carrying the development inside the University of Manchester campus. Thus, this has
given access to a much faster learning and an acquainted environment.
2.4 Apache Taverna
As mentioned before, Taverna is an open-source and platform-independent workflow management
system written in Java. The myGrid team at the University of Manchester created it with the scope
of delivering a tool to design and run scientific workflows. This software has already started
incubating as a project of the Apache Software Foundation.
Source: (Nenadic, 2014b)
Figure 6. Hello World Workflow
10
The software is composed of three major components:
- Taverna Engine which is responsible for all the computational work which includes running
scripts and converting data from one format to another format;
- Taverna Server which enables the ability to execute workflows remotely;
- Taverna Workbench which is the desktop application responsible for designing the workflow
and outputting the results of the run.
The latter is the product with the most use for scientists because it has a suitable graphical user
interface. The Taverna Workbench’s window frame is divided into three perspectives as seen in
Figure 7. In the top-left part, there is the Services Panel that lists all known services. In the bottom-
left side, the Workflow Explorer tab presents the workflow in a tree-like form, while the Details tab
allows the visualisation or modification of the workflow node’s attributes by pressing on the Details
Tab. Lastly, the workflow can be sketched in the right side of the screen named the Workflow
Diagram panel.
Source: (Williams, 2014)
Figure 7. Taverna Workbench 2.5.0 – Perspectives
2.5 Apache Taverna Plug-ins
Due to the many different areas in which Taverna can be used, this workflow management system
has been designed as an extensible tool. Thus, its functionality is replaceable with modules that fit
the needs of a specific scientist or extended through various services and plug-ins. For example, there
is the Taverna-PROV plug-in, which records the provenance of workflow runs along their inputs,
outputs, and the executed processes.
11
This plug-in allows the provenance to be stored in an internal database. As mentioned in Chapter 1,
this information can be visualised in Taverna as well through the “Previous runs and Intermediate”
tabs in the Results section. Another functionality of this plug-in is the ability to export the workflow
run with its provenance. The exported file is called a databundle and a more detailed analysis of it is
given in the next subsection.
2.6 Databundles
A databundle is a zip file of Taverna Workbench. It can be generated once the researcher has run the
workflow and has decided to save the output of the experiment. The zip is a collection of files that
represent the data that contributed to the experiment. This bundle contains a description of the
workflow, the provenance trace and the input, intermediate and output values. (Soiland-Reyes, 2013)
state a databundle is condensed to the same structure as the one presented in Table 2.
Source: (Soiland-Reyes, 2013)
Table 2. Databundle Structure
File path Description
. /inputs/
. /intermediates/
. /outputs/
These three folders contain sub-folders and files. The files contain data that
describe the input, the modification it underwent throughout the whole
execution of the workflow and the output.
. /mimetype
A Multi-purpose Internet Extension (mime) file whose purpose is to
provide a way of identifying the nature of the databundle. The content of
this file is usually “application/vnd.taverna.scufl2.workflow-bundle” or
“application/vnd.wf4ever.robundle+zip”.
./workflow.wfbundle A description of the workflow written in the SCUFL2 Format.
./workflow.prov.ttl
This file represents the provenance of the workflow run. It is an RDF graph
in the Turtle format that acts as a log file. It contains every step taken during
the workflow execution and links it with their respective values from the
inputs, intermediates and outputs folders. A sample of how this file looks
like is provided in Appendix A.
12
2.7 Provenance
With the provenance recorded inside the databundle, it is the time to discuss more the provenance
and the approaches found to visualise it. Firstly, a variety of diagrams and charts will be briefly
described. Next, this discussion will concentrate on presenting the basis through which the
information is extracted.
Having established earlier that the standard graph with nodes and edges is not among the answers to
the provenance problem, other techniques for visualization had to be found. Thus, an analysis of
relevant literature was conducted. Among all the research, the (Dang et al., 2015) proved to be quite
useful in determining two solutions for this project. Figure 8 and Figure 9 present the two approaches
that the paper mentioned earlier has proposed. Besides these, there were other diagrams considered,
such as the Chord Diagram or a variation of the Scatter Plot. However, through the discussions with
the supervisor and members of the myGrid team, these were rejected on the basis that the data
extracted about the workflow run could not be represented through these types.
As explained in Appendix A, the provenance of a workflow run is saved by Taverna using many
ontologies into the creation of a provenance file in Turtle format. The most essential ones to this
project are the “wfprov”, “prov” and “rdf” ontologies. The last one makes possible to distinguish
between the type of nodes that the provenance creates inside the RDF graph. The types generated are
Artifacts, Process Runs, Workflow Runs and Workflow Engines. In Figure 10, (Soiland-Reyes et al.,
2013) defines the relationships between these nodes. One thing that is not illustrated is that Artifacts
can be of 2 types: simple Artifacts and Dictionaries (also known as Lists or Collections). The “prov”
ontology makes possible to observe this distinction between the Artifacts.
Source: (Dang et al., 2015)
Figure 9. An example of the
Provenance Matrix
Source: (Dang et al., 2015)
Figure 8. An example of the icicle
tree representation and coloured
bands
13
Source: (Soiland-Reyes et al., 2013)
Figure 10. Extraction of provenance data based on wfprov ontology
2.8 myExperiment
Once a scientist built a properly working workflow, there is the possibility to publish and share the
solution of the research to others for reproducibility, collaborative or gathering feedback purposes.
This is possible through the social website myExperiment launched in November 2007, developed
by a team formed by members associated with the universities of Oxford, Manchester, and
Southampton. This team was managed and guided by Carole Goble and David De Roure.
Since 2007, according to the statistics published in (The University of Manchester and University of
Southampton, n.d.), this community has attracted more than 10200 users. Focusing on the needs of
the scientists, myExperiment acts as a scientific workflow repository regardless of the tool that has
been used to create the workflows. Thus, myExperiment allows uploading files and grouping them
by the workflow management system utilized. Taverna, RapidMiner, Galaxy, Kepler and Bio Extract
are examples of systems whose exports are with 100% certainty able to be uploaded. Currently, using
the same statistics, myExperiment is considered to be the largest public repository containing over
3700 workflows from which approximately 2100 designed for Taverna (~1550 for Taverna 2 and
~550 for Taverna 1).
14
This website is of high importance to this project since the testing of the project can be done with
workflows created by researchers. Most of the workflow tested are the work of Alan Williams, Alex
Nenadic and Stian Soiland-Reyes.
15
Chapter 3. Planning
3.1 Overview
The purpose of this chapter is to design the system architecture of the system. To be understood by
both specialized and non-specialized people, the architecture has been split into multiple levels as
suggested in (Brown, 2016).
The first step is to define the context, immediately followed by a presentation of the platform Apache
Taverna Databundle Viewer before jumping into any implementation. This chapter covers both the
functionalities and the technologies used to develop it.
Afterwards, the results of the research upon provenance and the types of diagrams found to the best
approachable for displaying it are presented. At the same time, the approaches and technologies
considered to be the best choice will be discussed in comparison with their alternatives. This part will
also explain what data should be extracted from the provenance file inside any databundle and how
to link this with the methods used for visualization.
Finally, this chapter will be concluded with a complete navigation of the system.
3.2 Context
This section outlines everything that was introduced in Chapter 2 and has been used as a starting
point for designing the software system. Therefore, the intent of the project along with its
surroundings are reflected in Figure 11. It also describes the nature of the users who are going to use
this system. In addition, it provides information about the origins of the artefact used as input.
3.2.1 Apache Taverna Databundle Viewer
This project is based on the ATDV platform which was developed by Denis Karyakin, a former
student of the University of Manchester. As it can be seen in Error! Reference source not found., t
his tool displays a scientific workflow within a databundle generated by Taverna as a force directed
graph drawing. However, as it can be seen in Figure 4, this is untidy. Therefore, as an out-of-scope
task was the modification to visualize workflow as a vertical icyclic tree.
16
Figure 11. Context Diagram of ADTV
3.2.2 Project Scope and Functionalities
This software represents the platform on which the functionality to visualize the provenance of
workflows runs will be implemented. All functionalities of ADTV are presented in Figure 12:
Red, yellow and cyan nodes represent the functionalities that are already implemented;
o Red implies that functionality is going to be removed;
o Yellow node that this part is going to be modified and improved;
Green node constitutes the functionalities that are going to be added.
Figure 12. Diagram with the functionalities
17
3.2.3 High-level Tasks
As a characteristic of the agile methodology, the development of this project has been split into three
iterations based on the information introduced earlier. The high-level tasks are:
(1). Read, parse and query the provenance file – Once the databundle has been uploaded to the
server, perform the computations on the provenance file;
(2). Diagram Implementation – Integrate a diagram that is going to be used to visualize mock
data with various interactive features;
(3). Couple the data resulted from (1) with the diagram implemented (2) – Combine the two
previous tasks and analyse whether this diagram is suitable to display the provenance.
3.3 Container
Having understood how the system is being included thoroughly in the environment by the means of
Figure 11 and Figure 12, the next move was to settle upon the technologies used for achieving the
overall objective of this project. This section can be thought as a dish recipe in which the ingredients
represent technologies.
Therefore, the next subsections will introduce some of the most important technologies along with
alternatives. The criteria considered in the technical selection process depend on:
the ability to fulfil the tasks mentioned earlier;
the developer’s experience with the technology or with similar technologies;
the difficulty level for learning and using the respective technology;
the cost of the technology (budget, performance, size, compatibility).
3.3.1 Technologies
3.3.1.1 Ruby on Rails
The most important technology that needs to be discussed is the web-based application language.
Because this project extends another one, this decision was already made. The selected technology
is the web framework Ruby on Rails.
The language introduced various difficulties as I had no previous experience with this technology.
Therefore, a decision of whether the project should continue forwards with Ruby on Rails or develop
the entire website again with a more familiar language such as PHP or ASP.NET with C# was
considered. All of them are suitable for delivering a successful application. After a bit of research
and completing several tutorials, Ruby has proven to be easily readable and mostly self-describing.
18
This fact influenced the final decision according to which Ruby on Rails will be used for the entire
project. There are several other things that influenced this decision.
First, Ruby on Rails is a recent smart web application development framework based on the software
architectural Model-View-Controller(MVC) design. This approach was used traditionally mostly for
developing desktop applications with graphical user interface. However, this has been adapted for
web designing. More details about this approach are found in Appendix C.
Another reason is the fact that Rails is considered to be suitable for agile development due to its
flexibility through which functionalities can be delivered in a short time-frame. Therefore, constant
feedback of the progress of the project can be provided to the stakeholders. This is one of the reasons
of why Ruby on Rails is considered to be rapid application development.
The decision is also influenced by seeing that the language is supported by an active community and
provide many ways of learning. Even more, this community is known for sharing solutions to various
common tasks. These solutions are offered in the form of “gems”. Thus, the wheel does not have to
be reinvented and the level of error resulted from the human factor is decreased. For example,
consider the login system which represents one of the most common tasks that a web developer is
required to implement. Therefore, the Rails community has provided gems which can be used to
create a complex and yet easy systems for authentication and authorization.
For the purpose of opening a Turtle file and reading from it, Ruby on Rails provides the simplest way
of doing it. There are two alternatives, namely the ActiveRDF and rdflib-turtle gems. Both fit the
requirements, but the rdflib-turtle has a better documentation. For this reason, the project uses rdflib-
turtle that allows the provenance file to be read directly as an RDF graph and stored in local memory.
Afterwards, the graph can be queried. The rdflib-turtle gem also provides two options through which
data can be accessed. Either the data can be queried directly or by using another gem, namely the rdf-
SPARQL gem. Since SPARQL is more flexible and its syntax is quite similar to SQL, the rdf-
SPARQL gem has been selected as the better option. In addition, it also makes the project easier to
maintain and test.
Other examples are the “devise” and “omniauth” gems. The later can be used for logging in through
other social websites such as Facebook, Twitter, Google.
19
3.3.1.2 Node.js
This tool is an open-source server-side Web application. As the developers of this application
mention in (Node.js Foundation, 2016) that its purpose is to offer an easy way to "build scalable
network applications". In other words, it allows for many connections to be dealt in parallel.
3.3.1.3 D3.js
D3.js is a JavaScript library which can be used to build various types of diagrams and charts to
visualize data. ADTV already uses D3.js to display the workflow diagrams. This tool uses scalable
vector graphics (SVG) through which animation and interactivity can be added to 2D graphics and it
also fits task (2) mentioned in 3.2.3. Some examples are the movement of nodes or the ability to
highlight specific paths. One important aspect is that it accepts all kind of data as long it has a constant
structure.
There were alternatives considered as well. One of them is the GraphViz software. This software is
efficient in determining the optimal layout with consideration to the paths. However, compared with
D3, the outcome is very dull and lifeless.
In addition, similar solutions to those proposed in (Dang et al., 2015) can be found in the online
gallery of the D3 tool. There is the Sankey Diagram, which seems to resemble the icicle tree
representation with coloured bands (Figure 8). This will be implemented as a Horizontal Sankey
Diagram similarly to the Prov-O-Viz tool.
Meanwhile, the provenance matrix (Figure 9) could be implemented either using the Adjacency
Matrix or either the Clustergrammer D3 diagram. Clustergrammer graphic seems to fulfil more
requirements than the Adjacency Matrix diagram. However, because Clustergrammer is still in
progress, it was assumed that that this tool might be unstable and with a more probability to crash.
Thus, this project aims to implement the Horizontal Sankey Diagram and Adjacency Matrix Diagram
as ways of visualizations of provenance. Sample images of these diagrams are available in the next
chapter.
3.3.1.4 JSON
As mentioned earlier, ADTV is already using D3 to display the workflow diagram whose built uses
data serialized to JSON. At the end of the project, as an out-of-scope objective, the structure of it has
been adapted to the one used for the provenance. The new structure maintains the same structure for
20
all types of diagrams that have been implemented and is presented in comparison with the old one in
Figure 13 and Figure 14.
Figure 13. Original JSON structure
Figure 14. New JSON structure
3.3.1.5 Bower
Bower is a package manager for the external components. (Bower, 2012) mentions that this tool can
be used for adding and controling a bunch of things to a project such as "frameworks, libraries, assets
and other utilities." For the ATDV project, this is used to install the AdminLTE theme to the website.
The theme is built with Bootstrap by Almsaeed Studio. This theme can be previewed online at
(Almsaeed Studio, 2016).
3.3.1.6 Slim
Slim is a template language whose purpose is to replace HTML by reducing the standard HTML
syntax to be as specific as possible. (Mendler, Stone and Wu, 2016) states that this started as an
exercise and its functionality started to expand once its popularity increased. In addition, it is proved
21
to be the fastest template language for Rails on Ruby by a software developer named Klaus Zanders
(Zanders, 2012). Also, this language has a logical structure. Its simplicity may be noticed from the
following rules:
Indentation is used, as in Python, to represent the relationship parent-children of the elements;
the ID attribute is denoted by the hash symbol and the class symbol by the dot symbol as in
CSS;
A new line that starts with a hash or a dot represents the “div” html element with the respective
attributes
3.3.1.7 CoffeeScript
Similarly, to the previous tool, CoffeeScript is a template language for writing JavaScript whose
purpose is to provide a readable and nice looking structure. It inherits almost the same rules as Slim
such as the fact that indentation marks the inclusion property. On their official website (CoffeeScript,
2015), it is claimed that this tool "tends to run as fast or faster than the equivalent handwritten
JavaScript". In addition to the performance that this language claims to have, this has been used to
ensure that consistency is maintained throughout the whole implementation of D3 diagrams included
in the project.
3.4 Container Diagram
Having introduced most of the technologies that were used to implement this project, these are
presented in Appendix B by the means of diagrams that are helpful during the implementation stage
to revise what are the technologies and how they communicate within the system.
22
Chapter 4. Implementation and Results
4.1 Overview
This chapter presents the development stage of the project. The first thing looked at is the
methodology considered for development and its environment. Based on the material presented in
the previous chapter, the discussion focuses entirely on the implementation of the main tasks. This
chapter also goes into details of certain features developed, while providing also code samples.
In addition, the chapter also includes the difficulties encountered during this process. Out of all, one
of the most difficult parts was the research conducted to establish the appropriate design. Onwards,
the nature of the problems was mostly technical rather than methodical.
4.2 Development Environment
Scientific Linux v7.1 was used as the operating system on which this website has been developed,
since the Ruby on Rails framework was created on Linux and is not very compatible with Windows.
The installation of Ruby on Rails on Windows is quite complex because there are many missing
dependencies that must be resolved. For example, Windows doesn’t have the libxml21 or the iconv2
libraries which are prerequisites for Rails.
The coding was done in a text editor called Sublime Text 2. This software is great for web developing
on Linux since it has many features such as auto-complete, it highlights reserved words, has a project
explorer and even a small compiler inside it. However, this tool does not have a debugger plug-in.
Google Chrome comes in handy with a plug-in called Developer Tools that allows debugging of
client-side code. However, for server-side errors, Ruby on Rails compiler provides accurate
description of compile-time errors such as wrong indentation in CoffeeScript files or out of bounds
errors.
4.3 Development
Having set the design and the environment, the coding part of the implementation started. Therefore,
this section presents further on the most important bits of this project with sample code, illustrations
and results.
1 a XML parser library written in C 2 an API used to convert between different character encodings
23
4.3.1 Reading, parsing and querying
One of the most important objective was the functionality of being able to read, parse and extract the
provenance information. These three tasks are explained together since the rdflib-turtle library is used
for all of them. The reading and parsing part are common tasks. All it needs is to initialise an empty
RDF graph, provide the source path of where provenance file is saved and in which format.
However, since only relevant information had to be carefully chosen from the file, the querying part
presented a few complications and required more thought. This challenge faced was that the type of
artifacts couldn’t be distinguished easily. As it can be seen in Figure 15, the tricky part was to realise
that the difference between artifacts is recognizable through the absence of one of the properties.
Figure 15. Example of SPARQL query - Extraction of Workflows Runs
Once all3 queries have been executed, the next step was to insert the resulted data into a Ruby Object
and append it to the existing JSON as shown in the Figure 14. This approach minimises the number
of requests a client makes to the server, while making all required data available for computation.
3 There is a query for each type of provenance node: workflows runs, processes runs and artifacts.
24
4.3.2 Diagrams
This project was developed as a visualisation tool to which you can easily add other type of graphics.
In order to keep the project tidy and easily maintainable, certain decisions were made in relation to
organisation. Therefore, all the types of diagrams that need to be added using CoffeeScript are placed
under the javascript folder found in the assets folder of the Ruby on Rails project.
However, an unexpected problem with D3 has aroused. This tool proves to request the JSON every
time a new graph needs to be (re)drawn. For example, if a website has 5 graphs, then the JSON
response would have been requested 5 times. As a solution, it has been decided that the file should
be requested only once and saved into a temporary variable. Afterwards, the methods for drawing
diagrams are called.
4.3.3 Sankey Diagram
One diagram that proved to be useful for visualizing provenance is the Sankey Diagram. This tool
was implemented using tutorials from the D3 website. However, the way the tutorials teach you how
to do it are not so well documented. The way D3 built the Sankey Diagram is not useful for the
purpose of our object. For example, their final product would have positioned wrongly all the nodes
based on a logarithm function of the number of nodes consecutively linked with base 10. Therefore,
adaptations had to be done to fit this project.
Another important thing about the Sankey Diagram worth mentioning is the way that D3 was
developed. D3 uses Tarjan`s Algorithm to determine the order, position and height of each node in
the page. In other words, all nodes are initially placed on the same column on the right part of the
browser. Once an edge has been drawn between two nodes, the source node will move one position
to the left. Furthermore, the height of the node is multiplied by the maximum between the number of
outgoing or incoming edges.
As an out-of-scope requirement, this type of diagram was used to replace also the current
visualisation of the workflow. However, the implementation differs between the workflow and the
provenance visualisations. The layout of the workflow diagram is displayed vertically while the
provenance is displayed horizontally with the flow being read from left to the right part of the screen.
Once the implementation of the Sankey Diagram was finished, the next part was to improve the user
experience by making changes to enhance its usability and interactivity. Some of these changes are
details in continuation of this section.
25
4.3.3.1 Coloured Nodes
Every node in the Sankey Diagram is coloured based on the node type. A legend is located above the
diagram. As in Figure 16, the legend is formed of boxes coloured differently based on the referenced
object such as: “Flamenco” → simple artifacts, “Spring Green” → workflow runs, “Curious Blue”
→ process runs and “Electric Violet” → dictionaries. A complete list of colours used across the
system can be found in Appendix D.
Figure 16. Sankey Diagram – Legend
4.3.3.2 Draggable Nodes
The next addition represents the ability to move the node everywhere in the page under the tabs. This
turns out to be useful for a scientist that wants to separate specific nodes. In Figure 17. there is an
example in which the user has moved the nodes that are irrelevant in the bottom-right corner to allow
a cleaner and more focused visualisation of that part of the workflow that is of specific interest.
Figure 17. Sankey Diagram – Draggable Nodes. Initially, the provenance had a lineage aspect.
However, the nodes can be moved to increase readability or to separate the relevant information.
26
4.3.3.3 Text labels for Nodes
As mentioned earlier, the Turtle file is composed of triples that consists of a subject, a predicate and
an object. Every subject represents a provenance node and is it uniquely identifiable due to its naming
convection. Additionally, labels are assigned to some of the nodes through the “rdf” ontology. It is
important to mention that a node can have more than one label. Therefore, the labels can be thought
as all the nicknames that a node receives.
Due to the method of labelling, there are three cases that need to be dealt with: nodes with no label,
nodes with exactly one label and nodes with multiple labels. The first and second case are quite
straight-forward. The nodes with no label are always of type dictionaries. Thus, one of the solution
is to label them “Lists”. The second case is the ideal case which is straightforward.
However, the real difficulty is related with the nodes with multiple labels. Due to the fact that D3
uses SVG, the standard characters for newline such as “\n” or “\newline” or html break element
“</br>” do not work. Therefore, it was not possible to read the labels of a single node directly on
multiple lines. In order to overcome this, the solution was to create a string in which I had to include
every line in <span> elements. The result is illustrated in the above figure.
4.3.3.4 Node information displayed on mouse-over
As the label was extracted and displayed, the next thing
was to show more details about the respective node.
For a clean and convenient approach, a tool-tip is used.
In the tool-tip body, the user can find the URI of the
node and other characteristics that differ for each type
of provenance node. An artifact would contain the
URI, the labels and the value inside that node. On the
other hand, a process would contain information about
the time when that process was executed and how long
it took to complete it. Figure 18 demonstrates one of
the cases mentioned previously.
Figure 18. Sankey Diagram - Information on
hovering nodes
27
4.3.3.5 Coloured links between nodes
Sankey Diagram colours the path between a source node and a targeted node with a variant colour of
the source node. In the documentation of the Sankey Diagram if two paths would intersect then one
of the paths would cover the other, making it hard to visualise these connections. As a solution to
this, making the paths opaque enabled the visualisation of both paths. The colour of intersection is a
combination of both colours.
4.3.3.6 Path information on mouse-over
The previous solution was applied also to distinguish between the path hovered by the mouse and
others by changing the opacities of those. In addition, extra information related to the path, such as
the source and target URIs and types, is are displayed on hover. This feature is visible in Figure 19.
Figure 19. Sankey Diagram – Information on hovering the coloured paths
4.3.3.7 Single Click on node
The difference with other provenance tools is the ability to highlight all the paths that consists of the
nodes that contributed to the creation of the clicked node and that are depended on the clicked node.
Figure 20. Sankey Diagram – Highlight the path of the orange node in the middle. This enables a
researcher to follow easier the flow of a specific node.
28
4.3.3.8 Double click on node hides links
Another feature that this tool implemented is the ability to hide all the outgoing edges of a single
node by double clicking on it. This allows the user to hide paths and concentrate on the bits that are
more important.
Figure 21. Sankey Diagram – Hide outgoing links on double clicking the violet node to reduce
noise
4.3.3.9 Zooming
Finally, the last interactive functionality of this diagram is the ability to zoom in or zoom out with
the mouse scroll. However, to use this feature the user needs to enable it by clicking on a button
positioned below the graph. Zooming on scroll is disabled by default to avoid the issue of the user
mistakenly scaling the graph when the intention is to navigate the page using the scroll button.
4.3.4 Adjacency Matrix
After the Sankey Diagram was successfully implemented and since there was still time left until the
deadline, a new type of diagram was implemented. The Adjacency Matrix which as the name
suggests, represents the data in an N*N matrix, where N is the number of nodes. Therefore, each row
and column are labelled and if the intersection of a row with a column is coloured this means that
there a relationship between those nodes. This type of graph was created using the D3 library as well
but it is not as sophisticate as the previous one.
4.3.4.1 Sorting
This diagram has a very different feature set compared to the ones highlighted in the previous section.
The Adjacency Matrix has the ability to sort the nodes based on some criteria such as labels,
frequency and type as shown in the Figure 22.
29
Figure 22. Adjacency Matrix Diagram – The provenance is represented as a coloured table in which
every type of cell represents a specific relationship between the types of provenance nodes. The
table can be sorted alphabetically by labels(left), type(mid) and frequency(right).
4.3.4.2 Coloured cells for paths
As the previous diagram, the relationship between nodes is coloured. However, this has a bigger
range of colours because it uses a different colour for each possible type of combination made from
all types of provenance nodes. Similarly, a detailed legend is provided.
4.3.4.3 Saving the graph as picture
All the diagrams can be saved and downloaded as pictures. The picture will have a unique name
based on the type of diagram the user is downloading and when. For example, if you download a
provenance diagram on 1st of February 2016 at 10:00:01am, then the file will be named
"provenance_01_02_2016_10_00_01.jpg".
This feature has some limitations. The first one is that only the nodes and links that are visible on the
screen are saved in the picture. Another one is that once the diagram is saved, the user loses the
ability to see the additional information in the picture when the mouse hovers a node.
In spite of these limitations, it is useful for scientists to save a picture of their research to include it
in reports, presentations or other materials.
30
Chapter 5. Testing and Evaluation
5.1 Overview
This chapter presents how the software was tested throughout the development stage to determine
the correctness of the functionalities that had been implemented. Afterwards, methods in measuring
the successfulness of the functionalities are evaluated.
5.2 Testing
Since this project is mostly about the visual presentation of information, standard test-driven
techniques could not be applied. Therefore, a new strategy, which was mostly based on manual
testing, had to be developed in order to test the graphical user interface.
Due to the decision to use the agile methodology, testing was executed immediately after each
objective was implemented. This enabled early detections of errors and bugs. The project consists of
three important iterations that needed to be tested: the extraction of the correct data from the
databundle, the correct implementation of every type of diagram, and the coupling of the data with
the diagram.
5.2.1 Functional Testing
Functional testing is performed once the project has been completed. It tests all the functionalities of
the project. Therefore, it includes all the testing that was made after each main task was finished. The
test was performed based on a more detailed version of the diagram presented in Figure 12 and as
such even the functionalities of the original website were considered.
5.2.1.1 Data extraction testing
For the data extraction, the reading and the parsing was done through the rdflib-turtle gem which
saves the entire provenance graph in local memory. The test was made by directly comparing the
output with the original file.
On the other hand, the data querying was tested using a third party software, named Apache Jena
Fuseki. This is a JAVA framework that can be used as a SPARQL end-point. It runs as a web
application and offers the ability to test and validate SPARQL queries. A considerate advantage is
that it can directly load files in Turtle format (like workflow.prov.ttl). As the graph is loaded, it is
31
then available to querying and testing your queries for correctness. The results can be examined from
the table where the results are outputted. In addition, this tool also validates whether a query conforms
to the recommended standards by the World Wide Web Consortium (W3C).
5.2.1.2 Diagrams Testing
The next one required a Black-Box testing to determine the quality of each diagram., A test
workbench was designed to simulate the cases mentioned in Appendix E. Those tests were performed
for different values of N.
5.2.1.3 Coupling of the data with the diagram
The next task that follows the implementation and testing of the diagrams, is to coupling of the
provenance data with the diagram. For this step, manual testing was entirely used by following
relationships presented in Table 3.
Node Type Relationship (wfprov) Node Type Direction
Artifact wasOutputFrom Workflow Run ←
Artifact wasOutputFrom Process Run ←
Dictionary hadMember Artifact →
Dictionary hadMember Dictionary →
Process Run wasPartOfWorkflowRun Workflow Run ←
Process Run usedInput Artifact →
Process Run usedInput Artifact →
Workflow Run wasPartOfWorkflowRun Workflow Run ←
Table 3. Directions of relationships
5.2.1.4 Observations and remarks
By testing the diagrams, various irregularities were found. One irregularity was the logarithmic
problem. Another irregularity was noticed when the graph was bigger than the width of the browser-
window and there was no way through which the user could have been able to visualize that part of
the diagram.
32
5.2.2 Cross-Browser Testing
Another type of testing was the cross-browser testing. Therefore, the project has been tested on many
different and modern browsers that support web sockets. The outcome of this test has determined
that the project works with the following browsers: Internet Explorer 10+, Safari 6.0+, Google
Chrome 16.0+ and Mozilla Firefox 6.0+.
5.3 Evaluation
Another important cycle in the development project is the evaluation. This is used to determine the
correctness of the whole product. Several methods were used for the evaluation and are described in
Source: (Owen and Rogers, 1999) mentioned in (Pritchard, 2014)
Table 4.
Type of
Evaluation
Indications
Formative Summative
Proactive Clarificative Interactive Monitoring Outcome
Time Period Research
Stage Planning Implementation Stage
Testing
Purpose of
Evaluation
Define and
understand
the
requirements.
Gain knowledge
on the
prerequisites
and design a
plan
Code the
functionaliti
es following
the plan
Ensure that
the project is
delivered
upon
deadline
Evaluate
whether the
project has
met its
objectives
Source: (Owen and Rogers, 1999) mentioned in (Pritchard, 2014)
Table 4. Evaluation Techniques
Another kind of evaluation was provided either through feedback received during the meetings with
the supervisor and myGrid team, or through the evaluated assessments related with this project such
as the Seminar and Presentation of Results.
Therefore, through all the activities mentioned above, access to sufficient feedback was provided.
All of them have taken in consideration and almost all those suggestions that could have been
implemented in the allocated time have been included in my project. The others that have not been
implemented were added as future work which can be found in the last chapter.
33
Chapter 6. Conclusion
6.1 Overview
This chapter will present the achievements. It will also provide an evaluation of the decisions made
during the allocated period of time to deliver this project and how they affected the initial timing and
milestones. Afterwards, a reflection on how this projected contributed to the development of new
skills or improving the existential ones and it will conclude with possible future improvements.
6.2 Achievements
Overall, the third year project presented in this report has fulfilled successfully its main aim which
was to extend and improve the ATDV website which at the moment allows scientists to visualise
workflows along with the provenance resulted from running it. The measurement of the success can
be determined by the demonstration offered to my supervisor, my second evaluator and several
members of the myGrid team.
In the scope of completing this project, the methods were presented to visualise the provenance,
namely the Sankey Diagram and the Adjacency Matrix. This software has been designed with the
possibility of adding additional types of diagrams by enforcing a fixed format for the JSON file.
Everything was possible due the research and design that were made using a combination of the
waterfall and agile methodologies.
6.3 Variation from the initial plan
The first project design was designed when I had little knowledge about the topic. During the time
allocated for the project, it was redone several times due to the new information acquired. This
required modifications to functionalities that have been added. As the time passed on, I noticed that
the dates of the milestones of the project needed to be updated either due to the fact that the progress
was ahead of the plan or behind it.
The interval of time when the progress was behind is represented by the period before and after the
winter holiday. The end of the first semester represented the moment when most of the courses
required submissions and evaluations of assignments. Similarly, the period after the winter holiday
was mostly reserved for the exams.
34
6.4 Experience gained
This development of this project signifies a great improvement for me. Through the completion of
this project, I have developed new skills and improved the existential ones which are useful in the
future. I believe that the most development can be noticed in technical and research skills which is
followed by the managements ones.
Technical Skills
The reason of why I believe that the technical skills were improved is based on the fact that previous
experience with most of the technologies used in this project was almost non-existent. This implied
that I had to learn everything from zero on my own by creating simple applications that are the results
from the completing the available online tutorials. Through practice and exercise, I have started
accumulating experience which allowed me to complete my project using these technologies.
Thus, I am satisfied that in such a short period of time I was able to gain knowledge and practical
skills in web developing with the Ruby on Rails framework and the other technologies mentioned in
chapter 3. These can be considered a valuable set of skills that can be added to my Curriculum Vitae.
Research Skills
Another area in which I have evolved is the research performing which results from the time spent
on finding the best tools that are able to meet the objectives of the project. Through proper research,
I was able to find similar solutions that are already made. Therefore, there is no need for me to
reinvent the whole wheel. I just need to add several new improvements or modify it so that fit my
requirement.
Management Skills
As all projects, there is need for several essentials skills in order to be able to complete the project at
time. On one hand, there is the time management skill. This required an organized schedule that
would have included both academic and personal deadlines which had been prioritized by their
importance. Usually, the academic ones were the ones that had to be completed first due to the
sanctions imposed resulted from not submitting the deliverables on time.
Other skill that I had the chance to improve is the communication skill. This skill was not improved
only by the meetings and mails exchanged with the supervisor, it was also refined by the fact that
during the first semester I worked with myGrid team. This opportunity allowed me to know people
who have more experience than me and ask them for direction when I was stuck or I had no ideas.
However, in the second semester due to the change of the timetable, I have started working from
home in which communication with the researchers was done via Skype, emails or short meetings.
35
6.5 Future Works
The current project represents the first release which provides only the basic functionality of
visualizing the provenance. In order to make this application qualified for public release to a wider
user community, I had thought of several future works that can be done:
The ability to group nodes as in the project created by the Polish student;
Install Taverna Server and perform the execution of the workflow directly on this server of
this website. There will be no need for the user to upload the entire databundle, only the
workflow;
Add the ability to compare provenance graphs of the same workflow run done with a different
set-up by pointing out the differences;
Upgrade the Adjacency Matrix to the Clustergrammer when is being released;
Save the JSON file next to the databundle or workflow so that the reading, parsing and
querying to be done only once;
A loading pop-up window that will let the user when the diagram finished loading up;
A summative user experience evaluation by following the instructions in Appendix F through
which the usability of this project is determined.
36
References
Almsaeed Studio, (2016). AdminLTE Preview - Almsaeed Studio. [online] Almsaeedstudio.com.
Available at: https://almsaeedstudio.com/preview [Accessed 26 November 2015].
Apache Taverna Databundle Viewer, (2015). DatabundleViewer. [online]
Databundle.herokuapp.com. Available at: http://databundle.herokuapp.com/ [Accessed 14
October 2015].
Autoimmunity Research Foundation. (2012). Differences between in vitro, in vivo, and in silico
studies. [online] http://mpkb.org. Available at:
http://mpkb.org/home/patients/assessing_literature/in_vitro_studies [Accessed 20 October
2015].
Bower, (2012). Bower. [online] Bower.io. Available at: http://bower.io/docs/about/ [Accessed 2
November 2015].
Bowers, S. (2012). Scientific Workflow, Provenance, and Data Modeling Challenges and
Approaches. Journal on Data Semantics, 1(1), pp.19-30.
Brown, S. (2016). The Art of Visualising Software Architecture. [online] Leanpub. Available at:
https://leanpub.com/visualising-software-architecture/read [Accessed 20 October 2015].
CoffeeScript, (2015). CoffeeScript Documentation. [online] Coffeescript.org. Available at:
http://coffeescript.org/ [Accessed 2 November 2015].
Dalling, T. (2014). Model View Controller Explained. [online] Tomdalling.com. Available at:
http://www.tomdalling.com/blog/software-design/model-view-controller-explained/ [Accessed
17 October 2015].
Dang, T., Franz, N., Ludascher, B. and Forbes, A. (2015). Provenance Matrix. International
Workshop on Visualizations and User Interfaces for Ontologies and Linked Data, VOILA 2015
- co-located with 14th International Semantic Web Conference, ISWC 2015. [online] Available
at: https://asu.pure.elsevier.com/en/publications/provenancematrix-a-visualization-tool-for-
multi-taxonomy-alignmen [Accessed 20 October 2015].
Gol, M. (n.d.). Provenance Visualisation Test. [online] Provenance.curo.ch. Available at:
http://provenance.curo.ch/graph [Accessed 2 October 2015].
Hoekstra, R. (2013). Data2Semantics | PROV Provenance Visualizer. [online] Provoviz.org.
Available at: http://provoviz.org/ [Accessed 2 October 2015].
Mendler, D., Stone, A. and Wu, F. (2016). Documentation for slim (3.0.6). [online] Rubydoc.info.
Available at: http://www.rubydoc.info/gems/slim [Accessed 24 October 2015].
37
Nenadic, A. (2014a). myExperiment - Workflows - Example workflow for REST and XPath
activities (Alex Nenadic) [Taverna 2 Workflow]. [online] Myexperiment.org. Available at:
http://www.myexperiment.org/workflows/4206.html [Accessed 10 November 2015].
Nenadic, A. (2014b). myExperiment - Workflows - Hello Anyone [Taverna 2 Workflow]. [online]
Myexperiment.org. Available at: http://www.myexperiment.org/workflows/4210.html
[Accessed 10 November 2015].
Node.js Foundation, (2016). About | Node.js. [online] Nodejs.org. Available at:
https://nodejs.org/en/about/ [Accessed 26 October 2015].
Owen, J. and Rogers, P. (1999). Program evaluation. London: Sage Publications.
Oxford Dictionaries, (2016). workflow – definition of workflow in English from the Oxford
dictionary. [online] Oxforddictionaries.com. Available at:
http://www.oxforddictionaries.com/definition/english/workflow [Accessed 26 March 2016].
Pritchard, M. (2014). Types of evaluation. [online] Evaluationtoolbox.net.au. Available at:
http://evaluationtoolbox.net.au/index.php?option=com_content&view=article&id=15&Itemid=
19 [Accessed 26 February 2016].
Soiland-Reyes, S., Bechhofer, S., Klyne, G., Palma, R., Belhajjame, K., Garca Cuesta, E., Garijo, D.
and Coricho, O. (2013). Wf4Ever Research Object Model 1.0 (2013-11-30). [online] Zenodo.
Available at: http://dx.doi.org/10.5281/zenodo.12744 [Accessed 18 October 2015].
Soiland-Reyes, S., (2013). Data Bundle - Taverna 3 dev - myGrid developer wiki. [online]
Dev.mygrid.org.uk. Available at:
http://dev.mygrid.org.uk/wiki/display/TAVOSGI/Data+Bundle [Accessed 18 October 2015].
Taverna.org.uk. (2009a). What is in silico experimentation? | Taverna. [online] Available at:
http://www.taverna.org.uk/introduction/what-is-in-silico-experimentation/ [Accessed 1
November 2015].
Taverna.org.uk, (2009b). What is a Workflow Management System? | Taverna. [online]
Taverna.org.uk. Available at: http://www.taverna.org.uk/introduction/what-is-a-workflow-
management-system/ [Accessed 2 October 2015].
The University of Manchester, and University of Southampton, (n.d.). About myExperiment. [online]
Myexperiment.org. Available at: http://www.myexperiment.org/about [Accessed 28 March
2016].
Williams, A. (2014). Design Perspective - Taverna 2.5 - myGrid developer wiki. [online]
Dev.mygrid.org.uk. Available at:
http://dev.mygrid.org.uk/wiki/display/tav250/Design+Perspective [Accessed 31 March 2016].
Workflow Management Coalition, (1996). Workflow Management Coalition - Glossary and
Terminology. [online] Aiai.ed.ac.uk. Available at:
38
http://www.aiai.ed.ac.uk/project/wfmc/ARCHIVE/DOCS/glossary/glossary.html [Accessed 26
March 2016].
Zanders, K. (2012). HamlErbSlim. [online] GitHub. Available at:
https://github.com/scalp42/hamlerbslim [Accessed 24 October 2015].
39
Appendix A: workflow.prov.ttl inside Hello Anyone databundle
40
41
Appendix B. Container Diagram
Figure 23. Container Diagram
42
Appendix C: Model-View-Controller Design
Source: (Dalling, 2014)
Figure 24. Model-View-Controller Architecture
This Model-View-Controlled is explained briefly. As in can be seen in the above picture, the flow of
a standard Ruby on Rails web application consists of 3 elements:
Controller:
o Is a list with all the commands that a user can request;
Model:
o the part whose responsibility is only to manage data;
View:
o an interface in which data is viewed and can be modified;
43
Appendix D: Colours used in diagrams
Colour Name Hex RGB
Cardin Green #003318 0, 51, 24
Chathams Blue #12476D 18, 71, 109
Curious Blue #258FDA 37, 143, 218
Electric Violet #7F0EFF 127, 14, 255
Flamenco #FF7F0E 255, 127, 14
Green Haze #009947 0, 153, 71
Heliotrope #BB80FF 187, 128, 255
Kaitoke Green #004D24 0, 77, 36
Nutmeg Wood Finish #663000 102, 48, 0
Peach Orange #FFC999 255, 201, 153
Pear #C3E221 195, 226, 33
Pigment Indigo #3C0080 60, 0, 128
Red Berry #990000 153, 0, 0
Seagull #7CBCE9 124, 188, 233
Spring Green #0EFF7F 14, 255, 127
Table 5. Colours used for the Sankey Diagram and Adjacency Matrix
44
Appendix E: Testing
Test
No. Case Expected
Outcome
Sankey Diagram Adjacency Matrix
1 There are N nodes with no
links All nodes will be on the same column.
All nodes are positioned on the
right side of the screen on the same
column.
The matrix should be empty (all
cells should be white; labels
visible on top and left sides).
2 There are N nodes which
are connected sequentially All nodes will be on the same row.
Initially, the nodes were layered
by the different values of log (N).
However, at the end this problem
was fixed and all the nodes are on
the same row.
There should be a line parallel to
the primary diagonal. It should be
above the primary diagonal.
3 There are N nodes that are
connected randomly.
The nodes should be positioned
according to the flow in input. On first
column the nodes with no incoming
edges. On last column the nodes with
no outgoing edges. On X-column the
nodes that are connected with the
nodes (X+1) column or the nodes on
(X-1) column.
Beside the above problem, a new
one has been noticed. If the width
of the graph was bigger than the
width of the browser, then the user
could not have been accessible.
By zooming out, it has been seen
that the nodes were positioned and
interconnected as it was intended.
The table should be having
coloured cells for all source-target
links. The rest should be empty.
Table 6. Testing cases for the diagrams
45
Appendix F: Usability evaluation methods
No Method Description Approach
1 Usability Testing This approach involves
real-world users that
are asked to complete a
set of tasks and
complete a
questionnaire at the
end.
For the Provenance Viewer, the instructions may look something like this:
Design a workflow in Taverna, run it and save it as databundle;
Register on Provenance Viewer and log in;
Look at the workflow diagram and answer part A of the questionnaire;
Look at Sankey Diagram and fill the next part of the questionnaire;
Analyse the Provenance Matrix and complete the questionnaire;
In the “Additional feedback” section of the questionnaire, the user should write
useful comments that be used in improving the experience of this tool.
Meanwhile, the questionnaire should contain the following:
Part A: Workflow Diagram
Part B: Provenance – Sankey Diagram;
Part C: Provenance – Provenance Matrix;
Part D: Additional Feedback
2 Think Aloud Testing A method through
which feedback can be
gained by asking to
think out loud as they
test the software.
Every comment provided by the tester of what are they thinking and feeling with regards
to the task performed should be recorded. For example, users can provide information
that may be used to determine an easier-to-use design for the tool.
Table 7. Usability Evaluation Methods
46
Appendix G: More workflows
Source: (The University of Manchester and University of Southampton, n.d.)
Figure 25. Weather forecast workflow
Figure 26. Explicit looping workflow