semantic web 0 (0) 1 1 ios press findable and reusable … · how summarisation techniques can be...

12
Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 Findable and Reusable Workflow Data Products: A genomic Workflow Case Study Alban Gaignard a,* , Hala Skaf-Molli b and Khalid Belhajjame c a Institut du Thorax, Inserm UMR 1087, CNRS UMR 6291, Université de Nantes, France E-mail: [email protected] b LS2N, University of Nantes, France E-mail: [email protected] c LAMSADE, University of Paris-Dauphine, France E-mail: [email protected] Abstract. While workflow systems have improved the repeatability of scientific experiments, the value of the processed (intermediate) data have been overlooked so far. In this paper, we argue that the intermediate data products of workflow executions should be seen as first-class objects that need to be curated and published. Not only will this be exploited to save time and resources needed when re-executing workflows, but more importantly, it will improve the reuse of data products by the same or peer scientists in the context of new hypotheses and experiments. To assist curator in annotating (intermediate) workflow data, we exploit in this work multiple sources of information, namely: i) the provenance information captured by the workflow system, and ii) domain annotations that are provided by tools registries, such as Bio.Tools. Furthermore, we show, on a concrete bioinformatics scenario, how summarisation techniques can be used to reduce the machine-generated provenance information of such data products into concise human- and machine-readable annotations. Keywords: FAIR, Linked Data, Scientific Workflows, Provenance, Bioinformatics, Data Summaries 1. Introduction We have witnessed in the last decade a paradigm shift in the way scientists conduct their experiments, which are increasingly data-driven. Given a hypothe- sis that the scientist seeks to test, verify or confirm, s/he processes given input datasets using an experi- ment which is modelled as a series of scripts written in languages such as R, Python and Perl, or pipelines composed of connected modules (also known as work- flows [1, 2]). For example, the recent progress in se- quencing technologies, combined with the reduction of their cost has led to massive production of genomic data with growth rates that exceed major manufactur- ers’ expectations [3]. A single research lab that is using the last generation sequencer can currently generate in * Corresponding author. E-mail: [email protected]. one year 1 the equivalent of the world-wide collabora- tive sequencing capacity in 2012 [4]. The datasets obtained as a result of the experiment are analyzed by the scientist who then reports on the finding s/he obtained by analyzing the results [5]. As a response to the reproducibility movement [6], which has gained great momentum recently, scientists were encouraged to not only report on their findings, but also document the experiment (method) they used, the datasets they used as inputs, and eventually, the datasets obtained a result. To assist scientist in the task of reporting, a number of methods and tools has been proposed (see e.g., [7–9]). In [10] Gil et al. propose data narratives to automatically generate text to de- scribe computational analyses that can be presented to users and ultimately included in papers or reports. 1 theoretically around 2500 whole genomes per year with an Illu- mina NovaSeq technology 1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved

Upload: others

Post on 13-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

Semantic Web 0 (0) 1 1IOS Press

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Findable and Reusable Workflow DataProducts A genomic Workflow Case StudyAlban Gaignard a Hala Skaf-Molli b and Khalid Belhajjame c

a Institut du Thorax Inserm UMR 1087 CNRS UMR 6291 Universiteacute de Nantes FranceE-mail albangaignarduniv-nantesfrb LS2N University of Nantes FranceE-mail halaskafuniv-nantesfrc LAMSADE University of Paris-Dauphine FranceE-mail kbelhajjgooglemailcom

AbstractWhile workflow systems have improved the repeatability of scientific experiments the value of the processed (intermediate)

data have been overlooked so far In this paper we argue that the intermediate data products of workflow executions should beseen as first-class objects that need to be curated and published Not only will this be exploited to save time and resources neededwhen re-executing workflows but more importantly it will improve the reuse of data products by the same or peer scientists inthe context of new hypotheses and experiments To assist curator in annotating (intermediate) workflow data we exploit in thiswork multiple sources of information namely i) the provenance information captured by the workflow system and ii) domainannotations that are provided by tools registries such as BioTools Furthermore we show on a concrete bioinformatics scenariohow summarisation techniques can be used to reduce the machine-generated provenance information of such data products intoconcise human- and machine-readable annotations

Keywords FAIR Linked Data Scientific Workflows Provenance Bioinformatics Data Summaries

1 Introduction

We have witnessed in the last decade a paradigmshift in the way scientists conduct their experimentswhich are increasingly data-driven Given a hypothe-sis that the scientist seeks to test verify or confirmshe processes given input datasets using an experi-ment which is modelled as a series of scripts writtenin languages such as R Python and Perl or pipelinescomposed of connected modules (also known as work-flows [1 2]) For example the recent progress in se-quencing technologies combined with the reductionof their cost has led to massive production of genomicdata with growth rates that exceed major manufactur-ersrsquo expectations [3] A single research lab that is usingthe last generation sequencer can currently generate in

Corresponding author E-mail albangaignarduniv-nantesfr

one year1 the equivalent of the world-wide collabora-tive sequencing capacity in 2012 [4]

The datasets obtained as a result of the experimentare analyzed by the scientist who then reports on thefinding she obtained by analyzing the results [5] Asa response to the reproducibility movement [6] whichhas gained great momentum recently scientists wereencouraged to not only report on their findings butalso document the experiment (method) they usedthe datasets they used as inputs and eventually thedatasets obtained a result To assist scientist in the taskof reporting a number of methods and tools has beenproposed (see eg [7ndash9]) In [10] Gil et al proposedata narratives to automatically generate text to de-scribe computational analyses that can be presented tousers and ultimately included in papers or reports

1theoretically around 2500 whole genomes per year with an Illu-mina NovaSeq technology

1570-08440-1900$3500 ccopy 0 ndash IOS Press and the authors All rights reserved

2 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

We argue that intermediate datasets generated by thesteps of an experiment should be promoted as first-class objects on their own right to be findable ac-cessible and ultimately reusable by the members ofthe scientific community We focus in this paper ondatasets that are generated by experiments that arespecified and enacted using workflows There has beenrecently initiatives notably FAIR [11] which specifythe guidelines and criteria that need to be met whensharing data in general Meeting such criteria remainschallenging however

In this paper we show how we can combine prove-nance metadata with external knowledge associatedwith workflows and tools to promote processed datasharing and reuse More specifically we present FRESHan approach to associate the intermediate as well asthe final datasets generated by the workflows withannotations specifying their retrospective provenanceand their prospective provenance (ie the part of theworkflow that was enacted for their generation) Bothprospective and retrospective provenance can be over-whelming for a user to understand them Because ofthis we associate datasets with a summary of theirprospective provenance Moreover we annotate thedatasets with information about the experiment thatthey were used in eg hypothesis contributors aswell as with semantic domain annotations that we au-tomatically harvest from third-party resources in par-

ticular BioTools2 [12] Our ultimate objective is topromote processed data reuse in order to limit theduplication of computing and storage efforts asso-ciated to workflow re-execution

The contributions of this paper are the following

ndash Definition of workflow data products reuse in thebioinformatics domain

ndash An knowledge-graph based approach aimed atannotating raw processed data with domain-specific concepts while limiting domain expertsoverwhelming at the time of sharing their data

ndash An experiment based on a real-life bioinformat-ics workflow that can be reproduced through aninteractive notebook

This paper is organised as follows Section 2 presentsmotivation and defines the problem statement Sec-tion 3 details the proposed FRESH approach Sec-tion 4 presents our experimental results Section 5summarises related works Finally conclusions and fu-ture work are outlined in Section 6

2 Motivations and Problem Statement

We motivate our proposal through an exome-se-quencing bioinformatics workflow This workflowaims at (1) aligning sample exome data (the protein-coding region of genes) to a reference genome and (2)identifying genetic mutations for each of the biologi-cal samples Figure 1 drafts a summary of the bioin-formatics analysis tasks required to identify and anno-tate genetic variants from exome-sequencing data Fora matter of clarity we hide in this scenario some ofthe minor processing steps such as the sorting of DNAbases but they are still required in practice The realworkflow will be described in details in the experimen-tal results section

This workflow consumes as inputs two sequencedbiological samples sample_1 and sample_2 Foreach sample sequencers produce multiple files thatneed to be merged later on (Sequence mergingstep) The first processing step consists in align-ing [13] (Mapping to reference genome) theshort sequence reads to the human reference genome(GRCh37) Then for each sample data are mergedand post-processed [14 15] and result in binary (BAM)files representing the aligned sequences with their

2httpbiotools

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 3

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

VCF file

Sequence merging

Mapping to reference genome

sample_1aR1 sample_1aR2

Variant calling

GRCh37

sample_1bR1 sample_1bR2

Mapping to reference genome

sample_2bR1 sample_2bR2

Mapping to reference genome

Annotation from public knowledge bases

sample_1 BAM file

sample_2 BAM file

Fig 1 A typical bioinformatics workflow aimed at identifying and annotating genomic variations from a reference genome Green waved boxesrepresent data files and blue rounded box represent processing steps

quality metrics Finally from these aligned sequencesthe genetic variants are identified [16] and enrichedwith annotations [17] gathered from public knowledgebases such as DBsnp [18] or gnomAD3 This last pro-cessing step results in a VCF file listing for all pro-cessed sample sequences all known genomic varia-tions compared to the GCRh37 reference genome

Performing these analyses in real-life conditions iscomputation intensive They require a lot of CPU timeand storage capacity As an example similar work-flows are run in production in the CNRGH frenchnational sequencing facility For a typical exome-sequencing sample (97GB compressed) it has beenmeasured that 186GB was necessary to store the inputand output compressed data In addition 2 hours and27 minutes were necessary to produced an annotatedVCF variant file taking advantage of parallelism in adedicated high-performance computing infrastructure(7 nodes with 28 CPU Intel Broadwell cores each)which corresponds to 158 cumulative hours for a sin-gle sample ie 6 days of computation on single CPU

Considering the computational cost of these analy-ses we claim that the secondary use of data is criti-cal to speed-up Research addressing similar or relatedtopics In this workflow all processing steps producedata but they do not provide the same level of reusabil-ity We tagged reusable data with a white star in Fig-ure 1 More precisely (GRCh37) is by nature highlyreusable since it is a reference ldquoatlasrdquo for genomic hu-

3httpsgnomadbroadinstituteorg

man sequences and results from state-of-the-art scien-tific knowledge at a given time Then BAM files canalso be considered as more reusable than the raw inputdata since they have been aligned to this atlas and thusbenefit from consensual knowledge on this genomeAs an example they provide the relationship betweensequences and known genes they can be visualized ina genome viewer they can also be used to re-generateraw unmapped sequences

From the scientist perspective answering questionssuch as can or should I reuse these files in the con-text of my research study is still challenging To reusethe final VCF variant file it is of major importance toknow the version of the reference genome as well as toclearly understand the scientific context of the studythe phenotypes associated to the samples as well asthe possible relations between samples Finally havingprecise information on the variant calling algorithmis also critical due to application-specific detectionthresholds [19] More generally not only fine-grainedprovenance information regarding data and tools lin-eage is required but also domain-specific annotationsbased on community agreed vocabularies (Issue 1)These vocabularies exist but annotating processed datawith domain-specific concepts requires a lot of timeand expertise (Issue 2)

In this work we show how we can improve thefindability and reusability of workflow (intermedi-ate) data by leveraging (1) community efforts aimedat semantically cataloging bioinformatics process-ing tools to reduce the solicitation of domain ex-

4 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

perts and (2) the automation and provenance ca-pabilities of workflow management systems to au-tomate the annotation of processed data towardsmore reusable workflow results

3 FRESH Approach

FRESH is an approach to improve the Findabil-ity and the Reusability of genomic workflow dataFAIR [11 20] and Linked data[21 22] principles con-stitute the conceptual and technological backbones inthis direction

FRESH partially tackles FAIR requirements forbetter findability and reusability We address find-ability by relying on Linked Data best practicesnamely associating a URI to each dataset linking thesedatasets in the form of RDF knowledge graphs withcontrolled vocabularies for the naming of concepts andrelations

Being tightly coupled to scientific context reusabil-ity is more challenging to achieve Guidelines havebeen proposed for FAIR sharing of genomic data [23]however proposing and evaluating reusability is still achallenging and work in progress [24] In this work wefocus on reusable data as annotated with sufficientlycomplete information allowing without needs for ex-ternal resources traceability interpretability under-standability and usage by human or machine

To be traceable provenance traces are mandatoryfor tracking the data generation process

To be interpretable contextual data [11] are manda-tory this includes i) Scientific context in term ofClaims Research lab Experimental conditions previ-ous evidence (academic papers) ii) The technical con-text in term of material and methods data sourcesused software (algorithm queries) and hardware

To be understandable by itself data must be anno-tated with domain-specific vocabularies For instanceTo capture knowledge associated with the data pro-cessing steps we can rely on EDAM4 which is activelydeveloped and used in the context of the BioTools reg-istry and which organizes common terms used in thefield of bioinformatics However these annotations onprocessing tools do not capture the scientific contextin which a workflow takes place To address this issuewe rely on the Micropublications [25] ontology which

4httpedamontologyorg

has been proposed to formally represent scientific ap-proaches hypothesis claims or pieces of evidence inthe direction of machine-tractable academic papers

Figure 2 illustrates our approach to provide morereusable data The first step consists in capturingprovenance for all workflow runs PROV5 is the defacto standard for describing and exchanging prove-nance graphs Although capturing provenance can beeasily managed in workflow engines there is no sys-tematic way to link a PROV Activity (the actual execu-tion of a tool) to the relevant software Agent (ie thesoftware responsible for the actual data processing) Toaddress this issue we propose to provide at workflowdesign time the toolrsquos identifier in the tool catalogThis allows to generate a provenance trace which asso-ciates (provwasAssociatedWith) each execution andthus each consumed and produced data to the softwareidentifier

Then we assemble a bioinformatics knowledgegraph which links together (1) the tools annotationsgathered from the BioTools6 registry and provid-ing informations on what do the tools (bioinformaticsEDAM operations) and which kind of data they con-sume and produce (2) the complete EDAM ontologyto gather for instance the community-agreed defini-tions and synonyms for bioinformatics concepts (3)the PROV graph resulting from a workflow executionwhich provides fine-grained technical and domain-agnostic provenance metadata and (4) the experi-mental context using Micro-publication for scientificclaims and hypothesis associated to the experiment

Finally based on domain-specific provenance queriesthe last step consists in extracting few and meaning-ful data from the knowledge graph to provide scientistwith more reusable intermediate or final results and toprovide machines findable and query-able data stories

In the remainder of this section we rely on theSPARQL query language to interact with the knowl-edge graph in terms of knowledge extraction andknowledge enrichment

Query 1 aims at extracting and linking together dataartefacts with the definition of the bioinformatics pro-cess they result from

In this SPARQL query we first identify data (provEntity) the tool execution they result from (provwasGeneratedBy) and the used software (provwasAssociatedWith) Then we retrieve from the BioTools

5httpswwww3orgTRprov-o6httpsbiotools

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

PROV-O graph BioTools

graph

EDAM ontology

Knowledge Graph

Workflowspecification

Provenance mining queries

Human-oriented

data summaries

Machine-oriented

data summariesRaw data

Experimcontext

Workflow enactment

Provenance capture and

linking

Fig 2 Knowledge graph based on workflow provenance and tool annotations to automate the production of human- and machine- oriented datasummarises

SELECT d_label title f_def st WHERE d rdftype provEntity

provwasGeneratedBy x provwasAssociatedWith tool rdfslabel d_label

tool dctitle title biotoolshas_function f

f rdfslabel f_label oboInOwlhasDefinition f_def

c rdftype mpClaim mpstatement st

Query 1 SPARQL query aimed at linking processed data to the pro-cessing tool and the definition of what is done on data

sub-graph the EDAM annotation which specify thefunction of the tool (biotoolshas_function) The defi-nition of the function of the tool is retrieved from theEDAM ontology (oboInOwlhasDefinition) Finallywe retrieve the scientific context of the experimentby matching statements expressed in natural language(mpClaim mpstatement)

The Query 2 shows how a specific provenance pat-tern can be matched and reshaped to provide a sum-mary of the main processing steps in terms of domain-specific concepts

The idea consists in identifying all data deriva-tion links (provwasDerivedFrom) From the identifieddata the tool executions are then matched as well asthe corresponding software agents Similarly as in theprevious query the last piece of information to be iden-tified is the functionality of the tools This is done by

CONSTRUCT x2 p-planwasPreceededBy x1 x2 provwasAssociatedWith t2 x1 provwasAssociatedWith t1 t1 biotoolshas_function f1 f1 rdfslabel f1_label t2 biotoolshas_function f2 f2 rdfslabel f2_label

WHERE d2 provwasDerivedFrom d1

d2 provwasGeneratedBy x2 provwasAssociatedWith t2 rdfslabel d2_label

d1 provwasGeneratedBy x1 provwasAssociatedWith t1 rdfslabel d1_label

t1 biotoolshas_function f1 f1 rdfslabel f1_label

t2 biotoolshas_function f2 f2 rdfslabel f2_label

Query 2 SPARQL query aimed at assembling an abstract workflowbased on what happened (provenance) and how data were processed(domain-specific EDAM annotations)

exploiting the biotoolshas_function predicate Once

this graph pattern is matched a new graph is created

(CONSTRUCT query clause) to represent an ordered

chain of processing steps (p-planwasPreceededBy)

6 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

4 Experimental results and Discussion

We experimented our approach on a production-level exome-sequencing workflow7 designed and op-erated by the GenoBird genomic and bioinformaticscore facility It implements the motivating scenario weintroduced in section 2 We assume that based on theapproach beforehand presented the workflow has beenrun the associated provenance has been captured andthe knowledge graph has been assembled

The resulting provenance graph consists in an RDFgraph with 555 triples leveraging the PROV-O ontol-ogy The following two tables show the distribution ofPROV classes and properties

Classes Number of instances

provEntity 40provActivity 26provBundle 1provAgent 1provPerson 1

Properties Number of predicates

provwasDerivedFrom 167provused 100rdftype 69provwasAssociatedWith 65provwasGeneratedBy 39rdfslabel 39provendedAtTime 26provstartedAtTime 26rdfscomment 22provwasAttributedTo 1provgeneratedAtTime 1

Interpreting this provenance graph is challengingfrom a human perspective due to the number nodesand edges and more importantly due to the lack ofdomain-specific terms

41 Human-oriented data summaries

Based on query 1 and a textual template we showin Figure 3 sentences which have been automaticallygenerated from the knowledge graph They intend toprovide scientists with self-explainable information onhow data were produced leveraging domain-specificterms and in which scientific context

[]The file ltSamplesSample1BAMSample1finalbamgtresults from tool ltgatk2_print_reads-IPgt whichltCounting and summarising the number of shortsequence reads that map to genomic featuresgtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]The file ltVCFhapcallerrecalcombinedannotgnomadvcfgzgt results from toolltgatk2_variant_annotator-IPgt which ltPredict theeffect or function of an individual singlenucleotide polymorphism (SNP)gtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]

Fig 3 Sentence-based data summaries providing for a given fileinformation on the tool the data originates from and the definitionof what does the tool based on the EDAM ontology

Complex data analysis procedures would require along text and many logical articulations for being un-derstandable Visual diagrams provide a compact rep-resentation for complex data processing and constitutethus an interesting mean to assemble human-orienteddata summaries

Figure 1 shows a summary diagram automaticallycompiled from the bioinformatics knowledge graphpreviously described in section 3 Black arrows repre-sent the logical flow of data processing and black el-lipses represent the nature of data processing in termsof EDAM operations The diagram highlights in bluethe Sample1finalbam It shows that this file re-sults from a Read Summarisation step and is followedby a Variant Calling step

Another example for summary diagrams is providedin Figure 5 which highlight the final VCF file and itsbinary index The diagrams shows that these file resultfrom a processing step performing a SNP annotationas defined in the EDAM ontology

These visualisations provide scientists with meansto situate an intermediate result genomic sequencesaligned to a reference genome (BAM file) or ge-nomic variants (VCF file) in the context of a com-plex data analysis process While an expert bioinfor-matician wonrsquot need these summaries we consider thatexpliciting and visualizing these summaries is of ma-jor interest to better reuserepurpose scientific data oreven provide a first level of explanation in terms ofdomain-specific concepts

7httpsgitlabuniv-nantesfrbird_pipeline_registryexome-pipeline

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 2: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

2 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

We argue that intermediate datasets generated by thesteps of an experiment should be promoted as first-class objects on their own right to be findable ac-cessible and ultimately reusable by the members ofthe scientific community We focus in this paper ondatasets that are generated by experiments that arespecified and enacted using workflows There has beenrecently initiatives notably FAIR [11] which specifythe guidelines and criteria that need to be met whensharing data in general Meeting such criteria remainschallenging however

In this paper we show how we can combine prove-nance metadata with external knowledge associatedwith workflows and tools to promote processed datasharing and reuse More specifically we present FRESHan approach to associate the intermediate as well asthe final datasets generated by the workflows withannotations specifying their retrospective provenanceand their prospective provenance (ie the part of theworkflow that was enacted for their generation) Bothprospective and retrospective provenance can be over-whelming for a user to understand them Because ofthis we associate datasets with a summary of theirprospective provenance Moreover we annotate thedatasets with information about the experiment thatthey were used in eg hypothesis contributors aswell as with semantic domain annotations that we au-tomatically harvest from third-party resources in par-

ticular BioTools2 [12] Our ultimate objective is topromote processed data reuse in order to limit theduplication of computing and storage efforts asso-ciated to workflow re-execution

The contributions of this paper are the following

ndash Definition of workflow data products reuse in thebioinformatics domain

ndash An knowledge-graph based approach aimed atannotating raw processed data with domain-specific concepts while limiting domain expertsoverwhelming at the time of sharing their data

ndash An experiment based on a real-life bioinformat-ics workflow that can be reproduced through aninteractive notebook

This paper is organised as follows Section 2 presentsmotivation and defines the problem statement Sec-tion 3 details the proposed FRESH approach Sec-tion 4 presents our experimental results Section 5summarises related works Finally conclusions and fu-ture work are outlined in Section 6

2 Motivations and Problem Statement

We motivate our proposal through an exome-se-quencing bioinformatics workflow This workflowaims at (1) aligning sample exome data (the protein-coding region of genes) to a reference genome and (2)identifying genetic mutations for each of the biologi-cal samples Figure 1 drafts a summary of the bioin-formatics analysis tasks required to identify and anno-tate genetic variants from exome-sequencing data Fora matter of clarity we hide in this scenario some ofthe minor processing steps such as the sorting of DNAbases but they are still required in practice The realworkflow will be described in details in the experimen-tal results section

This workflow consumes as inputs two sequencedbiological samples sample_1 and sample_2 Foreach sample sequencers produce multiple files thatneed to be merged later on (Sequence mergingstep) The first processing step consists in align-ing [13] (Mapping to reference genome) theshort sequence reads to the human reference genome(GRCh37) Then for each sample data are mergedand post-processed [14 15] and result in binary (BAM)files representing the aligned sequences with their

2httpbiotools

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 3

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

VCF file

Sequence merging

Mapping to reference genome

sample_1aR1 sample_1aR2

Variant calling

GRCh37

sample_1bR1 sample_1bR2

Mapping to reference genome

sample_2bR1 sample_2bR2

Mapping to reference genome

Annotation from public knowledge bases

sample_1 BAM file

sample_2 BAM file

Fig 1 A typical bioinformatics workflow aimed at identifying and annotating genomic variations from a reference genome Green waved boxesrepresent data files and blue rounded box represent processing steps

quality metrics Finally from these aligned sequencesthe genetic variants are identified [16] and enrichedwith annotations [17] gathered from public knowledgebases such as DBsnp [18] or gnomAD3 This last pro-cessing step results in a VCF file listing for all pro-cessed sample sequences all known genomic varia-tions compared to the GCRh37 reference genome

Performing these analyses in real-life conditions iscomputation intensive They require a lot of CPU timeand storage capacity As an example similar work-flows are run in production in the CNRGH frenchnational sequencing facility For a typical exome-sequencing sample (97GB compressed) it has beenmeasured that 186GB was necessary to store the inputand output compressed data In addition 2 hours and27 minutes were necessary to produced an annotatedVCF variant file taking advantage of parallelism in adedicated high-performance computing infrastructure(7 nodes with 28 CPU Intel Broadwell cores each)which corresponds to 158 cumulative hours for a sin-gle sample ie 6 days of computation on single CPU

Considering the computational cost of these analy-ses we claim that the secondary use of data is criti-cal to speed-up Research addressing similar or relatedtopics In this workflow all processing steps producedata but they do not provide the same level of reusabil-ity We tagged reusable data with a white star in Fig-ure 1 More precisely (GRCh37) is by nature highlyreusable since it is a reference ldquoatlasrdquo for genomic hu-

3httpsgnomadbroadinstituteorg

man sequences and results from state-of-the-art scien-tific knowledge at a given time Then BAM files canalso be considered as more reusable than the raw inputdata since they have been aligned to this atlas and thusbenefit from consensual knowledge on this genomeAs an example they provide the relationship betweensequences and known genes they can be visualized ina genome viewer they can also be used to re-generateraw unmapped sequences

From the scientist perspective answering questionssuch as can or should I reuse these files in the con-text of my research study is still challenging To reusethe final VCF variant file it is of major importance toknow the version of the reference genome as well as toclearly understand the scientific context of the studythe phenotypes associated to the samples as well asthe possible relations between samples Finally havingprecise information on the variant calling algorithmis also critical due to application-specific detectionthresholds [19] More generally not only fine-grainedprovenance information regarding data and tools lin-eage is required but also domain-specific annotationsbased on community agreed vocabularies (Issue 1)These vocabularies exist but annotating processed datawith domain-specific concepts requires a lot of timeand expertise (Issue 2)

In this work we show how we can improve thefindability and reusability of workflow (intermedi-ate) data by leveraging (1) community efforts aimedat semantically cataloging bioinformatics process-ing tools to reduce the solicitation of domain ex-

4 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

perts and (2) the automation and provenance ca-pabilities of workflow management systems to au-tomate the annotation of processed data towardsmore reusable workflow results

3 FRESH Approach

FRESH is an approach to improve the Findabil-ity and the Reusability of genomic workflow dataFAIR [11 20] and Linked data[21 22] principles con-stitute the conceptual and technological backbones inthis direction

FRESH partially tackles FAIR requirements forbetter findability and reusability We address find-ability by relying on Linked Data best practicesnamely associating a URI to each dataset linking thesedatasets in the form of RDF knowledge graphs withcontrolled vocabularies for the naming of concepts andrelations

Being tightly coupled to scientific context reusabil-ity is more challenging to achieve Guidelines havebeen proposed for FAIR sharing of genomic data [23]however proposing and evaluating reusability is still achallenging and work in progress [24] In this work wefocus on reusable data as annotated with sufficientlycomplete information allowing without needs for ex-ternal resources traceability interpretability under-standability and usage by human or machine

To be traceable provenance traces are mandatoryfor tracking the data generation process

To be interpretable contextual data [11] are manda-tory this includes i) Scientific context in term ofClaims Research lab Experimental conditions previ-ous evidence (academic papers) ii) The technical con-text in term of material and methods data sourcesused software (algorithm queries) and hardware

To be understandable by itself data must be anno-tated with domain-specific vocabularies For instanceTo capture knowledge associated with the data pro-cessing steps we can rely on EDAM4 which is activelydeveloped and used in the context of the BioTools reg-istry and which organizes common terms used in thefield of bioinformatics However these annotations onprocessing tools do not capture the scientific contextin which a workflow takes place To address this issuewe rely on the Micropublications [25] ontology which

4httpedamontologyorg

has been proposed to formally represent scientific ap-proaches hypothesis claims or pieces of evidence inthe direction of machine-tractable academic papers

Figure 2 illustrates our approach to provide morereusable data The first step consists in capturingprovenance for all workflow runs PROV5 is the defacto standard for describing and exchanging prove-nance graphs Although capturing provenance can beeasily managed in workflow engines there is no sys-tematic way to link a PROV Activity (the actual execu-tion of a tool) to the relevant software Agent (ie thesoftware responsible for the actual data processing) Toaddress this issue we propose to provide at workflowdesign time the toolrsquos identifier in the tool catalogThis allows to generate a provenance trace which asso-ciates (provwasAssociatedWith) each execution andthus each consumed and produced data to the softwareidentifier

Then we assemble a bioinformatics knowledgegraph which links together (1) the tools annotationsgathered from the BioTools6 registry and provid-ing informations on what do the tools (bioinformaticsEDAM operations) and which kind of data they con-sume and produce (2) the complete EDAM ontologyto gather for instance the community-agreed defini-tions and synonyms for bioinformatics concepts (3)the PROV graph resulting from a workflow executionwhich provides fine-grained technical and domain-agnostic provenance metadata and (4) the experi-mental context using Micro-publication for scientificclaims and hypothesis associated to the experiment

Finally based on domain-specific provenance queriesthe last step consists in extracting few and meaning-ful data from the knowledge graph to provide scientistwith more reusable intermediate or final results and toprovide machines findable and query-able data stories

In the remainder of this section we rely on theSPARQL query language to interact with the knowl-edge graph in terms of knowledge extraction andknowledge enrichment

Query 1 aims at extracting and linking together dataartefacts with the definition of the bioinformatics pro-cess they result from

In this SPARQL query we first identify data (provEntity) the tool execution they result from (provwasGeneratedBy) and the used software (provwasAssociatedWith) Then we retrieve from the BioTools

5httpswwww3orgTRprov-o6httpsbiotools

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

PROV-O graph BioTools

graph

EDAM ontology

Knowledge Graph

Workflowspecification

Provenance mining queries

Human-oriented

data summaries

Machine-oriented

data summariesRaw data

Experimcontext

Workflow enactment

Provenance capture and

linking

Fig 2 Knowledge graph based on workflow provenance and tool annotations to automate the production of human- and machine- oriented datasummarises

SELECT d_label title f_def st WHERE d rdftype provEntity

provwasGeneratedBy x provwasAssociatedWith tool rdfslabel d_label

tool dctitle title biotoolshas_function f

f rdfslabel f_label oboInOwlhasDefinition f_def

c rdftype mpClaim mpstatement st

Query 1 SPARQL query aimed at linking processed data to the pro-cessing tool and the definition of what is done on data

sub-graph the EDAM annotation which specify thefunction of the tool (biotoolshas_function) The defi-nition of the function of the tool is retrieved from theEDAM ontology (oboInOwlhasDefinition) Finallywe retrieve the scientific context of the experimentby matching statements expressed in natural language(mpClaim mpstatement)

The Query 2 shows how a specific provenance pat-tern can be matched and reshaped to provide a sum-mary of the main processing steps in terms of domain-specific concepts

The idea consists in identifying all data deriva-tion links (provwasDerivedFrom) From the identifieddata the tool executions are then matched as well asthe corresponding software agents Similarly as in theprevious query the last piece of information to be iden-tified is the functionality of the tools This is done by

CONSTRUCT x2 p-planwasPreceededBy x1 x2 provwasAssociatedWith t2 x1 provwasAssociatedWith t1 t1 biotoolshas_function f1 f1 rdfslabel f1_label t2 biotoolshas_function f2 f2 rdfslabel f2_label

WHERE d2 provwasDerivedFrom d1

d2 provwasGeneratedBy x2 provwasAssociatedWith t2 rdfslabel d2_label

d1 provwasGeneratedBy x1 provwasAssociatedWith t1 rdfslabel d1_label

t1 biotoolshas_function f1 f1 rdfslabel f1_label

t2 biotoolshas_function f2 f2 rdfslabel f2_label

Query 2 SPARQL query aimed at assembling an abstract workflowbased on what happened (provenance) and how data were processed(domain-specific EDAM annotations)

exploiting the biotoolshas_function predicate Once

this graph pattern is matched a new graph is created

(CONSTRUCT query clause) to represent an ordered

chain of processing steps (p-planwasPreceededBy)

6 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

4 Experimental results and Discussion

We experimented our approach on a production-level exome-sequencing workflow7 designed and op-erated by the GenoBird genomic and bioinformaticscore facility It implements the motivating scenario weintroduced in section 2 We assume that based on theapproach beforehand presented the workflow has beenrun the associated provenance has been captured andthe knowledge graph has been assembled

The resulting provenance graph consists in an RDFgraph with 555 triples leveraging the PROV-O ontol-ogy The following two tables show the distribution ofPROV classes and properties

Classes Number of instances

provEntity 40provActivity 26provBundle 1provAgent 1provPerson 1

Properties Number of predicates

provwasDerivedFrom 167provused 100rdftype 69provwasAssociatedWith 65provwasGeneratedBy 39rdfslabel 39provendedAtTime 26provstartedAtTime 26rdfscomment 22provwasAttributedTo 1provgeneratedAtTime 1

Interpreting this provenance graph is challengingfrom a human perspective due to the number nodesand edges and more importantly due to the lack ofdomain-specific terms

41 Human-oriented data summaries

Based on query 1 and a textual template we showin Figure 3 sentences which have been automaticallygenerated from the knowledge graph They intend toprovide scientists with self-explainable information onhow data were produced leveraging domain-specificterms and in which scientific context

[]The file ltSamplesSample1BAMSample1finalbamgtresults from tool ltgatk2_print_reads-IPgt whichltCounting and summarising the number of shortsequence reads that map to genomic featuresgtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]The file ltVCFhapcallerrecalcombinedannotgnomadvcfgzgt results from toolltgatk2_variant_annotator-IPgt which ltPredict theeffect or function of an individual singlenucleotide polymorphism (SNP)gtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]

Fig 3 Sentence-based data summaries providing for a given fileinformation on the tool the data originates from and the definitionof what does the tool based on the EDAM ontology

Complex data analysis procedures would require along text and many logical articulations for being un-derstandable Visual diagrams provide a compact rep-resentation for complex data processing and constitutethus an interesting mean to assemble human-orienteddata summaries

Figure 1 shows a summary diagram automaticallycompiled from the bioinformatics knowledge graphpreviously described in section 3 Black arrows repre-sent the logical flow of data processing and black el-lipses represent the nature of data processing in termsof EDAM operations The diagram highlights in bluethe Sample1finalbam It shows that this file re-sults from a Read Summarisation step and is followedby a Variant Calling step

Another example for summary diagrams is providedin Figure 5 which highlight the final VCF file and itsbinary index The diagrams shows that these file resultfrom a processing step performing a SNP annotationas defined in the EDAM ontology

These visualisations provide scientists with meansto situate an intermediate result genomic sequencesaligned to a reference genome (BAM file) or ge-nomic variants (VCF file) in the context of a com-plex data analysis process While an expert bioinfor-matician wonrsquot need these summaries we consider thatexpliciting and visualizing these summaries is of ma-jor interest to better reuserepurpose scientific data oreven provide a first level of explanation in terms ofdomain-specific concepts

7httpsgitlabuniv-nantesfrbird_pipeline_registryexome-pipeline

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 3: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 3

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

VCF file

Sequence merging

Mapping to reference genome

sample_1aR1 sample_1aR2

Variant calling

GRCh37

sample_1bR1 sample_1bR2

Mapping to reference genome

sample_2bR1 sample_2bR2

Mapping to reference genome

Annotation from public knowledge bases

sample_1 BAM file

sample_2 BAM file

Fig 1 A typical bioinformatics workflow aimed at identifying and annotating genomic variations from a reference genome Green waved boxesrepresent data files and blue rounded box represent processing steps

quality metrics Finally from these aligned sequencesthe genetic variants are identified [16] and enrichedwith annotations [17] gathered from public knowledgebases such as DBsnp [18] or gnomAD3 This last pro-cessing step results in a VCF file listing for all pro-cessed sample sequences all known genomic varia-tions compared to the GCRh37 reference genome

Performing these analyses in real-life conditions iscomputation intensive They require a lot of CPU timeand storage capacity As an example similar work-flows are run in production in the CNRGH frenchnational sequencing facility For a typical exome-sequencing sample (97GB compressed) it has beenmeasured that 186GB was necessary to store the inputand output compressed data In addition 2 hours and27 minutes were necessary to produced an annotatedVCF variant file taking advantage of parallelism in adedicated high-performance computing infrastructure(7 nodes with 28 CPU Intel Broadwell cores each)which corresponds to 158 cumulative hours for a sin-gle sample ie 6 days of computation on single CPU

Considering the computational cost of these analy-ses we claim that the secondary use of data is criti-cal to speed-up Research addressing similar or relatedtopics In this workflow all processing steps producedata but they do not provide the same level of reusabil-ity We tagged reusable data with a white star in Fig-ure 1 More precisely (GRCh37) is by nature highlyreusable since it is a reference ldquoatlasrdquo for genomic hu-

3httpsgnomadbroadinstituteorg

man sequences and results from state-of-the-art scien-tific knowledge at a given time Then BAM files canalso be considered as more reusable than the raw inputdata since they have been aligned to this atlas and thusbenefit from consensual knowledge on this genomeAs an example they provide the relationship betweensequences and known genes they can be visualized ina genome viewer they can also be used to re-generateraw unmapped sequences

From the scientist perspective answering questionssuch as can or should I reuse these files in the con-text of my research study is still challenging To reusethe final VCF variant file it is of major importance toknow the version of the reference genome as well as toclearly understand the scientific context of the studythe phenotypes associated to the samples as well asthe possible relations between samples Finally havingprecise information on the variant calling algorithmis also critical due to application-specific detectionthresholds [19] More generally not only fine-grainedprovenance information regarding data and tools lin-eage is required but also domain-specific annotationsbased on community agreed vocabularies (Issue 1)These vocabularies exist but annotating processed datawith domain-specific concepts requires a lot of timeand expertise (Issue 2)

In this work we show how we can improve thefindability and reusability of workflow (intermedi-ate) data by leveraging (1) community efforts aimedat semantically cataloging bioinformatics process-ing tools to reduce the solicitation of domain ex-

4 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

perts and (2) the automation and provenance ca-pabilities of workflow management systems to au-tomate the annotation of processed data towardsmore reusable workflow results

3 FRESH Approach

FRESH is an approach to improve the Findabil-ity and the Reusability of genomic workflow dataFAIR [11 20] and Linked data[21 22] principles con-stitute the conceptual and technological backbones inthis direction

FRESH partially tackles FAIR requirements forbetter findability and reusability We address find-ability by relying on Linked Data best practicesnamely associating a URI to each dataset linking thesedatasets in the form of RDF knowledge graphs withcontrolled vocabularies for the naming of concepts andrelations

Being tightly coupled to scientific context reusabil-ity is more challenging to achieve Guidelines havebeen proposed for FAIR sharing of genomic data [23]however proposing and evaluating reusability is still achallenging and work in progress [24] In this work wefocus on reusable data as annotated with sufficientlycomplete information allowing without needs for ex-ternal resources traceability interpretability under-standability and usage by human or machine

To be traceable provenance traces are mandatoryfor tracking the data generation process

To be interpretable contextual data [11] are manda-tory this includes i) Scientific context in term ofClaims Research lab Experimental conditions previ-ous evidence (academic papers) ii) The technical con-text in term of material and methods data sourcesused software (algorithm queries) and hardware

To be understandable by itself data must be anno-tated with domain-specific vocabularies For instanceTo capture knowledge associated with the data pro-cessing steps we can rely on EDAM4 which is activelydeveloped and used in the context of the BioTools reg-istry and which organizes common terms used in thefield of bioinformatics However these annotations onprocessing tools do not capture the scientific contextin which a workflow takes place To address this issuewe rely on the Micropublications [25] ontology which

4httpedamontologyorg

has been proposed to formally represent scientific ap-proaches hypothesis claims or pieces of evidence inthe direction of machine-tractable academic papers

Figure 2 illustrates our approach to provide morereusable data The first step consists in capturingprovenance for all workflow runs PROV5 is the defacto standard for describing and exchanging prove-nance graphs Although capturing provenance can beeasily managed in workflow engines there is no sys-tematic way to link a PROV Activity (the actual execu-tion of a tool) to the relevant software Agent (ie thesoftware responsible for the actual data processing) Toaddress this issue we propose to provide at workflowdesign time the toolrsquos identifier in the tool catalogThis allows to generate a provenance trace which asso-ciates (provwasAssociatedWith) each execution andthus each consumed and produced data to the softwareidentifier

Then we assemble a bioinformatics knowledgegraph which links together (1) the tools annotationsgathered from the BioTools6 registry and provid-ing informations on what do the tools (bioinformaticsEDAM operations) and which kind of data they con-sume and produce (2) the complete EDAM ontologyto gather for instance the community-agreed defini-tions and synonyms for bioinformatics concepts (3)the PROV graph resulting from a workflow executionwhich provides fine-grained technical and domain-agnostic provenance metadata and (4) the experi-mental context using Micro-publication for scientificclaims and hypothesis associated to the experiment

Finally based on domain-specific provenance queriesthe last step consists in extracting few and meaning-ful data from the knowledge graph to provide scientistwith more reusable intermediate or final results and toprovide machines findable and query-able data stories

In the remainder of this section we rely on theSPARQL query language to interact with the knowl-edge graph in terms of knowledge extraction andknowledge enrichment

Query 1 aims at extracting and linking together dataartefacts with the definition of the bioinformatics pro-cess they result from

In this SPARQL query we first identify data (provEntity) the tool execution they result from (provwasGeneratedBy) and the used software (provwasAssociatedWith) Then we retrieve from the BioTools

5httpswwww3orgTRprov-o6httpsbiotools

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

PROV-O graph BioTools

graph

EDAM ontology

Knowledge Graph

Workflowspecification

Provenance mining queries

Human-oriented

data summaries

Machine-oriented

data summariesRaw data

Experimcontext

Workflow enactment

Provenance capture and

linking

Fig 2 Knowledge graph based on workflow provenance and tool annotations to automate the production of human- and machine- oriented datasummarises

SELECT d_label title f_def st WHERE d rdftype provEntity

provwasGeneratedBy x provwasAssociatedWith tool rdfslabel d_label

tool dctitle title biotoolshas_function f

f rdfslabel f_label oboInOwlhasDefinition f_def

c rdftype mpClaim mpstatement st

Query 1 SPARQL query aimed at linking processed data to the pro-cessing tool and the definition of what is done on data

sub-graph the EDAM annotation which specify thefunction of the tool (biotoolshas_function) The defi-nition of the function of the tool is retrieved from theEDAM ontology (oboInOwlhasDefinition) Finallywe retrieve the scientific context of the experimentby matching statements expressed in natural language(mpClaim mpstatement)

The Query 2 shows how a specific provenance pat-tern can be matched and reshaped to provide a sum-mary of the main processing steps in terms of domain-specific concepts

The idea consists in identifying all data deriva-tion links (provwasDerivedFrom) From the identifieddata the tool executions are then matched as well asthe corresponding software agents Similarly as in theprevious query the last piece of information to be iden-tified is the functionality of the tools This is done by

CONSTRUCT x2 p-planwasPreceededBy x1 x2 provwasAssociatedWith t2 x1 provwasAssociatedWith t1 t1 biotoolshas_function f1 f1 rdfslabel f1_label t2 biotoolshas_function f2 f2 rdfslabel f2_label

WHERE d2 provwasDerivedFrom d1

d2 provwasGeneratedBy x2 provwasAssociatedWith t2 rdfslabel d2_label

d1 provwasGeneratedBy x1 provwasAssociatedWith t1 rdfslabel d1_label

t1 biotoolshas_function f1 f1 rdfslabel f1_label

t2 biotoolshas_function f2 f2 rdfslabel f2_label

Query 2 SPARQL query aimed at assembling an abstract workflowbased on what happened (provenance) and how data were processed(domain-specific EDAM annotations)

exploiting the biotoolshas_function predicate Once

this graph pattern is matched a new graph is created

(CONSTRUCT query clause) to represent an ordered

chain of processing steps (p-planwasPreceededBy)

6 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

4 Experimental results and Discussion

We experimented our approach on a production-level exome-sequencing workflow7 designed and op-erated by the GenoBird genomic and bioinformaticscore facility It implements the motivating scenario weintroduced in section 2 We assume that based on theapproach beforehand presented the workflow has beenrun the associated provenance has been captured andthe knowledge graph has been assembled

The resulting provenance graph consists in an RDFgraph with 555 triples leveraging the PROV-O ontol-ogy The following two tables show the distribution ofPROV classes and properties

Classes Number of instances

provEntity 40provActivity 26provBundle 1provAgent 1provPerson 1

Properties Number of predicates

provwasDerivedFrom 167provused 100rdftype 69provwasAssociatedWith 65provwasGeneratedBy 39rdfslabel 39provendedAtTime 26provstartedAtTime 26rdfscomment 22provwasAttributedTo 1provgeneratedAtTime 1

Interpreting this provenance graph is challengingfrom a human perspective due to the number nodesand edges and more importantly due to the lack ofdomain-specific terms

41 Human-oriented data summaries

Based on query 1 and a textual template we showin Figure 3 sentences which have been automaticallygenerated from the knowledge graph They intend toprovide scientists with self-explainable information onhow data were produced leveraging domain-specificterms and in which scientific context

[]The file ltSamplesSample1BAMSample1finalbamgtresults from tool ltgatk2_print_reads-IPgt whichltCounting and summarising the number of shortsequence reads that map to genomic featuresgtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]The file ltVCFhapcallerrecalcombinedannotgnomadvcfgzgt results from toolltgatk2_variant_annotator-IPgt which ltPredict theeffect or function of an individual singlenucleotide polymorphism (SNP)gtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]

Fig 3 Sentence-based data summaries providing for a given fileinformation on the tool the data originates from and the definitionof what does the tool based on the EDAM ontology

Complex data analysis procedures would require along text and many logical articulations for being un-derstandable Visual diagrams provide a compact rep-resentation for complex data processing and constitutethus an interesting mean to assemble human-orienteddata summaries

Figure 1 shows a summary diagram automaticallycompiled from the bioinformatics knowledge graphpreviously described in section 3 Black arrows repre-sent the logical flow of data processing and black el-lipses represent the nature of data processing in termsof EDAM operations The diagram highlights in bluethe Sample1finalbam It shows that this file re-sults from a Read Summarisation step and is followedby a Variant Calling step

Another example for summary diagrams is providedin Figure 5 which highlight the final VCF file and itsbinary index The diagrams shows that these file resultfrom a processing step performing a SNP annotationas defined in the EDAM ontology

These visualisations provide scientists with meansto situate an intermediate result genomic sequencesaligned to a reference genome (BAM file) or ge-nomic variants (VCF file) in the context of a com-plex data analysis process While an expert bioinfor-matician wonrsquot need these summaries we consider thatexpliciting and visualizing these summaries is of ma-jor interest to better reuserepurpose scientific data oreven provide a first level of explanation in terms ofdomain-specific concepts

7httpsgitlabuniv-nantesfrbird_pipeline_registryexome-pipeline

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 4: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

4 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

perts and (2) the automation and provenance ca-pabilities of workflow management systems to au-tomate the annotation of processed data towardsmore reusable workflow results

3 FRESH Approach

FRESH is an approach to improve the Findabil-ity and the Reusability of genomic workflow dataFAIR [11 20] and Linked data[21 22] principles con-stitute the conceptual and technological backbones inthis direction

FRESH partially tackles FAIR requirements forbetter findability and reusability We address find-ability by relying on Linked Data best practicesnamely associating a URI to each dataset linking thesedatasets in the form of RDF knowledge graphs withcontrolled vocabularies for the naming of concepts andrelations

Being tightly coupled to scientific context reusabil-ity is more challenging to achieve Guidelines havebeen proposed for FAIR sharing of genomic data [23]however proposing and evaluating reusability is still achallenging and work in progress [24] In this work wefocus on reusable data as annotated with sufficientlycomplete information allowing without needs for ex-ternal resources traceability interpretability under-standability and usage by human or machine

To be traceable provenance traces are mandatoryfor tracking the data generation process

To be interpretable contextual data [11] are manda-tory this includes i) Scientific context in term ofClaims Research lab Experimental conditions previ-ous evidence (academic papers) ii) The technical con-text in term of material and methods data sourcesused software (algorithm queries) and hardware

To be understandable by itself data must be anno-tated with domain-specific vocabularies For instanceTo capture knowledge associated with the data pro-cessing steps we can rely on EDAM4 which is activelydeveloped and used in the context of the BioTools reg-istry and which organizes common terms used in thefield of bioinformatics However these annotations onprocessing tools do not capture the scientific contextin which a workflow takes place To address this issuewe rely on the Micropublications [25] ontology which

4httpedamontologyorg

has been proposed to formally represent scientific ap-proaches hypothesis claims or pieces of evidence inthe direction of machine-tractable academic papers

Figure 2 illustrates our approach to provide morereusable data The first step consists in capturingprovenance for all workflow runs PROV5 is the defacto standard for describing and exchanging prove-nance graphs Although capturing provenance can beeasily managed in workflow engines there is no sys-tematic way to link a PROV Activity (the actual execu-tion of a tool) to the relevant software Agent (ie thesoftware responsible for the actual data processing) Toaddress this issue we propose to provide at workflowdesign time the toolrsquos identifier in the tool catalogThis allows to generate a provenance trace which asso-ciates (provwasAssociatedWith) each execution andthus each consumed and produced data to the softwareidentifier

Then we assemble a bioinformatics knowledgegraph which links together (1) the tools annotationsgathered from the BioTools6 registry and provid-ing informations on what do the tools (bioinformaticsEDAM operations) and which kind of data they con-sume and produce (2) the complete EDAM ontologyto gather for instance the community-agreed defini-tions and synonyms for bioinformatics concepts (3)the PROV graph resulting from a workflow executionwhich provides fine-grained technical and domain-agnostic provenance metadata and (4) the experi-mental context using Micro-publication for scientificclaims and hypothesis associated to the experiment

Finally based on domain-specific provenance queriesthe last step consists in extracting few and meaning-ful data from the knowledge graph to provide scientistwith more reusable intermediate or final results and toprovide machines findable and query-able data stories

In the remainder of this section we rely on theSPARQL query language to interact with the knowl-edge graph in terms of knowledge extraction andknowledge enrichment

Query 1 aims at extracting and linking together dataartefacts with the definition of the bioinformatics pro-cess they result from

In this SPARQL query we first identify data (provEntity) the tool execution they result from (provwasGeneratedBy) and the used software (provwasAssociatedWith) Then we retrieve from the BioTools

5httpswwww3orgTRprov-o6httpsbiotools

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

PROV-O graph BioTools

graph

EDAM ontology

Knowledge Graph

Workflowspecification

Provenance mining queries

Human-oriented

data summaries

Machine-oriented

data summariesRaw data

Experimcontext

Workflow enactment

Provenance capture and

linking

Fig 2 Knowledge graph based on workflow provenance and tool annotations to automate the production of human- and machine- oriented datasummarises

SELECT d_label title f_def st WHERE d rdftype provEntity

provwasGeneratedBy x provwasAssociatedWith tool rdfslabel d_label

tool dctitle title biotoolshas_function f

f rdfslabel f_label oboInOwlhasDefinition f_def

c rdftype mpClaim mpstatement st

Query 1 SPARQL query aimed at linking processed data to the pro-cessing tool and the definition of what is done on data

sub-graph the EDAM annotation which specify thefunction of the tool (biotoolshas_function) The defi-nition of the function of the tool is retrieved from theEDAM ontology (oboInOwlhasDefinition) Finallywe retrieve the scientific context of the experimentby matching statements expressed in natural language(mpClaim mpstatement)

The Query 2 shows how a specific provenance pat-tern can be matched and reshaped to provide a sum-mary of the main processing steps in terms of domain-specific concepts

The idea consists in identifying all data deriva-tion links (provwasDerivedFrom) From the identifieddata the tool executions are then matched as well asthe corresponding software agents Similarly as in theprevious query the last piece of information to be iden-tified is the functionality of the tools This is done by

CONSTRUCT x2 p-planwasPreceededBy x1 x2 provwasAssociatedWith t2 x1 provwasAssociatedWith t1 t1 biotoolshas_function f1 f1 rdfslabel f1_label t2 biotoolshas_function f2 f2 rdfslabel f2_label

WHERE d2 provwasDerivedFrom d1

d2 provwasGeneratedBy x2 provwasAssociatedWith t2 rdfslabel d2_label

d1 provwasGeneratedBy x1 provwasAssociatedWith t1 rdfslabel d1_label

t1 biotoolshas_function f1 f1 rdfslabel f1_label

t2 biotoolshas_function f2 f2 rdfslabel f2_label

Query 2 SPARQL query aimed at assembling an abstract workflowbased on what happened (provenance) and how data were processed(domain-specific EDAM annotations)

exploiting the biotoolshas_function predicate Once

this graph pattern is matched a new graph is created

(CONSTRUCT query clause) to represent an ordered

chain of processing steps (p-planwasPreceededBy)

6 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

4 Experimental results and Discussion

We experimented our approach on a production-level exome-sequencing workflow7 designed and op-erated by the GenoBird genomic and bioinformaticscore facility It implements the motivating scenario weintroduced in section 2 We assume that based on theapproach beforehand presented the workflow has beenrun the associated provenance has been captured andthe knowledge graph has been assembled

The resulting provenance graph consists in an RDFgraph with 555 triples leveraging the PROV-O ontol-ogy The following two tables show the distribution ofPROV classes and properties

Classes Number of instances

provEntity 40provActivity 26provBundle 1provAgent 1provPerson 1

Properties Number of predicates

provwasDerivedFrom 167provused 100rdftype 69provwasAssociatedWith 65provwasGeneratedBy 39rdfslabel 39provendedAtTime 26provstartedAtTime 26rdfscomment 22provwasAttributedTo 1provgeneratedAtTime 1

Interpreting this provenance graph is challengingfrom a human perspective due to the number nodesand edges and more importantly due to the lack ofdomain-specific terms

41 Human-oriented data summaries

Based on query 1 and a textual template we showin Figure 3 sentences which have been automaticallygenerated from the knowledge graph They intend toprovide scientists with self-explainable information onhow data were produced leveraging domain-specificterms and in which scientific context

[]The file ltSamplesSample1BAMSample1finalbamgtresults from tool ltgatk2_print_reads-IPgt whichltCounting and summarising the number of shortsequence reads that map to genomic featuresgtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]The file ltVCFhapcallerrecalcombinedannotgnomadvcfgzgt results from toolltgatk2_variant_annotator-IPgt which ltPredict theeffect or function of an individual singlenucleotide polymorphism (SNP)gtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]

Fig 3 Sentence-based data summaries providing for a given fileinformation on the tool the data originates from and the definitionof what does the tool based on the EDAM ontology

Complex data analysis procedures would require along text and many logical articulations for being un-derstandable Visual diagrams provide a compact rep-resentation for complex data processing and constitutethus an interesting mean to assemble human-orienteddata summaries

Figure 1 shows a summary diagram automaticallycompiled from the bioinformatics knowledge graphpreviously described in section 3 Black arrows repre-sent the logical flow of data processing and black el-lipses represent the nature of data processing in termsof EDAM operations The diagram highlights in bluethe Sample1finalbam It shows that this file re-sults from a Read Summarisation step and is followedby a Variant Calling step

Another example for summary diagrams is providedin Figure 5 which highlight the final VCF file and itsbinary index The diagrams shows that these file resultfrom a processing step performing a SNP annotationas defined in the EDAM ontology

These visualisations provide scientists with meansto situate an intermediate result genomic sequencesaligned to a reference genome (BAM file) or ge-nomic variants (VCF file) in the context of a com-plex data analysis process While an expert bioinfor-matician wonrsquot need these summaries we consider thatexpliciting and visualizing these summaries is of ma-jor interest to better reuserepurpose scientific data oreven provide a first level of explanation in terms ofdomain-specific concepts

7httpsgitlabuniv-nantesfrbird_pipeline_registryexome-pipeline

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 5: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

PROV-O graph BioTools

graph

EDAM ontology

Knowledge Graph

Workflowspecification

Provenance mining queries

Human-oriented

data summaries

Machine-oriented

data summariesRaw data

Experimcontext

Workflow enactment

Provenance capture and

linking

Fig 2 Knowledge graph based on workflow provenance and tool annotations to automate the production of human- and machine- oriented datasummarises

SELECT d_label title f_def st WHERE d rdftype provEntity

provwasGeneratedBy x provwasAssociatedWith tool rdfslabel d_label

tool dctitle title biotoolshas_function f

f rdfslabel f_label oboInOwlhasDefinition f_def

c rdftype mpClaim mpstatement st

Query 1 SPARQL query aimed at linking processed data to the pro-cessing tool and the definition of what is done on data

sub-graph the EDAM annotation which specify thefunction of the tool (biotoolshas_function) The defi-nition of the function of the tool is retrieved from theEDAM ontology (oboInOwlhasDefinition) Finallywe retrieve the scientific context of the experimentby matching statements expressed in natural language(mpClaim mpstatement)

The Query 2 shows how a specific provenance pat-tern can be matched and reshaped to provide a sum-mary of the main processing steps in terms of domain-specific concepts

The idea consists in identifying all data deriva-tion links (provwasDerivedFrom) From the identifieddata the tool executions are then matched as well asthe corresponding software agents Similarly as in theprevious query the last piece of information to be iden-tified is the functionality of the tools This is done by

CONSTRUCT x2 p-planwasPreceededBy x1 x2 provwasAssociatedWith t2 x1 provwasAssociatedWith t1 t1 biotoolshas_function f1 f1 rdfslabel f1_label t2 biotoolshas_function f2 f2 rdfslabel f2_label

WHERE d2 provwasDerivedFrom d1

d2 provwasGeneratedBy x2 provwasAssociatedWith t2 rdfslabel d2_label

d1 provwasGeneratedBy x1 provwasAssociatedWith t1 rdfslabel d1_label

t1 biotoolshas_function f1 f1 rdfslabel f1_label

t2 biotoolshas_function f2 f2 rdfslabel f2_label

Query 2 SPARQL query aimed at assembling an abstract workflowbased on what happened (provenance) and how data were processed(domain-specific EDAM annotations)

exploiting the biotoolshas_function predicate Once

this graph pattern is matched a new graph is created

(CONSTRUCT query clause) to represent an ordered

chain of processing steps (p-planwasPreceededBy)

6 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

4 Experimental results and Discussion

We experimented our approach on a production-level exome-sequencing workflow7 designed and op-erated by the GenoBird genomic and bioinformaticscore facility It implements the motivating scenario weintroduced in section 2 We assume that based on theapproach beforehand presented the workflow has beenrun the associated provenance has been captured andthe knowledge graph has been assembled

The resulting provenance graph consists in an RDFgraph with 555 triples leveraging the PROV-O ontol-ogy The following two tables show the distribution ofPROV classes and properties

Classes Number of instances

provEntity 40provActivity 26provBundle 1provAgent 1provPerson 1

Properties Number of predicates

provwasDerivedFrom 167provused 100rdftype 69provwasAssociatedWith 65provwasGeneratedBy 39rdfslabel 39provendedAtTime 26provstartedAtTime 26rdfscomment 22provwasAttributedTo 1provgeneratedAtTime 1

Interpreting this provenance graph is challengingfrom a human perspective due to the number nodesand edges and more importantly due to the lack ofdomain-specific terms

41 Human-oriented data summaries

Based on query 1 and a textual template we showin Figure 3 sentences which have been automaticallygenerated from the knowledge graph They intend toprovide scientists with self-explainable information onhow data were produced leveraging domain-specificterms and in which scientific context

[]The file ltSamplesSample1BAMSample1finalbamgtresults from tool ltgatk2_print_reads-IPgt whichltCounting and summarising the number of shortsequence reads that map to genomic featuresgtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]The file ltVCFhapcallerrecalcombinedannotgnomadvcfgzgt results from toolltgatk2_variant_annotator-IPgt which ltPredict theeffect or function of an individual singlenucleotide polymorphism (SNP)gtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]

Fig 3 Sentence-based data summaries providing for a given fileinformation on the tool the data originates from and the definitionof what does the tool based on the EDAM ontology

Complex data analysis procedures would require along text and many logical articulations for being un-derstandable Visual diagrams provide a compact rep-resentation for complex data processing and constitutethus an interesting mean to assemble human-orienteddata summaries

Figure 1 shows a summary diagram automaticallycompiled from the bioinformatics knowledge graphpreviously described in section 3 Black arrows repre-sent the logical flow of data processing and black el-lipses represent the nature of data processing in termsof EDAM operations The diagram highlights in bluethe Sample1finalbam It shows that this file re-sults from a Read Summarisation step and is followedby a Variant Calling step

Another example for summary diagrams is providedin Figure 5 which highlight the final VCF file and itsbinary index The diagrams shows that these file resultfrom a processing step performing a SNP annotationas defined in the EDAM ontology

These visualisations provide scientists with meansto situate an intermediate result genomic sequencesaligned to a reference genome (BAM file) or ge-nomic variants (VCF file) in the context of a com-plex data analysis process While an expert bioinfor-matician wonrsquot need these summaries we consider thatexpliciting and visualizing these summaries is of ma-jor interest to better reuserepurpose scientific data oreven provide a first level of explanation in terms ofdomain-specific concepts

7httpsgitlabuniv-nantesfrbird_pipeline_registryexome-pipeline

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 6: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

6 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

4 Experimental results and Discussion

We experimented our approach on a production-level exome-sequencing workflow7 designed and op-erated by the GenoBird genomic and bioinformaticscore facility It implements the motivating scenario weintroduced in section 2 We assume that based on theapproach beforehand presented the workflow has beenrun the associated provenance has been captured andthe knowledge graph has been assembled

The resulting provenance graph consists in an RDFgraph with 555 triples leveraging the PROV-O ontol-ogy The following two tables show the distribution ofPROV classes and properties

Classes Number of instances

provEntity 40provActivity 26provBundle 1provAgent 1provPerson 1

Properties Number of predicates

provwasDerivedFrom 167provused 100rdftype 69provwasAssociatedWith 65provwasGeneratedBy 39rdfslabel 39provendedAtTime 26provstartedAtTime 26rdfscomment 22provwasAttributedTo 1provgeneratedAtTime 1

Interpreting this provenance graph is challengingfrom a human perspective due to the number nodesand edges and more importantly due to the lack ofdomain-specific terms

41 Human-oriented data summaries

Based on query 1 and a textual template we showin Figure 3 sentences which have been automaticallygenerated from the knowledge graph They intend toprovide scientists with self-explainable information onhow data were produced leveraging domain-specificterms and in which scientific context

[]The file ltSamplesSample1BAMSample1finalbamgtresults from tool ltgatk2_print_reads-IPgt whichltCounting and summarising the number of shortsequence reads that map to genomic featuresgtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]The file ltVCFhapcallerrecalcombinedannotgnomadvcfgzgt results from toolltgatk2_variant_annotator-IPgt which ltPredict theeffect or function of an individual singlenucleotide polymorphism (SNP)gtIt was produced in the context of ltRare CodingVariants in ANGPTL6 Are Associated with FamilialForms of Intracranial Aneurysmgt[]

Fig 3 Sentence-based data summaries providing for a given fileinformation on the tool the data originates from and the definitionof what does the tool based on the EDAM ontology

Complex data analysis procedures would require along text and many logical articulations for being un-derstandable Visual diagrams provide a compact rep-resentation for complex data processing and constitutethus an interesting mean to assemble human-orienteddata summaries

Figure 1 shows a summary diagram automaticallycompiled from the bioinformatics knowledge graphpreviously described in section 3 Black arrows repre-sent the logical flow of data processing and black el-lipses represent the nature of data processing in termsof EDAM operations The diagram highlights in bluethe Sample1finalbam It shows that this file re-sults from a Read Summarisation step and is followedby a Variant Calling step

Another example for summary diagrams is providedin Figure 5 which highlight the final VCF file and itsbinary index The diagrams shows that these file resultfrom a processing step performing a SNP annotationas defined in the EDAM ontology

These visualisations provide scientists with meansto situate an intermediate result genomic sequencesaligned to a reference genome (BAM file) or ge-nomic variants (VCF file) in the context of a com-plex data analysis process While an expert bioinfor-matician wonrsquot need these summaries we consider thatexpliciting and visualizing these summaries is of ma-jor interest to better reuserepurpose scientific data oreven provide a first level of explanation in terms ofdomain-specific concepts

7httpsgitlabuniv-nantesfrbird_pipeline_registryexome-pipeline

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 7: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

SamplesSample1BAMSample1fnalbam

Filtering

Sequence merging

Variant calling

Indel detection SNP annotation Local alignment

Sequence analysis

Genetic variation analysis

Sequence feature detection

Read summarisation Sequence alignment

Formatting

GenerationGenome indexing Read mapping

Fig 4 The Sample1finalbam file results from a Read Summarisation step and is followed by a Variant Calling step

VCFhapcallerrecalcombinedannotgnomadvcfgztbi VCFhapcallerrecalcombinedannotgnomadvcfgz

Genetic variation analysis

Sequence feature detection

Sequence merging

Local alignment

Read mapping

Formatting

Indel detection

Variant calling

Filtering

Sequence analysis

Read summarisation

Generation Sequence alignmentSNP annotation Genome indexing

Fig 5 Human-oriented diagram automatically compiled from the provenance and domain-specific knowledge graph

42 Machine-oriented data summaries

Linked Data principles advocate the use of con-trolled vocabularies and ontologies to provide bothhuman- and machine-readable knowledge We showin Figure 6 how domain-specific statements on datatypically their annotation with EDAM bioinformaticsconcepts can be aggregated and shared between ma-chines by leveraging the NanoPublication vocabularyPublished as Linked Data these data summaries canbe semantically indexed and searched in line with theFindability of FAIR principles

43 Implementation

Provenance capture We slightly extended the Snake-make [26] workflow engine with a provenance capture

[]head

_np1 a npNanopublication _np1 nphasAssertion assertion _np1 nphasProvenance provenance _np1 nphasPublicationInfo pubInfo

assertion lthttpsnakemake-provenanceSamplesSample1BAMSample1mergedbaigt rdfsseeAlsolthttpedamontologyorgoperation_3197gt

lthttpsnakemake-provenanceVCFhapcallerindelrecalfiltervcfgzgt rdfsseeAlsolthttpedamontologyorgoperation_3695gt

[]

Fig 6 An extract of a machine-oriented NanoPublication aggregat-ing domain-specific assertions provenance and publication informa-tion

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 8: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

8 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

What How ResultsTraceable Provenance PROV traces

Interpertable Scientific and technical Context Micropublications vocabulary

Understandable Domain-Specific Context ontologies EDAM terms

For human Human-oriented data summaries Text and diagrams

For machine Machine-oriented data summaries NanoPublicationsTable 1

Enhancing reuse of processed data with FRESH

RDF Graph load time Text-based NanoPub Graph-based218 906 triples 227s 12s 61ms 15s

Table 2Time for producing data summaries

module8 This module written in Python is a wrapperfor the AbstractExecutor class The same source codeis used to produce PROV RDF metadata when locallyrunning a workflow or when exploiting parallelism inan HPC environment or when simulating a workflowSimulating a workflow is an interesting feature sinceall data processing steps are generated by the work-flow engine but not concretely executed Neverthelessthe capture of simulated provenance information is stillpossible without paying for the generally required longCPU-intensive tasks This extension is under revisionfor being integrated in the main development branchof the SnakeMake workflow engineKnowledge graph assembly We developed a Pythoncrawler9 that consumes the JSON representation ofthe BioTools bioinformatics registry and producesan RDF data dump focusing on domain annotations(EDAM ontology) and links to the reference papersRDF dumps are nightly built and pushed to a dedicatedsource code repository10Experimental setup The results shown in section 4where obtained by running a Jupyter Notebook RDFdata loading and SPARQL query execution wereachieved through the Python RDFlib library Pythonstring templates were used to assemble the NanoPub-lication while NetworkX PyDot and GraphViz wereused for basic graph visualisations

We simulated the production-level exome-sequencingworkflow to evaluate the computational cost of produc-ing data summaries from an RDF knowledge graphThe simulation mode allowed to not being impactedby the actual computing cost of performing raw data

8httpsbitbucketorgagaignarsnakemake-provenancesrcprovenance-capture

9httpsgithubcombio-toolsbiotoolsShimtreemasterjson2rdf10httpsgithubcombio-toolsbiotoolsRdf

analysis Table 2 shows the cost using a 16GB 29GHzCore i5 MacBook Pro desktop computer We mea-sured 227s to load in memory the full knowledgegraph (218 906 triples) covering the workflow claimsand its provenance graph the BioTools RDF dumpand the EDAM ontology The sentence-based datasummaries have been obtained in 12s the machine-oriented NanoPublication has been generated in 61msand finally 15s to reshape and display the graph-baseddata summary This overhead can be considered asnegligible compared to the computing resources re-quired to analyse exome-sequencing data as showingin section 2

To reproduce the human- and machine-oriented datasummaries this Jupyter Notebook is available througha source code repository11

44 Discussion

The validation exercise we reported has shown thatit is possible to generate data summaries that providevaluable information about workflow data In doingso we focus on domain-specific annotations to pro-mote the findability and reuse of data processed byscientific workflows with a particular attention to ge-nomics workflows This is justified by the fact thatFRESH meets the reusability criteria set up by theFAIR community This is is demonstrated by Table 1which points the reusability aspects set by the FAIRcommunity that are satisfied by FRESH We alsonote that FRESH is aligned with R12 ((meta)dataare associated with detailed provenance) and R13((meta)data meet domain-relevant community stan-dards) of reusable principle of FAIR As illustrated inthe previous sections FRESH can be used to generatehuman-oriented data summaries or machine-orienteddata summaries

Still in the context of genomic data analysis a typ-ical reuse scenario would consists in exploiting as in-

11httpsgithubcomalbangaignardfresh-toolbox

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 9: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

put the annotated genomic variants (in blue in Fig-ure 5) to conduct a rare variant statistical analysisIf we consider that no semantics is attached to thenames of files or tools domain-agnostic provenancewould fail in providing information on the nature ofdata processing By looking on the human-oriented di-agram or by letting an algorithm query the machine-oriented nanopublication produced by FRESH sci-entists would be able to understand that the file re-sults from an annotation of single nucleotide polymor-phisms (SNPs) which was preceded by a variant call-ing step itself preceded by an insertiondeletion (Indel)detection step

We focused in this work on the bioinformatics do-main and leveraged BioTools a large-scale commu-nity effort aimed at semantically cataloguing availablealgorithmstools As soon as semantic tools catalogsare available for other domains FRESH can be ap-plied to enhance the findability and reusability of pro-cessed data Even if more recent similar efforts ad-dress the bioimaging community through the setup ofthe BISE12 bioimaging search engine (Neubias EUCOST Action) Annotated with a bioimaging-specificEDAM extension this tool registry could be queriedto annotate bioimaging data following the same ap-proach

5 Related Work

Our work is related to proposals that seek to enableand facilitate the reproducibility and reuse of scien-tific artifacts and findings We have seen recently theemergence of a number of solutions that assist scien-tists in the tasks of packaging resources that are neces-sary for preserving and reproducing their experimentsFor example OBI (Ontology for Biomedical Investi-gations) [27] and the ISA (Investigation Study As-say) model [28] are two widely used community mod-els from the life science domain for describing ex-periments and investigations OBI provides commonterms like investigations or experiments to describeinvestigations in the biomedical domain It also allowsthe use of domain-specific vocabularies or ontologiesto characterize experiment factors involved in the in-vestigation ISA on the other hand structures the de-scriptions about an investigation into three levels In-vestigation for describing the overall goals and means

12httpwwwbiiieu

used in the experiment Study for documenting infor-mation about the subject under study and treatmentsthat it may have undergone and Assay for representingthe measurements performed on the subjects ResearchObjects [29] is a workflow-friendly solution that pro-vides a suite of ontologies that can be used for aggre-gating workflow specification together with its execu-tions and annotations informing on the scientific hy-pothesis and other domain annotations ReproZip [7]is another solution that helps users create relativelylightweight packages that include all the dependenciesrequired to reproduce a workflow for experiments thatare executed using scripts in particular Python scripts

The above solutions are useful in that they help sci-entists package information they have about the ex-periment into a single container However they do nothelp scientists in actually annotating or reporting thefindings of their experiments In this respect Alper etal [9] and Gaignard et al [8] developed solutions thatprovide the users by the means for deriving annotationsfor workflow results and for summarizing the prove-nance information provided by the workflow systemsSuch summaries are utilized for reporting purposes

While we recognize that such proposals are of greathelp to the scientists and can be instrumental to a cer-tain extent to check the repeatability of experimentsthey are missing opportunities when it comes to thereuse of the intermediate data products that are gener-ated by their experiments Indeed the focus in the re-ports generated by the scientist is put on their scientificfindings documenting the hypothesis and experimentthey used and in certain cases the datasets obtained asa result of their experiment The intermediate datasetswhich are by-products of the internal steps of the ex-periment are in most cases buried in the provenanceof the experiment if not reported at all The availabilityof such intermediate datasets can be of value to third-party scientists to run their own experiment This doesnot only save time for those scientists in that they canuse readily available datasets but also save time andresources since some intermediate datasets are gener-ated using large-scale resource- and compute-intensivescripts or modules

Of particular interest to our work are the standardsdeveloped by the semantic web community for captur-ing provenance notably the W3C PROV-O recommen-dation13 and its workflow-oriented extensions eg

13httpswwww3orgTRprov-o

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 10: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

10 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ProvONE14 OPMW15 wfprov16 and P-Plan [30] Theavailability of provenance provides the means for thescientist to issues queries on Why and How data wereproduced However it does not necessarily allow thescientists to examine questions such as Is this datahelpful for my computational experiment or ifpotentially useful does this data has enough quality is particularly challenging since the capture of re-lated meta-data is in general domain-dependent andshould be automated This is partly due to the fact thatprovenance information can be overwhelming (largegraphs) and partly because of a lack of domain annota-tions In previous work [8] we proposed PoeM an ap-proach to generate human-readable experiment reportsfor scientific workflows based on provenance and usersannotations SHARP [31 32] extends PoeM for work-flow running in different systems and producing het-erogeneous PROV traces In this work we capitalize inour previous work to annotate and summarize prove-nance information In doing so we focus on Workflowdata products re-usability as opposed to the workflowitself As data re-usability require to meet domain-relevant community standards (R13 of FAIR princi-ples) We rely on Biotools (httpsbiotools) registryto discover tools descriptions and automatically gener-ate domain-specific data annotations

The proposal by Garijo and Gil [10] is perhaps theclosest to ours in the sense that it focuses on data (asopposed to the experiment as a whole) and gener-ate textual narratives from provenance information thatis human-readable The key idea of data narratives isto keep detailed provenance records of how an anal-ysis was done and to automatically generate human-readable description of those records that can be pre-sented to users and ultimately included in papers or re-ports The objective that we set out in this paper is dif-ferent from that by Garijo and Gil in that we do notaim to generate narratives Instead we focus on anno-tating intermediate workflow data The scientific com-munities have already investigated solutions for sum-marizing and reusing workflows (see eg [33 34]) Inthe same lines we aim to facilitate the reuse of work-flow data by providing summaries of their provenancetogether with domain annotations In this respect ourwork is complementary and compatible with the workby Garijo and Gil In particular the annotations andprovenance summaries generated by the solution we

14vcvcomputingcomprovoneprovonehtml15wwwopmworg16httppurlorgwf4everwfprov

propose can be used to feed the system developed byGarijo and Gil to generate more concise and informa-tive narratives

Our work is also related to the efforts of the scien-tific community to create open repositories for the pub-lication of scientific data For example Figshare17 andDataverse18 which help academic institutions storeshare and manage all of their research outputs Thedata summaries that we produce can be published insuch repositories However we believe that the sum-maries that we produce are better suited for reposito-ries that publish knowledge graphs eg the one cre-ated by The whyis project19 This project proposes anano-scale knowledge graph infrastructure to supportdomain-aware management and curation of knowledgefrom different sources

6 Conclusion and Future Work

In this paper we proposed FRESH an approachfor making scientific workflow data reusable with afocus on genomic workflows To do so we utilizeddata-summaries which are generated based on prove-nance and domain-specific ontologies FRESH comesinto flavors by providing concise human-oriented andmachine-oriented data summaries Experimentationwith production-level exome-sequencing workflowshows the effectiveness of FRESH in term of time theoverhead of producing human-oriented and machine-oriented data summaries are negligible compared tothe computing resources required to analyze exome-sequencing data FRESH open several perspectiveswhich we intend to pursue in our future work So farwe have focused in FRESH on the findability andreuse of workflow data products We intend to extendFRESH to cater for the two remaining FAIR criteriaTo do so we will need to rethink and redefine interop-erability and accessibility when dealing with workflowdata products before proposing solutions to cater forthem We also intend to apply and assess the effective-ness of FRESH when it comes to workflows from do-mains other than genomics From the tooling point ofview we intend to investigate the publication of datasummaries within public catalogues We also intendto identify means for the incentivization of scientiststo (1) provide tools with high quality domain-specific

17httpsfigsharecom18httpsdataverseorg19httptetherless-worldgithubiowhyis

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 11: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

annotations (2) generate and use domain-specific datasummaries to promote reuse

7 Acknowledgements

We thank our colleagues from ldquolrsquoInstitut du Thoraxrdquoand CNRGH who provided their insights and exper-tise regarding the computational costs of large-scaleexome-sequencing data analysis

We are grateful to the BiRD bioinformatics facility forproviding support and computing resources

References

[1] J Liu E Pacitti P Valduriez and M Mattoso A surveyof data-intensive scientific workflow management Journal ofGrid Computing 13(4) (2015) 457ndash493

[2] SC Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computational re-producibility in the life sciences Status challenges andopportunities Future Generation Comp Syst 75 (2017)284ndash298 doi101016jfuture201701012 httpsdoiorg101016jfuture201701012

[3] ZD Stephens SY Lee F Faghri RH Campbell C ZhaiMJ Efron R Iyer MC Schatz S Sinha and GE Robin-son Big data Astronomical or genomical PLoS Biol-ogy 13(7) (2015) 1ndash11 ISSN 15457885 ISBN 1545-7885doi101371journalpbio1002195

[4] GR Abecasis A Auton LD Brooks MA DePristoR Durbin RE Handsaker HM Kang GT Marth andGA McVean An integrated map of genetic variation from1092 human genomes Nature 491(7422) (2012) 56ndash65ISSN 1476-4687 ISBN 1476-4687 (Electronic)r0028-0836(Linking) doi101038nature11632An httpdxdoiorg101038nature11632

[5] P Alper Towards harnessing computational workflow prove-nance for experiment reporting PhD thesis University ofManchester UK 2016 httpethosblukOrderDetailsdouin=ukblethos686781

[6] V Stodden The scientific method in practice Reproducibilityin the computational sciences (2010)

[7] F Chirigati R Rampin D Shasha and J Freire ReprozipComputational reproducibility with ease in Proceedings ofthe 2016 International Conference on Management of DataACM 2016 pp 2085ndash2088

[8] A Gaignard H Skaf-Molli and A Bihoueacutee From scientificworkflow patterns to 5-star linked open data in 8th USENIXWorkshop on the Theory and Practice of Provenance 2016

[9] P Alper K Belhajjame and CA Goble Automatic Ver-sus Manual Provenance Abstractions Mind the Gap in8th USENIX Workshop on the Theory and Practice ofProvenance TaPP 2016 Washington DC USA June 8-9 2016 USENIX 2016 httpswwwusenixorgconferencetapp16workshop-programpresentationalper

[10] Y Gil and D Garijo Towards Automating Data Narra-tives in Proceedings of the 22Nd International Confer-ence on Intelligent User Interfaces IUI rsquo17 ACM NewYork NY USA 2017 pp 565ndash576 ISBN 978-1-4503-4348-0 doi10114530251713025193 httpdoiacmorg10114530251713025193

[11] W et al The FAIR Guiding Principles for scientific data man-agement and stewardship Scientific Data 3 (2016)

[12] J Ison K Rapacki H Meacutenager M Kalaš E RydzaP Chmura C Anthon N Beard K Berka D BolserT Booth A Bretaudeau J Brezovsky R Casadio G Ce-sareni F Coppens M Cornell G Cuccuru K DavidsenG Della Vedova T Dogan O Doppelt-Azeroual L EmeryE Gasteiger T Gatter T Goldberg M Grosjean B GruuumlingM Helmer-Citterich H Ienasescu V Ioannidis MC Jes-persen R Jimenez N Juty P Juvan M Koch C LaibeJW Li L Licata F Mareuil I Micetic RM FriborgS Moretti C Morris S Moumlller A Nenadic H Peter-son G Profiti P Rice P Romano P Roncaglia R SaidiA Schafferhans V Schwaumlmmle C Smith MM SperottoH Stockinger RS Varekovaacute SCE Tosatto V De La TorreP Uva A Via G Yachdav F Zambelli G Vriend B RostH Parkinson P Loslashngreen and S Brunak Tools and data ser-vices registry A community effort to document bioinformat-ics resources Nucleic Acids Research (2016) ISSN 13624962ISBN 13624962 (Electronic) doi101093nargkv1116

[13] H Li and R Durbin Fast and accurate long-read align-ment with Burrows-Wheeler transform Bioinformatics (2010)ISSN 13674803 ISBN 1367-4811 (Electronic)r1367-4803(Linking) doi101093bioinformaticsbtp698

[14] H Li B Handsaker A Wysoker T Fennell J RuanN Homer G Marth G Abecasis and R Durbin The Se-quence AlignmentMap format and SAMtools Bioinformat-ics (2009) ISSN 13674803 ISBN 1367-4803r1460-2059doi101093bioinformaticsbtp352

[15] Broad Institute Picard tools 2016 ISSN 1949-2553 ISBN8438484395 doi101007s00586-004-0822-1

[16] C Alkan BP Coe and EE Eichler GATK toolkit Naturereviews Genetics (2011) ISSN 1471-0064 ISBN 1471-0064(Electronic)n1471-0056 (Linking) doi101038nrg2958

[17] W McLaren L Gil SE Hunt HS Riat GRS RitchieA Thormann P Flicek and F Cunningham The Ensembl Vari-ant Effect Predictor Genome Biology (2016) ISSN 1474760XISBN 1474760X (Electronic) doi101186s13059-016-0974-4

[18] ST Sherry dbSNP the NCBI database of genetic vari-ation Nucleic Acids Research (2001) ISSN 13624962ISBN 1362-4962 (Electronic)r0305-1048 (Linking)doi101093nar291308

[19] ND Olson SP Lund RE Colman JT Foster JW SahlJM Schupp P Keim JB Morrow ML Salit andJM Zook Best practices for evaluating single nucleotidevariant calling methods for microbial genomics 2015 ISSN16648021 ISBN 1664-8021 (Electronic)r1664-8021 (Link-ing) doi103389fgene201500235

[20] MDea Wilkinson A design framework and exemplar met-rics for FAIRness Scientific Data 5 (2018)

[21] C Bizer M-E Vidal and H Skaf-Molli Linked OpenData in Encyclopedia of Database Systems L Liu andMT AtildeUzsu eds Springer 2017 httpslinkspringercomreferenceworkentry101007978-1-4899-7993-3_80603-2

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References
Page 12: Semantic Web 0 (0) 1 1 IOS Press Findable and Reusable … · how summarisation techniques can be used to reduce the machine-generated provenance information of such data products

12 Gaignard et al Findable and Reusable Workflow Data Products A genomic Workflow Case Study

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[22] C Bizer T Heath and T Berners-Lee Linked Data - TheStory So Far Int J Semantic Web Inf Syst 5(3) (2009)1ndash22 doi104018jswis2009081901 httpsdoiorg104018jswis2009081901

[23] M Corpas N Kovalevskaya A McMurray and F Nielsen AFAIR guide for data providers to maximise sharing of humangenomic data PLoS Computational Biology 14(3)

[24] MDea Wilkinson FAIRMetricsMetrics Proposed FAIRMetrics and results of the Metrics evaluation questionnaire2018 doiZenodo httpsdoiorg105281zenodo1065973

[25] T Clark PN Ciccarese and CA Goble MicropublicationsA semantic model for claims evidence arguments and anno-tations in biomedical communications Journal of BiomedicalSemantics (2014) ISSN 20411480 ISBN 2041-1480 (Link-ing) doi1011862041-1480-5-28

[26] J Koumlster and S Rahmann SnakemakeacircATa scalable bioinfor-matics workflow engine Bioinformatics 28(19) (2012) 2520ndash2522

[27] RR Brinkman M Courtot D Derom JM Fostel Y HeP Lord J Malone H Parkinson B Peters P Rocca-Serraet al Modeling biomedical experimental processes with OBIin Journal of biomedical semantics Vol 1 BioMed Central2010 p 7

[28] P Rocca-Serra M Brandizi E Maguire N Sklyar C Tay-lor K Begley D Field S Harris W Hide O Hofmann etal ISA software suite supporting standards-compliant exper-imental annotation and enabling curation at the communitylevel Bioinformatics 26(18) (2010) 2354ndash2356

[29] K Belhajjame J Zhao D Garijo M Gamble KM HettneR Palma E Mina Oacute Corcho JM Goacutemez-Peacuterez S Bech-hofer G Klyne and CA Goble Using a suite of ontologiesfor preserving workflow-centric research objects J Web Se-mant 32 (2015) 16ndash42 doi101016jwebsem201501003httpsdoiorg101016jwebsem201501003

[30] D Garijo and Y Gil Augmenting PROV with Plans in P-PLAN Scientific Processes as Linked Data in Proceedingsof the Second International Workshop on Linked Science 2012- Tackling Big Data Boston MA USA November 12 2012CEUR-WSorg 2012 httpceur-wsorgVol-951paper6pdf

[31] A Gaignard K Belhajjame and H Skaf-Molli SHARP Har-monizing and Bridging Cross-Workflow Provenance in TheSemantic Web ESWC 2017 Satellite Events - ESWC 2017Satellite Events Portorož SloveniaRevised Selected Papers2017 pp 219ndash234

[32] A Gaignard K Belhajjame and H Skaf-Molli SHARPHarmonizing Cross-workflow Provenance in Proceedings ofthe Workshop on Semantic Web Solutions for Large-scaleBiomedical Data Analytics co-located with 14th ExtendedSemantic Web Conference SeWeBMeDAESWC 2017 Por-toroz Slovenia 2017 pp 50ndash64 httpceur-wsorgVol-1948paper5pdf

[33] J Starlinger S Cohen-Boulakia and U Leser (Re)use in pub-lic scientific workflow repositories in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artifi-cial Intelligence and Lecture Notes in Bioinformatics) 2012ISSN 03029743 ISBN 9783642312342 doi101007978-3-642-31235-924

[34] N Cerezo and J Montagnat Scientific Workflow ReuseThrough Conceptual Workflows on the Virtual Imaging Plat-form in Proceedings of the 6th Workshop on Workflows inSupport of Large-scale Science WORKS rsquo11 ACM NewYork NY USA 2011 pp 1ndash10 ISBN 978-1-4503-1100-7 doi10114521104972110499 httpdoiacmorg10114521104972110499

[35] T Berners-Lee Linked Data - Design Issues Retrieved Jan-uary 29 httpswwww3orgDesignIssuesLinkedDatahtml(2019)

[36] JS Poulin Measuring software reusability in Proceedings of3rd International Conference on Software Reuse ICSR 19941994 pp 126ndash138 doi101109ICSR1994365803 httpsdoiorg101109ICSR1994365803

[37] S Ajila and D Wu Empirical study of the effects ofopen source adoption on software development economicsJournal of Systems and Software 80(9) (2007) 1517ndash1529 doi101016jjss200701011 httpsdoiorg101016jjss200701011

[38] D Adams The Hitchhikerrsquos Guide to the Galaxy San Val1995 ISBN 9781417642595 httpbooksgooglecombooksid=W-xMPgAACAAJ

[39] A Peter RC Michae T NebojAringaa C Brad C JohnH Michael K Andrey K John L Dan M HervAtildelrsquoN Maya S Matt S-R Stian and S Luka Com-mon Workflow Language v10 Figshare 2016doi106084m9figshare3115156v2 httpdxdoiorg106084m9figshare3115156v2

[40] J Cheney L Chiticariu and W-C Tan Provenance inDatabases Why How and Where Foundations and Trendsin Databases (2007) ISSN 1931-7883 ISBN 1931-7883doi1015611900000006

[41] M Atkinson S Gesing J Montagnat and I Taylor Scien-tific workflows Past present and future Future GenerationComputer Systems (2017) ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201705041

[42] S Cohen-Boulakia K Belhajjame O Collin J ChopardC Froidevaux A Gaignard K Hinsen P LarmandeYL Bras F Lemoine F Mareuil H Meacutenager C Pradaland C Blanchet Scientific workflows for computationalreproducibility in the life sciences Status challengesand opportunities Future Generation Computer Systems75 (2017) 284ndash298 ISSN 0167739X ISBN 0167-739Xdoi101016jfuture201701012

  • Introduction
  • Motivations and Problem Statement
  • FRESH Approach
  • Experimental results and Discussion
    • Human-oriented data summaries
    • Machine-oriented data summaries
    • Implementation
    • Discussion
      • Related Work
      • Conclusion and Future Work
      • Acknowledgements
      • References