reproducible bioinformatics project: a community for ... · 91 demonstrative workflows (i.e. docker...

19
Reproducible Bioinformatics Project: A community for reproducible 1 bioinformatics analysis pipelines 2 Neha Kulkarni 1 , Luca Alessandrì 1 , Riccardo Panero 1 , Maddalena Arigoni 1 , Martina Olivero 2 , 3 Francesca Cordero 3$ , Marco Beccuti 3 and Raffaele A Calogero 1$ 4 5 1 Dept. of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy 6 2 Dept. of Oncology, University of Torino, Candiolo, Italy 7 3 Dept. of Computer Sciences, University of Torino, Torino, Italy 8 9 Neha Kulkarni [email protected] 10 Luca Alessandrì [email protected] 11 Riccardo Panero [email protected] 12 Maddalena Arigoni [email protected] 13 Martina Olivero [email protected] 14 Francesca Cordero [email protected] 15 Marco Beccuti [email protected] 16 Raffaele A Calogero [email protected] 17 18 $ Corresponding author 19 20 Abstract 21 Background Reproducibility of a research is a key element in the modern science and it is 22 mandatory for any industrial application. It represents the ability of replicating an experiment 23 All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . http://dx.doi.org/10.1101/239947 doi: bioRxiv preprint first posted online Dec. 26, 2017;

Upload: others

Post on 18-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

ReproducibleBioinformaticsProject:Acommunityforreproducible1

bioinformaticsanalysispipelines2

NehaKulkarni1,LucaAlessandrì1,RiccardoPanero1,MaddalenaArigoni1,MartinaOlivero2,3

FrancescaCordero3$,MarcoBeccuti3andRaffaeleACalogero1$4

5

1Dept.ofMolecularBiotechnologyandHealthSciences,UniversityofTorino,Torino,Italy6

2Dept.ofOncology,UniversityofTorino,Candiolo,Italy7

3Dept.ofComputerSciences,UniversityofTorino,Torino,Italy8

9

Neha Kulkarni [email protected] 10

Luca Alessandrì [email protected] 11

Riccardo Panero [email protected] 12

Maddalena Arigoni [email protected] 13

Martina Olivero [email protected] 14

Francesca Cordero [email protected] 15

Marco Beccuti [email protected] 16

Raffaele A Calogero [email protected] 17

18

$Corresponding author 19

20

Abstract21

BackgroundReproducibilityofaresearchisakeyelementinthemodernscienceanditis22

mandatoryforanyindustrialapplication.Itrepresentstheabilityofreplicatinganexperiment23

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 2: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

2

independentlybythelocationandtheoperator.Therefore,astudycanbeconsidered24

reproducibleonlyifalluseddataareavailableandtheexploitedcomputationalanalysisworkflow25

isclearlydescribed.However,todayforreproducingacomplexbioinformaticsanalysis,theraw26

dataandalistoftoolsusedintheworkflowcouldbenotenoughtoguaranteethereproducibility27

oftheresultsobtained.Indeed,differentreleasesofthesametoolsand/orofthesystemlibraries28

(exploitedbysuchtools)mightleadtosneakyreproducibilityissues.29

ResultsToaddressthischallenge,weestablishedtheReproducibleBioinformaticsProject(RBP),30

whichisanon-profitandopen-sourceproject,whoseaimistoprovideaschemaandan31

infrastructure,basedondockerimagesandRpackage,toprovidereproducibleresultsin32

Bioinformatics.OneormoreDockerimagesarethendefinedforaworkflow(typicallyoneforeach33

task),whiletheworkflowimplementationishandledviaR-functionsembeddedinapackage34

availableatgithubrepository.Thus,abioinformaticianparticipatingtotheprojecthasfirstlyto35

integrateher/hisworkflowmodulesintoDockerimage(s)exploitinganUbuntudockerimage36

developedadhocbyRPBtomakeeasierthistask.Secondly,theworkflowimplementationmust37

berealizedinRaccordingtoanR-skeletonfunctionmadeavailablebyRPBtoguarantee38

homogeneityandreusabilityamongdifferentRPBfunctions.Moreovershe/hehastoprovidethe39

Rvignetteexplainingthepackagefunctionalitytogetherwithanexampledatasetwhichcanbe40

usedtoimprovetheuserconfidenceintheworkflowutilization.41

ConclusionsReproducibleBioinformaticsProjectprovidesageneralschemaandaninfrastructure42

todistributerobustandreproducibleworkflows.Thus,itguaranteestofinaluserstheabilityto43

repeatconsistentlyanyanalysisindependentlybytheusedUNIX-likearchitecture.44

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 3: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

3

Keywords45

Reproducibleresearch,docker,wholetranscriptomesequencing,miRNAsequencing,ChIP46

sequencing,community,SNV.47

Background48

RecentlyBakerandLithgow[1,2]highlightedtheproblemofthereproducibilityinresearch.49

Reproducibilitycriticalityaffectstodifferentextentalargeportionofthesciencefields[1].Since50

nowadaysbioinformaticsplaysanimportantroleinmanybiologicalandmedicalstudies[3],a51

greateffortmustbeputtomakesuchcomputationalanalysesreproducible[4,5].Reproducibility52

issuesinbioinformaticsmightbeduetotheshorthalf-lifeofthebioinformaticssoftware,the53

complexityofthepipelines,theuncontrolledeffectsinducedbychangesinthesystemlibraries,54

theincompletenessorimprecisioninworkflowdescription,etc.Todealwithreproducibilityissues55

inBioinformaticsSandve[5]suggestedtengoodpracticerulesforthedevelopmentofa56

computationalworkflow(Table1).AcommunitythatfulfillsomeoftherulessuggestedbySandve57

isBioconductor[6]project,whichprovidesversioncontrolforalargeamountof58

genomics/bioinformaticspackages.Inthisway,oldreleasesofanyBioconductorpackagearekept59

availablefortheusers.However,Bioconductordoesnotcoverallthestepsofanypossible60

bioinformaticsworkflow,e.g.inRNAseqwolkflowfastqtrimmingandalignmentstepsare61

generallydoneusingtoolsnotimplementedinBioconductor.BaseSpace[7,8]andGalaxy[9]62

representanexampleofbothcommercialandopen-sourcesolutions,whichpartiallyfulfill63

Sandve’sroles.Furthermore,theworkflowsimplementedinsuchenvironmentscannotbeheavily64

customized,e.g.BaseSpacehasstrictrulesforapplicationssubmission.Moreover,clouds65

applications,asBaseSpace,havetocopewithlegalandethicalissues[10].Ontheotherhand,66

Galaxydoesnotprovidestandardizedmetadatatoannotateworkflows.67

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 4: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

4

Recentlycontainertechnology,alightweightOS-levelvirtualization,wasexploredintheareaof68

Bioinformaticstomakeeasierthedistribution,theutilizationandthemaintenanceof69

bioinformaticssoftware[11-13].Indeed,sinceapplicationsandtheirdependenciesarepackaged70

togetherinthecontainerimage,theusershavenottodownloadandinstallallthedependencies71

requiredbyanapplication,thusavoidingallthecaseswherethedependenciesarenotwell72

documentedornotavailableatall.Moreover,problemsrelatedtoversionsconflictsorupdatesof73

thesystemlibrariesdonotoccur,becausethecontainersareisolatedfromtherestofthe74

operatingsystem.75

Amongtheavailablecontainerplatforms,Docker(http://www.docker.com)isbecomingdefacto76

thestandardenvironmenttoquicklycompose,create,deploy,scaleandoverseecontainerized77

applicationsunderLinux.Itsstrengthsarethehighdegreeofportability,whichallowsusersto78

registerandsharecontainersovervarioushostsinprivateandpublicrepositories;amore79

effectiveresourceuseandafasterdeploymentcomparedwithothersoftware.80

Although,Menegidio[13],daVeiga[11]andKim[12]providedalargecollectionofbioinformatics81

instrumentsbasedonDockertechnology,todaywearemissingacommunitydeliveringto82

bioinformaticiansacontrolled,butflexibleframeworktodistributeDockerbasedworkflowsunder83

theumbrellaofareproducibilityframework.Here,wedescribetheimplementationofthe84

ReproducibleBioinformaticsProject(RBP,http://reproducible-bioinformatics.org/),aimingto85

distributetothebioinformaticscommunitydocker-basedapplicationsunderthereproducibility86

frameworkproposedbySandve[5].RBPacceptssimpledockerimplementationsofbioinformatics87

software(e.g.adockerembeddingbwaalignertool),implementationofcomplexpipelines88

involvingtheuseofmultipledockersimages(e.g.aRNAseqworkflowprovidingallthestepsforan89

analysisstartingfromthequalitycontrolofthefastqtodifferentialexpression),aswellas90

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 5: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

5

demonstrativeworkflows(i.e.dockerimagesembeddingthefullbioinformaticsworkflowusedina91

publication)intendedtoprovidetheabilitytoreproducepublisheddata.92

Implementation93

TheReproducibleBioinformaticsProject(RBP)referencewebpageisreproducible-94

bioinformatics.org.Theprojectisbasedonthreemodules(Figure1):(i)docker4seqRpackage95

(https://github.com/kendomaniac/docker4seq),(ii)dockersimages96

(https://hub.docker.com/u/repbioinfo/),and(iii)4SeqGUI97

(https://github.com/mbeccuti/4SeqGUI).98

Docker4seqpackageprovidestheconnectionbetweenusersanddockercontainers.Docker4seqis99

organizedintwobranches:stableanddevelopment.Thetransitionbetweendevelopmentand100

stablebranchisdonewhenamodule(Rfunction(s)/dockercontainer(s))fulfillsthe10rules101

suggestedbySandve[5]forgoodbioinformaticspractice(Table1):102

Thefunctionskeleton.Rindocker4seqprovidesaprototypetobuildadockercontrollingfunction.103

Acknowledgmentsofthedeveloperworkisprovidedwithinthestructureoftheskeleton.R.In104

skeleton.Rthereisafieldindicatingdeveloperaffiliationandemailforcontacts.Indockerimages105

repositorydocker.io/repbioinfoisavailableanUbuntuimage,asprototypeforthecreationofa106

dockerimagecompliantwiththeRBPspecifications.Developerisfreetodecidetousethis107

prototypeortoadaptadifferentLinuxdockerdistributionforhis/herapplication.Dockerimages108

designedbythecoredevelopersofRBParelocatedindocker.io/repbioinfo(docker.com),the109

imagesdevelopedbythirdpartiescanbeinsteadplacedinanypublic-accessdockerrepository.110

RBPrequiresthatanyoperation,implyingtheuseofanyR/Bioconductorpackagesortheuseofan111

externalsoftware,hastobeimplementedinadockercontainer.Onlyreformattingactions,e.g.112

tableassembly,datareordering,etc.,canbehandledoutsideadockerimage.113

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 6: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

6

AnynewRBPmodule(Rfunction(s)/dockerimage(s))mustbeassociatedwithanexplanatory114

vignette,accessibleonlineashtmldocument,andtoasetoftestdata,alsoaccessibleonline.115

Thus,allinstrumentsneededtoacquireconfidenceonmodulefunctionalitiesareprovidedtothe116

finaluser.117

DockerimagesarelabelledwiththeextensionYYYY.NN,whereYYYYistheyearofinsertioninthe118

stableversionandNNaprogressivenumber.YYYYchangesonlyifanyupdateontheprogram(s),119

implementedinthedockerimage,isdone.Thisbecauseanyofsuchupdateswillaffectthe120

reproducibilityoftheworkflow.Previousversion(s)willbealsoavailableintherepository.NN121

referstochangesinthedockerimage,whichdonotaffectthereproducibilityoftheworkflow.122

Anewmodulecanbesubmittedtotheinfo@reproducible-bioinformatics.organdRBPcoreteam123

willverifythecompliancewithSandve[5]rules.Onesvalidated,theRfunctionscontrollingthe124

newmoduleareinsertedindocker4seqstablerelease.Partiallyvalidatedmoduleswillbeplacedin125

developmentbranchandmovedtostableonewhencompliancewithSandve’srulesisfulfilled.126

4SeqGUIisaJavabasedgraphicalinterfacetodocker4seqfunctions.Itisdesignedtoprovidea127

GUItousershavinglimitedknowledgeofRscripting.CurrentlytheGUIembedsonlygeneral-128

purposeworkflows,suchasRNAseq,miRNAseqandChip-seqworkflow.129

Results130

Thestablebranchofdocker4seqRpackagecontainsalltheRfunctionsrequiredtohandleallthe131

stepsofRNAseqworkflow(Fig.2A),ChIPseqworkflow(Fig.2B),andmiRNAseqworkflow(Fig.2C).132

Docker4seqalsoprovidesawrapperfunctionforthebcl2fastqIlluminatooltoconverttheIllumina133

sequenceroutputindemultiplexedfastqfiles(Fig.2).Then,thefastqfilescanbehandledwithany134

ofthethreedifferentworkflows.ThecountstableproducedbyRNAseqormiRNAseqworkflows135

canbeusedfordatavisualization(pca,principalcomponentanalysisfunction),toevaluatethe136

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 7: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

7

statisticalpoweroftheexperiment(experimentPowerfunction),todefinetheoptimalsamplesize137

oftheexperimentforthedetectionofdifferentiallyexpressedgenes(sampleSizefunction)andto138

detectdifferentiallyexpressedgenes/transcripts(wrapperDeseq2function).Samplesize/statistical139

powerestimationoftheexperimentanddifferentialexpressionarecalculatedrespectivelyvia140

RnaSeqSampleSize[14]andDESeq2Bioconductorpackages[15].141

Inthedevelopmentbranch,themaineffortofthecoredevelopersisfocusedinproviding142

workflowsforDNAandRNAsomaticvariantcalling.TheDNAvariantcallingworkflowembedsthe143

pre-processingproceduresuggestedbytheGATKbestpractice(Fig.3A).RNAseqdatapreparation144

forvariantcalling(Fig.3C)requirestheuseofSTAR2stepprocedure[16],whichprovides145

significantlyincreasedsensitivitytonovelsplicejunctions.Then,aftersortingandduplicates146

marking,OPOSSUM[17]isusedtoremoveintronicregionsandtomergeoverlappingreads.We147

havealsoimplementedaspecificprocedure(Fig.3B),basedonxenomesoftware[18],to148

discriminatebetweenhumanreadsandmousehostreadsinthesequencesproducedbythe149

analysisofpatientsderivedxenografts(PDX,[19]).Aspartofthesomaticvariantcallingworkflow150

weareimplementingMUTECT1and2[20](Fig.4A)tocallsomaticvariantsaswellasPLATYPUS151

[21]forextractinginformationofjoined-samplesSNVs(Fig.4B).152

WearealsoexpandingtheRNAseqmoduleaddingthereference-freeSalmonaligner[22],which153

employslessmemoryforthealignmenttaskthanSTAR,butprovidingsimilarresults[23].154

Finally,HashCloneframework(AcceptedforpublicationinBMCBioinformatics),anewsuiteof155

bioinformaticstoolsprovidingB-cellsclonalityassessmentandminimalresidualdisease(MRD)156

monitoringovertimefromdeepsequencingdata,wasintegratedintheDocker4seqpackage.In157

particular,aparallelversionofthestandardHashCloneworkflow(Fig.5)wasdevelopedexploiting158

thedockerarchitecture.159

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 8: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

8

Allthemodulesdescribedaboveareimplementedin18dockerimagesdepositedinthedocker160

hub(https://hub.docker.com/u/repbioinfo/).161

AspartoftheRBPwehavealsodevelopedaGUI,4SeqGUI162

(https://github.com/mbeccuti/4SeqGUI).TheGUIisimplementedinJAVAandcanbeexploitedto163

performwholetranscriptomesequencingworkflow(Fig.2A),ChIPsequencingworkflow(Fig.2B),164

andmiRNAsequencingworkflow(Fig.2C).165

Discussion166

Bioinformaticsworkflowsarebecominganessentialpartofmanyresearchpapers.However,167

absenceofclearandwell-definedrulesonthecodedistributionmaketheresultsofmost168

publishedresearchesunreproducible[24].Recently,Almugbelandcoworkers[25]describedan169

interestinginfrastructuretoembedBioconductorbasedpackages.However,Bioconductordoes170

notcoverallstepsofanypossiblebioinformaticsworkflow,thusprovidingalimitedframeworkfor171

developingcomplexpipelines.Differently,RBPrepresentsanewinstrument,whichexpandsthe172

ideaofAlmugbel[25],providingamoreflexibleinfrastructureallowingthebioinformatics173

communitytospreadtheirworkundertheguidanceofrules,whichguaranteeinter-laboratory174

reproducibilityanddonotlimitdockerimplementationstoBioconductorpackages.RBPcore175

developerscreatedframeworksforRNA/miRNAquantificationandanalysis.ChIPseqworkflowwas176

alsodevelopedandvariantcallingworkflowsforDNAandRNAareunderactivedevelopment.A177

peculiarfeatureofRBPistheacceptanceofdemonstrativeworkflows,i.e.bioinformatics178

proceduresdescribedinabiological/medicalpaper.Ademonstrativeworkflowiswrappedina179

dockerimageanditissupportedbyatutorial,whichdescribesstepbystephowtheanalysisis180

donetoguaranteethereproducibilityofpublisheddata. 181

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 9: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

9

Availabilityandrequirements182

Projectname:ReproducibleBioinformaticsProject183

Projecthomepage:http://reproducible-bioinformatics.org184

Operatingsystem:UNIX-like185

Programminglanguage:R186

Otherrequirements:dockerversion17.05.0-ceorhigher187

License:GPL.188

189

Declarations190

Competinginterests191

None192

193

Funding194

ThisworkhasbeensupportedbytheEPIGENFLAGPROJECT195

196

Authors'contributions197

NKandLAequallycontributedtothedevelopmentofmiRNAworkflowandalltheothertools.RP198

andFCdevelopedtheRNAseqworkflowandrefinedtheChIPseqworkflow.MAandMO199

performedapplicationstesting.MBandRACdevelopedtherulestosubmittoolsandworkflowsto200

theReproducibleBioinformaticscommunity.RACandMBequallysupervisedtheoverallwork.201

202

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 10: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

10

Figurescaption203204Figure1:ReproducibleBioinformaticsProjectstructure.205

206

Figure2:Workflowsavailableinthestablebranchofdocker4seq.A)Wholetranscriptome207

sequencingworkflow,B)ChIPsequencingworkflow,andC)miRNAsequencingworkflow.The208

namesfollowedbyparenthesisarethedocker4seqfunctionsusedtoexecutetheanalysissteps.209

Blackindicateelementsincommonamongmorethanoneworkflow.210

211

Figure3:Variantcallingworkflowsunderrefinementinthedevelopmentbranchofdocker4seq.212

A)SNVscallinginDNAworkflow.ThefunctionsnvPreprocessingrequiresthatusersprovidesits213

owncopyoftheGATKsoftware,becauseofBroadInstitutelicenserestrictions.Thisfunction214

returnsabamfilesorted,withduplicatesmarkedafterGATKindelrealignmentandquality215

recalibration.B)DatapreprocessingforsamplesderivedbyPatientDerivedXenografths(PDX).216

Thexenomefunctiondiscriminatesbetweenthemousehostreadsandthehumantumorreads,217

thenDNAorRNASNVcallingworkflowscanbeapplied.C)SNVscallinginRNAworkflow.The218

functionstar2stepsgeneratesasortedbam,whereduplicatesaremarkedandprocessedby219

opossumforremovalofintronicregionsandmergingofoverlappingreads.Thenamesfollowedby220

parenthesisarethedocker4seqfunctionsusedtoexecutetheanalysissteps.Blackindicate221

elementsincommonbetweenmorethanoneworkflow.222

223

Figure4:Variantcallingworkflowsunderdevelopmentinthedevelopmentbranchof224

docker4seq.A)SomaticSNVsdetectionusingGATKMUTECT1or2.B)Platypusbasedjoin225

mutationscaller.Dashedblocksarenotimplemented,yet.226

227

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 11: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

11

Figure5:HashClonepipeline.TheHashClonestrategyisorganizedinthreesteps:228

Thefirststep(redbox)isusedtodetectk-merinallpatients’samples.Thesecondstep(green229

box)focusonthegenerationofsequencesignaturesleadingtotheidentificationofthesetof230

putativeclonespresentineachofthepatients’sample;thethirdstep(bluebox)isusedtothe231

characterizationandevaluationofthecancerclones.232

233

References234

1. BakerM:1,500scientistsliftthelidonreproducibility.Nature2016,533(7604):452-454.235

2. LithgowGJ,DriscollM,PhillipsP:Alongjourneytoreproducibleresults.Nature2017,236

548(7668):387-388.237

3. SearlsDB:Therootsofbioinformatics.PLoScomputationalbiology2010,6(6):e1000809.238

4. KanwalS,KhanFZ,LonieA,SinnottRO:Investigatingreproducibilityandtracking239

provenance-Agenomicworkflowcasestudy.BMCbioinformatics2017,18(1):337.240

5. SandveGK,NekrutenkoA,TaylorJ,HovigE:Tensimplerulesforreproducible241

computationalresearch.PLoScomputationalbiology2013,9(10):e1003285.242

6. GentlemanRC,CareyVJ,BatesDM,BolstadB,DettlingM,DudoitS,EllisB,GautierL,GeY,243

GentryJetal:Bioconductor:opensoftwaredevelopmentforcomputationalbiologyand244

bioinformatics.Genomebiology2004,5(10):R80.245

7. ColomboAR,J.TricheTJ,RamsinghG:Arkas:RapidreproducibleRNAseqanalysis.246

F1000Res2017,6:586.247

8. VanNesteC,GansemansY,DeConinckD,VanHoofstatD,VanCriekingeW,DeforceD,Van248

NieuwerburghF:Forensicmassivelyparallelsequencingdataanalysistool:249

ImplementationofMyFLqasastandaloneweb-andIlluminaBaseSpace((R))-application.250

ForensicSciIntGenet2015,15:2-7.251

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 12: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

12

9. DiganW,CountourisH,BarritaultM,BaudoinD,Laurent-PuigP,BlonsH,BurgunA,Rance252

B:AnArchitectureforGenomicsAnalysisinaClinicalSettingUsingGalaxyandDocker.253

Gigascience2017.254

10. DoveES,JolyY,TasseAM,PublicPopulationProjectinG,SocietyInternationalSteeringC,255

InternationalCancerGenomeConsortiumE,PolicyC,KnoppersBM:Genomiccloud256

computing:legalandethicalpointstoconsider.Europeanjournalofhumangenetics:257

EJHG2015,23(10):1271-1278.258

11. daVeigaLeprevostF,GruningBA,AlvesAflitosS,RostHL,UszkoreitJ,BarsnesH,VaudelM,259

MorenoP,GattoL,WeberJetal:BioContainers:anopen-sourceandcommunity-driven260

frameworkforsoftwarestandardization.Bioinformatics2017,33(16):2580-2582.261

12. KimB,AliT,LijeronC,AfganE,KrampisK:Bio-Docklets:virtualizationcontainersfor262

single-stepexecutionofNGSpipelines.Gigascience2017,6(8):1-7.263

13. MenegidioFB,JabesDL,CostadeOliveiraR,NunesLR:Dugong:aDockerimage,basedon264

UbuntuLinux,focusedonreproducibilityandreplicabilityforbioinformaticsanalyses.265

Bioinformatics2017.266

14. ChingT,HuangS,GarmireLX:PoweranalysisandsamplesizeestimationforRNA-Seq267

differentialexpression.RNA2014,20(11):1684-1696.268

15. LoveMI,HuberW,AndersS:Moderatedestimationoffoldchangeanddispersionfor269

RNA-seqdatawithDESeq2.Genomebiology2014,15(12):550.270

16. DobinA,DavisCA,SchlesingerF,DrenkowJ,ZaleskiC,JhaS,BatutP,ChaissonM,Gingeras271

TR:STAR:ultrafastuniversalRNA-seqaligner.Bioinformatics2013,29(1):15-21.272

17. OikkonenL,LiseS:MakingthemostofRNA-seq:Pre-processingsequencingdatawith273

OpossumforreliableSNPvariantdetection.WellcomeOpenRes2017,2:6.274

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 13: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

13

18. ConwayT,WaznyJ,BromageA,TymmsM,SoorajD,WilliamsED,Beresford-SmithB:275

Xenome--atoolforclassifyingreadsfromxenograftsamples.Bioinformatics2012,276

28(12):i172-178.277

19. SiolasD,HannonGJ:Patient-derivedtumorxenografts:transformingclinicalsamplesinto278

mousemodels.Cancerresearch2013,73(17):5315-5319.279

20. CibulskisK,LawrenceMS,CarterSL,SivachenkoA,JaffeD,SougnezC,GabrielS,Meyerson280

M,LanderES,GetzG:Sensitivedetectionofsomaticpointmutationsinimpureand281

heterogeneouscancersamples.Naturebiotechnology2013,31(3):213-219.282

21. RimmerA,PhanH,MathiesonI,IqbalZ,TwiggSRF,ConsortiumWGS,WilkieAOM,McVean283

G,LunterG:Integratingmapping-,assembly-andhaplotype-basedapproachesforcalling284

variantsinclinicalsequencingapplications.Naturegenetics2014,46(8):912-918.285

22. PatroR,DuggalG,LoveMI,IrizarryRA,KingsfordC:Salmonprovidesfastandbias-aware286

quantificationoftranscriptexpression.Naturemethods2017,14(4):417-419.287

23. ZhangC,ZhangB,LinLL,ZhaoS:Evaluationandcomparisonofcomputationaltoolsfor288

RNA-seqisoformquantification.BMCgenomics2017,18(1):583.289

24. HothornT,LeischF:Casestudiesinreproducibility.Briefingsinbioinformatics2011,290

12(3):288-300.291

25. AlmugbelR,HungLH,HuJ,AlmutairyA,OrtogeroN,TamtaY,YeungKY:Reproducible292

Bioconductorworkflowsusingbrowser-basedinteractivenotebooksandcontainers.JAm293

MedInformAssoc2017.294

295

Tables296297

Table1:Goodpracticebioinformaticsrules,derivedfromSandveetal.[5]298

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 14: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

14

1 ForEveryResult,KeepTrackofHowItWasProduced

2 AvoidManualDataManipulationSteps

3 ArchivetheExactVersionsofAllExternalProgramsUsed

4 VersionControlAllCustomScripts

5 RecordAllIntermediateResults,WhenPossibleinStandardizedFormats

6 ForAnalysesThatIncludeRandomness,NoteUnderlyingRandomSeeds

7 AlwaysStoreRawDatabehindPlots

8 GenerateHierarchicalAnalysisOutput,AllowingLayersofIncreasingDetailtoBe

Inspected

9 ConnectTextualStatementstoUnderlyingResults

10 ProvidePublicAccesstoScripts,Runs,andResults

299

300

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 15: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

15

Figures301302

303

Figure1304

305

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 16: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

16

306

307

Figure2308

309

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 17: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

17

310

311

Figure3312

313

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 18: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

18

314

315

Figure4316

317

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;

Page 19: Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker images embedding the full bioinformatics workflow used in a 92 publication) intended

19

318

319

Figure5320

321

All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;