building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/paul... · simple...

34
Building bioinformatic pipelines 6/20/2019 P. Zumbo

Upload: others

Post on 29-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Building bioinformatic pipelines

6/20/2019

P.Zumbo

Page 2: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

What is a pipeline?

Apipelineorworkflowreferstoaseriesofprocessingstepssuchthatoutputofeachprocessistheinputofthenext,typicallydonetotransformrawdataintosomethingmoreinterpretable.

Page 3: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Why bother building pipelines?

1. Reproducibility2. Dataprovenance3. Automation4. Transparency

Page 4: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Pipelines aid in reproducibility

Reproducibility=obtainingthesameresult*usingthesamecodeanddata*withinreason(e.g.,somealignersassignmulti-mappingreadstoarandomlocation)

Page 5: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Data provenance contextualizes results

Provenancereferstothedescriptionoftheoriginofapieceofdata• Thestepstakentoarriveatapieceofdata• Thesoftwareused• Theversionofthesoftwareused• Theargumentssuppliedtothesoftwareused

Page 6: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Automation: the amount of data keeps increasing

StephensZD,LeeSY,FaghriF,CampbellRH,ZhaiC,etal.(2015)BigData:AstronomicalorGenomical?.PLOSBiology13(7):e1002195.https://doi.org/10.1371/journal.pbio.1002195http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Page 7: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Automation: some pipelines complex

Page 8: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

From:AreviewofbioinformaticpipelineframeworksBriefBioinform.2016;18(3):530-536.doi:10.1093/bib/bbw020

Page 9: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Simple alignment pipeline with bowtie2

#alignreadswithbowtie2bowtie2-xref.fa–Ushort_read.fq>aln-se.sam#convertfromsamtobamsamtoolsview-bSaln-se.sam>aln-se.bam#sortbamfilesamtoolssortaln-se.bam>aln-se.sorted.bam

Page 10: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Simple sample script

#!/usr/bin/envbash##toolsBOWTIE=/usr/local/bin/bowtie2#v2.3.5.1SAMTOOLS=/usr/local/bin/samtools#v1.9##referencegenomeREFERENCE=/usr/local/ref/e_coli.fa$BOWTIE-x$REFERENCE-UA.fastq.gz>A.sam$SAMTOOLSview-bSA.sam>A.bam$SAMTOOLSsortA.bam>A.sorted.bam

Page 11: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

For loops #!/usr/bin/envbashBOWTIE=/usr/local/bin/bowtie2SAMTOOLS=/usr/local/bin/samtoolsREFERENCE=/usr/local/ref/e_coli.faforreadin$(ls*fastq.gz);do

$BOWTIE-x$REFERENCE-U$read>${read/.fastq.gz/.sam}$SAMTOOLSview-bS${read/.fastq.gz/.sam}>${read/.fastq.gz/.bam}$SAMTOOLSsort${read/.fastq.gz/.bam}>${read/.fastq.gz/.sorted.bam}

done

Page 12: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

GNU parallel

toolforprocessingrepetitivecommandsparallel[options][command[arguments]]:::<files>• :::<files>orfind<files>|• Thefilename:{}• Thefilenamewiththeextensionremoved:{.}e.g.test.fawouldbecometest• --jobs,-jn

Page 13: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

GNU parallel pipeline

THREADS=2parallel--jobs$THREADSgunzip{}:::*fastq.gzparallel--jobs$THREADS$BOWTIE-x$REFERENCE-U{}">"{.}.sam:::*fastqparallel--jobs$THREADS$SAMTOOLSview-bS{.}.sam">"{.}.bam:::*samparallel--jobs$THREADS$SAMTOOLSsort{.}.sam">"{.}.sorted.bam:::*bam

Page 14: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

A brief history of make

• firstintroducedbyStuartFeldmanin1977atBellLabs• buildautomationtool• usedtobuildexecutableprogramsandlibrariesfromsourcecode• however,makeisnotlimitedtobuildingbinariesandlibraries

Page 15: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Key features of make

• Dependencyanalysis• Re-entrancy• Parallelization• Patternrules/abstraction• Audittrail

Page 16: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

what is make?

makeisaprogramthatreadsamakefileandthatbuildsoneormorefilesfromzeroormoreotherfilesthattheydependon.

Page 17: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

how does make do what it does?

makeparsesthemakefile,buildsadependencytree(bydeterminingtherelationshipsbetweentheinputsandoutputs),andthentraverseseachbranchofthetree,executingcommandsalongtheway.

Page 18: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

what is a makefile?

amakefileisatextfilewhichcontainsrulesforhowtocreateasetoftargetfiles.

Page 19: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

what is a rule?

aruletellsmakewhichseriesofcommandstoexecuteandwhatfilesmustexistbeforehandinordertocreateasetoftargetsfromsomeinput.

Page 20: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

the general form of a rule is:

target … : dependency … command … …

Page 21: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

a practical example: alignment

BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1

SAMTOOLS=/usr/local/bin/samtools #v1.9

REFERENCE=/usr/local/ref/e_coli.fa

all: A.sam

A.sam: A.fastq.gz

$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam

Page 22: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

adding another step:

BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1

SAMTOOLS=/usr/local/bin/samtools #v1.9

REFERENCE=/usr/local/ref/e_coli.fa

all: A.bam

A.sam: A.fastq.gz

$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam

A.bam: A.sam

$(SAMTOOLS) view –bS A.sam > A.bam

Page 23: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

automatic variables

BOWTIE=/usr/local/bin/bowtie2 #v2.3.5.1

SAMTOOLS=/usr/local/bin/samtools #v1.9

REFERENCE=/usr/local/ref/e_coli.fa

all: A.bam

A.sam: A.fastq.gz

$(BOWTIE) -x $(REFERENCE) –U $< > $@

A.bam: A.sam

$(SAMTOOLS) view –bS $< > $@

Page 24: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

using pattern rules: the percent sign

%:roughlyequivalentto*inaUnixshell-representsanynumberofanycharacters-canbeplacedanywherewithinpattern-canonlyoccuronce

somevaliduses:%.vs%.owrapper_%-charactersotherthan%matchliterallywithinafilename

Page 25: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

revisiting alignment…

FASTQFILES := $(wildcard *.fastq.gz)

all: $(FASTQFILES:.fastq.gz=.sorted.bam)

%.sam: %.fastq.gz

$(BOWTIE) -x $(REFERENCE) -U A.fastq.gz > A.sam

%.bam: %.sam

$(SAMTOOLS) view -bS $< > $@

%.sorted.bam: %. bam

$(SAMTOOLS) sort $< > $@

Page 26: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

visualizing the dependency tree

default

Sample1.bam Sample2.bam

Sample2.fastq.gzSample1.fastq.gz

makefile

Sample1.sam Sample2.sam

Page 27: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

the -j switch

-j[jobs],--jobs[=jobs]specifiesthenumberofjobs(commands)torunsimultaneously.

Page 28: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

why make? the limits of a script:

1.  linearexecution•  make-j

2.  truncatedfiles•  .DELETE_ON_ERROR:

3.  unabletoresume•  make

4.  pooraudittrail•  make-nB>make.log

Page 29: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Limitations of make

• Wasn’tdesignedforbioinformaticanalyses• Syntaxrequiresunderstandingrulestructure• Lackssupportformultipleoutputsfromsinglecommand• Nosupportformultiplewildcardspername• Nobuilt-insupportfordistributedcomputing

Page 30: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Ways to parallelize

ImageFrom:http://cloudcomputingnet.com/category/clouldcomputing/grid-computing/

Singlecomputer,singlecore

Singlecomputer,multiplecores Multiplecomputers,

multiplecores

Page 31: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Future trends

ImageFrom:https://www.hpcwire.com/2017/05/04/singularity-hpc-container-technology-moves-lab/#foobox-3/0/Singularity-architecture_G-Kurtzer-e1477021972985.jpgSingularitycontainers

Page 32: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Many contemporary alternatives to make

https://github.com/pditommaso/awesome-pipeline

Page 33: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

CWL

From:https://www.commonwl.org/user_guide/02-1st-example/

Page 34: Building bioinformatic pipelinesbigdata.citytech.cuny.edu/bd_media/2019/07/Paul... · Simple alignment pipeline with bowtie2 # align reads with bowtie2 bowtie2 -x ref.fa –U short_read.fq

Pipelines tip of iceberg concerning reproducibility

From:ExperimentingwithreproducibilityinbioinformaticsYang-MinKim,Jean-BaptistePoline,GuillaumeDumasbioRxiv143503;doi:https://doi.org/10.1101/143503