fully reproducible data analysis with snakemake and bioconda · 2021. 3. 16. · onda...

1
L A T E X TikZposter Fully reproducible data analysis with Snakemake and Bioconda Johannes K ¨ oster Centrum Wiskunde & Informatica, Amsterdam Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston Fully reproducible data analysis with Snakemake and Bioconda Johannes K ¨ oster Centrum Wiskunde & Informatica, Amsterdam Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston Data analysis The needs of data analysis dataset dataset dataset dataset dataset results scalability reproducibility Handle tens to thousands of samples via parallelization. Avoid redundant com- putations when changing datasets or parameters. Document param- eters, tools, ver- sions. Execute and de- ploy without man- ual intervention. Snakemake Snakemake formalizes, documents and executes data analyses Used by various high-impact studies: learn more Define workflows via generic rules rule mytask : input : reference.fasta , reads/ { dataset } .fastq output : mapped/ { dataset } . bam environment : software.yaml resources : mem gb=4 shell : bwa mem { input } | samtools view -b > { output } use shell commands, scripts (R, Python), and tool wrappers Dependencies between rules are deter- mined automatically. implicit paralleliza- tion to compute servers and clusters view example Define isolated software environments per rule channels : - bioconda - r dependencies : - bwa ==0.7.4 - samtools ==1.1 Isolation allows con- flicting versions on the same system. Exact versions ensure full reproducibility. automatic installation via the versatile Conda package manager learn more Bioconda Bioinformatics software installation is heterogeneous Bioconda normalizes software installation via easy to create package recipes over 1500 packages over 100 maintain- ers Works for any lan- guage (R, Python, C/C++, Rust, Perl, ...). learn more By combining Snakemake and Bioconda, data analyses become reproducible with minimal effort # clone workflow repository $ git clone https://github.com/user/workflow # install Snakemake $ conda install snakemake # execute workflow # (software dependencies are handled automatically) $ snakemake -s Snakefile get this poster $ conda install snakemake

Upload: others

Post on 21-Aug-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fully reproducible data analysis with Snakemake and Bioconda · 2021. 3. 16. · onda Bioinformatics software installation is heterogeneous Bioconda normalizes software installation

LATEX TikZposter

Fully reproducible data analysis with Snakemake and Bioconda

Johannes KosterCentrum Wiskunde & Informatica, Amsterdam

Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston

Fully reproducible data analysis with Snakemake and Bioconda

Johannes KosterCentrum Wiskunde & Informatica, Amsterdam

Department of Medical Oncology, Dana-Farber Cancer Institute and Harvard Medical School, Boston

Da

taa

na

lysi

s

The needs of data analysis

dataset dataset dataset dataset dataset

results

scalability

reproducib

ility

• Handle tens to thousands ofsamples via parallelization.

• Avoid redundant com-putations when changingdatasets or parameters.

• Document param-eters, tools, ver-sions.

• Execute and de-ploy without man-ual intervention.

Sn

ake

ma

ke

Snakemake formalizes,documents and executes

data analyses

Used by various high-impact studies:

learn more

Define workflows via generic rules

ru le mytask :input :

” r e f e r e n c e . f a s t a ” ,”r e ad s /{ da t a s e t } . f a s t q ”

output :”mapped/{ da t a s e t } . bam ”

environment :”s o f twa r e . yaml ”

resources :mem gb=4

s h e l l :”bwa mem { i n pu t } | ””samtoo l s v i ew - b > { output} ”

use shell commands,scripts (R, Python),and tool wrappers

Dependenciesbetween rules are deter-mined automatically.

implicit paralleliza-tion to compute serversand clusters

view example

Define isolated softwareenvironments per rule

channels :− b ioconda− r

dependencies :− bwa ==0.7.4− samtoo l s ==1.1

• Isolation allows con-flicting versions onthe same system.

• Exact versions ensurefull reproducibility. automatic installation

via the versatile Condapackage manager

learn more

Bio

con

da

Bioinformatics softwareinstallation is

heterogeneous

Bioconda normalizes softwareinstallation via easy to create

package recipes

• over 1500 packages

• over 100 maintain-ers

Works for any lan-guage (R, Python,C/C++, Rust, Perl, ...).

learn more

By combining Snakemake andBioconda, data analyses becomereproducible with minimal effort

# c l on e work f low r e p o s i t o r y$ g i t c l o n e h t t p s : // g i t hub . com/ u s e r /work f low

# i n s t a l l Snakemake$ conda i n s t a l l snakemake

# execu t e work f low# ( s o f twa r e dependenc i e s a r e hand l ed a u t oma t i c a l l y )$ snakemake −s S n a k e f i l e

get this poster

$ conda install snakemake