reproducible bioinformatics pipelines with docker and anduril

39
1 Reproducible Bioinformatics Pipelines with Docker & Anduril Christian Frech, PhD Bioinformatician at Children‘s Cancer Research Institute, Vienna CeMM Special Seminar September 25 th , 2015

Upload: christian-frech

Post on 14-Apr-2017

2.464 views

Category:

Science


9 download

TRANSCRIPT

Page 1: Reproducible bioinformatics pipelines with Docker and Anduril

1

Reproducible Bioinformatics Pipelines with Docker & Anduril

Christian Frech, PhDBioinformatician at Children‘s Cancer Research Institute, Vienna

CeMM Special SeminarSeptember 25th, 2015

Page 2: Reproducible bioinformatics pipelines with Docker and Anduril

Why care about reproducible pipelines in bioinformatics?

For your (future) self Quickly re-run analysis with different parameters/tools Best documentation how results have been produced

For others Allow others to easily reproduce your findings

(“reproducibility crisis”)*

Code re-use between projects and colleagues

2

*) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998

Page 3: Reproducible bioinformatics pipelines with Docker and Anduril

Obstacles to computational reproducibility

Software/script not available (even upon request) Black box: Code (or even virtual machine) available, but no

documentation how to run it Dependency hell: Software and documentation available,

but (too) difficult to get it running

Code rot: Code breaks over time due to software updates 404 Not Found: unstable URLs, e.g. links to lab homepages

3

Go figure…

Page 4: Reproducible bioinformatics pipelines with Docker and Anduril

Computational pipelines to the rescue

In bioinformatics, data analysis typically consists of a series of heterogeneous programs stringed together via file-based inputs and outputs

Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant annotation (SnpEff) -> custom R script

Simple automation via (bash/R/Python/Perl) scripting has its limitations

No error checking No partial execution No parallelization

4

Page 5: Reproducible bioinformatics pipelines with Docker and Anduril

No shortage of pipeline frameworks Script-based

GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake, Nextflow, …

GUI-based Galaxy, GenePattern, Chipster, Taverna, Pegasus, … Various commercial solutions for more standardized

workflows (e.g. RNA-seq) Geared toward biologists without programming skills

(“point-and-click”)

5See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/

Page 6: Reproducible bioinformatics pipelines with Docker and Anduril

Personal wish list for pipeline framework

Script-based (maximum flexibility, minimum overhead) Powerful scripting language Cluster integration (preferably via slurm) Modular (allow code re-use b/w projects and colleagues) Component library for frequent tasks (e.g. join two CSV files) Reporting (HTML, PDF) to share results Free & open-source Bundle scripts/data with execution environment

6

Page 7: Reproducible bioinformatics pipelines with Docker and Anduril

What’s wrong with good ol’ GNU make?

Available on all Linux platforms Stood the test of time

(developed in 1970s) Rapid development

(Bash scripting + target rules) Multi-threading (-j parameter)

7

No cluster support Arcane syntax, cryptic pattern

rules Half-baked multi-output rules No type checking (everything is a

generic file) Difficult to modularize

(code re-use) Rebuild not triggered by recipe

change No reporting

PRO CON

Page 8: Reproducible bioinformatics pipelines with Docker and Anduril

Anduril

8http://www.anduril.org

Page 9: Reproducible bioinformatics pipelines with Docker and Anduril

Anduril Developed since 2008 at Biomedicum Systems Biology Laboratory,

Helsinki, Finland http://research.med.helsinki.fi/gsb/hautaniemi/

Built for scientific data analysis with focus on bioinformatics Proprietary workflow scripting language “Anduril script”

Possibility to embed native code (Bash/R/Python/Perl) Version 2 will switch to Scala

Open source & free Significo (http://www.significo.fi/) is commercial spin-off offering Anduril

consulting services No widespread adoption (yet?)

9

Page 10: Reproducible bioinformatics pipelines with Docker and Anduril

Anduril features Script-based (maximum flexibility, less overhead) Expressive scripting language Cluster integration (preferably via slurm) Modular to allow code re-use (b/w projects and colleagues) Ready-made component library for frequent analysis steps Reporting (HTML, PDF) to share results Free & open-source Bundle scripts/data with execution environment

10

X

Page 11: Reproducible bioinformatics pipelines with Docker and Anduril

Example workflow: RNA-seq alignment with GSNAP

inputBamDir = INPUT(path="/data/bam", recursive=false)inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")

alignedBams = record()for bam : std.iterArray(inputBamFiles) {

gsnap = GSNAP (reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",@cpu = 10, @memory = 40000,@name = "gsnap_" + bam.key

)alignedBams[bam.key] = gsnap.alignment

}

11

Anduril script

Execute with$ anduril run workflow.and --exec-mode slurm

Distributed execution on cluster

Page 12: Reproducible bioinformatics pipelines with Docker and Anduril

Embedding native R code in Anduril script

12

ensembl = REvaluate(table1 = ucsc,script = StringInput(content=

'''table.out <- table1table.out$chrom <- gsub("^chr", "", table.out$chrom)'''

))

Supports also inlining of Bash, Python, Java, and Perl scripts

Convert UCSC to Ensembl chromosome names in a CSV file containing column ‘chrom’:

Page 13: Reproducible bioinformatics pipelines with Docker and Anduril

Anduril features Script-based (maximum flexibility, less overhead) Expressive scripting language Cluster integration (preferably via slurm) Modular to allow code re-use (b/w projects and colleagues) Ready-made component library for frequent analysis steps Reporting (HTML, PDF) to share results Free & open-source Bundle scripts/data with execution environment

13

?

Page 14: Reproducible bioinformatics pipelines with Docker and Anduril

“Lightweight” virtualization technology for Unix-based systems Processes run in isolated namespaces (“containers”), but share same kernel Like VMs: containers portable between systems -> reproducibility! Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization

14

VM Container

Page 15: Reproducible bioinformatics pipelines with Docker and Anduril

How to bundle workflow with execution environment?

15

ContainerAnduril

Workflow

Component 1

Component 2

Component 3

Pro: Single container, easy to maintainCon: VM-like approach; huge, monolithic container, difficult to share (against Docker philosophy)

Pro: Completely modularized, easy to re-use/share workflow componentsCon: “container hell”?

Workflow

Anduril

Solution 1 Solution 2

Container AComponent 1

Container BComponent 2

Container CComponent 3

Page 16: Reproducible bioinformatics pipelines with Docker and Anduril

Hybrid solution

16

Pro: Workflow completely containerized (= portable); only shared components in common containersCon: Still (but greatly reduced) overhead for container maintenance

WorkflowAnduril

Container AComponent 1

Component 2

Component 3

Master containerProject- and user-specific components installed in master container

Shared components installed in common container (e.g. container “RNA-seq”)

“Docker inside docker”

Page 17: Reproducible bioinformatics pipelines with Docker and Anduril

Dockerized GSNAP in Anduril

17

inputBamDir = INPUT(path="/data/bam", recursive=false)inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")

alignedBams = record()for bam : std.iterArray(inputBamFiles) {

gsnap = GSNAP (reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",docker = "cfrech/anduril-gsnap-2015-09-21",@cpu = 10, @memory = 40000,@name = "gsnap_" + bam.key

)alignedBams[bam.key] = gsnap.alignment

}

Page 18: Reproducible bioinformatics pipelines with Docker and Anduril

So, Anduril is great… but Proprietary scripting language

Biggest hurdle for widespread adoption IMO Will likely improve with version 2 (which uses Scala)

Documentation opaque for beginners WANTED: Simple step-by-step guide to build your first Anduril workflow

High upfront investment to get going (because of the above) In-lining Bash/R/Perl/Python should be simpler

Currently too much clutter when using “BashEvaluate” and alike Coding in Anduril sometimes “feels heavy” compared to other frameworks

(e.g. GNU Make) Will improve with fluency in workflow scripting language

18

Page 19: Reproducible bioinformatics pipelines with Docker and Anduril

Anduril RNA-seq case study

19

Page 20: Reproducible bioinformatics pipelines with Docker and Anduril

RNA-seq case studyStep 1: Configure Anduril workflow

title = “My project long title“shortName = “My project short title“authors = "Christian Frech"

// analyses to run

runNetworkAnalysis = truerunMutationAnalysis = truerunGSEA = true

// constants

PROJECT_BASE="/mnt/projects/myproject“gtf = INPUT(path=PROJECT_BASE+"/data/Homo_sapiens.GRCh37.75.etv6runx1.gtf.gz")referenceGenomeFasta = INPUT(path="/data/reference/human_g1k_v37.fasta")

...

20

+ description of samples, sample groups, and group comparisons in external CSV file

Page 21: Reproducible bioinformatics pipelines with Docker and Anduril

RNA-seq case studyStep 2: Run Anduril workflow on cluster

$ anduril run main.and --exec-mode slurm

21

Page 22: Reproducible bioinformatics pipelines with Docker and Anduril

RNA-seq case studyStep 3: Go for lunch

22

Page 23: Reproducible bioinformatics pipelines with Docker and Anduril

RNA-seq case studyStep 4: Study PDF report

23

Page 24: Reproducible bioinformatics pipelines with Docker and Anduril

What follows are screenshots from this PDF report

24

Page 25: Reproducible bioinformatics pipelines with Docker and Anduril

QC: Read counts

25

Page 26: Reproducible bioinformatics pipelines with Docker and Anduril

QC: Gene body coverage

26

Page 27: Reproducible bioinformatics pipelines with Docker and Anduril

QC: Distribution of expression values per sample

27

Page 28: Reproducible bioinformatics pipelines with Docker and Anduril

QC: Sample PCA & heatmap

28

Page 29: Reproducible bioinformatics pipelines with Docker and Anduril

Vulcano plot for each comparison

29

Page 30: Reproducible bioinformatics pipelines with Docker and Anduril

Table report of DEGs for each comparison

30

Page 31: Reproducible bioinformatics pipelines with Docker and Anduril

Expression values of top diff. expressed genes per comparison

31

Page 32: Reproducible bioinformatics pipelines with Docker and Anduril

GO term enrichment for each comparison

32

Page 33: Reproducible bioinformatics pipelines with Docker and Anduril

Interaction network of DEGs for each comparison

33

Page 34: Reproducible bioinformatics pipelines with Docker and Anduril

Chromosomal distribution of DEGs

34

Page 35: Reproducible bioinformatics pipelines with Docker and Anduril

GSEA heat map summarizing all comparisons

35

Rows = enriched gene setsColumns = comparisonsValue = normalized enrichment score (NES)Red = enriched for up-regulated genesBlue = enriched for down-regulated genes* = significant (FDR < 0.05)** = highly significant (FDR < 0.01)

Page 36: Reproducible bioinformatics pipelines with Docker and Anduril

Future developments Push new Anduril components to public repository

(needs some refactoring, documentation, test cases) Help on Anduril2 manuscript Port custom Makefiles to Anduril (ongoing) Cloud deployment of dockerized workflow

Couple slurm to AWS EC2 Automatic spin-up of docker-enabled AMIs serving as

computing nodes

36

Page 37: Reproducible bioinformatics pipelines with Docker and Anduril

In the (not so) distant future …

$ docker pull cfrech/frech2015_et_al

$ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output

$ evince output/figure1.pdf

37

Page 38: Reproducible bioinformatics pipelines with Docker and Anduril

Further reading

Discussion thread on Docker & Andurilhttps://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw/24i4M1pDIfcJ

38

Page 39: Reproducible bioinformatics pipelines with Docker and Anduril

Acknowledgement

39

Marko Laakso (Significo) Sirku Kaarinen (Significo) Kristian Ovaska (Valuemotive) Pekka Lehti (Valuemotive) Ville Rantanen (University of

Helsinki, Hautaniemi lab) Nuno Andrade (CCRI) Andreas Heitger (CCRI)