advanced tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · demo...

38
Roman Cherniatchik, 2020 May 6, St. Petersburg Snakemake Advanced Tutorial 1

Upload: others

Post on 28-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Roman Cherniatchik, 2020 May 6, St. Petersburg

SnakemakeAdvanced Tutorial

1

Page 3: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Snakemake vs Bash ScriptsSnakemake pipeline Bash Scripts

“Entrance” level Hard Easy

Programming language Python + Bash Bash - convenient for simple scripts only

Calculations Automation + +

Results Consistency / Reproducibility + depends on you

Computation Environment Reproducibility

Docker, Conda, Singularityintegration depends on you

Multiple Platfroms: Write once launch everywhere

PC, Computational clusters (HPC, LSF,..), Cloud Computing

New scripts for each platform, hard to do universal solution

Bash scripts are simple - easier to start with, but could be a nightmare for complicated large pipelines and effective cloud computing

3

Page 5: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Snakemake Dependency Graph (DAG)Crucial to understand DAG concept to write snakemake pipelines

Dependency Graph used to decide:

• Which rules to execute?

• Which input files to use?

Dependencies Graph Building:

InputOutput Files

Input Output Files

Rules Execution Order:

5

Page 6: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Pipeline Execution Orderreads/ A.fq.gz; B.fq.gz; C.fg.qz

Example:

snakemake --cores 1 --dag | dot -Tsvg > dag.svg6

Page 7: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Pipeline Dependencies Lookup

plot.svg

peaks/A.bedpeaks/B.bedpeaks/C.bed

rule: plot

rule: all

rule: call_peaks bams/{sample}.bam peaks/{sample}.bed

rule: align bams/{sample}.bamreads/{sample}.fq.gz

peaks/A.bed

peaks/B.bed

peaks/C.bed

plot.svg

reads/ A.fq.gz; B.fq.gz; C.fg.qz

RULE INPUT RULE OUTPUT

C -> {s

ample

}A -> {sample}

B ->

{sam

ple}

A,B,C -> {sample}

A,B,C -> {sample}

A,B,C -> {sample}

START HERE

The End

7

Page 8: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Demo time: example_01Topic: Snakemake Dependency Graph

• Q1: Why doesn’t work?

8

Page 9: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Demo time: example_01• call_peaks looks for “bams/A.bam”• candidate found: align “bams/{sample}” => {sample} in align is “A.bam” => input

should be “reads/A.bam.fq.gz”

• call_peaks {sample} (“A”) != align {sample} (“A.bam”)• Wildcard variable make sense only inside one rule

9

Page 10: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Demo time: example_01DAG: Jobs Graph vs Rules Graph

Practical advice:

Use rules graph for you pipelines

• Real use cases: 10..1000 input files

• Full graph with all jobs will be extremely large.

• Rules Graph is compact

snakemake --cores 1 --rulegraph | dot -Tsvg > rulegraph.svg

DAG: Rules onlyDAG: All jobs10

Page 11: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Demo time: example_01How to quickly find things in snakemake command line help

* snakemake --help | less: In less press ‘/‘, type ‘rulegraph’ (search in ‘less’ is same as VIM)

Useful snakemake options for debugnig:

• --dry-run : Builds DAG, checks pipeline & exits. Doesn’t execute rules

• --debug-dag : Prints wildcards details etc. while inferring DAG, doesn’t stop rules execution.

• --rulegraph : Prints compact DAG (only rules) & exits

Some useful snakemake methods:

• touch: Creates empty file, could be use it in output: section to mock shell/run sections.

• directory: Mark output: section argument if it is directory, not file

• protected: Mark output: section argument to create ‘read-only’ files for important pipeline results11

Page 12: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Snakemake Editing Tools MatterWhy I’m using PyCharm + SnakeCharm?

Let’s compare:

• cat

• vim + snakemake syntax highlighting

• Atom (recommended by Snakemake)

• PyCharm + SnakeCharm Plugin

12

Page 13: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Example2: catLet’s use editor w/o syntax highlighting, e.g: cat (or less / vim / nano/ …)Pros

• Installed almost on any Linux machine

• works in SSH session

• fast & light

Cons

• Easy to make an error

• Hard to read pipeline code13

Page 14: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Example2: vim + snakemake bundleLet’s use editor with syntax highlighting plugin, e.g: vim (or nano/ …)Pros

• Installed almost on any Linux machine

• works in SSH session

• fast & light

• Code looks better, some errors easier to notice

Cons

• Requires to install snakemake bundle

• Still easy to make an error

Supplementary: How do I enable syntax highlighting in Vim for Snakefiles?

14

Page 15: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Example2: AtomLet’s use IDE suggested by Snakemake official tutorial. Seems they should know better which tool to use, right ?AtomPros

• Fast & light

• Code looks readble, some errors easier to notice

• Reasonable default choice

Cons

• Requires be installed

• Unlikely works via SSH

• Still easy to make an error (see later)

15

Page 16: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Is this code OK?

14 mistakes in 24 lines

16

Page 17: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Example_02: PyCharm + SnakeCharmSnakeCharm developed by JetBrains Biolabs Team Pros

• Code analysis, lots of errors highlighted• Smart code completion• PyCharm: also good for

• Python, Markdown, R, ..• Git• ….

Cons• Requires be installed• Could show false positives• Couldn’t be used via SSH directly (put code in git, sshfs or other

tricks should be used)17

Page 18: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Text Editors Takeaways:My own preferences:

• Local Machine• PyCharm IDE• SnakeCharm Plugin : for Snakemake• IdeaVim Plugin: `Vim style` emulation

• Remote Machine (computation clusters, docker machines,…)• Vim / Nano + Snakemake syntax bundle

Keep your pipeline in Git :• Use it for sync pipeline with remote machine• Convenient for pipeline development

18

Page 19: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

“Snakenstein”Snakemake file = mixture of:

• Python code

• Some additional syntax• rules declarations, rules sections,

etc• special syntax in strings:

“path/{sample}.bam”

snakemake tool:• reads Snakefile• generates valid python code• executes python code in some

special python environment“Frankenstein” in terms of programming language

19

Page 20: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Example_03: Snakefile - Not Pythonsnakemake --print-compilation > Snakefile.py

Snakemake generates Python file

andexecutes it!

20

Page 21: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Section arguments typesMost sections supports:

• positional arguments:• string arguments• lists of strings• other python expressions

• named argumentskey1 = value

Input section TEXT is inserted into some python function workflow.input(…)

=> same syntax as in python for method call arguments, e.g like in

print(“fooo”, “boo”, file=…, end=..)

--print-compilation

21

Page 22: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Lambda / Input FunctionsLambda functions • Access to:• wildcards, threads, input, output, resources

• Different sections - different set of arguments for input functions

Input function: • Similar to lambda functions, but for

larger pieces of code• Only for input: sections• Could be used to handle dynamic

dependencies (see checkpoints) 22

Page 23: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Sections Syntax is not Equal• output, log, benchmark:

• lambdas/input functions cannot be used, only expressions

• input:

• expression, lambdas, also “input functions”

• threads: lambdas/functions or expressions which returns: integer or float values

• shell:, wrapper: only one positional argument, expression returning python string

• run: python code block

• ….

Check Snakemake docs / Trust SnakeCharm !

23

Page 24: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Wildcards Syntax3 types:

1. Sections that introduce wildcards:• output, log, benchmark. E.g.: “peaks/{sample}.bed”

• => lambdas cannot be used here

• => wildcards set should be same in these sections

• everything in `{..}` is wildcard name, e.g `{config[reads]}`

2. Sections which uses wildcards w/o `wildcards.` prefix:• input, params, … E.g.: “peaks/{sample}.bed”

• everything in `{..}` is wildcard name, e.g `{SOME_VARIABLE}`

3. Sections which requires `wildcards.` prefix:• message, shell, run, … E.g.: “peaks/{wildcards.sample}.bed”

• w/o wildcards prefix - just python e.g. `{config[reads]}`, `{SOME_VARIABLE}`

Constraining wildcards example: “sorted_reads/{sample,[A-Za-z0-9]+}.bam”

24

Page 25: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Three Execution Phases1. Python module loading

• File top level

• Section arguments

2. DAG computation

• input and lambda functions

3. Rule running

• run, script, shell, wrapper sections

• bonus: after DAG computation, but before/after all rules execution

25 See Example_03.6

Page 26: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Rules Referencesrules.NAME.output.key

• Perfomance improvement for large pipeline DAG computations

• Reduces code duplication - fewer ERRORs!

• Helps in finding usages of rule in code

Documentation

26See Example_04

Page 27: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Dynamic DependenciesSometimes all intermediate file names are not known before pipeline execution

Examples:

• Download some files (e.g. fastq) from database by id

• Align samples, perform QC, use only samples passed QC for downstream

Use: • checkpoint rules (see data-dependent-conditional-execution)• dynamic flag for output is deprecated and will be removed

27

Page 28: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

CheckpointsSpecial rule sub-type: • declaration ~ rule syntax• usage: input function + checkpoint ref: checkpoints.NAME.get(**wildcards).output.key

DAG evaluation: • Evaluate DAG except checkpoint

‘using’ rules (syntax above), e.g w/o expression rule

• Run pipeline• Re-calc DAG after checkpoint finished

(separately for every wildcard, if used)• Run pipeline

Do download

28See Example_05

decl

arat

ion

usag

e sy

ntax

Page 29: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Shell Commands• Different python strings

syntax is available

• Use the most convenient for the situation

29See Example_06.1

Page 30: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Wrappers, Scripts, ...shell is enough, but alternatives are:

• wrapper: See docs.Recommended way for standard tools.

• Automatically install tool via conda• Keep each tool in separate conda env• Collects shell args for you• Wrappers repo:

https://snakemake-wrappers.readthedocs.io

• script: Syntax sugar to pass arguments into Python, R scripts.

• notebook: Way to launch notebook (R or Python) and use it’s output, see docs

30See Example_06.2

Wrapper example:

Page 31: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Project Layout• Recommended Structure

• Keep pipeline settings in config.yaml

• Pass input files information as TSV / CSV tables with columns like sample name, reads path, etc. E.g.:

31

Page 32: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Results Consistency / ReproducibilitySnakemake smart enough but with your assistance!

• Rule output exists:

• Recalculated only if input files changed (not always for checkpoints)

• If input marked with `ancient` - file modification date not checked

• Rule fails:

• Deleted only files mentioned in output: of failed rule

• => Mention all tool output files in output: or use shadow: and mention only required files

• shadow:

• If tool outputs too many files

• Run tool in temp directory: .snakemake/shadow/tmpxxx

• Copy only files requested in output: section

• Use symlinks to make all things works

• Shadow levels: minimal, shallow, full

• `minimal` - most cases 32 See Example_07

Page 33: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Bad Pipeline - Inconsistent ResultsSnakemake - not a silver bullet

You can always break all Snakemake conventions and write an inconsistent not reproducible pipeline. Think what you are doing and why.

33

Bugs in the above example:

• ‘my_file.csv’ will be incorrect if several jobs works in parallel

• if one of samples fails - ‘my_file.csv’ will contain inconsistent results

• also ‘my_file.csv’ won’t be deleted

• also ‘my_file.csv’ won’t be recalculated automatically on next pipeline launch

• if input file changed - ‘my_file.csv’ won’t be recalculated automatically

Example of broken conventions:

Page 34: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Computation Environment ReproducibilitySee “distribution and reproducibility” snakemake docs

conda • Used for automatically tools installation with proper versions• --use-conda snakemake option + conda: section in pipeline

wrappers

• Shell commands + conda environment. Required: --use-conda option

docker • You could run whole pipeline or each job in a single docker container• See --use-singularity snakemake option and container: section• up to 100% reproducibility

34

Page 35: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Computational clustersDifferent types: HPC, LSF, Slurm, …

Idea:• Only Linux terminal (SSH) access to cluster

entry point (`login` node machine)• Each rule - single job submission

Snakemake:• Does all complicated work for you ~ feels

like local machine• localrules: force job be launched not

via job submission• Required options: --profile XXX, --jobscript XXX.sh, --restart-times NN

My latest data processing was:• 120 WGBS samples• 10 TB reads, 150 machines, 2

weeks, 100k+ jobs• Each rule - own Docker container

35

Page 36: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Snakemake Pipeline Examples• Community created workflows - https://github.com/snakemake-workflows

Curated, but not all are good

• Our team workflows:

• Chip-Seqhttps://github.com/JetBrains-Research/chipseq-smk-pipeline

• SC ATAC-Seqhttps://github.com/JetBrains-Research/scasat-smk-pipeline

• WGBS Methylation<not published yet>

36

Page 38: Advanced Tutorial - research.jetbrains.org › files › material › 5ec245b4e907c.pdf · Demo time: example_01 How to quickly find things in snakemake command line help * snakemake

Resources

• Snakemake Documentationhttps://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

• Snakemake Wrappershttps://snakemake.readthedocs.io/en/stable/

• SnakeCharm Pluginhttps://jetbrains-research.github.io/snakecharm/

38