population-scale high-throughput sequencing data analysis

39
Denis C. Bauer | Bioinformatics | @allPowerde 08 July 2014 CSIRO COMPUTATIONAL INFORMATICS Population-scale high-throughput sequencing data analysis B y M e l o d y

Upload: denis-bauer

Post on 08-May-2015

332 views

Category:

Documents


0 download

DESCRIPTION

Unprecedented computational capabilities and high-throughput data collection methods promise a new era of personalised, evidence-based healthcare, utilising individual genomic profiles to tailor health management as demonstrated by recent successes in rare genetic disorders or stratified cancer treatments. However, processing genomic information at a scale relevant for the health-system remains challenging due to high demands on data reproducibility and data provenance. Furthermore, the necessary computational requirements requires a large investment associated with compute hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks, like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility. To cater for this resource-hungry, fast pace yet sensitive environment of personalized medicine, we developed NGSANE, a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance when processing raw sequencing data either on a local cluster or Amazon’s Elastic Compute Cloud (EC2).

TRANSCRIPT

Page 1: Population-scale high-throughput sequencing data analysis

Denis C. Bauer | Bioinformatics | @allPowerde08 July 2014

CSIRO COMPUTATIONAL INFORMATICS

Population-scale high-throughput sequencing data analysis

By M

elo

dy

Page 2: Population-scale high-throughput sequencing data analysis

Talk Overview

2 |

• Background: CSIRO/Omics Project

• Methods: NGS Data Processing on HPC/Cloud

• Research Outcome: Cancer and Microbes in Colorectal Cancer

Denis Bauer | @allPowerde

Page 3: Population-scale high-throughput sequencing data analysis

62% of our people hold

university degrees 2000 doctorates 500 masters

With our university partners, we develop

650 postgraduate research students

Top 1% of global research institutions in 14 of 22 research fields

Top 0.1% in 4 research fields

Darwin

Alice Springs

Geraldton 2 sites

Atherton

Townsville2 sites

Rockhampton

Toowoomba

GattonMyall Vale

Narrabri

Mopra

Parkes

Griffith

BelmontGeelong

HobartSandy Bay

Wodonga

Newcastle

Armidale 2 sites

Perth3 sites

Adelaide2 sites Sydney 5 sites

Canberra 7 sites

Murchison

Cairns

Irymple

Melbourne 5 sites

CSIRO: Who we are

Werribee 2 sites

Brisbane6 sites

Bribie Island

People

Divisions

Locations

Flagships

Budget

6500

13

58

11

$1B+

The Commonwealth Scientific and Industrial Research Organisation

Denis Bauer | @allPowerde3 |

Page 4: Population-scale high-throughput sequencing data analysis

Our business units

12 Research Divisions11 National Research Flagships

+ National Research Facilities and Collections

FOOD, HEALTH & LIFE SCIENCE

INDUSTRIES

ENVIRONMENT MANUFACTURING,MATERIALS &

MINERALS

ENERGY INFORMATION & COMMUNICATIONS

+ Transformational Capability Platforms

Denis Bauer | @allPowerde4 |

Page 5: Population-scale high-throughput sequencing data analysis

Our track record: top inventions

4. EXTENDED WEAR CONTACTS

2. POLYMER BANKNOTES

3. RELENZA FLU VACCINE

1. Fast WLANWireless Local Area Network

5. AEROGARD 6. TOTAL WELLBEING DIET

7. RAFT POLYMERISATION

8. BARLEYMAX 9. SELF TWISTING YARN

10. SOFTLY WASHING LIQUID

Denis Bauer | @allPowerde5 |

Page 6: Population-scale high-throughput sequencing data analysis

Part 1: The ‘omics project

The goal of the project is to investigate the

susceptibility to colorectal cancer in the

context of obesity and the gut

microbiome

Denis Bauer | @allPowerde6 |

Page 7: Population-scale high-throughput sequencing data analysis

Data from Pilot Study

Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith) organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)

Denis Bauer | @allPowerde7 |

Page 8: Population-scale high-throughput sequencing data analysis

• Objective: capture genomic variances reliably in tumour normal and adipose.

• Sequence effort:• 12 tumour -> 6 lanes (2-plex)• 12 normal -> 3 lanes (4-plex)• 12 adipose -> 3 lanes (4-plex)

Considerations before sequencing: Undersampling

More depth needed due to potentially low cellularity in the tumour sample

additional depth

tumour samplenormal sample

Denis Bauer | @allPowerde8 |

Page 9: Population-scale high-throughput sequencing data analysis

• Objective: process samples avoiding confounding factors

Considerations before sequencing: Flowcell design

L1

L2

L2

L2

O1

O1

O1

O2

O2

O2

Sequenced over 3 lanes

L1

L1

Normal

Adipose

Tumour

4-plex

4-plex

4-plex

L2

O2

L1

O1

L2

O2

L1

O1

Sequence on one lane each

L2

O2

L1

O1

Subject every sample to the same lane and flowcell effects by multiplexing (labelling every sample with a identifying barcode)

Denis Bauer | @allPowerde9 |

Page 10: Population-scale high-throughput sequencing data analysis

• Population-scale sequencing with more samples than illumina-barcodes: imbalanced flowcell design will split samples and pair the halves with different partners (e.g. LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 )

Considerations for Omics Proj.: Flowcell design

L1.1

L1.1

O1.1

O1.1

O1.1

L1.1

Normal

Adipose

Tumour

L2.1

L2.1

L2.1

O2.1

O2.1

O2.1

L3.1

L3.1

L3.1

O3.1

O3.1

O3.1

L4.1

L4.1

L4.1

O4.1

O4.1

O4.1

Lane1Lane2 Lane3 Lane4

L1.2

L1.2

O3.2

O3.2

O3.2

L2.2

L2.2

L2.2

O4.2

O4.2

O4.2

L3.2

L3.2

L3.2

O1.2

O1.2

O1.2

L4.2

L4.2

L4.2

O2.2

O2.2

O2.2

Lane5 Lane6 Lane7 Lane8

L1.2

4-plex

4-plex

2-plex

L=Lean O=Obese

L1.1=Lean individual 1 part 1 (of 2) ...

12 Lanes

Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781

Denis Bauer | @allPowerde10 |

Page 11: Population-scale high-throughput sequencing data analysis

Blue Monster says

Design your experiment with project-specific pitfalls in mind

Auer PL et al. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781

Denis Bauer | @allPowerde11 |

Page 12: Population-scale high-throughput sequencing data analysis

Part 2: NGS Data Processing

Minimize project set-up overhead while providing easily adaptable processing modules

for NGS analysis on high-performance-compute clusters/cloud

Denis Bauer | @allPowerde12 |

Page 13: Population-scale high-throughput sequencing data analysis

Resource consumption for Variant Calling

qsub –t 1-36 task.qsub

Script

Submission

Scheduler

36 samples (2.7T data) on average requires 128 hours CPU time (ste= 15) 77 GB RAM (ste=0.34)

CPU (hours)

Real time(hours)

Memory(GB)

#PBS –l nodes=2:ppn=8

High-Performance-Compute

Denis Bauer | @allPowerde13 |

Page 14: Population-scale high-throughput sequencing data analysis

doi:10.1038/nbt.2421

Tailored processing for different sequencing applicationsWet-lab Protocols Production Informatics

Variant Calling

MethylationSites

Gene Expression

Despite different approaches we want to use the same processing framework!

Denis Bauer | @allPowerde14 |

Page 15: Population-scale high-throughput sequencing data analysis

reusability

cutting edgedata security

HPC environment

reproducibility

robustness

adaptability

knowledge transfer (publication)

efficient

Wish list for a framework

Denis Bauer | @allPowerde15 |

Page 16: Population-scale high-throughput sequencing data analysis

Denis Bauer | @allPowerde16 |

Page 17: Population-scale high-throughput sequencing data analysis

Denis Bauer | @allPowerde17 |

Page 18: Population-scale high-throughput sequencing data analysis

Assessexperimentalsuccessquickly

Denis Bauer | @allPowerde18 |

Page 19: Population-scale high-throughput sequencing data analysis

DEMO - files

Project X fastq

Exp1

Run1_read1.fastq

Run2_read1.fastq

Exp2 Run3_read1.fastq

We can start from raw fastq files: here 3 files (Run1-3) in 2 different

conditions (Exp1-2)

Denis Bauer | @allPowerde19 |

Page 20: Population-scale high-throughput sequencing data analysis

DEMO – setting up config file

#********************

# Data

#********************

declare -a DIR; DIR=( Exp1 Exp2 )

#********************

# Tasks

#********************

RUNMAPPINGBOWTIE2="1" # mapping with bowtie2

#********************

# Paths

#********************

# reference genome

FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa

20 | Denis Bauer, @allPowerde

We specify the folders NGSANE should run on and what to do (here:

bowtie2 mapping). We can also specify project specific settings (here:

use igenomes)

Page 21: Population-scale high-throughput sequencing data analysis

DEMO – dry run

bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt

[NGSANE] Trigger mode: [empty] (dry run)[NOTE] Folders: Exp1 Exp2[Task] bowtie2 [NOTE] setup enviroment[TODO] Exp1/Run1_read1.fastq[TODO] Exp1/Run2_read1.fastq[TODO] Exp2/Run3_read1.fastq[NOTE] proceeding with job scheduling...[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy

[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy

[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1[NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy

[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1

We run NGSANE in dry run to test what jobs it would submit

Denis Bauer | @allPowerde21 |

Page 22: Population-scale high-throughput sequencing data analysis

DEMO – submit

bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed[NGSANE] Trigger mode: armedDouble check! Then type safetyoff and hit enter to launch the job: safetyoff... take cover!

[NOTE] Folders: Exp1 Exp2[Task] bowtie2 [NOTE] setup environment[TODO] Exp1/Run1_read1.fastq[TODO] Exp1/Run2_read1.fastq[TODO] Exp2/Run3_read1.fastq[NOTE] proceeding with job scheduling...[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1Jobnumber 2424899

[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1Jobnumber 2424900

[NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2Jobnumber 2424901

We submit HPC jobs. Checkout the returned qsub identifiers.

Denis Bauer | @allPowerde22 |

Page 23: Population-scale high-throughput sequencing data analysis

DEMO – scheduler bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c

burnet-srv.idpx.hpsc.csiro.au: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:002424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:002424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00

Three HPC jobs run in parallele because there were three fastq files. But there is no limit to the number of files to process in parallele: easy scale-

up to populations.

Denis Bauer | @allPowerde23 |

Page 24: Population-scale high-throughput sequencing data analysis

DEMO – report

bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html

[NGSANE] Trigger mode: html>>>>> Generate HTML report >>>>> startdate Fri Jan 24 08:02:37 EST 2014>>>>> hostname burnet-login>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”--Python--Python 2.7.2QC - bowtie2>>>>> Generate HTML report - FINISHED>>>>> enddate Fri Jan 24 08:02:39 EST 2014

More report examples

Now create the HTML overview page, to check if jobs finised sucessfully and

what the results are (bowtie2: mapping statistics)

Denis Bauer | @allPowerde24 |

Page 25: Population-scale high-throughput sequencing data analysis

DEMO - files

Project X

Summary HTML

Exp1 Bowtie

Run1.bam

Run2.bam

Exp2 Bowtie Run3.bam

fastq

Exp1

Run1_read1.fastq

Run2_read1.fastq

Exp2 Run3_read1.fastq

The resulting file structure: every experiment has a folder with the tasks as subfolders and in them the results

(here: bam files)

Denis Bauer | @allPowerde25 |

Page 26: Population-scale high-throughput sequencing data analysis

NGSANE Currently supports

• Transfer data (smbclient)

• Quality Control (GATK, FastQC, RNA-SeQC, custom summaries,

user code)

• Trimming (Cutadapt,Trimgalore, Trimmomatic)

• Mapping (BWA,Bowtie1,Bowtie2,Tophat)

• Transcript Quantification

(cufflinks, htseq, bedtools)

• Variant calling (GATK, samtools)

• Variant annotation(annovar)

• 3D Genome structure (Hicup, fit-hi-c, Hiclib, Homer)

Denis Bauer | @allPowerde26 |

Page 27: Population-scale high-throughput sequencing data analysis

For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine

Denis Bauer | @allPowerde27 |

Page 28: Population-scale high-throughput sequencing data analysis

Blue Monster says

Analyze your data to be reproducible and well documented with tools that

scale well to larger datasets

Buske FA et al. NGSANE: a lightweight production informatics framework for high-throughput data analysis. Bioinformatics. 2014 PMID: 24470576

Denis Bauer | @allPowerde28 |

Page 29: Population-scale high-throughput sequencing data analysis

Part 3: Combining Omics Data

Seeing the full picture requires taking all information into account

Denis Bauer | @allPowerde29 |

Page 30: Population-scale high-throughput sequencing data analysis

Result overview: traditional differential analysis

1. 722 genes differentially expressed (DE) between tumour and normal• QC: We have good concordance with genes known to be up/down regulated in CRC

2. 841 differentially methylated (DM) genomic regions -- mostly hypermethylated• QC: good concordance with previously reported gut methylation profile

Fernandez et al. Genome Res. 2012CSIRO inhouse

Known DE gene Known DM locations

Denis Bauer | @allPowerde30 |

Page 31: Population-scale high-throughput sequencing data analysis

Microbial Population: traditional population survey

Paul GreenfieldDenis Bauer | @allPowerde31 |

Page 32: Population-scale high-throughput sequencing data analysis

Data integration

(image credit: Francis Tabary)

Denis Bauer | @allPowerde32 |

Page 33: Population-scale high-throughput sequencing data analysis

DNA methylation: Blood signatures in Adipose and Gut samples

Tim Peters

Some gut/adipose samples have blood-

like signatures.

Denis Bauer | @allPowerde33 |

Page 34: Population-scale high-throughput sequencing data analysis

Exonseq: blood-signatures stem from a blood-plasma protein

Contamination by ADM2, a gene expressed in blood plasma

Individuals

Contamination (%)

Con

tam

inat

ion

(%)

expr

essi

on

Plasma protein ADM2 makes up most of the human material in the digesta (number

of reads mapping to human genome)

Denis Bauer | @allPowerde34 |

Page 35: Population-scale high-throughput sequencing data analysis

Medical History: Blood potentially resulting from medication

CARTIA14,50,57

WARFARIN40

ASPIRIN59,7

COPLAVIX12

No anti-clotting drug 2, 62, 4No medication 19,20

Wilcoxon rank sum test p-value = 0.02

Anti-thrombosis drugs significantly enriched in individuals with human material in digesta.

Denis Bauer | @allPowerde35 |

Page 36: Population-scale high-throughput sequencing data analysis

Microbial data: Blood “liking” opportunistic bacteria are enriched in contaminated samples

E. coli and Salmonella etc

Opportunistic pathogens.Respond to inflammation and bleeding

Bacterial marker for low level chronic gut bleeding ?

Denis Bauer | @allPowerde36 |

Page 37: Population-scale high-throughput sequencing data analysis

Blue Monster says

Integrating different ‘omics data is still a challenge.

Denis Bauer | @allPowerde37 |

Page 38: Population-scale high-throughput sequencing data analysis

Three things to remember

• Good experimental design is necessary (even) in sequencing experiments

• Reproducible, documented data analysis is key (e.g. NGSANE, a lightweight flexible tool for large-scale sequence data analysis on high-performance systems and Amazon’s elastic cloud)

• Promising research opportunities are in the integration of multiple high-throughput data sources

Denis Bauer | @allPowerde38 |

Page 39: Population-scale high-throughput sequencing data analysis

COMPUTATIONAL INFORMATICS

Thank youComputational InformaticsDenis C. Bauert +61 2 9123 4567e [email protected] www.csiro.au/bioinformatics

Buske et al.,

Bioinformatics, Jan

2014

More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde

Fabian A. BuskeSusan ClarkHugh FrenchMartin SmithGarvan Institute of Medical Research, Sydney, Australia

Robert DunneTim PetersPaul GreenfieldPiotr SzulTomasz BednarzComputational Informatics, CSIRO, Australia

Garry HannanAnimal Food and Health Scinece, CSIRO, Australia

Rodney ScottUniversity of Newcastle, Australia

Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund

http://www.genome-engineering.com.au/