understanding the computational challenges in large-scale genomic analysis

No Slide Title

Glenn K. Lockwood, Ph.D.User Services GroupSan Diego Supercomputer Center

Understanding the Computational Challenges in Large-Scale Genomic AnalysisSAN DIEGO SUPERCOMPUTER CENTERAcknowledgmentsSTSIKristopher Standish*Tristan M. CarlandNicholas J. Schork*

SDSCWayne PfeifferMahidhar TatineniRick WagnerChristopher IrvingJanssen R&DChris HuangSarah LamberthZhenya CherkasCarrie BrodmerkelEd JaegerMartin DellwoLance SmithMark CurranSandor SzalmaGuna Rajagopal

* Now at the J. Craig Venter InstituteSAN DIEGO SUPERCOMPUTER CENTEROutlineWhat's the problem?The 438-genome project with JanssenThe scientific premise4-step computational procedureGained insightCosts of population-scale sequencingDesigning systems for genomicsFinal remarksSAN DIEGO SUPERCOMPUTER CENTERComputational DemandsComputational Challenges in Large-Scale Genomic AnalysisSAN DIEGO SUPERCOMPUTER CENTERCost of Sequencing

Source: NHGRI (genome.gov)SAN DIEGO SUPERCOMPUTER CENTERThe cost of sequencing human genomes has been dropping at a rate that has made genomic studies involving hundreds or thousands of whole human genomes viable. While the benefits to medicine and healthcare are undeniable, the rate of advancement in sequencing has been outpacing Moore's law for half a decade. As a result, the cutting edge of genomics research is not limited by the number of genomes that can be sequenced, but rather by the computational power available to perform the necessary processing of the raw data coming from sequencers.Sequencing Cost vs. Transistor Cost

Source: Intel (ark.intel.com)SAN DIEGO SUPERCOMPUTER CENTERThe cost of sequencing human genomes has been dropping at a rate that has made genomic studies involving hundreds or thousands of whole human genomes viable. While the benefits to medicine and healthcare are undeniable, the rate of advancement in sequencing has been outpacing Moore's law for half a decade. As a result, the cutting edge of genomics research is not limited by the number of genomes that can be sequenced, but rather by the computational power available to perform the necessary processing of the raw data coming from sequencers.Sequencing is only the Beginning

SAN DIEGO SUPERCOMPUTER CENTERAlthough most viewers may already be aware, physically sequencing a patient's genome with a NGS is just the beginning. The raw reads coming from the sequencer need to be re-assembled to form a meaningful description of the whole sample's genome. This is done computationally.Scaling up to Populations

Petco Park = 40,000 peopleGordon = 16,384 cores65,536 GB memory307,200 GB solid-state diskSAN DIEGO SUPERCOMPUTER CENTERAs we move into population-scale sequencing though, the effect of Moore's law not keeping pace with sequencing technology begins to manifest. We are rapidly approaching a point in time where supercomputers will have to become an integral part of doing cutting-edge genomic analysis.Core Computational DemandsRead mapping and variant callingRequired by almost all (human) genomic studiesMulti-step pipeline to refine mapped genomePipelines are similar in principle, different in detailStatistical analysis (science)Population-scale solutions don't yet existDifficult to define computational requirementsStorageShort-term capacity for studyLong-term archivingCost: store vs. re-sequenceSAN DIEGO SUPERCOMPUTER CENTERGiven these challenges, how do we even start to solve them?438 Patients: A Case StudyComputational Challenges in Large-Scale Genomic AnalysisSAN DIEGO SUPERCOMPUTER CENTERScientific Problem

Rheumatoid Arthritis (RA)autoimmune disorderpermanent joint damage if left untreatedpatients respond to treatments differentlyJanssen developed a new treatment (TNF inhibitor golimumab)more effective treatment in some patients...but patients respond to treatments differentlyX-ray showing permanent joint damage resulting from RA. Source: Bernd BrgelmannSAN DIEGO SUPERCOMPUTER CENTERScientific GoalsCan we predict if a patient will respond to the new treatment?

Sequence whole genome of patients undergoing treatment in clinical trialsCorrelate variants (known and novel) with patients' response/non-response to treatmentDevelop a predictive model based on called variants and patient responseSAN DIEGO SUPERCOMPUTER CENTERPredicting patient response means faster treatment. This results in reducing the time available for irreversible joint damage to occur.

Sequencing the whole genome rather than known regions means unbiased examinationtarget genes, gene networks, and systems involved in the TNA-inhibitor pathway......and potentially discover new associationsComputational GoalsProblem: original mapped reads and individually called variants provided no insightBWA -aln pipelineSOAPsnp and SAMtools pileup to call SNPs and indels

Solution: Re-align all reads with newer algorithms and employ group variant callingBWA -mem pipelineGATK HaplotypeCaller high-quality called variantsSAN DIEGO SUPERCOMPUTER CENTERNewest sequencers deliver VCFsComputational Approach1. Understand computational requirements

2. Develop workflow

3. Load dataset into HPC resource

4. Perform full-scale computationsSAN DIEGO SUPERCOMPUTER CENTERNewest sequencers deliver VCFsComputational Approach1. Understand computational requirements

2. Develop workflow


4. Perform full-scale computationsSAN DIEGO SUPERCOMPUTER CENTERNewest sequencers deliver VCFsStep #1: Data RequirementsInput Dataraw reads from 438 full human genomes (fastq.gz)50 TB of compressed data from Janssen R&DOutput Data+ 50 TB of high-quality mapped reads+ small amount (< 1 TB) of called variantsIntermediate (Scratch) Data+ 250 TB (more = better)PerformanceData must be stored online High bandwidth to storage (> 10 GByte/s)SAN DIEGO SUPERCOMPUTER CENTERStep #1: Compute & Other RequirementsProcessing Requirementsperform read mapping on all genomes; 9-step pipeline to achieve high-quality read mappingperform variant calling on groups of genomes 5-step pipeline to do group variant callingEngineering RequirementsFAST turnaround (all 438 genomes done in < 2 months) requires high capacity supercomputer (many CPUs)EFFICIENT (minimum core-hours used) requires data-oriented architecture (RAM, SSDs, IO)SAN DIEGO SUPERCOMPUTER CENTERCan we even satisfy these requirements?Data Requirements: SDSC Data Oasis1,400 terabytes of project data storage1,600 terabytes of fast scratch storageData available to every compute nodeProcessing/Eng'g Requirements: SDSC Gordon1024 compute nodes64 GB of RAM each300 GB local SSD eachAccess Data Oasis at up to 100 GB/s

SAN DIEGO SUPERCOMPUTER CENTERComputational Approach1. Understand computational requirements

2. Develop workflow


4. Perform full-scale computationsSAN DIEGO SUPERCOMPUTER CENTERNewest sequencers deliver VCFsWorkflow EngineStep #2: The Workflow

Compute Hardware

Resource PoliciesStorage

Applications00 f6 0b 3f bf 0c 49 46 80 3d 99 a8 0f a099 8f 65 f8 b7 94 4f 9b 3f 44 00 fc 50 ac4a db 5d a2 69 ce d1 70 ce bc ab b3 76 9060 fe e5 96 23 9b d6 c5 8d b9 56 a6 d5 2e14 15 40 36 8a c4 4c 09 de 9e 89 92 34 4a64 1e f4 f9 66 e0 15 74 36 64 f0 b1 7f a296 a0 0d 3c 78 83 87 74 5b b3 7a 07 79 b38b 54 5e 91 ba d3 6f 23 2f 7c 49 a0 df 849e 46 a8 93 7c ca 6d e1 0e 52 76 12 19 6970 03 6b 6d b1 93 de 6a 65 f0 61 bd 6c 7380 25 db 90 4c e4 1a cc 43 17 ae c9 e1 6819 88 9a be 9b 9e 3a 1f a5 07 98 a0 8d 43ae aa cd 71 83 3d ef ba 43 80 81 3d 84 f4Input Data##fileformat=VCFv4.1##FILTER=

understanding the computational challenges in large-scale genomic analysis

Documents

sequencing cost

sequencing technology

populationscale sequencing

human genomes viable

number of genomes

cuttingedge genomic

computational power

rate of advancement