the st. jude children’s research hospital/washington university pediatric cancer genome project: a...

19
The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair in Bioinformatics SVP & CIO St. Jude Children’s Research Hospital

Upload: theodore-washington

Post on 25-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective

Clayton W. Naeve, Ph.D.

Endowed Chair in Bioinformatics

SVP & CIO

St. Jude Children’s Research Hospital

Page 2: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

The Data Deluge

St. Jude Data: The First 50 Years

Admin/Clinical Research0

200

400

600

800

1000

1200

1400

1600

1800

2000

Tera

byte

s

48 Years (800 TB)

2 1/2 Years (1000 TB)

PCGP Data: 917 TB, 148 million files

Page 3: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

• Launched Feb. 2010• St. Jude/WashU collaboration• WGS on 600 patients (leukemia, brain tumors, solid tumors)• Matched germline and tumor samples• 1200 genomes (~90 billion bp/genome) in 36 months• ~2 Petabytes of data

The PCGP Project

St. Jude/WashU Pediatric Cancer Genome Project

Page 4: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

PCGP Challenges

• Moving data• Data workflow• Data analysis• Computational horsepower• Data storage• Data sharing

Challenges to Information Sciences

Page 5: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Moving Data

• Multi-Terabyte data transit across networks is not trivial• DNA sequence raw data reads, contig assembly, alignment to

reference, variants, etc. shipped to SJCRH as binary BAM files: ~100 GB

• 24 hrs to infinity to send via commodity internet• Internet2 connectivity (10 Gbs via MRC) to transfer files from

WashU to SJCRH• Evaluated 5 different fast data transfer algorithms….selected FDT

(developed at CalTech to transfer LHC data at Cern)• Developed a pipeline to facilitate transfer• Today: ~5 hour transit time/file

Page 6: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

IBM SoNAS734 TB (usable )

SGI Altix UV 1000640 cores /5 TB RAM

IBM BladeCenter Cluster810 cores /3 TB RAM

IBM iDataplex1008 cores /4 TB

RAM

Mellanox IS 5200Chasis Switch

PVFS Servers

ESX Cluster29 Servers

Mellanox IS 5200Chasis Switch

Data Transfer Node (dtn01) Internal Data

Transfer Node (datamover)

SGI IS550060TB (usable)

HPCF 505

Mellanox BridgeX BX5020

Mellanox Grid Director 4036

Mellanox Grid Director 4036

Mellanox BridgeX BX5020

X4

x4 x4 x4

x2

x2

10 GE CampusNetwork

10 GE CampusNetwork

x6x84

xSeries 335

COMPACT

xSeries 335

COMPACT

xSeries 335

COMPACT

xSeries 335

COMPACT

x4

X4

Moving Data

Page 7: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Moving Data

Page 8: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

• Began work on PCGP 9 months prior to launch• Developed a LIMS system for Validation Lab• Developed a PCGP SharePoint site to facilitate

collaboration internally• Developed a bioinformatics workflow engine:

PALLAS• Security management• Data provenance management• Intermediate and final result tracking• Flexible workflow design• Rapid new analytical algorithms/tools configuration• Web-based LSF job submission and monitoring • Support a range of protocols to connect to other web

application systems, databases, file systems, and etc.• Integrated with applications, such as SRM, Genome

Browser and etc.• Data integration with tissue sample, clinical, and research

data• Vision: parse each algorithm to the appropriate computing

environment

Data Workflow

Page 9: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

• BAM Quality Assurance:• Tumor Purity Algorithm (SJCRH)• Not Disease/Genomic Swap (SNP checks)• Xenograft Filter (Remove Contaminating Mouse Reads)• Gene Exon and Genome Coverage algorithms (Gang Wu)

• BAM file work:• Bam file extraction and visualization• Samtools and C++/bioperl api’s• Bambino• IGV

• Single Nucleotide Variation:• Freebayes• In-house PCGP

• Copy Number Variation:• Stan’s Copy Number Algorithm• Regression Tree Algorithm

• Structural Variation:• One End Anchored Inference:• CREST• ViralTopology

• Fusion Detection:• In-house (Michael Rusch)

• RNAseq:• RNAseq mysql/Cufflinks  

• ChipSeq:• ChiPseq mysql/in house (John Obenauer)

• viralScan• in-house (McGoldrick)

• Integration:• GFF intersect• Gff2fasta• gffBuilders• Cancer warehouse

• Visualization:• Circos maker• BED GFF Tracks maker

Jinghui Zhang and CompBio Team

Data Analyses

Page 10: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Computational Horsepower (HPCF)

• IBM BladeCenter (810 cores/3TB RAM)• IBM iDataplex (1,008 cores/4TB RAM) – April 2010• SGI Altix UV1000 (640 cores/5TB RAM/60TB storage using Lustre v2.2) – December 2011• IBM SoNAS (780 TB) – March 2011• Data Transfer Node (10 Gbps I2 connection) – April 2011• Internal Data Transfer Node (10 Gbps x2) – June 2011• QDR Infiniband (40 Gbps for all HPC equipment) – January 2012• Software (Platform LSF, Intel Parallel Studio)• Total: 2,366 cores, 13TB RAM (estimated 11.6 Tflops)

• 2010: 365,000 cpu hours• 2011: 712,000 cpu hours

Page 11: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

• IBM SoNAS (780 TB) – March 2011• Scales to 21PB; 1 billion files/filesystem; 7,200 drives• Current total on campus: 3.8 Petabytes (3,800,000 Gb)• PCGP uses 917 TB (<- +500TB on tape), 148 million data files

• IBM TSM systems for backup/archive (Tiered)• 240 SAS (15k) drives• 480 SAS-NL (7.2k) drives• Current 7,900 tape capacity, up to 1.6TB/tape; 12.6+ PB total• 734 TB usable under one file system• High speed/low latency backend interconnect (QDR InfiniBand 20Gb per

port and 100ns latency)

Data Storage

Page 12: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Gene sequencing project identifies potential drug targets in common childhood brain tumorNatureJune 20, 2012Researchers studying the genetic roots of the most common malignant childhood brain tumor have discovered missteps in three of the four subtypes of the cancer that involve genes already targeted for drug development. The most significant gene alterations are linked to subtypes of medulloblastoma that currently have the best and worst prognosis. They were among 41 genes associated for the first time to medulloblastoma by the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project.World's largest release of comprehensive human cancer genome data helps researchers everywhere speed discoveriesNature GeneticsMay 29, 2012To speed progress against cancer and other diseases, the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project today announced the largest-ever release of comprehensive human cancer genome data for free access by the global scientific community. The amount of information released more than doubles the volume of high-coverage, whole genome data currently available from all human genome sources combined. This information is valuable not just to cancer researchers, but also to scientists studying almost any disease.Genome sequencing initiative links altered gene to age-related neuroblastoma riskJournal of the American Medical AssociationMarch 13, 2012St. Jude Children’s Research Hospital – Washington University Pediatric Cancer Genome Project and Memorial Sloan-Kettering Cancer Center discover the first gene alteration associated with patient age and neuroblastoma outcome. Researchers have identified the first gene mutation associated with a chronic and often fatal form of neuroblastoma that typically strikes adolescents and young adults. The finding provides the first clue about the genetic basis of the long-recognized but poorly understood link between treatment outcome and age at diagnosis.Cancer sequencing initiative discovers mutations tied to aggressive childhood brain tumorsNature GeneticsJanuary 29, 2012Findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) offer important insight into a poorly understood tumor that kills more than 90 percent of patients within two years. The tumor, diffuse intrinsic pontine glioma (DIPG), is found almost exclusively in children and accounts for 10 to 15 percent of pediatric tumors of the brain and central nervous system.Cancer sequencing project identifies potential approaches to combat aggressive leukemiaNature January 11, 2012Researchers with the St. Jude Children's Research Hospital - Washington University Pediatric Cancer Genome Project (PCGP) have discovered that a subtype of leukemia characterized by a poor prognosis is fueled by mutations in pathways distinctly different from a seemingly similar leukemia associated with a much better outcome. The work provides the first details of the genetic alterations fueling a subtype of acute lymphoblastic leukemia (ALL) known as early T-cell precursor ALL (ETP-ALL). The results suggest ETP-ALL has more in common with acute myeloid leukemia (AML) than with other subtypes of ALL. Gene identified as a new target for treatment of aggressive childhood eye tumorNature January 11, 2012New findings from the St. Jude Children's Research Hospital – Washington University Pediatric Cancer Genome Project (PCGP) have helped identify the mechanism that makes the childhood eye tumor retinoblastoma so aggressive. The discovery explains why the tumor develops so rapidly while other cancers can take years or even decades to form. The finding also led investigators to a new treatment target and possible therapy for the rare childhood tumor of the retina, the light-sensing tissue at the back of the eye .

>356 Patients/712 Complete Genomes

Progress

Page 13: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

http://www.pediatriccancergenomeproject.org

Data Sharing

http://explore.pediatriccancergenomeproject.org

Page 14: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Data Sharing

• Data Integration is critical: platform data (expression, WGS, methylation, etc.) and processed data (“genomics” data with phenotype data (clinical care, clinical research))

Page 15: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Total=>150 FTEs with “research informatics” skills

Key: Staff

19 AcademicDepartments

2 PhD2 Support

Information Sciences

PCGP 5 PhD1 Dev.

8-10 Faculty50-60 Support Staff

10 PhD Bioinformatics 2 developers

EnterpriseInformatics

ClinicalInformatics

127 FTEs

81 FTEs

ResearchInformatics

56 FTEs

OffshoreDevelopers

15 FTEs

HPC

Shared ResourcesComputational Biology

Page 16: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

• Project total cost: $65M (11 Illuminas @ WashU and 4 @ SJCRH, sequencing costs, staffing, IT, etc.)

• New “IT” staff @ SJCRH: 10 FTEs in CompBiol, 0 FTEs in IS• Capital IT investment: ~$7.2 M at SJCRH, $9M at WashU• IT is ~25% of overall project costs (doesn’t include costs of other

participating SJ FTEs)

$UMMARY

Page 17: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Information Sciences PCGP Team

Key: Staff

• Ashish Pagare• David Zhao• Dan Alford• Stephen Espy• Kiran Chand Bobba• Scott Malone• Dr. Antonio Ferreira• Bill Pappas• James McMurry• Dr. Jianmin Wang• Dr. John Obenauer• Jared Becksfort• Pankaj Gupta• Dr. Suraj Mukatira

• Simon Hagstrom• Sundeep Shakya• Asmita Vaidya• Swetha Mandava• Bhagavathy Krishna• Manohar Gorthi• Sandhya Rani Kolli• Sivaram Chintalapudi• Roshan Shrestha• Irina McGuire• PJ Stevens• Thanh Le• John Penrod• Pat Eddy• Dr. Dan McGoldrick

Page 18: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Questions?

Page 19: The St. Jude Children’s Research Hospital/Washington University Pediatric Cancer Genome Project: A CIO’s Perspective Clayton W. Naeve, Ph.D. Endowed Chair

Data Workflow

cluster GPU

Contigassembly

SV CNV INDELS SNV CIRCOS

PALLAS

large memory