a national big data cyberinfrastructure supporting computational biomedical research
Post on 23-Jan-2018
254 Views
Preview:
TRANSCRIPT
“A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research”
Invited Presentation
Symposium on Computational Biology and Bioinformatics:
Remembering John Wooley
National Institutes of Health
Bethesda, MD
July 29, 2016
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net 1
John Wooley Drove Supercomputing for Biological Sciences
John Wooley was a Scientific Founder of Calit2
www.calit2.net
220 UCSD & UCI FacultyWorking in Multidisciplinary Teams
With Students, Industry, and the Community
The State Provides $100 M For New Buildings and Equipment
LS Slide2001
John Wooley was the UCSD Layer
Leader for DeGeM
NSF’s OptIPuter Project: Using Supernetworks to Meet the Needs of Data-Intensive Researchers
OptIPortal– Termination
Device for the
OptIPuter Global
Backplane
Calit2 (UCSD, UCI), SDSC, and UIC Leads—Larry Smarr PIUniv. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST
Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent
2003-2009 $13,500,000
Biomedical Big Data as Application Driver:Mark Ellisman, co-PI
UCSD
StarLight Chicago
UIC EVL
NU
CENIC San Diego GigaPOP
CalREN-XD
8
8
The OptIPuter LambdaGrid is Rapidly Expanding
NetherLight Amsterdam
U Amsterdam
NASA Ames
NASA GoddardNLRNLR
2
SDSU
CICESE
via CUDI
CENIC/Abilene Shared Network
1 GE Lambda
10 GE Lambda
PNWGP Seattle
CAVEwave/NLR
NASA JPL
ISI
UCI
CENIC Los Angeles
GigaPOP
22
Source: Greg Hidley, Aaron Chin, Calit2
LS Slide2005
PI Larry Smarr
Paul Gilna Ex. Dir.
Announced January 17, 2006$24.5M Over Seven Years
John Wooley was a CAMERA co-PI &
Chief Science Officer
Calit2 Microbial Metagenomics Cluster-Next Generation Optically Linked Science Data Server
512 Processors ~5 Teraflops
~ 200 Terabytes Storage 1GbE and
10GbESwitched/ Routed
Core
~200TB Sun
X4500 Storage
10GbE
Source: Phil Papadopoulos, SDSC, Calit2
The CAMERA Project Established a GlobalMarine Microbial Metagenomics Cyber-Community
Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis
http://camera.calit2.net/
4000 Registered Users From Over 80 Countries
Determining the Protein Structures of the Thermophilic Thermotoga Maritima Genome—Life at 80oC!
Extremely Thermostable -- Useful for Many Industrial Processes (e.g. Chemical and Food)
173 Structures (122 from JCSG)
• 122 T.M. Structures Solved by JCSG (75 Unique In The PDB) • Direct Structural Coverage of 25% of the Expressed Soluble Proteins• Probably Represents the Highest Structural Coverage of Any Organism
Source: John Wooley, JCSG Bioinformatics Core Project Directro, UCSD
LS Slide2005
John Wooley Organized a Series of International Workshopson Metagenomics and Thermotoga at Calit2
Academic Research OptIPlanet Collaboratory:A 10Gbps “End-to-End” Lightpath Cloud
National LambdaRail
CampusOptical Switch
Data Repositories & Clusters
HPC
HD/4k Video Repositories
End User OptIPortal
10G Lightpaths
HD/4k Live Video
Local or Remote Instruments
LS 2009 Slide
So Why Don’t We Have a NationalBig Data Cyberinfrastructure?
“Research is being stalled by ‘information overload,’ Mr. Bement said, because data from digital instruments are piling up far faster than researchers can study. In particular, he said, campus networks need to be improved. High-speed data lines crossing the nation are the equivalent of six-lane superhighways, he said. But networks at colleges and universities are not so capable. “Those massive conduits are reduced to two-lane roads at most college and university campuses,” he said. Improving cyberinfrastructure, he said, “will transform the capabilities of campus-based scientists.”-- Arden Bement, the director of the National Science Foundation May 2005
DOE ESnet’s Science DMZ: A Scalable Network Design Model for Optimizing Science Data Transfers
• A Science DMZ integrates 4 key concepts into a unified whole:– A network architecture designed for high-performance applications,
with the science network distinct from the general-purpose network
– The use of dedicated systems for data transfer
– Performance measurement and network testing systems that are regularly used to characterize and troubleshoot the network
– Security policies and enforcement mechanisms that are tailored for high performance science environments
http://fasterdata.es.net/science-dmz/
Science DMZCoined 2010
The DOE ESnet Science DMZ and the NSF “Campus Bridging” Taskforce Report Formed the Basis for the NSF Campus Cyberinfrastructure Network Infrastructure and Engineering (CC-NIE) Program
Based on Community Input and on ESnet’s Science DMZ Concept,NSF Has Funded Over 100 Campuses to Build Local Big Data Freeways
Red 2012 CC-NIE AwardeesYellow 2013 CC-NIE AwardeesGreen 2014 CC*IIE AwardeesBlue 2015 CC*DNI AwardeesPurple Multiple Time Awardees
Source: NSF
Creating a “Big Data” Freeway on Campus:NSF-Funded Prism@UCSD and CHeruB Campus CC-NIE Grants
Prism@UCSD, PI Phil Papadopoulos,
SDSC, Calit2, (2013-15)
CHERuB, PI Mike Norman,
SDSC
CHERuB
NCMIR Brain Images in Calit2 VROOM:Allows for Interactive Zooming from Cerebellum to Individual Neurons
NCMIR Connected Over Prism to Calit2/SDSC at 80 Gbps
Calit2 3D Immersive StarCAVE OptIPortal:Enables Interative Exploration of Protein Data Bank
Cluster with 30 Nvidia 5600 cards-60 GB Texture Memory
Source: Tom DeFanti, Greg Dawe, Calit2
Connected at 50 Gb/s to Quartzite
30 HD Projectors!
15 Meyer Sound Speakers + Subwoofer
Passive Polarization--Optimized the
Polarization Separation and Minimized Attenuation
The Pacific Wave PlatformCreates a Regional Science-Driven “Big Data Freeway System”
Source: John Hess, CENIC
Funded by NSF $5M Oct 2015-2020
Flash Disk to Flash Disk File Transfer Rate
PI: Larry Smarr, UC San Diego Calit2Co-PIs:• Camille Crittenden, UC Berkeley CITRIS, • Tom DeFanti, UC San Diego Calit2, • Philip Papadopoulos, UC San Diego SDSC, • Frank Wuerthwein, UC San Diego Physics and SDSC
Pacific Research Platform Regional Collaboration:Multi-Campus Science Driver Teams
• Jupyter Hub
• Biomedical– Cancer Genomics Hub/Browser
– Microbiome and Integrative ‘Omics
– Integrative Structural Biology
• Earth Sciences– Data Analysis and Simulation for Earthquakes and Natural Disasters– Climate Modeling: NCAR/UCAR– California/Nevada Regional Climate Data Analysis– CO2 Subsurface Modeling
• Particle Physics• Astronomy and Astrophysics
– Telescope Surveys– Galaxy Evolution
– Gravitational Wave Astronomy
• Scalable Visualization, Virtual Reality, and Ultra-Resolution Video 20
PRP Transforms Big Data Microbiome and Integrated ‘Omics Science
12 Cores/GPU128 GB RAM3.5 TB SSD48TB Disk
10Gbps NIC
Knight Lab
10Gbps
Gordon
Prism@UCSD
Data Oasis7.5PB,
200GB/s
Knight 1024 ClusterIn SDSC Co-Lo
CHERuB100Gbps
Emperor & Other Vis Tools
64Mpixel Data Analysis Wall
120Gbps
40Gbps
1.3TbpsPNNL
UC DavisLBNL
Caltech
To Expand IBD Project the Knight/Smarr Labs Were Awarded ~ 1 Million Core-Hours on SDSC’s Comet Supercomputer
• 8x Compute Resources Over Prior Study
• Smarr Gut Microbiome Time Series– From 7 Samples Over 1.5 Years – To 50 Samples Over 4 Years
• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100 Patients– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients
• New Software Suite from Knight Lab– Re-annotation of Reference Genomes, Functional / Taxonomic Variations– Novel Compute-Intensive Assembly Algorithms from Pavel Pevzner
We Used SDSC’s Comet to Uniformly Compute Protein-Coding Genes, RNAs, & CRISPR Annotations
• We Downloaded from NCBI Over 60,000 Bacterial and Archaea Genomes– Required 5 Core-Hours Per Genome
– 300,000 Core-Hours to Complete– Ran 24 Cores in Parallel– Over 400 Days Wall-Clock Time
• Requires a Variety of Software Programs– Prodigal for Gene Prediction
– Diamond for Protein Homolog Search Against UniRef db – Infernal for ncRNA Prediction – RNAMMER for rRNA Prediction
– Aragorn for tRNA Prediction
• Will Make These Results a New Community Database– Knight Lab, Calit2, SDSC
Source: Zhenjiang (Zech) Xu, Knight Lab, UCSD
Cancer Genomics Hub (UCSC) is Housed in SDSC:Large Data Flows to End Users at UCSC, UCB, UCSF, …
1G
8G
Data Source: David Haussler, Brad Smith, UCSC
15GJan 2016
30,000 TBPer Year
Creating a Distributed Cluster for Integrated Modelingof Large Macromolecular Machines
• UCSF-10-100 Gbps Science DMZ– QB3@UCSF (~5000 cores), – Institute for Human Genetics (~1200 cores), – Cancer Center (~800 cores), – Molecular Structure Group (~1000 cores).
• Coupled Via PRP to:– LBNL NERSC– SDSC
• Bring Huge Datasets from Supercomputer Centers Back to UCSF Clusters for Analysis
Requires CPU-months per computation
Lead: Andrej Sali, UCSF
Driving
Improvement
s in
Scientific
Data Transfer
Driving
Improvement
s in
Scientific
Data Transfer
NCMIR X-rayMicroscope
(XRM)Zeiss Versa
510
MicroCT reconstructions of Chiton radula. Chiton radula have evolved to incorporate an iron oxide mineral, magnetite, making them extremely hard and magnetic. Images
courtesy of Steven Herrera, Ph.D., Kisailus Biomemetics and Nanostructured Materials Laboratory, UC Riverside
UCSD/NCMIR Fiona/Data Transfer Node (DTN)
PRP Facilitated CollaborativeData Transfer 10-100Gbps
XRM Data Sets are 100+ GBs
UCR researchers are modeling the teeth (radula) of
marine snail, Cryptochiton Stelleri, to engineer new
biomimetic abrasion resistant composites
UC RiversideFiona/Data Transfer Node (DTN)
3D Reconstructions from NCMIR X-ray Microscopic Computed Tomography Facilitates Development of Bioinspired “Tough” Materials
Next Step: Global Research PlatformBuilding on CENIC/Pacific Wave and GLIF
Current InternationalGRP Partners
Mirror Cell Image Library Infrastructure and Data Management Workflows at Singapore’s NSCC
Cell Image Library Designed For “Big Data” Leverages High Bandwidth Connected High Performance Storage and Computing Resources
Source: Mark Ellisman & Steve Peltier, NCMIR, UCSD
top related