using supercomputers and supernetworks to explore the ocean of life
TRANSCRIPT
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Director's Colloquium
Los Alamos National Laboratory
Los Alamos, New Mexico
June 7, 2007
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
Abstract
Calit2, in partnership with J. Craig Venter Institute in Rockville, MD, and UCSD's SDSC and Scripps Institution of Oceanography, is creating a Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), funded by the Gordon and Betty Moore Foundation. CAMERA collaborates closely with DoE's Joint Genome Institute. The CAMERA computational and storage cluster containing the metagenomic data can be accessed via the web over novel dedicated 10 Gb/s light pipes (termed "lambdas") through the National LambdaRail, providing direct connection to the scalable Linux clusters in individual user laboratories. These clusters are reconfigured as "OptIPortals," providing the end user with local scalable visualization, computing, and storage. Scientists will use CAMERA for metagenomics research -- analyzing microbial genomic sequence data in the context of other microbial species, as well as in relation to the chemical and physical conditions in which microbes are sampled. The CAMERA project contains the results of the Venter Institute's Sorcerer II Expedition, which carried out the first large-scale genomic survey of microbial life in the world's oceans to produce the largest gene catalogue ever assembled, doubling the number of protein sequences currently available in GenBank. In addition to Sorcerer II's ecological genomic data, the CAMERA database will be augmented by the full genomes of more than 150 critical marine microbes enabling new comparative genomics studies. Currently over 1000 users are registered from over 40 countries.
Calit2--A Systems Approach to the Future of the Internet and its Transformation of Our Society
www.calit2.net
Calit2 Has Assembled a Complex Social Network of Over 350 UC San Diego & UC Irvine Faculty
Working in Multidisciplinary TeamsWith Staff, Students, Industry, and the Community
Over 130 Companies and 300 Federal Grants in Collaboration with Calit2
UC San Diego
Two New Calit2 Buildings Provide New Laboratories for “Living in the Future”
• “Convergence” Laboratory Facilities– Nanotech, BioMEMS, Chips, Radio, Photonics– Virtual Reality, Digital Cinema, HDTV, Gaming
• Over 1000 Researchers in Two Buildings– Linked via Dedicated Optical Networks
UC Irvinewww.calit2.net
Preparing for a World in Which Distance is Eliminated…
Calit2 Brings Computer Scientists and Engineers Together with Biomedical Researchers
• Some Areas of Concentration:– Algorithmic and System Biology
– Bioinformatics
– Metagenomics
– Cancer Genomics
– Human Genomic Variation and Disease
– Proteomics
– Mitochondrial Evolution
– Computational Biology
– Multi-Scale Cellular Imaging
– Information Theory and Biological Systems
– Telemedicine
UC Irvine
UC Irvine
Southern California Telemedicine Learning Center (TLC)
National Biomedical Computation Resource an NIH supported resource center
Calit2 Facilitated Formation of the Center for Algorithmic and
Systems Biology
http://casb.ucsd.edu/
Most of Evolutionary Time Was in the Microbial World
You Are
Here
Source: Carl Woese, et al
Tree of Life Derived from 16S rRNA Sequences
Joint Genome Institute is a Leading Microbial Genomic Source
Moore Microbial Genome Sequencing ProjectSelected Microbes Throughout the World’s Oceans
www.moore.org/microgenome/worldmap.asp
Microbes Nominated by Leading Ocean Microbial
Biologists
Moore Foundation Funded the Venter Institute to Provide the Full Genome Sequence of 155 Marine Microbes
Phylogenetic Trees Created by Uli Stingl, Oregon State
Blue Means Contains One of the Moore 155 Genomes
www.moore.org/microgenome/trees.aspx
Moore 155 Marine Microbial Genomes Gives Broad Coverage of Microbial “Tree of Life”
www.moore.org/microgenome/alpha-proteobacteria.aspx
Phylogenetic Trees Created by Uli Stingl, Oregon State
Full Genome Sequencing is Exploding:Most Sequenced Genomes are Bacterial
www.genomesonline.org
90Metagenomes
First Genome 1995 6 Genomes/ Year 2000
Metagenomic Data SetsAre Rapidly Being Accumulated
• “A majority of the bacterial sequences corresponded to uncultivated species and novel microorganisms.”
• “We discovered significant inter-subject variability.” • “Characterization of this immensely diverse ecosystem is the first step in
elucidating its role in health and disease.”
“Diversity of the Human Intestinal Microbial Flora” Paul B. Eckburg, et al Science (10 June 2005)
395 Phylotypes
Microbial Metagenomics is a Rapidly Emerging Field of Research
“Despite their ubiquity, relatively little is known about the majority of environmental microorganisms, largely because of their resistance to culture under standard laboratory conditions.”
“The application of high-throughput shotgun sequencing environmental samples has recently provided global views of those communities not obtainable from 16S rRNA or BAC clone–sequencing surveys .”
Comparative Metagenomics of Microbial Communities
Susannah Green Tringe, Christian von Mering, Arthur Kobayashi, Asaf A. Salamov, Kevin Chen, Hwai W. Chang, Mircea Podar, Jay M. Short, Eric J. Mathur, John C. Detter, Peer Bork, Philip Hugenholtz, Edward M. Rubin
Science 22 April 2005
Enormous Increase in Scale of Known Genes Over Last Decade
1995First Microbe Genome
2001Human Genome
2007Ocean Metagenomics
3 Billion Bases 30,000 Genes
6.3 Billion Bases 5.6 Million Genes
1.8 Million Bases 1749 Genes
~3300x
Microbial Genomics Allow Us to Look Back Nearly 4 Billion Years In the Evolution of Life
Falkowski and Vargas Science 304 (5667) 2004
The Sargasso Sea Experiment The Power of Environmental Metagenomics
• Yielded a Total of Over 1 Billion Base Pairs of Non-Redundant Sequence
• Displayed the Gene Content, Diversity, & Relative Abundance of the Organisms
• Sequences from at Least 1800 Genomic Species, including 148 Previously Unknown
• Identified over 1.2 Million Unknown Genes
MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from
22 February 2003
J. Craig Venter, et al.
Science 2 April 2004:
Vol. 304. pp. 66 - 74
Marine Genome Sequencing Project – Measuring the Genetic Diversity of Ocean Microbes
Sorcerer II Data Will Double Number of Proteins in GenBank!
Specify Ocean Data
Each Sample ~2000
Microbial Species
Environmental Metadata: Beyond Data Collected at Sampling Site
Sea Surface Temp
Chlorophyll
NASA AQUA-MODIS Images covering
GOS sites #8 – 12, mid
November, 2003
GOS Predicted Proteins are Largely Bacterial
Source: Shibu Yooseph, et al. (PLOS Biology March 2007)
~3 Million Previously Known
Proteins
~5.1 Million GOS Predicted Proteins
NCBI-nr, PG,
TGI-EST, ENS
Current Universe of Medium/ Large Protein Families
Source: Shibu Yooseph, et al. (PLOS Biology March 2007)
Protein Families Conserved Across
Tree of Life
Protein Families Unique to GOS
17,067 Protein Family Clusters
1 Million CPU-Hour Computation !
GOS Analysis -- Protein Families in Nature Have Been Poorly Explored Thus Far
• Novel Sequence Similarity Clustering Process Predicts Proteins and Groups Related Sequences Into Clusters (Families)
• GOS Proteins Increase Size / Diversity of Many Protein Families• 1,700 Novel GOS-Only Clusters Identified (>20 per Cluster)
– 10% of 17,000 Clusters
NCBI_nr
GOS + NCBI_nr + Ensembl + TIGR Gene Indices + Prokaryotic Genomes
Source: Shibu Yooseph, et al. (PLOS Biology March 2007)
Enormous Biodiversity:Very little of GOS Metagenomic Data Assembles Well
• Use Reference Genomes to Recruit Fragments– Compared 334 Finished and 250 Draft Microbial Genomes
• Only 5 Microbial Genera Yielded substantial and Uniform Recruitment – Prochlorococcus, Synechococcus, Pelagibacter, Shewanella, and Burkholderia
Source: Douglas Rusch, et al. (PLOS Biology March 2007)
Self Organizing Maps Identifies SpeciesUsing Japanese Earth Simulator
Human
Fugu
Arabidopsis
RiceC. Elegans
Drosophilia
www.es.jamstec.go.jp/publication/journal/jes_vol.6/pdf/JES6_22-Abe.pdf
T. Abe, H. Sugawara, S. Kanaya, T. IkemuraJournal of the Earth Simulator, Volume 6, October 2006, 17–23
SOM Created from an
Unsupervised Neural Network
Algorithm to Analyze
Tetranucleotide Frequencies in a Wide Range of
Genomes 10kb Moving Window
Using SOM, Sargasso Sea Metagenomic Data Yields 92 Microbial Genera !
Eukaryotes
Prokaryotes
Viruses
Mitochondria
Chloroplasts
Input Genomes:
1500 Microbes 40 Eukaryotes 1065 Viruses 642 Mitochondria 42 Chloroplasts
5kb Window
T. Abe, H. Sugawara, S. Kanaya, T. IkemuraJournal of the Earth Simulator, Volume 6, October 2006, 17–23
Crystal Structures
The Human Kinome:A Protein Family Implicated In Many Human Diseases
YEAST
Mouse
C.elegans
Drosoph
Arabid.
Sea Urchin
Dicty.
Tetrahy.
EPKs
Manning, et al (2002) Science 298:1912
Over 500 Protein Kinases2% of the Human Genome
Many splice variants
Source: Susan Taylor, SOM,
UCSD
From Microbial Genomes To Human Disease
• Microbes Have a Much Simpler Genome Than Humans
• However, Microbes Share Many of the Core Components of the Molecular Signaling Machinery Used by Humans
• Understand Both the Evolution and Regulation of Signaling Systems, First in Microbes and Then in Humans
• This is a Rich Source for Mapping the Origins of EPKs
Source: Susan Taylor, SOM, UCSD
>24,000 KinasesIncluding 16,000
New KinasesIn Venter Global
Ocean Sampling Data!
The OptIPuter Project: Creating High Resolution Portals Over Dedicated Optical Channels to Global Science Data
Picture Source:
Mark Ellisman,
David Lee, Jason Leigh
Calit2 (UCSD, UCI) and UIC Lead Campuses—Larry Smarr PIUniv. Partners: SDSC, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST
Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent
$13.5M Over Five
Years
Now In the Fifth
Year
fc *
Dedicated Optical Channels Makes High Performance Cyberinfrastructure Possible
(WDM)
Source: Steve Wallach, Chiaro Networks
“Lambdas”Parallel Lambdas are Driving Optical Networking
The Way Parallel Processors Drove 1990s Computing
10 Gbps per User ~ 200x Shared Internet Throughput
San Francisco Pittsburgh
Cleveland
National Lambda Rail (NLR) and TeraGrid Provides Cyberinfrastructure Backbone for U.S. Researchers
San Diego
Los Angeles
Portland
Seattle
Pensacola
Baton Rouge
HoustonSan Antonio
Las Cruces /El Paso
Phoenix
New York City
Washington, DC
Raleigh
Jacksonville
Dallas
Tulsa
Atlanta
Kansas City
Denver
Ogden/Salt Lake City
Boise
Albuquerque
UC-TeraGridUIC/NW-Starlight
Chicago
International Collaborators
NLR 4 x 10Gb Lambdas Initially Capable of 40 x 10Gb wavelengths at Buildout
NSF’s TeraGrid Has 4 x 10Gb Lambda Backbone
Links Two Dozen State and Regional Optical
Networks
DOE, NSF, & NASA
Using NLR
OptIPortal–Termination Device for the Dedicated Gigabit/sec Lightpaths
Photo Source: David Lee, Mark Ellisman NCMIR, UCSD
Collaborative Analysis of Large Scale Images of
Cancer Cells
Integration of High
Definition Video
Streamswith Large
Scale Image Display Walls
My OptIPortalTM – AffordableTermination Device for the OptIPuter Global Backplane
• 20 Dual CPU Nodes, 20 24” Monitors, ~$50,000• 1/4 Teraflop, 5 Terabyte Storage, 45 Mega Pixels--Nice PC!• Scalable Adaptive Graphics Environment ( SAGE) Jason Leigh, EVL-UIC
Source: Phil Papadopoulos SDSC, Calit2
PI Larry Smarr
Paul Gilna Ex. Dir.
Announced January 17, 2006$24.5M Over Seven Years
The Calit2 CAMERA Microbial Metagenomics Server is Open to the Community
PLOS Biology March 2007
CAMERA Builds on Cyberinfrastructure Grid, Workflow, and Portal Projects in a Service Oriented Architecture
Cyberinfrastructure: Raw Resources, Middleware & Execution Environment
NBCR Rocks Clusters
Virtual Organizations Web Services
KEPLER
Workflow Management
Vision
Telescience Portal
National Biomedical Computation Resource an NIH supported resource center
Located in Calit2@UCSD Building
Flat FileServerFarm
W E
B P
OR
TA
L
TraditionalUser
Response
Request
DedicatedCompute Farm
(1000s of CPUs)
TeraGrid: Cyberinfrastructure Backplane(scheduled activities, e.g. all by all comparison)
(10,000s of CPUs)
Web(other service)
Local Cluster
LocalEnvironment
DirectAccess LambdaCnxns
Data-BaseFarm
10 GigE Fabric
Calit2’s Direct Access Core Architecture Will Create Next Generation Metagenomics Server
Source: Phil Papadopoulos, SDSC, Calit2+
We
b S
erv
ice
s
Sargasso Sea Data
Sorcerer II Expedition (GOS)
JGI Community Sequencing Project
Moore Marine Microbial Project
NASA and NOAA Satellite Data
Community Microbial Metagenomics Data
Calit2 CAMERA ProductionCompute and Storage Complex
512 Processors ~5 Teraflops
~ 200 Terabytes Storage
The Calit2 CAMERA Metagenomics Site is Now Active
http://camera.calit2.net/
CAMERA Research Tools and Data
Distribution of CAMERA User Registrations
Nearly 1000 Registered Users From 45 Countries
USA 583United Kingdom 46Canada 35France 35Germany 32
Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome
Acidobacteria bacterium Ellin345 Soil Bacterium 5.6 Mb
Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome
Source: Raj Singh, UCSD
Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome
Source: Raj Singh, UCSD
Interactive Exploration of Marine Genomes Using 100 Million Pixels
Ginger Armburst (UW), Terry Gaasterland (UCSD SIO)
NW!
CICESE
UW
JCVI
MIT
SIO UCSD
SDSU
UIC EVL
UCI
OptIPortals
OptIPortal
Calit2 is Now OptIPuter Connecting Remote OptIPortals for Moore-Funded Microbial Researchers via NLR
CAMERAServers
www.glif.is
Created in Reykjavik, Iceland 2003
Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA Data System
Visualization courtesy of Bob Patterson, NCSA.