using supercomputers and supernetworks to explore the ocean of life

47
Using Supercomputers and Supernetworks to Explore the Ocean of Life Director's Colloquium Los Alamos National Laboratory Los Alamos, New Mexico June 7, 2007 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD

Upload: larry-smarr

Post on 20-Aug-2015

939 views

Category:

Economy & Finance


1 download

TRANSCRIPT

Page 1: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Using Supercomputers and Supernetworks to Explore the Ocean of Life

Director's Colloquium

Los Alamos National Laboratory

Los Alamos, New Mexico

June 7, 2007

Dr. Larry Smarr

Director, California Institute for Telecommunications and Information Technology

Harry E. Gruber Professor,

Dept. of Computer Science and Engineering

Jacobs School of Engineering, UCSD

Page 2: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Abstract

Calit2, in partnership with J. Craig Venter Institute in Rockville, MD, and UCSD's SDSC and Scripps Institution of Oceanography, is creating a Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), funded by the Gordon and Betty Moore Foundation. CAMERA collaborates closely with DoE's Joint Genome Institute. The CAMERA computational and storage cluster containing the metagenomic data can be accessed via the web over novel dedicated 10 Gb/s light pipes (termed "lambdas") through the National LambdaRail, providing direct connection to the scalable Linux clusters in individual user laboratories. These clusters are reconfigured as "OptIPortals," providing the end user with local scalable visualization, computing, and storage. Scientists will use CAMERA for metagenomics research -- analyzing microbial genomic sequence data in the context of other microbial species, as well as in relation to the chemical and physical conditions in which microbes are sampled. The CAMERA project contains the results of the Venter Institute's Sorcerer II Expedition, which carried out the first large-scale genomic survey of microbial life in the world's oceans to produce the largest gene catalogue ever assembled, doubling the number of protein sequences currently available in GenBank. In addition to Sorcerer II's ecological genomic data, the CAMERA database will be augmented by the full genomes of more than 150 critical marine microbes enabling new comparative genomics studies. Currently over 1000 users are registered from over 40 countries.

Page 3: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Calit2--A Systems Approach to the Future of the Internet and its Transformation of Our Society

www.calit2.net

Calit2 Has Assembled a Complex Social Network of Over 350 UC San Diego & UC Irvine Faculty

Working in Multidisciplinary TeamsWith Staff, Students, Industry, and the Community

Over 130 Companies and 300 Federal Grants in Collaboration with Calit2

Page 4: Using Supercomputers and Supernetworks to Explore the Ocean of Life

UC San Diego

Two New Calit2 Buildings Provide New Laboratories for “Living in the Future”

• “Convergence” Laboratory Facilities– Nanotech, BioMEMS, Chips, Radio, Photonics– Virtual Reality, Digital Cinema, HDTV, Gaming

• Over 1000 Researchers in Two Buildings– Linked via Dedicated Optical Networks

UC Irvinewww.calit2.net

Preparing for a World in Which Distance is Eliminated…

Page 5: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Calit2 Brings Computer Scientists and Engineers Together with Biomedical Researchers

• Some Areas of Concentration:– Algorithmic and System Biology

– Bioinformatics

– Metagenomics

– Cancer Genomics

– Human Genomic Variation and Disease

– Proteomics

– Mitochondrial Evolution

– Computational Biology

– Multi-Scale Cellular Imaging

– Information Theory and Biological Systems

– Telemedicine

UC Irvine

UC Irvine

Southern California Telemedicine Learning Center (TLC)

National Biomedical Computation Resource an NIH supported resource center

Page 6: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Calit2 Facilitated Formation of the Center for Algorithmic and

Systems Biology

http://casb.ucsd.edu/

Page 7: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Most of Evolutionary Time Was in the Microbial World

You Are

Here

Source: Carl Woese, et al

Tree of Life Derived from 16S rRNA Sequences

Page 8: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Joint Genome Institute is a Leading Microbial Genomic Source

Page 9: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Moore Microbial Genome Sequencing ProjectSelected Microbes Throughout the World’s Oceans

www.moore.org/microgenome/worldmap.asp

Microbes Nominated by Leading Ocean Microbial

Biologists

Page 10: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Moore Foundation Funded the Venter Institute to Provide the Full Genome Sequence of 155 Marine Microbes

Phylogenetic Trees Created by Uli Stingl, Oregon State

Blue Means Contains One of the Moore 155 Genomes

www.moore.org/microgenome/trees.aspx

Page 11: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Moore 155 Marine Microbial Genomes Gives Broad Coverage of Microbial “Tree of Life”

www.moore.org/microgenome/alpha-proteobacteria.aspx

Phylogenetic Trees Created by Uli Stingl, Oregon State

Page 12: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Full Genome Sequencing is Exploding:Most Sequenced Genomes are Bacterial

www.genomesonline.org

90Metagenomes

First Genome 1995 6 Genomes/ Year 2000

Page 13: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Metagenomic Data SetsAre Rapidly Being Accumulated

• “A majority of the bacterial sequences corresponded to uncultivated species and novel microorganisms.”

• “We discovered significant inter-subject variability.” • “Characterization of this immensely diverse ecosystem is the first step in

elucidating its role in health and disease.”

“Diversity of the Human Intestinal Microbial Flora” Paul B. Eckburg, et al Science (10 June 2005)

395 Phylotypes

Page 14: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Microbial Metagenomics is a Rapidly Emerging Field of Research

“Despite their ubiquity, relatively little is known about the majority of environmental microorganisms, largely because of their resistance to culture under standard laboratory conditions.”

“The application of high-throughput shotgun sequencing environmental samples has recently provided global views of those communities not obtainable from 16S rRNA or BAC clone–sequencing surveys .”

Comparative Metagenomics of Microbial Communities

Susannah Green Tringe, Christian von Mering, Arthur Kobayashi, Asaf A. Salamov, Kevin Chen, Hwai W. Chang, Mircea Podar, Jay M. Short, Eric J. Mathur, John C. Detter, Peer Bork, Philip Hugenholtz, Edward M. Rubin

Science 22 April 2005

Page 15: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Enormous Increase in Scale of Known Genes Over Last Decade

1995First Microbe Genome

2001Human Genome

2007Ocean Metagenomics

3 Billion Bases 30,000 Genes

6.3 Billion Bases 5.6 Million Genes

1.8 Million Bases 1749 Genes

~3300x

Page 16: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Microbial Genomics Allow Us to Look Back Nearly 4 Billion Years In the Evolution of Life

Falkowski and Vargas Science 304 (5667) 2004

Page 17: Using Supercomputers and Supernetworks to Explore the Ocean of Life

The Sargasso Sea Experiment The Power of Environmental Metagenomics

• Yielded a Total of Over 1 Billion Base Pairs of Non-Redundant Sequence

• Displayed the Gene Content, Diversity, & Relative Abundance of the Organisms

• Sequences from at Least 1800 Genomic Species, including 148 Previously Unknown

• Identified over 1.2 Million Unknown Genes

MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from

22 February 2003

J. Craig Venter, et al.

Science 2 April 2004:

Vol. 304. pp. 66 - 74

Page 18: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Marine Genome Sequencing Project – Measuring the Genetic Diversity of Ocean Microbes

Sorcerer II Data Will Double Number of Proteins in GenBank!

Specify Ocean Data

Each Sample ~2000

Microbial Species

Page 19: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Environmental Metadata: Beyond Data Collected at Sampling Site

Sea Surface Temp

Chlorophyll

NASA AQUA-MODIS Images covering

GOS sites #8 – 12, mid

November, 2003

Page 20: Using Supercomputers and Supernetworks to Explore the Ocean of Life

GOS Predicted Proteins are Largely Bacterial

Source: Shibu Yooseph, et al. (PLOS Biology March 2007)

~3 Million Previously Known

Proteins

~5.1 Million GOS Predicted Proteins

NCBI-nr, PG,

TGI-EST, ENS

Page 21: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Current Universe of Medium/ Large Protein Families

Source: Shibu Yooseph, et al. (PLOS Biology March 2007)

Protein Families Conserved Across

Tree of Life

Protein Families Unique to GOS

17,067 Protein Family Clusters

1 Million CPU-Hour Computation !

Page 22: Using Supercomputers and Supernetworks to Explore the Ocean of Life

GOS Analysis -- Protein Families in Nature Have Been Poorly Explored Thus Far

• Novel Sequence Similarity Clustering Process Predicts Proteins and Groups Related Sequences Into Clusters (Families)

• GOS Proteins Increase Size / Diversity of Many Protein Families• 1,700 Novel GOS-Only Clusters Identified (>20 per Cluster)

– 10% of 17,000 Clusters

NCBI_nr

GOS + NCBI_nr + Ensembl + TIGR Gene Indices + Prokaryotic Genomes

Source: Shibu Yooseph, et al. (PLOS Biology March 2007)

Page 23: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Enormous Biodiversity:Very little of GOS Metagenomic Data Assembles Well

• Use Reference Genomes to Recruit Fragments– Compared 334 Finished and 250 Draft Microbial Genomes

• Only 5 Microbial Genera Yielded substantial and Uniform Recruitment – Prochlorococcus, Synechococcus, Pelagibacter, Shewanella, and Burkholderia

Source: Douglas Rusch, et al. (PLOS Biology March 2007)

Page 24: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Self Organizing Maps Identifies SpeciesUsing Japanese Earth Simulator

Human

Fugu

Arabidopsis

RiceC. Elegans

Drosophilia

www.es.jamstec.go.jp/publication/journal/jes_vol.6/pdf/JES6_22-Abe.pdf

T. Abe, H. Sugawara, S. Kanaya, T. IkemuraJournal of the Earth Simulator, Volume 6, October 2006, 17–23

SOM Created from an

Unsupervised Neural Network

Algorithm to Analyze

Tetranucleotide Frequencies in a Wide Range of

Genomes 10kb Moving Window

Page 25: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Using SOM, Sargasso Sea Metagenomic Data Yields 92 Microbial Genera !

Eukaryotes

Prokaryotes

Viruses

Mitochondria

Chloroplasts

Input Genomes:

1500 Microbes 40 Eukaryotes 1065 Viruses 642 Mitochondria 42 Chloroplasts

5kb Window

T. Abe, H. Sugawara, S. Kanaya, T. IkemuraJournal of the Earth Simulator, Volume 6, October 2006, 17–23

Page 26: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Crystal Structures

The Human Kinome:A Protein Family Implicated In Many Human Diseases

YEAST

Mouse

C.elegans

Drosoph

Arabid.

Sea Urchin

Dicty.

Tetrahy.

EPKs

Manning, et al (2002) Science 298:1912

Over 500 Protein Kinases2% of the Human Genome

Many splice variants

Source: Susan Taylor, SOM,

UCSD

Page 27: Using Supercomputers and Supernetworks to Explore the Ocean of Life

From Microbial Genomes To Human Disease

• Microbes Have a Much Simpler Genome Than Humans

• However, Microbes Share Many of the Core Components of the Molecular Signaling Machinery Used by Humans

• Understand Both the Evolution and Regulation of Signaling Systems, First in Microbes and Then in Humans

• This is a Rich Source for Mapping the Origins of EPKs

Source: Susan Taylor, SOM, UCSD

>24,000 KinasesIncluding 16,000

New KinasesIn Venter Global

Ocean Sampling Data!

Page 28: Using Supercomputers and Supernetworks to Explore the Ocean of Life

The OptIPuter Project: Creating High Resolution Portals Over Dedicated Optical Channels to Global Science Data

Picture Source:

Mark Ellisman,

David Lee, Jason Leigh

Calit2 (UCSD, UCI) and UIC Lead Campuses—Larry Smarr PIUniv. Partners: SDSC, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST

Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent

$13.5M Over Five

Years

Now In the Fifth

Year

Page 29: Using Supercomputers and Supernetworks to Explore the Ocean of Life

fc *

Dedicated Optical Channels Makes High Performance Cyberinfrastructure Possible

(WDM)

Source: Steve Wallach, Chiaro Networks

“Lambdas”Parallel Lambdas are Driving Optical Networking

The Way Parallel Processors Drove 1990s Computing

10 Gbps per User ~ 200x Shared Internet Throughput

Page 30: Using Supercomputers and Supernetworks to Explore the Ocean of Life

San Francisco Pittsburgh

Cleveland

National Lambda Rail (NLR) and TeraGrid Provides Cyberinfrastructure Backbone for U.S. Researchers

San Diego

Los Angeles

Portland

Seattle

Pensacola

Baton Rouge

HoustonSan Antonio

Las Cruces /El Paso

Phoenix

New York City

Washington, DC

Raleigh

Jacksonville

Dallas

Tulsa

Atlanta

Kansas City

Denver

Ogden/Salt Lake City

Boise

Albuquerque

UC-TeraGridUIC/NW-Starlight

Chicago

International Collaborators

NLR 4 x 10Gb Lambdas Initially Capable of 40 x 10Gb wavelengths at Buildout

NSF’s TeraGrid Has 4 x 10Gb Lambda Backbone

Links Two Dozen State and Regional Optical

Networks

DOE, NSF, & NASA

Using NLR

Page 31: Using Supercomputers and Supernetworks to Explore the Ocean of Life

OptIPortal–Termination Device for the Dedicated Gigabit/sec Lightpaths

Photo Source: David Lee, Mark Ellisman NCMIR, UCSD

Collaborative Analysis of Large Scale Images of

Cancer Cells

Integration of High

Definition Video

Streamswith Large

Scale Image Display Walls

Page 32: Using Supercomputers and Supernetworks to Explore the Ocean of Life

My OptIPortalTM – AffordableTermination Device for the OptIPuter Global Backplane

• 20 Dual CPU Nodes, 20 24” Monitors, ~$50,000• 1/4 Teraflop, 5 Terabyte Storage, 45 Mega Pixels--Nice PC!• Scalable Adaptive Graphics Environment ( SAGE) Jason Leigh, EVL-UIC

Source: Phil Papadopoulos SDSC, Calit2

Page 33: Using Supercomputers and Supernetworks to Explore the Ocean of Life

PI Larry Smarr

Paul Gilna Ex. Dir.

Announced January 17, 2006$24.5M Over Seven Years

Page 34: Using Supercomputers and Supernetworks to Explore the Ocean of Life
Page 35: Using Supercomputers and Supernetworks to Explore the Ocean of Life

The Calit2 CAMERA Microbial Metagenomics Server is Open to the Community

PLOS Biology March 2007

Page 36: Using Supercomputers and Supernetworks to Explore the Ocean of Life

CAMERA Builds on Cyberinfrastructure Grid, Workflow, and Portal Projects in a Service Oriented Architecture

Cyberinfrastructure: Raw Resources, Middleware & Execution Environment

NBCR Rocks Clusters

Virtual Organizations Web Services

KEPLER

Workflow Management

Vision

Telescience Portal

National Biomedical Computation Resource an NIH supported resource center

Located in Calit2@UCSD Building

Page 37: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Flat FileServerFarm

W E

B P

OR

TA

L

TraditionalUser

Response

Request

DedicatedCompute Farm

(1000s of CPUs)

TeraGrid: Cyberinfrastructure Backplane(scheduled activities, e.g. all by all comparison)

(10,000s of CPUs)

Web(other service)

Local Cluster

LocalEnvironment

DirectAccess LambdaCnxns

Data-BaseFarm

10 GigE Fabric

Calit2’s Direct Access Core Architecture Will Create Next Generation Metagenomics Server

Source: Phil Papadopoulos, SDSC, Calit2+

We

b S

erv

ice

s

Sargasso Sea Data

Sorcerer II Expedition (GOS)

JGI Community Sequencing Project

Moore Marine Microbial Project

NASA and NOAA Satellite Data

Community Microbial Metagenomics Data

Page 38: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Calit2 CAMERA ProductionCompute and Storage Complex

512 Processors ~5 Teraflops

~ 200 Terabytes Storage

Page 39: Using Supercomputers and Supernetworks to Explore the Ocean of Life

The Calit2 CAMERA Metagenomics Site is Now Active

http://camera.calit2.net/

Page 40: Using Supercomputers and Supernetworks to Explore the Ocean of Life

CAMERA Research Tools and Data

Page 41: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Distribution of CAMERA User Registrations

Nearly 1000 Registered Users From 45 Countries

USA 583United Kingdom 46Canada 35France 35Germany 32

Page 42: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome

Acidobacteria bacterium Ellin345 Soil Bacterium 5.6 Mb

Page 43: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome

Source: Raj Singh, UCSD

Page 44: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome

Source: Raj Singh, UCSD

Page 45: Using Supercomputers and Supernetworks to Explore the Ocean of Life

Interactive Exploration of Marine Genomes Using 100 Million Pixels

Ginger Armburst (UW), Terry Gaasterland (UCSD SIO)

Page 46: Using Supercomputers and Supernetworks to Explore the Ocean of Life

NW!

CICESE

UW

JCVI

MIT

SIO UCSD

SDSU

UIC EVL

UCI

OptIPortals

OptIPortal

Calit2 is Now OptIPuter Connecting Remote OptIPortals for Moore-Funded Microbial Researchers via NLR

CAMERAServers

Page 47: Using Supercomputers and Supernetworks to Explore the Ocean of Life

www.glif.is

Created in Reykjavik, Iceland 2003

Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA Data System

Visualization courtesy of Bob Patterson, NCSA.