gtl facilities computing infrastructure for 21 st century systems biology ed uberbacher ornl &...

16
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL

Upload: paula-hubbard

Post on 28-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

GTL Facilities ComputingInfrastructure for 21st Century

Systems Biology

Ed UberbacherORNL

&Mike Colvin

LLNL

Experimental:•Complete datasets•Quantitative measurements•Comprehensive physical characterization:

Protein expression and interactions Spatial distributions Process kinetics

Computational:•Automated data analysis and validation•Automated integration of diverse data sets•Human and computer-accessible databases•Molecular, Pathway and cell-level

simulations

The goals require a new synergy

between computing

and biology.

Ultimate Goal is to Provide Predictive Models of Microbes

This goal drives data collection and computing strategy.

GTL Biology ParadigmIntegrated Large-Scale Experiment-Computing Cycles

Real-Time Analysis

Design or Revise Models

Design or Revise Models

Large-ScaleData Sets

Large-ScaleData Sets

Simulate andGenerate Hypotheses

Simulate andGenerate Hypotheses

ExperimentExperiment

Facility IProduction and Characterization of ProteinsEstimating Microbial Genome Capability

Computational Analysis Genome analysis of genes, proteins, and operons

Metabolic pathways analysis from reference data

Protein machines estimate from PM reference data

Knowledge Captured Initial annotation of genome

Initial perceptions of pathways and processes

Recognized machines, function, and homology

Novel proteins/machines (including prioritization)

Production conditions and experience

Analysis and Modeling Mass spectrometry expression analysis

Metabolic and regulatory pathway / network analysis and modeling

Knowledge Captured Expression data and conditions

Novel pathways and processes

Functional inferences about novel proteins/machines

Genome super annotation: regulation, function, and processes (deep knowledge about cellular subsystems)

Facility II: Whole Proteome AnalysisModeling Proteome Expression, Regulation, and Pathways

Skeletogenic

Regulatory Gene Network Model for Endomesoderm Specification

Eric Davidson

Facility III: Characterization and Imaging

of Molecular MachinesExploring Molecular Machine Geometry and Dynamics

Computational Analysis, Modeling and Simulation

Image analysis/cryoelectron microscopy

Protein interaction analysis/mass spec

Machine geometry and docking modeling

Machine biophysical dynamic simulation

Knowledge Captured

Machine composition, organization, geometry, assembly and disassembly

Component docking and dynamic simulations of machines

Classical Mol. DynamicsJeruzalmi et al. Cell 106:417 (2001)

Mechanistic model based on physical and biochemical data Jeruzalmi et al. Cell 106:429 (2001)

Electron microscopy Mayanagi et al. J. Struct. Bio. 134: 35 (2001)

Homology Modeling Venclovas et al. Prot. Sci. 11:2403 (2002)

Atomic Force MicroscopyShiomi, et al. PNAS, 97:14127 (2002)

Example of Combined Experiment and Modeling to Understand a Multiprotein Complex: DNA Clamps & Clamp-Loading Mechanisms

Facility IV: Analysis and Modeling

of Cellular Systems Simulating Cell and Community Dynamics

Analysis, Modeling and Simulation Couple knowledge of pathways, networks,

and machines to generate an understanding of cellular and multi-cellular systems

Metabolism, regulation, and machine simulation

Cell and multicell modeling and flux visualization

Knowledge Captured Cell and community measurement data sets Protein machine assembly time-course data

sets Dynamic models and simulations of cell

processes

Facility 1genome annotationregulatory element and operon identificationmetabolic pathway analysis

Facility 2mass spec data analysisexpression analysis and clusteringmetabolic and regulatory network modeling

Facility 3image analysismass spec analysisprotein / machine modelingdocking and molecular dynamics

Facility 4metabolic simulationregulatory simulationcell modeling and simulations

Collect and manage software - Maintain current versions - Ensure hardware compatability - User Interfaces - Documentation

Centrally Planned Analysis and Modeling Tools Libraries

ATCGTAGCAATCGACCGT...CGGCTATAGCCGTTACCG…TTATGCTATCCATAATCGA...GGCTTAATCGCATACGAC...

Capacity: e.g., High-throughput protein structure predictions

Thread ontotemplates

Bestmatch

Capability: e.g., Large scale biophysical simulations:

Large size and timescale classical simulations

Highly accurate quantum mechanical simulations

GTL facilities will Require High Performance Computing for Both Capacity and Capability

GTL High-Performance Computing Roadmap

Biological Complexity

ComparativeGenomics

Constraint-BasedFlexible Docking

1000 TF

100 TF

10 TF

1 TF*

Constrained rigid

docking

Genome-scale protein threading

Community metabolic regulatory, signaling simulations

Molecular machine classical simulation

Protein machineInteractions

Cell, pathway, and network

simulation

Molecule-basedcell simulation

*Teraflops

Current U.S. Computing

Swimming in Data: Exploding Need to Capture and Manipulate Data

● From Acquisition, Refinement, Reduction and Deposition

● Across Scales of Space and Time - Petabytes

Data Repositories Genomes, annotation and community ‘genomes’

Expression data and proteome composition

Metabolite and flux data

Metabolic pathways and kinetic parameters

Protein interactions

Protein machines repository - machine composition, function, homology, models

Image data repository

Regulatory network data and models

Cell models repository

Integrated or integrable

Requires development of cross-facilities approach

phylogeny

microbialgenomes

proteindomains

pathways

regulatoryelements

communitygenomes

literature

Metabolicmodels

Expression

proteomics

proteinmachines

regulatorynetworks

proteinstructure

Central Database Planning

Simulation of even “simple” metabolic pathway depends on large volume data

Annotated data sets Raw data sets

The GTL Knowledge Base:Integration of Large Datasets is a Precursor to Predictive Modeling

•GTL knowledge base will change how information about microbes reaches the community•Models and simulations will be online•We will know more and more about systems in each consecutive microbe