pacific northwest national laboratory u.s. department of energy doe data workshop view from...
Post on 20-Dec-2015
213 views
TRANSCRIPT
Pacific Northwest National Laboratory
U.S. Department of Energy
DOE Data WorkshopView from Information-intensive
Applications
H. Steven WileyBiomolecular Systems Initiative
Pacific Northwest National Laboratory(www.sysbio.org)
2
Information Intensive ScienceGoals of IIS
Understanding systems versus individual phenomena Strengthening/automating links between different types of data from different scales
Examples Biology: Cell Signaling Biology: BIRN Chemistry: CMCS Homeland Defense Complexity of systems is becoming pervasive
Challenges Efficient federation, graph-based queries Continuous data correlation Managing complex experiments, data provenance using multiple independent data and analysis
resources
Priorities High-performance federation, data mining, semantic query capabilities (software, hardware
architecture) Knowledge environments (lightweight, evolvable, powerful, …) Organization and Visualization of large-scale, complex information
3
A systems-science approach to address complex problems
New knowledge is assimilated from different data, tools, and disciplines at each scale
Real-time bi-directional information flow Deep analysis across scales Multiple applications for the same information
Challenges Data, provenance, annotation publication Syntactic and Semantic Federation Standardization versus innovation
Examples: IUPAC – update of radical thermochemistry reference
values by global expert group PrIMe – community developed optimized reaction
mechanismsguiding experimental plans across scales, providing
community resources for applied research
Combustion is a Multi-scale Chemical Science Challenge
4
Volume of data, orders of magnitude larger and at different levels of abstractionComplexity of information spaces into very high dimensions, 200 the normInformation often out of context, incomplete, fuzzyDeceptionInformation in all media types: text, imagery, video, voice, web, sensor dataTime and temporal dynamics fundamentally change the approachSpatial, yet non-spatial abstract dataMultiple ontologies, languages, culturesPrivacy Issues
Homeland Security: Pulling insight out of information overload
ImmigrationFinancial
Sensors
Shipping
Communications
Is there adomesticterrorist
plot?
Is there adomesticterrorist
plot?
Can we detect and prevent a terrorist attack BEFORE it happens?
For homeland security and science For homeland security and science we now turn to data-intensive visual analyticswe now turn to data-intensive visual analytics
5
6
Molecularparameters:protein levels / states /locations / interactions / activities
Cellfunction: death,proliferation,differentiation,migration, ...
Systems Biology of Cells
Ultimate aim: Understanding andpredictionof effects ofcomponent properties
7
8
9
What, Where, Quantity, Quality?
What parts are being made? (identity)What is the regulatory network structured? (interactions)Where are the proteins located in cell? (location)What are their levels? (quantity) How do they interact with their partners? (activity)
As a function of covalent modification Contribution of steric restrictions Forward and reverse rate constants
To successfully model a complex biological system, one must minimally
know the following information:
10
Cells as Input-Output Systems
Biologists look at their experiments as input-output systemsWe start with a “defined” system to which we apply a stimulus (Input: independent variable)We then look for a specific response (output: dependent variable)The relationship between the input and output provides insight into the workings of the system
SystemInput Output
Unknown context So unless we control the experimental context, we cannot
interpret our experiments
11
The Two Greatest Challenges of Systems Biology
1. Working with indeterminate systems
2. Understanding context - what it is and how to control and capture it
12
Defining the composition of living systems is driving analytical technologies
GenomicsProteomicsMetabanomicsExpression profilingImagingEtc…….
All of these technologies seek to rigorously define the composition of living
systems
13
2,500
2,243
1,731
1,475
1,218
962
706
450
1,987
24 33 44 52 62 71
MW
Capillary LC-FTICR 2-D display of peptides from a yeast soluble protein digest>160,000 isotopic distributions corresponding to >100,000 polypeptides detected
2,500
2,243
1,731
1,475
1,218
962
706
450
1,987
24 33 44 52 62 7124 33 44 52 62 71
MW
Capillary LC-FTICR 2-D display of peptides from a yeast soluble protein digest>160,000 isotopic distributions corresponding to >100,000 polypeptides detected
Time
2-D display of detected peptides
Mass
Global simultaneous quantitative proteome measurements
Proteins identified and quantified using Proteins identified and quantified using accurate mass and time (AMT) tagsaccurate mass and time (AMT) tags
0 42 84 126LC elution time (min)
m/z 750 1000
Dimension one - separation time
Dimension two - accurate mass
1250 1500
14
9.4 Tesla High Throughput Mass Spectrometer
1 Experiment per hour5000 spectra per experiment4 MByte per spectrum
Per instrument:20 Gbytes per hour480 Gbytes per day
These are based ontoday's technologies.
Time to analyze offsite: 1 weekTime to analyze onsite: 48 hoursTime to analyze onsite with smart storage: 2 hours
High Throughput ProteomicsHigh Throughput Proteomics
15
Integrated, High-throughput Experiments will Generate Enormous Amounts of Data
Experiment templates for a single microbe
class of experiment
time points treatments conditions
genetic variants
biological replication
total biological samples
Proteomics data volume in TB
Metabolite data in TB
Transcription data in TB
simple (scratching the surface) 10 1 3 1 3 90 1.8 1.4 0.009moderate 25 3 5 1 3 1125 22.5 16.9 0.1125upper mid 50 3 5 5 3 11250 225.0 168.8 1.125complex 20 5 5 20 3 30000 600.0 450.0 3real interesting 20 5 5 50 3 75000 1500.0 1125.0 7.5
Profiling methodProteomics Looking at a possible 6000 proteins per microbe assuming ~20 GB per sample Metabolites Looking a panel of 500-1000 different molecules assuming ~15GB per sampleTranscription 6000 genes & 2 arrays per sample ~100 MB
Typically a single significant scientific question takes the multidimensional analysis of at least 1000 biological samples
16
17
Trey Ideker
The Molecular Interaction Scaffold is Huge
18
Cell Imaging New multispectral, multidimensional imaging techniques
can generate enormous amounts of data
19
Cell Imaging Workflow
Complex set of metadata
collected here
20
How Much Data From Imaging?
Currently, a high quality image of a single cell field is 4mb per image, obtained at 4fps (16mb/s)Following cell through one cell cycle is 24h, or approximately 1.4tbNew hyperspectral microscopes analyzing only 10 wavelengths would generate 7tb/dayCharacterizing dynamics of most abundant set of genes (4000) would require 5.5pbThis is for a single instrument and a single experiment using today’s technology
21
Understanding the influence of cell context is driving experimental and computational
biology
Cell SignalingDevelopmental biologyCancer and growth controlHost-pathogen interactionsDynamics of microbial communitiesCellular responses to stress
22
Computational Modeling Approaches-- Diverse Spectrum
differential equations
statistical mining
Bayesian networks
SPECIFIED ABSTRACTED
Markov chains
Boolean models
relationships
mechanisms
influences *(includingstructure)
*
23
Computer Models Allow Reconstruction of Processes Across Different Scales
MODEL DATABASE
Organ 1Organ 1Organ 1Organ N
Model 1Model 1Cell DataSet N
Unique IDModel NameModel Descr.Default Par.Default Comp.TimestampSecurity
Organ
Species 1 Species 1 Species 1 Species N Species
Solution Par.Input_par IDInput_par IDReact. RatesChemical Par.Concen. Val.--
GeometricPar.
Input_par IDInput_par IDValue_par--
EquationDocs.
Input_par IDInput_par IDSymbolicSource--
TissueModel 1Model 1Model 1Tissue N Cell
ComputePar.
Input_par IDInput_par IDValue_par--
Initial Conditions
Input_fld IDInput_fld IDValue_parValue_par--
ParameterDocs.
Input_par IDInput_par IDReferencesLimits-
24
25
26
27
Data is distributed across many repositories with various ontologies and data formats
Analysis tools do not address integration of heterogeneous data sets
Minimal informatics based analysis tools that support a systems biology approach
Collaboration capabilities are primitive to support shared knowledge among researchers
Obstacles preventing scientists from utilizing available data
Obstacles preventing scientists from utilizing available data
28
The Challenge for Data Handling is Two-fold
1. Managing the massive amounts of compositional data necessary to define all of the relevant experimental systems
2. Capture all of the data on the relationships between context, composition and response
Integration of the analytical and experimental methodologies into a single system is necessary to
link all of the data in a useful way
29
END
30
Understanding Living Cells
Cell responses are multiphasic
Different classes of stimulants (information) are processed at characteristic time scales
Processing nodes within cells are spatially segregated
Each cell responds independently depending on its specific context
A response generally induces a reprogramming of the cell machinery
To create cell simulations, we must “abstract” this information to create a reference model which can then be modified