داده های عظیم در دوران پساژنوم big data in post genome era مهدی...
TRANSCRIPT
داده های عظیم در دوران پساژنوم
Big Data in Post Genome Era
مهدی صادقیپژوهشگاه ملی مهندسی ژنتیک و زیست فناوری
پژوهشکده علوم زیستی، پژوهشگاه دانش های بنیادی
4
The Problem of Big Data
Volume
Velocity of process
Variability
Motivation
• Recent developments in biotechnology have allowed the high-throughput data generation from biological samples
• We have lots and lots of data about all aspects of biology (although still mostly about humans)
• How can we make sense of all this data?– Analyse the data to extract new knowledge about
the biology Data Mining
1973Sharp, Sambrook, Sugden
Gel Electrophoresis Chamber, $250
1958 Matt Meselson &
Ultracentrifuge, $500,000
The Problem of Big Data in Biology hopefully comfortable enough to minimize the technology
and focus on the biology.
Human Genome:$2.7 Billion, 11 Years
Human Genome: $900, 6 Hours
2012:Oxford Nanopore
MiniION
2003: ABI 3730 Sequencer
The Problem of Big Data in Biology A decade’s progress
9
2010: 5K$, a few days
2009: Illumina, Helicos40-50K$
Sequencing the Human Genome
Year
Log
10(p
rice)
201020052000
2012<1000$, <24 hrs
2008: ABI SOLiD60K$, 2 weeks
2007: 4541M$, 3 months
2001: Celera100M$, 3 years
2001: Human Genome Project2.7G$, 11 years
The Problem of Big Data in Biology
A Super-Moore’s Law
So what data can we generate?
• Biological data can be generated at many different levels– Genomics (DNA)– Transcriptomics (RNA)– Proteomics (proteins)–Metabolomics (small compounds)– Lipidomics (lipids)
• Hundreds of –omics have been catalogued
The Problem of Big Data in Biology
High Throughput Phenotyping
The large amount of sequencebased data need balancingwith equally powerful phenotypicdata.
Phytomorph Project (Univ. Wisconsin)
•$70K for 30 cameras•200 movies of root growth•4GB/day of images for processing
Data to Networks to Biology
Protein Interaction Network
Aims
• First Data organization researchers access to existing information submit new entries
• Second develop tools and resources that aid in the analysis of data
• Third interpret the results in a biologically meaningful manner.
Theoretical CS
interdisciplinary
MolecularBiology
Machine LearningData Mining
Information Management
Biophysics
Bioinformatics
Biochemistry
Applied Mathematics & Statistics
Biology Computer Science
General Types of “….Informatics techniques…..”
• Databases– Building, Querying– Object DB
• Text String Comparison– Text Search– 1D Alignment– Significance Statistics
• Finding Patterns– AI / Machine Learning– Clustering– Datamining
• Geometry– Robotics– Graphics (Surfaces, Volumes)– Comparison and 3D Matching
(Vision, recognition)• Physical Simulation
– Newtonian Mechanics– Electrostatics– Numerical Algorithms– Simulation
Algorithmic vs. Statistical Perspectives
Computer Scientists • Data: are a record of everything that happened. • Goal: process the data by positing a model to find interesting patterns and associations.• Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard.
Statisticians (and Natural Scientists)• Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world.• Goal: is to extract information about the world from noisy data.• Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic or stochastic model.
Major Application : Finding Homologs
Major Application :Designing Drugs
• Understanding How Structures Bind Other Molecules (Function)• Designing Inhibitors• Docking, Structure Modeling
(From left to right, figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).
Pharmacogenomics
Everybody is different
The Right Drug
To The Right Patient
For The Right Disease
At The Right Time
Big changes in the past ... and future
Consider the creation of:
• Modern Physics Management Science
• Computer Science Transistors and Microelectronics
• Molecular Biology Biotechnology
•These were driven by new measurement techniques and technological advances, but they led to:
big new (academic and applied) questions
• new perspectives on the world
• lots of downstream applications
We are in the middle of a similarly big shift!