cloud computing and innovations for optimizing life sciences research
TRANSCRIPT
Cloud Computing and Innovations for Optimizing Life Sciences
Research
Krittika Sasmal @ InterpretOmics19th February 2014
2
AcknowledgementOrganizers of this eventScientists and Researchers who work for others and
make their research findings OPENEntrepreneurs who translate this knowledgeGNU, Open Software, NIH and other funding bodies
that keep Biology & Medical information OPEN
19.02.14
19.02.14 3
4 Grand Social Challenges
Food SecurityHealth SecurityEnergy SecurityEnvironmental Security
Common Thread is Biology
19.02.14 4
The 21st Century Biology– The Quantitative Biology
Ref: A New Biology for the 21st Century, The National Academies
Will create a discovery engine able to
tackle extremely complex biological
and societal problems
19.02.14 5
Decoding the Book of Life– milestone for Quantitative Biology
A Milestone for Humanity – the Human genomeHuman Genome Completed, June 26th, 2000
19.02.14 6
Journey is- From Reduction to Integration
There are many diseases that were researched and understood through the process of reduction However as the understanding of diseases mature, and the need for proactive medicine increases,
researchers find that many diseases including cancer are due to somatic mutations that cannot be understood in the reduced space
Understanding of such disease and discover a drug for these diseases will need a reverse operation - integration and systems biology
Ref: Hiroaki Kitano, et al. Systems Biology: A Brief Overview, Science 295, 1662 (March 2002);
19.02.14 8
Translational Medicine– Genomics + Clinical + Non-clinical Data to Discovery of Novel therapeutics
Data
Information
Knowledge
Literature/ Molecular Data
Clinical/Bedside Data
Medical Knowledge
Target Data
Preprocessed Data
Transformed Data
Patterns
iOmicsDisease/Drug
Data
9
Next Generation Sequence Data
• FASTQ (Illumina)• Sff (454)• CCS (PacBio) • ...• Microarray
Single End
Sequences
Insert SizeLibrary Size
Sequence SequencePaired End orMate-paired
DNA/RNA/miRNA
OverlappedOverlapped reads
Random Order & Orientation
Long reads
Short reads
Fixed length readsVariable length reads
cDNA/mRNA
Hundreds to Billions Bases
Circular Consensus reads
Billions to Hundreds Bases
19.02.14
12
Exploratory Data AnalysisExploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
1. Maximize insight into a data set
2. Uncover underlying structure
3. Extract important variables
4. Detect outliers and anomalies
5. Test underlying assumptions
6. Develop parsimonious models
7. Determine optimal factor settings
19.02.14
Each of these is addressed through algorithms to solve a computational
problem, typically based upon optimizing some mathematical
criterion
1319.02.14
16
Data Driven Healthcare
Personalized Health Care
Translational Medicine
Health Care Today Digital Imaging
Episodic Treatment Electronic Health Records Artificial Expert Systems
Clinical Genomics
Genetic Predisposition Testing
Molecular Medicine
CA Diagnosis
Pre-symptomatic Treatment
Lifetime Treatment
Evolutionary Practices
Revolutionary Technology
Automated Systems
Non-specific (Treat Symptoms)
Information Correlation
Organized(Error Reduction)
Personalized(Disease Prevention)
Data and Systems Integration
Distributed H
igh-Throughput A
nalytics
19.02.14
17
P6 Medicine– Preventive, Predictive, Participatory, Personalized, Precision, and Pervasive
19.02.14
19
Population Data
Registry
Registry
Claims Data
Clinical Trial
Drug reactionLiterature
Genomic DataSurvivability
Public Health Epidemiology
Population Data
19.02.14
20
Biomedical DataClinical
Repositories
Online Mendelian Inheritance in Man
Medical Subject Heading
Genome/Gene Annotations
University of Washington Digital Anatomist
UWDA
NCBI TaxonomiesHuman
Metabolome
RxNormDrug
ICD10
Logical Observation Identifiers Names and
Code
UMLS (Unified Medical Language System)
19.02.14
7V's in Healthcare Big-dataVexing. Proper algorithm needs to be designed to ensure data
processing time is linear. Genomic data are generally NP-Hard and proper parallel algorithms need to be designed to access data in a near real-time manner.
Volume. Physical volume of data that needs to be online. This includes structured and unstructured data. Storage is available, however, the challenge is to determine relevance within large data volumes and how to use analytics to create value from relevant data.
Velocity. Data must be retrieved in a timely manner. In healthcare, many data sources are outside the enterprise. Reacting quickly enough to deal with data velocity is critical for most healthcare applications.
2119.02.14
Variety. Data today comes in all types of formats. Structured, numeric, unstructured data like CT scan, MRI, Ultrasound, X-Rays etc. in different forms. Also, most healthcare data is categorical, with or without any order. Managing, merging and governing different varieties of data is a challenge.
Variability. Healthcare data are highly inconsistent with periodic peaks. Cancer for example has four different types of variability viz., Intratumoral, Intermetastatic, Intrametastatic, and Interpatient. Discovery of independent variables and the causal attributes are critical.
Veracity. Quality, relevance, repeatability, quantification, meaningfulness, predictive value, reduction of error
Value. The final result and its quantification from ROI or reduction of readmission or reducing the morbidity is the key that will finally be measured.
2219.02.14
25
Question is, NOT whether we can do it! But, HOW QUICKLY and
HOW ACCURATELY we can do it; where, Speed, Repeatability, Reliability, Predictability, and
Precision matter!
19.02.14
26
The CloudYou don't buy a COW when you need
Milk!Likewise, a Biologist, a Breeder, or a Clinician
need not Worry about Data Analysis, Algorithms, Supercomputer, Pipelines, or even the
Analytics and theBiomedical Informatics!
19.02.14
27
Cloud Computing Defined
Cloud computing is an emerging computing paradigm where data and applications reside in the cyberspace, it allows users to access their data and information through any web-connected device be it fixed or mobile.
Source: John B. Horrigan, Use of Cloud Computing Applications & Services, Data memo, PEW Internet & American Life project, September 2008
19.02.14
30
Characteristics of Cloud Computing
Virtual – Physical location and underlying infrastructure details are transparent to users
Scalable – Able to break complex workloads into pieces to be served across an incrementally expandable infrastructure
Efficient – Services Oriented Architecture for dynamic provisioning of shared compute resources
Flexible – Can serve a variety of workload types – both consumer and commercial
19.02.14
31
Benefits of CloudUnlimited Resource
Unlimited Computing power Unlimited storage (Filestore & online memory)
Users can use resources without owning anything – converting Capex to Opex
Helping Green computing by lending out idle resources through Cycle Scavenging
Pay as you go
19.02.14
32
Cloud Computing Stack
Facilities
Hardware
Facilities
Integration & Middleware
Data Metadata Content
Application
API
Presentation Modality Presentation Platform
Infrastructure as a Service
Platform as a Service
Software as a Service
Connectivity & delivery
API
Facilities
Hardware
Facilities
Connectivity & delivery
API
Integration & Middleware
QOE
&
QOS
SECURITY
User/ Customer/
Device
MIDDLEWARE
Original Cloud ProviderCloud VendorCloud User Next Gen Network
NextGenerationNetwork /Intranet
19.02.14
33
Divide and Conquer: MapReduce
Output files
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Worker
k1:v1 k3:v2
k1:v3 k2:v4
k2:v5 k4:v6
k1:v1,v3
k2:v4,v5
k3:v2Worker
Worker
Worker Output 1
Output 2
k4:v6Output 3
Master
Input Files
Map
Intermediate files
ReduceSort/Group
19.02.14
34
Open Source MapReduce Hadoop
Implemented in Java; enabled on Amazon
Twister Light weight new arrival in town
19.02.14
35
Security in the CloudSecurity in the cloud needs to answer few specific questions like:
1. How much trust do you have on virtualized environment or the hypervisors in the cloud as against your own physical hardware?
2. How much trust do you have on cloud vendor versus your own infrastructure?
3. How do you address regulatory and compliance requirement in an environment when your application might be running on an infrastructure in a foreign country?
19.02.14
36
New generation software in bioinformatics needs to be:
Fast/Very fast software, with a low memory consumption
Be able to handle and analyze TB of dataStore data efficiently to queryDistribute computation, not dataFocused, and useful in analysis
19.02.14
Analysis Pipelines in iOMICS
QA/QC of raw reads SNP/InDel/CNV analysismiRNA discoveryExome sequencingChIP sequencing... (New additions)Meta-analysisVisualization of results
Omics data
Systems Biology
4019.02.14