cloud computing and innovations for optimizing life sciences research

Cloud Computing and Innovations for Optimizing Life Sciences

Research

Krittika Sasmal @ InterpretOmics19th February 2014

2

AcknowledgementOrganizers of this eventScientists and Researchers who work for others and

make their research findings OPENEntrepreneurs who translate this knowledgeGNU, Open Software, NIH and other funding bodies

that keep Biology & Medical information OPEN

19.02.14

19.02.14 3

4 Grand Social Challenges

Food SecurityHealth SecurityEnergy SecurityEnvironmental Security

Common Thread is Biology

19.02.14 4

The 21st Century Biology– The Quantitative Biology

Ref: A New Biology for the 21st Century, The National Academies

Will create a discovery engine able to

tackle extremely complex biological

and societal problems

19.02.14 5

Decoding the Book of Life– milestone for Quantitative Biology

A Milestone for Humanity – the Human genomeHuman Genome Completed, June 26th, 2000

19.02.14 6

Journey is- From Reduction to Integration

There are many diseases that were researched and understood through the process of reduction However as the understanding of diseases mature, and the need for proactive medicine increases,

researchers find that many diseases including cancer are due to somatic mutations that cannot be understood in the reduced space

Understanding of such disease and discover a drug for these diseases will need a reverse operation - integration and systems biology

Ref: Hiroaki Kitano, et al. Systems Biology: A Brief Overview, Science 295, 1662 (March 2002);

19.02.14 8

Translational Medicine– Genomics + Clinical + Non-clinical Data to Discovery of Novel therapeutics

Data

Information

Knowledge

Literature/ Molecular Data

Clinical/Bedside Data

Medical Knowledge

Target Data

Preprocessed Data

Transformed Data

Patterns

iOmicsDisease/Drug

Data

9

Next Generation Sequence Data

• FASTQ (Illumina)• Sff (454)• CCS (PacBio) • ...• Microarray

Single End

Sequences

Insert SizeLibrary Size

Sequence SequencePaired End orMate-paired

DNA/RNA/miRNA

OverlappedOverlapped reads

Random Order & Orientation

Long reads

Short reads

Fixed length readsVariable length reads

cDNA/mRNA

Hundreds to Billions Bases

Circular Consensus reads

Billions to Hundreds Bases

19.02.14

Data domains and Challenges

Source: Clevergene Biocorp

1119.02.14

12

Exploratory Data AnalysisExploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

1. Maximize insight into a data set

2. Uncover underlying structure

3. Extract important variables

4. Detect outliers and anomalies

5. Test underlying assumptions

6. Develop parsimonious models

7. Determine optimal factor settings

19.02.14

Each of these is addressed through algorithms to solve a computational

problem, typically based upon optimizing some mathematical

criterion

1319.02.14

15

Trends in Genomic Medicine

J. J. McCarthy et al., Sci Transl Med 2013;5:189sr4

19.02.14

16

Data Driven Healthcare

Personalized Health Care

Translational Medicine

Health Care Today Digital Imaging

Episodic Treatment Electronic Health Records Artificial Expert Systems

Clinical Genomics

Genetic Predisposition Testing

Molecular Medicine

CA Diagnosis

Pre-symptomatic Treatment

Lifetime Treatment

Evolutionary Practices

Revolutionary Technology

Automated Systems

Non-specific (Treat Symptoms)

Information Correlation

Organized(Error Reduction)

Personalized(Disease Prevention)

Data and Systems Integration

Distributed H

igh-Throughput A

nalytics

19.02.14

17

P6 Medicine– Preventive, Predictive, Participatory, Personalized, Precision, and Pervasive

19.02.14

18

Personal Data

19.02.14

19

Population Data

Registry

Registry

Claims Data

Clinical Trial

Drug reactionLiterature

Genomic DataSurvivability

Public Health Epidemiology

Population Data

19.02.14

20

Biomedical DataClinical

Repositories

Online Mendelian Inheritance in Man

Medical Subject Heading

Genome/Gene Annotations

University of Washington Digital Anatomist

UWDA

NCBI TaxonomiesHuman

Metabolome

RxNormDrug

ICD10

Logical Observation Identifiers Names and

Code

UMLS (Unified Medical Language System)

19.02.14

7V's in Healthcare Big-dataVexing. Proper algorithm needs to be designed to ensure data

processing time is linear. Genomic data are generally NP-Hard and proper parallel algorithms need to be designed to access data in a near real-time manner.

Volume. Physical volume of data that needs to be online. This includes structured and unstructured data. Storage is available, however, the challenge is to determine relevance within large data volumes and how to use analytics to create value from relevant data.

Velocity. Data must be retrieved in a timely manner. In healthcare, many data sources are outside the enterprise. Reacting quickly enough to deal with data velocity is critical for most healthcare applications.

2119.02.14

Variety. Data today comes in all types of formats. Structured, numeric, unstructured data like CT scan, MRI, Ultrasound, X-Rays etc. in different forms. Also, most healthcare data is categorical, with or without any order. Managing, merging and governing different varieties of data is a challenge.

Variability. Healthcare data are highly inconsistent with periodic peaks. Cancer for example has four different types of variability viz., Intratumoral, Intermetastatic, Intrametastatic, and Interpatient. Discovery of independent variables and the causal attributes are critical.

Veracity. Quality, relevance, repeatability, quantification, meaningfulness, predictive value, reduction of error

Value. The final result and its quantification from ROI or reduction of readmission or reducing the morbidity is the key that will finally be measured.

2219.02.14

23

Healthcare Analytics & Decision Support System

Analytics of 7Vs

19.02.14

24

Big Data in Life Sciences

19.02.14

25

Question is, NOT whether we can do it! But, HOW QUICKLY and

HOW ACCURATELY we can do it; where, Speed, Repeatability, Reliability, Predictability, and

Precision matter!

19.02.14

26

The CloudYou don't buy a COW when you need

Milk!Likewise, a Biologist, a Breeder, or a Clinician

need not Worry about Data Analysis, Algorithms, Supercomputer, Pipelines, or even the

Analytics and theBiomedical Informatics!

19.02.14

27

Cloud Computing Defined

Cloud computing is an emerging computing paradigm where data and applications reside in the cyberspace, it allows users to access their data and information through any web-connected device be it fixed or mobile.

Source: John B. Horrigan, Use of Cloud Computing Applications & Services, Data memo, PEW Internet & American Life project, September 2008

19.02.14

28

Cloud Computing User – I (Amir)

19.02.14

29

Cloud Computing User – II (Fakir)

19.02.14

30

Characteristics of Cloud Computing

Virtual – Physical location and underlying infrastructure details are transparent to users

Scalable – Able to break complex workloads into pieces to be served across an incrementally expandable infrastructure

Efficient – Services Oriented Architecture for dynamic provisioning of shared compute resources

Flexible – Can serve a variety of workload types – both consumer and commercial

19.02.14

31

Benefits of CloudUnlimited Resource

Unlimited Computing power Unlimited storage (Filestore & online memory)

Users can use resources without owning anything – converting Capex to Opex

Helping Green computing by lending out idle resources through Cycle Scavenging

Pay as you go

19.02.14

32

Cloud Computing Stack

Facilities

Hardware

Facilities

Integration & Middleware

Data Metadata Content

Application

API

Presentation Modality Presentation Platform

Infrastructure as a Service

Platform as a Service

Software as a Service

Connectivity & delivery

API

Facilities

Hardware

Facilities

Connectivity & delivery

API

Integration & Middleware

QOE

&

QOS

SECURITY

User/ Customer/

Device

MIDDLEWARE

Original Cloud ProviderCloud VendorCloud User Next Gen Network

NextGenerationNetwork /Intranet

19.02.14

33

Divide and Conquer: MapReduce

Output files

Split 1

Split 2

Split 3

Split 4

Worker

Worker

Worker

k1:v1 k3:v2

k1:v3 k2:v4

k2:v5 k4:v6

k1:v1,v3

k2:v4,v5

k3:v2Worker

Worker

Worker Output 1

Output 2

k4:v6Output 3

Master

Input Files

Map

Intermediate files

ReduceSort/Group

19.02.14

34

Open Source MapReduce Hadoop

Implemented in Java; enabled on Amazon

Twister Light weight new arrival in town

19.02.14

35

Security in the CloudSecurity in the cloud needs to answer few specific questions like:

1. How much trust do you have on virtualized environment or the hypervisors in the cloud as against your own physical hardware?

2. How much trust do you have on cloud vendor versus your own infrastructure?

3. How do you address regulatory and compliance requirement in an environment when your application might be running on an infrastructure in a foreign country?

19.02.14

36

New generation software in bioinformatics needs to be:

Fast/Very fast software, with a low memory consumption

Be able to handle and analyze TB of dataStore data efficiently to queryDistribute computation, not dataFocused, and useful in analysis

19.02.14

37

www.iomics.in

The Omics Lab in the Internet

19.02.14

http://www.iomics.in/

Crop to Cancer

19.02.14

3919.02.14

Analysis Pipelines in iOMICS

QA/QC of raw reads SNP/InDel/CNV analysismiRNA discoveryExome sequencingChIP sequencing... (New additions)Meta-analysisVisualization of results

Omics data

Systems Biology

4019.02.14

4119.02.14

4219.02.14

4319.02.14

4519.02.14

46

SNP & CNV Analysis

19.02.14

4719.02.14

48

Hierarchical Clustering

19.02.14

49

Gene Interaction/Enrichment

19.02.14

50

19.02.14

51

Cloud computing Holds the Potential to Address the Challenges and Transform

Biology and Heathcare

Krittika SasmalEmail: “krittika” dot “sasmal” (at) “interpretomics” dot “co”

19.02.14

cloud computing and innovations for optimizing life sciences research

Health & Medicine