big data, bioscience and the cloud biocatalyst june 2015 sullivan

18
Big Data, Bioscience and the Cloud Dan Sullivan June 25, 2015 BioCatalyst: Cloud Computing in Bioscience Oregon Bioscience Association

Upload: dan-sullivan

Post on 08-Aug-2015

256 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Big Data, Bioscience and the CloudDan SullivanJune 25, 2015BioCatalyst: Cloud Computing in BioscienceOregon Bioscience Association

Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections

My Background Data Architect / Engineer

NoSQL and relational data modeler Big data Analytics, machine learning and text mining Cloud computing

Computational Biologist

Author No SQL for Mere Mortals Contributor to TechTarget

Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections

Big Data Challenges in Bioscience

Volume

Velocity

Variety

Integration

Varieties of Big Data in Bioscience

Subcellular – Genetics and Proteomics

Cellular – Metabolic and Signaling Pathways

Organism – Disease, Medicine, Insurance

Populations – Epidemiology, Social Networks

Genetics and Proteomics• Genetic Sequencing

• Order of nucleotides in DNA• Most DNA is common across species• Many genes code proteins• Some variants associated with disease• Which ones?

• Proteomics• Structure and function of proteins• Variation in protein sequence and

structure associated with disease• Which ones? In what context?

Images: http://www.masimo.it/hemoglobin/anemia.htm, https://en.wikipedia.org/wiki/DNA

Pathways• Metabolic Pathways

• Series of chemical reactions• Coordinated to produce

reactants• Choreography of molecules

• Signaling Pathways• Molecules on cell surface detect

changes in environment• Cascade of reactions to change

state of cell• Choreography of molecules• How do they interact?

• Early 1950s Korean War autopsies

2012-2016 Genomic and Proteomic Studies

1985-1998 Pathology Studies - Pathodeterminants of Atherosclerosis in Youth (PDAY) study

Disease - Atherosclerosis

Healthcare

• Genetics and Disease• Post-Approval Drug Efficacy• Discovering and Retrieving Medical

Information• Comparative Quality

Populations• Infectious Disease Spread

• How fast will disease spread?• What countermeasures are

effective?• What is the morbidity and

mortality?

• Simulation– Synthetic population– Model interactions– Probabilistic

Why Cloud for Big Data in BioScience?

• Scalability• Access to compute and memory optimized

virtual machines• Virtually unlimited storage

• Speed• Many bioscience computations highly

parallel• Minimize time to analyze, lower IT

overhead• Cost

• AWS Spot Instances• Google Pre-emptible VMs

Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections

Continuous Learning• Coursera

• Cloud Computing Concepts• Bioinformatics: Life Sciences on Your

Computer

• edX• Introduction to Statistics• Introduction to Biology• Principles of Biochemistry

• Rackspace CloudU

• You Tube

• Big Data Vendors• MapR• Cloudera• HortonWorks• DataStax• Data Bricks

• Trade Publications– TechTarget

• SearchAWS• SearchCloudComputing• SearchCloudSecurity

– Health Data Management– Harvard Business Review

Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections

LinkedIn Groups

Final Thoughts• Great time to get into Biosciences and

Big Data• Don’t be intimidated if it’s been a while

since you’ve studied biology – we are all constantly learning in this field

• Network online and in person• Take advantage of free resources

• Courses• Cloud

• AWS Free Tier• MAPR Hadoop On Demand Training

• Connect with me on LinkedIn• https://www.linkedin.com/in/dansulliva

npdx• Join me at a Meetup• [email protected]