big data, bioscience and the cloud biocatalyst june 2015 sullivan
TRANSCRIPT
Big Data, Bioscience and the CloudDan SullivanJune 25, 2015BioCatalyst: Cloud Computing in BioscienceOregon Bioscience Association
Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections
My Background Data Architect / Engineer
NoSQL and relational data modeler Big data Analytics, machine learning and text mining Cloud computing
Computational Biologist
Author No SQL for Mere Mortals Contributor to TechTarget
Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections
Varieties of Big Data in Bioscience
Subcellular – Genetics and Proteomics
Cellular – Metabolic and Signaling Pathways
Organism – Disease, Medicine, Insurance
Populations – Epidemiology, Social Networks
Genetics and Proteomics• Genetic Sequencing
• Order of nucleotides in DNA• Most DNA is common across species• Many genes code proteins• Some variants associated with disease• Which ones?
• Proteomics• Structure and function of proteins• Variation in protein sequence and
structure associated with disease• Which ones? In what context?
Images: http://www.masimo.it/hemoglobin/anemia.htm, https://en.wikipedia.org/wiki/DNA
Pathways• Metabolic Pathways
• Series of chemical reactions• Coordinated to produce
reactants• Choreography of molecules
• Signaling Pathways• Molecules on cell surface detect
changes in environment• Cascade of reactions to change
state of cell• Choreography of molecules• How do they interact?
• Early 1950s Korean War autopsies
2012-2016 Genomic and Proteomic Studies
1985-1998 Pathology Studies - Pathodeterminants of Atherosclerosis in Youth (PDAY) study
Disease - Atherosclerosis
Healthcare
• Genetics and Disease• Post-Approval Drug Efficacy• Discovering and Retrieving Medical
Information• Comparative Quality
Populations• Infectious Disease Spread
• How fast will disease spread?• What countermeasures are
effective?• What is the morbidity and
mortality?
• Simulation– Synthetic population– Model interactions– Probabilistic
Why Cloud for Big Data in BioScience?
• Scalability• Access to compute and memory optimized
virtual machines• Virtually unlimited storage
• Speed• Many bioscience computations highly
parallel• Minimize time to analyze, lower IT
overhead• Cost
• AWS Spot Instances• Google Pre-emptible VMs
Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections
Continuous Learning• Coursera
• Cloud Computing Concepts• Bioinformatics: Life Sciences on Your
Computer
• edX• Introduction to Statistics• Introduction to Biology• Principles of Biochemistry
• Rackspace CloudU
• You Tube
• Big Data Vendors• MapR• Cloudera• HortonWorks• DataStax• Data Bricks
• Trade Publications– TechTarget
• SearchAWS• SearchCloudComputing• SearchCloudSecurity
– Health Data Management– Harvard Business Review
Overview• Background• Varieties of Big Data in Bioscience• Continuous learning about Big Data & Cloud• Making Connections
Final Thoughts• Great time to get into Biosciences and
Big Data• Don’t be intimidated if it’s been a while
since you’ve studied biology – we are all constantly learning in this field
• Network online and in person• Take advantage of free resources
• Courses• Cloud
• AWS Free Tier• MAPR Hadoop On Demand Training
• Connect with me on LinkedIn• https://www.linkedin.com/in/dansulliva
npdx• Join me at a Meetup• [email protected]