genome-scale big data pipelines

Dr. Denis Bauer & Lynn Langit

Genomic-scale Data Pipelines

Denis Bauer, PhD

Oscar Luo, PhD

Rob Dunne, PhD

Piotr Szul

Aidan O’BrienLaurence Wilson, PhD

Adrian WhiteAndy Hindmarch

Collaborators

David Levy

Software

Dan Andrews

Kaitao Lai, PhD

Arash Bayat

John Hildebrandt Mia Chapman

Ian BlairKelly Williams

Jules Damji

Gaetan Burgio Lynn Langit

Natalie Twine, PhD

Prabha Pillay

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Transformational Bioinformatics Team

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

0 500 1000 1500 2000 2500

Astronomy

Twitter

YouTube

Big Data in 2025…Petabytes?

0 5 10 15 20 25

Astronomy

Twitter

YouTube

Genomic

GENOMIC Big Data in 2025 - Exabytes

Genome holds Blueprint for Every Cell

Affects Looks, Disease Risk, and Behavior

VCF Data

Genomic Research Workflow

https://www.projectmine.com/about/

BigData Focus

Finding the Disease Gene(s)

Spot the letter that is…• common amongst all affected

• absent in all unaffected*

* oversimplified

controls

Gene1 Gene2

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Transformational Bioinformatics| Denis C. Bauer @allPowerde

Why Apache Spark?

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

low Accuracy high

Cloud Data Pipeline Pattern

Business Problem

DataQuality

Candidate Technologies

Build/TestMVPs

Assemble Pipeline

Building a Cloud Data Pipeline

Candidate Technologies

• Ingest/Clean

• Analyze/Predict

• Visualize

Build MVPs

• Test

• Iterate

• Learn

Assemble Pipeline

• Combine pieces

• Validate sections

• Test at scale

Building a Cloud Data Pipeline

•IaaS, PaaS, SaaS Vendors

•AWS, Azure, GCP…

Visualizing Machine Learning Results

Solving Important Questions…Cancer genomics?

DEMO: Who is a Bondi Hipster?

Supervised ML: Wide Random Forests

Scaling to 50 M variables and 10 K samples

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster • 12 workers

• 16 x Intel CPUs

• Xeon E5-2660@2.20GHz

• 128 GB RAM

• Spark 1.6.1 • 128 executors

• 6GB / executor 0.75TB

• Synthetic dataset

Whole Genome

RangeGWAS Range

Future Directions for VariantSpark RF

Mixed feature types

Unordered Categorical

Continuous

Build Community

Python API

Non-Genomic Demos

Implementation by

Try it out: VariantSpark Notebook

https://docs.databricks.com/spark/latest/training/variant-spark.html

Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy

“Editing does not work every time, e.g. only 7 in 10 embryos were mutation free.”

Aim: Develop computational guidance framework to enable edits the first time; every time

Ma et al. Nature 2017 *

* Controversy around the paper – stay tuned

Make Process Parallel and Scalable

• Each search can be broken down into parallel tasks - each takes seconds

• Researchers might want to search the target for one gene or 100,000

Scalability + Agility =

One of the first Serverless Applications in Research

Featured in

X-Ray Tracing Demo of GT-Scan2• Find performance

bottlenecks

• Fix and test

Webapp

Resources (S3, DynamoDB)

Lambda

riptio

functions

runtim

GTScan2 X-Ray Analysis

Results – 4x Faster (80% improvement)

30 sec

Considering Servicesfor GT-Scan2

• Use AWS Step Functions• Simplify workflow

• Simplify task timeouts

• Simplify task failures

• Must evaluate costs• SNS vs. Step Functions

Problem Data Technologies MVPs Pipeline

SearchGTScan2

fastq, bed-> S3, NoSQL Ingest ETL, AnalyzeViz

S3LambdaLambda/API Gateway

Serverless

Serverless Pipeline Pattern

Lambda function

buckets with objects DynamoDB

API Gateway Users

Step Functions

Problem Data Technologies MVPs Pipeline

AnalyzeGWAS

vcf -> S3/Spark IngestETLAnalyzeViz

S3 -> Databricks DBFSApache SparkVariant-Spark MLNotebook, SQL, R, Python

Spark ServerCluster

Spark Server Cluster Pipeline Pattern

Jupyter Notebook

Cloud Genomic-Scale Data Pipelines• Problem # 1 – ML on Large Data

• Solution: Spark-server cluster + custom machine learning

• Problem #2 – Burstable Search

• Solution: Serverless pipeline

Genomic-scale Data Pipelines

Dr. Denis Bauer & Lynn Langit

genome-scale big data pipelines

Science

human genome sciences large scale manufacturing facility

workflows and pipelines for ngs analysis: lessons from...

engineering of algorithms for personal genome pipelines

genome-scale arabidopsis promoter array identifies targets

genome-scale metabolic

chebi and genome scale metabolic reconstructions

a genome-scale metabolic reconstruction of mycoplasma

genome-scale identification of mlo domain-containing …

genome-scale metabolic models of microbacterium species

calculation of scale thickness in oil pipelines using

genome-scale metabolic model of caldicellulosiruptor

knowledge-based analysis of genome-scale data

genome-scale disk-based suffix tree indexing

genome-scale transcriptomic insights into early-stage fruit...

genome-scale reconstruction of the streptococcus pyogenes

accomplishments in genome-scale in silico modeling

global connectivity in genome-scale metabolic networks

reconstruction of a genome-scale metabolic model …

biochemical characteristics and a genome-scale metabolic

keystoneml: optimizing pipelines for large-scale...