spark summit eu talk by erwin datema and roeland van ham

28
High-throughput Genomics at Your Fingertips with Apache Spark Erwin Datema Roeland van Ham Spark Summit EU 2016, October 26, Brussels

Upload: spark-summit

Post on 06-Jan-2017

166 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark Summit EU talk by Erwin Datema and Roeland van Ham

High-throughput Genomicsat Your Fingertips with Apache Spark

Erwin DatemaRoeland van Ham

Spark Summit EU 2016, October 26, Brussels

Page 2: Spark Summit EU talk by Erwin Datema and Roeland van Ham

OverviewHigh-throughput Genomics at Your Fingertips with Apache Spark

• Disclaimer : I am a scientist in computational biology (bioinformatics)• I am not a computer scientist• I am not a data scientist

• Scope• KeyGene’s journey into Spark to analyze genomics data• Goal: enable interactive genomics data processing and querying• Told from a user’s perspective

• Contents• Introduction to KeyGene• Crash Course Genomics• Big Data Challenges

The crop innovation company 2

Page 3: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Global trends

The crop innovation company 33

World population grows from 7 to 9 billion people in 2050

Climate change

Limited/bad land, water and fossil fuels

More obese people

Malnourished people

Page 4: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Agricultural challenges

The crop innovation company 4

YIELD: Producing more food on less land

Population3.0 billion

Population4.4 billion

Population6.0 billion

Population7.5 billion

4.3 hectaresarable land per person

3.o hectaresarable land per person

2.2 hectaresarable land per person

1.8 hectaresarable land per person

Page 5: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genetic improvement of crops

The crop innovation company 5

Molecular breedingMolecular mutagenesis

Our strategy:

Use of natural genetic variation in crops

Not GM:At this moment too costly (20-100 mil €)

Regulatory, societal and technical hurdles

Page 6: Spark Summit EU talk by Erwin Datema and Roeland van Ham

About KeyGene

The crop innovation company 6

working for the future of global agriculture

Curr

ent

shar

ehol

ders

Founded in 1989

>

Go-to Ag Biotech company for higher crop yield & quality

Page 7: Spark Summit EU talk by Erwin Datema and Roeland van Ham

R&D strategy

The crop innovation company 7

Fundamental research

Developingtechnologies& traits

Applying molecular breeding ofcrops

Breeding Seed products Market

UniversitiesKeyGene

Partners breeding industry

Page 8: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Big Data in GenomicsRelevance

The crop innovation company 8

image source: https://www.genome.gov/images/content/costpermb2015_4.jpg data source: https://www.ncbi.nlm.nih.gov/genbank/statistics/

• Genomic data is being produced on an unprecedented scale• The cost to sequence a genome is now a few thousand dollars• We can routinely sequence tens to hundreds of individual plants

Total data volume (# bases) in NCBI GenBank

Page 9: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Plant GenomicsDifferent from human genomics!

The crop innovation company 9

one genome many genomes

genomics as a diagnostic tool genomics as a tool to direct breeding

“simple” genetics complex genetics

high quality, shared resources variable quality, fragmented resources

understand and cure diseases improve crop yield and quality

Page 10: Spark Summit EU talk by Erwin Datema and Roeland van Ham

~16 Gb~5.5 Gb~3.5 Gb~850 Mb~450 Mb

Crash Course GenomicsPlant genome sizes

The crop innovation company 10

OnionWheatPepperPotatoMelon

diploidhexaploiddiploidtetraploiddiploid

Mb = megabasesGb = gigabases

diploid = two sets of chromosomestetraploid = four sets of chromosomes

Page 11: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Crash Course GenomicsDNA, chromosomes, nucleotides

• DNA consists of four different elements (nucleotides or bases)• We represent DNA as strings of characters from the alphabet A C G T

The crop innovation company 11

• DNA is organized into chromosomes

• Each chromosome contains millions of nucleotides (characters)

• We can only ‘read’ short pieces of DNA (hundreds to thousands of nucleotides)

Page 12: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Crash Course GenomicsGenes and traits

The crop innovation company 12

drought tolerance fruit size

genome

genes

• Genes are (some of) the functional elements in the genome• We represent genes as an interval on a chromosome

Page 13: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Crash Course GenomicsPolymorphisms and their impact

The crop innovation company 13

large fruits

small fruits

no impact

Page 14: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Crash Course GenomicsRead alignment and variant calling

The crop innovation company 14

• The ‘reference genome’ represents the known sequence of a species

• Reads are aligned against the reference genome (string similarity search)• Complex: sequence variation, repetitive regions and data errors

• Variants are called from differences observed in ‘pile-ups’ of reads

Page 15: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Crash Course GenomicsPopulation-scale genome sequencing

The crop innovation company 15

genomegenes

reads

(individual 1)reads

(individual 2)

reads(individual n)

5 Gigabases

1 billion reads

1 billion reads

1 billion reads

• Align a billion reads x 1,000 individuals to a 5 Gb genome

• Call hundreds of millions (up to potentially billions) of sequence variants

Page 16: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Crash Course GenomicsRecap

• Genome sequences are represented as strings of A C G T

• Variation between genome sequences underlies differences in traits

• We can “read” the genome in little pieces• High throughput, massively parallel sequencing technologies• Up to thousands of individual plants from a given species

• Genomics data analysis is challenging• Rapid increase in data generation• Rapid turnover of sequencing technologies and their outputs• Scientific software is (often) bad (Nature News, Oct 13 2010)

The crop innovation company 16

Page 17: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genomics data analysisHigh Performance Computing

• Computational challenges• Align billions of reads to the reference genome• Call millions or billions of sequence variants• Determine the small number of variants that impact a given trait

• HPC infrastructure (e.g. SGE clusters) are the de facto standard• Manually split large datasets (to accommodate the job scheduler)• Manually deal with failures: check logs, resubmit jobs…

• Many software tools are in fact large, monolithic “pipelines”• No fine-grained control over resource usage• A single error often implies a complete re-run of the analysis

The crop innovation company 17

Page 18: Spark Summit EU talk by Erwin Datema and Roeland van Ham

High Performance ComputingBig Data technologies

The crop innovation company 18

compute nodecompute nodecompute nodecompute nodecompute node

submit host

compute nodecompute node

storagestoragestoragestoragestorage

storage

storagestorage

stor

age

serv

er

com

pute

clu

ster

Conventional Compute Cluster

Expensive, proprietary storageExpensive network connections

Expensive, high-reliability hardware

Spark Cluster

Spar

k c

lust

er

compute + storagecompute + storagecompute + storagecompute + storagecompute + storage

admin / name node

compute + storagecompute + storage

Commodity hardwareLinear scalabilityFault tolerance

Page 19: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genomics application landscapeOpportunities for Big Data technologies

19

?

ConventionalComputeCluster

Compute tool

High Memory

Hardware Accelerated(GPU / FPGA)

Spark(and Hadoop)

Page 20: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Big Data in GenomicsChallenges

The crop innovation company 20

FASTQ SAM BAM VCF

DNA sequencer reads alignments variants

• Genomics is a dynamic, rapidly changing field• Data generators and analysis algorithms are in constant flux• Tools are generally built around flat, text-based file formats• Workflow is file-centric (POSIX file system; no streaming…)

Page 21: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Big Data in GenomicsOur ‘Sparkified’ pipeline

The crop innovation company 21

Scala

BWA

GATK4

Mark Duplicates

ADAM

GuacamoleFASTQ

HDFS

SAM

HDFS

BAM

HDFS

VCF

local

read alignment processing variant calling

Page 22: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genomics on SparkSolutions to legacy designs

The crop innovation company 22

FASTQ

DNA sequencer reads

FASTQ

reads

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

identifiersequenceseparator

quality

RDD[String]

f1.zip(f2).map(e => e._1 + e._2)

‘interleaved’ reads

class FastQRecordReader extends RecordReader

Page 23: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genomics on SparkRead alignment

The crop innovation company 23

FASTQ

FASTQ

FASTQ

FASTQ

FASTQ RDD

Scala

partition 1

partition 2

partition n

BWA

partition 1

partition 2

partition n

memory

RDD SAM

Scala

HDFSHDFS

perl wrapper for BWA called by rdd.pipe()

class FastQRecordReaderextends RecordReader

sort outputand add header

Page 24: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genomics on SparkProcessing and variant calling

The crop innovation company 24

Broad Institute, Cambridge

Broad Institute, Cambridge

AMPlab, University of California

• Alignment processing

• Variant calling (in development)

• Data schemas + APIs

• Variant calling (Guacamole)

• Variant analysis

• You’ve all attended the Keynote talk…

Page 25: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Genomics on SparkVariant selection and analysis

The crop innovation company 25

VCFScala

parquet

HDFSlocal

RDD tempTableSQL

selectionScala

VCF

localmemory

Variant selection

phenotypes GWAS

• Interactive, “real-time” selection of variant data with simple SQL queries

• GWAS analysis on Spark (e.g. hail) or conventional infra (e.g. PLINK)

Page 26: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Big Data in GenomicsKeyGene’s ambition

The crop innovation company 26

Page 27: Spark Summit EU talk by Erwin Datema and Roeland van Ham

Wrap-upConclusions and lessons learnt

• Initial success in applying Spark to plant genomics• Proof-of-concept for enabling interactive GWAS analysis on Spark

• Spark appears to be a good fit for (some of our current) Genomics problems• Developer community needed to translate core Genomics applications!• Paradigm shift required to move away from flat POSIX files…• Opportunities for streaming data analysis

The crop innovation company 27

Page 28: Spark Summit EU talk by Erwin Datema and Roeland van Ham

The EndHigh-throughput Genomics at Your Fingertips with Apache Spark

The crop innovation company 28

• Erwin Datema [email protected]

• Roeland van Ham [email protected]