spark summit eu talk by erwin datema and roeland van ham
TRANSCRIPT
High-throughput Genomicsat Your Fingertips with Apache Spark
Erwin DatemaRoeland van Ham
Spark Summit EU 2016, October 26, Brussels
OverviewHigh-throughput Genomics at Your Fingertips with Apache Spark
• Disclaimer : I am a scientist in computational biology (bioinformatics)• I am not a computer scientist• I am not a data scientist
• Scope• KeyGene’s journey into Spark to analyze genomics data• Goal: enable interactive genomics data processing and querying• Told from a user’s perspective
• Contents• Introduction to KeyGene• Crash Course Genomics• Big Data Challenges
The crop innovation company 2
Global trends
The crop innovation company 33
World population grows from 7 to 9 billion people in 2050
Climate change
Limited/bad land, water and fossil fuels
More obese people
Malnourished people
Agricultural challenges
The crop innovation company 4
YIELD: Producing more food on less land
Population3.0 billion
Population4.4 billion
Population6.0 billion
Population7.5 billion
4.3 hectaresarable land per person
3.o hectaresarable land per person
2.2 hectaresarable land per person
1.8 hectaresarable land per person
Genetic improvement of crops
The crop innovation company 5
Molecular breedingMolecular mutagenesis
Our strategy:
Use of natural genetic variation in crops
Not GM:At this moment too costly (20-100 mil €)
Regulatory, societal and technical hurdles
About KeyGene
The crop innovation company 6
working for the future of global agriculture
Curr
ent
shar
ehol
ders
Founded in 1989
>
Go-to Ag Biotech company for higher crop yield & quality
R&D strategy
The crop innovation company 7
Fundamental research
Developingtechnologies& traits
Applying molecular breeding ofcrops
Breeding Seed products Market
UniversitiesKeyGene
Partners breeding industry
Big Data in GenomicsRelevance
The crop innovation company 8
image source: https://www.genome.gov/images/content/costpermb2015_4.jpg data source: https://www.ncbi.nlm.nih.gov/genbank/statistics/
• Genomic data is being produced on an unprecedented scale• The cost to sequence a genome is now a few thousand dollars• We can routinely sequence tens to hundreds of individual plants
Total data volume (# bases) in NCBI GenBank
Plant GenomicsDifferent from human genomics!
The crop innovation company 9
one genome many genomes
genomics as a diagnostic tool genomics as a tool to direct breeding
“simple” genetics complex genetics
high quality, shared resources variable quality, fragmented resources
understand and cure diseases improve crop yield and quality
~16 Gb~5.5 Gb~3.5 Gb~850 Mb~450 Mb
Crash Course GenomicsPlant genome sizes
The crop innovation company 10
OnionWheatPepperPotatoMelon
diploidhexaploiddiploidtetraploiddiploid
Mb = megabasesGb = gigabases
diploid = two sets of chromosomestetraploid = four sets of chromosomes
Crash Course GenomicsDNA, chromosomes, nucleotides
• DNA consists of four different elements (nucleotides or bases)• We represent DNA as strings of characters from the alphabet A C G T
The crop innovation company 11
• DNA is organized into chromosomes
• Each chromosome contains millions of nucleotides (characters)
• We can only ‘read’ short pieces of DNA (hundreds to thousands of nucleotides)
Crash Course GenomicsGenes and traits
The crop innovation company 12
drought tolerance fruit size
genome
genes
• Genes are (some of) the functional elements in the genome• We represent genes as an interval on a chromosome
Crash Course GenomicsPolymorphisms and their impact
The crop innovation company 13
large fruits
small fruits
no impact
Crash Course GenomicsRead alignment and variant calling
The crop innovation company 14
• The ‘reference genome’ represents the known sequence of a species
• Reads are aligned against the reference genome (string similarity search)• Complex: sequence variation, repetitive regions and data errors
• Variants are called from differences observed in ‘pile-ups’ of reads
Crash Course GenomicsPopulation-scale genome sequencing
The crop innovation company 15
genomegenes
reads
(individual 1)reads
(individual 2)
reads(individual n)
5 Gigabases
1 billion reads
1 billion reads
1 billion reads
• Align a billion reads x 1,000 individuals to a 5 Gb genome
• Call hundreds of millions (up to potentially billions) of sequence variants
Crash Course GenomicsRecap
• Genome sequences are represented as strings of A C G T
• Variation between genome sequences underlies differences in traits
• We can “read” the genome in little pieces• High throughput, massively parallel sequencing technologies• Up to thousands of individual plants from a given species
• Genomics data analysis is challenging• Rapid increase in data generation• Rapid turnover of sequencing technologies and their outputs• Scientific software is (often) bad (Nature News, Oct 13 2010)
The crop innovation company 16
Genomics data analysisHigh Performance Computing
• Computational challenges• Align billions of reads to the reference genome• Call millions or billions of sequence variants• Determine the small number of variants that impact a given trait
• HPC infrastructure (e.g. SGE clusters) are the de facto standard• Manually split large datasets (to accommodate the job scheduler)• Manually deal with failures: check logs, resubmit jobs…
• Many software tools are in fact large, monolithic “pipelines”• No fine-grained control over resource usage• A single error often implies a complete re-run of the analysis
The crop innovation company 17
High Performance ComputingBig Data technologies
The crop innovation company 18
compute nodecompute nodecompute nodecompute nodecompute node
submit host
compute nodecompute node
storagestoragestoragestoragestorage
storage
storagestorage
stor
age
serv
er
com
pute
clu
ster
Conventional Compute Cluster
Expensive, proprietary storageExpensive network connections
Expensive, high-reliability hardware
Spark Cluster
Spar
k c
lust
er
compute + storagecompute + storagecompute + storagecompute + storagecompute + storage
admin / name node
compute + storagecompute + storage
Commodity hardwareLinear scalabilityFault tolerance
Genomics application landscapeOpportunities for Big Data technologies
19
?
ConventionalComputeCluster
Compute tool
High Memory
Hardware Accelerated(GPU / FPGA)
Spark(and Hadoop)
Big Data in GenomicsChallenges
The crop innovation company 20
FASTQ SAM BAM VCF
DNA sequencer reads alignments variants
• Genomics is a dynamic, rapidly changing field• Data generators and analysis algorithms are in constant flux• Tools are generally built around flat, text-based file formats• Workflow is file-centric (POSIX file system; no streaming…)
Big Data in GenomicsOur ‘Sparkified’ pipeline
The crop innovation company 21
Scala
BWA
GATK4
Mark Duplicates
ADAM
GuacamoleFASTQ
HDFS
SAM
HDFS
BAM
HDFS
VCF
local
read alignment processing variant calling
Genomics on SparkSolutions to legacy designs
The crop innovation company 22
FASTQ
DNA sequencer reads
FASTQ
reads
@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
identifiersequenceseparator
quality
RDD[String]
f1.zip(f2).map(e => e._1 + e._2)
‘interleaved’ reads
class FastQRecordReader extends RecordReader
Genomics on SparkRead alignment
The crop innovation company 23
FASTQ
FASTQ
FASTQ
FASTQ
FASTQ RDD
Scala
partition 1
partition 2
partition n
…
BWA
partition 1
partition 2
…
partition n
memory
RDD SAM
Scala
HDFSHDFS
perl wrapper for BWA called by rdd.pipe()
class FastQRecordReaderextends RecordReader
sort outputand add header
Genomics on SparkProcessing and variant calling
The crop innovation company 24
Broad Institute, Cambridge
Broad Institute, Cambridge
AMPlab, University of California
• Alignment processing
• Variant calling (in development)
• Data schemas + APIs
• Variant calling (Guacamole)
• Variant analysis
• You’ve all attended the Keynote talk…
Genomics on SparkVariant selection and analysis
The crop innovation company 25
VCFScala
parquet
HDFSlocal
RDD tempTableSQL
selectionScala
VCF
localmemory
Variant selection
phenotypes GWAS
• Interactive, “real-time” selection of variant data with simple SQL queries
• GWAS analysis on Spark (e.g. hail) or conventional infra (e.g. PLINK)
Big Data in GenomicsKeyGene’s ambition
The crop innovation company 26
Wrap-upConclusions and lessons learnt
• Initial success in applying Spark to plant genomics• Proof-of-concept for enabling interactive GWAS analysis on Spark
• Spark appears to be a good fit for (some of our current) Genomics problems• Developer community needed to translate core Genomics applications!• Paradigm shift required to move away from flat POSIX files…• Opportunities for streaming data analysis
The crop innovation company 27
The EndHigh-throughput Genomics at Your Fingertips with Apache Spark
The crop innovation company 28
• Erwin Datema [email protected]
• Roeland van Ham [email protected]