Accurate, Scalable and Easy to Use Genomic Data Analysis
analytics Gianfranco de Feo, Ph. D.
www.binatechnologies.com
The Opportunity and Challenge Clinical Research Transformation:
Ability to sequence entire genomes at accessible costs and timeframes
Robustness of Sequencing Technologies: reagents and instruments
Deluge of data linking clinical phenotypes to genomic aberrations
The Challenge: Huge amounts of data (both in terms of sequence and Giga/Peta bites)
• Large investments in Bioinformatics/IT/Engineering to handle data
Analytical workflows immature and prone to errors • Large efforts leading to public domain tools that are ‘best-in-class’ but difficult to use • Software tools and algorithms are being constantly updated
Much more work required to link genomic aberrations to clinical actionability
www.binatechnologies.com
Statistics Data Analytics Bioinformatics
Genomics
Big Data Technologies Compute and Data Science
www.binatechnologies.com
NGS Workflow: Bina Analytics solutions
4
Sequencing 2º Analysis 3º Analysis Interpretation
Raw Reads Variant Calling (SV and SNVs) RNAseq analysis
Annotation (public and private DBs)
Scientific and medical interpretation
Integrated Workflows for: Whole Genome, Whole Exome, RNAseq, and targeted panels
www.binatechnologies.com
ACCURACY SCALABILITY SIMPLICITY.
www.binatechnologies.com
ACCURACY TopHat Cufflinks
BWA Pindel Picard BreakDancer
GATK BreakSeq CNVNator Samtools Bowtie
www.binatechnologies.com
AlignmentEngine
‘Under the hood’: DNA pipeline
in: FASTQ
out: Sorted BAM
out: VCF
out: Custom, VCF
In-Memory Sorter
Single/ Multi-Node Bina Platform
Parallelized GATK Pipeline (Best Practices)
Structural and Copy Number Variants
Parallelized BWA 0.7 Bina Aligner
GATK 1.x, 2.x 1. RealignerTargetCreator 2. IndelRealigner 3. BaseRecalibrator 4. PrintReads 5. Unified Genotyper 6. VQSR
SV tools BreakDancer, CNVNator, BreakSeq, Pindel,SVMerge
in: Sorted BAM
in: Re-calibrated BAM
out: Realigned BAM
out: Recalibrated BAM
Tools Formats
Genome-Aware Load-Balancing
In-Memory Bina Proprietary Sorting, Duplicate Marking, and QC Calculation
Whole Genome, Whole Exome and Targeted Panel Datasets
7
www.binatechnologies.com
Alignment
‘Under the hood’: RNA pipeline
Sorted BAM, BED
.FPKM_tracking, GTF
In-Memory Sorting
Single/ Multi-Node Bina Platform
Assembly
Expression
Tophat2 SpliceMap
Cuffmerge/Cuffdiff IDP (Isoform Detection and Prediction)
Tool set
In-Memory Sorting, QC Calculation
Cufflinks LSC (Long read error correction)
Per-Sample Load-Balancing
GTF, DIFF, .*_tracking
8
FASTQ
Formats
www.binatechnologies.com Bina Technologies Confidential and Proprietary
Trios
Accuracy Validation
9
SNP Indel SV
Synthetic Diploid Genome
Simulation
Bina Genome Analysis Platform
SNP/Indel SV/CNV
Validate
M
C
F
Replicates
R2 R1
Gold Call Sets QC Statistics
NIST
EXP/ASM
1KG CG # SN
P
Ti/T
v H
et/H
om
Bina Genome Analysis Platform Bina Genome Analysis Platform Bina Genome Analysis Platform
Validate Validate
Validate Validate
Validate
Com
puta
tiona
l Experimental
www.binatechnologies.com Bina Technologies Confidential and Proprietary
Validation by Gold Set
• Highly confident variants for NA12878
• Integrated 13 datasets currently from 5 platforms with sophisticated filtering
• Includes SNPs and Indels with genotypes
• 99.69% of SNPs are known polymorphic variants, whereas Indels are 89.57%
SNPs : 2.8M Indels: 360K SVs : Coming soon
http://genomeinabottle.org/
www.binatechnologies.com Bina Technologies Confidential and Proprietary
Benchmarking: Alignment effect on indel calling accuracy
11
Bina Aligner / GATK
Unique Calls
Insertions 65,555
Deletions 38,165
Known 70.7%
Het / Hom 1.4
79.9% Overlap shared
indel calls vs. total indel calls
Shared Calls
Insertions 267,144
Deletions 274,874
Known 85.2%
Het / Hom 1.26
Accelerated BWA /
GATK Unique Calls
Insertions 18,996
Deletions 21,094
Known 59.7%
Het / Hom 1.6
11% more known indels
www.binatechnologies.com Bina Technologies Confidential and Proprietary
Benchmarking: Alignment and SNV Accuracy
12
• Single nucleotide variants
Bina Aligner / GATK
Unique Calls
All SNVs 98,870
Known 95.5%
Ti / Tv 1.8
Het /Hom 2.3
94.4% Overlap shared
SNV calls vs. total SNV calls
Shared SNV Calls
All SNVs 3,097,329
Known 99.2%
Ti / Tv 2.13
Het / Hom 1.45
Accelerated BWA /
GATK Unique Calls
All SNVs 84,481
Known 94.1%
Ti / Tv 2.09
Het / Hom 3.4
More accurate
SNV calls
www.binatechnologies.com Bina Technologies Confidential and Proprietary
Alignment Benchmarking
Aligner Reads/s Mapping rate (%)
Uniq. Mapping rate (%)
Uniq. Mismatch rate (%)
Uniq. gap rate (%)
Bina 69K 94.6 88.1 0.56 0.03
BWA accel.
35K 92.4 86.8 0.28 0.013
BWA mem 50K 95.7 88.5 0.88 0.022
Novoalign 9K 86.6 86.6 0.34 0.018
Isaac 72K 88.9 82.9 0.17 0.0108
www.binatechnologies.com
Our overall performance
BWA+GATK2
Platform Illumina
Mode WGS
Type Paired End
Sample NA12878
Read Length 2x100bp
Reads 1.2G
Coverage 37.8X
SNP
% Known (dbSNP) 98.62%
Ti/Tv 2.1
Het/Hom 1.53
Sensitivity (Gold Set) 98.55%
GT Concordance (Gold set) 99.98%
Inde
l
% Known (dbSNP) 89.29%
Sensitivity (Gold Set) 84.82%
GT Concordance (Gold Set) 97.88%
www.binatechnologies.com
3.65 3.66 3.67 3.70 3.75
0.30 0.43 0.24 0.21 0.21
5.7
7.1
6.2 6.2 6.1
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
0.00
1.00
2.00
3.00
4.00
5.00
6.00
BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2
Run
time
(hr)
# SN
Ps
Mill
ions
SNPs (HC) SNPs (LC) Time (h)
SNP Accuracy
98.47% 98.55% 98.56% 98.57% 98.60% 99.98% 99.98% 99.98% 99.98% 99.98%
0% 20% 40% 60% 80%
100%
BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2
Sensitivity GT concordance
www.binatechnologies.com
84.88% 84.82% 84.92% 84.87% 84.91% 98.01% 97.88% 97.99% 97.99% 97.97%
0% 20% 40% 60% 80%
100%
BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2
Sensitivity GT Concordance
590.0 589.4 590.7 590.6 590.2
4.6 5.5 4.0 4.1 4.4
0
200
400
600
BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2
# In
dels
Th
ousa
nds
Indels (HC) Indels (LC)
Indel Accuracy
www.binatechnologies.com
Haplotyper Caller Excels in Indels
99.62
96.35
98.6
84.91
75
80
85
90
95
100
105
SNPs Indels
Sens
itivi
ty
HaplotypeCaller
UnifiedGenotyper
www.binatechnologies.com
SV Callers Analysis
Unpublished results
www.binatechnologies.com
TOPHAT2 is More Reliable
44
33
56
44
29
62
0
10
20
30
40
50
60
70
Total number of reads Aligned Reads Alignments
Mill
ions
TOPHAT1 TOPHAT2
165
43
184
42
0
20
40
60
80
100
120
140
160
180
200
Junction Calls Novel
Thou
sand
s
TOPHAT1 TOPHAT2
19.1%
25.9%
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
Validation Rate
TOPHAT1 TOPHAT2
84
52 44
0
10
20
30
40
50
60
70
80
90
Total Transcripts
Thou
sand
s
TOPHAT1 TOPHAT2 RefSeq
TOPHAT1 TOPHAT2 REFSEQ
www.binatechnologies.com
Integrated Easy to use RNASeq Workflows
Long Reads (Raw) Short Reads
Long Read Correction (LSC)
Long Reads (Corrected)
Alignment
Au et al. Improving PacBio Long Read Accuracy by Short Read Alignment. Plos ONE Au et al. Characterization of the human ESC transcriptome by hybrid sequencing. PNAS
Isoform Detection & Prediction (IDP)
www.binatechnologies.com Bina Technologies Confidential and Proprietary
Science innovation • Validation
• SV gold set
• Accuracy • Filtering, Feature enhancements, replacing GATK tools in some areas (SNPs Variant calling) • Improving SV tools • Incorporate more tools (BreakSeq, Pindel)
• Cancer pipeline: • Tool selection • Workflow definition (many use cases) • Annotation
• Contribute back to open source genomic tools • Bug fixes • Validation tools, data quality framework • Functionality • New tools
www.binatechnologies.com
SCALABILITY Flexible Deployment
High Performance Computing Best Practices
Bina Box Bina Lite Bina Cloud
www.binatechnologies.com
Bina Box Genome Analysis Pipeline Performance
23
Whole Genome Sequencing
Whole Exome Sequencing RNA Sequencing
Turnaround Time ~4h ~45m 2.5 h
1 Bina Box Throughput 6/day ~50/day ~38/day (152/day)
Data: • WGS: Three lanes of paired-end HiSeq data from the NA12878 cell line (37X) • WES: NA12878 Whole Exome Dataset, 100X • RNA: Human Body Map 2.0, Skeletal Muscle, 82M reads, 75bp
Pipeline:
• WGS: Parallelized BWA, GATK + VQSR • WES: Parallelized BWA, GATK • RNA: Tophat 2.0, bowtie2 2.1.0, cufflinks 2.1.1
Hardware Configuration WGS & WES: 4-node, 64-core Bina Box appliance Hardware Configuration RNA: 1-node, 64-core Bina Box appliance
www.binatechnologies.com
Integrated Management for Scalable Backend
www.binatechnologies.com
SIMPLICITY
Best Practices Workflows
Hardware/Software
On-premises Integration
Integration
www.binatechnologies.com
SIMPLICITY Intuitive User Interface
Monitoring & Management
Quality Control
Visualization
www.binatechnologies.com
Intuitive QC Metrics Summaries
Filtering the annotated VCF
Miley Trio
NA 12878
www.binatechnologies.com
Annotation Platform – high level architecture
29
External Genomic Data Sources
Small Variations
dbSNP
Structural Variations
dbVar, DGVa
Genome
RefSeq, Ensembl
Impact prediction
SnpEff, SIFT, PolyPhen-2
Genotype – Phenotype
ClinVar
VCF
Bina Box Data
Versioning
Standardization
Compression
User-Defined Data Sources
Query Engine
Serving Layer Batch Layer • Compact Representation
of Knowledgebase
• Smart Integration of User Data with Knowledgebase
• Unbounded Data Storage
• Real-time Filtering
• Cached Queries • Annotation
• Custom Queries
• Saved filters
Visualization
www.binatechnologies.com
Academic Core Labs: Bina-on-Demand
30
• High capacity compute available, when needed, with no overhead costs
• Similar to a shared network printer; Only pay for actual use
In Dr. Snyder’s group, secondary analysis time was reduced from 10 days per whole genome (using a 1200-core shared cluster) to 6h on one Bina Box. A second Bina-on-Demand platform accelerates NGS research across the Stanford Campus.
Stanford Center for Genomics and Personalized Medicine
www.binatechnologies.com
Clinical /Translational Centers: Bina Subscription
31
• On-premises solution for maximum privacy and security
• Fastest time to results – and therefore fastest time to decision and reporting
• Roadmap to clinical-grade software (HIPAA, robustness, training, and support)
Elizabeth Worthey’s team focuses on high volume clinical sequencing of distressed newborns in a neonatal intensive care unit (NICU). Bina reduces bioinformatics analysis from 17h to 3.5h while slashing costs by half. Plans to sequence all newborns at MCW by 2015.
www.binatechnologies.com
Bina Custom: Large scale Translational Research
32
• High-throughput platform for large cohort NGS studies (WGS, WES, RNA-Seq)
• Capable of processing at least 100 WGS/month or more than 600 WES/month
• Additional Bina Boxes increases throughput; Straightforward subscription model
• Secondary analysis today, Data management, aggregation and storage in Q4
• 400 WGS samples related to cardiovascular disease
• Alignment, variant calling and SV / CNV results
• Joint sample variant calling; Aggregation of results
• Dramatically reduced analysis time from years to months
• Results being prepared for publication in a leading journal
U.S. Department of Veterans Affairs
www.binatechnologies.com
Next Steps
Want to know more? Try it with your own datasets on the cloud (no cost) www.binatechnologies.com/ Contact your bina representative: Take Ogawa, Director of Sales [email protected] Contact me! Gianfranco de Feo, VP Marketing [email protected]
www.binatechnologies.com
ACCURACY SCALABILITY SIMPLICITY
Low to very high throughput solutions
Full incorporation of best-in-class tools (benchmarking)
RNAseq Whole Genome Whole Exome
Easy-to-use interfaces
analytics