whole genome sequencing in drug discovery research: a one
TRANSCRIPT
Marc Sultan, September 24th, 2015
Biomarker Development, Translational Medicine, Novartis
On behalf of the BMD WGS pilot team:
Robert Bruccoleri, Stine Buechmann-Moller, Nicole Cheung, Anita Fernandez, Nicole Hartmann, Yunsheng
He, Xiaoyu Jiang, Li Lei, Bolan Linghu, Thomas Morgan, Nirmala Nanguneri, Thomas Schlitt, Kevin Sloan, Jill
Somers, Marc Sultan, Frank Staedtler, Joseph Szustakowski, Marie Waldvogel, Daniela Wieser, Fan Yang,
Xiaojun Zhao
Whole genome sequencing in drug discovery research: a one fits all solution?
Outline
| MipTec | Marc Sultan | September 24th 2015 | WGS Pilot | Business Use Only 2
o Introduction
o WGS Pilot study results
o Summary & Challenges
3
Introduction
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
Why Whole Genome Sequencing?
4
Illustration by Pete Ellis/www.drawgood.com
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
5
WGS Principle
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
6
Finding The Differences
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
On market since January 2014
Designed & marketed for population-scale sequencing projects
HiSeq X ten | 160 Genomes | system | 3 days
7 Provided by Illumina | MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
8
Patterned Flowcell
Nanowell substrate | Billions of ordered wells • Defined feature size
• Optimal cluster spacing
Exclusion amplification • Delivers single template per well
• Simultaneous seeding and amplification
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
WGS, WES, OmniExome chip
9
WES ($1000)
exon1 exon3 exon2
WGS ($1500-2500)
X X X X X X X X X X X
Omnichip($500)
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
WGS pilot goals: assessing the utility of WGS
10
Can WGS replace other profiling platforms?
Can small and large structural variants be accurately called?
Can variants be accurately called in HLA and ADME genes?
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
WGS Pilot
11
Study Questions N
NIST “GiB” • Basic proficiency
• Comparison to highly curated variant calls 3
Clinical Study
• Comparison to candidate genotyping
• Comparison to SNP chips (Omni Bead Chip)
• Comparison to exome sequencing
• Can we detect large structural variant in a particular
gene?
77
Academic
collaboration
• Confirmation of a particular deletion? 2
Clinical Study • Comparison to ADME chip data (DMET plus chip) 13
ADME reference • Evaluation of challenging ADME genes 8
HLA reference panel • Evaluate the quality of HLA calls from X10 data? 10
Questions to be answered
113 DNA samples from 6 sub-projects
30x genome-wide coverage with ~1 billion short DNA reads per sample
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
Candidate genes/
Pathways (2~3 weeks)
analysis
WGS Pilot Workflow
12
Sample
logistics WGS
Variant
calling Rare disease
pedigree (1-2 months)
Exploratory genome
analysis (> 2 months)
@ Novartis
@ Broad Institute
(12-16 weeks)
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
Statistics
13
Key statistics for raw data
# samples 113
Raw data size from Broad 23 TB
Paired-end read length 151 base
Ave. total reads per sample 1.06 Billion
% of reads after filtering 60%
Ave. coverage per sample 32x
Coverage on Agilent exome
targeted region
38x
99.5% > 10x
82% > 30x
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
14
WGS Pilot Results
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
NIST “Genome in a Bottle”
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 15
99.9% concordance with GiB in high confidence regions
WGS data quality of Broad is higher than Macrogen
0.1% error rate in variant calls as estimated by Mendelian
inheritance errors
16
99.7% concordance for common genotypes
~97% variants called found Omni chip are found in WGS
Variant calls
from WGS
~3,000,000
Variant calls by Omni chip
~12,000,000 ~96,000
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only
Comparison of WGS with OmniExome5 Chip
17
99.6% concordance for common genotypes
~94% variants found in WES are found in WGS
Variant calls by WGS
~2,000,000
Variant calls by WES
~13,000,000 ~140,000
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only
Comparison of WGS with Exome-seq (WES)
WGS vers DMET Chip
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 18
Drug Metabolizing Enzymes and Tansporters
99.5% concordance for ~12,000 (959*13) common genotypes
WGS was be in some cases more accurate or specific (probe design)
Main Advantage of DMET platform: software for star allele prediction and phenotype prediction (high complexity)
Main Advantage WGS: additional sites easily accessible
Example Mismatch DMET and WGS
19
Probe AM_10799 designed to detect A or T allele
DMET.genotype WGS.genotype
A/A A/G
I386V
AM_10799 (CYP1A2): Designed to detect *4 allele: A/T (I386F)
No star allele definition
for A/G genotype
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only
HLA alleles called with WGS data
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 20
• WGS can be used to genotype classical HLA genes with accuracy
>95% compared to results generated from conventional methods.
• Accuracy of typing can be influenced by type of software used.
• Caveat: sample size in this evaluation is small, considering large
number of polymorphisms in HLA gene
OpiTYPE Omixon Athlates
HLA gene Fraction of
correct allele call
Accuracy
(%)
Fraction of
correct allele call
Accuracy
(%)
Fraction of
correct allele call
Accuracy
(%)
HLA-A 20/20 100% 19/20 95% 20/20 100%
HLA-B 20/20 100% 20/20 100% 19/20 95%
HLA-C 20/20 100% 18/20 90% 19/20 95%
HLA-DQA1 NA NA 20/20 100% NA NA
HLA-DQB1 NA NA 20/20 100% 19/20 95%
HLA-DPB1 NA NA 20/20 100% NA NA
HLA-DRB1 NA NA 20/20 100% 19/20 90%
Opitype: open source; Omixon: currently have license; Athlates: evaluation license
Confirmation of known large structural variants
21
ARMS2 3’UTR deletion in clinical study
deletion of CYP2D6 gene in 3 ADME reference samples.
1364-bp deletion in KRT77 gene in collaboration study
240-bp tandem duplication in KIAA1109 gene in
collaboration study
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only
Duplication and deletion of CYP2D6
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 22
• Samples1-3 have 2 copies (depth:86x, 56x, 47x), samples 4,5 have > 2 copies(depth:67x, 82x),
samples 6-8 have one copy(depth:32x, 29x, 28x)
2X
>2X
2X
2X
1X
>2X
1X
1X
CYP2D6
deletion
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 23
Summary & Challenges
Summary of results
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 24
Can WGS replace other profiling platforms?
• WES, Omni, targeted genotyping: yes
• DMET chips: WGS need better software support
Can small and large structural variants be accurately called?
• Small variants: yes
• Large variants: challenging but promising
Can variants be accurately called in HLA and ADME genes?
• HLA: yes
• ADME: yes for small variants, challenging for large structural variants
Summary
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 25
WGS is technically successful
• High cross-platform concordance of WGS data (>99%)
• Discover more variants than other technologies (coding/non-coding variants, structural variants)
• Analysis algorithms are rapidly improving
WGS is valuable for generating and testing new hypothese in clinical studies
• standarized experimental procedure that enables retrospective analyses (dictionary approach)
• Key to interpret ‘‘big WGS’’ data is to filter and integrate on diverse sources
WGS opportunities in clinical studies
26
Familial genetic studies
High-priority or competitive programs requiring quick
interrogation of genetic data in response to new
discoveries
High-priority studies with a priori genetic hypotheses -
candidate genes and pathways
Strategic disease indications where heritability is
moderate-to-high (Asthma, COPD) as part of Pan-Omic
strategy
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only
| MipTec | Marc Sultan | September, 10th 2015 | WGS Pilot | Business Use Only
Pinpointing a small subset of disease causal variants is a non-trivial task.
Large number of variants per individual make association tests impossible for «typical» sample sizes
limited scope for «hypothesis free» approaches
Huge data: an efficient strategy is required to store, organize, and query the data.
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only
> 4 million variants
per patient
A small subset of
candidate disease
causal mutations
?
27
Interpretation of results – too much data? Interpreting numerous mutations in small samples is challenging
Challenges
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 28
Data volume: large amounts of data are generated (WGS pilot: 23TB raw data plus data generated during analysis)
Long term storage costs
File transfer times are considerable
Analysis not yet standard, best practices are rapidly changing
Data generation and analysis takes longer
Ethics/legal concerns: incidental findings, consent, cloud based storage?
| IDD | Marc Sultan | June 10th 2015 | WGS Pilot | Business Use Only 29
WGS pilot team
Robert Bruccoleri
Stine Buechmann-Moller
Nicole Cheung
Anita Fernandez
Nicole Hartmann
Yunsheng He
Xiaoyu Jiang
Li Lei
Bolan Linghu
Thomas Morgan
Nirmala Nanguneri
Thomas Schlitt
Kevin Sloan
Jill Somers
Marc Sultan
Frank Staedtler
Joseph Szustakowski
Marie Waldvogel
Daniela Wieser
Fan Yang
Xiaojun Zhao