the final frontier data analysis - agilent.com · mbc, optical deduplication, etc. must have bam...
Post on 03-Mar-2019
219 Views
Preview:
TRANSCRIPT
The Final Frontier
Jean Jasinski, Ph.D.Field Application Scientist
Sept. 27, 2017
Data Analysis
For Research Use Only. Not for use in
diagnostic procedures. PR7000_09531
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09532
Final Frontier: Data AnalysisAgenda
SureDesign (eArray)
NGS Data Analysis
Cartagenia (Alissa)
Microarray Data Analysis
Introduction
Standard Disclaimer
Except for GenetiSureDX and Cartagenia Bench, all other products are Research Use Only (RUO)
For Research Use Only. Not for use in diagnostic procedures. PR7000_09533
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09534
Final Frontier: Data AnalysisAgenda
SureDesign (eArray)
NGS Data Analysis
Cartagenia (Alissa)
Microarray Data Analysis
Introduction
Precision Medicine Needs Precision GenomicsHigh resolution, accuracy and sensitivity
Key Technologies
▪ Next Generation Sequencing
▪ Microarrays
▪ Digital PCR
▪ qPCR
▪ Oligonucleotide FISH
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09535
Puzzled by Options?
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09536
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09537
Final Frontier: Data AnalysisAgenda
SureDesign (eArray)
NGS Data Analysis
Cartagenia (Alissa)
Microarray Data Analysis
Introduction
SureDesign eArray
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09538
• Gene Expression microarrays
• miRNA microarrays
• RNA-Seq targeted capture
• Mutagenesis (QuikChangeHT)
SureDesign and eArrayCreate and View Custom and Catalog Designs
SureDesign and eArray
• Web-based tools
• Same login for both tools
• Must use institutional email for account
• Create custom designs
• Customize catalog designs
• Download designs
• Order designs (trigger quote)
• Free to use
For Research Use Only. Not for use in diagnostic procedures.
PR7000_09539
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095310
Final Frontier: Data AnalysisAgenda
SureDesign (eArray)
NGS Data Analysis
Cartagenia (Alissa)
Microarray Data Analysis
Introduction
Microarray Data Analysis Tools
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095311
• Agilent CGH and CGH+SNP
arrays (human and nonhuman)
• True two-color analysis
• Copy Number
• LOH and UPD (CGH+SNP)
• Suppress, classify, edit, annotate
aberrations
• Report generation
• Free
CytogenomicsDX
• GenetisureDX array analysis
• FDA-cleared
• Free
• Gene Expression arrays
• miRNA arrays
• Exon and Exon Splicing
Arrays
• Copy Number
• Clustering, GEO, GO, GSA
• Pathway Analysis
• Multiple vendor arrays
• License fee
MPP (Mass Profiler Pro)
• Metabolomics and
proteomics from Mass Spec
data
• License fee
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095312
Final Frontier: Data AnalysisAgenda
SureDesign (eArray)
NGS Data Analysis
Cartagenia (Alissa)
Microarray Data Analysis
Introduction
OneSight
Seeing is Knowing
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
The OneSight cfDNA solution allows
labs to study the (aneu)ploidy status
of the DNA found in the cell-free
fraction of a biopsy sample from low-
pass whole genome sequencing
data.
Key features:
• Vizualisation tools: detailed views (aneu)ploidy status of each chromosome
▪ All chromosomes
• Automation tools: define classification rules for marking loci for review
• Reference sets: define normal samples in the study population
• Excluded regions: remove recurrent technical noise and biologically
irrelevant loci in the data
OneSightTurnkey solution
Compatible with
any common
NGS library
prep kit
Compatible with the
most common NGS
sequencing
platforms
Select analysis pipeline
and reference set
Visually inspect
chromosome plots
OneSight
Upload raw NGS
data
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
OneSightVisual plots
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Normal Segmental aberration
Trisomy Complex aberrations
-Developed by Cartagenia, a part of Agilent Technologies, leveraging the company’s expertise with software
solutions for genetics labs
-Proven SaaS approach and technology platform
-Workflow efficiency, traceability & versioning, and automation
-Setup fee and per sample analysis
SureCall Alignment to Mutation
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095316
• NGS data analysis
tool for biologists
• Accepts fastq or
bam files
• Generates vcf (4.2)
and pdf or text
mutation reports
• Human (hg19) DNA
analysis only
• Free to Agilent
Target Enrichment
customers
(HaloPlex,
SureSelect,
OneSeq)
• Runs on local
computer
SureCall 4.0 New Features
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095317
Support SureSelect XTHS Data Analysis• Add Molecular Barcode (MBC) analysis for SureSelect XTHS
• Improves MBC analysis flexibility
• Indexing hopping control including optical duplication removal and ‘estimated index hopping frequency’ parameter
• Additional QC metrics and plots for HS analysis
Introducing Translocation Detection
• New algorithm module
• New visualization
Overall Software Improvement
• Check for internet connection while submitting the job. If the connection is not available, a pop-up message to warn user that without internet connection, annotations result will be affected.
• Now allow re-annotation for updating an analyzed sample or finishing up a failed job due to network issues
• Provide link out to EXAC (Exome Aggregation Consortium, hosted by Broad Institute) while in Triage View.
• Improved login dialog
• Better installer, checks system/hardware compatibility first
• Support VCF v4.2 format, which include all variant types (SNPs, Indels, CNVs, translocations, etc.) from a sample)
• QC report improvements (e.g. include SureCall version, Design ID, Genome Build in the report).
Choose one of the four analysis types available in SureCall
Research Use Only. Not for Use in Diagnostic Procedures.
PR7000_0953
Description Result
Single Sample
Analysis
• For individual samples • SNPs, indels,
translocations
Pair Analysis • To determine copy
number changes (use a
normal reference).
• To determine somatic
mutations in tumor-
normal samples
• SNPs and indels
• CNVs
• Somatic mutations
Trio Analysis • For trios, typically
mother, father and child
• SNPs and indels
• de novo mutations
OneSeq
Analysis
• For simultaneous
detection of genome-
wide copy number
changes, cnLOH, SNP
and Indel mutations
• CNVs, cnLOH, SNPs
and Indels
18
SureCall – Support of HaloPlexHS and XTHS
molecular barcodes
Research Use Only.. Not for use in diagnostic procedures.
PR7000_095319
What are Molecular Barcodes (MBC)?
• Also known as Unique Molecular Identifiers (UMI) or Random Molecular Tags (RMT)
• The goal is for each original DNA fragment, within the same sample, to be attached to a unique sequence barcode
• Although similarly named, these are not the same as a sample barcode/index which allow for multiple samples to be run on a single sequencing run
• Molecular barcodes are a string of totally random nucleotides (such as NNNNNNN), partially degenerate nucleotides (such as NNNRNYN), or defined nucleotides (when template molecules are limited)
• Agilent uses 10-base MBCDNA
Adaptor
Sample Index
Molecular Barcode
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Why are Molecular Barcodes Useful?
In Capture based technology (SureSelectHS):
• Able to identify original DNA fragments with bias from fragmentation methods
• With deep sequencing, able to use duplicate reads for error correction
In Amplicon Based technology (HaloPlexHS):
• De-duplication – ability to determine original DNA fragments and PCR duplicates
In Both:
• Accurate low allele frequency variant calling
• Calling of copy number changes
• Error correction introduced by PCR and sequencing
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
De-duplication – Capture without MBC
When you de-duplicate reads that have the same start and stop point,
all will be removed (discarded) except for one read.
Reference GenomeExon of interest
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
De-duplication – Capture with Molecular Barcodes
Reference GenomeExon of interest
When you ‘de-duplicate’ using molecular barcodes, the reads that
have the same start stop point are not removed but are merged
together to create consensus reads. This way, errors introduced by
PCR or sequencing are removed.
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
De-duplication – Amplicon without MBC
Reference GenomeExon of interest
When using amplicon technology de-duplication really isn’t possible
because of the nature of the amplicons the majority of the sequencing
data would be lost.
For Research Use Only. Not for use in diagnostic
procedures.
De-duplication – Amplicon with MBC
Reference GenomeExon of interest
When using amplicon technology with molecular barcodes, it becomes
possible to ‘de-duplicate’ and identify the unique molecules of DNA.
The reads that have the same molecular barcode can then be used to
create consensus reads and remove errors created by the library prep
or sequencing processes.
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
• Low allele frequency variants are difficult to detect by conventional NGS
methods
• Relatively high error rate of sequencers
Low Allele Frequency Variants (<3%)
Sequencer Error rate Error type
Illumina MiniSeq & NextSeq <1% Substitutions
Illumina MiSeq & HiSeq 0.1% Substitutions
Ion Torrent PGM, Proton & S5 1% Indels & homopolymers
PacBio 13% single pass
≤1% circular consensus read
Indels
Oxford Nanopore MinIon 12% Indels
Adapted from Goodwin et al (2016) Nature Reviews Genetics 17:333-351
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Detecting low allele frequency variants and DNA Inputs
Input
4000
End repair & A tail
3900
Ligation 2500
Hybridisation 1750
Capture 1250
Clean up 1000
Library
900
Perfect world (0.1% allele frequency)
• 4 reads to create a consensus therefore:
• 4000x coverage would be sufficient = 4000 original
copies of the genome (2000 cells)
• 12ng of DNA input required
In reality, library prep is inherently inefficient
Conclusion: To detect low allele frequency variants, higher
DNA inputs are requiredAdapted from: https://cofactorgenomics.com/heterogenous-dna-sequencing-lower-limits-minor-allele-frequency-sensitivity/
Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Analysis Pipelines other than SureCall
• For customers with established bioinformatics pipelines, Agilent provides two separate java programs in AGeNT (Agilent Genomics NextGen Toolkit (AGeNT) that can be integrated into your pipelines: SureCallTrimmer and LocatIt
• SurecallTrimmer is called before alignment and handles adapter trimming (on both ends), trims low quality bases, and masks enzyme footprints. SurecallTrimmer is important for HaloPlex and HaloPlexHS data not processed in SureCall
• MBC reads are found in third fastq file
• Generation of consensus reads occurs after alignment by examining all reads that align to the same location (chr, start, stop) and share the same molecular barcode
• LocatIt handles MBC after alignment: consensus reads, filtering based on MBC, optical deduplication, etc. Must have bam file and MBC fastq
• Tools for bioinformaticians capable of developing and debugging pipelines
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095328
Other Types of NGS Analyses: Non-human or Other Type of Sequencing (RNA-, small RNA-, Methyl-, meDIP-, or ChIP-Seq)
• SureCall only performs DNA analysis for human (hg19) data only
• StrandNGS can align DNA, RNA, and small RNA using its own aligner or accept BAM or SAM inputs
• Workflows for DNA-Seq, RNA-Seq, small RNA-Seq, Methyl-Seq, MeDIP-Seq, and ChIP-Seq using algorithms specific to experiment type
• Powerful QC tools
• Extensive filtering options
• Pathway, GO analysis, clustering
• StrandNGS pipelines now available
• License fee
For Research Use Only. Not for use in diagnostic procedures. PR7000_095329
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095330
Final Frontier: Data AnalysisAgenda
SureDesign (eArray)
NGS Data Analysis
Cartagenia (Alissa)
Microarray Data Analysis
Introduction
Enabling clinical analysis of genomic data
▪ Enables the interpretation, reporting, and sharing of genomic variants
▪ Manage increasing volumes of data and reduce turnaround time
▪ Draft clinical grade lab reports (FDA Class 1 Medical Device)
▪ Analyzed CGH and NGS data accepted as input
▪ Rebranded as Alissa Interpret
PR7000_0953
How Cartagenia Works
Software as a Service• Scalable
• Secure
• Cost effective
Content is key! Knowledge Integration:• Over 100 public and private data sources
• Institution specific repositories
• Sharing across private and public consortia• Partnerships (Alamut, HGMD, OncoMD, CollabRx, N-of-1…)
Setting and Adopting Standards• Adapting to diagnostic standards
• ISO9001 and ISO13485 certified• Registered as Medical Device in US, Canada and Europe
Support• A fully-serviced solution
• Adapted to your needs, specialization and deadlines
PR7000_095332
PR7000_0953
Benefits of Cartagenia Bench
Efficient
Productivity through Automation
Standardization
Knowledge Integration
Easy to use
Co-designed with you
and your peers
Integrated with lab and hospital IT
Robust
Validation
Versioning
Security & control
High quality support
Clinical grade
ISO Certification
Class I medical device
Agilent Alissa Vision – from raw data to report
Make your work flow with Agilent Alissa Clinical Informatics for NGS
✓ One single platform from raw reads to lab reports
✓ Comprehensive QC metrics at your fingertips
✓ Alissa Interpret is Class I medical device (CGH and NGS)
✓ Alissa Align & Call (RUO future release)
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Bonus Content
Index Hopping
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
35
Index Hopping (Illumina Sequencers)
• Incorrect assignment of reads to different sample
• Occurs in multiplexed samples
• Frequency is higher on patterned flow cells (ExAmp chemistry) but still occurs in bridge amplification chemistry
• Multiple causes (index contamination, sample contamination, postcapture PCR mispriming, excess adapters, overclustering)
• Detection best done during demultiplexing when data from all samples is available
For Research Use Only. Not for use in diagnostic procedures.
PR7000_095336
Illumina’s recommendationshttps://www.illumina.com/science/education/minimizing-index-hopping.html
Observed index hopping rate using XTHS: Hiseq4000 vs. Hiseq2500
▪ We see an average hop rate** of 2.9% with HiSeq4000 (newer patterned flowcell).
▪ On HiSeq2500 (older non patterned flowcell), we see average hopping rate of 0.1-0.2%.
October 23, 201737
**: Hop rate = hopped reads/ total reads
Libraries are prepared and enriched individually, so hopping observed has occurred at sequencing level
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
HiSeq4000 HiSeq2500
Index hopping rate
P5
MBC
Insert1
Index1
P7
Index2
P7
P5
MBC
Insert1
Index2
P7
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
What does this mean for your application*?
▪ For pooling of samples from the same germline application using XT low input or SureSelect XT, XT2▪ Assuming <5% alleles are not called
▪ Customers should not be concerned about index hopping
▪ For pooling of samples from the same somatic application using XTHS
▪ Variant calls with >5% alleles are likely not due to index hopping
▪ Variant calls with <5% alleles, might be impacted by index hopping.
▪ For heterogeneous pooling across applications, or of samples across species, single cell, microbiome, viral, RNA expression, etc.▪ Variant calls are possibly impacted by index hopping
▪ Consider index hopping risks when determining what samples to pool for sequencing
October 23, 201738
*: index misassignment discussed here is limited to hopping at sequencing level; HiSeq 2500 data suggest other source of misassignment, such as index purity,
are insignificant by comparison
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Index Hopping Physical Corrections
• Use one sample (exome) per lane
• Do not use precapture pooling as PCR of multiplexed samples may misprime and cause index hopping
• Pool libraries right before sequencing and sequence pooled library as soon as possible
• Freeze pooled libraries at -20°C
• Remove as much free adapters and PCR primers as possible; second bead cleanup if see small MW blip
• If sample barcode is comprised of dual indexes, do not use all possible combinations of indices so illegal combinations can be detected and removed
For Research Use Only. Not for use in diagnostic procedures. PR7000_095339
XTHS molecular barcode thresholding -Bioinformatically remove hopped reads
October 23, 201740
Fragments with multiple reads
(same MBC)
3
This will work well for low allele
frequency applications where error
correction with MBC is needed.
(All colors): molecular barcode
Good reads
Hopped reads
The vast majority of hopped reads, have just 1 read, regardless of sequencing depth
Fragments with single read
1
2 One way to minimize the impact of
hopped reads is to remove all single
reads (MBC thresholding). No error
correction utilizing MBC with these
reads anyway.
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
How effective is MBC thresholding on HiSeq4000?
October 23, 201741
▪ MBC2+ means all single MBC reads are
filtered out, i.e. MBC thresholding
▪ MBC thresholding results in a 10x reduction
in hop rate, from average of 2.9% to 0.3%,
close to observed hopping level on HiSeq
2500
Each data point is average hop rate of 2-3 HiSeq4000 runs per given sample.
Data include 3-plex, 4-plex and 8-plex lane runs
M B C (1 + ) M B C (2 + )
0
1
2
3
4
5
w ith v s . w ith o u t M B C th re s h o ld
% H
op
ra
te (
ho
pp
ed
re
ad
s/t
ota
l re
ad
s)
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
Impact of MBC Thresholding on XTHS Sensitivity
October 23, 201742
Expected
HiSeq2500
MBC1+
HiSeq4000
MBC1+
HiSeq2500
MBC2+
HiSeq4000
MBC2+
>2% known Variants 59 59 59 59 59
<=2% known Variants* 21 12 12 11 12
false positive (or unknown Variants)** 57 105 24 24
Total Sensitivity 88.75% 88.75% 87.50% 88.75%
Specificity 99.93% 99.86% 99.97% 99.97%
Precision (PPV) 55.47% 40.34% 74.47% 74.74%
77kb panel, 10ng input, 10,000X sequencing depth
*: 2 of 21 have expected frequency of 1-2%. both are detected. The rest 19 are <=1%
**: True variant calls are based on genome in a bottle. False positive count could
include unknown true variants.
Without thresholding, False
positive rates are significantly
higher with 4000 (low specificity)
HiSeq 4000 with MBC thresholding,
comparable sensitivity and specificity
to HiSeq 2500***
MBC thresholding on HiSeq 4000, while removing significant amount of sequencing data,
shows little to no negative impact on assay sensitivity.
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
MBC Thresholding for Hopped Reads in SureCall
October 23, 201743
MBC2+ is set by default
For Research Use Only. Not for use in diagnostic procedures.
PR7000_0953
SureCall Estimated Hopping Frequency
For Research Use Only. Not for use in diagnostic procedures. PR7000_095344
• New parameter reduces noise generated by sample index cross-contamination
• Default setting is 0.005 (0.5%)
• Range is 0 to 0.1 (0 to 10%)
• How SureCall uses this parameter:
1) Calculate “Read numbers of variants could caused by indexing hopping” = Average coverage of each region X Estimated Index Hopping Frequency
2) Based on the reads number from the 1st step, SureCall calculates the probability of certain variant calls to be real or noise caused by index hopping and filters out such mutations
• Estimate value by comparing SureCall allele frequencies with known allele frequencies, from data for the particular sequencer used (higher in patterned flow cells), past experience
Number of variants
that might be due to
index hopping
For Research Use Only. Not for use in diagnostic procedures. PR7000_095345
Optical Duplicates
These are only a problem for HiSeq 2500/MiSeq/NextSeq data. They come from
large clusters being called incorrectly as two separate clusters by Illumina’s RTA SW.
On a 2500: Some clusters are either too big or their shape does not conform to the
model and they get counted as 2+ clusters.
On a 4000: During amplification on the flow cell one of the local duplicates that are
part of a growing cluster break free and go on to seed a new nanowell and start
a cluster of its own nearby. After analysis, these two nanowells show the very same
data: sequence and MBC. The geographical coordinates are close to each other.
SureCall uses geographical location (tile) for optical deduplication before MBC
deduplication
For Research Use Only. Not for use
in diagnostic procedures.
PR7000_0953
top related