a quick guide to the analytics behind genomic testing · understanding bioinformatics requires ......

27
A Quick Guide to the Analytics Behind Genomic Testing 1 Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories

Upload: hoangngoc

Post on 28-Jul-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

A Quick Guide to the

Analytics Behind Genomic Testing

1

Elaine Gee, PhD Director, Bioinformatics ARUP Laboratories

Page 2: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Learning Objectives

Catalogue various types of bioinformatics analyses that support clinical genomic testing

Enumerate types of variant classes

Describe algorithmic methods for variant detection by NGS

Compare and contrast germline and somatic clinical bioinformatics pipeline methodologies

Discuss the infrastructure complexity required to support analytics for NGS testing at scale in the cloud

Explain validation strategies for bringing best-in-class pipelines into clinical production

Page 3: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

The Human Reference Genome

~3B base pairs structured into

23 chromosome pairs

3,098,825,702 base pairs

20,805 coding genes

14,181 pseudogenes

196,501 gene transcripts

Page 4: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic
Page 5: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Why Genomic Testing?

KRAS G12D

1 in 4 cancer deaths are from lung cancer.

~222,500 new cases of lung cancer in the U.S. in 2017.

Page 6: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Short-Read Sequencers Illumina Ion-Torrent

Long-Read Sequencers PacBio NanoPore 10X Nanostring

Genomic Testing

Page 7: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

7

Types of NGS Testing—Somatic & Germline

Page 8: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Types of NGS Testing—cfDNA and ctDNA

Non-Invasive Prenatal Testing (NIPT) Liquid Biopsy

Trisomy 21 (Down Syndrome) Non-small cell lung cancer

EGFR

Page 9: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Types of NGS Testing—Infectious Disease

Page 10: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Types of NGS Testing—RNA-Seq

• Alternate transcripts

• Novel gene isoforms

• Gene fusions

Page 11: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Role of Clinical Bioinformatics

Build pipelines Provide supplemental information for clinical interpretation and quality control

Other computationally heavy analytics are involved in evaluating:

Design of new panels

Identification of genetic patterns in patient cohorts

Discovery of gene pathways

Page 12: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Understanding bioinformatics requires understanding the laboratory process.

Page 13: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Steps in a bioinformatics pipeline:

1. Sample demultiplexing

2. Read alignment

3. BAM polishing steps

4. Variant calling

5. Variant annotations

6. QC calculations

Variant Calling Pipeline

Page 14: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Step 1: Sample Demultiplexing

Page 15: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Step 2: Read Alignment

Read Alignment

SAM Format

Page 16: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

A T C C T G

A T C C C T G

A T C C T G

A T C C T G

A T C C T G

A T C C T G

PCR Duplicate Removal Base Quality Score Recalibration

Step 3: BAM Polishing Steps

homopolymer

+1%

Q30 Phred base quality score → 99.9% → 1/1000

reference

read

C

C

C

C

C

Page 17: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Step 4: Variant Calling by Class

Page 18: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

SNV/Insertion/Deletion • Position based callers

(GATK Unified Genotyper, LoFreq)

• Local de-novo assembly of haplotypes (GATK Haplotype Caller)

• Graph based variant callers (Graph Genome)

• Neural networks (Deep Variant)

Duplications/Structural variants • Pattern growth approach (Pindel)

• Split reads, discordant paired-end reads (Manta, DELLY, CREST)

• kmer + de-novo assembly (BreaKmer)

• Unmapped or partially mapped reads (ITD Assembler)

• Depth of coverage + background error correction + principal component analysis (XHMM)

• Tumor/normal

• B allele frequency

Example Variant Calling Algorithms

Page 19: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Example KRAS G12D Variant Cell

Page 20: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

The annotated variant includes: • Gene • Gene Transcript • Nucleotide change (cdot) • Protein change (pdot) • Variant Type

– Polymorphism – Synonymous – Non-synonymous

• Nonsense • Missense

– Frame shift

The VCF variant includes: • chromosome • position • ID • reference base • alternate base • variant quality • meta-information

– information and individual format fields

– filter flags

Step 5: Variant Annotations

VCF variant

Annotated variant

Page 21: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Sample Report QC metrics

Step 6: QC Calculations

Sequencing

Run Level Cluster density Base call quality score Fragment size

Sample Level Depth coverage Uniformity Mapping quality Duplication rate

Variant Level Novel variants Known variants Transition-to-transversion ratio

Page 22: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Off-Target Read D

epth

On-Target

Gene Exon

Intronic regions

Sample-Level QC Metrics for Targeted Capture

Minimum Depth of Coverage Uniformity

Mapping Quality

Duplication Rate

Page 23: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

How does a bioinformatics job get executed in clinical production?

Job 1

Job 3

Job 2

Job 4

Job 5

Compute Infrastructure for Data Processing

Page 24: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Data Storage Infrastructure

BCL (500–550 GB)

FASTQs (12–15 GB)

Database

Object Storage “hot”

Archive “cold”

99.9% availability 99.999999999% durability

Raw output for a single run

Exome ~150x

In-house

Cloud based

FASTQs, BAMs, VCFs

Bioinformatics HiSeq 4000

Page 25: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

• Recommendations from CAP/AMP – 17 recommendation statements

– 59 variants tested in each variant class

• Example Statistics – Positive percentage agreement (PPA)

– Positive predictive value (PPV)

– Reproducibility

– Allelic fraction lower limit of detection

• Validation required prior to use in clinical production

Bioinformatics Pipeline Validation

Page 26: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Summary

Catalogue various types of bioinformatics analyses that support clinical genomic testing

Enumerate types of variant classes

Describe algorithmic methods for variant detection by NGS

Compare and contrast germline and somatic clinical bioinformatics pipeline methodologies

Discuss the infrastructure complexity required to support analytics for NGS testing at scale in the cloud

Explain validation strategies for bringing best-in-class pipelines into clinical production

Page 27: A Quick Guide to the Analytics Behind Genomic Testing · Understanding bioinformatics requires ... Catalogue various types of bioinformatics analyses that support clinical genomic

Questions? Elaine Gee, PhD Director of Bioinformatics ARUP Laboratories [email protected]