next-generation sequencing: from basics to future diagnostics part ii: ngs analysis to find variant...

Next-generation sequencing:from basics to future diagnosticsPART II: NGS analysis to find

variant

Sangwoo Kim, Ph.D.Assistant Professor,

Severance Biomedical Research Institute, Yonsei University College of Medicine

Overview

• PART I: NGS technologies and standard workflow– Next generation sequencing

• History and technology

– Data and its meaning; process workflow– Discussion

• PART II: NGS Analysis to find variants– NGS analysis to find variants

• Single nucleotide variants (SNVs)• Copy number variations (CNVs)• Structural variations (SVs)

• PART III: NGS application to diagnostics – NGS in genomic medicine– Potential application to forensic science

FROM PREVIOUS SESSION

Conventional variant callingVariant calling in minor subgroups

3/123

Next-generation sequencing

Metzker et al, Nat Rev Genet, 2010

Massively Parallel Sequencing (a.k.a. Next-generation sequenc-

ing)

Illumina HiSeq2500

5500 SOLiD sys-tem

Ion Torrent PGM

via spatially separated, clonally amplified DNA templates or single DNA molecules

The human genome project

Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc.“Rough draft” in 2000“Complete genome” published in 2003

13 years,$3 billion dollars.

The Human Genome Project (1990~2003)

5

FASTQ format (NGS raw data)

one read

sequence

quality

A format for NGS read (FASTQ + quality)

Kim S and Paik S, in preparation

control

sequenc-ing

quality control

short read alignment (BAM files)

sequenc-ingraw reads

(FASTQ files)

germ-line mutation somatic mutation

copy numbervariation (CNV)

structuralvariation (SV)

A. Data Genera-tion

B. Variant Find-ing

C. Variant Anal-ysis

xenogeneic sequence

43%0%

31%

recurrence analysis

GKRRAGGGKRRAV*Gvariant impact prediction

mutation filtration/selection

tumor heterogeneity inference

disease

Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimiza-tion.Whole genome sequencing (WGS) allows

an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only in-terrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear.

D. Validation and functional assessment

variant confirmation

pathway analysis

functional study

DATA PREPROCESSINGShort Read Alignment

8/123

Mapping back to genome

TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCT-TAAGAGCTGTGAGA

Where is this sequence in human genome?

Do this as fast as possible!

brute force way

T G A C G T G T G A T T C A A A A A A G CThe reference genome (chr1, start)

G A T T C A A A Your query

G A T T C A A A

G A T T C A A A

G A T T C A A A

Find “GATTCAAA” in human genome

This is very long (3 billion)

How fast should it be?

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algorithm 3 3.6x108 10 yrs

minimum required 0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 days

based on 200bp read length, 80x single-end wgs

Searching with index• Assume you’re searching “genome” in

a English dictionary– You don’t search every line in every page– You first find the page range of “g” in the

dictionary– in the above range (of ‘g’), you find the

page range of “ge” in the dictionary– in the above range (of ‘ge’), you find the

page range of “gen” in the dictionary

– ...– until you find “genome”

Searching with index• Assume you’re searching “genome” in

a English dictionary– You don’t search every line in every page– You first find the page range of “g” in the

dictionary– in the above range (of ‘g’), you find the

page range of “ge” in the dictionary– in the above range (of ‘ge’), you find the

page range of “gen” in the dictionary

– ...– until you find “genome”

How can we build an in-

dex for genome?

Burrows-Wheeler Transform

14

Burrows-Wheeler Transformation

BANANA


BANANA$Lexicographically smallest


BANANA$ANANA$B


BANANA$ANANA$BNANA$BA


BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA


0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA



0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA

sort




sort

ANNB$AA

last col-umn




sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”




sort

ANNB$AA

last col-umn

BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original string

LF Search


Question: Find “NAN” from BANANA

LF Search



NANN

ANNAN

LF Search



NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point

• the number of ‘N’• to determine the end point

start

end

LF Search



NAN

The range of strings that start with “N” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point

• =5 • the number of ‘N’

• to determine the end point• =2

start

end

LF Search



NAN

The range of strings that start with “AN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point

• =1 • the number of ‘A’


start

end

LF Search



NAN





start

end

This is a range for ‘A’ not ‘AN’!!

LF Search



NAN





start

end

LF Search



NAN





start

end

count of ‘A’ before start point = 1

LF Search



NAN


• the number of symbols that are lexicographi-cally less than ‘A’ + number of ‘A’ before start point• to determine the start point

• =1 + 1 = 2• the number of ‘A’ before end point


start

end

count of ‘A’ before start point = 1“Ax” is not “AN” and less than “AN”

LF Search



NAN

start

end

The range of strings that start with “NAN” can be calculated from:

• the number of symbols that are lexicographi-cally less than ‘N’ + number of ‘N’ before start point• to determine the start point

• =5 + 1 = 6• the number of ‘N’ before end point


LF Search



NAN

startend

2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”

BANANA

Genome Informatics I (2015 Spring)

Genome query

imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf


Inexact matchingT G A C G T G T G A T T C A A A A A A G C

G A T T G A A A

When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed


Goal achieved

time per 1 read (sec)

time per 80x WGS (sec)

is equal to

eyeballing 3x109 3.6x1018 1x1011 yrs

naïve matching 2400 1.2x109 7,608 yrs

improved algorithm 3 3.6x108 10 yrs

minimum required 0.01 1.2x107 11.5 days

desired 0.001 1.2x106 1.2 days

VARIANT CALLING – SNV CALLINGSNV calling

46/123

Detailed View

one read = one DNA fragmentaligned to a specific genomic region

= observation of our sample in this re-gion (1 time)

A genome region

Detailed View

A—AAAACAAAAC

A certain genomic posi-tion (in bp)

Detailed View

A—AAAACAAAAC

A certain genomic posi-tion (in bp)

reference allele

observation of our sample at this position from read 1



Why multiple observations?• Observations contain errors– errors from machine

• basecall error

– errors from mapping• mapping error

– errors from others• library prep error

• With accuracy of 99%...– 1% error from whole region– leads to

• ~30million false SNPs for whole genome• ~500k false SNPs for whole exome

Human diploid genomeG

A

G

G A

A

Homozygotic Reference

Heterozygotic Alternative

Homozygotic Alternative

G G

G GG

GG GGG G G

ASequencing error / map-ping error

G G

GGGG

G

A AA A

AA A

AA

AA

AA

AA

AA

A

A

somatic mutations

51/123

Allele fraction distribution (binomial)

Pr (𝜇−3𝜎 ≤ 𝑥≤𝜇+3𝜎 )≈0.9973Pr (35≤𝑥 ≤65)≈0.9973

Normal approximation of B(100,0.5)

52/123

Allele fraction distribution (binomial)G G

G GG

GG GGG G G

A

G G

GGGG

G

A AA A

AA A

AA

AA

AA

AA

AA

A

A

53/123

Inferring mutations

GAGAGGGGGAAAGAGA

reference allele

• True genotype = “AA” and no sequencing error

• True genotype = “AB” and– Read was generated from ‘A’ allele and no sequencing

error

– Read was generated from ‘B’ allele and sequencing error and ‘A’ was generated by chance

• True genotype = “BB” and sequencing error

Probability of observing “G” at the site of “G”

Obs

erva

tion

of d

onor

gen

ome

Inferring mutations

GAGAGGGGGAAAGAGA

reference allele

Probability of observing “A” at the site of “G”

Obs

erva

tion

of d

onor

gen

ome

• True genotype = “AA” and sequencing errorP(e)

• True genotype = “AB” and- Read was generated from ‘A’ allele and sequencing error and ‘T’ was generated by chance

- Read was generated from ‘B’ allele and no sequencing error

• True genotype = “BB” and no sequencing error

Genotype determination

• L(g=AA|D)• L(g=AB|D)• L(g=BB|D)

Likelihood that the genotype is wild-type given the observation!

Likelihood that the genotype is mutant given the observation!

57

Tools

SOMATIC MUTATIONS

58

59

Germline vs. Somatic mutation

sample from non-disease site

sample from disease site

reference sequence (e.g. hg19)

• UnifiedGenotyper• VarScan2• SomaticSniper• …

60

Easy way to somatic mutations

sample from non-disease site

sample from disease site

GN=AA

GT=AB

61

Joint Probabilities

62

Joint Probabilities• P(GT=AB|GN=AA)

≠P(GT=AB|GN=AB) ≠P(GT=AB|GN=BB)Tumor genotype is dependent on normal genotype!!!

G: Joint Genotype Matrix

WHEN SAMPLE IS NOT PURE

63

Heterogeneous Sample

G G

Normal Cells

G GG G

G G

Tumor Cells

G AG G

GGG

GG

AA

GG

GG

G

G

G G

G GG

GG GGG G

G

64/123

Causes of low-frequency• Sample contamination (e.g. stromal cells)

65/123

Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity

66/123

Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity• Extreme environments

67/123

Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity• Extreme environments• Somatic mosaicism

68/123


G G

GGG

GG

AA

GG

GG

G

G

“2/15: No mutation. Two ‘A’s are from sequencing errors”

“2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!”

VS

69/123


G G

GGG

GG

AA

GG

GG

G

G

“2/15: No mutation. Two ‘A’s are from sequencing errors...”

“2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!”

VS

“How do we know this?”

70/123

Estimating Cellularity • It is “easy” only if we already know where to see

(disease genotype is AB or BB)

But how do we know the genotype? (even without knowing α?)

1. Use SNP array - ONCOSNP (Yau et al, Genome Biol, 2009), Absolute (Carter et al, Nature Biotech, 2012)

2. SNP Calling - Snyder et al, PNAS, 2010, PurityEst (Su et al, Bioinformatics, 2012)

71/75

Accurate inference in Virmid

Estimate global within-individual con-tamination to accurate detection of so-matic mutations

72/123

Bias 1 - Loss of Reads (Virmid)

A

B

A

AB

𝑥𝑎=𝑝 (a read that passes 𝑔1 being unmapped )

g1

g2

𝑥𝑏=𝑝 (a read that passes 𝑔2 being unmapped )

¿𝑝 (𝑟1 has 𝑑+1or more variants in the remaining sites )

¿𝑝 (𝑟2 has 𝑑or more variants in the remaining sites )

r1r2

ref

𝑥𝑎=1−∑𝑖=0

𝑑

(𝑙−1𝑖 )𝑝𝑖 (1−𝑝 )𝑙 −1−𝑖𝑥𝑏=1−∑

𝑖=0

𝑑−1

(𝑙−1𝑖 )𝑝𝑖 (1−𝑝 )𝑙− 1− 𝑖

, where 𝑑=maximum edit distance , 𝑙=read length , and 𝑝=frequency of variation

73/123

Bias 2 - Loss of variants (Virmid)

reads from nor-mal

reads from dis-easeB-al-

lele

α

1-α

overestimate BAF

underestimate α

74/123

Estimated α

underestimated α

overestimated α

75/123

Calling low-fraction somatic mutations in Virmid

Kim S et al, Genome Biology 2013

76/123

Low frequent mutations in disease

Identification of de novo somatic mutation in ATK-MTOR-PIK3CA in hemimega-lencephaly

Lee J et al, Nature Genetics, 2012

77/123

Low frequent mutations in disease

Lim J et al, Nature Medicine 2015

Identification of MTOR driver mutations in focal cortical dysplaisa

78/123

COPY NUMBER VARIATION (CNV)

79

Copy Number VariationChanges in copy number of large DNA segment - usually in terms of genes- e.g. HER2 amplification

Types of CNVs- Copy number gain (CN > 2):

- Increase of copy number due to ge-nomic rearrangement like insertion/duplication

- Copy number loss (CN < 2):- Decrease of copy number due to

deleterious genomic rearrangements

Copy number aberration (CNA)- refers to CNV particularly when the

events are associated with disease phe-notype

Comparative Genome Hybridization (CGH)

500kb-1500kb fragmentfor optimal hybridization

Array CGH

Resolution

Benefits of NGS-based CNV detection

• High resolution (< 50 bp) in size• Data reuse (multi-purpose)– One NGS (whole-genome) sequencing

can be used to SNV, CNV, SV detection

• Can be improved with additional NGS information– Discordant reads in paired-end sequenc-

ing

Inferring CNVs from NGS

• Principle:– Samples with copy number gain (or loss)

will generate more (or less) reads in the region

gene

3 Copy (gain) 2 Copy (nor-mal)

1 Copy (loss)


The signal3 Copy (gain) 2 Copy (nor-

mal)1 Copy (loss)

mapped to reference

The signal3 Copy (gain) 2 Copy (nor-

mal)1 Copy (loss)

mapped to reference

catching these needs a system-atic approach!

Catching the signal

• Problems– Read depth is not uniform even without

copy number changes• GC bias• Mapping bias in repeat region• Natural variance (Poisson distribution)

Poisson distribution: - The probability of a given number of events occurring in a fixed interval of time and/or space.

Example:- You have 120 phone calls a day, what is the best way to describe the

number of phone call in an hour?- Similarly, you generated 100,000,000 NGS reads from whole genome, what is the number of reads generated within chr1:12781718-12782228?

Significantly deviated read-depth

• Null hypothesis (H0):– copy number of a given region is unchanged– we assume the read-depth follows Poisson dist.

• Alternative hypothesis (Ha):– copy number of a given region is changed

• If H0 is right:– The read-depth (calculated from number of reads) within

a specific genomic region is not significantly deviated from the Poisson distribution

• If the read-depth is too deviated to explain with natural variance (Poisson distribution)– Copy number has been changed

Practically, we should consider

• Bias correction from sequence con-text (GC-bias, etc.)

• Event detection method– If the significant rise (or drop) of read-

depth looks like an event• mean-shift technique (CNVnator, Abyzov et

al 2013)• event-wise testing (Yoon et al, 2009)• paired-end signal (CNVer, Medvedev et al

2010)

CNVNator

91/123

STRUCTURE VARIATION (SV)

92

Beyond the SNVs

Beyond the SNVs

TFE3-KHSRP Translocation in Renal Cell Carcinoma

Structural Variations (SVs)

• Genomic rearrangements that affect >50bp of sequence

Alkan et al, Nat. Rev. Genetics 12, 363-376, 2011

List of structural variations

98/123

List of structural variations

Paired-end sequencing

Bix Seminar UCSD 100/123

Paired end reads for SV finding

Donor

Reference

Donor

Reference

Methods for SV detection

• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions

• Read pairo Assess the span and orientation of paired end reads

• Split Reado Define breakpoints of SVs using split-sequence-read

signature (broken alignment)

• Assemblyo Assemble and reconstruct the whole genome of

sample DNA

Methods for Deletion Detection

Problems 1. Judgment of discordance

Problem 2. Size of insertion

Problem 2. Large indels

Novel Sequence Insertion

Problem 2. Large Indels

Existing Se-quence Insertion

Problem 3. Nonspecific Mappings

DISCUSSION

122/123

THANK YOU

123/123

next-generation sequencing: from basics to future diagnostics part ii: ngs analysis to find variant...

Documents