next-generation sequencing: from basics to future diagnostics part ii: ngs analysis to find variant...
TRANSCRIPT
Next-generation sequencing:from basics to future diagnosticsPART II: NGS analysis to find
variant
Sangwoo Kim, Ph.D.Assistant Professor,
Severance Biomedical Research Institute, Yonsei University College of Medicine
Overview
• PART I: NGS technologies and standard workflow– Next generation sequencing
• History and technology
– Data and its meaning; process workflow– Discussion
• PART II: NGS Analysis to find variants– NGS analysis to find variants
• Single nucleotide variants (SNVs)• Copy number variations (CNVs)• Structural variations (SVs)
• PART III: NGS application to diagnostics – NGS in genomic medicine– Potential application to forensic science
FROM PREVIOUS SESSION
Conventional variant callingVariant calling in minor subgroups
3/123
Next-generation sequencing
Metzker et al, Nat Rev Genet, 2010
Massively Parallel Sequencing (a.k.a. Next-generation sequenc-
ing)
Illumina HiSeq2500
5500 SOLiD sys-tem
Ion Torrent PGM
via spatially separated, clonally amplified DNA templates or single DNA molecules
The human genome project
Began in 1990. Consortium comprised in U.S, U.K, France, Australia, Japan etc.“Rough draft” in 2000“Complete genome” published in 2003
13 years,$3 billion dollars.
The Human Genome Project (1990~2003)
5
FASTQ format (NGS raw data)
one read
sequence
quality
A format for NGS read (FASTQ + quality)
Kim S and Paik S, in preparation
control
sequenc-ing
quality control
short read alignment (BAM files)
sequenc-ingraw reads
(FASTQ files)
germ-line mutation somatic mutation
copy numbervariation (CNV)
structuralvariation (SV)
A. Data Genera-tion
B. Variant Find-ing
C. Variant Anal-ysis
xenogeneic sequence
43%0%
31%
recurrence analysis
GKRRAGGGKRRAV*Gvariant impact prediction
mutation filtration/selection
tumor heterogeneity inference
disease
Box 1. Sequencing types and platforms. Depending on the sequencing purpose, various platforms can be considered for optimiza-tion.Whole genome sequencing (WGS) allows
an inspection of all genomic areas and is applicable for CNV and SV analysis. Whole exome sequencing (WES) only in-terrogates coding regions (1~2% of the genome) with a less cost and throughput. WGS and WES are frequently used for novel causative variant discovery and control sample sequencing is generally mandatory. When a limited regions are to be tested (as in a diagnosis kit), a set of targeted genes are amplified and fed for sequencing (targeted/ panel sequencing). For this case, control is usually omitted when the target sites (hotspots) are clear.
D. Validation and functional assessment
variant confirmation
pathway analysis
functional study
DATA PREPROCESSINGShort Read Alignment
8/123
Mapping back to genome
TAACACCTGGGAAATTCATCACAAAAAGATCTTAGCCTAGGCACATTGTCATTAGGTTATCCAAAGTTAAGACAAAGGAAAGAATCT-TAAGAGCTGTGAGA
Where is this sequence in human genome?
Do this as fast as possible!
brute force way
T G A C G T G T G A T T C A A A A A A G CThe reference genome (chr1, start)
G A T T C A A A Your query
G A T T C A A A
G A T T C A A A
G A T T C A A A
Find “GATTCAAA” in human genome
This is very long (3 billion)
How fast should it be?
time per 1 read (sec)
time per 80x WGS (sec)
is equal to
eyeballing 3x109 3.6x1018 1x1011 yrs
naïve matching 2400 1.2x109 7,608 yrs
improved algorithm 3 3.6x108 10 yrs
minimum required 0.01 1.2x107 11.5 days
desired 0.001 1.2x106 1.2 days
based on 200bp read length, 80x single-end wgs
Searching with index• Assume you’re searching “genome” in
a English dictionary– You don’t search every line in every page– You first find the page range of “g” in the
dictionary– in the above range (of ‘g’), you find the
page range of “ge” in the dictionary– in the above range (of ‘ge’), you find the
page range of “gen” in the dictionary
– ...– until you find “genome”
Searching with index• Assume you’re searching “genome” in
a English dictionary– You don’t search every line in every page– You first find the page range of “g” in the
dictionary– in the above range (of ‘g’), you find the
page range of “ge” in the dictionary– in the above range (of ‘ge’), you find the
page range of “gen” in the dictionary
– ...– until you find “genome”
How can we build an in-
dex for genome?
Burrows-Wheeler Transform
14
Burrows-Wheeler Transformation
BANANA
Burrows-Wheeler Transformation
BANANA$Lexicographically smallest
Burrows-Wheeler Transformation
BANANA$ANANA$B
Burrows-Wheeler Transformation
BANANA$ANANA$BNANA$BA
Burrows-Wheeler Transformation
BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
BWT(“BANANA$”) = “ANNB$AA”
Burrows-Wheeler Transformation
0 BANANA$1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
sort
ANNB$AA
last col-umn
BWT(“BANANA$”) = “ANNB$AA”1. BWT just changes the order of the string2. BWT tends to collect similar characters together3. With only the transformed string, we can easily get the original string
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NANN
ANNAN
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point
• the number of ‘N’• to determine the end point
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point
• =5 • the number of ‘N’
• to determine the end point• =2
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “N” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘N’• to determine the start point
• =5 • the number of ‘N’
• to determine the end point• =2
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
This is a range for ‘A’ not ‘AN’!!
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘A’• to determine the start point
• =1 • the number of ‘A’
• to determine the end point• =3
start
end
count of ‘A’ before start point = 1
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
The range of strings that start with “AN” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘A’ + number of ‘A’ before start point• to determine the start point
• =1 + 1 = 2• the number of ‘A’ before end point
• to determine the end point• =3
start
end
count of ‘A’ before start point = 1“Ax” is not “AN” and less than “AN”
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
start
end
The range of strings that start with “NAN” can be calculated from:
• the number of symbols that are lexicographi-cally less than ‘N’ + number of ‘N’ before start point• to determine the start point
• =5 + 1 = 6• the number of ‘N’ before end point
• to determine the end point• =2
LF Search
0 6 $BANANA1 5 A$BANAN2 3 ANA$BAN3 1 ANANA$B4 0 BANANA$5 4 NA$BANA6 2 NANA$BA
Question: Find “NAN” from BANANA
NAN
startend
2nd row at the original permutation=number of rotations of original string=“NAN” exists at the 3rd position of “BANANA”
BANANA
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Genome query
imported from Mike Schatz’s slidehttp://schatzlab.cshl.edu/teaching/2010/Lecture%202%20-%20Sequence%20Alignment.pdf
Genome Informatics I (2015 Spring)
Inexact matchingT G A C G T G T G A T T C A A A A A A G C
G A T T G A A A
When exact match does not exist:• continue other possible candidates (G -> A, C, T) and increase the mismatch count• If another mismatch occurs, again branch it out. • So edit distance is critical to alignment speed
Genome Informatics I (2015 Spring)
Goal achieved
time per 1 read (sec)
time per 80x WGS (sec)
is equal to
eyeballing 3x109 3.6x1018 1x1011 yrs
naïve matching 2400 1.2x109 7,608 yrs
improved algorithm 3 3.6x108 10 yrs
minimum required 0.01 1.2x107 11.5 days
desired 0.001 1.2x106 1.2 days
VARIANT CALLING – SNV CALLINGSNV calling
46/123
Detailed View
one read = one DNA fragmentaligned to a specific genomic region
= observation of our sample in this re-gion (1 time)
A genome region
Detailed View
A—AAAACAAAAC
A certain genomic posi-tion (in bp)
Detailed View
A—AAAACAAAAC
A certain genomic posi-tion (in bp)
reference allele
observation of our sample at this position from read 1
observation of our sample at this position from read 2
observation of our sample at this position from read 10
Why multiple observations?• Observations contain errors– errors from machine
• basecall error
– errors from mapping• mapping error
– errors from others• library prep error
• With accuracy of 99%...– 1% error from whole region– leads to
• ~30million false SNPs for whole genome• ~500k false SNPs for whole exome
Human diploid genomeG
A
G
G A
A
Homozygotic Reference
Heterozygotic Alternative
Homozygotic Alternative
G G
G GG
GG GGG G G
ASequencing error / map-ping error
G G
GGGG
G
A AA A
AA A
AA
AA
AA
AA
AA
A
A
somatic mutations
51/123
Allele fraction distribution (binomial)
Pr (𝜇−3𝜎 ≤ 𝑥≤𝜇+3𝜎 )≈0.9973Pr (35≤𝑥 ≤65)≈0.9973
Normal approximation of B(100,0.5)
52/123
Allele fraction distribution (binomial)G G
G GG
GG GGG G G
A
G G
GGGG
G
A AA A
AA A
AA
AA
AA
AA
AA
A
A
53/123
Inferring mutations
GAGAGGGGGAAAGAGA
reference allele
• True genotype = “AA” and no sequencing error
• True genotype = “AB” and– Read was generated from ‘A’ allele and no sequencing
error
– Read was generated from ‘B’ allele and sequencing error and ‘A’ was generated by chance
• True genotype = “BB” and sequencing error
Probability of observing “G” at the site of “G”
Obs
erva
tion
of d
onor
gen
ome
Inferring mutations
GAGAGGGGGAAAGAGA
reference allele
Probability of observing “A” at the site of “G”
Obs
erva
tion
of d
onor
gen
ome
• True genotype = “AA” and sequencing errorP(e)
• True genotype = “AB” and- Read was generated from ‘A’ allele and sequencing error and ‘T’ was generated by chance
- Read was generated from ‘B’ allele and no sequencing error
• True genotype = “BB” and no sequencing error
Genotype determination
• L(g=AA|D)• L(g=AB|D)• L(g=BB|D)
Likelihood that the genotype is wild-type given the observation!
Likelihood that the genotype is mutant given the observation!
57
Tools
SOMATIC MUTATIONS
58
59
Germline vs. Somatic mutation
sample from non-disease site
sample from disease site
reference sequence (e.g. hg19)
• UnifiedGenotyper• VarScan2• SomaticSniper• …
60
Easy way to somatic mutations
sample from non-disease site
sample from disease site
GN=AA
GT=AB
61
Joint Probabilities
62
Joint Probabilities• P(GT=AB|GN=AA)
≠P(GT=AB|GN=AB) ≠P(GT=AB|GN=BB)Tumor genotype is dependent on normal genotype!!!
G: Joint Genotype Matrix
WHEN SAMPLE IS NOT PURE
63
Heterogeneous Sample
G G
Normal Cells
G GG G
G G
Tumor Cells
G AG G
GGG
GG
AA
GG
GG
G
G
G G
G GG
GG GGG G
G
64/123
Causes of low-frequency• Sample contamination (e.g. stromal cells)
65/123
Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity
66/123
Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity• Extreme environments
67/123
Causes of low-frequency• Sample contamination (e.g. stromal cells)• Tumor heterogeneity• Extreme environments• Somatic mosaicism
68/123
Heterogeneous Sample
G G
GGG
GG
AA
GG
GG
G
G
“2/15: No mutation. Two ‘A’s are from sequencing errors”
“2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!”
VS
69/123
Heterogeneous Sample
G G
GGG
GG
AA
GG
GG
G
G
“2/15: No mutation. Two ‘A’s are from sequencing errors...”
“2/15: Heterozygous somatic mutation!! The sample is certainly heterogeneous!”
VS
“How do we know this?”
70/123
Estimating Cellularity • It is “easy” only if we already know where to see
(disease genotype is AB or BB)
But how do we know the genotype? (even without knowing α?)
1. Use SNP array - ONCOSNP (Yau et al, Genome Biol, 2009), Absolute (Carter et al, Nature Biotech, 2012)
2. SNP Calling - Snyder et al, PNAS, 2010, PurityEst (Su et al, Bioinformatics, 2012)
71/75
Accurate inference in Virmid
Estimate global within-individual con-tamination to accurate detection of so-matic mutations
72/123
Bias 1 - Loss of Reads (Virmid)
A
B
A
AB
𝑥𝑎=𝑝 (a read that passes 𝑔1 being unmapped )
g1
g2
𝑥𝑏=𝑝 (a read that passes 𝑔2 being unmapped )
¿𝑝 (𝑟1 has 𝑑+1or more variants in the remaining sites )
¿𝑝 (𝑟2 has 𝑑or more variants in the remaining sites )
r1r2
ref
𝑥𝑎=1−∑𝑖=0
𝑑
(𝑙−1𝑖 )𝑝𝑖 (1−𝑝 )𝑙 −1−𝑖𝑥𝑏=1−∑
𝑖=0
𝑑−1
(𝑙−1𝑖 )𝑝𝑖 (1−𝑝 )𝑙− 1− 𝑖
, where 𝑑=maximum edit distance , 𝑙=read length , and 𝑝=frequency of variation
73/123
Bias 2 - Loss of variants (Virmid)
reads from nor-mal
reads from dis-easeB-al-
lele
α
1-α
overestimate BAF
underestimate α
74/123
Estimated α
underestimated α
overestimated α
75/123
Calling low-fraction somatic mutations in Virmid
Kim S et al, Genome Biology 2013
76/123
Low frequent mutations in disease
Identification of de novo somatic mutation in ATK-MTOR-PIK3CA in hemimega-lencephaly
Lee J et al, Nature Genetics, 2012
77/123
Low frequent mutations in disease
Lim J et al, Nature Medicine 2015
Identification of MTOR driver mutations in focal cortical dysplaisa
78/123
COPY NUMBER VARIATION (CNV)
79
Copy Number VariationChanges in copy number of large DNA segment - usually in terms of genes- e.g. HER2 amplification
Types of CNVs- Copy number gain (CN > 2):
- Increase of copy number due to ge-nomic rearrangement like insertion/duplication
- Copy number loss (CN < 2):- Decrease of copy number due to
deleterious genomic rearrangements
Copy number aberration (CNA)- refers to CNV particularly when the
events are associated with disease phe-notype
Comparative Genome Hybridization (CGH)
500kb-1500kb fragmentfor optimal hybridization
Array CGH
Resolution
Benefits of NGS-based CNV detection
• High resolution (< 50 bp) in size• Data reuse (multi-purpose)– One NGS (whole-genome) sequencing
can be used to SNV, CNV, SV detection
• Can be improved with additional NGS information– Discordant reads in paired-end sequenc-
ing
Inferring CNVs from NGS
• Principle:– Samples with copy number gain (or loss)
will generate more (or less) reads in the region
gene
3 Copy (gain) 2 Copy (nor-mal)
1 Copy (loss)
Genome Informatics I (2015 Spring)
The signal3 Copy (gain) 2 Copy (nor-
mal)1 Copy (loss)
mapped to reference
The signal3 Copy (gain) 2 Copy (nor-
mal)1 Copy (loss)
mapped to reference
catching these needs a system-atic approach!
Catching the signal
• Problems– Read depth is not uniform even without
copy number changes• GC bias• Mapping bias in repeat region• Natural variance (Poisson distribution)
Poisson distribution: - The probability of a given number of events occurring in a fixed interval of time and/or space.
Example:- You have 120 phone calls a day, what is the best way to describe the
number of phone call in an hour?- Similarly, you generated 100,000,000 NGS reads from whole genome, what is the number of reads generated within chr1:12781718-12782228?
Significantly deviated read-depth
• Null hypothesis (H0):– copy number of a given region is unchanged– we assume the read-depth follows Poisson dist.
• Alternative hypothesis (Ha):– copy number of a given region is changed
• If H0 is right:– The read-depth (calculated from number of reads) within
a specific genomic region is not significantly deviated from the Poisson distribution
• If the read-depth is too deviated to explain with natural variance (Poisson distribution)– Copy number has been changed
Practically, we should consider
• Bias correction from sequence con-text (GC-bias, etc.)
• Event detection method– If the significant rise (or drop) of read-
depth looks like an event• mean-shift technique (CNVnator, Abyzov et
al 2013)• event-wise testing (Yoon et al, 2009)• paired-end signal (CNVer, Medvedev et al
2010)
CNVNator
91/123
STRUCTURE VARIATION (SV)
92
Beyond the SNVs
Beyond the SNVs
Beyond the SNVs
TFE3-KHSRP Translocation in Renal Cell Carcinoma
Structural Variations (SVs)
• Genomic rearrangements that affect >50bp of sequence
Alkan et al, Nat. Rev. Genetics 12, 363-376, 2011
List of structural variations
98/123
List of structural variations
Paired-end sequencing
Bix Seminar UCSD 100/123
Paired end reads for SV finding
Donor
Reference
Donor
Reference
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for SV detection
• Read deptho Assume a random distribution in mapping deptho Significantly higher depth for duplicated regionso Significantly reduced depth for deleted regions
• Read pairo Assess the span and orientation of paired end reads
• Split Reado Define breakpoints of SVs using split-sequence-read
signature (broken alignment)
• Assemblyo Assemble and reconstruct the whole genome of
sample DNA
Methods for Deletion Detection
Methods for Deletion Detection
Methods for Deletion Detection
Methods for Deletion Detection
Methods for Deletion Detection
Methods for Deletion Detection
Problems 1. Judgment of discordance
Problems 1. Judgment of discordance
Problem 2. Size of insertion
Problem 2. Large indels
Novel Sequence Insertion
Problem 2. Large Indels
Existing Se-quence Insertion
Problem 3. Nonspecific Mappings
Problem 3. Nonspecific Mappings
DISCUSSION
122/123
THANK YOU
123/123