a robust framework for detecting structural variations
DESCRIPTION
A Robust Framework for Detecting Structural Variations. February 6, 2008 Seunghak Lee 1 , Elango Cheran 1 , and Michael Brudno 1 1 University of Toronto, Canada. What are structural variations? (1). 10^3 – 10^6 basepair variations in the genome - PowerPoint PPT PresentationTRANSCRIPT
1
A Robust Framework for Detecting Structural Variations
February 6, 2008
Seunghak Lee1, Elango Cheran1, and Michael Brudno1
1University of Toronto, Canada
2
What are structural variations? (1)
10^3 – 10^6 basepair variations in the genome
Insertion: a large consecutive fragment of DNA is inserted
Deletion: a large consecutive fragment of DNA is deleted
Inversion: a large consecutive fragment of DNA is inversed
Translocation: a large consecutive fragment of DNA is moved from one chromosome to another.
Copy number variations
3
What are structural variations? (2)
Various examples of structural variations
4
Outline
Introduction Type of Structural Variations Sequencing Approaches to Detect Structural Variations Motivation & Research Objectives
Probabilistic Framework for Detecting Structural Variations Probabilistic Framework Flow of our Framework Hierarchical Clustering of Matepairs (2nd phase) Choosing a Unique Mapped Location for Each Matepair (3nd phase)
Experiments Comparison with Three Previous research DMBT1 Gene for Deletion Centromere and Translocations
Conclusions
5
Type of Structural Variations (1)
Insertion
A
REF
6
Type of Structural Variations (2)
Deletion
A
REF
7
Type of Structural Variations (3)
Inversion
A
REF
5’ 3’
5’ 3’
5’3’
8
Type of Structural Variations (4)
Translocation
chr1
chr2
9
Sequencing Approaches
1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005]
• Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance• Inversion: the same orientation of both reads
2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007]
• Proposed high-throughput and massive paired end mapping technique• Detailed types of structural variations
10
Motivation & Research Objectives (1)
Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)
How can we map reads onto the reference genome?
11
Motivation & Research Objectives (2)
Sequencing method is effective to detect structural variants. Proven by Tuzun et al, Korbel et al
However, there are multiple mappings for each read Previous research used a priori mapped locations.
Why don’t we develop a probabilistic model without such assumptions? Hopefully, it can be applied to short reads from NGS machines.
12
Probabilistic Framework (1)
p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes
We play with p(Y) to describe our probabilistic framework
13
Probabilistic Framework (2)
Insertion
μY = (s+r)
P(Xi, Xj|ins=r) = P(Xi|ins=r)P(Xj|ins=r)P(Xi|ins=r) = 1 - P(μY - δ ≤Y≤μy+ δ)
, where δ= |μY- (s+r)|, s = mapped distance
μy - δ
X1, X2 = matepair 1,2Y= random variable for mapped distances of “uniquely mapped” matepairs
p(Y)
14
Probabilistic Framework (3)
Deletion
μY = (s-r)
P(Xi, Xj|del=r) = P(Xi|del=r)P(Xj|del=r)P(Xi|del=r) = 1 - P(μY - δ ≤Y≤μy+ δ)
where δ= |μY- (s-r)|, s = mapped distance
μy - δ
p(Y)
15
Probabilistic Framework (4)
c - d = s(X1) - s(X2)
P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c – d)|, s(Xi) = insert size of Xi
μ|Y1-Y2|-δ
p(|Y1-Y2|)
Inversion
16
Probabilistic Framework (5)
μ|Y1-Y2|-δ(c – a) – (d – b) = s(X1) - s(X2)
P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b) |, s(Xi) = insert size of Xi
p(|Y1-Y2|)
Translocation
17
Flow of our Framework (1)
1. Preprocessing step
Get top K Get top K mappings mappings Get top K Get top K mappings mappings
Remove Remove short mappingsshort mappingsRemove Remove short mappingsshort mappings
Make all possible Make all possible combinations of combinations of
mappingsmappings
Make all possible Make all possible combinations of combinations of
mappingsmappings
Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size
Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size
Remove invalid Remove invalid strands (-,+) strands (-,+) Remove invalid Remove invalid strands (-,+) strands (-,+)
Remove very Remove very similar similar
mappingsmappings
Remove very Remove very similar similar
mappingsmappingsMask Mask repeatsrepeatsMask Mask repeatsrepeats
18
Flow of our Framework (2)
2. Clustering
3. Finding structural variations
Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)
Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)
Find a local Find a local optimum optimum configurationconfiguration
Find a local Find a local optimum optimum configurationconfiguration
Parameter learning Parameter learning for the objective for the objective functionfunction
Parameter learning Parameter learning for the objective for the objective functionfunction
Find initial Find initial configuration in configuration in greedy manner greedy manner
Find initial Find initial configuration in configuration in greedy manner greedy manner
19
Hierarchical Clustering (1)
(ex) Insertion
A
REF
•Cluster, C, is a set of matepairs explaining the same structural variations•Linkage distance = D(X1, X2) = - ln P(X1, X2|C)
X1X2
X1X2
C={X1, X2}
20
Hierarchical Clustering (2)
Generally, linkage distance is given by,
We do hierarchical clustering for each structural variation.
21
Choosing a Unique Mapped Location (1)
We should map matepairs onto unique pair of BLAT hits and unique cluster.
R1 R2
C2C1 C2C1
R2R1
1 2 3 4 5
M1,4 M2,4 M3,5
22
Choosing a Unique Mapped Location (2)
We define a objective Function J(ω)
ƒ1 corresponds to BLAT hit scores
ƒ2 corresponds to the probability
ƒ3 corresponds to the size of clusters
23
Choosing a Unique Mapped Location (3)
Find the initial configuration greedily
Learn parameters for the objective function J(ω). We used hill climbing search to maximize the l
og likelihood of P(ω|λi)
Finally, find a configuration, locally maximizing J(ω) using hill climbing search
24
P-values
We assign p-values to give confidence to our clusters.
The probability that the cluster is generated by the reference genome not by structural variants Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)
where E = (Expected number of matepairs mapped to the location of the cluster)
P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.
25
Clustering Results
We started with ~360,000 matepair ~90% were uniquely mapped ~90% had a concordant position (mapped at ± 2)
Through the clustering procedure above (FDR 0.2) we found
82 Insertion clusters (53 had a uniquely mapped read) 175 Deletion clusters (135) 103 inversion clusters (24) 55 Translocation (cross-chromosome) cluster
(all were required to have a uniquely mapped read)
26
Example Deletion
27
Agreement with Previous Results
Type Total Tuzun Levy Korbel DGV-All
Insertion 82(53) 12(7)/139 6(5)/319 0(0)/34 24(13)/2216
Deletion 175(135) 21(17)/102 25(23)/344 45(36)/742 82(63)/4697
Inversion 103(24) 34(12)/56 N/A 42(8)/105 60(15)/164
We have compared
All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations
The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset).
28
Translocations
A large fraction (69%) of the translocations were close to the centromeres
She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart
These could also be mis-assemblies.
Distance to centromere
<106 (106, 4.5*106] >4.5*106
<106 22 6 10
(106, 4.5*106] 0 3
>4.5*106 14
29
Conclusions
Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions.
Introduced a probabilistic model for structural variants
Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor.
These results show statistically significant correlation with previous variation studies
Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)