a robust framework for detecting structural variations

1

A Robust Framework for Detecting Structural Variations

February 6, 2008

Seunghak Lee1, Elango Cheran1, and Michael Brudno1

1University of Toronto, Canada

2

What are structural variations? (1)

10^3 – 10^6 basepair variations in the genome

Insertion: a large consecutive fragment of DNA is inserted

Deletion: a large consecutive fragment of DNA is deleted

Inversion: a large consecutive fragment of DNA is inversed

Translocation: a large consecutive fragment of DNA is moved from one chromosome to another.

Copy number variations

3

What are structural variations? (2)

Various examples of structural variations

4

Outline

Introduction Type of Structural Variations Sequencing Approaches to Detect Structural Variations Motivation & Research Objectives

Probabilistic Framework for Detecting Structural Variations Probabilistic Framework Flow of our Framework Hierarchical Clustering of Matepairs (2nd phase) Choosing a Unique Mapped Location for Each Matepair (3nd phase)

Experiments Comparison with Three Previous research DMBT1 Gene for Deletion Centromere and Translocations

Conclusions

5

Type of Structural Variations (1)

Insertion

A

REF

6


Deletion

A

REF

7


Inversion

A

REF

5’ 3’

5’ 3’

5’3’

8


Translocation

chr1

chr2

9

Sequencing Approaches

1. “Fine-scale structural variation of the human genome” [Tuzun et al, 2005]

• Mapping matepairs onto the reference genome • Insertion and deletion: inconsistent mapped distance• Inversion: the same orientation of both reads

2. “Paired-End mappings Reveals Extensive Structural Variation in the Human Genome” [Korbel et al, 2007]

• Proposed high-throughput and massive paired end mapping technique• Detailed types of structural variations

10

Motivation & Research Objectives (1)

Tuzun et al used scores which are the combination of several factors. (e.g. length, identity, quality of the sequences)

How can we map reads onto the reference genome?

11

Motivation & Research Objectives (2)

Sequencing method is effective to detect structural variants. Proven by Tuzun et al, Korbel et al

However, there are multiple mappings for each read Previous research used a priori mapped locations.

Why don’t we develop a probabilistic model without such assumptions? Hopefully, it can be applied to short reads from NGS machines.

12

Probabilistic Framework (1)

p(Y): distribution of mapped distances of “uniquely mapped” matepairs of various sizes

We play with p(Y) to describe our probabilistic framework

15


c - d = s(X1) - s(X2)

P(Xi, Xj|inv) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) where δ= |μ|Y1-Y2| – (c – d)|, s(Xi) = insert size of Xi

μ|Y1-Y2|-δ

p(|Y1-Y2|)

Inversion

16


μ|Y1-Y2|-δ(c – a) – (d – b) = s(X1) - s(X2)

P(Xi, Xj|trans) = 1 - P(μ|Y1-Y2| - δ ≤|Y1-Y2|≤μ|Y1-Y2| + δ) , where δ= |μ|Y1-Y2| – (c – a) – (d – b) |, s(Xi) = insert size of Xi

p(|Y1-Y2|)

Translocation

17

Flow of our Framework (1)

1. Preprocessing step

Get top K Get top K mappings mappings Get top K Get top K mappings mappings

Remove Remove short mappingsshort mappingsRemove Remove short mappingsshort mappings

Make all possible Make all possible combinations of combinations of

mappingsmappings

Make all possible Make all possible combinations of combinations of

mappingsmappings

Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size

Discard matepairs consistent Discard matepairs consistent with insert sizewith insert size

Remove invalid Remove invalid strands (-,+) strands (-,+) Remove invalid Remove invalid strands (-,+) strands (-,+)

Remove very Remove very similar similar

mappingsmappings

Remove very Remove very similar similar

mappingsmappingsMask Mask repeatsrepeatsMask Mask repeatsrepeats

18

Flow of our Framework (2)

2. Clustering

3. Finding structural variations

Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)

Do hierarchical clustering for each structural variationDo hierarchical clustering for each structural variation(Insertion, Deletion, Inversion, Translocation)(Insertion, Deletion, Inversion, Translocation)

Find a local Find a local optimum optimum configurationconfiguration

Find a local Find a local optimum optimum configurationconfiguration

Parameter learning Parameter learning for the objective for the objective functionfunction

Parameter learning Parameter learning for the objective for the objective functionfunction

Find initial Find initial configuration in configuration in greedy manner greedy manner

Find initial Find initial configuration in configuration in greedy manner greedy manner

19

Hierarchical Clustering (1)

(ex) Insertion

A

REF

•Cluster, C, is a set of matepairs explaining the same structural variations•Linkage distance = D(X1, X2) = - ln P(X1, X2|C)

X1X2

X1X2

C={X1, X2}

20

Hierarchical Clustering (2)

Generally, linkage distance is given by,

We do hierarchical clustering for each structural variation.

21

Choosing a Unique Mapped Location (1)

We should map matepairs onto unique pair of BLAT hits and unique cluster.

R1 R2

C2C1 C2C1

R2R1

1 2 3 4 5

M1,4 M2,4 M3,5

22


We define a objective Function J(ω)

ƒ1 corresponds to BLAT hit scores

ƒ2 corresponds to the probability

ƒ3 corresponds to the size of clusters

23


Find the initial configuration greedily

Learn parameters for the objective function J(ω). We used hill climbing search to maximize the l

og likelihood of P(ω|λi)

Finally, find a configuration, locally maximizing J(ω) using hill climbing search

24

P-values

We assign p-values to give confidence to our clusters.

The probability that the cluster is generated by the reference genome not by structural variants Pval(Ck)=(E choose |Ck|) ∏ P(Xi|Cnull)

where E = (Expected number of matepairs mapped to the location of the cluster)

P-values depend on the length of the cluster, thenumber of matepairs involved and probabilities.

25

Clustering Results

We started with ~360,000 matepair ~90% were uniquely mapped ~90% had a concordant position (mapped at ± 2)

Through the clustering procedure above (FDR 0.2) we found

82 Insertion clusters (53 had a uniquely mapped read) 175 Deletion clusters (135) 103 inversion clusters (24) 55 Translocation (cross-chromosome) cluster

(all were required to have a uniquely mapped read)

26

Example Deletion

27

Agreement with Previous Results

Type Total Tuzun Levy Korbel DGV-All

Insertion 82(53) 12(7)/139 6(5)/319 0(0)/34 24(13)/2216

Deletion 175(135) 21(17)/102 25(23)/344 45(36)/742 82(63)/4697

Inversion 103(24) 34(12)/56 N/A 42(8)/105 60(15)/164

We have compared

All of the correlations (besides the zero) are significant (p-values < 0.001) via Monte Carlosimulations

The DMBT1 deletion was also found in theTuzun et al dataset (but not the Levy dataset).

28

Translocations

A large fraction (69%) of the translocations were close to the centromeres

She et al. predicted up to 200 interchromosomal rearrangement events near centromeres per million years. The two donors are ~0.2 million years apart

These could also be mis-assemblies.

Distance to centromere

<106 (106, 4.5*106] >4.5*106

<106 22 6 10

(106, 4.5*106] 0 3

>4.5*106 14

29

Conclusions

Introduced a novel framework for finding structural variants that does not rely on ab initio mapping of matepairs to genomic positions.

Introduced a probabilistic model for structural variants

Isolated 82 insertions, 175 deletions, and 103 inversions between the reference public human genome and the JCVI donor.

These results show statistically significant correlation with previous variation studies

Isolated 194 novel structural variants that do not overlap any event from the database of genomic variants (of these 121 have support from a uniquely mapped matepair)

a robust framework for detecting structural variations

Documents