1 mavid: constrained ancestral alignment of multiple sequence author: nicholas bray and lior pachter

38
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pa chter

Upload: barbra-bishop

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

3 AVID: A Global Alignment Program Fast Memory efficient Practical for sequence for alignments of large genomic region Sensitive in finding homologous regions Specific and avoids the false- positive problems

TRANSCRIPT

Page 1: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

1

MAVID: Constrained Ancestral Alignment of Multiple Sequence

Author: Nicholas Bray and Lior Pachter

Page 2: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

2

Outline

• AVID• MAVID

– Progressive alignment– Constraints– Tree Building– Experimental Results

Page 3: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

3

AVID: A Global Alignment Program• Fast• Memory efficient• Practical for sequence for

alignments of large genomic region• Sensitive in finding homologous

regions• Specific and avoids the false-

positive problems

Page 4: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

4

Algorithm

• Repeat Masking (Optional)• Finding Matches Using Suffix Trees• Anchor Selection• Recursion

Page 5: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

5

Repeat Masking

Match finding

Anchor selection

Base pair alignmentSplit sequences

using anchors

Enough anchors?

Recursion

Page 6: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

6

Repeat Masking (Optional)

• RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html)

• Repeat matches• Clean matches

Repeat matches Clean matches

Page 7: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

7

Finding Matches Using Suffix Trees

Page 8: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

8

Finding Matches Using Suffix Trees• Maximal repeated substring (Match)

– Every subsequence that contains it is not repeated in the string

• Maximal matches between two sequence– Pairs of matching subsequences

whose flanking bases are mismatches• Transform

Page 9: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

9

Maximal repeated substring

Maximal matches between two sequence

Transform

Page 10: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

10

Anchor Selection

• Eliminate noisy matches (those less than half the length of the longest match)

• The left matches are ordered by– Long clean -> short clean -> long repeat ->

short repeat

Page 11: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

11

Anchor Selection

• A variant of Smith-Waterman algorithm (no overlapping)

• Gap score: 0• Mismatch score: ∞• Match score:

10 bp

Page 12: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

12

Recursion

Page 13: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

13

Condition

• There are still significant matches– The anchor set is >50% of the length of the s

equence• Recursion

– Otherwise• Needleman-Wunsch algorithm

• No significant matches– Short sequence (<4kb)

• Needleman-Wunsch algorithm– Long sequence

• Trivial alignment (gap)

Page 14: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

14

MAVID

• Rapidly aligning multiple large genomic regions

• Incorporating biologically meaningful heuristics

• Sound alignment strategies

Page 15: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

15

Method

• Core: progressive ancestral alignment, which incorporate preprocessed constraint

• Terminology – Match

• Similar (may not exactly match) region between two sequences

– Constraint• The order of positions of alignment

Page 16: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

16

Standard progressive alignment• Compute the distance matrix by aligning

all pairs of sequences • Build a phylogenetic tree (guide tree)

from the distance matrix– Cluster– Midpoint method

• Progressively align the sequence according to the branching order in the guide tree– Aligning two alignments– An alignment is viewed as a sequence

Page 17: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

17

Method

Page 18: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

18

Key difference

• Instead of aligning alignments, we first infer ancestral sequences of alignments using maximum-likelihood estimation within a probabilistic evolutionary model

• maximum-likelihood estimation– a popular statistical method used to

make inferences about parameters of the underlying probability distribution of a given data set

Page 19: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

19

Key difference

• The ancestral sequences are then aligned with AVID

• The scores of the Smith-Waterman step are assigned according to the branch length of the two alignments

• The alignment of the ancestral sequences is then used to glue two alignments. Gaps in the ancestral sequences lead to gaps in the multiple alignment

Page 20: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

20

Alignment A

Ancestral A

Ancestral B

Alignment B

AVID

Page 21: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

21

AVID with preprocessed data

• Gene predictions using GENSCAN• Protein alignments using BLAT• Finding exon matches without using

suffix tree• In addition, the exon matches can b

e used shape the final multiple alignment

Page 22: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

22

MAVID(Constraints, Tree building, and

Experimental results)

Speaker: 羅正偉2005/12/07

Page 23: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

23

Constraints(1/3)

Notation: ai ≤ bj

This means that position i in sequence a must appear before position j in sequence b in the multiple sequence alignment.

Page 24: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

24

Constraints(2/3)a

c

b

ai

cycx

bj

If x ≤ y, then ai ≤ cx ≤ cy ≤ bj ,and so ai ≤ bj by transitivity.

Page 25: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

25

Constraints(3/3)

• The above information can be used in the alignment of the ancestral sequences by requiring potential anchors between the sequences to satisfy the constraints.

Page 26: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

26

Prime Constraints(1/4)

• Consider every triplet of sequences (a, b, c) with a in u, b in v, and c not in x.

• Every triplet can provide potential constraints for the alignment.

• If there are n sequences, there are O(n3) such triplets.

x

u vToo many constraints!

Page 27: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

27

Prime Constraints(2/4)

• Actually, we don’t need to find all possible constraints, many of which will be redundant.

• Instead, we wish to find a set of prime constraints

• In this set, no constraint is implied by the others.

• Such a set can be inferred from the homology map.

Page 28: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

28

Illustration

Page 29: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

29

Prime Constraints(3/4)

• If there are m sets of orthologous exons, then at node x there can be at most O(m) prime constraints.

• The sets of all prime constraints can be found in O(mk2), where k is the number of leaves below x.

Page 30: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

30

Prime Constraints(4/4)

• Matches between the ancestral sequences that are inconsistent with this set of constraints can be filtered out in time O(N logm), where N is the total number of matches.

• For typical values of m and k, the time taken computing and utilizing the constraints is negligible.

Page 31: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

31

Tree Building(1/3)

• Most multiple alignment programs require pairwise alignments of all the sequences to build in initial guide tree. (Quadratic number of sequence alignments)

• We utilize an iterative method to obtain a guide tree using only linear number of alignments.

Page 32: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

32

Tree Building(2/3)

• The initial guide tree is selected randomly from the set of complete binary trees.

• The sequences are aligned using this random tree, and then a phylogenetic tree is inferred from the resulting multiple alignment.

• The above process is iterated until the alignment and tree are satisfactory.

Page 33: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

33

Tree Building(3/3)

• Instead of computing all pairwise alignments, only O(nk) alignments are necessary to perform n iterations with k sequences.

• We found that for typical alignment problems, only a small number of iterations were necessary.

Page 34: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

34

Experimental Results 1

• A human, mouse, and rat whole-genome multiple alignment.– A homology map for the genomes was

built by C. Dewey, and was used to generate gene anchors and constraints.

– Chromosome 20 was chosen because it aligns almost completely with mouse chromosome 2.

Page 35: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

35

Experimental Results 1 (cont.)

Coverage of human chromosome 20 RefSeq exons by the MAVID alignments. Of a total of 3927 exons, only six were not in the homology map. A total of 53.5% of the exons were covered by precomputed exon anchors in either mouse or rat. The remaining exons are mostly aligned by MAVID, resulting in 93.6% of the exons covered by alignment in either mouse or rat.

Page 36: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

36

Experimental Results 2

• Alignment of 21 Organisms– We aligned 1.8 Mb of human sequenc

e together with the homologous regions from 20 other organisms of a total 23 Mb of sequence.

– Baboon, cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, mouse, opossum, pig, platypus, rabbit, rat, tetraodon, and zebra-fish.

Page 37: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

37

Experimental Results 2(cont.)

• The MAVID alignments were compared with MLAGAN, version 1.1(Brudno et al. 2003).

• MLAGAN is the only other program we know of that is able to align the 21 sequences in a reasonable period of time.

Page 38: 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter

38

Experimental Results 2(cont.)

• MAVID and MLAGAN both aligned sequences correctly.

• MAVID took 40 min, while MLAGAN took roughly 6h.