multiple sequence alignment csc391/691 bioinformatics spring 2004 fetrow/burg/miller (slides by j....

21
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Upload: chad-sparks

Post on 11-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple Sequence Alignment

CSC391/691 BioinformaticsSpring 2004Fetrow/Burg/Miller(Slides by J. Burg)

Page 2: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Why do we care about sequence alignment? It can tell us something about the evolution of organisms. We can see which regions of a gene (or its derived protein)

are susceptible to mutation and which can have one residue replaced by another without changing function.

Homologous genes (genes with share evolutionary origin) have similar sequences.

Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species.

Paralogs are evolutionarily related (share an origin) but no longer have the same function.

You can uncover either orthologs or paralogs through sequence alignment.

Page 3: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple Sequence Alignment

Often applied to proteins Proteins that are similar in sequence are

often similar in structure and function Sequence changes more rapidly in

evolution than does structure and function.

Page 4: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Overview of Methods

Dynamic programming – too computationally expensive to do a complete search; uses heuristics

Progressive – starts with pair-wise alignment of most similar sequences; adds to that

Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms)

Locally conserved patterns Statistical and probabilistic methods

Page 5: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Dynamic Programming

Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n-dimensional hyperspace (We can picture this in 2 or 3 dimensions.)

Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

Page 6: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

A Heuristic for Reducing the Search Space in Dynamic Programming Let’s picture this in 3 dimensions (pp. 146-157 in

book). It generalizes to n. Consider the pair-wise alignments of each pair

of sequences. Create a phylogenetic tree from these scores. Consider a multiple sequence alignment built

from the phylogenetic tree. These alignments circumscribe a space in which

to search for a good (but not necessarily optimal) alignment of all n sequences.

Page 7: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Phylogenetic Tree

Dynamic programming uses a phylogenetic tree to build a “first-cut” msa

The tree shows how protein could have evolved from shared origins over evolutionary time.

See page 143 in Bioinformatics by Mount. Chapter 6 goes into detail on this.

Page 8: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Dynamic Programming -- MSA

Create a phylogenetic tree based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.)

Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the tree. Most alike sequences aligned first.

Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space.

The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.

Page 9: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Heuristic Dynamic Programming Method for MSA Does not guarantee an optimal alignment

of all the sequences in the group. Does get an optimal alignment within the

space chosen.

Page 10: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Progressive Methods Similar to dynamic programming method in that it uses

the first step (i.e., it creates a phylogenetic tree, aligns the most-alike pair, and incrementally adds sequences to the alignment in order of “alikeness” as indicated by the tree.)

Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA in that, even though we’ve cut down the search space, it’s still big when we have many sequences to align.)

Page 11: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Progressive Method

Generally proceeds as follows: Choose a starting pair of sequences and align them Align each next sequence to those already aligned, one

at a time Heuristic method – doesn’t guarantee an optimal alignment Details vary in implementation:

How to choose the first sequence to align? Align all subsequence sequences cumulatively or in

subfamilies? How to score?

Page 12: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

ClustalW

Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance

matrix and nearest-neighbor algorithm The most closely-related pairs of sequences are aligned

using dynamic programming Each of the alignments is analyzed and a profile of it is

created Alignment profiles are aligned progressively for a total

alignment W in ClustalW refers to a weighting of scores depending

on how far a sequence is from the root on the phylogenetic tree (See p. 154 of Bioinformatics by Mount.)

Page 13: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Problems with Progressive Method

Highly sensitive to the choice of initial pair to align. If they aren’t very similar, it throws everything off.

It’s not trivial to come up with a suitable scoring matrix or gap penaties.

Page 14: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Iterative Methods for Multiple Sequence Alignment Get an alignment. Refine it. Repeat until one msa doesn’t change

significantly from the next. An example is genetic algorithm approach.

Page 15: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Genetic Algorithms

A general problem solving method modeled on evolutionary change.

Create a set of candidate solutions to your problem, and cause these solutions to evolve and become more and more fit over repeated generations.

Use survival of the fittest, mutation, and crossover to guide evolution.

Page 16: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Evolutionary Change in Genetic Algorithms survival of the fittest – the best solutions

survive and reproduce to the next generation

mutation – some solutions mutate in random ways (but they must always remain viable solutions)

crossover – solutions “exchange parts”

Page 17: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Laying Out the Problem

What would a candidate solution look like in a multiple sequence alignment program? (an msa of ~20 proteins)

How many candidate solutions should there be? (~100)

Page 18: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Evolving to a Next Generation

Which candidate solutions should survive to the next generation? First, take the top half based on best sum of

pairs scoresThen randomly select second half, giving

more chance to an msa’s being selected in proportion to how good its score is

Page 19: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

How would mutation work?

Can’t change a sequence in the msa. Otherwise you would be created a solution that isn’t really a solution.

You can only insert or rearrange gaps.

Page 20: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

How would crossover work?

See page 160 in Bioinformatics by Mount.

Page 21: Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Profiles and Motifs A sequence motif is a relatively short pattern that

appears consistently with a family of proteins. (Motifs can also appear in families of DNA or RNA molecules.)

Frequently, motif-based analysis is used to detect patterns of amino acids in proteins that correspond to structural or functional features.

Motifs are generated during multiple sequence alignment. They can be displayed as patterns of amino acids, as sequence logos, or as profile scoring matrices.