introduction to sequence alignment

Introduction to sequence Alignment

Outline

2

Introduction-Definitions

The need for sequence alignment

Classification of sequence alignments

The alignment problem-Complexity of alignment

Sequence Alignment Probably the most common

“experiment” done in biology today Formally considered an experiment

because you don’t know what you’ll get until you perform the operation

As an experiment, it is based on a hypothesis; it uses a reproducible technique and it generates results that lead to conclusions or more experiments

Fact:

Sequence comparisons

lie at the heart of all

bioinformatics

Sequence Alignment

Sequence alignment is the assignment of residue-residue correspondences: It involves:

• - precise operators for alignment: matching, gaps

• - quantitative scoring system for matches and gaps

• - systematic search among possible alignments

• - use alignment algorithms to find optimal alignment

Algorithms An algorithm is a sequence of

instructions that one must perform in order to solve a well-formulated problem

First you must identify exactly what the problem is!

A problem describes a class of computational tasks. A problem for instance is one particular input from that task

Similarity versus Homology* Similarity refers to the likeness or % identity between 2 sequences Similarity means sharing a statistically significant number of bases or amino acids Similarity does not imply homology

Homology refers to shared ancestry

Two sequences are homologous if they are derived from a common ancestral sequence

Homology usually implies similarity

Similarity versus Homology*

Similarity can be quantified It is correct to say that two

sequences are X% identical It is correct to say that two

sequences have a similarity score of Z

It is generally incorrect to say that two sequences are X% similar

Homologues & All That* Homologue (or Homolog)

Protein/gene that shares a common ancestor and which has good sequence and/or structure similarity to another (general term)

Homology: genes that derive from a common ancestor- these gene are called homologs

Paralogue (or Paralog) A homologue which arose through gene duplication in the

same species/chromosome Paralogous genes are homologous genes in one organism

that derive from gene duplication Gene duplication: one gene is duplicated in multiple copies

that are therefore free to evolve and assume new functions Orthologue (or Ortholog)

A homologue which arose through speciation (found in different species)

Orthologous genes are homologous genes in different organisms

Mutations Causes for sequence (dis)similarity

mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA)

insertion: at a certain location one new nucleotide is

inserted in between two existing nucleotides

(e.g.: AA → AGA)

deletion: at a certain location one existing nucleotide

is deleted (e.g.: ACTG → AC-G)

indel: an insertion or a deletion

Importance: Alignments tell us about...* Function or activity of a new gene/protein Structure or shape of a new protein Location or preferred location of a protein Stability of a gene or protein Origin of a gene or protein Origin or phylogeny of an organelle Origin or phylogeny of an organism

Sequence Complexity*

MCDEFGHIKLAN…. High Complexity

ACTGTCACTGAT…. Mid Complexity

NNNNTTTTTNNN…. Low Complexity

Assessing Sequence Similarity

Rbn KETAAAKFERQHMD

Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT

Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA

Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN

Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY

Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR

Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV

Lsz NRCKGTDVQA WIRGCRL

is this alignment significant?

Is This Alignment Significant?Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108

Annexin 82 L P S A L K S A L S G H L E T V I L G L 101

154 L E K D I I S D T S G D F R K L M V A L 173

240 L E – S I K K E V K G D L E N A F L N L 258

314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333

Consensus L x P x x x P D x S G x h x x h x V L L

Some Simple Rules** If two sequence are > 100 residues and > 25% identical, they are likely related If two sequences are 15-25% identical they may be related, but more tests are needed If two sequences are < 15% identical they are probably not related If you need more than 1 gap for every 20 residues the alignment is suspicious

Classifications of sequence alignments

a) Global/local sequence alignment b) Pairwise/multiple sequence alignment

Global/local sequence alignment

1. Global alignment

- Input: treat the two sequences as potentially equivalent - Goal: identify conserved regions and differences - Algorithm: Needleman-Wunsch dynamic programming - Applications:

- Comparing two genes with same function (in human vs. mouse). - Comparing two proteins with similar function.

Q: How similar are two sequences S1 and S2

Input: two sequences S1, S2 over the same alphabetOutput: two sequences S’1, S’2 of equal length

(S’1, S’2 are S1, S2 with possibly additional gaps)Example: S1= GCGCATGGATTGAGCGA S2= TGCGCCATTGATGACC A possible alignment:

S’1= -GCGC-ATGGATTGAGCGAS’2= TGCGCCATTGAT-GACC--


2. Local alignment

- Input: The two sequences may or may not be related - Goal: see whether a substring in one sequence aligns well with

a substring in the other - Algorithm: Smith-Waterman dynamic programming- Note: for local matching, overhangs at the ends are not treated

as gaps- Applications:

- Searching for local similarities in large sequences (e.g., newly sequenced genomes)

- Looking for conserved domains or motifs in two proteins

Q: Find the pair of substrings in two input sequences which have the highest similarity

Input: two sequences S1, S2 over the same alphabetOutput: two sequences S’1, S’2 of equal length

(S’1, S’2 are substrings of S1, S2 with possibly additional gaps)

Example: S1= GCGCATGGATTGAGCGA S2= TGCGCCATTGATGACC A possible alignment:

S’1= ATTGA-GS’2= ATTGATG

Global vs. Local Alignments Global alignment algorithms start at the

beginning of two sequences and add gaps to each until the end of one is reached.

Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.


3. Semi-global alignment

- Input: two sequences, one short and one long - Goal: is the short one a part of the long one? - Algorithm: modification of Smith-Waterman - Applications:

- Given a DNA fragment (with possible error), look for it in the genome - Look for a well-known domain in a newly-sequenced protein.

4. Suffix-prefix alignment - Input: two sequences (usually DNA) - Goal: is the prefix of one the suffix of the other? - Algorithm: modification of Smith-Waterman. - Applications:

- DNA fragment assembly

5. Heuristic alignment

- Input: two sequences - Goal: See if two sequences are "similar" or candidates for alignment - Algorithms: BLAST, FASTA (and others) - Applications: - Search in large databases

Database search methods: Sequence Alignment

The most widely used local similarity algorithms are:Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)

http://www.ebi.ac.uk/MPsrch/

http://www.ncbi.nih.gov/

http://fasta.genome.jp/

http://www.ebi.ac.uk/fasta33/

http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl

Which algorithm to use for database similarity search?

BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a LOT OF COMPUTER POWER)

FASTA is more sensitive, misses less homologuesSmith-Waterman is even more sensitive. BLAST calculates probabilities

FASTA more accurate for DNA-DNA search then BLAST

Pairwise/multiple sequence alignment

Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2

Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of N rows and L columns where each column represents a homologous position

Note: MSA applies both to nucleotide and amino acid sequences

To construct a multiple alignment, one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise alignment multiple alignments typically contain more gaps than any given pair of aligned sequences

Multiple sequence alignment (MSA)

Pairwise sequence alignment

A pairwise sequence alignment is an alignment of 2 sequences obtained by inserting gaps (“-”) such that the resulting sequences have the same length and where each pair of residues represents a homologous position

Keyword search vs. alignment

Keyword search - keyword search is exact matching - can be done quickly (straightforward scan) - used in Entrez (for example)

Alignment - non-exact, scored matching - takes much more time - used in tools like BLAST2, CLUSTALW

Why do we need (multiple) sequence alignment?

Multiple sequence alignment can help to develop a sequence “finger print” which allows the identification of members of distantly related protein family (motifs)

Formulate & test hypotheses about protein 3-D structure

MSA can help us to reveal biological facts about proteins, e.g.: (e.g. how protein function has changed or evolutionary pressure acting on a gene)

Crucial for genome sequencing:- Random fragments of a large molecule are sequenced and those that overlap are

found by a multiple sequence alignment program.- Sequence may be from one strand of DNA or the other, so complements of each sequence must also be compared- Sequence fragments will usually overlap, but by an unknown amount and in some cases, one sequence may be included within another- All of the overlapping pairs of sequence fragments must be assembled into large composite genome sequence To establish homology for phylogenetic analyses Identify primers and probes to search for homologous sequences in other organisms

The alignment problem

Taxon A AGAC

Taxon B --AC

Taxon C AG--

Taxon A AGAC

Taxon C AG--

Taxon B --AC

Taxon B AC--

Taxon C AG--

Taxon A AGAC

Taxon B --AC

Taxon C --AG

Taxon A AGAC

It is not self-evident how these sequences are to be aligned together. Here are some possibilities:

How do we generate a multiple alignment? Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work?

Example:

Taxon A AGACTaxon B --AC

Taxon A AGACTaxon C AG--

Taxon B ACTaxon C AG

It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment

The alignment problem

What happens when a sequence alignment is wrong?

A B C A C B B C A

A: AGTB: ATC: ATC

A: AGTB: A -TC: ATC

A: AGTB: AT -C: ATC

A: AGT -B: A -T -C: A -TC

From pairwise to multiple alignments

In pairwise alignments, one has a two-dimensional matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences

A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. Algorithmically, this is not difficult to do

But what about execution time?

Algorithm Complexity ‘The big-O notation’One of the most important properties of an algorithm is how its

execution time increases as the problem is made larger (e.g. more sequences to align).This is the so-called algorithmic (or computational) complexity of the algorithm

There is a notation to describe the algorithmic complexity, called the big-O notation.If we have a problem size (number of input data points) n, then an algorithm takes O(n) time if the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2)

It is important to realize that an algorithm that is quick on small problems may be totally useless on large problems if it has a bad O() behavior. As a rule of thumb one can use the following characterizations, where n is the size of the problem, and c is a constant:

The big-O notation

• To compute a N-wise alignment, the algorithmic complexity is something like O(c2n),where c is a constant, and n is the number of sequences

Example:A pairwise alignment of two sequences [O(c2x2)], takes 1 second, then four sequences [O(c2x4)], would take 104 seconds (2.8 hours), five sequences [O(c2x5)], 106 seconds (11.6 days), six sequences [O(c2x6)], 108 seconds (3.2 years), seven sequences [O(c2x7)], 1010 seconds (317 years), and so on

This is disastrous!

How to optimize alignment algorithms?

Use structural information:- reading frame- protein structure

-Sequence elements are not truly independent but related by phylogeny

NK/-YLS

NK/-Y/FL/-S

NKYLSNYLS NFS NFLS

NFL/-S

N – Y L SN K Y L SN – F – SN – F L S

Raw

Human N Y L SChimp N K Y L SGorilla N F SOrangutan N F L S

Alignment

Human Chimp Gorilla Orangutan

How to optimize alignment algorithms?

Sequences often contain highly conserved regions

These regions can be used for an initial alignment

By analyzing a number of small, independent fragments,the algorithmic complexity can be drastically reduced!

“Optimal” vs. “correct” alignment For a given group of sequences, there is no single “correct”

alignment, only an alignment that is “optimal” according to some set of calculations

This is partly due to:- the complexity of the problem,- limitations of the scoring systems used,- our limited understanding of life and evolution

Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment

Sequence alignment and gaps

Gaps can occur: Before the first character of a string CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- Inside a string CTGCGGG---GGTAAT |||| || ||--GCGG-AGAGG-AA- After the last character of a stringCTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- Note: In protein-coding nucleotide sequences most gaps have a length of 3N

Gap Penalties

In the MSA scoring scheme, a penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment

The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment

Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)

In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance

Most MSA programs allow for an adjustment of gap penalties

Sequence alignment and gaps

introduction to sequence alignment

Education