pairwise and multiple sequence alignments alain schenkel tuomas hätinen bioinformatics group...
TRANSCRIPT
Pairwise and multiple sequence alignments
Alain Schenkel
Tuomas Hätinen
Bioinformatics groupInstitute of BiotechnologyUniversity of Helsinki
Protein Analysis Workshop 2006
Overview
Motivation – Why alignments? Sequence comparison
Dotplot
The alignment problem
Pairwise alignment algorithms Exact algorithms
Heuristic algorithms
Database searches
Multiple sequence alignments Web tools:
Build alignments using SRS or EBI server,
Blast at NCBI, EBI,
PairsDB, …
Motivation
Proteins perform most of the functions required in biological systems: Signaling (kinases, ...)
Enzymes (proteases, …)
Structural (collagen, elastin, …)
Immune system (antibodies, ...)
Storage and transport (hemoglobin, …)
…
Large amount of information available in current databanks.
Goal: Want to extrapolate information about the function of a newly discovered sequence by comparing it to annotated sequences.
Does it make sense?
All functional information is ultimately contained within the sequence.
Proteins are evolutionary related:
Selective pressure is on function, and thus on residues with functional role
(eg: active site or structural key residues are conserved).
Modular nature of proteins.
Two sequences have the same structure if corresponding residues are
similar enough on physico-chemical level.
Application of sequence alignments
Determining function of newly discovered genetic or protein
sequences. Identification of functional patterns/domains. Predicting structure of proteins. Determining evolutionary relationships among genes, proteins,
and entire species.
Aligning and comparing sequences, and searching databases for similar sequences – a cornerstone of bioinformatics!!
Sequence Comparison
• Alignment
• Dotplots
• The pairwise alignment problem
Pairwise alignment
Pairwise alignment = identification of residue-residue correspondence.
????? 101 AGVIGTILLISYGIRRLIKKSPSDVKP 115 ||:||.|||::|..|||.|:.|:||.| GLP_HORSE 60 AGIIGIILLLAYVSRRLRKRPPADVPP 86
What criteria should we use to obtain biologically meaningful alignments?
For the alignment to be meaningful, the correspondence should reflect the functional, or evolutionary, …, relationship (if any).
Some terminology
Identity:
percentage of pairs of identical residues between two aligned sequences.
Similarity:
percentage of pairs of similar residues between two aligned sequences.
one must define what similar means. Eg:
- as observed in well studied evolutionary
related protein families,
- physico-chemical amino acid
properties: hydropathy, size, …
Homology:
two sequences are homologous if and only if they have a common ancestor.
it´s either yes or no.
not to be confused with similarity!
Dotplots
The simplest way of comparing
two sequences: A dot is placed where both
sequence elements are identical.
Gives an overview of all possible
alignments. Each diagonal indicates a
possible (ungapped) alignment.
A T C T T C G A T
T ● ● ● ●
A ● ●
C ● ●
G ●
A ● ●
T ● ● ● ●
Sequence 1
Sequence 2
ATCTTCGAT | ||||---TACGAT
One possible alignment:
Dots may be scored according to a sliding window and a similarity
cutoff to reduce noise:
The smaller the window, the more noise. With large windows, the sensitivity for short sequences is reduced.
Filtering Out the Noise in Dotplots
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| | || |||| | || ||| |
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| | || |||| | || ||| |
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
Window size = 5, Similarity cutoff = 3
LETVHKKLYAGQYQNAGQFCDDIWLMLDNALSTIKRKLD *TGQ *YQEPWQ…
LETVHKKLYAGQYQNAGQFCDDIWLMLDNA
| | || |||| | || ||| |
LSTIKRKLDTGQYQEPWQYVDDVWLMFNN
Using Dotmatcher from SRS
SRS at EBI: http://srs.ebi.ac.uk/
SRS at EMBnet Austria: http://emb2.bcc.univie.ac.at:8080/srs/
... or any servers listed at http://downloads.lionbio.co.uk/publicsrs.html
Check out the SRS version (bottom of page): different versions index different databases, so the search results might be different depending on
the version.
DotmatcherP (for proteins)
Enter sequences in FASTA format!
Advanced options: Change default window size, threshold score and scoring matrix
DotmatcherP
Comparing a protein with itself.
repeated protein domains
Eg: Drosophila Melanogaster SLIT
Identification of
Identification of conserved protein domains. Using the default parameters window size = 10 and
threshold = 23:
DotmatcherP
Comparing two different sequences:
DotmatcherP
If we lower the window size and the threshold, we
observe lots of noise. Eg, with window size = 5, threshold = 10:
Another Dotplot server: Dotlet
Has more options and provides more flexibility than Dotmatcher. Some very useful features:
If only one sequence is entered, dotlet automatically compares it
against itself (finding repeats, low complexity regions, etc.).
Same application for both nucleic acid and protein sequences.
When comparing nucleic acid to nucleic acid, dotlet will reverse
complement one of the sequences and perform a second
comparison. Enables, eg, to see structures like stem-loops.
Possible to compare a protein to a nucleic acid sequence. The
nucleic acid sequence is translated in the three forward frames and
pixels are set to the highest of the scores. Enables, eg, to detect
introns/exons, frameshift, etc.
Dotlet
At http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
Let´s find repeated domains
in the following sequence :> SLIT_DROME (P24014):MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCTGLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVITTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSWLSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTLPDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLLLNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCESPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGRISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFEHLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCTCTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYNKLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQMKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNATCTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAKCMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHECKHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAVELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDPAQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLENKCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGNQCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY
1. Enter sequence
3. Repeat for second sequence (optional)
2. Enter name for sequence (optional)
4. Select scoring matrix, window size and zoom
4. Click ”compute”!
Each pixel corresponds to a residue in the horisontal sequence and to a residue in the vertical sequence
The pixels color depends on how similar the two sequences are around these two positions
Possible to scroll the dotplot here
Possible to scroll the alignment here
Residues that match well in the alignment are coloured blue
Tuning of grayscale in order to make background noise disappear
Dotlet reverse complements one of the sequences stem-loops can be detected
Dotplot - Summary
Comparing a sequence with itself, can be used to
identify:
Repeated domains,
Regions of low complexity (eg, …GYCAAAAAAAAALK…).
Comparing two protein sequences, can be used to
identify: Local regions of similarity,
Conserved protein domains.
Dotplot - Summary
Good: visual detection of feature/similarity,
exploring the sequence organisation.
Bad: resolving regions of low similarity,
does not provide an alignment (no insertions/deletions).
To obtain an alignment, we need a method for lining up the diagonals in a dotplot.
G A T C T A
G 1
A 2
T 3
C 4
A 5
GATCTA
GATC_A
The Pairwise Alignment Problem
Lign up diagonal by edit operations: substitution (mutation)
gap or indel (insertion/deletion)
seq1 IGTILLISYGIRRLIKKSPSDVKP----LPSPDTDVP || ||| | ||| | | || | || | |seq2 IGIILLLAYVSRRLRKRPPADVPPPASTVPSADAPPP
substitution deletion
insertion
sequence 1s
eq
ue
nc
e
2
gap
But there are many ways to align 2 sequences we need to score alignments to decide which is the best.
Scoring the Edit Operations
For example: identical: +10 (it´s good)
substitution: +2 for S-A, -1 for K-P, …
gap: -3
PSDVKP--P | || | | PADVPPPAP
Score: +50+2-1+2*(-3) = 45
Choosing an appropriate scoring scheme: where biological information is introduced (eg, reward the evolutionary most likely alignment).
Standard notation: | for identical : for very similar (eg, size and hydropathy) . for somewhat similar (eg, size or hydropathy)
Gap penalty
Few long gaps
is better than
many small gaps
Different scores for gap opening, eg: -5
gap extension, eg: L(-1) with
L=length of extension
gap opening > gap extension
TIL--------LISYGIRRLIK
TILKKSPSDVKLISYGIRRLIK
TIL--------LISYGIRRLIK
TILKKSPSDVKLISYGIRRLIK
gap openinggap extension
IG-TI--LYDL-SYYAG---IR
IGKIIPRL--LVAY--VLIGSR
gap score= -5 -6
Gap penalty
Can also consider special penalty for gaps at end/beginning of
alignment (eg, zero penalty).
Need to be careful in adjusting the gap score to the substitution
score: too strong penalty no gaps,
too weak penalty too many gaps.
Insertions and deletions have been found to occur in nature at
significantly lower frequency than mutations.
Residue Substitution
A substitution score for each aa pair
a substitution matrix.
Most used: based on evolutionary relationship.
Two types: PAM series,
BLOSUM series.
PAM (Percent Accepted Mutation)
PAM1: observed mutations in
carefully selected sets of closely
related proteins (1572 sequences
from 71 families). (1978)
Idea: observed substitutions are the
result of 1 mutation (not many).
PAMn: iterate PAM1 n times to
obtain substitution rate between
more divergent sequences.
PAM: 0 30 80 110 200 250%identity: 100 75 60 50 25 20
PAM250
Usewhen
BLOSUM (BLOck Substitution Matrix)
Based on a larger set than PAM is.
More recent than PAM. (1992)
Different approach than PAM:
not based on an explicit evolutionary
model,
observed aa substitutions in a set of
conserved aa patterns called blocks.
BLOSUMn: from blocks which are n%
identical.
BLOSUM62: empirically shown to be among
the best at detecting weak similarity.
BLOSUM62
Tips for using substitution matrices
Generally, BLOSUM matrices perform better than PAM for local
similarity searches. For database searches, the most commonly used matrix is
BLOSUM62. When comparing closely related proteins, one should use lower
PAM or higher BLOSUM, for distantly related proteins higher PAM
or lower BLOSUM matrices
Caution: substitution matrices are statistical in nature. In a given
alignment, a substitution may or may not correspond to an actual
mutation.
BLOSUM 8 BLOSUM 62 BLOSUM 45
PAM 1 PAM 120 PAM 250
Less divergent More divergent
Pairwise alignment algorithms
• Exact algorithms
• Heuristic algorithms
• Database scanning
Pairwise Alignment Algorithms
Given a scoring scheme, an alignment algorithm tries to find the best
alignment between 2 sequences according to that scheme.
Exact algorithms: guaranteed to return an alignment with the best possible score.
Heuristic alignments: not guaranteed to return best alignments.
but they are quicker (and hopefully still return good alignments).
Two types of alignment: Global: forced over the entire length of 2 sequences.
Local: between substrings of 2 sequences..
Global vs Local Alignment
Global alignments: are sensitive to gap penalties,
do not take into account the modular nature of
proteins,
can be used to compare 2 proteins with same
function (in, eg, human/mouse).
Local alignments: are sensitive to modular nature
of proteins. They can be used to: look for conserved domains or motifs in 2 proteins,
search for local similarities in large sequences,
database searches,
scanning an entire genome with a short sequence.
Exact Algorithms: Dynamic Programming
Exhaustive search among all possible
alignments is not possible (eg, for 2 sequences of
100 and 95 residues: 55 millions alignments with 5
gaps).
Problem solved by dynamic programming:
1. initialize top row and left column,
2. compute best local scores iteratively,
3. keep track of where best local score comes from,
4. traceback to obtain the best alignments.
May exist several best solutions: an alignment
reported to you may be one among a number of
possibilities.
How can we find the best alignment between 2 sequences?
best global score
ATTCTCTGA-TAC--TGA
ATTCTCTGA-TA--CTGA
The example is from www.pasteur.fr
Example of 2 best solutions:
Global Alignment Servers (Exact Algorithm)
Server at SRS: NeedleP. (http://srs.ebi.ac.uk/ Tools) Server at EBI: EMBOSS-Align
Let´s submit to http://www.ebi.ac.uk/emboss/align/index.html the sequences :
Use the Needleman-Wunsch algorithm (1970).
>uniprot|P35858|ALS_HUMAN Insulin-like growth factor-binding protein complexMALRKGGLALALLLLSWVALGPRSLEGADPGTPGEAEGPACPAACVCSYDDDADELSVFCSSRNLTRLPDGVPGGTQALWLDGNNLSSVPPAAFQNLSSLGFLNLQGGQLGSLEPQALLGLENLCHLHLERNQLRSLALGTFAHTPALASLGLSNNRLSRLEDGLFEGLGSLWDLNLGWNSLAVLPDAAFRGLGSLRELVLAGNRLAYLQPALFSGLAELRELDLSRNALRAIKANVFVQLPRLQKLYLDRNLIAAVAPGAFLGLKALRWLDLSHNRVAGLLEDTFPGLLGLRVLRLSHNAIASLRPRTFKDLHFLEELQLGHNRIRQLAERSFEGLGQLEVLTLDHNQLQEVKAGAFLGLTNVAVMNLSGNCLRNLPEQVFRGLGKLHSLHLEGSCLGRIRPHTFTGLSGLRRLFLKDNGLVGIEEQSLWGLAELLELDLTSNQLTHLPHRLFQGLGKLEYLLLSRNRLAELPADALGPLQRAFWLDVSHNRLEALPNSLLAPLGRLRYLSLRNNSLRTFTPQPPGLERLWLEGNPWDCGCPLKALRDFALQNPSAVPRFVQAICEGDDCQPPAYTYNNITCASPPEVVGLDLRDLSEAHFAPC
>uniprot|O08770|GPV_RAT Platelet glycoprotein V precursor (GPV) (CD42D).MLRSVLLSAVLSLVGAQPFPCPKTCKCVVRDAVQCSGGSVAHIAELGLPTNLTHILLFRMDRGVLQSHSFSGMTVLQRLMLSDSHISAIDPGTFNDLVKLKTLRLTRNKISHLPRAILDKMVLLEQLFLDHNALRDLDQNLFQKLLNLRDLCLNQNQLSFLPANLFSSLGKLKVLDLSRNNLTHLPQGLLGAQIKLEKLLLYSNRLMSLDSGLLANLGALTELRLERNHLRSIAPGAFDSLGNLSTLTLSGNLLESLPPALFLHVSWLTRLTLFENPLEELPEVLFGEMAGLRELWLNGTHLRTLPAAAFRNLSGLQTLGLTRNPLLSALPPGMFHGLTELRVLAVHTNALEELPEDALRGLGRLRQVSLRHNRLRALPRTLFRNLSSLVTVQLEHNQLKTLPGDVFAALPQLTRVLLGHNPWLCDCGLWPFLQWLRHHLELLGRDEPPQCNGPESRASLTFWELLQGDQWCPSSRGLPPDPPTENALKAPDPTQRPNSSQSWAWVQLVARGESPDNRFYWNLYILLLIAQATIAGFIVFAMIKIGQLFRTLIREELLFEAMGKSSN
choose scoring matrix gap penalties
gap penalties
NeedleP at SRS
options for gap penalties
choose scoring matrix (optional)
Local Alignment Servers (Exact Algorithm)
Server at EMBnet: LALIGN, uses SIM algorithm (1991) http://www.ch.embnet.org/software/LALIGN_form.html
Server at SRS: http://srs.ebi.ac.uk/ Tools.
WaterP. Uses the Smith-Waterman algorithm (1981)
MatcherP. Can be used to find various local alignments
between 2 sequences. Slower than WaterP.
Server at EBI (Smith-Waterman algorithm). http://www.ebi.ac.uk/emboss/align/index.html
Heuristic Algorithms
Motivations:Exact algorithms are exhaustive but computationally
expensive.Exact algorithms are impractical for comparing a query
sequence to millions of other sequences in a database
(database scanning),and so, database scanning requires faster alignment
algorithm (at the cost of optimality).
Heuristic Algorithms
Probing a database with a query is similar to aligning a query with
a very long sequence.
Main idea: Use dynamic programming, but limited to (sub-)sequences which are
likely to produce interesting alignments with the query.
Heuristic part of the algorithm: eliminate from search uninteresting
sequences (need to make a guess).
Algorithms: FASTA : Lipman-Pearson (1985).
BLAST (Basic Local Alignment Search Tool) : Altshul et al. (1990).
need fast local alignment methods.
BLAST Overview
Many versions for different query-database cases: blastp: protein - protein
blastn: nucleotide - nucleotide
blastx: nucleotide protein - protein
tblastn: protein - protein nucleotide
tblastx: nucleotide protein - protein nucleotide
Comes in many flavours. Fast and reliable. Easy to use.
BLAST Overview
BLAST computes “an alignment”, not necessarily the exact optimal
alignment. Given the query and the database (long sequence):
Find all words of length k (typical: k=4) that match the query with a
score high enough.
Look for subsequences in the database that contain these words.
Extend subsequences to see if match score can be increased.
Compute total score when no more extensions are possible.
Rank the alignments.
How should the different matched (sub-)sequences be ranked?
Significance of Alignments
Scores cannot be used to rank alignments: a bad but long alignment may have a higher score than a good but short
alignment.
We need a normalized scoring scheme that would allow to
compare alignments, and evaluate their biological significance. Idea:
Probe the database with random sequences.
This gives a distribution of scores (it follows the extreme-value distribution).
Establish a threshold for significance.
Extreme-Value Distribution
score
Score distribution for random sequences
score of our query
probability that the score of our query is no better than random: P-value
Difficulty: finding a significance threshold.
Quantifying the Significance of Alignments
P-value: The probability of an alignment occurring with score S or better if
the aligned-against sequence is random. The lower the P-value, the more significant the alignment.
E-value: Expected number of alignments with scores equivalent to or better
than S to occur by chance only. The lower the E-value, the more significant the alignment. E-value = P-value * size of database.
For an alignment with raw score S:
Rough Guide for P-values and E-values
P-Value (reported by many programs): 0≤ P-val ≤ 1
E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size of database
P<= 10-100 Exact match
10-100 < P < 10-50 Sequences very nearly identical, e.g.: alleles or SNPs
10-50 < P < 10-10 Closely related sequences, homology certain
10-5 < P < 10-1 Usually distant relatives
P>10-1 Match probably insignificant
E<=0.02 Sequences probably homologous
0.02 <=E <=1 Homology can’t be ruled out
E>1 This match would be obtained by chance
Heuristic Algorithms Servers
Pairwise alignment:BLAST:
http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi Database screening:
FASTA: http://www.ebi.ac.uk/fasta33/ , SRS, …BLAST:
- SRS (at EBI or ...)- http://www.ncbi.nlm.nih.gov/BLAST/ - http://www.ebi.ac.uk/blast/index.html - http://www.ch.embnet.org/software/bBLAST.html - http://www.ch.embnet.org/software/aBLAST.html
Evaluating the significance of an alignment:PRSS:
http://www.ch.embnet.org/software/PRSS_form.html
BLAST Servers
Blast has many options : choice of database, substitution matrix, …
basic or advanced section.
BLAST interfaces are different: NCBI: excellent help pages and tutorial
SRS: easy multiple alignment access
EMBnet: simple text + graphical output.
Remark: there is a server with a powerful implementation of Smith-Waterman for database screening: http://www.ebi.ac.uk/MPsrch/. Runs about 50 times slower, but is more sensitive and returns less false positives than Blast.
BLAST at NCBI
>1IGR:A INSULIN-LIKE GROWTH FACTOR RECEPTOR EICGPGIDIRNDYQQLKRLENCTVIEGYLHILLISKAEDYRSYRFPKLTVITEYSLGDLFPNLTVIRGWKLFYNYALVIFEMTNLKDIGLYNLRNITRGAIRIEKNADLCYLSTVDWSLILDAVSNNYIVGNKPPKECGDLCPGTMEEKPMCEKTTINNEYNYRCWTTNRCQKMCPSTCGKRACTENNECCHPECLGSCSAPDNDTACVACRHYYYAGVCVPACPPNTYRFEGWRCVDRDFCANILSAESSDSEGFVIHDGECMQECPSGFIRNGSQSMYCIPCEGPCPKVCEEEKKTKTIDSVTSAQMLQGCTIFKGNLLINIRRGNNIASELENFMGLIEVVTGYVKIRHSHALVSLSFLKNLRLILGEEQLEGNYSFYVLDNQNLQQLWDWDHRNLTIKAGKMYFAFNPKLCVSEIYRMEEVTGTKGRQSKGDINTRNNGERASCESDVDDDDKEQKLISEEDLN
Let´s submit the query sequence
at http://www.ncbi.nlm.nih.gov/BLAST/
We paste our sequence here and launch the search
substitution matrix
Conserved domains
Graphical overview of hits – couloured according to similarity
Hits
Alignment for each of the hits
E value: Expectation value.
Expected # of alignments with scores equivalent to or better than S to occur by chance. The lower the E value, the more significant the score.
Bit score: S’
The value S’ is derived from the raw alignment score S, but statistical properties of the scoring system have been taken into account. Because bit scores are normalised w.r.t. scoring system, they can be used to compare alignment scores from different searches.
NCBI Blast output help: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Blast_output.html
BLAST at SRS EBI
SRS EBI: View results using BlastAlignment
Alignments are dispalyed
BLAST at EMBnet
Graphic output on/off
BLAST Variants
PHI-Blast: Pattern-Hit Initiated Blast: Searches proteins containing a specified pattern AND are similar
to the query sequence in the neighborhood. Patterns must follow the syntax of PROSITE.
PSI-Blast: Position-Specific Iterated Blast: More sensitive, ie better at detecting distant relationships, than
BLAST. Computes position-specific substitution matrices (PSSMs) to score
matches between query and database sequences .(Blast uses
precomputed substitution matrices, eg BLOSUM62.)
PSI-BLAST
Repeatedly searches the target databases.
At each round: compute a multiple alignment of high scoring sequences to
generate a new PSSM for next round of searching.
Iterates until no new sequences found (or until a maximal
number of iteration is reached).
Rules of thumb for pairwise alignment
Use server defaults in the absence of any other information. Adjust the substitution matrix to the expected divergence of
the 2 sequences. Use BLOSUM62 if no a priori information. For distantly related sequences, use PSI-Blast rather than
BLAST. Many ways of aligning 2 sequences.
A returned alignment is not the absolute truth.
Inspect the alignment from the biologist´s perspective.
PairsDB
A database of pre-computed Blast and Psi-Blast alignments.
Continually updated.
Source databases: Uniprot, PDB, EMBL, Worm database, ENSEMBL, NCBI genomes, RefSeq.
PairsDB thus provides a quick and easy way to explore protein sequences and their relationships.
PairsDB
NRDB90: non-redundant database at 90%, etc.
Seq databases: - Uniprot - PDB - ...
remove redundancy at 90%
NRDB90
NRDB80
NRDB70
NRDB40
NRDB30
...
BlastP all-on-all
Psi-Blast all-on-all
A set of alignments
A set of alignments
PairsDB: http://www.csc.fi/cgi-bin/pairsdb/pairsdb.cgi
PairsDB
Multiple sequence alignment
• Motivation
• Algorithms overview
• Clustalw
• Clustal-X
Multiple Sequence Alignment
Given a set of N ≥ 3 sequences, we want to find the best
way of aligning these sequences simultaneously. A multiple alignment does not reflect the level of pairwise
similarity between pairs of sequences.
-----------------NC------------------------------- 142-----------------ACF------------------------------ 141---------------IRGCRL----------------------------- 147---------------MAECWSHGSNSVFPF-------------------- 158VTPSVKPSHASQEVKLHDSTSYAQNPFLSLLGKPIVPAQAPIKPQSKPPS 792------------------CEAQ---------------------------- 142----------------VACNLRSLSPVRSPRGFLTG-------------- 179
Motivations
Pairwise sequence alignment is easy with sufficiently
closely related sequences.
Below a certain level of identity sequence alignment may
become uncertain : twilight zone for aa sequences ~ 30%.
In or below the twilight zone it is good to make use of
additional information, eg, from evolution.
Motivations
A multiple alignment of diverse sequences is more
informative than a pairwise alignment: residues conserved over longer period of time are under
stronger evolutionary constraints.
Reasons for aligning sets of sequences: organize data to reflect sequence homology,
estimate evolutionary distance,
infer phylogenetic trees from homologous sites,
highlight variable and conserved sites/regions,
determine substitution frequencies,
pattern/domains identification,
helpful for protein structure prediction.
An alignment of 8 fragments of immunoglobulin:
Alignment highlights: Conserved residues: One of the cysteines forming the
disulphide bridges, and the tryptophan.
Conserved regions (e.g. Q.PG).
Patterns (e.g.: dominance of hydrophobic residues at
positions 1 and 3). The alternating hydrophobicity pattern
is typical for surface beta-strand at the beginning of each
fragment.
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--
Consensus Sequence
Simplest Form:
A single sequence which represents the most common amino
acid/base in that position
Y D D G A V - E A L
Y D G G - - - E A L
F E G G I L V E A L
F D - G I L V Q A V
Y E G G A V V Q A L
-------------------------------------------------------Y D G G A/I V/L V E A L
Multiple Sequence Alignments Algorithms
Multiple sequence alignment uses heuristic methods only: With dynamic programming, computational time quickly
explodes as the number of sequences increases.
Different methods/algorithms: Segment-based (DiAlign, T-Coffee…).
Iterative (HMMs, SAGA, DiAlign, PRRP, …).
Progressive (Clustalw, T-Coffee, PileUp, …).
ClustalW: First described by D.G. Higgins and P.M.Sharp (1988).
Can be used for nucleotide or amino acid sequences.
Clustalw Algorithm
Step1: Calculate all pairwise alignments and calculate distances for all pairs of sequences.
Step 2: Construct guide tree joining the most similar sequences using Neighbour Joining.
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
Step 1 Step 2
Clustalw Algorithm
Step 3: From the tree assign weights for each sequence: We want to down-weight nearly identical sequences and up-
weight the most divergent ones.
Step 4: Align sequences, starting at the leaves of the
guide tree: Pairwise comparisons as well as comparison of single
sequence with a group of sequences (Profile)
Clustalw Algorithm
Some features: Amino acid substitution matrices are varied at different alignment
stages according to the divergence of the sequences to be aligned. Reduced gap penalties in hydrophilic regions encourage new gaps in
potential loop regions rather than regular secondary structure.
Insertions and deletions are more common in loop regions than in the core of the protein!
Clustalw
Clustalw is not optimal. There are known areas in which Clustalw performs badly, for
example: errors introduced early cannot be corrected by subsequent
information, alignments of sequences of differing lengths cause strange guide
trees and unpredictable effects.
Use also others, slower but better depending on the situation: T-Coffee: http://www.ch.embnet.org/software/TCoffee.html
DiAlign: http://dialign.gobics.de/
POA: http://www.bioinformatics.ucla.edu/poa/
SAGA
... and more at http://helix.nih.gov/apps/bioinfo/msa.html.
ClustalW Servers
Servers: EBI: http://www.ebi.ac.uk/clustalw/
SRS: eg, http://srs.ebi.ac.uk/ tools multiple alignments
EMBnet: http://www.ch.embnet.org/software/ClustalW.html
Let’s build a multiple alignment for the following sequences :
>query
MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVWVHTELGYFSGEAVPSNGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRVTDVIITHAHADRIGGMKTLKERGIKAHSTALTAELAKKNGYEEPLGDLQSVTNLKFGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNVADAYVNEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK>gi|2984094 MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAYFYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFREVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQVGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWLVCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVECVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE>gi|115023|sp|P10425|MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVWVHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIENMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK>gi|115030|sp|P25910|MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGMIVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQRKGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIVVWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGHGDYGGTELIEHTKQIVNQYIESTSKP>gi|282554|pir||S25844 MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWVDKLAAGPGRTVVNTHFHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWPRVDWGEIELRPPNVTFRDRLTLHVGERQVELICVGPAHTDHDVVVWLPEERVLFAGDVVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWVQRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAMEIFAELVAYNGGQLPTCLA
ClustalW at EBI
Many options: CPU mode,
full/fast alignment,
window length in fast mode,
…
gap penalties.
ClustalW at EBI
Automatic display of:
Score table
Alignment (optional colouring)
Tree guide
Link to Jalview alignment editor!(More on Jalview at end of week.)
Running Clustalw from SRS (Columbia University)
Running Clustalw from SRS
View results using: *complete entries*
View results using: ClustalwAli
Clustal-X
Windows or Linux interface for the ClustalW multiple sequence
alignment program. Integrated environment for performing multiple sequence and
profile alignments and analyzing the results. A versatile coloring scheme:
allows to highlight conserved features in the alignment,
fully customizable.
Does not have as versatile gap penalties options as servers. Start with sequences in FASTA format (or an existing alignment
in Clustal format). [Do Alignment] on the alignment menu.
Clustal-X
Clustal-X
Using Clustal-X
Clustal X input: can read FASTA format (and 6 others)
Output: alignment (coloured) and consensus sequence: * indicates single, fully conserved residue : indicates that one of the following ‘strong’ groups is fully conserved:
STA, NEQK, NHQK, NDEQ, QHRK, MILV, MILF, HY, FYW
. Indicates that one of the following ‘weaker’ groups is conserved:
CSA, ATV, SAG, STNK, STPA, SGND, SNDEQK, NDEQHK, NEQHRK, FVLIM, HFY
Residues are coloured by type by default, but colouring scheme is customizable.
Source: ClustalX help search on google: => http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
Using Clustal-X with JalView
Proteins: 1MBD (myoglobin), 4HHB-B (hemoglobin), 1ECD (hemoglobin)
• Feed sequences to Clustal-X compute alignments, trees, ...• Feed an alignment to JalView edit the alignment.
The most hydrophobic residues according to this table are coloured red and the most hydrophilic ones are coloured blue. The colours of the in between residues are varying shades of purple according to whereabouts they are on the scale.
A note on the example
It is atypical: It uses only three sequences. One should use more in order to extract reliable informations.
It illustrates a common mistake: It uses too closely related sequences. One should use as divergent and diverse sequences as
possible in order to extract relevant informations.
References
Tutorials:Blast: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html Clustal-X: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
Sequence analysis:D.W. Mount: Bioinformatics, Sequence Analysis and Genome
Analysis. Cold Spring Harbor Laboratory Press, 2004 (2nd
edition)…