tools & techniques for finding and aligning homologous sequences sequence searching and...
TRANSCRIPT
Tools & techniques for finding and aligning homologous sequences
Sequence Searching and Alignments
Andrew CowleyWeb Production Team
[email protected]@ebi.ac.uk
Materials
http://www.ebi.ac.uk/~apc/Courses/Brazil
About me
• MA Biochemistry, Cambridge University• MRes Bioinformatics, University of York• PhD CASE Studentship, Structural Bioinformatics, University of York
About me
• Training bioinformatics since 2005• Joined EMBL-EBI in 2010
From Derbyshire, UK
Many hobbies!
Derbyshire
Beautiful countryside
1871: South Derbyshire Football Association
1884: Derby County F.C.
“Hold the record for the lowest ever pointsfinish in the Premier League”
Contents
• Sequence databases
• Text searching
• Sequence similarity searching
• Alignment basics
• Similarity searching tools
• Improving algorithms
• Guidelines
• Problem sequences
• Multiple sequence alignments
Sequence Databases
Primary vs Secondary
• Primary data comes from experiments/submitters
• Derived (or secondary) data is generated with additional work (by curators etc.) from the primary data
Nucleotide primary data
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
•Original sequence data• Experimental data
• Patent data
• Submitter-defined
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
GenBank DDBJ
ENA
(Japan)(U.S.A.)
(Europe)
INSDC: • International Nucleotide Sequence Database Collaboration
• Daily exchange of data
Submission can be
made to any
INSDC database
Nucleotide primary data
Assembled sequences
Raw data
Annotated sequence
Large-scale sequencing
projects
Individual scientists
Patent Offices
ENAEMBL-Bank
(ENA Annotation)
Sequence ReadArchive (SRA)
EMBL-Coding etc.
Nucleotide primary data
Ensembl/genomes
IMGT/HLA
Nucleotide sequence resources at EMBL-EBI• European Nucleotide Archive (ENA)
• ENA sequence – Annotated sequence entries
• Sequence Read Archive (SRA) – sequence read data
• Sequence Version Archive (SVA) – historical entry version
• ENA Coding/Non-coding
• Ensembl
• Assembled genomes and annotations for Vertebrates
• Ensembl Genomes
• Extending Ensembl to other species
When is the data updated?
Data is updated every night, but main releases are quarterly
• Quarterly release of all EMBL-Bank eg. Rel
116
Normal Release
• All updates since last normal release
• Rolled into quarterly release
Updates
ENA updates
Protein sequence data
Swiss-Prot & TrEMBL PIR-PSD
Since 2002 a merger and collaboration of three databases:
UniProtKB
Non-redundant, high-quality manual annotation
- reviewed
Redundant, automatically annotated - unreviewed
UniProtKB/TrEMBL1 entry per nucleotide
submission
UniProtKB/Swiss-Prot1 entry per protein
UniProtKB/Swiss-ProtManuallyannotated
UniProtKB/TrEMBLComputationallyannotated
Data sources of UniProtKB
UniProt/TrEMBL
VEGA(Sanger)
WormBaseFlyBase
Sub/Peptide
DataPDB
Patent Data
EnsemblENA (EMBL) DNA database
mRNAData
UniProtKB employs two prediction programs which are referred to as UniRule and SAAS.
UniRule maintains a set of manually established and maintained annotation rules.
SAAS, Statistical Automatic Annotation System, generates a new set of decision-trees with every UniProtKB release using data-mining.
InterProSwiss-Prot
Automatic annotation
Curation of a UniProtKB/Swiss-Prot entry
Sequence variants
Nomenclature
Sequence features
UniProtKB/TrEMBL
UniProtKB/SwissProt
Ontologies
Literature Annotations
References
UniProtKB
UniProt databases
• UniProtKB/Swiss-Prot
• Manually curated
• UniProtKB/TrEMBL
• Automatically curated
• UniRef
• Sequences clustered by %identity
• UniParc
• Sequence archive – keeps track of historical sequences & identifiers
• Proteomes
Data
• Simplistically, much the data held at EMBL-EBI can be thought of as like a container
• Part of it is the raw data itself (eg. Protein sequence)
• Another part being meta-information or annotation about this data
ExampleID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, Brazil.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//
Formats
• Different databases store this data in different formats
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919
LOCUS AJ131285 919 bp mRNA linear INV 20-JUL-2001DEFINITION Sabella spallanzanii mRNA for globin 3.ACCESSION AJ131285VERSION AJ131285.1 GI:13810248KEYWORDS globin; globin 3; globin gene.SOURCE Sabella spallanzanii ORGANISM Sabella spallanzanii Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Palpata; Canalipalpata; Sabellida; Sabellidae; Sabella.REFERENCE 1
SQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//
ORIGIN 1 caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 61 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 121 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 181 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 241 tctgtggaca agttcttcaa gcgtgtcaat ggcaaggaca tcagctcccc agccttccag 301 gctcacatcc agcgtgtgtt cggtggcttt gacatgtgca tctccatgct tgatgacagt 361 gatgtgctcg cctctcagct ggctcacctc cacgcccagc acgtcgagag aggaatctct//
EMBL GENBANK
Formats
>gi|13810248|emb|AJ131285.1| Sabella spallanzanii mRNA for globin 3CAAACAGTCARTTAATTCACAGAGCCCTGAGGTCTCTCGCTCCTTTCTGCGTCACTCTCTCTTACCGTCATCATGTACAAGTGGTTGCTTTGCCTGGCTCTGATTGGCTGCGTCAGCGGCTGCAACATCCTCCAGAGGCTGAAGGTCAAGAACCAGTGGCAGGAGGCTTTCGGCTATGCTGACGACAGGACATCCCYCGGTACCGCATTGTGGAGATCCATCATCATGCAGAAGCCCGAGTCTGTGGACAAGTTCTTCAAGCGTGTCAATGGCAAGGACATCAGCTCCCCAGCCTTCCAGGCTCACATCCAGCGTGTGTTCGGTGGCTTTGACATGTGCATCTCCATGCTTGATGACAGTGATGTGCTCGCCTCTCAGCTGGCTCACCTCCACGCCCAGCACGTCGAGAGAGGAATCTCT
FASTA format
Format conversion tools
http://www.ebi.ac.uk/Tools/sfc/
Meta-information
• Contains information important for:
• Identifying/referencing a piece of data or entry
• Classifying an entry
• Determining the source of the data
• And can also contain annotation that adds value:
• Identification of sequence features
• Keywords, GO terms etc.
• Cross-references to other entries that share some property
• Etc.
Searching using the meta-data
• When looking for a sequence we can perform text searches against the meta-data
• Accession look-up
• Keyword search eg: function, species
• Protein family classification
• Accession changes?
• Cross reference services
• PICR
• UniProt ID Mapping
Text search tools
• Each database has its own search engine
• Interface tailored to their specific data use
• There are also EBI-wide search tools
• EBI Search
EBI Search
• First approach/entry point to data resources at EBI
EBI Search
• Just type, with auto-complete
EBI Search
• One stop search across many resources, grouped into categories
Categories
Domains
Multi-domain facet
Domain-specific facet
EBI Search
• First approach/entry point to data resources at EBI
• One stop search across many resources
• Non-expert friendly summaries
EBI Search
EBI Search
• First approach/entry point to data resources at EBI
• One stop search across many resources
• Non-expert friendly summaries
• Advanced search available (via direct URL)
• http://www.ebi.ac.uk/ebisearch/advancedsearch.ebi
• Allows domain/field specification
• Boolean etc.
Sequence searching
Sequence searching tools
• Central to modern techniques
• Genome annotation
• Characterising protein families
• Exploring evolutionary relationships
How?
• Search by comparing sequence data rather than meta-data
• Find sequences/entries when missing or inaccurate meta-data
• More than just an exact look-up
• Allow for sequence variability – look for ‘similar’ sequences
• Sequence variation is important information for bioinformaticians
• Infer homology (shared ancestry)
• IF homologous, then can transfer information
Homology vs. Similarity
• Presence of similar features because of common decent
• Cannot be observed since the ancestors are not anymore
• Is inferred as a conclusion based on ‘similarity’
• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)
• Quantifies a ‘likeness’
• Uses statistics to determine ‘significance’ of a similarity
• Statistically significant similar sequences are considered ‘homologous’
Measurable
Inferred
Sequence alignment
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
Sequence alignment
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
ACATAGGTACATAGGT
Sequence alignment
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
Score: 6/8 3/8
ACATAGGT ACATAGGT
Identity
Sequence alignment
atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt
atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc
cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca
Query:
1
2
Dot plot
• Maybe a dot plot will help
Query
Sequence 1
A C A T A G
GATACT
Dot plot
Query vs Sequence 1 Query vs Sequence 2
Query Query
1 2
Algorithms
• To get a computer to solve a problem, the first step is to create a way for the computer to know what is relatively ‘good’ and what is relatively ‘bad’
• I.e. a score.
• Computer can then assess solutions and choose best.
• Simple algorithm – penalise movement away from diagonal – gap penalty
0
-10
-10
0
-10
-10
Why gap open and extension?
• Adjacent gap positions are likely to have been created by the same in/del event, rather than multiple independent events
• Use a smaller gap extension compared to opening penalty to account for this
G---ATTA G-A-T-TA
• To encourage this we apply a low penalty per each gap, and a high one just to open a gap.
-10.5
Gap extend
0
-10.5
-10.5
0
-10 -0.5
-10-0.5
-11
0
-10.5
-0.5
-11-0.5
-10.5-10.5
Gap open = 10Gap extend = 0.5
Match/mismatch
• Of course, we need to tell the algorithm that matching letters are better than mismatches too
• This is done via a scoring matrix
A C G T
ACGT
5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5
• Putting the two together gives us a scoring mechanism
-4
-18.5
-18.5
1
-14 -13.5
-23
-13.5
T
A
C
A
C A
6
• To pick the optimal alignment, start at the end and trace back the highest scoring route.
-4
-18.5
-18.5
1
-14 -13.5
-23
-13.5
T
A
C
A
C A
6
Needleman-Wunsch
• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!
• An example of dynamic programming
• Comparing the full length of both sequences is called a global-global or just global alignment
Global vs Local
• But global-global might not be suitable for sequences that are very different lengths
• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm.
• Sets negative scores in matrix to 0, and allows trace back to end and restart
Global vs Local
A T G T A T A C G C
A G T A T A - G C
A - T G T A T A C G C
A G T A T A - - - G C
Scoring
• Parameters so far:
• Match/mismatch
• Gap opening
• Gap extending
• Can we improve it?
Substitutions
• Some substitutions are more likely than others
Protein substitution matrices
• Can look at closely related proteins to determine substitution rates
• Two most commonly used models:
• PAM
• BLOSUM
PAM
• Point Accepted Mutation
• Observed mutations in a set of closely related proteins
• Markov chain model created to describe substitutions
• Normalised so that PAM1 = 1 mutation per 100 amino acids
• Extrapolate matrices from model
• Higher PAM number = less closely related
PAM 250
BLOSUM
• Blocks of Amino Acid Substitution Matrix
• Align conserved regions of evolutionary divergent sequences clustered at a given % identity
• Count relative frequencies of amino acids and substitution probability
• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.
• Higher BLOSUM number = more closely related
BLOSUM 45PAM 250
BLOSUM 62PAM 160
BLOSUM 90PAM 100
More divergent Less divergent
Scoring
• Parameters:
• Match/mismatch
• Gap opening
• Gap extending
• Substitution matrix
Dynamic programming alignments at the EBI• EMBOSS Pairwise Alignment algorithms
• European Molecular Biology Open Software Suite
• Suite of useful tools for molecular biology
• Command line based
• Designed to be used as part of scripts/chained programs
• We implement selected tools to provide web and Web Services access
• Database alignments via FASTA suite of programs
Where to find at the EBI?
http://www.ebi.ac.uk/Tools/psa/
Where to find at the EBI?
http://www.ebi.ac.uk/Tools/psa/
Pairwise alignment tools
• Global alignment
• Local alignment
• Genomic DNA alignment
Needle
Water
Stretcher
Matcher
LALIGN
WISE tools
Big sequences
Big sequences
Change to nucleotide
Sequence input
Parameters
Submit!
Key
- Gap
: Positive match
. Negative match
| Identity
Example sequences
www.ebi.ac.uk/~apc/Courses/Brazil
Pairwise_align1.fsa
Pairwise_align2.fsa
Searching a database
• Multiple pairwise alignments between query sequence and database sequence
Dynamic programming sequence search methods at the EBI
• Global alignment
• Local alignment
• Global query vs local database
• Profile-iterative search
GGSEARCH
SSEARCH
GLSEARCH
PSI-SEARCH
Where to find at the EBI?
http://www.ebi.ac.uk/Tools/sss/
Where to find at the EBI?
http://www.ebi.ac.uk/Tools/sss/
Database selection
Sequence input
Parameters
Submit!
• Dynamic programming methods are rigorous and guarantee an optimal result
• But take up a lot of memory
• And evaluate each position of the matrix
• Predictably, this makes them slow and demanding when you are aligning large sequences
Heuristics
• Therefore we need methods of estimating alignments
• Estimation methods are called heuristics
• Try and take short cuts in an intelligent manner
• Speed up the search
• At the possible expense of accuracy
• Accuracy in sequence searches is important for:
• Aligning the right bits
• Scoring the alignment correctly
• Identifying similar sequences - sensitivity
• Going back to our dot plot
• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.
• Of course, we have to identify likely regions – not all alignments will be as nice as that one!
• This is the method used by FASTA
• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
FASTA – step 1
• Identify runs of identical sequence and pick regions with highest density of runs
Ktup parameter:How small are ‘words’ considered before they are ignored
Increase Ktup = faster, but less sensitive
FASTA – step 2
• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores
Parameter:Substitution matrix
FASTA – step 3
• Discard regions too far from the highest scoring region
Joining threshold:Internally determined
FASTA – step 4
• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions
Parameters:Gap openGap extendSubstitution matrix
FASTA
• Repeat against all sequences in the database
FASTA – programs available at EBI
• FASTA: ”a fast approximation to Smith & Waterman”
• FASTA – scan a protein or DNA sequence library for similar sequences.
• FASTX/Y – compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward or reverse translation frames.
• TFASTX/Y – compare a protein sequence to a translated DNA data bank.
• FASTF – compares ordered peptides (Edman degradation) to a protein databank.
• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.
Where to find at the EBI?
http://www.ebi.ac.uk/Tools/sss/
Database selection
Sequence input
Parameters
Submit!
FASTA - results
FASTA - results
FASTA - results
FASTA - results
Key
- Gap
: Identity
. Similarity
X Filtered
Example sequence
www.ebi.ac.uk/~apc/Courses/Brazil
test_prot.fasta
BLAST – Basic Local Alignment Search Tool• Instead of narrowing the dynamic programming
search space, BLAST works a slightly different way
• Firstly, it creates a word list both of the exact sequence and high scoring substitutions
Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
BLAST – step 1
• w=3
SEWRFKHIYRGQPRRHLLTTGWSTFVT
SEWEWR
WRF Parameter:Word length (w)
Increase = faster, but less sensitive
BLAST – step 1(cont.d)
• w=3
• T=13
SEWRFKHIYRGQPRRHLLTTGWSTFVT
GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13
AQP 12NQP 12
Parameters:Neighbourhood threshold (T)Substitution matrix
BLAST – step 2
• Then it scans database sequences for exact matches with these words
• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount
• This results in a High-scoring Segment Pair (HSP)
BLAST – step 3
Parameters:Drop offSubstitution matrix
• If the total HSP score is above another threshold then a gapped extension is initiated
BLAST – step 4
Parameters:Extension threshold (Sg)Substitution matrix
BLAST
• The steps rule out many database sequences early on
• Large increase in speed
BLAST – programs available at the EBI
• Basic Local Alignment Search Tool
• NCBI BLAST programs:
• BLASTP – protein sequence vs. protein sequence library
• BLASTN – nucleotide query vs. nucleotide database
• BLASTX – translated DNA vs. protein sequence library
Key
- Gap
[residue] Identity
+ Similarity
X Filtered
Example sequence
www.ebi.ac.uk/~apc/Courses/Brazil
test_prot.fasta
When to use what?
Database size
Query length
GGSEARCH
FASTA
BLAST
PSI-SEARCHSSEARCH
When to use what?
PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc
time to search
GGSEARCH
FASTA
BLAST
PSI-SEARCHSSEARCH
Homology and Similarity
Similarity
Homology
• So far, we’ve talked about scoring alignments
• Direct function of the algorithm
• But what we want is to assign some kind of quality to that score
Score vs significance
A A A
A A A
A C A T A A G G C T
A T A C A A G C C T
High score Higher significance?
“Lies, damn lies, and statistics”
“Lies, damn lies, and statistics”
• Not just interested in score...
• ...But how likely we are to get that alignment by chance alone
• It is this ‘non-random’ alignment that infers homology
• Statistics are used to estimate this chance
E-value
• ‘Expect’ value (really ‘expectation’)
• Probability of obtaining this score by chance in the given database, or “how many times you might be wrong”
• Best measure of how biologically significant an alignment is
• Used for ranking results by default
• Most people use 10-3 “Happy to be wrong one time in a thousand”
• Calculated in slightly different ways for BLAST and FASTA
• Short alignments are more likely to be found by chance so have higher E-values
• Affected by database size
• BLAST and FASTA both optimised for distant relationships
FASTA statistics
• Compares query sequence with every sequence in database
• As most of these sequences are unrelated it is possible to use the distribution of scores (sampled) to assign statistical significance
• As distribution is taken from a random sample, exact E-Value can vary slightly from search to search
FASTA - histogram
Predicted distribution of scores
Observed distribution of scores
Key
*
=
High scoring region
BLAST statistics
• Main reason for speed is that it doesn’t compare query with lots of other sequences
• Therefore it pre-estimates statistical values using a random sequence model
“Appears to yield fairly accurate results”
Improving algorithms
Sensitivity, Selectivity & Speed
• Sensitivity is how distantly you can determine a homologous sequence (avoid false negative)
• Selectivity is how accurately you can determine whether a sequence is homologous or not (avoid false positive)
• Speed is obviously how long it takes!
• In general, the more information we can add to an alignment, the better the result
Conserved regions Structural information Motifs
[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H
Conserved regions
• We can add a new ‘position’ parameter to the substitution matrix
We can even modify a normal search to generate a position specific scoring matrix, or PSSM
PSI-BLAST
Position Specific Iterative – BLAST:
1.Takes the result of a normal BLAST
2.Aligns them and generates profile of conserved positions
3.Uses this to weight scoring on next iteration
PSI-BLAST
PSI-BLAST
PSI-BLAST
Example sequence
www.ebi.ac.uk/~apc/Courses/Brazil
test_prot.fasta
PHI-BLAST
• Pattern Hit Initiated-BLAST
• User provides a pattern alongside a protein
• Database hits have to contain this pattern, and similarity to rest of sequence
• Results can initiate a PSI-BLAST search as well
PSI-BLAST
• By adding importance to conserved residues we might be able to find more distant sequences
• But iterate too far and we might be assigning importance where there is none
• Problem of Homologous Over-Extension (HOE)
More sensitive
Less selective
Homologous Over-Extension (HOE)
Alignment region
Extends over subsequent iterations
2nd3rd
Homologous Over-Extension (HOE)
Contaminated PSSM
Homologous Over-Extension (HOE)
Which can cause (significant) alignment with unrelated protein
Homologous Over-Extension (HOE)
Expect score: 9.0x10-5
PSI-BLAST initial search
Homologous Over-Extension (HOE)
PSI-BLAST 2nd Iteration
Homologous Over-Extension (HOE)
PSI-BLAST 3rd Iteration
Expect score: 1.0x10-4
Homologous Over-Extension (HOE)
PSI-BLAST 5th Iteration
Expect score: 7.0x10-4
Expect score: 1.0x10-19
Reducing HOE
• Look for domains in results and manually select sequences that form part of PSSM
• Mask boundaries according to initial alignment
• Results in improvement of false-positives (selectivity)
PSI-SEARCH
• Smith-Waterman implementation (SSEARCH)
• With iterative position specific scoring
• Optional boundary masking to reduce HOE
Reducing HOE errors
• Sequence boundary masking procedure
• First time a significant alignment occurs for a library sequence, store co-ordinates
Reducing HOE errors
• Mask regions outside so can’t contribute to PSSM
Reducing HOE errors
PSI-Search 2nd Iteration
PSI-Search 5th Iteration
PSI-Search
So what does that do to sensitivity/selectivity?
Selectivity
Sen
sitiv
ity
PSI-Search
=
Very sensitive+Much more selective
Coming soon
• PSI-Search 2!
• Use domain annotations/predictions to inform alignment
Low complexity regions
• Biologically irrelevant, but likely to skew alignment scoring
• E.g. CA repeats, poly-A tails and Proline rich regions
Good Statistics:
The inset shows good correlationbetween the observed over expectednumbers of scores.
This is the region of the histogram to look out for first when evaluating results.
The inset shows bad correlationbetween the observed and expectedscores in this search.
The spaces between the = and * symbolsindicate this poor correlation.
One reason for this can be low complexityregions.
Bad Statistics:
Low complexity regions
• Biologically irrelevant, but likely to skew alignment scoring
• E.g. CA repeats, poly-A tails and Proline rich regions
• Compensate by filtering/masking sequence so these regions don’t contribute to scoring
• Filters: seg, xnu, dust, CENSOR
• But check what you are filtering!
Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.
Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.
Filtered:
Example sequence
www.ebi.ac.uk/~apc/Courses/Brazil
Filtertest_seq.fsa
Database composition
• Statistics rely on database containing wide coverage
• Assumption query is not homologous to most of the data
• Specialist databases might cause problems
• Eg Innuno- databases, made up of relatively few genes
• A lot of the database IS homologous
• Skews statistics
Database composition
• Can’t make same assumptions about coverage
• So don’t use BLAST
• FASTA based tools sample the score so provide accurate statistics
• Use the histogram to check
• Use shuffled versions of database to create additional coverage
Search Guidelines
Search guidelines 1
• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)
• Then with translated DNA query sequences (fastx, blastx)
• Search with DNA vs. DNA as the next resort
• And then against translated DNA database sequences (tfastx, tblastx) as the VERY LAST RESORT!
Search guidelines 2
• Search the smallest database that is likely to contain the sequence(s) of interest
• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology
Search guidelines 3
• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence
• Examine the histograms
• Use programs such as prss3 to confirm the expectation values.
• Searching with shuffled sequences (use MLE/Shuffle in fasta) which should have an E() ~1.0
• Perform reverse search
Search guidelines 4
• Default parameters are set up for most common queries
• Consider searches with different gap penalties and other scoring matrices, especially for short queries/domains
• Use shallower matrices and/or more stringent gaps in order to uncover or force out relationships in partial sequences
• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of PAM250)
• Remember to change the gap penalty defaults (if the tool doesn’t change them for you)
MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -11 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4
Search guidelines 5
• Homology can be reliably inferred from statistically significant similarity
• But remember:
• Orthologous sequences have similar functions
• Paralogous sequences can acquire very different functional roles
• So further work might be needed to tease out details
Search guidelines 6
• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues
• However, motif identity in the absence of significant sequence similarity usually occurs by chance alone
• Try to produce multiple sequence alignments in order to examine the relatedness of your sequence data
• Clustal Omega
• MUSCLE
• T-Coffee
• Kalign
• MAFFT
Problem Sequences
Short sequences
• What about short sequences?
• Depends on their nature:
• Protein
• Use shallow matrices
• Reduce word length and/or increase the E() value cut off
• DNA
• Reduce the word length
• Ignore gap penalties (force local alignments only)
• Use rigorous methods
• But ask what you are trying to do!
Vector contamination
• You think you know what your sequence is..
• .. But the results are really confusing!
• Maybe you have vector contamination
• Search against known vectors to check
Vector contamination
Example sequences
www.ebi.ac.uk/~apc/Courses/Brazil
vectortest_seq1.fsa
vectortest_seq2.fsa
Multiple Sequence Alignments
Uses of Multiple Sequence Alignment (MSA)• Alignment of three or more sequences
• Functional prediction
• Structural prediction
• Conservation analysis
• Classification
• Phylogeny
• To help distinguish between orthology and parology
We have a (computational) problem…
• Pairwise alignments are simple enough to find the optimal (highest scoring) solution in a reasonable timeframe
• Multiple sequence alignment is in a class of problems that is ‘NP-hard’
NP-easy
• Problems that are solvable in polynomial time
• E.g. operations to solve = n2
• Problems that are hard to solve
• E.g. operations to solve = 2n
NP-hard
n2 vs 2n
• Imagine a computer running 109 operations a second
n2
2n
n = 10 n = 30 n = 50n = 70
100< 1 sec
900< 1 sec
2500< 1 sec
1024< 1 sec
109
1 sec 1015
13 days
4900< 1 sec
1021
37 trillion years
What to do about NP-hard problems?
• Give up (do you really need MSA?)
• Use approximations and heuristics
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Weighted Sums of Pairs: WSP
N
i
i
jijijDW
2
1
1
Sequences Time2 1 second
3 150 seconds
4 6.25 hours
5 39 days
6 16 years
Time O(LN)
7 2404 years
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Progressive Alignment:Barton and Sternberg, 1987Florence Corpet, 1988Feng and Doolittle, 1987Jotun Hein, 1989Higgins and Sharp, 1988Hogeweg and Hesper, 1984Willie Taylor, 1987, 1988
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Horse beta
Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin
Guide Tree
Clustal• >85,000 citations
• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin
• Clustal V 1992• EMBL Heidelberg,
• Rainer Fuchs
• Alan Bleasby
• Clustal W, Clustal X 1994-2007• Toby Gibson, EMBL, Heidelberg
• Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 2007• University College Dublin
www.clustal.org
ClustalW2 at the EBI
www.ebi.ac.uk/Tools/msa/clustalw2/
www.ebi.ac.uk/Tools/msa/
ClustalW2
Sequence input
Parameters
Submit!
ClustalW2
ClustalW2
Jalview
ClustalW2
Advantages• Quite fast for low
numbers
• Not too demanding
• Widely used
Disadvantages• Fixing of early
alignments• Propagate errors
• Doesn’t search far• Local minima
• Compresses gaps
Example sequences
www.ebi.ac.uk/~apc/Courses/Brazil
Prot_MSA.fsa
Other progressive aligners
• MUSCLE
• Optimised progressive aligner
• Good alternative to ClustalW
BaliBase % correct time(s)
Clustal W 37.4 766Muscle 47.5 789
Other progressive aligners
• KAlign
• Local regions progressive aligner
• Extremely fast!
• Good for large alignments/input
BaliBase % correct time(s)
Clustal W 37.4 766Kalign 50.1 21
Consistency based alignment
• Maximise similarity to a library of residue pairs
COFFEE
• Consistency based Objective Function For alignmEnt Evaluation
• Maximum Weight Trace (John Kececioglu)
• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE:
An objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
COFFEE
• Library of reference pairwise alignments
• For your given set of sequences
• Objective Function
• Evaluates consistency between multiple alignment and the library of pairwise alignments
• Use SAGA to optimise this function
• Weigh depending on quality of alignment
SAGA is another alignment method, using genetic algorithms
COFFEE
• More accurate than ClustalW
• Much less prone to problems in early alignment stages
• VERY slow!
T-Coffee
• Tree-based COFFEE
• Heuristic approach to COFFEE
• Gets rid of genetic algorithm portion
• Uses progressive alignments
• Changes algorithm based on number of sequences
T-Coffee
• Much faster than COFFEE
• Avoids some of ClustalW’s pitfalls
• Can take information from several data sources
• Still not that fast
• Can be very demanding of memory etc.
Other Tools
• MAFFT
• Iterative based Fast Fourier Transform
• Different modes – can operate in both progressive and consistency type alignments
NEW!: Clustal Omega
• Completely different way of doing things from ClustalW
• Two major areas of improvement:
• 1) Guide tree generation
• 2) Profile-profile alignments
Clustal Omega – Guide Tree improvements• Guide tree generation is one of the slowest steps
• Especially with large numbers of sequence
• Clustal Omega uses the embed method to sample range of sequences and represent all sequences as vectors to these samples
• Results in better scaling with more sequences
Clustal Omega – Profile-profile alignments• Like sequence searching, profiles can be used to
increase sensitivity
• HMMs are a form of profile
• Clustal Omega aligns HMMs to HMMs
Clustal Omega
• Better scaling for many sequences
• Speed
• Accuracy
• Better scaling for many computers
• More accurate alignments
• Nucleic Acid alignments still work in progress
Which tool should I use?
Input data
• 2-100 sequences of typical protein length
• 100-500 sequences
• >500 sequences
• Small number of unusually long sequences
Recommendation
• MUSCLE, T-Coffee, MAFFT, ClustalW2/Omega
• Clustal Omega, MUSCLE, MAFFT
• Clustal Omega, KALIGN
• ClustalW, KALIGN
How to evaluate?
• Use a benchmark
• BaliBASE
BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics
•ICGEB Strasbourg
•141 manual alignments using structures• 5 sections• core alignment regions marked
1. Equidistant(82)
2. Orphan(23)
3. Two groups (12)
4. Long internal gaps(13)
5. Long terminal gaps(11)
BaliBase % correct time(s)
Clustal Omega 55.4 539
Clustal W 37.4 766Mafft (default) 45.8 68Muscle 47.5 789Kalign 50.1 21T-Coffee 55.1 81041Probcons 55.8 13086Mafft (auto/consistency) 58.8 1475MsaProbs 60.7 12382
Benchmark pitfalls
• Benchmark dataset may not be representative
• Danger of over-training towards benchmark
• Goldman: Most MSAs have unrealistic gaps
• Tend towards multiple, independent deletions
• Insertions are rare
• Sequences shrink in length over evolution
• No supporting evidence that this is the case
Solutions
• Use phylogentic data to guide alignment
• Keep track of changes to ancestor sequences
• Don’t change them again so easily in decendents
Phylogeny
• Multiple Sequence Alignment tries to find best alignment of three or more sequences
• Used to identify groups of similar sequences
• Conserved regions etc.
• But if we want to examine evolutionary relationships we need more than just current sequence similarity
• Phylogeny is an estimate of evolutionary history between sequences
• Model substitutions from theoretical ancestor sequences
Neighbour Joining
• Simple phylogenetic tree method
• Bottom up (starts from alignment of current day sequences)
• Iterate to form a tree with nodes forming minimum distances between paired taxa
• Fast
• Dependant on accuracy of input
• Can sometimes get negative branch lengths
ClustalW2 - Phylogeny
• Neighbour joining (and UPGMA) phylogenetic tree algorithm from the ClustalW2 package
http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/
ClustalW2 - Phylogeny
ALIGNED sequence input
ClustalW2 - Phylogeny
PRANK
• Probabilistic Alignment Kit
• webPRANK
• Better suited for closely related sequences
• Tied solutions are chosen from at random
• Avoids incorrect confidence in result
• Means alignments might not be reproducible
• Alignments look quite different
• Might look worse!
• But gap patterns make sense
• Gaps are good!
http://www.ebi.ac.uk/goldman-srv/webprank/
Common problems with MSA
• Input format
• Try using FASTA format
• Unique sequence identifiers
• Include sequence!
• Usually limit of 500 sequences/1MB
• Job can’t be found/other error
• Results deleted after 7 days
• Some sequence/program combinations run out of memory
• Use a different program
Example sequences
www.ebi.ac.uk/~apc/Courses/Brazil
Problem_MSA1.fsa
Problem_MSA2.fsa
Problem_MSA3.fsa
Problem_MSA4.fsa
Common mis-uses of MSA
• Performing a sequence assembly
• Specialist type of MSA
• Use other tools (Staden etc.)
• Aligning ESTs to a reference genome
• Use EST2Genome
• Designing primers
• Use primer tools (primer3 etc.)
• Aligning two sequences
• Use a pairwise alignment tool!
Putting it all together
• EBI Search
• Sequence retrieval
• Sequence search
• Sequences retrieval
• Multiple sequence alignment
• Phylogeny
• Analysis
Final remarks
• Don’t assume a single tool will cater for all your needs
• Change the parameters of the tools
• Remember where the tool excels and what its limitations are
• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)
• Crazy input will always give crazy results!
Getting Help
Getting Help
• Database documentation
• EBI Support• http://www.ebi.ac.uk/support/
• EBI training programme• http://www.ebi.ac.uk/training
• EBI online training• http://www.ebi.ac.uk/training/online
• IMGT/HLA
Thank you!
www.ebi.ac.uk
Twitter: @emblebi
Facebook: EMBLEBI
Final remarks
• Don’t assume a single tool will cater for all your needs
• Change the parameters of the tools
• Remember where the tool excels and what its limitations are
• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)
• Crazy input will always give crazy results!
Getting Help
Getting Help
• Database documentation
• EBI Support• http://www.ebi.ac.uk/support/
• EBI training programme• http://www.ebi.ac.uk/training
• EBI online training• http://www.ebi.ac.uk/training/online
With thanks to our funders
• EMBL-EBI is primarily funded by EMBL member states
• Other major funders:
• European Commission
• National Institutes of Health
• Research Councils UK
• Wellcome Trust
• Industry Programme
Appendix
How is the data organised?
Data in EMBL-Bank is divided in 2 ways:
• Type of data or
• Methodology used to obtain data
• Each entry belongs to one data class
1) Data classes
• Each entry belongs to one taxonomic division
2) Taxonomic Divisions
ENA database structure
CON
EST
GSS
HTC
HTG
MGA
PAT
STS
TPA
TSA
WGS
Constructed from sequence assemblies
Expressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STD Standard (high quality annotated sequence)
1) Data Classes
SRA Sequence Read Archive (both databank and data class)
CON
EST
GSS
HTC
HTG
MGA
PAT
STS
TPA
TSA
WGS
Constructed from sequence assemblies
Expressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STD Standard (high quality annotated sequence)
1) Data Classes
SRA Sequence Read Archive (both databank and data class)
• Single pass reads variable quality
CON
EST
GSS
HTC
HTG
MGA
PAT
STS
TPA
TSA
WGS
Constructed from sequence assemblies
Expressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STD Standard (high quality annotated sequence)
1) Data Classes
SRA Sequence Read Archive (both databank and data class)
• SRA is a separate databank from EMBL-Bank
• SRA can also be searched as a data class within EMBL-Bank
CON
EST
GSS
HTC
HTG
MGA
PAT
STS
TPA
TSA
WGS
Constructed from sequence assemblies
Expressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STD Standard (high quality annotated sequence)
1) Data Classes
SRA Sequence Read Archive (both databank and data class)
• Bulk of entries
• Highest level of tracked information
HUM Human
MUS Mouse
MAM Mammal
VRT Vertebrate
ROD Rodent
FUN Fungi
INV Invertebrate
PLN Plant
PHG Phage
PRO Prokaryote
VIR Viral
ENV Environmental
SYN Synthetic
TGN Transgenic
UNC Unclassified
Other:
2) Taxonomy
All INSDC databases use NCBI Taxonomy
Which taxonomy database does ENA use?
Divisions:
2) Taxonomy: exclusion
Some species EXCLUDED
from certain taxonomic ranges
VRT Vertebrate excludeshumanmouserodentmammal
MAM Mammal excludeshumanmouserodent
ROD Rodent excludes mouse
Applies to:
• ftp files and• Sequence search tools
But not:
• ENA Browser
How does data organization differ from GenBank?
EMBL-Bank
Data classes
TaxonomicDivisions
• Data split into intersecting slices
• Reduces search set
• Ensures complete result set
con
est
gss
htc
htg
pat
sts
std ...
hummusrodmamvrtfun...
GenBank
Divisions
• Data split into parallel slices
• Large search sets
• Classes incomplete for taxonomy
• Taxonomy incomplete for classes
con
est
gss
htc
htg
pat
sts
std ...hum
mus rod
mam vrt
fun
inv
pln
...
Data classes Taxonomy
Database structure
How does data organization differ from GenBank?
EMBL-Bank
Data classes
TaxonomicDivisions
• Data split into intersecting slices
• Reduces search set
• Ensures complete result set
con
est
gss
htc
htg
pat
sts
std ...
hummusrodmamvrtfun...
GenBank
Divisions
• Data split into parallel slices
• Large search sets
• Classes incomplete for taxonomy
• Taxonomy incomplete for classes
con
est
gss
htc
htg
pat
sts
std ...hum
mus rod
mam vrt
fun
inv
pln
...
Data classes Taxonomy
‘Mouse’ + ‘EST’ intersection
• small data set
• ensured complete set of mouse ESTs
‘Mouse’ set
• large data set
• includes all mouse entries
‘EST’ set
• large data set
• includes all EST entries
Database structure