ebi is an outstation of the european molecular biology laboratory. ebi roadshow james watson, phd...
TRANSCRIPT
![Page 1: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/1.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
EBI Roadshow
James Watson, PhDSenior Scientific Training Officer
EBI-EMBL
![Page 2: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/2.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Andrew Cowley
External Services, EMBL-EBI
Sequence Searching and Alignments
![Page 3: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/3.jpg)
External Services
Sequence searching and alignments - Andrew Cowley21/04/233
Andrew CowleyBioinformatics Trainer
Hamish McWilliamSoftware engineer Rodrigo Lopez
Head of External Services
+ many others!
![Page 4: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/4.jpg)
Contents
• Sequence databases• Database browsing tools
• Similarity searching and alignments• Alignment basics• Similarity searching tools• More advanced tools• Alignment tools• Guidelines
• (slightly) More advanced tools• Problem sequences
Sequence searching and alignments - Andrew Cowley21/04/234
![Page 5: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/5.jpg)
Materials
Sequence searching and alignments - Andrew Cowley21/04/235
Presentations and tutorials can be found on the roadshow course page at the EBI
Data files for exercises can be found at:
www.ebi.ac.uk/~watson/africa
![Page 6: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/6.jpg)
Data
• Simplistically, much of the data at the EBI can be thought of as a container
• One part being the raw data (eg. Sequence)
• Another part being annotation on this data
Sequence searching and alignments - Andrew Cowley21/04/236
![Page 7: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/7.jpg)
Example
Sequence searching and alignments - Andrew Cowley21/04/237
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//
![Page 8: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/8.jpg)
Data - Nucleotide
• ENA/EMBL-Bank: • Release and updates• Divided into classes and divisions• Supplementary sets: EMBL-CDS, EMBL-MGA
• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc.• Alternative splicing: ASD, ASTD, etc.• Completed genomes: Ensembl, Integr8, etc.• Variation: HGVBase, dbSNP, etc.
Sequence searching and alignments - Andrew Cowley21/04/238
![Page 9: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/9.jpg)
Individual sequencing
Individual scientists
ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG
Sequence individual gene
What sequence data is submitted?
ACTGCTGCTAGCTAG
submission
add annotation
submission
![Page 10: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/10.jpg)
High throughput sequencing
chromosome
fragment
ACTGCTGCTAGCTAG
cyp30 cyp309 insvcg343
annotation
sequence reads
sequencing library
assemble sequence
![Page 11: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/11.jpg)
chromosome
fragment
sequencing library
ACTGCTGCTAGCTAG
assemble sequence
cyp30 cyp309 insvcg343
annotation
sequence reads
Large-scale sequencing
projects
submissione.g. whole genome shotgun
submission
submission
High throughput sequencing
![Page 12: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/12.jpg)
What are primary sequence databases?
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
•Original sequence data• Experimental data
• Patent data
• Submitter-defined
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
![Page 13: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/13.jpg)
How do primary and derived databases differ?
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
Derived data e.g. protein sequence
Deriveddatabase
![Page 14: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/14.jpg)
Primary v. derived data
ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACATDNA sequence
translate
Derived mRNA sequence
AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC
Derived protein sequence MRSNECCCAMSC
transcribe
submit
ACTGCTGCTAGCTAG
![Page 15: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/15.jpg)
How do primary and derived databases differ?
Individual scientistsLarge-scale
sequencing projects
Primary sequence
data
Primarysequencedatabase
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
Deriveddatabase
redundant
may be non-redundant
If anything in submission varies (e.g.
source / submitter / sequence)
generates a new entry
Derived data e.g. protein sequence
![Page 16: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/16.jpg)
How do primary and derived databases differ?
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
Deriveddatabase
data lost
regenerate data
Derived data e.g. protein sequence
![Page 17: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/17.jpg)
Primary nucleotide sequence databases
GenBank DDBJ
ENA
(Japan)(U.S.A.)
(Europe)
INSDC:
• International Nucleotide
Sequence Database
Collaboration
• Daily exchange of data
ENA
DDBJ GenBank
ACTGCTGCTAGCTAG
Submission can
be made to any
INSDC database
![Page 18: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/18.jpg)
Sequence information
ReadsReads
How is sequence data processed?
• Sequence machine output (reads)• Quality scores
AssemblyAssembly• Fragmented sequence reads assembled into contigs mapped onto chromosomes
AnnotationAnnotation • Functional information assigned to assembled regions
ENA
DDBJ GenBank
ACTGCTGCTAGCTAG
![Page 19: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/19.jpg)
Sequence information
ReadsReads
What type of sequence data is submitted?
AssemblyAssembly
AnnotationAnnotation
Assembled
sequences
Raw data
Annotated sequence
Interpreted information:• Assembly• Mapping• Functional annotation• Sample information
RawAnnotated /
ENA
DDBJ GenBankInput information:• Sample• Set-up• Machine configuration
Output machine data:• Sequence traces• Reads• Quality scores
Metagenomic data:• Where originated
ACTGCTGCTAGCTAG
![Page 20: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/20.jpg)
How does ENA store the data?
Assembled sequences
Raw data
Annotated sequence
Large-scale sequencing
projects
Individual scientists
Patent Offices
ENA
European Nucleotide Archive
RawAnnotated /
ENA
DDBJ GenBank
ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)
Sequence ReadSequence ReadArchive (SRA)Archive (SRA)
Trace ArchiveTrace Archive SRAAnn Trace
ACTGCTGCTAGCTAG
![Page 21: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/21.jpg)
How does ENA store the data?
Assembled sequences
Raw data
Annotated sequence
Large-scale sequencing
projects
Individual scientists
Patent Offices
ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)
Sequence ReadSequence ReadArchive (SRA)Archive (SRA)
Trace ArchiveTrace Archive
ENA
European Nucleotide Archive
RawAnnotated /
SRAAnn
ENA
DDBJ GenBank
Trace
• Trace sequence reads
• Capillary sequencing instruments
• Intensity reads
• Next-generation sequencing instruments
ACTGCTGCTAGCTAG
![Page 22: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/22.jpg)
INDSC Sequencing Projects
Can data be traced to an Institute?
Institute
Database records
Consortium
genomic
ESTs...
shotgun
Complete genome /
metagenome
(single organism / metagenomic study)
Assembly & annotation
Assembly & annotation
genomic
ESTs...
shotgun
Track projects
Comparative analysis
RawAnnotated /
SRAAnn
ENA
DDBJ GenBank
Trace
ACTGCTGCTAGCTAG
Pulls information
together
![Page 23: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/23.jpg)
Nucleotides: European Nucleotide Archive (ENA)
21/04/2323
Figure adapted from: Cochrane, G. et al. Public Data Resources as
the Foundation for a Worldwide Metagenomics Data Infrastructure.
In: Metagenomics: Theory, Methods and Applications (Chapter 5),
Caister Academic Press, Universidad Nacional de Cordoba,
Argentina. Ed. D. Marco (2010).
The ENA has a three-tiered data architecture.
It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms).
Sequence searching and alignments - Andrew Cowley
![Page 24: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/24.jpg)
Data Quality
Is the data cleaned up?
• Automatic quality checks
Validation of submitted data:
• Some manual inspection and curation
Errors can still exist in sequence and annotation
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Trace
ACTGCTGCTAGCTAG
![Page 25: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/25.jpg)
How is the data organized?
Data in ENA Annotation is divided in 2 ways:
Database Structure
• Type of data or
• Methodology used to obtain data
• Each entry belongs to one data class
1) Data classes1) Data classes
• Each entry belongs to one taxonomic division
2) Taxonomic Divisions2) Taxonomic Divisions
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 26: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/26.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 27: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/27.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Single pass reads variable quality
• Need to search both EST and RNA data
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 28: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/28.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)• Often copies of existing entries
• Records not clean, even for taxonomy
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 29: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/29.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Bulk of entries
• Highest level of tracked information
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 30: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/30.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Derived data entries• e.g. patch genomic and RNA data to construct complete coverage
• Must have publication
• Must show which entries data is derived from
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 31: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/31.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Also derived data entries
• ESTs assembled to construct RNA
• Must show which EST/HTC entries data is derived from
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 32: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/32.jpg)
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Entries change over time (completely replaced)
• Raw WGS entries assembled into contigs CON entries
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 33: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/33.jpg)
Data Classes
How stable is the data?
Data is always changing:
• Assembly of sequences into larger fragments
• Deletion of obsolete entries (i.e. once assembled)
• Sequence modifications
• Daily updates
• Identifier changes
• Corrections (databases can contain errors)
• etc…
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 34: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/34.jpg)
Data Classes
How does assembly affect entries?
WGSWGS Shotgun
Example:
• Fragments in separate entry
CONCON Constructed • Join to make new CON entries
• Old WGS entries archived
STDSTD Standard• Join into large STD entry (e.g. Completed genome)• Add annotation
• Old CON entries archived
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 35: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/35.jpg)
Taxonomy
ENVENV
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
Environmental
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 36: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/36.jpg)
Taxonomy
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
ENVENV Environmental
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
• CAUTION: organism never isolated
• May blast sequence to assign putative organism
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 37: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/37.jpg)
Taxonomy
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
ENVENV Environmental
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
• CAUTION: not consistently handled, variable quality
• Transgenics may be from multiple organisms
![Page 38: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/38.jpg)
Taxonomy
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
ENVENV Environmental
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
• Division primarily used by GenBank
for PAT (patent) sequences
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 39: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/39.jpg)
Taxonomy exclusion
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Some species excluded
from certain taxonomic ranges
VRTVRT Vertebrate excludeshumanmouserodentmammal
MAMMAM Mammal excludeshumanmouserodent
RODROD Rodent excludes mouse
Applies to:• ftp files and• sequence search tools
But not:• ENA Browser
![Page 40: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/40.jpg)
Taxonomy Database
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
All INSDC databases use the NCBI Taxonomy Browser
Which taxonomy database does ENA use?
Only organisms with sequence are represented
• EBI-wide service maps resources into taxonomy service
• Culture collection – physical data, e.g. sample or stored version
• Biomaterial
• Specimen voucher
EBI Taxonomy Portal
representation, e.g. picture
![Page 41: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/41.jpg)
Database Structure
How does data organization differ from GenBank?
ENA-Annotation
Data classesData classes
TaxonomicTaxonomicDivisionsDivisions
• Data split into intersecting slices
• Reduces search set
• Ensures complete result set
con
est
gss
htc
htg
pat
sts
std ...
hummusrodmamvrtfun...
GenBank
Taxonomic DivisionsTaxonomic Divisions
Data classesData classes
• Data split into parallel slices
• Large search sets
• Classes incomplete for taxonomy
• Taxonomy incomplete for
classes
con
est
gss
htc
htg
pat
sts
std ...
hum
mus rod
mam vrt
fun
inv
pln ...
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
![Page 42: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/42.jpg)
How does data organization differ from GenBank?
ENA-Annotation
Data classesData classes
TaxonomicTaxonomicDivisionsDivisions
• Data split into intersecting slices
• Reduces search set
• Ensures complete result set
GenBank
Taxonomic DivisionsTaxonomic Divisions
Data classesData classes
• Data split into parallel slices
• Large search sets
• Classes incomplete for taxonomy
• Taxonomy incomplete for
classes
con
est
gss
htc
htg
pat
sts
std ...
hummusrodmamvrtfun...
con
est
gss
htc
htg
pat
sts
std ...
hum
mus rod
mam vrt
fun
inv
pln ...
‘Mouse’ + ‘EST’ intersection
• small data set
• ensured complete set of mouse ESTs
Database Structure
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
‘Mouse’ set
• large data set
• includes all mouse entries
‘EST’ set
• large data set
• includes all EST entries
ACTGCTGCTAGCTAG
![Page 43: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/43.jpg)
Data – Protein Sequence
• UniProt databases:• UniProtKB: human curated and automatic translation sections• UniRef: non-redundant sequence clusters• UniParc: non-identical sequence archive
• Sequence from structures:• PDB• SGT
• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA• Alternative splicing: ASD, ASTD• Completed proteomes: Ensembl, Integr8• Protein Interactions: IntAct• Patent Proteins: EPO, JPO, KIPO and USPTO
Sequence searching and alignments - Andrew Cowley21/04/2343
![Page 44: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/44.jpg)
Sequence Databases
Nucleotide Nucleotide/Protein Protein
Not curated Curated Automatically annotated/curated
Author submits NCBI creates from existing data Entries created from large scale genomics, small scale cloning and peptide submissions
Only author can revise NCBI revises as new data emerge UniProt regularly revises entries
Multiple records for same loci common Single records for each molecule of major organisms
An entry per protein including any isoforms (when curated).
Records can contradict each other ?
No limit to species included Limited to model organisms All species
Data exchanged among INSDC members Exclusive NCBI database International consortium, inc leading figures in Bioinformatics
Akin to primary literature Akin to review articles Akin to review articles plus contain extensive cross-references to other DBBs – almost a portal.
Proteins identified and linked Proteins and transcripts identified and linked Links to supporting nucleotide data added to entries
Access via NCBI Nucleotide databases Access via Nucleotide & Protein databases UniProt.org – modern, multi functional website with incorporated bioinformatics tools.
Genbank
![Page 45: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/45.jpg)
21/04/2345
Protein sequence: UniProt
UniProt
• Manual curation
• Literature-based annotation
• Sequence analysis
• Automated annotation
PRIDE
GO
InterPro
IntAct
IntEnz
HAMAP
RESID
Functional infoFunctional info
Protein identification data
Protein identification data
Protein families and domains
Protein families and domains
Molecular interactionsMolecular
interactions
EnzymesEnzymes
Microbial protein families
Microbial protein families
Post-translational modifications
Post-translational modificationsS
ome
data
sou
rces
for
an
nota
tion
Transmembrane prediction
Transmembrane prediction
InterPro classification
InterPro classification
Signal predictionSignal prediction
Other predictionsOther predictions
Protein classification
Sequence searching and alignments - Andrew Cowley
![Page 46: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/46.jpg)
PatentData
WormBaseFlyBaseSub/
PeptideData
PDB VEGAEnsemblRefSeqINSDC
(incl. WGS,Env.)
Database sources
ProteomeSets
IPI
UniProt data sources and data flow
UniRef 100
UniRef 90
UniRef 50
UniRefPre-computed clusters of similar proteins
UniMes
UniMesUniProt Metagenomic and Environmental Sequences(available by FTP only)
UniParcUniParcUniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences
UniSave
UniSaveUniProt protein entry archive. Contains all versions of each protein entry.(Accessed via www.uniprot.org and www.ebi.ac.uk/unisave)
UniProtKB
UniProtKBUniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.
![Page 47: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/47.jpg)
The Two Sides of UniProtKB
UniProtKB/TrEMBL UniProtKB/Swiss-Prot
Non-redundant, high-quality manual annotation - reviewed
Redundant, automatically annotated -
unreviewed
![Page 48: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/48.jpg)
Databases
• Many databases and they are getting bigger• Efficient searching involves knowledge of what is stored
in these• Don’t assume that everything in the databases is correct• Nothing is constant, but changes...
• Deletions, sequence modifications• Daily updates, identifier changes, etc.
Sequence searching and alignments - Andrew Cowley21/04/2348
![Page 49: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/49.jpg)
Searching databases
Sequence searching and alignments - Andrew Cowley21/04/2349
![Page 50: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/50.jpg)
?What methods of searching databases do you know of?
Sequence searching and alignments - Andrew Cowley21/04/2350
?What is the difference between a primary and secondary database?
?What is the best protein sequence database to search(specific part)?
![Page 51: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/51.jpg)
Searching
• Many ways of searching databases• Annotation/title
• Know something about your sequence• Gene name• Function• Accession
Sequence searching and alignments - Andrew Cowley21/04/2351
![Page 52: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/52.jpg)
New search service
Data organised according to: •gene•expression•protein•structure•literature
Species selector allows for easy
comparison
Explore data, return easily to
your results
Access from the EBI’s homepage
![Page 53: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/53.jpg)
Database webpages
Sequence searching and alignments - Andrew Cowley21/04/2353
![Page 54: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/54.jpg)
Database searching
Sequence searching and alignments - Andrew Cowley21/04/2354
![Page 55: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/55.jpg)
Searching
• Many ways of searching databases
• Annotation/title • Know something about your sequence
• Gene name• Function• Accession
• Raw data• Don’t know!• Or want to check...
• Infer extra information• Homology?• Annotation?• Function?
Sequence searching and alignments - Andrew Cowley21/04/2355
![Page 56: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/56.jpg)
Sequence alignment
• Relatively easy if we have an exact match• .. But sequence is variable
• Between individuals, species, location etc.
• That variability is useful data too!• Need a search method that allows for some variability• And even better – helps us assess that variability
Sequence searching and alignments - Andrew Cowley21/04/2356
![Page 57: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/57.jpg)
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2357
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
![Page 58: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/58.jpg)
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2358
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
ACATAGGTACATAGGT
![Page 59: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/59.jpg)
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2359
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
Score: 6/8 3/8
ACATAGGT ACATAGGT
![Page 60: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/60.jpg)
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2360
atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt
atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc
cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca
Query:
1
2
![Page 61: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/61.jpg)
Dot plot
• Maybe a dot plot will help
Sequence searching and alignments - Andrew Cowley21/04/2361
Query
Sequence 1
A C A T A G
GATACT
![Page 62: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/62.jpg)
Dot plot
Sequence searching and alignments - Andrew Cowley21/04/2362
Query vs Sequence 1 Query vs Sequence 2
Query Query
1 2
![Page 63: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/63.jpg)
• We can see the difference, but how to turn that into something a computer can evaluate?
• Computers rely on algorithms which give them a score
• They can then compare scores
Sequence searching and alignments - Andrew Cowley21/04/2363
![Page 64: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/64.jpg)
• Simple algorithm – penalise movement away from diagonal – gap penalty
Sequence searching and alignments - Andrew Cowley21/04/2364
0
0
0
0
-10
-10
![Page 65: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/65.jpg)
• Having opened a gap, we should assign a lesser penalty to extending it
-10
Gap extend
Sequence searching and alignments - Andrew Cowley21/04/2365
0
-10
-10
0
-10
-10
-10.5
0
-10
-0.5
-10.5-0.5
-10-10
Actual implementation is usually to apply gap extension penalty to every gap
Actual implementation is usually to apply gap extension penalty to every gap
![Page 66: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/66.jpg)
Why a lesser gap extend penalty?
• Single block of insertions/deletions is more likely than multiple in/del events
Sequence searching and alignments - Andrew Cowley21/04/2366
NVELKAET NVDEATNFELKAET
NV-ELKAET
NVDE--A-TNFELKAET
NV------ELKAET
NVDEATNFELKAET
![Page 67: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/67.jpg)
Match/mismatch
• Of course, we need to tell the algorithm that matching letters are better than mismatches too
• This is done via a scoring matrix
Sequence searching and alignments - Andrew Cowley21/04/2367
A C G T
ACGT
5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5
![Page 68: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/68.jpg)
• Putting the two together gives us a scoring mechanism
Sequence searching and alignments - Andrew Cowley21/04/2368
-4
-18
-18
1
-13.5 -13
-22.5
-13
T
A
C
A
C A
6
-10
-4
Gap
Mismatch
A C G T
ACGT
5 -4 -4 -4
-4 5 -4 -4
-4 -4 5 -4
-4 -4 -4 5
-10
0
-10
-10
0
-10
-10
-10.5
0
-10
-0.5
-10.5
-0.5
-10-10
![Page 69: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/69.jpg)
• To pick the optimal alignment, start at the end and trace back the highest scoring route.
Sequence searching and alignments - Andrew Cowley21/04/2369
-4
-18
-18
1
-13.5 -13
-22.5
-13
T
A
C
A
C A
6
![Page 70: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/70.jpg)
Needleman-Wunsch
• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!• An example of dynamic programming
• Comparing the full length of both sequences is called a global-global or just global alignment
Sequence searching and alignments - Andrew Cowley21/04/2370
![Page 71: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/71.jpg)
Global vs Local
• But global-global might not be suitable for sequences that are very different lengths
• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm. • Sets negative scores in matrix to 0, and allows trace back to end
and restart
Sequence searching and alignments - Andrew Cowley21/04/2371
![Page 72: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/72.jpg)
QUESTION: Global vs Local - which is which?
Sequence searching and alignments - Andrew Cowley21/04/2372
A T G T A T A C G C
- A G T A T A - G C
A - T G T A T A C G C
A G T A T A - - - G C
LOCAL GLOBAL
![Page 73: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/73.jpg)
Scoring
• Parameters so far:• Match/mismatch• Gap opening• Gap extending
• Can we improve it?
Sequence searching and alignments - Andrew Cowley21/04/2373
![Page 74: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/74.jpg)
Substitutions
• Some substitutions are more likely than others• DNA:
• Purines (A,G) – dual ring• Pyrimidines (C, T) – single ring
• Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion
• Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix
Sequence searching and alignments - Andrew Cowley21/04/2374
![Page 75: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/75.jpg)
Sequence searching and alignments - Andrew Cowley21/04/2375
![Page 76: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/76.jpg)
Proteins
• What about proteins?
Sequence searching and alignments - Andrew Cowley21/04/2376
![Page 77: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/77.jpg)
Protein substitution matrices
• Can look at closely related proteins to determine substitution rates
• Two most commonly used models:• BLOSUM• PAM
Sequence searching and alignments - Andrew Cowley21/04/2377
![Page 78: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/78.jpg)
BLOSUM
• Blocks of Amino Acid Substitution Matrix
• Align conserved regions of evolutionary divergent sequences clustered at a given % identity
• Count relative frequencies of amino acids and substitution probability
• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.
• Higher BLOSUM number = more closely related
Sequence searching and alignments - Andrew Cowley21/04/2378
![Page 79: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/79.jpg)
PAM
• Point Accepted Mutation• Observed mutations in a set of closely
related proteins• Markov chain model created to describe
substitutions• Normalised so that PAM1 = 1 mutation
per 100 amino acids• Extrapolate matrices from model• Higher PAM number = less closely
related
Sequence searching and alignments - Andrew Cowley21/04/2379
PAM 250
![Page 80: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/80.jpg)
Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence
Sequence searching and alignments - Andrew Cowley21/04/2380
10 100 200
400 500300
![Page 81: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/81.jpg)
Sequence searching and alignments - Andrew Cowley21/04/2381
BLOSUM 45PAM 250
BLOSUM 45PAM 250
BLOSUM 62PAM 160
BLOSUM 62PAM 160
BLOSUM 90PAM 100
BLOSUM 90PAM 100
More divergent Less divergent
![Page 82: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/82.jpg)
Scoring
• Parameters:• Match/mismatch• Gap opening• Gap extending• Substitution matrix
Sequence searching and alignments - Andrew Cowley21/04/2382
![Page 83: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/83.jpg)
Dynamic programming alignments at the EBI
• EMBOSS Pairwise Alignment Algorithms• European Molecular Biology Open Software Suite
• Suite of useful tools for molecular biology• Command line based• Designed to be used as part of scripts/chained programs
• We implement selected tools to provide web-based access
Sequence searching and alignments - Andrew Cowley21/04/2383
![Page 84: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/84.jpg)
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/2384
http://www.ebi.ac.uk/Tools/sequence.htmlhttp://www.ebi.ac.uk/Tools/sequence.html
Or...
![Page 85: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/85.jpg)
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/2385
![Page 86: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/86.jpg)
EMBOSS align tools
• Global alignment
• Local alignment
Sequence searching and alignments - Andrew Cowley21/04/2386
NeedleNeedle
WaterWater
![Page 87: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/87.jpg)
Sequence searching and alignments - Andrew Cowley21/04/2387
Program selectionProgram selection ParametersParameters
Sequence input
Sequence input
Submit!Submit!
![Page 88: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/88.jpg)
Sequence searching and alignments - Andrew Cowley21/04/2388
![Page 89: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/89.jpg)
Sequence searching and alignments - Andrew Cowley21/04/2389
Key
- Gap
: Positive match
. Negative match
| Identity
Key
- Gap
: Positive match
. Negative match
| Identity
![Page 90: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/90.jpg)
Pairwise Alignments - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/2390
Pairwise_align1.fsaPairwise_align2.fsa
Pages 25-30 in full booklet: Questions 7-10
www.ebi.ac.uk/~watson/africa
![Page 91: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/91.jpg)
Dynamic programming sequence search methods at the EBI
• Global alignment
• Local alignment
• Global query vs local database
Sequence searching and alignments - Andrew Cowley21/04/2391
GGSEARCHGGSEARCH
SSEARCHSSEARCH
GLSEARCHGLSEARCH
![Page 92: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/92.jpg)
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/2392
www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/
Or...
![Page 93: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/93.jpg)
Similarity search
Sequence searching and alignments - Andrew Cowley21/04/2393
Database selectionDatabase selection
Sequence input
Sequence input
ParametersParameters
Submit!Submit!
![Page 94: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/94.jpg)
• Dynamic programming methods are rigorous and guarantee an optimal result
• But have to store the matrix of both sequences in memory
• And evaluate each position of the matrix
• Predictably, this makes them slow and demanding when you are aligning large sequences
Sequence searching and alignments - Andrew Cowley21/04/2394
![Page 95: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/95.jpg)
Heuristics
• Therefore we need methods of estimating alignments• Estimation methods are called heuristics
• Try and take short cuts in an intelligent manner• Speed up the search• At the possible expense of accuracy
• Accuracy in sequence searches is important for:• Aligning the right bits• Scoring the alignment correctly• Identifying similar sequences - sensitivity
Sequence searching and alignments - Andrew Cowley21/04/2395
![Page 96: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/96.jpg)
• Going back to our dot plot
Sequence searching and alignments - Andrew Cowley21/04/2396
![Page 97: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/97.jpg)
• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.
Sequence searching and alignments - Andrew Cowley21/04/2397
![Page 98: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/98.jpg)
• Of course, we have to identify likely regions – not all alignments will be as nice as that one!
• This is the method used by FASTA• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
Sequence searching and alignments - Andrew Cowley21/04/2398
![Page 99: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/99.jpg)
FASTA – step 1
• Identify runs of identical sequence and pick regions with highest density of runs
Sequence searching and alignments - Andrew Cowley21/04/2399
Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’
Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’
Increase Ktup = faster, but less sensitive
Increase Ktup = faster, but less sensitive
![Page 100: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/100.jpg)
FASTA – step 2
• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores
Sequence searching and alignments - Andrew Cowley21/04/23100
Parameter:Substitution matrixParameter:Substitution matrix
![Page 101: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/101.jpg)
FASTA – step 3
• Discard regions too far from the highest scoring region
Sequence searching and alignments - Andrew Cowley21/04/23101
Joining threshold:Internally determinedJoining threshold:Internally determined
![Page 102: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/102.jpg)
FASTA – step 4
• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions
Sequence searching and alignments - Andrew Cowley21/04/23102
Parameters:Gap openGap extendSubstitution matrix
Parameters:Gap openGap extendSubstitution matrix
![Page 103: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/103.jpg)
FASTA
• Repeat against all sequences in the database
Sequence searching and alignments - Andrew Cowley21/04/23103
![Page 104: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/104.jpg)
FASTA – programs available at EBI
• FASTA: ”a fast approximation to Smith & Waterman”• FASTA – scan a protein or DNA sequence library for similar
sequences.• FASTX/Y – compare a DNA sequence to a protein sequence
databases, comparing the translated DNA sequence in forward or reverse translation frames.
• TFASTX/Y – compare a protein sequence to a translated DNA data bank.
• FASTF – compares ordered peptides (Edman degradation) to a protein databank.
• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.
• SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).
Sequence searching and alignments - Andrew Cowley21/04/23104
![Page 105: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/105.jpg)
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/23105
www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/
Or...
![Page 106: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/106.jpg)
Similarity search
Sequence searching and alignments - Andrew Cowley21/04/23106
Database selectionDatabase selection
Sequence input
Sequence input
ParametersParameters
Submit!Submit!
![Page 107: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/107.jpg)
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23107
![Page 108: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/108.jpg)
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23108
![Page 109: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/109.jpg)
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23109
![Page 110: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/110.jpg)
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23110
Key
- Gap
: Identity
. Similarity
X Filtered
Key
- Gap
: Identity
. Similarity
X Filtered
![Page 111: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/111.jpg)
Using FASTA - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23111
www.ebi.ac.uk/~watson/africa
test_prot.fasta
Page 37-46 in full booklet: Questions 11-14
![Page 112: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/112.jpg)
BLAST – Basic Local Alignment Search Tool
• Instead of narrowing the dynamic programming search space, BLAST works a different way
• Firstly, it creates a word list both of the exact sequence and high scoring substitutions
Sequence searching and alignments - Andrew Cowley21/04/23112
Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
![Page 113: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/113.jpg)
BLAST – step 1
• w=3
Sequence searching and alignments - Andrew Cowley21/04/23113
SEWRFKHIYRGQPRRHLLTTGWSTFVT
SEWEWR
WRF Parameter:Word length (w)Parameter:Word length (w)
Increase = faster, but less sensitive
Increase = faster, but less sensitive
![Page 114: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/114.jpg)
BLAST – step 1(cont.d)
• w=3• T=13
Sequence searching and alignments - Andrew Cowley21/04/23114
SEWRFKHIYRGQPRRHLLTTGWSTFVT
GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13
AQP 12NQP 12
Parameters:Neighbourhood threshold (T)Substitution matrix
Parameters:Neighbourhood threshold (T)Substitution matrix
![Page 115: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/115.jpg)
BLAST – step 2
• Then it scans database sequences for exact matches with these words
Sequence searching and alignments - Andrew Cowley21/04/23115
![Page 116: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/116.jpg)
• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount
• This results in a High-scoring Segment Pair (HSP)
BLAST – step 3
Sequence searching and alignments - Andrew Cowley21/04/23116
Parameters:Drop offSubstitution matrix
Parameters:Drop offSubstitution matrix
![Page 117: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/117.jpg)
• If the total HSP score is above another threshold then a gapped extension is initiated
BLAST – step 4
Sequence searching and alignments - Andrew Cowley21/04/23117
Parameters:Extension threshold (Sg)Substitution matrix
Parameters:Extension threshold (Sg)Substitution matrix
![Page 118: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/118.jpg)
BLAST
• The steps rule out many database sequences early on• Large increase in speed
Sequence searching and alignments - Andrew Cowley21/04/23118
![Page 119: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/119.jpg)
BLAST – programs available at the EBI
• Basic Local Alignment Search Tool• NCBI-BLAST programs:
• BLASTP – protein sequence vs. protein sequence library• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated DNA vs. protein sequence library
• WU-BLAST programs:• BLASTP – protein query vs. protein database• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated nucleotide query vs. protein database• TBLASTN – protein query vs. translated nucleotide database• TBLASTX – translated nucleotide query vs. translated nucleotide
database
Sequence searching and alignments - Andrew Cowley21/04/23119
Combines several parameters into ‘sensitivity’ option
Combines several parameters into ‘sensitivity’ option
![Page 120: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/120.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23120
![Page 121: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/121.jpg)
Using BLAST - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23121
test_prot.fasta
Pages 47-50 in full booklet: Questions 15-17
www.ebi.ac.uk/~watson/africa
![Page 122: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/122.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23122
Key
- Gap
[residue] Identity
+ Similarity
X Filtered
Key
- Gap
[residue] Identity
+ Similarity
X Filtered
![Page 123: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/123.jpg)
Differences between BLAST and FASTA
• BLAST• Fast• Good with proteins• Produces good local alignments
+ short global alignments• Produces HSP (reports internal
matches in long sequences)• Might miss a potential alignment
due to ruling out sequences early on in the process
• Good at finding siblings
• FASTA• Not as fast as BLAST• Much better with DNA than
BLASTN• Produces S&W alignments• Checks each possible alignment
with database sequences• Good at finding cousins
Sequence searching and alignments - Andrew Cowley21/04/23123
![Page 124: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/124.jpg)
When to use what?
Sequence searching and alignments - Andrew Cowley21/04/23124
Database size
Query
length
FASTA
WU-BLAST
NCBI BLAST
PSI-SEARCH
![Page 125: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/125.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23125
When to use what?
PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc
FASTA
WU-BLAST
NCBI BLAST
PSI-SEARCH
time to
search
![Page 126: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/126.jpg)
Homology and Similarity
Sequence searching and alignments - Andrew Cowley21/04/23126
![Page 127: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/127.jpg)
Similarity
Sequence searching and alignments - Andrew Cowley21/04/23127
![Page 128: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/128.jpg)
Homology
Sequence searching and alignments - Andrew Cowley21/04/23128
![Page 129: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/129.jpg)
Unrelated!
Sequence searching and alignments - Andrew Cowley21/04/23129
![Page 130: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/130.jpg)
Homology vs. Similarity
• Presence of similar features because of common decent
• Cannot be observed since the ancestors are not anymore
• Is inferred as a conclusion based on ‘similarity’
• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)
• Quantifies a ‘likeness’• Uses statistics to determine
‘significance’ of a similarity• Statistically significant similar
sequences are considered ‘homologous’
Sequence searching and alignments - Andrew Cowley21/04/23130
![Page 131: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/131.jpg)
• So far, we’ve talked about scoring alignments• Direct function of the algorithm
• But what we want is to assign some kind of quality to that score
Sequence searching and alignments - Andrew Cowley21/04/23131
![Page 132: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/132.jpg)
Score vs significance
Sequence searching and alignments - Andrew Cowley21/04/23132
A A AA A A
A C A T A A G G C TA T A C A A G C C T
High scoreHigh score High significance
High significance
![Page 133: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/133.jpg)
“Lies, damn lies, and statistics”
Sequence searching and alignments - Andrew Cowley21/04/23133
![Page 134: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/134.jpg)
“Lies, damn lies, and statistics”
• Not just interested in score...
• ...But how likely we are to get that alignment by chance alone
• It is this ‘non-random’ alignment that infers homology
• Statistics are used to estimate this chance
Sequence searching and alignments - Andrew Cowley21/04/23134
![Page 135: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/135.jpg)
E-value
• ‘Expect’ value
• Probability of obtaining this alignment by chance
• Best measure of how good an alignment is
• Often used for ranking results by default
Sequence searching and alignments - Andrew Cowley21/04/23135
![Page 136: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/136.jpg)
• Calculated in different ways for BLAST and FASTA
• Short query sequences are more likely to be found by chance so have higher E-values
• Affected by parameter values like gap penalties and substitution matrices
Sequence searching and alignments - Andrew Cowley21/04/23136
![Page 137: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/137.jpg)
FASTA statistics
• Compares query sequence with every sequence in database
• As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance
Sequence searching and alignments - Andrew Cowley21/04/23137
![Page 138: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/138.jpg)
FASTA - histogram
Sequence searching and alignments - Andrew Cowley21/04/23138
Predicted distribution of scores
Observed distribution of scores
Key
*
=
High scoring regionHigh scoring region
![Page 139: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/139.jpg)
BLAST statistics
• Main reason for speed is that it doesn’t compare query with lots of other sequences
• Therefore it pre-estimates statistical values using a random sequence model
Sequence searching and alignments - Andrew Cowley21/04/23139
““Appears to yield fairly accurate resultsAppears to yield fairly accurate results””
![Page 140: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/140.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Search Guidelines
![Page 141: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/141.jpg)
Search guidelines 1
• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)
• Then with translated DNA sequences (fastx, blastx)
• Search with DNA vs. DNA as the next resort
• And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT!
Sequence searching and alignments - Andrew Cowley21/04/23141
![Page 142: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/142.jpg)
Search guidelines 2
• Search the smallest database that is likely to contain the sequence(s) of interest
• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology
Sequence searching and alignments - Andrew Cowley21/04/23142
![Page 143: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/143.jpg)
Search guidelines 3
• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence• Examine the histograms• Use programs such as prss3 to confirm the expectation values.• Searching with shuffled sequences (use MLE/Shuffle in fasta)
which should have an E() ~1.0
Sequence searching and alignments - Andrew Cowley21/04/23143
![Page 144: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/144.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23144
![Page 145: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/145.jpg)
Search guidelines 4
Sequence searching and alignments - Andrew Cowley21/04/23145
• Consider searches with different gap penalties and other scoring matrices• Use shallower matrices and/or more stringent gaps in order to
uncover or force out relationships in partial sequences• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of
PAM250)• Remember to change the gap penalty defaults!
MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -7 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4
![Page 146: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/146.jpg)
Search guidelines 5
• Homology can be reliably inferred from statistically significant similarity
• But remember:• Orthologous sequences have similar functions• Paralogous sequences can acquire very different functional roles
• So further work might be needed to tease out details
Sequence searching and alignments - Andrew Cowley21/04/23146
![Page 147: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/147.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23147
![Page 148: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/148.jpg)
Search guidelines 6
• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues• However, motif identity in the absence of overall sequence similarity is
not a reliable indicator of homology!
• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data• ClustalW• MUSCLE• T-Coffee• Kalign• MAFFT• Mview (available from EBI FASTA & BLAST services)• DBCLUSTAL (available from EBI BLAST services)
Sequence searching and alignments - Andrew Cowley21/04/23148
![Page 149: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/149.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Advanced
![Page 150: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/150.jpg)
• In general, the more information we can add to an alignment, the better the result
Sequence searching and alignments - Andrew Cowley21/04/23150
Conserved regions Structural information Motifs
[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H
![Page 151: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/151.jpg)
Conserved regions
• We can add a new ‘position’ parameter to the substitution matrix
Sequence searching and alignments - Andrew Cowley21/04/23151
We can even modify a normal search to generate a position specific scoring matrix, or PSSM
![Page 152: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/152.jpg)
PSI-BLAST
Position Specific Iterative – BLAST:
1. Takes the result of a normal BLAST
2. Aligns them and generates profile of conserved positions
3. Uses this to weight scoring on next iteration
Sequence searching and alignments - Andrew Cowley21/04/23152
![Page 153: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/153.jpg)
PSI-BLAST
• By adding importance to conserved residues we might be able to find more distant sequences
• But iterate too far and we might be assigning importance where there is none
Sequence searching and alignments - Andrew Cowley21/04/23153
More sensitiveMore sensitive
![Page 154: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/154.jpg)
PSI-BLAST
Sequence searching and alignments - Andrew Cowley21/04/23154
![Page 155: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/155.jpg)
PSI-BLAST
Sequence searching and alignments - Andrew Cowley21/04/23155
![Page 156: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/156.jpg)
PSI-BLAST
Sequence searching and alignments - Andrew Cowley21/04/23156
![Page 157: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/157.jpg)
PHI-BLAST
• Pattern Hit Initiated-BLAST
• User provides a pattern alongside a protein
• Database hits have to contain this pattern, and similarity to rest of sequence
• Results can initiate a PSI-BLAST search as well
Sequence searching and alignments - Andrew Cowley21/04/23157
![Page 158: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/158.jpg)
PSI-SEARCH
• Smith-Waterman implementation (SSEARCH)
• But with iterative position specific scoring
Sequence searching and alignments - Andrew Cowley21/04/23158
![Page 159: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/159.jpg)
Using PSI-BLAST - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23159
test_prot.fasta
Pages 52-55 in full booklet: Questions 18-20
www.ebi.ac.uk/~watson/africa
![Page 160: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/160.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Problem Sequences
![Page 161: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/161.jpg)
Short sequences
• What about short sequences?• Depends on their nature:
• Protein• Reduce word length and/or increase the E() value cut off• Use shallow matrices
• DNA• Reduce the word length• Ignore gap penalties (force local alignments only)
• Use rigorous methods• But ask what you are trying to do!
Sequence searching and alignments - Andrew Cowley21/04/23161
![Page 162: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/162.jpg)
Low complexity regions
• Sometimes biologically relevant, but always likely to skew alignment scoring
• E.g. CA repeats, poly-A tails and Proline rich regions
Sequence searching and alignments - Andrew Cowley21/04/23162
![Page 163: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/163.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23163
Good Statistics:
The inset shows good correlationbetween the observed over expectednumbers of scores.
This is the region of the histogram to look out for first when evaluating results.
![Page 164: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/164.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23164
The inset shows bad correlationbetween the observed and expectedscores in this search.
The spaces between the = and * symbolsindicate this poor correlation.
One reason for this can be low complexityregions.
Bad Statistics:
![Page 165: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/165.jpg)
Low complexity regions
• Sometimes biologically relevant, but always likely to skew alignment scoring
• E.g. CA repeats, poly-A tails and Proline rich regions
• Compensate by filtering sequence so these regions don’t contribute to scoring• Filters: seg, xnu, dust, CENSOR
• But check what you are filtering!
Sequence searching and alignments - Andrew Cowley21/04/23165
![Page 166: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/166.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23166
Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.
Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.
Filtered:
![Page 167: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/167.jpg)
Using Filters - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23167
Pages 56-57 in full booklet: Questions 21-22
Filtertest_seq.fsa
www.ebi.ac.uk/~watson/africa
![Page 168: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/168.jpg)
Vector contamination
• You think you know what your sequence is..• .. But the results are really confusing!
• Maybe you have vector contamination
• Search against known vectors to check
Sequence searching and alignments - Andrew Cowley21/04/23168
![Page 169: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/169.jpg)
Vector contamination
Sequence searching and alignments - Andrew Cowley21/04/23169
![Page 170: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/170.jpg)
Vector Contamination - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/23170
vectortest_seq1.fsa
vectortest_seq2.fsa
Page 57 in full booklet: Question 23
www.ebi.ac.uk/~watson/africa
![Page 171: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/171.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Multiple Sequence Alignments
![Page 172: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/172.jpg)
Uses of MSA
• Functional prediction• Phylogeny• Structural prediction• Homology detection• Protein analysis
• To distinguish between orthology and parology
Sequence searching and alignments - Andrew Cowley21/04/23172
![Page 173: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/173.jpg)
• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)
• But this is too computationally intensive
Sequence searching and alignments - Andrew Cowley21/04/23173
![Page 174: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/174.jpg)
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Weighted Sums of Pairs: WSP
N
i
i
jijijDW
2
1
1
Sequences Time2 1 second
3 150 seconds
4 6.25 hours
5 39 days
6 16 years
7 2404 years
Time O(LN)
21/04/23174 Sequence searching and alignments - Andrew Cowley
![Page 175: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/175.jpg)
• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)
• But this is too computationally intensive• Therefore we have to use heuristics and progressive
alignment methods
Sequence searching and alignments - Andrew Cowley21/04/23175
![Page 176: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/176.jpg)
Clustal
• >60,000 citations
• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin
• Clustal V 1992• EMBL Heidelberg, • Rainer Fuchs• Alan Bleasby
• Clustal W, Clustal X 1994-2005• Toby Gibson, EMBL, Heidelberg• Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 2006• University College Dublin
www.clustal.org
21/04/23176 Sequence searching and alignments - Andrew Cowley
![Page 177: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/177.jpg)
CLUSTAL
• Quick, pairwise alignment of all sequences• Line up pairs, with the most similar first
Sequence searching and alignments - Andrew Cowley21/04/23177
![Page 178: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/178.jpg)
CLUSTAL
• Fix the alignment between pairs and treat as one sequence
Sequence searching and alignments - Andrew Cowley21/04/23178
![Page 179: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/179.jpg)
CLUSTAL
• Align your fixed pairs with each other
Sequence searching and alignments - Andrew Cowley21/04/23179
![Page 180: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/180.jpg)
• Note, this is not a phylogram!• Only a guide tree for the alignment
Sequence searching and alignments - Andrew Cowley21/04/23180
![Page 181: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/181.jpg)
ClustalW at the EBI
Sequence searching and alignments - Andrew Cowley21/04/23181
![Page 182: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/182.jpg)
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23182
ParametersParameters
Sequence input
Sequence input
Submit!Submit!
Help!Help!
![Page 183: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/183.jpg)
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23183
Interactive – results in browser, deleted after 24 hours
Email – receive URL to results page, deleted after 7 days
![Page 184: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/184.jpg)
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23184
![Page 185: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/185.jpg)
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23185
![Page 186: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/186.jpg)
Jalview
Sequence searching and alignments - Andrew Cowley21/04/23186
![Page 187: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/187.jpg)
ClustalW
Advantages• Fast• Not too demanding• Widely used• Fine for most uses
Disadvantages• Fixing of early alignments
• Propagate errors
• Doesn’t search far• Local minima
• Compresses gaps
Sequence searching and alignments - Andrew Cowley21/04/23187
![Page 188: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/188.jpg)
Use of Clustal/JalView - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/23188
prot_MSA.fastaProblem_MSA1.fsaProblem_MSA2.fsaProblem_MSA3.fsaProblem_MSA4.fsa
Pages 59-66 in full booklet: Questions 24-28
www.ebi.ac.uk/~watson/africa
![Page 189: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/189.jpg)
Other Tools
Sequence searching and alignments - Andrew Cowley21/04/23189
![Page 190: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/190.jpg)
COFFEE
• Consistency based Objective Function For alignmEnt Evaluation
• Maximum Weight Trace (John Kececioglu)• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An
objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
21/04/23190 Sequence searching and alignments - Andrew Cowley
![Page 191: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/191.jpg)
COFFEE
• Library of reference pairwise alignments• For your given set of sequences
• Objective Function• Evaluates consistency between multiple alignment and the library
of pairwise alignments• Use SAGA to optimise this function
• Weigh depending on quality of alignment
Sequence searching and alignments - Andrew Cowley21/04/23191
SAGA is another alignment method, using genetic algorithmsSAGA is another alignment method, using genetic algorithms
![Page 192: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/192.jpg)
COFFEE
• More accurate than ClustalW• Much less prone to problems in early alignment stages
• VERY slow!
Sequence searching and alignments - Andrew Cowley21/04/23192
![Page 193: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/193.jpg)
T-Coffee
• Tree-based COFFEE• Heuristic approach to COFFEE
• Gets rid of genetic algorithm portion• Uses progressive alignments• Changes algorithm based on number of sequences
Sequence searching and alignments - Andrew Cowley21/04/23193
![Page 194: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/194.jpg)
T-Coffee
• Much faster than COFFEE• Avoids some of ClustalW’s pitfalls• Can take information from several data sources
• Still not that fast• Can be very demanding of memory etc.
Sequence searching and alignments - Andrew Cowley21/04/23194
![Page 195: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/195.jpg)
Others
• MUSCLE – Bob Edgar• Iterative/progressive alignment• Fast• Good for big alignments, proteins
• MAFFT• Iterative based Fast Fourier Transform• Fast and accurate• Good for huge alignments
• Kalign• Very fast, local-regions aligning• Good for very large numbers of alignments!
Sequence searching and alignments - Andrew Cowley21/04/23195
![Page 196: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/196.jpg)
Which tool should I use?
Input data• 2-100 sequences of typical
protein length• 100-500 sequences• >500 sequences
• Small number of unusually long sequences
Recommendation• MUSCLE, T-Coffee,
MAFFT, ClustalW• MUSCLE, MAFFT• MUSCLE, KALIGN
• ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23196
![Page 197: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/197.jpg)
How to evaluate?
• Use a benchmark • BaliBASE
Sequence searching and alignments - Andrew Cowley21/04/23197
![Page 198: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/198.jpg)
BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics
•ICGEB Strasbourg
•141 manual alignments using structures•5 sections•core alignment regions marked
1. Equidistant(82)
2. Orphan(23)
3. Two groups (12)
4. Long internal gaps(13)
5. Long terminal gaps(11)
21/04/23198 Sequence searching and alignments - Andrew Cowley
![Page 199: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/199.jpg)
Benchmark pitfalls
• Benchmark dataset may not be representative• Danger of over-training towards benchmark
• Goldman: Most MSAs have unrealistic gaps• Tend towards multiple, independent deletions• Insertions are rare
• Sequences shrink in length over evolution• No supporting evidence that this is the case
Sequence searching and alignments - Andrew Cowley21/04/23199
![Page 200: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/200.jpg)
Solutions
• Use phylogentic data to guide alignment• Keep track of changes to ancestor sequences• Don’t change them again so easily in decendents
Sequence searching and alignments - Andrew Cowley21/04/23200
![Page 201: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/201.jpg)
PRANK
• Probabilistic Alignment Kit• webPRANK• Better suited for closely related sequences• Tied solutions are chosen from at random
• Avoids incorrect confidence in result• Means alignments might not be reproducible
• Alignments look quite different• Might look worse!• But gap patterns make sense• Gaps are good!
Sequence searching and alignments - Andrew Cowley21/04/23201
![Page 202: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/202.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23202
![Page 203: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/203.jpg)
Sequence searching and alignments - Andrew Cowley21/04/23203
![Page 204: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/204.jpg)
Comparing Alignments - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/23204
prot_MSA.fasta
Pages 67-74 in full booklet: Questions 29-30
www.ebi.ac.uk/~watson/africa
![Page 205: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/205.jpg)
Common problems with MSA
• Input format• FASTA format• Unique sequence identifiers• Include sequence!
• Job can’t be found• Interactive results deleted after 24hrs• Use email• Consider other tool
Sequence searching and alignments - Andrew Cowley21/04/23205
![Page 206: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/206.jpg)
Common mis-uses of MSA
• Performing a sequence assembly• Specialist type of MSA• Use other tools (Staden etc.)
• Aligning ESTs to a reference genome• Use EST2Genome
• Designing primers• Use primer tools (primer3 etc.)
• Aligning two sequences• Use a pairwise alignment tool!
Sequence searching and alignments - Andrew Cowley21/04/23206
![Page 207: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/207.jpg)
Putting it all together
• EB-Eye search• Sequence retrieval• Sequence search• Sequences retrieval• Multiple sequence alignment• Analysis
Sequence searching and alignments - Andrew Cowley21/04/23207
![Page 208: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/208.jpg)
Final remarks
• Don’t assume a single tool will cater for all your needs
• DO change the parameters of the tools
• Remember where the tool excels and what its limitations are
• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)
• Crazy input will always give crazy results!
Sequence searching and alignments - Andrew Cowley21/04/23208
![Page 209: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/209.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Getting Help
![Page 210: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/210.jpg)
Getting Help
• Database documentation• Frequently Asked Questions
• http://www.ebi.ac.uk/help/faq.html
• 2can Support Portal• http://www.ebi.ac.uk/2can/
• EBI Support• http://www.ebi.ac.uk/support/
• Hands-on training programme• http://www.ebi.ac.uk/training/handson/
Sequence searching and alignments - Andrew Cowley21/04/23210
![Page 211: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk](https://reader036.vdocument.in/reader036/viewer/2022081516/56649e375503460f94b270af/html5/thumbnails/211.jpg)
Thanks!
• Hamish McWilliam and Andrew Cowley• Vicky Schneider• Rodrigo Lopez
• EMBL-EBI• SLING• You!
Sequence searching and alignments - Andrew Cowley21/04/23211