ebi is an outstation of the european molecular biology laboratory. ebi roadshow james watson, phd...
Post on 20-Jan-2016
220 Views
Preview:
TRANSCRIPT
EBI is an Outstation of the European Molecular Biology Laboratory.
EBI Roadshow
James Watson, PhDSenior Scientific Training Officer
EBI-EMBL
watson@ebi.ac.uk
EBI is an Outstation of the European Molecular Biology Laboratory.
Andrew Cowley
External Services, EMBL-EBI
Sequence Searching and Alignments
External Services
Sequence searching and alignments - Andrew Cowley21/04/233
Andrew CowleyBioinformatics Trainer
Hamish McWilliamSoftware engineer Rodrigo Lopez
Head of External Services
+ many others!
Contents
• Sequence databases• Database browsing tools
• Similarity searching and alignments• Alignment basics• Similarity searching tools• More advanced tools• Alignment tools• Guidelines
• (slightly) More advanced tools• Problem sequences
Sequence searching and alignments - Andrew Cowley21/04/234
Materials
Sequence searching and alignments - Andrew Cowley21/04/235
Presentations and tutorials can be found on the roadshow course page at the EBI
Data files for exercises can be found at:
www.ebi.ac.uk/~watson/africa
Data
• Simplistically, much of the data at the EBI can be thought of as a container
• One part being the raw data (eg. Sequence)
• Another part being annotation on this data
Sequence searching and alignments - Andrew Cowley21/04/236
Example
Sequence searching and alignments - Andrew Cowley21/04/237
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//
ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//
Data - Nucleotide
• ENA/EMBL-Bank: • Release and updates• Divided into classes and divisions• Supplementary sets: EMBL-CDS, EMBL-MGA
• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc.• Alternative splicing: ASD, ASTD, etc.• Completed genomes: Ensembl, Integr8, etc.• Variation: HGVBase, dbSNP, etc.
Sequence searching and alignments - Andrew Cowley21/04/238
Individual sequencing
Individual scientists
ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG
Sequence individual gene
What sequence data is submitted?
ACTGCTGCTAGCTAG
submission
add annotation
submission
High throughput sequencing
chromosome
fragment
ACTGCTGCTAGCTAG
cyp30 cyp309 insvcg343
annotation
sequence reads
sequencing library
assemble sequence
chromosome
fragment
sequencing library
ACTGCTGCTAGCTAG
assemble sequence
cyp30 cyp309 insvcg343
annotation
sequence reads
Large-scale sequencing
projects
submissione.g. whole genome shotgun
submission
submission
High throughput sequencing
What are primary sequence databases?
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
•Original sequence data• Experimental data
• Patent data
• Submitter-defined
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
How do primary and derived databases differ?
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
Derived data e.g. protein sequence
Deriveddatabase
Primary v. derived data
ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACATDNA sequence
translate
Derived mRNA sequence
AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC
Derived protein sequence MRSNECCCAMSC
transcribe
submit
ACTGCTGCTAGCTAG
How do primary and derived databases differ?
Individual scientistsLarge-scale
sequencing projects
Primary sequence
data
Primarysequencedatabase
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
Deriveddatabase
redundant
may be non-redundant
If anything in submission varies (e.g.
source / submitter / sequence)
generates a new entry
Derived data e.g. protein sequence
How do primary and derived databases differ?
Individual scientistsLarge-scale
sequencing projects
Primary sequence data
Primarysequencedatabase
Patent Offices
ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG
ACTGCTGCTAGCTAG
Deriveddatabase
data lost
regenerate data
Derived data e.g. protein sequence
Primary nucleotide sequence databases
GenBank DDBJ
ENA
(Japan)(U.S.A.)
(Europe)
INSDC:
• International Nucleotide
Sequence Database
Collaboration
• Daily exchange of data
ENA
DDBJ GenBank
ACTGCTGCTAGCTAG
Submission can
be made to any
INSDC database
Sequence information
ReadsReads
How is sequence data processed?
• Sequence machine output (reads)• Quality scores
AssemblyAssembly• Fragmented sequence reads assembled into contigs mapped onto chromosomes
AnnotationAnnotation • Functional information assigned to assembled regions
ENA
DDBJ GenBank
ACTGCTGCTAGCTAG
Sequence information
ReadsReads
What type of sequence data is submitted?
AssemblyAssembly
AnnotationAnnotation
Assembled
sequences
Raw data
Annotated sequence
Interpreted information:• Assembly• Mapping• Functional annotation• Sample information
RawAnnotated /
ENA
DDBJ GenBankInput information:• Sample• Set-up• Machine configuration
Output machine data:• Sequence traces• Reads• Quality scores
Metagenomic data:• Where originated
ACTGCTGCTAGCTAG
How does ENA store the data?
Assembled sequences
Raw data
Annotated sequence
Large-scale sequencing
projects
Individual scientists
Patent Offices
ENA
European Nucleotide Archive
RawAnnotated /
ENA
DDBJ GenBank
ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)
Sequence ReadSequence ReadArchive (SRA)Archive (SRA)
Trace ArchiveTrace Archive SRAAnn Trace
ACTGCTGCTAGCTAG
How does ENA store the data?
Assembled sequences
Raw data
Annotated sequence
Large-scale sequencing
projects
Individual scientists
Patent Offices
ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)
Sequence ReadSequence ReadArchive (SRA)Archive (SRA)
Trace ArchiveTrace Archive
ENA
European Nucleotide Archive
RawAnnotated /
SRAAnn
ENA
DDBJ GenBank
Trace
• Trace sequence reads
• Capillary sequencing instruments
• Intensity reads
• Next-generation sequencing instruments
ACTGCTGCTAGCTAG
INDSC Sequencing Projects
Can data be traced to an Institute?
Institute
Database records
Consortium
genomic
ESTs...
shotgun
Complete genome /
metagenome
(single organism / metagenomic study)
Assembly & annotation
Assembly & annotation
genomic
ESTs...
shotgun
Track projects
Comparative analysis
RawAnnotated /
SRAAnn
ENA
DDBJ GenBank
Trace
ACTGCTGCTAGCTAG
Pulls information
together
Nucleotides: European Nucleotide Archive (ENA)
21/04/2323
Figure adapted from: Cochrane, G. et al. Public Data Resources as
the Foundation for a Worldwide Metagenomics Data Infrastructure.
In: Metagenomics: Theory, Methods and Applications (Chapter 5),
Caister Academic Press, Universidad Nacional de Cordoba,
Argentina. Ed. D. Marco (2010).
The ENA has a three-tiered data architecture.
It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms).
Sequence searching and alignments - Andrew Cowley
Data Quality
Is the data cleaned up?
• Automatic quality checks
Validation of submitted data:
• Some manual inspection and curation
Errors can still exist in sequence and annotation
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Trace
ACTGCTGCTAGCTAG
How is the data organized?
Data in ENA Annotation is divided in 2 ways:
Database Structure
• Type of data or
• Methodology used to obtain data
• Each entry belongs to one data class
1) Data classes1) Data classes
• Each entry belongs to one taxonomic division
2) Taxonomic Divisions2) Taxonomic Divisions
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Single pass reads variable quality
• Need to search both EST and RNA data
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)• Often copies of existing entries
• Records not clean, even for taxonomy
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Bulk of entries
• Highest level of tracked information
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Derived data entries• e.g. patch genomic and RNA data to construct complete coverage
• Must have publication
• Must show which entries data is derived from
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Also derived data entries
• ESTs assembled to construct RNA
• Must show which EST/HTC entries data is derived from
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
CONCON
ESTEST
GSSGSS
HTCHTC
HTGHTG
MGAMGA
PATPAT
STSSTS
TPATPA
TSATSA
WGSWGS
Constructed from sequence assembliesExpressed Sequence Tag (cDNA)
Genome Survey Sequence (high-throughput short sequence)
High-Throughput cDNA (unfinished)
High-Throughput Genome sequencing (unfinished)
Mass Genome Annotation
Patent sequences
Sequence Tagged Site (short unique genomic sequences)
Third Party Annotation (re-annotated and re-assembled)
Transcriptome Shotgun Assembly (computational assembly)
Whole Genome Shotgun
STDSTD Standard (high quality annotated sequence)
• Entries change over time (completely replaced)
• Raw WGS entries assembled into contigs CON entries
Data Classes
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Data Classes
How stable is the data?
Data is always changing:
• Assembly of sequences into larger fragments
• Deletion of obsolete entries (i.e. once assembled)
• Sequence modifications
• Daily updates
• Identifier changes
• Corrections (databases can contain errors)
• etc…
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Data Classes
How does assembly affect entries?
WGSWGS Shotgun
Example:
• Fragments in separate entry
CONCON Constructed • Join to make new CON entries
• Old WGS entries archived
STDSTD Standard• Join into large STD entry (e.g. Completed genome)• Add annotation
• Old CON entries archived
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Taxonomy
ENVENV
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
Environmental
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Taxonomy
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
ENVENV Environmental
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
• CAUTION: organism never isolated
• May blast sequence to assign putative organism
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Taxonomy
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
ENVENV Environmental
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
• CAUTION: not consistently handled, variable quality
• Transgenics may be from multiple organisms
Taxonomy
HUMHUM HumanMUSMUS Mouse
MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant
PHGPHG Phage
PROPRO Prokaryote
RODROD Rodent
VIRVIR Viral
ENVENV Environmental
SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified
Other:
• Division primarily used by GenBank
for PAT (patent) sequences
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Taxonomy exclusion
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
Some species excluded
from certain taxonomic ranges
VRTVRT Vertebrate excludeshumanmouserodentmammal
MAMMAM Mammal excludeshumanmouserodent
RODROD Rodent excludes mouse
Applies to:• ftp files and• sequence search tools
But not:• ENA Browser
Taxonomy Database
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
All INSDC databases use the NCBI Taxonomy Browser
Which taxonomy database does ENA use?
Only organisms with sequence are represented
• EBI-wide service maps resources into taxonomy service
• Culture collection – physical data, e.g. sample or stored version
• Biomaterial
• Specimen voucher
EBI Taxonomy Portal
representation, e.g. picture
Database Structure
How does data organization differ from GenBank?
ENA-Annotation
Data classesData classes
TaxonomicTaxonomicDivisionsDivisions
• Data split into intersecting slices
• Reduces search set
• Ensures complete result set
con
est
gss
htc
htg
pat
sts
std ...
hummusrodmamvrtfun...
GenBank
Taxonomic DivisionsTaxonomic Divisions
Data classesData classes
• Data split into parallel slices
• Large search sets
• Classes incomplete for taxonomy
• Taxonomy incomplete for
classes
con
est
gss
htc
htg
pat
sts
std ...
hum
mus rod
mam vrt
fun
inv
pln ...
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
ACTGCTGCTAGCTAG
How does data organization differ from GenBank?
ENA-Annotation
Data classesData classes
TaxonomicTaxonomicDivisionsDivisions
• Data split into intersecting slices
• Reduces search set
• Ensures complete result set
GenBank
Taxonomic DivisionsTaxonomic Divisions
Data classesData classes
• Data split into parallel slices
• Large search sets
• Classes incomplete for taxonomy
• Taxonomy incomplete for
classes
con
est
gss
htc
htg
pat
sts
std ...
hummusrodmamvrtfun...
con
est
gss
htc
htg
pat
sts
std ...
hum
mus rod
mam vrt
fun
inv
pln ...
‘Mouse’ + ‘EST’ intersection
• small data set
• ensured complete set of mouse ESTs
Database Structure
RawAnnotated /
SRAAnn
Clean-up
ENA
DDBJ GenBank
Class
Taxon
Trace
‘Mouse’ set
• large data set
• includes all mouse entries
‘EST’ set
• large data set
• includes all EST entries
ACTGCTGCTAGCTAG
Data – Protein Sequence
• UniProt databases:• UniProtKB: human curated and automatic translation sections• UniRef: non-redundant sequence clusters• UniParc: non-identical sequence archive
• Sequence from structures:• PDB• SGT
• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA• Alternative splicing: ASD, ASTD• Completed proteomes: Ensembl, Integr8• Protein Interactions: IntAct• Patent Proteins: EPO, JPO, KIPO and USPTO
Sequence searching and alignments - Andrew Cowley21/04/2343
Sequence Databases
Nucleotide Nucleotide/Protein Protein
Not curated Curated Automatically annotated/curated
Author submits NCBI creates from existing data Entries created from large scale genomics, small scale cloning and peptide submissions
Only author can revise NCBI revises as new data emerge UniProt regularly revises entries
Multiple records for same loci common Single records for each molecule of major organisms
An entry per protein including any isoforms (when curated).
Records can contradict each other ?
No limit to species included Limited to model organisms All species
Data exchanged among INSDC members Exclusive NCBI database International consortium, inc leading figures in Bioinformatics
Akin to primary literature Akin to review articles Akin to review articles plus contain extensive cross-references to other DBBs – almost a portal.
Proteins identified and linked Proteins and transcripts identified and linked Links to supporting nucleotide data added to entries
Access via NCBI Nucleotide databases Access via Nucleotide & Protein databases UniProt.org – modern, multi functional website with incorporated bioinformatics tools.
Genbank
21/04/2345
Protein sequence: UniProt
UniProt
• Manual curation
• Literature-based annotation
• Sequence analysis
• Automated annotation
PRIDE
GO
InterPro
IntAct
IntEnz
HAMAP
RESID
Functional infoFunctional info
Protein identification data
Protein identification data
Protein families and domains
Protein families and domains
Molecular interactionsMolecular
interactions
EnzymesEnzymes
Microbial protein families
Microbial protein families
Post-translational modifications
Post-translational modificationsS
ome
data
sou
rces
for
an
nota
tion
Transmembrane prediction
Transmembrane prediction
InterPro classification
InterPro classification
Signal predictionSignal prediction
Other predictionsOther predictions
Protein classification
Sequence searching and alignments - Andrew Cowley
PatentData
WormBaseFlyBaseSub/
PeptideData
PDB VEGAEnsemblRefSeqINSDC
(incl. WGS,Env.)
Database sources
ProteomeSets
IPI
UniProt data sources and data flow
UniRef 100
UniRef 90
UniRef 50
UniRefPre-computed clusters of similar proteins
UniMes
UniMesUniProt Metagenomic and Environmental Sequences(available by FTP only)
UniParcUniParcUniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences
UniSave
UniSaveUniProt protein entry archive. Contains all versions of each protein entry.(Accessed via www.uniprot.org and www.ebi.ac.uk/unisave)
UniProtKB
UniProtKBUniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.
The Two Sides of UniProtKB
UniProtKB/TrEMBL UniProtKB/Swiss-Prot
Non-redundant, high-quality manual annotation - reviewed
Redundant, automatically annotated -
unreviewed
Databases
• Many databases and they are getting bigger• Efficient searching involves knowledge of what is stored
in these• Don’t assume that everything in the databases is correct• Nothing is constant, but changes...
• Deletions, sequence modifications• Daily updates, identifier changes, etc.
Sequence searching and alignments - Andrew Cowley21/04/2348
Searching databases
Sequence searching and alignments - Andrew Cowley21/04/2349
?What methods of searching databases do you know of?
Sequence searching and alignments - Andrew Cowley21/04/2350
?What is the difference between a primary and secondary database?
?What is the best protein sequence database to search(specific part)?
Searching
• Many ways of searching databases• Annotation/title
• Know something about your sequence• Gene name• Function• Accession
Sequence searching and alignments - Andrew Cowley21/04/2351
New search service
Data organised according to: •gene•expression•protein•structure•literature
Species selector allows for easy
comparison
Explore data, return easily to
your results
Access from the EBI’s homepage
Database webpages
Sequence searching and alignments - Andrew Cowley21/04/2353
Database searching
Sequence searching and alignments - Andrew Cowley21/04/2354
Searching
• Many ways of searching databases
• Annotation/title • Know something about your sequence
• Gene name• Function• Accession
• Raw data• Don’t know!• Or want to check...
• Infer extra information• Homology?• Annotation?• Function?
Sequence searching and alignments - Andrew Cowley21/04/2355
Sequence alignment
• Relatively easy if we have an exact match• .. But sequence is variable
• Between individuals, species, location etc.
• That variability is useful data too!• Need a search method that allows for some variability• And even better – helps us assess that variability
Sequence searching and alignments - Andrew Cowley21/04/2356
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2357
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2358
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
ACATAGGTACATAGGT
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2359
ACATAGGT
TCATAGAT AAATTCTG
Query:
1 2
Score: 6/8 3/8
ACATAGGT ACATAGGT
Sequence alignment
Sequence searching and alignments - Andrew Cowley21/04/2360
atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt
atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc
cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca
Query:
1
2
Dot plot
• Maybe a dot plot will help
Sequence searching and alignments - Andrew Cowley21/04/2361
Query
Sequence 1
A C A T A G
GATACT
Dot plot
Sequence searching and alignments - Andrew Cowley21/04/2362
Query vs Sequence 1 Query vs Sequence 2
Query Query
1 2
• We can see the difference, but how to turn that into something a computer can evaluate?
• Computers rely on algorithms which give them a score
• They can then compare scores
Sequence searching and alignments - Andrew Cowley21/04/2363
• Simple algorithm – penalise movement away from diagonal – gap penalty
Sequence searching and alignments - Andrew Cowley21/04/2364
0
0
0
0
-10
-10
• Having opened a gap, we should assign a lesser penalty to extending it
-10
Gap extend
Sequence searching and alignments - Andrew Cowley21/04/2365
0
-10
-10
0
-10
-10
-10.5
0
-10
-0.5
-10.5-0.5
-10-10
Actual implementation is usually to apply gap extension penalty to every gap
Actual implementation is usually to apply gap extension penalty to every gap
Why a lesser gap extend penalty?
• Single block of insertions/deletions is more likely than multiple in/del events
Sequence searching and alignments - Andrew Cowley21/04/2366
NVELKAET NVDEATNFELKAET
NV-ELKAET
NVDE--A-TNFELKAET
NV------ELKAET
NVDEATNFELKAET
Match/mismatch
• Of course, we need to tell the algorithm that matching letters are better than mismatches too
• This is done via a scoring matrix
Sequence searching and alignments - Andrew Cowley21/04/2367
A C G T
ACGT
5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5
• Putting the two together gives us a scoring mechanism
Sequence searching and alignments - Andrew Cowley21/04/2368
-4
-18
-18
1
-13.5 -13
-22.5
-13
T
A
C
A
C A
6
-10
-4
Gap
Mismatch
A C G T
ACGT
5 -4 -4 -4
-4 5 -4 -4
-4 -4 5 -4
-4 -4 -4 5
-10
0
-10
-10
0
-10
-10
-10.5
0
-10
-0.5
-10.5
-0.5
-10-10
• To pick the optimal alignment, start at the end and trace back the highest scoring route.
Sequence searching and alignments - Andrew Cowley21/04/2369
-4
-18
-18
1
-13.5 -13
-22.5
-13
T
A
C
A
C A
6
Needleman-Wunsch
• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!• An example of dynamic programming
• Comparing the full length of both sequences is called a global-global or just global alignment
Sequence searching and alignments - Andrew Cowley21/04/2370
Global vs Local
• But global-global might not be suitable for sequences that are very different lengths
• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm. • Sets negative scores in matrix to 0, and allows trace back to end
and restart
Sequence searching and alignments - Andrew Cowley21/04/2371
QUESTION: Global vs Local - which is which?
Sequence searching and alignments - Andrew Cowley21/04/2372
A T G T A T A C G C
- A G T A T A - G C
A - T G T A T A C G C
A G T A T A - - - G C
LOCAL GLOBAL
Scoring
• Parameters so far:• Match/mismatch• Gap opening• Gap extending
• Can we improve it?
Sequence searching and alignments - Andrew Cowley21/04/2373
Substitutions
• Some substitutions are more likely than others• DNA:
• Purines (A,G) – dual ring• Pyrimidines (C, T) – single ring
• Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion
• Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix
Sequence searching and alignments - Andrew Cowley21/04/2374
Sequence searching and alignments - Andrew Cowley21/04/2375
Proteins
• What about proteins?
Sequence searching and alignments - Andrew Cowley21/04/2376
Protein substitution matrices
• Can look at closely related proteins to determine substitution rates
• Two most commonly used models:• BLOSUM• PAM
Sequence searching and alignments - Andrew Cowley21/04/2377
BLOSUM
• Blocks of Amino Acid Substitution Matrix
• Align conserved regions of evolutionary divergent sequences clustered at a given % identity
• Count relative frequencies of amino acids and substitution probability
• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.
• Higher BLOSUM number = more closely related
Sequence searching and alignments - Andrew Cowley21/04/2378
PAM
• Point Accepted Mutation• Observed mutations in a set of closely
related proteins• Markov chain model created to describe
substitutions• Normalised so that PAM1 = 1 mutation
per 100 amino acids• Extrapolate matrices from model• Higher PAM number = less closely
related
Sequence searching and alignments - Andrew Cowley21/04/2379
PAM 250
Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence
Sequence searching and alignments - Andrew Cowley21/04/2380
10 100 200
400 500300
Sequence searching and alignments - Andrew Cowley21/04/2381
BLOSUM 45PAM 250
BLOSUM 45PAM 250
BLOSUM 62PAM 160
BLOSUM 62PAM 160
BLOSUM 90PAM 100
BLOSUM 90PAM 100
More divergent Less divergent
Scoring
• Parameters:• Match/mismatch• Gap opening• Gap extending• Substitution matrix
Sequence searching and alignments - Andrew Cowley21/04/2382
Dynamic programming alignments at the EBI
• EMBOSS Pairwise Alignment Algorithms• European Molecular Biology Open Software Suite
• Suite of useful tools for molecular biology• Command line based• Designed to be used as part of scripts/chained programs
• We implement selected tools to provide web-based access
Sequence searching and alignments - Andrew Cowley21/04/2383
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/2384
http://www.ebi.ac.uk/Tools/sequence.htmlhttp://www.ebi.ac.uk/Tools/sequence.html
Or...
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/2385
EMBOSS align tools
• Global alignment
• Local alignment
Sequence searching and alignments - Andrew Cowley21/04/2386
NeedleNeedle
WaterWater
Sequence searching and alignments - Andrew Cowley21/04/2387
Program selectionProgram selection ParametersParameters
Sequence input
Sequence input
Submit!Submit!
Sequence searching and alignments - Andrew Cowley21/04/2388
Sequence searching and alignments - Andrew Cowley21/04/2389
Key
- Gap
: Positive match
. Negative match
| Identity
Key
- Gap
: Positive match
. Negative match
| Identity
Pairwise Alignments - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/2390
Pairwise_align1.fsaPairwise_align2.fsa
Pages 25-30 in full booklet: Questions 7-10
www.ebi.ac.uk/~watson/africa
Dynamic programming sequence search methods at the EBI
• Global alignment
• Local alignment
• Global query vs local database
Sequence searching and alignments - Andrew Cowley21/04/2391
GGSEARCHGGSEARCH
SSEARCHSSEARCH
GLSEARCHGLSEARCH
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/2392
www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/
Or...
Similarity search
Sequence searching and alignments - Andrew Cowley21/04/2393
Database selectionDatabase selection
Sequence input
Sequence input
ParametersParameters
Submit!Submit!
• Dynamic programming methods are rigorous and guarantee an optimal result
• But have to store the matrix of both sequences in memory
• And evaluate each position of the matrix
• Predictably, this makes them slow and demanding when you are aligning large sequences
Sequence searching and alignments - Andrew Cowley21/04/2394
Heuristics
• Therefore we need methods of estimating alignments• Estimation methods are called heuristics
• Try and take short cuts in an intelligent manner• Speed up the search• At the possible expense of accuracy
• Accuracy in sequence searches is important for:• Aligning the right bits• Scoring the alignment correctly• Identifying similar sequences - sensitivity
Sequence searching and alignments - Andrew Cowley21/04/2395
• Going back to our dot plot
Sequence searching and alignments - Andrew Cowley21/04/2396
• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.
Sequence searching and alignments - Andrew Cowley21/04/2397
• Of course, we have to identify likely regions – not all alignments will be as nice as that one!
• This is the method used by FASTA• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
Sequence searching and alignments - Andrew Cowley21/04/2398
FASTA – step 1
• Identify runs of identical sequence and pick regions with highest density of runs
Sequence searching and alignments - Andrew Cowley21/04/2399
Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’
Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’
Increase Ktup = faster, but less sensitive
Increase Ktup = faster, but less sensitive
FASTA – step 2
• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores
Sequence searching and alignments - Andrew Cowley21/04/23100
Parameter:Substitution matrixParameter:Substitution matrix
FASTA – step 3
• Discard regions too far from the highest scoring region
Sequence searching and alignments - Andrew Cowley21/04/23101
Joining threshold:Internally determinedJoining threshold:Internally determined
FASTA – step 4
• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions
Sequence searching and alignments - Andrew Cowley21/04/23102
Parameters:Gap openGap extendSubstitution matrix
Parameters:Gap openGap extendSubstitution matrix
FASTA
• Repeat against all sequences in the database
Sequence searching and alignments - Andrew Cowley21/04/23103
FASTA – programs available at EBI
• FASTA: ”a fast approximation to Smith & Waterman”• FASTA – scan a protein or DNA sequence library for similar
sequences.• FASTX/Y – compare a DNA sequence to a protein sequence
databases, comparing the translated DNA sequence in forward or reverse translation frames.
• TFASTX/Y – compare a protein sequence to a translated DNA data bank.
• FASTF – compares ordered peptides (Edman degradation) to a protein databank.
• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.
• SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).
Sequence searching and alignments - Andrew Cowley21/04/23104
Where to find at the EBI?
Sequence searching and alignments - Andrew Cowley21/04/23105
www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/
Or...
Similarity search
Sequence searching and alignments - Andrew Cowley21/04/23106
Database selectionDatabase selection
Sequence input
Sequence input
ParametersParameters
Submit!Submit!
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23107
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23108
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23109
FASTA - results
Sequence searching and alignments - Andrew Cowley21/04/23110
Key
- Gap
: Identity
. Similarity
X Filtered
Key
- Gap
: Identity
. Similarity
X Filtered
Using FASTA - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23111
www.ebi.ac.uk/~watson/africa
test_prot.fasta
Page 37-46 in full booklet: Questions 11-14
BLAST – Basic Local Alignment Search Tool
• Instead of narrowing the dynamic programming search space, BLAST works a different way
• Firstly, it creates a word list both of the exact sequence and high scoring substitutions
Sequence searching and alignments - Andrew Cowley21/04/23112
Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
BLAST – step 1
• w=3
Sequence searching and alignments - Andrew Cowley21/04/23113
SEWRFKHIYRGQPRRHLLTTGWSTFVT
SEWEWR
WRF Parameter:Word length (w)Parameter:Word length (w)
Increase = faster, but less sensitive
Increase = faster, but less sensitive
BLAST – step 1(cont.d)
• w=3• T=13
Sequence searching and alignments - Andrew Cowley21/04/23114
SEWRFKHIYRGQPRRHLLTTGWSTFVT
GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13
AQP 12NQP 12
Parameters:Neighbourhood threshold (T)Substitution matrix
Parameters:Neighbourhood threshold (T)Substitution matrix
BLAST – step 2
• Then it scans database sequences for exact matches with these words
Sequence searching and alignments - Andrew Cowley21/04/23115
• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount
• This results in a High-scoring Segment Pair (HSP)
BLAST – step 3
Sequence searching and alignments - Andrew Cowley21/04/23116
Parameters:Drop offSubstitution matrix
Parameters:Drop offSubstitution matrix
• If the total HSP score is above another threshold then a gapped extension is initiated
BLAST – step 4
Sequence searching and alignments - Andrew Cowley21/04/23117
Parameters:Extension threshold (Sg)Substitution matrix
Parameters:Extension threshold (Sg)Substitution matrix
BLAST
• The steps rule out many database sequences early on• Large increase in speed
Sequence searching and alignments - Andrew Cowley21/04/23118
BLAST – programs available at the EBI
• Basic Local Alignment Search Tool• NCBI-BLAST programs:
• BLASTP – protein sequence vs. protein sequence library• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated DNA vs. protein sequence library
• WU-BLAST programs:• BLASTP – protein query vs. protein database• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated nucleotide query vs. protein database• TBLASTN – protein query vs. translated nucleotide database• TBLASTX – translated nucleotide query vs. translated nucleotide
database
Sequence searching and alignments - Andrew Cowley21/04/23119
Combines several parameters into ‘sensitivity’ option
Combines several parameters into ‘sensitivity’ option
Sequence searching and alignments - Andrew Cowley21/04/23120
Using BLAST - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23121
test_prot.fasta
Pages 47-50 in full booklet: Questions 15-17
www.ebi.ac.uk/~watson/africa
Sequence searching and alignments - Andrew Cowley21/04/23122
Key
- Gap
[residue] Identity
+ Similarity
X Filtered
Key
- Gap
[residue] Identity
+ Similarity
X Filtered
Differences between BLAST and FASTA
• BLAST• Fast• Good with proteins• Produces good local alignments
+ short global alignments• Produces HSP (reports internal
matches in long sequences)• Might miss a potential alignment
due to ruling out sequences early on in the process
• Good at finding siblings
• FASTA• Not as fast as BLAST• Much better with DNA than
BLASTN• Produces S&W alignments• Checks each possible alignment
with database sequences• Good at finding cousins
Sequence searching and alignments - Andrew Cowley21/04/23123
When to use what?
Sequence searching and alignments - Andrew Cowley21/04/23124
Database size
Query
length
FASTA
WU-BLAST
NCBI BLAST
PSI-SEARCH
Sequence searching and alignments - Andrew Cowley21/04/23125
When to use what?
PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc
FASTA
WU-BLAST
NCBI BLAST
PSI-SEARCH
time to
search
Homology and Similarity
Sequence searching and alignments - Andrew Cowley21/04/23126
Similarity
Sequence searching and alignments - Andrew Cowley21/04/23127
Homology
Sequence searching and alignments - Andrew Cowley21/04/23128
Unrelated!
Sequence searching and alignments - Andrew Cowley21/04/23129
Homology vs. Similarity
• Presence of similar features because of common decent
• Cannot be observed since the ancestors are not anymore
• Is inferred as a conclusion based on ‘similarity’
• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)
• Quantifies a ‘likeness’• Uses statistics to determine
‘significance’ of a similarity• Statistically significant similar
sequences are considered ‘homologous’
Sequence searching and alignments - Andrew Cowley21/04/23130
• So far, we’ve talked about scoring alignments• Direct function of the algorithm
• But what we want is to assign some kind of quality to that score
Sequence searching and alignments - Andrew Cowley21/04/23131
Score vs significance
Sequence searching and alignments - Andrew Cowley21/04/23132
A A AA A A
A C A T A A G G C TA T A C A A G C C T
High scoreHigh score High significance
High significance
“Lies, damn lies, and statistics”
Sequence searching and alignments - Andrew Cowley21/04/23133
“Lies, damn lies, and statistics”
• Not just interested in score...
• ...But how likely we are to get that alignment by chance alone
• It is this ‘non-random’ alignment that infers homology
• Statistics are used to estimate this chance
Sequence searching and alignments - Andrew Cowley21/04/23134
E-value
• ‘Expect’ value
• Probability of obtaining this alignment by chance
• Best measure of how good an alignment is
• Often used for ranking results by default
Sequence searching and alignments - Andrew Cowley21/04/23135
• Calculated in different ways for BLAST and FASTA
• Short query sequences are more likely to be found by chance so have higher E-values
• Affected by parameter values like gap penalties and substitution matrices
Sequence searching and alignments - Andrew Cowley21/04/23136
FASTA statistics
• Compares query sequence with every sequence in database
• As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance
Sequence searching and alignments - Andrew Cowley21/04/23137
FASTA - histogram
Sequence searching and alignments - Andrew Cowley21/04/23138
Predicted distribution of scores
Observed distribution of scores
Key
*
=
High scoring regionHigh scoring region
BLAST statistics
• Main reason for speed is that it doesn’t compare query with lots of other sequences
• Therefore it pre-estimates statistical values using a random sequence model
Sequence searching and alignments - Andrew Cowley21/04/23139
““Appears to yield fairly accurate resultsAppears to yield fairly accurate results””
EBI is an Outstation of the European Molecular Biology Laboratory.
Search Guidelines
Search guidelines 1
• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)
• Then with translated DNA sequences (fastx, blastx)
• Search with DNA vs. DNA as the next resort
• And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT!
Sequence searching and alignments - Andrew Cowley21/04/23141
Search guidelines 2
• Search the smallest database that is likely to contain the sequence(s) of interest
• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology
Sequence searching and alignments - Andrew Cowley21/04/23142
Search guidelines 3
• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence• Examine the histograms• Use programs such as prss3 to confirm the expectation values.• Searching with shuffled sequences (use MLE/Shuffle in fasta)
which should have an E() ~1.0
Sequence searching and alignments - Andrew Cowley21/04/23143
Sequence searching and alignments - Andrew Cowley21/04/23144
Search guidelines 4
Sequence searching and alignments - Andrew Cowley21/04/23145
• Consider searches with different gap penalties and other scoring matrices• Use shallower matrices and/or more stringent gaps in order to
uncover or force out relationships in partial sequences• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of
PAM250)• Remember to change the gap penalty defaults!
MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -7 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4
Search guidelines 5
• Homology can be reliably inferred from statistically significant similarity
• But remember:• Orthologous sequences have similar functions• Paralogous sequences can acquire very different functional roles
• So further work might be needed to tease out details
Sequence searching and alignments - Andrew Cowley21/04/23146
Sequence searching and alignments - Andrew Cowley21/04/23147
Search guidelines 6
• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues• However, motif identity in the absence of overall sequence similarity is
not a reliable indicator of homology!
• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data• ClustalW• MUSCLE• T-Coffee• Kalign• MAFFT• Mview (available from EBI FASTA & BLAST services)• DBCLUSTAL (available from EBI BLAST services)
Sequence searching and alignments - Andrew Cowley21/04/23148
EBI is an Outstation of the European Molecular Biology Laboratory.
Advanced
• In general, the more information we can add to an alignment, the better the result
Sequence searching and alignments - Andrew Cowley21/04/23150
Conserved regions Structural information Motifs
[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H
Conserved regions
• We can add a new ‘position’ parameter to the substitution matrix
Sequence searching and alignments - Andrew Cowley21/04/23151
We can even modify a normal search to generate a position specific scoring matrix, or PSSM
PSI-BLAST
Position Specific Iterative – BLAST:
1. Takes the result of a normal BLAST
2. Aligns them and generates profile of conserved positions
3. Uses this to weight scoring on next iteration
Sequence searching and alignments - Andrew Cowley21/04/23152
PSI-BLAST
• By adding importance to conserved residues we might be able to find more distant sequences
• But iterate too far and we might be assigning importance where there is none
Sequence searching and alignments - Andrew Cowley21/04/23153
More sensitiveMore sensitive
PSI-BLAST
Sequence searching and alignments - Andrew Cowley21/04/23154
PSI-BLAST
Sequence searching and alignments - Andrew Cowley21/04/23155
PSI-BLAST
Sequence searching and alignments - Andrew Cowley21/04/23156
PHI-BLAST
• Pattern Hit Initiated-BLAST
• User provides a pattern alongside a protein
• Database hits have to contain this pattern, and similarity to rest of sequence
• Results can initiate a PSI-BLAST search as well
Sequence searching and alignments - Andrew Cowley21/04/23157
PSI-SEARCH
• Smith-Waterman implementation (SSEARCH)
• But with iterative position specific scoring
Sequence searching and alignments - Andrew Cowley21/04/23158
Using PSI-BLAST - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23159
test_prot.fasta
Pages 52-55 in full booklet: Questions 18-20
www.ebi.ac.uk/~watson/africa
EBI is an Outstation of the European Molecular Biology Laboratory.
Problem Sequences
Short sequences
• What about short sequences?• Depends on their nature:
• Protein• Reduce word length and/or increase the E() value cut off• Use shallow matrices
• DNA• Reduce the word length• Ignore gap penalties (force local alignments only)
• Use rigorous methods• But ask what you are trying to do!
Sequence searching and alignments - Andrew Cowley21/04/23161
Low complexity regions
• Sometimes biologically relevant, but always likely to skew alignment scoring
• E.g. CA repeats, poly-A tails and Proline rich regions
Sequence searching and alignments - Andrew Cowley21/04/23162
Sequence searching and alignments - Andrew Cowley21/04/23163
Good Statistics:
The inset shows good correlationbetween the observed over expectednumbers of scores.
This is the region of the histogram to look out for first when evaluating results.
Sequence searching and alignments - Andrew Cowley21/04/23164
The inset shows bad correlationbetween the observed and expectedscores in this search.
The spaces between the = and * symbolsindicate this poor correlation.
One reason for this can be low complexityregions.
Bad Statistics:
Low complexity regions
• Sometimes biologically relevant, but always likely to skew alignment scoring
• E.g. CA repeats, poly-A tails and Proline rich regions
• Compensate by filtering sequence so these regions don’t contribute to scoring• Filters: seg, xnu, dust, CENSOR
• But check what you are filtering!
Sequence searching and alignments - Andrew Cowley21/04/23165
Sequence searching and alignments - Andrew Cowley21/04/23166
Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.
Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.
Filtered:
Using Filters - Example sequence
Sequence searching and alignments - Andrew Cowley21/04/23167
Pages 56-57 in full booklet: Questions 21-22
Filtertest_seq.fsa
www.ebi.ac.uk/~watson/africa
Vector contamination
• You think you know what your sequence is..• .. But the results are really confusing!
• Maybe you have vector contamination
• Search against known vectors to check
Sequence searching and alignments - Andrew Cowley21/04/23168
Vector contamination
Sequence searching and alignments - Andrew Cowley21/04/23169
Vector Contamination - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/23170
vectortest_seq1.fsa
vectortest_seq2.fsa
Page 57 in full booklet: Question 23
www.ebi.ac.uk/~watson/africa
EBI is an Outstation of the European Molecular Biology Laboratory.
Multiple Sequence Alignments
Uses of MSA
• Functional prediction• Phylogeny• Structural prediction• Homology detection• Protein analysis
• To distinguish between orthology and parology
Sequence searching and alignments - Andrew Cowley21/04/23172
• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)
• But this is too computationally intensive
Sequence searching and alignments - Andrew Cowley21/04/23173
Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :
Weighted Sums of Pairs: WSP
N
i
i
jijijDW
2
1
1
Sequences Time2 1 second
3 150 seconds
4 6.25 hours
5 39 days
6 16 years
7 2404 years
Time O(LN)
21/04/23174 Sequence searching and alignments - Andrew Cowley
• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)
• But this is too computationally intensive• Therefore we have to use heuristics and progressive
alignment methods
Sequence searching and alignments - Andrew Cowley21/04/23175
Clustal
• >60,000 citations
• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin
• Clustal V 1992• EMBL Heidelberg, • Rainer Fuchs• Alan Bleasby
• Clustal W, Clustal X 1994-2005• Toby Gibson, EMBL, Heidelberg• Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 2006• University College Dublin
www.clustal.org
21/04/23176 Sequence searching and alignments - Andrew Cowley
CLUSTAL
• Quick, pairwise alignment of all sequences• Line up pairs, with the most similar first
Sequence searching and alignments - Andrew Cowley21/04/23177
CLUSTAL
• Fix the alignment between pairs and treat as one sequence
Sequence searching and alignments - Andrew Cowley21/04/23178
CLUSTAL
• Align your fixed pairs with each other
Sequence searching and alignments - Andrew Cowley21/04/23179
• Note, this is not a phylogram!• Only a guide tree for the alignment
Sequence searching and alignments - Andrew Cowley21/04/23180
ClustalW at the EBI
Sequence searching and alignments - Andrew Cowley21/04/23181
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23182
ParametersParameters
Sequence input
Sequence input
Submit!Submit!
Help!Help!
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23183
Interactive – results in browser, deleted after 24 hours
Email – receive URL to results page, deleted after 7 days
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23184
ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23185
Jalview
Sequence searching and alignments - Andrew Cowley21/04/23186
ClustalW
Advantages• Fast• Not too demanding• Widely used• Fine for most uses
Disadvantages• Fixing of early alignments
• Propagate errors
• Doesn’t search far• Local minima
• Compresses gaps
Sequence searching and alignments - Andrew Cowley21/04/23187
Use of Clustal/JalView - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/23188
prot_MSA.fastaProblem_MSA1.fsaProblem_MSA2.fsaProblem_MSA3.fsaProblem_MSA4.fsa
Pages 59-66 in full booklet: Questions 24-28
www.ebi.ac.uk/~watson/africa
Other Tools
Sequence searching and alignments - Andrew Cowley21/04/23189
COFFEE
• Consistency based Objective Function For alignmEnt Evaluation
• Maximum Weight Trace (John Kececioglu)• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An
objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
21/04/23190 Sequence searching and alignments - Andrew Cowley
COFFEE
• Library of reference pairwise alignments• For your given set of sequences
• Objective Function• Evaluates consistency between multiple alignment and the library
of pairwise alignments• Use SAGA to optimise this function
• Weigh depending on quality of alignment
Sequence searching and alignments - Andrew Cowley21/04/23191
SAGA is another alignment method, using genetic algorithmsSAGA is another alignment method, using genetic algorithms
COFFEE
• More accurate than ClustalW• Much less prone to problems in early alignment stages
• VERY slow!
Sequence searching and alignments - Andrew Cowley21/04/23192
T-Coffee
• Tree-based COFFEE• Heuristic approach to COFFEE
• Gets rid of genetic algorithm portion• Uses progressive alignments• Changes algorithm based on number of sequences
Sequence searching and alignments - Andrew Cowley21/04/23193
T-Coffee
• Much faster than COFFEE• Avoids some of ClustalW’s pitfalls• Can take information from several data sources
• Still not that fast• Can be very demanding of memory etc.
Sequence searching and alignments - Andrew Cowley21/04/23194
Others
• MUSCLE – Bob Edgar• Iterative/progressive alignment• Fast• Good for big alignments, proteins
• MAFFT• Iterative based Fast Fourier Transform• Fast and accurate• Good for huge alignments
• Kalign• Very fast, local-regions aligning• Good for very large numbers of alignments!
Sequence searching and alignments - Andrew Cowley21/04/23195
Which tool should I use?
Input data• 2-100 sequences of typical
protein length• 100-500 sequences• >500 sequences
• Small number of unusually long sequences
Recommendation• MUSCLE, T-Coffee,
MAFFT, ClustalW• MUSCLE, MAFFT• MUSCLE, KALIGN
• ClustalW
Sequence searching and alignments - Andrew Cowley21/04/23196
How to evaluate?
• Use a benchmark • BaliBASE
Sequence searching and alignments - Andrew Cowley21/04/23197
BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics
•ICGEB Strasbourg
•141 manual alignments using structures•5 sections•core alignment regions marked
1. Equidistant(82)
2. Orphan(23)
3. Two groups (12)
4. Long internal gaps(13)
5. Long terminal gaps(11)
21/04/23198 Sequence searching and alignments - Andrew Cowley
Benchmark pitfalls
• Benchmark dataset may not be representative• Danger of over-training towards benchmark
• Goldman: Most MSAs have unrealistic gaps• Tend towards multiple, independent deletions• Insertions are rare
• Sequences shrink in length over evolution• No supporting evidence that this is the case
Sequence searching and alignments - Andrew Cowley21/04/23199
Solutions
• Use phylogentic data to guide alignment• Keep track of changes to ancestor sequences• Don’t change them again so easily in decendents
Sequence searching and alignments - Andrew Cowley21/04/23200
PRANK
• Probabilistic Alignment Kit• webPRANK• Better suited for closely related sequences• Tied solutions are chosen from at random
• Avoids incorrect confidence in result• Means alignments might not be reproducible
• Alignments look quite different• Might look worse!• But gap patterns make sense• Gaps are good!
Sequence searching and alignments - Andrew Cowley21/04/23201
Sequence searching and alignments - Andrew Cowley21/04/23202
Sequence searching and alignments - Andrew Cowley21/04/23203
Comparing Alignments - Example sequences
Sequence searching and alignments - Andrew Cowley21/04/23204
prot_MSA.fasta
Pages 67-74 in full booklet: Questions 29-30
www.ebi.ac.uk/~watson/africa
Common problems with MSA
• Input format• FASTA format• Unique sequence identifiers• Include sequence!
• Job can’t be found• Interactive results deleted after 24hrs• Use email• Consider other tool
Sequence searching and alignments - Andrew Cowley21/04/23205
Common mis-uses of MSA
• Performing a sequence assembly• Specialist type of MSA• Use other tools (Staden etc.)
• Aligning ESTs to a reference genome• Use EST2Genome
• Designing primers• Use primer tools (primer3 etc.)
• Aligning two sequences• Use a pairwise alignment tool!
Sequence searching and alignments - Andrew Cowley21/04/23206
Putting it all together
• EB-Eye search• Sequence retrieval• Sequence search• Sequences retrieval• Multiple sequence alignment• Analysis
Sequence searching and alignments - Andrew Cowley21/04/23207
Final remarks
• Don’t assume a single tool will cater for all your needs
• DO change the parameters of the tools
• Remember where the tool excels and what its limitations are
• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)
• Crazy input will always give crazy results!
Sequence searching and alignments - Andrew Cowley21/04/23208
EBI is an Outstation of the European Molecular Biology Laboratory.
Getting Help
Getting Help
• Database documentation• Frequently Asked Questions
• http://www.ebi.ac.uk/help/faq.html
• 2can Support Portal• http://www.ebi.ac.uk/2can/
• EBI Support• http://www.ebi.ac.uk/support/
• Hands-on training programme• http://www.ebi.ac.uk/training/handson/
Sequence searching and alignments - Andrew Cowley21/04/23210
Thanks!
• Hamish McWilliam and Andrew Cowley• Vicky Schneider• Rodrigo Lopez
• EMBL-EBI• SLING• You!
Sequence searching and alignments - Andrew Cowley21/04/23211
top related