ebi is an outstation of the european molecular biology laboratory. ebi roadshow james watson, phd...

211
EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL [email protected]

Upload: damian-weaver

Post on 20-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

EBI Roadshow

James Watson, PhDSenior Scientific Training Officer

EBI-EMBL

[email protected]

Page 2: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

Andrew Cowley

External Services, EMBL-EBI

Sequence Searching and Alignments

Page 3: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

External Services

Sequence searching and alignments - Andrew Cowley21/04/233

Andrew CowleyBioinformatics Trainer

Hamish McWilliamSoftware engineer Rodrigo Lopez

Head of External Services

+ many others!

Page 4: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Contents

• Sequence databases• Database browsing tools

• Similarity searching and alignments• Alignment basics• Similarity searching tools• More advanced tools• Alignment tools• Guidelines

• (slightly) More advanced tools• Problem sequences

Sequence searching and alignments - Andrew Cowley21/04/234

Page 5: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Materials

Sequence searching and alignments - Andrew Cowley21/04/235

Presentations and tutorials can be found on the roadshow course page at the EBI

Data files for exercises can be found at:

www.ebi.ac.uk/~watson/africa

Page 6: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Data

• Simplistically, much of the data at the EBI can be thought of as a container

• One part being the raw data (eg. Sequence)

• Another part being annotation on this data

Sequence searching and alignments - Andrew Cowley21/04/236

Page 7: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Example

Sequence searching and alignments - Andrew Cowley21/04/237

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

Page 8: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Data - Nucleotide

• ENA/EMBL-Bank: • Release and updates• Divided into classes and divisions• Supplementary sets: EMBL-CDS, EMBL-MGA

• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc.• Alternative splicing: ASD, ASTD, etc.• Completed genomes: Ensembl, Integr8, etc.• Variation: HGVBase, dbSNP, etc.

Sequence searching and alignments - Andrew Cowley21/04/238

Page 9: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Individual sequencing

Individual scientists

ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG

Sequence individual gene

What sequence data is submitted?

ACTGCTGCTAGCTAG

submission

add annotation

submission

Page 10: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

High throughput sequencing

chromosome

fragment

ACTGCTGCTAGCTAG

cyp30 cyp309 insvcg343

annotation

sequence reads

sequencing library

assemble sequence

Page 11: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

chromosome

fragment

sequencing library

ACTGCTGCTAGCTAG

assemble sequence

cyp30 cyp309 insvcg343

annotation

sequence reads

Large-scale sequencing

projects

submissione.g. whole genome shotgun

submission

submission

High throughput sequencing

Page 12: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

What are primary sequence databases?

Individual scientistsLarge-scale

sequencing projects

Primary sequence data

Primarysequencedatabase

•Original sequence data• Experimental data

• Patent data

• Submitter-defined

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

ACTGCTGCTAGCTAG

Page 13: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How do primary and derived databases differ?

Individual scientistsLarge-scale

sequencing projects

Primary sequence data

Primarysequencedatabase

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

ACTGCTGCTAGCTAG

Derived data e.g. protein sequence

Deriveddatabase

Page 14: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Primary v. derived data

ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACATDNA sequence

translate

Derived mRNA sequence

AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

Derived protein sequence MRSNECCCAMSC

transcribe

submit

ACTGCTGCTAGCTAG

Page 15: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How do primary and derived databases differ?

Individual scientistsLarge-scale

sequencing projects

Primary sequence

data

Primarysequencedatabase

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

ACTGCTGCTAGCTAG

Deriveddatabase

redundant

may be non-redundant

If anything in submission varies (e.g.

source / submitter / sequence)

generates a new entry

Derived data e.g. protein sequence

Page 16: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How do primary and derived databases differ?

Individual scientistsLarge-scale

sequencing projects

Primary sequence data

Primarysequencedatabase

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

ACTGCTGCTAGCTAG

Deriveddatabase

data lost

regenerate data

Derived data e.g. protein sequence

Page 17: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Primary nucleotide sequence databases

GenBank DDBJ

ENA

(Japan)(U.S.A.)

(Europe)

INSDC:

• International Nucleotide

Sequence Database

Collaboration

• Daily exchange of data

ENA

DDBJ GenBank

ACTGCTGCTAGCTAG

Submission can

be made to any

INSDC database

Page 18: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence information

ReadsReads

How is sequence data processed?

• Sequence machine output (reads)• Quality scores

AssemblyAssembly• Fragmented sequence reads assembled into contigs mapped onto chromosomes

AnnotationAnnotation • Functional information assigned to assembled regions

ENA

DDBJ GenBank

ACTGCTGCTAGCTAG

Page 19: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence information

ReadsReads

What type of sequence data is submitted?

AssemblyAssembly

AnnotationAnnotation

Assembled

sequences

Raw data

Annotated sequence

Interpreted information:• Assembly• Mapping• Functional annotation• Sample information

RawAnnotated /

ENA

DDBJ GenBankInput information:• Sample• Set-up• Machine configuration

Output machine data:• Sequence traces• Reads• Quality scores

Metagenomic data:• Where originated

ACTGCTGCTAGCTAG

Page 20: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How does ENA store the data?

Assembled sequences

Raw data

Annotated sequence

Large-scale sequencing

projects

Individual scientists

Patent Offices

ENA

European Nucleotide Archive

RawAnnotated /

ENA

DDBJ GenBank

ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)

Sequence ReadSequence ReadArchive (SRA)Archive (SRA)

Trace ArchiveTrace Archive SRAAnn Trace

ACTGCTGCTAGCTAG

Page 21: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How does ENA store the data?

Assembled sequences

Raw data

Annotated sequence

Large-scale sequencing

projects

Individual scientists

Patent Offices

ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)

Sequence ReadSequence ReadArchive (SRA)Archive (SRA)

Trace ArchiveTrace Archive

ENA

European Nucleotide Archive

RawAnnotated /

SRAAnn

ENA

DDBJ GenBank

Trace

• Trace sequence reads

• Capillary sequencing instruments

• Intensity reads

• Next-generation sequencing instruments

ACTGCTGCTAGCTAG

Page 22: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

INDSC Sequencing Projects

Can data be traced to an Institute?

Institute

Database records

Consortium

genomic

ESTs...

shotgun

Complete genome /

metagenome

(single organism / metagenomic study)

Assembly & annotation

Assembly & annotation

genomic

ESTs...

shotgun

Track projects

Comparative analysis

RawAnnotated /

SRAAnn

ENA

DDBJ GenBank

Trace

ACTGCTGCTAGCTAG

Pulls information

together

Page 23: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Nucleotides: European Nucleotide Archive (ENA)

21/04/2323

Figure adapted from: Cochrane, G. et al. Public Data Resources as

the Foundation for a Worldwide Metagenomics Data Infrastructure.

In: Metagenomics: Theory, Methods and Applications (Chapter 5),

Caister Academic Press, Universidad Nacional de Cordoba,

Argentina. Ed. D. Marco (2010).

The ENA has a three-tiered data architecture.

It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms).

Sequence searching and alignments - Andrew Cowley

Page 24: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Data Quality

Is the data cleaned up?

• Automatic quality checks

Validation of submitted data:

• Some manual inspection and curation

Errors can still exist in sequence and annotation

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Trace

ACTGCTGCTAGCTAG

Page 25: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How is the data organized?

Data in ENA Annotation is divided in 2 ways:

Database Structure

• Type of data or

• Methodology used to obtain data

• Each entry belongs to one data class

1) Data classes1) Data classes

• Each entry belongs to one taxonomic division

2) Taxonomic Divisions2) Taxonomic Divisions

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 26: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 27: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

• Single pass reads variable quality

• Need to search both EST and RNA data

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 28: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)• Often copies of existing entries

• Records not clean, even for taxonomy

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 29: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

• Bulk of entries

• Highest level of tracked information

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 30: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

• Derived data entries• e.g. patch genomic and RNA data to construct complete coverage

• Must have publication

• Must show which entries data is derived from

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 31: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

• Also derived data entries

• ESTs assembled to construct RNA

• Must show which EST/HTC entries data is derived from

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 32: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

• Entries change over time (completely replaced)

• Raw WGS entries assembled into contigs CON entries

Data Classes

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 33: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Data Classes

How stable is the data?

Data is always changing:

• Assembly of sequences into larger fragments

• Deletion of obsolete entries (i.e. once assembled)

• Sequence modifications

• Daily updates

• Identifier changes

• Corrections (databases can contain errors)

• etc…

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 34: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Data Classes

How does assembly affect entries?

WGSWGS Shotgun

Example:

• Fragments in separate entry

CONCON Constructed • Join to make new CON entries

• Old WGS entries archived

STDSTD Standard• Join into large STD entry (e.g. Completed genome)• Add annotation

• Old CON entries archived

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 35: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Taxonomy

ENVENV

HUMHUM HumanMUSMUS Mouse

MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant

PHGPHG Phage

Environmental

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified

Other:

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 36: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Taxonomy

HUMHUM HumanMUSMUS Mouse

MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant

PHGPHG Phage

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

ENVENV Environmental

SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified

Other:

• CAUTION: organism never isolated

• May blast sequence to assign putative organism

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 37: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Taxonomy

HUMHUM HumanMUSMUS Mouse

MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant

PHGPHG Phage

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

ENVENV Environmental

SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified

Other:

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

• CAUTION: not consistently handled, variable quality

• Transgenics may be from multiple organisms

Page 38: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Taxonomy

HUMHUM HumanMUSMUS Mouse

MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant

PHGPHG Phage

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

ENVENV Environmental

SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified

Other:

• Division primarily used by GenBank

for PAT (patent) sequences

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 39: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Taxonomy exclusion

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Some species excluded

from certain taxonomic ranges

VRTVRT Vertebrate excludeshumanmouserodentmammal

MAMMAM Mammal excludeshumanmouserodent

RODROD Rodent excludes mouse

Applies to:• ftp files and• sequence search tools

But not:• ENA Browser

Page 40: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Taxonomy Database

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

All INSDC databases use the NCBI Taxonomy Browser

Which taxonomy database does ENA use?

Only organisms with sequence are represented

• EBI-wide service maps resources into taxonomy service

• Culture collection – physical data, e.g. sample or stored version

• Biomaterial

• Specimen voucher

EBI Taxonomy Portal

representation, e.g. picture

Page 41: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Database Structure

How does data organization differ from GenBank?

ENA-Annotation

Data classesData classes

TaxonomicTaxonomicDivisionsDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

con

est

gss

htc

htg

pat

sts

std ...

hummusrodmamvrtfun...

GenBank

Taxonomic DivisionsTaxonomic Divisions

Data classesData classes

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for

classes

con

est

gss

htc

htg

pat

sts

std ...

hum

mus rod

mam vrt

fun

inv

pln ...

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

ACTGCTGCTAGCTAG

Page 42: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How does data organization differ from GenBank?

ENA-Annotation

Data classesData classes

TaxonomicTaxonomicDivisionsDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

GenBank

Taxonomic DivisionsTaxonomic Divisions

Data classesData classes

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for

classes

con

est

gss

htc

htg

pat

sts

std ...

hummusrodmamvrtfun...

con

est

gss

htc

htg

pat

sts

std ...

hum

mus rod

mam vrt

fun

inv

pln ...

‘Mouse’ + ‘EST’ intersection

• small data set

• ensured complete set of mouse ESTs

Database Structure

RawAnnotated /

SRAAnn

Clean-up

ENA

DDBJ GenBank

Class

Taxon

Trace

‘Mouse’ set

• large data set

• includes all mouse entries

‘EST’ set

• large data set

• includes all EST entries

ACTGCTGCTAGCTAG

Page 43: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Data – Protein Sequence

• UniProt databases:• UniProtKB: human curated and automatic translation sections• UniRef: non-redundant sequence clusters• UniParc: non-identical sequence archive

• Sequence from structures:• PDB• SGT

• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA• Alternative splicing: ASD, ASTD• Completed proteomes: Ensembl, Integr8• Protein Interactions: IntAct• Patent Proteins: EPO, JPO, KIPO and USPTO

Sequence searching and alignments - Andrew Cowley21/04/2343

Page 44: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence Databases

Nucleotide Nucleotide/Protein Protein

Not curated Curated Automatically annotated/curated

Author submits NCBI creates from existing data Entries created from large scale genomics, small scale cloning and peptide submissions

Only author can revise NCBI revises as new data emerge UniProt regularly revises entries

Multiple records for same loci common Single records for each molecule of major organisms

An entry per protein including any isoforms (when curated).

Records can contradict each other ?

No limit to species included Limited to model organisms All species

Data exchanged among INSDC members Exclusive NCBI database International consortium, inc leading figures in Bioinformatics

Akin to primary literature Akin to review articles Akin to review articles plus contain extensive cross-references to other DBBs – almost a portal.

Proteins identified and linked  Proteins and transcripts identified and linked Links to supporting nucleotide data added to entries

Access via NCBI Nucleotide databases Access via Nucleotide & Protein databases UniProt.org – modern, multi functional website with incorporated bioinformatics tools.

Genbank

Page 45: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

21/04/2345

Protein sequence: UniProt

UniProt

• Manual curation

• Literature-based annotation

• Sequence analysis

• Automated annotation

PRIDE

GO

InterPro

IntAct

IntEnz

HAMAP

RESID

Functional infoFunctional info

Protein identification data

Protein identification data

Protein families and domains

Protein families and domains

Molecular interactionsMolecular

interactions

EnzymesEnzymes

Microbial protein families

Microbial protein families

Post-translational modifications

Post-translational modificationsS

ome

data

sou

rces

for

an

nota

tion

Transmembrane prediction

Transmembrane prediction

InterPro classification

InterPro classification

Signal predictionSignal prediction

Other predictionsOther predictions

Protein classification

Sequence searching and alignments - Andrew Cowley

Page 46: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PatentData

WormBaseFlyBaseSub/

PeptideData

PDB VEGAEnsemblRefSeqINSDC

(incl. WGS,Env.)

Database sources

ProteomeSets

IPI

UniProt data sources and data flow

UniRef 100

UniRef 90

UniRef 50

UniRefPre-computed clusters of similar proteins

UniMes

UniMesUniProt Metagenomic and Environmental Sequences(available by FTP only)

UniParcUniParcUniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences

UniSave

UniSaveUniProt protein entry archive. Contains all versions of each protein entry.(Accessed via www.uniprot.org and www.ebi.ac.uk/unisave)

UniProtKB

UniProtKBUniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.

Page 47: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

The Two Sides of UniProtKB

UniProtKB/TrEMBL UniProtKB/Swiss-Prot

Non-redundant, high-quality manual annotation - reviewed

Redundant, automatically annotated -

unreviewed

Page 48: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Databases

• Many databases and they are getting bigger• Efficient searching involves knowledge of what is stored

in these• Don’t assume that everything in the databases is correct• Nothing is constant, but changes...

• Deletions, sequence modifications• Daily updates, identifier changes, etc.

Sequence searching and alignments - Andrew Cowley21/04/2348

Page 49: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Searching databases

Sequence searching and alignments - Andrew Cowley21/04/2349

Page 50: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

?What methods of searching databases do you know of?

Sequence searching and alignments - Andrew Cowley21/04/2350

?What is the difference between a primary and secondary database?

?What is the best protein sequence database to search(specific part)?

Page 51: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Searching

• Many ways of searching databases• Annotation/title

• Know something about your sequence• Gene name• Function• Accession

Sequence searching and alignments - Andrew Cowley21/04/2351

Page 52: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

New search service

Data organised according to: •gene•expression•protein•structure•literature

Species selector allows for easy

comparison

Explore data, return easily to

your results

Access from the EBI’s homepage

Page 53: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Database webpages

Sequence searching and alignments - Andrew Cowley21/04/2353

Page 54: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Database searching

Sequence searching and alignments - Andrew Cowley21/04/2354

Page 55: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Searching

• Many ways of searching databases

• Annotation/title • Know something about your sequence

• Gene name• Function• Accession

• Raw data• Don’t know!• Or want to check...

• Infer extra information• Homology?• Annotation?• Function?

Sequence searching and alignments - Andrew Cowley21/04/2355

Page 56: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence alignment

• Relatively easy if we have an exact match• .. But sequence is variable

• Between individuals, species, location etc.

• That variability is useful data too!• Need a search method that allows for some variability• And even better – helps us assess that variability

Sequence searching and alignments - Andrew Cowley21/04/2356

Page 57: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence alignment

Sequence searching and alignments - Andrew Cowley21/04/2357

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

Page 58: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence alignment

Sequence searching and alignments - Andrew Cowley21/04/2358

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

ACATAGGTACATAGGT

Page 59: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence alignment

Sequence searching and alignments - Andrew Cowley21/04/2359

ACATAGGT

TCATAGAT AAATTCTG

Query:

1 2

Score: 6/8 3/8

ACATAGGT ACATAGGT

Page 60: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence alignment

Sequence searching and alignments - Andrew Cowley21/04/2360

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca

Query:

1

2

Page 61: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Dot plot

• Maybe a dot plot will help

Sequence searching and alignments - Andrew Cowley21/04/2361

Query

Sequence 1

A C A T A G

GATACT

Page 62: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Dot plot

Sequence searching and alignments - Andrew Cowley21/04/2362

Query vs Sequence 1 Query vs Sequence 2

Query Query

1 2

Page 63: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• We can see the difference, but how to turn that into something a computer can evaluate?

• Computers rely on algorithms which give them a score

• They can then compare scores

Sequence searching and alignments - Andrew Cowley21/04/2363

Page 64: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Simple algorithm – penalise movement away from diagonal – gap penalty

Sequence searching and alignments - Andrew Cowley21/04/2364

0

0

0

0

-10

-10

Page 65: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Having opened a gap, we should assign a lesser penalty to extending it

-10

Gap extend

Sequence searching and alignments - Andrew Cowley21/04/2365

0

-10

-10

0

-10

-10

-10.5

0

-10

-0.5

-10.5-0.5

-10-10

Actual implementation is usually to apply gap extension penalty to every gap

Actual implementation is usually to apply gap extension penalty to every gap

Page 66: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Why a lesser gap extend penalty?

• Single block of insertions/deletions is more likely than multiple in/del events

Sequence searching and alignments - Andrew Cowley21/04/2366

NVELKAET NVDEATNFELKAET

NV-ELKAET

NVDE--A-TNFELKAET

NV------ELKAET

NVDEATNFELKAET

Page 67: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Match/mismatch

• Of course, we need to tell the algorithm that matching letters are better than mismatches too

• This is done via a scoring matrix

Sequence searching and alignments - Andrew Cowley21/04/2367

A C G T

ACGT

5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5

Page 68: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Putting the two together gives us a scoring mechanism

Sequence searching and alignments - Andrew Cowley21/04/2368

-4

-18

-18

1

-13.5 -13

-22.5

-13

T

A

C

A

C A

6

-10

-4

Gap

Mismatch

A C G T

ACGT

5 -4 -4 -4

-4 5 -4 -4

-4 -4 5 -4

-4 -4 -4 5

-10

0

-10

-10

0

-10

-10

-10.5

0

-10

-0.5

-10.5

-0.5

-10-10

Page 69: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• To pick the optimal alignment, start at the end and trace back the highest scoring route.

Sequence searching and alignments - Andrew Cowley21/04/2369

-4

-18

-18

1

-13.5 -13

-22.5

-13

T

A

C

A

C A

6

Page 70: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Needleman-Wunsch

• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!• An example of dynamic programming

• Comparing the full length of both sequences is called a global-global or just global alignment

Sequence searching and alignments - Andrew Cowley21/04/2370

Page 71: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Global vs Local

• But global-global might not be suitable for sequences that are very different lengths

• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm. • Sets negative scores in matrix to 0, and allows trace back to end

and restart

Sequence searching and alignments - Andrew Cowley21/04/2371

Page 72: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

QUESTION: Global vs Local - which is which?

Sequence searching and alignments - Andrew Cowley21/04/2372

A T G T A T A C G C

- A G T A T A - G C

A - T G T A T A C G C

A G T A T A - - - G C

LOCAL GLOBAL

Page 73: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Scoring

• Parameters so far:• Match/mismatch• Gap opening• Gap extending

• Can we improve it?

Sequence searching and alignments - Andrew Cowley21/04/2373

Page 74: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Substitutions

• Some substitutions are more likely than others• DNA:

• Purines (A,G) – dual ring• Pyrimidines (C, T) – single ring

• Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion

• Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix

Sequence searching and alignments - Andrew Cowley21/04/2374

Page 75: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/2375

Page 76: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Proteins

• What about proteins?

Sequence searching and alignments - Andrew Cowley21/04/2376

Page 77: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Protein substitution matrices

• Can look at closely related proteins to determine substitution rates

• Two most commonly used models:• BLOSUM• PAM

Sequence searching and alignments - Andrew Cowley21/04/2377

Page 78: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLOSUM

• Blocks of Amino Acid Substitution Matrix

• Align conserved regions of evolutionary divergent sequences clustered at a given % identity

• Count relative frequencies of amino acids and substitution probability

• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

• Higher BLOSUM number = more closely related

Sequence searching and alignments - Andrew Cowley21/04/2378

Page 79: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PAM

• Point Accepted Mutation• Observed mutations in a set of closely

related proteins• Markov chain model created to describe

substitutions• Normalised so that PAM1 = 1 mutation

per 100 amino acids• Extrapolate matrices from model• Higher PAM number = less closely

related

Sequence searching and alignments - Andrew Cowley21/04/2379

PAM 250

Page 80: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

Sequence searching and alignments - Andrew Cowley21/04/2380

10 100 200

400 500300

Page 81: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/2381

BLOSUM 45PAM 250

BLOSUM 45PAM 250

BLOSUM 62PAM 160

BLOSUM 62PAM 160

BLOSUM 90PAM 100

BLOSUM 90PAM 100

More divergent Less divergent

Page 82: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Scoring

• Parameters:• Match/mismatch• Gap opening• Gap extending• Substitution matrix

Sequence searching and alignments - Andrew Cowley21/04/2382

Page 83: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Dynamic programming alignments at the EBI

• EMBOSS Pairwise Alignment Algorithms• European Molecular Biology Open Software Suite

• Suite of useful tools for molecular biology• Command line based• Designed to be used as part of scripts/chained programs

• We implement selected tools to provide web-based access

Sequence searching and alignments - Andrew Cowley21/04/2383

Page 84: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley21/04/2384

http://www.ebi.ac.uk/Tools/sequence.htmlhttp://www.ebi.ac.uk/Tools/sequence.html

Or...

Page 85: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley21/04/2385

Page 86: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EMBOSS align tools

• Global alignment

• Local alignment

Sequence searching and alignments - Andrew Cowley21/04/2386

NeedleNeedle

WaterWater

Page 87: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/2387

Program selectionProgram selection ParametersParameters

Sequence input

Sequence input

Submit!Submit!

Page 88: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/2388

Page 89: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/2389

Key

- Gap

: Positive match

. Negative match

| Identity

Key

- Gap

: Positive match

. Negative match

| Identity

Page 90: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Pairwise Alignments - Example sequences

Sequence searching and alignments - Andrew Cowley21/04/2390

Pairwise_align1.fsaPairwise_align2.fsa

Pages 25-30 in full booklet: Questions 7-10

www.ebi.ac.uk/~watson/africa

Page 91: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Dynamic programming sequence search methods at the EBI

• Global alignment

• Local alignment

• Global query vs local database

Sequence searching and alignments - Andrew Cowley21/04/2391

GGSEARCHGGSEARCH

SSEARCHSSEARCH

GLSEARCHGLSEARCH

Page 92: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley21/04/2392

www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/

Or...

Page 93: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Similarity search

Sequence searching and alignments - Andrew Cowley21/04/2393

Database selectionDatabase selection

Sequence input

Sequence input

ParametersParameters

Submit!Submit!

Page 94: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Dynamic programming methods are rigorous and guarantee an optimal result

• But have to store the matrix of both sequences in memory

• And evaluate each position of the matrix

• Predictably, this makes them slow and demanding when you are aligning large sequences

Sequence searching and alignments - Andrew Cowley21/04/2394

Page 95: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Heuristics

• Therefore we need methods of estimating alignments• Estimation methods are called heuristics

• Try and take short cuts in an intelligent manner• Speed up the search• At the possible expense of accuracy

• Accuracy in sequence searches is important for:• Aligning the right bits• Scoring the alignment correctly• Identifying similar sequences - sensitivity

Sequence searching and alignments - Andrew Cowley21/04/2395

Page 96: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Going back to our dot plot

Sequence searching and alignments - Andrew Cowley21/04/2396

Page 97: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

Sequence searching and alignments - Andrew Cowley21/04/2397

Page 98: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Of course, we have to identify likely regions – not all alignments will be as nice as that one!

• This is the method used by FASTA• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

Sequence searching and alignments - Andrew Cowley21/04/2398

Page 99: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA – step 1

• Identify runs of identical sequence and pick regions with highest density of runs

Sequence searching and alignments - Andrew Cowley21/04/2399

Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’

Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’

Increase Ktup = faster, but less sensitive

Increase Ktup = faster, but less sensitive

Page 100: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA – step 2

• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Sequence searching and alignments - Andrew Cowley21/04/23100

Parameter:Substitution matrixParameter:Substitution matrix

Page 101: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA – step 3

• Discard regions too far from the highest scoring region

Sequence searching and alignments - Andrew Cowley21/04/23101

Joining threshold:Internally determinedJoining threshold:Internally determined

Page 102: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA – step 4

• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Sequence searching and alignments - Andrew Cowley21/04/23102

Parameters:Gap openGap extendSubstitution matrix

Parameters:Gap openGap extendSubstitution matrix

Page 103: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA

• Repeat against all sequences in the database

Sequence searching and alignments - Andrew Cowley21/04/23103

Page 104: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA – programs available at EBI

• FASTA: ”a fast approximation to Smith & Waterman”• FASTA – scan a protein or DNA sequence library for similar

sequences.• FASTX/Y – compare a DNA sequence to a protein sequence

databases, comparing the translated DNA sequence in forward or reverse translation frames.

• TFASTX/Y – compare a protein sequence to a translated DNA data bank.

• FASTF – compares ordered peptides (Edman degradation) to a protein databank.

• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

• SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).

Sequence searching and alignments - Andrew Cowley21/04/23104

Page 105: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Where to find at the EBI?

Sequence searching and alignments - Andrew Cowley21/04/23105

www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/

Or...

Page 106: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Similarity search

Sequence searching and alignments - Andrew Cowley21/04/23106

Database selectionDatabase selection

Sequence input

Sequence input

ParametersParameters

Submit!Submit!

Page 107: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA - results

Sequence searching and alignments - Andrew Cowley21/04/23107

Page 108: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA - results

Sequence searching and alignments - Andrew Cowley21/04/23108

Page 109: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA - results

Sequence searching and alignments - Andrew Cowley21/04/23109

Page 110: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA - results

Sequence searching and alignments - Andrew Cowley21/04/23110

Key

- Gap

: Identity

. Similarity

X Filtered

Key

- Gap

: Identity

. Similarity

X Filtered

Page 111: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Using FASTA - Example sequence

Sequence searching and alignments - Andrew Cowley21/04/23111

www.ebi.ac.uk/~watson/africa

test_prot.fasta

Page 37-46 in full booklet: Questions 11-14

Page 112: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST – Basic Local Alignment Search Tool

• Instead of narrowing the dynamic programming search space, BLAST works a different way

• Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Sequence searching and alignments - Andrew Cowley21/04/23112

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Page 113: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST – step 1

• w=3

Sequence searching and alignments - Andrew Cowley21/04/23113

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEWEWR

WRF Parameter:Word length (w)Parameter:Word length (w)

Increase = faster, but less sensitive

Increase = faster, but less sensitive

Page 114: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST – step 1(cont.d)

• w=3• T=13

Sequence searching and alignments - Andrew Cowley21/04/23114

SEWRFKHIYRGQPRRHLLTTGWSTFVT

GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13

AQP 12NQP 12

Parameters:Neighbourhood threshold (T)Substitution matrix

Parameters:Neighbourhood threshold (T)Substitution matrix

Page 115: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST – step 2

• Then it scans database sequences for exact matches with these words

Sequence searching and alignments - Andrew Cowley21/04/23115

Page 116: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount

• This results in a High-scoring Segment Pair (HSP)

BLAST – step 3

Sequence searching and alignments - Andrew Cowley21/04/23116

Parameters:Drop offSubstitution matrix

Parameters:Drop offSubstitution matrix

Page 117: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• If the total HSP score is above another threshold then a gapped extension is initiated

BLAST – step 4

Sequence searching and alignments - Andrew Cowley21/04/23117

Parameters:Extension threshold (Sg)Substitution matrix

Parameters:Extension threshold (Sg)Substitution matrix

Page 118: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST

• The steps rule out many database sequences early on• Large increase in speed

Sequence searching and alignments - Andrew Cowley21/04/23118

Page 119: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST – programs available at the EBI

• Basic Local Alignment Search Tool• NCBI-BLAST programs:

• BLASTP – protein sequence vs. protein sequence library• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated DNA vs. protein sequence library

• WU-BLAST programs:• BLASTP – protein query vs. protein database• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated nucleotide query vs. protein database• TBLASTN – protein query vs. translated nucleotide database• TBLASTX – translated nucleotide query vs. translated nucleotide

database

Sequence searching and alignments - Andrew Cowley21/04/23119

Combines several parameters into ‘sensitivity’ option

Combines several parameters into ‘sensitivity’ option

Page 120: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23120

Page 121: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Using BLAST - Example sequence

Sequence searching and alignments - Andrew Cowley21/04/23121

test_prot.fasta

Pages 47-50 in full booklet: Questions 15-17

www.ebi.ac.uk/~watson/africa

Page 122: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23122

Key

- Gap

[residue] Identity

+ Similarity

X Filtered

Key

- Gap

[residue] Identity

+ Similarity

X Filtered

Page 123: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Differences between BLAST and FASTA

• BLAST• Fast• Good with proteins• Produces good local alignments

+ short global alignments• Produces HSP (reports internal

matches in long sequences)• Might miss a potential alignment

due to ruling out sequences early on in the process

• Good at finding siblings

• FASTA• Not as fast as BLAST• Much better with DNA than

BLASTN• Produces S&W alignments• Checks each possible alignment

with database sequences• Good at finding cousins

Sequence searching and alignments - Andrew Cowley21/04/23123

Page 124: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

When to use what?

Sequence searching and alignments - Andrew Cowley21/04/23124

Database size

Query

length

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

Page 125: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23125

When to use what?

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

FASTA

WU-BLAST

NCBI BLAST

PSI-SEARCH

time to

search

Page 126: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Homology and Similarity

Sequence searching and alignments - Andrew Cowley21/04/23126

Page 127: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Similarity

Sequence searching and alignments - Andrew Cowley21/04/23127

Page 128: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Homology

Sequence searching and alignments - Andrew Cowley21/04/23128

Page 129: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Unrelated!

Sequence searching and alignments - Andrew Cowley21/04/23129

Page 130: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Homology vs. Similarity

• Presence of similar features because of common decent

• Cannot be observed since the ancestors are not anymore

• Is inferred as a conclusion based on ‘similarity’

• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)

• Quantifies a ‘likeness’• Uses statistics to determine

‘significance’ of a similarity• Statistically significant similar

sequences are considered ‘homologous’

Sequence searching and alignments - Andrew Cowley21/04/23130

Page 131: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• So far, we’ve talked about scoring alignments• Direct function of the algorithm

• But what we want is to assign some kind of quality to that score

Sequence searching and alignments - Andrew Cowley21/04/23131

Page 132: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Score vs significance

Sequence searching and alignments - Andrew Cowley21/04/23132

A A AA A A

A C A T A A G G C TA T A C A A G C C T

High scoreHigh score High significance

High significance

Page 133: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

“Lies, damn lies, and statistics”

Sequence searching and alignments - Andrew Cowley21/04/23133

Page 134: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

“Lies, damn lies, and statistics”

• Not just interested in score...

• ...But how likely we are to get that alignment by chance alone

• It is this ‘non-random’ alignment that infers homology

• Statistics are used to estimate this chance

Sequence searching and alignments - Andrew Cowley21/04/23134

Page 135: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

E-value

• ‘Expect’ value

• Probability of obtaining this alignment by chance

• Best measure of how good an alignment is

• Often used for ranking results by default

Sequence searching and alignments - Andrew Cowley21/04/23135

Page 136: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Calculated in different ways for BLAST and FASTA

• Short query sequences are more likely to be found by chance so have higher E-values

• Affected by parameter values like gap penalties and substitution matrices

Sequence searching and alignments - Andrew Cowley21/04/23136

Page 137: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA statistics

• Compares query sequence with every sequence in database

• As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance

Sequence searching and alignments - Andrew Cowley21/04/23137

Page 138: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

FASTA - histogram

Sequence searching and alignments - Andrew Cowley21/04/23138

Predicted distribution of scores

Observed distribution of scores

Key

*

=

High scoring regionHigh scoring region

Page 139: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BLAST statistics

• Main reason for speed is that it doesn’t compare query with lots of other sequences

• Therefore it pre-estimates statistical values using a random sequence model

Sequence searching and alignments - Andrew Cowley21/04/23139

““Appears to yield fairly accurate resultsAppears to yield fairly accurate results””

Page 140: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

Search Guidelines

Page 141: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Search guidelines 1

• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)

• Then with translated DNA sequences (fastx, blastx)

• Search with DNA vs. DNA as the next resort

• And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT!

Sequence searching and alignments - Andrew Cowley21/04/23141

Page 142: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Search guidelines 2

• Search the smallest database that is likely to contain the sequence(s) of interest

• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Sequence searching and alignments - Andrew Cowley21/04/23142

Page 143: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Search guidelines 3

• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence• Examine the histograms• Use programs such as prss3 to confirm the expectation values.• Searching with shuffled sequences (use MLE/Shuffle in fasta)

which should have an E() ~1.0

Sequence searching and alignments - Andrew Cowley21/04/23143

Page 144: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23144

Page 145: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Search guidelines 4

Sequence searching and alignments - Andrew Cowley21/04/23145

• Consider searches with different gap penalties and other scoring matrices• Use shallower matrices and/or more stringent gaps in order to

uncover or force out relationships in partial sequences• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of

PAM250)• Remember to change the gap penalty defaults!

MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -7 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4

Page 146: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Search guidelines 5

• Homology can be reliably inferred from statistically significant similarity

• But remember:• Orthologous sequences have similar functions• Paralogous sequences can acquire very different functional roles

• So further work might be needed to tease out details

Sequence searching and alignments - Andrew Cowley21/04/23146

Page 147: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23147

Page 148: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Search guidelines 6

• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues• However, motif identity in the absence of overall sequence similarity is

not a reliable indicator of homology!

• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data• ClustalW• MUSCLE• T-Coffee• Kalign• MAFFT• Mview (available from EBI FASTA & BLAST services)• DBCLUSTAL (available from EBI BLAST services)

Sequence searching and alignments - Andrew Cowley21/04/23148

Page 149: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

Advanced

Page 150: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• In general, the more information we can add to an alignment, the better the result

Sequence searching and alignments - Andrew Cowley21/04/23150

Conserved regions Structural information Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

Page 151: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Conserved regions

• We can add a new ‘position’ parameter to the substitution matrix

Sequence searching and alignments - Andrew Cowley21/04/23151

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

Page 152: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PSI-BLAST

Position Specific Iterative – BLAST:

1. Takes the result of a normal BLAST

2. Aligns them and generates profile of conserved positions

3. Uses this to weight scoring on next iteration

Sequence searching and alignments - Andrew Cowley21/04/23152

Page 153: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PSI-BLAST

• By adding importance to conserved residues we might be able to find more distant sequences

• But iterate too far and we might be assigning importance where there is none

Sequence searching and alignments - Andrew Cowley21/04/23153

More sensitiveMore sensitive

Page 154: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PSI-BLAST

Sequence searching and alignments - Andrew Cowley21/04/23154

Page 155: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PSI-BLAST

Sequence searching and alignments - Andrew Cowley21/04/23155

Page 156: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PSI-BLAST

Sequence searching and alignments - Andrew Cowley21/04/23156

Page 157: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PHI-BLAST

• Pattern Hit Initiated-BLAST

• User provides a pattern alongside a protein

• Database hits have to contain this pattern, and similarity to rest of sequence

• Results can initiate a PSI-BLAST search as well

Sequence searching and alignments - Andrew Cowley21/04/23157

Page 158: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PSI-SEARCH

• Smith-Waterman implementation (SSEARCH)

• But with iterative position specific scoring

Sequence searching and alignments - Andrew Cowley21/04/23158

Page 159: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Using PSI-BLAST - Example sequence

Sequence searching and alignments - Andrew Cowley21/04/23159

test_prot.fasta

Pages 52-55 in full booklet: Questions 18-20

www.ebi.ac.uk/~watson/africa

Page 160: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

Problem Sequences

Page 161: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Short sequences

• What about short sequences?• Depends on their nature:

• Protein• Reduce word length and/or increase the E() value cut off• Use shallow matrices

• DNA• Reduce the word length• Ignore gap penalties (force local alignments only)

• Use rigorous methods• But ask what you are trying to do!

Sequence searching and alignments - Andrew Cowley21/04/23161

Page 162: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Low complexity regions

• Sometimes biologically relevant, but always likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

Sequence searching and alignments - Andrew Cowley21/04/23162

Page 163: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23163

Good Statistics:

The inset shows good correlationbetween the observed over expectednumbers of scores.

This is the region of the histogram to look out for first when evaluating results.

Page 164: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23164

The inset shows bad correlationbetween the observed and expectedscores in this search.

The spaces between the = and * symbolsindicate this poor correlation.

One reason for this can be low complexityregions.

Bad Statistics:

Page 165: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Low complexity regions

• Sometimes biologically relevant, but always likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

• Compensate by filtering sequence so these regions don’t contribute to scoring• Filters: seg, xnu, dust, CENSOR

• But check what you are filtering!

Sequence searching and alignments - Andrew Cowley21/04/23165

Page 166: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23166

Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.

Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.

Filtered:

Page 167: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Using Filters - Example sequence

Sequence searching and alignments - Andrew Cowley21/04/23167

Pages 56-57 in full booklet: Questions 21-22

Filtertest_seq.fsa

www.ebi.ac.uk/~watson/africa

Page 168: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Vector contamination

• You think you know what your sequence is..• .. But the results are really confusing!

• Maybe you have vector contamination

• Search against known vectors to check

Sequence searching and alignments - Andrew Cowley21/04/23168

Page 169: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Vector contamination

Sequence searching and alignments - Andrew Cowley21/04/23169

Page 170: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Vector Contamination - Example sequences

Sequence searching and alignments - Andrew Cowley21/04/23170

vectortest_seq1.fsa

vectortest_seq2.fsa

Page 57 in full booklet: Question 23

www.ebi.ac.uk/~watson/africa

Page 171: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

Multiple Sequence Alignments

Page 172: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Uses of MSA

• Functional prediction• Phylogeny• Structural prediction• Homology detection• Protein analysis

• To distinguish between orthology and parology

Sequence searching and alignments - Andrew Cowley21/04/23172

Page 173: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

• But this is too computationally intensive

Sequence searching and alignments - Andrew Cowley21/04/23173

Page 174: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : .  Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Weighted Sums of Pairs: WSP

N

i

i

jijijDW

2

1

1

Sequences Time2 1 second

3 150 seconds

4 6.25 hours

5 39 days

6 16 years

7 2404 years

Time O(LN)

21/04/23174 Sequence searching and alignments - Andrew Cowley

Page 175: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

• But this is too computationally intensive• Therefore we have to use heuristics and progressive

alignment methods

Sequence searching and alignments - Andrew Cowley21/04/23175

Page 176: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Clustal

• >60,000 citations

• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin

• Clustal V 1992• EMBL Heidelberg, • Rainer Fuchs• Alan Bleasby

• Clustal W, Clustal X 1994-2005• Toby Gibson, EMBL, Heidelberg• Julie Thompson, ICGEB, Strasbourg

• Clustal W and Clustal X 2.0 2006• University College Dublin

www.clustal.org

21/04/23176 Sequence searching and alignments - Andrew Cowley

Page 177: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CLUSTAL

• Quick, pairwise alignment of all sequences• Line up pairs, with the most similar first

Sequence searching and alignments - Andrew Cowley21/04/23177

Page 178: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CLUSTAL

• Fix the alignment between pairs and treat as one sequence

Sequence searching and alignments - Andrew Cowley21/04/23178

Page 179: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

CLUSTAL

• Align your fixed pairs with each other

Sequence searching and alignments - Andrew Cowley21/04/23179

Page 180: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

• Note, this is not a phylogram!• Only a guide tree for the alignment

Sequence searching and alignments - Andrew Cowley21/04/23180

Page 181: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

ClustalW at the EBI

Sequence searching and alignments - Andrew Cowley21/04/23181

Page 182: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

ClustalW

Sequence searching and alignments - Andrew Cowley21/04/23182

ParametersParameters

Sequence input

Sequence input

Submit!Submit!

Help!Help!

Page 183: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

ClustalW

Sequence searching and alignments - Andrew Cowley21/04/23183

Interactive – results in browser, deleted after 24 hours

Email – receive URL to results page, deleted after 7 days

Page 184: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

ClustalW

Sequence searching and alignments - Andrew Cowley21/04/23184

Page 185: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

ClustalW

Sequence searching and alignments - Andrew Cowley21/04/23185

Page 186: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Jalview

Sequence searching and alignments - Andrew Cowley21/04/23186

Page 187: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

ClustalW

Advantages• Fast• Not too demanding• Widely used• Fine for most uses

Disadvantages• Fixing of early alignments

• Propagate errors

• Doesn’t search far• Local minima

• Compresses gaps

Sequence searching and alignments - Andrew Cowley21/04/23187

Page 188: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Use of Clustal/JalView - Example sequences

Sequence searching and alignments - Andrew Cowley21/04/23188

prot_MSA.fastaProblem_MSA1.fsaProblem_MSA2.fsaProblem_MSA3.fsaProblem_MSA4.fsa

Pages 59-66 in full booklet: Questions 24-28

www.ebi.ac.uk/~watson/africa

Page 189: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Other Tools

Sequence searching and alignments - Andrew Cowley21/04/23189

Page 190: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

COFFEE

• Consistency based Objective Function For alignmEnt Evaluation

• Maximum Weight Trace (John Kececioglu)• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An

objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

21/04/23190 Sequence searching and alignments - Andrew Cowley

Page 191: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

COFFEE

• Library of reference pairwise alignments• For your given set of sequences

• Objective Function• Evaluates consistency between multiple alignment and the library

of pairwise alignments• Use SAGA to optimise this function

• Weigh depending on quality of alignment

Sequence searching and alignments - Andrew Cowley21/04/23191

SAGA is another alignment method, using genetic algorithmsSAGA is another alignment method, using genetic algorithms

Page 192: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

COFFEE

• More accurate than ClustalW• Much less prone to problems in early alignment stages

• VERY slow!

Sequence searching and alignments - Andrew Cowley21/04/23192

Page 193: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

T-Coffee

• Tree-based COFFEE• Heuristic approach to COFFEE

• Gets rid of genetic algorithm portion• Uses progressive alignments• Changes algorithm based on number of sequences

Sequence searching and alignments - Andrew Cowley21/04/23193

Page 194: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

T-Coffee

• Much faster than COFFEE• Avoids some of ClustalW’s pitfalls• Can take information from several data sources

• Still not that fast• Can be very demanding of memory etc.

Sequence searching and alignments - Andrew Cowley21/04/23194

Page 195: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Others

• MUSCLE – Bob Edgar• Iterative/progressive alignment• Fast• Good for big alignments, proteins

• MAFFT• Iterative based Fast Fourier Transform• Fast and accurate• Good for huge alignments

• Kalign• Very fast, local-regions aligning• Good for very large numbers of alignments!

Sequence searching and alignments - Andrew Cowley21/04/23195

Page 196: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Which tool should I use?

Input data• 2-100 sequences of typical

protein length• 100-500 sequences• >500 sequences

• Small number of unusually long sequences

Recommendation• MUSCLE, T-Coffee,

MAFFT, ClustalW• MUSCLE, MAFFT• MUSCLE, KALIGN

• ClustalW

Sequence searching and alignments - Andrew Cowley21/04/23196

Page 197: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

How to evaluate?

• Use a benchmark • BaliBASE

Sequence searching and alignments - Andrew Cowley21/04/23197

Page 198: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

•ICGEB Strasbourg

•141 manual alignments using structures•5 sections•core alignment regions marked

1. Equidistant(82)

2. Orphan(23)

3. Two groups (12)

4. Long internal gaps(13)

5. Long terminal gaps(11)

21/04/23198 Sequence searching and alignments - Andrew Cowley

Page 199: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Benchmark pitfalls

• Benchmark dataset may not be representative• Danger of over-training towards benchmark

• Goldman: Most MSAs have unrealistic gaps• Tend towards multiple, independent deletions• Insertions are rare

• Sequences shrink in length over evolution• No supporting evidence that this is the case

Sequence searching and alignments - Andrew Cowley21/04/23199

Page 200: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Solutions

• Use phylogentic data to guide alignment• Keep track of changes to ancestor sequences• Don’t change them again so easily in decendents

Sequence searching and alignments - Andrew Cowley21/04/23200

Page 201: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

PRANK

• Probabilistic Alignment Kit• webPRANK• Better suited for closely related sequences• Tied solutions are chosen from at random

• Avoids incorrect confidence in result• Means alignments might not be reproducible

• Alignments look quite different• Might look worse!• But gap patterns make sense• Gaps are good!

Sequence searching and alignments - Andrew Cowley21/04/23201

Page 202: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23202

Page 203: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Sequence searching and alignments - Andrew Cowley21/04/23203

Page 204: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Comparing Alignments - Example sequences

Sequence searching and alignments - Andrew Cowley21/04/23204

prot_MSA.fasta

Pages 67-74 in full booklet: Questions 29-30

www.ebi.ac.uk/~watson/africa

Page 205: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Common problems with MSA

• Input format• FASTA format• Unique sequence identifiers• Include sequence!

• Job can’t be found• Interactive results deleted after 24hrs• Use email• Consider other tool

Sequence searching and alignments - Andrew Cowley21/04/23205

Page 206: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Common mis-uses of MSA

• Performing a sequence assembly• Specialist type of MSA• Use other tools (Staden etc.)

• Aligning ESTs to a reference genome• Use EST2Genome

• Designing primers• Use primer tools (primer3 etc.)

• Aligning two sequences• Use a pairwise alignment tool!

Sequence searching and alignments - Andrew Cowley21/04/23206

Page 207: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Putting it all together

• EB-Eye search• Sequence retrieval• Sequence search• Sequences retrieval• Multiple sequence alignment• Analysis

Sequence searching and alignments - Andrew Cowley21/04/23207

Page 208: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Final remarks

• Don’t assume a single tool will cater for all your needs

• DO change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Sequence searching and alignments - Andrew Cowley21/04/23208

Page 209: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

EBI is an Outstation of the European Molecular Biology Laboratory.

Getting Help

Page 210: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Getting Help

• Database documentation• Frequently Asked Questions

• http://www.ebi.ac.uk/help/faq.html

• 2can Support Portal• http://www.ebi.ac.uk/2can/

• EBI Support• http://www.ebi.ac.uk/support/

• Hands-on training programme• http://www.ebi.ac.uk/training/handson/

Sequence searching and alignments - Andrew Cowley21/04/23210

Page 211: EBI is an Outstation of the European Molecular Biology Laboratory. EBI Roadshow James Watson, PhD Senior Scientific Training Officer EBI-EMBL watson@ebi.ac.uk

Thanks!

• Hamish McWilliam and Andrew Cowley• Vicky Schneider• Rodrigo Lopez

• EMBL-EBI• SLING• You!

Sequence searching and alignments - Andrew Cowley21/04/23211