ebi is an outstation of the european molecular biology laboratory. ebi roadshow james watson, phd...

EBI is an Outstation of the European Molecular Biology Laboratory.

EBI Roadshow

James Watson, PhDSenior Scientific Training Officer

EBI-EMBL

watson@ebi.ac.uk

Andrew Cowley

External Services, EMBL-EBI

Sequence Searching and Alignments

External Services

Sequence searching and alignments - Andrew Cowley21/04/233

Andrew CowleyBioinformatics Trainer

Hamish McWilliamSoftware engineer Rodrigo Lopez

Head of External Services

+ many others!

Contents

• Sequence databases• Database browsing tools

• Similarity searching and alignments• Alignment basics• Similarity searching tools• More advanced tools• Alignment tools• Guidelines

• (slightly) More advanced tools• Problem sequences

Materials

Presentations and tutorials can be found on the roadshow course page at the EBI

Data files for exercises can be found at:

www.ebi.ac.uk/~watson/africa

• Simplistically, much of the data at the EBI can be thought of as a container

• One part being the raw data (eg. Sequence)

• Another part being annotation on this data

Example

ID AJ131285; SV 1; linear; mRNA; STD; INV; 919 BP.XXAC AJ131285;XXDT 24-APR-2001 (Rel. 67, Created)DT 20-JUL-2001 (Rel. 68, Last updated, Version 4)XXDE Sabella spallanzanii mRNA for globin 3XXKW globin; globin 3; globin gene.XXOS Sabella spallanzaniiOC Eukaryota; Metazoa; Annelida; Polychaeta; Palpata; Canalipalpata;OC Sabellida; Sabellidae; Sabella.XXRN [1]RP 1-919RA Negrisolo E.M.;RT ;RL Submitted (11-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL Negrisolo E.M., Biologia, Universita degli Studi di Padova, via U. BassiRL 58/B, Padova,35131, ITALY.FH Key Location/QualifiersFHFT source 1..919FT /organism="Sabella spallanzanii"FT /mol_type="mRNA"FT /db_xref="taxon:85702"FT CDS 73..552FT /gene="globin"FT /product="globin 3"FT /function="respiratory pigment"FT /db_xref="GOA:Q9BHK1"FT /db_xref="InterPro:IPR000971"FT /db_xref="InterPro:IPR014610"FT /db_xref="UniProtKB/TrEMBL:Q9BHK1"FT /experiment="experimental evidence, no additional detailsFT recorded"FT /protein_id="CAC37412.1"FT /translation="MYKWLLCLALIGCVSGCNILQRLKVKNQWQEAFGYADDRTSXGTAFT LWRSIIMQKPESVDKFFKRVNGKDISSPAFQAHIQRVFGGFDMCISMLDDSDVLASQLAFT HLHAQHVERGISAEYFDVFAESLMLAVESTIESCFDKDAWSQCTKVISSGIGSGV"XXSQ Sequence 919 BP; 244 A; 246 C; 199 G; 225 T; 5 other; caaacagtca rttaattcac agagccctga ggtctctcgc tcctttctgc gtcactctct 60 cttaccgtca tcatgtacaa gtggttgctt tgcctggctc tgattggctg cgtcagcggc 120 tgcaacatcc tccagaggct gaaggtcaag aaccagtggc aggaggcttt cggctatgct 180 gacgacagga catcccycgg taccgcattg tggagatcca tcatcatgca gaagcccgag 240//

Data - Nucleotide

• ENA/EMBL-Bank: • Release and updates• Divided into classes and divisions• Supplementary sets: EMBL-CDS, EMBL-MGA

• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA, IMGT/LIGM, etc.• Alternative splicing: ASD, ASTD, etc.• Completed genomes: Ensembl, Integr8, etc.• Variation: HGVBase, dbSNP, etc.

Individual sequencing

Individual scientists

ACTGCTGCTAGCTGGCTGACTATTCTAGCTTTAGCTGAGTGACTATTATCAGCTATTACAGCATCCG

Sequence individual gene

What sequence data is submitted?

ACTGCTGCTAGCTAG

submission

add annotation

submission

High throughput sequencing

chromosome

fragment

ACTGCTGCTAGCTAG

cyp30 cyp309 insvcg343

annotation

sequence reads

sequencing library

assemble sequence

chromosome

fragment

sequencing library

ACTGCTGCTAGCTAG

assemble sequence

cyp30 cyp309 insvcg343

annotation

sequence reads

Large-scale sequencing

projects

submissione.g. whole genome shotgun

submission

High throughput sequencing

What are primary sequence databases?

Individual scientistsLarge-scale

sequencing projects

Primary sequence data

Primarysequencedatabase

•Original sequence data• Experimental data

• Patent data

• Submitter-defined

Patent Offices

ACTGCTGCTAGCTAGCTGATCTATGCTAGCTGTAGCTGAG

ACTGCTGCTAGCTAG

How do primary and derived databases differ?

sequencing projects

Patent Offices

ACTGCTGCTAGCTAG

Derived data e.g. protein sequence

Deriveddatabase

Primary v. derived data

ACGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACATDNA sequence

translate

Derived mRNA sequence

AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC

Derived protein sequence MRSNECCCAMSC

transcribe

submit

ACTGCTGCTAGCTAG

sequencing projects

Primary sequence

Patent Offices

ACTGCTGCTAGCTAG

Deriveddatabase

redundant

may be non-redundant

If anything in submission varies (e.g.

source / submitter / sequence)

generates a new entry

sequencing projects

Patent Offices

ACTGCTGCTAGCTAG

Deriveddatabase

data lost

regenerate data

Primary nucleotide sequence databases

GenBank DDBJ

(Japan)(U.S.A.)

(Europe)

INSDC:

• International Nucleotide

Sequence Database

Collaboration

• Daily exchange of data

DDBJ GenBank

ACTGCTGCTAGCTAG

Submission can

be made to any

INSDC database

Sequence information

ReadsReads

How is sequence data processed?

• Sequence machine output (reads)• Quality scores

AssemblyAssembly• Fragmented sequence reads assembled into contigs mapped onto chromosomes

AnnotationAnnotation • Functional information assigned to assembled regions

DDBJ GenBank

ACTGCTGCTAGCTAG

Sequence information

ReadsReads

What type of sequence data is submitted?

AssemblyAssembly

AnnotationAnnotation

Assembled

sequences

Raw data

Annotated sequence

Interpreted information:• Assembly• Mapping• Functional annotation• Sample information

RawAnnotated /

DDBJ GenBankInput information:• Sample• Set-up• Machine configuration

Output machine data:• Sequence traces• Reads• Quality scores

Metagenomic data:• Where originated

ACTGCTGCTAGCTAG

How does ENA store the data?

Assembled sequences

Raw data

Annotated sequence

projects

Patent Offices

European Nucleotide Archive

RawAnnotated /

DDBJ GenBank

ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)

Sequence ReadSequence ReadArchive (SRA)Archive (SRA)

Trace ArchiveTrace Archive SRAAnn Trace

ACTGCTGCTAGCTAG

How does ENA store the data?

Assembled sequences

Raw data

Annotated sequence

projects

Patent Offices

ENA-AnnotationENA-Annotation(formerly EMBL-Bank)(formerly EMBL-Bank)

Sequence ReadSequence ReadArchive (SRA)Archive (SRA)

Trace ArchiveTrace Archive

European Nucleotide Archive

RawAnnotated /

SRAAnn

DDBJ GenBank

• Trace sequence reads

• Capillary sequencing instruments

• Intensity reads

• Next-generation sequencing instruments

ACTGCTGCTAGCTAG

INDSC Sequencing Projects

Can data be traced to an Institute?

Institute

Database records

Consortium

genomic

ESTs...

shotgun

Complete genome /

metagenome

(single organism / metagenomic study)

Assembly & annotation

genomic

ESTs...

shotgun

Track projects

Comparative analysis

RawAnnotated /

SRAAnn

DDBJ GenBank

ACTGCTGCTAGCTAG

Pulls information

together

Nucleotides: European Nucleotide Archive (ENA)

21/04/2323

Figure adapted from: Cochrane, G. et al. Public Data Resources as

the Foundation for a Worldwide Metagenomics Data Infrastructure.

In: Metagenomics: Theory, Methods and Applications (Chapter 5),

Caister Academic Press, Universidad Nacional de Cordoba,

Argentina. Ed. D. Marco (2010).

The ENA has a three-tiered data architecture.

It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms).

Sequence searching and alignments - Andrew Cowley

Data Quality

Is the data cleaned up?

• Automatic quality checks

Validation of submitted data:

• Some manual inspection and curation

Errors can still exist in sequence and annotation

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

How is the data organized?

Data in ENA Annotation is divided in 2 ways:

Database Structure

• Type of data or

• Methodology used to obtain data

• Each entry belongs to one data class

1) Data classes1) Data classes

• Each entry belongs to one taxonomic division

2) Taxonomic Divisions2) Taxonomic Divisions

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Constructed from sequence assembliesExpressed Sequence Tag (cDNA)

Genome Survey Sequence (high-throughput short sequence)

High-Throughput cDNA (unfinished)

High-Throughput Genome sequencing (unfinished)

Mass Genome Annotation

Patent sequences

Sequence Tagged Site (short unique genomic sequences)

Third Party Annotation (re-annotated and re-assembled)

Transcriptome Shotgun Assembly (computational assembly)

Whole Genome Shotgun

STDSTD Standard (high quality annotated sequence)

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Patent sequences

• Single pass reads variable quality

• Need to search both EST and RNA data

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Patent sequences

STDSTD Standard (high quality annotated sequence)• Often copies of existing entries

• Records not clean, even for taxonomy

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Patent sequences

• Bulk of entries

• Highest level of tracked information

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Patent sequences

• Derived data entries• e.g. patch genomic and RNA data to construct complete coverage

• Must have publication

• Must show which entries data is derived from

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Patent sequences

• Also derived data entries

• ESTs assembled to construct RNA

• Must show which EST/HTC entries data is derived from

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

CONCON

ESTEST

GSSGSS

HTCHTC

HTGHTG

MGAMGA

PATPAT

STSSTS

TPATPA

TSATSA

WGSWGS

Patent sequences

• Entries change over time (completely replaced)

• Raw WGS entries assembled into contigs CON entries

Data Classes

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Data Classes

How stable is the data?

Data is always changing:

• Assembly of sequences into larger fragments

• Deletion of obsolete entries (i.e. once assembled)

• Sequence modifications

• Daily updates

• Identifier changes

• Corrections (databases can contain errors)

• etc…

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Data Classes

How does assembly affect entries?

WGSWGS Shotgun

Example:

• Fragments in separate entry

CONCON Constructed • Join to make new CON entries

• Old WGS entries archived

STDSTD Standard• Join into large STD entry (e.g. Completed genome)• Add annotation

• Old CON entries archived

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Taxonomy

ENVENV

HUMHUM HumanMUSMUS Mouse

MAMMAM MammalVRTVRT VertebrateFUNFUN FungiINVINV InvertebratePLNPLN Plant

PHGPHG Phage

Environmental

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

SYNSYN SyntheticTGNTGN TransgenicUNCUNC Unclassified

Other:

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Taxonomy

PHGPHG Phage

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

ENVENV Environmental

Other:

• CAUTION: organism never isolated

• May blast sequence to assign putative organism

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Taxonomy

PHGPHG Phage

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

Other:

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

• CAUTION: not consistently handled, variable quality

• Transgenics may be from multiple organisms

Taxonomy

PHGPHG Phage

PROPRO Prokaryote

RODROD Rodent

VIRVIR Viral

Other:

• Division primarily used by GenBank

for PAT (patent) sequences

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Taxonomy exclusion

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

Some species excluded

from certain taxonomic ranges

VRTVRT Vertebrate excludeshumanmouserodentmammal

MAMMAM Mammal excludeshumanmouserodent

RODROD Rodent excludes mouse

Applies to:• ftp files and• sequence search tools

But not:• ENA Browser

Taxonomy Database

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

All INSDC databases use the NCBI Taxonomy Browser

Which taxonomy database does ENA use?

Only organisms with sequence are represented

• EBI-wide service maps resources into taxonomy service

• Culture collection – physical data, e.g. sample or stored version

• Biomaterial

• Specimen voucher

EBI Taxonomy Portal

representation, e.g. picture

Database Structure

How does data organization differ from GenBank?

ENA-Annotation

Data classesData classes

TaxonomicTaxonomicDivisionsDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

std ...

hummusrodmamvrtfun...

GenBank

Taxonomic DivisionsTaxonomic Divisions

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for

classes

std ...

mus rod

mam vrt

pln ...

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

ACTGCTGCTAGCTAG

How does data organization differ from GenBank?

ENA-Annotation

TaxonomicTaxonomicDivisionsDivisions

• Data split into intersecting slices

• Reduces search set

• Ensures complete result set

GenBank

Taxonomic DivisionsTaxonomic Divisions

• Data split into parallel slices

• Large search sets

• Classes incomplete for taxonomy

• Taxonomy incomplete for

classes

std ...

hummusrodmamvrtfun...

std ...

mus rod

mam vrt

pln ...

‘Mouse’ + ‘EST’ intersection

• small data set

• ensured complete set of mouse ESTs

Database Structure

RawAnnotated /

SRAAnn

Clean-up

DDBJ GenBank

‘Mouse’ set

• large data set

• includes all mouse entries

‘EST’ set

• large data set

• includes all EST entries

ACTGCTGCTAGCTAG

Data – Protein Sequence

• UniProt databases:• UniProtKB: human curated and automatic translation sections• UniRef: non-redundant sequence clusters• UniParc: non-identical sequence archive

• Sequence from structures:• PDB• SGT

• Specialist data sets, e.g.:• Immunoglobulins: IMGT/HLA• Alternative splicing: ASD, ASTD• Completed proteomes: Ensembl, Integr8• Protein Interactions: IntAct• Patent Proteins: EPO, JPO, KIPO and USPTO

Sequence Databases

Nucleotide Nucleotide/Protein Protein

Not curated Curated Automatically annotated/curated

Author submits NCBI creates from existing data Entries created from large scale genomics, small scale cloning and peptide submissions

Only author can revise NCBI revises as new data emerge UniProt regularly revises entries

Multiple records for same loci common Single records for each molecule of major organisms

An entry per protein including any isoforms (when curated).

Records can contradict each other ?

No limit to species included Limited to model organisms All species

Data exchanged among INSDC members Exclusive NCBI database International consortium, inc leading figures in Bioinformatics

Akin to primary literature Akin to review articles Akin to review articles plus contain extensive cross-references to other DBBs – almost a portal.

Proteins identified and linked Proteins and transcripts identified and linked Links to supporting nucleotide data added to entries

Access via NCBI Nucleotide databases Access via Nucleotide & Protein databases UniProt.org – modern, multi functional website with incorporated bioinformatics tools.

Genbank

21/04/2345

Protein sequence: UniProt

UniProt

• Manual curation

• Literature-based annotation

• Sequence analysis

• Automated annotation

InterPro

IntAct

IntEnz

Functional infoFunctional info

Protein identification data

Protein families and domains

Molecular interactionsMolecular

interactions

EnzymesEnzymes

Microbial protein families

Post-translational modifications

Post-translational modificationsS

Transmembrane prediction

InterPro classification

Signal predictionSignal prediction

Other predictionsOther predictions

Protein classification

Sequence searching and alignments - Andrew Cowley

PatentData

WormBaseFlyBaseSub/

PeptideData

PDB VEGAEnsemblRefSeqINSDC

(incl. WGS,Env.)

Database sources

ProteomeSets

UniProt data sources and data flow

UniRef 100

UniRef 90

UniRef 50

UniRefPre-computed clusters of similar proteins

UniMes

UniMesUniProt Metagenomic and Environmental Sequences(available by FTP only)

UniParcUniParcUniProt Sequence Archive. Contains all current and obsolete UniProtKB sequences

UniSave

UniSaveUniProt protein entry archive. Contains all versions of each protein entry.(Accessed via www.uniprot.org and www.ebi.ac.uk/unisave)

UniProtKB

UniProtKBUniProt Knowledgebase, the centrepiece of the UniProt Consortium’s activities. It provides a richly curated protein database.

The Two Sides of UniProtKB

UniProtKB/TrEMBL UniProtKB/Swiss-Prot

Non-redundant, high-quality manual annotation - reviewed

Redundant, automatically annotated -

unreviewed

Databases

• Many databases and they are getting bigger• Efficient searching involves knowledge of what is stored

in these• Don’t assume that everything in the databases is correct• Nothing is constant, but changes...

• Deletions, sequence modifications• Daily updates, identifier changes, etc.

Searching databases

?What methods of searching databases do you know of?

?What is the difference between a primary and secondary database?

?What is the best protein sequence database to search(specific part)?

Searching

• Many ways of searching databases• Annotation/title

• Know something about your sequence• Gene name• Function• Accession

New search service

Data organised according to: •gene•expression•protein•structure•literature

Species selector allows for easy

comparison

Explore data, return easily to

your results

Access from the EBI’s homepage

Database webpages

Database searching

Searching

• Many ways of searching databases

• Annotation/title • Know something about your sequence

• Gene name• Function• Accession

• Raw data• Don’t know!• Or want to check...

• Infer extra information• Homology?• Annotation?• Function?

Sequence alignment

• Relatively easy if we have an exact match• .. But sequence is variable

• Between individuals, species, location etc.

• That variability is useful data too!• Need a search method that allows for some variability• And even better – helps us assess that variability

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

ACATAGGTACATAGGT

Sequence alignment

ACATAGGT

TCATAGAT AAATTCTG

Query:

Score: 6/8 3/8

ACATAGGT ACATAGGT

Sequence alignment

atttcacagaggaggacaaggctactatcacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtctacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccattaaagcacctgggatgatctcaagggcacctttgcccagcttgagt

atggtgctctctgcagctgacaaaaccaacatcaagaactgctgggggaagattggtggccatggtggtgaatatggcgaggaggccctacagaggatgttcgctgccttccccaccaccaagacctacttctctcacattgatgtaagccccggctctgcccaggtcaaggctcacggcaagaaggttgctgatgccctggccaaagctgcagaccacgtcgaagacctgcctggtgccctgtccactctgagcgacctgc

cacaagcctgtggggcaaggtgaatgtggaagatgctggaggagaaaccctgggaaggctcctggttgtntacccatggacccagaggttctttgacagctttggcaacctgtcctctgcctctgccatcatgggcaaccccaaagtcaaggcacatggcaagaaggtgctgacttccttgggagatgccataaagcacctggatgatctcaagggca

Query:

Dot plot

• Maybe a dot plot will help

Sequence 1

A C A T A G

GATACT

Dot plot

Query vs Sequence 1 Query vs Sequence 2

Query Query

• We can see the difference, but how to turn that into something a computer can evaluate?

• Computers rely on algorithms which give them a score

• They can then compare scores

• Simple algorithm – penalise movement away from diagonal – gap penalty

• Having opened a gap, we should assign a lesser penalty to extending it

Gap extend

-10.5-0.5

-10-10

Actual implementation is usually to apply gap extension penalty to every gap

Why a lesser gap extend penalty?

• Single block of insertions/deletions is more likely than multiple in/del events

NVELKAET NVDEATNFELKAET

NV-ELKAET

NVDE--A-TNFELKAET

NV------ELKAET

NVDEATNFELKAET

Match/mismatch

• Of course, we need to tell the algorithm that matching letters are better than mismatches too

• This is done via a scoring matrix

A C G T

5 -4 -4 -4-4 5 -4 -4-4 -4 5 -4 -4 -4 -4 5

• Putting the two together gives us a scoring mechanism

-13.5 -13

Mismatch

A C G T

5 -4 -4 -4

-4 5 -4 -4

-4 -4 5 -4

-4 -4 -4 5

-10-10

• To pick the optimal alignment, start at the end and trace back the highest scoring route.

-13.5 -13

Needleman-Wunsch

• Congratulations! You’ve just reconstructed the Needleman-Wunsch algorithm!• An example of dynamic programming

• Comparing the full length of both sequences is called a global-global or just global alignment

Global vs Local

• But global-global might not be suitable for sequences that are very different lengths

• A modified form of this algorithm for local alignment is called the Smith-Waterman algorithm. • Sets negative scores in matrix to 0, and allows trace back to end

and restart

QUESTION: Global vs Local - which is which?

A T G T A T A C G C

- A G T A T A - G C

A - T G T A T A C G C

A G T A T A - - - G C

LOCAL GLOBAL

Scoring

• Parameters so far:• Match/mismatch• Gap opening• Gap extending

• Can we improve it?

Substitutions

• Some substitutions are more likely than others• DNA:

• Purines (A,G) – dual ring• Pyrimidines (C, T) – single ring

• Substitutions of the same type are called transitions, where as exchanging one for another is called a transversion

• Transistions occur more frequently than transversions, so we can score them higher in the scoring matrix

Proteins

• What about proteins?

Protein substitution matrices

• Can look at closely related proteins to determine substitution rates

• Two most commonly used models:• BLOSUM• PAM

BLOSUM

• Blocks of Amino Acid Substitution Matrix

• Align conserved regions of evolutionary divergent sequences clustered at a given % identity

• Count relative frequencies of amino acids and substitution probability

• Turn that into a matrix where the more positive a substitution is, the more likely is it to be found, and the more negative, the less likely.

• Higher BLOSUM number = more closely related

• Point Accepted Mutation• Observed mutations in a set of closely

related proteins• Markov chain model created to describe

substitutions• Normalised so that PAM1 = 1 mutation

per 100 amino acids• Extrapolate matrices from model• Higher PAM number = less closely

Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence

10 100 200

400 500300

BLOSUM 45PAM 250

BLOSUM 62PAM 160

BLOSUM 90PAM 100

More divergent Less divergent

Scoring

• Parameters:• Match/mismatch• Gap opening• Gap extending• Substitution matrix

Dynamic programming alignments at the EBI

• EMBOSS Pairwise Alignment Algorithms• European Molecular Biology Open Software Suite

• Suite of useful tools for molecular biology• Command line based• Designed to be used as part of scripts/chained programs

• We implement selected tools to provide web-based access

Where to find at the EBI?

http://www.ebi.ac.uk/Tools/sequence.htmlhttp://www.ebi.ac.uk/Tools/sequence.html

EMBOSS align tools

• Global alignment

• Local alignment

NeedleNeedle

WaterWater

Program selectionProgram selection ParametersParameters

Sequence input

Submit!Submit!

: Positive match

. Negative match

| Identity

: Positive match

. Negative match

| Identity

Pairwise Alignments - Example sequences

Pairwise_align1.fsaPairwise_align2.fsa

Pages 25-30 in full booklet: Questions 7-10

Dynamic programming sequence search methods at the EBI

• Global alignment

• Local alignment

• Global query vs local database

GGSEARCHGGSEARCH

SSEARCHSSEARCH

GLSEARCHGLSEARCH

www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/

Similarity search

Database selectionDatabase selection

Sequence input

ParametersParameters

Submit!Submit!

• Dynamic programming methods are rigorous and guarantee an optimal result

• But have to store the matrix of both sequences in memory

• And evaluate each position of the matrix

• Predictably, this makes them slow and demanding when you are aligning large sequences

Heuristics

• Therefore we need methods of estimating alignments• Estimation methods are called heuristics

• Try and take short cuts in an intelligent manner• Speed up the search• At the possible expense of accuracy

• Accuracy in sequence searches is important for:• Aligning the right bits• Scoring the alignment correctly• Identifying similar sequences - sensitivity

• Going back to our dot plot

• Instead of searching the whole matrix, if we narrow the search space down to a likely region we will improve the speed.

• Of course, we have to identify likely regions – not all alignments will be as nice as that one!

• This is the method used by FASTA• W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448

FASTA – step 1

• Identify runs of identical sequence and pick regions with highest density of runs

Ktup parameter:How many consecutive identities before considered a ‘run’Also called ‘word size’

Increase Ktup = faster, but less sensitive

FASTA – step 2

• Weight scoring of runs using matrix, trim back regions to those contributing to highest scores

Parameter:Substitution matrixParameter:Substitution matrix

FASTA – step 3

• Discard regions too far from the highest scoring region

Joining threshold:Internally determinedJoining threshold:Internally determined

FASTA – step 4

• Use dynamic programming to optimise alignment in a narrow band encompassing the top scoring regions

Parameters:Gap openGap extendSubstitution matrix

• Repeat against all sequences in the database

FASTA – programs available at EBI

• FASTA: ”a fast approximation to Smith & Waterman”• FASTA – scan a protein or DNA sequence library for similar

sequences.• FASTX/Y – compare a DNA sequence to a protein sequence

databases, comparing the translated DNA sequence in forward or reverse translation frames.

• TFASTX/Y – compare a protein sequence to a translated DNA data bank.

• FASTF – compares ordered peptides (Edman degradation) to a protein databank.

• FASTS – compares unordered peptides (Mass Spec.) to a protein databank.

• SSEARCH – Rigorous scan of protein or DNA sequence library (S&W Algorithm).

www.ebi.ac.uk/Tools/sss/www.ebi.ac.uk/Tools/sss/

Similarity search

Database selectionDatabase selection

Sequence input

Submit!Submit!

FASTA - results

: Identity

. Similarity

X Filtered

: Identity

. Similarity

X Filtered

Using FASTA - Example sequence

test_prot.fasta

46 in full booklet: Questions 11-14

BLAST – Basic Local Alignment Search Tool

• Instead of narrowing the dynamic programming search space, BLAST works a different way

• Firstly, it creates a word list both of the exact sequence and high scoring substitutions

Altschul et al (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

BLAST – step 1

• w=3

SEWRFKHIYRGQPRRHLLTTGWSTFVT

SEWEWR

WRF Parameter:Word length (w)Parameter:Word length (w)

Increase = faster, but less sensitive

BLAST – step 1(cont.d)

• w=3• T=13

SEWRFKHIYRGQPRRHLLTTGWSTFVT

GQP 18GEP 15GRP 14GKP 14GNP 13GDP 13

AQP 12NQP 12

Parameters:Neighbourhood threshold (T)Substitution matrix

BLAST – step 2

• Then it scans database sequences for exact matches with these words

• If two hits are found on the same diagonal the alignment is extended until the score drops by a certain amount

• This results in a High-scoring Segment Pair (HSP)

BLAST – step 3

Parameters:Drop offSubstitution matrix

• If the total HSP score is above another threshold then a gapped extension is initiated

BLAST – step 4

Parameters:Extension threshold (Sg)Substitution matrix

• The steps rule out many database sequences early on• Large increase in speed

BLAST – programs available at the EBI

• Basic Local Alignment Search Tool• NCBI-BLAST programs:

• BLASTP – protein sequence vs. protein sequence library• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated DNA vs. protein sequence library

• WU-BLAST programs:• BLASTP – protein query vs. protein database• BLASTN – nucleotide query vs. nucleotide database• BLASTX – translated nucleotide query vs. protein database• TBLASTN – protein query vs. translated nucleotide database• TBLASTX – translated nucleotide query vs. translated nucleotide

database

Combines several parameters into ‘sensitivity’ option

Using BLAST - Example sequence

test_prot.fasta

[residue] Identity

+ Similarity

X Filtered

[residue] Identity

+ Similarity

X Filtered

Differences between BLAST and FASTA

• BLAST• Fast• Good with proteins• Produces good local alignments

+ short global alignments• Produces HSP (reports internal

matches in long sequences)• Might miss a potential alignment

due to ruling out sequences early on in the process

• Good at finding siblings

• FASTA• Not as fast as BLAST• Much better with DNA than

BLASTN• Produces S&W alignments• Checks each possible alignment

with database sequences• Good at finding cousins

When to use what?

Database size

length

WU-BLAST

NCBI BLAST

PSI-SEARCH

When to use what?

PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc

WU-BLAST

NCBI BLAST

PSI-SEARCH

time to

search

Homology and Similarity

Similarity

Homology

Unrelated!

Homology vs. Similarity

• Presence of similar features because of common decent

• Cannot be observed since the ancestors are not anymore

• Is inferred as a conclusion based on ‘similarity’

• Homology is like pregnancy: Either one is or one isn’t! (Gribskov – 1999)

• Quantifies a ‘likeness’• Uses statistics to determine

‘significance’ of a similarity• Statistically significant similar

sequences are considered ‘homologous’

• So far, we’ve talked about scoring alignments• Direct function of the algorithm

• But what we want is to assign some kind of quality to that score

Score vs significance

A A AA A A

A C A T A A G G C TA T A C A A G C C T

High scoreHigh score High significance

High significance

“Lies, damn lies, and statistics”

• Not just interested in score...

• ...But how likely we are to get that alignment by chance alone

• It is this ‘non-random’ alignment that infers homology

• Statistics are used to estimate this chance

E-value

• ‘Expect’ value

• Probability of obtaining this alignment by chance

• Best measure of how good an alignment is

• Often used for ranking results by default

• Calculated in different ways for BLAST and FASTA

• Short query sequences are more likely to be found by chance so have higher E-values

• Affected by parameter values like gap penalties and substitution matrices

FASTA statistics

• Compares query sequence with every sequence in database

• As most of these sequences are unrelated it is possible to use the distribution of scores to assign statistical significance

FASTA - histogram

Predicted distribution of scores

Observed distribution of scores

High scoring regionHigh scoring region

BLAST statistics

• Main reason for speed is that it doesn’t compare query with lots of other sequences

• Therefore it pre-estimates statistical values using a random sequence model

““Appears to yield fairly accurate resultsAppears to yield fairly accurate results””

Search Guidelines

Search guidelines 1

• Whenever possible, compare at the amino acid level rather than at the nucleotide level (fasta, blastp, etc…)

• Then with translated DNA sequences (fastx, blastx)

• Search with DNA vs. DNA as the next resort

• And then with translated DNA vs. translated DNA (tfastx, tblastx) as the VERY LAST RESORT!

Search guidelines 2

• Search the smallest database that is likely to contain the sequence(s) of interest

• Use sequence statistics (E()-values) rather than % identity or % similarity, as your primary criterion for sequence homology

Search guidelines 3

• Check that the statistics are likely to be accurate by looking for the highest scoring unrelated sequence• Examine the histograms• Use programs such as prss3 to confirm the expectation values.• Searching with shuffled sequences (use MLE/Shuffle in fasta)

which should have an E() ~1.0

Search guidelines 4

• Consider searches with different gap penalties and other scoring matrices• Use shallower matrices and/or more stringent gaps in order to

uncover or force out relationships in partial sequences• Use BLOSUM62 instead of BLOSUM50 (or PAM100 instead of

PAM250)• Remember to change the gap penalty defaults!

MATRIX open ext.BLOSUM50 -10 -2BLOSUM62 -7 -1BLOSUM80 -16 -4PAM250 -10 -2PAM120 -16 -4

Search guidelines 5

• Homology can be reliably inferred from statistically significant similarity

• But remember:• Orthologous sequences have similar functions• Paralogous sequences can acquire very different functional roles

• So further work might be needed to tease out details

Search guidelines 6

• Consult motif or fingerprint databases in order to uncover evidence for conservation-critical or functional residues• However, motif identity in the absence of overall sequence similarity is

not a reliable indicator of homology!

• Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data• ClustalW• MUSCLE• T-Coffee• Kalign• MAFFT• Mview (available from EBI FASTA & BLAST services)• DBCLUSTAL (available from EBI BLAST services)

Advanced

• In general, the more information we can add to an alignment, the better the result

Conserved regions Structural information Motifs

[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H[R, T or D]-[D, A or Q]-[F, E or A]-A-T-H

Conserved regions

• We can add a new ‘position’ parameter to the substitution matrix

We can even modify a normal search to generate a position specific scoring matrix, or PSSM

PSI-BLAST

Position Specific Iterative – BLAST:

1. Takes the result of a normal BLAST

2. Aligns them and generates profile of conserved positions

3. Uses this to weight scoring on next iteration

PSI-BLAST

• By adding importance to conserved residues we might be able to find more distant sequences

• But iterate too far and we might be assigning importance where there is none

More sensitiveMore sensitive

PSI-BLAST

PHI-BLAST

• Pattern Hit Initiated-BLAST

• User provides a pattern alongside a protein

• Database hits have to contain this pattern, and similarity to rest of sequence

• Results can initiate a PSI-BLAST search as well

PSI-SEARCH

• Smith-Waterman implementation (SSEARCH)

• But with iterative position specific scoring

Using PSI-BLAST - Example sequence

test_prot.fasta

Problem Sequences

Short sequences

• What about short sequences?• Depends on their nature:

• Protein• Reduce word length and/or increase the E() value cut off• Use shallow matrices

• DNA• Reduce the word length• Ignore gap penalties (force local alignments only)

• Use rigorous methods• But ask what you are trying to do!

Low complexity regions

• Sometimes biologically relevant, but always likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

Good Statistics:

The inset shows good correlationbetween the observed over expectednumbers of scores.

This is the region of the histogram to look out for first when evaluating results.

The inset shows bad correlationbetween the observed and expectedscores in this search.

The spaces between the = and * symbolsindicate this poor correlation.

One reason for this can be low complexityregions.

Bad Statistics:

Low complexity regions

• Sometimes biologically relevant, but always likely to skew alignment scoring

• E.g. CA repeats, poly-A tails and Proline rich regions

• Compensate by filtering sequence so these regions don’t contribute to scoring• Filters: seg, xnu, dust, CENSOR

• But check what you are filtering!

Inset showing the effect of using a low complexity filter (seg) and searchingthe database using the segment withhighest complexity.

Note that there is now good agreementbetween the observed and expectedhigh score in the search and that thedistance between = and * has beensignificantly reduced.

Filtered:

Using Filters - Example sequence

Filtertest_seq.fsa

Vector contamination

• You think you know what your sequence is..• .. But the results are really confusing!

• Maybe you have vector contamination

• Search against known vectors to check

Vector contamination

Vector Contamination - Example sequences

vectortest_seq1.fsa

vectortest_seq2.fsa

in full booklet: Question 23

Multiple Sequence Alignments

Uses of MSA

• Functional prediction• Phylogeny• Structural prediction• Homology detection• Protein analysis

• To distinguish between orthology and parology

• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

• But this is too computationally intensive

Human beta --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLSTHorse beta --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNHuman alpha ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-Horse alpha ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-Whale myoglobin ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTLamprey globin PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTLupin globin --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRLHorse beta PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRLHuman alpha ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKLHorse alpha ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKLWhale myoglobin EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEFLamprey globin ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKVLupin globin VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------Horse beta LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------Human alpha LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------Horse alpha LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------Whale myoglobin ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGLamprey globin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY-------Lupin globin VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : : .: . .. . :

Weighted Sums of Pairs: WSP

jijijDW

Sequences Time2 1 second

3 150 seconds

4 6.25 hours

5 39 days

6 16 years

7 2404 years

Time O(LN)

21/04/23174 Sequence searching and alignments - Andrew Cowley

• Ideally, you would build up multiple alignments through weighted sum of pairs (pairwise scores)

• But this is too computationally intensive• Therefore we have to use heuristics and progressive

alignment methods

Clustal

• >60,000 citations

• Clustal1-Clustal4 • 1988, Paul Sharp, Dublin

• Clustal V 1992• EMBL Heidelberg, • Rainer Fuchs• Alan Bleasby

• Clustal W, Clustal X 1994-2005• Toby Gibson, EMBL, Heidelberg• Julie Thompson, ICGEB, Strasbourg

• Clustal W and Clustal X 2.0 2006• University College Dublin

www.clustal.org

CLUSTAL

• Quick, pairwise alignment of all sequences• Line up pairs, with the most similar first

CLUSTAL

• Fix the alignment between pairs and treat as one sequence

CLUSTAL

• Align your fixed pairs with each other

• Note, this is not a phylogram!• Only a guide tree for the alignment

ClustalW at the EBI

ClustalW

Sequence input

Submit!Submit!

Help!Help!

ClustalW

Interactive – results in browser, deleted after 24 hours

Email – receive URL to results page, deleted after 7 days

ClustalW

Jalview

ClustalW

Advantages• Fast• Not too demanding• Widely used• Fine for most uses

Disadvantages• Fixing of early alignments

• Propagate errors

• Doesn’t search far• Local minima

• Compresses gaps

Use of Clustal/JalView - Example sequences

prot_MSA.fastaProblem_MSA1.fsaProblem_MSA2.fsaProblem_MSA3.fsaProblem_MSA4.fsa

Other Tools

COFFEE

• Consistency based Objective Function For alignmEnt Evaluation

• Maximum Weight Trace (John Kececioglu)• Maximise similarity to a LIBRARY of residue pairs• Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: An

objective function for multiple sequence alignments. Bioinformatics 14: 407-422.

COFFEE

• Library of reference pairwise alignments• For your given set of sequences

• Objective Function• Evaluates consistency between multiple alignment and the library

of pairwise alignments• Use SAGA to optimise this function

• Weigh depending on quality of alignment

SAGA is another alignment method, using genetic algorithmsSAGA is another alignment method, using genetic algorithms

COFFEE

• More accurate than ClustalW• Much less prone to problems in early alignment stages

• VERY slow!

T-Coffee

• Tree-based COFFEE• Heuristic approach to COFFEE

• Gets rid of genetic algorithm portion• Uses progressive alignments• Changes algorithm based on number of sequences

T-Coffee

• Much faster than COFFEE• Avoids some of ClustalW’s pitfalls• Can take information from several data sources

• Still not that fast• Can be very demanding of memory etc.

Others

• MUSCLE – Bob Edgar• Iterative/progressive alignment• Fast• Good for big alignments, proteins

• MAFFT• Iterative based Fast Fourier Transform• Fast and accurate• Good for huge alignments

• Kalign• Very fast, local-regions aligning• Good for very large numbers of alignments!

Which tool should I use?

Input data• 2-100 sequences of typical

protein length• 100-500 sequences• >500 sequences

• Small number of unusually long sequences

Recommendation• MUSCLE, T-Coffee,

MAFFT, ClustalW• MUSCLE, MAFFT• MUSCLE, KALIGN

• ClustalW

How to evaluate?

• Use a benchmark • BaliBASE

BaliBASE Thompson, JD, Plewniak, F. and Poch, O. (1999)NAR and Bioinformatics

•ICGEB Strasbourg

•141 manual alignments using structures•5 sections•core alignment regions marked

1. Equidistant(82)

2. Orphan(23)

3. Two groups (12)

4. Long internal gaps(13)

5. Long terminal gaps(11)

Benchmark pitfalls

• Benchmark dataset may not be representative• Danger of over-training towards benchmark

• Goldman: Most MSAs have unrealistic gaps• Tend towards multiple, independent deletions• Insertions are rare

• Sequences shrink in length over evolution• No supporting evidence that this is the case

Solutions

• Use phylogentic data to guide alignment• Keep track of changes to ancestor sequences• Don’t change them again so easily in decendents

• Probabilistic Alignment Kit• webPRANK• Better suited for closely related sequences• Tied solutions are chosen from at random

• Avoids incorrect confidence in result• Means alignments might not be reproducible

• Alignments look quite different• Might look worse!• But gap patterns make sense• Gaps are good!

Comparing Alignments - Example sequences

prot_MSA.fasta

Common problems with MSA

• Input format• FASTA format• Unique sequence identifiers• Include sequence!

• Job can’t be found• Interactive results deleted after 24hrs• Use email• Consider other tool

Common mis-uses of MSA

• Performing a sequence assembly• Specialist type of MSA• Use other tools (Staden etc.)

• Aligning ESTs to a reference genome• Use EST2Genome

• Designing primers• Use primer tools (primer3 etc.)

• Aligning two sequences• Use a pairwise alignment tool!

Putting it all together

• EB-Eye search• Sequence retrieval• Sequence search• Sequences retrieval• Multiple sequence alignment• Analysis

Final remarks

• Don’t assume a single tool will cater for all your needs

• DO change the parameters of the tools

• Remember where the tool excels and what its limitations are

• A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!)

• Crazy input will always give crazy results!

Getting Help

• Database documentation• Frequently Asked Questions

• http://www.ebi.ac.uk/help/faq.html

• 2can Support Portal• http://www.ebi.ac.uk/2can/

• EBI Support• http://www.ebi.ac.uk/support/

• Hands-on training programme• http://www.ebi.ac.uk/training/handson/

Thanks!

• Hamish McWilliam and Andrew Cowley• Vicky Schneider• Rodrigo Lopez

• EMBL-EBI• SLING• You!

ebi is an outstation of the european molecular biology laboratory. ebi roadshow james watson, phd...

Documents

miamexpress development october 2002 mohammad shojatalab...

ebi‘40 multi-channel temperature data logger · ebi...

chembl – large-scale open access data for drug discovery...

ebi climate assessment - ferris state university...11/1/2017...

ebi web resources iii: web-based tools in europe (ebi...

1 new emboss web service shaun mcglinchey (shaun@ebi.ac.uk)

structural genomics: case studies in assigning function from...

introduction to intact pablo porras millán, intact...

ensembl opening up the whole genome philip lijnzaad...

ebi is an outstation of the european molecular biology...

sept 2008 ensembl funcgen perl api nathan johnson...

arrayexpress and gene expression atlas: mining functional...

gabriella rustici embl-ebi gabry@ebi.ac · towards a...

emily dimmer edimmer@ebi.ac.uk goa group european...

quality control hubert denise (hudenise@ebi.ac.uk)

hinxton, uk. …european bioinformatics institute (embl-ebi)...

ebi katalogowa

arrayexpress and expression atlas: mining functional...

miamexpress development october 2002 mohammad shojatalab...

ebi overview