Page 1: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Protein Analysis Course

Day 1: Databases, dotplots and pairwise alignment

Page 2: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Todays timetable

Databases and file formats Exercises

Dotplot and pairwise alignment Exercises

Coffee breaks during the exercises

Page 3: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Databases and file formats

Sequence file format FASTA

UniProt (Universal protein resource) Primary structure

PDB (Protein Database) Tertiary structure

Page 4: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Sequence file format

FASTA (a.k.a Pearson format) Most commonly used Can be easily construted by hand if needed Straightforward way to store multiple sequences – just

concatenate multiple FASTA –files Content:

First line (Header line) always starts with symbol ”>” followed by identifiers and descriptions

Header line is ALWAYS just one line before sequence After header line (from the second line) starts the sequence

(presented using single-letter codes) Sequence normally divided into multiple lines (often required) Recommended line length max 80 chars (also with header line)

Page 5: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment



Page 6: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Databases: UniProt

UniProt is the universal protein resource, a central repository of protein data created by combining Swiss-Prot, TrEMBL and PIR. This makes it the world's most comprehensive resource on protein information [wikipedia]

UniProt provides three core database: The UniProt Archive (UniParc) provides a stable, comprehensive

sequence collection without redundant sequences by storing the complete body of publicly available protein sequence data

The UniProt Reference Clusters (UniRef) databases provide non-redundant reference data collections based on the UniProt knowledgebase in order to obtain complete coverage of sequence space at several resolutions

The UniProt Knowledgebase (UniProtKB) is the central database of protein sequences with accurate, consistent, and rich sequence and functional annotation

Page 7: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

UniProt Archive (UniParc)

Comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world

Currently UniParc contains protein sequences from the following publicly available databases:

EMBL/DDBJ/GenBank nucleotide sequence databases Ensembl European Patent Office (EPO) FlyBase H-Invitational Database (H-Inv) Internation Protein Index (IPI) Japan Patent Office (JPO) PIR-PSD Protein Data Bank (PDB) Protein Research Foundation (PRF) RefSeq Saccharomyces Genome database (SGD) TAIR Arabidopsis thaliana Information Resource TROME USA Patent Office (USPTO) UniProtKB/Swiss-Prot, UniProtKB/Swiss-Prot protein isoforms, UniProtKB/TrEMBL Vertebrate Genome Annotation database (VEGA) WormBase

Page 8: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

UniProt Reference Clusters (UniRef)

Sequence clusters, used to speed up similarity searches

UniRef100 Cluster is composed of sequences that are identical

UniRef90 Cluster is composed of sequences that have at least

90% sequence identity

UniRef50 Cluster is composed of sequences that have at least

50% sequence identity

Page 9: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Protein knowledgebase (UniProtKB)

Is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation

Consists of two sections: Swiss-Prot, which is manually annotated and

reviewed by curator TrEMBL, which is automatically annotated and

is not reviewed

Page 10: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

UniProt entry

Every line in a entry begins with a 2 letter identifier

UniProt format closely resembles EMBL format except that considerably more information about physical and biochemical properties is provided

More information here

Page 11: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Databases: PDB

Founded in 1971 by Brookhaven National Laboratory, New York.

Transferred to the Research Collaboratory for Structural Bioinformatics (RCSB) in 1998.

Currently it holds more than 55,000 released structures.

Page 12: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Methods used to solve 3d structure: X-ray: 86% NMR: 13% Electron Microscopy: 0,7% Other: 0,3%

Page 13: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

PDB file format

Text file – you can edit with a text editor e.g. WordPad

Atomic co-ordinates

Rich annotation Citation Experimental Method Biological source e. Etc.

Page 14: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

FYI: Errors in databases

Be aware of errors in the databases:

sequence errors:

genome projects’ error rate is 1/10,000nts; ESTs’ error rate is 1/100nts.

annotation errors:

Automated computer programs do not always give correct annotations.

SwissProt is a protein database curated and annotated manually by biologists. Most reliable database, but is not up-to-date

Page 15: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Go to the course web page and start with exercises given in file: database_exercises.doc

Page 16: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Pairwise sequence alignments

Motivation – Why alignments? Sequence comparison

Dotplot The alignment problem

Pairwise alignment algorithms Exact algorithms Heuristic algorithms Database searches

Web tools: Build alignments using EBI server, Blast at NCBI, EBI, PairsDB, …

Page 17: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Proteins perform most of the functions required in biological systems: Signaling (kinases, ...) Enzymes (proteases, …) Structural (collagen, elastin, …) Immune system (antibodies, ...) Storage and transport (hemoglobin, …) …

Large amount of information available in current databanks.

Goal: Want to extrapolate information about the function of a newly discovered sequence by comparing it to annotated sequences.

Page 18: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Does it make sense?

All functional information is ultimately contained within the sequence.

Proteins are evolutionary related: Selective pressure is on function, and thus on residues

with functional role (eg: active site or structural key residues are conserved).

Modular nature of proteins.

Two sequences have the same structure if corresponding residues are similar enough on physico-chemical level.

Page 19: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Application of sequence alignment

Determining function of newly discovered genetic or protein sequences.

Identification of functional patterns/domains. Predicting structure of proteins. Determining evolutionary relationships among

genes, proteins, and entire species.

Aligning and comparing sequences, and searchingdatabases for similar sequences – a cornerstoneof bioinformatics!!

Page 20: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Pairwise alignment

Pairwise alignment = identification of residue-residue correspondence.

For the alignment to be meaningful, the correspondence should reflect the functional or evolutionary relationship

What criteria should we use to obtain biologically meaningful alignments?


Page 21: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Identity: percentage of pairs of identical residues between two aligned sequences.

Similarity: percentage of pairs of similar residues between two aligned sequences. one must define what similar means. Eg:

as observed in well studied evolutionary related protein families, physico-chemical amino acid properties: hydropathy, size, …

Homology: two sequences are homologous if and only if they have a common ancestor. it´s either yes or no. Two types: orthology and paralogy not to be confused with similarity! don’t mix up with analogy

Page 22: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


The simplest way of comparing two sequences: A dot is placed

where both sequence elements are identical.

Gives an overview of all possible alignments.

Each diagonal indicates a possible (ungapped) alignment

Page 23: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Filtering Out the Noise in Dotplots Dots may be scored according to a sliding window and a similarity

cutoff to reduce noise:

The smaller the window, the more noise. With large windows, the sensitivity for short sequences is reduced.


Window size = 5, Similarity cutoff = 3


| | || |||| | || ||| |



| | || |||| | || ||| |




| | || |||| | || ||| |


Page 24: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment



Let´s find repeated domains in the following sequence :

> SLIT_DROME (P24014):


Page 25: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

DotPlot summary

Comparing a sequence with itself, can be used to identify: Repeated domains, Regions of low complexity (eg, …GYCAAAAAAAAALK…).

Comparing two protein sequences, can be used to identify: Local regions of similarity, Conserved protein domains.

Page 26: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

The Pairwise Alignment Problem

Lign up diagonal by edit operations: substitution (mutation) gap or indel (insertion/deletion)


substitution deletion


sequence 1s







But there are many ways to align 2 sequences we need to score alignments to decide which is the best.

Page 27: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Scoring the Edit Operations

For example: identical: +10 (it´s good) substitution: +2 for S-A, -1 for K-P, … gap: -3


Score: +50+2-1+2*(-3) = 45

Choosing an appropriate scoring scheme: where biological information is introduced (eg, reward the evolutionary most likely alignment).

Standard notation: | for identical : for very similar (eg, size and hydropathy) . for somewhat similar (eg, size or hydropathy)

Page 28: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Gap penalty

Few long gaps

is better than

many small gaps

Different scores for gap opening, eg: -5 gap extension, eg: L*(-1)

with L=length of extension gap opening > gap






gap openinggap extension



gap score= -5 -6

Page 29: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Gap penalty

Can also consider special penalty for gaps at end/beginning of alignment (eg, zero penalty).

Need to be careful in adjusting the gap score to the substitution score: too strong penalty no gaps, too weak penalty too many gaps.

Insertions and deletions have been found to occur in nature at significantly lower frequency than mutations.

Page 30: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Residue Substitution

A substitution score for each aa pair a substitution matrix.

Most used: based on evolutionary relationship.

Two types: PAM series, BLOSUM series.

Page 31: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

PAM (Percent Accepted Mutation) PAM1: observed mutations in

carefully selected sets of closely related proteins (1572 sequences from 71 families). (1978)

Idea: observed substitutions are the result of 1 mutation (not many).

PAMn: iterate PAM1 n times to obtain substitution rate between more divergent sequences.

PAM: 0 30 80 110 200 250%identity: 100 75 60 50 25 20



Page 32: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

BLOSUM (BLOck Substitution Matrix)

Based on a larger set than PAM is. More recent than PAM. (1992) Different approach than PAM:

not based on an explicit evolutionary model,

observed aa substitutions in a set of conserved aa patterns called blocks.

BLOSUMn: from blocks which are n% identical.

BLOSUM62: empirically shown to be among the best at detecting weak similarity.


Page 33: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Tips for using substitution matrices Generally, BLOSUM matrices perform better than PAM for local

similarity searches. For database searches, the most commonly used matrix is

BLOSUM62. When comparing closely related proteins, one should use lower

PAM or higher BLOSUM, for distantly related proteins higher PAM or lower BLOSUM matrices

Caution: substitution matrices are statistical in nature. In a given alignment, a substitution may or may not correspond to an actual mutation.


PAM 1 PAM 120 PAM 250

Less divergent More divergent

Page 34: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Pairwise Alignment Algorithms

Given a scoring scheme, an alignment algorithm tries to find the best alignment between 2 sequences according to that scheme.

Exact algorithms: guaranteed to return an alignment with the best possible score.

Heuristic alignments: not guaranteed to return best alignments. but they are quicker (and hopefully still return good alignments).

Two types of alignment: Global: forced over the entire length of 2 sequences. Local: between substrings of 2 sequences..

Page 35: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Global vs Local Alignment Global alignments:

are sensitive to gap penalties, Assumes homology. Outputs everything – either matches or gaps can be used to compare 2 proteins with same

function (in, eg, human/mouse). Local alignments:

Can be used to look for conserved domains or motifs in 2 proteins,

search for local similarities in large sequences,

database searches, scanning an entire genome with a short

sequence. Does not output everything – only the best hits

Page 36: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Exact Algorithms: Dynamic Programming

Exhaustive search among all possible alignments is not possible (eg, for 2 sequences of 100 and 95 residues: 55 millions possible alignments with 5 gaps).

Problem solved by dynamic programming:

1. initialize top row and left column,

2. compute best local scores iteratively,

3. keep track of where best local score comes from,

4. traceback to obtain the best alignments. May exist several best solutions: an alignment

reported to you may be one among a number of possibilities.

How can we find the best alignment between 2 sequences?

best global score



The example is from

Example of 2 best solutions:

Page 37: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Local and global Alignment Servers (Exact Algorithm)

Server at EBI: EMBOSS-Align Let´s submit to the

sequence :

Use the Needleman-Wunsch algorithm (1970) and the Smith-Waterman algorithm (1981).



Page 38: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Heuristic Algorithms

Motivations: Exact algorithms are exhaustive but computationally

expensive. Exact algorithms are impractical for comparing a query

sequence to millions of other sequences in a database (database scanning),

and so, database scanning requires faster alignment algorithm (at the cost of optimality).

Page 39: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Heuristic Algorithms

Probing a database with a query is similar to aligning a query with a very long sequence.

Main idea: Use dynamic programming, but limited to (sub-)sequences

which are likely to produce interesting alignments with the query.

Heuristic part of the algorithm: eliminate from search uninteresting sequences (need to make a guess).

Algorithms: FASTA : Lipman-Pearson (1985). BLAST (Basic Local Alignment Search Tool) : Altshul et al.


need fast local alignment methods.

Page 40: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

BLAST Overview

Many versions for different query-database cases: blastp: protein - protein blastn: nucleotide - nucleotide blastx: nucleotide protein - protein tblastn: protein - protein nucleotide tblastx: nucleotide protein - protein

nucleotide Comes in many flavours. Fast and reliable. Easy to use.

Page 41: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

BLAST Overview BLAST computes “an alignment”, not necessarily the exact optimal

alignment. Given the query and the database (long sequence):

Find all words of length k (default: k=3 for AA and k=11 for DNA) that match the query with a score high enough.

Look for subsequences in the database that contain these words.

Extend subsequences to see if match score can be increased. Compute total score when no more extensions are possible.

Rank the alignments.

Page 42: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment













Let´s submit the query sequence


Page 43: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

E value: Expectation value.

Expected # of alignments with scores equivalent to or better than S to occur by chance. The lower the E value, the more significant the score.

Bit score: S’

The value S’ is derived from the raw alignment score S, but statistical properties of the scoring system have been taken into account. Because bit scores are normalised w.r.t. scoring system, they can be used to compare alignment scores from different searches.

NCBI Blast output help:

Page 44: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

BLAST servers

Pairwise alignment: BLAST: Database screening:


Remark: there is a server with a powerful implementation of Smith-Waterman for database screening: Runs about 50 times slower, but is more sensitive and returns less false positives than Blast.

Page 45: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Position-Specific Iterated Blast:

More sensitive, ie better at detecting distant relationships,

than BLAST.

Computes position-specific substitution matrices (PSSMs)

to score matches between query and database sequences.

(Blast uses precomputed substitution matrices, eg


Page 46: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Repeatedly searches the target databases.

At each round: compute a multiple alignment of high scoring

sequences to generate a new PSSM for next round of searching.

Iterates until no new sequences found (or until a maximal number of iteration is reached).

Page 47: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Significance of Alignments

Scores cannot be used to rank alignments: a bad but long alignment may have a higher score than a

good but short alignment. We need a normalized scoring scheme that would allow to

compare alignments, and evaluate their biological significance. Idea:

Probe the database with random sequences. This gives a distribution of scores (it follows the extreme-

value distribution). Establish a threshold for significance.

Page 48: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Extreme-Value Distribution


Score distribution for random sequences

score of our query

probability that the score of our query is no better than random: P-value

Difficulty: finding a significance threshold.

Page 49: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Quantifying the Significance of Alignments

P-value: The probability of an alignment occurring with score S or

better if the aligned-against sequence is random. The lower the P-value, the more significant the alignment.

E-value: Expected number of alignments with scores equivalent to or

better than S to occur by chance only. The lower the E-value, the more significant the alignment. E-value = P-value * size of database.

For an alignment with raw score S:

Page 50: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Rough Guide for P-values and E-values P-Value (reported by many programs): 0 ≤ P-val ≤ 1

E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size of database

P<= 10-100 Exact match

10-100 < P < 10-50 Sequences very nearly identical, e.g.: alleles or SNPs

10-50 < P < 10-10 Closely related sequences, homology certain

10-5 < P < 10-1 Usually distant relatives

P>10-1 Match probably insignificant

E<=0.02 Sequences probably homologous

0.02 <=E <=1 Homology can’t be ruled out

E>1 This match would be obtained by chance

Page 51: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment

Rules of thumb for pairwise alignment

Use server defaults in the absence of any other information. Adjust the substitution matrix to the expected divergence of

the 2 sequences. Use BLOSUM62 if no a priori information. For distantly related sequences, use PSI-Blast rather than

BLAST. If PSI-BLAST doesn’t give you anything use GTG. Many ways of aligning 2 sequences.

A returned alignment is not the absolute truth. Inspect the alignment from the biologist´s perspective.

Page 52: Protein Analysis Course Day 1: Databases, dotplots and pairwise alignment


Go to the course web page and start with exercises given in file: p_alignment_exercises.doc

Top Related