ncbi fieldguide mapviewer genome resources and sequence similaritylocuslink unigene homologene basic...

NC

BI

Fie

ldG

uid

eMapViewerMapViewer

Genome Resources and Sequence Similarity

LocusLinkLocusLink

UniGeneUniGene

HomologeneHomologene

Basic Local AlignmentBasic Local Alignment

Search Tool Search Tool

Gene databaseGene database

NC

BI

Fie

ldG

uid

eTopics

• Why use sequence similarity?

• BLAST algorithm

– blastn, blastp, megablast

• BLAST statistics

• BLAST output

• Examples

NC

BI

Fie

ldG

uid

eWhy Do We Need

Sequence Similarity Searching?

• To identify and annotate sequences

• To evaluate evolutionary relationships

• Other:

– model genomic structure (e.g., Spidey)

– check primer specificity in silico

: NCBI’s tool

NC

BI

Fie

ldG

uid

eGlobal vs Local Alignment

Seq 1

Seq 2

Seq 1

Seq 2

Global alignment

Local alignment

NC

BI

Fie

ldG

uid

e

Global vs Local Alignment

Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)

Global

Seq1: 1 W--HEREISWALTERNOW 16 W HERE

Seq2: 1 HEWASHEREBUTNOWISHERE 21

LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

NC

BI

Fie

ldG

uid

eGlobal programming algorithm

NC

BI

Fie

ldG

uid

eGlobal Dynamic Programming

• Full sequence must be aligned• Gaps at ends are penalized as much as

internal ones• F(n,m) is the best score for alignment• Traceback can give >1 correct alignment• Used to examine closely related

sequences• http://www.sbc.su.se/~per/molbioinfo2001/

dynprog/dynamic.html

NC

BI

Fie

ldG

uid

eLocal Alignment – Smith-Waterman

NC

BI

Fie

ldG

uid

eLocal alignments - How

• Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one)

• The alignment can end anywhere in the matrix

• Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0

NC

BI

Fie

ldG

uid

e

Heuristic alignment algorithms

• Shortcuts are important– Searching a sequence length of 1000 against a

database with 108 residues requires approximately 1011 matrix cells. At ten million matrix cells a second, it would take about 3 hours.

• BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.

NC

BI

Fie

ldG

uid

eSeeding

• BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide)

• It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.

NC

BI

Fie

ldG

uid

eExtension

• The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming.

• These locally optimal alignments are called high scoring segment pairs or HSP’s

• Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments

NC

BI

Fie

ldG

uid

eAlignment example

• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.

• Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. • Assume the seed is the capital T, extend the alignment• You’ll hit a mismatch c/e should you continue and how far?• Generate a variable X to measure how far the score drops off.• Set X = 5 and try the alignment…• Set X = 2 and try again …• A large X value will increase the speed, however, speed is often

modulated by word size and other parameters…

NC

BI

Fie

ldG

uid

eGapped BLAST – a time saver

• Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A)

• A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver

• Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit

• T is adjustable, higher the T then the smaller the search space

NC

BI

Fie

ldG

uid

eEvaluation

• Once seeds are extended to generate alignments, these alignments are tested for statistical significance.

• Can establish thresholds for reporting

NC

BI

Fie

ldG

uid

eThe Flavors of BLAST

• Standard BLAST– traditional “contiguous” word hit– position independent scoring – nucleotide, protein and translations (blastn, blastp,

blastx, tblastn, tblastx)• Megablast

– optimized for large batch searches– can use discontiguous words

• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search

• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches

NC

BI

Fie

ldG

uid

eBLASTN variations

• BLASTN seeds are always identical words; T is never used

• To make BLASTN faster, increase word size, to make it more sensitive decrease word size

• MegaBLAST increases word size to 28• The minimum word size is 7• http://monod.uwaterloo.ca/papers/02ph.pd

f

NC

BI

Fie

ldG

uid

eBLASTP implementation

• To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default)

• Affine gaps– Decreased penalty for gap extension relative

to gap introduction

NC

BI

Fie

ldG

uid

eAlso, FASTA

• Similar to Gapped BLAST – except bigger neighborhood

• Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6

• Once identified, looks for diagonals with many mutually supporting word matches

• Extensions similar to BLAST

NC

BI

Fie

ldG

uid

e

Scoring Matrices

• Scoring matrix specifies a score, sij, for aligning sequence I with sequence II.

• Choice of matrix depends on the divergence level of desired/expected hits.

• Examples: PAM, BLOSUM• Both can be modified for different divergence

levels (eg, BLOSUM40, BLOSUM62)

• Advice: try several matrices when possible.

NC

BI

Fie

ldG

uid

e

Dayhoff Family of Matrices

• Dayhoff model measures sequence evolution in units of “PAMs”– One PAM unit represents the evolutionary

distance in which 1% of the amino acids have changed.

• Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change)– Mutability of alanine was defined to be 100.

NC

BI

Fie

ldG

uid

e

Dayhoff Family of Matrices

Problems with the original Dayhoff scheme• It does not consider the genetic code.

– Not all amino acid substitutions can occur by a single nucleotide substitution event.

• Parameters were estimated from a small sample of closely related proteins.

• Evolution at the “average site” of the “average protein” is being modeled.

NC

BI

Fie

ldG

uid

e

NC

BI

Fie

ldG

uid

e

BLOSUM Scoring MatricesBlocks Substitution Matrix. A substitution matrix in

which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)

NC

BI

Fie

ldG

uid

e

NC

BI

Fie

ldG

uid

e

• Widely used similarity search tool• Heuristic approach based on Smith Waterman

algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.

– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation

• www, standalone, and network clients

• Widely used similarity search tool• Heuristic approach based on Smith Waterman

algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.

– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation

• www, standalone, and network clients

Basic Local Alignment Search Tool

NC

BI

Fie

ldG

uid

e

How BLAST Works

• Make lookup table of “words” for query

• Scan database for hits

• Ungapped extensions of hits (initial HSPs)

• Gapped extensions (no traceback)

• Gapped extensions (traceback; alignment details)

• Make lookup table of “words” for query

• Scan database for hits

• Ungapped extensions of hits (initial HSPs)

• Gapped extensions (no traceback)

• Gapped extensions (traceback; alignment details)

X dropoff (X1)

X dropoff (X2)

X dropoff (X3)

NC

BI

Fie

ldG

uid

eNucleotide Words

GTACTGGACATGGACCCTACAGGAAQuery:

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT

Make a lookuptable of words

11-mer

. . .

828megablast

711blastn

minimumdefaultWORD SIZE

NC

BI

Fie

ldG

uid

e

Protein WordsGTQITVEDLFYNIATRRKALKNQuery:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word size = 3 (default)

Word size can only be 2 or 3

[ -f 11 = blastp default ]

NC

BI

Fie

ldG

uid

eMinimum Requirements for a Hit

• Nucleotide BLAST requires one exact match• Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

one exact match

two matches

[ -A 40 = blastp default ]

NC

BI

Fie

ldG

uid

e

BLASTP Summary

YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47

Gapped extension with trace back

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337

Final HSP

+E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ ISbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333

High-scoring pair (HSP)

HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …

YLS 15YLT 12 YVS 12YIT 10etc …

Neighborhood words

Neighborhood score threshold

T (-f) =11

Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…

example query words

NC

BI

Fie

ldG

uid

eScoring Systems - Nucleotides

A G C T

A +1 –3 –3 -3

G –3 +1 –3 -3

C –3 –3 +1 -3

T –3 –3 –3 +1

Identity matrix

CAGGTAGCAAGCTTGCATGTCA

|| |||||||||||| ||||| raw score = 19-9 = 10

CACGTAGCAAGCTTG-GTGTCA

[ -r 1 -q -3 ]

NC

BI

Fie

ldG

uid

eScoring Systems - Proteins

Position Independent MatricesPAM Matrices (Percent Accepted Mutation)

• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used

BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly

conserved blocks• Each matrix derived separately from blocks with a

defined percent identity cutoff• BLOSUM62 - default matrix for BLAST

Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST

NC

BI

Fie

ldG

uid

e

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

D

F

Negative for less likely substitutions

D

Y

FPositive for more likely substitutions

NC

BI

Fie

ldG

uid

ePosition-Specific Score Matrix

DAF-1

Serine/Threonine protein kinases catalytic loop

1 7 4PSSM scores 5 4

NC

BI

Fie

ldG

uid

e

A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3

Position-Specific Score Matrix

catalytic loop

[ >./blastpgp -i NP_499868.2 -d nr -j 3 -Q NP_499868.pssm ]

NC

BI

Fie

ldG

uid

eLocal Alignment Statistics

High scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score (S)

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S or E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect ValueE = number of database hits you expect to find by chance, ≥ S

your score

expected number of

random hits

More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

NC

BI

Fie

ldG

uid

eGapped Alignments

Gapping provides more biologically realistic alignments

Gapped BLAST parameters are simulated for each scoring matrix

Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

NC

BI

Fie

ldG

uid

eAn Alignment BLAST Cannot Make

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Reason:

no contiguous exact match of 7 bp.

NC

BI

Fie

ldG

uid

e

BLAST 2 Sequences (blastx) output:

An Alignment BLAST Can Make

Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3

NC

BI

Fie

ldG

uid

eOther BLAST Algorithms

• Megablast

• Discontiguous Megablast

• PSI-BLAST

NC

BI

Fie

ldG

uid

e

Megablast: NCBI’s Genome Annotator

• Long alignments of similar DNA sequences

• Greedy algorithm

• Concatenation of query sequences

• Faster than blastn; less sensitive

NC

BI

Fie

ldG

uid

e

Discontiguous Megablast

• Uses discontiguous word matches

• Better for cross-species comparisons

NC

BI

Fie

ldG

uid

eDiscontiguous (Cross-species) MegaBLAST

NC

BI

Fie

ldG

uid

eDiscontiguous Word Options

NC

BI

Fie

ldG

uid

eTemplates for Discontiguous Words

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

W = word size; # matches in template

t = template length

NC

BI

Fie

ldG

uid

eBLAST Databases: Nucleic Acid

nr (nt)• traditional GenBank

divisions• NM_ and XM_ RefSeqs

dbest • EST division

htgs • HTG division

gss • GSS division

chromosome • NC_ RefSeqs

env_nr•environmental sample[filter]•e.g., 16S rRNA

NC

BI

Fie

ldG

uid

e

BLAST Databases: Protein

nr (non-redundant protein sequences) GenBank CDS translations NP_ RefSeqs Outside databases

PIR, Swiss-Prot, PRF PDB (sequences from structures)

env_nr (environmental sample[filter])

NC

BI

Fie

ldG

uid

e

NC

BI

Fie

ldG

uid

e

Web BLAST: BLASTP

>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS

1

1. Paste in the query sequence

2

2. Select the appropriate db

3. BLAST

3

NC

BI

Fie

ldG

uid

eFormat Options

NC

BI

Fie

ldG

uid

eBLAST Formatting Page

102347584-927-19372.BLASTQ3

NC

BI

Fie

ldG

uid

e

RPS-BLAST (CD search) Results Summary

partial sequence

partial domain

NC

BI

Fie

ldG

uid

eRPS-BLAST Results (CDD)

DNA_mis_repair

complete sequence

NC

BI

Fie

ldG

uid

e

BLAST Output: Graphic Overview

Sort results by taxonomy

same database sequence

NC

BI

Fie

ldG

uid

eBLAST Output: Descriptionssorted by e values

8 X 10-58

Bacterial mismatch repair proteins

Linkouts

E value cutoff

GEO

UniGene

Structure

NC

BI

Fie

ldG

uid

e

BLAST Output: Alignments

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)

Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)

Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

positive (conservative) substitution

negative substitution gap

NC

BI

Fie

ldG

uid

eBLAST Output: Alignments & Filter

low complexity sequence filtered

NC

BI

Fie

ldG

uid

e

Advanced OptionsLimit to Organism

protein all[filter] A

Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]

Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]

OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments

Filter options

-e 10000 -v 2000

NC

BI

Fie

ldG

uid

e

PSI-BLAST

Example: Confirming relationships of purine

nucleotide metabolism proteins

Position-specific Iterated BLAST

NC

BI

Fie

ldG

uid

e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK

PSI-BLAST

0.005 E value cutoff for PSSM

NC

BI

Fie

ldG

uid

e

RESULTS: Initial BLASTP

Same results as protein-protein BLAST; different format

NC

BI

Fie

ldG

uid

eResults of First PSSM Search

Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NC

BI

Fie

ldG

uid

eTenth PSSM Search: Convergence

Just below threshold, another nucleotide metabolism enzyme

Check to add to PSSM

NC

BI

Fie

ldG

uid

eReverse PSI-BLAST (RPS)-BLAST

NC

BI

Fie

ldG

uid

eAdenosine/AMP Deaminase Domain

AMP Deaminases

.

.

.

NC

BI

Fie

ldG

uid

ePHI-BLAST

>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK

[GA]xxxxGK[ST]

NC

BI

Fie

ldG

uid

eGenome BLAST

NC

BI

Fie

ldG

uid

e

What is an HMM?

• Hidden Markov Model• Important to know: it's a generalization of the

profile in terms of statistical weights, rather than scores.

• At each position, the profile HMM gives the probability of finding a particular amino acid, an insertion, or a deletion

• HMMs are very popular in molecular data analysis but are not specific to this field

NC

BI

Fie

ldG

uid

eA Characterization Example

How could we characterize this (hypothetical) family of nucleotide sequences?– Keep the Multiple Alignment– Try a regular expression

[AT] [CG] [AC] [ACTG]* A [TG] [GC]• But what about?

– T G C T - - A G G vrs– A C A C - - A T C

– Try a consensus sequence:A C A - - - A T C• Depends on distance measure

Example borrowed from Salzberg, 1998

NC

BI

Fie

ldG

uid

e

HMMs to the rescue!

Transition probabilitiesEmission Probabilities

NC

BI

Fie

ldG

uid

e

Insert (Loop) States

NC

BI

Fie

ldG

uid

eScoring our simple HMM

• #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C”– Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]):

• #1 = Member #2: Member

– HMM: • #1 = Score of 0.0023% #2 Score of 4.7% (Probability)

• #1 = Score of -0.97 #2 Score of 6.7 (Log odds)

NC

BI

Fie

ldG

uid

eStandard Profile HMM Architecture

• Three types of states:– Match– Insert– Delete

• One delete and one match per position in model

• One insert per transition in model

• Start and end “dummy” states

Example borrowed from Cline, 1999

NC

BI

Fie

ldG

uid

eAligning and Training HMMs

• Training from a Multiple Alignment

• Aligning a sequence to a model– Can be used to create an alignment– Can be used to score a sequence– Can be used to interpret a sequence

• Training from unaligned sequences (not included in current HMMer package)

NC

BI

Fie

ldG

uid

eTraining from an existing alignment

• This process what we’ve been seeing up to this point.– Start with a predetermined number of states in your

HMM.– For each position in the model, assign a column in the

multiple alignment that is relatively conserved.– Emission probabilities are set according to amino acid

counts in columns.– Transition probabilities are set according to how many

sequences make use of a given delete or insert state.

NC

BI

Fie

ldG

uid

eRemember the simple example

• Chose six positions in model.• Highlighted area was selected to be modeled by

an insert due to variability.

NC

BI

Fie

ldG

uid

eAligning sequences to a model

• Now that we have a profile model, let’s use it!

• Try every possible path through the model that would produce the target sequence – Keep the best one and its probability.

NC

BI

Fie

ldG

uid

e

A T C T C - C G A

A G C T - - T G G

T G T T C T C T A

A A C T C - C G A

A G C T C - C G A

Profile HMMs

A 0.8 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.8 T 0.2 0.2 0.2 1.0 0.0 0.2 0.2 0.2 0.0 G 0.0 0.6 0.0 0.0 0.0 0.0 0.0 0.8 0.2 C 0.0 0.0 0.8 0.0 0.8 0.0 0.8 0.0 0.0 P

roba

bili

ty

NC

BI

Fie

ldG

uid

e

A .8C 0G 0T .2

A .2C 0G .6T .2

A 0C .8G 0T .2

A 0C 0G 0T 1

A 0C .8G 0T .2

A 0C 0G .8T .2

A .8C 0G .2T 0

A 0C .8G 0T .2

1.0 1.0 1.0 1.0 1.0

0.80.8

0.2

0.2

T T T T - T T T G

. . . . . . . .2 .2 1 0 2 1 0 2 1 1 0 8 0 2 0 8 0 2 1 0 1 0 2

T T T GT TT T

Score = 8.2 x 10-6 Consensus score = 0.1 Scores generally calculated with base e logarithms

NC

BI

Fie

ldG

uid

eThe HMM must first be “trained” using a database of known signals.

Consensus sequences for all signals are needed.

Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors.

Transition probabilities between all connected states must be estimated.

Pseudocounts prevent the “regular expression” problem of non-matching or zero probability of a given amino acid…

NC

BI

Fie

ldG

uid

eGene Finding Software

• GENSCAN

• HMMGENE

• GENMARK

• GRAIL

HMMs

Neural Net

NC

BI

Fie

ldG

uid

e

HMM resources

• UC Santa Cruz (David Haussler group)– SAM-02 server. Returns alignments, secondary structure

predictions, HMM parameters, etc. etc.– SAM HMM building program

(requires free academic license)

• Washington U. St. Louis (Sean Eddy group)– Pfam. Large database of precomputed HMM-based

alignments of proteins– HMMer, program for building HMMs

• Gene finders and other HMMs (more later)

NC

BI

Fie

ldG

uid

ehttp://www.cse.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html

NC

BI

Fie

ldG

uid

ehttp://hmmer.janelia.org/

NC

BI

Fie

ldG

uid

e

http://pfam.janelia.org/

ncbi fieldguide mapviewer genome resources and sequence similaritylocuslink unigene homologene basic...

Documents