blosum substitution matrix pab
TRANSCRIPT
![Page 1: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/1.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
1
Introduction to Bioinformatics:Protein Informatics
7/23/03NHLBI Symposium: From Genome to Disease
Patricia C. BabbittUniversity of California, San Francisco
![Page 2: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/2.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
2
“ –mastics, –omens & omics”(courtesy of Cambridge Healthtech Institute: 50 & counting...)
• Biome• Celluome• Chronome• Clinome• Complexome• Crystallome• Cytome• Diagnome• Enzymome• Epigenome• Fluxome• Foldome• Functome• Genome• Glycome• Infectuome
• Immunome• Interactome• Localizome• Metabolome• Methylome• Microbiome• Morphome• Operome• ORFeome• Pathogenome• Peptidome• Pharmacogenomics• Phenome• Phylogenome• Physiome
• Promoterome• Proteome• Pseudogenome• Regulome• Resistome• Ribonome• Secretome• Signalome• Somatonome• Toxicome• Transcriptome• Translatome• Unknome• Vaccinomics• Variome
![Page 3: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/3.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
3
• deduction of function• tracing ancestral connections• understanding enzyme mechanisms• structural analysis of receptors, molecules involved
in cell signaling• identification of molecular surfaces in protein-
protein, protein-DNA interactions• protein engineering• clustering of families, superfamilies• metabolic computing/comparative genome analysis
Applications of Protein Informatics
![Page 4: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/4.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
4
Tools/Approaches for Protein Informatics
• database searching/pairwise alignments• pattern searching and motif analysis• multiple alignments• phylogenetic tree construction• sequence and structure comparison• comparative genomics• “metabolic computing”• transmembrane/2° structure prediction• 3D structure prediction/modeling• visualization• composition/pI/mass analysis
![Page 5: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/5.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
5
• Protein sequence analysis is more specific and lessnoisy than nucleic acid analysis due to the inherentdifferences in the message content of nucleic acid andamino acid codes
• 20-letter code vs 4-letter code, degeneracy of codonmessaging
• But searches for many functional genomicsexperiments must be done at nucleotide level...
Protein vs. nucleic acid sequenceanalysis?
![Page 6: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/6.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
6
Outline: Performing your own Analyses inProtein Informatics
• Ins and Outs of database searching– underlying assumptions– scoring, optimization, statistical significance, caveats
• Fasta, Blast & PsiBlast• Pattern searching & motif analysis• Pre-computed analyses for protein families using
sequence and structure information, motif databases
![Page 7: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/7.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
7
• The first and most common operation in proteininformatics...and the only way to access the information inlarge databases
• Primary tool for inference of homologous structure andfunction
• Improved algorithms to handle large databases quickly
• Provides an estimate of statistical significance
• Generates alignments
• Definitions of similarity can be tuned using differentscoring matrices and algorithm-specific parameters
Database searching
![Page 8: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/8.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
8
The underlying assumption used infunctional inference...
…requires comparison of sequences
Sequence Conservation
Structure Conservation
Function Conservation
![Page 9: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/9.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
9
Formalizing the Problem
• Given: two sequences that you want to align• Goal: find the best alignment that can be obtained by
sliding one sequence along the other• Requirements:
– a scheme for evaluating matches/mis-matches between anytwo characters
– a score for insertions/deletions– a method for optimization of the total score– a method for evaluating the significance of the alignment
![Page 10: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/10.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
10
• The degree of match between two letters can berepresented in a matrix
• Changing the matrix can change the alignment– Simplest: Identity (unitary) matrix– Better: Definitions of similarity based on inferences about chemical
or biological properties –Examples: PAM, Blosum, Gonnet matrices
• The score should have the form: pab /qa qb , where pab isthe probability that residue a is substituted by residue b,and qa and qb are the background probabilities for residuea and b respectively.
• Handling gaps remains an incompletely solved problem...
Scoring Systems
![Page 11: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/11.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
11
• Derived from the BLOCKS database, which, in turn isderived from the PROSITE library(see http://blocks.fhcrc.org/blocks/; http://www.expasy.ch/prosite/)
• BLOCKS generated from multiply aligned sequencesegments without gaps clustered at various similaritythresholds and corrected to avoid sampling bias
• Derived from data representing highly conservedsequence segments from divergent proteins rather thandata based on very similar sequences (as with PAMmatrices)
BLOSUM (BLOcks SUbstitution) Matrices
![Page 12: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/12.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
12
• Many sequences from aligned families are used togenerate the matrices
• Sequences identical at >X% are eliminated to avoidbias from proteins over-represented in the database
• Specific matrices refer to these clustering cut-offs, i.e.,BLOSUM62 reflects observed substitutions betweensegments <62% identical
• These matrices have become the default scoringschemes used at most primary internet search sites
• Different matrices can make a difference to yourresults!
*adapted from Ewens & Grant, Statistical Methods in Bionformatics
Derivation of BLOSUM matrices*
![Page 13: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/13.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
13
• scoring matrices are tailored to degree of divergenceand may require a specific query length for optimalperformance*
*adapted from information available at the NCBI Blast web site
Query Length Substitution Matrix
<35 PAM-30
35-50 PAM-70
50-85 BLOSUM-80
>85 BLOSUM-62
![Page 14: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/14.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
14
Scoring and optimization
![Page 15: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/15.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
15
SEQUENCEHOMOLOGS• •E • • •Q •U •E • • •N • •C •E • • •AN •AL •O •G• •
• Dot matrix plots: a simple description of alignmentoperations illustrating types of relationships betweena sequence pair
![Page 16: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/16.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
16
• The signal-to-noise ratio can be improved usingfiltering techniques designed to minimize thecomposition- dependent background
• Example of common filters: over-lapping, fixed-length"windows" for sequence comparison
• To be counted, a comparison must achieve aminimum threshold score summed over the window,derived empirically or from a statistical or evolutionarymodel of sequence similarity
• The window size and minimum threshold score (oftentermed "stringency") at which the score is counted canbe user-defined
![Page 17: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/17.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
17
Seq1 = SEQUENCEHOMOLOGSeq2 = SEQUENCEANALOGWindow = 7, Stringency = 42% (3/7 matches)
SEQUENCSEQUENCEANALOG (7/7 matches)
SEQUENCSEQUENCEANALOG (0/7 matches)
...
CEHOMOLSEQUENCEANALOG (2/7 matches)
...
HOMOLOG (3/7 matches)SEQUENCEANALOG
![Page 18: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/18.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
18
Window = 30; Stringency = 2
![Page 19: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/19.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
19
Window = 30; Stringency = 11
![Page 20: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/20.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
20
• To measure the local similarity between 2 sequences, scorescan be used in the matrix instead of dots for a sliding windowcomparison– Summing the identities/similarities at each position– For a window of 5 residues and storing the score in the position
corresponding to the center of the window:
1P R I M E511-1-2+0+4 = +21S E Q U E N C E A N A L Y S I S P R I M E R21 . . .
1P R I M E5 16+6+5+6+4 = +271S E Q U E N C E A N A L Y S I S P R I M E R21
![Page 21: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/21.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
21
Statistical Significance
• A good way to determine if the alignment score hasstatistical meaning is to compare it with the scoregenerated from the alignment of two randomsequences
• A model of ‘random’ sequences is needed. Thesimplest model chooses the amino acid residues in asequence independently, with backgroundprobabilities
• For an un-gapped alignment, the score of a match toa random sequence is the sum of many similarrandom variables, the sum can be approximated by anormal distribution.
![Page 22: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/22.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
22
– Comparing a query sequence to a set of random sequences of uniform length results inscores that obey an extreme value distribution rather than a normal distribution, e.g.,can lead to overestimation of an alignment’s significance (see Altschul et al, 1994)
![Page 23: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/23.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
23
• For database searches, the ONLY criteriaavailable to judge the likelihood of a structural orevolutionary relationship between 2 sequences isan estimate of statistical significance
• Statistical significance and biological significanceare NOT necessarily the same
Caveats
![Page 24: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/24.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
24
Query= /phosphonatase/phosSt.gcg (255 letters) (10/20/99/pcb)Database: /mol/seq/blast/db/swissprot 78,725 sequences; 28,368,147 total letters!
Score ESequences producing significant alignments: (bits) Value
sp|O06995|PGMB_BACSU Begin: 93 End: 204 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 38 0.020sp|P31467|YIEH_ECOLI Begin: 1 End: 180 HYPOTHETICAL 24.7 KD PROTEIN IN TNAB-BGLB I... 36 0.10sp|O14165|YDX1_SCHPO Begin: 34 End: 201 HYPOTHETICAL 27.1 KD PROTEIN C4C5.01 IN CHR... 31 2.6sp|P41277|GPP1_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 1 30 4.4sp|Q39565|DYHB_CHLRE Begin: 3911 End: 4032 DYNEIN BETA CHAIN, FLAGELLAR OUTER ARM 29 7.6sp|P77625|YFBT_ECOLI Begin: 143 End: 187 HYPOTHETICAL 23.7 KD PROTEIN IN LRHA-ACKA I... 29 10.0sp|Q40297|FCPA_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P40853|GPHP_ALCEU Begin: 94 End: 188 PHOSPHOGLYCOLATE PHOSPHATASE, PLASMID (PGP) 29 13sp|Q40296|FCPB_MACPY Begin: 146 End: 176 FUCOXANTHIN-CHLOROPHYLL A-C BINDING PROTEIN... 29 13sp|P52183|ANNU_SCHAM Begin: 119 End: 168 ANNULIN (PROTEIN-GLUTAMINE GAMMA-GLUTAMYLTR... 29 13sp|P40106|GPP2_YEAST Begin: 133 End: 200 (DL)-GLYCEROL-3-PHOSPHATASE 2 28 17sp|P37934|MAY3_SCHCO Begin: 435 End: 552 MATING-TYPE PROTEIN A-ALPHA Y3 27 29sp|O06219|MURE_MYCTU Begin: 255 End: 371 UDP-N-ACETYLMURAMOYLALANYL-D-GLUTAMATE--2,6... 27 29sp|P08419|EL2_PIG Begin: 182 End: 245 ELASTASE 2 PRECURSO 27 38sp|Q11034|Y07S_MYCTU Begin: 163 End: 218 HYPOTHETICAL 69.5 KD PROTEIN CY02B10.28C 27 38sp|P00577|RPOC_ECOLI Begin: 1290 End: 1401 DNA-DIRECTED RNA POLYMERASE BETA' CHAIN (T 27 38sp|P32662|GPH_ECOLI Begin: 20 End: 49 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 38sp|P32662|GPH_ECOLI Begin: 116 End: 224 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 27 28sp|P32282|RIR1_BPT4 Begin: 239 End: 266 RIBONUCLEOSIDE-DIPHOSPHATE REDUCTASE ALPHA C... 27 50sp|P17346|LEC2_MEGRO Begin: 36 End: 121 LECTIN BRA-2 27 50sp|P54947|YXEH_BACSU Begin: 24 End: 51 HYPOTHETICAL 30.2 KD PROTEIN IN IDH-DEOR IN... 27 50sp|P77366|PGMB_ECOLI Begin: 95 End: 190 PUTATIVE BETA-PHOSPHOGLUCOMUTASE (BETA-PGM) 27 50sp|P30139|THIG_ECOLI Begin: 43 End: 79 THIG PROTEIN 27 50sp|P95649|CBBY_RHOSH Begin: 96 End: 189 CBBY PROTEIN 27 50sp|Q43154|GSHC_SPIOL Begin: 228 End: 327 GLUTATHIONE REDUCTASE, CHLOROPLAST PRECURSO... 26 66sp|P34132|NT6A_HUMAN Begin: 191 End: 215 NEUROTROPHIN-6 ALPHA (NT-6 ALPHA) 26 66sp|P34134|NT6G_HUMAN Begin: 115 End: 144 NEUROTROPHIN-6 GAMMA (NT-6 GAMMA) 26 66sp|P95650|GPH_RHOSH Begin: 48 End: 114 PHOSPHOGLYCOLATE PHOSPHATASE (PGP) 26 66
![Page 25: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/25.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
25
0
50
100
150
200
0 200 400 600 800 1000
chan
ges/
100
amin
o ac
ids
millions of years since divergence
Hemoglobin
Fibrinopeptides
Cytochrome C
• Different proteins evolve at different rates
![Page 26: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/26.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
26
• Different domains within a single proteinevolve at different rates
C-peptide
B-chain C-peptide A-chain
A-chain
B-chain
r = 0.13 x 10-9/site/yearr = 0.97 x 10-9/site/year
Proinsulin
Mature insulin
![Page 27: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/27.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
27
• "Fast" search algorithm generates global alignments,allows gaps(see http://www.ebi.ac.uk/fasta33/)
• Extensively updated since first release– added statistical analysis– multiple variants available– FASTA3 is the current implementation
FASTA
![Page 28: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/28.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
28
• FASTA Compares protein vs protein or DNA vs DNA
• FASTX/FASTY Compares DNA query to proteinsequence db, DNA translated in 3 forward (or reverse)frames; allows frameshifts
• TFASTX Compares protein query vs DNA sequence ordb, translated in all 6 reading frames; no accommodationfor introns
• FASTS Compares a set of short peptide fragmentsderived from mass spectrometric proteomic analysis vsprotein or DNA db
FASTA flavors(see http://fasta.bioch.virginia.edu/)
![Page 29: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/29.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
29
• Original "fast" search algorithm generates localalignments without gaps (Blast 1.4)
• Newer versions (Blast 2.0x) accommodates gaps
• Access at NCBI and other sites:http://www.ncbi.nlm.nih.gov/BLAST/
• Documentation– Manual: http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html– FACS: http://www.ncbi.nlm.nih.gov/BLAST/blast_FAQs.html– Tutorial: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST
![Page 30: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/30.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
30
BLAST flavors
• blastp compares an amino acid query sequence against a proteinsequence database
• blastn compares a nucleotide query sequence against a nucleotidesequence database
• blastx compares the six-frame conceptual translation products ofa nucleotide query sequence (both strands) against a proteinsequence database
• tblastn compares a protein query sequence against a nucleotidesequence database dynamically translated in all six readingframes (both strands)
• tblastx compares the six-frame translations of a nucleotide querysequence against the six-frame translations of a nucleotide sequencedatabase
![Page 31: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/31.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
31
• These methods are so widely used because theyreally are that good...
• BUT, there are some disadvantages:– Loss of sub-optimal alignments– Pairwise comparisons limit information content– Many biologically significant relationships may be lost in the
"noise," i.e., hits that are not statistically significant
• BLAST is not “better” than FASTA
Some Generalities about Fasta, Blast
![Page 32: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/32.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
32
• Generalizes BLAST algorithm to use a position-specific score matrix in place of a query sequence andassociated substitution matrix for searching thedatabases
• Position-specific score matrix generated from theoutput of a gapped Blast search, i.e., uses a profile ormotif defined in the initial Blast search in place of asingle query sequence and matrix for subsequentsearches of the database
• Results in a database search “tuned” to the specificsequence characteristics of interest
Psi-Blast: Extending our reach...
![Page 33: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/33.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
33
• Constructs a multiple alignment from a Gapped Blastsearch and generates a profile from any significantlocal alignments found
• The profile is compared to the protein database andPSI-BLAST estimates the statistical significance ofthe local alignments found, using "significant" hits toextend the profile for the next round
• PSI-BLAST iterates step 2 an arbitrary number oftimes or until convergence
*Adapted from the PSI-BLAST tutorial at NCBI
Steps in a Psi-Blast search*
![Page 34: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/34.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
34
• Access at http://www.ncbi.nlm.nih.gov/BLAST/
• Tutorial athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-2.html
• A short explanation of PSI-BLAST statistics athttp://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-3.html
• See also:Park J et al “Sequence comparisons using multiplesequences detect three times as many remote homologs as pairwisemethods,” JMB 284:1201-10, 1998
PSI-BLAST information on the web
![Page 35: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/35.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
35
Other alternatives
• Many, many other DB searching algorithms areavailable– Smith-Waterman– Methods based on probabilistic models/profiles, e.g., Hidden
Markov models– Motif searching
• Or, you can use (or start with) pre-computedanalyses of protein families
![Page 36: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/36.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
36
• Identification of very distant homologs• May point to important functional units in a
protein• Can be used to "anchor" a multiple alignment• Databases of motifs can be used to develop other
informatics applications
Example: BLOCKS Æ Blosum matrices
Why do motif analysis?
![Page 37: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/37.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
37
Motif analysis
• Focuses on conserved patterns among two or moresequences to determine relationships
• Many variants of motif searching available– Consensus-based, e.g., Prosite
http://expasy.nhri.org.tw/prosite/– Manually annotated motifs, distant relationships, e.g.,
PRINTShttp://www.bioinf.man.ac.uk/dbbrowser/PRINTS/
– Statistical, e.g., MEME (Multiple EM for Motif Elicitation)http://meme.sdsc.edu/meme/website/
– Database searching, e.g., PHI-BLASThttp://www.ncbi.nlm.nih.gov/BLAST/
![Page 38: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/38.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
38
Meme & Mast
• Meme: motif discovery toolhttp://meme.sdsc.edu/meme/website/intro.html– motifs represented as position-dependent letter-probability
matrices which describe the probability of each possibleletter at each position in the pattern
– output can be converted to BLOCKS which can then beconverted to PSSMs (position-specific scoring matrices)
• Mast: database searching tool using one or moremotifs as queries– provides a match score for each sequence in the database
compared with each of the motifs in the group of motifsprovided represented as p-values
– provides probable order and spacing of occurrences of themotifs in the sequence hits
![Page 39: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/39.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
39
Some pre-calculated motif/family compilations
• Prosite: Protein families/domains showing biologicallyimportant patterns (1637 different patterns, rules andprofiles/matrices as of 6/03) http://us.expasy.org/prosite/
• Pfam: Multiple sequence alignments and HMMs formany protein domains (5724 families as of 5/03)http://pfam.wustl.edu/
• Prints: Conserved motifs characterizing proteinfamilies (1800 entries, encoding 10,931 individualmotifs as of 4/03) http://bioinf.man.ac.uk/dbbrowser/PRINTS/
• Compilation of specific protein family websites at theMRC http://www.hgmp.mrc.ac.uk/GenomeWeb/prot-family.html
![Page 40: Blosum Substitution Matrix Pab](https://reader031.vdocument.in/reader031/viewer/2022020115/55294eaf4a7959ae158b46d7/html5/thumbnails/40.jpg)
July 23, 2003 Patricia Babbitt, PhDUniv. of Calif., San Francisco
40
Laboratory Exercises & Resources fromBaygenomics
http://baygenomics.ucsf.edu/PGAConference2003/
• Using the LDL receptor as an example– DB searching– TMD prediction– Prosite, Pfam, Prints, Motif analysis– Multiple alignment generation and interpretation– Tree building/visualization– 2° structure/TMD prediction– 3D structure visualization
• Part of a 2-day hands-on workshop (& and onlineversion)– extensive help files– detailed answer keys