Download - Bioinformatics – Sequence analysis
1
Bioinformatics – Sequence analysis
Magnus Alm Rosenblad Cell and Molecular Biology/GU
Free electronic book at UB WWW: ”Bioinformatics and Functional Genomics” Author: Jonathan Pevsner
Computational biology
Databases Sequence analysis
Structural bioinformatics
Microarray analysis
Systems biology
Bioinformatics
Systems biology tries to create mathematical models of biological systems and processes. It uses data from
bioinformatics and functional genomics.
2
Databases?
• A database is a structured collection of records (data) that is stored in a computer system. (Could be a simple text file.)
• Records have one or more identifier (“key”) plus several columns (”fields”) with data
• Each database has its own set of identifiers but may also contain identifiers used in other databases (to make links between databases)
• Primary databases: original data Ex. Genbank/EMBL/DDBJ, SWISSPROT
• Secondary databases: data from other databases + “analyses” Ex. Pfam, PROSITE
EBI
GenBank
DDBJ
EMBL
Entrez
SRS
getentry
CIB
NCBI
• Submissions • Updates • Submissions
• Updates
• Submissions • Updates
International Sequence Database Collaboration
USA Europe
Japan
3
NCBI Databases: identifiers • Pubmed: scientific papers (identifier=“PMID”) • Taxonomy: all organisms and groups (TAXID) • Nucleotide: nucleotide sequences (GI + acc) • Genome: complete genome sequences (use nucleotide ID) • Protein: protein sequences (GI + acc)
Example: “Signal sequences get active.” PMID: 19219017 Example: gi|219681188|dbj|CJ999201.1 -- from DDBJ/Japan Example: gi|66547259|ref|XP_396445.2 -- Refseq, protein (306 aa) Example: gi|48119433|ref|XP_396445.1 – Refseq (old, 413 aa)
“GI” = Genbank identifier “acc” = accession number, an identifier type with versions: xxx.1, xxx.2 etc.
SEQUENCE ANALYSIS
Where and why ?
Sequencing projects, assembly of sequence data Gene prediction Identification of functional elements in sequences Sequence comparison Classification of proteins Comparative genomics RNA structure prediction Protein structure prediction Evolutionary history
4
Sequence analysis
• Biological sequences.* Central dogma. • Analysis of primary, secondary, not tertiary ...
structures • Similarities (orthologs, paralogs) • Methods, algorithms (alignments, models) • Databases (primary, secondary)
* Because DNA, RNA and protein molecules are polymers, we can treat them as strings of characters and use methods from computer science!
Sequences: DNA, RNA, protein ...
Genome: DNA transcription
Primary transcript: pre-mRNA, pre-ncRNA processing (splicing*, cleavage)
Processed transcript: mRNA, ncRNA (tRNA, rRNA ...) translation, modification
[a] Translated sequence: protein (amino acids). [b] Mature ncRNA protein cleavage ... Mature protein.
[ ESTs are nucleotide sequences, might be unspliced, spliced ...]
* Splicing only occurs in Eukaryotes (almost true).
5
Genome sizes, overview Viruses 5 Kb 1.2 Mb (HIV-1: 10 Kb) Mitochondria* 6 Kb 2.5 Mb (some have no genome) Plastids* 35 Kb 230 Kb (most are photosynthetic)
Bacteria** 160 Kb 10 Mb (E.coli: 4.6-5.5 Mb)
Fungi 3 Mb 120 Mb (S.cervisiae: 12 Mb) Plants 12 Mb 100 Gb ! Mammals*** ~3 Gb (~ 1000 times E.coli)
protein size 20 aa 35,000 aa!
* Mitochondria and plastids are bacterial endosymbionts once engulfed by a eukaryote. ** Bacteria have 1-3 chromosomes. Bacteria also have plasmids. Carsonella rudii = 160 Kb *** Human non-coding DNA: ~98.5% of the genome for a human is non-protein-coding, as opposed to 11% of the genome for the bacterium E. coli
Sequencing, assembly, dbs
sequence reads
Sequencing
NCBI TraceDB (not always)
Sequencing center such as JGI, Broad ...
assembly: reads contigs
assembly: contigs scaffolds/supercontigs
Sequencing center www (not always)
Sequencing center www (not always)
+ preliminary gene/protein predictions Sequencing center www (not always)
draft assembly: chromosomes NCBI Nucleotide?, NCBI Protein?
finished: chromosomes, proteins NCBI Nucleotide, Genome?, Protein
6
FlyBase, gene example (1/2)
Gene symbol Gene full name Type
Chromosome
Map
Species IDs
Sequence location
FlyBase, gene example (2/2)
Gene
mRNA
CDS
mRNA info
How many transcripts? How are they different? Exons? How many proteins?
5’ ?
UTR ? introns ?
7
NCBI Genbank entry for contig (Oamb) gene complement(165253..190776) /gene="Oamb" /locus_tag="Dmel_CG3856" /db_xref="FLYBASE:FBgn0024944"
mRNA complement(join(165253..167342,174570..175092, 176330..176566,176829..177120,189554..189985, 190715..190776)) /product="CG3856-RC, transcript variant C“
mRNA complement(join(165253..167342,174570..175092, 176330..176566,176829..177120,182196..182621, 189554..189985,190715..190776)) /product="CG3856-RA, transcript variant A"
CDS complement(join(166417..167342,174570..175092, 176330..176566,176829..177080)) /note="CG3856 gene product from transcript CG3856-RA" /product="CG3856-PA, isoform A" /protein_id="AAF55798.2"
CDS complement(join(166417..167342,174570..175092, 176330..176566,176829..177080)) /note="CG3856 gene product from transcript CG3856-RC" /product="CG3856-PC, isoform C" /protein_id="AAF55796.1"
mRNA complement(join(168640..170083,174570..175092, 176330..176566,176829..177120,182196..182621, 189554..189985,190715..190776)) /product="CG3856-RB, transcript variant B"
CDS complement(join(169182..170083,174570..175092, 176330..176566,176829..177080)) /note="CG3856 gene product from transcript CG3856-RB" /product="CG3856-PB, isoform B" /protein_id="AAF55797.1"
Compare mRNA & CDS. Two or three 5’ UTR exons? All have 4 coding exons ...
transcript B, 5’ UTR: 177081-177120 + ... 3’ UTR: 168640-169181
AE003731, 230001 bp DNA Drosophila melanogaster chromosome 3R, section 69 of 118
UCSC Genome Browser
Genes
EST
Location How is the direction shown?
Exons, introns?
UTRs? CDS?
Transcripts?
mRNA
8
Similarity -- Function
• Molecules that have the same function often have similar sequences
• Molecules that have the same or similar sequence often have the same function
Sequence analysis can give a lot of information about the function
Biological problem, sequence analysis Common biological problem:
We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein?
* Sequence homology - BLAST, FASTA, SSEARCH Simple example: a human protein is highly similar to a protein with known function from another organism => The human protein has a related function (it’s a homolog: ortholog or paralog)
* Pattern/profile search – PROSITE, Pfam (known motifs?) ** Secondary structure precition (proteins, ncRNA) ** Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!)
9
Example of similarity: 2 proteins QUERY: Mus musculus Signal recognition particle receptor subunit beta (SR-beta) SUBJECT: unknown human protein Identities = 245/271 amino acids (90%) Gaps = 2/271
• Query 1 MASANTRRVGDG--AGGAFQPYLDSLRQELQQRDPTLLSVAVALLAVLLTLVFWKFIWSR 58 • MASA++RRV DG AGG FQPYLD+LRQELQQ DPTLLSV VA+LAVLLTLVFWK I SR • Sbjct 1 MASADSRRVADGGGAGGTFQPYLDTLRQELQQTDPTLLSVVVAVLAVLLTLVFWKLIRSR 60
• Query 59 KSSQRAVLFVGLCDSGKTLLFVRLLTGQYRDTQTSITDSSAIYKVNNNRGNSLTLIDLPG 118 • +SSQRAVL VGLCDSGKTLLFVRLLTG YRDTQTSITDS A+Y+VNNNRGNSLTLIDLPG • Sbjct 61 RSSQRAVLLVGLCDSGKTLLFVRLLTGLYRDTQTSITDSCAVYRVNNNRGNSLTLIDLPG 120
• Query 119 HESLRFQLLDRFKSSARAVVFVVDSAAFQREVKDVAEFLYQVLIDSMALKNSPSLLIACN 178 • HESLR Q L+RFKSSARA+VFVVDSAAFQREVKDVAEFLYQVLIDSM LKN+PS LIACN • Sbjct 121 HESLRLQFLERFKSSARAIVFVVDSAAFQREVKDVAEFLYQVLIDSMGLKNTPSFLIACN 180
• Query 179 KQDIAMAKSAKLIQQQLEKELNTLRVTRSAAPSTLDSSSTAPAQLGKKGKEFEFSQLPLK 238 • KQDIAMAKSAKLIQQQLEKELNTLRVTRSAAPSTLDSSSTAPAQLGKKGKEFEFSQLPLK • Sbjct 181 KQDIAMAKSAKLIQQQLEKELNTLRVTRSAAPSTLDSSSTAPAQLGKKGKEFEFSQLPLK 240
• Query 239 VEFLECSAKGGRGDTGSADIQDLEKWLAKIA 269 • VEFLECSAKGGRGD GSADIQDLEKWLAKIA • Sbjct 241 VEFLECSAKGGRGDVGSADIQDLEKWLAKIA 271
Unknown = SR-beta?
Similarity Homology ? Comparing non-identical sequences Protein sequence comparison - basic concepts
When two protein sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
Proteins are homologous if they are related by divergence from a common ancestor.
10
Homology: orthologs & paralogs
Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor) Speciation!
Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB). (Example globin duplication: basal vertebrate)
Sr54_arcfu
Ftsy_aquae
Shared domain(s) with same function
Example: Homology, domain architecture
Common ancestry, different function
Orthologs/paralogs?
RNA binding domain
11
Protein similarity: yeasts, chordates
Mouse and human proteins are very similar.
Candida glabrata is “closest” to S.cerevisiae but not as similar as we may think.
Even in closely related organisms some orthologs are quite different. For more distant species these may be hard to identify.
Proteins evolve at different rates!
Dujon B. Trends Genet. 2006 Jul;22(7):375-87.
Orthologous sequences compared.
“Evolution”: yeasts, chordates
Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Dujon B. Trends Genet. 2006 Jul;22(7):375-87.
Average protein identity between orthologs
12
Methods in sequence analysis • Simple transformation/extraction
a) Translation: RNA > protein b) Reverse translation protein>RNA c) Splicing (removing introns in pre-mRNA, pre-rRNA ...)
• Comparison of primary sequences a) Identity: finding sites, pattern matches b) Alignments: non-identical seqs (pair/multiple/phylogeny)
• Analyzing for other properties a) statistical composition (GC%, CpG islands, ...) b) profile analysis (PSI-Blast, Pfam HMMs) c) predicting transmembrane domains (TMHMM) d) higher order stucture (secondary structure in RNA/prot)
Translation of sequences mRNA protein
• Different nucleotide sequences may translate into identical amino acid sequences.
• Nucleotide sequence may yield different amino acid seqs. (6 reading frames)
• Reverse translation does not give unique nucleotide sequence.
• Different splicing of pre-mRNA 1 gene – several proteins! (Eukaryotes only)
13
The (degenerate) Genetic code UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop*
UUG Leu L UCG Ser S UAG Stop* UGG Trp W
CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R
CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R
AUU Ile I ACU Thr T AAU Asn N AGU Ser S
AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R
AUG Met M ACG Thr T AAG Lys K AGG Arg R
GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G
GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G
Translation:
AUGUUGGGUUGA=MLG* ||| | || | | AUGCUAGGAUAA=MLG*
Reverse translation:
MLG* = AUG UUA GGU UAA 1 AUG UUA GGU UAG 2 AUG UUA GGU UGA 3 ... . AUG CUG GGG UGA 72 (1x6x4x3 possible seqs)
3rd position is less important! 71 different substitutions give MLG*-sequence The nucleotide sequences for a protein in different organisms may be very different!
UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* UUG Leu L UCG Ser S UAG Stop* UGG Trp W
CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R
CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R
AUU Ile I ACU Thr T AAU Asn N AGU Ser S
AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R
AUG Met M ACG Thr T AAG Lys K AGG Arg R
GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G
GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G
Translation:
AUGUUGGGUUGA=MLG* ||| | || | | AUGCUAGGAUAA=MLG*
AUGUUGGGUUGA=MLG* AUGUUAGGUUGA=MLG* AUGUUCGGUUGA=MFG* AUGUGAGGUUGA=M*G*(=M*!)
AUG-UGGGUUGA=MTV(+GA.) Frameshift=> new AA seq Last example: no Stop!
Changes that affect translation
Substitution single aa is changed (maybe), unless STOP Insertion/deletion (“indel”) rest of aa-sequence is changed
14
Open Reading Frame (ORF) Forward reading frames:
Frames 1-3 AUGUUGGGUUGA=MLG* .UGUUGGGUUGA=CTV ..GUUGGGUUGA=VGL ...UUGGGUUGA= LG*
Backward reading frames:
Frames 4-6 on reverse (minus) strand: AUGUUGGGUUGA original AGUUGGGUUGUA rev UCAACCCAACAU +complement = STQH, QPN, ...
1 AUGUUCCGUCUCACGCUCACCAAACGGCUAGCCCGCGCUUCUGCACACGUCACUCCGUCG 60 ------------------------------------------------------------ UACAAGGCAGAGUGCGAGUGGUUUGCCGAUCGGGCGCGAAGACGUGUGCAGUGAGGCAGC
M F R L T L T K R L A R A S A H V T P S C S V S R S P N G * P A L L H T S L R R V P S H A H Q T A S P R F C T R H S V A ------------------------------------------------------------ H E T E R E G F P * G A S R C V D S R R T G D * A * W V A L G R K Q V R * E T A N R R V S V L R S A R A E A C T V G D G Frame 4-6
Example unknown RNA:
Pairwise alignments:
Global alignment (Needleman-Wunsch, ClustalW) Considers similarity across the full extent of the sequences xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx || | ||||||| | | | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Local alignment (most common, Smith-Waterman, BLAST) Considers regions of similarity in parts of the sequences only. xxxxxxx ||||||| xxxxxxx region of similarity
15
16
Problem with finding best alignment
• Idea: (a) construct a system to score similarity*, and then (b) calculate the score for every possible alignment. (c) highest score = best alignment
• Number of alignments for sequences of length N: ~ 2^2N / sqroot(6N).
Example: N = 300 10^179 alignments! • (a) and (c) are good parts, but we have to find another
way of finding the highest scoring alignment!
* Match is good, mismatch not so good, gap is bad!
New idea
• Strategy: find a way to solve the smallest problems and then use these to get the rest
• Recursive strategy OK, but too slow • Idea for algorithm: Save the results for the
previous problems so that we don’t have to calculate them again.
• Strategy = “Dynamic programming” Biological algorithm = “Needleman-Wunsch”
17
Needleman-Wunsch algorithm (1) Used scores: match = +2 mismatch = -1 ; gap = -2
1. Sequences written outside matrix of size m+1 x n+1, each residue per col/row. Fill col/row 0.
2. Start at (1,1). For each cell, 3 scores for alignment of two residues/gap are calculated (white squares). The best score in each cell is saved (grey squares)! What move best score?
3. Save direction to best score in a “trace matrix”.
Note: The last residues may not align.
Sequences to be aligned (m,n).
Needleman-Wunsch algorithm (2)
Here we show best scores and trace direction for all cells. Traceback may be calculated from scoring matrix if no trace matrix. Start at (m,n), how did we get that score?
This is a “dynamic programming” algorithm. Easily modified to get local alignment.
Running time is proportional to the lengths multiplied (m x n). Very slow algorithm, cannot be used in big database searches!
Used cores: match +5, mismatch -2, gap -6
18
Smith-Waterman finds better local alignment
In the example the global alignment misses the exact match.
Needleman-Wunsch may be easily modified to get the best local alignment: Smith-Waterman algorithm.
Problems with Needleman-Wunsch and Smith-Waterman algorithms
• The running time is proportional to m x n: sequences 10m, 10n time = 100 x mn if we want to compare 1,000,000 sequences ...
• We always calculate the whole matrix and do the traceback even though the sequences have no sequence similarity what so ever.
• Homologous sequences often share regions with high similarity ... can we use this?
19
BLAST lists all matching “words”*
Query
Subject
For each short match, the program tries to extend in both directions. This way, we only align regions that have some sequence identity!
* A word is 7-11 nucleotides or 2-3 aminoacids
ATCGGAT
ATCGGAT
CTCAGAG
CTCAGAG
CCCGGCC
BLAST and FastA and BLAT Searching databases with BLAST:
Initial search is for short words. Word hits are then extended in either direction. we only extend words that are in both sequences fast, but gap can’t be long between two close words
Searching databases with FastA (FastP for proteins): Initial search for short words. Words are extended, but also linked if they are close! slower, but longer alignments (good for nucleotide seqs)
Using BLAT to search genomes (UCSC Genome browser): Tables with all “words” are already calculated for genomes! very fast, but precalculation must have been done
20
BLAST “word filtering” example
1. ATGGCGATGT ATG: pos 1,7 TGG: pos 2 TGT: pos 8 GAT: pos 6 GGC: pos 3 GCG: pos 4 CGA: pos 5
2. ATCGCAATAC
3. ATCGCGCAT
“Nucleotide BLAST”. If word size=3, will BLAST align seq 1 with seq 2 or 3 ?
1. ATGGCGATGT || |||
3. ATCGCGCAT
1. ATGGCGATGC || || || |
2. ATCGCAATAC
Pretty similar. 7/10 matches, but word hit??
Less similar but ...
BLAST creates a lookup table for “words” of size 3. Is any word found in seq 2 and 3?
Note: BLAST only reports alignments > 18 nts Note: NCBI BLAST word size is min 7.
1. ATGGCG-ATgt || ||| ||
3. ATCGCGCAT Word hit is extended 7 matches
Which will BLAST try to align ... ?
A nucleotide alignment that NCBI BLAST can’t find!
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
There are no matching regions > 6 nts. Smaller word size must be used.
21
RNA: Conserved secondary structure AU, GC base pairing create ”hairpins”
CAGGAAACUG seq1 ...|.||... GCUGCAAAGC seq2 |||||||...
GCUGCAACUG seq3
A A C A C A G A G A G A G-C U-A U C A-U C-G C U C-G G-C G G
seq1 seq2 seq3
Seq1 and seq2 are not similar, but they both have a hairpin structure, which is not shared by seq3!
The alignment of the primary sequences (structure) doesn’t give us any information.
ncRNAs are often not annotated
NC_006270.1 -TTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACTCGCAATCCGCTCGAGCGAGGC X06802|BAC.SUB. NTTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACCTGCAATCCGCTCTAGCAGGGC ************************************* *********** *** ***
NC_006270.1 CGAATCCCTTTCTCGAGGTTCGTTTACTTTAAGGTCTGCCTTAAGCAAGTGGTGTTGACG X06802|BAC.SUB. CGAATCCCTT-CTCGAGGTTCGTTTACTTTAAGGCCTGCCTTAAGTAAGTGGTGTTGACG ********** *********************** ********** **************
NC_006270.1 CTTGGGTCCTGCGCAATGGGAATCCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA X06802|BAC.SUB. TTTGGGTCCTGCGCAATGGGAATTCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA ********************** ************************************
NC_006270.1 AGTGGAACCTTCCATGTGCCGCAGGGTTGCCTGGGCTGAGCTAACTGCTTAAGTAACGCT X06802|BAC.SUB. AGTGAAACCTCTCATGTGCCGCAGGGTTGCCTGGGCCGAGCTAACTGCTTAAGTAACGCT **** ***** ************************ ***********************
NC_006270.1 TAGGGTAGCGAATCGACAGAAGGTGCACGGTA X06802|BAC.SUB. TAGGGTAGCGAATCGACAGAAGGTGCACGGTA ********************************
Sequence alignment of annotated SRP RNA from Bacillus subtilis and identified SRP RNA from the newly sequenced and “fully” annotated Bacillus licheniformis. Sequence identity = 94%! Still no SRP RNA is annotated. SRPDB is needed.
22
ncRNAs in the 3 Kingdoms of Life
Rfam: annotating non-coding RNAs in complete genomes. Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R. Eddy and Alex Bateman. Nucleic Acids Res. 2005 33:D121-D124.
Examples BLAST can’t find
• Proteins: frequent substitutions (1 aa / 3 aa) use word size 2 (BLAST default is 3) make sure DB has close relatives (or profile methods)
• mRNA: many synonymous substitutions (1/7) search on protein level
• ncRNA genes with “compensatory base changes” that preserve secondary structure use methods that allow small “word size”, close relatives
(OR search with secondary structure motif)
23
How to score aligned nucleotides Substitution matrices
A T C G A 1 0 0 0 T 0 1 0 0 C 0 0 1 0 G 0 0 0 1
A T C G A 1 -3 -3 -3 T -3 1 -3 -3 C -3 -3 1 -3 G -3 -3 -3 1
Unitary matrix BLAST matrix
All nucleotides considered equally probable in both. BLAST matrix: more matches needed to allow mismatches.
(Gaps should score worse than mismatch score.)
Properties of Amino acids While nucleic acids are quite similar, amino acids have different chemical properties. They may be hydrophilic (polar, or even charged), hydrophobic or neutral etc.
These properties are important for the structure and function of the protein.
If an amino acid is changed, is it probable that the “new” aa has a different type of property?
Should a DE or ST change get zero score? D E
S T
24
Are there better/worse substitutions?
• From comparisons of known proteins, it is known that some changes/mutations are more frequent than others.
• Also, not all amino acids* are common ... If a rare amino acid is matched, it is more significant than if a common amino acid match
• How can we give a score to a mismatch/match that is biologically significant? substitution matrices for proteins
* There are 20 amino acids, but only 4 nucleotides!
Scores for aligned amino acids (1/3)
• We want to have a score for a match/mismatch of two aligned amino acids (a and b) that is based on the probability of finding these aligned, as different pairs are more or less likely. Prob(a,b) (We assume for the moment we can get that value somehow.)
• But that is not enough, if we want to know if it is biologically significant. We need to compare Prob(a,b) to the probability that a and b are uncorrelated (occurring independently), so we want the probabilities of observing a and b on average in a protein (frequencies), and multiply them to get the probability for an alignment by chance f(a) * f(b).
• We now have a formula for comparing the probabilities (“odds-ratio”): Prob(a,b) / f(a) * f(b)
If we expect that the amino acids are aligned more often in homologous sequences, then Prob(a,b) > f(a) * f(b), and the odds-ratio is > 1
• In statistics usually the logarithm of this is used, to get a “log-odds” score**. log ( Prob(a,b) / f(a) * f(b) )
If Prob(a,b) > f(a) * f(b), then this expression is positive since log 1 = 0, otherwise negative. • To get a “nice” score we multiply this log-odds value with a constant (C) and round it off.
Ex. If C is 3 and the log-odds is 1.371, we get : 3 * 1.371 =~ 4, the score is thus +4. • Finished formula for calculating he score for all amino acid pairs:
score (a,b) = C * log ( Prob(a,b) / f(a) * f(b) )
** By using log we may also add all scores instead of multiplying the odds ratios.
25
Scores for aligned amino acids (2/3)
• Wonderful, we now have a score formula: score (a,b) = C * log ( Prob(ab) / f(a) * f(b) )
• But what exactly is Prob(ab) and f(a), f(b) ... ? • Frequencies are easy to calculate for each aa: f(X) = occurrances of X / #all amino acids.
Ex. Leucine is common: f(L) = 0.099; Tryptophan is rare: f(T) = 0.013, etc. Aligning L and L by chance: 0.099 * 0.099 = 0.01 (1%); L and T: 0.099 * 0.013 = 0.0013, etc.
• Now we only need Prob(a,b) for all amino acid pairs! • We want the Prob(a,b) values to be as “biologically significant” as possible.
Idea: – Get a lot of sequences that we know are homologous (same “ancestor”). – Make alignments and then count how often all the different amino acid pairs occur! – For “BLOSUM” matrices, the BLOCKS database was used to get the alignments. – To get different sets of scores depending on whether we are comparing very similar sequences or less
similar sequences, only use alignments with a minimum identity. – BLOSUM matrices have variants: BLOSUM50, BLOSUM62 ... where the number is the min id.
• Now we can calculate scores for all possible aligned amino acids, and put the values in a 20x20 substituion matrix to be used in alignment programs.
* The Pevsner book have older values.
Scores for aligned amino acids (3/3)
Ex 1. What is the score for aligning the rare W with W (Tryptophan) ? • Calculate score with values:
Prob(W,W) = 0.0065 ; f(W) = 0.013; C = 1 / 0.347 (0.347 is the “λ value”) • log ( Prob(W,W) / f(W)*f(W) ) = log (0.0065 / 0.013 * 0.013) = 3.64
score (W,W) = (1 / 0.347) * 3.64 = 10.52 +11
Ex 2. Score for aligning the common L with L (Leucine) ? • Calculate score with values:
Prob(L,L) = 0.0371 ; f(L) = 0.099; C = 1 / 0.347 • log (Prob(L,L) / f(L)*f(L) ) = log (0.0371 / 0.099^2) = 1.33
score (L,L) = (1 / 0.347) * 1.33 = 3.84 +4
Ex 3. Score for aligning the A with L ? Prob(A,L) = 0.0044; f(A) = 0.074. • score (A,L) = 1/0.347 * log (0.0044 / 0.074*0.099) = -0.5 / 0.347 = - 1.46 -1
Note: log = natural logarithm, ln.
26
BLOSUM 62 scores A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
Common amino acids have low weights
Rare amino acids have high weights
Negative for less likely substitutions
Positive for more likely substitutions
PAM matrices: “accepted point mutation”
• What point mutations are accepted by evolution? • Dayhoff et al. examined 71 groups of proteins and counted all mutations =
“accepted mutations” • Calculated mutation probabilities for all amino acids. • One “PAM” = 1% of the proteins mutated • PAM1 matrix for closely related proteins/organisms • PAM100, PAM 250* for less similar proteins. • PAMs use a similar log-odds score as for the BLOSUM matrices to contsruct
the substitution matrix. • PAM matrices are less used than BLOSUM, but all are distributed with
BLAST. • Read more in the book (page 50-) if you are interested.
* 250 mutations / 100 is possible because each position may be mutated several times.
27
Substitution matrices (summary)
Unitary matrices (nucleotide, protein) All matches get ’10’, all mismatches ’0’ (BLAST UNIT), or +1, -3 (BLASTN). Used for nucleotide seqs. Bad protein hits due to identities by chance.
Point Accepted Mutation, PAM (proteins) PAM30, PAM70 ... matrices. Based on evolutionary distance: 1 PAM = 1 point mutation / 100 residues. Can’t handle distant relationships well.
Blocks Substitution Matrix, BLOSUM (prots) BLOSUM50, BLOSUM62 ... matrices. Based on alignments in the BLOCKS db. Sequence segments of a certain identity are clustered: The most used matrices. BLOSUM62 default in BLAST (>62% identity).
Remember: Any substitution matrix is making a statement about the probability of observing a pair of aligned residues in real alignments!
BLAST: Raw score, bit score, E-value • For each alignment a “raw score” is calculated based on the chosen substitution matrix
and gap costs (gap open, gap extend). • The raw score has several limitations:
– Shorter alignments with high identity get lower score than longer with less identity, but it is the high identity alignments that are biologically more significant.
– The raw score does not tell us anything about the probability of finding an alignment by chance.
• Solution: – Calculate a normalized score:
bit score = (λ * raw score – ln K) / ln 2. (λ and K are the “Karlin-Altschul” parameters) – Calculate a value for how many aligments with this score or better that we would find by
chance in a database of this size: Expect-value = search space * 2^(- bit score); search space = query length * db size
• Relation between score and E-value: – High score means a low E-value (very few expected hits by chance) – If database size gets smaller, E-value gets lower (better), score is the same.
• E-value and P-value – P-value is the probability of a chance alignment (E-value is the number of alignments) – E-value is similar to P-value for E < 0.1 . (P-values are not reported by BLAST.) – Equation: P = 1 – e^(-E)
28
BLOSUM62 scoring example
# Matrix made by matblas from blosum62.iij # * column uses minimum score # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62
A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Query 1 MTPSLSSFLNSLILGAVIVVVPITLALLFVSQKDRTIRS 39 MTPSL SF SL+LG +IVV+P+T+AL+ +SQ D+ R+ MTPSLISFFYSLLLGTIIVVLPLTIALILISQTDKLKRN
How many identities? How many gaps? Scores for mismatches: Positive/negative?
Scores for mismatches:
SLNIAVVILLFVKRTIS + +++++ + + + IFYLTILLIILITKLKN -2 0 -2 2 0 3 1 2 2 2 0 3 -1 2 -1 -3 Sum 9 1
Scores for matches: MTPSLSFSLLGIVVPTALSQDR 5+5+7+3*4+6+3*4+6+3*4+7+5+3*4+5+6+5 Sum 105
Matrix is symmetrical!
BLOSUM62 scoring example BLAST result
>Cyanidium_caldarium_chl_NC_001840 Length = 164921
Score = 48.5 bits (114), Expect = 9e-08 Identities = 22/39 (56%), Positives = 31/39 (79%) Frame = +1
Query: 1 MTPSLSSFLNSLILGAVIVVVPITLALLFVSQKDRTIRS 39 MTPSL SF SL+LG +IVV+P+T+AL+ +SQ D+ R+ Sbjct: 122065 MTPSLISFFYSLLLGTIIVVLPLTIALILISQTDKLKRN 122181
Scores for 17 mismatches:
SLNIAVVILLFVKRTIS + +++++ + + + IFYLTILLIILITKLKN -2 0 -2 2 0 3 1 2 2 2 0 3 -1 2 -1 -3 Sum 9 1
Scores for 22 matches: MTPSLSFSLLGIVVPTALSQDR 5+5+7+3*4+6+3*4+6+3*4+7+5+3*4+5+6+5 Sum 105
Positive scores = more probable than by chance ...
E-value dependent on db size:
Database: chloroplast_genomes_29.fas 29 sequences; 3,993,082 total letters
Expect = 9e-08 Database: nt 6,385,943 sequences; 22,801,566,233 total letters Expect = 5e-04
4 Mb – 24,000 Mb ~ 6000 x difference in size. 6000 * 9e-08 = 54 e-05 = 5e-04
29
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
Searching..................................................done
Score E Sequences producing significant alignments: (bits) Value
gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29 gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29 gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13 gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05 gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048 gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6
E-value: number of hits by chance in a database of this size.
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65
Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63
Query: 74 G 74 G Sbjct: 64 G 64
In protein alignments some mismatches are marked “similar” (+).
The alignment is called “High Scoring Pair” (HSP). There may be several HSPs for each sequence
30
BLASTP: info at end of report
Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 212,176,541 Number of Sequences: 5470121 Number of extensions: 6292746 Number of successful extensions: 36982
Number of sequences better than 10.0: 9 Number of HSP's better than 10.0 without gapping: 6 Number of HSP's successfully gapped in prelim test: 3 Number of HSP's that attempted gapping in prelim test: 36976 Number of HSP's gapped (non-prelim): 9
length of query: 70 length of database: 1,894,087,724 effective HSP length: 42 effective length of query: 28 effective length of database: 1,664,342,642 effective search space: 46601593976 effective search space used: 46601593976
Used substitution matrix Gap costs
Number of “word” hits in DB. Number of sequences in DB. Number of extensions of word.
Number of sequences E < 10 HSP gap info
Number of HSPs ( = 1 / seq) Query length
Calculated search space (effective length of query * effective length of db) used in E-value calculation.
BLASTN: info at end of report
Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 95,310 Number of Sequences: 29 Number of extensions: 95310 Number of successful extensions: 67
Number of sequences better than 10.0: 26 Number of HSP's better than 10.0 without gapping: 26 Number of HSP's successfully gapped in prelim test: 0 Number of HSP's that attempted gapping in prelim test: 0 Number of HSP's gapped (non-prelim): 67
length of query: 108 length of database: 3,993,082 effective HSP length: 14 effective length of query: 94 effective length of database: 3,992,676 effective search space: 375311544 effective search space used: 375311544
Scoring: identity +1, mismatch -3 Gap costs
Number of “word” hits in DB. Number of sequences in DB. Number of extensions of word.
Number of sequences E < 10 HSP gap info
Number of HSPs (>1 / seq) Query length
Calculated search space (effective length of query * effective length of db) used in E-value calculation.
31
BLAST search variants Query OutputAlign Database
blastn DNA dna/dna DNA blastp Protein prot/prot Protein tblastn* Protein prot/prot DNA (6 frames) blastx* DNA (6 fr) prot/prot Protein tblastx* DNA (6 fr) prot/prot DNA (6 frames) * = ”translated BLAST”
Example 1: Searching a new genome assembly for a protein homolog. Input: protein. Database: DNA (genome sequences)
tblastn Example 2: We have DNA sequences and want to find out if they code for a similar protein. ... ??
Conserved synteny ... missing gene?
apcD accD psbV petJ tRNA psbX
apcD accD psbV petJ tRNA ?
apcD accD psbV petJ tRNA psbX
C.merolae
The annotations of three chloroplast genomes from red algae were compared. In a conserved gene cluster, one species lacked a gene.
C.caldarium
P.purpurea
How do we check if the gene is really missing using BLAST?
Query? Searched database? Program? Alternatives?
apcD tRNA psbX accD psbV petJ
32
Search results psbX (TBLASTN)
gene 121234..121725 /gene="apcD" /locus_tag="CycaCp138" /db_xref="GeneID:800290" CDS 121234..121725 /gene="apcD" /locus_tag="CycaCp138" /note="allphycocyanin
gamma chain" /codon_start=1 /transl_table=11 /product="allophycocyanin
gamma subunit" /protein_id="NP_045154.1" gene complement(121897..121979) /locus_tag="CycaCt226" /db_xref="GeneID:1457232" tRNA complement(121897..121979) /locus_tag="CycaCt226" /product="tRNA-Leu" /db_xref="GeneID:1457232" gene 122225..123028 /gene="accD" /locus_tag="CycaCp139" /db_xref="GeneID:800114" CDS 122225..123028 /gene="accD" /locus_tag="CycaCp139" /codon_start=1 /transl_table=11 /product="acetyl-CoA carboxylase
beta subunit" /protein_id="NP_045155.1"
C.caldarium chloroplast: “psbX region”
C.caldarium matching region: >122065-122187 C. caldarium chloroplast ATGACACCAAGTTTGATTTCTTTTTTCTATAGTTTACTTTTA GGAACTATCATTGTCGTTTTACCATTAACAATAGCGCTTATA TTAATTAGCCAAACTGATAAGCTAAAAAGAAAT TTTTAG
BLAST result:
Query= gi|122194748|sp|Q1XDU0.1|PSBX_PORYE Length=39
>ref|NC_001840.1| Cyanidium caldarium chloroplast, complete genome Length=164921
Score = 47.0 bits (110), Expect = 1e-08, Identities = 22/39 (56%), Positives = 31/39 (79%), Gaps = 0/39 (0%) Frame = +1
Query 1 MTPSLSSFLNSLILGAVIVVVPITLALLFVSQKDRTIRS 39 MTPSL SF SL+LG +IVV+P+T+AL+ +SQ D+ R+ Sbjct 122065 MTPSLISFFYSLLLGTIIVVLPLTIALILISQTDKLKRN 122181
Score = 20.0 bits (40), Expect = 1.4, Identities = 7/20 (35%), Positives = 15/20 (75%), Gaps = 0/20 (0%) Frame = +3
Query 4 SLSSFLNSLILGAVIVVVPI 23 +LS FLN ++ ++++VP+ Sbjct 89931 TLSCFLNEMLESLILLLVPL 89990
Full protein sequence?: TTT = F, TAG = ?
BLAST output, with many HSPs
gb|CM000011.1| Canis familiaris chromosome 11, whole genome shot... 86 9e-15
>gb|CM000011.1| Canis familiaris chromosome 11, whole genome shotgun sequence Length = 75769841
Score = 85.7 bits (43), Expect = 9e-15 Identities = 89/102 (87%), Gaps = 3/102 (2%) Strand = Plus / Minus
Query: 4 cgtgctgaaggcctgtatcctaggctacacactgaggactctgttcctcccctttccgcc 63 |||||||||||||||| |||||||||||| || ||||||| ||||||| ||| |||| Sbjct: 53542401 cgtgctgaaggcctgtttcctaggctacagacggaggact-tgttcctta--tttgcgcc 53542345
Query: 64 taggggaaagtccccggacctcgggcagagagtgccacgtgc 105 |||||||||||||||||||| ||||||||||||| ||||| Sbjct: 53542344 taggggaaagtccccggacccttggcagagagtgccgcgtgc 53542303
Score = 75.8 bits (38), Expect = 9e-12 Identities = 75/86 (87%), Gaps = 1/86 (1%) Strand = Plus / Minus
Query: 181 ggggcgtcatccgtcagctccctctagttacgcaggcagtgcgtgtcc-gcgcaccaacc 239 |||||||| ||||||| ||| ||||||||||||||||| ||| | |||| |||||| Sbjct: 53542216 ggggcgtcgtccgtcaactctatctagttacgcaggcagcgcgcctggtgcgcgccaacc 53542157
Query: 240 acacggggctcattctcagcgcggct 265 |||||||||||||||||||||||||| Sbjct: 53542156 acacggggctcattctcagcgcggct 53542131
Score = 36.2 bits (18), Expect = 7.7 Identities = 18/18 (100%) Strand = Plus / Minus
Query: 25 aggctacacactgaggac 42 |||||||||||||||||| Sbjct: 42727936 aggctacacactgaggac 42727919
Note: Only the best HSP is shown in the list before the alignments. Check the positions to understand in which order the HSPs match. The strand must be the same!
33
Aligningtwosequences-Gapextensionpenalty.AlignmentofgenomicsequencewithmRNA(Globalalignment!)
Alignmentofthefollowingtwosequences:V00594(HumanmRNAformetallothionein)andJ00271(correspondinggenomicsequence).
Defaultsetting
Extendgap=3
In a global alignment all residues are matched.
?
!
Newsettings
Extendgap=0Exon 1
Exon 2
Exon 3
34
Rules of database searches (like BLAST)
• Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level * • Use of smallest possible database (not too small though, ... homologs?) • Sequence statistics should be used rather than percent identity/similarity as criterion for homology. E-values < e-03. (But distant homologs ...) • Consider different scoring matrices and gap penalties
* 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code.
TTTCGATTCTCAACAAGAAGC ** * ** ** * * TTCAGGTTTAGCACGCGGTCC F R F S T R S
2) For nucleotide—nucleotide searches, it is often good to set the word size low (-W 7)
1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60
61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120
121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180
181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240
241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300
301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360
361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ
Low complexity sequence tends to (1) increase the number of non-specific hits to database sequences (2) correspond to regions in proteins not associated with a known biological function (typically unstructured parts of the protein)
Therefore, low complexity parts are filtered out by default in BLAST searches. (Don’t use filtering if you want exact matches.)
35
Databases at NCBI available for BLAST searches
Protein sequence databases
nr All non-redundant GenBank CDS translations +PDB+SwissProt+PIR+PRF
swissprot the last major release of SWISS-PROT uniprot swissprot + TrEMBL (translated EMBL DNA sequences)
DNA sequence Databases
nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)
dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions
You may also blast against single genomes ...
How can the sequences of protein homologs be so different?
ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G
60% nucleotide identity ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 69% amino acid identity
M V R I Q K I N E K G A L L A G 38%
Q V R I Q K I Y E K G A L L A A 19% (‘twilight zone’)
Q V R I Q K I Y E K T A L L F A
6% (‘midnight zone’)
Evolution of protein genes: secondary and tertiary structure conserved
36
BLAST at NCBI What kind of BLAST will you perform?
BLASTP search page 1. INPUT: Sequence or accession number
2. DATABASE: Choose non-redundant nr, SwissProt ...
3. RESTRICT SEARCH? Input organism or organism group to be searched, other sequences are neglected
Sequence
Database Organism
37
BLAST output at NCBI
1 perfect hit, some hits with parts of sequence matched
Alignments below
“HSP” high
scoring pair
– there may be several!
Best hit
Next best hit
38
Sequence analysis II
• Finding distant homologues with Blink • Multiple sequence alignments • PSSM, HMM, profiles • Domain databases • Secondary structure • Transmembrane prediction • Signal peptide prediction
Eukaryote phylogeny
Baldauf et al. Science 2000
Based on 4 concatenated protein sequences.
Maximum parsimony.
Humans are more like fungi than plants!
39
Protein similarity: yeasts, chordates
Mouse and human proteins are very similar.
Candida glabrata is “closest” to S.cerevisiae but not as similar as may think.
Orthologous sequences.
“Evolution”: yeasts, chordates
Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Dujon B. Trends Genet. 2006 Jul;22(7):375-87.
Average protein identity between orthologs
40
NCBI protein entry, BLink
Name, length, IDs
Sources Keywords Organism and taxonomy References
The NCBI protein database entry for SRP21 from yeast which we searched for.
As many researchers want to make BLAST searches, a precomputed BLAST page is accessible: BLink (“BLAST Link”)
SRP21 YEAST
Finding homologs by iterative BLAST
Fitzpatrick et al., BMC Evolutionary Biology (2006) 6:99
Shown is a phylogenetic tree of 42 fungal species. At the bottom are the Saccharomycotina (2 major groups, red ) and above Pezizomycotina (green), and the last group with only S.pombe (organge).
Starting with a Saccharomyces cerevisiae protein, how many homologs can we find within (Ascomycota) fungi?
Is the S.cerevisiae protein similar enough for BLAST to find the homologs ...
41
Saccharomyces SRP21 BLAST result Chosen protein: SRP21 The SRP21 protein from Saccharomyces cerevisiae is used in a BLAST search to find homologs in other organism groups. But the complete BLAST (BLink) list only contains SRP21 from the two Saccharomycotina groups.
Idea: Pick a distantly related SRP21 and use it as new query ... BLink is the precomputed BLAST search result
available at NCBI.
Debaryomyces SRP 21 BLAST result (BLink)
Aspergillus SRP21 and other
Pezizomycotina
Saccharomycotina SRP21
Debaryomyces belongs to Saccharomycotina, so other SRP21 from the 2 groups are easily found.
Now we also find Pezizomycotina homologs!
But no S.pombe ...
Use Aspergillus SRP21!
42
Aspergillus SRP21 BLAST
S.pombe SRP21 !
Saccharomycotina SRP21
Pezizomycotina SRP21 and other sequences
Aspergillus belongs to the Pezizomycotina group, so those homolgs are easily found. The result also includes some Saccharomycotina.
Now we have finally found the S.pombe homolog.
Finding homologs by iterative BLAST
Fitzpatrick et al., BMC Evolutionary Biology (2006) 6:99
S.cerevisiae SRP21 (Saccharomycotina - A)
Debaryomyces SRP21 (Saccharomycotina - B)
Aspergillus SRP21 (Pezizomycotina)
S.pombe SRP21 !
Acumulated mutations during evolution made the proteins too different. By “jumping” from group to group we bridged the gap.
43
Check that S.pombe SRP21 find ...
Debaryomyces SRP21 (Saccharomycotina)
Other Pezizomycotina SRP21
Aspergillus SRP21 (Pezizomycotina)
If protein is not annotated – check if reciprocal search is successful
SRP21 aligned to SRP9 &14
Unaligned box 21
9
14
Secondary structure prediction by PSI-Pred also showed the conserved αβββα structure.
SRP9/14 αβββα secondary structure (Birse et al.) shown as cylinders (alfa helices) and arrows (beta strands).
The most conserved residues are in secondary structure elements. SRP9, SRP21 more similar.
Residues marked according to similarity in sequence and chemical properties.
21
9
14
44
Sr54_arcfu
Ftsy_aquae
Shared domain(s)
Example: Homology, domain architecture
Common ancestry, different function
Orthologs/paralogs?
RNA binding domain
N-terminal
C-terminal
Two different proteins (4+4 sequences ) are aligned. They share a domain. They are paralogs.
45
Pfam: TreeFam (homologs) FA9_HUMAN example (cont.):
In the displayed tree gene duplications (red dots at nodes; leading to paralogs) och speciation (blue dots, leading to orthologs) are shown.
FA9 and FA10 arose by a gene duplication. Some nodes are hard to decide upon (duplications?).
FA10
FA9
Multiple alignments - applications
Identify conserved motifs - patterns (PROSITE) Profiles (Pfam, PROSITE) Phylogenetic studies Prediction of protein secondary structure Experimental : design of probes
46
Multiple alignment software
Pileup (GCG)
Clustalw / Clustalx
T-coffee
Muscle/MAFFT
Multiple alignment editors/viewers
SeqLab (GCG) Jalview CINEMA Genedoc Bioedit Boxshade
How to find homologs with low sequence identity
• Similarity gets low if evolutionary distance gets big. • Many amino acid positions change. • An amino acid may be substituted differently in
different species. • If we have many known homologs, we can use all of
them as queries, but the unknown sequence may have yet another set of substitutions compared to the known homologs. align known sequences and make a “profile”
• The profile has a different substitution matrix for each position in the alignment ...
47
Multiple alignment: env, ClustalX
frequency plot (no gaps!)
information (bits) plot
The probablility of an aminoacid in a sequence is dependent on the position!
Position Specific Substitution Rates
Active site serine Typical serine
48
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
Example sequence. How does Serine score in positions 211 and 216?
Amino acids
Pfam use HMMs
A HMM is a statistical model for identifying features in a primary sequence. It is built from an multiple sequence alignment, from which the probabilities are calculated. HMMs are used also in geneprediction (GenScan) ...
Splice site “toy” example. We are looking for the start of the intron.
49
Pfam-Trypsin: “Summary” The summary for the domain contains background, literature references and links to other databases plus much more ....
(... next slide ...)
... Pfam-Trypsin: “Summary” ...
The summary also groups the domain into a “clan” which has several other members, in this case the clan is called Peptidase_PA *.
Gene Ontology terms associated with the domain is listed: proteolysis.
Similarity to other domains by using PRC, a domain-domain search program. Most are in the same clan.
Links
* This clan contains a diverse set of peptidases with the trypsin fold.
50
Pfam-Trypsin: “Interactions” The “Interaction” section gives a list of which domains have shown to interact with the domain.
For instance, coagulase domain is present in a bacterial protein that binds to the trypsin domain of human prothrombin.
Pfam-Trypsin : “Domain organisation” For each domain, Pfam lists all types of domain organisations that contain the domain.
Shown are the first 5 and the FA10-9 organisation, which is the same, is at the bottom with a member proteins displayed.
Gla-EGF-EGF-Trypsin
List: FA9 etc.
Trypsin-PDZ
Trypsin-PDZ-PDZ
Trypsin-Trypsin
51
Pfam-Trypsin: “Curation”, seed Information about how the domain was created.
Where the initial (seed) alignment came from, in this case SCOP and PROSITE, and how many sequences it contained (71).
The “full” alignment is seed plus all found proteins (6237!) that scores above a certain threshold. Seed and Full alignments are found under “Alignments”.
Pfam – protein domains DB
• From multiple alignments of many related proteins, profiles (profile HMMs) are made
• Curated and highly trusted Pfam-A (red, green, yellow) and automatic temporary Pfam-B (striped).
• Input a sequence, match to all families/HMMs.
• UniProt sequences are in Pfam database. (shown is FA9_HUMAN)
Pfam DB: Karolinska Inst., Sanger (UK), S:t Louis (USA), Pasteur (F)
http://pfam.sanger.ac.uk/
Pfam-A
52
Pfam: “Features”, FA9_HUMAN
Disulphide bonds
InterPro domains
Pfam superfamilies
Signal peptide prediction
In “Features” for a sequence, Pfam incorporates other info than Pfam domains, some imported from other databases. + Prediction of signal peptides.
Holding the cursor over part gives info.
FA9 search at PROSITE
Symbols for active sites (red) and disulphide bridges (grey lines), also in Pfam.
Hold pointer over domain and the sequence will be highlighted.
Almost the same result as Pfam. Second EGF domain in Pfam is not in PROSITE.
PROSITE also contains matches to patterns!
PROSITE does not have an equivalent to Pfam-B models.
http://www.expasy.org/
53
NCBI Conserved Domains
In CDD also Pfam domains are listed.
The search is conducted in a similar but not identical way.
SRS: InterPro – search all domain DB
InterPro domains are also listed in Pfam for each protein
PROSITE Pfam PRODOM PRINTS SMART ... ...
Seq input
54
PSI-BLAST creates profiles automatically
When no more new sequences are found, search terminates.
Problem: If bad sequences enters the profile, it finds only trash!
PSI-BLAST example (POP)
• The particle RNaseP exists both i yeast and metazoans, but their protein components differ
• However, several of them have been shown to be homologs:
Metazoa Yeast Rpp25 POP6 Rpp38 POP3
• Two of them (Rpp14, POP8) do not have a homolog in the other organism. Or?
55
PSI-BLAST example (POP8) QUERY: protein from Neurospora crassa, a fungi, a probable POP8 protein.
QUESTION: Can we see a distant relationship to metazoan proteins? (We are interested in Rpp14.)
BLAST did not give us any new information, only other fungal homologs. A more sensitive method is needed.
We try with PSI-BLAST instead, and look at the results after round 1-5: _________________________________________________________________________________
Results from round 1 Score E Sequences producing significant alignments: (bits) Value
emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 221 5e-57 gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 126 2e-28 ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 96 4e-19 ref|XP_659856.1| hypothetical protein AN2252.2 [Aspergillus nidu... 62 4e-09 emb|CAD99064.1| SPBC1709.20 [Schizosaccharomyces pombe] >gi|6801... 39 0.069 PGUG_01796.1 | Candida guilliermondii predicted protein (transla... 34 1.8
/.../ Comment: Only fungal proteins in the first round as in BLAST. No profile yet.
PSI-BLAST example (2) Results from round 2 Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again:
gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 218 5e-56 ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 205 3e-52 emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 203 1e-51 ref|XP_659856.1| hypothetical protein AN2252.2 [Aspergillus nidu... 144 7e-34
Sequences not found previously or not previously below threshold:
emb|CAD99064.1| SPBC1709.20 [Schizosaccharomyces pombe] >gi|6801... 50 2e-05 gb|AAS53388.1| AFR017Cp [Ashbya gossypii ATCC 10895] >gi|4519853... 46 5e-04 ref|ZP_00744890.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 42 0.006 ref|ZP_00755396.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 41 0.009 gb|AAF93762.1| glutamyl-tRNA synthetase-related protein [Vibrio ... 41 0.009 ref|ZP_00748964.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 41 0.009 ref|ZP_00751512.1| COG0008: Glutamyl- and glutaminyl-tRNA synthe... 41 0.009 ref|ZP_00434095.1| COG0625: Glutathione S-transferase [Burkholde... 41 0.012 ref|XP_454424.1| unnamed protein product [Kluyveromyces lactis] ... 40 0.034 713_pichia_stipitis_FM1.aa.fasta 38 0.083 emb|CAE28370.1| conserved hypothetical protein [Rhodopseudomonas... 37 0.18 ref|XP_341296.2| PREDICTED: similar to ribonuclease P 14kDa subu... 37 0.19
/../ Comment: The constructed profile found more sequences, even a Rpp14.
56
PSI-BLAST example (3) Results from round 4
Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again:
gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 179 3e-44 ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 175 3e-43 emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 173 1e-42 ref|XP_454424.1| unnamed protein product [Kluyveromyces lactis] ... 143 2e-33 713_pichia_stipitis_FM1.aa.fasta 140 2e-32 gb|AAS53388.1| AFR017Cp [Ashbya gossypii ATCC 10895] >gi|4519853... 133 2e-30 emb|CAA84837.1| unnamed protein product [Saccharomyces cerevisia... 127 1e-28 emb|CAG88296.1| unnamed protein product [Debaryomyces hansenii C... 123 2e-27 emb|CAD99064.1| SPBC1709.20 [Schizosaccharomyces pombe] >gi|6801... 122 3e-27 ref|XP_659856.1| hypothetical protein AN2252.2 [Aspergillus nidu... 119 3e-26 gb|AAH95792.1| Hypothetical protein LOC553721 [Danio rerio] >gi|... 119 3e-26 ref|XP_447894.1| unnamed protein product [Candida glabrata] >gi|... 107 8e-23
Sequences not found previously or not previously below threshold:
ref|XP_593187.1| PREDICTED: similar to ribonuclease P 14kDa subu... 85 7e-16 ref|NP_080214.1| ribonuclease P 14kDa subunit [Mus musculus] >gi... 82 6e-15
/.../ Comment: Rpp14 seqeunces now have good e-values.
PSI-BLAST example (end) Results from round 5
Score E Sequences producing significant alignments: (bits) Value Sequences used in model and found again:
ref|XP_370445.1| hypothetical protein MG06942.4 [Magnaporthe gri... 165 4e-40 gb|EAA77253.1| hypothetical protein FG07394.1 [Gibberella zeae P... 162 3e-39 ref|NP_008973.1| ribonuclease P 14kDa subunit [Homo sapiens] >gi... 156 2e-37 gb|AAX36190.1| ribonuclease P 14kDa subunit [synthetic construct] 156 2e-37 emb|CAD70969.1| hypothetical protein [Neurospora crassa] >gi|324... 155 4e-37 ref|XP_849188.1| PREDICTED: similar to ribonuclease P 14kDa subu... 152 3e-36 ref|NP_080214.1| ribonuclease P 14kDa subunit [Mus musculus] >gi... 152 4e-36 ref|XP_593187.1| PREDICTED: similar to ribonuclease P 14kDa subu... 151 5e-36 ref|XP_341296.2| PREDICTED: similar to ribonuclease P 14kDa subu... 150 2e-35 gb|AAS53388.1| AFR017Cp [Ashbya gossypii ATCC 10895] >gi|4519853... 144 6e-34 gb|AAH95792.1| Hypothetical protein LOC553721 [Danio rerio] >gi|... 140 1e-32 ref|NP_001017048.1| ribonuclease P 14kDa subunit [Xenopus tropic... 135 4e-31 ref|XP_526214.1| PREDICTED: similar to ribonuclease P 14kDa subu... 133 2e-30 ref|XP_454424.1| unnamed protein product [Kluyveromyces lactis] ... 133 2e-30 emb|CAG88296.1| unnamed protein product [Debaryomyces hansenii C... 133 3e-30 /.../
In the last round we have a mix of fungal (POP8) sequences and Rpp14! This is a good indication that these proteins are related (homologs).
57
Protein secondary structure elements
• Alpha helix • Beta strand • Coil
• connected Beta strands Beta sheet
• Beta sheet forming a closed structure that may span a cell membrane (porin) Beta-barrel
58
PSIPRED prediction compared to 3D structure
Structure 2W9J fragment pos 1-91 of SRP14 S.pombe
Not in structure
N-term
Integral membrane proteins
The transmembrane regions are mostly α-helices (Beta-barrels also exist ...)
Length of TM domain approx 20 aa.
GPCRs
59
Transmembrane prediction
• 25% of all proteins are membrane bound • By comparing known transmembrane proteins,
programs like TMHMM make predictions about where the trans-membrane regions are. “inside”, “TM” and “outside” = HMMs. Probabilities for these 3 are compared.
• Similar approaches are used for signal peptide and transit peptide predictions, and even for gene predictions ...
TMHMM output (GPCR input)
Seven clearly defined TM domains.
C-terminal is inside cell.
60
TMHMM output RF47_[Guillardia len=68 ExpAA=37.41 First60=32.65 PredHel=2 Topology=i2-19o47-64i ORF74_[Odontella len=74 ExpAA=39.05 First60=32.92 PredHel=2 Topology=i2-24o48-65i ORF71_[Porphyra len=71 ExpAA=36.0 First60=26.14 PredHel=2 Topology=i7-24o53-70i ORF70_[Chlorella len=70 ExpAA=38.67 First60=32.40 PredHel=2 Topology=i2-21o45-67i
-------------------------------------------------------------------------
PredHel=2 (= 2 TM dom) Topology=i2-21o45-67i inside-TRANSMEMBRANE-outside-TRANSMEMBRANE-inside
Example in which scores for first TM domain are too low :
SignalP, signal peptide data
Eukaryotes Total length (average) 22.6 aa n-regions only slightly Arg-rich pos charged h-regions short, very hydrophobic hydrophobic c-regions short, no pattern neutral, polar -3,-1 positions small and neutral residues +1 to +5 region no pattern
Blue:Positively charged residues Red:Negatively charged residues Green:Neutral polar residues Black:Hydrophobic residues
61
Signal peptide/anchor prediction by SignalP
>TXN4_HUMAN Prediction: Signal peptide Signal peptide probability: 0.984 Signal anchor probability: 0.015 Max cleavage site probability: 0.962 between pos. 29 and 30
Scores for the n-region, h-region, and c-region of the signal peptide plus cleavage prediction
SignalP: No signal peptide, but anchor
>sp_Q93127_GPR18_BALAM Prediction: Signal anchor Signal peptide probability: 0.000 Signal anchor probability: 0.969
No h- and c-regions no cleaved peptide
TM domain = anchor?
62
SignalP: Non-secretory protein
>BM2K_HUMAN Prediction: Non-secretory protein Signal peptide probability: 0.157 Signal anchor probability: 0.023 Max cleavage site probability: 0.027 between pos. 28 and 29
TargetP: transit-peptide, signalpep
http://www.cbs.dtu.dk/services/TargetP/
Nuclear encoded proteins destined for an mitochodrion or plastid have a “transit-peptide” that directs the protein to the organelle.
63
TargetP: transit-peptide prediction mito, chloro ...?