sequence analysis sequences: dna, rna ,...

16
1 Sequence analysis Analysis of primary , secondary , not tertiary ... structures • Biological sequences. Central dogma. Similarities (orthologs, paralogs) Methods, algorithms (alignments, models) Databases (primary, secondary) Sequences: DNA, RNA , protein ... Genome: DNA transcription ? Primary transcript: pre-mRNA , pre-ncRNA processing (splicing*, cleavage) ? Processed transcript: mRNA, ncRNA (tRNA, rRNA ...) translation, modification ? [a] Translated sequence: protein (amino acids). [b] Mature ncRNA protein cleavage ... ? Mature protein. [ ESTs are nucleotide sequences, might be unspliced, spliced ...] * Splicing only occurs in Eukaryotes. SEQUENCE ANALYSIS Where and why ? Sequencing projects, assembly of sequence data Identification of functional elements in sequences Sequence comparison Classification of proteins Comparative genomics RNA structure prediction Protein structure prediction Evolutionary history Alignments and database searches (Summary) Common biological problem: We have a novel protein sequence. What can we infer from this sequence about the biological function of the protein? * Sequence homology - BLAST, FASTA, SSEARCH Simple example: unknown human protein is highly similar to a protein with known function from another organism => The human protein has the same function (it’s a homolog: ortholog or paralog) * Pattern/profile search – PROSITE, Profile search - Pfam ** Secondary structure precition ** Prediction of transmembrane domains ( ~ 25 % of all proteins are membrane bound!) Comparing non-identical sequences Protein sequence comparison - basic concepts When two protein sequences are being compared and the similarity is considered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships: Orthologs Proteins that carry out the same function in different species Paralogs Proteins that perform different but related functions within one organism Proteins are homologous if they are related by divergence from a common ancestor. Homology: orthologs & paralogs Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor) Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB).

Upload: ngohuong

Post on 23-Feb-2018

235 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

1

Sequence analysis

• Analysis of primary, secondary, not tertiary ... structures

• Biological sequences. Central dogma.• Similarities (orthologs, paralogs)• Methods, algorithms (alignments, models)• Databases (primary, secondary)

Sequences: DNA, RNA , protein ...

Genome: DNAtranscription?

Primary transcript: pre-mRNA, pre-ncRNAprocessing (splicing*, cleavage) ?

Processed transcript: mRNA, ncRNA (tRNA, rRNA ...)translation, modification?

[a] Translated sequence: protein (amino acids). [b] Mature ncRNAprotein cleavage ... ?Mature protein.

[ ESTs are nucleotide sequences, might be unspliced, spliced ...]

* Splicing only occurs in Eukaryotes.

SEQUENCE ANALYSIS

Where and why ?

Sequencing projects, assembly of sequence dataIdentification of functional elements in sequences Sequence comparisonClassification of proteins Comparative genomicsRNA structure prediction Protein structure prediction Evolutionary history

Alignments and database searches (Summary)

Common biological problem: We have a novel protein sequence. What can we inferfrom this sequence about the biological function of theprotein?

* Sequence homology - BLAST, FASTA, SSEARCHSimple example: unknown human protein is highly similar to a protein with known function from another organism=> The human protein has the same function

(it’s a homolog: ortholog or paralog) * Pattern/profile search – PROSITE, Profile search - Pfam** Secondary structure precition** Prediction of transmembrane domains

( ~ 25 % of all proteins are membrane bound!)

Comparing non-identical sequencesProtein sequence comparison - basic concepts

When two protein sequences are being compared and the similarity isconsidered statistically significant, it is highly likely that the two proteins are evolutionary related. There are two kinds of biological relationships:

Orthologs Proteins that carry out the same function in different species

Paralogs Proteins that perform different but related functions within one organism

Proteins are homologous if they are related by divergence from a common ancestor.

Homology: orthologs & paralogs

Orthology describes genes in different species that derive from a common ancestor. (=MouseA, ChickA, FrogA that come from Alfa-chain gene in common ancestor)

Paralogy describes homologous genes within a single species that diverged by gene duplication (= MouseA and MouseB).

Page 2: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

2

Methods in sequence analysis• Simple transformation/extraction

a) Translation: RNA > proteinb) Reverse translation protein>RNAc) Splicing (removing introns in pre-mRNA, pre-rRNA ...)

• Comparison of primary sequencesa) Identity: finding sites, pattern matchesb) Alignments: non-identical seqs (pair/multiple/phylogeny)

• Analyzing for other propertiesa) statistical compositionb) profile analysis (PSI-Blast)c) HMMs (probabilities of aa in position, Pfam) d) higher order stucture (secondary structure in RNA/prot)

Translation of sequences

• Different nucleotide sequences may translate into identical amino acid sequences.

• Nucleotide sequence may yield different amino acid seqs. (6 reading frames)

• Reverse translation does not give unique nucleotide sequence.

• Different splicing of pre-mRNA1 gene – several proteins!

The (degenerate) Genetic code

UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop* UUG Leu L UCG Ser S UAG Stop* UGG Trp W

CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R

AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R

GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G

Translation:

AUGUUGGGUUGA=MLG*||| | || | | AUGCUAGGAUAA=MLG*

Reverse translation:

MLG* =AUG UUA GGU UAA 1AUG UUA GGU UAG 2AUG UUA GGU UGA 3... .AUG CUG GGG UGA 72(1x6x4x3 possible seqs)

3rd position is not so important!

UUU Phe F UCU Ser S UAU Tyr Y UGU Cys C UUC Phe F UCC Ser S UAC Tyr Y UGC Cys C UUA Leu L UCA Ser S UAA Stop* UGA Stop*UUG Leu L UCG Ser S UAG Stop* UGG Trp W

CUU Leu L CCU Pro P CAU His H CGU Arg R CUC Leu L CCC Pro P CAC His H CGC Arg R CUA Leu L CCA Pro P CAA Gln Q CGA Arg R CUG Leu L CCG Pro P CAG Gln Q CGG Arg R

AUU Ile I ACU Thr T AAU Asn N AGU Ser S AUC Ile I ACC Thr T AAC Asn N AGC Ser S AUA Ile I ACA Thr T AAA Lys K AGA Arg R AUG Met M ACG Thr T AAG Lys K AGG Arg R

GUU Val V GCU Ala A GAU Asp D GGU Gly G GUC Val V GCC Ala A GAC Asp D GGC Gly G GUA Val V GCA Ala A GAA Glu E GGA Gly G GUG Val V GCG Ala A GAG Glu E GGG Gly G

Translation:

AUGUUGGGUUGA=MLG*||| | || | | AUGCUAGGAUAA=MLG*

AUGUUGGGUUGA=MLG*AUGUUAGGUUGA=MLG*AUGUUCGGUUGA=MFG*AUGUGAGGUUGA=M*G*(=M*!)

AUG-UGGGUUGA=MTV(+GA.)Frameshift=> new AA seqLast example: no Stop!

Changes that affect translation

Open Reading Frame (ORF)Forward reading frames:

Frames 1-3AUGUUGGGUUGA=MLG*.UGUUGGGUUGA=CTV..GUUGGGUUGA=VGL...UUGGGUUGA= LG*

Backward reading frames:

Frames 4-6 on reverse (minus) strand:AUGUUGGGUUGA originalAGUUGGGUUGUA revUCAACCCAACAU +complement= STQH, QPN, ...

1 AUGUUCCGUCUCACGCUCACCAAACGGCUAGCCCGCGCUUCUGCACACGUCACUCCGUCG 60------------------------------------------------------------UACAAGGCAGAGUGCGAGUGGUUUGCCGAUCGGGCGCGAAGACGUGUGCAGUGAGGCAGC

M F R L T L T K R L A R A S A H V T P S C S V S R S P N G * P A L L H T S L R R V P S H A H Q T A S P R F C T R H S V A

------------------------------------------------------------H E T E R E G F P * G A S R C V D S R R T G D * A * W V A L G R K Q V R * E T A N R R V S V L R S A R A E A C T V G D G Frame 4-6

Example unknown RNA:

Translation tables

• The coding for amino acids depends on species and/or nuclear/mitochondrial DNA.

• At least 17 translation tables exist:* The Standard Code* The Vertebrate Mitochondrial Code* The Yeast Mitochondrial Code* The Mold, Protozoan, and Coelenterate Mitochondrial Code and ...* The Invertebrate Mitochondrial Code* The Ciliate, Dasycladacean and Hexamita Nuclear Code* The Echinoderm and Flatworm Mitochondrial Code* The Euplotid Nuclear Code...* ...

Tables with comments may be found at NCBI: http://www.ncbi.nlm.nih.gov/Taxonomy/

Page 3: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

3

Translation tables (cont), examples

Example:

The Vertebrate Mitochondrial Code (transl_table=2)

Differences from the Standard Code:

Code 2 Standard AGA Ter * Arg R AGG Ter * Arg RAUA Met M Ile IUGA Trp W Ter *

Example:

The Yeast Mitochondrial Code (transl_table=3)

Differences from the Standard Code:

Code 3 Standard AUA Met M Ile I CUU Thr T Leu L CUC Thr T Leu LCUA Thr T Leu LCUG Thr T Leu LUGA Trp W Ter *CGA absent Arg RCGC absent Arg R

Alternative Initiation Codon:

Bos: AUA Homo: AUA, AUUMus: AUA, AUU, AUCCoturnix, Gallus: also GUG.

Big differences if start (initiation) and stop (termination) codes differ!

Ambiguous sequence notation

Nucleotide examples:A or C, [AC]: symbol MA or G, [AG]: symbol RA or T, [AT]: symbol WA or C or G, [ACG]: V

... etc.

G A A A A CG A G A T CG C A A C CG C G A G C-----------------G[AC][AG]A[ATCG]C

The 4 sequence example may be written as a sequence : GMRANC , or as a pattern : G-[AC]-[AG]-A-x(1)-C

Wildcard: x(N) represents N arbitrary symbols.

Identity (pattern matching)• Finding short exact matches

GAATTC – recognition site for enzyme EcoRIGDSGGP – typical of serine proteases (e.g. G-[DE]-S-G-[GS] -[SAPHV] )

• Patterns for multiple matchesGA-[AG]-L-[ST] : GA + A or G + L + S or T

GAALS, GAGLS, GAALT, GAGLT matchesGA-x-G-[STLAG] : GA + any 1 aa + G + S or T or L or A or G

100 different sequences matchC-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

pattern for zinc finger proteins (millions of possible sequences)

Programs that use these kinds of patterns:”Findpatterns” searches a sequence (or set of sequences) for a pattern.”Motifs” searches a sequence for motifs present in the PROSITE database.PROSITE have patterns for >1000 protein families.Important: Match or no match – just true or false, no score!(”Profiles” have probabilities for different aminoacids in certain positions.)

Pairwise alignments:

Global alignmentConsiders similarity across the full extent of the sequences xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

| | ||||||| | |xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Local alignment (most common)Considers regions of similarity in parts of the sequences only.

xxxxxxx|||||||xxxxxxx

region of similarity

M A K L Q G A L G K R Y

M *A * *K * *I

Q *G * *A * *L * *A * * K * *R *Y

M A K L Q G A L G K R Y

* * * * * * * * * *M A K I Q G A L A K R Y

Comparing 2 sequences - Dotplot analysis

Sequence alignment

2 mismatches

M A K L Q L G K R Y

M *A *K * *L * *Q *G *A *L * *G *K * *R *Y *

M A K L Q L G K R Y

* * * * * * * * * *M A K L Q G A L G K R Y

Gap

Sequence alignment

Comparing 2 sequences - Gaps

Page 4: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

4

Comparing 2 sequences: What are gaps?

Gaps are results of mutations (changes in DNA) that occur during evolution

For instance consider this deletion mutation:

AACTTGACGTTGAACTGC

GACTGGGCGTATCTGACCCGCATA

CGGGCACCGGCCCGTGGC

N L T D W A Y R A P

N L T R A P

AACTTGACGTTGAACTGC

CGGGCACCGGCCCGTGGC

DNAprotein

Alignment report example

Red lines = matches full sequence (high identity) Purple lines = matches contain gap (good identity)

Gap

Best alignment = highest score!

Give scores for match, mismatch and gap (and gap extension).

What is better: mismatch or gap?

Calculate best score for each position, “trace back” to find best alignment.

“Dynamic programming” algorithms.

Very slow algorithm, cannot be used in database searches!

BLAST lists all matching “words”*

Query

Subject

For each short match, the program tries to extend in both directions.

* A word is 7-11 nucleotides or 3-.. aa

Improvement of speed as compared to local alignment algorithm:

BLAST and FastA

Searching databases with BLAST

Initial search is for short words.Word hits are then extended in either direction.? we only extend words that are in both sequences? fast, but gap can’t be long between two close words

Searching databases with FastA

Initial search for short words.Words are extended, but also linked if they are close!? slower, but longer alignments

An alignment that BLAST can’t find!

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG|| | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Page 5: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

5

Aligning two sequences - Gap extension penalty. Alignment of genomic sequence with mRNA (Global alignment!)

Alignment of the following two sequences: V00594 (Human mRNA for metallothionein) and J00271 (corresponding genomic sequence).

Default setting

Extend gap= 3

In a global alignment all residues are matched.

?

!

New settings

Extend gap= 0Exon 1

Exon 2

Exon 3

Output from Blast

BLASTP 2.0.11 [Jan-20-2000]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.

Query= ramp4.seq(75 letters)

Database: nr457,798 sequences; 140,871,481 total letters

Searching..................................................done

Score ESequences producing significant alignments: (bits) Value

gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6

E-value: probability of finding hit in a database of this size.

E-value, as important as score!

Score

Alig

nmen

ts

Expect ValueE = number of database hits you expect to find by chance

size of database

your score

expected number of random hits

Small database = few random hits. Big database = many random hits!In small databases you get higher E-values.

High score

>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;cDNA EST EMBL:D71338 comes from this gene; cDNA ESTEMBL:D74010 comes from this gene; cDNA EST EMBL:D74852comes from this gene; cDNA EST EMBL:C07354 comes fromthis gene; cDNA EST EMBL:C0...Length = 65

Score = 74.1 bits (179), Expect = 1e-13Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)

Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++M

Sbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63

Query: 74 G 74G

Sbjct: 64 G 64

In protein alignments some mismatches are marked “similar” (+).

Substitution matrices are used to score matches/mismatches!

Are there better/worse substitutions?

• From comparisons of known proteins, it is known that some changes/mutations are more frequent than others.

• Also, not all amino acids* are common ...If a rare amino acid is matched, it is more significant than if a common amino acid match

• How can we give a score to a mismatch/match that is biologically significant?? substitution matrices

* There are 20 amino acids, but only 4 nucleotides!

Page 6: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

6

BLOSUM 62 scoresA 4R -1 5N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1

A R N D C Q E G H I L K M F P S T W Y V X

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Substitution matrices

Unitary matrices (nucleotide, protein)All matches get ’10’, all mismatches ’0’.Used for nucleotide seqs. Bad protein hits due to identities by chance.

Point Accepted Mutation, PAM (proteins)PAM30, PAM70 ... matrices. Based on evolutionary distance: 1 PAM = 1 point mutation / 100 residues. Can’t handle distant relationships well.

Blocks Substitution Matrix, BLOSUM (prots)BLOSUM50, BLOSUM62 ... matrices. Based on alignments in the BLOCKS db. Sequence segments of a certain identity are clustered: The most used matrices. BLOSUM62 default in BLAST (>62% identity).

Remember: Any substitution matrix is making a statement about the probability of observing a pair of aligned residues in real alignments!

ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGTM A K L E K L N Q A G L M V A G

60% nucleotide identityATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGAM V R L E K I N Q A G L L V A G69% amino acid identity

M V R I Q K I N E K G A L L A G38%

Q V R I Q K I Y E K G A L L A A19% (‘twilight zone’)

Q V R I Q K I Y E K T A L L F A6% (‘midnight zone’)

Evolution of protein genes: secondary and tertiary structure conservedBlast report

Sequences producing significant alignments: (bits) Value

pir||F69494 (R)-hydroxyglutaryl-CoA dehydratase activator (hgdC)... 462 e-129gb|AAD31675.1| (AF123384) (R)-2-hydroxyglutaryl-CoA dehydratase ... 233 1e-060sp|P39383|YJIL_ECOLI HYPOTHETICAL 27.4 KD PROTEIN IN IADA-MCRD I... 184 9e-046emb|CAA67409.1| (X98916) orf6 [Methanopyrus kandleri] 170 1e-041gb|AAF13150.1|AF156260_1 (AF156260) unknown [Methanosarcina bark... 143 2e-033pir||A69117 activator of (R)-2-hydroxyglutaryl-CoA - Methanobact... 132 4e-030pir||A72369 (R)-2-hydroxyglutaryl-CoA dehydratase activator-rela... 129 4e-029gb|AAC23928.1| (U75363) benzoyl-CoA reductase subunit [Rhodopseu... 117 1e-025pir||S04476 hypothetical protein (hdgA 5' region) - Acidaminococ... 104 1e-021sp|P27542|DNAK_CHLPN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 42 0.005gb|AAC15473.1| (AF016711) heat shock protein 70 [Burkholderia ps... 39 0.036pir||F75029 o-sialoglycoprotein endopeptidase (gcp) PAB1159 - Py... 38 0.082pir||F72514 probable glucokinase APE2091 - Aeropyrum pernix (str... 37 0.18sp|P42373|DNAK_BURCE DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 37 0.18emb|CAA10035.1| (AJ012470) mitochondrial-type hsp70 [Encephalito... 36 0.31sp|P56836|DNAK_CHLMU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.41gb|AAF39496.1| (AE002336) dnaK protein [Chlamydia muridarum] 36 0.41pir||B70189 rod shape-determining protein (mreB-1) homolog - Lym... 36 0.41sp|O57716|GCP_PYRHO PUTATIVE O-SIALOGLYCOPROTEIN ENDOPEPTIDASE (... 36 0.54sp|O33522|DNAK_ALCEU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 36 0.54ref|NP_012874.1| Ykl050cp >gi|549677|sp|P35736|YKF0_YEAST HYPOTH... 36 0.54emb|CAA53420.1| (X75781) D513 [Saccharomyces cerevisiae] >gi|158... 36 0.54sp|P30722|DNAK_PAVLU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) >gi|99... 36 0.54pir||A40158 dnaK-type molecular chaperone - Chlamydia trachomati... 34 1.2gb|AAF07742.1|AE001584_39 (AE001584) hypothetical protein [Borre... 34 1.6gb|AAF07521.1|AE001577_35 (AE001577) hypothetical protein [Borre... 34 1.6gb|AAF38963.1| (AE002276) cell shape-determining protein MreB [C... 34 2.1gb|AAG08147.1|AE004889_10 (AE004889) DnaK protein [Pseudomonas a... 33 2.7dbj|BAB03215.1| (AB017035) dnaK [Bacillus thermoglucosidasius] 33 2.7sp|P43736|DNAK_HAEIN DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|P45554|DNAK_STAAU DNAK PROTEIN (HEAT SHOCK PROTEIN 70) (HSP70... 33 2.7sp|Q58303|FLA3_METJA FLAGELLIN B3 PRECURSOR 32 4.7gb|AAG08239.1|AE004898_10 (AE004898) phosphoribosylaminoimidazol... 32 6.1

Bad scores/E-

values might

sometimes not matter.

1 MSAAPVQDKDTLSNAERAKNVNGLLQVLMDINTLNGGSSDTADKIRIHAKNFEAALFAKS 60

61 SSKKEYMDSMNEKVAVMRNTYNTRKNAVTAAAANNNIKPVEQHHINNLKNSGNSANNMNV 120

121 NMNLNPQMFLNQQAQARQQVAQQLRNQQQQQQQQQQQQRRQLTPQQQQLVNQMKVAPIPK 180

181 QLLQRIPNIPPNINTWQQVTALAQQKLLTPQDMEAAKEVYKIHQQLLFKARLQQQQAQAQ 240

241 AQANNNNNGLPQNGNINNNINIPQQQQMQPPNSSANNNPLQQQSSQNTVPNVLNQINQIF 300

301 SPEEQRSLLQEAIETCKNFEKTQLGSTMTEPVKQSFIRKYINQKALRKIQALRDVKNNNN 360

361 ANNNGSNLQRAQNVPMNIIQQQQQQNTNNNDTIATSATPNAAAFSQQQNASSKLYQ

Low complexity sequence tends to(1) increase the number of non-specific hits to database sequences(2) correspond to regions in proteins not associated with a knownbiological function (typically unstructured parts of the protein)

Therefore, low complexity parts are filtered out by default in BLAST searches. (Don’t use filtering if you want exact matches.)

Blast variants:

Query Database

blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA

Example: Searching a new genome assembly for a protein homolog.

Input: protein.Database: DNA (genome sequences)

? tblastn

Page 7: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

7

Rules of database searches (like BLAST)

? Database sequence searches involving proteins should be carried out at the protein level and not at the DNA level *? Use of smallest possible database (not too small though) ? Sequence statistics should be used rather than percent identity/similarity as criterion for homology? Consider different scoring matrices and gap penalties

* 1) DNA sequences encoding the same protein sequence can be very different, due to the degeneracy of the genetic code.

TTTCGATTCTCAACAAGAAGC** * ** ** * *TTCAGGTTTAGCACGCGGTCCF R F S T R S

2) For nucleotide—nucleotide searches, it is often good to set the word size low (-W 7)

BLAST at NCBI

tblastn

BLAST output at NCBI

1 perfect hit, some hits with parts of sequence matched

Alignments below

“HSP” high

scoring pair

– there may be several!

Best hit

Next best hit

BLAST output, with many HSPs

gb|CM000011.1| Canis familiaris chromosome 11, whole genome shot... 86 9e-15

>gb|CM000011.1| Canis familiaris chromosome 11, whole genome shotgun sequenceLength = 75769841

Score = 85.7 bits (43), Expect = 9e-15Identities = 89/102 (87%), Gaps = 3/102 (2%)Strand = Plus / Minus

Query: 4 cgtgctgaaggcctgtatcctaggctacacactgaggactctgttcctcccctttccgcc 63|||||||||||||||| |||||||||||| || ||||||| ||||||| ||| ||||

Sbjct: 53542401 cgtgctgaaggcctgtttcctaggctacagacggaggact-tgttcctta--tttgcgcc 53542345

Query: 64 taggggaaagtccccggacctcgggcagagagtgccacgtgc 105|||||||||||||||||||| ||||||||||||| |||||

Sbjct: 53542344 taggggaaagtccccggacccttggcagagagtgccgcgtgc 53542303

Score = 75.8 bits (38), Expect = 9e-12Identities = 75/86 (87%), Gaps = 1/86 (1%)Strand = Plus / Minus

Query: 181 ggggcgtcatccgtcagctccctctagttacgcaggcagtgcgtgtcc-gcgcaccaacc 239|||||||| ||||||| ||| ||||||||||||||||| ||| | |||| ||||||

Sbjct: 53542216 ggggcgtcgtccgtcaactctatctagttacgcaggcagcgcgcctggtgcgcgccaacc 53542157

Query: 240 acacggggctcattctcagcgcggct 265||||||||||||||||||||||||||

Sbjct: 53542156 acacggggctcattctcagcgcggct 53542131

Score = 36.2 bits (18), Expect = 7.7Identities = 18/18 (100%)Strand = Plus / Minus

Query: 25 aggctacacactgaggac 42||||||||||||||||||

Sbjct: 42727936 aggctacacactgaggac 42727919

Note: Only the best HSP is shown in the list before the alignments. Check the positions to understand in which order the HSPsmatch. The strand must be the same!

?

Databases at NCBI available for BLAST searches

Protein sequence databases

nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF

swissprot the last major release of SWISS-PROT

DNA sequence Databases

nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)

dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions

You may also blast against single genomes ...

Page 8: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

8

Multiple alignments - applications

Identify conserved motifs - patterns (PROSITE)Profiles (Pfam)Phylogenetic studiesPrediction of protein secondary structure Experimental : design of probes

Multiple sequence alignment programs (CLUSTALW, PileUp, T-coffee ...)

PILEUP

PileUp does a series of progressive, pairwise alignments between sequences and clusters of sequences to generate the final multiple alignment. A cluster consists of two or more already-aligned sequences.

PileUp begins by doing pairwise alignments that score the similarity between every possible pair of sequences. These similarity scores are used to create a clustering order that can be represented as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco, California, USA).

The dendrogram shows the order of the pairwise alignments of sequences and clusters of sequences that together generate the final alignment. For example:

Trees from MSA

SRP54

SRPRFtsY

3 large groups

Multiple alignment software

Pileup (GCG)

Clustalw / Clustalx

MSA (program that in principle finds the true optimal multiple alignment by the dynamic programming method)

T-coffee

Multiple alignment editors/viewers

SeqLab (GCG)MACAW (search for motifs, blocks)JalviewCINEMAGenedocBioeditBoxshade

Clustalx

njplot

Colours of amino acids according to type: charged, hydrophobic ...

Makes it easier to see matches.

Page 9: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

9

How to find homologs with low sequence identity

• Sequence identity high if evolutionary distance is small, but low if the distance is big.

• Many amino acid positions change.• An amino acid may be substituted differently in

different species.• If we have many known homologs, we can search

with “all of them” as queries, but the unknown sequence may have yet another set of substitutions compared to the known homologs.? align known sequences and make a “profile”

Position Specific Substitution Rates

Active site serineTypical serine

Position Specific Score Matrix (PSSM)

A R N D C Q E G H I L K M F P S T W Y V206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differentlyin these two positions

Active site nucleophile

Example sequence. How does Serine score in positions 211 and 216?

Amino acids

PSIBLAST – a more sensitive BLAST!

PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of the following steps:

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program .

(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence servesas a template for the multiple alignment and profile, whose lengths are identical to that of the query.

(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly.

(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale , and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments.

(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass.

Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

1st BLAST round 2nd BLAST round

threshold

profile profile

3rd BLAST round

PSI-BLAST creates profiles automatically

When no more new sequences are found, search terminates.

Problem: If bad sequences enters the profile, it finds only trash!

Example of homology: SRP9/14/21

• SRP9 & SRP14 are related (common ancestor)• SRP9 is not found in Fungi (but SRP21 is)• But weak SRP9 hit in the fungi S.pombe (YE07)• Weak similarity SRP9 S.pombe & SRP21• Make a profile of known SRP21 sequences and

search a database of all known proteins!Can we detect any similarity SRP9/21?

Page 10: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

10

Profilesearch - based on Saccharomyces SRP21 sequences Sequence ZScore Orig Length Comment

1. S_BAY 84.40 349.21 146 SRP21 S. bayanus2. S_PAR 83.75 351.99 169 Paradoxus3. S_KUD 83.71 346.53 146 Kudria4. SR21_YEAST 82.41 346.02 166 P32342 saccharomyces cerevisiae5. S_MIK 82.06 339.92 145 Mikatae6. S_KLU 75.51 314.59 145 Kluyveri7. S_CAS 74.91 308.02 125 Castellii8. C_ALB 21.67 107.92 168 Candida9. N_CRA 12.74 74.20 197 Neurospora10 YE07_SCHPO 9.61 58.63 120 O13804 schizosaccharomyces pombe11 CD3D_RAT 8.74 57.34 173 P19377 rattus norvegicus (rat). 12 ARP2_PLAFA 8.52 65.04 451 P13824 plasmodium falciparum. 13 Q23147 8.50 60.12 284 Q23147 caenorhabditis elegans.14 SR09_ARATH 8.45 53.56 103 Q9smu7 arabidopsis thaliana (mouse-ear 15 Q8K2G5 8.45 60.59 306 Q8k2g5 mus musculus (mouse). riken cdna16 Q8BFQ4 8.40 60.59 313 Q8bfq4 mus musculus (mouse).17 Q8I562 8.39 64.65 459 Q8i562 plasmodium falciparum (isolate 3d7).18 AAH44174 8.28 60.11 313 Aah44174 brachydanio rerio (zebrafish) 19 SR09_MAIZE 8.17 52.51 103 O04438 zea mays (maize). signal recognition20 CD3D_MOUSE 8.11 54.85 173 P04235 mus musculus (mouse). t-cell surface 21 SR09_CAEEL 7.90 50.47 76 P34642 caenorhabditis elegans. signal

Green box = sequences in profile (should be first!)Yellow box = unknown SRP21 (incl YE07 from S.pombeRed box = SRP9 sequences (Best hits in db of >1 million proteins!)

SRP21 aligned to SRP9 &14

Unaligned box21

9

14

Secondary structure prediction by PSI-Pred also showed the conserved ? ? ? ? ?structure.

SRP9/14 ????? secondary structure (Birse et al.) shown as cylinders (alfahelices) and arrows (beta strands).

The most conserved residues are in secondary structure elements.SRP9, SRP21 more similar.

Residues marked according to similarity in sequence and chemical properties.

21

9

14

Proteins share domains

• In primary sequence searches the found proteins are aligned because they share domains

• If the sequences are very different outside the shared domain, they may be paralogs.

• The next example shows a MSA in which the middle part is a GTPase domain. The first or last part is missing ...

N-terminal

C-terminal

Two different proteins (4+4 sequences ) are aligned. They share a domain.

Pfam – protein domains DB

• From multiple alignments of many related proteins, profiles (HMMs) are made

• Input a sequence, match to all families/HMMs.

• Known sequences are in Pfam database.

Pfam DB: Karolinska Inst., Sanger (UK), S:t Louis (USA), Pasteur (F)

Structure logo for Pfammotif trypsin (only part of the model shown).

Positions in the model

The size of the letters = probability of finding that amino acid in the position In these

positions, some amino acids are much more common than others.

Pfam model amino acid probability plot in the “structure logo” style

Page 11: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

11

Search a sequence for matches to Pfam modelsHMM file: /dbs/pfam/Pfam_lsSequence file: pop3_spombe- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query sequence: gi|3560259|emb|CAA20744.1|Accession: [none]Description: SPCC16C4.05 [Schizosaccharomyces pombe]

Scores for sequence family classification (score includes all domains):Model Description Score E-value N -------- ----------- ----- ------- ---RNase_P_pop3 RNase P subunit Pop3 332.1 8.6e-97 1

Parsed for domains:Model Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------RNase_P_pop3 1/1 7 165 .. 1 175 [] 332.1 8.6e-97

Alignments of top-scoring domains:RNase_P_pop3: domain 1 of 1, from 7 to 165: score 332.1, E = 8.6e-97

*->KrkQvyKPVLeNPytNEAkLWPhVtdqklvlELLqekvlkklvhalkK+kQ++K+VL+NP++++ WP+V+++ +qek++++lv++l+

gi|3560259 7 KVKQTVKLVLRNPLSIS---WPIVDAN------TQEKLAQTLVQWLP 44

ashKgneesevtvGfNeivelLsraCsesddvTQPAVvlfvcnkDgtPsvashK++++s++tvG+N+++elL+r+C++++dvTQPAVv++++++D s+

gi|3560259 45 ASHKDILDSKLTVGLNSVNELLERCCQNAKDVTQPAVVFILHDQD---SM 91

LlsQlPLLvavanltGSSKVkLVqLpksaqakfdehlGlskavHDGmlLvL++++P+Lva+an++GSSK++LV+L++saqa+++++lGls+a G+++v

gi|3560259 92 LVTHMPQLVANANFYGSSKCRLVPLGFSAQALIAKKLGLSRA---GAIAV 138

rkdasldksfadlvdskvEepqiPWLep<-*++d++l+k+++dlv++ +Eepq++WL++

gi|3560259 139 QDDSPLWKYLKDLVMN-IEEPQARWLSE 165

The HMMER package is used for searching sequences against one or all Pfam models (or a model that you have made yourself).

As in BLAST searches, you get a score, e-value and an alignment.

Searches may be done at Pfam WWW.

Search at Pfam (Sanger)For a known protein one may use the UNIPROT accession to get a precomputedalignment.

If the protein is not in the database, just input the sequence ...

WWW results

Good match(RNaseP_pop3)

Matches below threshold

Pfam notes ...• Even though the Pfam alignments are curated, they

may contain sequences that are very different from your sequence ? bad score and eval!

• If your sequence gets a very high score and good evalue, it probably is in the alignment that was used to create the model.

• Pfam B models are made “automatically” and not curated (use with caution)

• Some Pfam models are domains, others are almost complete proteins ...

SRS: InterPro – search all domain DB

InterPro had a “bad” reputation some years ago, but it is good idea!

PROSITEPfamPRODOMPRINTSSMART......

Seq input

Transmembrane prediction

• 25% of all proteins are membrane bound• By comparing known transmembrane proteins,

programs like TMHMM make predictions. Some use neural networks that are trained on known TM proteins.

• Other methods can be combined to get a higher specificity of TMHMM predictions or other programs (all methods have a flaw somewhere)

Page 12: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

12

TMHMM output

RF47_[Guillardia len=68 ExpAA=37.41 First60=32.65 PredHel=2 Topology=i2-19o47-64iORF74_[Odontella len=74 ExpAA=39.05 First60=32.92 PredHel=2 Topology=i2-24o48-65iORF71_[Porphyra len=71 ExpAA=36.0 First60=26.14 PredHel=2 Topology=i7-24o53-70iORF70_[Chlorella len=70 ExpAA=38.67 First60=32.40 PredHel=2 Topology=i2-21o45-67i

-------------------------------------------------------------------------

PredHel=2 (= 2 TM dom) Topology=i2-21o45-67iinside-TRANSMEMBRANE-outside-TRANSMEMBRANE-inside

Example in which scores for first TM domain are too low .

PSI-pred: secondary structure

Confidence in prediction of this residue

Output sent by mail:

EEE = beta strandHHH= alfa helixCCC= coil (“normal”)

Link to image files.

http://www.psipred.net

Looking for short sequences

• Sometimes you want to find out if there are short sequences (often called words) that are in a set of sequences. They may, for instance, be transcription factor binding sites ...

• Alignment programs wont find these ...• MEME is a program that finds “words” of a

specified length in a set of sequences.• MAST may be used to search for known words

But what about RNA genes?

• RNA genes are genes that do not code for protein (they are not translated)They are usually called “noncoding RNAs”

• There are structural, catalytic and regulatory ncRNA, few are conserved in all organisms

• Many ncRNAs are part of ribonucleoproteincomplexes (RNPs)

• Some commonly known ncRNAs are:ribosomal RNAs (rRNA), transfer RNAs (tRNAs),signal recognition particle RNA (SRP RNA),ribonuclease P RNA (RNaseP RNA)

ncRNAs are often not annotated

NC_006270.1 -TTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACTCGCAATCCGCTCGAGCGAGGCX06802|BAC.SUB. NTTGCCGTGCTAAGCGGGGAGGTAGCGGTGCCCTGTACCTGCAATCCGCTCTAGCAGGGC

************************************* *********** *** ***

NC_006270.1 CGAATCCCTTTCTCGAGGTTCGTTTACTTTAAGGTCTGCCTTAAGCAAGTGGTGTTGACGX06802|BAC.SUB. CGAATCCCTT-CTCGAGGTTCGTTTACTTTAAGGCCTGCCTTAAGTAAGTGGTGTTGACG

********** *********************** ********** **************

NC_006270.1 CTTGGGTCCTGCGCAATGGGAATCCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTAX06802|BAC.SUB. TTTGGGTCCTGCGCAATGGGAATTCATGAACCATGTCAGGTCCGGAAGGAAGCAGCATTA

********************** ************************************

NC_006270.1 AGTGGAACCTTCCATGTGCCGCAGGGTTGCCTGGGCTGAGCTAACTGCTTAAGTAACGCTX06802|BAC.SUB. AGTGAAACCTCTCATGTGCCGCAGGGTTGCCTGGGCCGAGCTAACTGCTTAAGTAACGCT

**** ***** ************************ ***********************

NC_006270.1 TAGGGTAGCGAATCGACAGAAGGTGCACGGTAX06802|BAC.SUB. TAGGGTAGCGAATCGACAGAAGGTGCACGGTA

********************************

Sequence alignment of annotated SRP RNA from Bacillus subtilis and identified SRP RNA from the newly sequenced and “fully” annotated Bacillus licheniformis.Sequence identity = 94%! Still no SRP RNA is annotated. SRPDB is needed.

ncRNAs in the 3 Kingdoms of Life

Rfam: annotating non-coding RNAs in complete genomes.Sam Griffiths-Jones, Simon Moxon, Mhairi Marshall, Ajay Khanna, Sean R. Eddy and Alex Bateman.Nucleic Acids Res. 2005 33:D121-D124.

Page 13: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

13

RNA structure

5’-GGGGAUGUAGCUUAGUGGUAGAGCAUUGGAGUUAUAAUCCGGAGGCGCGGGUUCGAAUCCCGUUAUCCCC -3’

primary secondary tertiary

ncRNAs basepair (G-C, A-U, G-U) creating secondary structure

Mutations may maintain secondary structure (G-C? G-U? A-U)

ncRNAs first fold into a secondary structure before adding tertiary interactions ? The secondary structure must not change!

RNA: Conserved secondary structureAU, GC base pairing create ”hairpins”

CAGGAAACUG seq1...|.||...GCUGCAAAGC seq2|||||||...GCUGCAACUG seq3

A A C A C AG A G A G AG-C U-A U CA-U C-G C UC-G G-C G G

seq1 seq2 seq3

Seq1 and seq2 are not similar, but they both have a hairpin structure, which is not shared by seq3!

The alignment of the primary sequences (structure) doesn’t give us any information.

Secondary structure pattern

Compensatory base changes maintain secondary structure.We need a way to specify the base pairing!

Secondary structure pattern

Pattern:h1 s1 h1’h1 NNN:NNNs1 GMAA

Note:M = [GA]N = [AUGC]

”h” stands for helix”s” -”- strand

A A C AG A G AG-C U-A A-U C-G C-G G-C

seq1 seq2

Programs that search for secondary structure: Patscan, RNAbob.

Creating probabalistic covariance models from alignments (Rfam)

tetraloop with stem

Both primary sequence and secondary structure conservation captured in probabalistic model.

COVE (Eddy 1994) used for creating models, searching.

Covariance models are equivalent to stochastic context-free grammars.

Patterns: Hard to make large patterns and patterns that find new structures. Yes/No match, no scoring. Fast!

Models: covariance model of which bases appear together created automatically from alignment.Time-complexity: O(n3). Slow!

Idea: Use smaller pattern to filter, use covariance model on filtered sequences ? Fast and sensitive!

Multiple Sequence Alignments for ncRNAs must specify basepairings

The Rfam database is the “Pfam for ncRNAs”

Page 14: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

14

ncRNA structure evolution

1. Mutations in ncRNAs maintain the secondary structure? primary sequence is poorly conserved? hard to detect similarities by primary sequence searches

2. Structure evolves by loosing / adding helices? big gaps in alignments even when primary sequence is conserved

An example ...

SRP RNA variants

Helices H3 and H4 missing in yeast!

Bacteria

Archaea

Eukarya

Comment: t1 and t2 depict tertiary interactions

Helix 8 is the only part found in all SRP RNA!

Fungi SRPRNA lack helices 3,4

o

A

GCUGUAA U G G C

AU U U

UG U C G G A

G U GG U A A A U

CG C C U U C U

UGUU

GUGCGU

UC G

AGUUCUG

GACUC

UGCACUGG

G C U A C U U UG U U G U C C UUU

C C GA A U

U CUG

C G G UUGAUGGGCGUCUCGG

UCUGA

GU A

AUCGGC

UUUGAGAUUUCCGUUCU

AAGA

UUAACUGGGAUACUU C

AGU

GGAG

CAAUCCAG

CA G

AGAUCCAGUU

GCCGUG

GGU

AUGGCGGUGGG

AUAGCAACAAAGUGGU

AU

AUGU

UAU

GGAAGGUAUUUGCAA

UCA

CGACUC

UCo12

3

4

5

6

Yarrowia lipolytica

oC

GACTGTAA T

G G T CA

A G G T G G GT

T T GAAG G C A C T T G A

T T TT C T C A A T G

TC T C T A T T CC A

TG

TCCA

AA T

CTGGA

AGC C C A G C G G C G C C C A G C A C G A A CC T T G C G G T G

GTC

A C CCACTCGCACGGGT

AGCC TG

CG

ACTTGCTGCGCGTGG C CC

TAAG

CAATGA

AGATG A

CAC

TT G

AGA

GAGGTTCC

ACTCTG

CA G

AG

ACATCTT

CACCGTCAGGTGG

CGCGCTGGA

TTACG

ATCGCTGGG

GGGTTGGGATAGAGCGTTGAGATGGAG

ATGTC

GACTCCTATTT

To

1

2

3

4

5

6

Neurospora crassa

o

GGCTGTGATG G C T

TT T A G

CG G A

AG C

GT G C T G C

T C G TG T A C C T G C T G T T T G TT GA

AAAT TT

AA

G A G C A A A G T G T C CG G C T C G A T CC CT GC G AAT

TGAATTCTGA

ACGCTAGAG

T AATCAGTGT C

TTT

CAAGTTCTG

GTAAT

GTTTAGCAT A

AC

CACTG G

AG

GGAAG

CAATTCA

GC

A CAGT

AATGCTAA

TCGTG

GT

GGAGG

CGAAT

CCGGATG

GCACCTTGTTTGTTG

ATAAATAGTGC

GGTATC

TAGT

GTTGCAAC

TCTATo

1

2

3

4

Candida albicans

oC

GCUGU

AA U

G G CU UGGU

CGAA

G U G U U U AGU A CU C C C A

AU A

GU G C A UG U U C G G U GG

UC U

CG GG U

U CG A G U C U CG C U U U C G

A UC C C

UCG A

UCUGCCACGUCUGUUCGAAGA

GUA

GUCUUCGUGGCAACUGGCAGU

UAA

ACCGUGUAGU A

CCG

AUG G

AGG

UUGG

AAACAAUG

CA C

AUC

ACUACCGGG

UCUU

GGGC

AGUGCGAUAGCGA

UGGGAUUCACCUUCGCAGGAUGUGCAUGGAAGUAUAAACAC

AACG

GUC

GU

U o

1

2

3

4

S. pombe

These RNAs are < 300 nts.

BUT ...

S.cerevisiae (length 519 nts) SRP RNA was not possible to fit to this type of SRP RNAs.

How do we decide on the structure of this gene?

Note that Yarrowia (bottom) has an extra helix.

Comparative analysis of SRP RNA Saccharomyces species

Using the known SRP sequences from Saccharomyces cerevisiaeas queries, regions of the genomes of S.paradoxus, S.mikatae, S.kudriavzevii, S.castelli and S.kluyveri were retreived from Washington Univ., St. Louis.

By comparative analysis, SRPRNA sequences (453-547 nts) and structures were identified*.

The results showed that all species had large inserts in the helix 5 region, especially close to the small Alu domain, and that helix 7 also was variable.

* The secondary structures were predicted with MFOLD.

o

A

GCUGUAA U G G C

AU U U

UG U C G G A

G U GG U A A A U

CG C C U U C U

UGUUG

UGCGU

UC G

AGUUCUG

GACUC

UGCACUGG

G C U A C U U U G U U G U C C UUU

C C GA A U

U CUG

C G G UUGAUGGGCGUCUCGG

UCUGA

GU A

AUCGGC

UUUGAGAUUUCCGUUCU

AAGA

UUAACUGGGAUACUUG

AGAUCCAGUU

GCCGUG

GGU

AUGGCGGUGGG

AUAGCAACAAAGUGGU

AUA

UGUUAU

GGAAGGUAUUUGCAA

UCA

CGACUC

UCo12

3

4

5

6

Yarrowia lipolytica

AGGCUGUAAUG G C U U

UC U

GG U G G

G AU GG G A U A C

GUUG

GGA

AUU

UU

GGC

CG

AGG

AACA

AAU C

CU

UCCU

CG

CGG

CC

AGA

CACGGA

C UGC

ACG

CC

CUUUG

GG

CAAGGGAUGGUUCU

CCAUCUC

GCA

CCGUG

CC C U G

U UG U G G C A

AC C G U CU UUU

CUCCGUCGCUAA

UU

U G UCCUGGGCAGA

AA U

GUCUGCUCGGA

GGCGGGGGAG

U C C G GUC U G A A G U G U C C C G G C U

AU

A A U AAAU C G A U C

U U UG C G G G

CAGCCCGU

UGGCAGGAGGCGCGA

GG A

AUCCGUCUCUCUGUCU

GGU

GCGGCAA

G GUA G U C C

UGG G

UUUG

GGGCUCCAC

CUU

CACC

GCUGUU A

GGG

GAGU

UUUAUCCA

GC G

GCAGCA

AA G

GUGA

CCCGUGAUGGAGGC

GGCCGGGAU

AGCACAUAUCAGUCGGAU

AA

UCGUG

CAAGUUGAUCGUU

UCGGCGGUCU AAUUU

GGCGGUGCCAUCAGGAU

UUACUCG

CACA

UUGUGU

UCGUUCCC

UCGGGGACGAG

UGU

GUAUCCUGAACCACA

UU

UUUo

1

2

3

4

5

6 7

8

9

10

11

12

13

14

15

S. bayanusSaccharomyceshave a unique inserted part in helix 5 close to the Alu domain.

This was found in all Saccharomycesspecies.

S.bayanus

Saccharomyces helix insertions

Helix 7

This structure is not in C.albicans, S.pombe.

MicroRNAs – regulatory ncRNAs

Red part is the mature miRNA, the sequence is complementary to mRNA!

Page 15: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

15

RNAi pathway

Cell. 2004 Apr 2;117(1):1-3. miRNA and siRNA work in a similar fashion

Cross-species genomic sequence conservation can be used1. for discovery of new regions with regulatory functions2. to enhance gene predictions, and3. alternative splicing predictions (1 gene ? >1 mRNA ? >1 protein)4. reveal transcription factor binding sites

Cross-species gene location conservation can be used for1. identification of unknown ORFs (predicted proteins)2. adding evidence for discovered new genes

Cross-species gene prevalence can be used for prediction of1. the probability for the existance of a gene in a species (Keep looking!)2. the function of a certain gene/protein/RNA (Is the product essential?)

Post-genomic Bioinformatics/Genomics

Cross-species genome comparisons

And much more ... (We will show some examples later ...)

SRP component searches

This is part of the secretory pathway.

The SRPpathway is conserved is all domains of life: Eukarya, Bacteria, Archaea.

All organisms have an SRPparticle, but it looks different.

Mitochondria and

Chloroplasts are

endosymbionts

Origin of photosynthetic organisms (have chloroplasts with own genome!)

Primary endosymbiosis

:Cyanobacteria+ Eukaryote ?

algae

Secondary endosymbiosis

:algae +

Eukaryote

Genome map of

P.purpureachloroplast

at NCBI

We downloaded 26 chloroplast genomes

and searched with pattern and model for bacterial SRP RNA.

Page 16: Sequence analysis Sequences: DNA, RNA , proteinbio.lundberg.gu.se/courses/vt06/seq_anal_1-4_phd_march06.pdf · Methods in sequence analysis ... into identical amino acid sequences

16

Red algal group

Odontella and Guillardia have chloroplasts of secondary endosymbiosis origin

Green plant group

Found SRP RNA candidates (low scores) in 8 chloroplast genomes

Genome position for SRP RNA gene candidates in “green plant” group

Conserved clusters The candidates in phylogeneticallylinked organisms are all found in this position.

No overlap with known genes!

(Conserved gene clusters are marked with ‘3’, ‘4’ ...)

rpoC-rpoB-trnC-RNA

Red algae (incl. secondary endosymbionts)Porphyra purpurea psaJ-apcD-RNA-fabH-(tRnaLeu)-psbX-accD-psbVCyanidioschyzon mer. psaJ-apcD-RNA------(tRnaLeu)-psbX-accD-psbVCyanidium caldarium psaJ-apcD-RNA------(tRnaLeu)------accD-psbVGuillardia theta psaJ------RNA----------------psbX------psbVOdontella sinensis (tRnaPhe)-RNA----------------psbX-p120-psbV

Green algae + ancestral streptophytaMesostigma viride (ycf6)-RNA------(trnC)-rpoB-rpoC1-rpoC2Nephroselmis olivacea ndbH---RNA------(trnC)-rpoB-rpoC1-rpoC2Chorella vulgaris p133---RNA-p134-(trnC)-rpoB-rpoC1-rpoC2

Some of these also contain rnpB (gene for RNase P RNA)

•2 clear groups: Red algae and Green algae

Genome locations of SRP RNA candidatesin chloroplasts

The predicted SRP RNAs have conserved promoters (as in cyanobacteria)

Cyano-bacteria

Distances between –10 TATA box and sequence (5-8 nts), and promoter sequences are consistent with experimentally verified promoters in Prochlorococcus (Vogel et al. 2003)