comparative genomics: functional characterization of new genes and regulatory interactions using...

50
Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis Mikhail Gelfand Institute for Information Transmission Problems (The Kharkevich Institute), RAS Workshop at the Landau Instiute of Theoretical Physics, RAS September 27-28, 2007, Moscow

Upload: robert

Post on 25-Feb-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis. Mikhail Gelfand Institute for Information Transmission Problems (The Kharkevich Institute), RAS Workshop at the Landau Instiute of Theoretical Physics, RAS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Comparative genomics: functional characterization of new genes and regulatory interactions using computer

analysisMikhail Gelfand

Institute for Information Transmission Problems(The Kharkevich Institute), RAS

Workshop at the Landau Instiute of Theoretical Physics, RAS

September 27-28, 2007, Moscow

Page 2: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

The genome is decyphered!

Page 3: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Is it?To intercept a message does not mean to

understand it

Page 4: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Fragment of a genome (0.1% of E. coli)

A typical bacterial genome: several million nucleotides~600 through ~9,000 genes (~90% of the genome encodes proteins)

Page 5: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Propaganda

100

1000

10000

100000

1000000

10000000

1982 1984 1986 1988 1990 1992 1994 1996 1998 2000год

sequences in GenBank (~genes)

articles in PubMed (~experiments)

Page 6: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

More propagandaMost genes will never be studied in experiment

Even in E.coli: only 20-30 new genes per year (hundreds are still uncharacterized)

• “Universally missing genes” – not a single known gene even for ~10% reactions of the central metabolism. No genes for >40% reactions overall.

• “Conserved hypothetical genes” (5-15% of any bacterial genome) – essential, but unknown function.

Page 7: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

The local goal: to characterize the genes

• What?– function (rather, role)

• When?– regulation (conditions)

• gene expression• lifetime (mRNA, protein)

• Where?– Localization

• Cellular/membrane/secreted• How?

– Mechanism of action• Specificity, regulation (biochemistry)

Page 8: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

2007:> 1200 bacterial genomes

Propaganda-2: complete genomes

2

149

4

18

30

55

84

8

19

422

1

107

4321

15

0

10

20

30

40

50

60

70

80

90

1995 1996 1997 1998 1999 2000 2001 2002

Page 9: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

The global goal:to predict the organism’s

properties given its genome

(plus some additional information, e.g. the initial state after cell division)

and “to understand” the evolution of genomes/organisms

Page 10: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Haemophilus influenzae, 1995

Page 11: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Vibrio cholerae, 2000

Page 12: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

The metabolic map, the bird’s view

Page 13: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Metabolic pathways, the eagle’s view

Page 14: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

A submap (metabolism of arginine and proline)

Page 15: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Approaches• Similarity => homology (common origin)• Homology => common function• “The Pearson Principle” (after Karl

Pearson):important features are conserved– functional sites in proteins– regulatory (protein-binding) sites in DNA– not necessarily sequences:

• structure of protein and RNA• gene localization on chromosomes• co-expression of genes

• Allows one to annotate 50-75% of genes in a bacterial genome

• Necessary first step, may be automated (to some extent)

Page 16: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

… but not so simple• Similarity ≠ homology

– Low complexity regions, unstructured domains, transmembrane segments and other regions with non-strandard amino acid composition

• The need for correct similarity measures– Does homology always follow from the

structural similarity?• What is structural similarity?

How can it be measured?• Convergent evolution of structures?

Independent emergence of folds?• Homology ≠ same function

– What is «the same function»?• Biochemical details and cellular role

Page 17: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

“The Fermi principle”(after Enrico Fermi)

Purely homology-based annotation: boring (nothing radically new)

It turns out, one can predict something completely new

Comparative genomics

Page 18: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Positional clustering• Genes that are located in immediate

proximity tend to be involved in the same metabolic pathway or functional subsystem – caused by operon structure, but not only

• horizontal transfer of loci containing several functionally linked operons

• compartmentalisation of products in the cytoplasm– very weak evidence

• stronger if observed in may unrelated genomes• May be measured

– e.g. the STRING database/server (P.Bork, EMBL) – and other sources

Page 19: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

STRING: trpB –

positional

clusters

Page 20: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Functionally dependent genes tend to cluster on chromosomes in many

different organismsVertical axis: number of gene pairs with association score exceeding a threshold.

Control: same graph, random re-labeling of vertices

Page 21: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

More genomes (stronger links) => highly significant clustering

Page 22: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Fusions• If two (or more) proteins form a single

multidomain protein in some organism, they all are likely to be tightly functionally related

• Very useful for the analysis of eukaryotes• Sometimes useful for the analysis of

prokaryotes

Page 23: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

STRING: trpB – fusions

Page 24: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Phyletic patterns• Functionally linked genes tend to occur

together

• Enzymes with the same function (isozymes) have complementary phyletic profiles

Page 25: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

STRING: trpB –

co-occurrence (phyletic patterns)

Page 26: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Phyletic patterns in the Phe/Tyr pathway

shikimate kinase

Page 27: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Archaeal shikimate-kinaseChorismate biosynthesis pathway (E.

coli)

Page 28: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Arithmetics of phyletic patterns

3-dehydroquinate dehydratase (EC 4.2.1.10):Class I (AroD) COG0710 aompkzyq---lb-e----n---i-- Class II (AroQ) COG0757 ------y-vdr-bcefghs-uj---- Two forms combined aompkzyqvdrlbcefghsnuj-i--+

5-enolpyruvylshikimate 3-phosphate synthase (EC 2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-i--

Shikimate dehydrogenase (EC 1.1.1.25):AroE COG0169 aompkzyqvdrlbcefghsnuj-i--

+

Shikimate kinase (EC 2.7.1.71):Typical (AroK) COG0703 ------yqvdrlbcefghsnuj-i--Archaeal-type COG1685 aompkz-------------------- Two forms combined aompkzyqvdrlbcefghsnuj-i--

Chorismate synthase (EC 2.5.1.19) AroC COG0082 aompkzyqvdrlbcefghsnuj-i--

Page 29: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Distribution of association scores: monotonic for subunits,

bimodal for isozymes

Page 30: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Comparative analysis of regulation

• Phylogenetic footprinting: regulatory sites are more conserved than non-coding regions in general and are often seen as conserved islands in alignments of gene upstream regions

• Consistency filtering: regulons (sets of co-regulated genes) are conserved =>– true sites occur upstream of orthologous

genes– false sites are scattered at random

Page 31: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Riboflavin (vitamin B2) biosynthesis pathway

ribAribA

ribA ribB

G TP cyclohydrolase II

ribD

ribD

ribG

ribG

P yrim id ine deam inase

3,4-D HB P synthase P yrim id ine reductase

ribHribH R ibo flavin synthase, -cha in

ribEribBypaA

R ibo flavin synthase, -chain

GTP

2,5-diam ino-6-hydroxy-4-(5`-phosphoribosylamino)pyrimidine

ribulose-5-phosphate

PENTOSE-PHOSPHATE PATHWAY

PURINE BIO SYNTHESIS PATHWAY

3,4-dihydroxy-2-butanone-4-phosphate 5-am ino-6-(5`-phosphoribitylam ino)uracil

5-am ino-6-(5`-phosphoribosylamino)uracil

6,7-dimethyl-8-ribityllumazine

Riboflavin

Page 32: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

5’ UTR regions of riboflavin genes from bacteria 1 2 2’ 3 Add. 3’ Variable 4 4’ 5 5’ 1’

=========> ==> <== ===> -><- <=== -> <- ====> <==== ==> <== <========= BS TTGTATCTTCGGGG-CAGGGTGGAAATCCCGACCGGCGGT 21 AGCCCGTGAC-- 8 4 8 -----TGGATTCAGTTTAA-GCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAT BQ AGCATCCTTCGGGG-TCGGGTGAAATTCCCAACCGGCGGT 19 AGTCCGTGAC-- 8 5 8 -----TGGATCTAGTGAAACTCTAGGGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATATG BE TGCATCCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGATTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGGATGCC HD TTTATCCTTCGGGG-CTGGGTGGAAATCCCGACCGGCGGT 19 AGTCCGTGAC-- 10 4 10 ----–TGGACCTGGTGAAAATCCGGGACCGACAGTGAA-AGTCTGGAT-GGGAGAAGGAAACG Bam TGTATCCTTCGGGG-CTGGGTGAAAATCCCGACCGGCGGT 23 AGCCCGTGAC-- 8 4 8 ----–TGGATTCAGTGAAAAGCTGAAGCCGACAGTGAA-AGTCTGGAT-GGGAGAAGGATGAG CA GATGTTCTTCAGGG-ATGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCAA--- 3 4 3 ------AGATCCGGTTAAACTCCGGGGCCGACAGTTAA-AGTCTGGAT-GAAAGAAGAAATAG DF CTTAATCTTCGGGG-TAGGGTGAAATTCCCAATCGGCGGT 2 AGCCCGCG---- 7 6 7 --------ATTTGGTTAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GGAAGAAGATATTT SA TAATTCTTTCGGGG-CAGGGTGAAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGTTAA-AGTCTGGAT-GGGAGAAAGAATGT LLX ATAAATCTTCAGGG-CAGGGTGTAATTCCCTACCGGCGGT 2 AGCCCGCGA--- 4 4 4 -----ATGATTCGGTGAAACTCCGAGGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAATA PN AACTATCTTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGA--- 3 4 3 -----ATGATTTGGTGAAATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAAGATAAAA TM AAACGCTCTCGGGG-CAGGGTGGAATTCCCGACCGGCGGT 3 AGCCCGCGAG-- 5 4 5 ----–TTGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAGAGCGTGA DR GACCTCTTTCGGGG-CGGGGCGAAATTCCCCACCGGCGGT 15 AGCCCGCGAA-- 8 12 9 ----–CCGATGCCGCGCAACTCGGCAGCCGACGGTCAC-AGTCCGGAC-GAAAGAAGGAGGAG TQ CACCTCCTTCGGGG-CGGGGTGGAAGTCCCCACCGGCGGT 3 AGCCCGCGAA-- 5 4 5 -----CCGACCCGGTGGAATTCCGGGGCCGACGGTGAA-AGTCCGGAT-GGGAGAAGGAGGGC AO AATAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGCGGT 2 AGTCCGCGA--- 7 7 7 -----AGGAACCGGTGAGATTCCGGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGATGAAA DU TTTAATCTTCAGGG-CAGGGTGAAATTCCCGATCGGTGGT 2 AGTCCGCGA--- 13 4 12 -----AGGAACTAGTGAAATTCTAGTACCGACAGT-AT-AGTCTGGAT-GGAAGAAGAGCAGA CAU GAAGACCTTCGGGG-CAAGGTGAAATTCCTGATCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGACCCGGTGTGATTCCGGGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTCGGC FN TAAAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 5 4 5 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GGGAGAAGAATTAG TFU ACGCGTGCTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TGGAACCGGTGAAACTCCGGTACCGACGGTGAA-AGTCCGGAT-GGGAGGTAGTACGTG SX -AGCGCACTCCGGG-GTCGGTGAAAGTCCGAACCGGCGGT 3 AGTCCGCGAC-- 8 5 8 -----TTGACCAGGTGAAATTCCTGGACCGACGGTTAA-AGTCCGGAT-GGGAGGCAGTGCGCG BU GTGCGTCTTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 30 AGCCCGCGAGCG 137 GTCAGCAGATCTGGTGAGAAGCCAGAGCCGACGGTTAG-AGTCCGGAT-GGAAGAAGATGTGC BPS GTGCGTCTTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCTGGTCCGATGCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGATGTGC REU TTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 31 AGCCCGCGAGCG 7 5 7 GTCAGCAGATCTGGTGAGAGGCCAGGGCCGACGGTTAA-AGTCCGGAT-GAAAGAAGATGGGC RSO GTACGTCTTCAGGG-CGGGGTGGAATTCCCCACCGGCGGT 21 AGCCCGCGAGCG 11 3 11 GTCAGCAGATCCGGTGAGATGCCGGGGCCGACGGTCAG-AGTCCGGAT-GGAAGAAGATGTGC EC GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 8 4 8 GACAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAG-AGTCCGGAT-GGGAGAGAGTAACG TY GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 67 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGGGTAACG KP GCTTATTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 20 AGCCCGCGAGCG 8 4 8 GTCAGCAGATCCGGTGTAATTCCGGGGCCGACGGTTAA-AGTCCGGAT-GGGAGAGAGTAACG HI TCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 2 AGCCCACGAGCG 26 9 30 GTCAGCAGATTTGGTGAAATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAAAGAGAATAAAA VK GCGCATTCTCAGGG-CAGGGTGAAATTCCCTACCGGTGGT 14 AGCCCACGAGCG 11 9 11 GTCAGCAGATTTGGTGAGAATCCAAAGCCGACAGT-AT-AGTCTGGAT-GAAAGAGAATAAGC VC CAATATTCTCAGGG-CGGGGCGAAATTCCCCACCGGTGGT 13 AGCCCACGAGCG 5 4 5 GTCAGCAGATCTGGTGAGAAGCCAGGGCCGACGGTTAC-AGTCCGGAT-GAGAGAGAATGACA YP GCTTATTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 40 AGCCCGCGAGCG 16 6 16 GTCAGCAGACCCGGTGTAATTCCGGGGCCGACGGTTAT-AGTCCGGAT-GGGAGAGAGTAACG AB GCGCATTCTCAGGG-CAGGGTGAAAGTCCCTACCGGTGGT 25 AGCCCACGAGCG 16 4 27 GTCAGCAGATTTGGTGCGAATCCAAAGCCGACAGTGAC-AGTCTGGAT-GAAAGAGAATAAAA BP GTACGTCTTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 18 AGCCCGCGAGCG 10 4 10 GTCAGCAGACCTGGTGAGATGCCAGGGCCGACGGTCAT-AGTCCGGAT-GAGAGAAGATGTGC AC ACATCGCTTCAGGG-CGGGGCGTAATTCCCCACCGGCGGT 16 AGCCCGCGAGCA 10 3 11 ---CGCAGATCTGGTGTAAATCCAGAGCCGACGGT-AT-AGTCCGGAT-GAAAGAAGACGACG Spu AACAATTCTCAGGG-CGGGGTGAAACTCCCCACCGGCGGT 34 AGCCCGCGAGCG 6 6 6 GTCAGCAGATCTGGTG 52 TCCAGAGCCGACGGT 31 AGTCCGGAT-GGAAGAGAATGTAA PP GTCGGTCTTCAGGG-CGGGGTGTAAGTCCCCACCGGCGGT 13 AGCCCGCGAGCG 7 3 7 GTCAGCAGATCTGGTGCAACTCCAGAGCCGACGGTCAT-AGTCCGGAT-GAAAGAAGGCGTCA AU GGTTGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 17 AGCCCGCGAGCG 7 9 7 GTCAGCAGATCCGGTGAGAGGCCGGAGCCGACGGT-AT-AGTCCGGAT-GGAAGAGGACAAGG PU AAACGTTCTCAGGG-CGGGGTGCAATTCCCCACCGGCGGT 19 AGCCCGCGAGCG 19 4 18 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAC-AGTCCGGATGAAGAGAGAACGGGA PY TAACGTTCTCAGGG-CGGGGTGCAACTCCCCACCGGCGGT 19 AGCCCGCGAGCG 15 4 16 GTCAGCAGACCCGGTGTGATTCCGGGGCCGACGGTCAT-AGTCCGGATGAAGAGAGAGCGGGA PA TAACGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 19 AGCCCGCGAGCG 14 4 13 GTCAGCAGACCCGGTGCGATTCCGGGGCCGACGGTCAT-AGTCCGGATAAAGAGAGAACGGGA MLO TAAAGTTCTCAGGG-CGGGGTGAAAGTCCCCACCGGCGGT 16 AGCCCGCGAGCG 8 5 8 GTCAGCAGATCCGGTGTGATTCCGGAGCCGACGGTTAG-AGTCCGGAT-GAAAGAGGACGAAA SM AAGCGTTCTCAGGG-CGGGGTGAAATTCCCCACCGGCGGT 34 AGCCCGCGAGCG 8 3 8 GTCAGCAGATCCGGTCGAATTCCGGAGCCGACGGTTAT-AGTCCGGAT-GGAAGAGAGCAAGC BME GCTTGTTCTCGGGG-CGGGGTGAAACTCCCCACCGGCGGT 17 AGCCCGCGAGCG 10 15 10 GTCAGCAGATCCGGTGAGATGCCGGAGCCGACGGTTAA-AGTCCGGAT-GGAAGAGAGCGAAT BS ATCAATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 18 AGCCCGCGA--- 5 4 5 -----AGGATTCGGTGAGATTCCGGAGCCGACAGT-AC-AGTCTGGAT-GGGAGAAGATGGAG BQ GTCTATCTTCGGGG-CAGGGTGAAAATCCCGACCGGCGGT 27 AGCCCGCGA—-- 3 5 3 -----AGGATTTGGTGTGATTCCAAAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG BE ATTCATCTTCGGGG-CAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCGA--- 3 4 3 -----AGGATCCGGTGCGAGTCCGGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGAAG CA AATGATCTTCAGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCGAG-- 3 4 3 ----TATGATCCGGTTTGATTCCGGAGCCGACAGT-AA-AGTCTGGAT-GAAAGAAGATATAT DF GAAGATCTTCGGGG-CAGGGTGAAATTCCCTACCGGCGGT 2 AGCCCGCG---- 6 4 6 -------GATTTGGTGAGATTCCAAAGCCGACAGT-AA-AGTCTGGAT-GAGAGAAGATATTT EF GTTCGTCTTCAGGGGCAGGGTGTAATTCCCGACCGGTGGT 3 AGTCCACGAC-- 5 3 5 ----ATTGAATTGGTGTAATTCCAATACCGACAGT-AT-AGTCTGGAT—-AAAGAAGATAGGG LLX AAATATCTTCAGGG-CACCGTGTAATTCGGGACCGGCGGT 21 ACTCCGCGAT-- 4 4 4 ----–TTGAAGCAGTGAGAATCTGCTAGCGACAGT-AA-AGTCTGGAT-GGAAGAAGATGAAC LO GTTCATCTTCGGGG-CAGGGTGCAATTCCCGACCGGTGGT 3 AGTCCACGAT-- 3 10 3 ----TTGACTCTGGTGTAATTCCAGGACCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGTTG PN AAGAGTCTTCAGGG-CAGGGTGAAATTCCCGACCGGCGGT 125 AGTCCGTG---- 3 4 3 -------GATGTGGTGAGATTCCACAACCGACAGT-AT-AGTCTGGAT-GGGAGAAGACGAAA ST AAGTGTCTTCAGGG-CAGGGTGTGATTCCCGACCGGCGGT 14 AGTCCGCG---- 3 4 3 -------GATGTGGTGTAACTCCACAACCGACAGT-AT-AGTCTGGAT-GAGAGAAGACCGGG MN AAGTGTCTTCAGGG-CAGGGTGAGATTCCCGACCGGCGGT 104 AGTCCGCG---- 3 4 3 -------GATGTGGTGAAATTCCACAACCGACAGT-AA-AGTCTGGAT-GGGAGAAGACTGAG SA ATTCATCTTCGGGG-TCGGGTGTAATTCCCAACCGGCAGT 6 AGCCTGCGAC-- 11 3 11 ----–CTGATCTAGTGAGATTCTAGAGCCGACAGT-AT-AGTCTGGAT-GGGAGAAGATGGAG AMI TCACAGTTTCAGGG-CGGGGTGCAATTCCCCACTGGCGGT 14 AGCCCGCGC--- 5 5 5 ------TGATCTGGTGCAAATCCAGAGCCAACGGT-AT-AGTCCGGAT-GGAAGAAACGGAGC DHA ACGAACCTTCGAGG-TAGGGTGAAATTCCCGACCGGCGGT 20 AGCCCGCAAC-- 11 4 11 --CGACTGACTTGGTGAGACTCCAAGGCCGACGGT-AT-AGTCCGGAT-GGGAGAAGGTACAA FN AATAATCTTCGGGG-CAGGGTGAAATTCCCGACCGGTGGT 2 AGTCCACG---- 4 6 4 -------GATTTGGTGAAATTCCAAAACCGACAGT-AG-AGTCTGGAT-GAGAGAAGAAAAGA GLU ---TGTTCTCAGGG-CGGGGCGAAATTCCCCACCGGCGGT 28 AGCCCGCGAGCG 10 4 10 GTCAGCAGATCCGGTTAAATTCCGGAGCCGACGGTCAT-AGTCCGGAT-GCAAGAGAACC---

Page 33: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Conserved secondary structure of the RFN-element

NNNNyYYUC

NNNNrRRAG

NgGGNcCC

rgGGxc A

RRgxuAG

GRCCYG

AcCG

AGCCRGY

GG YRCC

GRYBy CYRVrG N

YGNaA N U U x N

NxAGU

UrN A g

Y

variab lestem-loop

additionalstem -loop

3 4

2

1

5

5 ’ 3 ’

u K NRA

xK

*

****

Capitals: invariant (absolutely conserved) positions. Lower case letters: strongly conserved positions. Dashes and stars: obligatory and facultative base pairs Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide or deletion

Page 34: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

RFN: the mechanism of regulation• Transcription attenuation

• Translation attenuation

Page 35: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Early observation: an uncharacterized gene (ypaA) with an upstream RFN element

Page 36: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Phylogenetic tree of RFN-elements (regulation of riboflavin

biosynthesis)

duplications

no riboflavin biosynthesis

no riboflavin biosynthesis

Page 37: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

YpaA a.k.a. RibU: riboflavin transporter

in Gram-positive bacteria• 5 predicted transmembrane segments => a transporter

• Upstream RFN element (likely co-regulation with riboflavin genes) => transport of riboflaving or a precursor

• S. pyogenes, E. faecalis, Listeria sp.: ypaA, no riboflavin pathway => transport of riboflavin

Prediction: YpaA is riboflavin transporter (Gelfand et al., 1999)

Validation:• YpaA transports flavines (riboflavin, FMN, FAD): by

genetic analysis (Kreneva et al., 2000) by direct measurement (Burgess et al., 2006; Vogl et al., 2007 )

• ypaA is regulated by riboflavin: by microarray expression study (Lee et al., 2001)

• … via attenuation of transcription (and to some extent inhibition of translaition) (Winkler et al., 2003)

Page 38: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Conserved structures of riboswitches (circled: X-ray)

NNNNyYYUC

NNNNrRRAG

NgGGNcCC

RgG

Gxc G

Aux

gRRA

GRC

CYG

AcCG

AGCCRGYGG YRCC GRYBy CYRVr

G N

YGNaA N U U x N

Nx

AGU

UrN

A gY u

K NRA

xK

Var

Add

RFN-elem ent

MGGGA

G G A

A G

C C U

THI-element

C Y G GN U N

RUR

UCRR G

A

A

A

AA

AA

CGd

a

aa

a

a

ktk

h

CC

cC

C

GG G

GGG

G

GT

M

Y

K

y

c

c G

g

g G

G

G YG

tg

g

ggN

RNN

NN

r

r

r

g

g C

c

c T

C

C G

C Ca

ta N

B 12 box

P1

5' 3'

P2

P5 P6 P7

P3

N

base stem

CGh

G

d

yc c

C C

P4

g u yc a r

NaAUGc

AP1

5' 3'

u R

CA

U

Uu

Ga

P4

NaGA

g

c

GRCA

aCcD H

Gg

UGCY

a

AA NuccN

r

NN

G gyC cr

P2G GG A

C C DC

rG

N y G A a

Ac

gg

P3

P5g

AUR

UA

P1

5' 3'

C GU R

Y

CA RUAU GG

P2A

N

U

A

C

GU N U U A

UA

A A

G

GCCP3

C

N G A

U

P1

P2

P3

P4

P5

P3 P2

P4

base stem base stem5' 3' 5' 3'

B12-element

base stem

S box-

base stem

G box-

Add

Add I

Add II

Add III

Var

P5

P1

uaAG

uCG

P1

5' 3'base stem

R Yr y

Gyy

r

aa

g

u g

aa a GG

r Cr G

y G Cyk

a G ug R

C a Yu

a

Gg N

a

aA

a N

acUGC

GA

G G gaR

ruYy

P2

P5P6

P7

P3P4

LYS-element

Page 39: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Mec

hani

sm

s

UUUUUUUU

5’

33 ’

5 ’

Regulatory hairpin(term inator of transcription and or RBS-sequestor)/

In the case of regulation of transcription

In the case of regulation of translation

GENES

3’ GENES

RNA-elem ent

A

5 ’ 1 3

UUUUUUUU

Antiterm inator/Antisequestor

3’ GENES

5 ’ 1 2

RNA-elem ent

3 ’ GENES

B 5 ’

2 3

Antiterminator/Antisequestor

3 ’ GENES

C

5’

RNA-elem ent

3’ GENES

12

5 ’

1 23 ’ GENES

Regulatory hairpin

+ Effector

UUUUUUUU

- Effector

2

1

gcvT: ribozyme, cleaves its mRNA (the Breaker group)

THI-box in plants: inhibition of splicing (the Breaker and Hanamoto groups)

Page 40: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Characterized riboswitches (more are predicted)RFN Riboflavin

biosynthesis and transport

FMN (flavin mononucleotide)

Bacillus/Clostridium group, proteobacteria, actinobacteria, other bacteria

THI Biosynthesis and transport of thiamin and related compounds

TPP (thiamin pyrophosphate)

Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, other bacteria, archea (thermoplasmas), plants, fungi

B12 Biosynthesis of cobalamine, transport of cobalt, cobalamin-dependent enzymes

Coenzyme B12 (adenosyl-cobalamin)

Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, spirochaetes, other bacteria

S-boxSAM-IISAM-III

Metabolism of methionine and cystein

SAM (S-adenosyl- methionine)

Bacillus/Clostridium group and some other bacteriaSAM-II (alpha), SAM-III (Streptococci)

LYS Lysine metabolism lysine Bacillus/Clostridium group, enterobacteria, other bacteria

G-box Metabolism of purines

purines Bacillus/Clostridium group and some other bacteria

glmS (ribozyme)

Synthesis of glucosamine-6-phosphate

glucosamine-6-phosphate

Bacillus/Clostridium group

gcvT (tandem)

Catabolism of glycine

glycine Bacillus/Clostridium group

Page 41: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Properties of riboswitches• Direct binding of ligands• High conservation

– Including “unpaired” regions: tertiary interactions, ligand binding• Same structure – different mechanisms:

transcription, translation, splicing, (RNA cleavage)• Distribution in all taxonomic groups

– diverse bacteria– archaea: thermoplasmas– eukaryotes: plants and fungi

• Correlation of the mechanism and taxonomy:– attenuation of transcription (anti-anti-terminator) – Bacillus/Clostridium

group– attenuation of translation (anti-anti-sequestor of translation initiation) –

proteobacteria– attenuation of translation (direct sequestor of translation initiation) –

actinobacteria• Evolution: horizontal transfer, duplications, lineage-specific loss• Sometimes very narrow distribution: evolution from scratch?

Page 42: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Conserved signal upstream of nrd genes

Page 43: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Identification of the candidate regulator by the analysis of phyletic

patternsCOG1327: the only COG with exactly the

same phylogenetic pattern as the signal– “large scale” on the level of major taxa– “small scale” within major taxa:

• absent in small parasites among alpha- and gamma-proteobacteria

• absent in Desulfovibrio spp. among delta-proteobacteria• absent in Nostoc sp. among cyanobacteria• absent in Oenococcus and Leuconostoc among

Firmicutes• present only in Treponema denticola among four

spirochetes

Page 44: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

COG1327 “Predicted transcriptional regulator, consists of a Zn-ribbon and ATP-cone domains”: regulator of the riboflavin pathway (RibX)?

Page 45: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Additional evidence: co-localization

nrdR is sometimes clustered with nrd genes or with replication genes dnaB, dnaI, polA

Page 46: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Additional evidence: co-

regulated genes In some genomes,

candidate NrdR-binding sites are found upstream of other replication-related genes– dNTP salvage– topoisomerase I,

replication initiator dnaA, chromosome partitioning, DNA helicase II

Page 47: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Multiple sites (nrd genes): FNR, DnaA, NrdR

Page 48: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Mode of regulation

• Repressor (overlaps with promoters)• Co-operative binding:

– most sites occur in tandem (> 90% cases)

– the distance between the copies (centers of palindromes) equals an integer number of DNA turns:• mainly (94%) 30-33 bp, in 84% 31-32 bp – 3

turns• 21 bp (2 turns) in Vibrio spp.• 41-42 bp (4 turns) in some Firmicutes

Page 49: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Experimental validations

Page 50: Comparative genomics:  functional characterization  of new genes and regulatory interactions using computer analysis

Acknowledgements• Dmitry Rodionov (comparative genomics)• Andrei Mironov (software)• Alexei Vitreschak (riboswitches)

• Funding:– Howard Hughes Medical Institute– Russian Foundation of Basic Research– RAS, program “Molecular and Cellular Biology”– INTAS