application of genomics in biochemistry · hana Šišková: application of genomics in biochemistry...
TRANSCRIPT
COMENIUS UNIVERSITY IN BRATISLAVA
Faculty of Natural Sciences
Department of Biochemistry
Application of genomics in biochemistry
Bachelor thesis
Hana ŠIŠKOVÁ
Study programme - Biochemistry
Supervisor of thesis: Mgr. Silvia Poláková, PhD.
BRATISLAVA 2010
1
2
Declaration
I hereby declare that this thesis is the result of my own original work and I have done
it with the use of literature listed in the references.
In Bratislava, April 15th 2010 ..........................................
Author signature
3
Acknowledgement
I am very thankful to my supervisor, Mgr. Silvia Poláková, PhD., whose
encouragement, guidance and support from the initial to the final level enabled me to
develop an understanding of the subject.
4
ABSTRAKT
Hana Šišková: Využitie genomiky v biochémii
Univerzita Komenského v Bratislave, Prírodovedecká fakulta, Katedra biochémie
Bakalárska práca, 38 strán, 13 grafických príloh, 2010
Genomika je mladá vedná disciplína, ktorej rozvoj značne napomáha štúdiu
metabolických dráh v organizmoch. Táto práca mala za cieľ predstaviť základné oblasti
genomiky a postgenomiky, vysvetliť ich princípy a prínos pre biochémiu za účelom
pochopenia metabolizmu organického celku. Práca sa štruktúrou snažila chronologicky
objasniť vývoj metód pre štúdium bunkového metabolizmu; od pre-genomických metód,
využívaných pred objavom metód sekvenovania genómu, cez metódy genomiky, ktoré
pomáhajú odhaľovať vzťah medzi genómovou štruktúrou a funkciou, až po postgenomické
metódy zamerané na objasnenie fyziológie eukaryotickej a prokaryotickej bunky.
Kľúčové slová: genomika, funkčná genomika, komparatívna genomika, transkriptomika,
proteomika, metabolomika, metagenomika
ABSTRACT
Hana Šišková: Application of genomics in biochemistry
Comenius University in Bratislava, Faculty of Natural Sciences, Department of Biochemistry
Bachelor thesis, 38 pages, 13 supplements, 2010
Genomics is a young scientific discipline, the development of which has considerably
facilitated the study of metabolic pathways in living organisms. The aim of this work was to
present the fields of genomics and post-genomics, and to explain their principles and use in
biochemistry in order to understand the metabolism of an organic entity. This work attempted
to present chronologically the development of methods used to study the cell metabolism:
from pre-genomic methods, used before the discovery of genome sequencing methods,
through the methods of genomics that help to reveal the relationship between genome
structure and function, and to post-genomics methods aimed at the clarification of the
physiology of the eukaryotic and prokaryotic cell.
Key words: genomics, functional genomics, comparative genomics, transcriptomics,
proteomics, metabolomics, metagenomics
5
TABLE OF CONTENTS
Prologue ..................................................................................................................................... 6
1. Methods to reveal metabolic pathways in the pre-genomic world .................................. 7
2. How can genomics help in the study of metabolic pathways.......................................... 10
2.1 Definition of genomics ................................................................................................... 10
2.2 History of genomics ....................................................................................................... 11
2.3 Comparative genomics ................................................................................................... 13
3. Postgenomics ....................................................................................................................... 19
3.1 Use of functional genomics to determinate gene and protein function .......................... 19
3.1.1 ORF clones .............................................................................................................. 21
3.1.2 Knock-out mutants ................................................................................................... 22
3.1.3 Transcriptome analysis ........................................................................................... 23
3.1.4 Proteome analysis ................................................................................................... 26
3.1.5 Metabolome analysis ............................................................................................... 27
3.2 Metagenomics - approach for studying complex communities in natural environment 29
Summary ................................................................................................................................. 31
Zhrnutie ................................................................................................................................... 32
Used abbreviations ................................................................................................................. 33
References ............................................................................................................................... 34
6
Prologue
The discovery of the DNA structure by Watson and Crick in 1953 and the use of
genome sequencing methods by Fred Sanger‘s group in the 1970s have given rise to the new
field of genomics. Many scientific groups have started to sequence the genomes of some
simpler organisms and with the use of bioinformatics the genome databases accessible
worldwide have started to appear. These databases enabled further development of
comparative genomics, as an approach to determine unknown genes based on the comparison
with the sequences of known genes. Once the genes were aligned, scientists found a new
challenge for themselves: to determine the molecular, cellular and physiological functions for
the numerous proteins encoded by the sequenced genomes. This task has required the
integrated application of chemical and biological techniques and has promoted the further
development of the fields such as functional genomics, transcriptomics, proteomics and
metabolomics. Some scientists went even further by switching from the analysis of an isolated
organism to the research of complex communities in their natural environment, as presented
by the field of metagenomics.
7
1. Methods to reveal metabolic pathways in the pre-genomic world
Today in biochemistry we have many specialised methods to help us analyse the
metabolic pathways in various organisms. But before these methods appeared, scientists often
had to improvise in their research in order to get some notable success. The work of Hans
Krebs and Melvin Calvin is a typical case, and worth recounting.
When Hans Krebs wanted to measure the amounts of ammonia taken up and urea
formed in the liver, he had first to design a salt solution that would be similar enough to the
saline composition of blood. In his experiments, he showed that the liver made some urea
from added ammonium salts and that the energy for this synthesis was supplied by the
breakdown of stored carbohydrates. The production of urea was stimulated to various degrees
by the addition of different amino acids, but one amino acid — ornithine — stimulated it to a
much greater extent than any other tested.
Since 1904 it had been known that the amino acid arginine could be hydrolysed to
ornithine and urea by the enzyme arginase. In addition, it was known that the liver is rich in
arginase (Kossel and Dakin, 1904). Therefore, the question was whether the ornithine could
not be contaminated with arginine. Krebs, in his further research, not only gave a negative
answer to this question but also revealed that in the presence of ammonia, the addition of one
molecule of ornithine brought the extra formation of more than 20 molecules of urea. Krebs
assumed that ornithine acted as a catalyst in the reaction. Although arginine seemed to be the
intermediate regenerating the ornithine, it looked unlikely that arginine would be formed from
ornithine and ammonia in one step. Krebs postulated that citrulline might be the missing
intermediate of the cycle. He also proved his hypothesis when he obtained a few milligrams
of citrulline from two scientists who by a happy coincidence had just isolated this substance
(Wada, 1930; Ackermann, 1931). And so the ornithine cycle was discovered (Krebs and
Henseleit, 1932) (Fig.1).
Figure 1. a) The ornithine cycle formulated by Krebs and his assistant Henseleit in 1932. b) Modern
version of the ornithine cycle (Kornberg, 2000).
8
Hans Krebs has also helped to discover two other very important cycles in living
organisms. First, he participated in the discovery of the citric acid cycle. Together with
William A. Johnson studied how small organic acids are oxidized to CO2 and H2O to yield
energy for the body. He based his research on the previous studies of the Hungarian
biochemist Albert Szent–Györgyi, who used the breast muscles of pigeons to study this
process. Albert Szent–Györgyi chose this tissue as it was well adapted to extract the energy
from the food in order to power the bird flight. He showed that the rate of oxygen uptake was
greatly increased when any of three four-carbon dicarboxylic acids - fumarate, succinate or
malate - were added. He concluded that these substances were limiting in the cell and
stimulated oxidation of endogenous glucose, but the catalytic effect of these compounds
remained a puzzle for him. Krebs and Johnson argued that this could not proceed in one step.
They showed that succinate could be synthesised if pyruvate was present and that citrate was
formed from oxaloacetate if pyruvate was also added (Krebs and Johnson, 1937a). But the
most important observation was that citrate was not only oxidised, but that it was
continuously re-formed. They saw that they had a cycle, not a simple pathway, and that
addition of any of the intermediates could generate all of the others (Krebs and Johnson,
1937b). The existence of a cycle, together with the entry of pyruvate into the cycle in the
synthesis of citrate, gave a clear explanation for the catalytic effect of succinate, fumarate, and
malate. The citric acid cycle, formulated later by Kornberg and Krebs, is shown in Figure 2a.
Additionally, Krebs participated in the discovery of the glyoxylate cycle – the ‗bypass‘
of the citric acid cycle, when he wanted to explain how some organisms could grow on
acetate as the only carbon source (Kornberg and Krebs, 1957) (Fig.2b).
Figure 2. Main stages of the citric acid cycle (a) and the glyoxylate cycle (b), formulated by Kornberg and
Krebs,1957 (adapted from Kornberg, 2000).
9
Another well-known scientist, Melvin Calvin, used the method of C14
labelling in his
experimental observation of the carbon cycle in photosynthesis. The C14
isotope was cheaply
available in large amounts after 1945 thanks to the number of nuclear reactors that had been
constructed. Calvin studied the carbon reduction sequence in photosynthesis on the unicellular
green alga Chlorella, which turned out to be a very good biological material. In his
experiment, Calvin injected some C14
labelled CO2 into the stream of nonradioactive CO2,
both of which entered the plant materials. At the end of the different time periods, the plants
were killed and the extraction of material for analysis was initiated. The isolation and
identification procedures were then carried out in ion-exchange columns. Later, when Martin
and Synge had developed the method of partition chromatography, Calvin used this method as
his principal analytical tool. He spread the plant material on a sheet of filter paper by two-
dimensional chromatography and he placed the paper in contact with photographic film, thus
exposing the film at those points of the paper where the compounds he was interested in were
located. Thanks to this method, together with his co-workers A.A. Benson and J.A. Bassham,
he was able to identify the first product of photosynthesis – the phosphoglycerate, as well as
other intermediates of the photosynthetic carbon cycle, known also as Calvin-Benson-
Bassham cycle (Calvin, 1962) (Fig.3).
Figure 3. The photosynthetic carbon cycle (Calvin, 1962).
10
2. How can genomics help in the study of metabolic pathways
2.1 Definition of genomics
A new age of study of metabolic pathways came with the emergence of genomics as
a new field in the omics studies. ‗Omics‘ is a general term for a broad discipline of science
and engineering for analysing the interactions of biological information objects in
various omes (ome is the Greek suffix for ‗whole‘, so it means the organic totality of
something). Since the mid-'90s researchers are rapidly taking up omes and omics, as shown
by the explosion of the use of these terms.1 These include studies such as genomics,
proteomics, metabolomics, expressomics, materiomics and interactomics. When we talk about
genes that are differentially expressed and proteins produced by the cell, the terms like
transcriptome and proteome are used; and since the living cell is a complex structure, there is
a complexome; all of the parts are connected ultimately with the metabolome, and together
these omes house the phenome, which interacts with an environome (Fig.4).
Figure 4. The diagram suggests other omes that constitute a living multicellular organism (Scriver, 2001).
The main focus in the omics study is on:
1) mapping information objects such as genes and proteins,
2) finding interaction relationships among the objects,
3) engineering the networks and objects to understand and manipulate the regulatory
mechanisms.
So genomics, as the omics study, is a field studying the complexity of genes in
individual organisms, populations, and species.
1 The internet site omics.org provides a list of more than 400 categories of different omes which gives an
evidence of rising tendency in use of this term, however not necessarily just in the field of molecular biology.
11
2.2 History of genomics
Much in the way that the biochemical studies based on radioactively-labelled
biological molecules paved the way for the discovery of the basic mechanisms involved in the
regulation of metabolic pathways, genomic DNA sequences are today paving the way for the
elucidation of global mechanisms for genetic regulation.
The work, in which radioisotopes were used for the elucidation of metabolic pathways,
resulted in a monograph titled Studies of Biosynthesis in Escherichia coli that guided research
in biochemistry for the next 20 years and helped to establish this bacterium as a model
organism for biological research (Roberts et al., 1955). In the following years, most of the
metabolic pathways required for the biosynthesis of intermediary metabolites were revealed
and new methods were developed to identify and characterise the enzymes involved in these
pathways. These studies defined the biosynthetic pathways for the building blocks of
macromolecules such as proteins and nucleic acids, and enabled the discovery of the
mechanisms for metabolic regulation such as end product inhibition, allostery, and
modulation of enzyme activity by protein modification. A bigger breakthrough in the
biosynthesis of macromolecules presented a discovery of the structure of the DNA helix
(Watson and Crick, 1953), which provided a basis for the new field of molecular biology.
This field gained even more importance with further discoveries of restriction endonucleases
and DNA ligases. These new enzymes, together with a discovery of a method of polymerase
chain reaction by Kary Mullis in 1983, enabled construction of recombinant DNA molecules
composed of DNA sequences from different organisms.
As for genomics itself, it was established by Fred Sanger group in 1970s when they
developed a gene sequencing technique and completed the first genomes. In 1972, Walter
Fiers and his team at the Laboratory of Molecular Biology of the University of Ghent were
the first to determine the sequence of a gene – it was the gene for Bacteriophage MS2 coat
protein (Fiers et al., 1972). In 1976, the team determined the complete nucleotide-sequence of
bacteriophage MS2-RNA (Fiers et al., 1976). In 1977, Frederick Sanger sequenced the first
DNA-based genome in its entirety - the bacteriophage Φ-X174 (5,368 bp) (Sanger et al.,
1977).
The new era of genomics was initiated in 1986 at an international conference in Santa
Fe, New Mexico, sponsored by the US Department of Energy. At this meeting, the leading
scientists from around the world endorsed unanimously the desirability and feasibility of
implementing a human genome program. This meeting resulted in a 1988 study by the
12
National Research Council titled ―Mapping and Sequencing the Human Genome‖ that
recommended that the USA support a human genome program and presented an outline for a
multiphase plan. In 1990, an international scientific research project was launched at the US
National Institutes of Health (NIH) in order to determine the complete human genome – The
Human Genome Project.
In 1991 Craig Venter at the NIH developed a way of finding human genes without
having to sequence the entire human genome. He estimated that only about 3% of the genome
is composed of genes that express mRNA and suggested that the most efficient way to find
genes would be to use the processing machinery of the cell. At any time, only a part of cell‘s
DNA is transcriptionally active. These ―expressed‖ segments of DNA are transcribed to
mRNA and then with the help of the enzyme reverse transcriptase can be transcribed into
complementary DNA (cDNA). These stable cDNA fragments are called expressed sequence
tags (EST). These cDNA sequences are then assembled into longer sequences representing
large parts of many human genes. This is done with computer programs that match
overlapping ends of EST‘s. In 1992 Venter left NIH and later became one of the founders of
Celera Genomics Corporation that in 1999 announced its parallel research project of The
Human Genome project.
Meanwhile, the sequencing proceeded very fast. In 1995, the first free-living organism
was sequenced in the Institute for Genomic Research – the Haemophilus influenzae (1.8 Mb)
(Fleischmann et al, 1995). Then, the complete genomes of many bacteria, viruses and
eukaryote organisms (mostly fungi) were determined. For scientific research, it was an
important milestone when the genomes of the most studied organisms, Escherichia coli and
budding yeast (Saccharomyces cerevisiae) were sequenced (Goffeau et al., 1996). Other
important sequences of model organisms were: Caenorhabditis elegans (The C. elegans
Sequencing Consortium, 1998), Arabidopsis – the first sequenced plant (The Arabidopsis
Genome Initiative, 2000), Drosophila melanogaster (Adams et al., 2000) and the mouse Mus
musculus as the first sequenced mammal (Mouse Genome Sequencing Consortium, 2002).
In 1998 the Human Genome Program announced a plan to complete the human
genome sequence by 2003, the 50th
anniversary of the discovery of the DNA structure. The
goals of this plan were to:
Achieve coverage of at least 90% of the genome in a working draft based on mapped
clones by the end of 2001.
Finish one-third of the human DNA sequence by the end of 2001.
Finish the complete human genome sequence by the end of 2003.
13
Make the sequence totally and freely accessible.
In 2000, both the director of the Human Genome project and Celera Genomics
announced that they completed the draft versions of the human genome two years ahead of
schedule. These drafts were published in special issues of Science and Nature in 2001 and the
sequence is now online at the National Center for Biotechnology Information (NCBI).
Genome sequencing enabled the identification of many genes that had been previously
unknown. Many projects, such as The Human Genome Project, are still focusing on the
genome sequencing, but the knowledge of full genomes has created the possibility for the
other field such as comparative and functional genomics, for example.
2.3 Comparative genomics
Comparative genomics is a young field of genomics that studies the relationship of
genome structure and function across different biological species or strains. It uses the
similarities and differences of genes in different organisms in order to analyse their function
and to gain a better understanding of how species have evolved. A very effective approach to
find the most likely function of an unidentified gene is based on projection of experimentally
established functions of proteins from one species to another on the basis of homology,
determined by sequence similarity. This projection is supported by a set of powerful tools and
public archives (such as GenBank and Swiss-Prot), as well as by a significant body of
literature. Researchers have learned a lot about the function of human genes by examining
their counterparts in simpler model organisms. Many different features can be considered by
genome comparison: sequence similarity, gene location, the length and number of coding
regions within genes (exons), the amount of noncoding DNA in each genome, and highly
conserved regions maintained in organisms as simple as bacteria and as complex as humans.
The comparative approach is an ideal way to solve the missing genes problem. A
metabolic reconstruction attempts to develop a detailed overview of an organism‘s
metabolism from an analysis of its genomic sequence. As the reconstruction technology is
focused on identifying present components, it also gives a notion of components that should
be present but cannot be identified. In this case, there is an attempt to connect functional roles
to genes that have not yet been characterized. Two categories of missing genes can be
distinguished: globally missing (for functions without any representative sequenced genes
from any organism) and locally missing (for functions previously connected to one sequenced
14
form of a gene in one group of species, but expected to exist in an alternative form in another
group of species)2.
A typical missing gene study is split in three phases: revealing missing genes,
identification and ranking of candidate genes and experimental verification. In the first phase
of the missing gene search, there has to be defined a ‗functional context‘, which usually
includes the other enzymes that participate in the same pathway or variants of the pathway.
When the closely related functions have been determined, a table showing which of these
functions are present or absent in model organisms can be built. This table shows, which
organisms have variants of the pathway, which do not, and where the situation is ambiguous
(Table 1). Once the gene inventory has been completed, the missing genes are identified
according to the presence of pathway‗s variants in model organisms – a process of metabolic
reconstruction.
Table 1. Gene inventory showing which of closely related functions is present or absent within a diverse set of
model organisms. The table contains a row for each of the enzymatic functions and a column for each organism
(adapted from Osterman, 2003).
2 The term missing genes problem, as well as its division into these two categories, has been used by Osterman,
2003.
15
In the second phase, various techniques of genome context analysis are used to
produce an initial list of candidate genes for a sought functional role. Let‘s examine the major
techniques closely:
Clustering on the chromosome – this technique uses the fact that on prokaryotic
chromosomes the genes from the same pathway tend to cluster. They are often
organized into operons, defined as a set of adjacent genes all under the transcriptional
control of the same operator. This enables to infer ‗functional coupling‘ between
genes (Overbeek et al., 1999). Using the genome-scanning tools, one can find cases
where multiple genes orthologous to members of the gene inventory occur in close
proximity.
Protein fusion events – this technique uses searching for a pair of genes from one
genome that appear to be fused into a single gene in another genome, which provides
further evidence of potential functional coupling (Enright et al., 1999).
Occurrence profiles – this technique, also called ‗phylogenetic profiling‘, is based on
assumption that two proteins from the same cellular pathway are expected to either
both occur or both not occur in any specific organism (Pellegrini et al., 1999). The
higher version of this technique generates instances of potential functional coupling
for a pair of proteins on the basis of their occurrence profiles.
Shared regulatory sites – This technique is based on identification of so-called
regulons (ensemble of genes which have coordinated expression). Co-regulation of a
pair of genes provides evidence that these genes may be functionally coupled
(McGuire, 2000).
These techniques produce candidate genes that can be further prioritised based on
strength and consistency of evidence. For gene discovery, as well as for additional candidate
ranking, there are methods revealing and analysing putative folds (Pawlowski et al., 2001),
long-range sequence similarities (Bateman et al., 2002) and conserved motifs (Falquet, 2002).
In most cases, the number of highly ranked gene candidates is very limited and they can be
quickly subject to experimental verification by traditional experimental techniques.
Among all of the contemporary techniques of genome context analysis, gene
clustering on the chromosome presents the most critical contribution to missing gene
discovery in spite of fact that this technique is almost exclusively applicable for the
comparative analysis of prokaryotic genomes. Large-scale sequencing and comparative
16
analysis of many prokaryotic genomes provide growing evidence that for an overwhelming
majority of eukaryotic metabolic enzymes, it is possible to find functional counterparts in one
or more prokaryotes (Osterman, 2003).
The comparative approach can also help to find new genes based on the comparison of
closely related organisms coming from the same family. For instance, this method was used to
identify a large amount of new genes in one of the most favourite model organisms –
Caenorhabditis elegans. Stein et al. (2003) used this approach to compare this soil nematode
with Caenorhabditis briggsae - both species diverged from a common ancestor roughly 100
million years ago and yet are morphologically almost indistinguishable. They have the same
chromosome number and genome sizes. After the C. briggsae genome had been sequenced to
a high-quality draft stage and compared to the finished C. elegans sequence, the researchers
predicted approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the
same as in C. elegans. 12,200 of these have clear C. elegans orthologs (two genes are
orthologous if they diverged after a speciation event, when a new species forms from an
existing one), other 6,500 have one or more clearly detectable C. elegans homologs, and
approximately 800 C. briggsae genes showed no equivalent in C. elegans. Based on
comparison to C. briggsae, the authors found strong evidence for 1,300 new C. elegans genes.
Apparently, comparisons of the two genomes can help us to understand the evolutionary
forces that modified nematode genomes. Surprisingly, C. briggsae and C. elegans, despite
their abundant differences at the genomic level, remain morphologically almost
indistinguishable, whereas mouse and human, for example, are more similar genetically, but
show dramatic anatomic and behavioural differences.
The comparative approach can be effectively used also for studying such a complex
organism as the human being. In order to explore the function of different human genes, the
comparative study of the laboratory mouse Mus musculus and Homo sapiens turned out to be
a very valuable approach. In the roughly 75 million years since the divergence of the human
and mouse lineages, evolution has changed their genome sequences and caused them to
diverge by nearly one substitution for every two nucleotides, as well as by deletion and
insertion (Mouse Genome Sequencing Consortium, 2002). The divergence is low enough that
orthologous sequences can still be aligned but high enough so that many functionally
important elements can be recognised by their greater degree of conservation (Fig.5). An
example of some similarities and differences revealed by the comparative analysis of mouse
and human genomes:
17
The mouse genome is about 14% smaller than the human genome, which probably
reflects a higher rate of deletion in the mouse lineage.
At the nucleotide level, approximately 40% of the human genome can be aligned to
the mouse genome. These sequences seem to represent most of the orthologous
sequences that remain in both lineages from the common ancestor.
The mouse and human genomes each seem to contain about 30,000 protein-coding
genes. The proportion of mouse genes with a single identifiable orthologue in the
human genome seems to be approximately 80%. The proportion of mouse genes
without any homologue currently detectable in the human genome (and vice versa)
seems to be less than 1% (Mouse Genome Sequencing Consortium, 2002).
Figure 5. The genetic similarity (or homology) of superficially dissimilar species – mice and humans. The full
complement of human chromosomes can be cut, schematically at least, into about 150 pieces (only about 100 are
large enough to appear in this illustration), then reassembled into a reasonable approximation of the mouse
genome. The colors of the mouse chromosomes and the numbers alongside indicate the human chromosomes
containing homologous segments. This piecewise similarity between the mouse and human genomes means that
insights into mouse genetics are likely to illuminate human genetics as well (adapted from Human Genome
Program, U.S. Department of Energy, To Know Ourselves, 1996) .
Comparing of human and mouse genomes has also allowed the identification of novel
genes. An interesting example is the discovery of a gene called APOAV that encodes a
previously unknown member of the apolipoprotein gene family. After the identification of
the gene APOAIV, which was considered to be the last catalogued from this gene family
(Boguski, 1984), it was quite surprising to find out that there‘s a gene APOAV and that the
18
field of lipid metabolism and cardiovascular disease had not been so much explored as it was
thought. The comparative analysis helped to identify a region of conserved sequence,
approximately 25 kilobases from APOAIV, which proved to contain the APOAV gene
(Pennacchio et al., 2001). As the AIV/CIII/AI gene locus influence plasma lipid levels in
humans, Pennacchio et al. studied lipid levels in knockout and transgenic mice and found that
the apoAV protein has a strong inverse correlation with plasma triglyceride levels, which
presents a risk factor for coronary artery disease. Further studies showed that genetic variation
in the APOAV locus influences plasma triglyceride levels in humans (Talmud et al., 2002).
19
3. Postgenomics
Genome sequencing projects have provided researchers with a large quantity of
molecular information, which presented a solid basis for new techniques that enabled the
exploration of new genes. However, in the postgenomic era, other scientific fields have
emerged, taking these techniques even further by studying how genes are transcribed into
mRNA (transcriptomics), how mRNA is translated into proteins (proteomics), and what
chemical fingerprints cellular processes leave behind (metabolomics). All these methods are
„postgenomic― in the sense that they require complete genome sequences for optimal
performance.
3.1 Use of functional genomics to determinate gene and protein function
Once the genomes of a large number of organisms have been sequenced, scientists
found a new challenge for themselves: to determine the molecular, cellular and physiological
functions for the numerous proteins encoded by eukaryotic and prokaryotic genomes. This
task requires the integrated application of chemical and biological techniques, as several
factors complicate the research:
The realisation that numerous protein products can be derived from a single gene as a
result of multiple mechanisms, including alternative splicing, proteolytic processing
and post-translational modifications (PTM).
The fact that proteins with highly related sequences can perform different functions in
vivo, and conversely, proteins lacking structural similarity show similar activities.
The differences in protein function in vitro (purified material) and in vivo (part of
complex metabolic and signalling networks), where they can be regulated by covalent
modification and protein-protein interactions (Fig.6).
An example of a clear case where gene or protein sequence alone cannot be used to
predict function is provided by the family of (β/α)8 - barrel enzymes. Two proteins of this
family, mandelate racemase and muconate lactonizing enzyme, have been shown to share
conserved structures and active site functional groups, even though they catalyze different
reactions (Neidhart, 1990). Since then, other members of this family have also been shown to
perform a wide range of mechanistically diverse chemical transformations.
20
Figure 6. Comparison of the endogenous (in vivo) and exogenous (in vitro) environments typically experienced
by proteins, highlighting several natural modes for their post-translational regulation (adapted from Saghatelian
and Cravatt, 2005).
An opposite case is an example of sequence-unrelated enzymes that perform similar
reactions. By comparing the 4-CBA dehalogenation pathway operons from different bacteria,
Zhuang et al. identified conserved sequence homologs responsible for the 4-CBA-CoA ligase
and 4-CBA-CoA dehalogenase activities in this pathway (Zhuang, 2003). However, the 4-
hydroxybenzoate-CoA thioesterase activity was found to be catalysed by distinct sets of
sequence-unrelated enzymes (Fig.7). This discovery leads to the speculation that natural
sources of 4-CBA may exist and that the divergence in 4-CBA pathways in different bacteria
did not occur as a recent adaptation to exposure to industrially produced 4-CBA.
Figure 7. The 4-hydroxybenzoate-CoA (HBA) thioesterase enzymes in the 4-chlorobenzoate dehalogenation
pathways of Arthrobacter and Pseudomonas share no sequence homology but perform the same reaction
(adapted from Saghatelian and Cravatt, 2005).
21
The new approaches, that enable to overpass the obstacles and characterise the
activities of proteins, are such techniques as production of ORF clones, targeted gene
disruption (knockout mutants) and RNA interference (RNAi).
Most of these techniques focus more on determining the function of enzymes than of
other proteins. This is caused by simpler determination of enzyme‘s function because it
generally equates to catalytic activity. Enzymes also have well-defined active sites, which
possess conserved catalytic residues that can be exploited by many approaches in
experimental analysis. However, many of the principles established for the postgenomic
examination of enzymes could be applied to other types of proteins as well, such as receptors
and ion channels, that also share common molecular and structural features.
3.1.1 ORF clones
Although more than 400 bacterial genomes have been sequenced, the function of all
the gene products has not yet been assigned for any organism. E. coli K12, the best studied
organism, is estimated to contain about 4400 genes, of which about 2000 have not been
characterized experimentally, and of these 20% remain difficult even to predict their function.
The latest estimates reveal that about 700 to 800 of total ORFs have no attributable function.
An ORF is a portion of an organism's genome, which contains a sequence of bases that could
potentially encode a protein. Therefore, new technologies are being developed to identify
these functions. A basic genetic tool for studying gene function is provided by ORF clones,
since they present a template for PCR amplification and for preparing purified gene products.
In E. coli, the whole set of PCR amplified ORF fragments were cloned into a special
plasmid vector, and these clones represent the only comprehensive collection of E. coli ORFs
that is available to the public (Mori et al., 2004).
Many groups have started to collect ORFs in a genome-wide manner, so-called
‗ORFeome cloning‘ (Nagase et al., 2008). ORFeome collections can be used for experiments
on single ORFs (for example to study protein localisation) or to study entire ORFeomes (for
structural experiments). They can be also used in module-scale experiments, where a
particular pathway or biological function can be characterised. But the greatest value can be
extracted from ORFeome collections in large-scale experiments. Until recently, these studies
were impossible to be practised because of low numbers of cloned ORFs, and because the
ORFs that were available were not in the same vector or were not expression ORFs. With the
22
availability of ORFeome collections, three important approaches could have been further
developed: structural genomics, proteome-wide mapping of PPI (notably using the Y2H
system) and cell-based assays (Temple et al., 2006). In the emerging field of structural
genomics, there have been several large-scale initiatives3 that aim to generate protein
structures based on available ORFs. Some success with this approach has been shown with
ORFeome collection for Caenorhabditis elegans (Luan et al., 2004). High-throughput Y2H
approaches of PPI detection generally consist of testing all available combinations of proteins
as DNA-binding domain and activation domain fusion proteins (Fields and Song, 1989). In
cell-based assays, expression ORFs are transfected4 into mammalian cells. Changes in cell-
shape or protein localisation can be detected and analysed automatically thanks to the
technology allowing ‗high-content screening‘ of cells. This method was used in a live cell
assay to identify proteins that increase proliferation when over-expressed (Harada et al.,
2005).
3.1.2 Knock-out mutants
A tool that could provide basic information and insight into the function of genes is a
systematic mutational analysis. Creation of knock-out mutants is a technique used in the
reverse genetics approach, which is based on discovering the function of a gene by analysing
its phenotypic effects. However, some complications arise (such as incomplete disruption of
the targeted gene or polar effects on the downstream genes) because of the nature of
transposon mutagenesis. This problem can be avoided only by the set of in-frame deletion
mutants. Most bacteria are not transformable with linear DNA because of the presence of
intracellular exonucleases that degrade transformed linear DNA (Mori, 2004). It was shown
that many bacteriophages encode their own homologous recombination systems (Smith,
1988) and that the λ Red (g, b, exo) function promotes a greatly enhanced rate of
recombination over that exhibited by the recBC sbcB or recD mutants when using linear
DNA. So the λ Red system has been used as a basis for development of a convenient
procedure that provides an efficient way to isolate replacement mutants using PCR fragments
encoding an antibiotic resistance gene and having only 40 to 50 nucleotides of flanking
regions (Datsenko and Wanner, 2000) (Fig. 8).
3 For example the Protein Structure Initiative (http://www.structuralgenomics.org/)
4 Transfection is the process of introducing nucleic acids into cells by non-viral methods
(http://www.promega.com/paguide/chap12.htm)
23
Fig. 8. Construction of deletion mutants (Mori, 2004).
Some organisms have developed the ability to protect against gene deletions. This
ability is called genetic robustness and it has been demonstrated in some model oganisms,
such as S. cerevisiae. Its genome contains duplicates that often share similar functions, and
the loss of one paralog may be buffered by others (Hsiao and Vitkup, 2008). In spite of this
robustness, the genome of S. cerevisiae can still be used for functional characterization by the
technique of gene deletion. Winzeler et al. (1999) constructed a library of 6925 S. cerevisiae
strains, each with a precise deletion of one of 2026 ORFs (more than one-third of the ORFs in
the genome). 17 percent were essential for viability in rich medium. The phenotypes of more
than 500 deletion strains were assayed in parallel, 40 percent of them
showed quantitative
growth defects in either rich or minimal medium.
3.1.3 Transcriptome analysis
The transcriptome is the set of all RNA molecules (including mRNA, rRNA, tRNA,
and non-coding RNA) produced in the cell. Because it includes all mRNA transcripts in the
cell, the transcriptome reflects the genes that are being actively expressed at any given time.
For the analysis of global gene expression, many novel techniques have been developed,
including DNA microarray or DNA chip technology (Lockhart et al., 1996). A DNA
microarray is an orderly arrangement of tens to hundreds of thousands of unique DNA
molecules of known sequence, usually on a glass slide. DNA molecules can be individually
synthesised on a rigid silicon plate (DNA chips) or prepared from pre-synthesised DNA
24
(synthetic oligonucleotides or PCR products) that are spotted and immobilised on a slide
glass. DNA microarrays were first used to study E. coli gene regulation (Blattner et al., 1997)
but this technology has rapidly expanded and been applied to study various aspects of
transcriptional regulation (Tao et al., 1999). Microarrays present the accumulation of large
amounts of functional genomic information. The experimental design enables to show
respective internal and external contributions to the physiological state of a biological system
(Fig. 9).
Figure 9. Microarrays in systems analysis. Intrinsic and extrinsic forces affect biological processes such as gene
expression, which can be examined with microarrays (Schena, 1998).
This method also appeared to be helpful by revealing novel biochemical pathways.
When the genomes of the most studied organisms, E.coli and yeast, had been sequenced, it
was thought that the primary metabolic pathways, reactants, intermediates, products and
enzymes in these organisms were elucidated and fully understood. But with the help of
comparative genomics, Loh et al. showed that there is still an undiscovered central
biochemical pathway in E. coli K12 (Loh et al., 2006). The use of microarray technology
showed that the b1012 operon is highly expressed under the control of the transcriptional
activator nitrogen regulatory protein C (NtrC) and probably involved in catabolism of
alternative nitrogen sources (Zimmer, 2000). When Loh et al. tested strains carrying a mini
Tn5 insertion in several of the seven genes in the operon, he found that these mutants could
not grow on uracil and uridine as sole nitrogen source at room temperature, but the control
wild-type strain could grow. This experiment showed two surprising features that have been
overlooked so far: E. coli K12 can grow on uracil as the sole nitrogen source at room
temperature but not at 37.8 C and this phenotype is directly connected to the b1012 operon.
25
This operon has been then renamed to rut (pyrimidine utilization). Loh et al. then used the
databases to compare the seven deduced protein sequences with other sequences in the
database. This approach revealed that these seven genes are present in several proteobacteria,
such as Shigella, Yersinia and Agrobacterium, and that the Rut proteins are homologous to
enzymes characterised in other organisms. It was then possible to clearly deduce a probable
function for the final protein. The chemical analysis then completed the bioinformatics
approach. When E. coli K12 was grown on radioactively labelled substrates, a three-carbon
product was detected. It was obvious that the carbon atoms corresponding to the uracil
positions 4, 5 and 6, were predominantly secreted as 3-hydroxypropionic acid but the precise
enzymatic activities and intermediates in the reaction remained unknown (Loh et al., 2006).
However, the Rut homologous enzymes in other organisms enable the use of comparative
genomics to make speculations about the reaction mechanism that is alternative to the
reductive and oxidative pathway of degradation of pyrimidines (Fig.10).
Figure 10. Pyrimidine catabolic pathways. Known reductive (1) and oxidative (2) pathways for catabolism of
pyrimidine rings (A, upper and lower, respectively) and the novel pathway discovered by Loh et al. (2006),
adapted from Loh et al., 2006.
26
3.1.4 Proteome analysis
In prokaryotic organisms, in which related genes are often linked within an operon, the
use of the comparative genomics method for the assignment of protein function is a very
powerful tool. However, for the global characterisation of proteins encoded by an eukaryotic
genome, some additional technologies are needed. These are described by the field of
proteomics (Patterson et al., 2003). Proteomic methods quantify the abundance levels of
proteins. Nonetheless, many proteins (especially enzymes) are regulated by PTM mechanisms
in vivo (Kobe and Kemp, 1999), so it can happen that their translation levels do not correlate
with their activities. To deal with this problem, a chemical proteomic strategy has been
developed, referred to as activity-based protein profiling (ABPP) (Liu, 1999). This method
uses active site–directed probes to read out the functional state of many enzymes directly in
whole proteomes (Fig.11). ABPP probes selectively label active enzymes but not their
inactive (as zymogen or inhibitor-bound) forms (Kidd et al., 2001). This facilitates the
characterisation of changes in enzyme activity that can occur in the absence of alterations in
protein or RNA levels (Jessani and Cravatt, 2004). ABPP probes label enzymes on the basis
of shared catalytic properties rather than mere translation levels and so they provide data sets
enriched in low-abundance proteins that can be read out in various formats, including gels,
microarrays, liquid chromatography and capillary electrophoresis (Saghatelian and Cravatt,
2005).
Figure 11. General strategy for ABPP. Proteomes are treated with chemical probes that label active enzymes but
not enzymes inhibited by intra- or intermolecular regulators or those lacking complementary binding sites. RG,
reactive group; BG, binding group; TAG, biotin and/or fluorophore. Probe-labeled proteomes can be analyzed
through several platforms, including gel or capillary electrophoresis and LC-MS (adapted from Saghatelian and
Cravatt, 2005).
27
3.1.5 Metabolome analysis
The metabolome can be described as the total set of metabolites in a cell (Tweeddale
et al., 1998). The field of metabolomics has emerged with the objective to advance new
technologies that can profile the molecular composition of cells and tissues in order to
characterise the endogenous substrates of enzymes. The challenges that arise in this field are
caused by the fact that metabolites are not directly linked to the genetic code, but are instead
the products of complex enzymatic networks. Also, they are not linear polymers composed of
a defined set of monomers, but rather a diverse collection of structures with different chemical
and physical properties. These features have inspired the development of several approaches
for metabolome analysis, such as flux analysis using isotopic tracer, pathway reconstruction
and in particular, metabolite profiling. The method of metabolite profiling (Fig.12) can be
either targeted (to quantify known metabolites in complex biological samples) or untargeted
(to report on the full spectrum of substrates used by a given enzyme in vivo, potentially
including the discovery of new metabolites) (Saghatelian and Cravatt, 2005).
The approach of global metabolite profiling has been used for functional genomics
studies in plants (Fiehn et al., 2000) and yeast (Raamsdonk et al., 2001). In E. coli,
metabolites were labeled with C-14 glucose and identified by 2D thin-layer chromatography
after extraction by cold methanol (Maharjan and Ferenci, 2003). Recently, a powerful
analytical method using capillary electrophoresis-electrospray ionization mass spectrometry
(CE-ESI-MS) has been developed. This dramatically increases the number of metabolites that
can be measured simultaneously (Soga et al., 2003).
28
Figure 12. Comparison of targeted and untargeted LC-MS methods for comparative metabolite analysis. (a)
General scheme for targeted LC-MS analysis, in which metabolites are detected by SIM (shown for a metabolite
with a mass of 347.5) and their levels are quantified by comparing mass signals to those of isotopically distinct
internal standards. (b) General scheme for DMP, an untargeted global LC-MS approach, in which metabolites
are detected in the broad mass scanning mode (for example, 200–1200 mass units) and their levels quantified by
measurement of direct mass ion intensities (that is, without the inclusion of internal standards). Enzyme-
regulated metabolites are identified by comparison of mass ion intensity ratios between wild-type and knockout
samples (adapted from Saghatelian and Cravatt, 2005).
29
3.2 Metagenomics - approach for studying complex communities in natural environment
Historically, the study of organisms (notably microbes) has focused on single species
in pure culture. Therefore, understanding of complex communities of these organisms has
been perceived only through understanding of their individual members. However, their
functions are often conducted within complex communities - intricate, balanced, and
integrated entities that adapt swiftly and flexibly to environmental change. Metagenomics,
still a very new science, has emerged to overcome these obstacles of unculturability and
genomic diversity of most microbes, the understanding of wich is essential for advancement
in clinical and environmental microbiology. In Greek, meta means ‗transcendent‘. This shows
that metagenomics focuses on transcending the individual organism to pinpoint the genes in
the community and the influence of each other‘s activities in serving collective functions.
In a metagenomics study, DNA is extracted directly from the natural environment. The
mixed sample can be then analysed directly or cloned into a form maintainable in laboratory
bacteria. This enables the creation of a library that contains the genomes of all the microbes
found in the environment (Fig. 13). This library is not organised into neat volumes, each
containing the genome of one community member. Instead, it is composed of millions of
clones, each holding a random fragment of DNA. These DNA fragments are translated into
proteins by bacteria growing in the laboratory. Clones producing ―foreign‖ proteins are then
tested for various capabilities, such as vitamin production or antibiotic resistance. This
enables researchers to discover new antibiotics and resistance mechanisms without knowing
anything about the underlying gene sequence, the structure of the desired protein, or the
microbe of origin (National Academy of Sciences, 2007).
An example of an application of metagenomic approach is a study of human
microbiome. The term ‗microbiome‗ refers to the fact that the human body provides an
environment for our 100 trillion microbial partners. Considering that we contain perhaps 10
times more microbial than human cells and at least 100 times more microbial than human
genes, it is obvious that human metabolome is actually a mixture of human and microbial
parts (National Academy of Sciences, 2007). The first truly metagenomic study of the human
microbiome consisted of sequencing the microbial communities taken from the colons of two
healthy adults. Further analysis of 78 million base pairs of unique DNA sequence showed
that, in comparison with previously sequenced human and microbial genomes, the gut
metagenome contains genes responsible for the breakdown of otherwise indigestible plant-
derived polysaccharides that form a substantial part of modern diets, as well as for the
30
detoxification of consumed xenobiotics and the synthesis of essential amino acids and
vitamins (Gill et al., 2006).
Figure 13. Construction of a metagenomic library: 1 - prokaryotic cells are extracted from the environmental
sample (e.g. soil, sediment, seawater), 2 - the total DNA content (the metagenome) is extracted and purified, 3 -
DNA fragments are cloned into a cloning vector and transformed into a host cell, 4 - the result is thousands of
cells, each carrying a DNA fragment from the metagenome (adapted from
http://wiki.biomine.skelleftea.se/wiki/index.php/Metagenomics).
The metagenomic approach gives an idea of the future, in which it may be possible to
optimise the nutritional requirements of overfed or underfed people on the basis of their gut
microbial ecology, or to forecast the risk of particular types of cancer for individuals or
populations. Researchers in this field are led by programs, such as Human Gut Microbiome
Initiative (HGMI) or Human Microbiome Project (HMP), that are trying to develop
a reference set of microbial genome sequences and preliminary characterisation of human
microbiome (National Institutes of Health, USA, 2008). Greater knowledge of the microbial
communities in other parts of human body, such as the oral cavity, skin or female
reproductive tract will also improve our ability to prevent, diagnose, and treat diseases at
those sites.
31
Summary
It has been only a few decades since the field of genomics appeared, yet it has become
a very popular scientific study among molecular biologists as well as biochemists. In these
few years, the field of genomics has been subjected to turbulent evolution and has given rise
also to many other fields, which are presented in this work.
At the beginning of the genomic era, the main goal of scientists was to sequence the
genomes of numerous organisms. But once the sequencing has been done, further exploration
of the sequenced genomes has become a challenge. The field of comparative genomics has
evolved in order to identify the missing genes and to make the complete image of the
genomes. The field of transcriptomics has started to examine the levels of gene expression in
a given cell population. And the field of proteomics has focused on the structures and
functions of the protein complement of a cell.
However, the biggest challenge in biology has remained: to determine gene function.
To achieve this goal today, scientists can take advantage of the vast data produced by
genomic and proteomic projects. In addition, the complete set of ORF clones as well as the
libraries of knock-out mutants showed to be very helpful. The resulting organisms (either with
an extra gene or with a deleted/disrupted gene) can be screened for phenotypes that provide
clues to the function of the studied gene.
Metabolomics presents the next step of the systematic study as it aims to analyse the
complete set of metabolites. Thus, while transcriptomic and proteomic analyses do not tell the
whole story of what might be happening in a cell, metabolic profiling can give an
instantaneous snapshot of the physiology of the cell.
The metagenomic approach could be placed on the top of this pyramid as it goals to
study all the aforementioned points within complex communities and in their natural
environment.
In biochemical research, we have witnessed the progressive transition from the
application of pre-genomic methods, such as radioisotope labeling, to genome sequencing and
advanced methods used for further genome, proteome and metabolome analysis. Using these
methods, we can get a better understanding of the physiology in any organism. In addition,
the obtained complexe knowledge of the cell physiology will improve our ability to prevent,
diagnose, and treat various diseases.
32
Zhrnutie
Je to len pár desaťročí, čo vznikol nový študijný odbor genomika, avšak za ten čas sa
tento odbor stihol stať veľmi obľúbeným medzi molekulárnymi biológmi ako aj medzi
biochemikmi. Za tých pár rokov prešiel biochemický výskum turbulentnou evolúciou
a zapríčinil vznik mnohých ďalších odborov, ktoré sú prezentované v tejto práci.
Na počiatku genomickej éry bolo hlavným cieľom vedcov sekvenovať genómy
mnohých organizmov. Ale potom ako bolo sekvenovanie spravené, novou výzvou sa stal
hlbší výskum týchto genómov. Vznikol nový odbor, komparatívna genomika, za účelom
identifikovania chýbajúcich génov a vytvorenia kompletného obrazu genómu. Odbor
transkriptomiky začal skúmať úrovne génovej expresie v daných bunkových populáciách a
odbor proteomiky sa zase sústredil na štruktúry a funkcie proteínových súčastí bunky.
Avšak najväčšia výzva v molekulárnej biológii ostala: určiť funkciu génu.
V súčastnosti na dosiahnutie tohto cieľa vedci môžu využívať rozsiahle databázy, ktoré sú
výsledkom mnohých genomických a proteomických projektov. Taktiež ako veľmi nápomocné
sa ukázali kompletné sety ORF klonov ako aj genomické knižnice knock-out mutantov.
Výsledné organizmy (buď s génom navyše alebo s odstráneným génom) môžu byť testované
na fenotypy, ktoré nám poskytujú kľúč k funkcii študovaného génu.
Metabolomika predstavuje ďalší krok v tomto systematickom výskume, nakoľko sa
zameriava na analýzu kompletného setu metabolitov. Vďaka tomu, zatiaľ čo transkriptomické
a proteomické analýzy nám nepovedia všetko, čo sa deje v bunke, tak metabolické
profilovanie nám poskytuje okamžitú ukážku fyziológie bunky.
Na vrchol tejto genomickej pyramídy by mohol byť umiestnený metagenomický
prístup, nakoľko jeho cieľom je výskum všetkých vyššie uvedených bodov v rámci
komplexných komunít v ich prirodzenom prostredí.
Boli sme svedkom postupného vývoja biochemického výskumu, od využívania pre-
genomických metód, ako je napr. rádioizotopické značenie, až po sekvenovanie genómu
a používanie pokročilejších metód pre hlbšiu analýzu genómu, proteómu a metabolómu.
S použitím týchto metód budeme môcť získať lepšie porozumenie o fyziológii v každom
študovanom organizme. Vďaka tomu, tieto získané komplexné vedomosti o fyziológii bunky
tiež zlepšia naše schopnosti predchádzať rôznym chorobám, ako aj rozšíria možnosti ich
včasnej diagnostiky a liečenia.
33
Used abbreviations
ABPP - activity-based protein profiling
CBA - chlorobenzoate
DMP - discovery metabolite profiling
EST – expressed sequence tags
LC-MS - liquid chromatography – mass spectrometry
NIH - National Institutes of Health
ORF – open reading frames
PCR – polymerase chain reaction
PPI – protein-protein interactions
PTM – post-translational modifications
SIM - selected ion monitoring
Y2H – yeast-2-hybrid
34
References
Ackermann, D. (1931). Über den biologischen Abbau des Arginins zu Citrullin. Biochem.
Z., 203, 66–69
Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G.,
Scherer, S. E., Li, P. W., Hoskins, R. A., & Galle, R. F. (2000). The genome sequence of
Drosophila melanogaster. Science, 287(5461), 2185-2195.
Arabidopsis Genome Initiative. (2000). Analysis of the genome sequence of the flowering
plant Arabidopsis thaliana. Nature, 408(6814), 796-815.
Bateman, A., & Haft, D. H. (2002). HMM-based databases in InterPro. Brief Bioinform, 3,
236-245.
Blattner, F. R., Plunkett, G., 3rd, Bloch, C. A., Perna, N. T., Burland, V., Riley, M.,
Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F. et al. (1997). The complete
genome sequence of Escherichia coli K-12. Science, 277, 1453-1474.
Boguski, M. S., Elshourbagy, N., Taylor, J. M., & Gordon, J. I. (1984). Rat apolipoprotein
A-IV contains 13 tandem repetitions of a 22-amino acid segment with amphipathic helical
potential. Proc Natl Acad Sci USA, 81(16), 5021-5025.
C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans:
a platform for investigating biology. Science, 282(5396), 2012-2018.
Calvin, M. (1962). The Path of Carbon in Photosynthesis. Science, 135(3507), 879-889
Datsenko, K. A. & Wanner, B. L. (2000). One-step inactivation of chromosomal genes in
Escherichia coli K-12 using PCR products. Proc. Natl. Acad. Sci. USA, 97, 6640-6645.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C., & Ouzounis, C. A. (1999). Protein interaction
maps for complete genomes based on gene fusion events. Nature, 402, 86-90.
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C. J., Hofmann, K., & Bairoch, A.
(2002). The PROSITE database, its status in 2002. Nucleic Acids Res, 30, 235-238.
Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R. N. & Willmitzer, L. (2000).
Metabolite profiling for plant functional genomics. Nat. Biotechnol., 18, 1157-1161.
Fields, S., & Song, O. (1989). A novel genetic system to detect protein-protein interactions.
Nature, 340(6230), 245-246.
Fiers, W., Contreras, R., Duerinck, F., Haegeman, G., Iserentant, D., Merregaert, J., Min
Jou, W., Molemans, F., Raeymaekers, A., Van den Berghe, A., Volckaert, G., & Ysebaert,
M. (1976). Complete nucleotide sequence of bacteriophage MS2 RNA: primary and
secondary structure of the replicase gene. Nature, 260 (5551), 500–507
35
Fleischmann, R., Adams, M., White, O., Clayton, R., Kirkness, E., Kerlavage, A., Bult, C.,
Tomb, J. F., Dougherty, B. A., Merrick, J. M., et al. (1995). Whole-genome random
sequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496-512.
Gill, S. R., Pop, M., Deboy, R. T., Eckburg, P. B., Turnbaugh, P. J., Samuel, B. S., Gordon,
J. I., Relman, D. A., Fraser-Liggett, C. M., & Nelson, K. E. (2006). Metagenomic analysis
of the human distal gut microbiome. Science, 312 (5778), 1355-1369.
Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert, F.,
Hoheisel, J. D., Jacq, C., Johnston, M., Louis, E. J., Mewes, H. W., Murakami, Y.,
Philippsen, P., Tettelin, H., & Oliver, S. G. (1996). Life with 6000 genes. Science,
274(5287), 563-567.
Harada, J.N., Bower, K.E., Orth, A.P., Callaway, S., Nelson, C.G., Laris, C., Hogenesch,
J.B., Vogt, P.K., & Chanda, S.K. (2005). Identification of novel mammalian growth
regulatory factors by genome-scale quantitative image analysis. Genome Res, 15, 1136-
1144.
Hsiao, T-L., & Vitkup, D. (2008). Role of Duplicate Genes in Robustness against
Deleterious Human Mutations. PLoS Genet, 4(3), e1000014
Jessani, N. & Cravatt, B.F. (2004). The development and application of methods for activity
based protein profiling. Curr. Opin. Chem., Biol. 8, 54–59.
Kidd, D., Liu, Y. & Cravatt, B.F. (2001). Profiling serine hydrolase activities in complex
proteomes. Biochemistry, 40, 4005–4015.
Kobe, B. & Kemp, B.E. (1999). Active site-directed protein regulation. Nature, 402, 373–
376.
Kornberg, H. (2000). Krebs and his trinity of cycles. Nat Rev Mol Cell Biol., 1(3), 225-228.
Kornberg, H. L., & Krebs, H. A. (1957). Synthesis of cell constituents from C2-units by a
modified tricarboxylic acid cycle. Nature, 179, 988–991
Kossel, A., & Dakin, H. D. (1904). Über die Arginase. Z. Physiol. Chem., 41, 321–331
Krebs, H. A., & Henseleit, K. (1932). Untersuchungen über die Harnstoffbildung im
Tierkorper. Z. Physiol. Chem., 210, 33–66
Krebs, H. A., & Johnson, W. A. (1937a). Metabolism of ketonic acids in animal tissues.
Biochem. J., 31, 645–660
Krebs, H. A., & Johnson, W. A. (1937b). The role of citric acid in intermediate metabolism
in animal tissues. Enzymologia 4, 148–156
Liu, Y., Patricelli, M.P. & Cravatt, B.F. (1999). Activity-based protein profiling: the serine
hydrolases. Proc. Natl. Acad. Sci. USA, 96, 14694–14699.
36
Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S.,
Mittmann, M., Wang, C., Kobayashi, M., Horton, H. & Brown, E. L. (1996). Expression
monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol., 14,
1675-1680.
Loh, K. D., Gyaneshwar, P., Markenscoff Papadimitriou, E., Fong, R., Kim, K., Parales, R.,
Zhou, Z., Inwood, W., & Kustu, S. (2006). A previously undescribed pathway for
pyrimidine catabolism. PNAS, 103(13), 5114-5119.
Luan, C.H., Qiu, S., Finley, J.B., Carson, M., Gray, R.J., Huang, W., Johnson, D., Tsao, J.,
Reboul, J., Vaglio, P., et al. (2004). High-throughput expression of C. elegans proteins.
Genome Res., 14(10B), 2102-2110.
Maharjan, R. P. & Ferenci, T. (2003). Global metabolite analysis: the influence of
extraction methodology on metabolome profiles of Escherichia coli. Anal. Biochem., 313,
145-154.
McGuire, A. M., Hughes, J. D., & Church, G. M. (2000). Conservation of DNA regulatory
motifs and discovery of new motifs in microbial genomes. Genome Res, 10, 744-757.
Min Jou, W., Haegeman, G., Ysebaert, M., & Fiers, W. (1972). Nucleotide sequence of the
gene coding for the bacteriophage MS2 coat protein. Nature, 237(5350), 82–88.
Mori, H. (2004). From the sequence to cell modeling: comprehensive functional genomics
in Escherichia coli. J Biochem Mol Biol., 37(1), 83-92.
Mouse Genome Sequencing Consortium. (2002). Initial sequencing and comparative
analysis of the mouse genome. Nature, 420(6915), 520-562.
Nagase, T., Yamakawa, H., Tadokoro, S., Nakajima, D., Inoue, S., Yamaguchi, K., Itokawa,
Y., Kikuno, R.F., Koga, H., & Ohara, O. (2008). Exploration of human ORFeome: high-
throughput preparation of ORF clones and efficient characterization of their protein
products. DNA Res., 15(3), 137-49.
Neidhart, D.J., Kenyon, G.L., Gerlt, J.A. & Petsko, G.A. (1990). Mandelate racemase and
muconate lactonizing enzyme are mechanistically distinct and structurally homologous.
Nature, 347, 692–694.
Osterman, A., & Overbeek, R. (2003). Missing genes in metabolic pathways: a comparative
genomics approach. Curr. Opin. Chem, Biol. 7, 238–251.
Overbeek, R., Fonstein, M., D‘Souza, M., Pusch, G. D., & Maltsev, N. (1999). The use of
gene clusters to infer functional coupling. Proc Natl Acad Sci USA, 96, 2896-2901.
Patterson, S.D. & Aebersold, R. (2003). Proteomics: the first decade and beyond. Nat.
Genet., 33, 311–323.
Pawlowski, K., Rychlewski, L., Zhang, B., & Godzik, A. (2001). Fold predictions for
bacterial genomes. J Struct Biol, 134, 219-231.
37
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., & Yeates, T. O. (1999).
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.
Proc Natl Acad Sci USA, 96, 4285-4288.
Pennacchio, L. A., Olivier, M., Hubacek, J. A., Cohen, J. C., Cox, D. R., Fruchart, J. C.,
Krauss, R. M., & Rubin, E. M. (2001). An apolipoprotein influencing triglycerides in
humans and mice revealed by comparative sequencing. Science, 294(5540), 169-173.
Raamsdonk, L. M., Teusink, B., Broadhurst, D., Zhang, N., Hayes, A., Walsh, M. C.,
Berden, J. A., Brindle, K. M., Kell, D. B., Rowland, J. J., et al. (2001). A functional
genomics strategy that uses metabolome data to reveal the phenotype of silent mutations.
Nat. Biotechnol., 19, 45-50.
Roberts, R. B., Abelson, P. H., Cowie, D. B., Bolton, E. B., & Britten, J. R. (1955). Studies
of Biosynthesis in Escherichia coli. Carnegie Institution of Washington, Publication 607
Saghatelian, A., & Cravatt, B. F. (2005). Assignment of protein function in the postgenomic
era. Nat Chem Biol., 1(3), 130-142.
Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, C. A.,
Hutchison, C. A., Slocombe, P. M., & Smith, M. (1977). Nucleotide sequence of
bacteriophage phi X174 DNA. Nature, 265(5596), 687–695
Scriver, C.R. (2001). Garrod's foresight; our hindsight. J. Inherit. Metab. Dis., 24, 93-116
Schena, M., Heller, R.A., Theriault, T.P., Konrad, K., Lachenmeier, E., & Davis, R.W.
(1998). Microarrays: biotechnology's discovery platform for functional genomics. Trends
Biotechnol., 16(7), 301-306.
Smith, G. R. (1988). Homologous recombination in procaryotes. Microbiol. Rev., 52, 1-28.
Soga, T., Ohashi, Y., Ueno, Y., Naraoka, H., Tomita, M. & Nishioka, T. (2003).
Quantitative metabolome analysis using capillary electrophoresis mass spectrometry. J.
Proteome Res., 2, 488-494.
Stein, L. D., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M. R., Chen, N., Chinwalla, A.,
Clarke, L., Clee, C., Coghlan, A., et al. (2003). The genome sequence of Caenorhabditis
briggsae: a platform for comparative genomics. PLoS Biol, 1(2), e45.
Talmud, P. J., Hawe, E., Martin, S., Olivier, M., Miller, G. J., Rubin, E. M., Pennacchio,
L.A., & Humphries, S. E. (2002). Relative contribution of variation within the
APOC3/A4/A5 gene cluster in determining plasma triglycerides. Hum Mol Genet., 11(24),
3039-3046.
Tao, H., Bausch, C., Richmond, C., Blattner, F. R. & Conway, T. (1999). Functional
genomics: expression analysis of Escherichia coli growing on minimal and rich media. J.
Bacteriol., 181, 6425-6440.
38
Temple, G., Lamesch, P., Milstein, S., Hill, D.E., Wagner, L., Moore, T., & Vidal, M.
(2006). From genome to proteome: developing expression clone resources for the human
genome, Hum. Mol. Genet.,15, R31–R43.
The National Academy of Sciences. (2007). The New Science of Metagenomics: Revealing
the Secrets of Our Microbial Planet. The National Academies Press, ISBN: 0-309-10677-X
Tweeddale, H., Notley-McRobb, L. & Ferenci, T. (1998). Effect of slow growth on
metabolism of Escherichia coli, as revealed by global metabolite pool (―metabolome‖)
analysis. J. Bacteriol., 180, 5109-5116.
Wada, M. (1930). Über Citrullin, eine neue Aminosaure im presssaft der Wassermelone,
Citrullis vulgaris schrad. Biochem. Z., 224, 420–429
Watson, J. D., & Crick, F. H. C. (1953). A structure for deoxyribose nucleic acid. Nature,
171-173.
Winzeler, E.A., Shoemaker, D.D., Astromoff, A., Liang, H., Anderson, K., Andre, B.,
Bangham, R., Benito, R., Boeke, J.D., & Bussey, H. et al., (1999). Functional
characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science,
285(5429), 901-906.
Zhuang, Z., Gartemann, K.H., Eichenlaub, R. & Dunaway-Mariano, D. (2003).
Characterization of the 4-hydroxybenzoyl-coenzyme A thioesterase from Arthrobacter sp.
strain SU. Appl. Environ. Microbiol., 69, 2707–2711
Zimmer, D. P., Soupene, E., Lee, H. L., Wendisch, V. F., Khodursky, A. B., Peter, B. J.,
Bender, R. A. & Kustu, S. (2000). Nitrogen regulatory protein C-controlled genes of
Escherichia coli: scavenging as a defense against nitrogen limitation. Proc. Natl. Acad. Sci.
USA, 97, 14674–14679.
Internet sources:
Human Genome Program, U.S. Department of Energy, To Know Ourselves. [online] 1996.
[cit.18.12.2009]. Available from
<http://www.ornl.gov/sci/techresources/Human_Genome/publicat/tko/>
Metagenomic library. [online] [cit.10.2.2010]. Available from
<http://wiki.biomine.skelleftea.se/wiki/index.php/Image:Metagenomic_library.jpg>
National Institutes of Health, USA. [online] Dec. 2008. [cit. 28.2.2010]. Available from
<http://nihroadmap.nih.gov/hmp/initiatives.asp>
Omes and omics in Biotechnology and Bioscience. [online] [cit.29.1.2010]. Available from
< http://omics.org>
Protein Structure Initiative. [online] [cit.12.2.2010]. Available from
<http://www.structuralgenomics.org/>