current trends in bioinformatics
TRANSCRIPT
preliminary projects that establish and
automate needed technologies. The pilots
aim at (1) moving to global strategies,
(2) exploring basic properties and
(3) accelerating drug discovery. Underlying
the discussion is a commitment to fulfilling
the original promise of the genome project,
obtaining a deep understanding of
biological systems that will benefit society
through improved health care.
Structure in the driver’s seat
Structural genomics, largely driven by the
National Institute of General Medical
Sciences’Protein Structure Initiative,
seeks to fulfill a central biophysical
assertion: while form follows function,
a given structure can carry out only a
specific function, and so studying structure
will give us insight into mechanism
(biochemical function) and ultimately to
role (biological function). Computational
tools are very effective in predicting the
structures of proteins with high sequence
similarity to those of known structure but
there has been little success in predicting
the structure of a novel sequence. A high-
throughput structure determination
pipeline includes target selection, over-
expression, purification, crystal growth,
structure determination and refinement.
Structural genomics aims at automating
each step while driving down costs.
Overviews of efforts to accelerating the
pipeline in industry, universities and
US National Laboratories were presented.
The Joint Center for Structural Genomics
(University of Califormia at San Diego, The
Scripps Research Institute and Stanford
University, CA, USA) uses a robotic facility,
developed by the Genomic Institute of the
Novartis Research Foundation and Syrrx
(San Diego, CA, USA), for 100 000
crystallizations per day at 100 nanoliter
droplets. Automated expression and
purification systems are also being
optimized. Escherichia coli strains are still
the only practical vehicle for large-scale
expression, which is the bottleneck.
Alex Raeber (Cytos Proteome Therapeutics,
Konstanz, Germany) presented an
alternative – alphavirus expression in
mammalian cells to create a library of
potential therapeutic targets. This system
provides safe handling, a broad host range,
the capacity to represent the proteome
faithfully (in contrast to bacterial
expression), rapid gene recovery and the
necessary production level for screening.
Functional genomics with drug discovery
in mind
Moving further into drug discovery
approaches, mass spectrometry in a
proteomics context has been used to identify
potential drug targets, characterize protein
complexes, analyze the mechanism of drug
action, and test drug efficacy monitoring
using biomarkers. Mass spectrometry
approaches require managing the wide
distribution of abundance of individual
protein species and the concomitant large
dynamic range within a single cell type.
Stanley Hefta (Bristol Myers Squibb,
Plainsboro, NJ, USA) described how the
company is looking at osteoporosis through
analysis of signal transduction in osteoclasts
and has already validated targets
experimentally to the point of involving
animal models. Robert Hollingsworth
(GlaxoSmithKline, NC, USA) described
how high throughput yeast two-hybrid
screening provides another method for
drug target discovery. Nicholas Dean (Isis
Pharmaceuticals, Carlsband, CA, USA)
discussed how antisense probes, coupled
with microarray data, can be used to probe
cells directly to identify genes. This antisense
technology should allow an efficient
transition to human studies and provide
predictable pharmokinetics and toxicology.
Isis has characterized the pharmacokinetics
of its oligonucleotide leads, which support
the promise of this approach in molecular
medicine. John Kozarich (ActivX, La Jolla,
CA, USA) explained how designing novel
chemicals to interrogate the activities
of proteins complements analytical
approaches and James LeBlanc (Ciphergen
Biosystems, Freemont, CA, USA) described
how scanning fluids directly for biomarkers
via protein arrays and laser ionization can
achieve femtomolar sensitivity.
The European Bioinformatics
Institute’s proteomics project (described by
Claire O’Donovan, EBI, Cambridge, UK)
uses automated annotation to ascertain
functional information. This information is
then compared with the ~30 000 protein-
coding genes already identified and, based
on a five to ten fold increase from post-
translational modifications and a 30 to 50%
increase from alternative splicing, suggests
that more than one million proteins are
encoded by the human genome.
Prototype proteome and conclusion
What the world needs now are models for
proteomics: Soumitra Ghosh (MitoKor,
San Diego, CA, USA) urged us to consider a
powerful but simple system, mitochondria,
two to three orders simpler than
eukaryotes. Whereas the mitochondrial
genome only encodes 13 proteins, the
details of the experimental observations
are daunting and provide a glimpse of the
challenges of moving to nuclear, let alone
cellular, levels for ‘complete’analyses.
Characterizing the mitochondria proteome
fully could yield a plethora of novel drugs
because many oncogenic processes involve
mitochondrial membrane-bound proteins.
In sum, the complexities of proteomics will
entertain us for decades to come.
John C. Wooley
Dept of Pharmacology andChemistry–Biochemistry, University ofCalifornia at San Diego, 9500 Gilman Drive,La Jolla, CA 92093–0043, USA.e-mail: [email protected]
TRENDS in Biotechnology Vol.20 No.8 August 2002
http://tibtech.trends.com 0167-7799/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(02)01975-3
317Research Update
Current trends in bioinformatics
Eric Jain
The fourth annual ‘Integrated
bioinformatics’ conference was organized
by the Cambridge Healthtech Institute (CHI)
and held in Zürich 16–18 January 2002.
Published online: 31 May 2002
More than two thirds of the presentations
at this conference were related to
microarrays, mirroring a current trend in
bioinformatics. One of the purposes of
microarray experiments is to find genes
that are either up- or downregulated
under specific circumstances. The usual
method for detecting such variations
within microarray data is to look for genes
that have expression levels more than a
certain number of standard deviations
from the mean or show a several-fold
difference between experiment and
control measurements. It was pointed out
that this strategy works well if changes
are large enough but can be problematic
because many small changes can add up to
a significant effect.
One of the challenges is to separate
relevant signals from the background
noise without obtaining too many false
positives. Besides the fact that biological
systems are inherently noisy, further
errors are introduced by various technical
factors. After obtaining a list of up- and
downregulated genes the next step
usually involves trying to extract some
biological meaning from this data, for
example by looking for similar genes in
public databases to determine properties
such as chromosomal location and
functional categories. A tool that
automates this procedure was presented
by Sorin Draghici (Wayne State
University, Detroit, MI, USA).
Microarray data processing involves
several steps ranging from image
acquisition, image processing, filtering
and normalization of raw data to tasks
such as data analysis and visualization.
Many techniques for filtering are in use,
the most common being multiplication
with a normalization factor, intensity
averaging, ratio averaging, LOWESS
normalization and the use of control
genes. Usually, better results can be
achieved by working with subarrays
rather than full arrays. Also, applying any
kind of normalization is better than none
at all, according to Jason Gonçalves
(Iobion, Toronto, Canada).
Microarray databases should be able to
store original images from scanners, raw
quantification data, processed data and,
often overlooked, experiment annotation.
Keeping track of experimental details
ensures repeatability and there is now a
standard for this type of data – MIAME
(see Microarray Gene Expression Database
Group; http://www.mged.org). The
importance of well-defined and controlled
vocabularies was restated several times.
The Gene Ontology Consortium, for
example, is working on a gene annotation
standard (see Gene Ontology Consortium:
http://www.geneontology.org). Regardless
of the algorithms used for data cleaning
it was recommended that microarray
images are always inspected visually
because this still provides the most reliable
method for correcting certain errors such as
obvious outliers.
Of course data normalization should
not be overdone because every processing
step can be a step away from the
underlying biological data. And, depending
on the method, certain patterns are more
likely or less likely to be amplified or
suppressed. According to James Lyons-
Weiler (University of Massachusetts,
Amherst, MA, USA) more reliable results
are achieved with hierarchical clustering if
the process is repeated several times with
different starting points and then the most
common solution is chosen. Various
machine learning approaches, such as
neural networks and support vector
machines, are also being adopted for
classification of microarray data.
Algorithms and computational challenges
Clustering by choosing representative
sequences is an efficient but perhaps
simplistic strategy to eliminate redundancy
and speed up homology searching within
large protein databases, according to
Weizhong Li (Burnham Institute,
La Jolla, CA, USA). Deepak Thakkar
(Silicon Genetics, Redwood City, CA, USA)
pointed out that analysis software that
transparently applies certain algorithms
should be able to attach corresponding
references to the processed data. Although
this would in principal be simple to
implement it is not yet widely done.
Two theoretical approaches for
reverse engineering gene networks
were independently presented by
Alberto de la Fuente (Virginia Polytechnic
Institute, Blacksburg, VI, USA) and
Mattias Wahde (Chalmers University
of Technology, Göteborg, Sweden).
Experiments including many time series
with fewer points are usually more
reliable than those consisting of a single
series with many redundant data points.
Protein interactions and pathways
Experimental protein interaction data is
obtained on a large scale from yeast two-
hybrid experiments. Experiments might
vary not only in the proteins they test for
interactions but also in how detailed
information such as place and time of
interactions is recorded and if interactions
should be classified in a binary way or with
probabilities. If performing an all-against-
all comparison is not feasible, proteins are
screened for interactions with fragment
libraries. Although this approach does yield
domain–domain interaction information,
data on time and location of the
interactions is lost, according to Vincent
Schächter (Hybrigenics, Paris, France).
Promoter analysis is another way to
detect protein interactions. This technique
is based on the fact that coregulated genes
usually produce gene products that lie
within the same pathway. Moreover,
coexpression infers a unique shared
promoter or shared enhancers. Similarity
and structure-based methods, however,
are likely to miss proteins that are
involved in the same pathway as a given
protein but do not interact with it directly
according to Thomas Werner (Genomatix,
München, Germany).
Protein interaction maps can be used to
establish potential pathways, which are
essential for the understanding of gene
functions. The function of a gene can often
be predicted more reliably by its functional
context than its sequence. Agene’s functional
context includes not only the properties
of the gene product but also interaction
partners and which other genes are in the
same pathway. The major drawback at the
moment is that protein interaction and
pathway predicting methods are difficult
to evaluate because there are no complete
reference data sets available.
Protein function
Structural similarities between proteins are
considered to be more relevant than pattern
or sequence similarity because structures
are known to be more conserved than the
underlying sequences. Therefore protein
structures rather than sequences should be
used for predicting functions, according to
Ben Hitz (ProCeryon Biosciences, New York,
USA). Margaret Biswas (European
Bioinformatics Institute, Cambridge, UK)
presented a database that integrates several
existing motif and pattern databases (see
Interpro; http://www.ebi.ac.uk/interpro).
Clinical and research applications
Insight into how microarrays are used in
research was provided by Heiko Müller
(Dept of Pharmacology, Pharmacia
Corporation, Nerviano, Italy), who talked
about the pRB tumor suppressor pathway
and Frank Pugh (Pennsylvania State
University, University Park, PA, USA),
who revealed the mechanism by which
repressors control gene expression.
Cancer research is an ideal application
for microarrays. Gene expression data can
be linked to clinical data such as survival
rates and used to determine high-risk
patient groups that should receive more
TRENDS in Biotechnology Vol.20 No.8 August 2002
http://tibtech.trends.com
318 Research Update
therapy. According to Peter Lichter (German
Cancer Research Center, Heidelberg,
Germany), a major step in accelerating
research will be the availability of disease-
specific microarray chips. But predicting if
chemotherapy will work for a patient can’t
always be determined by proteins alone
because they might not be specific to a
cancer type, cautioned Walter Battistutti
(University Hospital, Vienna, Austria)
Roland Eils (Phase-It, Heidelberg,
Germany) pointed out that it is important
to be aware that microarrays might only
be detecting secondary effects, which are,
of course, poor pharmaceutical targets.
According to Andreas Windemuth
(Genaissance Pharmaceuticals, New
Haven, CT, USA), for drug discovery
purposes the genome should be seen to
consist not only of the basic sequence but
also of individual variations now that the
human genome has been sequenced.
Concluding remarks
The conference provided a well-balanced
mixture of theoretical presentations
on mathematical and statistical issues
and practical presentations on the
applications of the discussed technologies.
Somewhat missing from the program were
software engineering topics, which might
simply reflect the targeted audience.
Along with the rapid growth of data
there has been a strong increase in the
number of algorithms used for analyzing
this data. Most algorithms are not new but
are being rediscovered and adopted from
other scientific fields. There is a lot of
insecurity over which algorithms to
choose, even for basic tasks such as
clustering and filtering data. There is a
growing need for better benchmarks as
well as guidelines or expert systems for
choosing appropriate algorithms.
Another reoccurring topic is the
importance of standards for data exchange
among tools and databases. Although the
progress of software standards in other
industries doesn’t necessarily provide
much hope, there are now a few promising
standardization projects underway
(see I3C: http://www.i3c.org).
Eric Jain
Jain PharmaBiotech, Blaesiring 7, 4057 Basel,Switzerland.e-mail: [email protected]
TRENDS in Biotechnology Vol.20 No.8 August 2002
http://tibtech.trends.com 0167-7799/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(02)01981-9
319Research Update
Using proteomics to identify targets and validate leads
Kewal K. Jain
The International Business
Communications (IBC) ‘Proteomics and
the proteome: role in target identification
and lead validation’ conference was held
18–20 February 2002 in Geneva, Switzerland.
Published online: 31 May 2002
This conference consisted of presentations
on proteomic technologies relevant to
drug discovery. Further details of other
proteomic technologies have been described
elsewhere (for example, see [1]). Important
technologies that were discussed included
protein microarrays, protein–protein
interactions and cell proteomics.
2D gel electrophoresis
Two-dimensional gel electrophoresis
(2D GE) remains the method with the
highest resolution power for the separation
of proteins and is followed by mass
spectrometry (MS) for rapid identification.
Substantial fractionation has to be
performed in addition to 2D GE when
aiming to identify 2000 proteins in different
tissues and body fluids. Hanno Langen
(Roche, Basel, Switzerland) explained how
Roche uses subcellular fractionation and
removal of abundant proteins before
separation techniques, increasing the
sensitivity by a factor of ten. Attempts are
being made to reduce the biological
variation associated with traditional
2D GE to maximize the quantitative aspect
of differential protein analysis.
Protein biochips and microarrays
Protein biochips and microarrays are
increasingly used in proteomics and offer
the distinct possibility of enabling a rapid
analysis of the entire proteome. Microarrays
can also be used to screen protein–drug
interactions and to detect posttranslational
modifications. Thus, the concept of
comparing proteomic maps of healthy and
diseased cells could help us to understand
cell signalling and metabolic pathways
and will help pharmaceutical companies to
speed up the development of therapeutics.
Arrays of functional proteins can help
at various stages in the drug discovery
process. Roland Kozlowski (Sense
Proteomics, Cambridge, UK) described
COVET (cloned open reading frames for
the validation of experimental targets)
technology for parallel cloning, gene
expression and arraying of a large number
of functional proteins and protein
families. COVET allows the specific
modification of every member of a cDNA
library in a manner that does not rely
on any knowledge of the sequence of
individual genes. Modified cDNAs are
expressed in a suitable host to produce
tagged functional proteins, which can
be microarrayed without the risk of
impairing the function of the protein.
Michael Pawlak (Zeptosens, Witterswil,
Switzerland) described the use of
ZeptoMark protein microarrays for
quantification of low abundance proteins
using ultrasensitive fluorescence detection
on planar waveguides [2]. This enables
real-time measurement in light-scattering
media such as blood. Data about protein
expression and interactions from using this
protein biochip can be applied at various
stages of drug development including
target identification, target validation and
quantitative marker monitoring.
Analysis of multiprotein complexes in
Saccharomyces cerevisiae
Bernhard Kuster (Cellzome, Heidelberg,
Germany) described the systematic
analysis of multiprotein complexes in
Saccharomyces cerevisiae using tandem
affinity purification (TAP). TAP enables
assembly and selective purification of
cellular protein complexes and is followed
by MS. The TAP tag has been fused to
many genes for isolation of protein
complexes from different cell types,
providing an outline of the eukaryotic
proteome as a network of protein complexes
at a level of organization beyond binary
interactions [3]. Protein and interaction
pathway maps are being developed to
provide the basis for target identification
and lead validation. This approach will
improve our understanding of the mode of