current trends in bioinformatics

3
preliminary projects that establish and automate needed technologies. The pilots aim at (1) moving to global strategies, (2) exploring basic properties and (3) accelerating drug discovery. Underlying the discussion is a commitment to fulfilling the original promise of the genome project, obtaining a deep understanding of biological systems that will benefit society through improved health care. Structure in the driver’s seat Structural genomics, largely driven by the National Institute of General Medical Sciences’ Protein Structure Initiative, seeks to fulfill a central biophysical assertion: while form follows function, a given structure can carry out only a specific function, and so studying structure will give us insight into mechanism (biochemical function) and ultimately to role (biological function). Computational tools are very effective in predicting the structures of proteins with high sequence similarity to those of known structure but there has been little success in predicting the structure of a novel sequence. A high- throughput structure determination pipeline includes target selection, over- expression, purification, crystal growth, structure determination and refinement. Structural genomics aims at automating each step while driving down costs. Overviews of efforts to accelerating the pipeline in industry, universities and US National Laboratories were presented. The Joint Center for Structural Genomics (University of Califormia at San Diego, The Scripps Research Institute and Stanford University, CA, USA) uses a robotic facility, developed by the Genomic Institute of the Novartis Research Foundation and Syrrx (San Diego, CA, USA), for 100 000 crystallizations per day at 100 nanoliter droplets. Automated expression and purification systems are also being optimized. Escherichia coli strains are still the only practical vehicle for large-scale expression, which is the bottleneck. Alex Raeber (Cytos Proteome Therapeutics, Konstanz, Germany) presented an alternative – alphavirus expression in mammalian cells to create a library of potential therapeutic targets. This system provides safe handling, a broad host range, the capacity to represent the proteome faithfully (in contrast to bacterial expression), rapid gene recovery and the necessary production level for screening. Functional genomics with drug discovery in mind Moving further into drug discovery approaches, mass spectrometry in a proteomics context has been used to identify potential drug targets, characterize protein complexes, analyze the mechanism of drug action, and test drug efficacy monitoring using biomarkers. Mass spectrometry approaches require managing the wide distribution of abundance of individual protein species and the concomitant large dynamic range within a single cell type. Stanley Hefta (Bristol Myers Squibb, Plainsboro, NJ, USA) described how the company is looking at osteoporosis through analysis of signal transduction in osteoclasts and has already validated targets experimentally to the point of involving animal models. Robert Hollingsworth (GlaxoSmithKline, NC, USA) described how high throughput yeast two-hybrid screening provides another method for drug target discovery. Nicholas Dean (Isis Pharmaceuticals, Carlsband, CA, USA) discussed how antisense probes, coupled with microarray data, can be used to probe cells directly to identify genes. This antisense technology should allow an efficient transition to human studies and provide predictable pharmokinetics and toxicology. Isis has characterized the pharmacokinetics of its oligonucleotide leads, which support the promise of this approach in molecular medicine. John Kozarich (ActivX, La Jolla, CA, USA) explained how designing novel chemicals to interrogate the activities of proteins complements analytical approaches and James LeBlanc (Ciphergen Biosystems, Freemont, CA, USA) described how scanning fluids directly for biomarkers via protein arrays and laser ionization can achieve femtomolar sensitivity. The European Bioinformatics Institute’s proteomics project (described by Claire O’Donovan, EBI, Cambridge, UK) uses automated annotation to ascertain functional information. This information is then compared with the ~30 000 protein- coding genes already identified and, based on a five to ten fold increase from post- translational modifications and a 30 to 50% increase from alternative splicing, suggests that more than one million proteins are encoded by the human genome. Prototype proteome and conclusion What the world needs now are models for proteomics: Soumitra Ghosh (MitoKor, San Diego, CA, USA) urged us to consider a powerful but simple system, mitochondria, two to three orders simpler than eukaryotes. Whereas the mitochondrial genome only encodes 13 proteins, the details of the experimental observations are daunting and provide a glimpse of the challenges of moving to nuclear, let alone cellular, levels for ‘complete’analyses. Characterizing the mitochondria proteome fully could yield a plethora of novel drugs because many oncogenic processes involve mitochondrial membrane-bound proteins. In sum, the complexities of proteomics will entertain us for decades to come. John C. Wooley Dept of Pharmacology and Chemistry–Biochemistry, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093–0043, USA. e-mail: [email protected] TRENDS in Biotechnology Vol.20 No.8 August 2002 http://tibtech.trends.com 0167-7799/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(02)01975-3 317 Research Update Current trends in bioinformatics Eric Jain The fourth annual ‘Integrated bioinformatics’ conference was organized by the Cambridge Healthtech Institute (CHI) and held in Zürich 16–18 January 2002. Published online: 31 May 2002 More than two thirds of the presentations at this conference were related to microarrays, mirroring a current trend in bioinformatics. One of the purposes of microarray experiments is to find genes that are either up- or downregulated under specific circumstances. The usual method for detecting such variations within microarray data is to look for genes that have expression levels more than a certain number of standard deviations from the mean or show a several-fold

Upload: eric-jain

Post on 16-Sep-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

preliminary projects that establish and

automate needed technologies. The pilots

aim at (1) moving to global strategies,

(2) exploring basic properties and

(3) accelerating drug discovery. Underlying

the discussion is a commitment to fulfilling

the original promise of the genome project,

obtaining a deep understanding of

biological systems that will benefit society

through improved health care.

Structure in the driver’s seat

Structural genomics, largely driven by the

National Institute of General Medical

Sciences’Protein Structure Initiative,

seeks to fulfill a central biophysical

assertion: while form follows function,

a given structure can carry out only a

specific function, and so studying structure

will give us insight into mechanism

(biochemical function) and ultimately to

role (biological function). Computational

tools are very effective in predicting the

structures of proteins with high sequence

similarity to those of known structure but

there has been little success in predicting

the structure of a novel sequence. A high-

throughput structure determination

pipeline includes target selection, over-

expression, purification, crystal growth,

structure determination and refinement.

Structural genomics aims at automating

each step while driving down costs.

Overviews of efforts to accelerating the

pipeline in industry, universities and

US National Laboratories were presented.

The Joint Center for Structural Genomics

(University of Califormia at San Diego, The

Scripps Research Institute and Stanford

University, CA, USA) uses a robotic facility,

developed by the Genomic Institute of the

Novartis Research Foundation and Syrrx

(San Diego, CA, USA), for 100 000

crystallizations per day at 100 nanoliter

droplets. Automated expression and

purification systems are also being

optimized. Escherichia coli strains are still

the only practical vehicle for large-scale

expression, which is the bottleneck.

Alex Raeber (Cytos Proteome Therapeutics,

Konstanz, Germany) presented an

alternative – alphavirus expression in

mammalian cells to create a library of

potential therapeutic targets. This system

provides safe handling, a broad host range,

the capacity to represent the proteome

faithfully (in contrast to bacterial

expression), rapid gene recovery and the

necessary production level for screening.

Functional genomics with drug discovery

in mind

Moving further into drug discovery

approaches, mass spectrometry in a

proteomics context has been used to identify

potential drug targets, characterize protein

complexes, analyze the mechanism of drug

action, and test drug efficacy monitoring

using biomarkers. Mass spectrometry

approaches require managing the wide

distribution of abundance of individual

protein species and the concomitant large

dynamic range within a single cell type.

Stanley Hefta (Bristol Myers Squibb,

Plainsboro, NJ, USA) described how the

company is looking at osteoporosis through

analysis of signal transduction in osteoclasts

and has already validated targets

experimentally to the point of involving

animal models. Robert Hollingsworth

(GlaxoSmithKline, NC, USA) described

how high throughput yeast two-hybrid

screening provides another method for

drug target discovery. Nicholas Dean (Isis

Pharmaceuticals, Carlsband, CA, USA)

discussed how antisense probes, coupled

with microarray data, can be used to probe

cells directly to identify genes. This antisense

technology should allow an efficient

transition to human studies and provide

predictable pharmokinetics and toxicology.

Isis has characterized the pharmacokinetics

of its oligonucleotide leads, which support

the promise of this approach in molecular

medicine. John Kozarich (ActivX, La Jolla,

CA, USA) explained how designing novel

chemicals to interrogate the activities

of proteins complements analytical

approaches and James LeBlanc (Ciphergen

Biosystems, Freemont, CA, USA) described

how scanning fluids directly for biomarkers

via protein arrays and laser ionization can

achieve femtomolar sensitivity.

The European Bioinformatics

Institute’s proteomics project (described by

Claire O’Donovan, EBI, Cambridge, UK)

uses automated annotation to ascertain

functional information. This information is

then compared with the ~30 000 protein-

coding genes already identified and, based

on a five to ten fold increase from post-

translational modifications and a 30 to 50%

increase from alternative splicing, suggests

that more than one million proteins are

encoded by the human genome.

Prototype proteome and conclusion

What the world needs now are models for

proteomics: Soumitra Ghosh (MitoKor,

San Diego, CA, USA) urged us to consider a

powerful but simple system, mitochondria,

two to three orders simpler than

eukaryotes. Whereas the mitochondrial

genome only encodes 13 proteins, the

details of the experimental observations

are daunting and provide a glimpse of the

challenges of moving to nuclear, let alone

cellular, levels for ‘complete’analyses.

Characterizing the mitochondria proteome

fully could yield a plethora of novel drugs

because many oncogenic processes involve

mitochondrial membrane-bound proteins.

In sum, the complexities of proteomics will

entertain us for decades to come.

John C. Wooley

Dept of Pharmacology andChemistry–Biochemistry, University ofCalifornia at San Diego, 9500 Gilman Drive,La Jolla, CA 92093–0043, USA.e-mail: [email protected]

TRENDS in Biotechnology Vol.20 No.8 August 2002

http://tibtech.trends.com 0167-7799/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(02)01975-3

317Research Update

Current trends in bioinformatics

Eric Jain

The fourth annual ‘Integrated

bioinformatics’ conference was organized

by the Cambridge Healthtech Institute (CHI)

and held in Zürich 16–18 January 2002.

Published online: 31 May 2002

More than two thirds of the presentations

at this conference were related to

microarrays, mirroring a current trend in

bioinformatics. One of the purposes of

microarray experiments is to find genes

that are either up- or downregulated

under specific circumstances. The usual

method for detecting such variations

within microarray data is to look for genes

that have expression levels more than a

certain number of standard deviations

from the mean or show a several-fold

difference between experiment and

control measurements. It was pointed out

that this strategy works well if changes

are large enough but can be problematic

because many small changes can add up to

a significant effect.

One of the challenges is to separate

relevant signals from the background

noise without obtaining too many false

positives. Besides the fact that biological

systems are inherently noisy, further

errors are introduced by various technical

factors. After obtaining a list of up- and

downregulated genes the next step

usually involves trying to extract some

biological meaning from this data, for

example by looking for similar genes in

public databases to determine properties

such as chromosomal location and

functional categories. A tool that

automates this procedure was presented

by Sorin Draghici (Wayne State

University, Detroit, MI, USA).

Microarray data processing involves

several steps ranging from image

acquisition, image processing, filtering

and normalization of raw data to tasks

such as data analysis and visualization.

Many techniques for filtering are in use,

the most common being multiplication

with a normalization factor, intensity

averaging, ratio averaging, LOWESS

normalization and the use of control

genes. Usually, better results can be

achieved by working with subarrays

rather than full arrays. Also, applying any

kind of normalization is better than none

at all, according to Jason Gonçalves

(Iobion, Toronto, Canada).

Microarray databases should be able to

store original images from scanners, raw

quantification data, processed data and,

often overlooked, experiment annotation.

Keeping track of experimental details

ensures repeatability and there is now a

standard for this type of data – MIAME

(see Microarray Gene Expression Database

Group; http://www.mged.org). The

importance of well-defined and controlled

vocabularies was restated several times.

The Gene Ontology Consortium, for

example, is working on a gene annotation

standard (see Gene Ontology Consortium:

http://www.geneontology.org). Regardless

of the algorithms used for data cleaning

it was recommended that microarray

images are always inspected visually

because this still provides the most reliable

method for correcting certain errors such as

obvious outliers.

Of course data normalization should

not be overdone because every processing

step can be a step away from the

underlying biological data. And, depending

on the method, certain patterns are more

likely or less likely to be amplified or

suppressed. According to James Lyons-

Weiler (University of Massachusetts,

Amherst, MA, USA) more reliable results

are achieved with hierarchical clustering if

the process is repeated several times with

different starting points and then the most

common solution is chosen. Various

machine learning approaches, such as

neural networks and support vector

machines, are also being adopted for

classification of microarray data.

Algorithms and computational challenges

Clustering by choosing representative

sequences is an efficient but perhaps

simplistic strategy to eliminate redundancy

and speed up homology searching within

large protein databases, according to

Weizhong Li (Burnham Institute,

La Jolla, CA, USA). Deepak Thakkar

(Silicon Genetics, Redwood City, CA, USA)

pointed out that analysis software that

transparently applies certain algorithms

should be able to attach corresponding

references to the processed data. Although

this would in principal be simple to

implement it is not yet widely done.

Two theoretical approaches for

reverse engineering gene networks

were independently presented by

Alberto de la Fuente (Virginia Polytechnic

Institute, Blacksburg, VI, USA) and

Mattias Wahde (Chalmers University

of Technology, Göteborg, Sweden).

Experiments including many time series

with fewer points are usually more

reliable than those consisting of a single

series with many redundant data points.

Protein interactions and pathways

Experimental protein interaction data is

obtained on a large scale from yeast two-

hybrid experiments. Experiments might

vary not only in the proteins they test for

interactions but also in how detailed

information such as place and time of

interactions is recorded and if interactions

should be classified in a binary way or with

probabilities. If performing an all-against-

all comparison is not feasible, proteins are

screened for interactions with fragment

libraries. Although this approach does yield

domain–domain interaction information,

data on time and location of the

interactions is lost, according to Vincent

Schächter (Hybrigenics, Paris, France).

Promoter analysis is another way to

detect protein interactions. This technique

is based on the fact that coregulated genes

usually produce gene products that lie

within the same pathway. Moreover,

coexpression infers a unique shared

promoter or shared enhancers. Similarity

and structure-based methods, however,

are likely to miss proteins that are

involved in the same pathway as a given

protein but do not interact with it directly

according to Thomas Werner (Genomatix,

München, Germany).

Protein interaction maps can be used to

establish potential pathways, which are

essential for the understanding of gene

functions. The function of a gene can often

be predicted more reliably by its functional

context than its sequence. Agene’s functional

context includes not only the properties

of the gene product but also interaction

partners and which other genes are in the

same pathway. The major drawback at the

moment is that protein interaction and

pathway predicting methods are difficult

to evaluate because there are no complete

reference data sets available.

Protein function

Structural similarities between proteins are

considered to be more relevant than pattern

or sequence similarity because structures

are known to be more conserved than the

underlying sequences. Therefore protein

structures rather than sequences should be

used for predicting functions, according to

Ben Hitz (ProCeryon Biosciences, New York,

USA). Margaret Biswas (European

Bioinformatics Institute, Cambridge, UK)

presented a database that integrates several

existing motif and pattern databases (see

Interpro; http://www.ebi.ac.uk/interpro).

Clinical and research applications

Insight into how microarrays are used in

research was provided by Heiko Müller

(Dept of Pharmacology, Pharmacia

Corporation, Nerviano, Italy), who talked

about the pRB tumor suppressor pathway

and Frank Pugh (Pennsylvania State

University, University Park, PA, USA),

who revealed the mechanism by which

repressors control gene expression.

Cancer research is an ideal application

for microarrays. Gene expression data can

be linked to clinical data such as survival

rates and used to determine high-risk

patient groups that should receive more

TRENDS in Biotechnology Vol.20 No.8 August 2002

http://tibtech.trends.com

318 Research Update

therapy. According to Peter Lichter (German

Cancer Research Center, Heidelberg,

Germany), a major step in accelerating

research will be the availability of disease-

specific microarray chips. But predicting if

chemotherapy will work for a patient can’t

always be determined by proteins alone

because they might not be specific to a

cancer type, cautioned Walter Battistutti

(University Hospital, Vienna, Austria)

Roland Eils (Phase-It, Heidelberg,

Germany) pointed out that it is important

to be aware that microarrays might only

be detecting secondary effects, which are,

of course, poor pharmaceutical targets.

According to Andreas Windemuth

(Genaissance Pharmaceuticals, New

Haven, CT, USA), for drug discovery

purposes the genome should be seen to

consist not only of the basic sequence but

also of individual variations now that the

human genome has been sequenced.

Concluding remarks

The conference provided a well-balanced

mixture of theoretical presentations

on mathematical and statistical issues

and practical presentations on the

applications of the discussed technologies.

Somewhat missing from the program were

software engineering topics, which might

simply reflect the targeted audience.

Along with the rapid growth of data

there has been a strong increase in the

number of algorithms used for analyzing

this data. Most algorithms are not new but

are being rediscovered and adopted from

other scientific fields. There is a lot of

insecurity over which algorithms to

choose, even for basic tasks such as

clustering and filtering data. There is a

growing need for better benchmarks as

well as guidelines or expert systems for

choosing appropriate algorithms.

Another reoccurring topic is the

importance of standards for data exchange

among tools and databases. Although the

progress of software standards in other

industries doesn’t necessarily provide

much hope, there are now a few promising

standardization projects underway

(see I3C: http://www.i3c.org).

Eric Jain

Jain PharmaBiotech, Blaesiring 7, 4057 Basel,Switzerland.e-mail: [email protected]

TRENDS in Biotechnology Vol.20 No.8 August 2002

http://tibtech.trends.com 0167-7799/02/$ – see front matter © 2002 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(02)01981-9

319Research Update

Using proteomics to identify targets and validate leads

Kewal K. Jain

The International Business

Communications (IBC) ‘Proteomics and

the proteome: role in target identification

and lead validation’ conference was held

18–20 February 2002 in Geneva, Switzerland.

Published online: 31 May 2002

This conference consisted of presentations

on proteomic technologies relevant to

drug discovery. Further details of other

proteomic technologies have been described

elsewhere (for example, see [1]). Important

technologies that were discussed included

protein microarrays, protein–protein

interactions and cell proteomics.

2D gel electrophoresis

Two-dimensional gel electrophoresis

(2D GE) remains the method with the

highest resolution power for the separation

of proteins and is followed by mass

spectrometry (MS) for rapid identification.

Substantial fractionation has to be

performed in addition to 2D GE when

aiming to identify 2000 proteins in different

tissues and body fluids. Hanno Langen

(Roche, Basel, Switzerland) explained how

Roche uses subcellular fractionation and

removal of abundant proteins before

separation techniques, increasing the

sensitivity by a factor of ten. Attempts are

being made to reduce the biological

variation associated with traditional

2D GE to maximize the quantitative aspect

of differential protein analysis.

Protein biochips and microarrays

Protein biochips and microarrays are

increasingly used in proteomics and offer

the distinct possibility of enabling a rapid

analysis of the entire proteome. Microarrays

can also be used to screen protein–drug

interactions and to detect posttranslational

modifications. Thus, the concept of

comparing proteomic maps of healthy and

diseased cells could help us to understand

cell signalling and metabolic pathways

and will help pharmaceutical companies to

speed up the development of therapeutics.

Arrays of functional proteins can help

at various stages in the drug discovery

process. Roland Kozlowski (Sense

Proteomics, Cambridge, UK) described

COVET (cloned open reading frames for

the validation of experimental targets)

technology for parallel cloning, gene

expression and arraying of a large number

of functional proteins and protein

families. COVET allows the specific

modification of every member of a cDNA

library in a manner that does not rely

on any knowledge of the sequence of

individual genes. Modified cDNAs are

expressed in a suitable host to produce

tagged functional proteins, which can

be microarrayed without the risk of

impairing the function of the protein.

Michael Pawlak (Zeptosens, Witterswil,

Switzerland) described the use of

ZeptoMark protein microarrays for

quantification of low abundance proteins

using ultrasensitive fluorescence detection

on planar waveguides [2]. This enables

real-time measurement in light-scattering

media such as blood. Data about protein

expression and interactions from using this

protein biochip can be applied at various

stages of drug development including

target identification, target validation and

quantitative marker monitoring.

Analysis of multiprotein complexes in

Saccharomyces cerevisiae

Bernhard Kuster (Cellzome, Heidelberg,

Germany) described the systematic

analysis of multiprotein complexes in

Saccharomyces cerevisiae using tandem

affinity purification (TAP). TAP enables

assembly and selective purification of

cellular protein complexes and is followed

by MS. The TAP tag has been fused to

many genes for isolation of protein

complexes from different cell types,

providing an outline of the eukaryotic

proteome as a network of protein complexes

at a level of organization beyond binary

interactions [3]. Protein and interaction

pathway maps are being developed to

provide the basis for target identification

and lead validation. This approach will

improve our understanding of the mode of