bio-broker: a tool for integration of biological data ... · this biological mediator (bio-broker)...
TRANSCRIPT
SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2006; 36:1585–1604Published online 13 July 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.733
Bio-Broker: a tool forintegration of biological datasources and data analysis tools
J. F. Aldana, M. Roldan-Castro, I. Navas, M. M. Roldan-Garcıa, M. Hidalgo-Condeand O. Trelles∗,†
E.T.S.I. Informatica, Campus de Teatinos, 29071, Malaga, Spain
SUMMARY
In this work we present an architecture for XML-based mediator systems and a framework for helpingsystems developers in the construction of mediator-services for the integration of heterogeneous datasources. A unique feature of our architecture is its capability to manage (proprietary) user’s softwaretools and algorithms, modelled as Extended Value Added Services (EVASs), and integrated in the dataflow. The mediator offers a view of the system as a single data source where EVASs are readily availablefor enhancing query processing. A Web-based graphic interface has been developed to allow dynamic andflexible EVASs inter-connection, thus creating complex distributed bioinformatics machines. The feasibilityand usefulness of our ideas has been validated by the development of a mediator system (Bio-Broker) andby a diverse set of applications aimed at combining gene expression data with genomic, sequence-based andstructural information, so as to provide a general, transparent and powerful solution that integrates dataanalysis tools and algorithms. Copyright c© 2006 John Wiley & Sons, Ltd.
Received 23 August 2004; Revised 15 November 2005; Accepted 16 November 2005
KEY WORDS: biological mediation; biological dataflow; data analysis tools; XML
1. INTRODUCTION
Technological breakthroughs such as high-throughput sequencing and gene-expression monitoringtechnology have nurtured the ‘omics’ revolution, enabling the massive production of data.Unfortunately, this valuable information is often dumped in proprietary data models and specific
∗Correspondence to: Oswaldo Trelles, Departamento de Arquitectura de Computadores, Universidad de Malaga, E.T.S.I.Informatica, Campus de Teatinos, 29071, Malaga, Spain.†E-mail: [email protected]
Contract/grant sponsor: Spanish Ministerio de Ciencia y Tecnologıa; contract/grant numbers: QLK3-2001-01473 andTIC2002-04586-C04-04
Copyright c© 2006 John Wiley & Sons, Ltd.
1586 J. F. ALDANA ET AL.
services are developed for data access and analysis, without forethought to the potential externalexploitation and integration of such data.
However, dealing with the exponential growing rates of biological data is a simple problem whencompared with the problem posed by diversity, heterogeneity and dispersion of data [1]. Nowadays,the accumulated biological knowledge needed to produce a more complete view of any biologicalprocess is disseminated around the world in the form of biology sequences and structure databases,frequently as flat files, as well as image/scheme-based libraries, Web-based information with particularand specific query systems, etc. Under these conditions, this plethora of interrelated informationbecomes difficult to use, pointing to the integration of these information sources for unified access,independently of possible internal changes, as a clear and important technological priority.
The database community has been engaged for many years with the problem of integratingheterogeneous data sources [2]. A first approach was based on wrappers, which act as interfaces forthe data sources, translating their structure into a common model [3,4]. Later, the focus moved tothe use of mediators for encapsulating the knowledge needed for evaluating a query over multiplewrappers. The wrapper–mediator approach sets up an interface for a group of (semi-structured) datasources, amalgamating their local schemas into a global schema and integrating the information of thelocal sources. As a result, the views of the data that mediators offer are coherent, producing semanticreconciliation of the common data model representations carried out by the wrappers. Some examplesof the wrapper–mediator systems are Amos [5], Tsimmis [6], Disco [7] and Garlic [8]. In the specificfield of biological data there are the following examples: TAMBIS [9], BioDataServer [10], KIND [11],BioZoom [12], BioKleisli [13] and DiscoveryLink [14]. With the advent of XML, some of the mediatorsystems have evolved toward the XML standard (Amos and Tsimmis), whilst other projects wereinitially XML-based, such as MIX [15] and MOCHA [16].
We have designed an architecture called Bio-Broker for XML-based mediator systems anddeveloped a framework for assisting systems developers in the construction of mediator services thatintegrate heterogeneous data sources. Our architecture allows the easy development of Extended ValueAdded Services‡ (EVASs). This is especially suitable for environments where users have to share toolswith which they are unfamiliar. The final user will have a view of the system as a single data sourceand furthermore, with the capability of easily selecting, adding and modifying EVASs for enhancingof query processing.
In the bioinformatics context, researchers often use a variety of tools for data analysis. Each of thesetools have their own interfaces, data formats and semantics, and probably need to be interconnectedin a workflow in order to define a processing pipeline, thus becoming complex virtual bioinformaticmachines [17]. Today, users build these workflows by hand, which may be a source of errors andinvolves a time cost. It is important to automate this process to free biologists of this repetitive task.We propose the use of EVASs for making the workflow construction easier. Each EVAS encapsulatesa software tool as a black box that may be used as a building block, which can be connected to anotherEVAS, building the desired workflow in an almost automatic way. EVAS may be interconnectedwithout requiring the knowledge of the internal details related to the encapsulated tools. Furthermore,it allows private software tools to be shared in a ‘pay per use’ way of operation.
‡An EVAS is an external software tool integrated in the mediator (i.e. a data mining algorithm).
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1587
The flexible use of EVAS is an outstanding characteristic of our system, which is not provided bytraditional or biological mediators such as TAMBIS, BioDataServer, KIND, BioZoom or BioKleisly.
Among closely related developments to Bio-Broker, we find BioMOBY, which provides a wayto create Web services as well as a central registry that has information on all the services thatare available (BioMOBY). In fact, the main difference between BioMOBY and Bio-Broker lies inthat Bio-Broker is a mediator system, i.e. an intermediary between data sources, services and usersthat offers a view of the whole system as a single resource, where EVASs are readily available forenhancing query processing (see complementary material at http://uranos.khaos.uma.es/mediator fordetails). Another related project is myGRID, which also provides a prototype system named Tavernafor building workflows (Taverna).
However, Taverna only offers a way to build and execute workflows, but it does not includeintegration capabilities. Thus, BioBroker combines traditional data integration through an integrationschema with workflow definition and execution (see complementary material for details).
As an example of such integration procedures we have developed a biological instance of theframework in the context of gene-expression data. In this context, the query space is, in practical terms,unbounded and dynamic. So the problem is not only to integrate diverse and scattered data sources butalso to find a way to analyse the queries, to identify with which service or database should be solved,as well as to integrate dispersed services and apply them to the selected set of data.
This biological mediator (Bio-Broker) integrates genetic information and various processingmechanisms developed by different users and groups through a graphical interface that allows fastand intuitive ‘wiring’ of EVASs components, expanding the functionality and enabling the easyincorporation of new procedures to customize the system for specific concerns. Bio-Broker allowsuniform access to data stored in different databases: presently these include SWISS-PROT [18],EMBL [19], MICADO [20], PDB [21], DIP [22] and BIND [23] as well as to several processingservices such as BLAST [24], AnaGram [25], Frags [26] and AsocRules [26]. The end result of thisservice is the ability to combine gene-expression data with genomic, sequence-based and structuralinformation, so as to supply a general, transparent and powerful solution. As a proof of concept,Bio-Broker has been also used to extend the capacity of engene, a versatile, Web-based and platform-independent exploratory data analysis tool for gene expression data (see [27] for more details).
Here we describe both the architecture design and the development mechanisms needed to build upthe mediator. We also define the problems selected to illustrate the system. We then describe a mediatorsystem, Bio-Broker, which offers enhanced query capabilities beyond gene-expression clustering.
2. SYSTEM AND METHODS
2.1. An XML based architecture for mediators design
The mediator system architecture [28,29] is based on the use of XML standards. The key characteristicsof the architecture and its main components are as follows (see Figure 1).
(1) The architecture is XML-based. XML [30] is itself a standard format for the representation,interchange and validation of data and, together with the XQuery query language [31] and theXML Schema [32], forms a complete data model for dealing with information on the Web.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1588 J. F. ALDANA ET AL.
Figure 1. General mediator architecture. This picture shows the classical architecture of a mediatorsystem extended with an EVAS module. The latter is a main feature of our architecture together withthe use of metadata at the client side. EVASs are, in a way, modelling recurrent processes (workflows)
that can be shared by several users.
XML Schema [32] and RDF Schema [33] are used for describing all the metadata necessary forquery processing in this architecture as well as for expressing the data resulting from the queries.
(2) Xquery is the query language used by the mediators. All the clients connected to a mediator mustsend a valid XQuery query to the mediator and will receive an XML document. The client canbe any sophisticated query generation mechanism and the only requirement is that it must senda valid XQuery query independent of how this query is generated. A generic tool is available forspeeding up the development of specific graphical query interfaces.
(3) The query processor is the component that receives a query and generates a sub-query for each ofthe data sources involved in the query. It analyses the data requirements of the query, identifyingthe data sources and generating a global plan for evaluating the query. The global plan contains
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1589
information on the various alternatives for solving the same data requirement. Later, the globalplan is fragmented into one sub-query for each of the selected data sources.
(4) Metadata are used to express all information needed for the above process. This includesinformation both on the logical schemas produced by the wrappers that encapsulate thedata sources, and on the semantic equivalence of the terms produced by the data sources(the integration schema). We also store information about the query processing capabilities ofthese wrappers and about the location and availability of the data sources for optimizing accessto the data. Specific user information such as privileges, EVASs usage and preferences is alsostored.
(5) EVASs are all those processes that allow the user to further filter or post-process the queries.In a sense EVASs are modelling arbitrarily complex repetitive processes appearing on thedata obtained from the data sources and shared by several users. Obviously these processes(e.g. BLAST, AnaGram, Frags) being application specific are not efficiently incorporated intothe query language. On the other hand, being repetitive tasks performed by several users, theirautomation and optimization should be assisted by the mediator.
(6) Wrappers are the components that translate the various data sources being integrated into thecommon data model (XML/Xquery). Wrappers hide the internal complexity of the data sourcesand offer, in a controlled manner, their query capabilities. There is one wrapper for eachdata source integrated into the mediator. Wrappers receive XQuery queries and produce XMLdocuments that conform to the XML schema (integration schema).
2.2. From architecture to mediators
A framework has been developed for simplifying the mediator construction task. The mediatorinstantiation process involves the construction of a particular application of the framework by thedevelopment of wrappers and the use of metadata for the configuration of the different mediatorcomponents. Wrappers are the means of translation between the reference model of the architecture(XML) and the data source model. Once the mediator has been constructed, the mediator user candynamically manage the EVASs adding or removing them as necessary.
The main tasks involved in mediator configuration are as follows.
Wrapper construction. Specific wrappers are necessary for each data source integrated into themediator. To simplify this task, our system offers ‘wrapper construction schemes’ for threeof the more frequent types of data sources (relational databases, XML documents and HTMLdocuments).Mediator configuration through metadata. Metadata will be used for adapting the mediatorcomponents. Specifically they will be used to configure both the location and integrationschemes.
• The location scheme links the wrappers to the mediator. It is used for identifying the datasource of a query and selecting the appropriate wrapper.
• The integration scheme specifies the data translations between the data model used by theclient and that offered by the wrappers. So, for each database field, it specifies by Xpath itslocation into the XML document given by the wrapper. Furthermore, it gives informationon the data source where such is available.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1590 J. F. ALDANA ET AL.
Figure 2. Developing and publishing of E-BlastP, an EVAS which performs a BlastP-search using a queryprotein sequence and some (user selected) parameters. E-BlastP receives the input as an XML document.Therefore, before using BlastP, E-BlastP parses and translates this input. BlastP results are then translated intoan XML output document. The right-hand side of the figure provides a snapshot of the graphical interface for
building workflows by pipelining EVAS.
2.3. Dynamic integration of EVASs
An EVAS is a software tool adapted for being connected to the mediator. EVASs are not in the mediatorcore. It is not possible to know in advance how to build the EVASs structure. EVAS are linkeddynamically during the use of the mediator system. Therefore, although the EVASs configurationis made using metadata, the framework offers to the mediator-users a management tool to connect,arrange and remove EVASs without any information on the mediator internal architecture, thus,enabling a wide range of processing alternatives to be specified for relating—in different ways—diversesources of biological data.
The integration of EVAS with the mediator consists of two main phases.
(1) Creation of the EVAS. The tool (program) to be integrated into the mediator has to be adapted forobtaining a new program (EVAS) with three parameters, an input XML document, an XQueryquery and a list of zero or more parameters (those that correspond to the tool). The resultingEVAS will use these three parameters for calling the original tool, thus obtaining an output XMLdocument. The input and output XML documents must conform to the schemas, which determineits connection in the workflow (see Figure 2).
(2) Connection of EVAS to the mediator. Once an EVAS has been created, it has to be linked to themediator. This implies the following tasks.
(a) Publication of the EVAS as a Web service. It will be made as a Web method namely Query,or as a server side program (CGI, ASP, JSP, etc.)
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1591
(b) Registration of the Web service in the system, sending information about this service tothe mediator administrator.
(c) Obtaining a DLL library, which performs a call to the Web service. This library allowsthe user to add services dynamically to the mediator without changes in its sourcecode. The library will be automatically generated by the mediator administrator, usingpreviously implemented templates.
(d) Registration of the library in the system. For this, some metadata, including name of thelibrary, name of the service, and input and output XML schema will be added.
When a set of EVAS has been developed and connected (see Figure 2) to the system, users can createworkflows using a graphical tool provided by the system. This tool has been developed using C#(the Microsoft programming language) and makes use of an internal representation for workflows.This tool allows users to connect databases, EVAS and the data integrator. Once the mediator has beendeveloped and configured, it is ready for use by the final clients. The clients can use the mediator eitherthrough a programming interface or through a graphical interface (for details about this tool see thesupplementary material).
This architecture does not impose any restriction on the definition of EVASs or on the way in whichthey are connected in workflows. Generic EVASs can be defined and they can also be replicated and/orspecialized to facilitate availability and/or communication efficiency. For example, running a BLASTquery on a custom set of genes would require transferring the entire set of genes for comparison againstthe BLAST server (typically many gigabytes). In this case, it would be much more efficient to link thisEVAS to a database that already has the necessary genes. In this case, a simple service can recover thegenes from the database avoiding communication overheads.
2.4. Biological database integration: Bio-Broker
We present a sample use of the mediator system for the integration of gene expression applications.In general, clustering is one of the most common forms of analysis for such applications, and oncethe data has been clustered, additional post processing analysis has to be performed to extract relevantknowledge.
In a sense, clustering analysis is incomplete without integrating it with functional informationcontained in nucleotide and protein databases (both at sequence and structure level) as well as withinformation contained in emerging pathways and assembly databases. A common option is the staticintegration of these very diverse data sources (equivalent to a pre-annotation process). However, whatis really necessary and much more interesting is the dynamic integration of services onto this data soas to obtain valuable information from the diverse sources.
At present, Bio-Broker integrates many different types of data sources: nucleotide (EMBL) andprotein (SWISS-PROT) sequence information [18]; the worldwide repository of three-dimensionalstructures of biological macromolecules—the Protein Data Bank, (PDB) [34]; the MICrobial AdvancedDatabase Organization (MICADO)—a relational database dedicated to microbial genomes [20] andfunctional analysis of Bacillus subtilis [35]; the Database of Interacting Proteins (DIPTM); and theBiomolecular Interaction Network Database (BIND) designed to store full descriptions of interactions,molecular complexes and pathways.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1592 J. F. ALDANA ET AL.
These databases have been selected on the basis of differences in content, format, access mechanismand geographical location. Our intention is to show how these very diverse data sources can be easilyintegrated using the proposed architecture and how the possibility of easily adding EVASs allows theuser to mix different tools for obtaining enhanced data processing capabilities.
Services requested of the mediator help to recover information associated with a given set ofclustered genes, independent of the way in which these clusters were obtained, on the basis of theirexpression levels by association rule discovering or by whatever other appropriated method.
3. RESULTS
3.1. Bio-Broker architecture
Figure 3 shows the complete architecture of Bio-Broker. A user interface (optionally a graphical one)is used for accessing the services. The queries, expressed in terms of the integration schema, are sent inXQuery to the server, which divides it into different subqueries. The subqueries are sent to the differentdatabases. Each database has its own query and XML result document construction mechanism(i.e. database-specific wrappers). The service receives the sub-query results. At this point, it is possibleto apply an EVAS to process the sub-query results. For example, a new EVAS can be applied to thefirst EVAS process result, and so on.
The results of each sub-query (with or without applied EVAS processing) are integrated, removingduplicates and inconsistencies. Duplications can be removed by taking advantage of the mappingsbetween the integration schema and the resource schemas. Data that the mediator receives is expressedin terms of the mediator’s integration schema and the mappings identify any duplicated items andremove one of them from the integrated data. Note that EVAS processing does not include duplicateremoving, this step is performed in the integration step.
Next, the integrated solution can be post-processed by a set of EVAS. The final result is then sent tothe user interface. It is worth observing that where and how these EVASs are implemented is completelyirrelevant: they are just Web services integrated into the mediator system. In the architecture thesemodules are used as a ‘black box’. They can even be developed by other people. To include a newEVAS one only needs to know its address (IP) and the service must conform to the DTD/schema.
The integration schema of Bio-Broker stores both the data provenance (the database(s) from whichthe data can be obtained) and the query capabilities (the elements of the resource scheme by whichusers can make queries and the expressiveness of the query language). Data provenance is an importantissue due to the necessity of assessing the quality of data in the biological context. The quality of dataobtained from proprietary Web services, not being up-to-date or well-ordered would not be as goodas that obtained from well known, widely used biological Web services, such as EMBL, PDB, etc.Therefore, data provenance is not only stored in the integration schema but also annotated in the answerso that the source can be identified, and the users can infer its quality according to their confidence inthe data source. Thus, we can for example annotate the GeneName field with metadata showing that itcan be obtain from EMBL, GenBank or MICADO databases (see the supplementary material).
Please note that we have not defined an architecture with multiple middle-layers; we have ratherdefined an architecture with three layers: user interfaces, the mediator and wrappers. It is also importantto note that although EVASs are fully integrated in the architecture they are not mandatory and so do
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1593
Figure 3. Bio-Broker architecture: data access, pre-processing, integration, post-processing and user interfaces.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1594 J. F. ALDANA ET AL.
Figure 4. Instance of our integration architecture with three EVASs.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1595
not represent another architectural layer; therefore, it is not necessary to include a set of EVASs foreach query processing task. Figure 4 shows an instance of our architecture for three EVAS, appliedafter the integration step. Addition of EVASs in the integration process does not impose a performancepenalty on the query processing step. Performance is simply the sum of the time of making use of thesetools and the time required to retrieve the data. Furthermore, the end-user cost of using these tools isalways less than using them manually, since the results are integrated. That is, the repetitive manualapplication of filters and transformation tools to data retrieved from data sources has a high cost;this is decreased by the mediator, which applies them automatically. Besides, execution of workflowsis performed automatically, taking advantage of metadata, so the cost of maintenance is limited todefinition of workflow according to user needs.
3.2. Bio-Broker services
As a demonstrable ‘proof of concept’ that can be evaluated and tested in the field we presentthe integration of gene expression data with the biological databases described in Section 2.The data set described by [36] will be used. This data set is composed of the cDNA expressionlevels of 4005 Bacillus subtilis two-component regulatory systems in which the overproduction ofa response regulator of the two-component systems, coinciding with a deficiency of its cognatesensor kinase, affects the regulation of genes, including its target genes. The genome-wide effecton gene expression caused by the overproduction was analysed on 24 two-component systems(http://www.genome.ad.jp/kegg/expression).
The engene platform (at http://chirimoyo.ac.uma.es/bitlab) was used to cluster this subset of geneexpression data. The mediator system is hosted at IP address http://uranos.khaos.uma.es and is availableby using the Web services (see http://uranos.khaos.uma.es/mediator). The Web method ‘connect’supplies an identifier to dialog with the system. The WSDL description of the method for queryinghas two available methods ‘ConsultCompress’ and ‘Consult’. The latter produces a ‘string’ XMLdocument, while the former compresses the data, and the client must use Xmill to decompress.Parameters for both methods are an array with the query (queries), the number of queries, the useridobtained when connected to the system, a Boolean array that specifies the services to use, and theparameters needed for each service (see the complementary material).
Table I describes four samples of implemented queries. These examples describe different levels ofcomplexity, and the different databases and services that can be requested and combined in a simplequery. They have been designed and published as services in http://uranos.khaos.uma.es/mediator, thuswe provide biologists with an easy Web interface to solve common queries that they often need tosolve.
The first service (Service A in Table I) sends a request to the mediator system for informationabout the ‘genomic context’ of clustered genes, whose more straightforward definition is based on thephysical proximity of genes in the chromosome [37,38]. It is well accepted that neighbouring genesin bacteria are often functionally related, so drawing the gene/cluster distribution provides additionalinformation for interpreting expression data. This service will work on MICADO, which containsdetailed information on Bacillus subtilis. This is a very simple request and in a way is equivalent to apre-annotation process that can be used to include information for analysis from very different sourcesin the data collection (see Figure 5). Additional information about the query formulation can be foundin the complementary material.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1596 J. F. ALDANA ET AL.
Tabl
eI.
Req
uest
edse
rvic
es.F
our
exam
ples
are
show
nto
desc
ribe
diff
eren
tlev
els
ofco
mpl
exit
y,an
dth
edi
ffer
entd
atab
ases
and
serv
ices
that
can
bere
ques
ted
and
com
bine
din
asi
mpl
equ
ery.
The
foll
owin
gla
bels
have
been
used
for
met
adat
ain
form
atio
n:<
S>
Spe
cies
,<
G>
Gen
ecl
uste
r,<
P>
Pre
fix
leng
th,(
BS
=B
acil
lus
subt
ilis
).
Ser
vice
[A]
Sin
gle
one-
step
requ
est
[B]
Mul
tipl
e-st
epre
ques
t/sa
me
pipe
[C]
Com
plex
3-st
eps/
3-pi
pes
requ
est
[D]
Str
uctu
ralI
nfor
mat
ion
Des
crip
tion
Gen
e/C
lust
erdi
stri
buti
onth
roug
hout
the
geno
me.
Clu
ster
sw
ere
obta
ined
from
gene
-exp
r.da
ta.
Chr
omos
ome
zone
sin
volv
edin
spec
ific
biol
ogic
alpr
oces
ses
wil
lbe
high
ligh
ted
Giv
ena
seto
fD
NA
sequ
ence
s(s
ame
orga
nism
),re
late
dby
thei
rex
pres
sion
data
,to
end
upw
ith
anot
her
seto
fov
erre
pres
ente
dw
ords
iden
tifi
edat
the
up-s
trea
mpo
siti
ons
ofth
ese
quen
ces.
Giv
ena
seto
fre
late
dD
NA
sequ
ence
s(f
rom
serv
ice
[B])
toen
dup
wit
han
othe
rse
tof
rela
ted
sequ
ence
san
dit
sas
soci
ated
KW
from
whi
chto
iden
tify
asso
ciat
ion
rule
s.
Giv
ena
coll
ecti
onof
cons
ensu
spa
ttern
sfr
oma
seto
fpr
otei
nse
quen
ces
(wit
hth
eir
PD
Bco
des)
toen
dup
wit
han
stru
ctur
alm
otif
.
Inpu
tS
,GS
,G,P
S,G
G,P
DB
ld,s
tart
,end
(fro
mse
rvic
e[C
])
Exa
mpl
e<
S>
BS
<G
>at
pAat
pB<
/G>
</S
><
S>
BS<
G>
atpA
atpB
</G
><
/S>
<P>
500<
/P>
<S>
BS
<G
>at
pAat
pB<
/G>
</S
><
PD
BId
=‘1c
om’
star
t=46
end=
56/>
Dat
abas
eM
ICA
DO
EM
BL
EM
BL
,SW
ISS
PR
OT
PD
B
Pro
cedu
reS
end
aqu
ery
toM
ICA
DO
data
base
Obt
ain
entr
ies
from
EM
BL
,th
enap
ply
aE
VA
Sto
obta
inup
-str
eam
sequ
ence
.The
nap
ply
Pre
fix
and
Ana
gram
.
Sen
da
call
toE
MB
L,
then
appl
yB
last
Pto
each
entr
yto
find
rela
ted
sequ
ence
s.R
etur
nth
atse
quen
ces
anit
sas
soci
ated
keyw
ords
and
PD
Bco
des-
whe
nav
aila
ble
Sen
da
call
toP
DB
toob
tain
the
PD
Ben
try.
The
nap
ply
apr
oced
ure
toob
tain
posi
tion
san
dat
oms.
Met
hod
OD
BC
call
CG
Ica
llC
GI
call
CG
Ica
ll
EV
AS
Chr
omoD
ist
Pre
fix,
Ana
gram
Bla
stP,
Fra
gs,A
sRul
esC
ut3D
,3D
alig
n
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1597
Tabl
eI.
Con
tinu
ed.
Ser
vice
[A]
Sin
gle
one-
step
requ
est
[B]
Mul
tipl
e-st
epre
ques
t/sa
me
pipe
[C]
Com
plex
3-st
eps/
3-pi
pes
requ
est
[D]
Str
uctu
ralI
nfor
mat
ion
EV
AS
Des
crip
tion
Dis
play
sa
char
twit
hth
ege
nedi
stri
buti
onon
the
chro
mos
ome
Pre
fix:
give
nan
EM
BL
entr
yin
XM
L,t
oen
dup
wit
ha
the
up-s
trea
mse
quen
ce.
Bla
stP
:giv
enan
EM
BL
entr
y,se
arch
bysi
mil
arpr
otei
nse
quen
ces
inS
WIS
S-P
RO
T,R
etur
nth
ese
quen
cean
dit
sas
soci
ated
keyw
ords
Cut
3D:G
iven
aP
DB
entr
yan
da
rang
eof
resi
due
posi
tion
s,re
turn
the
atom
(or
amin
oaci
d)3D
coor
dina
tes
corr
espo
ndin
gto
thes
ere
sidu
esA
nagr
am;C
ompu
teth
edi
stri
buti
onof
over
-rep
rese
nted
wor
dsof
leng
thK
.
Fra
gs;i
sa
stan
dalo
nepr
ogra
mw
hose
outp
utco
nsis
tof
stat
isti
call
ysi
gnifi
cant
frag
men
tsfr
omth
ein
put
sequ
ence
(rou
ghly
nam
edco
nsen
sus
mat
ches
)
3Dal
ign:
Ali
gntw
o3D
stru
ctur
esan
dco
mpu
teth
erm
sddi
stan
cebe
twee
nth
em
Ass
Rul
es:A
ssoc
iatio
nru
les
disc
over
ing
from
ase
tof
patt
erns
and
KW
sid
enti
fied
from
agi
ven
seto
fse
quen
ces
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1598 J. F. ALDANA ET AL.
Figure 5. Graphical representation of output from Service A. Red and green colours have been used torepresent up- and down-regulated genes, and forward and backward gene direction is represented as out-and in-side the chromosome. Position is counted clockwise. A clear proximity relatedness can be observed
with potentially useful biological information.
The objective of the Service B is for given a set of related DNA sequences, clustered by their geneexpression data, to find another set of over-represented K-fragments identified from the up-streampositions that could represent putative promoters or activation signals of those genes. In this case, thereare two very simple EVASs added to the pipe: Prefix and Repeats, the former to obtain the up-streamfragments from the DNA sequences, and the latter, a tailored algorithm, to identify over-representedstrings of length K from those up-stream sequences.
Service C is more complex, oriented to obtaining additional information not only about the originalsequences, but about sequences related to the original sequences. In this case the input is also formedby a collection of DNA sequence identifiers and the final output is a collection of Keywords tobe used in an association-rules-discovering procedure. The necessary steps for this exercise are asfollows. First, from EMBL retrieve the accession numbers of the protein sequences that correspondto the collection of genes (e.g. /db xref=”SWISS-PROT:P37800). Second, obtain the sequences andkeywords corresponding to those proteins. Third, BlastP is launched to complete a database search forsimilar sequences using each of the original proteins as query sequences. Finally, obtain sequences,keywords and identifiers of the similar sequences. These are used as input to the Frags procedure,which is able to detect statistically significant patterns from the collection of sequences. Sequences,patterns and keywords are used as input to ASSRUL, a procedure to detect association rules (availablein the Engene platform) that correlate the presence of some patterns with functional keywords. This isa complete exercise that demonstrates the ability to use EVASs in the pipeline. This allows the user todynamically incorporate additional information into the gene expression data.
Finally, in Service D, we explore the possibility of combining gene-expression data with structuralinformation. In this case we use the fragments obtained as partial output in Service C, and seek to verify
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1599
Figure 6. Creation of the workflow for the Service C.
if these fragments—with strong association at the sequence level—are also associated at structurallevel. To this end, the service receives as input a collection of PDB codes together with the start andend positions of the fragments and ends up with three-dimensional (3D) alignments of such fragments.
As test case for Service C we have used the YesM-YesN two-component regulatory system,composed of 21 up-regulated genes and three down-regulated genes as reported by [36].
We will now describe with some more detail Service C, which has a similar procedure to Service B.The objective is to obtain additional functional information both from the original sequences and fromthose sequences related to the original sequences. In Table II we present the query and results from themediator. First, the mediator requests the protein sequences and its associated keywords correspondingto a given collection of genes. Results from this request are used to launch a Blast database search(see Supplementary material for detailed parameters).
The Blast procedure retrieves a set of related sequences, and the mediator is used to retrieve thesequence and its associated keywords (see Table III, and additional information in the complementarymaterial). Using these sequences, a data mining procedure (Frags) is used to detect statisticallysignificant patterns from the collection of sequences Finally, the ASSRUL procedure is used to detectassociation rules (available in the engene platform).
From the implementation point of view, the Service C construction requires the creation ofa wrapper for EMBL, and the creation and configuration of three EVAS (E-BlastP, E-Fragsand E-ASSRUL) around the three mentioned tools: BlastP, Frags and ASSRUL (see Figure 6).These EVAS are published as Web services (WSE-Blastp, WSE-Frags and WSE-ASSRULL), finallythey are interconnected in Bio-Broker using a graphical tool provided by the framework, and the desiredworkflow is obtained.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1600 J. F. ALDANA ET AL.
Table II. XML string output for Service C. This service request for the protein sequences and associated functionalkeywords of the genes belonging to a collection of clusters (non-informative keywords such as ‘Hypotheticalprotein’ and ‘Complete proteome’ have been removed). For some cases Blast output clearly shows functionalrelationships between the query sequence (in this case the protein coded by yesS gene) and a set of by similarityrelated sequences. In other cases, the keywords are less informative (e.g. compete proteome) and no-functionalinformation is obtained by similarity. In these cases, the sequences are used again in a pipeline workflow as input
to the Frags procedure in which short stretches of sequences resembling putative patterns are detected.
<?xml version="1.0" encoding="UTF-8"?><result><query>FOR $Iteration IN INTEGRAClON WHERE
($Iteration/Organism/data()=’Bacillus subtilis’ AND($Iteration/GeneName/data()=’lplB’ OR $Iteration/GeneName/data()=’licC’ OR$Iteration/GeneName/data()=’opuBD’))
</query>
<data><CLUSTERS>
<CLUSTER>
<Sequence id="BSUB0004" Specie="Bacillus subtilis">
<description>Bacillus subtilis complete genome (600701, 813890)</description>
<KW>Transmembrane; Transport; Complete proteome;</KW>
<Features><gene name="lplB" AccNumber="P39128" PID="CAB12530"/><SeqPro>METVPKKRDAPV ... FGANYIAKKFDQEGLF</SeqPro>
</Features></Sequence>. . . . . .
<Sequence id="AF008930" Specie="Bacillas subtilis">
<description>Bacillus subtilis choline transport system including ATPase (opuBA), transmembrane protein (opuBB),choline binding protein precursor (opuBC) and transmembrane protein (opuBD) genes, complete cds; and unknown gene.</description>
<KW>Transport; Amino-acid transport; Transmembrane;</KW>
<Features><gene name="opuBD" AccNumber="P39775" PID="AACl4359"/><SeqPro>MNVLEQLMTYYAQNGS .... VMAVGADLLMAWLERALSPVKKKRTGAKHVQSAA</SeqPro>
</Features></Sequence>
</CLUSTER>
</CLUSTERS>
</data><Statistics StartTime="02/09/2002 9:32:15"
EndTime"02/09/2002 10:02:30">00:30:15.5649630</Statistics></result>
Blast results for translated protein yesS geneYesS E-value Swissprot KeywordsO31522 0. DNA-binding: Transcription regulation;Q9KFJ6 2.6e-38 DNA-binding; Transcription regulationQ32071 8.1e-37 DNA-binding; Transcription regulationQ30502 1.6e-20 DNA-binding: Transcription regulation;Q9KB68 3.2e-16 DNA-binding; Transcription regulationQ8RCW7 1.0e-15Q9KBL6 1.5e-14 DNA-binding; Phosphorylation; Sensory transduction: Transcription regulation;Q9K7C1 2.5e.14 DNA-binding; Transcription regulation;Q9K690 2.1e-13 DNA-binding; Phosphorylation; Sensory transduction; Transcription regulation;O52846 5.8e-13 DNA-binding; Transcription regulation;
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1601
Table III. The rules’ significance can be established upon a variety of parameters calculated by the system:confidence (probability of rule satisfaction), support (examples in data that are covered by the rule), coverage(examples in data that are covered by the antecedent of the rule), improvement (how much more frequent is theoccurrence of the rule than normal) and leverage (additional examples covered by the rule above those expected).
CONF SUPPORT COVERAGE IMPROV LEVERAGE ANTECEDENT -> CONSEQUENT
100 1.8018 1.8018 14.8 1.6801 [+]PAT64 G-PROTEIN COUPLED RECEPTOR,GLYCOPROTEIN,MULTIGENE FAMILY,TRANSMEMBRANE
100 1.3514 1.3514 6.2535 1.1353 [+]PAT40 GLYCOPROTEIN, HYDROLASE,MEMBRANE, NERVE,NEUROTRANSMITTER DEGRADATION,SERINE ESTERASE, SIGNAL,SYNAPSE
100 1.3514 1.3514 24.667 1.2966 [+]PATS 4FE-4S, IRON-SULFUR, LYASE,MITOCHONDRION,TRANSIT PEPTIDE,TRICARBOXYLIC ACID CYCLE
100 1.5766 1.5766 6.6269 1.3387 [+]PAT43 CALCIUM,CARBOHYDRATE METABOLISM,GLYCOSIDASE, HYDROLASE,SIGNAL
The possibility of easily connecting EVASs allows one to solve these examples without requiringknowledge of the internal details of the tools connected to the service as EVASs. This characteristic isespecially suitable for allowing different users to share tools with which they are unfamiliar.
Finally, as a test case for Service D we have structurally compared two protein structures: 1dhrand 4tgf (the system has no more limitation than computational resources to deal with larger sets ofstructures). As commented previously, the set of PDB-codes can be provided directly by the user or bea partial output from Service C. As can be observed, in Figure 7 the core of the fragment resides in thetwo anti-parallel beta-strands, but additional residues from very different (and partial) two-dimensionalstructures are also aligned. In this case it represents a random geometrical relationship.
4. DISCUSSION
In this work we have presented an architectural design that allows easy and transparent integration ofseveral heterogeneous sources of information. This architecture is especially suitable for collaborativeenvironments where users wish to include their own specific software tools (EVASs) with minimalintervention and expertise required to manage the system. The inclusion of EVASs in the data flow isone of the features of the architecture and allows filtering, restructuring and processing data withoutrequiring any knowledge of the EVASs implementation.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1602 J. F. ALDANA ET AL.
Figure 7. A 28 residues length 3D-local alignment between 1DHR, pdb code for DihydropteridineReductase structure (236 residues) and 4 Transforming Grow Factor (4tgf) (50 residues) obtained usingour 3D align tool (a geometrical hashing based tool for aligning 3D structures). The root mean squared ofthe distance between alpha-carbon positions is 1.58. On the left, there is a ribbons model for the whole
unrelated structure, and on the right the local fragment.
Our framework provides a base set of templates that can be readily adapted and refined to meet userneeds, incorporating new databases and algorithms for (bioinformatics) integration.
The framework has a modular organization with flexibility for selecting and combining specificEVASs for specific needs. These EVASs designs are reusable for repeated tasks and can be shared withother users.
In this architecture, wrappers are created manually and added to the mediator modifying its sourcecode. We are addressing our efforts to a dynamic mediator through semantic mediation. It allowsone to use semantic information about data sources, such as query capabilities, data provenance,data’s schema, etc. The main goal is to provide a method for adding wrappers without source codemodifications. A secondary goal tries to provide a tool for automatic wrapper generation. There are alot of works that treat this problems, but we think that they are, in the most cases, incomplete tools orthey produce monolithic wrappers that can not be reused (either as a complete or a partial interface) inother mediators.
Different use cases have also been developed to demonstrate how this architecture and its associatedframework allow the rapid development of domain specific applications. The development of Bio-Broker is a proof of concept of the suitability of this kind of architecture for the bioinformatics domain.In particular, the usefulness of the mediator system is demonstrated by a diverse set of applicationsaimed at combining expression data with genomic, sequence-based and structural information, so as toprovide a general, transparent and powerful solution that goes beyond traditional gene expression dataclustering.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
BIO-BROKER 1603
We are currently working on the integration of gene expression data with pathways information,which would be a significant development. There are several metabolic databases available that containseveral hundred manually drawn pathway maps, but in some cases their static and separate diagramsdo not provide sufficient flexibility to integrate the information.
ACKNOWLEDGEMENTS
This work has been partially supported by grant QLK3-2001-01473 under the EU sub-programme area‘Quality of Life and Management of Living Resources’—Key Action ‘The Cell factory’ and the grant TIC2002-04586-C04-04 of the Spanish Ministerio de Ciencia y Tecnologıa. The authors would like to thank Dr. AndresRodriguez and Dr. Antonio Perez for carefully reading and recommendations.
We thank the reviewers for their comments and help towards the improvement of the paper.
REFERENCES
1. Rechenmann F. From data to knowledge. Bioinformatics 2000; 16:411.2. Abiteboul S, Buneman P, Suciu D. Data on the Web from Relations to Semistructured Data and XML. Morgan Kaufmann:
San Francisco, CA, 2000.3. Roth MT, Schwarz PM. Don’t scrap it, wrap it! A wrapper architecture for legacy data sources. Proceedings of the 23rd
International Conference on Very Large Data Bases, Athens, Greece, 25–29 August 1997, Jarke M, Carey MJ, Dittrich KR,Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds.). Morgan Kaufmann: San Francisco, CA, 1997; 266–275.
4. Sahuguet A, Azavant F. Building light-weight wrappers for legacy Web data-sources using W4F. Proceedingsof the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, 7–10 September 1999,Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds.). Morgan Kaufmann: San Francisco, CA, 1999;738–741.
5. Fahl G, Risch T, Skold M. AMOS—an architecture for active mediators. Proceedings of the 1993 Workshop on NextGeneration Information Technologies and Systems (NGITS’93). The Technian: Haifa, Israel, 1993; 47–53.
6. Garcia-Molina H, Papakonstantinou Y, Quass D, Rajararnan A, Sagiv Y, Ullman J, Vassalos V, Widom J. The TSIMMISapproach to mediation: Data models and languages. Journal of Intelligent Information Systems 1997; 2:117–132.
7. Tomasic A, Raschid L, Valduriez P. Scaling heterogeneous databases and the design of disco. Proceedings of theInternational IEEE Conference on Parallel and Distributed Information Systems, Hong Kong, 1996. IEEE ComputerSociety Press: Los Alamitos, CA, 1996; 449–457.
8. Haas L, Kossmann D, Wimmers EL, Yang J. Optimizing queries across diverse data sources. Proceedings of the 23thInternational Conference on Very Large Databases (VLDB’97). Morgan Kaufmann: San Francisco, CA, 1997; 276–285.
9. Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A. TAMBIS: Transparent access tomultiple bioinformatics information sources. Bioinformatics 2000; 16:184–186.
10. Lange M, Freier A, Scholz U, Stephanik A. A computational support for access to integrated molecular biology data.German Conference on Bioinformatics, 2001. Available at: http://www.bioinfo.de/isb/gcb01/poster/lange.html#img-1.
11. Gupta A, Ludascher B, Martone ME. Knowledge-based integration of neuroscience data sources. Proceedings of the 12thInternational Conference on Scientific and Statistical Database Management (SSDBM), Berlin, Germany, July 2000. IEEEComputer Society: Los Alamitos, CA, 2000.
12. Liu L, Buttler D, Critchlow T, Han W, Paques H, Pu C, Rocco D. BioZoom: Exploiting source-capability information forintegrated access to multiple bioinformatics data sources. Proceedings of the 3rd IEEE Symposium on BioInformatics andBioEngineering (BIBE 2003), Washington, DC, 10–12 March 2003. IEEE Computer Society Press: Los Alamitos, CA,2003.
13. Davidson S, Overton C, Tannen V, Wong L. BioKleisli: A digital library for biomedical researchers. International Journalof Digital Libraries 1997; 1:36–53.
14. IBM Corp. ‘DiscoveryLink’. http://www.ibm.com/discoverylink [23 August 2004].15. Baru C, Gupta A, Ludascher B, Marciano R, Papakonstantinou Y, Velikhov P, Chu V. XML-based information mediation
with MIX. Proceedings of the Special Interest Group on Management of Data (SIGMOD’98). ACM Press: New York,1998; 597–599.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe
1604 J. F. ALDANA ET AL.
16. Rodgriguez-Martinez M, Roussopoulos N. MOCHA: A self-extensible database middleware system for distributed datasources. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, May2000. ACM Press: New York, 2000; 213–224.
17. Buttler D, Coleman M, Critchlow T, Fileto R, Han W, Pu C, Rocco D, Xiong L. Querying multiple bioinformaticsinformation sources: Can semantic Web research help? SIGMOD Record 2002; 31(4):59–64.
18. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic AcidsResearch 2000; 28(1):45–48.
19. Wang L, Riethoven JJM, Robinson AJ. XEMBL—distributing EMBL data in XML format. Bioinformatics 2002;18(8):1147–1148.
20. Biaudet V, Samson F, Bessieres P. Micado—a network oriented database for microbial genomes. Computer Applicationsin the BioSciences 1997; 13:431–438.
21. The PDB Team (2003). Details the history, function, development, and future goals of the PDB resource. StructuralBioinformatics, Bourne PE, Weissig H (eds.). Wiley: New York, 2003; 181–198.
22. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S, Eisenberg D. DIP: The Database of Interacting Proteins. A researchtool for studying cellular networks of protein interactions. Nucleic Acids Research 2000; 30(1):303–305.
23. Bader GD, Betel D, Hogue CW. BIND: The Biomolecular Interaction Network Database. Nucleic Acids Research 2003;31(1):248–250.
24. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Gapped BLAST and PSI-PLAST: A new generation of proteinDB search programs. Nucleic Acids Research 1997; 25(17):3389–3402.
25. Perez AJ, Rodrıguez A, Trelles O, Thode G. A computational strategy for protein structure-function assignment whichapproaches the multidomain problem. Comparative and Functional Genomics 2002; 3(5):423–440.
26. Rodrıguez A, Carazo JM, Trelles O. Mining association rules from biological databases. Journal of American SocietyInformation Systems and Technologies 2005; 56(5):493–504.
27. Garcia de la Nava J, Santaella DF, Cuenca Alba J, Maria Carazo J, Trelles O, Pascual-Montano A. Engene: The processingand exploratory analysis of gene expression data. Bioinformatics 2003; 19:657–658.
28. Aldana JF, Gomez AC, Roldan MM, Moreno N, Nebro A. Metadata functionality for semantic Web integration.Proceedings of the 7th International Conference of the International Society of Knowledge Organization (ISKO’02). Ergon:Wurzburg, Germany, 2002.
29. Aldana JF, Roldan MM, Gomez AC, Yague ME. Un framework para la consulta e integracion de fuentes de datosheterogeneas y distribuidas sobre la Web. Memorias de las Jornadas Iberoamericanas de Ingenierıa de Requisitos deAmbientes y de Software (IDEAS’2001). Instituto Technologico de Costa Rica, Centro de Informacion Tecnologica (CIT):Cartago, Costa Rica, 2001; 65–76.
30. XML. Extensible Markup Language (XML) 1.1. W3C Candidate Recommendation.http://www.w3.org/TR/2002/CR-xml11-20021015 [15 October 2003].
31. XQuery 1.0. W3C Working Draft, 16 August 2002. http://www.w3.org/TR/xquery [23 August 2004].32. XML Schema. W3C Recommendation. http://www.w3.org/TR/xmlschema-{0, 1, 2} [2 May 2001].33. RDF Schema. Resource Description Framework Schema Specification 1.0. W3C Candidate Recommendation.
http://www.w3.org/TR/2000/CR-rdf-schema-20000327 [March 2000].34. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank.
Nucleic Acids Research 2000; 28(1):235–242.35. Samson F, Biaudet-Brunaud V, Gas S, Dervyn E, Gallezot G, Duchet S, Batto JM, Ehrlich D, Bessieres P. Micado,
an Integrative Database Dedicated to the Functional Analysis of Bacillus Subtilis and Microbial Genomics. FunctionalAnalysis of Bacterial Genes, Schumann W, Ehrlich SO, Ogasawara N (eds.). Wiley: New York, 2000.
36. Kobayashi K, Ogura M, Yamaguchi H, Yoshida K, Ogasawara N, Tanaka T, Fujita Y. Comprehensive DNA microarrayanalysis of Bacillus subtilis two-component regulatory systems. Journal of Bacteriology 2001; 183(24):7365–7370.
37. Tamames J, Casari G, Ouzounis C, Valencia A. Conserved clusters of functionally related genes in two bacterial genomes.Journal of Mollecular Evolution 1997; 44:66–73.
38. Dandekar T et al. Systematic genomic screening and analysis of mRNA in untranslated regions and mRNA precursors:Combining experimental and computational approaches. Bioinformatics 1998; 14:271–278.
Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe