bio-broker: a tool for integration of biological data ... · this biological mediator (bio-broker)...

SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2006; 36:1585–1604Published online 13 July 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/spe.733

Bio-Broker: a tool forintegration of biological datasources and data analysis tools

J. F. Aldana, M. Roldan-Castro, I. Navas, M. M. Roldan-Garcıa, M. Hidalgo-Condeand O. Trelles∗,†

E.T.S.I. Informatica, Campus de Teatinos, 29071, Malaga, Spain

SUMMARY

In this work we present an architecture for XML-based mediator systems and a framework for helpingsystems developers in the construction of mediator-services for the integration of heterogeneous datasources. A unique feature of our architecture is its capability to manage (proprietary) user’s softwaretools and algorithms, modelled as Extended Value Added Services (EVASs), and integrated in the dataflow. The mediator offers a view of the system as a single data source where EVASs are readily availablefor enhancing query processing. A Web-based graphic interface has been developed to allow dynamic andflexible EVASs inter-connection, thus creating complex distributed bioinformatics machines. The feasibilityand usefulness of our ideas has been validated by the development of a mediator system (Bio-Broker) andby a diverse set of applications aimed at combining gene expression data with genomic, sequence-based andstructural information, so as to provide a general, transparent and powerful solution that integrates dataanalysis tools and algorithms. Copyright c© 2006 John Wiley & Sons, Ltd.

Received 23 August 2004; Revised 15 November 2005; Accepted 16 November 2005

KEY WORDS: biological mediation; biological dataflow; data analysis tools; XML

1. INTRODUCTION

Technological breakthroughs such as high-throughput sequencing and gene-expression monitoringtechnology have nurtured the ‘omics’ revolution, enabling the massive production of data.Unfortunately, this valuable information is often dumped in proprietary data models and specific

∗Correspondence to: Oswaldo Trelles, Departamento de Arquitectura de Computadores, Universidad de Malaga, E.T.S.I.Informatica, Campus de Teatinos, 29071, Malaga, Spain.†E-mail: [email protected]

Contract/grant sponsor: Spanish Ministerio de Ciencia y Tecnologıa; contract/grant numbers: QLK3-2001-01473 andTIC2002-04586-C04-04

Copyright c© 2006 John Wiley & Sons, Ltd.

1586 J. F. ALDANA ET AL.

services are developed for data access and analysis, without forethought to the potential externalexploitation and integration of such data.

However, dealing with the exponential growing rates of biological data is a simple problem whencompared with the problem posed by diversity, heterogeneity and dispersion of data [1]. Nowadays,the accumulated biological knowledge needed to produce a more complete view of any biologicalprocess is disseminated around the world in the form of biology sequences and structure databases,frequently as flat files, as well as image/scheme-based libraries, Web-based information with particularand specific query systems, etc. Under these conditions, this plethora of interrelated informationbecomes difficult to use, pointing to the integration of these information sources for unified access,independently of possible internal changes, as a clear and important technological priority.

The database community has been engaged for many years with the problem of integratingheterogeneous data sources [2]. A first approach was based on wrappers, which act as interfaces forthe data sources, translating their structure into a common model [3,4]. Later, the focus moved tothe use of mediators for encapsulating the knowledge needed for evaluating a query over multiplewrappers. The wrapper–mediator approach sets up an interface for a group of (semi-structured) datasources, amalgamating their local schemas into a global schema and integrating the information of thelocal sources. As a result, the views of the data that mediators offer are coherent, producing semanticreconciliation of the common data model representations carried out by the wrappers. Some examplesof the wrapper–mediator systems are Amos [5], Tsimmis [6], Disco [7] and Garlic [8]. In the specificfield of biological data there are the following examples: TAMBIS [9], BioDataServer [10], KIND [11],BioZoom [12], BioKleisli [13] and DiscoveryLink [14]. With the advent of XML, some of the mediatorsystems have evolved toward the XML standard (Amos and Tsimmis), whilst other projects wereinitially XML-based, such as MIX [15] and MOCHA [16].

We have designed an architecture called Bio-Broker for XML-based mediator systems anddeveloped a framework for assisting systems developers in the construction of mediator services thatintegrate heterogeneous data sources. Our architecture allows the easy development of Extended ValueAdded Services‡ (EVASs). This is especially suitable for environments where users have to share toolswith which they are unfamiliar. The final user will have a view of the system as a single data sourceand furthermore, with the capability of easily selecting, adding and modifying EVASs for enhancingof query processing.

In the bioinformatics context, researchers often use a variety of tools for data analysis. Each of thesetools have their own interfaces, data formats and semantics, and probably need to be interconnectedin a workflow in order to define a processing pipeline, thus becoming complex virtual bioinformaticmachines [17]. Today, users build these workflows by hand, which may be a source of errors andinvolves a time cost. It is important to automate this process to free biologists of this repetitive task.We propose the use of EVASs for making the workflow construction easier. Each EVAS encapsulatesa software tool as a black box that may be used as a building block, which can be connected to anotherEVAS, building the desired workflow in an almost automatic way. EVAS may be interconnectedwithout requiring the knowledge of the internal details related to the encapsulated tools. Furthermore,it allows private software tools to be shared in a ‘pay per use’ way of operation.

‡An EVAS is an external software tool integrated in the mediator (i.e. a data mining algorithm).

Copyright c© 2006 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2006; 36:1585–1604DOI: 10.1002/spe

BIO-BROKER 1587

The flexible use of EVAS is an outstanding characteristic of our system, which is not provided bytraditional or biological mediators such as TAMBIS, BioDataServer, KIND, BioZoom or BioKleisly.

Among closely related developments to Bio-Broker, we find BioMOBY, which provides a wayto create Web services as well as a central registry that has information on all the services thatare available (BioMOBY). In fact, the main difference between BioMOBY and Bio-Broker lies inthat Bio-Broker is a mediator system, i.e. an intermediary between data sources, services and usersthat offers a view of the whole system as a single resource, where EVASs are readily available forenhancing query processing (see complementary material at http://uranos.khaos.uma.es/mediator fordetails). Another related project is myGRID, which also provides a prototype system named Tavernafor building workflows (Taverna).

However, Taverna only offers a way to build and execute workflows, but it does not includeintegration capabilities. Thus, BioBroker combines traditional data integration through an integrationschema with workflow definition and execution (see complementary material for details).

As an example of such integration procedures we have developed a biological instance of theframework in the context of gene-expression data. In this context, the query space is, in practical terms,unbounded and dynamic. So the problem is not only to integrate diverse and scattered data sources butalso to find a way to analyse the queries, to identify with which service or database should be solved,as well as to integrate dispersed services and apply them to the selected set of data.

This biological mediator (Bio-Broker) integrates genetic information and various processingmechanisms developed by different users and groups through a graphical interface that allows fastand intuitive ‘wiring’ of EVASs components, expanding the functionality and enabling the easyincorporation of new procedures to customize the system for specific concerns. Bio-Broker allowsuniform access to data stored in different databases: presently these include SWISS-PROT [18],EMBL [19], MICADO [20], PDB [21], DIP [22] and BIND [23] as well as to several processingservices such as BLAST [24], AnaGram [25], Frags [26] and AsocRules [26]. The end result of thisservice is the ability to combine gene-expression data with genomic, sequence-based and structuralinformation, so as to supply a general, transparent and powerful solution. As a proof of concept,Bio-Broker has been also used to extend the capacity of engene, a versatile, Web-based and platform-independent exploratory data analysis tool for gene expression data (see [27] for more details).

Here we describe both the architecture design and the development mechanisms needed to build upthe mediator. We also define the problems selected to illustrate the system. We then describe a mediatorsystem, Bio-Broker, which offers enhanced query capabilities beyond gene-expression clustering.

2. SYSTEM AND METHODS

2.1. An XML based architecture for mediators design

The mediator system architecture [28,29] is based on the use of XML standards. The key characteristicsof the architecture and its main components are as follows (see Figure 1).

(1) The architecture is XML-based. XML [30] is itself a standard format for the representation,interchange and validation of data and, together with the XQuery query language [31] and theXML Schema [32], forms a complete data model for dealing with information on the Web.



Figure 1. General mediator architecture. This picture shows the classical architecture of a mediatorsystem extended with an EVAS module. The latter is a main feature of our architecture together withthe use of metadata at the client side. EVASs are, in a way, modelling recurrent processes (workflows)

that can be shared by several users.

XML Schema [32] and RDF Schema [33] are used for describing all the metadata necessary forquery processing in this architecture as well as for expressing the data resulting from the queries.

(2) Xquery is the query language used by the mediators. All the clients connected to a mediator mustsend a valid XQuery query to the mediator and will receive an XML document. The client canbe any sophisticated query generation mechanism and the only requirement is that it must senda valid XQuery query independent of how this query is generated. A generic tool is available forspeeding up the development of specific graphical query interfaces.

(3) The query processor is the component that receives a query and generates a sub-query for each ofthe data sources involved in the query. It analyses the data requirements of the query, identifyingthe data sources and generating a global plan for evaluating the query. The global plan contains


BIO-BROKER 1589

information on the various alternatives for solving the same data requirement. Later, the globalplan is fragmented into one sub-query for each of the selected data sources.

(4) Metadata are used to express all information needed for the above process. This includesinformation both on the logical schemas produced by the wrappers that encapsulate thedata sources, and on the semantic equivalence of the terms produced by the data sources(the integration schema). We also store information about the query processing capabilities ofthese wrappers and about the location and availability of the data sources for optimizing accessto the data. Specific user information such as privileges, EVASs usage and preferences is alsostored.

(5) EVASs are all those processes that allow the user to further filter or post-process the queries.In a sense EVASs are modelling arbitrarily complex repetitive processes appearing on thedata obtained from the data sources and shared by several users. Obviously these processes(e.g. BLAST, AnaGram, Frags) being application specific are not efficiently incorporated intothe query language. On the other hand, being repetitive tasks performed by several users, theirautomation and optimization should be assisted by the mediator.

(6) Wrappers are the components that translate the various data sources being integrated into thecommon data model (XML/Xquery). Wrappers hide the internal complexity of the data sourcesand offer, in a controlled manner, their query capabilities. There is one wrapper for eachdata source integrated into the mediator. Wrappers receive XQuery queries and produce XMLdocuments that conform to the XML schema (integration schema).

2.2. From architecture to mediators

A framework has been developed for simplifying the mediator construction task. The mediatorinstantiation process involves the construction of a particular application of the framework by thedevelopment of wrappers and the use of metadata for the configuration of the different mediatorcomponents. Wrappers are the means of translation between the reference model of the architecture(XML) and the data source model. Once the mediator has been constructed, the mediator user candynamically manage the EVASs adding or removing them as necessary.

The main tasks involved in mediator configuration are as follows.

Wrapper construction. Specific wrappers are necessary for each data source integrated into themediator. To simplify this task, our system offers ‘wrapper construction schemes’ for threeof the more frequent types of data sources (relational databases, XML documents and HTMLdocuments).Mediator configuration through metadata. Metadata will be used for adapting the mediatorcomponents. Specifically they will be used to configure both the location and integrationschemes.

• The location scheme links the wrappers to the mediator. It is used for identifying the datasource of a query and selecting the appropriate wrapper.

• The integration scheme specifies the data translations between the data model used by theclient and that offered by the wrappers. So, for each database field, it specifies by Xpath itslocation into the XML document given by the wrapper. Furthermore, it gives informationon the data source where such is available.



Figure 2. Developing and publishing of E-BlastP, an EVAS which performs a BlastP-search using a queryprotein sequence and some (user selected) parameters. E-BlastP receives the input as an XML document.Therefore, before using BlastP, E-BlastP parses and translates this input. BlastP results are then translated intoan XML output document. The right-hand side of the figure provides a snapshot of the graphical interface for

building workflows by pipelining EVAS.

2.3. Dynamic integration of EVASs

An EVAS is a software tool adapted for being connected to the mediator. EVASs are not in the mediatorcore. It is not possible to know in advance how to build the EVASs structure. EVAS are linkeddynamically during the use of the mediator system. Therefore, although the EVASs configurationis made using metadata, the framework offers to the mediator-users a management tool to connect,arrange and remove EVASs without any information on the mediator internal architecture, thus,enabling a wide range of processing alternatives to be specified for relating—in different ways—diversesources of biological data.

The integration of EVAS with the mediator consists of two main phases.

(1) Creation of the EVAS. The tool (program) to be integrated into the mediator has to be adapted forobtaining a new program (EVAS) with three parameters, an input XML document, an XQueryquery and a list of zero or more parameters (those that correspond to the tool). The resultingEVAS will use these three parameters for calling the original tool, thus obtaining an output XMLdocument. The input and output XML documents must conform to the schemas, which determineits connection in the workflow (see Figure 2).

(2) Connection of EVAS to the mediator. Once an EVAS has been created, it has to be linked to themediator. This implies the following tasks.

(a) Publication of the EVAS as a Web service. It will be made as a Web method namely Query,or as a server side program (CGI, ASP, JSP, etc.)


BIO-BROKER 1591

(b) Registration of the Web service in the system, sending information about this service tothe mediator administrator.

(c) Obtaining a DLL library, which performs a call to the Web service. This library allowsthe user to add services dynamically to the mediator without changes in its sourcecode. The library will be automatically generated by the mediator administrator, usingpreviously implemented templates.

(d) Registration of the library in the system. For this, some metadata, including name of thelibrary, name of the service, and input and output XML schema will be added.

When a set of EVAS has been developed and connected (see Figure 2) to the system, users can createworkflows using a graphical tool provided by the system. This tool has been developed using C#(the Microsoft programming language) and makes use of an internal representation for workflows.This tool allows users to connect databases, EVAS and the data integrator. Once the mediator has beendeveloped and configured, it is ready for use by the final clients. The clients can use the mediator eitherthrough a programming interface or through a graphical interface (for details about this tool see thesupplementary material).

This architecture does not impose any restriction on the definition of EVASs or on the way in whichthey are connected in workflows. Generic EVASs can be defined and they can also be replicated and/orspecialized to facilitate availability and/or communication efficiency. For example, running a BLASTquery on a custom set of genes would require transferring the entire set of genes for comparison againstthe BLAST server (typically many gigabytes). In this case, it would be much more efficient to link thisEVAS to a database that already has the necessary genes. In this case, a simple service can recover thegenes from the database avoiding communication overheads.

2.4. Biological database integration: Bio-Broker

We present a sample use of the mediator system for the integration of gene expression applications.In general, clustering is one of the most common forms of analysis for such applications, and oncethe data has been clustered, additional post processing analysis has to be performed to extract relevantknowledge.

In a sense, clustering analysis is incomplete without integrating it with functional informationcontained in nucleotide and protein databases (both at sequence and structure level) as well as withinformation contained in emerging pathways and assembly databases. A common option is the staticintegration of these very diverse data sources (equivalent to a pre-annotation process). However, whatis really necessary and much more interesting is the dynamic integration of services onto this data soas to obtain valuable information from the diverse sources.

At present, Bio-Broker integrates many different types of data sources: nucleotide (EMBL) andprotein (SWISS-PROT) sequence information [18]; the worldwide repository of three-dimensionalstructures of biological macromolecules—the Protein Data Bank, (PDB) [34]; the MICrobial AdvancedDatabase Organization (MICADO)—a relational database dedicated to microbial genomes [20] andfunctional analysis of Bacillus subtilis [35]; the Database of Interacting Proteins (DIPTM); and theBiomolecular Interaction Network Database (BIND) designed to store full descriptions of interactions,molecular complexes and pathways.



These databases have been selected on the basis of differences in content, format, access mechanismand geographical location. Our intention is to show how these very diverse data sources can be easilyintegrated using the proposed architecture and how the possibility of easily adding EVASs allows theuser to mix different tools for obtaining enhanced data processing capabilities.

Services requested of the mediator help to recover information associated with a given set ofclustered genes, independent of the way in which these clusters were obtained, on the basis of theirexpression levels by association rule discovering or by whatever other appropriated method.

3. RESULTS

3.1. Bio-Broker architecture

Figure 3 shows the complete architecture of Bio-Broker. A user interface (optionally a graphical one)is used for accessing the services. The queries, expressed in terms of the integration schema, are sent inXQuery to the server, which divides it into different subqueries. The subqueries are sent to the differentdatabases. Each database has its own query and XML result document construction mechanism(i.e. database-specific wrappers). The service receives the sub-query results. At this point, it is possibleto apply an EVAS to process the sub-query results. For example, a new EVAS can be applied to thefirst EVAS process result, and so on.

The results of each sub-query (with or without applied EVAS processing) are integrated, removingduplicates and inconsistencies. Duplications can be removed by taking advantage of the mappingsbetween the integration schema and the resource schemas. Data that the mediator receives is expressedin terms of the mediator’s integration schema and the mappings identify any duplicated items andremove one of them from the integrated data. Note that EVAS processing does not include duplicateremoving, this step is performed in the integration step.

Next, the integrated solution can be post-processed by a set of EVAS. The final result is then sent tothe user interface. It is worth observing that where and how these EVASs are implemented is completelyirrelevant: they are just Web services integrated into the mediator system. In the architecture thesemodules are used as a ‘black box’. They can even be developed by other people. To include a newEVAS one only needs to know its address (IP) and the service must conform to the DTD/schema.

The integration schema of Bio-Broker stores both the data provenance (the database(s) from whichthe data can be obtained) and the query capabilities (the elements of the resource scheme by whichusers can make queries and the expressiveness of the query language). Data provenance is an importantissue due to the necessity of assessing the quality of data in the biological context. The quality of dataobtained from proprietary Web services, not being up-to-date or well-ordered would not be as goodas that obtained from well known, widely used biological Web services, such as EMBL, PDB, etc.Therefore, data provenance is not only stored in the integration schema but also annotated in the answerso that the source can be identified, and the users can infer its quality according to their confidence inthe data source. Thus, we can for example annotate the GeneName field with metadata showing that itcan be obtain from EMBL, GenBank or MICADO databases (see the supplementary material).

Please note that we have not defined an architecture with multiple middle-layers; we have ratherdefined an architecture with three layers: user interfaces, the mediator and wrappers. It is also importantto note that although EVASs are fully integrated in the architecture they are not mandatory and so do


BIO-BROKER 1593

Figure 3. Bio-Broker architecture: data access, pre-processing, integration, post-processing and user interfaces.



Figure 4. Instance of our integration architecture with three EVASs.


BIO-BROKER 1595

not represent another architectural layer; therefore, it is not necessary to include a set of EVASs foreach query processing task. Figure 4 shows an instance of our architecture for three EVAS, appliedafter the integration step. Addition of EVASs in the integration process does not impose a performancepenalty on the query processing step. Performance is simply the sum of the time of making use of thesetools and the time required to retrieve the data. Furthermore, the end-user cost of using these tools isalways less than using them manually, since the results are integrated. That is, the repetitive manualapplication of filters and transformation tools to data retrieved from data sources has a high cost;this is decreased by the mediator, which applies them automatically. Besides, execution of workflowsis performed automatically, taking advantage of metadata, so the cost of maintenance is limited todefinition of workflow according to user needs.

3.2. Bio-Broker services

As a demonstrable ‘proof of concept’ that can be evaluated and tested in the field we presentthe integration of gene expression data with the biological databases described in Section 2.The data set described by [36] will be used. This data set is composed of the cDNA expressionlevels of 4005 Bacillus subtilis two-component regulatory systems in which the overproduction ofa response regulator of the two-component systems, coinciding with a deficiency of its cognatesensor kinase, affects the regulation of genes, including its target genes. The genome-wide effecton gene expression caused by the overproduction was analysed on 24 two-component systems(http://www.genome.ad.jp/kegg/expression).

The engene platform (at http://chirimoyo.ac.uma.es/bitlab) was used to cluster this subset of geneexpression data. The mediator system is hosted at IP address http://uranos.khaos.uma.es and is availableby using the Web services (see http://uranos.khaos.uma.es/mediator). The Web method ‘connect’supplies an identifier to dialog with the system. The WSDL description of the method for queryinghas two available methods ‘ConsultCompress’ and ‘Consult’. The latter produces a ‘string’ XMLdocument, while the former compresses the data, and the client must use Xmill to decompress.Parameters for both methods are an array with the query (queries), the number of queries, the useridobtained when connected to the system, a Boolean array that specifies the services to use, and theparameters needed for each service (see the complementary material).

Table I describes four samples of implemented queries. These examples describe different levels ofcomplexity, and the different databases and services that can be requested and combined in a simplequery. They have been designed and published as services in http://uranos.khaos.uma.es/mediator, thuswe provide biologists with an easy Web interface to solve common queries that they often need tosolve.

The first service (Service A in Table I) sends a request to the mediator system for informationabout the ‘genomic context’ of clustered genes, whose more straightforward definition is based on thephysical proximity of genes in the chromosome [37,38]. It is well accepted that neighbouring genesin bacteria are often functionally related, so drawing the gene/cluster distribution provides additionalinformation for interpreting expression data. This service will work on MICADO, which containsdetailed information on Bacillus subtilis. This is a very simple request and in a way is equivalent to apre-annotation process that can be used to include information for analysis from very different sourcesin the data collection (see Figure 5). Additional information about the query formulation can be foundin the complementary material.



Tabl

eI.

Req

uest

edse

rvic

es.F

our

exam

ples

are

show

nto

desc

ribe

diff

eren

tlev

els

ofco

mpl

exit

y,an

dth

edi

ffer

entd

atab

ases

and

serv

ices

that

can

bere

ques

ted

and

com

bine

din

asi

mpl

equ

ery.

The

foll

owin

gla

bels

have

been

used

for

met

adat

ain

form

atio

n:<

S>

Spe

cies

,<

G>

Gen

ecl

uste

r,<

P>

Pre

fix

leng

th,(

BS

=B

acil

lus

subt

ilis

).

Ser

vice

[A]

Sin

gle

one-

step

requ

est

[B]

Mul

tipl

e-st

epre

ques

t/sa

me

pipe

[C]

Com

plex

3-st

eps/

3-pi

pes

requ

est

[D]

Str

uctu

ralI

nfor

mat

ion

Des

crip

tion

Gen

e/C

lust

erdi

stri

buti

onth

roug

hout

the

geno

me.

Clu

ster

sw

ere

obta

ined

from

gene

-exp

r.da

ta.

Chr

omos

ome

zone

sin

volv

edin

spec

ific

biol

ogic

alpr

oces

ses

wil

lbe

high

ligh

ted

Giv

ena

seto

fD

NA

sequ

ence

s(s

ame

orga

nism

),re

late

dby

thei

rex

pres

sion

data

,to

end

upw

ith

anot

her

seto

fov

erre

pres

ente

dw

ords

iden

tifi

edat

the

up-s

trea

mpo

siti

ons

ofth

ese

quen

ces.

Giv

ena

seto

fre

late

dD

NA

sequ

ence

s(f

rom

serv

ice

[B])

toen

dup

wit

han

othe

rse

tof

rela

ted

sequ

ence

san

dit

sas

soci

ated

KW

from

whi

chto

iden

tify

asso

ciat

ion

rule

s.

Giv

ena

coll

ecti

onof

cons

ensu

spa

ttern

sfr

oma

seto

fpr

otei

nse

quen

ces

(wit

hth

eir

PD

Bco

des)

toen

dup

wit

han

stru

ctur

alm

otif

.

Inpu

tS

,GS

,G,P

S,G

G,P

DB

ld,s

tart

,end

(fro

mse

rvic

e[C

])

Exa

mpl

e<

S>

BS

<G

>at

pAat

pB<

/G>

</S

><

S>

BS<

G>

atpA

atpB

</G

><

/S>

<P>

500<

/P>

<S>

BS

<G

>at

pAat

pB<

/G>

</S

><

PD

BId

=‘1c

om’

star

t=46

end=

56/>

Dat

abas

eM

ICA

DO

EM

BL

EM

BL

,SW

ISS

PR

OT

PD

B

Pro

cedu

reS

end

aqu

ery

toM

ICA

DO

data

base

Obt

ain

entr

ies

from

EM

BL

,th

enap

ply

aE

VA

Sto

obta

inup

-str

eam

sequ

ence

.The

nap

ply

Pre

fix

and

Ana

gram

.

Sen

da

call

toE

MB

L,

then

appl

yB

last

Pto

each

entr

yto

find

rela

ted

sequ

ence

s.R

etur

nth

atse

quen

ces

anit

sas

soci

ated

keyw

ords

and

PD

Bco

des-

whe

nav

aila

ble

Sen

da

call

toP

DB

toob

tain

the

PD

Ben

try.

The

nap

ply

apr

oced

ure

toob

tain

posi

tion

san

dat

oms.

Met

hod

OD

BC

call

CG

Ica

llC

GI

call

CG

Ica

ll

EV

AS

Chr

omoD

ist

Pre

fix,

Ana

gram

Bla

stP,

Fra

gs,A

sRul

esC

ut3D

,3D

alig

n


BIO-BROKER 1597

Tabl

eI.

Con

tinu

ed.

Ser

vice

[A]

Sin

gle

one-

step

requ

est

[B]

Mul

tipl

e-st

epre

ques

t/sa

me

pipe

[C]

Com

plex

3-st

eps/

3-pi

pes

requ

est

[D]

Str

uctu

ralI

nfor

mat

ion

EV

AS

Des

crip

tion

Dis

play

sa

char

twit

hth

ege

nedi

stri

buti

onon

the

chro

mos

ome

Pre

fix:

give

nan

EM

BL

entr

yin

XM

L,t

oen

dup

wit

ha

the

up-s

trea

mse

quen

ce.

Bla

stP

:giv

enan

EM

BL

entr

y,se

arch

bysi

mil

arpr

otei

nse

quen

ces

inS

WIS

S-P

RO

T,R

etur

nth

ese

quen

cean

dit

sas

soci

ated

keyw

ords

Cut

3D:G

iven

aP

DB

entr

yan

da

rang

eof

resi

due

posi

tion

s,re

turn

the

atom

(or

amin

oaci

d)3D

coor

dina

tes

corr

espo

ndin

gto

thes

ere

sidu

esA

nagr

am;C

ompu

teth

edi

stri

buti

onof

over

-rep

rese

nted

wor

dsof

leng

thK

.

Fra

gs;i

sa

stan

dalo

nepr

ogra

mw

hose

outp

utco

nsis

tof

stat

isti

call

ysi

gnifi

cant

frag

men

tsfr

omth

ein

put

sequ

ence

(rou

ghly

nam

edco

nsen

sus

mat

ches

)

3Dal

ign:

Ali

gntw

o3D

stru

ctur

esan

dco

mpu

teth

erm

sddi

stan

cebe

twee

nth

em

Ass

Rul

es:A

ssoc

iatio

nru

les

disc

over

ing

from

ase

tof

patt

erns

and

KW

sid

enti

fied

from

agi

ven

seto

fse

quen

ces



Figure 5. Graphical representation of output from Service A. Red and green colours have been used torepresent up- and down-regulated genes, and forward and backward gene direction is represented as out-and in-side the chromosome. Position is counted clockwise. A clear proximity relatedness can be observed

with potentially useful biological information.

The objective of the Service B is for given a set of related DNA sequences, clustered by their geneexpression data, to find another set of over-represented K-fragments identified from the up-streampositions that could represent putative promoters or activation signals of those genes. In this case, thereare two very simple EVASs added to the pipe: Prefix and Repeats, the former to obtain the up-streamfragments from the DNA sequences, and the latter, a tailored algorithm, to identify over-representedstrings of length K from those up-stream sequences.

Service C is more complex, oriented to obtaining additional information not only about the originalsequences, but about sequences related to the original sequences. In this case the input is also formedby a collection of DNA sequence identifiers and the final output is a collection of Keywords tobe used in an association-rules-discovering procedure. The necessary steps for this exercise are asfollows. First, from EMBL retrieve the accession numbers of the protein sequences that correspondto the collection of genes (e.g. /db xref=”SWISS-PROT:P37800). Second, obtain the sequences andkeywords corresponding to those proteins. Third, BlastP is launched to complete a database search forsimilar sequences using each of the original proteins as query sequences. Finally, obtain sequences,keywords and identifiers of the similar sequences. These are used as input to the Frags procedure,which is able to detect statistically significant patterns from the collection of sequences. Sequences,patterns and keywords are used as input to ASSRUL, a procedure to detect association rules (availablein the Engene platform) that correlate the presence of some patterns with functional keywords. This isa complete exercise that demonstrates the ability to use EVASs in the pipeline. This allows the user todynamically incorporate additional information into the gene expression data.

Finally, in Service D, we explore the possibility of combining gene-expression data with structuralinformation. In this case we use the fragments obtained as partial output in Service C, and seek to verify


BIO-BROKER 1599

Figure 6. Creation of the workflow for the Service C.

if these fragments—with strong association at the sequence level—are also associated at structurallevel. To this end, the service receives as input a collection of PDB codes together with the start andend positions of the fragments and ends up with three-dimensional (3D) alignments of such fragments.

As test case for Service C we have used the YesM-YesN two-component regulatory system,composed of 21 up-regulated genes and three down-regulated genes as reported by [36].

We will now describe with some more detail Service C, which has a similar procedure to Service B.The objective is to obtain additional functional information both from the original sequences and fromthose sequences related to the original sequences. In Table II we present the query and results from themediator. First, the mediator requests the protein sequences and its associated keywords correspondingto a given collection of genes. Results from this request are used to launch a Blast database search(see Supplementary material for detailed parameters).

The Blast procedure retrieves a set of related sequences, and the mediator is used to retrieve thesequence and its associated keywords (see Table III, and additional information in the complementarymaterial). Using these sequences, a data mining procedure (Frags) is used to detect statisticallysignificant patterns from the collection of sequences Finally, the ASSRUL procedure is used to detectassociation rules (available in the engene platform).

From the implementation point of view, the Service C construction requires the creation ofa wrapper for EMBL, and the creation and configuration of three EVAS (E-BlastP, E-Fragsand E-ASSRUL) around the three mentioned tools: BlastP, Frags and ASSRUL (see Figure 6).These EVAS are published as Web services (WSE-Blastp, WSE-Frags and WSE-ASSRULL), finallythey are interconnected in Bio-Broker using a graphical tool provided by the framework, and the desiredworkflow is obtained.



Table II. XML string output for Service C. This service request for the protein sequences and associated functionalkeywords of the genes belonging to a collection of clusters (non-informative keywords such as ‘Hypotheticalprotein’ and ‘Complete proteome’ have been removed). For some cases Blast output clearly shows functionalrelationships between the query sequence (in this case the protein coded by yesS gene) and a set of by similarityrelated sequences. In other cases, the keywords are less informative (e.g. compete proteome) and no-functionalinformation is obtained by similarity. In these cases, the sequences are used again in a pipeline workflow as input

to the Frags procedure in which short stretches of sequences resembling putative patterns are detected.

<?xml version="1.0" encoding="UTF-8"?><result><query>FOR $Iteration IN INTEGRAClON WHERE

($Iteration/Organism/data()=’Bacillus subtilis’ AND($Iteration/GeneName/data()=’lplB’ OR $Iteration/GeneName/data()=’licC’ OR$Iteration/GeneName/data()=’opuBD’))

</query>

<data><CLUSTERS>

<CLUSTER>

<Sequence id="BSUB0004" Specie="Bacillus subtilis">

<description>Bacillus subtilis complete genome (600701, 813890)</description>

<KW>Transmembrane; Transport; Complete proteome;</KW>

<Features><gene name="lplB" AccNumber="P39128" PID="CAB12530"/><SeqPro>METVPKKRDAPV ... FGANYIAKKFDQEGLF</SeqPro>

</Features></Sequence>. . . . . .

<Sequence id="AF008930" Specie="Bacillas subtilis">

<description>Bacillus subtilis choline transport system including ATPase (opuBA), transmembrane protein (opuBB),choline binding protein precursor (opuBC) and transmembrane protein (opuBD) genes, complete cds; and unknown gene.</description>

<KW>Transport; Amino-acid transport; Transmembrane;</KW>

<Features><gene name="opuBD" AccNumber="P39775" PID="AACl4359"/><SeqPro>MNVLEQLMTYYAQNGS .... VMAVGADLLMAWLERALSPVKKKRTGAKHVQSAA</SeqPro>

</Features></Sequence>

</CLUSTER>

</CLUSTERS>

</data><Statistics StartTime="02/09/2002 9:32:15"

EndTime"02/09/2002 10:02:30">00:30:15.5649630</Statistics></result>

Blast results for translated protein yesS geneYesS E-value Swissprot KeywordsO31522 0. DNA-binding: Transcription regulation;Q9KFJ6 2.6e-38 DNA-binding; Transcription regulationQ32071 8.1e-37 DNA-binding; Transcription regulationQ30502 1.6e-20 DNA-binding: Transcription regulation;Q9KB68 3.2e-16 DNA-binding; Transcription regulationQ8RCW7 1.0e-15Q9KBL6 1.5e-14 DNA-binding; Phosphorylation; Sensory transduction: Transcription regulation;Q9K7C1 2.5e.14 DNA-binding; Transcription regulation;Q9K690 2.1e-13 DNA-binding; Phosphorylation; Sensory transduction; Transcription regulation;O52846 5.8e-13 DNA-binding; Transcription regulation;


BIO-BROKER 1601

Table III. The rules’ significance can be established upon a variety of parameters calculated by the system:confidence (probability of rule satisfaction), support (examples in data that are covered by the rule), coverage(examples in data that are covered by the antecedent of the rule), improvement (how much more frequent is theoccurrence of the rule than normal) and leverage (additional examples covered by the rule above those expected).

CONF SUPPORT COVERAGE IMPROV LEVERAGE ANTECEDENT -> CONSEQUENT

100 1.8018 1.8018 14.8 1.6801 [+]PAT64 G-PROTEIN COUPLED RECEPTOR,GLYCOPROTEIN,MULTIGENE FAMILY,TRANSMEMBRANE

100 1.3514 1.3514 6.2535 1.1353 [+]PAT40 GLYCOPROTEIN, HYDROLASE,MEMBRANE, NERVE,NEUROTRANSMITTER DEGRADATION,SERINE ESTERASE, SIGNAL,SYNAPSE

100 1.3514 1.3514 24.667 1.2966 [+]PATS 4FE-4S, IRON-SULFUR, LYASE,MITOCHONDRION,TRANSIT PEPTIDE,TRICARBOXYLIC ACID CYCLE

100 1.5766 1.5766 6.6269 1.3387 [+]PAT43 CALCIUM,CARBOHYDRATE METABOLISM,GLYCOSIDASE, HYDROLASE,SIGNAL

The possibility of easily connecting EVASs allows one to solve these examples without requiringknowledge of the internal details of the tools connected to the service as EVASs. This characteristic isespecially suitable for allowing different users to share tools with which they are unfamiliar.

Finally, as a test case for Service D we have structurally compared two protein structures: 1dhrand 4tgf (the system has no more limitation than computational resources to deal with larger sets ofstructures). As commented previously, the set of PDB-codes can be provided directly by the user or bea partial output from Service C. As can be observed, in Figure 7 the core of the fragment resides in thetwo anti-parallel beta-strands, but additional residues from very different (and partial) two-dimensionalstructures are also aligned. In this case it represents a random geometrical relationship.

4. DISCUSSION

In this work we have presented an architectural design that allows easy and transparent integration ofseveral heterogeneous sources of information. This architecture is especially suitable for collaborativeenvironments where users wish to include their own specific software tools (EVASs) with minimalintervention and expertise required to manage the system. The inclusion of EVASs in the data flow isone of the features of the architecture and allows filtering, restructuring and processing data withoutrequiring any knowledge of the EVASs implementation.



Figure 7. A 28 residues length 3D-local alignment between 1DHR, pdb code for DihydropteridineReductase structure (236 residues) and 4 Transforming Grow Factor (4tgf) (50 residues) obtained usingour 3D align tool (a geometrical hashing based tool for aligning 3D structures). The root mean squared ofthe distance between alpha-carbon positions is 1.58. On the left, there is a ribbons model for the whole

unrelated structure, and on the right the local fragment.

Our framework provides a base set of templates that can be readily adapted and refined to meet userneeds, incorporating new databases and algorithms for (bioinformatics) integration.

The framework has a modular organization with flexibility for selecting and combining specificEVASs for specific needs. These EVASs designs are reusable for repeated tasks and can be shared withother users.

In this architecture, wrappers are created manually and added to the mediator modifying its sourcecode. We are addressing our efforts to a dynamic mediator through semantic mediation. It allowsone to use semantic information about data sources, such as query capabilities, data provenance,data’s schema, etc. The main goal is to provide a method for adding wrappers without source codemodifications. A secondary goal tries to provide a tool for automatic wrapper generation. There are alot of works that treat this problems, but we think that they are, in the most cases, incomplete tools orthey produce monolithic wrappers that can not be reused (either as a complete or a partial interface) inother mediators.

Different use cases have also been developed to demonstrate how this architecture and its associatedframework allow the rapid development of domain specific applications. The development of Bio-Broker is a proof of concept of the suitability of this kind of architecture for the bioinformatics domain.In particular, the usefulness of the mediator system is demonstrated by a diverse set of applicationsaimed at combining expression data with genomic, sequence-based and structural information, so as toprovide a general, transparent and powerful solution that goes beyond traditional gene expression dataclustering.


BIO-BROKER 1603

We are currently working on the integration of gene expression data with pathways information,which would be a significant development. There are several metabolic databases available that containseveral hundred manually drawn pathway maps, but in some cases their static and separate diagramsdo not provide sufficient flexibility to integrate the information.

ACKNOWLEDGEMENTS

This work has been partially supported by grant QLK3-2001-01473 under the EU sub-programme area‘Quality of Life and Management of Living Resources’—Key Action ‘The Cell factory’ and the grant TIC2002-04586-C04-04 of the Spanish Ministerio de Ciencia y Tecnologıa. The authors would like to thank Dr. AndresRodriguez and Dr. Antonio Perez for carefully reading and recommendations.

We thank the reviewers for their comments and help towards the improvement of the paper.

REFERENCES

1. Rechenmann F. From data to knowledge. Bioinformatics 2000; 16:411.2. Abiteboul S, Buneman P, Suciu D. Data on the Web from Relations to Semistructured Data and XML. Morgan Kaufmann:

San Francisco, CA, 2000.3. Roth MT, Schwarz PM. Don’t scrap it, wrap it! A wrapper architecture for legacy data sources. Proceedings of the 23rd

International Conference on Very Large Data Bases, Athens, Greece, 25–29 August 1997, Jarke M, Carey MJ, Dittrich KR,Lochovsky FH, Loucopoulos P, Jeusfeld MA (eds.). Morgan Kaufmann: San Francisco, CA, 1997; 266–275.

4. Sahuguet A, Azavant F. Building light-weight wrappers for legacy Web data-sources using W4F. Proceedingsof the 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, 7–10 September 1999,Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds.). Morgan Kaufmann: San Francisco, CA, 1999;738–741.

5. Fahl G, Risch T, Skold M. AMOS—an architecture for active mediators. Proceedings of the 1993 Workshop on NextGeneration Information Technologies and Systems (NGITS’93). The Technian: Haifa, Israel, 1993; 47–53.

6. Garcia-Molina H, Papakonstantinou Y, Quass D, Rajararnan A, Sagiv Y, Ullman J, Vassalos V, Widom J. The TSIMMISapproach to mediation: Data models and languages. Journal of Intelligent Information Systems 1997; 2:117–132.

7. Tomasic A, Raschid L, Valduriez P. Scaling heterogeneous databases and the design of disco. Proceedings of theInternational IEEE Conference on Parallel and Distributed Information Systems, Hong Kong, 1996. IEEE ComputerSociety Press: Los Alamitos, CA, 1996; 449–457.

8. Haas L, Kossmann D, Wimmers EL, Yang J. Optimizing queries across diverse data sources. Proceedings of the 23thInternational Conference on Very Large Databases (VLDB’97). Morgan Kaufmann: San Francisco, CA, 1997; 276–285.

9. Stevens R, Baker P, Bechhofer S, Ng G, Jacoby A, Paton NW, Goble CA, Brass A. TAMBIS: Transparent access tomultiple bioinformatics information sources. Bioinformatics 2000; 16:184–186.

10. Lange M, Freier A, Scholz U, Stephanik A. A computational support for access to integrated molecular biology data.German Conference on Bioinformatics, 2001. Available at: http://www.bioinfo.de/isb/gcb01/poster/lange.html#img-1.

11. Gupta A, Ludascher B, Martone ME. Knowledge-based integration of neuroscience data sources. Proceedings of the 12thInternational Conference on Scientific and Statistical Database Management (SSDBM), Berlin, Germany, July 2000. IEEEComputer Society: Los Alamitos, CA, 2000.

12. Liu L, Buttler D, Critchlow T, Han W, Paques H, Pu C, Rocco D. BioZoom: Exploiting source-capability information forintegrated access to multiple bioinformatics data sources. Proceedings of the 3rd IEEE Symposium on BioInformatics andBioEngineering (BIBE 2003), Washington, DC, 10–12 March 2003. IEEE Computer Society Press: Los Alamitos, CA,2003.

13. Davidson S, Overton C, Tannen V, Wong L. BioKleisli: A digital library for biomedical researchers. International Journalof Digital Libraries 1997; 1:36–53.

14. IBM Corp. ‘DiscoveryLink’. http://www.ibm.com/discoverylink [23 August 2004].15. Baru C, Gupta A, Ludascher B, Marciano R, Papakonstantinou Y, Velikhov P, Chu V. XML-based information mediation

with MIX. Proceedings of the Special Interest Group on Management of Data (SIGMOD’98). ACM Press: New York,1998; 597–599.



16. Rodgriguez-Martinez M, Roussopoulos N. MOCHA: A self-extensible database middleware system for distributed datasources. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, May2000. ACM Press: New York, 2000; 213–224.

17. Buttler D, Coleman M, Critchlow T, Fileto R, Han W, Pu C, Rocco D, Xiong L. Querying multiple bioinformaticsinformation sources: Can semantic Web research help? SIGMOD Record 2002; 31(4):59–64.

18. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic AcidsResearch 2000; 28(1):45–48.

19. Wang L, Riethoven JJM, Robinson AJ. XEMBL—distributing EMBL data in XML format. Bioinformatics 2002;18(8):1147–1148.

20. Biaudet V, Samson F, Bessieres P. Micado—a network oriented database for microbial genomes. Computer Applicationsin the BioSciences 1997; 13:431–438.

21. The PDB Team (2003). Details the history, function, development, and future goals of the PDB resource. StructuralBioinformatics, Bourne PE, Weissig H (eds.). Wiley: New York, 2003; 181–198.

22. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim S, Eisenberg D. DIP: The Database of Interacting Proteins. A researchtool for studying cellular networks of protein interactions. Nucleic Acids Research 2000; 30(1):303–305.

23. Bader GD, Betel D, Hogue CW. BIND: The Biomolecular Interaction Network Database. Nucleic Acids Research 2003;31(1):248–250.

24. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Gapped BLAST and PSI-PLAST: A new generation of proteinDB search programs. Nucleic Acids Research 1997; 25(17):3389–3402.

25. Perez AJ, Rodrıguez A, Trelles O, Thode G. A computational strategy for protein structure-function assignment whichapproaches the multidomain problem. Comparative and Functional Genomics 2002; 3(5):423–440.

26. Rodrıguez A, Carazo JM, Trelles O. Mining association rules from biological databases. Journal of American SocietyInformation Systems and Technologies 2005; 56(5):493–504.

27. Garcia de la Nava J, Santaella DF, Cuenca Alba J, Maria Carazo J, Trelles O, Pascual-Montano A. Engene: The processingand exploratory analysis of gene expression data. Bioinformatics 2003; 19:657–658.

28. Aldana JF, Gomez AC, Roldan MM, Moreno N, Nebro A. Metadata functionality for semantic Web integration.Proceedings of the 7th International Conference of the International Society of Knowledge Organization (ISKO’02). Ergon:Wurzburg, Germany, 2002.

29. Aldana JF, Roldan MM, Gomez AC, Yague ME. Un framework para la consulta e integracion de fuentes de datosheterogeneas y distribuidas sobre la Web. Memorias de las Jornadas Iberoamericanas de Ingenierıa de Requisitos deAmbientes y de Software (IDEAS’2001). Instituto Technologico de Costa Rica, Centro de Informacion Tecnologica (CIT):Cartago, Costa Rica, 2001; 65–76.

30. XML. Extensible Markup Language (XML) 1.1. W3C Candidate Recommendation.http://www.w3.org/TR/2002/CR-xml11-20021015 [15 October 2003].

31. XQuery 1.0. W3C Working Draft, 16 August 2002. http://www.w3.org/TR/xquery [23 August 2004].32. XML Schema. W3C Recommendation. http://www.w3.org/TR/xmlschema-{0, 1, 2} [2 May 2001].33. RDF Schema. Resource Description Framework Schema Specification 1.0. W3C Candidate Recommendation.

http://www.w3.org/TR/2000/CR-rdf-schema-20000327 [March 2000].34. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank.

Nucleic Acids Research 2000; 28(1):235–242.35. Samson F, Biaudet-Brunaud V, Gas S, Dervyn E, Gallezot G, Duchet S, Batto JM, Ehrlich D, Bessieres P. Micado,

an Integrative Database Dedicated to the Functional Analysis of Bacillus Subtilis and Microbial Genomics. FunctionalAnalysis of Bacterial Genes, Schumann W, Ehrlich SO, Ogasawara N (eds.). Wiley: New York, 2000.

36. Kobayashi K, Ogura M, Yamaguchi H, Yoshida K, Ogasawara N, Tanaka T, Fujita Y. Comprehensive DNA microarrayanalysis of Bacillus subtilis two-component regulatory systems. Journal of Bacteriology 2001; 183(24):7365–7370.

37. Tamames J, Casari G, Ouzounis C, Valencia A. Conserved clusters of functionally related genes in two bacterial genomes.Journal of Mollecular Evolution 1997; 44:66–73.

38. Dandekar T et al. Systematic genomic screening and analysis of mRNA in untranslated regions and mRNA precursors:Combining experimental and computational approaches. Bioinformatics 1998; 14:271–278.


bio-broker: a tool for integration of biological data ... · this biological mediator (bio-broker)...

Documents