data integration in bioinformatics using ogsa-dai the bioda project shirley crompton, brian matthews...
TRANSCRIPT
![Page 1: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/1.jpg)
Data Integration in Bioinformatics Using OGSA-DAI
The BioDA Project
Shirley Crompton, Brian Matthews (CCLRC)
Alex Gray, Andrew Jones,
Richard White (Cardiff University)
![Page 2: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/2.jpg)
Overview
• Bioinformatics Data Access and Integration Requirements– Generic
• BioDA Workshop and Questionnaire
– BDWorld-specific
• OGSA-DAI exemplar
![Page 3: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/3.jpg)
The BioDA Project
• Independent Evaluation of OGSA-DAI– the suitability of that software in its present form – how to leverage OGSA-DAI in bioinformatics GRID
• OGSA-DAI Product Improvement– Feedbacks to the DAIT Team
• Knowledge Dissemination– Evaluation Report– Publications/Presentations– Workshop on OGSA-DAI for the bioinformatics
eResearch community
![Page 4: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/4.jpg)
Bioinformatics
The Application and development of computing of mathematics to the
management, analysis an understanding of data to solve biological question.
Attwood, TK and Parry-Smith, DJ 1999
Data Management
Data Analysis
![Page 5: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/5.jpg)
Grid Computing
... “... flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources…”
Foster, Kesselman and Tuecke, 2001
![Page 6: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/6.jpg)
1st BioDA Workshop
• Objectives– examine bioinformatics community’s needs for
data access and integration (DAI) on the grid, and
– to explore the application of OGSA-DAI, a middleware developed expressly to address DAI requirements of eScience projects
![Page 7: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/7.jpg)
The BioDA Survey
Mean Scores by Requirement Categories(adjusted by the no. of questions within each category)
0
1
2
3
4
5
Requirement Category
Mea
n S
core
![Page 8: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/8.jpg)
The Results
17 key requirements, top of the list include:
• schema integration
• schema mapping
• mixed language query
• complex join across databases
• provenance data
• flexible resource discovery
• RDF database access
![Page 9: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/9.jpg)
The BioDA Exemplar
The BioDiversity World
• To create a GRID-based problem solving environment.
• Enable collaborative exploration and analysis of global biodiversity patterns using workflow and rich data sources from around the world
• Example applications would be modeling species distributions against climate change, conservation prioritization and linking evolutionary changes to past climates.
![Page 10: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/10.jpg)
BDWorld(Source: BDWolrd)
Taxonomic index (Species 2000
& ITIS Catalogue of Life)Analyti
c tool
Thematic data
source
BDGrid
Ontology: Metadata
Intelligent links Resource & analytic
tool descriptions Maintenance tools
Proxy
Abiotic data
source
User
Local tools
Problem Solving
Environment user interface
Problem Solving Environment: Broker agents
Facilitator agents Presentation agents
Proxy
Proxy
ProxyProxy
Proxy
Analytic tool
GSDGSDGSDGSD
![Page 11: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/11.jpg)
BDWorld Data Resources :Key Issues
• geographically distributed and autonomous– heterogeneous in structure and data standards – mainly read via HTTP/XML protocols using custom
wrappers • SQL queries are limited to the EBI EMBL store and
BDWorld cache databases
• potentially resource-intensive to harvest – a single taxa name may resolve into a large number
of ‘accepted’ taxon names – same query repeated on different data collections
![Page 12: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/12.jpg)
Resource Wrapping(Source:BDWorld)
Remote Resource
The GRID
Workflow enactment engine
User
BDWorld-GRID Interface (BGI)
BGI API
BDWorld-GRID Interface (BGI)
BGI API
Wrapper
![Page 13: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/13.jpg)
Implications for BioDA
• abstraction layer (BGI) Proprietary invocation mechanism – InvokeOperation
(ResourceHandler, Operation, XmlDataCollection)
• prepared search statements defined in individual data resource wrapper
• BGI protocols BDW communication objects. Search parameters and results passed as XmlDataCollecton
![Page 14: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/14.jpg)
BioDA Exemplar
• Two main possibilities within BDW:
1.Augment BGI to support inclusion of queries in workflows and to be sent directly to OGSA-DAI enabled databases.
• Distributed query processing facilities could assist in planning execution & distribution of data-orientated parts of a workflow. (For the current status of OGSA-DQP see Section
4.) – Very major revision to BDW protocols; also,– many resources of interest are simply not exposed as
databases.
2.Provide facilities within individual wrappers that benefit from OGSA-DAI.
![Page 15: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/15.jpg)
OGSA-DAI Prototype(What we’d have liked)
OGSA-DAI R5 GDS
deliverFromURL(xsl)OGSA-DAIClient
BDWQueryActivity
Wrapper Module
WrapperWrapperWrapper2. Create GDS
and query
3. Invoke wrapper
Web DBs
4. Query
deliverFromURL(url)5. Download URL
XSLTransform
deliverToURL/GFTP
6. Download url7. url
8. XSL transform to BDW
format
9. To WF unit
1. BGI
InvokeOperation()
![Page 16: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/16.jpg)
Key Issues encountered
• Complex client-side coding to orchestrate the application flow– require several GDS perform requests…
• Difficult to synchronise– Remote web databases have different response time (or not
response at all!)
• Different data transformation series applicable to different data resources
• BDW Protocols specify data returned as a BDW XmlDataCollection object
![Page 17: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/17.jpg)
OGSA-DAI Prototype(What we ended up doing)
OGSA-DAI R5 GDS
OGSA-DAIClient
BDWQueryActivity
Wrapper Module
WrapperWrapperWrapper
2. Create GDS
and query
3. Invoke wrapper/s
Web DBs
4. Query, transform
1. BGI
InvokeOperation()
Cache File
5. Write cache file
6. return XmlRemoteData
7. return
XmlDataCollection
![Page 18: Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff](https://reader036.vdocument.in/reader036/viewer/2022062423/5697bfef1a28abf838cb9f3d/html5/thumbnails/18.jpg)
Conclusion
• Highlighted key bioinformatics eScience project requirements for OGSA-DAI – support for a metadata-driven two-step
access to data and data integration…
• Reviewed BDWorld DAI requirements– uniform access to disparate, heterogeneous
data resources• including anonymous access to web information
system
• Reviewed the BDWorld OGSA-DAI exemplar and issues encountered