front cover ibm life sciences solutions · · 2002-03-08vi ibm life sciences solutions: turning...

ibm.com/redbooks

IBM Life Sciences Solutions:Turning Data into Discoverywith DiscoveryLink

Authored by theLife Sciences Solution Team

Introduction to IBM Life Sciences

Overview of DB2 Life Sciences Data Connect and Consulting Services

DiscoveryLink Demonstration Overview

Front cover

International Technical Support Organization

IBM Life Sciences Solutions:Turning Data into Discovery with DiscoveryLink

March 2002

SG24-6290-00

© Copyright International Business Machines Corporation 2002. All rights reserved.Note to U.S Government Users - Documentation related to restricted rights - Use, duplication or disclosure is subject to restrictions setforth in GSA ADP Schedule Contract with IBM Corp.

First Edition (March 2002)

This edition applies to the Data Management Series — DiscoveryLink.

Comments may be addressed to:IBM Corporation, International Technical Support OrganizationDept. 5KNA Building 80-E2650 Harry RoadSan Jose, California 95120-6099

When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you.

Take Note! Before using this information and the product it supports, be sure to read the general information in “Special notices” on page 179.

Contents

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vThe team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viSpecial notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viIBM trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viComments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Part 1. DiscoveryLink Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 1. IBM Life Sciences Solutions: Advancing Research and Discovery through Information Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Discovering new possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Flexible infrastructure solutions designed for future growth . . . . . . . . . . . . . . . . . . . . . . 51.3 Leveraging the value of data: high-performance computing . . . . . . . . . . . . . . . . . . . . . . 61.4 DiscoveryLink: solving the data integration dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Knowledge management: turn research data into insight . . . . . . . . . . . . . . . . . . . . . . . 11

1.5.1 Storage: share centralized data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5.2 e-business: manage and share data in a secure, stable environment . . . . . . . . . 121.5.3 Services and consulting: optimize your solution . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 IBM’s commitment to life sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.7 Why IBM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 2. IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1 Traditional approaches to data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Case study: a protein is born . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 DiscoveryLink: transform data into insight-fast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Case study: unlocking mysteries of the brain stem. . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Case study: bridging oceans of data when companies merge . . . . . . . . . . . . . . . 20

2.4 Industrial-strength performance and ease of use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5 Start simply—grow fast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Accelerate scientific discovery and productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Chapter 3. DiscoveryLink: A Data Integration Solution for Life Sciences . . . . . . . . . . 233.1 DiscoveryLink Presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Part 2. Detailed DiscoveryLink Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Chapter 4. DiscoveryLink: A System for Integrated Access to Life Sciences Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.1 Three data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.2 Scenario 1: a new protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.3 Scenario 2: a merger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.4 Scenario 3: serotonin research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 A wrapper architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3 Query processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 Optimizing the query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.2 Executing the query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

© Copyright IBM Corp. 2002 iii

4.3.3 Future enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4 Field experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.7 Status and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.8 Cited references and notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Chapter 5. DB2 Life Sciences Data Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.1 IBM DB2 Life Sciences Data Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.2 Querying life sciences data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Chapter 6. IBM Life Sciences Global Consulting and Solutions Practice . . . . . . . . . . 916.1 Data integration services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 Knowledge management services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.3 e-Business hosting and services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.4 Comprehensive IT service offerings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 7. IBM Life Sciences Global Consulting and Solutions Practice DiscoveryLink Transition Offering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7.1 Life Sciences Solution Practice Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Chapter 8. DiscoveryLink: A Data Integration Solution for Life Sciences (For IT Professionals) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.1 IT Professional Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Part 3. DiscoveryLink Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Chapter 9. DiscoveryLink Demonstration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1459.1 DiscoveryLink technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1469.2 About the DiscoveryLink demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1469.3 Example queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.4 Query 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.4.1 Query 1 details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.4.2 Query 1 architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489.4.3 SQL queries issued to DiscoveryLink for Query 1 . . . . . . . . . . . . . . . . . . . . . . . 1499.4.4 Query 1 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519.4.5 Query 1 results for NCI experiment ID number 9423 . . . . . . . . . . . . . . . . . . . . . 1529.4.6 Query 1 results for NCI experiment ID number 11872 . . . . . . . . . . . . . . . . . . . . 1569.4.7 Query 1 results for NCI experiment ID number 12253 . . . . . . . . . . . . . . . . . . . . 160

9.5 Query 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639.5.1 Query 2 details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639.5.2 Query 2 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

9.6 Query 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.6.1 Query 3 details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.6.2 Query 3 results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1729.6.3 Query 3 results for NSC ID 171 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1739.6.4 Query 3 results for NSC ID 291 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1749.6.5 Query 3 results for NSC ID 295 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1759.6.6 Query 3 results for NSC ID 473 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1769.6.7 Query 3 results for NSC ID 473 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Special notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

iv IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink

Preface

This IBM Redbook provides useful information on DiscoveryLink: IBM's solution for data integration and data management for the Life Sciences.

Dramatic advances occurring in the life sciences industry are changing the way we live. These advances fuel the rapid scientific discoveries in genomics, pharmaceutical research, proteomics, and molecular biology that serve as the basis for medical breakthroughs and the development of new drugs and treatments. One of the life sciences industry’s most difficult challenges is transforming massive quantities of highly complex, constantly changing data—from a variety of data sources—into knowledge.

The key to increasing R&D effectiveness and remaining competitive in today’s fast-paced scientific community is data integration. The ability to tap into multiple, heterogeneous data sources once and quickly retrieve clear, consistent information is critical to uncovering correlations and insights that lead to the discovery of new drugs and medical products.

To meet the challenges of integrating and analyzing diverse scientific data from the variety of domains within life sciences, IBM has developed a versatile platform solution—IBM DiscoveryLink™. With single query data access, the IBM DiscoveryLink™ software allows researchers to work with distributed data sources and diverse data formats. IBM DB2® Universal Database™, the industry’s first multimedia, Web-ready, federated database, provides the industry-leading performance and scalability required to drive the most demanding life sciences applications.

To ensure robust performance and fast response time, DiscoveryLink includes query optimization technology that automatically searches for the most efficient means of executing the query and assembling the results. With a single Structured Query Language (SQL) command, researchers can access and integrate information from multiple data sources.

DiscoveryLink technology can complement and extend the capabilities of existing data warehouses and object frameworks, enabling drug discovery, biotechnology, and pharmaceutical research companies to boost productivity while protecting their IT investments. Depending on workload and system requirements, DiscoveryLink middleware can run on a single or multiple servers and supports a wide variety of popular operating systems. DiscoveryLink allows integration of multiple diverse sources, including text search engines and industry-specific repositories to make data management and application development easy. DiscoveryLink builds on IBM’s knowledge and understanding of databases and data management—while adding the scientific expertise needed to customize solutions unique to the life sciences industry.

Part 1 of this book gives you an overview of DiscoveryLink and provides several generalized case studies.

Part 2 presents detailed information on DiscoveryLink, DB2 Life Science Data Connect, and an overview of the IBM Life Sciences Global Consulting and Solutions Practice.

Part 3 provides you with a static demonstration for DiscoveryLink, including the results of three different queries.

© Copyright IBM Corp. 2002 v

The team that wrote this redbookThis redbook was produced by an IBM Life Sciences Solutions Team of specialists from around the world working in conjunction with the International Technical Support Organization, San Jose Center.

The IBM Life Sciences Solutions Team provides the IT infrastructure that researchers in biotechnology, pharmaceutical research, genomics, proteomics, and healthcare need to turn data into scientific discovery and new treatments for disease. Backed by a world-class research team, the team offers integrated solutions for the laboratory in the areas of:

� Scalable, high-performance computing

� Flexible, fast-recovery storage

� Data integration tools that permit single-query access across multiple data sources

� Knowledge management products and services

� Secure, open-architecture, scalable e-business infrastructure.

Thanks to the following people for their contributions to this project:

Inna Kuznetsova, IBM Life Sciences, Somers, New York

Jason Alter, IBM Life Sciences, Somers, New York

Joe DeCarlo, IBM International Technical Support Organization, San Jose Center

Special noticeThis publication is intended to help researchers in biotechnology, pharmaceutical research, genomics, proteomics, and healthcare need to turn data into scientific discovery and new treatments for disease. The information in this publication is not intended as the specification of any programming interfaces that are provided by the DiscoveryLink solution. See the PUBLICATIONS section of the IBM Programming Announcement for the DiscoveryLink solution for more information about what publications are considered to be product documentation.

IBM trademarksThe following terms are trademarks of the International Business Machines Corporation in the United States and/or other countries:

e (logo)® Redbooks (logo)™AIX®Approach®Blue Gene™DataJoiner®DB2®DB2 Universal Database™DiscoveryLink®IBM®

Informix™Lotus®Mulliken®Notes®Perform™POWERparallel®RS/6000®Scalable POWERparallel Systems®SP™WebSphere®

vi IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink

Comments welcomeTivoli, Manage. Anything. Anywhere.,The Power To Manage., Anything. Anywhere.,TME, NetView, Cross-Site, Tivoli Ready, Tivoli Certified, Planet Tivoli, and Tivoli Enterprise are trademarks or registered trademarks of Tivoli Systems Inc., an IBM company, in the United States, other countries, or both. In Denmark, Tivoli is a trademark licensed from Kjøbenhavns Sommer - Tivoli A/S.

Your comments are important to us!

We want our IBM Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways:

� Use the online Contact us review redbook form found at:

ibm.com/redbooks

� Send your comments in an Internet note to:

[email protected]

� Mail your comments to the address on page ii.

Preface vii

http://www.redbooks.ibm.com/

http://www.ibm.com/redbooks/

http://www.ibm.com/redbooks/

http://www.redbooks.ibm.com/contacts.html

viii IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink

Part 1 DiscoveryLink Overview

One of the life sciences industry’s most difficult challenges is transforming massive quantities of highly complex, constantly changing data—from a variety of data sources—into knowledge. Data that is dispersed throughout the R&D enterprise. Scattered across disparate hardware platforms. Compiled within specialized applications and databases, including sequence databases, chemical structure databases and relational databases. The challenge of harnessing this substantial data into life sciences insights—transforming information into knowledge—is increased daily by the exponential explosion of data created in every domain of the life sciences industry. Somewhere within the mountains of information are answers to questions that may prevent and cure disease. Questions about what proteins are encoded by the over 30,000 human genes. What biological pathways do they participate in? Which proteins are appropriate targets for the development of new therapeutics? What molecules can be identified and optimized to act as therapeutics against these target proteins? These are just some of the key questions facing researchers in life sciences today.

To meet the challenge of integrating and analyzing large quantities of diverse scientific data from a variety of life sciences domains, IBM has developed a versatile solution—IBM DiscoveryLink™—that can help dramatically increase R&D productivity with single-query access to existing databases, applications and search engines. The DiscoveryLink solution includes the combined resources of DiscoveryLink middle ware and IBM Life Sciences services. Using this versatile software, IBM Life Sciences services can create new components that allow specialized databases—for proteomics, genomics, combinatorial chemistry, or high-throughput screening—to be accessed and integrated quickly and easily.

Part 1

© Copyright IBM Corp. 2002 1

2 IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink

Chapter 1. IBM Life Sciences Solutions: Advancing Research and Discovery through Information Technology

Dramatic advances occurring in the life sciences industry are changing the way we live. These advances fuel the rapid scientific discoveries in genomics, proteomics and molecular biology that serve as the basis for medical breakthroughs, the advent of personalized medicine and the development of new drugs and treatments. The sequenced human genome has already increased the number of biological drug “targets” that can be explored from about 500 to over 30,000. Soon, the typical life sciences company will need to access and analyze “petabytes” (1015 bytes) of data to further their research efforts. In addition to the enormity of the data, there are challenges related to querying non-standard data formats, accessing data assets across global networks and securing data outside firewalls. The competitive advantage belongs to companies that can best use information technology (IT) solutions to capitalize on the opportunities presented by this transformation.

1


1.1 Discovering new possibilitiesIn response to these challenges, life sciences companies are redefining their research methodologies and retooling their IT infrastructures to position themselves for success in this new environment. The traditional trial-and-error approach is rapidly giving way to a more predictive science based on sophisticated laboratory automation and computer simulation. The technology used in newly emerging life sciences discovery models is critical to laboratory productivity and time to market. Key issues include:

� Sharing and pooling information across global resources while maintaining security

� Retrieving and integrating diverse data across a variety of scientific domains

� Adding new data sources without new software development or complete redeployment of the solution

� Acquiring experimental data from industrial-style laboratory activities 24 hours a day, 7 days a week

� Enabling continuous realtime access to data without building and managing database warehouses

� Developing new ways to collaborate among research teams using shared research to focus efforts.

Add to these issues the need to work within existing laboratory and business computing environments, and the challenges facing today’s life sciences industry are almost overwhelming.


1.2 Flexible infrastructure solutions designed for future growthEnd-to-end infrastructure solutions from IBM provide the scalable tools and systems to help life sciences companies access, manage and develop content. The IBM Life Sciences organization offers a comprehensive set of innovative IT infrastructure products and services designed to create value for drug discovery, pharmaceutical and biotech companies, healthcare organizations and academic research institutions. IBM has award-winning Web application servers, advanced computer cluster offerings and powerful database technologies that can help reduce the time and effort required to capture, compile and analyze research data for drug discovery, product development and Web-based clinical trials. In addition, IBM has the experience and the storage, security and systems management products needed to run mission-critical scientific research projects.

Trained professionals from the IBM Global Life Sciences Consulting and Solutions practice can help provide a wide range of consulting, implementation, outsourcing and hosting services to help improve the efficiency of R&D cycles, enhance collaboration within research communities and ensure the successful implementation of life sciences solutions. IBM is developing strategic business partnerships with leading bioinformatics companies to develop end-to-end specialized solutions for the life sciences industry, such as specialized laboratory applications.

IBM solutions for knowledge management, data integration, high-performance computing, storage, e-business and consulting services deliver the powerful key capabilities needed in life sciences laboratories:

� Knowledge management tools for transforming life sciences data into knowledge

� Data integration for extracting information and identifying patterns from multiple data sources and across diverse data domains

� High-performance computing for computational modeling, simulation and visualization

� Industry-leading, supercomputing performance for scientific workloads, including genome sequencing, protein structure sequencing and drug target identification

� Storage and retrieval technologies and tools for managing data easily

� Secure e-business technologies and services for accessing, sharing and processing information across the Internet

� Consulting services for assessing requirements and supporting solution design and implementation

� Security and data management to help protect the privacy of research data.

The life sciences industry needs flexible, scalable, reliable systems that can easily adapt as needs change. And IBM has an extensive portfolio of proven IT solutions that meet those needs. IBM offers a full range of infrastructure components, including system management and security software, storage technologies, and an impressive selection of high performance servers, such as IBM models.

“...infrastructure solutions from IBM provide scalable tools and systems to help life sciences companies access, manage, and develop content.”

Chapter 1. IBM Life Sciences Solutions: Advancing Research and Discovery through Information Technology 5

1.3 Leveraging the value of data: high-performance computing An important, fast-growing business within the life sciences industry is focused on the compilation of genomic information into databases and the sale of information through subscriptions to drug companies and biotech research institutions. To help identify and analyze patterns within genetic data for viability as diagnostics and pharmaceutical products, drug discovery companies need these powerful, high-performance solutions.

High-speed, high-performance computing power and industrial-strength databases perform a wide range of data-intensive computing functions. Mapping genetic and proteomic information. Data mining to identify patterns and similarities. Text mining using huge libraries of information. These activities require high-speed computer infrastructures with well-integrated storage systems.

IBM delivers complete solutions for searching vast quantities of genomic data from a large number of sources and executing thousands of jobs simultaneously. A wide range of server solutions enable drug discovery companies and biotechnical researchers to leverage the value of the data while maintaining control over the analysis phase of the process.

Using Linux® technology, researchers can formulate complex queries, search multiple genomic databases, create, submit and execute computational jobs and manage the computing server resources. Because the Linux operating system is not tied to any single architecture or proprietary set of tools, it is singularly well-suited for the development and deployment of applications in a heterogeneous data environment. IBM offers the broadest selection of Linux-enabled software and services available to ensure secure Linux application development environments for the life sciences.

“IBM high-performance computing incorporates the processing power and deep computing technology necessary to help solve the most complex challenges of

science.”


Figure 1-1 IBM servers for Life Sciences Solutions

IBM AIX®, IBM e-server and IBM RS/6000 Scalable POWERparallel Systems (SP) solutions offer the flexibility and reliability required to handle mission-critical and data-intensive applications as well as the scalability needed to handle large workload demands today and tomorrow.


1.4 DiscoveryLink: solving the data integration dilemmaToday’s life sciences businesses require solutions with the flexibility to adapt and extend mission-critical applications to meet customer demands as well as the stability to smoothly absorb these changes. IBM’s data integration strategy provides super computers, software and services to enable successful research and development in life sciences laboratories. The DiscoveryLink™ solution from IBM Life Sciences includes the combined resources of DiscoveryLink middleware and IBM Life Sciences Services. Using this versatile new software, IBM Life Sciences Services can create new components that allow specialized databases-for proteomics, genomics, combinatorial chemistry or high-throughput screening-to be accessed and integrated quickly and easily.

With single-query data access, the IBM DiscoveryLink solution allows researchers to extract information from large volumes of heterogeneous life sciences research and clinical data sources including the integration of important legacy data. The DiscoveryLink solution includes the DB2® Life Sciences DataConnect and DB2 RelationalConnect set of software tailored specifically to life sciences research and development requirements for integrating data from multiple sources. DB2 Life Sciences DataConnect provides the interfaces (wrappers) through which the database communicates with the federated data sources. DB2 RelationalConnect provides the robust, scalable communication structure for connecting to relational data sources, such as Oracle, and providing secure access to legacy data.

“The DiscoveryLink solution combines innovative middleware and integration services to extract information from large volumes of heterogeneous life sciences

and clinical data sources.”


Figure 1-2 IBM Systems and Systemware for Life Sciences Solutions

DiscoveryLink is unique among existing systems because it enables easy creation of wrappers for nonrelational sources and provides the capability to add new sources dynamically. It also includes query optimization technology that automatically searches for the most efficient means of executing a query and assembling the results. IBM DB2® Universal Database™, the industry’s first multimedia, Web-ready, federated database, provides the industry-leading performance and scalability required to drive the most demanding life sciences applications. Although it is built on DB2 Universal Database technology, DiscoveryLink is designed to be used along with most other databases. Functioning as the critical bridge between R&D applications and diverse data sources, DiscoveryLink delivers a single-table virtual database federation of multiple heterogeneous data sources. While these data sources appear as a single format to the applications and end users, the functionality, integrity, location and form of the original data remains unchanged.


With its innovative database technology and multiplatform hardware support, the DB2 DiscoveryLink technology can complement and extend the capabilities of existing data warehouses and object frameworks. This enables drug discovery, biotechnology and pharmaceutical research companies to boost productivity while protecting their IT investments. DiscoveryLink allows integration of multiple diverse sources, including text search engines and industry-specific repositories to make data management and application development easy.


1.5 Knowledge management: turn research data into insightMany drug discovery processes—including clinical trials—require maximum efficiency for data sharing and knowledge management functions like patient record mining across companies. IBM has the end-to-end open source infrastructure solutions to help optimize data sharing and information management.

Figure 1-3 IBM Knowledge Management Tools for Life Sciences Solutions

With IBM knowledge management software and services, laboratory research teams can capture, manage and share information and create new information relationships. Life sciences teams can build virtual research communities using secure Web portals for online research collaboration in realtime. And meet the security requirements of the entire R&D process, including privacy issues concerning genetic information of donors in research studies, patient records in clinical trials and confidential business information. IBM provides a full range of knowledge management infrastructure solutions and development tools to enhance the overall effectiveness of the research process.

1.5.1 Storage: share centralized dataA key component of a high-quality life sciences solution is reliable, disaster-proof storage. IBM Web planning, designing and hosting services can help life sciences research teams develop custom solutions for managing clinical trials, personalized medicine and FDA approvals in a secure and stable environment. IBM storage hardware, software and services can help maximize laboratory productivity and minimize operating costs. Researchers can store results obtained from collaborative research and data mining in “pools” of commonly shared knowledge that are administered from a centralized point. Laboratories can increase capacity without interruptions using these scalable storage systems and reduce backup time because only modified data needs to be transferred. And IBM consultants can design a flexible disaster-tolerance strategy—with no single point of failure—to accommodate laboratory data needs as they evolve.


1.5.2 e-business: manage and share data in a secure, stable environmentR&D organizations can use IBM data integration and management tools to create easy-to-use Web portals to leverage the Internet for collaborative global life sciences research. Award-winning Web application servers, advanced computer cluster systems and powerful database technologies can help reduce the time and effort required to capture, compile and analyze research data for drug discovery, product development and Web-based clinical trials.

Figure 1-4 IBM Storage Solutions for Life Sciences

“Reliable, disaster-proof IBM storage hardware, software and services help maximize laboratory productivity.”


IBM WebSphere Application Server is an e-business application deployment environment built on open, standards-based technology. The WebSphere software platform provides the foundation for building and expanding a life sciences e-business infrastructure on the Web. With a stable Web foundation life sciences researchers can access multiple online data sources, personalize data queries and ensure the secure management of research and medical information.

Figure 1-5 IBM Web Tools for Life Sciences Solutions

1.5.3 Services and consulting: optimize your solutionIBM has the skills, resources and infrastructure management support to meet the laboratory and business needs of the life sciences industry. IBM specialists can provide consulting, systems integration, strategic outsourcing services and hosting services. Along with data management, knowledge management and e-business solutions that work with existing systems and software. As technology needs change and grow, IBM consultants can help optimize systems to accommodate every level of expansion.

IBM is working directly and in concert with IBM Business Partners to provide highly effective approaches to IT management and strategy that are specifically tailored for life sciences companies—including data integration query management, data mining and data interpretation—using advanced algorithms and deep computing technologies.

“Create easy-to-use Web portals to harness the power of the Internet for collaborative, global research.”


1.6 IBM’s commitment to life sciencesThe IBM commitment to the life sciences industry is defined by the establishment of the specialized IBM Life Sciences Solutions Group dedicated to rapidly bringing leading-edge technology out of the laboratory and into the marketplace for customers and Business Partners in the fields of pharmaceutical research, biotechnology, genomics, proteomics, health and other life sciences. In addition, a dedicated IBM Global Life Sciences Consulting and Solutions practice has been established to focus IBM service capabilities and expertise, as well as intellectual capital on helping customers and Business Partners to migrate their R&D units into even more efficient and competitive operations.

IBM Research continues to pursue strategic exploration of technologies applicable to life sciences research and product development. Pattern discovery and matching functional and structural genomics and proteomics. Visualization, technical knowledge management and nanotechnology. The IBM Computational Biology Center and the IBM Deep Computing Institute are two key research centers housing teams of scientists working on projects involving computational biology, chemistry and material science. The long term projects at the IBM Computational Biology Center foster IBM collaboration with life sciences companies to bring scientific expertise directly into the development of life sciences solutions. Detailed studies are in progress aimed at providing new clues for medical diagnostics, the synthesis and design of novel materials and the analysis of genes and their relationships.

The IBM Deep Computing Institute provides business decision-making capabilities to analyze and develop solutions for complex and difficult problems. Combining these capabilities with advances in algorithms, analytic methods, modeling

and simulation, data management and software infrastructures enables valuable scientific, engineering and business solutions. Joining life sciences research with information technology, IBM is uniquely positioned to help the industry solve the challenges of R&D information management.

IBM is actively pursuing strategic business partnerships with life sciences companies whose complementary skills, knowledge and resources can help build value-rich solutions for the industry.

There are extraordinary challenges and opportunities ahead for the companies engaged in the life sciences industry. The scientific challenges in this emerging industry are matched equally by the challenges associated with managing data integration and developing the computing technology and tools needed to provide solutions for the laboratory.

Combined with IBM’s core strength in providing robust technology and global services, the IBM Life Sciences business unit delivers innovative, scalable, infrastructure solutions that are unique within the industry.

For more information and to learn more about IBM Life Sciences solutions, visit our Web site at ibm.com/solutions/lifesciences. Or contact an IBM Life Sciences specialist at [email protected].


1.7 Why IBM?Today, IBM systems include 215 of the world’s 500 most powerful high-performance computers; the world’s fastest supercomputer (ASCI White, 12.3 trillion calculations/sec); the world’s largest established database (116 terabits); advanced storage management; and a world-renowned computational biology center. IBM announced a $100 million initiative to build what promises to be the world’s most powerful supercomputer—“Blue Gene”—which will be capable of more than one quadrillion operations per second (one “petaflop”).

The Blue Gene project is an IBM Research project with three goals:

� Explore the challenges associated with cellular hardware architectures for massively parallel computers.

� Conduct research into software application development of a computer system with interprocess or communications capabilities necessary for optimized performance.

� Exploit the Blue Gene supercomputer to address the important problems of protein science, in particular, the phenomenon of protein folding, but ultimately, specific issues such as drug-protein interaction.

IBM is committed to bringing powerful products and expertise to help researchers and scientists speed their processes and optimize their results.

“Joining life sciences research with information technology, IBM is uniquely positioned to help solve the challenges of R&D information management.”


Chapter 2. IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink

One of the life sciences industry’s most difficult challenges is transforming massive quantities of highly complex, constantly changing data—from a variety of data sources—into knowledge. Data that is dispersed throughout the R&D enterprise. Scattered across disparate hardware platforms. Compiled within specialized applications and databases, including sequence databases, chemical structure databases and relational databases. The challenge of harnessing this substantial data into life sciences insights—transforming information into knowledge—is increased daily by the exponential explosion of data created in every domain of the life sciences industry. Somewhere within the mountains of information are answers to questions that may prevent and cure disease. Questions about what proteins are encoded by the over 30,000 human genes. What biological pathways do they participate in? Which proteins are appropriate targets for the development of new therapeutics? What molecules can be identified and optimized to act as therapeutics against these target proteins? These are just some of the key questions facing researchers in life sciences today.

To meet the challenge of integrating and analyzing large quantities of diverse scientific data from a variety of life sciences domains, IBM has developed a versatile solution—IBM DiscoveryLink™—that can help dramatically increase R&D productivity with single-query access to existing databases, applications and search engines. The DiscoveryLink solution includes the combined resources of DiscoveryLink middleware and IBM Life Sciences services. Using this versatile software, IBM Life Sciences services can create new components that allow specialized databases—for proteomics, genomics, combinatorial chemistry, or high-throughput screening—to be accessed and integrated quickly and easily.

2

“Solving the data integration dilemma for the life sciences.”


2.1 Traditional approaches to data integrationThe key to increasing R&D effectiveness and remaining competitive in today’s fast-paced scientific community is data integration. The ability to tap into multiple heterogeneous data sources and quickly retrieve clear, consistent information is critical to uncovering correlations and insights that lead to the discovery of new drugs and agricultural products.

Traditional approaches—such as data warehousing and point-to-point connections between specific applications and databases—provide only part of the answer. Data warehousing— placing data into a centralized repository—works well in situations where information is relatively static and data types are not too diverse. But building and maintaining enterprise wide warehouses that contain hundreds of data sources can be costly and risky to implement.

Similarly, the technical effort and costs associated with writing customized point-to-point connections to multiple data sources and applications can result in unwieldy, time-consuming processes for companies with limited IT resources. Factor in the additional time and expense required to execute queries across multiple data sources and it is clear that companies adopting these traditional approaches to data generation and management run the risk of decreasing the overall efficiency and effectiveness of their R&D operations.

2.2 Case study: a protein is bornAfter extensive experimentation, Dave, a biologist at a pharmaceutical company, needs to determine if his protein sequence is already known and—if not—whether there are any other sequences homologous to the new sequence. The pharmaceutical company has its own curated copy of the protein database in house. But Dave needs to check the publicly available version to see if additional data exists that has not yet made its way into the in-house version.

Using traditional methods, he would run a BLASTP search using the new sequence against the in-house version of the database and do a second BLASTP search against the public version. After obtaining the results from the two searches, Dave would then combine the results, eliminating the sequences common to both to get a unique list. Because the application for accessing the in-house version is different from the Web interface used to access the public version, Dave would have to cut and paste the two sets of results or write a new application to combine the data.

The DiscoveryLink solution will enable Dave to request information from both data sources and compile the results with a single query. It’s that simple.

“DiscoveryLink provides a total solution to address the challenges associated with traditional methods of data integration in the life sciences industry.”


2.3 DiscoveryLink: transform data into insight-fast Combining IBM software and services, DiscoveryLink provides a total solution to address the challenges associated with traditional methods of data integration in the life sciences industry. Functioning as the critical bridge between R&D applications and diverse data sources, DiscoveryLink provides single-query cross-source access to databases and delivers a single-format virtual database view of multiple heterogeneous data sources. While these data sources appear as a single format to applications and end users, the integrity and form of the original data remains unchanged.

2.3.1 Case study: unlocking mysteries of the brain stemIn the brain stem, the most primitive part of the brain, lie clusters of serotonin neurons. The deficiency of available serotonin, or inefficient serotonin receptors, is implicated in a broad range of disorders including depression, schizophrenia and Parkinson’s disease. Agents that modulate the processing of serotonin, or 5-HT, by inhibiting or stimulating its release, can be useful for treating such diseases. Analysts project a greater than $10 billion market for serotonin-related drugs within the next decade.

Jane, a chemist, wants to see the structures of compounds that are active against a family of serotonin receptors to better understand the mechanism of action and gain insights into the structure activity relationships. Using traditional methods, this query would require a three-way join of information from at least three different data sources—a protein sequence database, chemical structure database and an assay database. Jane would have to initiate at least three separate requests: to find compounds that scored low IC50 in an assay; to eliminate the assays that were not members of the serotonin family; and to then retrieve the structures of compounds tested in the remaining assays. To further complicate the process, she must also determine the best way to coordinate and execute these queries.

If Jane proceeds in the wrong order, she runs the risk of complicating the search, which can dramatically prolong an already tedious task and escalate costs. With DiscoveryLink, the software views the entire request at once and automatically parses, optimizes and executes the query, ensuring that it is executed efficiently.

After reviewing the results of her initial query, Jane recognizes ketanserin, a compound that is highly selective against the HTR2A class of serotonin receptors. Now she wants to find compounds that are active against any members of the serotonin receptor family and have other drug-like characteristics (such as specific values for clogP and molecular weight). Although this query also requires information from all three data sources, it exploits the chemical structure database’s ability to perform similarity searches. The DiscoveryLink optimizer efficiently executes the three-way join. By exploiting the power of Structured Query Language (SQL) for enhanced performance and optimized query capabilities, the DiscoveryLink solution makes it easy to integrate data from multiple sources, allowing researchers to obtain real value from data.

“DiscoveryLink is unique among existing systems because it enables easy creation of wrappers for nonrelational sources and provides the capability to add new

sources dynamically.”

Chapter 2. IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink 19

The DiscoveryLink solution includes the IBM DB2® Life Sciences DataConnect and DB2 RelationalConnect set of software tailored specifically to life sciences research and development requirements for integrating data from multiple sources. DB2 Life Sciences DataConnect provides the interfaces (wrappers) through which the database communicates with the federated data sources. DB2 RelationalConnect provides the robust, scalable communication structure for connecting relational data sources, such as Oracle, and provides secure access to legacy data. Using DiscoveryLink, IBM Life Sciences services can create the interfaces to translate the information needed to execute queries from target data sources.

DiscoveryLink is unique among existing systems because it enables easy creation of wrappers for nonrelational sources and provides the capability to add new sources dynamically. Its sophisticated query-processing engine uses federated database technology to extend the power of a relational database search engine to distributed data sources in relational and nonrelational formats. To ensure robust performance and fast response time, DiscoveryLink includes query optimization technology that automatically searches for the most efficient means of executing the query and assembling the results. With a single SQL command, researchers can access and integrate information from multiple data sources. While these data sources appear as a single format to the applications and end users, the functionality, integrity, location and form of the original data remains unchanged.

Using the software wrapper to facilitate communication, the DiscoveryLink solution can receive complex queries from applications, identify the individual query elements and execute each one separately with the appropriate data source—all without copying or altering the original data sources.

2.3.2 Case study: bridging oceans of data when companies mergeAfter the merger between the two pharmaceutical giants, AlphaPharm and BetaPharm, the newly formed informatics team for Alpha-BetaPharm needed to easily access each company’s compound databases. Developing new leads for therapeutic research, the newly formed Alpha-BetaPharm research team—located at multiple geographic locations worldwide—needed to query the chemical structure and assay databases of the two companies to identify the various targets that compounds similar to fluoxetine were tested against.

Typically, each research team would write an application to access the two chemical-structure databases for compounds with fluoxetine’s structure and combine the results of both searches into a single representation. Using the combined results of the first query, the researchers for each company would then run a second pair of queries against the two assay databases to identify the targets that fluoxetine has been screened against.

Using the DiscoveryLink solution, a single query can retrieve the similar structures and the matching assays from both company databases and present the combined results as a single data source with a common representation. DiscoveryLink ensures that their query is optimized and executed efficiently, significantly reducing the time, effort and frustration associated with searching and comparing data from a variety of data sources across two separate company systems.

Before proposing the synthesis and testing of a promising compound, the Alpha-BetaPharm researchers may also need to gather more information about the compound and related compounds from the proprietary toxicity database and the metabolic pathway database. Without the DiscoveryLink solution, searching structures and names from these databases would require another series of potentially tricky queries. With the solution, IBM Life Sciences services can use the DiscoveryLink middleware to develop the interfaces needed to access and integrate the additional information with a single query.


2.4 Industrial-strength performance and ease of useWith its innovative database technology and multiplatform hardware support, DiscoveryLink can complement and extend the capabilities of existing data warehouses and object frameworks, enabling drug discovery, biotech and research companies to boost productivity while protecting their IT investments. The DiscoveryLink solution allows the integration of multiple diverse sources to make management and application development easy.

Because DiscoveryLink queries the original, distributed data sources without modifying or copying the data, it can eliminate many database currency and synchronization issues. Applications can take advantage of the native search engines and processing capabilities that come with the original databases. But it does not stop there; it can also automatically initiate and perform additional operations as part of a query.

Depending on workload and system requirements, DiscoveryLink middleware can run on one or more servers. DiscoveryLink solution capabilities include support for a variety of popular operating systems, including IBM AIX®, Microsoft® Windows® 2000 and Windows NT®, Linux® and Sun Solaris®.

2.5 Start simply—grow fastToday, more than 225,000 companies and more than 40 million users worldwide rely on IBM DB2 solutions. Built on IBM DB2® Universal Database™ technology, DiscoveryLink integration middleware may be used with most existing database infrastructures. DiscoveryLink builds on IBM’s knowledge and understanding of databases and data management—while adding the scientific expertise needed to customize solutions unique to the life sciences industry.

IBM offers several options for the initial installation and post-installation maintenance and support of the DiscoveryLink solution. As part of the initial installation, IBM service professionals work closely with companies to define their data integration requirements and solve data integration problems across the entire life sciences R&D process. Companies can contact IBM service representatives about building a comprehensive plan to provide an integration framework for their total R&D enterprise, according to the specific needs of their organization.

Others may choose the modular services approach, which enables companies to start small and grow as quickly as their business demands. Engagements can begin with the integration of a few key sources and grow as data integration needs change. This approach minimizes the up-front investment while helping to deliver early results and reduce overall project risk.

“The DiscoveryLink solution combines innovative middleware and integration services to extract information from large volumes of heterogeneous life sciences

and clinical data sources.”

Chapter 2. IBM Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink 21

IBM draws on the extensive resources of IBM Healthcare Consulting and over 120,000 IBM Global Services personnel worldwide to assist life sciences companies of all sizes with building a solution based on their particular data integration requirements. In addition, IBM is continually working with leading industry software and database providers to offer life sciences companies new and innovative integrated solutions.

2.6 Accelerate scientific discovery and productivityDiscoveryLink allows integration of discovery, clinical trial, regulatory—and even marketing data—throughout the product development, approval and deployment cycle. Research organizations can increase the number of qualified discovery projects, identify promising targets and leads more quickly and develop those leads more effectively—while reducing the burden of managing the IT infrastructure. This complete, integrated solution from IBM enables fast, sophisticated data analysis to help researchers turn data into scientific insight.

To learn more about how the DiscoveryLink solution can help dramatically improve the R&D effectiveness of your organization, visit our Web site at ibm.com/solutions/lifesciences.Or contact an IBM Life Sciences solutions specialist at [email protected]

“...DiscoveryLink can compliment and extend the capabilities of existing data warehouses and object frameworks, enabling drug discovery, biotechnology, and

research companies to boost productivity...”


Chapter 3. DiscoveryLink: A Data Integration Solution for Life Sciences

The DiscoveryLink solution from IBM allows the integration of discovery, clinical trial, regulatory and even marketing data throughout the product development, approval and deployment cycle.

Research organizations can increase the number of qualified discovery projects, identify promising targets and leads faster and develop them more effectively while reducing the burden of managing IT infrastructure. The DiscoveryLink solution from IBM includes the combined resources of DiscoveryLink middleware and IBM Life Sciences Services.

Using this versatile software, IBM Life Sciences Services can create new components that allow specialized databases for proteomics, genomics, combinatorial chemistry and high-throughput screening to be accessed and integrated quickly and easily. DiscoveryLink is unique among existing systems because it enables easy creation of wrappers for nonrelational sources and provides the capability to add new sources dynamically.

It also includes query optimization technology that automatically searches for the most efficient means of executing a query and assembling the results.

3


3.1 DiscoveryLink PresentationDiscoveryLink is built on IBM DB2 ® Universal Database and includes the DB2 ® Life Sciences DataConnect and DB2 RelationalConnect set of software tailored specifically to the life sciences research and development requirements for integrating data from multiple sources.

DB2 Life Sciences DataConnect provides the interfaces (wrappers) through which the database communicates with the federated data sources. DB2 Relational Connect provides the robust, scalable communication structure to connect to relational data sources, such as Oracle, and provide more secure access to legacy data. Built on DB2 technology, DiscoveryLink is designed to be used along with many other databases.

Although DiscoveryLink delivers a single-table virtual database federation of multiple heterogeneous data sources, the functionality, integrity, location and form of the original data remains unchanged.


Figure 3-1 DiscoveryLink: A Data Integration Solution for Life Sciences

DiscoveryLink — A Data Integration Solution for Life Sciences

A new understanding of the workings of life at the genetic and molecular levels, combined with laboratory automation, promises to make finding new therapeutic agents radically faster, cheaper, and more effective. New data are pouring out of innovative technologies, such as genomics, at an unprecedented and rapidly increasing rate.

DiscoveryLink offers a unique data integration and knowledge management capability that addresses the extremely demanding needs presented by the ever increasing amounts and types of data required for research in the life sciences, particularly in informatics and drug discovery. It is a way of turning life science data into insight.

IBM® Life Sciences

DiscoveryLink: DiscoveryLink:TMTM

A Data Integration A Data Integration

Solution Solution

for Life Sciencesfor Life Sciences

Chapter 3. DiscoveryLink: A Data Integration Solution for Life Sciences 25

Figure 3-2 Increasing Data Requirements

Dramatic advances occurring in the life sciences industry are changing the way we live. These advances fuel the rapid scientific discoveries in genomics, pharmaceutical research, proteomics, and molecular biology that serve as the basis for medical breakthroughs and the development of new drugs and treatments. As imperatives to unravel the mysteries of DNA and bring healing medicines to the market on time—better, faster and cheaper—grow more urgent, the pressures to improve the productivity of the research and development (R&D) processes intensify.

One of the life sciences industry's most difficult challenges is transforming massive quantities of highly complex, constantly changing data—from a variety of data sources—into knowledge.

The sequenced human genome has already increased the number of biological drug “targets” that can be explored from about 500 to over 30,000. Soon, the typical life sciences company will need to access and analyze “petabytes” (1015 bytes) of data to further their research efforts. In addition to the enormity of the data, there are challenges related to querying non-standard data formats, accessing data assets across global networks, and securing data outside firewalls.

PetabytesPetabytesofof

DataData HTSHTS

CombinatorialCombinatorialChemistryChemistry

HumanHumanGenomeGenome

SNPsSNPs

PharmacogenomicsPharmacogenomics

ProteinsProteins

MetabolicMetabolicPathwaysPathways

1990 2000 2010

Medical Data Growth

External Research Partnerships

Growth in Clinical Trials

The Internet

ESTsESTs

Life Sciences data is increasing at a tremendous ratePetabytes (1015) of data are projectedData integration and data management are key to successfully deciphering meaning

Increasing Data Increasing Data RequirementsRequirements


Figure 3-3 Convergence of Life Sciences and IT Creating New Discovery

It is an exciting time for the life sciences. Information technology and scientific inquiry are converging and the resulting relationship is creating new models for discovery.

InformationInformationTechnologyTechnologyLife SciencesLife Sciences

Scientific discoveryScientific discoveryNew drugs and treatmentsNew drugs and treatmentsRevolution in healthcareRevolution in healthcare

The Convergence of The Convergence of Life Sciences and IT Life Sciences and IT

will Create New Discovery will Create New Discovery ModelsModels


Figure 3-4 IBM Life Sciences

The goal of the IBM Life Sciences Business Unit is to rapidly bring IT technology to customers and partners in the fields of pharmaceutical research, biotechnology, genomics, health, and other life sciences industries. IBM is a proven leader in data integration, super computing, high performance storage, e-business, and information technology services.

The goal of the IBM Life Sciences Business Unit is torapidly bring IT technology to customers and partners in the fields of pharmaceutical research, biotechnology, genomics, health, and other life sciences industries. IBM is a proven leader in data integration, super computing, high performance storage, e-business, and information technology services.

IBM Life SciencesIBM Life Sciences


Figure 3-5 Two Approaches to Data Storage

There are two approaches to data storage: data warehouses and federated data sources.

Data Warehouse

Federated Database

Two Approaches toTwo Approaches to Data Storage Data Storage


Figure 3-6 What is a Data Warehouse?

A Data warehouse is designed to support a set of specific activities. Data is replicated from multiple sources within a company into one DBMS. Data can be processed (cleansed/filtered/ transformed) during replication and data is refreshed at specified times.

An Enterprise Data Warehouse (EDW) is a BIG warehouse, big enough to contain a large fraction of a company's data.

A data mart is a “small” warehouse designed to support a specific activity.

A data warehouse extracts data from data sources across an entire enterprise and acts as a centralized repository of information.

A Data mart is a "small" warehouse designed to support a specific activity.

What is a What is a Data Warehouse?Data Warehouse?


Figure 3-7 Data Warehouse

Data is brought from separate data sources into a centralized repository (warehouse). The data can be “cleansed” as it is brought into the warehouse. The data is sent from the warehouse directly to end user applications or to data marts prior to the actual application.

Data is brought into the warehouse from a separate source

Data is sent from the warehouse to end-user applications

OperationalData Sources

Data Cleansed/filtered

Data Marts

Application

Application

DataWarehouse

Data WarehouseData Warehouse


Figure 3-8 Data Warehouse Limitations

Data warehouses have limitations relating to the physical storage of data in one location. Keeping data “fresh” requires frequent updates (or data replication) to ensure that data is current: some data sources may not change often but other sources change frequently. Because data is stored in multiple locations there are associated hardware and data management costs.

Data freshness depends on the frequency of data replication (updating data to ensure that changes (additions/deletions/modifications are reflected)Data is replicated and stored in multiple locations resulting in increased storage hardware and data management costsSpecialized searches needs to be re-implemented in the warehouse modelChanges instituted in remote sources are not directly reflected in the warehouse without intervention

Data Warehouse Data Warehouse LimitationsLimitations


Figure 3-9 What is a Federated Database?

Utilizing a federated approach the data remains in the original sources and data is accessed as required to support specific applications. The data source is not modified in any way and the data is always as current as the original data sources.

The use of a federated approach:

� Allows data and applications to be integrated in real time

� Is adaptable to constantly changing technologies

� Obviates problems of data currency and synchronization

� Can be complimentary with and enhance data warehouses technology

Data remains in the original separate sources

All operational data sources accessible with a single query

Query optimization on all data sources

OperationalData Sources

FederatedServer

Application

What is a What is a Federated Database?Federated Database?


Figure 3-10 Federated Database Technology

To process queries such as the one coming from the computer on the left, DiscoveryLink needs a full database engine. The engine not only compiles and optimizes the query to get the best possible plan, but allows DiscoveryLink to compensate for functionality that is missing in less sophisticated data sources (for example, if a source cannot do joins, or particular kinds of predicates, etc.

Query compilerParserSemantic processorOptimizer

Execution engineSort engineResidual predicateFunctions

CatalogData managerLockingLoggingBuffer managerClient accessTransaction CoordinatorQuery gateway

Interface to sources

Federated DB

databasedatabase

and

databasedatabase

databasedatabase

and

Federated Database Federated Database Technology is the Technology is the

Foundation of Foundation of DiscoveryLink DiscoveryLink


Figure 3-11 Key Requirement in the Life Sciences

DiscoveryLink extends the warehouse capability in DB2® by providing the capability to support Federated data sources.

DiscoveryLink extends the warehouse capability in DB2®

by:

providing the capability to support Federated data sources

Fulfilling a key requirement in the Life Sciences

Key Requirement Key Requirement in the Life Sciencesin the Life Sciences


Figure 3-12 The DiscoveryLink Approach

Data integration/management is a common problem. Here is an example from the pharmaceutical domain. To find new drugs, scientists use a wide range of data from several sources. They need to be able to interrelate data from different sources, for example, to combine structural information in a chemical structures database with assay results from a relational database.

Toxicology Toxicology DataData

Proteomic Proteomic DataData

Compound Compound DataData

Genomic Genomic DataData

Textual Textual DataData

Clinical Clinical DataData

Gene Gene Expression Expression

DataData Other DataOther DataSourcesSources

Integrated DataIntegrated DataManagementManagement

DiscoveryLinkDiscoveryLink

Link multiple heterogeneous data sources together

One query spans multiple data sources

The DiscoveryLink The DiscoveryLink ApproachApproach


Figure 3-13 DiscoveryLink is Built on Proven Technology

DiscoveryLink technology is solid and stable because it is built on top of our award-winning, industry-leading, relational database technology and expertise.

DiscoveryLink comes from the integration of IBM's Datajoiner technology into DB2 Universal Database with the addition of relational and non-relational wrapper technology from our Relational Connect and Life Sciences Data Connect products, respectively.

1995DataJoiner®/AIX® Version 1 is released

1997DataJoiner/AIX, NT, Solaris Version 2 is released

2000DB2 UDBTM Version 7 Enterprise Edition and Extended

Enterprise Edition were releasedDataJoiner technology integrated with DB2 Universal DatabaseRelational connectDiscoveryLink: the base technology is DB2 UDB V7 Enterprise Edition

2001Life Science data connect DB2 7.2

DiscoveryLink is DiscoveryLink is Built on Built on

Proven TechnologyProven Technology


Figure 3-14 DiscoveryLink: A Unique Combination of Features

DiscoveryLink is a unique solution to the substantial problems encountered in accessing the disparate and heterogeneous sources of data as encountered in the life sciences. The technology lies in the properties of:

� Transparency: Provides a single “virtual database” to applications that appears to be just one data source, and it supports a high-level query language. Both of these factors are essential to rapid and efficient use by the researcher, and economical administration by a DBA.

� Heterogeneity: Integrates data from different data sources—both different types of data and different data sources.

� High Function: Includes all the capabilities of SQL3 to search for and to manipulate data. No new programming is required for a complex or novel search! In addition there is no loss of any functionality in the data sources. This means any functionality present in the various remote databases is preserved, and also functionality can be added when it is missing.

� Autonomy: Causes no perturbation of existing data sources. Data does not need to be moved or reformatted. Complex informatics databases do not need the expensive and time-consuming process of reconstruction or reformatting, and their integrity is protected.

� Performance: Optimizes queries for good performance. In complex life science and drug design queries involving massive sizes and numbers of databases, the speed with which results are returned can be a decisive factor.

Transparency

Heterogeneity

Functions

Cost-based optimization

IBM Global Services

DiscoveryLink: A DiscoveryLink: A Unique Combination Unique Combination

of Featuresof Features


DiscoveryLink combines the five key features just mentioned, particularly the seamless data integration of a transparent virtual database, and the high performance of efficiently optimized queries. Its usefulness is enhanced by:

� Open standards and industry standard SQL, which facilitate its use with GUIs and other ISV products.

� Wrapper technology, which provides a simple way of connecting the search engine to different data repositories.

� Exploitation of the extensive and sophisticated suite of IBM database products and middleware integration with IBM's line of storage systems and devices.


Figure 3-15 Transparency

What is Transparency?

Transparency is the extent to which the system masks the user from the differences, idiosyncrasies, and implementation of the underlying data source. Another way of looking at transparency is the degree of physical and logical data independence. A fully federated database is highly distributed and has a high degree of transparency.

DiscoveryLink masks the differences, idiosyncrasies, and implementation of the underlying data source from the user

DiscoveryLink provides for a "virtual" data source linking multiple heterogeneous data sources

All data appears to come from one data source

TransparencyTransparency


Figure 3-16 DiscoveryLink Handles Heterogeneity

What is Heterogeneity?

Heterogeneity and transparency are two of the five key features mentioned at the beginning as essential characteristics of a life science research database system.

Access to multiple data sources represents a huge challenge for scientists. These sources often reside on both sides of a firewall and are often linked across a wide area network.

Heterogeneity in general refers to the disparate nature of the data sources invariably found in life science research. These may be specified in terms of specific features: differences in where data is stored, how it is stored, and these nine characteristics of the server.

Heterogeneity is the differentiation in existing data sources: v

DiscoveryLink is designed to overcome such differentiation and seamlessly integrate multiple, heterogeneous sources

Hardware platformNetwork protocolOperating systemData management softwareData model

Query languageApplication interfaceQuery capabilitiesError handlingTransaction protocol

DiscoveryLink DiscoveryLink Handles HeterogeneityHandles Heterogeneity


Figure 3-17 Functions

DiscoveryLink can use functions of the existing sources. For example, suppose a chemical structure store has an algorithm for comparing chemical structures. That algorithm can be invoked through DiscoveryLink, by including an appropriate function call in the query DiscoveryLink receives.

DiscoveryLink can also compensate for function that a data source does not have. As an example, suppose the sort function is not supported by the remote database, which contains employee name and salary information. Even though this function is not present in the remote database, the query through DiscoveryLink may include it in the same way as if it were present.

DiscoveryLink utilizes the functions of existing sources and SQL language

One query from DiscoveryLink can combine data from multiple sources

Source retains functionality

FunctionsFunctions


Figure 3-18 Cost-Based Optimization Issues

The DiscoveryLink approach involves using federated queries to connect heterogeneous sources of data. DiscoveryLink uses a cost-based optimizer, which means that the execution plan chosen for a query is estimated to require the least time based on the specific system characteristics. These characteristics include:

� The cost of evaluating the operation

� The relative CPU speeds of the DiscoveryLink and data source machines

� The relative I/O speeds of both machines

� Where the data is located

� The network speed between both machines

� The details of any query optimization in foreign data sources.

The query may be arbitrarily complex. The optimized query is then used to drive the wrappers, accessing the various sources of data in the sequences and extents needed for maximum efficiency. The query engine can assign missing functionality, i.e. imposing categories or rankings to flat database files.

The data is integrated, the integration and query optimization occurring behind the scenes. This is essential for efficient research queries.

DiscoveryLink's cost-based optimizer is designed to manage these issues:

How is the system configured?What is the optimization level?How is the data configured?How is the data distributed?What operations can be pushed down?How is each operation evaluated?What is the cost to evaluate an operation?Where is an operation evaluated?

Cost-Based Cost-Based Optimization IssuesOptimization Issues



The DiscoveryLink solution consists of a number of components including: wrappers, a query Processing engine and IBM Global Services.


DiscoveryLink solution consists of: WrappersQuery Processing engineIBM Global Services


Figure 3-20 The DiscoveryLink Approach: Wrappers

Wrappers are C++ programs which are packaged as a shared library and which are dynamically loaded as needed by DiscoveryLink. Typically a single wrapper is capable of accessing several data sources, as long as they share a common or similar API. (This is because the wrapper does not encode information on the schema of data in the source.)

For example, the Oracle wrapper provided with DiscoveryLink can be used to access any number of Oracle databases, and even several different Oracle releases.

Adding a new source requires only supplying a new wrapper for that source.

Some wrappers are currently available through Relational Data Connect. These are relational databases currently consisting of Oracle, Sybase, Informix and MS SQL Server. In addition, flat file, Documentum and Excel spreadsheet wrappers are included in Life Science Data Connect.

IBM internal tests have shown that queries of relational databases such as Oracle are generally executed faster when run through DiscoveryLink than when run directly through their native client, due primarily to DiscoveryLink query optimization.

Wrappers are small programs written for each type of data source

Wrappers translate a researcher's request into directions that each data source will understand

Wrappers can be written for many data sources (e.g. Oracle, DB2, SQL Server, flat files, etc.)

The DiscoveryLink The DiscoveryLink Approach: WrappersApproach: Wrappers


Figure 3-21 The DiscoveryLink Approach: Query Processing Engine

The DiscoveryLink approach involves using federated queries to connect heterogeneous sources of data. It uses a cost-based optimizer (in other words an optimizer that minimizes the use of system resources, reflected mainly by the time it takes to execute a query) to generate a query plan. The query may be arbitrarily complex. The query engine then follows the optimized query plan driving the wrappers, to access the various sources of data in the sequences and extents needed for maximum efficiency. The query engine can compensate for missing functionality, e.g. imposing categories or rankings to flat database files.

The DiscoveryLink The DiscoveryLink Approach: QueryApproach: QueryProcessing EngineProcessing Engine

DiscoveryLink utilizes a powerful query processing engine in a federated server which:

Increases performance via:Query decomposition and distributionCost-based optimization

Drives Wrappers and combines resultsCan compensate for missing functions in some data sources


*

Figure 3-22 IBM Global Services (IGS) as a Resource

The services provided by IBM Global Service demonstrate an enormous range of capabilities. Please note the reference to IBM Research. In the Life Sciences marketplace, the expertise represented by IBM Research is highly relevant in areas such as Supercomputing, Bioinformatics, Knowledge Management, Text and Data Mining, and all aspects of e-business.

IGS provides IT Services Expertise

IGS provides consulting for drug discovery and healthcare

IGS Understand Business Requirements

140,000+Professionals

in 160 Countries

IGS utilizes the Intellectual Capital prevalent throughout IBM

3000+ Scientists and Engineers

at 8 Labs in 6 Countries

IBM Global Services IBM Global Services (IGS) (IGS)

is a Tremendous is a Tremendous Resource to CustomersResource to Customers


Figure 3-23 Data Management and Integration

IBM Global Services offers a wide variety of offerings, ranging from knowledge management and data integration to web hosting. Each of these activities has well defined and proven customer value.

Customer Value

Data Management and Integration

Increased revenue and market share

Enablement of personalized medicine

Decreased capital expenditures

Clinical Trial e-Enablement

Knowledge Management

Technology Enablement and Hosting

Services Offerings

Increased productivity and innovation

Data Management and Data Management and Integration is Only One Integration is Only One of Many Offerings from of Many Offerings from

IBM Global Services IBM Global Services


Figure 3-24 IBM Offerings and Life Sciences Requirements

IGS (IBM Global Services) is organized along the headings displayed:

What should I do? IBM Consulting will review the situation and advise on the appropriate course of action.

Help me do it: Business Innovation Services professionals are able to implement a wide range of activities ranging from knowledge management to e-Enablement.

Do it for me: Strategic Outsourcing / Hosting are major activities IGS is engaged in.

ITIT

BUSINESSBUSINESS

WHAT SHOULD I DO?WHAT SHOULD I DO?

DiscoveryLinkTransitionServices

Clinical Triale-Enablement

KnowledgeManagement

ApplicationsIntegration

ApplicationsHosting

Life SciencesPortal Hosting

I/T infrastructureOutsourcing

Life SciencesMgmt Consulting Vision & Strategy

Life SciencesApplications

Selection&

I/T InfrastructureDesign

HELP ME DO ITHELP ME DO IT DO IT FOR MEDO IT FOR ME

IBM Offerings IBM Offerings Address the full Address the full

Spectrum of Life Spectrum of Life Sciences RequirementsSciences Requirements


Figure 3-25 New Queries for Life Sciences Powered by DiscoveryLink

Dramatic advances occurring in the life sciences industry are changing the way we live. These advances fuel the rapid scientific discoveries in genomics, pharmaceutical research, proteomics, and molecular biology that serve as the basis for medical breakthroughs and the development of new drugs and treatments. One of the life sciences industry’s most difficult challenges is transforming massive quantities of highly complex, constantly changing data—from a variety of data sources—into knowledge. The sequenced human genome has already increased the number of biological drug “targets” that can be explored from about 500 to over 30,000. Soon, many life sciences companies will need to access and analyze “petabytes” (1015 bytes) of data to further their research efforts.

New Queries for the New Queries for the Life Sciences Powered Life Sciences Powered

by DiscoveryLinkby DiscoveryLink


Figure 3-26 Query Scenario 1

Integrated Data: the First Step in Extracting Knowledge

As a further example of the way that DiscoveryLink can be used, a researcher might want to see all compounds similar to ketanserin that have been tested against members of the serotonin receptor family and which have characteristics of a promising drug. Answering this query requires information about compounds, proteins and assays. The request will likely be made using a form-based interface that is simple for the scientist, and which generates a SQL query (in this case, a rather complex one!). The query is submitted to DiscoveryLink, which recognizes which data are available in each of the different databases, forms an optimized plan and executes it, returning a list of candidate compounds.

Notice that the databases can be widely separated, even in different countries!

Show me all the compounds similar to ketanserin that have been tested against members of the serotonin family and have the characteristics of a good drug

QueryQuery ResultsResults

Discovery LinkDiscovery Link

DB2 WrapperDB2 WrapperOracle WrapperOracle WrapperActivity DB WrapperActivity DB Wrapper Flat File WrapperFlat File Wrapper

DB2 Compound DBDB2 Compound DBUSAUSA

Oracle Compound DBOracle Compound DBItalyItaly

Activity DBActivity DB Flat FileFlat File

Scenario 1Scenario 1


Figure 3-27 Query Scenario 2

Consider a simple query—suppose a researcher wants to determine if a given amino acid sequence of a protein is known. (S)he would enter the commands to search all available protein databases for the sequence. The details of how to search the various databases are all hidden by DiscoveryLink. In this example the user simply enters the type of data being examined (amino acid sequence), the operator (sequence homology), and finally the new sequence itself.

MDVLSPGQGN NTTSPPAPFE TGGNTTGISD VTVSYQVITS LLLGTLIFCA VLGNACVVAAIALERSLQNV ANYLIGSLAV TDLMVSVLVL PMAALYQVLN KWTLGQVTCD LFIALDVLCCTSSILHLCAI ALDRYWAITD PIDYVNKRTP RRAAALISLT WLIGFLISIP PMLGWRTPEDRSDPDACTIS KDHGYTIYST FGAFYIPLLL MLVLYGRIFR AARFRIRKTV KKVEKTGADTRHGASPAPQP KKSVNGESGS RNWRLGVESK AGGALCANGA VRQGDDGAAL EVIEVHRVGNSKEHLPLPSE AGPTPCAPAS FERKNERNAE AKRKMALARE RKTVKTLGII MGTFILCWLPFFIVALVLPF CESSCHMPTL LGAIINWLGY SNSLLNPVIY AYFNKDFQNA FKKIIKCKFC

What other proteins share this specific peptide sequence? Check my in-house proprietary data source as well as external sources.

DatabaseDatabase TermTerm OperatorOperator ValueValue

All protein dbs Sequence Homologous :This_seq

Scenario 2Scenario 2


Figure 3-28 DiscoveryLink Information

To learn more about how DiscoveryLink can help dramatically improve the R&D effectiveness of your organization, visit our web site at www.ibm.com/discoverylink, or contact a Life Sciences solutions specialist at [email protected].

For more information on IBM DiscoveryLinkGo to our web site at: ibm.com/DiscoveryLinkContact us at: [email protected]

IB M D i scover yLink - t he d a ta i n tegr a t i onsolu t ion


Part 2 Detailed DiscoveryLink Information

DiscoveryLink is an IBM offering that uses database middleware technology to provide integrated access to data sources used in the life sciences industry. DiscoveryLink provides users with a virtual database to which they can pose arbitrarily complex queries in the high-level, nonprocedural query language SQL (Structured Query Language). DiscoveryLink efficiently answers these queries, even though the necessary data may be scattered across several different sources, and none of those sources, by itself, is capable of answering the query. In other words, DiscoveryLink can optimize queries and compensate for SQL function that may be lacking in a data source. Additionally, queries can exploit the specialized functions of a data source, so that no functionality is lost in accessing the source through DiscoveryLink.

This part provides a detailed look into DiscoveryLink and the IBM Global Services offerings associated with DiscoveryLink.

Part 2


Chapter 4. DiscoveryLink: A System for Integrated Access to Life Sciences Data Sources

By:

L. M. Haas

P. M. Schwarz

P. Kodali

E. Kotlar

J. E. Rice

W. C. Swope

Vast amounts of life sciences data reside today in specialized data sources, with specialized query processing capabilities. Data from one source often must be combined with data from other sources to give users the information they desire. There are database middleware systems that extract data from multiple sources in response to a single query. IBM’s DiscoveryLink is one such system, targeted to applications from the life sciences industry. DiscoveryLink provides users with a virtual database to which they can pose arbitrarily complex queries, even though the actual data needed to answer the query may originate from several different sources, and none of those sources, by itself, is capable of answering the query. We describe the DiscoveryLink offering, focusing on two key elements, the wrapper architecture and the query optimizer, and illustrate how it can be used to integrate the access to life sciences data from heterogeneous data sources.

4


The human genome has been sequenced, but an even greater challenge remains: to use the information created through this and other processes to prevent and cure disease. Knowledge about genes will help us understand how genetics influence disease development, aid researchers looking for genes associated with particular diseases, and contribute to the discovery of new treatments. To progress in this quest, we must start to answer questions such as: What proteins are encoded by the 35000 human genes? (It is estimated that there may be as many as one million proteins present in the body.) Under what conditions (which cells/when) are they manufactured? What biological pathways do they participate in? Which of these proteins are appropriate targets against which to develop new therapeutics? Finally, what molecules can be identified and optimized to act as therapeutics against these targets? As we start to answer these questions, we may be able to find effective drugs more quickly, to design drugs that are more selective and have fewer side effects, and even to produce drugs that may be tailored to a particular individual’s genes (pharmacogenomics). As one indication of the possibilities, some analysts predict1 that the market for personalized medicine could become as large as $800 million by 2005.

A myriad of different data sources in differing formats have been set up to support different aspects of genomics, proteomics, and the drug design process. Some of these data sources are huge—and growing rapidly. Celera Genomics estimates that it has already generated 50 terabytes of genomic data. With the automated high throughput experimental technologies that have been developed in recent years, it is possible to sequence 20 million DNA (deoxyribonucleic acid) base pairs a day. Technologies for testing chemical compounds have improved as well, making it possible to run high throughput assays at the rate of thousands a day, leading to an explosion in the size of test databases. With the promise of high throughput techniques for protein identification, the volume of data that must be analyzed to find good candidates for drugs is only going to increase.

Not only are there vast quantities of data, but much of the data reside in specialized data sources, with specialized query processing capabilities. Sequence data are often stored in flat files or in databases and then converted to specialized formats (e.g., FASTA 2) to run particular homology search algorithms (e.g., BLAST 3). Proprietary chemical structure data sources used for drug design support substructure and similarity search. Reference data are often found in on-line databases such as MedLine 4. Assay data are frequently stored in relational format (e.g., ActivityBase 5, and different companies get information on patents or reports from a variety of text retrieval systems supporting content search of differing degrees of sophistication. These various technologies provide efficient means of finding particular pieces of data of a specific type.

But extracting the data from these specialized stores solves only part of the problem. To obtain real value from these data, they must be combined with data from other sources to give researchers the information they desire. Only by integrating the data from many sources will scientists be able to identify correlations across the spectrum from genomics to proteomics to drug design. The variety of different formats and search algorithms, while making it possible to optimize the access to a particular kind of data, unfortunately makes it difficult to integrate data of different types, or even to integrate data from different providers of information.

Many different approaches to integrating access to these data sources are possible. Often, integration is provided by applications that can talk to one of several data sources, depending on the user’s request. In these systems, the data sources are typically “hardwired”; replacing one data source with another means rewriting a portion of the application. In addition, data from different sources cannot be compared in response to a single request unless the comparison is likewise wired into the application. Moving all relevant data to a warehouse allows greater flexibility in retrieving and comparing data, but at the cost of reimplementing or


losing the specialized functions of the original source, as well as the cost of maintenance. A third approach is to create a homogeneous object layer to encapsulate diverse sources. This encapsulation makes applications easier to write, and more extensible, but does not solve the problem of comparing data from multiple sources.

Database middleware systems offer users the ability to combine data from multiple sources in a single query, without creating a physical warehouse. By “wrapping” the actual sources, they provide extensibility and encapsulation as well. Several research projects 6-9 have focused on middleware to bridge sources of “nonstandard data types” (that is, types other than the simple strings and numbers stored by most relational database management systems). DiscoveryLink 10,11 is an IBM offering that uses database middleware technology to provide integrated access to data sources used in the life sciences industry. DiscoveryLink provides users with a virtual database to which they can pose arbitrarily complex queries in the high-level, nonprocedural query language SQL (Structured Query Language). DiscoveryLink efficiently answers these queries, even though the necessary data may be scattered across several different sources, and none of those sources, by itself, is capable of answering the query. In other words, DiscoveryLink can optimize queries and compensate for SQL function that may be lacking in a data source. Additionally, queries can exploit the specialized functions of a data source, so that no functionality is lost in accessing the source through DiscoveryLink.

In this paper, we present an overview of DiscoveryLink and show how it can be used to integrate the access to life sciences data from heterogeneous data sources. As motivation, the next section sketches several common research scenarios that substantiate the need for cross-source queries and query optimization, and which we will use to illuminate our discussion. In the section on a wrapper architecture, we describe the DiscoveryLink offering as it exists today. Then in the section on query processing, we walk through the optimization and execution of a few queries, pointing out the benefits of the database middleware approach and highlighting areas for improvement. The next version of DiscoveryLink will be enhanced by changes to query optimization following the Garlic 9 approach. We describe these changes in the section on future enhancements and illustrate their effect on the processing of one of our earlier queries. The section on field experience recounts our experiences with DiscoveryLink to date, describing briefly some ongoing functional and performance studies. In the section on discussion, we reflect on DiscoveryLink’s overall role in the process of data integration. In the next section we discuss related work, and then conclude with a report on current and future work.

Chapter 4. DiscoveryLink: A System for Integrated Access to Life Sciences Data Sources 59

4.1 MotivationA key feature of DiscoveryLink is that it enables users to ask queries that span multiple sources of information. Such queries arise frequently for researchers in the life sciences. Today, a query that spans multiple sources must be mapped to a series of requests to two or more separate sources. Either the end user must figure out the best sequence of requests, submit them, and then manually intersect the results, or, if the particular type of request is fairly common, an application might be written to hide the sequence of requests. This, however, could require a long and complex program, while providing only limited flexibility.

In this section, we describe several scenarios in which scientists must use multiple data sources in order to get the information they need. We show how the information could be obtained using DiscoveryLink, and contrast that with the way it would be obtained today without the benefit of DiscoveryLink. We refer again to these scenarios (particularly the last one) in future sections. We start with a description of three data sources.

4.1.1 Three data sourcesTo understand the biological mechanisms of disease and to discover new therapies, researchers need to have access to data from heterogeneous databases. These databases may include DNA databases such as GenBank12 , protein databases such as SWISS-PROT13, proprietary databases for storing structural information about compounds, databases for storing physical properties and activities of chemical entities, and reference databases such as MedLine 4. A researcher might wish to access and integrate information from some or all of the above-mentioned databases. For our scenarios, we assume that the researcher has available a protein sequence database, a chemical structure database, and a relational database holding assay results.

Each entry in the protein sequence database is indexed by a sequence identifier (protein_id) and contains sequence data, citation information, and taxonomic data as well as annotation data that describe the function of the protein, information about protein structure, similarities to other proteins, associated diseases, sequence conflicts, and cross-references to other data. We assume that the data also contain information on the family to which the protein belongs. All of these data are in text format.

The chemical structure database maintains collections of molecules with information about their 2-D chemical structure as well as their physical properties, including molecular weight and logP values (logP, the log of the partition coefficient, is an indication of how well the body can use the compound). This database is indexed on a molecule identifier (compound_id). This data source can also handle a similarity query. Given a sample molecule (represented in a standard format such as a MOLFILE 14 or as a SMILES 15 string, similarity queries compute a score in the range of [0, 1] for every molecule meeting certain criteria, measuring how similar each is to the sample molecule; the query returns all relevant molecules of a collection ordered by this score. This kind of query results in ranking molecules of a collection in the same way as is done for Web pages in a World Wide Web (WWW) search engine or for images in an image processing system.


The third database, a relational database such as Oracle, contains information about the assay results. An example of such a database schema would be ActivityBase. The main information in this data source is stored in the “results” table, which details the molecules that have been tested against a given receptor (a type of protein) and lists their IC50 values, which are a measure of the binding affinity of the molecule to the receptor site. Auxiliary tables then list further details of the experimental conditions. Each entry in the results table is indexed by a compound key comprised of the molecule and receptor identifiers.

4.1.2 Scenario 1: a new proteinA new protein. In this first simple scenario, a biologist at a pharmaceutical company has a new protein sequence. The biologist wants to find out if this sequence is already known and, if not known, find any sequences that are homologous (i.e., similar) to the new sequence. The pharmaceutical company has its own curated copy of the protein database in-house. However, our biologist wants to check the publicly available version to see if there are any additional data that have not yet made their way into the in-house version.

To accomplish this, our biologist would run a BLASTP search using the new sequence against the inhouse version of the database and then do the same with the public version. After obtaining the result sets from both versions, the biologist would have to combine the results, eliminating the sequences present in both (using the protein_id numbers) so as to get a unique list. However, because the application for accessing the in-house version might be different from the Web interface used to access the public version, our biologist needs to combine the results by cutting and pasting, or write an application or script to perform that task. With DiscoveryLink, the entire task can be carried out as a simple query, relying on the SQL “union” operator to spawn the two BLASTP searches and to eliminate duplicates in the result. In this simple scenario, the difference between writing, say, a Perl script and writing an SQL query may not seem too great. However, as more sites are involved, with more interfaces and more choices in how to actually retrieve the result, the difference between hand-coding (and hand-optimizing) the script and writing a nonprocedural statement that is automatically optimized will become increasingly pronounced.

Of course, these results will no doubt lead to other queries, across other databases, as the biologist checks what assays have been done against these homologous proteins, what compounds were tested, and so on.

4.1.3 Scenario 2: a mergerAfter the merger of two pharmaceutical companies, the discovery informatics group is tasked to provide easy access to both companies’ chemical structure and assay databases, located at two different locations. The databases contain information about chemical compounds that have been tested, over the years, against various targets. Merging the databases, which results in a greater number of active compounds, increases the likelihood of developing new leads for a particular therapeutic area. Suppose the researchers are interested in compounds similar to fluoxetine, also known as Prozac. Though the compound_ids in the two databases are likely to be different, the similarity search function can be used to query the databases and extract the required information. For simplicity, we assume that the structures are stored in the same database format, e.g., either SMILES 15 strings or MOLFILEs 14. If not, a function to convert between representations would be needed.

For our scenarios, we assume that the researcher has avialable a protein sequence database, a chemical structure database, and a relational database holding assay

results.


Today this problem can be addressed by writing an application that accesses both chemical structure databases individually (using a similarity search on each of the two databases for fluoxetine’s structure) and puts the two result sets into a common representation. Then a second pair of queries would be done against the assay databases to find what targets these compounds have been screened against. Perhaps the compounds from company A have been tested against serotonin receptors, while those from company B were tested against dopamine receptors. Depending on the activities of the set of compounds, various scenarios emerge: if the activities of compounds from company A are high (low IC50 values) and activities for the compounds from company B are low (high IC50 values), it means that this group of compounds might be selective against the serotonin class of receptors. If the activities of compounds from both databases are high, it means that this group of compounds is not selective toward one of these types of receptors.

In the case of DiscoveryLink, a single query can retrieve the similar structures and their matching assays from both companies’ databases. Views could be defined to create a canonical representation of the data. Furthermore, the query will be optimized and executed efficiently. DiscoveryLink gives the end user the perspective of a single data source, saving effort and frustration.

Again, the story is not likely to end here. Before proposing the synthesis and testing of a newly found compound, the researcher needs to know the toxicity profile of the compound and related compounds and also the pathways in which the compound or related compounds might be involved. This would require gathering information from a (proprietary) toxicity database and a database with information on metabolic pathways, such as KEGG 16, using the structures and names of the compounds to look up the data—a potentially difficult set of queries without the benefit of an engine such as DiscoveryLink.

4.1.4 Scenario 3: serotonin researchSerotonin research. In the brain stem, the most primitive part of the brain, lie clusters of serotonin neurons. The nerve fiber terminals of the serotonin neurons extend throughout the central nervous system from the cerebral cortex to the spinal cord. This neurotransmitter is responsible for numerous fundamental physiological aspects of the body, including control of appetite, sleep, memory and learning, temperature regulation, mood, behavior (including sexual and hallucinogenic behavior), cardiovascular function, muscle contraction, endocrine regulation, and depression. Serotonin (5-HT, or 5-Hydroxytryptamine) is implicated in a broad range of disorders like depression, schizophrenia, and Parkinson’s disease. Major depression results from a deficiency of available serotonin, or inefficient serotonin receptors. Agents that modulate the processing of 5-HT by, for example, inhibiting or stimulating its release, can be useful for treating such diseases. Prozac, for example, is an agent that inhibits the uptake of 5-HT back into the nerve terminal. Analysts project a greater than $10 billion market for serotonin-related drugs in the next decade.

Suppose our scientist, a chemist by background, wants to see what compounds are active against the family of serotonin receptors. To do so, the scientist could ask DiscoveryLink to display the structures of compounds that scored low in an assay in which the receptor screened was a member of the serotonin family. This simple query would in fact require a three-way “join” of information from all three data sources. Without DiscoveryLink, the scientist would need to make (at least) three separate requests: to the assay database to find the assays with low IC50s; to the protein family/sequence database to eliminate those assays where the receptor was not a member of the family of serotonin receptors; and to the structure database to retrieve the structures of the compounds tested in the remaining


assays. Note that the second and third steps might, in fact, require multiple requests, one for each assay returned, unless the protein and chemical structure sources can both accept a list of elements to check. In any case, making the individual requests and assembling the results would be a tedious process for the scientist.

Furthermore, there are many possible ways to process this query. Instead of starting with the assay database, our scientist might start by finding out what proteins are in the family of serotonin receptors, and then determine for which of these there were assays with the right activity. If there are only a few serotonin receptors, and many assays, this would probably be the best way to go, because it would be quicker to look up each of the receptors in the family to find its assays than to look up, for each assay, whether its receptor was in the correct family. However, if not aware of these considerations, the scientist could easily make a mistake, increasing the tediousness of the task dramatically. By contrast, since DiscoveryLink processes the entire request at once, it can optimize the query, ensuring that the query is executed efficiently.

Browsing through the results of this query, our scientist recognizes ketanserin, a compound that is highly selective against the HTR2A class of serotonin receptors. Our chemist would likely investigate compounds similar to ketanserin to find out whether they are selective against one particular class of receptor, in which case they might be good drug candidates, or whether they are active against all classes of the family of serotonin receptors, in which case they would need to be modified in order to be more selective. The scientist might ask a query such as: “Show me compounds with structures similar to ketanserin that are active against any members of the family of serotonin receptors and that have other drug-like characteristics.” This query again requires information from all three data sources, and this time exploits the ability of the chemical structure store to search by similarity. It would be even harder for the scientist to determine the best way to perform this query: whether to look for compounds like ketanserin first, or for assays against the family of serotonin receptors, or for the compounds with druglike characteristics (appropriate molecular weight, logP etc.).

In the sections on query processing and future enhancements, we return to this scenario and describe how these two queries would be processed by DiscoveryLink.

4.2 A wrapper architectureDiscoveryLink is a fusion of Garlic,9 a federated database management system prototype developed by IBM Research to integrate heterogeneous data, and DataJoiner*, an IBM federated database management product for relational data sources based on DATABASE 2* Universal Database (DB2 UDB*)17. From the DataJoiner side, DiscoveryLink inherits proven technology for federating relational data sources, as well as DB2’s powerful query optimizer and complete query execution engine. From the Garlic side, DiscoveryLink inherits a modular architecture that facilitates integration of new data sources, especially data sources that store nontraditional datatypes and embody specialized search algorithms. In the next two sections, we discuss how this heritage is embodied in the current version of DiscoveryLink. This section is devoted to the DiscoveryLink architecture, and in particular to wrappers, software modules that act as intermediaries between data sources and the DiscoveryLink server. The next section describes how the DiscoveryLink server uses information supplied by wrappers to develop execution plans for application queries. For illustration we make use of the three data sources and the query scenario described in the section on motivation (Scenario 3).

The overall architecture of DiscoveryLink, shown in Figure 4-1, is common to many heterogeneous database systems, including TSIMMIS 8, DISCO 18, Pegasus 6, DIOM 7, HERMES 19, and Garlic 9. Applications connect to the DiscoveryLink server using any of a variety of standard database client interfaces, such as Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC**), and submit queries to DiscoveryLink in standard


SQL.20 (The current offering does not support the INSERT, UPDATE, or DELETE SQL statements.) The information required to answer the query comes from one or more data sources, which have been identified to DiscoveryLink through a process called registration. Data sources of interest to the life sciences range from simple data files to complex domain-specific systems that not only store data but also incorporate specialized algorithms for searching or manipulating data. The ability to use these specialized capabilities must not be lost when the data are accessed through DiscoveryLink.

Figure 4-1 DiscoveryLink Architecture

The DiscoveryLink server communicates with a data source by means of a wrapper 21, a software module tailored to a particular family of data sources. The wrapper for a data source is responsible for four tasks:

1. Mapping the information stored by the data source into DiscoveryLink’s relational data model

2. Informing DiscoveryLink about the data sources’ query processing capabilities

3. Mapping the query fragments submitted to the wrapper into requests that can be processed using the native query language or programming interface of the data source

4. Issuing such requests and, following their execution, returning results

Because wrappers are the key to extensibility in DiscoveryLink, one of our primary goals for the wrapper architecture was to enable the implementation of wrappers for the widest possible variety of data sources with a minimum of effort. Our experience with the Garlic prototype has shown that this is feasible. To make the range of data sources that can be accessed using DiscoveryLink as broad as possible, we require only that a data (or application) source have some form of programmatic interface that can respond to queries and, at a minimum, be able to return unfiltered data modeled as rows of a table. The author of a wrapper need not implement a standard query interface that may be too high-level or too low-level for the underlying data source. Instead, a wrapper provides information about a data source’s query processing capabilities and specialized search facilities to the DiscoveryLink server, which dynamically determines how much of a given query the data source is capable of handling. This approach allows wrappers for simple data sources to be built quickly, while retaining the ability to exploit the unique query processing capabilities of nontraditional data sources such as search engines for chemical structures or images. Using the Garlic prototype, we validated this design by wrapping a diverse set of data sources including flat files, relational databases, Web sites, and specialized search engines for images and text.


To make wrapper authoring as simple as possible, we require only a small set of key services from a wrapper, and ensure that a wrapper can be written with very little knowledge of DiscoveryLink’s internal structure. As a result, the cost of writing a basic wrapper is small. In our experience, a wrapper that just makes the data at a new source available to DiscoveryLink, without attempting to exploit much of the data source’s native query processing capability, can be written in a matter of days. Because the DiscoveryLink server can compensate for missing functionality at the data sources, even this sort of simple wrapper allows applications to apply the full power of SQL to retrieve the new data and integrate the data with information from other sources, albeit with perhaps less than optimal performance. Once a basic wrapper is written, it can be incrementally improved to exploit more of the data source’s query processing capability, leading to better performance and increased functionality as specialized search algorithms or other novel query processing facilities of the data source are exposed.

A DiscoveryLink wrapper is a C++ program, packaged as a shared library that can be loaded dynamically by the DiscoveryLink server when needed. Typically, a single wrapper is capable of accessing several data sources, as long as they share a common or similar application programming interface (API). This is because the wrapper does not encode information on the schema used in the data source. Thus, schemas can evolve without requiring any change in the wrapper, as long as the source’s API remains unchanged. For example, the Oracle wrapper provided with DiscoveryLink can be used to access any number of Oracle databases, each having a different schema. In fact, the same wrapper supports several Oracle release levels as well.

The process of using a wrapper to access a data source begins with registration, the means by which a wrapper is defined to DiscoveryLink and configured to provide access to selected collections of data. Registration consists of several steps, each taking the form of an SQL Data Definition Language (DDL) statement. Several new DDL statements have been defined for DiscoveryLink, and some existing DDL statements have been extended. Each registration statement stores configuration meta-data in system catalogs maintained by the DiscoveryLink server.

The first step in registration is to define the wrapper itself and identify the shared library that must be loaded before the wrapper can be used. A new CREATE WRAPPER statement has been defined for this purpose. The wrapper for chemical structures databases such as the one described in the section on three data sources might be registered as follows:

CREATE WRAPPER ChemWrapper LIBRARY'libchemdb.a'

Similar statements would define the wrappers for the other two data sources.

Note that we have not yet identified particular data sources, only the software required to access any data source of these three kinds. The next step of the registration process is to define specific data sources, using the CREATE SERVER statement. If several sources of the same type are to be used, only one CREATE WRAPPER statement is needed, but a separate CREATE SERVER would be needed for each source. For the chemical structures database in our examples, the statement might be as follows:

CREATE SERVER Chem-HTS WRAPPER ChemWrapperOPTIONS(NODE 'hts1.bigpharma.com',PORT '2003', VERSION '3.2b')

This statement registers a data source that will be known to DiscoveryLink as “Chem-HTS,” and indicates that it is to be accessed using the previously registered wrapper “ChemWrapper.” The additional information specified in the OPTIONS clause is a set of (option name, option value) pairs that are stored in the DiscoveryLink catalogs but meaningful only to the relevant wrapper. In this case, they indicate to the wrapper that the “Chem-HTS”


data source can be contacted via a particular IP address and port number, and that it is using version 3.2b of the chemical database software. In general, the set of valid option names and option values will vary from wrapper to wrapper, since different data sources require different configuration information. Options can be specified on each of the registration DDL statements, and provide a simple but powerful form of extensible meta-data. Because options are understood only by wrappers, only the appropriate wrapper can validate that the option names and values specified on a registration statement are meaningful and mutually compatible. As a result, wrappers participate in each step of the registration process, and may reject, alter, or augment the option information provided in the registration DDL statement.

The third registration step is to identify, for each data source, particular collections of data that will be exposed to DiscoveryLink applications as tables. This is done using the CREATE NICKNAME statement. Collectively, these statements define the schema of each data source and form the basis of the integrated schema seen by applications.

In our example, we need three sets of CREATE NICKNAME statements, one set for each of the three previously defined data sources. Based on our previous description of these sources, Figure 4-2 shows representative CREATE NICKNAME statements that define partial schemas for each source. (The syntax shown is simplified for purposes of illustration.) The Protein_Sequence source exports a single relation, Proteins, with columns representing the unique identifier for a protein, the common (print) name, the protein family, and a list of diseases with which the protein has been associated. In real life, a DBA (database administrator) would likely declare a fuller set of columns, representing more of the information contained in the source; we simplify the schema in the interest of space only. Similarly, the DBA makes visible a single table, Assays, from the Oracle source, for which we show only three columns: the id of the compound being tested, the screen name identifying the protein (receptor) involved, and an IC50 value for the test. The IC50 value represents the concentration of compound required to produce a 50 percent inhibition of enzyme (protein) activity. Finally, the chemical structures database exports a table of compounds along with several important fields, including the structure, molecular weight, and logP. Note that the nickname definitions give the types of attributes in terms of standard SQL datatypes. This represents a commitment on the part of the wrapper to translate types used by the data source to these types as necessary.

Figure 4-2 Wrapper schemas

Any specialized search capabilities of a data source are modeled as user-defined functions, and identifying these functions by means of CREATE FUNCTION MAPPING statements is the fourth step in registration. Thus the definition of the chemical structures data source in Figure 4-2 also includes a CREATE FUNCTION MAPPING statement, registering that


source’s function similarity (A, B). The mapping identifies this function to the query processor and declares its signature and return value (in this case, the similarity score) in terms of standard SQL datatypes. As with nicknames, the wrapper must convert values of these types to and from the corresponding types used by the data source.

Once registration is completed, the newly defined nicknames and functions can be used in queries. When an application issues a query, the DiscoveryLink server uses the meta-data in the catalogs to determine which data sources hold the requested information. To break the query into fragments and develop an optimized execution plan, the DiscoveryLink server must take into account the query processing power of each data source. This information is obtained by requesting a server attributes table (SAT) from the data source’s wrapper. The SAT contains a long list of parameters that are set to appropriate values by the wrapper. For example, if the parameter PUSHDOWN is set to “N,” DiscoveryLink will not request that the data source perform query fragments more complex than:

SELECT <column_list> FROM <nickname>

If PUSHDOWN is set to “Y,” more complex requests may be generated, depending on the nature of the query and the values of other SAT parameters. For example, if the wrapper sets the BASIC_PRED parameter to “Y,” requests may include predicates like:

. . . WHERE logP > 4

The parameter MAX_TABS is used to indicate a data source’s ability to perform joins. If it is set to “1,” no joins are supported. Otherwise MAX_TABS indicates the maximum number of nicknames that can appear in the FROM clause of the query fragment to be sent to the data source.

Information about the cost of query processing by a data source is supplied to the DiscoveryLink optimizer in a similar way, using a fixed set of parameters such as CPU_RATIO, the relative speed of the data source’s processor relative to the one hosting the DiscoveryLink server. Additional parameters like average number of instructions per invocation and average number of I/O operations per invocation can be provided for data source functions defined to DiscoveryLink with function mappings, as can statistics about tables defined as nicknames. Once defined, these parameters and statistics can be easily updated whenever necessary.

This approach is easy for wrapper writers, and has proven satisfactory for describing the query processing capabilities and costs of simple data sources, and of the relational database engines supported by the DataJoiner product. However, it is difficult to extend this approach to more idiosyncratic data sources. Web servers, for example, may be able to supply many pieces of information about some entity, but frequently will only allow certain attributes to be used as search criteria. This sort of restriction is difficult to express using a fixed set of parameters. Similarly, the cost of executing a query fragment at a data source may not be easily expressed in terms of fixed parameters if, for example, the cost depends on the value of an argument to a function. In the section on future enhancements, we describe a more flexible approach, pioneered by Garlic, that will be included in the next release of DiscoveryLink.

Once the optimizer has chosen a plan for a query, query fragments are distributed to the data sources for execution. Each wrapper maps the query fragment it receives into a sequence of operations that make use of its data source’s native programming interface and/or query language. Once the plan has been translated, it can be executed immediately or saved for later execution. The DiscoveryLink server’s execution engine is pipelined and employs a fixed set of functions (Open/Fetch/Close) that each wrapper must implement to control the execution of a query fragment. When accepting parameters from the server or returning results, the wrapper is responsible for converting values from the data source type system to DiscoveryLink’s SQL-based type system.


4.3 Query processingIn this section, we show how the DiscoveryLink server creates an optimized execution plan for a query, drawing on information obtained from wrappers about the query processing capabilities of data sources and the location and schema of the data themselves. DiscoveryLink follows a traditional, dynamic programming approach to optimization 22. Plans are tree structures with Plan Operators, or POPs, as nodes. Each POP is characterized by a fixed set of plan properties. These properties include Cost, Tables, Columns, and Predicates, where the latter three keep track of the relations and attributes accessed and the predicates applied by the plan, respectively. Each POP works on one or more inputs, and produces some output (usually a stream of tuples). The input to a POP may include one or more streams of tuples produced by other POPs. DiscoveryLink’s POPs include operators for join, sort, filter (to apply predicates), temp (to make a temporary collection), and scan (to retrieve locally stored data). DiscoveryLink also provides a generic POP, called Remote Query, which encapsulates work to be done at a data source.

A plan enumerator is a component of the optimizer that builds plans for the query bottom-up in three phases, applying pruning to eliminate inefficient plans at every step. In the first phase, it creates plans to access individual relations used in the query. In the second phase, it iteratively combines these single-relation plans to create join plans. Finally, the enumerator adds any POPs necessary to get complete query plans. The winning plan is chosen on the basis of cost. The overall cost is computed by the optimizer using parameter values and statistics supplied by the wrappers during registration, taking into account local processing costs, communication costs, and the costs to initiate a subquery to a data source, as well as the costs of any expensive functions or predicates 23,24.

Consider the following example, based on Scenario 3 of the section on motivation. Recall that the first step in our chemist’s investigation was to look for compounds that were active against the family of serotonin receptors, to find out whether they were selective against one particular receptor or class of receptors (in which case they might be good drug candidates) or whether they were active against all members of the family of serotonin receptors (in which case they would need to be modified so as to be more selective). Seeing the results of the following query in a structure-activity relationship (SAR) table would aid in this analysis:

Show me all the compounds that have been tested against members of the family of serotonin receptors and have IC50 values in the nanomolar/ml range.

Assuming the scientist wishes to see the structures of the compounds as well as their identifiers, this query involves information from all three data sources described above. Using DiscoveryLink, a single query can access these multiple databases and combine the resulting information. In SQL, using the wrapper schemas of Figure 2, the above query can be written as:

SELECT a.compound_id, a.IC50, p.name,c.structureFROM Assays a, Proteins p, Compounds cWHERE a.screen_name 5 p.protein_idAND a.compound_id 5 c.compound_idAND p.family LIKE '%serotonin%'AND a.IC50 < 1E-8

Of course, our scientist is unlikely to write such a query! Instead, the scientist will probably just fill in some values for the predicates (maybe by selecting them from a list of possible values) in a nice GUI (graphical user interface). Under the covers, the application would generate this query and pass it to DiscoveryLink, which can then parse, optimize, and execute it.


4.3.1 Optimizing the queryAs mentioned above, the optimizer examines the query bottom-up, first finding plans for accessing each of the individual tables, then finding plans for joining pairs of tables, and, finally, finding plans for the three-way join. The optimizer uses information from the wrappers about the speed of the various sources, their network connections, and the size and distribution of their data to predict the costs of the various plans. Using information about their query capabilities, it ensures that it does not ask the sources to do anything they cannot do, and adds any operators it needs to compensate for function missing in the sources. It may also be able to rewrite the query in ways that will make query processing more efficient.

Figure 4-3 shows the plans created in the first phase of optimization for each of the tables. Each plan consists of a single operator, RemoteQuery, but each has a different set of properties. For example, the first plan accesses the assay table, applying the predicate on IC50, and returning the columns needed for the select list (compound_id, IC50) and to join the Assay table to the protein table (screen_name). The second plan accesses the compound table, returning the structure for the select list as well as compound_id, to join the compound table with the assay table. The third plan accesses Proteins, applying the LIKE predicate at the data source and returning protein_id to join this table to the assay table. In each case, the plan chosen reflects information about the data source’s query capabilities that was supplied to the optimizer by the source’s wrapper. By setting parameter values in the server attribute table appropriately, the wrapper for the Assay database indicated that the underlying data source could apply basic predicates. As a result, the optimizer could safely delegate evaluation of the predicate “IC50 < 1E-8” to the data source. Similarly, the wrapper for the text data source indicated to the optimizer that the source could apply LIKE predicates, allowing the optimizer to include the predicate “p.family LIKE ‘%serotonin%’ “ in the access plan for this source.

Figure 4-3 Single table access plans, first phase of optimization

In the second phase, the optimizer will look at all pairs of tables and construct multiple plans for joining each pair.25 There will be a plan for each feasible join method (way of executing the join) and for each possible join order (order in which the tables are accessed). For simplicity, we assume there are only two join methods. In the first method, the data resulting from the plan for the inner table of the join (the second table accessed) is brought to DiscoveryLink and stored temporarily, so that the join predicate is evaluated in DiscoveryLink. Alternatively, each join value from the outer table can be sent to the data source, and both the join and the local predicates can be evaluated at the source, once for each outer table value. (This latter


join method has been called a bind join.21) Under these assumptions this phase would produce eight plans, two for joining Assays and Compounds in that order, two for joining them in reverse order, and two for joining Assays and Proteins in that order, two in the reverse. The DiscoveryLink optimizer actually has several more join methods to choose from, and some, such as hash join 26, might well lead to better plans than the ones described here.

Once the two-way joins are built, the optimizer looks at alternative ways of joining these with the single table plans for the remaining table. This query requested no additional work (sorting, for example), so to complete the plans all that is needed is a final Return operator that eliminates any extra columns, returning only those needed. Figure 4-4 shows three of the many plans the optimizer would create for this query. In general, the number of plans examined is exponential in the number of tables being joined. The first plan starts by finding the structure of every compound, then sees which of them received a low IC50 score in an assay, and, finally, looks up the proteins they bound to in those assays to see if they are in the serotonin receptor family. The second plan finds the assays with low IC50 scores, then finds the structure of the compound tested in each of those assays, and finally determines whether he proteins that these compounds bound to are members of the serotonin receptor family. The third plan starts by finding the proteins that are members of the serotonin receptor family, finds assays in which some compound bound tightly to them, and finally retrieves the structures of just those compounds.

Figure 4-4 Three plans for the full query

The first two plans make a temporary table of the results of the remote query on Proteins so that they only access that table once. The first plan probes the Assays table in Oracle once for each compound. Likewise, the third plan asks the chemical structures source to return the appropriate compound structure once for each compound_id that generated a low IC50 when screened with a protein in the serotonin receptor family.

Which of these plans is best depends on many factors. Since the first plan begins by retrieving the structure for every compound in the chemical structures database, it is unlikely to be good unless there are very few compounds. The second plan only fetches structures for those compounds that turn up with a low IC50 score in one or more assays, which should be an improvement in most circumstances. Since it accesses the Protein data source only once, creating a temporary table at the server, this plan may perform well if relatively few proteins are in the serotonin receptor family, the DiscoveryLink server is fast, and accessing the text


data source is slow. The third plan defers access to the Compounds table until the end, which ensures that only the structures of compounds that qualify for the final result will be retrieved, i.e., those that had low IC50 scores in assays against relevant proteins. In other respects, this plan is similar to the second plan and similar arguments apply.

As this example shows, there are many different plans possible for even relatively simple queries. Depending on the data, the selectivity of the predicates, the complexity of the operations, and the machine and network speeds, plan costs may vary by orders of magnitude. A cost-based optimizer is essential to be able to execute cross-source queries with reasonable performance.

4.3.2 Executing the queryDiscoveryLink coordinates execution of the chosen plan, requesting data from the wrappers as the plan dictates. To illustrate, we assume that the optimizer has chosen the second plan of Figure 4-4 as the best way to execute the query. Plan 2 starts by accessing the Assays table exported by our Oracle database, applying the predicate on IC50. To start this process, DiscoveryLink tells the Oracle wrapper to begin retrieving data for the Remote-Query operator. The wrapper creates a connection to the Oracle server, and requests the data it needs. Since it is talking to a relational engine, this request is expressed as an SQL query, namely:

SELECT a.compound_id, a.IC50, a.screen_nameFROM ASSAYS aWHERE a.IC50 < 0.00000001

Those assays that survive the IC50 test are returned to DiscoveryLink. When DiscoveryLink receives the first result row, it asks the wrapper for the chemical structures database to retrieve the structure of the compound tested. In turn, the wrapper makes a request to the chemical structures database itself. This request will likely consist of a call to one of the interface routines supplied by the source, passing in the compound identifier obtained from the assay data. The structure data returned by this call is passed back to DiscoveryLink, which attaches it to the assay data. This process is repeated for each qualifying assay, completing the join between Assays and Compounds. Note that assays for which the tested compound’s structure is not available in the chemical database will be dropped from the result. If this is not desired, an outer join could be used to preserve the presence of these assays in the result set.

As soon as the first assay-structure pair is produced by the first join, DiscoveryLink requests that the Protein_ Sequence wrapper execute its piece of the plan. As above, the Protein_Sequence wrapper in turn requests data from its source. If the scientist is using Protein_Sequence over the Web, this request looks like a query URL (uniform resource locator), and returns an HTML (HyperText Markup Language) page (or pages) with the result. The wrapper then parses each HTML page to retrieve the next set of results. These results are stored by DiscoveryLink in a local table, and processing of the final join begins. For each combined assay-structure record, DiscoveryLink might scan the local table of protein results, looking for any whose protein_id matches the screen_name from the assay. Any matches found meet all the criteria of the query and hence are returned to the user.

In this section, we saw how the optimization capabilities of DiscoveryLink work. In fact, for relational sources, and many simpler sources, the Server Attribute Table plus cost parameters approach provides excellent results. For other sources, however, which cannot be neatly characterized by the parameters in the Server Attribute Table, this approach can lead to suboptimal results. For example, a source that could answer some, but not all basic predicates, might be forced to declare that it could not handle basic predicates—leading to inefficient plans if all data must be shipped back to DiscoveryLink before predicates are


applied. In the next section, we consider the second half of Scenario 3, in which our chemist focuses on compounds structurally similar to ketanserin. We show how optimizing this query can exploit the more advanced query planning technology that will be included in future versions of DiscoveryLink.

4.3.3 Future enhancementsDiscoveryLink does not yet fully exploit the technology pioneered by its forebears, DataJoiner and Garlic. The process begun with the current version of DiscoveryLink will be completed with the next version, due to be generally available in 2002. Features to be incorporated from DataJoiner include support for “long” datatypes (BLOB, CLOB, etc.), the ability to update information at data sources via SQL statements submitted to DiscoveryLink (including full transaction management for those data sources that support external coordination of transactions), the ability to invoke stored procedures that run at data sources, and the ability to use DiscoveryLink DDL statements to create new data collections at data sources. Other forthcoming features stem from advanced technology that is being added to the database engine at the heart of DiscoveryLink. This engine is more sophisticated than that of either DataJoiner or Garlic. Improvements to the database engine will allow certain queries to be answered using prematerialized automatic summary tables stored by DiscoveryLink, with little or no access to the data sources themselves. Another new feature will allow DiscoveryLink servers with multiple processors to access several data sources in parallel within a single unit of work.

The improvements listed above are important, but the subject of this section is a more fundamental change in the way DiscoveryLink develops optimized execution plans for queries. To demonstrate the need for this change, and how query planning will work in future versions of DiscoveryLink, we return to Scenario 3. After browsing the results of the first query, the chemist decides to investigate the drug potential of compounds similar to ketanserin. The chemist would like to see an SAR table containing the following information:

Show me all the compounds that have been tested against members of the serotonin family of receptors, have IC50 values in the nanomolar/ml range, a molecular weight between 375 and 425, and a logP between 4 and 5. Order the results by how similar the compound tested is to ketanserin.

Like the chemist’s earlier query, this request can be expressed as a single SQL statement that combines data from all three data sources(27):

SELECT a.compound_id., a.IC50, p.name,c.mol_wt, c.logP, c.structure,similarity(c.structure,:KETANSERIN_MOL) AS rankFROM Assays a, Proteins p, Compounds cWHERE a.screen_name 5 p.protein_idAND a.compound_id 5 c.compound_idAND p.family LIKE '%serotonin%'AND a.IC50 , 1E-8AND c.mol_wt BETWEEN 375 AND 425AND c.logP BETWEEN 4 AND 5ORDER BY rank

However, accurately determining the cost of the various possible plans for this query is more difficult. In the earlier query, assuming the parameters are correctly set and the statistics characterizing the size and distribution of the data are up-to-date, estimating plan costs and result cardinalities is relatively straightforward. This query introduces two new problems. The first is estimating the cost of evaluating the similarity function. The costing parameters maintained in the current version of DiscoveryLink for a function implemented by a data


source include a cost for the initial invocation and a “per-row” cost for each additional invocation. However, the only way to take the value of a function argument into account is through a cost adjustment based on the size of the argument value, in bytes. This is unlikely to give very accurate results. For example, if different similarity calculation algorithms can be used for different classes of pattern molecules, the cost parameters must be set to reflect some amalgamation of all the algorithms. As another example, a BLAST function asked to do a BLASTP comparison against a moderate amount of data will return in seconds, whereas if asked to do a BLASTP comparison against a large data set it may need hours. A simple case statement, easily written by the wrapper provider, could model the differences and allow more sensible choices of plans. While the costs of such powerful functions can in other cases be hard to predict, many vendors do, in fact, know quite a bit about the costs of their functions, because they often model costs themselves to improve their systems’ performance.

The second problem is estimating the cost of ordering the compounds returned by similarity. To DiscoveryLink, the evaluation of the similarity function and ordering the result set by the rank value returned are separate operations. The optimizer first estimates the cost of executing the similarity function the required number of times (itself an estimate based on the selectivity of the other predicates in the query) and then adds on the estimated cost of a SORT operator (for both the case where the SORT is performed by DiscoveryLink and the case where it is performed by the data source). In reality, it is quite possible that the chemical structures data source can order the result by compound_id “for free” as a byproduct of evaluating the similarity function. However, the cost for ordering by another attribute, e.g., molecular weight, might be quite different, or, the data source might not be able to order results by that attribute at all.

The solution to these and many similar problems is not to define a richer set of parameters for more precisely modeling data sources’ query processing capabilities and their costs. Experience with DataJoiner has shown that even for a modest set of data sources, all sharing a common relational data model and query language, the number of parameters required to capture their idiosyncrasies soon becomes untenable. The situation will only be exacerbated by the greater number and kind of data sources anticipated for DiscoveryLink.

Instead, the solution, validated in the Garlic prototype, is to involve the wrappers directly in planning of individual queries. Instead of attempting to model the behavior of a data source using a fixed set of parameters with statically determined values, the DiscoveryLink server will request information from the wrapper about a data source’s ability to process a specific query fragment. In return, the server will receive one or more wrapper plans, each describing a specific portion of the fragment that can be processed, along with an estimate for the cost of computing the result and its estimated size.

Consider the query introduced above. During the first phase of optimization, when single-table access plans are being considered, the chemical structures database will receive the following fragment for consideration(28):

SELECT c.mol_wt, c.logP, c.structure,similarity(c.structure,:KETANSERIN_MOL) AS rankFROM Compounds cWHERE c.mol_wt BETWEEN 375 AND 425AND c.logP BETWEEN 4 AND 5ORDER BY rank

Let us assume that, in a single operation, the chemical structures database can either apply the predicates on molecular weight and logP, or compute the similarity and order the results by rank, but not both. The wrapper might return two wrapper plans for this fragment. The first would indicate that the data source could perform the following portion of the fragment:


SELECT c.mol_wt, c.logP, c.structure,FROM Compounds cWHERE c.mol_wt BETWEEN 375 AND 425AND c.logP BETWEEN 4 AND 5

with an estimated execution cost of 3.2 seconds and an estimated result size of 500 compounds. To estimate the total cost of the query fragment using this wrapper plan, the DiscoveryLink optimizer would add to the cost for the wrapper plan the cost of invoking the similarity function on each of the 500 compounds returned and sorting the resulting records by rank.

The second wrapper plan would indicate that the data source could perform the following portion of the fragment:

SELECT c.mol_wt, c.logP, c.structure,similarity(c.structure,:KETANSERIN_MOL) AS rankFROM Compounds cORDER BY rank

with an estimated execution cost of 6.4 seconds and an estimated result size of 300000 compounds (i.e., all the compounds in the database, sorted by similarity to ketanserin). To compute the total cost in this case, the optimizer would augment the cost for the wrapper plan with the cost of using the DiscoveryLink engine to apply the predicates on molecular weight and logP to each of the 300000 compounds returned from the data source. Note that when asked to produce this plan, the wrapper has the pattern structure (:KETANSERIN_MOL) available, and can take its properties into account to obtain the best possible estimate of how expensive the similarity computation will be. Furthermore, if the result from the data source is naturally ordered by rank, the wrapper’s estimate need not include any additional cost for sorting.

Wrappers participate in query planning in the same way during the join enumeration portion of optimization. In our example, the wrapper might be asked to consider the following “bind join” query fragment:

SELECT c.mol_wt, c.logP, c.structure,similarity(c.structure,:KETANSERIN_MOL) AS rankFROM Compounds cWHERE c.mol_wt BETWEEN 375 AND 425AND c.logP BETWEEN 4 AND 5AND c.compound_id 5 :H0ORDER BY rank

This is similar to the single-table access, but in this case the chemical structures database is being asked to supply the inner stream for a bind join. For each compound_id produced by the rest of the query (and represented above by the host variable :H0), the chemical structures database is asked to find the chemical properties of the corresponding compound and its similarity with respect to ketanserin, and return them if the properties satisfy the predicates on molecular weight and logP. If the data source cannot do lookups by compound_id, the wrapper would return no wrapper plans at all for this request. If such lookups are supported, the wrapper would return one or more plans, as above, and indicate in each one whether the similarity computation or any of the additional predicates would also be evaluated.


Since a wrapper may be asked to consider many query fragments during the planning of a single query, it is important that communication with the wrapper be efficient. This is achieved easily in DiscoveryLink, since the shared library that contains a wrapper’s query planning code is loaded on demand into the address space of the DiscoveryLink server process handling the query. The overhead for communicating with a wrapper is therefore merely the cost of a local procedure call.

The improved approach to query planning described in this section will have many advantages over DiscoveryLink’s current methodology. It is both simple and extremely flexible. Instead of using an ever-expanding set of parameters to invest the DiscoveryLink server with detailed knowledge of each data source’s capabilities, we let this knowledge reside where it falls more naturally, in the wrapper for the source in question, and ask only that the wrapper respond to specific requests in the context of a specific query. As the examples above have shown, sources that only support searches on the values of certain fields or combinations of fields are easily accommodated, as are sources that can only sort results under certain circumstances or can only perform certain computations in combination with others. Since a wrapper need only respond to a request with a single plan, or in some cases no plans at all, the new approach does not sacrifice the current system’s ability to start with a simple wrapper that evolves to reflect more of the underlying data source’s query processing power.

This approach to query planning need not place too much of a burden on the wrapper writer, either. In Reference 29, we showed that it is possible to provide a simple default cost model and costing functions, along with a utility to gather and update all necessary cost parameters. The default model proved to do an excellent job of modeling simple data sources, and did a good job of predicting costs even for sources that could apply quite complex predicates. Reference 29 further showed that even an approximate cost model dramatically improved the choice of plans over no information or fixed default values. We therefore believe that this method of query planning is not only viable, but necessary. With this advanced system for optimization, DiscoveryLink will have the ease of extension, flexibility, and performance required to meet the needs of life sciences applications.

4.4 Field experienceDiscoveryLink is a new offering, and as a result, we are only beginning to understand how it will be used in practice. Today, two customer pilots are underway. The first focuses on linking chemical information with biological information by bringing together data about the structure of compounds with information on assays that have been done using these compounds. The second pilot is linking chemical, biological, and bioinformatic data, stored in a combination of (different) relational databases and flat files. (These pilots and an earlier study with Garlic were the inspiration for our examples.) In both pilots, the information is geographically distributed, spanning in one case, the United States, and in the other, both shores of the Atlantic Ocean. The schemas used to represent the information in both cases are quite complex, involving 30 or more nicknames, and requiring complex joins and unions both within and across sources to assemble information required for the respective applications. Hence the query processing capability of DiscoveryLink is being well tested by these projects.

Additionally, several vendors of life sciences data sources are considering offerings which would couple DiscoveryLink with their sources and with an application or an object framework to build a platform for data integration. These vendors see that the ease of linking their data to data from other sources will help to distinguish their offerings from those of others in the field. Further, an object layer on top of DiscoveryLink would make it more attractive to a broader, not necessarily SQL-savvy audience.


Performance is a key issue for any data management and retrieval system, and a number of questions arise for a middleware system such as DiscoveryLink. One relatively simple question is, what effect will going through the DiscoveryLink middleware have on the performance of queries against a single data source? In other words, if a user were to issue the same query both through DiscoveryLink and directly to the data source, what would be the difference in the execution times?

We have done an initial study on this issue with one customer, Aventis. In this experiment, we ran a set of their existing queries against both their production database (PrDB) and against a DiscoveryLink installation configured to access (via the relational wrapper) the same database. Queries were submitted via their existing Web-based query application, which was modified to submit queries against either DiscoveryLink wrapping PrDB, or directly against PrDB. The application, Web server, PrDB, and DiscoveryLink all ran on separate machines: the application on a Compaq running Windows NT 4.0, the Web server on a second Compaq running Windows NT4.0, IIS (Internet Information Server) 4.0, and IE (Internet Explorer) 5.0, PrDB on an Alpha 2100 running Windows NT 4.0, and DiscoveryLink on an RS/6000 H70 running AIX. Pushdown was enabled, so DiscoveryLink could choose to use as much or as little of PrDB’s processing power as it saw fit. Two experiments were done, a functional test and a load test.

In the functional test, virtual users (simulated via Web-based testing software) ran scripts consisting of a sequence of steps. In each script, the virtual user would log on to a Web-based application, and run a sequence of two to four queries, then log off of the application. Each script was run 20 times before proceeding to the next, and all tests against PrDB were completed before testing against DiscoveryLink began. (Hence both systems had the opportunity to benefit from any buffering possible.) Tests were run during quiet hours, but the network was not isolated during testing. Total transaction time was measured for each run of each script, and averaged over the 20 runs. In addition, the query results were tested to verify that correct answers were being returned. No errors were found. In all, nine different scripts were run. Queries ranged from selections against a single table to four-way joins, usually including a mixture of inner and outer joins. Many had subqueries, some of which were unions of simpler queries. Both the number of fields selected and the number of predicates varied greatly in number, and often involved complex functions. The amount of data returned also varied from query to query, though none retrieved huge numbers of results. Perhaps most important, each script was representative of the way scientists at Aventis typically use the system to search for studies, protocols, compounds, and/or libraries.

Results of the functional test are shown in Figure 4-5. All times are in seconds. In general, transactions against DiscoveryLink performed comparably to transactions directly against PrDB. In some cases, DiscoveryLink was, on average, a few seconds slower, in others a few seconds faster. In one case, for script number four, the transactions through DiscoveryLink were substantially faster than those directly against PrDB. 30 In all, we concluded that at least for Aventis’s standard sorts of transactions there was no performance penalty for using the DiscoveryLink middleware.


Figure 4-5 Results of the functional tests on DiscoveryLink

The load test evaluated the robustness of DiscoveryLink as the number of simultaneous users was increased. Scalability is essential for any database system, but especially so for database middleware, because requests that might originally have been submitted to multiple independent systems may all be routed through the middleware instead. (For example, an application that previously had to submit separate requests to the ADME [absorption, distribution, metabolism, and excretion] and high throughput screening databases can now send both to DiscoveryLink, so DiscoveryLink will see a higher number of requests than any one underlying data source.) The load tests used scripts similar to those used to run the functional test, and were driven by the same testing software. Several different scenarios were run. In one, all virtual users ran the same script, while in another, half of the virtual users ran one script and half another. In the final scenario, the virtual users were divided into five groups, each of which ran a different script. Each scenario was run for 20 minutes starting with one virtual user and quickly building to 20. Experiments with greater maximum loads (40 and 60 virtual users) were also run, but high standard deviations and large numbers of errors from other components of the system rendered the measurements less reliable.

The results of the load test can be found in Figure 4-6. Again, we measured the total transaction times from start to end of script, and took the average over all executions for all virtual users. Times are again shown in seconds. In general, results were not significantly different between the two application configurations (direct against PrDB and direct against DiscoveryLink). The DiscoveryLink configuration performed better on both scripts in the two-script scenario, and worse for the five-script scenario, though the variability of the results for this latter case makes conclusions hard to draw. What is clear is that at 20 users, there was no significant difference between the configurations (again, DiscoveryLink is not adding overhead), and response times for both configurations are comparable to those when only a single user is running (i.e., both configurations scaled well). So far, we have only discussed queries against a single data source. What about the cross-source queries for which DiscoveryLink is intended? We are working with Aventis to develop a benchmark for these as well, in which we will compare the performance of cross-source queries against DiscoveryLink with that of an application asking multiple queries of distinct sources and then assembling the results. In the meantime, we rely on studies with Garlic and DB2 DataJoiner, the two key components of DiscoveryLink. In Reference 31, Daimler-Benz compared the performance of three state-of-the-art middleware systems to determine the best platform for a new application that needed to combine data from multiple database systems. Their benchmark covered a broad range of workloads, including single-user and multiuser tests with queries ranging from simple selections and projections to complex joins and aggregates. DataJoiner performed well in virtually all tests.


Figure 4-6 Results of the load tests on DiscoveryLink

The authors draw particular attention to the join tests, in which DataJoiner’s performance was up to 60000 percent better than the competition’s, concluding that “since the integration of heterogeneous schemas is mainly done by means of join operations, a well-designed query optimizer plays a kernel role in the solution to the heterogeneity problem because it greatly influences the performance.”

Experiments using the Garlic research prototype indicate that query optimization is important for cross-source queries even when the sources are nonrelational and highly heterogeneous. The Garlic-style optimizer provides the flexibility needed to choose good quality plans under these circumstances.32 A follow-up study29

showed that an accurate cost model is essential, hence the need to adopt the new query planning interface outlined in the section on query processing.

To summarize, today we can state with a fair amount of confidence that the use of DiscoveryLink will not introduce significant overhead for queries accessing a single data source, and that DiscoveryLink will perform well even under significant loads. Further, we have reason to believe, from both the DataJoiner and the Garlic studies, that performance on cross-source queries will be good: as long as good plans exist, DiscoveryLink should find them. We expect to have further confirmation of this from our current pilot projects, which are using DiscoveryLink in a variety of interesting ways as infrastructure for scientific research in the life sciences.

4.5 DiscussionFrom the preceding pages, we hope it is clear that DiscoveryLink can play a useful role in integrating access to life science data. Yet DiscoveryLink is not magic; a completely integrated information space requires significant additional work. In particular, DiscoveryLink does not solve the problems of semantic data integration. In many, if not most, research labs, similar or related information is often modeled differently in different data sources. The discrepancies may range from simple formatting differences (one data source uses uppercase, another lower), to differences in vocabulary (one source refers to Tylenol**, another to Acetaminophen). Common keys may not exist between sources because objects were identified differently by different data providers.

While DiscoveryLink does not eliminate the problems caused by semantic conflicts, it does offer some facilities that can be used to hide conflicts or translate between representations. By writing queries, for example, that explicitly call translation functions, or that join in a translation table or data dictionary, many conflicts can be resolved. In the examples above, an uppercase function could be used to allow the formatting difference to be bridged, and a join to a lexicon would eliminate the terminology problem. A DBA might have to build a translation table to map between different keys in different sources; DiscoveryLink offers a place to store


the table and the ability to use it in queries across these sources. Such approaches “solve” semantic problems at the expense of query processing time, but do not require converting and rebuilding entire databases. The task of reconciling the differences by writing appropriate queries and translation tables or functions is, however, left to the DBA or application programmers. DiscoveryLink merely provides the capability.

Another characteristic of life sciences data and research environments is frequent change. Data are being constantly accumulated, with volumes increasing rapidly. As more data of a particular type are acquired, and better understood, schemas change to reflect the new knowledge. Further, new sources of information are always appearing as new technologies and informatics companies evolve. In such an environment, flexibility is essential.

DiscoveryLink has been designed with that goal in mind. The powerful query processor and nonprocedural SQL interface protect applications (to the extent possible) from changes in the underlying data source, due to the principle of logical data independence. Often a new source of information can be added simply by registering it and adjusting a view definition to include it. Changes in interfaces can often be hidden from the application by modifying the translation portion of the wrapper, or installing a new wrapper with the new version of the source. The query processing technology is built to handle complex queries, and to scale to terabytes of data. Hence the database middleware concept itself contributes to dealing well with change.

Further, the wrapper architecture has been designed for extensibility. Only a small number of functions need to be written to create a working wrapper. Simple sources can be wrapped quickly, in a week or two; more complex sources may require from a few weeks to a few months to completely model, but even for these a working wrapper with perhaps limited functionality can be completed quickly. Templates are provided for each function today, and default cost modeling code will be provided for the next version. Wrappers are built so as to enable as much sharing of code as possible, so that one wrapper can be written to handle multiple versions of a data source, and so that wrappers for similar sources can build on existing wrappers. The ability to separate schema information from wrapper code means that changes in the schema of a data source need not require code changes in the wrappers. The addition of a new data source requires no change to any existing wrappers. Thus the wrappers also help the system adapt to the many changes possible in the environment.

While not a complete solution to all heterogeneous data source woes, DiscoveryLink is well-suited to the life sciences environment. It serves as a platform for data integration, allowing complex cross-source queries and optimizing them for high performance. In addition, several of its features can help in the resolution of semantic discrepancies, providing mechanisms DBAs can use to bridge the gaps between data representations. Finally, the high-level SQL interface and the flexibility and careful design of the wrapper architecture make it easy to accommodate the many types of change prevalent in this environment.

4.6 Related workMost data retrieval systems in the life science industry today are point solutions, “solving” the problem of searching or managing one particular type of data. Each domain in the life science industry has its own complicated data types and database formats. For example, in the cheminformatics domain, there are approximately 30 different formats for storing structural information for molecules. The problem is made even more complex by the diversity of database schemas and sources for chemical inventory, compound registry, compound properties, assay protocols, and synthesis protocols. Point solutions in the cheminformatics domain include algorithms for searching the databases for structures (e.g., MDL33 and Daylight34), solutions for calculating compound properties,35,36 and applications to study interactions of small molecules with macromolecules such as proteins.37,38 Similarly in bioinformatics/genomics, the number of data types and data sources is very broad.39 While


these solutions enable many applications that would otherwise not be possible, they also create islands of data that the end user is forced to address. By allowing integration of these heterogeneous solutions, DiscoveryLink provides a means of bridging the data islands they create.

Other vendors are trying to integrate data from a specific domain—a huge problem in and of itself. Many of these vendors have well-established products in a particular domain. For example, MSI40 and Oxford Molecular41 provide products that integrate several related data sources. Genomica Corporation’s42 tools combine clinical, epidemiology, genetic, molecular biology, and biochemistry applications into a single software environment that spans a number of domains, enabling scientists to accelerate genetic discoveries and pharmacogenomics. The Genomica Reference Database (RDB) centralizes public domain mapping data from worldwide genome centers.

All of these systems integrate specific data sources rather than providing a general framework for data integration as DiscoveryLink does.

More general work on integrating heterogeneous data sources for the life sciences domain includes Kleisli,43 OPM,44 TAMBIS,45 and SRS.46 Kleisli’s CPL language allows the expression of complicated transformations across heterogeneous data sources, but its procedural nature makes optimization difficult. CPL is geared toward biomedical sources, while SQL (used by DiscoveryLink) is more general purpose. OPM has a more flexible object model than DiscoveryLink, but its multidatabase query processor has a less powerful optimization capability. TAMBIS has concentrated more on the benefits of providing a source-independent ontology of bioinformatics concepts and less on the details of efficient cross-source query processing.

SRS47 (Sequence Retrieval System) is an indexed flatfile system, built on the model of a document retrieval system. The data files contain structured text, labeled with identifiable field names, e.g., author, keyword, organism, etc. Fields are parsed and an index is built for each field. The user can query the data set using the parsed terms (keywords, author name, etc.) in Boolean combination. There are in excess of 500–600 independent sequence-related data sets available in the public domain, each in a slightly different format that research scientists would like to access. SRS has created a parser that, with a modest amount of work, can be configured to parse a new data set and develop queriable indexes to it, and has systematically indexed a large number of these resources. Furthermore, SRS combines the indexes in a system that allows cross-database queries, simply executing the same query against all of the indexed data sets, sequentially, and reporting all of the results. This simple model is reasonably effective because there is a strong overlap in the field names and content of the various data sets. However, this system does not extend readily to data types other than sequences, and, even for sequence data, does not provide the rich query capability of SQL nor the optimization capability of DiscoveryLink. DiscoveryLink could be used by SRS as a richer means of integrating the various sources, or DiscoveryLink could wrap SRS as a single source of sequence data.

Solutions such as SYNERGY**48 and Tripos**49 provide useful access to diverse life sciences data sources and analysis applications through a domain neutral object framework. SYNERGY has been constructed as a network of object-based components built on Java** and CORBA** (Common Object Request Broker Architecture**) technologies, while Tripos relies on CORBA for its distributed framework, or MetaLayer. As with DiscoveryLink, both SYNERGY and Tripos can integrate heterogeneous data sources and programs, and have no built-in data types or analyses. Instead, the kinds of data upon which the framework can

DiscoveryLink serves as a platform for data integration, allowing complex cross-source queries and optimizing these for high performance.


operate and the analyses available for these data types are discovered by the program at run time. However, these systems’ focus is on building applications from objects rather than on queries and query optimization. As a result, this type of object layer is complementary to the DiscoveryLink technology, and when used in conjunction with it can provide a powerful solution.

Other solutions including SeqStore**,50 Gene Expression Datamart 48 and those provided by Incyte, 51 have taken a data warehousing or data mart approach to provide fast access to preintegrated data (a data mart is a “small” warehouse designed to support a specific activity). From a performance perspective, we believe the optimization technology for federated data sources described here makes the replication of data and consequent maintenance unnecessary for most applications. Nevertheless, there are situations in which, because of semantic incompatibilities or slow networks, it is preferable to warehouse some of the data and then join this warehouse with other sources using a federated system such as DiscoveryLink.

Compared to other database middleware systems such as TSIMMIS,8 DISCO, 18 Pegasus,6

DIOM, 7 and HERMES, 19 DiscoveryLink is unique in supporting the full SQL3 language across diverse sources. Because these systems are all research prototypes, they have not yet focused on the features needed to make a system industrial strength. Nimble Technology’s Nimble Integration Suite 52 is an XML (Extensible Markup Language)-based integration product that uses XML-QL 53 as the integration language. Although also based on advanced database research (from the University of Washington), this technology is relatively new and unproven compared to relational query processing. Other commercial database middleware systems provide query across multiple relational sources (for example, DataJoiner 54 from IBM and similar products from Oracle,55 and Sybase 56). DiscoveryLink is unique among these systems in its support for writing new wrappers, its capability to create wrappers for nonrelational sources, its capability to add new sources dynamically, and, with the exception of DataJoiner, in its optimization capabilities.

4.7 Status and future workIn this paper we have described IBM’s DiscoveryLink offering. DiscoveryLink allows users to query data that may be physically stored in many disparate, specialized data stores as if all those data were collocated in a single virtual database. Queries against these data may exploit all of the power of SQL, regardless of how much or how little SQL function the various data sources provide. In addition, queries may employ any additional functionality provided by individual data stores, allowing users the best of both the SQL and the specialized data source worlds. A sophisticated query optimization facility ensures that the query is executed as efficiently as possible. This optimizer will become even more discerning in the next version of DiscoveryLink. We have offered evidence that often DiscoveryLink does not add significant overhead to single-source queries, and we have summarized work showing that the optimizer technologies of both the current and the future versions are necessary and are capable of choosing good query execution plans.

DiscoveryLink is a new offering, but it is based on a fusion of well-tested technologies such as DB2 UDB, DB2 DataJoiner, and the Garlic research project. Both DB2 UDB (originally DB2 C/S) and DB2 DataJoiner have been available as products since the early 1990s, and have been used by thousands of customers over the past decade. The Garlic project began in 1994, and much of its technology was developed as the result of joint studies with customers, including an early study with Merck Pharmaceuticals. DiscoveryLink’s extensible wrapper architecture and the forthcoming version of the optimizer derive from Garlic. As part of Garlic, we successfully built and queried wrappers for a diverse set of data sources, including two


relational database systems (DB2 and Oracle), a patent server stored in Lotus Notes**, searchable sites on the World Wide Web (including a database of business listings and a hotel guide), and specialized search engines for collections of images, chemical structures, and text.

Currently, we are working on building up a portfolio of wrappers specific to the life sciences industry. In addition to key relational data sources such as Oracle and Microsoft’s SQL Server**,57 we are writing wrappers for common genomic sources such as SWISS-PROT 13

and GenBank,12 chemical structure sources such as Daylight,34 and general sources of interest to the industry such as Lotus Notes, Microsoft Excel**, flat files, and text management systems. We are also working with key industry vendors to wrap the data sources they supply. While we will continue to create wrappers as quickly as possible, we anticipate that most installations will require one or more new wrappers to be created, due to the sheer number of data sources that exist, and the fact that many potential users have their own proprietary sources as well. Hence we are training a staff of wrapper writers who will be able to build new wrappers as part of the DiscoveryLink software and services offering.

Of course, there are plenty of areas in which further research is needed. For the query engine, key topics are the exploitation of parallelism to enhance performance, and richer support for modeling of object features in foreign data sources. There is also a need for additional tools and facilities that enhance the basic DiscoveryLink offering. We have done some preliminary work on a system for data annotation that provides a rich model of annotations, while exploiting the DiscoveryLink engine to allow querying of both annotations and data separately and in together. We are also building a tool to help users create mappings between source data and a target, integrated schema 58,59 to ease the burden of view definition and reconciliation of schemas and data that plagues today’s system administrators. We hope that as DiscoveryLink matures it will serve as a basis for more advanced solutions that will use its ability to integrate access to data from multiple sources to pull real information out of the oceans of data in which life sciences researchers are currently drowning.

*Trademark or registered trademark of International Business Machines Corporation.

**Trademark or registered trademark of Oracle Corporation, Eli Lilly and Co., Sun Microsystems, Inc., Microsoft Corporation, McNeil Consumer Healthcare, Netgenics, Inc., Tripos Associates, Inc., Object Management Group, Genetics Computer Group, Inc., or Lotus Development Corporation.

4.8 Cited references and notes1. K. Howard, “The Bioinformatics Gold Rush,” Scientific American, July 2000.

2. See http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.

3. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic Local Alignment Search Tool,” Journal of Molecular Biology 215, No. 3, 403–410 (1990).

4. See http://www.nlm.nih.gov/medlineplus/medline.html.

5. See http://www.idbs.co.uk/.

6. M.-C. Shan, R. Ahmed, J. Davis, W. Du, and W. Kent, “Pegasus: A Heterogeneous Information Management System,” W. Kim, Editor, Modern Database Systems, Chapter 32, ACM Press (Addison-Wesley Publishing Co.), Reading, MA(1994).


7. L. Liu and C. Pu, “The Distributed Interoperable Object Model and Its Application to Large-Scale Interoperable Database Systems,” Proceedings of the Fourth International Conference on Information and Knowledge Management, ACM, New York (1995).

8. Y. Papakonstantinou, H. Garcia-Molina, and J. Widom, “Object Exchange Across Heterogeneous Information Sources,” Proceedings of the IEEE Conference on Data Engineering, Taipei, Taiwan, IEEE, New York (1995), pp. 251–260.

9. M. Carey et al., “Towards Heterogeneous Multimedia Information Systems,” Proceedings of the Fifth International Work-shop on Research Issues in Data Engineering, Taipei, Taiwan, March 1995, IEEE, New York (1995).

10. L. M. Haas, P. Kodali, J. E. Rice, P. M. Schwarz, and W. C. Swope, “Integrating Life Sciences Data—with a Little Garlic,” Proceedings of the IEEE International Symposium on BioInformatics and Biomedical Engineering, IEEE, New York (2000).

11. T. Studt, “Next Generation Database Management Tools,” R&DMagazine, Drug Discovery &Development, January 2000, http://www.dddmag.com/feats/0001net.htm.

12. See chapter 2 of Reference 39.

13. A. Bairoch and R. Apweiler, “The SWISS-PROT Protein Sequence Database and Its Supplement TrEMBL in 2000,” Nucleic Acids Research 28, No. 1, 45–48 (2000).

14. A. Dalby, J. Nourse, W. D. Hounshell, A. Gushurst, D. Grier, B. Leland, and J. Laufer, “Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited,” Journal of Chemical Information and Computer Sciences 32, No. 3, 244–255 (1992).

15. D. Weininger, “SMILES,” Journal of Chemical Information and Computer Sciences 28, No. 1, 31–36 (1988).

16. See http://www.genome.ad.jp/kegg/.

17. D. Chamberlin, A Complete Guide to DB2 Universal Database, Morgan Kaufmann Publishers, San Francisco, CA (1998).

18. A. Tomasic, L. Raschid, and P. Valduriez, “Scaling Heterogeneous Databases and the Design of DISCO,” Proceedings of the 16th International Conference on Distributed Computer Systems, Hong Kong, 1996, IEEE, New York (1996).

19. S. Adali, K. Candan, Y. Papakonstantinou, and V. S. Subrahmanian, “Query Caching and Optimization in Distributed Mediator Systems,” Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Canada, June 1996, ACM, New York (1996), pp. 137–148.

20. K. Kulkarni, “Object-Oriented Extensions in SQL3: A Status Report,” Proceedings of the ACM SIGMOD Conference on Management of Data, Minneapolis, May 1994, ACM, New York (1994).

21. M. Tork Roth and P. Schwarz, “Don’t Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources,” Proceedings of the Conference on Very Large Data Bases (VLDB), Athens, Greece, August 1997, ACM, New York (1997).

22. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, “Access Path Selection in a Relational Database Management System,” Proceedings of the ACMSIG-MODConference on Management of Data, Boston, MA, May 1979, ACM, New York (1979), pp. 23–34.


23. J. Hellerstein and M. Stonebraker, “Predicate Migration: Optimizing Queries with Expensive Predicates,” Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, DC, May 1993, ACM, New York (1993), pp. 267–276.

24. S. Chaudhuri and L. Gravano, “Optimizing Queries over Multimedia Repositories,” Proceedings of the ACMSIGMODCon-ference on Management of Data, Montreal, Canada, June 1996, ACM, New York (1996), pp. 91–102. 25. Actually, the optimizer normally ignores pairs when there is no predicate connecting them (e.g., Compounds and Proteins in this query), because typically these “cross-products” do not make good plans.

26. L. Shapiro, “Join Processing in Database Systems with Large Main Memories,” ACMTransactions on Database Systems 11, No. 3, 239–264 (1986).

27. The host variable :KETANSERIN_MOL is presumed to contain an appropriate representation of the ketanserin structure, perhaps as generated by a sketching tool.

28. In this paper, we represent query fragments in SQL; the actual wrapper interface will use an equivalent data structure that does not require parsing by the wrapper.

29. M. Tork Roth, F. Ozcan, and L. Haas, “Cost Models DOMatter: Providing Cost Information for Diverse Data Sources in a Federated System,” Proceedings of the Conference on Very Large Data Bases (VLDB), Edinburgh, Scotland, September 1999, ACM, New York (1999).

30. We did not analyze why this occurred, because it was not unexpected: our experience with DataJoiner over the years is that it usually introduces little overhead, and occasionally can run a transaction faster than the native data source. There are several possible reasons why this script might run faster through DiscoveryLink, among them, DiscoveryLink’s superior optimizer and the fact that it ran on a separate machine, hence could apply more hardware to the problem. In this instance, the result is probably due to the DiscoveryLink engine exploiting the resources of its separate machine, because the four queries in script four are fairly simple, and with one exception leave little room for optimization.

31. F. Rezende and K. Hergula, “The Heterogeneity Problem and Middleware Technology: Experiences with and Performance of Database Gateways,” Proceedings of the Conference on Very Large Data Bases (VLDB), New York, August 1998, ACM, New York (1998).

32. L. Haas, D. Kossmann, E. Wimmers, and J. Yang, “Optimizing Queries Across Diverse Data Sources,” Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), Athens, Greece, August 1997, Morgan Kaufmann Publishers, San Francisco, CA (1997).

33. See http://www.mdli.com.

34. See http://www.daylight.com.

35. Y. Martin, “Comparison of Programs That Calculate Octanol-Water logp Using Starlist,” Proceedings of the 12th Annual Daylight User Group Meeting, Daylight Chemical Information Systems (1997).

36. G. Klopman and H. S. Rosenkranz, “Toxicity Estimation by Chemical Substructure Analysis: The Tox ii Program,” Toxicology Letters 79, 145–155 (1995).

37. R. C. Glen and A. W. R. Payne, “A Genetic Algorithm for the Automated Generation of Molecules Within Constraints,” Journal of Computer-Aided Molecular Design 9, No. 2, 181– 202 (1995).

38. I. D. Kuntz, “Structure-Based Strategies for Drug Design and Discovery,” Science 257, 1078–1082 (1992).


39. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, A. D. Baxevanis and B. F. F. Ouellette, Editors, Wiley-Liss, New York (1998).

40. See http://www.msi.com.

41. See http://www.oxfordmolecular.com.

42. See http://www.genomica.com.

43. S. Davidson, C. Overton, V . Tannen, and L. Wong, “Bio-Kleisli: A Digital Library for Biomedical Researchers,” International Journal of Digital Libraries 1, No. 1, 36–53 (1997).

44. I-M. A. Chen, A. S. Kosky, V. M. Markowitz, and E. Szeto, “Constructing and Maintaining Scientific Database Views in the Framework of the Object-Protocol Model,” Proceedings of the Ninth International Conference on Scientific and Statistical Database Management, IEEE, New York (1997), pp. 237–248.

45. N. W. Paton, R. Stevens, P. Baker, C. A. Goble, S. Bechhofer, and A. Brass, “Query Processing in the TAMBIS Bioinformatics Source Integration System,” Proceedings of the 11th International Conference on Scientific and Statistical Database Management, IEEE, New York (1999), pp. 138–147.

46. P. Carter, T. Coupaye, D. Kreil, and T. Etzold, “SRS: Analyzing and Using Data from Heterogeneous Textual Databanks,” S. Letovsky, Editor, Bioinformatics: Databases and Systems, Chapter 18, Kluwer Academic Press (1998).

47. T. Etzold and P. Argos, “SRS: An Indexing and Retrieval Tool for Flat File Data Libraries,” Computer Applications in the Biosciences 9, 49–57 (1993).

48. See http://www.netgenics.com/.

49. See http://www.tripos.com.

50. See http://www.gcg.com/.

51. See http://www.incyte.com/.

52. See http://www.nimble.com/.

53. See http://www.w3.org/TR/NOTE-xml-ql.

54. See http://www.software.ibm.com/data/datajoiner/.

55. See http://www.oracle.com/.

56. See http://www.sybase.com/.

57. See http://www.microsoft.com.

58. L. M. Haas, R. J. Miller, B. Niswonger, M. Tork Roth, P. M. Schwarz, and E. L. Wimmers, “Transforming Heterogeneous Data with Database Middleware: Beyond Integration,” IEEE Data Engineering Bulletin 22, No. 1, 31–36 (1999).

59. R. J. Miller, L. M. Haas, and M. A. Hernandez, “Schema Mapping as Query Discovery,” Proceedings of the Conference on Very Large Data Bases (VLDB), Cairo, Egypt, September 2000, ACM, New York (2000).

Acceted for publication November 17, 2000.


Laura M. Haas IBM Software Group, Silicon Valley Laboratory, 555 Bailey Road, San Jose, California 95141 (electronic mail: [email protected]). Dr. Haas is manager of DB2 Query Compiler and Life Sciences Development for IBM. She was formerly the manager of Data Integration Research at IBM’s Almaden Research Center. She received her Ph.D. degree in 1981 from the University of Texas at Austin. Since joining IBM, she has worked on distributed relational database (R*), extensible query processing (Starburst), and the integration of heterogeneous data (Garlic and Clio). Technology from these projects forms the basis of the DB2 UDB query processor and enables access to heterogeneous data sources in the latest releases of DB2. Dr. Haas was vice-chair of ACM SIGMOD from 1989 to 1997. She has served as an associate editor of the ACM journal Transactions on Database Systems, as program chair of the 1998 ACM SIG-MOD conference, and was recently elected to the VLDB Board of Trustees. She has received IBM awards for Outstanding Technical Achievement and Outstanding Contributions, and a YWCA Tribute to Women in Industry (TWIN) award. Her research interests include schema mapping, data integration, and query processing.

Peter M. Schwarz IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, California 95120 (electronic mail: [email protected]). Dr. Schwarz is a research staff member in the Middleware Systems and Technology Department of IBM’s Almaden Research Center. He received his Ph.D. degree from Carnegie-Mellon University in 1984, working with Alfred Spector on concurrency control and recovery for typed objects. At IBM, Dr. Schwarz has worked on algorithms for log-based recovery in database systems and middleware for integrating heterogeneous data sources. His interests also include object-oriented programming languages and type systems.

Prasad Kodali 3rd Millennium Inc., 125 Cambridge Park Drive, Cambridge, Massachusetts 02140 (electronic mail: pkodali@ 3rdmill.com). Dr. Kodali is Informatics Project Lead at 3rd Millennium, where he is involved in developing advanced informatics solutions for pharmaceutical and biotechnology companies. He was previously the product manager of data integration products at NetGenics, Inc. He received his Ph.D. degree in computational chemistry from Pennsylvania State University. His research interests include data integration in drug discovery, computational algorithms, computer-assisted drug design, and life science informatics.

Elon Kotlar Aventis Pharmaceuticals, Bridgewater, New Jersey 08807 (electronic mail: [email protected]). Mr. Kotlar is a global project leader in the Drug Innovation and Approval Information Solutions organization at Aventis Pharmaceuticals. He received his B.A. in the biological basis of behavior from the University of Pennsylvania in 1996 and then worked in diagnostic radiology research at the Hospital of the University of Pennsylvania. At Aventis Pharmaceuticals he has worked to provide scientists with solutions to integrate data across the drug discovery process.

Julia E. Rice IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, California 95120 (electronic mail: [email protected]. Dr. Rice is a research staff member and man-ager in IBM Research at the Almaden Research Center. She joined the computational chemistry team at IBM in 1988 and worked on understanding and predicting the nonlinear optical properties of organic molecules. Following that, she led the teams that developed the quantum chemistry and architecture components of the computational chemistry software package, Mulliken. More recently, her interests have expanded to include database and in particular informatics issues in life sciences. Research in Dr. Rice’s group currently includes 3-D molecular similarity matching of databases of flexible molecules, as well as the use of annotations in life sciences. Her group has played a key role in bridging the gap between scientists and the use of database technology in the DiscoveryLink project. Dr. Rice received her Ph.D. in theoretical chemistry from the University of Cambridge, England. She spent a postdoctoral year at the University of California, Berkeley and then held a research fellowship at Newnham College, Cambridge before joining IBM. Dr. Rice was


named as one of the 750 most highly cited chemists worldwide for the period 1981–1997 (ISI survey). She is currently a member of the Executive Committee of the Physical Chemistry section of the American Chemical Society. Dr. Rice was awarded the YWCA Tribute to Women in Industry Award (TWIN) in 1999.

William C. Swope IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, California 95120 (electronic mail: [email protected]). Dr. Swope is a research staff member currently helping with the Blue Gene Protein Science project. He started his career in IBM at IBM Instruments, Inc., an IBM subsidiary that developed scientific instrumentation, where he worked in an advanced processor design group. He also worked for six years at the IBM Scientific Center in Palo Alto, California, where he helped IBM customers develop software for numerically intensive scientific applications. In 1992 Dr. Swope joined the IBM Research Division at Almaden, where he has been involved in software development for computational chemistry applications and in technical data management for petroleum and life sciences applications. He obtained his undergraduate degree in chemistry and physics from Harvard University and his Ph.D. degree in quantum chemistry from the University of California at Berkeley. He then performed postdoctoral research on the statistical mechanics of condensed phases in the chemistry department at Stanford University. He maintains a number of scientific relationships and collaborations with academic and commercial scientists involved in the life sciences and, in particular, drug development.

Copyright 2001 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.


Chapter 5. DB2 Life Sciences Data Connect

The DiscoveryLink offering is a set of middleware software and services tailored specifically to life sciences research and development requirements for integrating data from multiple heterogeneous data sources as shown in Figure 5-1.

Figure 5-1 IBM Life Sciences DiscoveryLink

For example, with DiscoveryLink, you can use a single SQL statement to integrate protein sequence data from an Oracle database in Switzerland, chemical structure data from a Sybase database in Japan, and spectroscopic data stored in table-structured flat files on your local area network. The data appears as if it is in one virtual database.

5

Softw are

DB 2 L ifeD B2 L ifeS ciencesS ciencesData C onnectD ata Conne ct

D B2 R elationalDB 2 RelationalC onnectConne ct

D B2 ConnectDB 2 Connec t DB2 Un ive rsalD B2 U niversalDatabaseD ata bas e

MS SQ LServe r

O racle

S ybase

DB 2 UD B forO S /39 0 or z /O S

D B2 UD B forV SE a nd V M

D B2 UD Bfo r iSer ies 40 0

Services

IB M L ife S cien cesDisco veryLink


5.1 IBM DB2 Life Sciences Data Connect IBM DB2 Life Sciences Data Connect enables a DB2 federated system to integrate genetic, chemical, biological, and other research data from distributed sources. A DB2 federated system is a distributed computing system that consists of a DB2 Universal Database (UDB) server and multiple data sources from which the DB2 UDB server retrieves data.

With a federated system, you or an application can use SQL statements to query, retrieve, and join data located in several heterogeneous data sources, such as relational databases from IBM, Oracle, Sybase, and Microsoft, and non-relational data sources, such as table-structured files. Figure 5-2 illustrates a federated system using DB2 Life Sciences Data Connect to access multiple sources of research data.

Figure 5-2 Accessing life sciences data with DB2 Life Sciences Data Connect

5.2 Querying life sciences dataTo query and retrieve data located in life sciences data sources (non-relational data sources), you must first install DB2 Life Sciences Data Connect.

After you install DB2 Life Sciences Data Connect, configure the appropriate wrapper to the data source. This process is known as registering the wrapper.

U N IX -b as edp la tfo rm s

D B 2 L ifeD B 2 L ifeS c ie n c e sS c ie nc e s

D a ta D a ta C on n ce ctC o nn c e c t

W in d o w s -b ase dp la tfo rm s

D B 2 U n ive rs a lD B 2 U niv e rs a l D a ta b a s e

fe de ra a tedd a ta b a s e

M e dic a ld a ta

G e n eticd a ta

C h em ic a ld a ta

B io lo g ic ald a ta

F e d eratedm id d lew are system

L ife S c ien c esd a ta s o u rcesD B 2 C lie n ts

w ra p p ers


Chapter 6. IBM Life Sciences Global Consulting and Solutions Practice

Dramatic advances occurring in the Life Sciences industry are fueling the rapid scientific discoveries in genomics, proteomics, and molecular biology that serve as the basis for medical breakthroughs, the advent of personalized medicine, and the development of new drugs and treatments. In response to these challenges, life sciences companies are redefining their research methodologies and retooling their IT infrastructures to position themselves for success in this new environment. The traditional trial and error approach is rapidly giving way to a more predictive science based on sophisticated laboratory automation and computer simulation.

The IBM Life Sciences organization offers a comprehensive set of innovative IT infrastructure products and services designed to create value for drug discovery, pharmaceutical and biotechnology companies, healthcare organizations, and academic research institutions. A dedicated IBM Global Life Sciences Consulting and Solutions practice has been established to focus IBM service capabilities and expertise, as well as intellectual capital, on helping customers and Business Partners to migrate their R&D units into even more efficient and competitive operations. Trained professionals from IBM Global Life Sciences Consulting and Solutions can help provide a wide range of consulting, implementation, outsourcing, and hosting services to help improve the efficiency of R&D cycles, enhance collaboration within research communities, and ensure the successful implementation of life sciences solutions.

The IBM Global Life Sciences Consulting and Solutions (LSCS) unit of the IBM Global Services organization is composed of a core team of dedicated professionals with extensive experience in the life sciences and health care industries. IBM LSCS professionals have deep resources, including access to application partners, non-IBM hardware providers, IBM Research Division, and service providers with specialized skills, which permit them to reach out to other leaders in life sciences systems worldwide to match internal capabilities to the needs of life sciences companies.

6


In addition, IBM LSCS works directly and in concert with leading bioinformatics business partners to:

� Develop end-to-end specialized solutions, such as specialized laboratory applications, for the life sciences industry

� Integrate the best-of-breed software to address a variety of life sciences industry business needs

� Provide highly effective approaches to IT management and strategy that are specifically tailored for life sciences companies—including data integration, query management, data mining, and data interpretation—using advanced algorithms and deep computing technologies.

Key areas for services in the life science industry include data management, e-business collaboration, and knowledge management. Strategic planning and implementation for laboratory needs, current systems and software, and future business plans are all critical to the economic planning and business success of life sciences companies. This is where IBM LSCS can help provide the strategic integration of hardware, software, knowledge management tools, e-business infrastructure and data integration to provide fast, seamless solutions quickly and economically.


6.1 Data integration services IBM's data integration strategy provides super computers, software, and services to enable successful research and development in life sciences laboratories. The DiscoveryLink solution from IBM Life Sciences includes the combined resources of DiscoveryLink software and IBM LSCS. Using this versatile software, IBM LSCS can create new components that allow specialized data bases—for proteomics, genomics, combinatorial chemistry, and high-throughput screening—to be accessed and integrated quickly and easily. With single-query data access, the DiscoveryLink solution allows researchers to extract information from large volumes of heterogeneous research and clinical data sources including the integration of important legacy data. IBM LSCS will work with companies transitional to DiscoveryLink to ensure that the solution fully addresses current laboratory needs and can be readily expanded as those needs develop.

This modular approach to data integration minimizes up front investment, delivers results quickly, and reduces overall project risk. It also permits life sciences companies to begin with the integration of key sources and grow as data integration needs change. In addition to data integration LSCS can pull from the full IBM enterprise data management portfolio to enable data analysis through business intelligence, collaboration across groups, knowledge transfer to share knowledge, knowledge mapping, and fast location of expert information.

Chapter 6. IBM Life Sciences Global Consulting and Solutions Practice 93

6.2 Knowledge management servicesMany drug discovery processes—including clinical trials—require maximum efficiency for data sharing and knowledge management functions like patient record mining across companies. With IBM knowledge management software and services, laboratory research teams can capture, manage, and share information and create new information relationships. This includes the determination of:

� What information exists

� Where information resides

� Who the experts are

� How to access experts when needed

� How to capture and disseminate expert knowledge electronically

� How to keep knowledge current

� How to manage data daily

� How to collaborate with colleagues in a real-time, online, innovative environment.


6.3 e-Business hosting and servicesAs life sciences information technology needs spread beyond the traditional business boundaries to extended, virtual enterprises, companies will require e-business infrastructures that enable them to share information and processes, work as teams across geographies, manage growing volumes of network transactions, and connect with both the wired and wireless worlds. R&D organizations can use IBM data integration and management tools to create easy-to-use Web portals to leverage the Internet for collaborative global life sciences research. Life sciences services can help company teams build virtual research communities using secure Web portals for online research collaboration in real-time while meeting the security requirements of the entire R&D process. This includes privacy issues concerning genetic information from donors in research studies, patient records in clinical trials, and confidential business information. Managed monitoring, systems administration and server management, professional services, and security are just a few of the e-business services available. IBM e-business hosting is unique due to IBM's capability and capacity to fully integrate every hosting activity in the Web hosting spectrum. IBM consultants can provide the services needed to implement e-business solutions on time, with full capability, and in the most economical manner. With a stable Web foundation, life sciences researchers can access multiple online data sources, personalize data queries, and ensure the secure management of research and medical information.

Chapter 6. IBM Life Sciences Global Consulting and Solutions Practice 95

6.4 Comprehensive IT service offeringsIBM has the skills, resources, and infrastructure management support to meet the laboratory and business needs of the life sciences industry. IBM service specialists can provide consulting, systems integration, strategic outsourcing services, and hosting services. Along with data management, knowledge management, and e-business solutions that work with existing systems and software. As technology needs change and grow, IBM consultants can help optimize systems to accommodate every level of expansion.

Business services from IBM include:

� Business recovery services/e-business infrastructure

� Information technology consulting—infrastructure systems and management

� IT consolidation services/IT product training

� Midrange express services/networking and connectivity services

� Technical support services/total systems management services.

Health care services from IBM include:

� Enterprise Resource Planning/Knowledge and content management

� Business intelligence services/systems integration services

� Customer relationship management services/digital branding and marketing

� E-business strategy and design/ e-commerce services

� Merger and acquisition services/ procurement services

� Security and privacy services/skills development for e-business

� Supply chain management services/Web application development.


Chapter 7. IBM Life Sciences Global Consulting and Solutions Practice DiscoveryLink Transition Offering

The IBM Life Sciences organization offers a comprehensive set of innovative IT infrastructure products and services designed to create value for drug discovery, pharmaceutical and biotechnology companies, healthcare organizations, and academic research institutions.

7


7.1 Life Sciences Solution Practice PresentationA dedicated IBM Global Life Sciences Consulting and Solutions practice has been established to focus IBM service capabilities and expertise, as well as intellectual capital, on helping customers and Business Partners to migrate their R&D units into even more efficient and competitive operations. Trained professionals from IBM Global Life Sciences Consulting and Solutions can help provide a wide range of consulting, implementation, outsourcing, and hosting services to help improve the efficiency of R&D cycles, enhance collaboration within research communities, and ensure the successful implementation of life sciences solutions.

Figure 7-1 Global Consulting and Solutions Practice

IBM has the skills, resources and infrastructure management support to meet the laboratory and business needs of the life sciences industry. A dedicated IBM Global Life Sciences Consulting and Solutions practice has been established to focus IBM service capabilities, expertise and intellectual capital on helping customers and IBM Business Partners to migrate their R&D units into even more efficient and competitive operations.

IBM specialists can provide consulting, systems integration, strategic outsourcing and hosting services. Along with data management, knowledge management and e-business solutions that work with existing systems and software. As technology needs change and grow, IBM consultants can help optimize systems to accommodate every level of expansion.

IBM Life SciencesIBM Life Sciences

Global Consulting andGlobal Consulting and

Solutions PracticeSolutions Practice

DiscoveryLinkDiscoveryLinkTMTM

Transition OfferingTransition Offering

IBM® Life Sciences


Figure 7-2 IBM Global Services as a Resource

The services provided by IBM Global Service demonstrate an enormous range of capabilities. Please note the reference to IBM Research. In the Life Sciences marketplace, the expertise represented by IBM Research is highly relevant in areas such as Supercomputing, Bioinformatics, Knowledge Management, Text and Data Mining, and all aspects of e-business.

IBM Global ServicesIBM Global Services (IGS) is a Tremendous (IGS) is a Tremendous Resource to CustomersResource to Customers

Intellectual CapitalLife Sciences specific assets, e.g. DiscoveryLink solutionHealthcare specific methodologies IBM Research, technology lab, development capabilities

3000+ Scientists and Engineers

at 8 Labs in 6 Countries

140,000+Professionals

in 160 Countries

$33.2B Services

Revenuesin 2000

25,000+e-Business

Engagements

IT Services ExpertiseData integration expertise Complex infrastructure management expertiseGlobal reach for fast and consistent implementation

Healthcare ConsultingUnderstanding of healthcare strategic issues Management consulting across the drug discovery, development, and healthcare value chainThought leader in e-Business transformation Leader in privacy and security services

Business RequirementsReduce cycle time of drug discovery Streamline key business processes Global collaboration

Chapter 7. IBM Life Sciences Global Consulting and Solutions Practice DiscoveryLink Transition Offering 99

Figure 7-3 IBM Offerings Addressing Full Spectrum of Life Sciences Requirements

IGS (IBM Global Services) is organized along the headings displayed:

What should I do? IBM Consulting will review the situation and advise on the appropriate course of action.

Help me do it: Business Innovation Services professionals are able to implement a wide range of activities ranging from knowledge management to e-Enablement.

Do it for me: Strategic Outsourcing / Hosting are major activities IGS is engaged in.

ITIT

BUSINESSBUSINESS

WHAT SHOULD I DO?WHAT SHOULD I DO?

DiscoveryLinkTransitionServices

Clinical Triale-Enablement

KnowledgeManagement

ApplicationsIntegration

ApplicationsHosting

Life SciencesPortal Hosting

I/T infrastructureOutsourcing

Life SciencesMgmt Consulting Vision & Strategy

Life SciencesApplications

Selection&

I/T InfrastructureDesign

HELP ME DO ITHELP ME DO IT DO IT FOR MEDO IT FOR ME

IBM Offerings IBM Offerings Address the full Address the full

Spectrum of Life Spectrum of Life Sciences RequirementsSciences Requirements


Figure 7-4 DiscoveryLink as Part of Overall Enterprise Data Management

DiscoveryLink is one piece of a larger enterprise data management approach.

IGS Business Innovation Solutions cover Business Intelligence and many aspects of Knowledge Management, from the support of Collaboration / Communities of Practice, to Search/Navigation of document collections and Location of Expertise, wherever it resides in a global organization.

The DiscoveryLink The DiscoveryLink Solution is Part of an Solution is Part of an

Overall Enterprise Data Overall Enterprise Data Management ApproachManagement Approach

Business Intelligence

Collaboration

Knowledge Transfer

Knowledge Discovery & Mapping (DiscoveryLink)

Expertise


Figure 7-5 IGS Services as Part of the DiscoveryLink Solution

The DiscoveryLink solution is based on IBM's flagship DB2 product (DB2 UDB and Relational Connect), on specific Life Sciences extensions (Life Sciences Data Connect), and on IBM Global Services to customize, install, and integrate DiscoveryLink in the existing IT infrastructure of the client organization.

IGS Services is an IGS Services is an Integral Piece of the Integral Piece of the

DiscoveryLink SolutionDiscoveryLink Solution

Services

Wrappers

DiscoveryLinkDiscoveryLinkSolutionSolution

DB2 UniversalDatabaseTM


Figure 7-6 IGS and DiscoveryLink: A Phased, Customized Approach —Proposal/SOW

In IBM's phased approach, a typical DiscoveryLink engagement will start with determination of client requirements and definition of the project scope.

IGS and IGS and DiscoveryLink: A DiscoveryLink: A

Phased, Customized Phased, Customized ApproachApproach

Determine client requirementsDefine project scope

Proposal /SOW


Figure 7-7 Phase 0—Assessment

In Phase 0 of the Services engagement, the existing IT infrastructure is assessed and data sources are identified. In addition, Phase 0 may include an assessment of business processes and organizational impact.

Understand existing IT infrastructure, applications and data sourcesAssess business process and organization impact



Assessment

Phase 0


Figure 7-8 Phase 1—Technical Validation

Phase 1 includes the installation and configuration of DiscoveryLink, the configuration of Wrappers for given data sources, development of test queries for technical feasibility and performance, and the development of a plan for deployment, based on requirements agreed with the client.

Install and configure initial DiscoveryLink solutionIntegrate data from identified sourcesDemonstrate technical feasibility and performanceDetermine requirements for adaptation and deployment



TechnicalValidation

Phase 1


Figure 7-9 Phase 1-A—Adaptation

Alternatively, Phase 1 could also include an ADAPTATION step, that will involve tasks such as:

� Identification of impact on IT architecture

� Identification of need for additional Wrappers, requiring development effort

� Determination of user interface requirements, and integration with applications and tools

� Identification of (user/administrator) training needs.

Identify impact on IT architecture, and operationsIdentify additional data sources and need for additional wrappersDetermine requirements for user interfaces, applications, and tools Outline training needs



Adaptation

Phase 1-A


Figure 7-10 Phase 2-n—Deployment

Based on the plan developed during Phase 1, Phase 2 involves activities pertaining to deployment. DiscoveryLink is put in “production” for a large number of users, and for a large number of data sources. This step may include deployment of processes for organizational changes, additional development of user interfaces, and extensive training.

Implement follow-on DiscoveryLink sites as requestedIntegrate data from additional sourcesDeploy process and organization changes, user interfaces, and training



Deployment

Phase 2-n


Figure 7-11 Contact IBM Global Service

To learn more about how DiscoveryLink can help dramatically improve the R&D effectiveness of your organization, visit our web site at www.ibm.com/discovery link, or contact a Life Sciences solutions specialist at [email protected].

For more information on IBM Global ServicesGo to our web site at: ibm.com/servicesContact us at: [email protected]

IB M D iscover yLin k - the d a ta in teg r a t ionsolu t ion


Chapter 8. DiscoveryLink: A Data Integration Solution for Life Sciences (For IT Professionals)

The DiscoveryLink solution from IBM allows the integration of discovery, clinical trial, regulatory and even marketing data throughout the product development, approval and deployment cycle.

8


8.1 IT Professional PresentationResearch organizations can increase the number of qualified discovery projects, and can identify promising targets and leads faster and develop them more effectively while reducing the burden of managing IT infrastructure. The DiscoveryLink solution from IBM includes the combined resources of DiscoveryLink middleware and IBM Life Sciences Services.

Using this versatile software, IBM Life Sciences Services can create new components that allow specialized databases for proteomics, genomics, combinatorial chemistry and high-throughput screening to be accessed and integrated quickly and easily. DiscoveryLink is unique among existing systems because it enables easy creation of wrappers for nonrelational sources and provides the capability to add new sources dynamically.

Figure 8-1 IT Professionals Overview of DiscoveryLink

DiscoveryLink—A Data Integration Solution for Life Sciences

A new understanding of the workings of life at the genetic and molecular levels, combined with laboratory automation, promises to make finding new therapeutic agents radically faster, cheaper, and more effective. New data are pouring out of innovative technologies, such as genomics, at an unprecedented and rapidly increasing rate.

DiscoveryLink offers a unique data integration and knowledge management capability that addresses the extremely demanding needs presented by the ever increasing amounts and types of data required for research in the life sciences, particularly in informatics and drug discovery. It is a way of turning life science data into insight.

DiscoveryLink: DiscoveryLink:TMTM

A Data Integration A Data Integration

Solution Solution

for Life Sciencesfor Life Sciences

(For IT Professionals)(For IT Professionals)

IBM Life Sciences®


Figure 8-2 Increasing Data Requirements

Dramatic advances occurring in the life sciences industry are changing the way we live. These advances fuel the rapid scientific discoveries in genomics, pharmaceutical research, proteomics, and molecular biology that serve as the basis for medical breakthroughs and the development of new drugs and treatments. As imperatives to unravel the mysteries of DNA and bring healing medicines to the market on time—better, faster and cheaper—grow more urgent, the pressures to improve the productivity of the research and development (R&D) processes intensify.

One of the life sciences industry's most difficult challenges is transforming massive quantities of highly complex, constantly changing data—from a variety of data sources—into knowledge.

The sequenced human genome has already increased the number of biological drug “targets” that can be explored from about 500 to over 30,000. Soon, many life sciences companies will need to access and analyze “petabytes” (1015 bytes) of data to further their research efforts. In addition to the enormity of the data, there are challenges related to querying non-standard data formats, accessing data assets across global networks, and securing data outside firewalls.

PetabytesPetabytesofof

DataData HTSHTS

CombinatorialCombinatorialChemistryChemistry

HumanHumanGenomeGenome

SNPsSNPs

PharmacogenomicsPharmacogenomics

ProteinsProteins

MetabolicMetabolicPathwaysPathways

1990 2000 2010

Medical Data Growth

External Research Partnerships

Growth in Clinical Trials

The Internet

ESTsESTs

Life Sciences data is increasing at a tremendous ratePetabytes (1015) of data are projectedData integration and data management are key to successfully deciphering meaning

Increasing Data Increasing Data RequirementsRequirements

Chapter 8. DiscoveryLink: A Data Integration Solution for Life Sciences (For IT Professionals) 111


Data integration/management is a common problem. Here is an example from the pharmaceutical domain. To find new drugs, scientists use a wide range of data from several sources. They need to be able to interrelate data from different sources, for example, to combine structural information in a chemical structures database with assay results from a relational database.

Toxicology Toxicology DataData

Proteomic Proteomic DataData

Compound Compound DataData

Genomic Genomic DataData

Textual Textual DataData

Clinical Clinical DataData

Gene Gene Expression Expression

DataData Other DataOther DataSourcesSources

Integrated DataIntegrated DataManagementManagement

DiscoveryLinkDiscoveryLink

Link multiple heterogeneous data sources together

One query spans multiple data sources



Figure 8-4 DiscoveryLink Architecture

At the far right are the data sources. DiscoveryLink looks to these sources like an application —they are not changed or modified in any way. DiscoveryLink talks to the sources using wrappers, which use the data source’s own client-server mechanism to interact with the sources in their native dialect.

DiscoveryLink has a local catalog in which it stores information (metadata) about the data accessible (both local data, if any, and data at the backend data sources).

Applications of DiscoveryLink manipulate data using any supported SQL API, for example, ODBC or JDBC are supported, as well as embedded SQL. Thus a DiscoveryLink application looks like any normal database application.

(Optional)

Client

SQL API(JDBC/ODBC)

Discovery Discovery Link Link

Wrappers

Wrappers

LifeLifeSciencesSciences

ApplicationApplication

Back-endBack-endDataData

SourceSource

Back-endBack-endDataData

SourceSource

DataData

DataData

DataDataCatalogCatalog

DiscoveryLink DiscoveryLink ArchitectureArchitecture

DiscoveryLink is comprised of:

Wrappers (software)IBM Global Services

®DiscoveryLink (DB2) Federated Database Engine

DB2 drives DiscoveryLink but it does not replace existing client databases!


Figure 8-5 Federated Database Technology

To process queries such as the one coming from the computer on the left above, DiscoveryLink needs a full database engine. The engine not only compiles and optimizes the query to get the best possible plan, but allows DiscoveryLink to compensate for functionality that is missing in less sophisticated data sources (for example, if a source cannot do joins, or particular kinds of predicates, and so forth).

Query compilerParserSemantic processorOptimizer

Execution engineSort engineResidual predicateFunctions

CatalogData managerLockingLoggingBuffer managerClient accessTransaction CoordinatorQuery gateway

Interface to sources

Federated DB

databasedatabase

and

databasedatabase

databasedatabase

and

Federated Database Federated Database Technology is the Technology is the

Foundation of Foundation of DiscoveryLinkDiscoveryLink


Figure 8-6 What are Wrappers?

DiscoveryLink technology is designed to make new wrappers easy to write. A wrapper for a simple data source (and even, a simple wrapper for a more complex source) can be written in a short time frame. Complete wrappers for complex sources may take more time. Wrappers are kept “thin” in the sense that they need not implement any functionality not provided by the data source itself.

Wrappers are C++ software interfaces to client data sources

Example: the sqlnet wrapper interfaces to Oracle SQL*NetWrappers can be written for many data sources (e.g. Oracle, DB2, SQL Server, flat files, etc.)

What are Wrappers?What are Wrappers?


Figure 8-7 Wrapper Responsibility

Wrappers are the translators between data sources and your federated system.

Mapping data source information into DiscoveryLink's model

Informing DiscoveryLink about the data source's query capabilities

Translating between query fragments and data source API

Executing requests and returning results

Wrappers are Wrappers are responsible for:responsible for:


Figure 8-8 DiscoveryLink Federated Server

DiscoveryLink connects data from:

� Other DB2 Family data sources

� Non-DB2 relational databases

� Specialized Life Sciences data sources

Wrapper libraries, files containing the wrapper code, are available for each category of data source listed in the chart.

DiscoveryLink DiscoveryLink Federated ServerFederated Server

For DB2 Federated Server capability with all DB2 Family data sourcesUse drda.dll (NT,W2K) | libdrda.a (UNIX) Wrapper library

comes with DB2 UDB(TM) EE/EEEusable with DB2 UDB, DB2/400, DB2/OS/390, DataJoiner(R) data sourcesdrda.dll / libdrda.a uses either db2ra or drda protocol, whichever is appropriate

For DB2 Federated Server capability with Oracle, Sybase, & MS SQL Server, need 'DB2 Relational Connect'

Provides sqlnet, net8, ctlib, dblib, mssqlodbc wrapper librariesAvailable wrappers for specific platforms;

Oracle - AIX(R), NT/2000, Solaris, LinuxSybase - AIX, NT/2000, SolarisMS SQL - AIX, NT/2000

Requires installation of the network client for the data sourceOracle SQL*Net or net8, Sybase Open Client, Microsoft SQL Server ODBC Client

Life sciences data sourcesEngage IBM Global Services to create wrappersObtain wrapper from 3rd party

Example: from the DBMS vendor


Figure 8-9 DiscoveryLink Components

DiscoveryLink is made up of 3 software components and 1 services component.

DB2 Life Sciences Data Connect software connects data sources associated with the life sciences industry to a federated database system. Data sources of this type are non-relational. For example: genomic or proteomic data in specialized data banks like Genbank or SWISS-PROT, or scientific data in flat files and spreadsheets.

DB2 Relational Connect software connects non-IBM relational data sources to a federated database system. For example: Oracle, Sybase, and Microsoft databases.

DiscoveryLink DiscoveryLink ComponentsComponents

DB2 Life Sciences Data ConnectWrappers to access life sciences dataFirst data source supported is table structured files on AIX

DB2 Relational ConnectWrappers to access relational databases from Oracle, Sybase and Microsoft

DiscoveryLink (DB2) Universal Database

IBM Global Services


Figure 8-10 DiscoveryLink is Built on Proven Technology

DiscoveryLink technology is solid and stable because it is built on top of our award-winning, relational database technology and expertise.

DiscoveryLink comes from the integration of IBM's DataJoiner technology into DB2 Universal Database with the addition of relational and non-relational wrapper technology from our Relational Connect and Life Sciences Data Connect products, respectively.

DiscoveryLink DiscoveryLink is Builtis Built

on Proven Technologyon Proven Technology

1995DataJoiner(R)/AIX Version 1 was released, the base technology was DB2/6000 V1

1997DataJoiner/AIX, NT, Solaris Version 2 was released, the base technology was DB2 Common Server V2

2000DB2 UDB(TM) Version 7 Enterprise Edition and Extended

Enterprise Edition was releasedDataJoiner(R technology integrated with DB2 Universal DatabaseRelational connectDiscoveryLink: the base technology is DB2 UDB V7 Enterprise Edition

2001Life Science data connect DB2 7.2


Figure 8-11 DiscoveryLink Accesses Multiple, Varied Data Sources

With DiscoveryLink, you can create a federated system that links together all your data from different data sources.

The data stays in the data source, unchanged, while you use standard SQL statements to query it as if the whole federation were one large relational database—a virtual database.

The various data sources and wrappers for these data sources are listed in the chart. For example, Oracle data on AIX or Solaris can be accessed using the SQL*Net or Net8 wrappers.

DiscoveryLink DiscoveryLink Accesses Multiple, Accesses Multiple, Varied Data SourcesVaried Data Sources

DB2 V7DB2 V7

X WrapperX Wrapper

DB2 Relational DB2 Relational ConnectConnect(net8, sql*net,(net8, sql*net,ctlib, dblib, ctlib, dblib, mssqlodbc mssqlodbc wrappers)wrappers)

DiscoveryLink uses data source's normal network client:

OracleOracle

OracleOracle

TCP/IP

TCP/IP

OracleSQL* Net

OracleNet8

Oracle V7 7.0.13 or later or Oracle V8:AIX or Solaris

SQL*Net V1, V2 or Net8NT/2000: SQL*Net V7.3 or Net8

SybaseSybase

SybaseOpen Client

SybaseAIX, Solaris, or Windows NT/2000

Sybase Open Client

MS SQL MS SQL ServerServer

MS SQL SrvrODBC Client

MS SQL ServerWindows NT/2000

MS SQL Server ODBC Driver

Flatfile Flatfile sourcesource

LS Data Connect

Life Sciences Data ConnectFlatfile sources

Data Data source Xsource X

X network client

Other Data SourcesAIX, Solaris, or Windows NT/2000

Network client of data source

Wrapper from 3rd party or customer

DB2 390DB2 390DB2 400DB2 400

DB2 NTDB2 NTDB2 UNIXDB2 UNIX

TCP/IPAPPCNetBIOS

APPC, TCP/IP

DRDA Driver

DB2 LAN Driver

DRDA wrapper

DB2 on MVS: V2.3 or laterDB2 Connect included in DB2 EE/EEE

DB2/400DB2 Connect included in DB2 EE/EEE

DB2 on NT, AIX, Solaris, HP-UX, etc


Figure 8-12 DiscoveryLInk has Object—Relational Extensibility

Because DiscoveryLink is based on DB2 UDB, it provides all the features of a leading-edge object-relational database system. This means enhanced modeling power, and a simpler mapping between applications and your data.

DiscoveryLink has DiscoveryLink has Object-Relational Object-Relational

ExtensibilityExtensibility

Structured TypesUser-defined, complex data typesCan be used as column types and/or table typesInheritance

Column TypesText, image, audio, video, time series, point, line, OLE...For modeling new kinds of facts about enterprise entitiesEnhanced infrastructure for Extenders, Blades, Cartridges

Row TypesTypes and functions for rows of tablesFor modeling enterprise entities with relationships & behaviorNative business object infrastructure

Object views


Figure 8-13 The Four Steps of Registration: Wrapper Configuration

The wrapper first needs to be registered with the CREATE WRAPPER statement to associate the wrapper name with the appropriate wrapper library.

Subsequently, the server needs to be registered using the CREATE SERVER statement to define a data source to a federated database.

The CREATE NICKNAME statement is used to create a nickname for the data source or view—essentially creating a virtual relational table understood by the federation. The data source itself can be of any type—relational or non-relational.

The CREATE FUNCTION... AS TEMPLATE and the CREATE FUNCTION MAPPING statements create a function template which maps to an existing data source function. The data sources function can then be used in federated queries.

The Four Steps The Four Steps of Registration: Wrapper of Registration: Wrapper

ConfigurationConfiguration

CREATE WRAPPER Day_wrapperLibrary "/imh/wrappers/libday.a"CREATE SERVER molecule_serverWRAPPER Day_wrapperOPTIONS (NODE 'styx', DIRECTORY'/lmh/day/top')CREATE NICKNAME Cmpnds(compound_id string, molwt float, compound_struct stringlogP float)SERVER molecule_server OPTIONS 9FILE "abcde.txt")CREATE FUNCTION SimilarTo (varchar(20), varchar(20))returns float AS TEMPLATECREATE FUNCTION MAPPING mapping1 SimilarTo(varchar(20), varchar(20)) SERVER molecule_server;WRAPPER Day_wrapper


Figure 8-14 Registration: Create Wrapper

In this example, the wrapper “Day_wrapper” is registered with the wrapper library “libday.a”.

Registration: Registration: Create WrapperCreate Wrapper

CREATE WRAPPER Day_wrapperLIBRARY "/lmh/wrappers/libday.a"

Define the wrapper

Identify the shared library which must be loaded


Figure 8-15 Registration: Create Server

Here, the server “molecule_server” is registered with the wrapper “DAY_wrapper”.

Registration: Registration:

Create ServerCreate Server

CREATE SERVER molecule_serverWRAPPER DAY_wrapperOPTIONS (NODE "styx', DIRECTORY "/lmh/day/top")Define specific data sourcesCREATE SERVER is necessary because a single wrappe can be used for multiple sources of the same typeA separate CREATE SERVER is required for each data source


Figure 8-16 Registration: Create Nickname

In this example, the nickname “Cmpnds” is registered. It has four columns. The first being compound_id with a data type of string. etc. The “Cmpnds” nickname is associated with the “molecule_server” server created with the CREATE SERVER statement. The data source is a flat file called “abcde.txt”.

Data collections are sets of data in any format—relational or non-relational. They can be subsets of larger data collections. For example, a molecular database can have 30 fields per record, but for a particular installation, only 10 fields are required. These 10 fields are mapped via the CREATE NICKNAME statement and exposed to the users.

Registration: Registration: Create NicknameCreate Nickname

CREATE NICKNAME Cmpnds (compound_id string, molwt float, compound_struct string, logP float) SERVERmolecule_server OPTIONS (FILE "abcde.txt")Identify data collections that will be exposed to DiscoveryLink applications as tables. Data collections must be identified for each data source


Figure 8-17 Registration: Create Function Mapping

The CREATE FUNCTION statement above defines a function “SimilarTo” that has 2 arguments—each of type varchar(10). It returns a float data type.

The CREATE FUNCTION MAPPING statement maps the “SimilarTo” function defined above to a function of the remote data source. The function “SimilarTo” is now available to DiscoveryLink queries.

Function mapping allows DiscoveryLink to use data source functions in its queries. For example, these functions could be those that enhance search capabilities with those of the data source, or enable conversions of data.

Registration: Registration: Create Function MappingCreate Function Mapping

CREATE FUNCTION SimilarTo (varchar(20), varchar(20)) returns float AS TEMPLATE;CREATE FUNCTION MAPPING mapping1 SimilarTo(varchar(20), varchar(20)) SERVER molecule_server;WRAPPER Day_wrapperCREATE FUNCTION MAPPING identifies any function of the sourceExamples:

Search capabilitiesConversion functions


Figure 8-18 DiscoveryLink: Query Optimization

Optimizing queries produces more efficient code for execution against the data source.

Query Rewrite: Transforms SQL statements into forms that can be optimized more easily.

Pushdown Analysis: Tells the DB2 optimizer if an operation can be performed at a remote data source. Operations that can be pushed-down can significantly improve query performance.

Cost-Based Optimization: Optimization based on modeling the costs of the possible execution strategies and choosing the lowest cost strategy.

Query RewriteTransform a user query based on heuristics and server knowledge

Pushdown AnalysisAnalyze how to decompose a user query

Cost-Based OptimizationGenerate an optimal query execution plan using cost estimates

Produce efficient DBMS specific SQL for SQL-speaking sources

DiscoveryLink: DiscoveryLink: Query OptimizationQuery Optimization

Query optimization consists of:


Figure 8-19 Why is Query Rewrite Necessary

Query rewrite is necessary because there are (1) many flavors of SQL, (2) poorly written SQL usually pumped from query generators, and (3) queries so complex that they are full of inefficient structures.

Why is Query Why is Query Rewrite Necessary ?Rewrite Necessary ?

Multiple specifications are allowed in SQL

Commercial query generatorsproduce queries that don't performthe best query specification is often DBMS dependent

Complex queriesresult in redundancy, especially with views


Figure 8-20 Typical Query Rewrite Examples

This chart lists some characteristics of SQL statements that are typically rewritten by the optimizer before being executed.

Typical Query Typical Query Rewrite ExamplesRewrite Examples

Subqueries transformed into joins

Set operations such as INTERSECT converted to joins

Predicate transitive closure computed

Redundant predicates eliminated

Columns not used are projected out


Figure 8-21 What is Pushdown Analysis

Pushdown Analysis: Tells the DB2 optimizer if an operation can be performed at a remote data source. Operations that can be pushed-down can significantly improve query performance. Pushdown analysis (PDA) determines which parts of a query can be executed at the data source.

Note that PDA provides its analysis to the optimizer. The optimizer, taking several factors into consideration to determine the cost of a particular plan, then makes a decision on which portions of a query to pushdown.

What is What is Pushdown Analysis?Pushdown Analysis?

PDA provides guidance to the cost-based optimizer

PDA rules out pushdowns that would cause bad results but does not determine if a specific piece will be pushed down - this is conducted by the optimizer based upon cost

PDA determines what portions of a query can be executed


Figure 8-22 Typical Factors that can Affect Pushdown Results

PDA takes the listed factors into consideration when making its analysis of the query.

Typical Factors Typical Factors That Can Affect That Can Affect

Pushdown ResultsPushdown Results

What can the server support ?

Is there a server specific restriction ?

Is there a server specific limit ?

Will data be ordered similarly ?


Figure 8-23 Cost-Based Optimization: Factors to Consider

This charts lists the factors used in developing a cost for a given plan of execution. The plan with the lowest cost is the one used.

Cost-Based Cost-Based Optimization: Optimization:

Factors to ConsiderFactors to Consider

Do statistics indicate a table is large or small?

How is the system configured?

What is the optimization level?

How is the data distributed?

What operations can be pushed down?

How to evaluate each operation?

What is the cost to evaluate an operation?

Where to evaluate an operation?

Care for only the first n rows?


Figure 8-24 One Scenario: Optimizing a Query

This scenario demonstrates how a query is optimized by DiscoveryLink's engine, DB2 Universal Database.

In the example, a question is posed by a scientist.

Three relevant data sources connected to the DiscoveryLink federated system are listed.

One Scenario: One Scenario: Optimizing a QueryOptimizing a Query

Chemist asks the question: Show me all compounds that have been tested against the serotonin family or receptors and have IC50 values in the nanomolar/ml range.

Three data sources (tables) involved:

assays compounds proteins


Figure 8-25 Overview of Optimization

A search strategy called “dynamic programming join enumeration” is used to develop the list of possible plans for executing the query.

Plans consist of operators. Operators include nested-loop join (NLJ), sort, scan, filter, and pushdown. Operators have properties, including tables, columns, predicates, order, cost, and cardinality.

A special operator is used to cover the work done by wrappers.

To build a plan, the optimizer first looks at all ways of accessing each of the individual tables (maybe there is a choice of whether to scan the whole table or to use an index, and so forth). Then for each pair of tables, it considers each possible way of combining them: should it access assays first, and use a hash join to combine with proteins, or access proteins first and then use a nested loop join to integrate it with assays, e.g. It then looks at combining those plans with the next table to make 3-way combinations and so on, until all the tables to be accessed are accounted for.

At each phase, the calculated cost of execution is the basis for removing plans from the list.

Overview of Overview of OptimizationOptimization

Dynamic programming used to enumerate plans

Plans consist of operatorsOperators have properties, including costSpecial operator to encapsulate wrapper work

Enumeration done bottom-up in three phasesSingle collection accesses, joins, finishing touches

Costs used to prune plans during each phase


Figure 8-26 Building Plans using Properties

For example, this diagram shows a possible partial plan for the query introduced above. In this plan, the Proteins “table” at the remote source is accessed first, and the proteins retrieved are filtered so that only those described as serotonins are passed through (the actual predicate would be a bit more complex... this is simplified to make it easier to explain). Then, for each of the approximately 15 serotonins retrieved, the Assays “table” at the second remote source is probed, looking for matches which also meet the restriction on the ic50.

This would not be a good plan—it will move way too much data, and would likely be pruned (thrown out of consideration) early.

Building Plans Building Plans using Propertiesusing Properties

Select d.name from emp e, dept dwhere e.age < 30and e.dno=d.dno

Pushdown T:emp P:none C:age,dno $:5 O:None #:1500

Filter T:emp P:age<30 C: age,dno $: 8

O:None #:10

PushdownT:dept P:noneC:dno,name $:2 O:None #:100

NLJ Tabs:Emp, Dept Preds:e.dno=d.dno,e.age<30Cols:age,dno,dno,name Cost:25 Order: none Card:100

Some operators:NLJ, Sort, Scan, Filter, Pushdown,...

Some properties:Tables, Columns, PredicatesOrder, Cost, Cardinality

Optimizer requests work in terms of propertiesWrapper reply uses propsOptimizer adds operators to compensate


Figure 8-27 Optimizing a Query: First Phase

In the first phase of query optimization, plans for each data table are created Each plan's properties (tables, columns, and predicates) are different.

First phase of optimization plans for each data tableEach plan has a different set of properties

Remote QueryRemote Query

Tables: AssaysColumns: compound_id IC50 screen_name

Predicates: IC50<1E-9


Tables: CompoundsColumns: compound_id structure

Predicates: (None)


Tables: ProteinsColumns: protein_id name Predicates: family LIKE '%serotonin'

Optimizing a Query: Optimizing a Query: First PhaseFirst Phase


Figure 8-28 Optimizing a Query: Second Phase

In the second phase of query optimization, pairs of tables are examined. Several plans for joining each pair are created.

One plan is created for (1) every possible way the join could be executed, and (2) the order the data tables could be accessed.

In the second phase the optimizer examines all pairs of tables and constructs multiple plans for joining each pair

A plan will be generated for:

Each manner of executing a joinThe order in which the data tables are accessed

Optimizing a Query: Optimizing a Query: Second PhaseSecond Phase


Figure 8-29 Optimizing a Query: Query Plans

The optimizer will generate many plans for the query. The number of plans is exponential in the number of tables being joined.

The optimizer will generate many plans for the query.

The number of plans is exponential in the number of tables being joined.

Optimizing a Query: Optimizing a Query: Query PlansQuery Plans


Figure 8-30 Optimizing a Query: Three Query Plans

Three plans are shown. Each shows a different sequence of executing the query to yield the requested results.

(1) -Plan finds the structure of each compound-Determines which had low IC50 scores in assays-Looks up bound proteins in assays to see if they are

serotonin receptors

(2) -Finds assays with low IC50 scores-Finds structures of those compounds-Determines if bound proteins are serotonin receptors

(3) -Finds proteins that are members of the serotonin family-Finds assays in which some compound tightly bound

serotonin members-Retrieves the structures of compounds that bind

Optimizing a Query: Optimizing a Query: Three Query PlansThree Query Plans


Figure 8-31 Optimizing a Query: Which Plan is Best?

Picking the best plan depends on several factors—please refer to diagram describing cost-based optimization.

In the three examples, selection depends on the number of items in the data tables with the listed characteristics. For example, if the number of compounds in the data source were low, the first plan, which goes after each compound first, might be best. However, if the number of compounds is high, it might be better to consider another plan.

Optimizing a Query: Optimizing a Query: Which Plan is Best?Which Plan is Best?

Depends upon many factors

In the three example plans, it would depend upon: The number of compoundsNumber of proteins in the serotonin familyNumber of compounds with low IC50 scoresNumber of proteins in the serotonin family

DiscoveryLink's cost-based optimizer is essential to executing cross-source queries with good performance


Figure 8-32 Information on DiscoveryLink

To learn more about how DiscoveryLink can help dramatically improve the R&D effectiveness of your organization, visit our web site at www.ibm.com/discoverylink, or contact a Life Sciences solutions specialist at [email protected].

For more information on IBM DiscoveryLinkGo to our web site at: ibm.com/DiscoveryLinkContact us at: [email protected]

IBM Discov er yLin k - th e d a ta in t eg r a t ionsolu t ion


Part 3 DiscoveryLink Demonstration

DiscoveryLink is the new middleware solution from IBM Life Sciences. This demonstration overview was developed to demonstrate one manner in which DiscoveryLink can be used in a life sciences environment. This demonstration is not meant to describe any particular “look” to DiscoveryLink. It is specifically for demonstration purposes.

Example queries enables you to follow real queries processed through DiscoveryLink using actual data.

Part 3


Chapter 9. DiscoveryLink Demonstration

DiscoveryLink is a unique data integration solution from IBM Life Sciences.

This Web-based demonstration was developed to demonstrate one manner in which DiscoveryLink can be used in a life sciences environment.(This demonstration uses Java™ technology. To experience the full capabilities of this demonstration, enable Java on your browser.)

This demonstration is not meant to describe any particular “look” to DiscoveryLink. It is specifically for demonstration purposes.

9


9.1 DiscoveryLink technologyDiscoveryLink technology enables and enhances data integration by providing the critical interface between front-end applications and data sources.

The software components of DiscoveryLink technology can enable rapid enterprise-class application development or integration. It works transparently with a multitude of databases, applications and client tools.

Utilizing DiscoveryLink as your data integration layer ensures a flexible, and at the same time, structured architecture that will grow and adapt to your business needs. DiscoveryLink is not a front-end application and therefore does not have its own specific front-end “look.”

DISCLAIMER: All data was obtained from the National Cancer Institute (NCI) Web site: http://dtp.nci.nih.gov and IBM is not responsible for the accuracy of the data contained in this demonstration.

9.2 About the DiscoveryLink demonstrationsThis demonstration provides several examples of how DiscoveryLink technology can be used in a life sciences environment. This demonstration does not reflect the actual appearance of DiscoveryLink or its query input. It is designed specifically for informational purposes.

Each of the three example queries displays a capability of DiscoveryLink. For the purposes of the demonstration, the results are static.

The queries display the following information:

� Query name: Provides the name of the query being executed.

� DiscoveryLink function: Explains the function of a specific query, which can include the JOIN and UNION functions, or functions of the remote data sources.

� Details of query: Links to detailed information for the query, including specific scientific background and DiscoveryLink information for the query.


9.3 Example queriesThese queries, as outlined in Figure 9-1, were specifically written to demonstrate the data federation capabilities of DiscoveryLink.

Figure 9-1 Example Queries

9.4 Query 1This query executes a search for all experiments for a specific molecular target. The DiscoveryLink feature that this query demonstrates is the UNION of data from stored in DB2, Oracle, and SQL Server. You enter a full or partial target name. Examples might be:

� CD81

� Tyrosine kinase

� ATPase

� Topoisomerase

9.4.1 Query 1 detailsThis section outlines the detailed background, information, and features demonstrated by DiscoveryLink for Query 1.

Scientific backgroundThousands of molecular targets have been measured in the NCI panel of 60 human tumor cell lines. Measurements include protein levels, RNA measurements, mutation status and enzyme activity levels. You can search for a target of interest by full or partial target name. The experiments were performed in many different laboratories and the data are stored in several different databases.

DiscoveryLink informationWith DiscoveryLink, you are able to retrieve the 60 cell line data from all databases in a single query. The data for this query is stored in DB2, Oracle and SQL server. DiscoveryLink accesses all datasources and materializes the view of the final result set.

DiscoveryLink features demonstratedQuery 1 demonstrates:

� UNION of data from the 3 heterogeneous datasources

� Functions of remote data source utilized through DiscoveryLink

� Nested sub-query accessing 2 datasources

Chapter 9. DiscoveryLink Demonstration 147

9.4.2 Query 1 architectureThe following steps outline the Query 1 architecture:

1. Query issued to DiscoveryLink

2. Query parsed and optimized by DiscoveryLink

3. New queries generated by DiscoveryLink for each respective data source

4. DiscoveryLink issues queries to each data source

5. Results from each data source returned to DiscoveryLink

6. DiscoveryLink processes returned result sets into one final result set

7. DiscoveryLink returns final result set to user application

Figure 9-2 illustrates the Query 1 architecture.

Figure 9-2 Query 1 Architecture


9.4.3 SQL queries issued to DiscoveryLink for Query 1The SQL queries listed below are issued to DiscoveryLink during navigation of the Query 1 path. [user selected] is replaced with the values selected by the user during navigation.

======================================================**** Part I ***************************************======================================================SELECT moltnbr, description, moltid, 'DB2' dbsrc,'WEB_HOOKS_PRIM (NCI)' dsrc FROM web_hooks_prim WHERE moltnbr IN ([user selected])

UNION

SELECT moltnbr, description, moltid, 'Oracle' dbsrc,'WEB_HOOKS_GC (Weinstein (NCI) and Brown & Botstein (Stanford))' dsrcFROM web_hooks_gcWHERE moltnbr IN ([user selected])

UNION

SELECT moltnbr, description, moltid, 'SQL Server' dbsrc,'MILLENNIUM (Millennium Pharmaceuticals)' dsrcFROM millenniumWHERE moltnbr IN ([user selected])

ORDER BY description, moltnbr

======================================================***** Part II **************************************======================================================SELECT panel_name, cell_name, value FROM web_hooks_prim WHERE moltnbr = [user selected]UNION

SELECT panel_name, cell_name, value FROM web_hooks_gc WHERE moltnbr = [user selected]

UNION

SELECT panel_name, cell_name, value FROM millennium WHERE moltnbr = [user selected]

ORDER BY Panel_name, cell_name

======================================================***** Part III *************************************======================================================

SELECT a.moltid, a.moltnbr, a.description, avg(a.value) mean, b.method, b.units FROM web_hooks_prim a, primary_units_pmid bWHERE a.moltnbr IN ([user selected])AND a.moltnbr = b.moltnbr GROUP BY moltid, a.moltnbr, description, method, units

UNION


SELECT moltid, moltnbr, description, avg(value) mean, 'Microarray' method ,'log(mRNA levels in cell line/mRNA levels in reference pool)' units FROM web_hooks_gc WHERE moltnbr IN ([user selected])GROUP BY moltid, moltnbr, description "

UNION

SELECT moltid, moltnbr, description, avg(value) mean, 'Microarray' method, ‘RNA level (signal from perfect match MINUS signal from mismatch)' units FROM millenniumWHERE moltnbr IN ([user selected])GROUP BY moltid, moltnbr, description

======================================================***** Part IV **************************************======================================================SELECT panel_name, cell_name, rel, toprank FROM (SELECT panel_name, cell_name, rel, rank() over (order by rel desc) as toprank FROM (SELECT panel_name, cell_name, value/mean as rel from web_hooks_prim WHERE moltnbr = [user selected]UNION SELECT panel_name, cell_name, value/mean as rel from web_hooks_gc WHERE moltnbr = [user selected]UNION SELECT panel_name, cell_name,value/mean as rel from millennium WHERE moltnbr = [user selected]) AS sub1 ) AS sub2WHERE toprank < 6


9.4.4 Query 1 resultsFigure 9-3 shows the query results for a target name like ‘cd81’.

Figure 9-3 Query 1 Results for Target Name ‘cd81’

If the Compare button is pressed, Figure 9-4 results showing the comparison between the three NCI Experiment IDs from Figure 9-3.

Figure 9-4 Query 1 Comparison Results


9.4.5 Query 1 results for NCI experiment ID number 9423The detailed results of NCI Experiment ID number 9423 are as shown in Figures 9-5, 9-6, 9-7, and 9-8.

Figure 9-5 Query 1 Results for NCI Experiment ID Number 9423 — Part 1


9.4.6 Query 1 results for NCI experiment ID number 11872The detailed results of NCI Experiment ID number 11872 are as shown in Figures 9-9, 9-10, 9-11, 9-12.



9.4.7 Query 1 results for NCI experiment ID number 12253The detailed results of NCI Experiment ID number 12253 are as shown in Figures 9-13, 9-14, 9-15, 9-16.



Figure 9-16 Query 1 results for NCI experiment ID number 12253 — Part 4

9.5 Query 2This query shows the assay results for a specific compound. The DiscoveryLink feature that this query demonstrates is accessing data from Excel and two Oracle databases. You enter an NSC number of the target compound, for example ‘8423’.


Scientific backgroundThe DTP Human Tumor Cell Line Screen has evaluated large numbers of compounds for evidence of the ability to inhibit the growth of human tumor cell lines. This demo utilizes screening results, chemical data as well as property data on compounds that are not covered by a confidentiality agreement. The compounds submitted to the cancer screen are generally tested at five different concentrations for the ability to inhibit sixty different human tumor cell lines. The dose response data is used to calculate three concentration parameters; GI50, TGI and LC50.

Using the seven absorbance measurements [time zero (Tz), control growth (C), and test growth in the presence of drug at the five concentration levels (Ti)], percentage growth inhibition is calculated as:

[(Ti-Tz)/(C-Tz)] x 100 for concentrations for which Ti>/=Tz

[([(Ti-Tz)/Tz] x 100 for concentrations for which Ti<Tz.

Three dose response parameters are calculated for each experimental agent: GI50, TGI and LC50.


Values are calculated for each of these three parameters if the level of activity is reached; however, if the effect is not reached or is exceeded, the value for that parameter is expressed as greater or less than the maximum or minimum concentration tested.

GI50 Growth inhibition of 50 % (GI50) is the drug concentration resulting in a 50% reduction in the net protein increase (in control cells during the drug incubation. It is calculated from [(Ti-Tz)/(C-Tz)] x 100 = 50.

TGI The drug concentration resulting in total growth inhibition (TGI) is calculated from Ti = Tz.

LC50 The LC50 (concentration of drug resulting in a 50% reduction in the measured protein at the end of the drug treatment as compared to that at the beginning) indicating a net loss of cells following treatment is calculated from [(Ti-Tz)/Tz] x 100 = -50.

DiscoveryLink informationThe data for this query is located in 2 distinct Oracle instances and an Excel spreadsheet. DiscoveryLink accesses all datasources and materializes the views of the final result sets.

DiscoveryLink features demonstrated� OUTER-JOIN of data from 2 distinct datasources

� various JOINs across datasources

� utilization of a non-relational wrapper for Excel

� utilization of SQL on a non SQL capable datasource.

Query 2 architectureThe following steps outline the Query 2 architecture:



3. New queries generated by DiscoveryLink for each respective datasource

4. DiscoveryLink issues queries to each datasource.

5. Results from each datasource returned to DiscoveryLink


7. DiscoveryLink returns final result set to user application.




The SQL queries listed below are issued to DiscoveryLink during navigation of the Query 2 path. [user selected] is replaced with the values selected by the user during navigation.

======================================================***** Part I *************************************************======================================================SELECT a.mol_wt, a.mol_fmla, b.molfile_text1, b.molfile_text2,b.molfile_text3, b.molfile_text4, b.molfile_text5, b.molfile_text6,b.molfile_text7, b.molfile_text8, b.molfile_text9, b.molfile_text10,b.molfile_text11FROM nci_smiles a LEFT OUTER JOIN nci_molfiles bON a.nsc = b.nscWHERE a.nsc = [user selected]

======================================================***** Part II *************************************************======================================================SELECT a.log_hi_conc, a.conc_unit,a.log_result_gi50, a.log_result_tgi, a.log_result_lc50,b.panel_name, b.cell_nameFROM nci_results a, nci_cancers b WHERE a.nsc = [user selected]AND a.panel_number = b.panel_number


AND a.cell_number = b.cell_numberAND a.panel_number = [user selected]AND a.cell_number = [user selected]ORDER BY panel_name, cell_name

======================================================***** Part III ************************************************======================================================SELECT * FROM nci_props_xl (Excel spreadsheet)WHERE nsc = [user selected]

9.5.2 Query 2 resultsFigure 9-18 shows the query results for the NSC number compound like ‘8423’.

Figure 9-18 Query 2 Results for NSC Number Target Compound ‘8423’


Additional Query2 results for NSC number target compound ‘8423’ are as shown in Figure 9-19, 9-20, and 9 -21.

Figure 9-19 Query 2 Results for NSC Number Target Compound ‘8423’ — Part 1


If additional compound properties is selected, Figure 9-22 shows the data presented.

Figure 9-22 Properties for NSC Compound ‘8423’

9.6 Query 3This query lists all compounds tested against a specific cancer cell. The DiscoveryLink feature that this query demonstrates is the JOIN of data from two Oracle databases. You select a cancer panel:cell, for example, Breast:MDA-MB-231/ATCC.


Scientific backgroundThe DTP Human Tumor Cell Line Screen has evaluated large numbers of compounds for evidence of the ability to inhibit the growth of human tumor cell lines. This demo utilizes screening results, chemical data as well as property data on compounds that are not covered by a confidentiality agreement. The compounds submitted to the cancer screen are generally tested at five different concentrations for the ability to inhibit sixty different human tumor cell lines. The dose response data is used to calculate three concentration parameters; GI50, TGI and LC50.


Using the seven absorbance measurements [time zero (Tz), control growth (C), and test growth in the presence of drug at the five concentration levels (Ti)], percentage growth inhibition is calculated as:

[(Ti-Tz)/(C-Tz)] x 100 for concentrations for which Ti>/=Tz

[([(Ti-Tz)/Tz] x 100 for concentrations for which Ti<Tz.

Three dose response parameters are calculated for each experimental agent: GI50, TGI and LC50.

Values are calculated for each of these three parameters if the level of activity is reached; however, if the effect is not reached or is exceeded, the value for that parameter is expressed as greater or less than the maximum or minimum concentration tested.

GI50 Growth inhibition of 50 % (GI50) is the drug concentration resulting in a 50% reduction in the net protein increase (in control cells during the drug incubation. It is calculated from [(Ti-Tz)/(C-Tz)] x 100 = 50.

TGI The drug concentration resulting in total growth inhibition (TGI) is calculated from Ti = Tz.

LC50 The LC50 (concentration of drug resulting in a 50% reduction in the measured protein at the end of the drug treatment as compared to that at the beginning) indicating a net loss of cells following treatment is calculated from [(Ti-Tz)/Tz] x 100 = -50.

DiscoveryLink informationThe data for this query is located in 2 distinct Oracle instances. DiscoveryLink accesses both datasources and materializes the view of the final result set.

DiscoveryLink features demonstrated� JOIN of data from the 2 distinct datasources

� Demonstration of DiscoveryLink efficiency joining large datasets

Query 3 architectureThe following steps outline the Query 3 architecture:



3. New queries generated by DiscoveryLink for each respective datasource

4. DiscoveryLink issues queries to each datasource

5. Results from each datasource returned to DiscoveryLink


7. DiscoveryLink returns final result set to user application.




The SQL queries listed below are issued to DiscoveryLink during navigation of the Query 3 path. [user selected] is replaced with the values selected by the user during navigation of Query 3.

==================================================================*****SQL**********************************************************==================================================================SELECT a.nsc, b.compound_name, a.log_hi_conc,a.conc_unit,a.log_result_gi50, a.log_result_tgi, a.log_result_lc50 FROM nci_results a, nci_names bWHERE panel_number = [user selected]AND cell_number = [user selected]AND a.nsc = b.nsc


9.6.2 Query 3 resultsFigure 9-24 shows the results for the query on cancer panel:cell, Breast:MDA-MB-231/ATCC.

Figure 9-24 Query 3 Results


9.6.3 Query 3 results for NSC ID 171Figure 9-25 shows the 3D structure of the results for Query 3 for NSC ID 171.

Figure 9-25 Query 3 Results for NSC ID 171 — 3D Structure


Special notices

References in this publication to IBM products, programs or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM's product, program, or service may be used. Any functionally equivalent program that does not infringe any of IBM's intellectual property rights may be used instead of the IBM product, program or service.

Information in this book was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785.

Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact IBM Corporation, Dept. 600A, Mail Drop 1329, Somers, NY 10589 USA.

Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee.

The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

Any pointers in this publication to external Web sites are provided for convenience only and do not in any manner serve as an endorsement of these Web sites.

The following terms are trademarks of other companies:

Tivoli, Manage. Anything. Anywhere.,The Power To Manage., Anything. Anywhere.,TME, NetView, Cross-Site, Tivoli Ready, Tivoli Certified, Planet Tivoli, and Tivoli Enterprise are trademarks or registered trademarks of Tivoli Systems Inc., an IBM company, in the United States, other countries, or both. In Denmark, Tivoli is a trademark licensed from Kjøbenhavns Sommer - Tivoli A/S.

C-bus is a trademark of Corollary, Inc. in the United States and/or other countries.

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and/or other countries.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States and/or other countries.

PC Direct is a trademark of Ziff Communications Company in the United States and/or other


countries and is used by IBM Corporation under license.

ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United States and/or other countries.

UNIX is a registered trademark in the United States and other countries licensed exclusively through The Open Group.

SET, SET Secure Electronic Transaction, and the SET Logo are trademarks owned by SET Secure Electronic Transaction LLC.

Other company, product, and service names may be trademarks or service marks of others.


AAutonomy 38

BBuilding Plans using Properties 135

CCase study

a protein is born 18bridging oceans of data when companies merge 20unlocking mysteries of the brain stem 19

changing data 111Cost-Based Optimization 132Create Function 126Create Nickname 125Create Server 124Create Wrapper 123

Ddata federation capabilities of DiscoveryLink 147data from heterogeneous databases 60Data integration 36, 112Data integration services 93Data Warehouse 31Data warehouse 30data warehouse 29Data Warehouse Limitations 32DB2 Life Sciences Data Connect 89Detailed DiscoveryLink Information 55DiscoveryLink 57

A Data Integration Solution for Life Sciences 23A Data Integration Solution for Life Sciences (For IT Professionals) 109A System for Integrated Access to Life Sciences Data Sources 57

DiscoveryLink Architecture 113DiscoveryLink architecture 64DiscoveryLink Components 118DiscoveryLink Demonstration 143, 145DiscoveryLink demonstrations 146DiscoveryLink Federated Server 117DiscoveryLink for Query 1 147DiscoveryLink for Query 2 163DiscoveryLink for Query 3 169DiscoveryLink Overview 1DiscoveryLink solution 44, 102DiscoveryLink technology 119

Ee-business hosting 95enterprise data management approach 101execute the query 71extends the warehouse capability 35

Ffederated data 29Federated Database 33, 34

Federated Database Technology 114federated queries 46federated system 90, 120Full Spectrum of Life Sciences Requirements 100functions 42

HHeterogeneity 38, 41heterogeneous sources 46High Function 38

IIBM Global Life Sciences Consulting 91IBM Global Service 47IBM Life Sciences Global Consulting and Solutions Prac-tice 91IBM Life Sciences Global Consulting and Solutions Prac-tice DiscoveryLink Transition Offering 97IBM Life Sciences Solutions

Advancing Research and Discovery through Informa-tion Technology 3Turning Data into Discovery with DiscoveryLink 17

infrastructure solutions 5integrating data 89

Kknowledge management 11knowledge management software and services 94

LLinux 6

Mmanaging IT infrastructure 110middleware 59multiple heterogeneous data sources 89

Ooptimization 43, 69Optimizing a Query

First Phase 136Query Plans 138Second Phase 137Three Query Plans 139

Overview of Optimization 134

PPerformance 38phased approach 103Pushdown Analysis 130

QQuery 1 architecture 148Query 2 architecture 164Query 3 architecture 170

181

Query Optimization 127query processing 68query results 151, 166Query rewrite 128Query Scenario 1 51Query Scenario 2 52

RRedbooks Web site

Contact us vii

SScenario 1

a new protein 61Scenario 2

a merger 61Scenario 3

serotonin research 62server solutions 6services provided by IBM Global Service 99SQL queries 149, 165, 171storage hardware 11systems and systemware 9

TTransparency 38, 40

Vvariety of data sources 26

Wwrapper architecture 63Wrapper Configuration 122Wrapper Responsibility 116Wrapper schemas 66wrapper to access a data source 65Wrappers 45wrappers 115


(0.2”spine)0.17”<->

0.473”90<->

249 pages

IBM Life Sciences Solutions: Turning Data into Discovery w

ith DiscoveryLinkIBM

Life Sciences Solutions: Turning Data into Discovery with DiscoveryLink

®

SG24-6290-00 ISBN 0738423254

INTERNATIONAL TECHNICALSUPPORTORGANIZATION

BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE

IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.

For more information:ibm.com/redbooks

IBM Life Sciences Solutions:Turning Data into Discoverywith DiscoveryLink

Introduction to IBM Life Sciences

Overview of DB2 Life Sciences Data Connect and Consulting Services

DiscoveryLink Demonstration Overview

The key to increasing R&D effectiveness and remaining competitive in today’s fast-paced scientific community is data integration. The ability to tap into multiple, heterogeneous data sources once and quickly retrieve clear, consistent information is critical to uncovering correlations and insights that lead to the discovery of new drugs and medical products.

To meet the challenges of integrating and analyzing diverse scientific data from the variety of domains within life sciences, IBM has developed a versatile platform solution—IBM DiscoveryLink™. With single query data access, the IBM DiscoveryLink™ software allows researchers to work with distributed data sources and diverse data formats. IBM DB2® Universal Database™, the industry’s first multimedia, Web-ready, federated database, provides the industry-leading performance and scalability required to drive the most demanding life sciences applications.

To ensure robust performance and fast response time, DiscoveryLink includes query optimization technology that automatically searches for the most efficient means of executing the query and assembling the results. With a single Structured Query Language (SQL) command, researchers can access and integrate information from multiple data sources.

Back cover

front cover ibm life sciences solutions · · 2002-03-08vi ibm life sciences solutions: turning...

Documents