integrated querying across disparate data sources josé luis ambite & gully apc burns...

Post on 30-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Integrated Querying Across Disparate Data Sources

José Luis Ambite & Gully APC Burns

Information Sciences InstituteUniversity of Southern California

2

Team

Information Integration Infrastructure: Jose Luis Ambite, Craig Knoblock, Maria Muslea, Gowri Kumaraguruparan, Kristina Lerman (USC/ISI)

Domain Collaborators:FBIRN: Naveen Ashish (UCI), Jessica Turner (UCI), Karl Helmer (MGH),

Tim Olsen (WUSTL), Dingying Wei (UCI)

NHPRC: John Nylander, Dave Brink, Liz Moran (NHPRC)

CVRG: Naveen Ashish (UCI), Steve Granite (JHU)

Security: Rachana Ananthakrishnan (UC), Laura Pearlman (USC/ISI)

Data Management: Robert Schuler, Ann Chervenak (USC/ISI)

Knowledge Engineering: Gully Burns, Tom Russ (USC/ISI), Naveen Ashish, Jessica Turner (UCI)

User Interfaces: Naveen Ashish (UCI), Jose Luis Ambite, Pedro Szekely, Craig Rogers, Gowri Kumaraguruparan, Maria Muslea (USC/ISI)

3

Information Integration

Mediator: uniform structured query access to heterogeneous sources

Challenges:Syntactic (Access/Format) heterogeneity: Wrappers

Structured Sources: DBMS, XML/XQuery DBsSemi-structured Sources: HTML, text, pdfWeb services XML, SOAP, WSDL

Semantic heterogeneity MediatorSchema Source modelingData Record Linkage

Scalability:MediationSecuritySource AdditionRecord LinkageEfficient Query Execution

Decision Support

Application Programs

Mediator

KnowledgeBases

Databases Computer Programs

The Web

4

Information Mediator• Virtual Integration Architecture:

– Virtual organization: community of data providers and consumers that want to share data for specific purpose

– Autonomous sources: data, control remains at sources; no change to access methods, schemas; data accessed real-time in response to user queries

– Mediator: integrator defines domain schema and describes source contents• Domain schema: agreed upon view of the domain preferred by the virtual

organization• Source descriptions: logical formulas relating source and domain schemas

• Easy to add new sources• Query Answering

– User writes query in domain schema– Mediator:

• Determines sources relevant to user query• Rewrites query in sources schemas• Breaks query into sub-queries for sources• Optimizes query evaluation plan• Combines answers from sources

– Efficient query evaluation• Streaming dataflow

Mediator

DomainSchema

User queries

Reformulation

Optimizer

Execution Engine

DataSource

Data Source Data

Source

Wrapper WrapperSources schemas

Logical SourceDescriptions

5

BIRN Grid-based Virtual Data Integration Architecture

Relational DBEx: HID

XML DBEx: eXist-db

Web Portal

BIRN Gateway

BIRN Gateway BIRN Gateway BIRN Gateway

OGSA-DQP/DAI

Internet

Internal to Organization

Internet

Internal to Organization

Grid Security Infrastructure(TLS + PKI)

OGSA-DAI OGSA-DAI OGSA-DAI

ISI-Mediator

Web ServiceEx: XNAT

Logical SourceDescriptions

Reconcile Semantics/

Query RewritingClient Program

QueryOptimization/

Execution

SourceWrappers

Security: Encryption,Authentication, Authorization,

HID@MRN

6

FBIRN Data Integration Use Case:

HID and XNAT

HID@UCI

Human Imaging Database(s)Oracle DB

XNAT

EXtensible Neuroimaging Archive Toolkit Web service API

BIRN MediatorSQL query XML

query

User query: find all male

patients over 50 with t1 scans

Results integratedfrom XNAT and HID

HIDresults

XNAT results(XML)

Domain query Integrated resultsLogical Source

descriptions

HID@MRN

7

FBIRN Data Integration Use Case:

HID and XNAT

HID@UCI

Human Imaging Database(s)Oracle DB

XNAT

EXtensible Neuroimaging Archive Toolkit Web service API

BIRN MediatorSQL query XML

query

User query: find all male

patients over 50 with t1 scans

Results integratedfrom XNAT and HID

HIDresults

XNAT results(XML)

Domain query Integrated resultsLogical Source

descriptions

ECG_Mesa(MySQL DB)

CardioVascular Research Grid

BIRN Mediator

Integrated results

Logical Sourcedescriptions

ChesnokovAnalysis(eXistDB

XML DBMS)

Image Metadatadcm4che PACS

(MySQL DB)

WaveformDB(eXistDB

XML DBMS)

DICOM Image Files(file system)

Waveform Files(file system)

Domain query

Same BIRN mediatorJust plug in CVRG sourcedescriptions

and additional wrapper for eXistDB (XML/XQuery database)

ECG_Mesa(MySQL DB)

CardioVascular Research Grid

BIRN Mediator

Integrated results

Logical Sourcedescriptions

ChesnokovAnalysis(eXistDB

XML DBMS)

Image Metadatadcm4che PACS

(MySQL DB)

WaveformDB(eXistDB

XML DBMS)

DICOM Image Files(file system)

Waveform Files(file system)

Domain query

Use mediator to identify subjectsand files of interest

10

BIRN NHPRC Data Integration Use Case

• Provide data integration infrastructure for NHPRC:– Colony management, genetics, pathology, …

• BIRN NHPRC Activities: – BIRN/ISI demonstrated Colony Management integration

prototype– BIRN/ISI released data integration system to NHPRC team – NHPRC team developing DNA Banking application using

mediator– Deployment to NPRC Centers in progress

BIRN and Biomedical Ontologies

Ontology development challenges: • Modeling complex domains is challenging and requires

specialized expertise • The community of ontology development efforts is

large and somewhat daunting to navigate

Our goal: to provide an ontology development process that

• leverages existing ontology development• creates effective dialog between domain users and

biomedical ontologists • informs and documents the design of domain models

for integration • publishes curated domain ontologies to the community

Domain/Ontology Engineering Strategy

Variables and Data Values are the basis of scientific assertions

e.g., ‘CDNF protects nigral dopaminergic neurons in-vivo’

This statistically-significant effect is the experimental basis for the findings of this study.

Our ontology engineering approach is based on experimental variables

from Lindholm, P. et al. (2007), Nature, 448(7149): p. 73-7

Ontology of Experimental Variables and Values (‘OoEVV’)• Capture semantics of experimental variables

– Based on Measurement Scales (nominal, ordinal, interval, ratio, etc.1)

• Publish to standardized ontology repositories (NCBO, OBO Foundry etc.) – Expressed in RDF / OWL.

• Serve as a focal point of interaction with other ontology curation activities– i.e., Ontology of Biomedical Investigations

(OBI, http://obi-ontology.org/)

1 Stevens, S. S. (1946). "On the theory of scales of measurement." Science 103(2684): 677-680.

OoEVV Basic Design

Characteristic

MeasurementScale

Measurement Value

measuresvalue

usesMeasurementScale

onMeasurementScale

Variable

External Ontologies

Example: Simple (‘nominal’) Handedness

Handedness

Characteristic

BirnLexOntology: birnlex_2178

Variable

Handedness Variable

Nominal Handedness

Variable

MeasurementValue

Nominal Measurement

Value

Nominal Handedness

Value

lefthanded

ambidextrous

righthanded

Measurement Scale

Nominal Measurement

Scale

Nominal Handedness

Scale

Example: Edinburgh Handedness

Inventory

Handedness

Characteristic

BirnLexOntology: birnlex_2178

Variable

Handedness Variable

Edinburgh Handedness

Variable

MeasurementValue

Ordinal Measurement

Value

Edinburgh Handedness

Value

float values[-100.0, +100.0]

Measurement Scale

Ordinal Measurement

Scale

Edinburgh Handedness

Inventory

OoEVV elements for FBIRN

• FBIRN (HID/XNAT) domain model – 193 attributes 83 OoEVV variables – Mainly clinical assessments – e.g., Structured Clinical Interview for

DSM Disorders ‘SCID’, Mini-Mental State Examination, Clinical Dementia Rating, etc.)

Integrated Querying Across Disparate Data Sources

• General information integration infrastructure– Mediators

• bridge semantics across data sources• provide integrated data for analysis and visualization

– Domain model development and curation process• Balance bottom-up/top-down domain/ontology

development and reuse– Security and user data access control built-in

• Approach– Engage research communities – User-lead integration: NHPRC, FBIRN– Build applications incrementally– Enhance capabilities while providing useful tools

top related