integrated support for data integration and science portals amarnath gupta university of california...

58
Integrated support for data integration and science portals Amarnath Gupta University of California San Diego

Post on 19-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Integrated support for data integration and science portals

Amarnath Gupta

University of California San Diego

2 ISSGC06 – Ischia, Italy

Overview

• We will first– Discuss what “cyberinfrastructure” for science means– Situate the business of “data integration” within the

cyberinfrastructure setting

• Then we will briefly describe a few cyberinfrastructure projects in different science disciplines– Biomedical sciences, geo-sciences, environmental sciences,

marine biology, physical oceanography …

• We will examine some dimensions of the data integration problem – Discuss how they are approached in different projects from a

CS /Data Management perspective

• Discuss common and complementary themes across these approaches

3 ISSGC06 – Ischia, Italy

Cyberinfrastructure

National Science Foundation’s Cyberinfrastructure

• Cyberinfrastructure is the organized aggregate of technologies enabling access and coordination of information technology resources to facilitate science, engineering, and societal goals.

– Data access from distributed systems

– Data inter-operability and assimilation– Computation: grid based and workflows– Visualization– Tools

– Information Integration: highlighted today

NSF Blue Ribbon Panel (Atkins) Report provided a compelling and comprehensive vision of an integrated Cyberinfrastructure

Modified from Berman, SDSC, 2005

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

4 ISSGC06 – Ischia, Italy Source: Mark Ellisman

5 ISSGC06 – Ischia, Italy

Source: Mark Ellisman

6 ISSGC06 – Ischia, Italy

We are here:(a) Making more general-purpose data

integration infrastructure over distributed resources

(b) Extending to accommodate various scientific applications with stored and streaming data

Source: Mark Ellisman

7 ISSGC06 – Ischia, Italy

GEONgrid Software Layers

Core Grid ServicesGT3, OGSA-DAI, GSI, CAS, gridFTP, SRB, PostGIS, mySQL, DB2

Portal (login, myGEON)

Physical GridRedHat Linux, ROCKS, Internet, I2, OptIPuter (planned)

Registration Services

Data Integration Services

Indexing Services

Workflow Services

Visualization& Mapping Services

Registration GEONsearchGEONworkbench

ModelingEnvironment

GEON Space

8 ISSGC06 – Ischia, Italy

BIRN: Major System Components

Identity/Login Management

Authorization and Role Definition

Computation/Analysis Facilities

Distributed Data File Management

Distributed Data Collections Mgmnt

Domain Application Tools

Data Integration Mechanisms

Complete Workflows

Collaborating Groups of Biomedical Researchers

Application P

ortal

Com

mand/B

atch Access

Integrated SW

Distribution

Overall O

perations

Registered BIRN Data

9 ISSGC06 – Ischia, Italy

BIRN: Specific Implementations

GSI-Based. GAMA + MyProxy

SRB for Access Control to Data

e.g., AFNI, Air, 3DSlicer, LONI, ..

BIRN Data Integration Suite

Condor, Globus: Local clusters + Teragrid

AFS (file system)

Storage Resource Broker (SRB)

Pegasus, Kepler, Loni Pipeline, etc.

Mouse, Function, Morphometry (+ New Areas and Users )

BIR

N P

ortal

Com

mand/B

atch Access

Se

mi-A

nn

ual B

IRN

SW

D

istrib

utio

n

BIR

N-C

C

Registered BIRN Data

10 ISSGC06 – Ischia, Italy

Thi

rd-

part

y to

ols

Utopia

Haystack LSID Launchpad

myGrid information

model

Applications

Core Services

External Services

Se

rvic

e &

wo

rkflo

w

dis

cove

ry

Fetasemantic discovery

GRIMOIRES federated

UDDI+ registry

Web portalsWeb

portals

Tavernae-Science workbench

Wor

kflo

w

en

act

me

nt

Freefluoworkflow engineM

etad

ata

Man

agem

ent KAVE

metadata store

KAVE provenance

capture

myGrid ontology

Soaplab

Gowlab

AMBITtext extraction

service

Legacy applications

Web Services OGSA-DAI databases

Web Sites

OGSA-DAI DQP service

e-Science coordination e-Science mediator

e-Science process patterns

e-Scien

ce even

ts

LSID support

Dat

a

Man

agem

ent

mIR myGrid information repository

Web Service (Grid Service) communication fabric

Web Service (Grid Service) communication fabric

Notification service

Pedro semantic publication

Pedro semantic publication

Java applications

Executable codes with an IDL

Courtesy: Carole Goble

The OntoGrid View

11 ISSGC06 – Ischia, Italy

A Word about Data in ScienceExcerpts from a Report by NSF’s Office of the Cyberinfrastructure

• Data. … data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data.

• Metadata. Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, inter-relationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.

• Ontology. An ontology is the systematic description of a given phenomenon, often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse.

12 ISSGC06 – Ischia, Italy

What is data integration?• For applications where there are a number of data

sources (recall previous slide)– Geographically distributed– Having data on different platforms – (may be) on systems with different query capabilities (e.g.,

different DBMSs, files, spreadsheets)• Perhaps even having different data models

– Having different schema– BUT about one common, general theme

• One may want to construct– A general-purpose information system such that

• All these data sources can be co-accessed as if they belong to a single data source

• It can produce “combined information objects” on-demand for ad hoc queries to facilitate problem-specific analyses performed through other software products (workflows, atlases, statistical packages …)

• Data integration refers to a body of techniques to produce such an information system

13 ISSGC06 – Ischia, Italy

Data Integration vis-à-vis Data Grid

• A different aspect of data management

Storage Resource Transparency

Storage Location Transparency

E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...

Data Identifier Transparency

image_0.jpg…image_100.jpgData Replica Transparency

image.sqlimage.cgi image.wsdl

Virtual Data Transparency

Semantic data Organization (with behavior)patientRecordsCollectionmyActiveNeuroCollection

Inter-organizational Information

Storage Management

Courtesy: Reagan Moore and Arun Jagatheesan

14 ISSGC06 – Ischia, Italy

Data Integration in Science Starts with Science Questions

• GeoScience (GEON)– What is the geologic and geophysical record of Super-Continent

assembly and dispersal?– What are the architectures of terrain boundaries at depth? – How do composition, temperature and strain fabrics vary within the

lithosphere and asthenosphere? Are lithospheric and asthenospheric strain coupled?

• Neuroscience (BIRN)– Find volumetric data/metadata from MRIs of humans with specific

diagnosis(es)• Which structures are decreased/increased in size relative to normal

controls• Which structures show structural differences across a variety of diagnoses

– Given a structure which shows structural differences • Which other structures are associated with it• Do any of these associated structures show structural differences• Do these other changed structures have commonalities (i.e. cell types,

neurotransmitters, other afferent/efferent connections)• Environmental Science (PAKT, CAMERA)

– Explain biodiversity by correlating distribution of a taxonomic group with spatial (temporal) distribution of temperature, dissolved oxygen, salinity.

– What accounts for large-scale genetic variation in microbial genomes that share a very recent common ancestry among coral reef habitats?

DATA NEEDED TO ADDRESS THESE QUESTIONS ARE DISTRIBUTED ACROSS THE WORLD

15 ISSGC06 – Ischia, Italy

A Science Question can be Complex

Adapted from D.Seber, SDSC

Q1. What is the geologic and geophysical record of Super-Continent assembly and dispersal?

Needs complex integration of geophysical data with those associated with sub-crustal lithosphere ages, its composition and physical properties (seismic, thermal etc), surface geology and associated events chronology A.K.Sinha, Virginia Tech, 2005

16 ISSGC06 – Ischia, Italy

Converting Questions to Queries

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

17 ISSGC06 – Ischia, Italy

(Some) Dimensions of Information Integrationin Cyberinfrastructure Projects

• Source Information Model• Integration Engine’s Information Model

– Specification of semantic correspondences across sources

– The 3-party power play among “global schema”, “local schema”, “ontology”

• Query paradigms over integrated data• The mechanics of

– query planning– query execution

18 ISSGC06 – Ischia, Italy

About Semantic Correspondences• The general problem

– For any data integration across multiple sources there needs to be a way to

• Specify how two objects from different data sources may correspond

• Specify of the “joining” of these two objects would create a composite data object

• What’s the big deal?– Identical object versus equivalent objects– Complete objects versus partial objects– Multi-scale representations of the same object– Handling definitional differences– Taking into account natural variability– Contextual correspondenceAre these always specifiable through ontological standards like OWL?Do we need to have “correspondence checking” services?Listen to Oscar and Carol’s session tomorrow for a different angle

19 ISSGC06 – Ischia, Italy

About the 3-party Power Play

• While we want to create a single (cyber-) infrastructure with a data integration component, different applications have different integration scenarios – Is there a single global schema?– Do new applications (and hence global schema) get added

all the time over existing sources and ontologies?– Are the sources fixed? Do new sources get added all the

time? Do sources come and go?• Are sources added dynamically as “data sets” that users want

to integrate “on the fly”?– Do local schemata come with their own ontologies? Is there

a global ontology that all local ontologies must map to?– How does the global schema (if one exists) relate to the

global and local ontologies?– Do new (or modified) ontologies get added all the time?– Do the local schemata evolve all the time?Is there a general way to manage this?

Do we need to architect any cyberinfrastructure components differently?

20 ISSGC06 – Ischia, Italy

Source Information Models• BIRN

– Data Sources• Relational DBMS

– Standard data types– Semantic data types (attribute-domain references to

ontologies)• Some data and computation sources expose a set of functions• Key constraints

– Ontology Sources• Simplifying assumptions

– Ontologies can be approximated by edge-labeled directed graphs stored in relational systems

– Graph traversal functions can be mimicked as database functions

• BONFIRE– Glue ontology for simple inter-ontology mappings and

extensions

– Image and Spatial Data Sources• Discussed later

21 ISSGC06 – Ischia, Italy

Source Information Models

• GEON– Data Sources

• Assumption: all data are in GEONSpace • Items and Item details• Any relational jdbc data source (e.g., Excel files) is admitted• Standard relational data types, shapefiles for spatial data• Semantic Data types by connecting to ontology

– Ontology Sources• Any OWL-specified ontology

– Registration in GEON• Level 1: Federation Based Integration

» Users should know the component database schemata • Level 2: View Based Integration

» Same as in BIRN• Level 3: Ontology Based Integration

» Preferred Method

22 ISSGC06 – Ischia, Italy

Source Information Models

• PAKT (marine biogeography)– Data Sources

• Relational• Spatial (vectors) supported by GIS and Spatial DBMS• Spatial (raster – continuously partitionable arrays)

– ArcGIS (map algebra), – Nested, non-aligned, multiple resolution

• Spatially-indexed time series• Function-exposing sources (WSDL)

– Parameter and result data types are interpretable or BLOBS

– Ontology Sources• Any ontology specified in a subset of OWL• Any DAG-structured data source

23 ISSGC06 – Ischia, Italy

Source Information Models

• CAMERA– PAKT ++– Data sources that export annotated sequences as

a base data type– Phylogenetic trees– XML repositories with XPath/XQuery Processor– RDBMS with XML processing capabilities– Graphs such as molecular interaction networks

(e.g., biological pathways), chemical reaction networks …

24 ISSGC06 – Ischia, Italy

Integration Engine’s Information Model

• BIRN– Sources from the mediator’s view

• Base relations may have binding patterns• Distinction between data and metadata is not strictly

observed– SRB metadata catalog is treated as a relational source

with some special functions• Files are accessed by reference to data-grid URIs (SRB

ids)

– Integration Model• Essentially Global-as-view (GAV) mediation• “semantic” aspect of the mediation executed through

opaque functions over ontology sources• Key constraints not used during standard query

processing but are used for keyword queries

25 ISSGC06 – Ischia, Italy

Integration Engine’s Information Model

• BIRN (contd.)– The 3-party power-play

• Many integrated views used by several global schemata on a relatively fixed set of sources

• Ontologies are used in two ways– A global view may be defined using ontology functions– Keyword queries use simple ontological relationships

• Some terms in the global schema mapped to ontologies through semantic typing

– Otherwise the global schema and integrated views are independent from the ontology

• Some data are warped to a common atlas coordinate systems to enable atlas queries

– Atlas mapping ≡ spatial annotation

26 ISSGC06 – Ischia, Italy

Ontological Query Processor

Integration Engine’s Information Model• BIRN Integration

architecture

OTISSpatial

Registry Mediator

Atlas Query Processor

Data Grid AccessWrapper Access

Atlas Client Onto ClientQuery Client

– Gateway• has XML API for source registration,

source schema update• Has XML API for queries• Can be accessed as web service

– Registry• API-based access to schema

elements and view definitions• Implemented over MySQL for

portability• Spatial registry for image data

– Planner and Executor• Described later

– Wrappers• Local and remote

– OTIS• Inverted index for ontological terms

27 ISSGC06 – Ischia, Italy

BIRN Tool: Source Registration

28 ISSGC06 – Ischia, Italy

Information Engine’s Information Model

• GEON– Sources from the Integration Engine’s Viewpoint

• Metadata (Item-level information) maintained in a GEON standard called ADN (Alexandria-Delese-NASA)

• Item-detail level information is either any relationalizable data or shapefiles

• Any WMS, WFS service is a valid source for map information management

• Does not permit an external ontology source, all ontologies have to be defined in the GEON framework

– Integration Model• Every source schema is registered to an ontology

29 ISSGC06 – Ischia, Italy

Integration Engine’s Information Model

• 3-party power play– Several global schemata can be defined– A global schema IS the OWL-DL compliant

ontology– A couple of consequences

• All transitive closure information is pre-computed after registration

• If a concept class have key constraints, subsumption is NEXP-Time hard, and undecidable if the key constraint has a complex domain

– Does not matter much in practice because subsumption is hardly computed

• Pragmatics– As new sources join, or new applications are

attempted, the ontology needs to evolve

30 ISSGC06 – Ischia, Italy

Geon Data RegistrationClick on Submissionto register a dataset Input a data set name

Select a zippedshapefile

Choose an ontology class

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

31 ISSGC06 – Ischia, Italy

SiO2 is an instance of class AnalyticalOxideConcentration and has all

information about the element Si

Planetary Material Ontology

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005

Registration of Item Detail

32 ISSGC06 – Ischia, Italy

ODAL (Ontological Database Annotation Language)

<odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> <odal:Class odal:resource="http://geon.vt.edu#RockSample" /> <odal:Table>Samples</odal:Table> <odal:Table>RockTexture</odal:Table> <odal:Table>RockGeoChemistry</odal:Table> <odal:Table>ModalData</odal:Table> <odal:Table>MineralChemistry</odal:Table> <odal:Table>Images</odal:Table> <odal:Column>ssID</odal:Column> </odal:NamedIndividuals>

GUI

generateto ODALprocessor

The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample

• Create a partial model of ontologies from database• Independent on any GUI• Independent on any concrete implementations• reusable

33 ISSGC06 – Ischia, Italy

ODAL: Import Ontologies

The Ontologies used for annotating a database can be imported as follows:

<?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” ><odal:Ontology> <odal:Imports rdf:resource="http://www.library.org/Book.owl"/> <odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/></odal:Ontology>

……

</odal:ODAL>

34 ISSGC06 – Ischia, Italy

ODAL: Database Connection Declaration

The target database for making annotation is declared as follows:

<?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” >……<odal:Database odal:id="PublicationDatabase"> <odal:DatabaseProductName>Oracle<odal:DatabaseProductName> <odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion> <odal:Host>oracle.sdsc.edu</odal:Host> <odal:Port>3456</odal:Port> <odal:DatabaseName>Publications</odal:DatabaseName></odal:Database>……

</odal:ODAL>

35 ISSGC06 – Ischia, Italy

ODAL: Simple Named Individuals

<odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" >

<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column>

</odal:NamedIndividuals>

Suppose the book ontology contains a class Book and the schema Collection contains a table book-price with a column ISBN.

odal:id gives a name to the declaration, and represents the set of the individuals generated by the statement.

The statement says that each value in the column ISBN represents a book individual.

36 ISSGC06 – Ischia, Italy

ODAL: The Names of Individuals

<odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" >

<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column>

</odal:NamedIndividuals>

ISBN

0817313478

(BookInTableBookPrice, PublicationDatabase.Collections.book-price.ISBN:0817313478)

Individual Name

37 ISSGC06 – Ischia, Italy

ODAL: Named Individuals from Multiple Columns

<odal:NamedIndividuals odal:id="LocationInTableRockSample" > <odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/> <odal:Schema>California</odal:Schema> <odal:Table>Rock-Sample</odal:Table> <odal:Column>Latitude</odal:Column> <odal:Column>Longitude</odal:Column></odal:NamedIndividuals>

Suppose an ontology contains a class Location and a database table Rock-Sample with two columns Latitude and Longitude.

The statement says that a pair of latitude and longitude gives a location

38 ISSGC06 – Ischia, Italy

ODAL: Named Individuals with Conditions

<odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition></odal:NamedIndividuals>

<odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition></odal:NamedIndividuals>

A condition in an odal:Condition element should be a Boolean expression which isvalid to be used in any WHERE clauses of SQL queries

39 ISSGC06 – Ischia, Italy

ODAL: Data Type Property Declaration

<odal:NamedIndividuals odal:id="PersonInTablePerson" > <odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/> <odal:Table>Person</odal:Table> <odal:Column>ssn</odal:Column></odal:NamedIndividuals>

<odal:OntologyProperty> <odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/> <odal:Table>person</odal:Table> <odal:Domain odal:resource="PersonInTablePerson" /> <odal:Range odal:resource="age" /></odal:OntologyProperty>

…8…123-56-7890…

…age…SSN… Person

posInt

hasAge

40 ISSGC06 – Ischia, Italy

Conditions for Joining Individuals from Different Resources

• Usually we don’t make join on individuals cross different resources

• A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys.

e.g. {hasLatitude, hasLongitude} can be declared as a key of Location Two locations from different resources are same if they have the

same latitude and longitude

Rock

RockSampleID

10001

RockID

10001

……

We don’t know whether 10001 represents the same rock in the two resources. By default, we assume they are not.

41 ISSGC06 – Ischia, Italy

The Architecture of GEON Semantic Mediator

Portal or Application

Mediator JDBC Driver

GUI

SOQLSemantic Query Rewriter

SOQL Parser Ontology

Reasoner

SOQL Processor

Spatial SQL against federal schemas

SQL Parser

OWL ODAL

Query Execution

Query Optimization

QueryPlanning Internal Database

Oracle DB2 MySQLSQL

ServerPostgreSQL PostGIS

ODAL Processor

42 ISSGC06 – Ischia, Italy

The Map Integration Architecture

43 ISSGC06 – Ischia, Italy

Map Integration

Snapshot after querying “Paleozoic”

44 ISSGC06 – Ischia, Italy

Integration Engine’s Information Model

• PAKT (briefly)– Type extensibility of the mediator

• Nested relational query language extended by tree and a restricted set of graph pattern operations

• Construction operations important• Passive extensibility

– Source more powerful than the mediator– Source exports a set of type-based optimization rules

to the mediator• Active extensibility

– Mediator extends its set of interpreted types

– Ontology management• Ontological queries processed by a separate co-processor

that interoperates with mediator• Query planner partitions the query into ontological and

mediated query processors

45 ISSGC06 – Ischia, Italy

Query Paradigms

• What are the different kinds of queries scientists and applications pose to an integrated system?– Metadata-based file access

• 21,038 raw image files per subject• 2.4 GB of raw image data per subject• 25 GB to 40 GB of processed image data per subject • 10 million slices of functional imaging data in Phase II• 7 Terabytes of image data for all of the Phase II analyses (conservative estimate of 25 GB/subject)

– Ontologically supported mediated queries• “Find most recent FMRI data of all patients with low scores

in working memory tasks having volumetric changes of hippocampus over 10% in 2 years”

– Keyword queries• FMRI “working memory task” hippocampus

– Ontologically supported keyword queries– Associative searches

Oct

-02

Feb-0

3

Jun-0

3

Oct

-03

Feb-0

4

Jun-0

4

Oct

-04

Feb-0

5

Jun-0

5

Oct

-05

Feb-0

6

Jun-0

6

Total Number of Files(in thousands)

02000400060008000

1000012000140001600018000

BIRN Data Grid Usage

Total Number of Files (in thousands) Total Size of Storage (in Gigabytes)

16+ Terabytes

16 million files

46 ISSGC06 – Ischia, Italy

GEON: SOQL (Simple Ontology Query Language)

Query single or integrated resources • via ontologies (i.e., high level logical views)• independent on any physical presentation (i.e. schemas)

RockSample Location

ValueWithUnit float

location

hasSiO2

value

lat long

unit

string

SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’

GUIgenerate

to SOQLprocessor

47 ISSGC06 – Ischia, Italy

SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1

SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1

GEONSOQLGUI

SOQL Processor

Railroadshapefile

Seismic Stations

Schema Mediator

distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1

SELECT X1.the_geom FROM railroads X1

Question: Finding all seismic stations within 1 mile from railroads

SELECT X2.stationcode, X2.lat, X2.lon FROM stationdatatable X2

WHERE bounding box condition

48 ISSGC06 – Ischia, Italy

BIRN: A Functional View of the Mediation Process

Query Expression(UCQ+ + Nesting + Grouping & Aggregate)

View Unfolding

Flattening of Nested Queries

Normalization to DNF

Predicate Reordering(binding patterns + maximal chunk)

Maximal Feasible Plan

Algebraic Plan

Cost/Selectivity-based Optimization

Pre-Executable Plan

Pre-Executable Plan

Executable Plan

Execution Control

Result Building

Post-processing+ aggregate

Planner Execution Engine

Result Reporting

49 ISSGC06 – Ischia, Italy

View Definition and Query Language

• Union of conjunctive queries• May contain function term• Expressed in XML Datalog with aggregated functions

• Query q(X,F(Y)):-r1(X,Z),r2(Z,Y), - where F(Y) – aggregate function operated on set of Y and X group-by variables.

• Planner and Executor translate this to:– q’(X,Y):-r1(X,Z),r2(Z,Y)– q(X,W):-F(gb(q’(X,Y)) – Where group-by “gb” function with aggregate function F

pushed to data source whenever possible or evaluate at Mediator.

• Query Language allows for nested query – inner queries are assigned to intermediate variables that are used by main query

50 ISSGC06 – Ischia, Italy

BIRN” Mapping Relations

• Ontology Mapping -maps data values from a source to an ontology term of a known ontology (UMLS)

• Joinable relation pairs attributes from different relations

• Value-Map – maps mediator-supported data value to source supported (for example: gender – 0/1 at some source is male/female for mediator)

51 ISSGC06 – Ischia, Italy

Processing Ontological Queries

Courtesy: Vadim Astakhov

52 ISSGC06 – Ischia, Italy

PAKT: Spatial and Taxonomic Queries

53 ISSGC06 – Ischia, Italy

Example Queries

Geo-SpatialBiological

Q1: where is species X found? OBIS(scientific_name,lat,long)

OBIS

Geo-Spatial

Habitat

Benth_Hab

Q5: where is habitat X found?

Q2: for a given polygon, what species are found? OBIS(scientific_name,m_lat,m_long,m_lat,m_long)

Geo-SpatialBiological Physiochemical

Q3: where is species X found given certain physical parameter? OBIS(scientific_name,lat,long) WOA(physio,lat,long)

Q4: what are the aggregated physical properties of species X? OBIS(scientific_name,lat,long) WOA(physio,lat,long)

OBIS WOA

Q6: for a given polygon A, what habitats are found?

Geo-SpatialBiological Physiochemical

Habitat

OBIS WOA

Benth_Hab

Q7: where is habitat X found given certain physical parameter?CMECS(habitat,physio)

Q8: what are the aggregated physical properties of habitat X?

CMECS(habitat,physio)

BH(habitat_grp,shape) WOA(physio,lat,long)BH(habitat_grp,shape)

CMECS(habitat,physio)

CMECS(habitat,physio) BH(habitat_grp,shape) PolygonA

BH(habitat_grp,shape) WOA(physio,lat,long)CMECS(habitat,physio)

BH(habitat_grp,shape)

Q9: what species can be found at habitat X?

Q10: what habitats is a species X found at ?

OBIS(scientific_name,lat,long)

CMECS(habitat,physio) BH(habitat_grp,shape) OBIS(scientific_name,lat,long)

Italics: input

Underline: output

extended

54 ISSGC06 – Ischia, Italy

Frequent Query Patterns

• Example queries are joins of– Left query patterns: habitat-spatial, and– Right query patterns:

spatial-environmental/species distribution

CMECS(habitat,physio) BH(habitat_grp,shape)

CMECS(habitat,physio) BH(habitat_grp,shape)

CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) WOA(physio,lat,long)

BH(..,shape) WOA(physio,lat,long)

BH(..,shape) OBIS(scientific_name,lat,long)

BH(..,shape) OBIS(scientific_name,lat,long)

Onto-module’s queries Mediator’s queries

PolygonA( )

API

55 ISSGC06 – Ischia, Italy

The Resource Management Aspect of Query Evaluation

• Primarily done by the Manchester group (Watson et al)

• Polar*– Based on OQL

(internally monoid comprehension)

– Multi-node planning• Plan partitioning• Exchange operator

– Attribute sensitivity– Data & index

repartitioning• Plan scheduling

– Query execution

DBMS

data

OGSA-DAI

DQP

DQP

scan (A)

DBMS

data

OGSA-DAI

DQP

scan (B)

join (A1,B1)

DQP

join (A2,B2)

DQP

reduce

node 1 node 2

node 3 node 4

node 5

From Amy Krause

56 ISSGC06 – Ischia, Italy

The Adaptivity Issue in DQP on a Grid

• Monitoring-Assessment-Response framework of adaptive query processing in a grid (by Gounaris)– Monitoring:

• a separate module that keeps track of information like– Has a resource (e.g., memory availability) changed more

than 10%?– Has the data volume changed recently?

• Occurs between operators or within an operator’s execution process

• Other modules subscribe to this notification

– Assessment• Diagnosis is carried out for suboptimal execution, resource

shortage, resource idleness, unmet performance requirements, unmet user needs

– Response• Operator replacement ore rescheduling, machine rescheduling,

plan re-optimization…

57 ISSGC06 – Ischia, Italy

Commonalities and Complementarities

• Common themes– Overall architectural similarity of cyberinfrastructure projects

• Service orientation

– The data integration task is part of a larger scientific computing, exploration and analysis process

• Has impact on integration setting, design decisions and performance expectations

– Mediation with semantic mapping and reasoning seems to be winning

• Complementary approaches– Details of the architecture

• Relationship with workflows

– Styles of mediation– Extensibility of mediator – Adaptivity of query planning and evaluation

Thank you!

Questions? Comments? Integrated Queries?