integrated support for data integration and science portals amarnath gupta university of california...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Integrated support for data integration and science portals
Amarnath Gupta
University of California San Diego
2 ISSGC06 – Ischia, Italy
Overview
• We will first– Discuss what “cyberinfrastructure” for science means– Situate the business of “data integration” within the
cyberinfrastructure setting
• Then we will briefly describe a few cyberinfrastructure projects in different science disciplines– Biomedical sciences, geo-sciences, environmental sciences,
marine biology, physical oceanography …
• We will examine some dimensions of the data integration problem – Discuss how they are approached in different projects from a
CS /Data Management perspective
• Discuss common and complementary themes across these approaches
3 ISSGC06 – Ischia, Italy
Cyberinfrastructure
National Science Foundation’s Cyberinfrastructure
• Cyberinfrastructure is the organized aggregate of technologies enabling access and coordination of information technology resources to facilitate science, engineering, and societal goals.
– Data access from distributed systems
– Data inter-operability and assimilation– Computation: grid based and workflows– Visualization– Tools
– Information Integration: highlighted today
NSF Blue Ribbon Panel (Atkins) Report provided a compelling and comprehensive vision of an integrated Cyberinfrastructure
Modified from Berman, SDSC, 2005
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005
6 ISSGC06 – Ischia, Italy
We are here:(a) Making more general-purpose data
integration infrastructure over distributed resources
(b) Extending to accommodate various scientific applications with stored and streaming data
Source: Mark Ellisman
7 ISSGC06 – Ischia, Italy
GEONgrid Software Layers
Core Grid ServicesGT3, OGSA-DAI, GSI, CAS, gridFTP, SRB, PostGIS, mySQL, DB2
Portal (login, myGEON)
Physical GridRedHat Linux, ROCKS, Internet, I2, OptIPuter (planned)
Registration Services
Data Integration Services
Indexing Services
Workflow Services
Visualization& Mapping Services
Registration GEONsearchGEONworkbench
ModelingEnvironment
GEON Space
8 ISSGC06 – Ischia, Italy
BIRN: Major System Components
Identity/Login Management
Authorization and Role Definition
Computation/Analysis Facilities
Distributed Data File Management
Distributed Data Collections Mgmnt
Domain Application Tools
Data Integration Mechanisms
Complete Workflows
Collaborating Groups of Biomedical Researchers
Application P
ortal
Com
mand/B
atch Access
Integrated SW
Distribution
Overall O
perations
Registered BIRN Data
9 ISSGC06 – Ischia, Italy
BIRN: Specific Implementations
GSI-Based. GAMA + MyProxy
SRB for Access Control to Data
e.g., AFNI, Air, 3DSlicer, LONI, ..
BIRN Data Integration Suite
Condor, Globus: Local clusters + Teragrid
AFS (file system)
Storage Resource Broker (SRB)
Pegasus, Kepler, Loni Pipeline, etc.
Mouse, Function, Morphometry (+ New Areas and Users )
BIR
N P
ortal
Com
mand/B
atch Access
Se
mi-A
nn
ual B
IRN
SW
D
istrib
utio
n
BIR
N-C
C
Registered BIRN Data
10 ISSGC06 – Ischia, Italy
Thi
rd-
part
y to
ols
Utopia
Haystack LSID Launchpad
myGrid information
model
Applications
Core Services
External Services
Se
rvic
e &
wo
rkflo
w
dis
cove
ry
Fetasemantic discovery
GRIMOIRES federated
UDDI+ registry
Web portalsWeb
portals
Tavernae-Science workbench
Wor
kflo
w
en
act
me
nt
Freefluoworkflow engineM
etad
ata
Man
agem
ent KAVE
metadata store
KAVE provenance
capture
myGrid ontology
Soaplab
Gowlab
AMBITtext extraction
service
Legacy applications
Web Services OGSA-DAI databases
Web Sites
OGSA-DAI DQP service
e-Science coordination e-Science mediator
e-Science process patterns
e-Scien
ce even
ts
LSID support
Dat
a
Man
agem
ent
mIR myGrid information repository
Web Service (Grid Service) communication fabric
Web Service (Grid Service) communication fabric
Notification service
Pedro semantic publication
Pedro semantic publication
Java applications
Executable codes with an IDL
Courtesy: Carole Goble
The OntoGrid View
11 ISSGC06 – Ischia, Italy
A Word about Data in ScienceExcerpts from a Report by NSF’s Office of the Cyberinfrastructure
• Data. … data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data.
• Metadata. Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, inter-relationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.
• Ontology. An ontology is the systematic description of a given phenomenon, often includes a controlled vocabulary and relationships, captures nuances in meaning and enables knowledge sharing and reuse.
12 ISSGC06 – Ischia, Italy
What is data integration?• For applications where there are a number of data
sources (recall previous slide)– Geographically distributed– Having data on different platforms – (may be) on systems with different query capabilities (e.g.,
different DBMSs, files, spreadsheets)• Perhaps even having different data models
– Having different schema– BUT about one common, general theme
• One may want to construct– A general-purpose information system such that
• All these data sources can be co-accessed as if they belong to a single data source
• It can produce “combined information objects” on-demand for ad hoc queries to facilitate problem-specific analyses performed through other software products (workflows, atlases, statistical packages …)
• Data integration refers to a body of techniques to produce such an information system
13 ISSGC06 – Ischia, Italy
Data Integration vis-à-vis Data Grid
• A different aspect of data management
Storage Resource Transparency
Storage Location Transparency
E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where...
Data Identifier Transparency
image_0.jpg…image_100.jpgData Replica Transparency
image.sqlimage.cgi image.wsdl
Virtual Data Transparency
Semantic data Organization (with behavior)patientRecordsCollectionmyActiveNeuroCollection
Inter-organizational Information
Storage Management
Courtesy: Reagan Moore and Arun Jagatheesan
14 ISSGC06 – Ischia, Italy
Data Integration in Science Starts with Science Questions
• GeoScience (GEON)– What is the geologic and geophysical record of Super-Continent
assembly and dispersal?– What are the architectures of terrain boundaries at depth? – How do composition, temperature and strain fabrics vary within the
lithosphere and asthenosphere? Are lithospheric and asthenospheric strain coupled?
• Neuroscience (BIRN)– Find volumetric data/metadata from MRIs of humans with specific
diagnosis(es)• Which structures are decreased/increased in size relative to normal
controls• Which structures show structural differences across a variety of diagnoses
– Given a structure which shows structural differences • Which other structures are associated with it• Do any of these associated structures show structural differences• Do these other changed structures have commonalities (i.e. cell types,
neurotransmitters, other afferent/efferent connections)• Environmental Science (PAKT, CAMERA)
– Explain biodiversity by correlating distribution of a taxonomic group with spatial (temporal) distribution of temperature, dissolved oxygen, salinity.
– What accounts for large-scale genetic variation in microbial genomes that share a very recent common ancestry among coral reef habitats?
DATA NEEDED TO ADDRESS THESE QUESTIONS ARE DISTRIBUTED ACROSS THE WORLD
15 ISSGC06 – Ischia, Italy
A Science Question can be Complex
Adapted from D.Seber, SDSC
Q1. What is the geologic and geophysical record of Super-Continent assembly and dispersal?
Needs complex integration of geophysical data with those associated with sub-crustal lithosphere ages, its composition and physical properties (seismic, thermal etc), surface geology and associated events chronology A.K.Sinha, Virginia Tech, 2005
16 ISSGC06 – Ischia, Italy
Converting Questions to Queries
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005
17 ISSGC06 – Ischia, Italy
(Some) Dimensions of Information Integrationin Cyberinfrastructure Projects
• Source Information Model• Integration Engine’s Information Model
– Specification of semantic correspondences across sources
– The 3-party power play among “global schema”, “local schema”, “ontology”
• Query paradigms over integrated data• The mechanics of
– query planning– query execution
18 ISSGC06 – Ischia, Italy
About Semantic Correspondences• The general problem
– For any data integration across multiple sources there needs to be a way to
• Specify how two objects from different data sources may correspond
• Specify of the “joining” of these two objects would create a composite data object
• What’s the big deal?– Identical object versus equivalent objects– Complete objects versus partial objects– Multi-scale representations of the same object– Handling definitional differences– Taking into account natural variability– Contextual correspondenceAre these always specifiable through ontological standards like OWL?Do we need to have “correspondence checking” services?Listen to Oscar and Carol’s session tomorrow for a different angle
19 ISSGC06 – Ischia, Italy
About the 3-party Power Play
• While we want to create a single (cyber-) infrastructure with a data integration component, different applications have different integration scenarios – Is there a single global schema?– Do new applications (and hence global schema) get added
all the time over existing sources and ontologies?– Are the sources fixed? Do new sources get added all the
time? Do sources come and go?• Are sources added dynamically as “data sets” that users want
to integrate “on the fly”?– Do local schemata come with their own ontologies? Is there
a global ontology that all local ontologies must map to?– How does the global schema (if one exists) relate to the
global and local ontologies?– Do new (or modified) ontologies get added all the time?– Do the local schemata evolve all the time?Is there a general way to manage this?
Do we need to architect any cyberinfrastructure components differently?
20 ISSGC06 – Ischia, Italy
Source Information Models• BIRN
– Data Sources• Relational DBMS
– Standard data types– Semantic data types (attribute-domain references to
ontologies)• Some data and computation sources expose a set of functions• Key constraints
– Ontology Sources• Simplifying assumptions
– Ontologies can be approximated by edge-labeled directed graphs stored in relational systems
– Graph traversal functions can be mimicked as database functions
• BONFIRE– Glue ontology for simple inter-ontology mappings and
extensions
– Image and Spatial Data Sources• Discussed later
21 ISSGC06 – Ischia, Italy
Source Information Models
• GEON– Data Sources
• Assumption: all data are in GEONSpace • Items and Item details• Any relational jdbc data source (e.g., Excel files) is admitted• Standard relational data types, shapefiles for spatial data• Semantic Data types by connecting to ontology
– Ontology Sources• Any OWL-specified ontology
– Registration in GEON• Level 1: Federation Based Integration
» Users should know the component database schemata • Level 2: View Based Integration
» Same as in BIRN• Level 3: Ontology Based Integration
» Preferred Method
22 ISSGC06 – Ischia, Italy
Source Information Models
• PAKT (marine biogeography)– Data Sources
• Relational• Spatial (vectors) supported by GIS and Spatial DBMS• Spatial (raster – continuously partitionable arrays)
– ArcGIS (map algebra), – Nested, non-aligned, multiple resolution
• Spatially-indexed time series• Function-exposing sources (WSDL)
– Parameter and result data types are interpretable or BLOBS
– Ontology Sources• Any ontology specified in a subset of OWL• Any DAG-structured data source
23 ISSGC06 – Ischia, Italy
Source Information Models
• CAMERA– PAKT ++– Data sources that export annotated sequences as
a base data type– Phylogenetic trees– XML repositories with XPath/XQuery Processor– RDBMS with XML processing capabilities– Graphs such as molecular interaction networks
(e.g., biological pathways), chemical reaction networks …
24 ISSGC06 – Ischia, Italy
Integration Engine’s Information Model
• BIRN– Sources from the mediator’s view
• Base relations may have binding patterns• Distinction between data and metadata is not strictly
observed– SRB metadata catalog is treated as a relational source
with some special functions• Files are accessed by reference to data-grid URIs (SRB
ids)
– Integration Model• Essentially Global-as-view (GAV) mediation• “semantic” aspect of the mediation executed through
opaque functions over ontology sources• Key constraints not used during standard query
processing but are used for keyword queries
25 ISSGC06 – Ischia, Italy
Integration Engine’s Information Model
• BIRN (contd.)– The 3-party power-play
• Many integrated views used by several global schemata on a relatively fixed set of sources
• Ontologies are used in two ways– A global view may be defined using ontology functions– Keyword queries use simple ontological relationships
• Some terms in the global schema mapped to ontologies through semantic typing
– Otherwise the global schema and integrated views are independent from the ontology
• Some data are warped to a common atlas coordinate systems to enable atlas queries
– Atlas mapping ≡ spatial annotation
26 ISSGC06 – Ischia, Italy
Ontological Query Processor
Integration Engine’s Information Model• BIRN Integration
architecture
OTISSpatial
Registry Mediator
Atlas Query Processor
Data Grid AccessWrapper Access
Atlas Client Onto ClientQuery Client
– Gateway• has XML API for source registration,
source schema update• Has XML API for queries• Can be accessed as web service
– Registry• API-based access to schema
elements and view definitions• Implemented over MySQL for
portability• Spatial registry for image data
– Planner and Executor• Described later
– Wrappers• Local and remote
– OTIS• Inverted index for ontological terms
28 ISSGC06 – Ischia, Italy
Information Engine’s Information Model
• GEON– Sources from the Integration Engine’s Viewpoint
• Metadata (Item-level information) maintained in a GEON standard called ADN (Alexandria-Delese-NASA)
• Item-detail level information is either any relationalizable data or shapefiles
• Any WMS, WFS service is a valid source for map information management
• Does not permit an external ontology source, all ontologies have to be defined in the GEON framework
– Integration Model• Every source schema is registered to an ontology
29 ISSGC06 – Ischia, Italy
Integration Engine’s Information Model
• 3-party power play– Several global schemata can be defined– A global schema IS the OWL-DL compliant
ontology– A couple of consequences
• All transitive closure information is pre-computed after registration
• If a concept class have key constraints, subsumption is NEXP-Time hard, and undecidable if the key constraint has a complex domain
– Does not matter much in practice because subsumption is hardly computed
• Pragmatics– As new sources join, or new applications are
attempted, the ontology needs to evolve
30 ISSGC06 – Ischia, Italy
Geon Data RegistrationClick on Submissionto register a dataset Input a data set name
Select a zippedshapefile
Choose an ontology class
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005
31 ISSGC06 – Ischia, Italy
SiO2 is an instance of class AnalyticalOxideConcentration and has all
information about the element Si
Planetary Material Ontology
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES A.K.Sinha, Virginia Tech, 2005
Registration of Item Detail
32 ISSGC06 – Ischia, Italy
ODAL (Ontological Database Annotation Language)
<odal:NamedIndividuals odal:id="RockSample" odal:database="VTDatabase"> <odal:Class odal:resource="http://geon.vt.edu#RockSample" /> <odal:Table>Samples</odal:Table> <odal:Table>RockTexture</odal:Table> <odal:Table>RockGeoChemistry</odal:Table> <odal:Table>ModalData</odal:Table> <odal:Table>MineralChemistry</odal:Table> <odal:Table>Images</odal:Table> <odal:Column>ssID</odal:Column> </odal:NamedIndividuals>
GUI
generateto ODALprocessor
The values in the column ssID of the table Samples, RockTexture, RockGeoChemistry, ModalData,MineralChemistry and Images represent instances of RockSample
• Create a partial model of ontologies from database• Independent on any GUI• Independent on any concrete implementations• reusable
33 ISSGC06 – Ischia, Italy
ODAL: Import Ontologies
The Ontologies used for annotating a database can be imported as follows:
<?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” ><odal:Ontology> <odal:Imports rdf:resource="http://www.library.org/Book.owl"/> <odal:Imports rdf:resource="http://www.writer.org/Writer.owl"/></odal:Ontology>
……
</odal:ODAL>
34 ISSGC06 – Ischia, Italy
ODAL: Database Connection Declaration
The target database for making annotation is declared as follows:
<?xml version="1.0"?> <odal:ODAL xmlns:rdf = “http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:odal = “http://www.sdsc.edu/odal#” >……<odal:Database odal:id="PublicationDatabase"> <odal:DatabaseProductName>Oracle<odal:DatabaseProductName> <odal:DatabaseProductVersion>9.1.21<odal:DatabaseProductVersion> <odal:Host>oracle.sdsc.edu</odal:Host> <odal:Port>3456</odal:Port> <odal:DatabaseName>Publications</odal:DatabaseName></odal:Database>……
</odal:ODAL>
35 ISSGC06 – Ischia, Italy
ODAL: Simple Named Individuals
<odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" >
<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column>
</odal:NamedIndividuals>
Suppose the book ontology contains a class Book and the schema Collection contains a table book-price with a column ISBN.
odal:id gives a name to the declaration, and represents the set of the individuals generated by the statement.
The statement says that each value in the column ISBN represents a book individual.
36 ISSGC06 – Ischia, Italy
ODAL: The Names of Individuals
<odal:NamedIndividuals odal:id="BookInTableBookPrice" odal:database="PublicationDatabase" >
<odal:Class odal:resource="http://www.amazon.com/Book.owl#Book"/> <odal:Schema>Collections</odal:Schema> <odal:Table>book-price</odal:Table> <odal:Column>ISBN</odal:Column>
</odal:NamedIndividuals>
ISBN
0817313478
…
(BookInTableBookPrice, PublicationDatabase.Collections.book-price.ISBN:0817313478)
Individual Name
37 ISSGC06 – Ischia, Italy
ODAL: Named Individuals from Multiple Columns
<odal:NamedIndividuals odal:id="LocationInTableRockSample" > <odal:Class odal:resource="http://www.usgs.org/Space.owl#Location"/> <odal:Schema>California</odal:Schema> <odal:Table>Rock-Sample</odal:Table> <odal:Column>Latitude</odal:Column> <odal:Column>Longitude</odal:Column></odal:NamedIndividuals>
Suppose an ontology contains a class Location and a database table Rock-Sample with two columns Latitude and Longitude.
The statement says that a pair of latitude and longitude gives a location
38 ISSGC06 – Ischia, Italy
ODAL: Named Individuals with Conditions
<odal:NamedIndividuals odal:id="MaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee.owl#MaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’M’ >]]</odal:Condition></odal:NamedIndividuals>
<odal:NamedIndividuals odal:id="FemaleEmployeeInTableEmployee" > <odal:Class odal:resource="http://www.abc.com/Employee#FemaleEmployee"/> <odal:Table>employee</odal:Table> <odal:Column>EmployeeId</odal:Column> <odal:Condition><![CDATA[ Gender=’F’ >]]</odal:Condition></odal:NamedIndividuals>
A condition in an odal:Condition element should be a Boolean expression which isvalid to be used in any WHERE clauses of SQL queries
39 ISSGC06 – Ischia, Italy
ODAL: Data Type Property Declaration
<odal:NamedIndividuals odal:id="PersonInTablePerson" > <odal:Class odal:resource="http://www.foo.org/Person.owl#Person"/> <odal:Table>Person</odal:Table> <odal:Column>ssn</odal:Column></odal:NamedIndividuals>
<odal:OntologyProperty> <odal:DatatypeProperty odal:resource="http://www.foo.org/Person.owl#hasAge"/> <odal:Table>person</odal:Table> <odal:Domain odal:resource="PersonInTablePerson" /> <odal:Range odal:resource="age" /></odal:OntologyProperty>
…8…123-56-7890…
…age…SSN… Person
posInt
hasAge
40 ISSGC06 – Ischia, Italy
Conditions for Joining Individuals from Different Resources
• Usually we don’t make join on individuals cross different resources
• A set of datatype properties can be declared as a key for a class in the ontology. We do join cross multiple resources based on keys.
e.g. {hasLatitude, hasLongitude} can be declared as a key of Location Two locations from different resources are same if they have the
same latitude and longitude
Rock
RockSampleID
10001
…
RockID
10001
……
We don’t know whether 10001 represents the same rock in the two resources. By default, we assume they are not.
41 ISSGC06 – Ischia, Italy
The Architecture of GEON Semantic Mediator
Portal or Application
Mediator JDBC Driver
GUI
SOQLSemantic Query Rewriter
SOQL Parser Ontology
Reasoner
SOQL Processor
Spatial SQL against federal schemas
SQL Parser
OWL ODAL
Query Execution
Query Optimization
QueryPlanning Internal Database
Oracle DB2 MySQLSQL
ServerPostgreSQL PostGIS
ODAL Processor
44 ISSGC06 – Ischia, Italy
Integration Engine’s Information Model
• PAKT (briefly)– Type extensibility of the mediator
• Nested relational query language extended by tree and a restricted set of graph pattern operations
• Construction operations important• Passive extensibility
– Source more powerful than the mediator– Source exports a set of type-based optimization rules
to the mediator• Active extensibility
– Mediator extends its set of interpreted types
– Ontology management• Ontological queries processed by a separate co-processor
that interoperates with mediator• Query planner partitions the query into ontological and
mediated query processors
45 ISSGC06 – Ischia, Italy
Query Paradigms
• What are the different kinds of queries scientists and applications pose to an integrated system?– Metadata-based file access
• 21,038 raw image files per subject• 2.4 GB of raw image data per subject• 25 GB to 40 GB of processed image data per subject • 10 million slices of functional imaging data in Phase II• 7 Terabytes of image data for all of the Phase II analyses (conservative estimate of 25 GB/subject)
– Ontologically supported mediated queries• “Find most recent FMRI data of all patients with low scores
in working memory tasks having volumetric changes of hippocampus over 10% in 2 years”
– Keyword queries• FMRI “working memory task” hippocampus
– Ontologically supported keyword queries– Associative searches
Oct
-02
Feb-0
3
Jun-0
3
Oct
-03
Feb-0
4
Jun-0
4
Oct
-04
Feb-0
5
Jun-0
5
Oct
-05
Feb-0
6
Jun-0
6
Total Number of Files(in thousands)
02000400060008000
1000012000140001600018000
BIRN Data Grid Usage
Total Number of Files (in thousands) Total Size of Storage (in Gigabytes)
16+ Terabytes
16 million files
46 ISSGC06 – Ischia, Italy
GEON: SOQL (Simple Ontology Query Language)
Query single or integrated resources • via ontologies (i.e., high level logical views)• independent on any physical presentation (i.e. schemas)
RockSample Location
ValueWithUnit float
location
hasSiO2
value
lat long
unit
string
SELECT X.location.*; FROM RockSample X WHERE X.location.lat > 60 AND X.location.long > 100 AND X.hasSiO2.value < 30 AND X.hasSiO2.unit =‘weightPercetage’
GUIgenerate
to SOQLprocessor
47 ISSGC06 – Ischia, Italy
SELECT X.code, X.location.* FROM SeismicStation X, Railroad Y WHERE distance(X.location, Y.geometry) < 1
SELECT X2.stationcode, X2.lat, X2.lon FROM railroads_of_the_united_states X1, stationdatatable X2 WHERE distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1
GEONSOQLGUI
SOQL Processor
Railroadshapefile
Seismic Stations
Schema Mediator
distance(X1.the_geom, MakePoint(X2.lat, X2.lon)) < 1
SELECT X1.the_geom FROM railroads X1
Question: Finding all seismic stations within 1 mile from railroads
SELECT X2.stationcode, X2.lat, X2.lon FROM stationdatatable X2
WHERE bounding box condition
48 ISSGC06 – Ischia, Italy
BIRN: A Functional View of the Mediation Process
Query Expression(UCQ+ + Nesting + Grouping & Aggregate)
View Unfolding
Flattening of Nested Queries
Normalization to DNF
Predicate Reordering(binding patterns + maximal chunk)
Maximal Feasible Plan
Algebraic Plan
Cost/Selectivity-based Optimization
Pre-Executable Plan
Pre-Executable Plan
Executable Plan
Execution Control
Result Building
Post-processing+ aggregate
Planner Execution Engine
Result Reporting
49 ISSGC06 – Ischia, Italy
View Definition and Query Language
• Union of conjunctive queries• May contain function term• Expressed in XML Datalog with aggregated functions
• Query q(X,F(Y)):-r1(X,Z),r2(Z,Y), - where F(Y) – aggregate function operated on set of Y and X group-by variables.
• Planner and Executor translate this to:– q’(X,Y):-r1(X,Z),r2(Z,Y)– q(X,W):-F(gb(q’(X,Y)) – Where group-by “gb” function with aggregate function F
pushed to data source whenever possible or evaluate at Mediator.
• Query Language allows for nested query – inner queries are assigned to intermediate variables that are used by main query
50 ISSGC06 – Ischia, Italy
BIRN” Mapping Relations
• Ontology Mapping -maps data values from a source to an ontology term of a known ontology (UMLS)
• Joinable relation pairs attributes from different relations
• Value-Map – maps mediator-supported data value to source supported (for example: gender – 0/1 at some source is male/female for mediator)
53 ISSGC06 – Ischia, Italy
Example Queries
Geo-SpatialBiological
Q1: where is species X found? OBIS(scientific_name,lat,long)
OBIS
Geo-Spatial
Habitat
Benth_Hab
Q5: where is habitat X found?
Q2: for a given polygon, what species are found? OBIS(scientific_name,m_lat,m_long,m_lat,m_long)
Geo-SpatialBiological Physiochemical
Q3: where is species X found given certain physical parameter? OBIS(scientific_name,lat,long) WOA(physio,lat,long)
Q4: what are the aggregated physical properties of species X? OBIS(scientific_name,lat,long) WOA(physio,lat,long)
OBIS WOA
Q6: for a given polygon A, what habitats are found?
Geo-SpatialBiological Physiochemical
Habitat
OBIS WOA
Benth_Hab
Q7: where is habitat X found given certain physical parameter?CMECS(habitat,physio)
Q8: what are the aggregated physical properties of habitat X?
CMECS(habitat,physio)
BH(habitat_grp,shape) WOA(physio,lat,long)BH(habitat_grp,shape)
CMECS(habitat,physio)
CMECS(habitat,physio) BH(habitat_grp,shape) PolygonA
BH(habitat_grp,shape) WOA(physio,lat,long)CMECS(habitat,physio)
BH(habitat_grp,shape)
Q9: what species can be found at habitat X?
Q10: what habitats is a species X found at ?
OBIS(scientific_name,lat,long)
CMECS(habitat,physio) BH(habitat_grp,shape) OBIS(scientific_name,lat,long)
Italics: input
Underline: output
extended
54 ISSGC06 – Ischia, Italy
Frequent Query Patterns
• Example queries are joins of– Left query patterns: habitat-spatial, and– Right query patterns:
spatial-environmental/species distribution
CMECS(habitat,physio) BH(habitat_grp,shape)
CMECS(habitat,physio) BH(habitat_grp,shape)
CMECS(habitat,physio) BH(habitat_grp,shape) BH(..,shape) WOA(physio,lat,long)
BH(..,shape) WOA(physio,lat,long)
BH(..,shape) OBIS(scientific_name,lat,long)
BH(..,shape) OBIS(scientific_name,lat,long)
Onto-module’s queries Mediator’s queries
PolygonA( )
API
55 ISSGC06 – Ischia, Italy
The Resource Management Aspect of Query Evaluation
• Primarily done by the Manchester group (Watson et al)
• Polar*– Based on OQL
(internally monoid comprehension)
– Multi-node planning• Plan partitioning• Exchange operator
– Attribute sensitivity– Data & index
repartitioning• Plan scheduling
– Query execution
DBMS
data
OGSA-DAI
DQP
DQP
scan (A)
DBMS
data
OGSA-DAI
DQP
scan (B)
join (A1,B1)
DQP
join (A2,B2)
DQP
reduce
node 1 node 2
node 3 node 4
node 5
From Amy Krause
56 ISSGC06 – Ischia, Italy
The Adaptivity Issue in DQP on a Grid
• Monitoring-Assessment-Response framework of adaptive query processing in a grid (by Gounaris)– Monitoring:
• a separate module that keeps track of information like– Has a resource (e.g., memory availability) changed more
than 10%?– Has the data volume changed recently?
• Occurs between operators or within an operator’s execution process
• Other modules subscribe to this notification
– Assessment• Diagnosis is carried out for suboptimal execution, resource
shortage, resource idleness, unmet performance requirements, unmet user needs
– Response• Operator replacement ore rescheduling, machine rescheduling,
plan re-optimization…
57 ISSGC06 – Ischia, Italy
Commonalities and Complementarities
• Common themes– Overall architectural similarity of cyberinfrastructure projects
• Service orientation
– The data integration task is part of a larger scientific computing, exploration and analysis process
• Has impact on integration setting, design decisions and performance expectations
– Mediation with semantic mapping and reasoning seems to be winning
• Complementary approaches– Details of the architecture
• Relationship with workflows
– Styles of mediation– Extensibility of mediator – Adaptivity of query planning and evaluation