ogsa-dqp:service-based distributed query processing on the grid m.nedim alpdemir department of...
TRANSCRIPT
OGSA-DQP:Service-Based OGSA-DQP:Service-Based Distributed Query Processing on Distributed Query Processing on
the Gridthe Grid
M.Nedim AlpdemirDepartment of Computer
ScienceUniversity of Manchester
People, Places, FundingManchester
M Nedim AlpdemirAnastasios
GounarisNorman W PatonAlvaro A A FernandesRizos Sakellariou
Newcastle upon Tyne
Arijit MukherjeeJim Smith
Paul Watson
Outline Requirements, Motivation Background OGSA-DQP Overview OGSA-DQP Details – How does it work OGSA-DQP Details – A query Example Summary
Story: Ad-hoc data integration
p r o c es s q u er yq u er yp r o c es sq u er y
C lien t Ap p lic a tio n
L o c atio n 1 L o c atio n 2 L o c atio n 3
Story: Middleware to the rescue
q u er y
C lien t Ap p lic a tio n
G r id I n f r as tr u c tu r e ( W eb S er v ic es , O G S A, Un ic o r e , C o n d o r )
D ata Ac c es s an d I n teg r a tio n F r am ew o r k s ( e .g . O G S A- D AI )
Hig h - lev e l D ata I n teg r a tio n S y s tem s ( e .g . O G S A- D Q P )
S em an tic S ear c h & D is c o v er y / S c h em a I n teg r a tio n / O n to lo g ies
D ata s o u r c e w r ap p er D ata s o u r c e w r ap p er D ata s o u r c e w r ap p er
Motivation Pull by applications:
overwhelming amounts of semantically complex data in
very diverse, structurally dissimilar, and autonomous, geographically dispersed data sources
requiring computationally demanding analysis.
Push from technologies and infrastructure:
Web service impetus Grid services and
protocols that enable: dynamic resource
discovery dynamic resource
allocation and use Data access
standards
The Grid needs DQP: Declarative, high-
level resource integration with implicit parallelism.
DQP-based solutions should in principle run faster than those manually coded.
DQP needs the Grid: Systematic access
to remote data and computational resources.
Dynamic resource discovery and allocation.
Mutual Benefits
Background: mediator/wrapper middleware
A well-known approach to data integration is to use mediator/wrapper middleware.
wrappers reconcile differences and impose a global schema.
Often there is only one mediator in one fixed location, which is limiting.
DBMS
data
mediator
DBMS
data
Query Results
wrapper wrapper
OGSA-DQP adopts mediator/wrapper
OGSA-DQP uses a middleware approach.
It can be seen as a mediator over OGSA-DAI wrappers.
It promises bottom-lines regarding:
efficiency: “leave to it to schedule in parallel”;
effectiveness: “leave to it to orchestrate your services”;
usability: “use it as a Grid data service”.
DBMS
data
OGSA-DQP
DBMS
data
Query Results
OGSA-DAI
OGSA-DAI
Background: Service-based approaches
S e rv iceR e g is try
S e rv iceR e qu e s te r
S e rv icePro v ide r
F in d P u b lish
B in d
C re de n ti a l s , ro l e s
Q oS n e goti a ti on tok e n s
M on i tori n g , accou n ti n g ,n oti fi ca ti on
I n fra s tru ctu re
Virtualisation of ResourcesVirtualisation of Resources
A convenient cooperation model for Distributed Systems (e.g. Grid)A convenient cooperation model for Distributed Systems (e.g. Grid)
facilitate
Leads to
Background: Grid Database Service (GDS)
Databases are made available on the Grid through integration with other Grid services, and provision of standard interfaces
Build upon OGSA to deliver high-level data management functionality for the Grid.
Seek to provide two classes of components: Data access components Data integration components
Innovations OGSA-DQP dynamically allocates
evaluators to do work on behalf of the mediator.
This allows for runtime circumstances to be taken into account when the optimiser decides how to partition and schedule.
OGSA-DQP uses a parallel physical algebra: most mediator-based query processors do not.
OGSA-DQP overview: A Service-based framework Service-based in two orthogonal sense:
Supports querying over data storagedata storage and analysis analysis resources made available as servicesservices
ConstructionConstruction of distributed query plans and their executionexecution over the grid are factored out as services
Uses the emerging standard for GDSs to provide consistent access to database metadata and to interact with databases on the Grid.
A query may refer to database (GDS) and computational services.
OGSA-DQP Overview: Two Separate Grid Services
Grid Distributed Query Services (GDQSs) that:
interact with clients; find and retrieve service
descriptions; parse, compile, partition
and schedule the query execution over a union of distributed data sources.
The query plan is an orchestration of GQESs
Grid Query Evaluation Services (GQESs) that:
implement the physical query algebra;
implement the query execution model and semantics;
run a partition of a query execution plan generated by a GDQS;
interact with other GQESs/GDSs/WSs but not with clients.
OGSA-DQP Overview: An illustration
G D Q S
GD S 1
GD S 2
W e bS e rv ice s
C lie n t
re s o u rce lis t
W S D L
D B S ch e m a
D B S ch e m a
G L o g ica lO pt im is e r G
Ph y s ica lO pt im is e r
G Pa rt it io n e r GS ch e du le r
G
OQ
L P
arse
r
Po la r* Q u e ry O pt im is e r En g in e
GD S Q u e r yR e qu e s t D oc .
O Q LQ u e r y
pr in t
e xc h an g e
h as h join
s c ane x ch a n g e
s c an
P1
P2
P3
GGQ ES 3
GGQ ES 2
GGQ ES 1
Distributed QueryExecution Engine
sub- pla n
sub- pla n
da ta b lock s
da ta b lock s
s u b-qu e ry
s u b-qu e ry
o pe ra t io n ca ll
<?xml version="1.0" encoding="UTF-8"?>
<GDQDataSourceList xmlns="http://dqp.ogsadai.org.uk/schema/gdqs" >
<importedDataSource>
<GDSFactoryHandle>http://phoebus.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>
<GDSFactoryHandle>http://rpc676.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>
<GDSFactoryHandle>http://mygrid.ncl.cs.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>
</importedDataSource>
<importedService>
<wsdlURL>http://phoebus.cs.man.ac.uk:9090/axis/services/EntropyAnalyserService?WSDL</wsdlURL>
</importedService>
</GDQDataSourceList>
<?xml version="1.0" encoding="UTF-8"?>
<databaseSchema xmlns="">
<logicalSchema>
<table name="goterm">
<column fullName="goterm_id" length="32" name="id">
<sqlTypeName>varchar</sqlTypeName>
<sqlJavaTypeID>12</sqlJavaTypeID>
</column>
<column fullName="goterm_type" length="55" name="type">
<sqlTypeName>varchar</sqlTypeName>
<sqlJavaTypeID>12</sqlJavaTypeID>
</column>
<column fullName="goterm_name" length="255" name="name">
<sqlTypeName>varchar</sqlTypeName>
<sqlJavaTypeID>12</sqlJavaTypeID>
</column>
<primaryKey>
<columnFullName>id</columnFullName>
</primaryKey>
</table>
</logicalSchema>
<physicalSchema>
<hostMachine>130.88.192.230</hostMachine>
<database join_buffer_size="131072" max_join_size="4294967295">
<physTable avgRowLength="67" dataLength="766784" indexLength="126976" name="goterm" rowFormat="Dynamic" rows="11369"/>
</database>
</physicalSchema>
<GDSFHandle>http://phoebus.cs.man.ac.uk:9090/ogsa/services/ogsadai/GridDataServiceFactory</GDSFHandle>
</databaseSchema>
<?xml version="1.0" encoding="UTF-8"?>
<Partitions>
<Partition>
<evaluatorURI>http://130.88.198.195:9090/ogsa/services/ogsadai/dqp/GridQueryEvaluationFactory/hash-11025450-1076603541049</evaluatorURI>
<Operator operatorID="0" operatorType="TABLE_SCAN">
<tupleType>
<type>goterm</type>
<name>goterm.OID</name>
<type>string</type>
<name>goterm.id</name>
<type>string</type>
<name>goterm.type</name>
<type>string</type>
<name>goterm.name</name>
</tupleType>
<TABLE_SCAN>
<dataResourceName> goterms </dataResourceName>
<GDSHandle> http://130.88.192.230:9090/ogsa/services/ogsadai/GridDataServiceFactory/hash-31056514-1076603576481</GDSHandle>
<tableName> goterms </tableName>
<predicateExpr>
<predicate>
<comparativeOperator>LIKE</comparativeOperator>
<leftOperand name=" goterm.id" type="13"/>
<rightOperand name=" GO:0000%" type="16"/>
</predicate>
</predicateExpr>
</TABLE_SCAN>
</Operator> . . .
</Partition> . . .
</Partitions>
Details: Setting up GDQS
Setup phase involves: Importing schemas of participating data sources
Results in a global schema (currently no schema integration support)
Importing WSDL documents of participating analysis services
Service metadata obtained (e.g. Port types, operations, types etc.)
Collecting computational resource metadata (implicit) Set-up strategy depends on the life-time model of
GDQS and GDSs GDQS instance is created per-client But it can serve multiple-queries This model avoids complexity of multi-user interactions
while ensures that the set-up cost is not high
Details: A query Example
Given two DBMSs and one analysis tool (e.g., a WS):
proteinTerm to a GO Gene Ontology running as a remote mySQL DB,
protein to a GIMS Genome Warehouse running as a remote ODMG-compliant DB,
Blast (sequence alignment scoring);
We can obtain alignment scores for a sequence against proteins of a certain kind:
select p.proteinId, Blast(p.sequence)from protein p, proteinTerm twhere t.termId = ‘GO:0005942’ and
p.proteinId = t.proteinId
• Then, OGSA-DQP acts as an enactor of a declarative orchestration of services on the Grid:
index_scantermId=GO:0005942(proteinTerm)
table_scan(protein)
reduce reduce
hash_join(proteinId)
op_call(Blast)reduce
exchange exchange
exchange
3,4
2
1
5
Details: what parallelisms do I benefit from?
3 4partitionedparallelism
pipelinedparallelism
table_scantermId=GO:0005942(proteinTerm)
table_scan(protein)
reduce reduce
hash_join(proteinId)
op_call(Blast)
reduce
exchange exchange
exchange
3,4
2
1 5
GFactory G Q ES F
GFactory G Q ES F
GFactory G Q ES F
N 2
N 1
N3
GC lie n tGG D S
GG D S
G D Q
G D T
G D Q S
N 0G D S
GFactory G Q ES F
N4
p erform (Q u ery)1
cre a te S e rv ice
cre a te S e rv ice2
cre ate S e rvi ce
2
2
GG D S G Q ES 2
G D T
GG D S G Q ES 3
G D T
GG D S G Q ES 1
G D T
GG D S G Q ES 1
G D T
p erform (Q u ery S u b p la n )
p erform (Q u ery S u b p la n )
perform(Q
uer ySu bpl an)
3
s eq u en t ial_ s can
red u ce (p r o tein ID ,s eq u en ce )
s eq u en t ial_ s can ( ter m = 8 3 7 2 )
red u ce (p r o tein ID )
h as h _ jo in(p .p r o tein ID = t.p r o tein ID )
3
o p erat io n _ callb la s t(p .s eq u en ce)
red u ce (p .p r o tein ID , b la s t)
o p erat io n _ callb la s t(p .s eq u en ce)
red u ce (p .p r o tein ID , b la s t)
3
W e b S e rvi ce s (B L A S T)
resu lts
resu lts
resu lts
4
1144
Details: What goes on behind the scene?
summary1. High-level data access and integration services are
needed if applications that have data with complex structure and complex semantics are to benefit from the Grid.
2. Standards for data access are emerging, and middleware products that are reference implementations of such standards are already available.
3. Distributed query processing technology is one approach to delivering (1.) given the availability of (2.).
4. OGSA-DQP is a service-based distributed query processor for the Grid that is:
Exposed as a service; Implemented as an orchestration of services. Improves on Grid portals when only retrieval and analysis is
involved.
where to find out moreOGSA-DQPGrid middleware to query distributed data
sources
www.ogsadai.org.uk/dqp OGSA-DAI
Grid middleware to interface with data(bases)
www.ogsadai.org.uk/