ogsa-dqp:service-based distributed query processing on the grid m.nedim alpdemir department of...

21
OGSA-DQP:Service-Based OGSA-DQP:Service-Based Distributed Query Distributed Query Processing on the Grid Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Upload: erika-fields

Post on 13-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

OGSA-DQP:Service-Based OGSA-DQP:Service-Based Distributed Query Processing on Distributed Query Processing on

the Gridthe Grid

M.Nedim AlpdemirDepartment of Computer

ScienceUniversity of Manchester

Page 2: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

People, Places, FundingManchester

M Nedim AlpdemirAnastasios

GounarisNorman W PatonAlvaro A A FernandesRizos Sakellariou

Newcastle upon Tyne

Arijit MukherjeeJim Smith

Paul Watson

Page 3: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Outline Requirements, Motivation Background OGSA-DQP Overview OGSA-DQP Details – How does it work OGSA-DQP Details – A query Example Summary

Page 4: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Story: Ad-hoc data integration

p r o c es s q u er yq u er yp r o c es sq u er y

C lien t Ap p lic a tio n

L o c atio n 1 L o c atio n 2 L o c atio n 3

Page 5: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Story: Middleware to the rescue

q u er y

C lien t Ap p lic a tio n

G r id I n f r as tr u c tu r e ( W eb S er v ic es , O G S A, Un ic o r e , C o n d o r )

D ata Ac c es s an d I n teg r a tio n F r am ew o r k s ( e .g . O G S A- D AI )

Hig h - lev e l D ata I n teg r a tio n S y s tem s ( e .g . O G S A- D Q P )

S em an tic S ear c h & D is c o v er y / S c h em a I n teg r a tio n / O n to lo g ies

D ata s o u r c e w r ap p er D ata s o u r c e w r ap p er D ata s o u r c e w r ap p er

Page 6: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Motivation Pull by applications:

overwhelming amounts of semantically complex data in

very diverse, structurally dissimilar, and autonomous, geographically dispersed data sources

requiring computationally demanding analysis.

Push from technologies and infrastructure:

Web service impetus Grid services and

protocols that enable: dynamic resource

discovery dynamic resource

allocation and use Data access

standards

Page 7: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

The Grid needs DQP: Declarative, high-

level resource integration with implicit parallelism.

DQP-based solutions should in principle run faster than those manually coded.

DQP needs the Grid: Systematic access

to remote data and computational resources.

Dynamic resource discovery and allocation.

Mutual Benefits

Page 8: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Background: mediator/wrapper middleware

A well-known approach to data integration is to use mediator/wrapper middleware.

wrappers reconcile differences and impose a global schema.

Often there is only one mediator in one fixed location, which is limiting.

DBMS

data

mediator

DBMS

data

Query Results

wrapper wrapper

Page 9: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

OGSA-DQP adopts mediator/wrapper

OGSA-DQP uses a middleware approach.

It can be seen as a mediator over OGSA-DAI wrappers.

It promises bottom-lines regarding:

efficiency: “leave to it to schedule in parallel”;

effectiveness: “leave to it to orchestrate your services”;

usability: “use it as a Grid data service”.

DBMS

data

OGSA-DQP

DBMS

data

Query Results

OGSA-DAI

OGSA-DAI

Page 10: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Background: Service-based approaches

S e rv iceR e g is try

S e rv iceR e qu e s te r

S e rv icePro v ide r

F in d P u b lish

B in d

C re de n ti a l s , ro l e s

Q oS n e goti a ti on tok e n s

M on i tori n g , accou n ti n g ,n oti fi ca ti on

I n fra s tru ctu re

Virtualisation of ResourcesVirtualisation of Resources

A convenient cooperation model for Distributed Systems (e.g. Grid)A convenient cooperation model for Distributed Systems (e.g. Grid)

facilitate

Leads to

Page 11: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Background: Grid Database Service (GDS)

Databases are made available on the Grid through integration with other Grid services, and provision of standard interfaces

Build upon OGSA to deliver high-level data management functionality for the Grid.

Seek to provide two classes of components: Data access components Data integration components

Page 12: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Innovations OGSA-DQP dynamically allocates

evaluators to do work on behalf of the mediator.

This allows for runtime circumstances to be taken into account when the optimiser decides how to partition and schedule.

OGSA-DQP uses a parallel physical algebra: most mediator-based query processors do not.

Page 13: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

OGSA-DQP overview: A Service-based framework Service-based in two orthogonal sense:

Supports querying over data storagedata storage and analysis analysis resources made available as servicesservices

ConstructionConstruction of distributed query plans and their executionexecution over the grid are factored out as services

Uses the emerging standard for GDSs to provide consistent access to database metadata and to interact with databases on the Grid.

A query may refer to database (GDS) and computational services.

Page 14: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

OGSA-DQP Overview: Two Separate Grid Services

Grid Distributed Query Services (GDQSs) that:

interact with clients; find and retrieve service

descriptions; parse, compile, partition

and schedule the query execution over a union of distributed data sources.

The query plan is an orchestration of GQESs

Grid Query Evaluation Services (GQESs) that:

implement the physical query algebra;

implement the query execution model and semantics;

run a partition of a query execution plan generated by a GDQS;

interact with other GQESs/GDSs/WSs but not with clients.

Page 15: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

OGSA-DQP Overview: An illustration

G D Q S

GD S 1

GD S 2

W e bS e rv ice s

C lie n t

re s o u rce lis t

W S D L

D B S ch e m a

D B S ch e m a

G L o g ica lO pt im is e r G

Ph y s ica lO pt im is e r

G Pa rt it io n e r GS ch e du le r

G

OQ

L P

arse

r

Po la r* Q u e ry O pt im is e r En g in e

GD S Q u e r yR e qu e s t D oc .

O Q LQ u e r y

pr in t

e xc h an g e

h as h join

s c ane x ch a n g e

s c an

P1

P2

P3

GGQ ES 3

GGQ ES 2

GGQ ES 1

Distributed QueryExecution Engine

sub- pla n

sub- pla n

da ta b lock s

da ta b lock s

s u b-qu e ry

s u b-qu e ry

o pe ra t io n ca ll

<?xml version="1.0" encoding="UTF-8"?>

<GDQDataSourceList xmlns="http://dqp.ogsadai.org.uk/schema/gdqs" >

<importedDataSource>

<GDSFactoryHandle>http://phoebus.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>

<GDSFactoryHandle>http://rpc676.cs.man.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>

<GDSFactoryHandle>http://mygrid.ncl.cs.ac.uk:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle>

</importedDataSource>

<importedService>

<wsdlURL>http://phoebus.cs.man.ac.uk:9090/axis/services/EntropyAnalyserService?WSDL</wsdlURL>

</importedService>

</GDQDataSourceList>

<?xml version="1.0" encoding="UTF-8"?>

<databaseSchema xmlns="">

<logicalSchema>

<table name="goterm">

<column fullName="goterm_id" length="32" name="id">

<sqlTypeName>varchar</sqlTypeName>

<sqlJavaTypeID>12</sqlJavaTypeID>

</column>

<column fullName="goterm_type" length="55" name="type">

<sqlTypeName>varchar</sqlTypeName>

<sqlJavaTypeID>12</sqlJavaTypeID>

</column>

<column fullName="goterm_name" length="255" name="name">

<sqlTypeName>varchar</sqlTypeName>

<sqlJavaTypeID>12</sqlJavaTypeID>

</column>

<primaryKey>

<columnFullName>id</columnFullName>

</primaryKey>

</table>

</logicalSchema>

<physicalSchema>

<hostMachine>130.88.192.230</hostMachine>

<database join_buffer_size="131072" max_join_size="4294967295">

<physTable avgRowLength="67" dataLength="766784" indexLength="126976" name="goterm" rowFormat="Dynamic" rows="11369"/>

</database>

</physicalSchema>

<GDSFHandle>http://phoebus.cs.man.ac.uk:9090/ogsa/services/ogsadai/GridDataServiceFactory</GDSFHandle>

</databaseSchema>

<?xml version="1.0" encoding="UTF-8"?>

<Partitions>

<Partition>

<evaluatorURI>http://130.88.198.195:9090/ogsa/services/ogsadai/dqp/GridQueryEvaluationFactory/hash-11025450-1076603541049</evaluatorURI>

<Operator operatorID="0" operatorType="TABLE_SCAN">

<tupleType>

<type>goterm</type>

<name>goterm.OID</name>

<type>string</type>

<name>goterm.id</name>

<type>string</type>

<name>goterm.type</name>

<type>string</type>

<name>goterm.name</name>

</tupleType>

<TABLE_SCAN>

<dataResourceName> goterms </dataResourceName>

<GDSHandle> http://130.88.192.230:9090/ogsa/services/ogsadai/GridDataServiceFactory/hash-31056514-1076603576481</GDSHandle>

<tableName> goterms </tableName>

<predicateExpr>

<predicate>

<comparativeOperator>LIKE</comparativeOperator>

<leftOperand name=" goterm.id" type="13"/>

<rightOperand name=" GO:0000%" type="16"/>

</predicate>

</predicateExpr>

</TABLE_SCAN>

</Operator> . . .

</Partition> . . .

</Partitions>

Page 16: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Details: Setting up GDQS

Setup phase involves: Importing schemas of participating data sources

Results in a global schema (currently no schema integration support)

Importing WSDL documents of participating analysis services

Service metadata obtained (e.g. Port types, operations, types etc.)

Collecting computational resource metadata (implicit) Set-up strategy depends on the life-time model of

GDQS and GDSs GDQS instance is created per-client But it can serve multiple-queries This model avoids complexity of multi-user interactions

while ensures that the set-up cost is not high

Page 17: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Details: A query Example

Given two DBMSs and one analysis tool (e.g., a WS):

proteinTerm to a GO Gene Ontology running as a remote mySQL DB,

protein to a GIMS Genome Warehouse running as a remote ODMG-compliant DB,

Blast (sequence alignment scoring);

We can obtain alignment scores for a sequence against proteins of a certain kind:

select p.proteinId, Blast(p.sequence)from protein p, proteinTerm twhere t.termId = ‘GO:0005942’ and

p.proteinId = t.proteinId

• Then, OGSA-DQP acts as an enactor of a declarative orchestration of services on the Grid:

index_scantermId=GO:0005942(proteinTerm)

table_scan(protein)

reduce reduce

hash_join(proteinId)

op_call(Blast)reduce

exchange exchange

exchange

3,4

2

1

5

Page 18: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Details: what parallelisms do I benefit from?

3 4partitionedparallelism

pipelinedparallelism

table_scantermId=GO:0005942(proteinTerm)

table_scan(protein)

reduce reduce

hash_join(proteinId)

op_call(Blast)

reduce

exchange exchange

exchange

3,4

2

1 5

Page 19: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

GFactory G Q ES F

GFactory G Q ES F

GFactory G Q ES F

N 2

N 1

N3

GC lie n tGG D S

GG D S

G D Q

G D T

G D Q S

N 0G D S

GFactory G Q ES F

N4

p erform (Q u ery)1

cre a te S e rv ice

cre a te S e rv ice2

cre ate S e rvi ce

2

2

GG D S G Q ES 2

G D T

GG D S G Q ES 3

G D T

GG D S G Q ES 1

G D T

GG D S G Q ES 1

G D T

p erform (Q u ery S u b p la n )

p erform (Q u ery S u b p la n )

perform(Q

uer ySu bpl an)

3

s eq u en t ial_ s can

red u ce (p r o tein ID ,s eq u en ce )

s eq u en t ial_ s can ( ter m = 8 3 7 2 )

red u ce (p r o tein ID )

h as h _ jo in(p .p r o tein ID = t.p r o tein ID )

3

o p erat io n _ callb la s t(p .s eq u en ce)

red u ce (p .p r o tein ID , b la s t)

o p erat io n _ callb la s t(p .s eq u en ce)

red u ce (p .p r o tein ID , b la s t)

3

W e b S e rvi ce s (B L A S T)

resu lts

resu lts

resu lts

4

1144

Details: What goes on behind the scene?

Page 20: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

summary1. High-level data access and integration services are

needed if applications that have data with complex structure and complex semantics are to benefit from the Grid.

2. Standards for data access are emerging, and middleware products that are reference implementations of such standards are already available.

3. Distributed query processing technology is one approach to delivering (1.) given the availability of (2.).

4. OGSA-DQP is a service-based distributed query processor for the Grid that is:

Exposed as a service; Implemented as an orchestration of services. Improves on Grid portals when only retrieval and analysis is

involved.

Page 21: OGSA-DQP:Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

where to find out moreOGSA-DQPGrid middleware to query distributed data

sources

www.ogsadai.org.uk/dqp OGSA-DAI

Grid middleware to interface with data(bases)

www.ogsadai.org.uk/