service-based distributed query processing on the grid

30
Service-Based Distributed Service-Based Distributed Query Processing on the Query Processing on the Grid Grid M.Nedim Alpdemir Department of Computer Science University of Manchester

Upload: moanna

Post on 30-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Service-Based Distributed Query Processing on the Grid. M.Nedim Alpdemir Department of Computer Science University of Manchester. Service-based approaches. facilitate. Virtualisation of Resources. Leads to. A convenient cooperation model for Distributed Systems (e.g. Grid). Context. Data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Service-Based Distributed Query Processing on the Grid

Service-Based Distributed Query Service-Based Distributed Query Processing on the GridProcessing on the Grid

M.Nedim AlpdemirDepartment of Computer

ScienceUniversity of Manchester

Page 2: Service-Based Distributed Query Processing on the Grid

Service-based approaches

S e rv iceR e g is try

S e rv iceR e qu e s te r

S e rv icePro v ide r

F in d P u b lish

B in d

C re de n ti a l s , ro l e s

Q oS n e goti a ti on tok e n s

M on i tori n g , accou n ti n g ,n oti fi ca ti on

I n fra s tru ctu re

Virtualisation of ResourcesVirtualisation of Resources

A convenient cooperation model for Distributed Systems (e.g. Grid)A convenient cooperation model for Distributed Systems (e.g. Grid)

facilitate

Leads to

Page 3: Service-Based Distributed Query Processing on the Grid

Context

Computational Complexity

DataComplexity

Page 4: Service-Based Distributed Query Processing on the Grid

Web Services Are Not Enough Lack facilities for:

Computational resource description. Computational resource discovery. Application staging.

Grid Services combine: Web Services for service description and

invocation. Grid middleware for computational resource

description and utilisation.

Page 5: Service-Based Distributed Query Processing on the Grid

Open Grid Services Architecture (OGSA) OGSA services are

described using WSDL.

OGSA service instances are:

Created dynamically by factories.

Identified through Grid Service Handles.

Self describing through Service Data Elements.

Stateful, with soft state lifetime management.

Current status: Globus 3 beta

release in June 2003: www.globus.org.

Supports service instances and access to other Globus services.

Core database services from OGSA-DAI project tracking Globus releases.

Page 6: Service-Based Distributed Query Processing on the Grid

Grid Database Service (GDS)

Databases are made available on the Grid through integration with other Grid services, and provision of standard interfaces

Build upon OGSA to deliver high-level data management functionality for the Grid.

Seek to provide two classes of components: Data access components Data integration components

Page 7: Service-Based Distributed Query Processing on the Grid

GFactory G D S F

G

Re gi s try

G SG D S R

GC lie n t

G

G D SG D SI n s ta n ce

G D T D B

R e g is te rS e rv ice

f in dS e rv ice D a ta

C re a te S e rv ice

pe rfo rm (Q u e ry )

GG D T

C on s u m e r

1

4

2

113

5

GDS interactions

Page 8: Service-Based Distributed Query Processing on the Grid

Distributed Query Processing DQP involves a single query referencing

data stored at multiple sites. The locations of the data may be

transparent to the author of the query.

select p.proteinId, Blast(p.sequence)from protein p, proteinTerm twhere t.termId = ‘GO:0005942’ and p.proteinId = t.proteinId

J. Smith, A. Gounaris, P. Watson, N. Paton, A. Fernandes, R. Sakellariou, Distributed Query Processing on the Grid, 3rd Int. Workshop on Grid Computing, Springer-Verlag, 279-290, 2002.

Page 9: Service-Based Distributed Query Processing on the Grid

Mutual Benefit The Grid needs

DQP: Declarative, high-

level resource integration with implicit parallelism.

DQP-based solutions should in principle run faster than those manually coded.

DQP needs the Grid: Systematic access

to remote data and computational resources.

Dynamic resource discovery and allocation.

Page 10: Service-Based Distributed Query Processing on the Grid

A Service-Based DQP Architecture

O G S A -D A I

O G S A

D is tribu te d Q u e ry Pro ce s s o r

A pplica t io n /Pre s e n ta t io n L a y e r

C on fi gu rati on

Logg i n g

Au di ti n g

Pol i cy

S e cu ri ty

Ve rs i on i n g

Accou n ti n g

Page 11: Service-Based Distributed Query Processing on the Grid

Service-based DQP framework Service-based in two orthogonal sense:

Supports querying over data storagedata storage and analysis analysis resources made available as servicesservices

ConstructionConstruction of distributed query plans and their executionexecution over the grid are factored out as services

Uses the emerging standard for GDSs to provide consistent access to database metadata and to interact with databases on the Grid.

A query may refer to database (GDS) and computational services.

Page 12: Service-Based Distributed Query Processing on the Grid

Extends OGSA & OGSA-DAI …

By adding a new port type and two new services ( and their corresponding factories) :

Grid Distributed Query (GDQ) Port Type importSchema operation

GDQSR importSchema(GDQDataSourceList GDSL)

GDSL : A document containing: the list of Data Sources. The items on this list should contain the

handles of the GDS Factories, along with an instance creation document for each factory.

And/or a set of WSDL URLs for the analysis services to be used

Page 13: Service-Based Distributed Query Processing on the Grid

Continued …

Grid Distributed Query Service (GDQS) Wraps an existing query compiler/optimiser system

which compile, optimise, partition and schedule distributed query execution plans

Obtains and maintains metadata and computational resource information required for above

Grid Query Evaluator Service (GQES) Each GQES instance is an execution node and is

dynamically created by the GDQS on the node it is scheduled to run

A GQES is in charge of a partition of the query execution plan assigned to it by the GDQS and is responsible for dispatching the partial results to other GQESs.

Page 14: Service-Based Distributed Query Processing on the Grid

Setting up a GDQS Set-up strategy depends on the life-time

model of GDQS and GDSs GDQS instance is created per-client But it can serve multiple-queries This model avoids complexity of multi-user

interactions while ensures that the set-up cost is not high

Setup phase involves: Importing schemas of participating data sources Importing WSDL documents of participating

analysis services Collecting computational resource metadata (implicit)

Page 15: Service-Based Distributed Query Processing on the Grid

Issues in Initialisation Q: When is a GDQS

bound to a particular GDS?

A: When the schema of the GDS is imported.

Q: What is the lifespan of a GDS used by a GDQS?

A: The GDS is kept alive until the GDQS expires.

Q: Are GDSs shared by multiple GDQSs?

A: No.

Q: When is a GQES created?

A: When a query is about to be evaluated that needs it.

Q: What is the lifespan of a GQES?

A: It lasts only as long as a single query.

Q: Is a GQES shared among several queries or GDQSs?

A: No.

Page 16: Service-Based Distributed Query Processing on the Grid

G S H :G D Q S F

fi n dS e rvi ce D ata

cre a te

im p o rtS ch em a (G S H :G D S F , C o n fD o c)

fi n dS e rvi ce D ata

G S H :G D S F

N 2

N 3

C re ate (C on fD oc)

fi n dS e rvi ce D ataD B Sch em aG SH :G D S1

G S H :G D Q S 1

7

N 1113

2

1

5

6

f in d S er v iceD a ta C o nfig D o cs

4

G

Factory

G D S F

G

Re gi s try

G SG D S R GFactory G D Q S F

re g i s te r

GG D Q G D Q S 1

G Re gi s try G S

G D S R

GC lie n t

G S

re g i s te r

GG D SG D S 1

G S

1

8

Importing Schemas

Page 17: Service-Based Distributed Query Processing on the Grid

<GDQDataSourceList ><importedDataSource> <GDSFactoryHandle> http://130.88.198.203:8080/ogsa/services/ogsadai/GridDataServiceFactory</GDSFactoryHandle> <GDSCreateDocument>

<gridDataServiceFactoryCreate > <dataResourceName>

myDataResource </dataResourceName></gridDataServiceFactoryCreate>

</GDSCreateDocument></importedDataSource><importedService> <wsdlURL>

http://www.ebi.ac.uk/collab/mygrid/service0/axis/services/urn:srs?WSDL

</wsdlURL></importedService>

</GDQDataSourceList>

An example of data source import list

Page 18: Service-Based Distributed Query Processing on the Grid

<request name = “myRequest”> <oqlQueryStatement name=“myStat"> <dataResource=“myGenomeDB"> <expression> 

select p.proteinId, Blast(p.sequence)from proteins p, proteinTerms twhere t.termId = ‘GO:0005942 ’ and p.proteinId = t.proteinId

</expression> </oqlQueryStatement > <deliverToGDT name="delivery">    <fromLocal name=“myStat"> <toGDT streamId="otherrequestasynch/d1" mode=“full"> http://ogsadai.org.uk/GDTService/my/GDT/GSH </toGDT> </deliverToGDT>  </request>

An example of a Query Document

Page 19: Service-Based Distributed Query Processing on the Grid

LogicalOptimiser

PhysicalOptimiser

Partitioner Scheduler

Evaluator

OQLParser

Single-nodeOptimiser

Multi-nodeoptimiser

Query Compilation

Page 20: Service-Based Distributed Query Processing on the Grid

Plan is expressed using a logical algebra.

Heuristic-based application of equivalence laws.

Multiple equivalent plans generated. scan

(protein)scantermID=…(proteinTerm)

reduce reduce

join(proteinId)

op_call(Blast)

reduce

Logical Optimisation

Page 21: Service-Based Distributed Query Processing on the Grid

Plan is expressed using a physical algebra.

Logical operators replaced with physical operators.

Cost-based ranking of plans. table_scan

(protein)index_scantermID=…(proteinTerm)

reduce reduce

hash_join(proteinId)

op_call(Blast)

reduce

Physical Optimisation

Page 22: Service-Based Distributed Query Processing on the Grid

Partitioning Plan is expressed

in a parallel algebra.

Parallel algebra = physical algebra + exchange.

Exchange operators are placed where data movement may be required.

table_scan(protein)

index_scantermID=…(proteinTerm)

reduce reduce

hash_join(proteinId)

op_call(Blast)

reduce

exchange exchange

exchange

Page 23: Service-Based Distributed Query Processing on the Grid

Scheduling Partitions are

allocated to Grid nodes; partitions may be merged during scheduling.

Expressed by decorating parallel algebra expression.

Heuristic algorithm considers memory use, network costs.

table_scan(protein)

table_scantermID=S92(proteinTerm)

reduce reduce

hash_join(proteinId)

op_call(Blast)

reduce

exchange exchange

exchange

3,4

1

1 2

Page 24: Service-Based Distributed Query Processing on the Grid

Query installation: GQESs created for partitions as

required. Partitions sent to GQESs.

Query evaluation: Partitions evaluated using iterator

model. Pipelined and partitioned parallelism. Results conveyed to client.

Query Evaluation

Page 25: Service-Based Distributed Query Processing on the Grid

<Partition isRoot="0"><evaluatorURI> http://mach1.cs.man.ac.uk:8080/ogsa/services/ogsadai/GQESFactory/GQES1</evaluatorURI> ... <Operator operatorID="2" operatorType="SEQ_SCAN"> <SEQ_SCAN>

<tupleType> <type> string </type> <name> proteinTerms.GOproteinID </name> <type> string </type> <name> proteinTerms.term </name></tupleType><inputOperator> <OperatorID> </OperatorID></inputOperator><DataResourceName>proteinTermsDataResource</DataResourceName><GDSHandle> http://mach1.cs.man.ac.uk:8080/…/GridDataServiceFactoryP2R1/GDS1 </GDSHandle><predicateExpr> <predicate>

<comparativeOperator>EQ</comparativeOperator><leftOperand name="proteinTerms.term" type="tuplefield"/><rightOperand name="GO:0008372" type="string"/>

</predicate></predicateExpr>

</SEQ_SCAN> </Operator>

... </Partition>

An example of a query sub-plan passed to a GQES

Page 26: Service-Based Distributed Query Processing on the Grid

GG D S G Q ES n

G D T

111

G D Q S

G D S

G D T

G D Q

G

Re gi s try

G SG D S R

C lie n t

GG D S G D S

In s tan ce s

GG D S G Q ES 1

G D T

. .

.

GFactory G D Q S F

114

113

112

8

5

114 .1

7

8

re gis te rSe rv ice

fi ndSe rv ice Da ta

cre a te Se rv ice

im portSche m a

fi ndSe rv ice Da ta ( DBSche m a )

G D T

G S

pe

rform

(qu

ery

Su

bP

lan

)pe r fo rm ( que ry )

pe rfo rm ( gqes_ query )

6

6

Interactions of SB-DQP components

Page 27: Service-Based Distributed Query Processing on the Grid

GFactory G Q ES F

GFactory G Q ES F

GFactory G Q ES F

N 2

N 1

N3

GC lie n tGG D S

GG D S

G D Q

G D T

G D Q S

N 0G D S

GFactory G Q ES F

N4

p erform (Q u ery)1

cre a te S e rv ice

cre a te S e rv ice2

cre ate S e rvi ce

2

2

GG D S G Q ES 2

G D T

GG D S G Q ES 3

G D T

GG D S G Q ES 1

G D T

GG D S G Q ES 1

G D T

p erform (Q u ery S u b p la n )

p erform (Q u ery S u b p la n )

perform(Q

uer ySu bpl an)

3

s eq u en t ial_ s can

red u ce (p r o tein ID ,s eq u en ce )

s eq u en t ial_ s can ( ter m = 8 3 7 2 )

red u ce (p r o tein ID )

h as h _ jo in(p .p r o tein ID = t.p r o tein ID )

3

o p erat io n _ callb la s t(p .s eq u en ce)

red u ce (p .p r o tein ID , b la s t)

o p erat io n _ callb la s t(p .s eq u en ce)

red u ce (p .p r o tein ID , b la s t)

3

W e b S e rvi ce s (B L A S T)

resu lts

resu lts

resu lts

4

1144

Page 28: Service-Based Distributed Query Processing on the Grid

Summary DQP on the Grid provides:

The normal benefits of DQP. Some added benefits from a Grid

setting. The Grid specifically enables:

Runtime computational resource discovery.

Dynamic creation of remote evaluators.

Authentication/Transport services. Access to non-database services.

Page 29: Service-Based Distributed Query Processing on the Grid

Features of Our GDQS Low cost of entry:

Imports source descriptions through GDSs. Imports service descriptions as WSDL.

Throw-away GDQS: Import sources on a task-specific basis. Discard GDQS when task completed.

Builds on parallel database technology: Implicit parallelism. Pipelined + partitioned parallel evaluation.

Public release in July 2003.

Page 30: Service-Based Distributed Query Processing on the Grid

The SB-DQP Team Manchester:

Nedim Alpdemir Anastasios

Gounaris Alvaro Fernandes Norman Paton Rizos Sakellariou

Newcastle: Arijit Mukherjee Jim Smith Paul Watson