distributed query processing and catalogs for peer-to-peer systems

23
Distributed Query Distributed Query Processing and Processing and Catalogs for Peer- Catalogs for Peer- to-Peer Systems to-Peer Systems Professor: Professor: Iluju Kiringa Iluju Kiringa Student: Fan Yang, Libin Student: Fan Yang, Libin Cai Cai

Upload: vaughan

Post on 13-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Distributed Query Processing and Catalogs for Peer-to-Peer Systems. Professor: Iluju Kiringa Student: Fan Yang, Libin Cai. Agenda. About P2P Mutant Query Plan Distributed Catalog Intentional Statements Security and Privacy Conclusions. About P2P. Advantages: Ease of deployment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Distributed Query Distributed Query Processing and Catalogs Processing and Catalogs

for Peer-to-Peer for Peer-to-Peer SystemsSystems

Distributed Query Distributed Query Processing and Catalogs Processing and Catalogs

for Peer-to-Peer for Peer-to-Peer SystemsSystems

Professor: Professor: Iluju KiringaIluju Kiringa Student: Fan Yang, Libin CaiStudent: Fan Yang, Libin Cai

Page 2: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Agenda• About P2P• Mutant Query Plan• Distributed Catalog• Intentional Statements• Security and Privacy• Conclusions

Page 3: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

About P2P

• Advantages:– Ease of deployment– Ease of use– Fault tolerance– Scalability

• Limitations:– Weak query capabilities– No infrastructure for distributed queries– Limitations in index scalability and result

quality

Page 4: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

A query example

FOR $r in document(‘‘film_reviews’’)//review, $g in document(‘‘preferences’’)//genre,$s in document(‘‘film_showings’’) / showing[date = ‘‘15 March 2002’’]WHERE $r/genre = $g AND $r/title = $s/titleRETURN <film> { $r/title } { $r/rating } { $s/theater } </film>

User Bob wants to see a movie tonight.

Bob visits his favorite portal, BobsPortal.com.

Bob uses GUI front-end to come up with an XML query:

Three XML documents: film reviews, preferences, and film showings.

[2]

Page 5: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

A query example (cont’)

The logical query planThree elements: Regular query operators: select, join

Pseudo-operator: document, display

References to XML fragments

Query processing: logical query plan

physical query plan

query processing

executed

algorithm

[2]

Page 6: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Advent of Mutant Query Plan• Why is MQP?

can cope with incomplete metadata can decentralize query optimization and execution Respect the autonomy and the local policies of sites Adapt to server and network conditions even while

being evaluated

• What is MQP? – An algebraic query plan graph, encoded in XML

• References to resource locations (URLs) • References to abstract resource names (URNs) • Verbatim XML fragments

– Each MQP is tagged with a target once the MQP is fully evaluated.

Page 7: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Mutant Query Processing

[1]

Page 8: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Mutant Query Plan Example

Garage Sale example:

Query: CDs for $10 or less in the Portland area.

MQP:

Regular query operators: select, join

Pseudo-operator: display

Constant piece of XML

URNs

[1]

Page 9: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Mutant Query Plan Example (cont’)

(a) Resolution and rewriting (b) reduction

[1]

Page 10: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Comparisons between Pipelined plan and Mutant plan

(a) Pipelined plan (b) mutant plan

[2]

Page 11: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Distributed Catalogs• Question: ? how do peers find out resources

available in other peers? Build distributed catalogs to efficiently

route queries • Procedures:

– Peers use multi-hierarchic namespaces to categorize data;

– Data providers use multi-hierarchic namespaces to describe data they serve;

– Data consumers use them to formulate queries.

Page 12: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Multi-hierarchic Namespaces

Multi-hierarchic namespace: The set of categorization hierarchies relevant to an applications domain. [1]

Interest area:

Second-hand armchairs in the Portland area:

[USA/OR/Portland, Furniture/Chairs]

A multi-hierarchic namespaces with two categorization dimensions and two highlighted interest areas: (a) Vancouver-Portland furniture, (b) items in Portland

[1]

Page 13: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Peer Roles

Page 14: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Resource Resolution• Authoritative Server

– Strives to know about all base servers within its interest area.

– Through an authoritative index or meta-index server, the known base servers in a particular interest area can be found out.

• Resource Resolution1. Seeks authoritative index or meta-index server 2. Recursively follows the index references 3. Finds all the relevant base servers and data items4. Resolves URN

Page 15: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Example of Resource Resolution

• Urn: ForSale: Portland-CDs• urls: http://10.1.2.3.9020/, http://10.2.3.4.9020/ • Interest area: [USA/OR/Portland, Music/CDs]• Authoritative meta-index server A :[USA, *]• Index Server B: [USA, Music]• Index Server C: [USA/OR, Music]• Index Server G: replace URN with URLs

Query plan A B C … G http://10.1.2.3.9020/

http://10.2.3.4.9020/

Page 16: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Intentional Statements• Purposes:

– How can index and meta-index servers convey the relationships between the data they cover?

– How can mutant queries use this information to make intelligent choices about completeness, currency and latency tradeoffs?

• Intentional Statements: – used to describe relationships between index and meta-index

servers, can be expressed using coordination formulas.

Server R replicates everything from server S for the Portland category of the Location hierarchy

Only Oregon sporting goods information that R holds is for Portland and Eugene golf clubs at S

R index several base servers

base[Portland, *]@R = base[Portland, *]@S

base[Oregon, Sporting Goods]@R = base[Portland, Golf Clubs]@S base[Eugene, Golf Clubs]@S

Index[Oregon, Golf Clubs]@R = base[Oregon, Golf Clubs]@S Base[base[Oregon, Golf Clubs]@T base[base[Oregon, Golf Clubs]@U

Page 17: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Utilizing Intentional Statements (cont’)

• Processes:– Whenever a server registers an interest area with the

meta-index server, it provides intentional statements – Servers can then use such information in binding and

routing MQPs.

Assumptions:

Meta-index server M knows about servers R and S

Interest areas: R [Portland, Recreation] S [Oregon, Sporting Goods]

M receives an MQP that contains the resource name [Portland, Golf Clubs]

Then the name could be bound to: base[Portland, Golf Clubs]@R base[Portland, Golf Clubs]@S

If M knows the intentional statement, base[Portland, Sporting Goods]@R = base[Portland, Sporting Goods]@S

then it could bind to: base[Portland, Golf Clubs]@R | base[Portland, Golf Clubs]@S

Conclusion: the MQP could be routed to either R or S, but it need not go to both.

Page 18: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Utilizing Intentional Statements (cont’)

For queries run not instantly:Suppose: Server R replicates everything for Portland at S, also possibly keeps additional data about Portland, can be up

to 30 minutes out of dateR polls every 30 minutes to update the data it replicates from S.Intentional Statement: base[Portland, *]@R ≥ base[Portland, *]@S{30}A binding for resource [Portland, CDs] might then be: base[Portland, CDs]@R{30} | (base[Portland, CDs]@R base[Portland, CDs]@S){0}Explanations:One can get an answer quickly by just routing the MQP to R, but that answer could be up to 30 minutes out of

date.By routing the MQP to both R and S, one can have a complete and current answer.

Conclusions:– Impossible to guarantee queries run instantly – Compromises on latency, completeness and currency. – Replication can’t be both scalable and instantaneous.

Page 19: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

What else could be in MQPs

• Accumulating catalog and statistics information

• Maintaining provenance– Rewards system– Meta-index updating– Detection of spoofing

Page 20: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Security and Privacy• Issues:

– With MQPs, the partial results is possibly divulged to other undesirable servers

• Solutions:– MQPs need to incorporate ordering and

transfer policies– Encrypts data or data elements with the

public key– MQPs can allow to obtain answers under

given server security policies

Page 21: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Conclusions• Enable peers to independently

optimize and partially evaluate queries without global knowledge, and with a minimum of coordination overhead.

Page 22: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

References

• [1] Vassilis Papadimos, David Maier and Kristin Tufte. Distributed Query Processing and Catalogs for Peer-to-Peer Systems. OGI School of Science Engineering. Oregon Health Science University.

• [2] V. Papadimos and D. Maier. Distributed Queries without Distributed State. In Proc. of WebDB 2002, pages 95-100.

Page 23: Distributed Query Processing and Catalogs for Peer-to-Peer Systems

Thanks!

Questions?...