distributed query processing and catalogs for peer-to-peer systems

Distributed Query Distributed Query Processing and Catalogs Processing and Catalogs

for Peer-to-Peer for Peer-to-Peer SystemsSystems

Distributed Query Distributed Query Processing and Catalogs Processing and Catalogs

for Peer-to-Peer for Peer-to-Peer SystemsSystems

Professor: Professor: Iluju KiringaIluju Kiringa Student: Fan Yang, Libin CaiStudent: Fan Yang, Libin Cai

Agenda• About P2P• Mutant Query Plan• Distributed Catalog• Intentional Statements• Security and Privacy• Conclusions

About P2P

• Advantages:– Ease of deployment– Ease of use– Fault tolerance– Scalability

• Limitations:– Weak query capabilities– No infrastructure for distributed queries– Limitations in index scalability and result

quality

A query example

FOR $r in document(‘‘film_reviews’’)//review, $g in document(‘‘preferences’’)//genre,$s in document(‘‘film_showings’’) / showing[date = ‘‘15 March 2002’’]WHERE $r/genre = $g AND $r/title = $s/titleRETURN <film> { $r/title } { $r/rating } { $s/theater } </film>

User Bob wants to see a movie tonight.

Bob visits his favorite portal, BobsPortal.com.

Bob uses GUI front-end to come up with an XML query:

Three XML documents: film reviews, preferences, and film showings.

[2]

A query example (cont’)

The logical query planThree elements: Regular query operators: select, join

Pseudo-operator: document, display

References to XML fragments

Query processing: logical query plan

physical query plan

query processing

executed

algorithm

[2]

Advent of Mutant Query Plan• Why is MQP?

can cope with incomplete metadata can decentralize query optimization and execution Respect the autonomy and the local policies of sites Adapt to server and network conditions even while

being evaluated

• What is MQP? – An algebraic query plan graph, encoded in XML

• References to resource locations (URLs) • References to abstract resource names (URNs) • Verbatim XML fragments

– Each MQP is tagged with a target once the MQP is fully evaluated.

Mutant Query Processing

[1]

Mutant Query Plan Example

Garage Sale example:

Query: CDs for $10 or less in the Portland area.

MQP:

Regular query operators: select, join

Pseudo-operator: display

Constant piece of XML

URNs

[1]

Mutant Query Plan Example (cont’)

(a) Resolution and rewriting (b) reduction

[1]

Comparisons between Pipelined plan and Mutant plan

(a) Pipelined plan (b) mutant plan

[2]

Distributed Catalogs• Question: ? how do peers find out resources

available in other peers? Build distributed catalogs to efficiently

route queries • Procedures:

– Peers use multi-hierarchic namespaces to categorize data;

– Data providers use multi-hierarchic namespaces to describe data they serve;

– Data consumers use them to formulate queries.

Multi-hierarchic Namespaces

Multi-hierarchic namespace: The set of categorization hierarchies relevant to an applications domain. [1]

Interest area:

Second-hand armchairs in the Portland area:

[USA/OR/Portland, Furniture/Chairs]

A multi-hierarchic namespaces with two categorization dimensions and two highlighted interest areas: (a) Vancouver-Portland furniture, (b) items in Portland

[1]

Peer Roles

Resource Resolution• Authoritative Server

– Strives to know about all base servers within its interest area.

– Through an authoritative index or meta-index server, the known base servers in a particular interest area can be found out.

• Resource Resolution1. Seeks authoritative index or meta-index server 2. Recursively follows the index references 3. Finds all the relevant base servers and data items4. Resolves URN

Example of Resource Resolution

• Urn: ForSale: Portland-CDs• urls: http://10.1.2.3.9020/, http://10.2.3.4.9020/ • Interest area: [USA/OR/Portland, Music/CDs]• Authoritative meta-index server A :[USA, *]• Index Server B: [USA, Music]• Index Server C: [USA/OR, Music]• Index Server G: replace URN with URLs

Query plan A B C … G http://10.1.2.3.9020/

http://10.2.3.4.9020/

Intentional Statements• Purposes:

– How can index and meta-index servers convey the relationships between the data they cover?

– How can mutant queries use this information to make intelligent choices about completeness, currency and latency tradeoffs?

• Intentional Statements: – used to describe relationships between index and meta-index

servers, can be expressed using coordination formulas.

Server R replicates everything from server S for the Portland category of the Location hierarchy

Only Oregon sporting goods information that R holds is for Portland and Eugene golf clubs at S

R index several base servers

base[Portland, *]@R = base[Portland, *]@S

base[Oregon, Sporting Goods]@R = base[Portland, Golf Clubs]@S base[Eugene, Golf Clubs]@S

Index[Oregon, Golf Clubs]@R = base[Oregon, Golf Clubs]@S Base[base[Oregon, Golf Clubs]@T base[base[Oregon, Golf Clubs]@U

Utilizing Intentional Statements (cont’)

• Processes:– Whenever a server registers an interest area with the

meta-index server, it provides intentional statements – Servers can then use such information in binding and

routing MQPs.

Assumptions:

Meta-index server M knows about servers R and S

Interest areas: R [Portland, Recreation] S [Oregon, Sporting Goods]

M receives an MQP that contains the resource name [Portland, Golf Clubs]

Then the name could be bound to: base[Portland, Golf Clubs]@R base[Portland, Golf Clubs]@S

If M knows the intentional statement, base[Portland, Sporting Goods]@R = base[Portland, Sporting Goods]@S

then it could bind to: base[Portland, Golf Clubs]@R | base[Portland, Golf Clubs]@S

Conclusion: the MQP could be routed to either R or S, but it need not go to both.

Utilizing Intentional Statements (cont’)

For queries run not instantly:Suppose: Server R replicates everything for Portland at S, also possibly keeps additional data about Portland, can be up

to 30 minutes out of dateR polls every 30 minutes to update the data it replicates from S.Intentional Statement: base[Portland, *]@R ≥ base[Portland, *]@S{30}A binding for resource [Portland, CDs] might then be: base[Portland, CDs]@R{30} | (base[Portland, CDs]@R base[Portland, CDs]@S){0}Explanations:One can get an answer quickly by just routing the MQP to R, but that answer could be up to 30 minutes out of

date.By routing the MQP to both R and S, one can have a complete and current answer.

Conclusions:– Impossible to guarantee queries run instantly – Compromises on latency, completeness and currency. – Replication can’t be both scalable and instantaneous.

What else could be in MQPs

• Accumulating catalog and statistics information

• Maintaining provenance– Rewards system– Meta-index updating– Detection of spoofing

Security and Privacy• Issues:

– With MQPs, the partial results is possibly divulged to other undesirable servers

• Solutions:– MQPs need to incorporate ordering and

transfer policies– Encrypts data or data elements with the

public key– MQPs can allow to obtain answers under

given server security policies

Conclusions• Enable peers to independently

optimize and partially evaluate queries without global knowledge, and with a minimum of coordination overhead.

References

• [1] Vassilis Papadimos, David Maier and Kristin Tufte. Distributed Query Processing and Catalogs for Peer-to-Peer Systems. OGI School of Science Engineering. Oregon Health Science University.

• [2] V. Papadimos and D. Maier. Distributed Queries without Distributed State. In Proc. of WebDB 2002, pages 95-100.

Thanks!

Questions?...

distributed query processing and catalogs for peer-to-peer systems

Documents

xml query

query optimization

distributed query processing

regular query operators

algebraic query plan

result qualitya query

advent of mutant query

index references