data management in p2p systems: challenges and research issues · timos sellis-ntua data management...

Data Management in P2P Systems: Challenges and Research Issues

Timos SellisNational Technical Univ. of Athens

Timos Sellis-NTUA Data Management in P2P Systems 2

What is P2P?

Client Server Architecture

Peer-to-Peer Architecture


P2P: an old concept

P2P is not a new technologyOct. 29 1969: first transmission UCLA Stanford Research Inst. (SRI) [UCSB, U. Of Utah]; Peer computing status among independent computing sites

1970s-1980s: ARPANET completed (1978); Usenet and DNS1990s: NSFnet and Internet explosion1996: ICQ (bypassing DNS by offering own addressing scheme and allowing end points to be directly addressable with each other)


NAPSTER boosts P2P technology

1999: Napster is born

2002: Napster is gone…April 30, 2002:A study from WebSense indicates that the number of file-

swapping, and peer-to-peer websites, has grown by 535 percent in the past year, despite legal efforts to have them shut down.

According to the study findings, the number of p2p websites totals nearly 38,000.

universities notice firstmid-2001: shut down service; other apps emerge since 2000


The Evolution of P2P systems

First generation – centralized P2P systemsE.g. Napster, SETI@home

Second generation –decentralized & unstructured P2P systems

E.g. GnutellaThird generation—structured P2P systems

DHT systems (CAN/Chord/Pastry/Tapestry)Skip-list based systems

….


P2P Perspective: Self-Organized Systems

P2P Paradigm is characterized by:no centralized controladaptable behaviorself-organization

distribution of control ⇒ decentralizationlocal interactions, information and decision ⇒ autonomy

emergence of global structuresfailure resilience and scalability


P2P Paradigm Dimensions

Structure – FrameworkServers and clients

OrganizationStructured vs. unstructured overlays

Managed and stored dataFiles vs. (semi)-structured data

Functionality-ApplicationsSharing, resource management, web services, collaboration etc


Structure- Frameworks

Main differences of different structures: how information is shared and how much information is sharedModels:

CentralizedDecentralizedControlled-decentralized | Hybrids

Components: - client, server, servents


Centralized: pros and cons

Centralized servers keep a catalog of available files ⇒ Napster

ProsMore effective, comprehensive searches Access is controlled

ConsSystem has single points of entry; one fail could bring the whole system downBroken links, out of date information.


Decentralized Framework

Every user acts as a client, a server or both (servent). User connects to framework and becomes a member of the community, allowing others to connect through him/herUsers “talk” directly to other users with no intermediate or central authorityNo one entity controls the information that passes through the community

⇒Gnutella, Freenet


Decentralized: Pros and ConsPros

Peers “talk” directly with no central authority. Nobody owns the Gnutella Network and nobody can shut it down.Isolated node failure can quickly and automatically be worked around. Free loading Scalability

ConsSearches are less effective and can be slow.


Hybrids: Controlled decentralization

Characteristics of both centralized and decentralized frameworks, i.e. systems with super-peersUser’s computer may act as client, a serveror a servent; there are server operators which may control which clients and/or servents are allowed to access a particular server.Examples: Morpheus, Groove


Organization: Overlay Networks

Structured vs. unstructured overlaysStructured:

data is indexed and stored at peers with pre-specified characteristics, e.g. peer idPeers are connected to specific peers

Unstructured:No pre-specified connection of peers and storage/indexing of data


Managed and Stored Data

Simple files: search is performed with keywords for file names and file extensions

Structured or semi-structured data ⇒P2P Databases:

XML dataRelational dataRDF data


Key Application/Function Areas

File sharing / Content distribution:Not just mp3s/media sharing Distributed searching: Used to easily lookup and share files and offer content management

Instant messaging:Exchanging messages and chatting ⇒ Jabber, IM

Distributed computation:Use under utilized Internet and/or network resources for improving computation and data analysis ⇒ Grid Computing

P2P groupware / Collaboration:Cooperative publishing, messaging, group project management. Secure environments are offered in some products


What Has to Be Done

Issues to be worked out:Reliability

Is the provided content or the provider itself reliable? How can we assure these?

Resource managementHow can we achieve efficient collaboration of autonomous resources?

Security & PrivacyHow can we guarantee security together with privacy without rescinding peer autonomy?

Intellectual RightsLegal notions of copyright and intellectual property need to adapt to the sharing model of the P2P paradigm


Research Areas

Peer discovery and group management

Data location and placement

Reliable and efficient file exchange

Security/privacy/anonymity/trust


Relational data sharing in Unstructured P2P vs. Distributed DB

Can actually retrieve the complete set of answers.

Answers to queries are typically incomplete (by “completeness” we mean all answers that satisfy a query)

Exact location to direct the query is typically known.

Content location is typically by “word-of-mouth” e.g., node routes query to its neighbors and so on…

Have some knowledge of a shared schema. Queries: SQL

Usually no predetermined (global) schema among nodes. Queries: Keywords

Nodes are added/removed from the network in a controlled manner.

Nodes can join and leave the network anytime.

Distributed Database SystemsP2P


P2P & DB Systems

Lightweight

Fault Tolerance

Powerful query facilities

Transactions & Concurrency Control

Strong Semantics

Decentralized

Flexibility

P2P

DB


P2P + DB = ?

P2P Database? No!ACID transactional guarantees do not scale, nor does the everyday user want ACID semanticsMuch too heavyweight of a solution for the everyday user

Query Processing on P2P!Both P2P and DBs do data location and movementCan be naturally unified (lessons in both directions)P2P brings scalability & flexibilityDB brings relational model & query facilities

Taken from Hellerstein’s group ppt


Many New Challenges

Relative to other parallel/distributed systemsPartial failureChurnFew guarantees on transport, storage, etc.Huge optimization spaceNetwork bottlenecks & other resource constraintsNo administrative organizationsTrust issues: security, privacy, incentives

Relative to IP networkingMuch higher function, more flexibleMuch less controllable/predictable


Data management in P2P systems

Current research focuses onDecentralized schema mappings

PeerDB: unstruct. network, keyword search only

Extending DHT for complex queryingPIER : exact-match and join queries

Query reformulationEdutella: super-peer, RDF-based schemasPiazza: graph of pair-wise schema mappings

Replicationgenerally limited to static read-only filesP-Grid addresses updates in structured networks


Our model of P2P databases

All peers hold relational databasesThere is no mediation or “global schema”Peers only have a few acquaintees (neighbors), i.e. other peers they “know” and can “talk to”Peers receive queries, answer them and at the same time propagate them to their neighborsPeers use mappings to map their own schema to their neighbor’s schemas in order to correctly translate a query and propagate it (Query rewriting)There is ad-hoc and intermittent connectivity


Query Answering: Motivating Application

SELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pidand Treatments.diseasedescr = "X"

Qorig

SELECT age,sexFROM Patients,ResultsWHERE Patients.pid = Results.pid andResults.diagnosis = "X"

Qorig_LDB

SELECT age,sexFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"

QLDB_StuartDB


Traditional Query Propagation (Contd.)

Each peer that receives a query QRewrites it according to its local schema (Q’)Answers Q’Forwards Q’ to one or more of its neighbors

Peers with dissimilar schemas hide nodes rich in informationBad rewritings have the same effectLocal-only knowledge in the overlay puts constraints on the propagation of a query

Bandwidth limitationsCannot keep track of all nodes!


People have a similar problemEN-GR

Dictionary

EN

GR

GR-FR Dictionary

FR

EN

IT

FR-IT DictionaryIT-EN

Dictionary

+ EN

+ EN

+ EN

+ EN


Overview of Our Framework - GrouPeer

Forward both the original and the rewrittenversion of QAnswer the one most similar to the local schema

Similarity functionat the same time …..

Overlay restructuring according to evaluation of the results (learning new neighbors)

but do this in an…..Intelligent, bandwidth-efficient query forwarding with informed walks


Answering the Original QueryAnswering the Original Query

SELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pidand Treatments.diseasedescr = "X"

Qorig


QLDB_StuartDBSELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"

Qorig_StuartDB


Similarity Measure

Simple structural similarity between 2 versions of Q:Msim(Qv1, Qv2) correlates simple conceptual similarities of structural query elements (i.e. ‘select’attributes and ‘where’ conditions )

Different at each peerSchema matching methodologyIndividual preferences


Confidence Parameter θ

Express each peer’s individualitySchema matching abilityQuery structureθ = θp(Q)

Each peer decides which query to answer (original, rewritten) according to:Covp(Q2, Q1) = Msim p(Q1, Q2)· θp(Q1)


Example Query Answering


CovStuartDB(QQLDB_Stuart, Qorig) = 0.4 · 1 =0.4

SELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"

CovStuartDB(QQorig_Stuart, Qorig) = 1.0 · 0.5 =0.5


Learning Process

• Data providers return to the query originator:– Which version they answered– Successively rewritten query– Locally rewritten query– Answer tuples

• The originators evaluate the answers:

and


Semantic Grouping of Peers

The learning process leads to semantic clustering of the P2P networkEach semantic cluster can create an explicit interest group:

Group Initiator: Peer that initiates the group creationGroup Schema: An abstract schema with most common schema elements of the cluster


Routing and Restructuring in the Overlay

Informed walks instead of blind forwardingK walkers, TTL steps eachChoose neighbors according to schema similarity

Adding an acquaintance p:When θp = θlocal

More than T replies from pImpose an maximum number of neighbors

Dropping an acquaintanceLow(est) similarity neighborsOnly if they don’t get disconnected


Simulation Methodology

Simulator in C Pure P2P model, 1-10K graphs (power-law and random)100 attributes, random assignment (10 avg.)Queries on local schemas (3 attributes avg.)100 requestersCompared methods

Naïve rewriting (naïve)Flooding with each peer answering Qorig (Brute) – and some variations


Results

100 200 300 400 500Queries per Requester

50

60

70

80

90

100

Sim

ilari

ty (

%)

GrouPeer totalGrouPeer original QuGrouPeer rewritten QuNaive

Similarity of results to the original query over variable number of queries per requester


Results

Ratio of the average similarity GrouPeer’s clustering achieves versus the optimal, given an equal number of aquaintees

100 200 300 400 500Queries per Requester

40

50

60

70

80

90

100

% o

f op

timal

clu

ster

ing

500 requ100 requ10 requ


Related Work

PeerDBQuery Reformulation (Halevy et al)EdutellaChatty Web


Handling Spatial Data

Problem DefinitionAssumptions:

Dissemination of spatial data in autonomous nodesStructured flat P2P overlay

Problem:Necessity for indexing and routing techniques for such an environment


Technique Requirements

We need a technique that:Guarantees the retrieval of any existing spatial information in the systemAchieves satisfying index size for any nodeAchieves a satisfying length for any search pathProvides easier access to popular informationPreserves proximity of areas


Related Work

VBI-tree (ICDE ‘05)P2PR-tree (Workshop of EDBT ‘04)Quad-tree approach (Poster in ICDE’05)kd-tree approach (WebDB’04)


Specification of Spatial Areas

Partitioning Space: We assume a basic fine-grained n×n gridSpecifying Spatial Areas:We consider any i×i area on the basic grid, 1≤i≤n⇒ Each i×i area belongs to an i×i gridThe id of an area A is: A ≡ azaxayaz: is the granularity level, i.e. the size in terms of basic cells.ax: x coordinator of the center basic cell.ay: y coordinator of the center basic cell.


Grid & Area Examples

Basic 5x5 grid

1 2 3 4 5

5

4

3

2

1

324

The four 2nd level grids (z=2)


Distributed Indexing for Spatial Areas

Using Chord as the basic DHT we index areas as follows:

Distributed Indexing AlgorithmStep 1: Indexing the grids of each level

Phase A: Indexing areas of the same gridPhase B: Indexing areas of other grids of the same level

Step 2: Indexing the grids of other levels


Indexing Examples

Basic 5x5 grid

Area stored in theindexing peer

Indexing non-overlapping areasof the same size

Example of indexing on thebasic grid with Chord of base 2

Example of indexing ona 2nd level grid with

Chord of base 2

Indexing overlapping areas ofthe same sizeIndexing areas of different size


Direct Neighbors & Data Hashing

Peers have neighbors on each dimension:Horizontal VerticalPerpendicular

Neighbors are the closest peers on each direction. Areas are stored to the closest neighbor following a

priorities of dimensions


Direct Neighbors: Examples

horizontal successor

verticalsuccessor

horizontalpredecessor

verticalpredecessor

1st quadrant2nd quadrant

3rd quadrant 4th quadrant


Routing of Spatial Queries

Basic Routing AlgorithmStep 1: Find a peer corresponding to the same size as the sought area.

Step 2: Find a peer hosting the sought area or an overlapping area of the latter

Step 3: Find the peer hosting the sought area


Complexity Characteristics

Complexity for both pure Chord and our method is O(3log2n).

But with our method:Most Important: relatively close areas are stored in close peersThe search path grows with the relative distancePeers with big areas are favored with small indexesShorter search paths for bigger areas


Summary

P2P systems offer effective use of distributed resourcesStandards for P2P infrastructure are emerging

Sun (JXTA) and Microsoft (.NET)Content choice and control

The user becomes not only a consumer but a content provider; P2P allows the end user to participate in the Internet, as the latter was originally envisioned

AN EXCITING NEW AREA FOR DATABASE MANAGEMENT

data management in p2p systems: challenges and research issues · timos sellis-ntua data management...

Documents