data management in p2p systems: challenges and research issues · timos sellis-ntua data management...
TRANSCRIPT
Data Management in P2P Systems: Challenges and Research Issues
Timos SellisNational Technical Univ. of Athens
Timos Sellis-NTUA Data Management in P2P Systems 2
What is P2P?
Client Server Architecture
Peer-to-Peer Architecture
Timos Sellis-NTUA Data Management in P2P Systems 3
P2P: an old concept
P2P is not a new technologyOct. 29 1969: first transmission UCLA Stanford Research Inst. (SRI) [UCSB, U. Of Utah]; Peer computing status among independent computing sites
1970s-1980s: ARPANET completed (1978); Usenet and DNS1990s: NSFnet and Internet explosion1996: ICQ (bypassing DNS by offering own addressing scheme and allowing end points to be directly addressable with each other)
Timos Sellis-NTUA Data Management in P2P Systems 4
NAPSTER boosts P2P technology
1999: Napster is born
2002: Napster is gone…April 30, 2002:A study from WebSense indicates that the number of file-
swapping, and peer-to-peer websites, has grown by 535 percent in the past year, despite legal efforts to have them shut down.
According to the study findings, the number of p2p websites totals nearly 38,000.
universities notice firstmid-2001: shut down service; other apps emerge since 2000
Timos Sellis-NTUA Data Management in P2P Systems 5
The Evolution of P2P systems
First generation – centralized P2P systemsE.g. Napster, SETI@home
Second generation –decentralized & unstructured P2P systems
E.g. GnutellaThird generation—structured P2P systems
DHT systems (CAN/Chord/Pastry/Tapestry)Skip-list based systems
….
Timos Sellis-NTUA Data Management in P2P Systems 6
P2P Perspective: Self-Organized Systems
P2P Paradigm is characterized by:no centralized controladaptable behaviorself-organization
distribution of control ⇒ decentralizationlocal interactions, information and decision ⇒ autonomy
emergence of global structuresfailure resilience and scalability
Timos Sellis-NTUA Data Management in P2P Systems 7
P2P Paradigm Dimensions
Structure – FrameworkServers and clients
OrganizationStructured vs. unstructured overlays
Managed and stored dataFiles vs. (semi)-structured data
Functionality-ApplicationsSharing, resource management, web services, collaboration etc
Timos Sellis-NTUA Data Management in P2P Systems 8
Structure- Frameworks
Main differences of different structures: how information is shared and how much information is sharedModels:
CentralizedDecentralizedControlled-decentralized | Hybrids
Components: - client, server, servents
Timos Sellis-NTUA Data Management in P2P Systems 9
Centralized: pros and cons
Centralized servers keep a catalog of available files ⇒ Napster
ProsMore effective, comprehensive searches Access is controlled
ConsSystem has single points of entry; one fail could bring the whole system downBroken links, out of date information.
Timos Sellis-NTUA Data Management in P2P Systems 10
Decentralized Framework
Every user acts as a client, a server or both (servent). User connects to framework and becomes a member of the community, allowing others to connect through him/herUsers “talk” directly to other users with no intermediate or central authorityNo one entity controls the information that passes through the community
⇒Gnutella, Freenet
Timos Sellis-NTUA Data Management in P2P Systems 11
Decentralized: Pros and ConsPros
Peers “talk” directly with no central authority. Nobody owns the Gnutella Network and nobody can shut it down.Isolated node failure can quickly and automatically be worked around. Free loading Scalability
ConsSearches are less effective and can be slow.
Timos Sellis-NTUA Data Management in P2P Systems 12
Hybrids: Controlled decentralization
Characteristics of both centralized and decentralized frameworks, i.e. systems with super-peersUser’s computer may act as client, a serveror a servent; there are server operators which may control which clients and/or servents are allowed to access a particular server.Examples: Morpheus, Groove
Timos Sellis-NTUA Data Management in P2P Systems 13
Organization: Overlay Networks
Structured vs. unstructured overlaysStructured:
data is indexed and stored at peers with pre-specified characteristics, e.g. peer idPeers are connected to specific peers
Unstructured:No pre-specified connection of peers and storage/indexing of data
Timos Sellis-NTUA Data Management in P2P Systems 14
Managed and Stored Data
Simple files: search is performed with keywords for file names and file extensions
Structured or semi-structured data ⇒P2P Databases:
XML dataRelational dataRDF data
Timos Sellis-NTUA Data Management in P2P Systems 15
Key Application/Function Areas
File sharing / Content distribution:Not just mp3s/media sharing Distributed searching: Used to easily lookup and share files and offer content management
Instant messaging:Exchanging messages and chatting ⇒ Jabber, IM
Distributed computation:Use under utilized Internet and/or network resources for improving computation and data analysis ⇒ Grid Computing
P2P groupware / Collaboration:Cooperative publishing, messaging, group project management. Secure environments are offered in some products
Timos Sellis-NTUA Data Management in P2P Systems 16
What Has to Be Done
Issues to be worked out:Reliability
Is the provided content or the provider itself reliable? How can we assure these?
Resource managementHow can we achieve efficient collaboration of autonomous resources?
Security & PrivacyHow can we guarantee security together with privacy without rescinding peer autonomy?
Intellectual RightsLegal notions of copyright and intellectual property need to adapt to the sharing model of the P2P paradigm
Timos Sellis-NTUA Data Management in P2P Systems 17
Research Areas
Peer discovery and group management
Data location and placement
Reliable and efficient file exchange
Security/privacy/anonymity/trust
Timos Sellis-NTUA Data Management in P2P Systems 18
Relational data sharing in Unstructured P2P vs. Distributed DB
Can actually retrieve the complete set of answers.
Answers to queries are typically incomplete (by “completeness” we mean all answers that satisfy a query)
Exact location to direct the query is typically known.
Content location is typically by “word-of-mouth” e.g., node routes query to its neighbors and so on…
Have some knowledge of a shared schema. Queries: SQL
Usually no predetermined (global) schema among nodes. Queries: Keywords
Nodes are added/removed from the network in a controlled manner.
Nodes can join and leave the network anytime.
Distributed Database SystemsP2P
Timos Sellis-NTUA Data Management in P2P Systems 19
P2P & DB Systems
Lightweight
Fault Tolerance
Powerful query facilities
Transactions & Concurrency Control
Strong Semantics
Decentralized
Flexibility
P2P
DB
Timos Sellis-NTUA Data Management in P2P Systems 20
P2P + DB = ?
P2P Database? No!ACID transactional guarantees do not scale, nor does the everyday user want ACID semanticsMuch too heavyweight of a solution for the everyday user
Query Processing on P2P!Both P2P and DBs do data location and movementCan be naturally unified (lessons in both directions)P2P brings scalability & flexibilityDB brings relational model & query facilities
Taken from Hellerstein’s group ppt
Timos Sellis-NTUA Data Management in P2P Systems 21
Many New Challenges
Relative to other parallel/distributed systemsPartial failureChurnFew guarantees on transport, storage, etc.Huge optimization spaceNetwork bottlenecks & other resource constraintsNo administrative organizationsTrust issues: security, privacy, incentives
Relative to IP networkingMuch higher function, more flexibleMuch less controllable/predictable
Timos Sellis-NTUA Data Management in P2P Systems 22
Data management in P2P systems
Current research focuses onDecentralized schema mappings
PeerDB: unstruct. network, keyword search only
Extending DHT for complex queryingPIER : exact-match and join queries
Query reformulationEdutella: super-peer, RDF-based schemasPiazza: graph of pair-wise schema mappings
Replicationgenerally limited to static read-only filesP-Grid addresses updates in structured networks
Timos Sellis-NTUA Data Management in P2P Systems 23
Our model of P2P databases
All peers hold relational databasesThere is no mediation or “global schema”Peers only have a few acquaintees (neighbors), i.e. other peers they “know” and can “talk to”Peers receive queries, answer them and at the same time propagate them to their neighborsPeers use mappings to map their own schema to their neighbor’s schemas in order to correctly translate a query and propagate it (Query rewriting)There is ad-hoc and intermittent connectivity
Timos Sellis-NTUA Data Management in P2P Systems 24
Query Answering: Motivating Application
SELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pidand Treatments.diseasedescr = "X"
Qorig
SELECT age,sexFROM Patients,ResultsWHERE Patients.pid = Results.pid andResults.diagnosis = "X"
Qorig_LDB
SELECT age,sexFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"
QLDB_StuartDB
Timos Sellis-NTUA Data Management in P2P Systems 25
Traditional Query Propagation (Contd.)
Each peer that receives a query QRewrites it according to its local schema (Q’)Answers Q’Forwards Q’ to one or more of its neighbors
Peers with dissimilar schemas hide nodes rich in informationBad rewritings have the same effectLocal-only knowledge in the overlay puts constraints on the propagation of a query
Bandwidth limitationsCannot keep track of all nodes!
Timos Sellis-NTUA Data Management in P2P Systems 26
People have a similar problemEN-GR
Dictionary
EN
GR
GR-FR Dictionary
FR
EN
IT
FR-IT DictionaryIT-EN
Dictionary
+ EN
+ EN
+ EN
+ EN
Timos Sellis-NTUA Data Management in P2P Systems 27
Overview of Our Framework - GrouPeer
Forward both the original and the rewrittenversion of QAnswer the one most similar to the local schema
Similarity functionat the same time …..
Overlay restructuring according to evaluation of the results (learning new neighbors)
but do this in an…..Intelligent, bandwidth-efficient query forwarding with informed walks
Timos Sellis-NTUA Data Management in P2P Systems 28
Answering the Original QueryAnswering the Original Query
SELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pidand Treatments.diseasedescr = "X"
Qorig
SELECT age,sexFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"
QLDB_StuartDBSELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"
Qorig_StuartDB
Timos Sellis-NTUA Data Management in P2P Systems 29
Similarity Measure
Simple structural similarity between 2 versions of Q:Msim(Qv1, Qv2) correlates simple conceptual similarities of structural query elements (i.e. ‘select’attributes and ‘where’ conditions )
Different at each peerSchema matching methodologyIndividual preferences
Timos Sellis-NTUA Data Management in P2P Systems 30
Confidence Parameter θ
Express each peer’s individualitySchema matching abilityQuery structureθ = θp(Q)
Each peer decides which query to answer (original, rewritten) according to:Covp(Q2, Q1) = Msim p(Q1, Q2)· θp(Q1)
Timos Sellis-NTUA Data Management in P2P Systems 31
Example Query Answering
SELECT age,sexFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"
CovStuartDB(QQLDB_Stuart, Qorig) = 0.4 · 1 =0.4
SELECT age,sex,symptoms,treatrec,drugnameFROM Patients,TreatmentsWHERE Patients.pid = Treatments.pid andTreatments.diseasedescr = "X"
CovStuartDB(QQorig_Stuart, Qorig) = 1.0 · 0.5 =0.5
Timos Sellis-NTUA Data Management in P2P Systems 32
Learning Process
• Data providers return to the query originator:– Which version they answered– Successively rewritten query– Locally rewritten query– Answer tuples
• The originators evaluate the answers:
and
Timos Sellis-NTUA Data Management in P2P Systems 33
Semantic Grouping of Peers
The learning process leads to semantic clustering of the P2P networkEach semantic cluster can create an explicit interest group:
Group Initiator: Peer that initiates the group creationGroup Schema: An abstract schema with most common schema elements of the cluster
Timos Sellis-NTUA Data Management in P2P Systems 34
Routing and Restructuring in the Overlay
Informed walks instead of blind forwardingK walkers, TTL steps eachChoose neighbors according to schema similarity
Adding an acquaintance p:When θp = θlocal
More than T replies from pImpose an maximum number of neighbors
Dropping an acquaintanceLow(est) similarity neighborsOnly if they don’t get disconnected
Timos Sellis-NTUA Data Management in P2P Systems 35
Simulation Methodology
Simulator in C Pure P2P model, 1-10K graphs (power-law and random)100 attributes, random assignment (10 avg.)Queries on local schemas (3 attributes avg.)100 requestersCompared methods
Naïve rewriting (naïve)Flooding with each peer answering Qorig (Brute) – and some variations
Timos Sellis-NTUA Data Management in P2P Systems 36
Results
100 200 300 400 500Queries per Requester
50
60
70
80
90
100
Sim
ilari
ty (
%)
GrouPeer totalGrouPeer original QuGrouPeer rewritten QuNaive
Similarity of results to the original query over variable number of queries per requester
Timos Sellis-NTUA Data Management in P2P Systems 37
Results
Ratio of the average similarity GrouPeer’s clustering achieves versus the optimal, given an equal number of aquaintees
100 200 300 400 500Queries per Requester
40
50
60
70
80
90
100
% o
f op
timal
clu
ster
ing
500 requ100 requ10 requ
Timos Sellis-NTUA Data Management in P2P Systems 38
Related Work
PeerDBQuery Reformulation (Halevy et al)EdutellaChatty Web
Timos Sellis-NTUA Data Management in P2P Systems 39
Handling Spatial Data
Problem DefinitionAssumptions:
Dissemination of spatial data in autonomous nodesStructured flat P2P overlay
Problem:Necessity for indexing and routing techniques for such an environment
Timos Sellis-NTUA Data Management in P2P Systems 40
Technique Requirements
We need a technique that:Guarantees the retrieval of any existing spatial information in the systemAchieves satisfying index size for any nodeAchieves a satisfying length for any search pathProvides easier access to popular informationPreserves proximity of areas
Timos Sellis-NTUA Data Management in P2P Systems 41
Related Work
VBI-tree (ICDE ‘05)P2PR-tree (Workshop of EDBT ‘04)Quad-tree approach (Poster in ICDE’05)kd-tree approach (WebDB’04)
Timos Sellis-NTUA Data Management in P2P Systems 42
Specification of Spatial Areas
Partitioning Space: We assume a basic fine-grained n×n gridSpecifying Spatial Areas:We consider any i×i area on the basic grid, 1≤i≤n⇒ Each i×i area belongs to an i×i gridThe id of an area A is: A ≡ azaxayaz: is the granularity level, i.e. the size in terms of basic cells.ax: x coordinator of the center basic cell.ay: y coordinator of the center basic cell.
Timos Sellis-NTUA Data Management in P2P Systems 43
Grid & Area Examples
Basic 5x5 grid
1 2 3 4 5
5
4
3
2
1
324
The four 2nd level grids (z=2)
Timos Sellis-NTUA Data Management in P2P Systems 44
Distributed Indexing for Spatial Areas
Using Chord as the basic DHT we index areas as follows:
Distributed Indexing AlgorithmStep 1: Indexing the grids of each level
Phase A: Indexing areas of the same gridPhase B: Indexing areas of other grids of the same level
Step 2: Indexing the grids of other levels
Timos Sellis-NTUA Data Management in P2P Systems 45
Indexing Examples
Basic 5x5 grid
Area stored in theindexing peer
Indexing non-overlapping areasof the same size
Example of indexing on thebasic grid with Chord of base 2
Example of indexing ona 2nd level grid with
Chord of base 2
Indexing overlapping areas ofthe same sizeIndexing areas of different size
Timos Sellis-NTUA Data Management in P2P Systems 46
Direct Neighbors & Data Hashing
Peers have neighbors on each dimension:Horizontal VerticalPerpendicular
Neighbors are the closest peers on each direction. Areas are stored to the closest neighbor following a
priorities of dimensions
Timos Sellis-NTUA Data Management in P2P Systems 47
Direct Neighbors: Examples
horizontal successor
verticalsuccessor
horizontalpredecessor
verticalpredecessor
1st quadrant2nd quadrant
3rd quadrant 4th quadrant
Timos Sellis-NTUA Data Management in P2P Systems 48
Routing of Spatial Queries
Basic Routing AlgorithmStep 1: Find a peer corresponding to the same size as the sought area.
Step 2: Find a peer hosting the sought area or an overlapping area of the latter
Step 3: Find the peer hosting the sought area
Timos Sellis-NTUA Data Management in P2P Systems 49
Complexity Characteristics
Complexity for both pure Chord and our method is O(3log2n).
But with our method:Most Important: relatively close areas are stored in close peersThe search path grows with the relative distancePeers with big areas are favored with small indexesShorter search paths for bigger areas
Timos Sellis-NTUA Data Management in P2P Systems 50
Summary
P2P systems offer effective use of distributed resourcesStandards for P2P infrastructure are emerging
Sun (JXTA) and Microsoft (.NET)Content choice and control
The user becomes not only a consumer but a content provider; P2P allows the end user to participate in the Internet, as the latter was originally envisioned
AN EXCITING NEW AREA FOR DATABASE MANAGEMENT