1 one torus to rule them all: multi-dimensional queries in p2p systems prasanna ganesan beverly yang...
Post on 19-Dec-2015
220 views
TRANSCRIPT
11
One Torus to Rule Them One Torus to Rule Them All: Multi-dimensional All: Multi-dimensional Queries in P2P SystemsQueries in P2P Systems
Prasanna GanesanPrasanna Ganesan
Beverly YangBeverly Yang
Hector Garcia-MolinaHector Garcia-Molina
Stanford UniversityStanford University
22
MotivationMotivation
P2P SystemsP2P Systems– Dynamic set of nodesDynamic set of nodes– Dynamic data distributed over nodesDynamic data distributed over nodes– No centralizationNo centralization– Traditionally Traditionally : Simple point queries over data: Simple point queries over data
New P2P applications desire multi-New P2P applications desire multi-dimensional queriesdimensional queries– Photo Sharing: Find all labels for photos in a Photo Sharing: Find all labels for photos in a
geographical areageographical area– Multi-player games: Find all objects in an areaMulti-player games: Find all objects in an area
33
ProblemProblem
Devise P2P system to store relation R Devise P2P system to store relation R with:with:
1.1. Efficient tuple insertion/deletionEfficient tuple insertion/deletion
2.2. Efficient node join/leaveEfficient node join/leave– Minimize #messagesMinimize #messages
3.3. Efficient multi-dimensional range Efficient multi-dimensional range queriesqueries
– Minimize #nodes processing queryMinimize #nodes processing query
4.4. Load balance across nodesLoad balance across nodes
A parallel DB on steroids
44
Challenge 1: Partitioning Challenge 1: Partitioning ProblemProblem
Partition data withPartition data with1.1. Locality: Keep Locality: Keep
nearby tuples on nearby tuples on same nodesame node
2.2. Load balance: Load balance: Equal #tuples on Equal #tuples on all nodesall nodes
ComplicationsComplications– Dynamic dataDynamic data– Dynamic nodesDynamic nodes
55
Challenge 2: Routing Challenge 2: Routing ProblemProblem
Route query/insert/delete to relevant Route query/insert/delete to relevant nodesnodes– No centralization!No centralization!– Replicated directory too expensive!Replicated directory too expensive!– Trade-off between cost of query and Trade-off between cost of query and
cost of maintaining routing structurecost of maintaining routing structure
66
RoadmapRoadmap
Two Different ApproachesTwo Different Approaches– SCRAP: Space-filling curves with Range SCRAP: Space-filling curves with Range
PartitionsPartitions– MURK: Multi-dimensional Rectangulation MURK: Multi-dimensional Rectangulation
with kd-treeswith kd-trees Comparing the two approachesComparing the two approaches
77
SCRAP PartitioningSCRAP Partitioning
Two-Step ProcessTwo-Step Process1.1. Map data to 1-d with space-filling curve Map data to 1-d with space-filling curve
– E.g., <E.g., <110011110011,,010101010101> becomes > becomes 110011110000001111001111
88
Scrap Partitioning (2)Scrap Partitioning (2)
2. Range partition 1-d data2. Range partition 1-d data– Preserves locality!Preserves locality!
99
Load Balancing with SCRAPLoad Balancing with SCRAP
Adjust partitions when unbalancedAdjust partitions when unbalanced– Adjust boundary with neighborAdjust boundary with neighbor– Migrate to new areaMigrate to new area– Guarantees: All loads within factor 4.24. Constant tuple Guarantees: All loads within factor 4.24. Constant tuple
movements per insert/delete [GBGM04]movements per insert/delete [GBGM04]
1010
Query RoutingQuery Routing
Map multi-dim query to set of 1-d rangesMap multi-dim query to set of 1-d ranges Send each 1-d range query to relevant Send each 1-d range query to relevant
nodenode Use a linked list to interconnect nodesUse a linked list to interconnect nodes
– Add “skip” pointers for fast routingAdd “skip” pointers for fast routing
– O(log n) messages for routing/node O(log n) messages for routing/node joins/leavesjoins/leaves
1111
RoadmapRoadmap
Two Different ApproachesTwo Different Approaches– SCRAP: Space-filling curves with Range SCRAP: Space-filling curves with Range
PartitionsPartitions– MURK: Multi-dimensional Rectangulation MURK: Multi-dimensional Rectangulation
with kd-treeswith kd-trees Comparing the two approachesComparing the two approaches
1212
MURKMURK
Intuition: Partition data in native Intuition: Partition data in native space into “Rectangles” space into “Rectangles” – a la a la kd-treeskd-trees
1313
Kd-tree InterpretationKd-tree Interpretation
Nodes form leaves of Nodes form leaves of kd-treekd-tree
Node Join: Split Node Join: Split existing leafexisting leaf
Node leaveNode leave– Sibling takes overSibling takes over– If no sibling, find If no sibling, find
someone in sibling someone in sibling sub-treesub-tree
1414
Murk PropertiesMurk Properties
Locality: Locality: Rectangulation Rectangulation better than SCRAPbetter than SCRAP
Load BalanceLoad Balance– Ok if data Ok if data
distribution is staticdistribution is static– ??? If data ??? If data
distribution is distribution is dynamicdynamic
1515
Routing QueriesRouting Queries
Build a grid of nodesBuild a grid of nodes– Adjacent nodes link to each otherAdjacent nodes link to each other– Analogous to linked list in higher dimensionsAnalogous to linked list in higher dimensions
ProblemsProblems– Node managing large space has many Node managing large space has many
neighbors!neighbors!– Routing on grid is too slow. Need skip Routing on grid is too slow. Need skip
pointerspointers– Not easy to add skip pointers (see paper)Not easy to add skip pointers (see paper)
1616
EvaluationEvaluation
DatasetsDatasets– Uniform: 32-bit ints drawn at randomUniform: 32-bit ints drawn at random– Skewed: Photo Co-ords from real collectionSkewed: Photo Co-ords from real collection
Nodes join one at a time to build Nodes join one at a time to build networknetwork
EvaluateEvaluate– Locality: #nodes that process a queryLocality: #nodes that process a query– Routing: #messages transmitted per queryRouting: #messages transmitted per query
1717
Dimensionality vs. LocalityDimensionality vs. Locality
Dimensionality
#nodes = 8192. #Ideal Locality =1
2020
ConclusionsConclusions
SCRAPSCRAP– Simple partitioning and routingSimple partitioning and routing– Excellent load balanceExcellent load balance– Issue: Space-filling curve offers poor locality Issue: Space-filling curve offers poor locality
MURKMURK– Much better locality than SCRAPMuch better locality than SCRAP– Routing still okRouting still ok– Load balance is more complex and heuristic Load balance is more complex and heuristic
2121
More InformationMore Information
Load Balancing, Range Queries and Load Balancing, Range Queries and P2PP2P– ““Online Balancing of Range-Partitioned Data Online Balancing of Range-Partitioned Data
with Applications to P2P Systemswith Applications to P2P Systems”, ”, VLDB 2004VLDB 2004
– ““Distributed Balanced Tables: Not Making a Distributed Balanced Tables: Not Making a Hash of it AllHash of it All”, ”, Stanford Tech ReportStanford Tech Report
– Google: “Prasanna Ganesan”Google: “Prasanna Ganesan” More work on P2PMore work on P2P
– Google: “Stanford Peers”Google: “Stanford Peers”