the priority r-tree: a practically efficient and worst-case optimal r-tree
DESCRIPTION
The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree. Lars Arge 1 , Mark de Berg 2 , Herman Haverkort 3 and Ke Yi 1 Department of Computer Science Duke University Department of Computer Science TU Eindhoven Institute of Information and Computing Sciences - PowerPoint PPT PresentationTRANSCRIPT
The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
Department of Computer ScienceDuke University
Department of Computer ScienceTU Eindhoven
Institute of Information and Computing SciencesUtrecht University
2
Problem Definition• Input:
– N rectangles in the plane– Window query Q
• Output:– All rectangles intersecting Q
• Applications– Spatial databases– GIS– CAD– Computer vision– Robotics– …
3
R-Tree• Definition [Guttman84]:
• Advantages:– Little redundancy– Multi-purpose– Easy to update
A B C D E F G H I
A
B
CD
E
FG
H
I
Fanout: Ө(B)
B: disk block size
4
How to Build an R-Tree• Repeated insertions
– [Guttman84]
– R+-tree [Sellis et al. 87]
– R*-tree [Beckmann et al. 90]
• Bulkloading
– Hilbert R-Tree [Kamel and Faloutos 94]
– Top-down Greedy Split [Garcia et al. 98]
– Advantages:
* Much faster than repeated insertions
* Better space utilization
* Usually produce R-trees with higher quality
5
R-Tree Variant: Hilbert R-Tree
• To build a Hilbert R-Tree (cost: O(N/B logM/BN) I/Os)
– Sort the rectangles by the Hilbert values of their centers
– Build a B-tree on top
• 4D Hilbert R-tree
Hilbert Curve
6
R-Tree Variant: TGS R-Tree
• To build a TGS R-tree
– Start from the root and buildthe tree top-down
– To build one node, use binary cutsuntil the desired fan-out is reached
* To make a binary cut, consider4 orderings of the rectangles: xmin, ymin, xmax, ymax
* In each ordering, consider the B cutting positions
* Choose the one that minimizes the sum of the areas of the two resulted bounding boxes
• Typical bulk-load cost: O(N/B log2N) I/Os
(Top-down Greedy Split)
7
Our Results• None of existing R-tree variants has worst-case query performa
nce guarantee!
– In the worst-case, a query can visit all nodes in the tree even when the output size is zero
• Priority R-Tree
– The first R-tree variant that answers a query by visiting nodes in the worst case
* T: Output size
– It is optimal!
* There exists a dataset such that for any R-tree, there is an empty query that visits nodes. [Kanth and Singh 99, Agarwal et al. 02]
)//( BTBNO
)/( BN
8
Roadmap
• Pseudo-PR-Tree
– Has the desired worst-case guarantee
– Not a real R-tree
• Transform a pseudo-PR-Tree into a PR-tree
– A real R-tree
– Maintain the worst-case guarantee
• Experiments
– PR-tree
– Hilbert R-tree (2D and 4D)
– TGS-R-tree
)//( BTBNO
9
Building a Pseudo-PR-Tree
root
Step 1: take out B extreme rectangles
from each direction and put them
into priority leaves
priority leaves
10
Building a Pseudo-PR-Tree
root
Step 2: Divide by the xmin coordinates and build subtrees recursively.
Division is performed using xmin, ymin, xma
x, ymax in a round-robin fashion, like a 4D kd-tree
Analysis sketch:
# nodes with at least one priority leafcompletely reported: O(T/B)
# nodes with no priority leaf completely reported: )/( BNO
11
Pseudo-PR-Tree to a Real R-tree
12
Query Complexity Remains Unchanged
# nodes visited on leaf level BTBN //
Next level: 22 //// BTBBNBN
3223 ////// BTBBNBBNBN
13
PR-Tree: Bulkload & Updates• Bulkload
– O(N/B∙log2N) I/Os→O(N/B∙logM/BN) I/Os, using “grid method” [Agarwal et al. 01]
– The same as Hilbert R-tree, but with a larger constant• Updates
– Can use any previous heuristic to update in O(logBN) I/Os* Without worst-case query guarantee
– Use logarithmic method* Insert: O(logBN + 1/B · logM/BN log2(N/M)) I/Os* Delete: O(logBN) I/Os
• Extending to d-dimensions– Query bound: O((N/B)1-1/d + T/B), still optimal– Bulkload & update bounds remain the same
14
Experiments• Implemented with TPIE
– Priority R-tree
– Hilbert R-tree
– 4D Hilbert R-tree
– TGS R-tree
• Real-life data
– TIGER datasets
– 16 million rectangles
• Synthetic data
– Varying from normal to extreme data
– 10 million rectangles
15
Experiments with Real-Life DataQuery performance on the TIGER datasets
Shown: # I/Os spent in answering a query
T/B
16
Experiments with Synthetic Data: SIZE
Each side of a rectangle is uniformly distributed in [0, max_side]
Queries are squares with area 1%
17
Experiments with Synthetic Data: ASPECT
Fix the area, vary aspect ratio
18
Experiments with Synthetic Data: SKEWED
Randomly place points, then do y’=yc on the y-coordinates
19
Experiments with Synthetic Data: CLUSTER
20
Conclusions
• In theory– The PR-tree is the first R-tree variant that answers a window
query in I/Os worst-case, which is optimal• In practice
– Roughly the same as previous best R-trees on real-life and relatively nicely distributed data
– Outperforms them significantly on more extreme data• Future work
– How previous heuristics may affect the performance of the PR-tree in the dynamic case
)//( BTBNO
21
Lower Bound Construction• Each bounding box intersects at
least queries
• N/B bounding boxes
• queries
• There exists a query that intersects at least
bounding boxes
B
N
BNN
BBN
/
22
Pseudo-PR-Tree: Query Complexity• Nodes v visited where all rectangles in at least one of the priority lea
ves of v’s parent are reported: O(T/B)
• Let v be a node visited but none of the priority leaves at its parent are reported completely, consider v’s parent u
Q
2D 4D
xmax = xmin(Q)
ymin = ymax(Q)
23
Pseudo-PR-Tree: Query Complexity• The cell in the 4D kd-tree of u is intersecte
d by two different 3-dimensional hyper-planes
• The intersection of each pair of such 3-dimensional hyper-planes is a 2-dimensional hyper-plane
• Lemma: # of cells in a d-dimensional kd-tree that intersect an axis-parallel f-dimensional hyper-plane is O((N/B)f/d)
• So, # such cells in a 4D kd-tree:
• Total # nodes visited:
)/( BNO
)//( BTBNO
u
24
Experiments with Real-Life Data• Datasets: TIGER/Line data
• Bulk-loading: