the priority r-tree: a practically efficient and worst-case optimal r-tree

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree

Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1

Department of Computer ScienceDuke University

Department of Computer ScienceTU Eindhoven

Institute of Information and Computing SciencesUtrecht University

2

Problem Definition• Input:

– N rectangles in the plane– Window query Q

• Output:– All rectangles intersecting Q

• Applications– Spatial databases– GIS– CAD– Computer vision– Robotics– …

3

R-Tree• Definition [Guttman84]:

• Advantages:– Little redundancy– Multi-purpose– Easy to update

A B C D E F G H I

A

B

CD

E

FG

H

I

Fanout: Ө(B)

B: disk block size

4

How to Build an R-Tree• Repeated insertions

– [Guttman84]

– R+-tree [Sellis et al. 87]

– R*-tree [Beckmann et al. 90]

• Bulkloading

– Hilbert R-Tree [Kamel and Faloutos 94]

– Top-down Greedy Split [Garcia et al. 98]

– Advantages:

* Much faster than repeated insertions

* Better space utilization

* Usually produce R-trees with higher quality

5

R-Tree Variant: Hilbert R-Tree

• To build a Hilbert R-Tree (cost: O(N/B logM/BN) I/Os)

– Sort the rectangles by the Hilbert values of their centers

– Build a B-tree on top

• 4D Hilbert R-tree

Hilbert Curve

6

R-Tree Variant: TGS R-Tree

• To build a TGS R-tree

– Start from the root and buildthe tree top-down

– To build one node, use binary cutsuntil the desired fan-out is reached

* To make a binary cut, consider4 orderings of the rectangles: xmin, ymin, xmax, ymax

* In each ordering, consider the B cutting positions

* Choose the one that minimizes the sum of the areas of the two resulted bounding boxes

• Typical bulk-load cost: O(N/B log2N) I/Os

(Top-down Greedy Split)

7

Our Results• None of existing R-tree variants has worst-case query performa

nce guarantee!

– In the worst-case, a query can visit all nodes in the tree even when the output size is zero

• Priority R-Tree

– The first R-tree variant that answers a query by visiting nodes in the worst case

* T: Output size

– It is optimal!

* There exists a dataset such that for any R-tree, there is an empty query that visits nodes. [Kanth and Singh 99, Agarwal et al. 02]

)//( BTBNO

)/( BN

8

Roadmap

• Pseudo-PR-Tree

– Has the desired worst-case guarantee

– Not a real R-tree

• Transform a pseudo-PR-Tree into a PR-tree

– A real R-tree

– Maintain the worst-case guarantee

• Experiments

– PR-tree

– Hilbert R-tree (2D and 4D)

– TGS-R-tree

)//( BTBNO

9

Building a Pseudo-PR-Tree

root

Step 1: take out B extreme rectangles

from each direction and put them

into priority leaves

priority leaves

10

Building a Pseudo-PR-Tree

root

Step 2: Divide by the xmin coordinates and build subtrees recursively.

Division is performed using xmin, ymin, xma

x, ymax in a round-robin fashion, like a 4D kd-tree

Analysis sketch:

# nodes with at least one priority leafcompletely reported: O(T/B)

# nodes with no priority leaf completely reported: )/( BNO

11

Pseudo-PR-Tree to a Real R-tree

12

Query Complexity Remains Unchanged

# nodes visited on leaf level BTBN //

Next level: 22 //// BTBBNBN

3223 ////// BTBBNBBNBN

13

PR-Tree: Bulkload & Updates• Bulkload

– O(N/B∙log2N) I/Os→O(N/B∙logM/BN) I/Os, using “grid method” [Agarwal et al. 01]

– The same as Hilbert R-tree, but with a larger constant• Updates

– Can use any previous heuristic to update in O(logBN) I/Os* Without worst-case query guarantee

– Use logarithmic method* Insert: O(logBN + 1/B · logM/BN log2(N/M)) I/Os* Delete: O(logBN) I/Os

• Extending to d-dimensions– Query bound: O((N/B)1-1/d + T/B), still optimal– Bulkload & update bounds remain the same

14

Experiments• Implemented with TPIE

– Priority R-tree

– Hilbert R-tree

– 4D Hilbert R-tree

– TGS R-tree

• Real-life data

– TIGER datasets

– 16 million rectangles

• Synthetic data

– Varying from normal to extreme data

– 10 million rectangles

15

Experiments with Real-Life DataQuery performance on the TIGER datasets

Shown: # I/Os spent in answering a query

T/B

16

Experiments with Synthetic Data: SIZE

Each side of a rectangle is uniformly distributed in [0, max_side]

Queries are squares with area 1%

17

Experiments with Synthetic Data: ASPECT

Fix the area, vary aspect ratio

18

Experiments with Synthetic Data: SKEWED

Randomly place points, then do y’=yc on the y-coordinates

19

Experiments with Synthetic Data: CLUSTER

20

Conclusions

• In theory– The PR-tree is the first R-tree variant that answers a window

query in I/Os worst-case, which is optimal• In practice

– Roughly the same as previous best R-trees on real-life and relatively nicely distributed data

– Outperforms them significantly on more extreme data• Future work

– How previous heuristics may affect the performance of the PR-tree in the dynamic case

)//( BTBNO

21

Lower Bound Construction• Each bounding box intersects at

least queries

• N/B bounding boxes

• queries

• There exists a query that intersects at least

bounding boxes

B

N

BNN

BBN

/

22

Pseudo-PR-Tree: Query Complexity• Nodes v visited where all rectangles in at least one of the priority lea

ves of v’s parent are reported: O(T/B)

• Let v be a node visited but none of the priority leaves at its parent are reported completely, consider v’s parent u

Q

2D 4D

xmax = xmin(Q)

ymin = ymax(Q)

23

Pseudo-PR-Tree: Query Complexity• The cell in the 4D kd-tree of u is intersecte

d by two different 3-dimensional hyper-planes

• The intersection of each pair of such 3-dimensional hyper-planes is a 2-dimensional hyper-plane

• Lemma: # of cells in a d-dimensional kd-tree that intersect an axis-parallel f-dimensional hyper-plane is O((N/B)f/d)

• So, # such cells in a 4D kd-tree:

• Total # nodes visited:

)/( BNO

)//( BTBNO

u

24

Experiments with Real-Life Data• Datasets: TIGER/Line data

• Bulk-loading:

the priority r-tree: a practically efficient and worst-case optimal r-tree

Documents

hilbert rtree cost

hilbert rtreeto

tree beckmann

real rtreetransform

tgs rtreeto

tgs rtreestart

zeropriority rtreethe

real rtreequery complexity