an experimental evaluation of the logarithmic priority...

TECHNISCHE UNIVERSITEIT EINDHOVENDepartment of Mathematics and Computer Science

An Experimental Evaluation of theLogarithmic Priority-R tree

by

Ummar Abbas

Advisordr. Herman Haverkort

Review Committeedr. Herman Haverkortprof. dr. Mark de Berg

dr. ir. Huub van de Wetering

Eindhoven, November 2006

\No amount of experimentation can ever prove me right; a single experiment canprove me wrong."

— Albert Einstein

3

Abstract

The Logarithmic PR-tree is an R-tree variant based on the PR-tree that maintainsthe worst case query time while the tree structure is updated. This thesis is dedicatedto the experimental study of the LPR tree using the C++ template based I/O-e�cientlibrary TPIE. It compares the performance of the LPR tree with the R∗-tree, one of themost popular dynamic R-tree structures.

5

Acknowledgements

I am very grateful to dr. Herman Haverkort, my advisor, for all the support and helpduring this thesis. This work would not have been completed in time without thenumerous discussions with him. I am also indebted to him for spending the huge e�ortreviewing the early versions of this document, in a very short time. My sincere thanks toprof. dr. Mark de Berg, head of the Algorithms group, to provide me with an opportunityto work here and to allow me to work part-time on this thesis, together with my job.Special thanks to Micha Streppel, for the help and guidance in using TPIE.

I am thankful to my wife Shabana and son Suhail, for giving me all the encouragementI needed, and providing me all the time that was necessary to complete this thesis.Finally, I would like to express my deepest gratitude to my parents, who motivated meto take up this course.

Ummar AbbasOctober 25, 2006

7

Contents

1 Introduction 131.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2 Aim and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 R-Trees 172.1 Original R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Dynamic Versions of R-trees . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Guttman's R-tree update algorithms . . . . . . . . . . . . . . . . . 182.2.2 The R∗-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3 The Hilbert R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.2.4 R+-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.5 Compact R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.6 Linear Node Splitting . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Static Versions of R-trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1 The Hilbert Packed R-tree . . . . . . . . . . . . . . . . . . . . . . . 242.3.2 TGS R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.3.3 Bu�er R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 PR-tree Family 273.1 Priority R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Pseudo-PR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.2 PR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Design and Implementation 334.1 General Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Two Dimensional Rectangle . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Pseudo-PR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.2 Construction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 384.3.3 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9

4.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4.2 Insertion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4.3 Deletion Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Experiments 535.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.1 Real life data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2.2 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3 Bulk Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.4 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.5 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.6 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 Conclusions 71

A Tables of Experimental Results 73A.1 Bulk Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A.1.1 LPR tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73A.1.2 R∗tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

A.2 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.2.1 Insertion Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.2.2 Insertion I/O's . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

A.3 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.3.1 Deletion I/O's and time . . . . . . . . . . . . . . . . . . . . . . . . 77

A.4 Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.4.1 LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.4.2 R∗-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

B Brief Introduction to TPIE 81

10

List of Figures

5.1 Bulk Load CPU time - Uniform dataset. . . . . . . . . . . . . . . . . . . . 565.2 Bulk Load CPU time - Normal dataset. . . . . . . . . . . . . . . . . . . . 565.3 Bulk Load CPU time - TIGER dataset. . . . . . . . . . . . . . . . . . . . 575.4 Bulk Load I/O - Uniform dataset. . . . . . . . . . . . . . . . . . . . . . . 575.5 Bulk Load I/O - Normal dataset. . . . . . . . . . . . . . . . . . . . . . . . 575.6 Bulk Load I/O - TIGER dataset. . . . . . . . . . . . . . . . . . . . . . . . 585.7 Insertion CPU time - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . 595.8 Insertion CPU time - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . 595.9 Insertion average CPU time - LPR tree vs. R∗ tree. . . . . . . . . . . . . 605.10 Insertion I/O's - LPR tree. . . . . . . . . . . . . . . . . . . . . . . . . . . 605.11 Insertion I/O's - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.12 Insertion Average I/O's - LPR tree vs. R∗ tree. . . . . . . . . . . . . . . . 615.13 Deletion CPU time - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . 635.14 Deletion CPU time - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . 635.15 Deletion average CPU time - LPR tree vs. R∗ tree. . . . . . . . . . . . . . 645.16 Delete I/O - LPR-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.17 Deletion I/O - R∗ tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.18 Deletion average I/O's - LPR tree vs. R∗ tree. . . . . . . . . . . . . . . . 655.19 Query CPU time (in msec) per B rectangles output . . . . . . . . . . . . . 685.20 Query I/O's per B rectangles output. . . . . . . . . . . . . . . . . . . . . 695.21 Empirical Analysis - Theoretical vs. Experimental query I/O results for

LPR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

11

Chapter 1

Introduction

1.1 Context

Spatial data management is required in several areas such as computer aided design(CAD), VLSI design, geo-data applications etc. Data objects in such a spatial databaseare multi-dimensional (typically 2D or 3D). It is important to be able to search andretrieve objects using their spatial position. Classical indexing mechanisms such as theB-tree and its various variants are not suitable for multiple dimensions and for rangequeries. The R-tree data structure introduced by Guttman[8] is considered as one ofthe most e�cient mechanism to handle multi-dimensional spatial data. It is aimed athandling geometrical data, such as, points, line segments, surfaces, volumes and hypervolumes. Since its introduction to the scienti�c community, various variants of thisstructure have been proposed.

1.2 Aim and Approach

This thesis is dedicated to the experimental study of the LPR-tree, an R-tree variant thatclaims to maintain the worst-case-optimal query guarantees made by its static variantthe PR-tree[3]. In particular, the thesis aims to achieve the following:

� To verify experimentally, the theoretical worst case optimal query guarantees madeon the LPR-tree.

� The performance comparison in terms of IO's and time of the LPR-tree againststate-of-art dynamic update algorithms such as the R∗-tree and the Hilbert-R-tree.

� Derive some conclusions on when and under what speci�c conditions or scenariosthe LPR-tree can (or cannot) out perform these R-tree variants.

We assume the following I/O model :

13

� There is fast internal memory(main memory) that can hold M two-dimensionalrectangles and a single slow disk which holds data and the results of computation.

� Data is transferred between the internal memory and the external memory in termsof blocks. A block can hold B two-dimensional rectangles. When a block is readfrom or written to the external memory, an I/O is said to have been performed.

� Algorithms have full control over which block is evicted from the main memoryand written to the disk and vice-versa.

� I/O's are most important bottle-neck while working with very large data/sets.Therefore, the number of block reads/writes performed by (external memory) al-gorithms are considered to be the measure of the e�ciency of an algorithm.

To perform the above investigations comprehensively, the LPR-tree is implementedusing TPIE, an IO-e�cient C++ library. Experiments are designed to observe the per-formance of the LPR-tree against the variation of several parameters like distributionof the dataset and size. The same kind of experiments have to be repeated for otherR-tree variants to compare and analyze the results. Due to time constraints and the non-availability of the implementation of other dynamic R-trees using the TPIE platform,the thesis restricts its comparative study to the R∗-tree. The experiments are designedto cover the following areas:

� Performance (in terms of I/O's and time) of the LPR-tree update algorithm. Thisincludes the cases where the tree gets rebuilt during updates.

� Compare the performance for the update algorithms of the LPR-tree and the R∗-tree.

� Compare the performance of the LPR query algorithm after bulk load against theperformance achieved when queries are interleaved with updates.

� Compare the query results of the LPR-tree and R∗-tree on similar data/sets andunder similar conditions.

1.3 Overview

The thesis is organized into the following chapters:

� Chapter 2 introduces the original R-tree to the reader (in some detail) to providea context and show how various variants have been developed. This chapter alsopresents a survey of some popular R-tree variations, that have based on di�erentheuristics to achieve good query performance while still having a good updateperformance. Because of the inherent heuristic nature of these algorithms, they donot give any asymptotical bounds on the worst-case query performance.

14

� Chapter 3 describes the two-dimensional pseudo-PR tree, PR-tree and the LPR-tree data structures. It also describes the update algorithms. The PR-tree is the�rst variant of an R-tree that guarantees optimal worst-case query time and hasits construction not based on heuristics. To achieve the same query performanceduring updates, the LPR-tree introduces update algorithms for insertion and dele-tion.

� Chapter 4 describes all the practical implementation issues together with thepseudo-code of algorithms and data structures.

� Chapter 5 describes the various experiments carried out on the LPR-tree, gives theresults and provides analysis of those results.

� Chapter 6 gives the conclusions derived from the experimental study that answersthe questions raised in section 1.2.

� Appendix A can be used to get the exact �gures of various quantities measuredduring the experiments.

� Appendix B gives a brief introduction into TPIE, the IO-e�cient library used forthe implementation of the LPR-tree.

15

Chapter 2

R-Trees

2.1 Original R-tree

An R-tree is a hierarchical data structure based on the B+-tree. A B+-tree is a heightbalanced dynamic data structure used to index single-dimensional data and supportse�cient insertion and deletion algorithms. All data is stored in leaf nodes while internalnodes contain keys and pointers to other nodes. While performing queries on this struc-ture, the keys stored in the internal nodes help in traversing the tree structure from theroot down to the leaf node using binary search type comparisons. The R-tree is a similarstructure characterized by the parameters k and K where K is the maximum number ofentries that �t in one node and k ≤ K

2represent the minimum number of such entries.

The R-tree has the following properties :

� Each leaf node contains between k and K records where each record is of the form(MBR, oid) where MBR is the minimum bounding box enclosing the spatial dataobject identi�ed by oid.

� Each internal node, except for the root, has between k and K children representedby records of the form (MBR, p) where p is the pointer to a child and MBR is theminimum bounding box that spatially contains the MBRs in this child.

Based on this de�nition, an R-tree on N rectangles has a height of at most Θ(logkN).A query to �nd a rectangle in a R-tree is called an exact match query. However, the

more common form of query is the window (or range) query. Given a rectangle Q, thefollowing forms a window query: �nd all the data rectangles that are intersected by Q.To answer such a query, we simply start at the root of the R-tree and recursively visitall the nodes with minimum bounding boxes intersecting Q; when encountering a leaf lwe report all the rectangles in l that intersect Q. This results in the following algorithm:

Algorithm WindowQuery(ν, Q)Input: Node ν in the R tree and a query rectangle Q.Output: Set A containing all the rectangles in the tree that intersect Q.

17

1. if ν is a leaf2. then Examine each data rectangle r in ν to check if its intersects Q. If it intersects

then A ←A ∪ frg.3. else (∗ ν is an internal node. ∗)4. Examine each entry to �nd exactly those children whose minimum bounding

box intersects Q. Let the set of intersecting children nodes be Sc.5. for µ ∈ Sc

6. A ←A ∪ WindowQuery(µ,Q).7. return A.

Various R-tree variations[12] have been proposed, some of them adapting for speci�cinstances and environments. All the R-tree based data structures proposed in literaturecan be classi�ed into one of the two categories.

� Dynamic Versions of R-trees : These are R-tree based data structures where theobjects are inserted or deleted on a one-by-one basis.

� Static Versions of R-trees : These are R-tree based data structures that are builtusing algorithms for bulk loading using a-priori known static data.

2.2 Dynamic Versions of R-trees

2.2.1 Guttman’s R-tree update algorithms

Guttman[8] provides insertion and deletion algorithms for the R-tree structure proposedby him. The insertion and deletion algorithms use the bounding boxes from the nodes toensure that nearby elements are placed in the same leaf node. Inserting a rectangle intothe R-tree, basically involves adding the rectangle to a suitable leaf node. As rectanglesget inserted, at a certain point, the leaf node over ows thus requiring a split. This createsanother child pointer in the parent node, which may cause the parent node to split, andso on, eventually, in the worst case, splitting the root node. The insertion algorithm isbased on heuristics on the splitting process so that a good query performance can beachieved. The algorithm can be summarized as follows:

Algorithm Insert(r)Input: A rectangle r to be inserted.1. Descend through the tree to �nd a leaf node L whose MBR requires least enlargement

to accommodate r.2. if L does not have enough room to accomodate r3. then split the node L to obtain an additional leaf node LL containing r. Pro-

pogate the split upwards through the tree and if necessary splitting the rootnode to create a new root.

4. else

18

5. Place r in L.6. return.

In step 2, the leaf node where the rectangle has to be placed is chosen. This is doneby descending through the tree starting at the root node until a leaf is found. At eachstep, the entry whose MBR requires least enlargement to include r is chosen. Whenthe leaf node is already full, the node is split resulting in the (K+1) rectangles beingredistributed in two leaf nodes, according to some splitting policy. If the newly addedleaf makes its parent over ow, then the split has to be recursively propagated to theupper levels (step 3).

There are three splitting techniques to split nodes in step 3, the linear split, thequadratic split and the exponential split. Their names come from their complexity.These three splitting techniques can be summarized as follows:

1. Linear Split This algorithm is linear in K and in the number of dimensions.Conceptually this algorithm chooses two rectangles that are furthest apart as seeds.The remaining rectangles are chosen in a random order and added to the node thatrequires least enlargement of its MBR.

2. Quadratic Split The cost is quadratic in K and linear in the number of dimensions.The algorithm picks two of the K + 1 entries to be the �rst elements of the twonew groups by choosing the pair such that the area of a rectangle covering bothentries, minus the area of the entries themselves, would be greatest. The remainingentries are then assigned to groups one at a time. At each step the area expansionrequired to add each remaining entry to each group is calculated, and the entryassigned is the one showing the greatest di�erence beween the two groups.

3. Exponential Split All possible groupings are tested and the split that results inthe least overlap area of the two MBR's is chosen. However even for reasonablysmall value of K, this strategy is very expensive as the number of possible groupsis approximately 2K−1.

2.2.2 The R∗-tree

Guttman's update algorithms are based completely on minimizing the overlap area ofMBR's. Insertion and deletions are intermixed with queries and there is no periodicreorganization. The structure must allow overlapping rectangles, which means it cannotguarantee that there is only one search path for an exact match query. The R∗-tree isa variant of R-tree proposed by Beckman et al[5] that strives to reduce the number ofsearch paths for queries by incorporating a combined optimization of several parameters.The following is a summary of the parameters it tries to optimize:� Area covered by each MBR should be minimized Minimizing the dead space(area

covered by the MBR but not by the data rectangles) will improve query perfor-mance as decisions of which paths are to be traversed can be made at higher levelsof the tree.

19

� Overlap between MBR's should be minimized A larger overlap implies morepaths to be searched for a query. Hence this optimization also serves the purposeof reducing search paths.

� The perimeters of MBR's should be minimized This optimization results inMBR's that are more quadratic. Query rectangles that are quadratic bene�t themost from this optimization. As quadratic rectangles can be packed more easily,the bounding boxes at higher levels in the R-tree are expected to be smaller. Infact this optimization will lead to lesser variance in lengths of bounding boxes,indirectly achieving area reduction.

� Storage utilization should be optimized Storage utilization is de�ned as the ratioof the total number of rectangles in the R-tree to the maximum number of rectangles(capacity) that can be stored across all the nodes of the R-tree. Low storageutilization would mean searches in larger number of nodes during queries. As aresult query cost is very high with low storage utilization, especially when a largepart of the data set satis�es the query.

Optimization of the afore-mentioned parameters is not independent as they a�ect eachother in a very complex way. For instance to minimize dead space and overlap, morefreedom in choosing the shape is necessary which would cause rectangles to be lessquadratic. Also, minimization of perimeters may lead to reduced storage optimization.Based on results of several experiments using these optimization criteria, Beckman et alpropose the following two strategies to obtain signi�cant gain in query performance:

� A new node splitting algorithm that uses the �rst three optimization criteria. Thisalgorithm is as follows:

Algorithm Split(ν)Input: Node ν in the R tree that contains maximum number K +1 of either data

rectangles or MBR's.Output: Node ν and µ containing K + 1 entries.1. for each axis2. Sort the entries along min then by max values on this axis.3. Determine the K − 2k + 2 distributions from the K+1 entries into two

groups such that each group contains a minimum of k entries.4. Compute σ, the minimum sum of the perimeters of the two MBR's

across all the distributions.5. Choose the axis with the minimum σ. Split is performed perpendicular to this

axis.6. From the K − 2k + 2 distributions along the chosen axis in line, choose the

distribution with minimum overlap. Resolve ties by choosing the distributionwith minimum dead space. The two groups of entries are collected in the nodesν and µ

20

7. return ν and µ

The split algorithm �rst determines the axis along which the split will be performed.To do this, it considers, for each axis in step 3, K − 2k + 2 distributions of theK + 1 nodes into two groups, where the i-th distribution is determined by havingthe �rst group contain the �rst (k − 1) + i entries sorted along that axis and thesecond group having the remaining entries.

� An insertion algorithm that uses the concept of forced reinsert to reinsert a fractionof entries of an over owing node to re-balance the tree at certain steps.

Algorithm Insert(r)1. Invoke ChooseSubTree to determine the leaf node ν where the insertion has

to take place.2. if ν contains less than K data rectangles.3. then Insert r in ν.4. else (∗ ν has maximum entries. Handle over ow. ∗)5. if this is the �rst time over ow has occured in this level6. then Reinsert 30% of the rectangles of ν whose centroid distances

from the node centroid are the largest.7. else8. Invoke Split(ν). Propagate the split upwards if necessary.9. return

Algorithm ChooseSubTree1. ν ←root of the R∗ tree.2. while Children nodes of ν are not leaves3. ν ←MBR requires least area enlargement to include r. Resolve ties by

choosing the entry µ whose MBR has the least area.4. Choose the leaf node whose MBR requires least overlap enlargement to include

r. Resolve ties by choosing nodes that need least area enlargement.

The dynamic reorganization of the R-tree by the reinsertion strategy during insertachieves a kind of tree re-balancing and signi�cantly improves query performance.As reinsertion is a costly operation, the number of rectangles reinserted have beenexperimentally tuned to 30% to yield best performance. Also reinsertion is re-stricted to be done once for each level in the tree.

2.2.3 The Hilbert R-tree

The Hilbert R-tree[10] is an R-tree variant which uses the notion of Hilbert value tode�ne an ordering on the data rectangles. The Hilbert value of a n-dimensional pointis calculated using the n-dimensional Hilbert curve. The Hilbert R-tree constructed ona data/set of n-dimensional rectangles uses the centroid of the rectangles to de�ne the

21

ordering. Such an ordering has been shown [10] to preserve the proximity of such spatialobjects quite well. The ordering on the data rectangles based on the Hilbert value, alsofacilitates the Hilbert R-tree to use a conceptually di�erent splitting technique knownas deferred splitting.

In addition to this variation in splitting algorithm, every internal node ν in the R-treestructure stores, in addition to the usual MBR, the largest Hilbert value(LHV) of thedata rectangles that are stored in the subtree rooted at ν.

In the original R-tree, when a node a over ows, a split is performed, as a result ofwhich, two nodes are created from a single node. This is referred to as a 1 to 2 splittingpolicy. The Hilbert R-tree implements the concept of deferred splitting by using a 2 to3 splitting policy. This means a split is not performed when a node over ows and eitherthat node or one its sibling, also known as cooperating siblings can accommodate anadditional entry. In general there could be a s to s + 1 splitting policy.

Finally these concepts of node ordering according to Hilbert value and deferred split-ting approach come together in the following insertion algorithm:

Algorithm Insert(r)Input: A rectangle r to be inserted.1. h ←Hilbert value of the centroid of r.2. Recursively descend through the tree, to select leaf node, ν, for insertion. At each

node, select the child node entry with the minimum LHV value greater than h.3. if ν has an empty slot.4. then Insert r into ν.5. else6. HandleOver ow (ν, r) that will a create a new leaf µ if split was inevitable.7. Propogate the node split in line 6 upwards, adjusting MBR's and LHV of the nodes.

If an over ow of the root caused a split, create a new root whose children are theprevious root and the new node created by the split.

8. return.

Algorithm HandleOver ow (ν, r)Input: A rectangle r to be inserted.Output: A new node µ if a split occurred.1. ε ←all the entries of ν and its s− 1 cooperating siblings.2. ε ←ε ∪ r.3. if (|ε| < s.K) (∗ atleast one of the s− 1 cooperating siblings is not full. ∗)4. then5. Distribute ε among the s nodes respecting the Hilbert ordering.6. return null.7. else (∗ all the s− 1 siblings are full. ∗)8. then9. Create a new node µ and distribute ε among the s + 1 nodes respecting the

Hilbert ordering.

22

10. return µ.

The Hilbert tree acts like a B+ tree for insertions and like an R tree for queries,thereby achieving its acclaimed performance. However it is vulnerable, performance-wise, to large objects. Performance is also penalized when the space dimensionality isincreased. In this case, proximity is not well preserved by the Hilbert curve, leading toincreased overlap of MBR's in internal nodes.

2.2.4 R+-tree

The R+-tree was proposed[11] as a variation to the R-tree structure to improve perfor-mance of exact match queries. The original R-tree su�ered from the problem that anexact match query may lead to investigation of several paths from the root to the leaf,especially in cases where the data rectangles are dense/clustered. To obtain better queryperformance in such cases, the R+-tree introduces a variation in the R-tree structure thatdoes not allow overlapping of MBR's at the same level of the tree. This is achieved byduplicating the stored data rectangles in more than one node. Because of this structuraldi�erence, the following changes are made to the update algorithms:

� Query - Query algorithm is similar to the one used for the R-tree with the onlydi�erence of removing duplicate results.

� Insertion - The insertion algorithm proceeds in the same way as the original R-treeto �nd the node whose MBR overlaps the rectangle r, to be inserted. Once sucha node is found, it is either inserted there, if there is enough space for r or thenode is split resulting in a, sometimes drastic, reorganization of the tree structurewhich eventually duplicately stores r. Under certain extreme circumstances, thiscan even lead to a dead-lock[12].

� Deletion - Duplication of stored rectangles means that the deletion algorithm musttake care to delete all occurrences of the rectangle to be deleted. Deletion is followedby a phase where the MBR's have to be adjusted. However deletion may reducestorage utilization signi�cantly, requiring the tree to be periodically reorganized.

2.2.5 Compact R-tree

The Compact R-tree mechanism, focuses on the improving storage utilization to improvequery performance. A very simple heuristic is applied to improve space utilization duringinsertions. When a node ν over ows, K rectangles among the K+1 available rectangles,are chosen such that MBR is the minimum possible. These rectangles are kept in νand the other rectangle is moved to one of its sibling provided it has the space andwhose MBR requires least enlargement. A split takes place only when all the siblings arecompletely �lled with K rectangles each. This heuristic, has experimentally been shownto improve utilization to around 97% to 99%. Insertion performance is improved by the

23

fact that lesser splits are required. However performance of window queries is seen, notto di�er much from the Guttman's R-tree.

2.2.6 Linear Node Splitting

This technique proposed in [1] introduces a new algorithm for linear node splitting. Thisalgorithm essentially can be substituted for the splitting algorithm used in the originalR-tree. This technique splits nodes based on the following heuristics, which are used inthe order stated:

� Distribute rectangles as evenly as possible.

� Minimize overlapping area between the nodes.

� Minimize the total perimeter of the two nodes.

This means when a node ν over ows, each of the K+1 rectangles is assigned two offour lists Lxmin, Lymin, Lxmax and Lymax. More precisely, for each rectangle r, it is deter-mined whether this rectangle is closer to left or the right edge of the MBR of ν. Therectangle is then assigned to the x-dimensional lists, Lxmin or Lxmax list. Then, accord-ing to the y dimensions of the rectangle, it is assigned to one of the y-dimensional lists,Lymin or the Lymax list. The node is split along the x-dimension if MAX (|Lxmin|, |Lxmax|)> MAX (|Lymin|, |Lymax|). If this is not true the split is performed along the y-dimensionunless the two lists are of the same size. In the latter case, the overlap of these sets isconsidered. Finally if this turns out to be equal as well, the total coverage is consid-ered. Experiments have shown that these heuristics result in R-trees that have bettercharacteristics and result in better performance for window queries in comparison withquadratic algorithm proposed by Guttman.

2.3 Static Versions of R-trees

2.3.1 The Hilbert Packed R-tree

The Hilbert Packed R-tree[9] is a R-tree structure designed with the aim of achieving100% space utilization with good query performance for applications where the R-treedoes not or most infrequently requires modi�cations. In order to achieve very good queryperformance the data rectangles that are in close proximity must be clustered together inthe same leaf. Similar to its dynamic counterpart, this structure uses the Hilbert curveand the resulting Hilbert values of the centroid of the rectangles as a heuristic to clusterrectangles. The tree is constructed in a bottom up manner, starting from the leaf leveland �nishing at the root. The construction algorithm is outlined below:

Algorithm HilbertPack (S)Input: Set S of data rectangles to be organized into a R tree.

24

Output: R-tree, τ , packed with the data rectangles in the set S.1. for each data rectangle r in S2. Calculate the Hilbert value of the centroid of r.3. Sort the rectangles in S on ascending Hilbert values calculated in line 1.4. while S 6= ∅5. do6. Create a new leaf node ν.7. Add B rectangles to ν and remove them from S.8. while there are > 1 nodes at level l9. do10. I ← MBRs of nodes in level l.11. while I 6= ∅12. do13. Create a new internal node µ.14. Add B rectangles with child pointers to µ and remove them from I.15. l ←l+1.16. Set the root node of τ to the one node left.17. return τ .

Experiments ([9]) showed that this variant of the R-tree, outperforms the originalR-tree with quadratic split and the R∗-tree signi�cantly.

2.3.2 TGS R-tree

Unlike the Packed Hilbert R-tree that constructs the R-tree bottom up, Top-down GreedySplitting(TGS) method presented in [7] constructs the tree a top down manner whileusing an aggressive approach that greedily constructs the various subtrees of the R-tree.A top down approach minimizes the cost of the levels that allow a potentially biggerreduction in the overall cost; i.e.; the top levels of the R-tree. Essentially, the algorithmrecursively partitions a set of N rectangles into two subsets by a cut orthogonal to anaxis. This cut must satisfy the following two conditions:

1. Cost of an objective function f(r1, r2), where r1 and r2 are the MBR's of theresulting two subsets, is minimized.

2. One subset has a cardinality of i.S for some i where S is �xed per level so that theresulting subtrees are packed i.e; the space utilization is 100%.

The algorithm for to perform a cut is summarized as follows:

Algorithm TGS (n, f)Input: �n - number of rectangles in the data set.

�f - function f(r1, r2) that measures the cost of the split.Output: Two split subsets that form two subtrees of a R-tree.

25

1. if n ≤ K return.2. for each dimension d.3. for each ordering in this dimension.4. for i ←1 to n

M- 1

5. r1 ←MBR of �rst i.S rectangles.6. r2 ←MBR of other rectangles.7. Remember i if f(r1, r2) is at its best position.8. Split the input set at the best position found in line 7.

Orderings in each dimension that are considered in line 3 are based on min coordinate,max coordinate, both (min followed by max) and the centroid of the input rectangles.

The cutting process described above is repeated recursively for the resulting subsetsuntil a cut is no longer possible.

This binary split process can easily be extended to a K-ary split, where each internalnode has K-entries. This means, to build the root of a (subtree of) an R-tree on a givenset of rectangles, the algorithm repeatedly partitions the rectangles, into two sets, untilthey are divided into K subsets of equal size. Each subsets bounding box is stored inthe root, and the subtrees are constructed recursively on each of the subsets.

2.3.3 Buffer R-tree

Bu�er R-tree[2] is not really a static R-tree, but provides e�cient algorithms for bulkupdates. It achieves I/O e�ciency by exploiting the available main memory to cacherectangles when a rectangle is inserted. More precisely, it attaches bu�ers to every nodeat blogB( M

4.B)c-th level of the tree, where i = 1, 2, .., and M is the maximum number of

rectangles that can �t in main memory, B is the block size. A node with an attachedbu�er is called a bu�er node.

In contrast with many other R-tree variations, the BR-tree does not split a nodeimmediately, when a node over ows due to insertion. Instead it stores this rectangle inthe bu�er of the root node. When the number of items in this bu�er exceed M/4, aspecialized procedure is executed to free bu�er space. Essentially, this procedure, movesrectangles from a full bu�er to next appropriate bu�er node at lower levels of the tree.Such movements must respect various branching heuristics. When we reach the leaf,the rectangle is inserted and split is performed when there is an over ow. Evidently,for some insertions, there are no I/O incurred. The BR-tree supports bulk insertions,bulk deletions, bulk loading and batch queries. Experimental results show that BR-treerequires smaller execution times to perform bulk updates and produces a good index forquery processing.

26

Chapter 3

PR-tree Family

3.1 Priority R-tree

The Priority R-tree[3], or PR-tree, is the �rst R-tree variant that guarantees a worstcase performance that is provably asymptotically optimal. The name of the tree derivesfrom the use of priority rectangles to bulk load the tree. This algorithm makes use ofan intermediate data structure called the pseudo-PR-tree. In the next section this datastructure is described together with a construction algorithm. The exact pseudo-codedescribing the implementation of the algorithm is presented in chapter 4. For simplicity,the description is for the two-dimensional case. The discussion and the results can beeasily generalized to higher dimensions.

3.1.1 Pseudo-PR-tree

Definition Let S = fR1, .. RNg be a set of N input data rectangles in the plane. Theset is mapped to a set of four-dimensional points S∗ where each element of S∗ is obtainedfrom the corresponding rectangles in S using the following relation:

S∗(Ri) ≡ ((xmin(Ri), ymin(Ri), xmax(Ri), ymax(Ri))

where:

� xmin(Ri), ymin(Ri) represent the coordinates of the left-bottom vertex of Ri.

� xmax(Ri), ymax(Ri) represent the coordinates of the right-top coordinates of Ri.

A Pseudo-PR-tree T (S) is de�ned recursively as follows:

� The root has at most six children, namely four priority leaves and two so calledKD-nodes. The root also stores the MBR of each of the children nodes.

27

� The �rst priority leaf contains the B rectangles in S with smallest xmin coordi-nates, the second, the B rectangles among the remaining rectangles with smallestymin coordinates, the third, the B rectangles among the remaining rectangles withlargestxmax coordinates and �nally the fourth, the B rectangles among the remain-ing rectangles with largest ymax coordinates.

� The set Sr of remaining rectangles is divided into two sets S1 and S2 of approx-imately the same size.The KD-nodes are the roots of the pseudo-PR-trees con-structed using S1 and S2 respectively. The division of rectangles is performedusing the xmin, ymin, xmax, ymaxcoordinates in a round robin fashion. This means,the division performed at the root node, is based on the xmin-values, the division atnext level of recursion is based on the ymin-values, then based on the xmax-values,then the ymax-values, then the xmin-values and so on. The split value used is storedin the internal node in addition to the bounding boxes.

Since each leaf or node of the pseudo-PR-tree is stored in O(1) disk blocks, and sinceatleast 4 of the six leaves contain a Θ(B) rectangles, the tree occupies O(N/B) diskblocks.

It has been proved[3] that a window query on a pseudo-PR-tree with N rectanglesuses O(

√NB

+ TB

) I/O's in the worst case.Following this de�nition, constructing the tree using a set of N rectangles in O(N

Blog N)

I/O's. However bulk loading the tree can be done I/O e�ciently using O(NB

logMB

NB

)

I/O's. This is done using a four-dimensional grid that de�nes a partition of the fourdimensional space and stores the counts of rectangles that lie in a each unit of such apartition. The construction algorithm recursively constructs Θ(log(M)) levels using thisgrid. Essentially the grid helps in preventing I/O's that would have otherwise been re-quired to split the input set correctly based on the appropriate dimension. The followingalgorithm describes the construction of the Θ(log(M)) levels the tree. The algorithm isthen recursively used to construct the complete tree.

Algorithm Construct(S∗, ν, level)Input: �S∗ - Set of 4-dimensional points representing the N rectangles in two-

dimensional space.

�ν - Root of a pseudo-PR (sub)tree.

�level - level of the node ν.Output: Pseudo-PR Tree rooted at ν containing the points in S∗.1. Construct four sorted lists Lxmin

, Lymin, Lxmax , Lymax containing the points in S∗

sorted by the respective dimensions.2. z ←α.M

14 (∗ Θ(M

14 ) : α >= 0 has to be chosen by the implementation ∗).

3. Using the sorted lists created in step 1, create a four dimensional grid using the(kN/z)-th coordinate along each dimension.Keep the counts of points in each gridcell, where 0 ≤ k ≤ z − 1.

28

4. Split(S∗, level, ν)5. Add priority leaves to all the nodes creates in the previous step.6. while S∗ has unprocessed points and subtree rooted at ν has space.7. do8. r ←next point in S∗.9. Add r to the tree, so that all the properties of a pseudo-PR-tree are preserved10. return

First the lists are sorted along each of the four dimensions. This helps in initializingthe rectangle counts in the grid. The algorithm Split is used in step 4 to create allthe internal KD-nodes by recursively splitting the grid until Θ(log M) levels of the treeare constructed. A �nal step in the construction algorithm (step 6) distributes theserectangles in the tree, respecting the properties of a pseudo-PR-tree. More precisely, we�ll the priority leaves by scanning S∗ and �ltering each point p, through the tree, oneby one, as follows: We start at the root ν, of the tree, and check its priority leaves νxmin,νymin, νxmax and νymax, one by one, in that order. If we encounter a non-full leaf wesimply place p there; if we encounter a full leaf νdim and p is more extreme than the leastextreme point p′ in νdim, we replace p′ with p and continue the �ltering process with p′.After we check νymax, we continue to check (recursively) one of the KD-nodes of ν. TheKD-node to check is chosen by using the split value stored in ν.

The algorithm Split that creates all the internal nodes, is summarized below.

Algorithm Split(S, level, ν)Input: S - set of 4-dimensional points, level - current level being constructed, ν - node

to be split.(∗ Constructs Θ(log M) of the (sub)tree rooted at ν recursively. ∗)1. if level > β. log M (∗ Θ(log M) : β >= 0 has to be chosen by the implementation ∗)2. then return3. d ←split dimension of level.4. Using the grid in step 3 of Construct �nd, using the split dimension, d, the exact

slice where the set can be divided into 2 sets S1 and S2 of roughly same size.5. Create two nodes ν1 and ν2 whose parent is ν and store the split value used in ν.6. Split(S1, level + 1, ν1)7. Split(S2, level + 1, ν2)8. return

At each recursive step, the appropriate dimension d, is used to split the grid (step 4.Using the grid, it is easy to �nd the approximate slice l, where the set of rectangles canbe divided into roughly two equal halves. Once this is achieved, the exact slice couldbe found by scanning the sorted stream along dimension d. In order to do this, onlyO(N/(Bz)) blocks have to be scanned as we have to scan only the rectangles that lie inslice l. Once this is done, a new slice l′ containing O(z3) grid cells is added to the grid,basically splitting slice l. The rectangle counts in the grid cells belonging to these slicesare computed using the same O(N/(Bz)) blocks.

29

3.1.2 PR-tree

A two dimensional PR-tree is an R-tree with a fanout of Θ(B) constructed using aPseudo-PR-tree. It maintains the query performance of O(

√NB

+ TB

) I/O's in the worstcase. The following algorithm is used to construct a PR-tree, in a bottom-up manner,on a set S of two dimensional rectangles:

Algorithm Construct(S)Input: Set S of N rectangles in two dimensional space.Output: PR Tree rooted at node ν1. V0 ←Leaves from the set S with Θ(B) rectangles in each leaf.2. i ←0.3. while Number of MBR's of Vi ≥ B.4. do5. τVi

←Pseudo-PR-tree on the MBR's of Vi.6. Vi+1 ←leaves τVi

.7. Match the MBR's of the nodes in Vi+1 with the rectangles in Vi and set the

child pointers in Vi to the nodes in Vi+1.8. i ←i + 1.9. Construct the root node, ν, from MBR's of Vi and set its children.10. return PR-tree rooted at ν.

It can be proved[3] that this algorithm bulk loads the PR-tree in O(NB

logMB

NB

) I/O's.PR-tree can be updated using standard heuristic based R-tree update algorithms inO(logB N) I/O's in the worst-case but without maintaining the query e�ciency.

3.2 LPR-tree

Logarithmic Priority R-tree is an adaptation of the conventional R-tree structure, basedon the pseudo-PR-tree structure aimed to support the same worst case query performancewhile the tree is updated. The adaptations in the structure are two-fold:

� Internal nodes store additional information besides the MBR.

� Leaf nodes are not at the same level.

The root of a LPR-tree has a number of subtrees of varying capacities. Each of thesesubtrees, known as Annotated PR-trees (APR-trees), is a normal Pseudo-PR-tree whereeach internal node ν of the tree, stores the following information:

� Pointers to each of ν's children, and the MBR of each child.

� Split value that is used to cut ν in the four dimensional kd-tree.

30

� For each priority leaf of ν, the least extreme value of the relevant coordinate of anyrectangle stored in that leaf.

A LPR-tree has up to dlog(N/B)e + 3 subtrees, τ0, τ1, τ2, ..., τdlog(N/B)e+2. τ0 can store atmost B rectangles, and τi for i > 0 has a capacity to store at most 2i−1B rectangles.Since each APR-tree has a di�erent capacity, the leaf nodes are not at the same level.

In addition to these adaptations, the LPR-tree structure proposes a disk layout strat-egy for the nodes of the tree:

� Each internal node of an APR-tree at depth i, such that i = 0 (mod blog Bc) andits descendant internal nodes down to level i + blog Bc − 1 are stored in the sameblock.

� The smaller APR-trees τi for m ≥ i ≥ 0 where m = log MB

are stored in the mainmemory.

� Of the larger APR-trees τi for dlog(N/B)e+2 ≥ i ≥ l where l = log NM

, the top i− llevels are kept in main memory and the rest of the tree is stored on the disk.

� The remaining APR-trees τi for l > i > m, are stored completely on the disk.

The LPR-tree is bulk loaded with a set S of N rectangles, by building the APR-treeτdlog(N/B)e+2. The rest of the trees are left empty.

Insertion is done using the following algorithm:

Algorithm Insert(r)Input: r - Rectangle to be inserted.1. if number of insertions made so far ≥ number of rectangles with which the tree was

bulk loaded.2. then3. S ←{Rectangles from all the APR-trees} ∪ {r}.4. Reconstruct the LPR-tree using S.5. return.6. if τ0 has less than B rectangles7. then8. Insert r in τ0.9. else10. j ←1.11. for i ←1 to dlog(N/B)e+ 212. if τi is empty.13. then j ←i and break (∗ continues with step 14 ∗)14. S ←Rectangles from all trees τk where 1 ≤ k ≤ j − 1.15. Empty all trees, τk where 0 ≤ k ≤ j − 1.16. Build an APR-tree using S and store it as τj.17. Insert r in τ0.

31

18. return.

Insertion using this algorithm takes O( 1B

(logMB

NB

)(log2NM

)) I/O's amortized.Deletion is done using the following algorithm:

Algorithm Delete(r)Input: r - rectangle to be deleted from the LPR-tree.1. if number of deletions made so far ≥ half the number of rectangles with which the

tree was bulk loaded.2. then3. S ←{Rectangles from all the APR-trees} \ {r}.4. Reconstruct the LPR-tree using S.5. return.6. Search for r in each subtree.7. if r is not found8. then return .9. L ←Priority leaf where r is found.10. Delete r from L.11. νp ←parent of L.12. if L contains more than B/2 nodes13. then return .14. Replenish L with rectangles from the sibling priority nodes and from the children

priority nodes of the parent of L.15. return.

The search algorithm in line 6 is quite trivial. It recursively searches the tree startingat the root. It uses the annotated information stored at each internal node about itspriority leaves and the split value to determine the child node to continue the search.Eventually it either locates the node in a leaf or will report failure when the searchrectangle is not found. If deleting a rectangle from a node L, causes the node to under ow,then this node has to be replenished with rectangles (step 14. This is done by movingthe most extreme B/2 rectangles among the sibling priority nodes that follow L andthe children priority nodes of sibling KD-nodes of L to L. It is possible that one of thepriority nodes from which rectangles were moved might under ow. Such priority nodesare also replenished recursively in the same manner as L. Chapter 4 will describe thepseudo-code used in replenishing rectangles in detail.

Taking the rebuilding of the tree in step 1 of the Delete algorithm and the replen-ishing of nodes, it can be shown[3] that deleting a rectangle from a LPR-tree takesO(logB

NM

log2NM

) I/O's amortized.

32

Chapter 4

Design and Implementation

I implemented the PR-tree family for two-dimensional rectangles. The sections thatfollow will explain the design and implementation of the the data structures and thealgorithms. Section 4.1 describes some general implementation issues that were foundduring the implementation and how they were handled. Section 4.2 will describe fun-damental data structures used throughout the implementation. Sections 4.3 and 4.4describe the data structures of the pseudo-PR-tree and the LPR-tree in terms of TPIEconcepts. These sections will also describe implementation issues related to the respec-tive areas explicitly.

4.1 General Implementation Issues

Although the PR-tree had been implemented, the code could not be reused for im-plementing the update algorithms because the underlying TPIE library had undergonemajor changes (for example, in caching mechanisms etc.). Most of the following issuesoccurred at di�erent places in the implementation. Almost all of them are related to theusage of TPIE (See Appendix B).

� Memory trackingAMI block objects represent logical disk blocks in TPIE. These blocks have uniqueidentifers in an AMI collection. Very often these block objects are created in acertain place in the code and have to be deleted at a di�erent place (for instance,due to caching). When blocks are not deleted, the available memory runs outand TPIE simply aborts the application. Also when a deleted block is accessed,exceptions occur at unpredictable places. To solve these range of errors, a memorytracker class was created. Every time a new statement is executed, the memorytracker is invoked to record the block id and the line number where the allocationis made. Similarly, every time a delete statement is executed, the memory trackeris invoked to release the allocation reference (block id). This memory tracker hasthe following bene�ts:

33

– It can be used at di�erent points in the code to check whether allocations andde-allocations are matching or have occurred correctly.

– It can detect if blocks are re-allocated without getting de-allocated.

– At the end of the program, it can dump the allocations that have not yet beende-allocated together with the line numbers where they were created, whichgreatly helps in �xing memory leaks. In fact, this trick was used to �x theR∗-tree implementation that was included in the TPIE distribution.

� I/O CountsTPIE provides an interface to obtain the number block reads and writes for AMI streamand AMI collection objects. However the interface to obtain these statistics forthe AMI stream does not work. Hence to work around this problem we take thenumber of item reads and writes and divide it by the number of such items thatcould �t in a block. For the LPR-tree we know that this assumption will givea good approximation as most of the block I/O for streams occur during sortingwhen almost all blocks that are read or written are full.

� Miscellaneous issuesThere were several minor problems related to using TPIE, the following are someof the most important ones which took some e�ort to trace out:

– It is not possible to delete an item from an AMI stream. This is a problem forthe LPR-tree implementation, as it is required to �lter out rectangle alreadyplaced in a tree to start the next recursive step in the construction of the tree.To do so, it would be nice to be able to delete the already placed rectanglesfrom the four streams that are sorted along the four respective directions. Towork around this problem, we �lter out the rectangles to be placed in the treeand sort them again along the four dimensions.

– Sorting an AMI stream that is already sorted returns an error code indicatingthe stream is already sorted as opposed to a normal success scenario wherethis return value would indicate success. Also in such a case, the stream thatis provided to store the sorted objects, would be empty as the user is expectedto reuse the original stream. This problem was easily worked around.

– When a block collection (usually representing a tree) is stored to a disk, TPIEwould also store a stack of free blocks in a separate �le. Care should be takenthat while making a copy of the tree stored in the disk, the stack(.stk) �leshould also be copied. Not doing so would results in crash in di�erent placesin TPIE not easily revealing the actual problem.

34

4.2 Two Dimensional Rectangle

� Data StructureThe two dimensional axis parallel rectangle is represented by the following datastructure:

DataStructure TwoDRectangleBegin

double min [2]double max [2]AMI bid id

End

The arrays min and max contain the minimum and maximum values of the coor-dinates in the x and y direction respectively. id represents a unique identi�er ina stream of rectangles. This data structure will simply be referred to as rectanglein the subsequent discussion.

A two dimensional stream of rectangles, TwoDRectangleStream is an AMI streamof TwoDRectangle objects.

� OperationsDuring the various operations on the LPR Tree, the following two operations arefrequently performed on a TwoDRectangle :

– Intersection of rectanglesTwo rectangles are said to intersect if their edges intersect or if one rectangleis completely contained in another. It should be noted that intersection ofrectangles in commutative. The following pseudo-code is used to determinerectangle intersection:

PseudoCode Intersects(r1, r2)Input: Two TwoDRectangle objects r1 and r2.Output: true if r1 and r2 intersect, false otherwise.1. b1 ← r2xmin > r1xmax (∗ r2 lies to the right of r1. ∗)2. b2 ← r2xmax < r1xmin (∗ r2 lies to the left of r1. ∗)3. b3 ← r2ymin > r1ymax (∗ r2 lies above r1. ∗)4. b4 ← r2ymax < r1ymin (∗ r2 lies below r1. ∗)5. return not (b1 or b2 or b3 or b4).

Basically, the above algorithm checks if the rectangles do not intersect andreturns the negation of that result.

– Computing the minimum bounding box

35

Given a list of rectangles the minimum bounding box is computed by linearlytraversing the list and keeping track of the most extreme coordinates in therespective directions. The following pseudo code describes this procedure.

PseudoCode ComputeMinimumBoundingBox (r, n)Input: List r of TwoDRectangle objects.Output: TwoDRectangle that is the minimum bounding box of all the rect-

angles in the speci�ed list.1. mbb ←r0 (∗ mbb is set to the �rst rectangle in the list ∗)2. for i ←1 to n3. do4. if mbbxmin > rixmin

5. then mbbxmin ← rixmin

6. if mbbymin > riymin

7. then mbbymin ← riymin

8. if mbbxmax < rixmax

9. then mbbxmax ← rixmax

10. if mbbymax < riymax

11. then mbbymax ← riymax

12. return mbb.

4.3 Pseudo-PR-tree

4.3.1 Data Structures

We �rst describe the priority node and the internal node (KD-node) structure. The treeitself is represented by the root node which is an internal node. The complete tree isstored in the disk as a AMI collection.

� PriorityNodePriorityNode is an AMI block that can hold at most B rectangles. All theserectangles are stored in the el �eld of the block. In implementation terms, thepriority node class is derived from a AMI block. The info �eld of the block, inthis case, contains the number of rectangles currently present in the block. Thepriority node stores rectangles in a sorted order. The sorting order is determinedby the dimension the node represents. So for example, if the node is a xmin or aymin priority node, the rectangles are sorted in the ascending order according totheir coordinate value in that dimension. The sorting order is descending when thenode is a xmax or a ymax node.

A rectangle can be added to a priority node only when it contains less than Brectangles. The rectangles stored are always maintained in a sorted order accordingto the dimension the priority node represents in the pseudo-PR-tree. The insertion

36

pseudocode is described below. It follows a binary search pattern to locate theposition of insertion.

PseudoCode InsertRectangle(r, dim)Input: TwoDRectangle r to be inserted along the speci�ed dimension dim.(∗ As-

sumption is the priority node has at least one position free. ∗)1. insertV alue ← rdim

2. first ← 0 ; last ← (n− 1) ; mid ← 03. compare ←<4. if IsMinDimension(dim)5. then compare ←>6. while first ≤ last7. do8. mid ← b(first + last)/2c9. midRectangle ← elmid

10. midV alue ← midRectangledim

11. if compare(midV alue, insertV alue) = true12. then last ← (mid− 1)13. lastRectangle ← ellast

14. lastV alue ← lastRectangledim

15. if compare(lastV alue, insertV alue) 6= true16. then break17. else18. if midV alue 6= insertV alue19. then first ← (mid + 1)20. else break21. elmid ← r

When a rectangle gets inserted in step 21, all the rectangles from the position ofinsertion are moved one position to the right. A similar procedure is followed fordeleting a rectangle.

When the priority node is already full, the most extreme rectangle, that is, therectangle at position (n − 1), will be replaced if the current rectangle is moreextreme than the rectangle that is being inserted.

� Internal Node (KD-node)The following data structure describes the internal node of the pseudo-PR-tree.

37

DataStructure InternalNodeBegin

TwoDRectangle minimumBoundingBoxes [6]double leastExtremeValues [4]double splitValueAMI bid priorityNodes [4]AMI bid subTreeIds [2]int subTreeIndices [2]

End

The internal node holds pointers (block id's) of the children priority nodes and theroots of the recursive subtrees. The node itself is stored in a AMI block. It alsocontains the annotated information, namely, the minimum bounding boxes of theall the children nodes and the least extreme values of each of the priority nodes.There can be at most 4 priority nodes and 2 internal nodes. To identify the childrenKD-nodes completely, it is also necessary to store the location in the block wherethe node could be found in addition to their block id's.

4.3.2 Construction Algorithm

The construction algorithm recursively builds the tree in a top-down fashion. I/O e�-ciency in the construction is obtained using a Grid. The grid is a 4-dimensional struc-ture de�ned by the coordinate axes, xmin, xmax, ymin, ymax. There are z 3-dimensionalslices de�ned orthogonal to each dimension d, using the input stream that has the four-dimensional points sorted on the coordinate values of dimension d. In particular, theseslices are de�ned such that the number of rectangles that lie in any slice de�ned byadjacent slices are equal. The z slices orthogonal to each of four dimensions divide thefour-dimensional space de�ned by the set S∗ into a grid that contains z4 grid cells. Eachgrid cell will hold the count of the number of rectangles that lie in that cell. As the gridis kept in the main memory, splitting the internal nodes to create sibling KD-nodes canbe performed I/O e�ciently as only a limited number of blocks I/O's will be necessaryto perform such a split. Using the grid each recursive step builds a part of the tree inmemory and distributes the rectangles. The distribution ensures that the properties ofa pseudo-PR-tree are maintained correctly. We �rst describe the structure of the gridand some of the important operations on the grid followed by the pseudo code for theconstruction algorithm itself.

� Axis SegmentsThere are z axis segments orthogonal to each of the four dimensions. As describedearlier, the grid will be split during the construction algorithm recursively, to createthe internal nodes of the pseudo-PR-tree. The grid helps in determining the slicel, that requires a split. However in order to determine the exact position of split,

38

part of the input stream sorted along the dimension of the split and has rectanglesthat lie in the slice l, needs to be accessed. To achieve this, for each axiss segmentorthogonal to a certain dimension d, we need to store the o�sets to the input streamsorted along d. This is achieved by having four hash tables, AxisSegments[4], onefor each of the dimension, whose keys are the coordinate values de�ning an axis andthe values are the o�sets to the correct position in the sorted stream. To e�cientlyuse memory, these hash tables are shared across the sub-grids that are created as aresult of splitting the grid. During the process of splitting, new axis segment getsadded to the hash table.

� GridGiven the memory constraints, the size of the grid is tuned to 16. The grid isimplemented as a collection of GridCell objects. To be able to e�ciently retrievecells from the grid given its address in terms of the de�ning coordinate values, thegrid cells are assigned id's. This is required for the operations on the grid that willbe described later. The grid is then a hash table of these cell id's to the actual gridcells.

DataStructure GridCellBegin

double axisBegin [4]double axisEnd [4]int numberOfRectanglesint id

End

The id's of the grid cell can be calculated from the index of the four coordinateaxis de�ning the grid cell using the following formula:

id = 163 ∗ Index(AxisSegmentsxmin, axisBeginxmin) +162 ∗ Index(AxisSegmentsymin, axisBeginymin) +16 ∗ Index(AxisSegmentsxmax, axisBeginxmax) +Index(AxisSegmentsymax, axisBeginymax)

Index(h, k) is a function that retrieves the sorted position of the key k, in the hashtable h.

� Splitting the Grid At each step of construction, the grid is required to be splitacross a speci�ed dimension d, depending on the level at which the tree is beingconstructed. The grid has to be split in such a way that the rectangles are approx-imately distributed evenly among the two halves. The split has to be performed

39

at a coordinate value along dimension d, such that rectangles that are greater thanor equal to this value are on one side and rest on the other side. The e�ciency ofthe grid can be seen here as the grid keeps the counts of all the rectangles. Thefollowing pseudo-code describes the splitting of the grid:

PseudoCode SplitGrid(grid, d)Input: Dimension d along which the grid has to split.Output: Two grids each having approximately half the rectangles.1. sortedStream ←Sorted stream along dimension d.2. slices ←GetSortedSlices(grid, d)(∗ slices is a map of the coordinate value to

the rectangle count in that slice. ∗)3. count ← 04. Choose smallest i such that count←∑i

j=0 slices[j] > StreamLength(sortedStream)/2

5. splitSlice ←Key(slices, i);6. offset ← AxisSegmentsd[splitSlice] (∗ Find precise coordinate ∗)7. previousCount ← count− slices[i]8. while ReadItem(sortedStream, r) and Contains(grid, r)9. do10. ++previousCount;11. o�set ←CurrentPosition(sortedStream)12. if previousCount ≥ count13. then break (∗ goes to step 14 ∗)14. splitV alue ← rd

15. Insert(AxisSegments, offset, splitV alue) (∗ Insert a new axis ∗)16. grid1 ← nil; grid2 ← nil (∗ Create two new empty grids ∗)17. for each GridCell in grid. (∗ Distribute the cells into two grids ∗)18. g ←CurrentItem(grid)19. if g.axisBegind < splitSlice20. then AddGridCell(grid1, g)21. if g.axesBegind > splitSlice22. then AddGridCell(grid2, g)23. else (∗ Split the cell into two and add it to both grids. ∗)24. g1, g2 ← g25. if g1.axisEndd 6= splitV alue26. then (∗ selectedSlice is not the exact split ∗)27. g1.axisEndd ← splitV alue28. AddGridCell(grid1, g1)29. g2.axisBegind ← splitV alue30. g2.numberOfRectangles ← 031. AddGridCell(grid2, g2)32. else33. AddGridCell(grid2, g)

40

34. Seek(sortedStream, offset)(∗ Update the rectangle counts if a slice was split in step 26 ∗)35. while ReadItem(sortedStream, r) and Contains(slice, r)36. do37. cellId ← GetCellId(r)38. g1 ← grid1[cellId]; g2 ← grid2[cellId];39. −− g1.numberOfRectangles; ++g2.numberOfRectangles;40. return grid1, grid2

We �rst obtain the slice l, where the split should occur in step 4. We then (instep 8) look up the sorted stream along dimension d by seeking to the correct o�setusing the hash table AxisSegments[d]. We go through a small amount of rectanglesin this stream that lie in the selected slice. This gives a more precise split value thatwould then become a new axis along dimension d. This axis and the o�set to theaxis in the sorted stream are added to the hash table AxisSegments[d] (step 15).Two new grids are then created. Grid cells that lie to left or to the right of thesplit value are easily distributed. Care has to be taken when a grid cell lies in theslice that was split (step 23). If the slice selected in step 4 was an exact split, wesimply add this slice to the second grid (step 33). However when a new axis isintroduced, we have to update the axis de�ning the boundaries of the grid cellsin step 26. When this happens, the rectangle counts are adjusted by scanning therectangles that lie in this slice in step 35.

� Construction using the gridThe construction algorithm is implemented using a set of procedures that sharesome data by being part of a class. The following class describes the necessarydata and operations to perform construction.

Class PseudoPRTreeData

TwoDRectangleStream inputStreamInternalNodeBlock cachedBlocks[]PriorityNode cachedPriorityNodes[]Queue leavesToConstruct

Operationsvoid Distribute(r, currentBlock, index, level, stream)void Construct(rootBlock, index)void SplitNode(currentBlock, grid, currentCount, level)

The construction algorithm constructs the tree recursively with each recursive stepconstructing z nodes in the tree. The following pseudo code describes the sequenceof operations performed in one recursive step of the construction algorithm:

41

PseudoCode Construct(rootBlock, index)Input: rootBlock and index identifying the rootNode of the tree constructed.Output: pseudo-PR-tree constructed using the input stream.1. for each dimension in xmin, ymin, xmax, ymax

2. AMI sort(inputStream, dimension) (∗ Sort in all dimensions ∗).3. grid ← ConstructGrid(sortedStreams)(∗ Construct the initial grid. ∗)4. SplitNode(rootBlock, grid, 0, 0)5. remainingRectangles ←nil (∗ rectangles that could not be distributed are

tracked. ∗)6. while ReadItem(inputStream, r) (∗ Distribute the stream ∗)7. do8. Distribute(r, rootBlock, 0, 0, remainingRectangles)(∗ end while ∗)9. substreams[| leavesToConstruct |] ← nil (∗ Filter the streams associated with

each leaf ∗)10. Clear(cachedPriorityNodes) (∗ Writes the priority nodes to the disk ∗)11. Clear(cachedBlocks) (∗ Write the internal nodes to the disk ∗)12. while ReadItem(remainingRectangles, r)13. do14. for each leaf l in leavesToConstruct15. if Contains(l, r)16. then AddItem(substreamsl, r)17. for each leaf l in leavesToConstruct18. tree ←CreateTree(substreamsl)19. Construct(l.rootBlock, l.index)

The operation Construct takes an internal node and constructs a subtree rooted atthat node in memory, containing at most z nodes. Each such construction phase,uses the recursive function SplitNode (step 4) to construct all the internal nodesas per the properties of the pseudo-PR-tree. Later a distribution phase (step 6performed by the function Distribute, distributes rectangles across the createdinternal nodes. The construction phase creates a grid (step 3) using the sortedstreams (created in step 1) that guides the splitting process. The last phase in theconstruction algorithm involves, recursively creating subtrees for all the leaves thatcould not be constructed completely because of memory constraints. The addressof such leaves will be stored in the queue leavesToConstruct. The pseudo-codethat constructs the internal nodes of the tree using the grid is described below:

PseudoCode SplitNode(currentBlock, grid, currentCount, level)Input: –currentBlock - Block to store internal nodes created.

–grid - Grid object to split the nodes evenly.–currentCount - Number of internal nodes created so far.–level - Current level being constructed.

1. currentNode ← currentBlockindex

42

2. if currentCount < z (∗ z = 16, size of the grid ∗)3. then4. if NumberOfRectangles(grid) > 4*B (∗ If su�cient rectangles are

available for a split ∗)5. then6. splitDimension ←level % 47. g1, g2 ←SplitGrid(grid, splitDimension)8. for g in g1, g2

9. if NumberOfRectangles(g) > 010. then if currentBlock is full11. then currentBlock←NewInternalNodeBlock ( )12. ++currentCount13. AddInternalNode(currentBlock)14. SplitNode(currentBlock, g1, currentCount, level + 1)15. SplitNode(currentBlock, g2, currentCount, level + 1)16. else (∗ Reached memory limit ∗)17. if NumberOfRectangles(grid) > 4*B18. then (∗ node needs to be split, add it to leavesToConstruct ∗)19. Add(leavesToConstruct, currentBlockId, level)20. else21. AddInternalNode(currentBlock)

The operation SplitNode uses the Grid to split nodes, at each level, in a KD-tree fashion. Once a block is allocated, it is recursively used to store the internalnodes created until it has no space left for more nodes. This disk layout strategyachieves I/O e�ciency during queries. When the maximum number of internalnodes z is created, the leaves that need further splits are stored in the queue,leavesToConstruct (step 19). The class PseudoPRTree also shows an internalcache of priority nodes and blocks for internal nodes. These caches will be �lledwhen the internal nodes are created in step 11 and later used during the distributionphase. The distribution phase is described by the following pseudo-code:

PseudoCode Distribute(r, currentBlock, index, level, remainingRectangles)Input: –r - TwoDRectangle object to be distributed in the tree.

–currentBlock - Block to store internal nodes created.–index - index of the location of the node in the currentBlock.–level - Level at which distribution occurs.–remainingRectangles - Stream of rectangles that could not be distributed.

1. currentNode ← currentBlockindex

2. inserted ←false3. for each dimension in xmin, ymin, xmax, ymax

4. if priorityNodedimension in currentNode does not exist

43

5. then6. Create priorityNodedimension

7. if NumberOfRectangles(priorityNodedimension) < B8. then9. AddRectangle(priorityNodedimension, r)10. inserted ← true11. break12. else13. rr ←GetRectangle(priorityNodedimension, (B − 1))14. if ReplaceRectangle(priorityNodedimension, r) (∗ Replace rect-

angle if r is more extreme than rr along dimension ∗)15. then r ← rr16. if inserted = false17. n ←NumberOfSubtrees(currentNode)18. if n > 019. then20. splitDimension ←level % 421. subtreeBlock, index ←Choose subtree according to split value

and splitDimension22. Distribute(r, subtreeBlock, index, level+1, remainingRectangles)

else23. AddItem(remainingRectangles, r)

The operation Distribute takes a rectangle r that has to be placed in the subtreerooted at the node currentBlockindex such that the properties of a pseudo-PR-treeare not violated. In order to do this, the algorithm �rst tries to �nd a place inthe priority nodes xmin, ymin, xmax, ymax, in that order. This means if a prioritynode p, has less than B rectangles, the rectangle r gets added in the correct sortedposition according to the dimension of the priority node (step 9). If this is notthe case, it is checked if r is more extreme than the least extreme rectangle rr inp (step 12). If this is indeed the case, we replace rr by r in p and continue thedistribution process with rr. If none of the priority nodes could accommodate r(or rr in case r is replaced in step 14), the search for the correct position continuesin the appropriate subtree, that is chosen using the dimension in which the treewas split and the split value (step 21). If the tree is completely full, r is addedto the stream, remainingRectangles. This stream is later �ltered for each of theleaves that have to be recursively constructed.

4.3.3 Implementation Issues

� Designing the structure of the gridThe Grid is a collection of grid cells. During the construction of the pseudo-PR-tree, it is often desirable to quickly obtain the grid cell that contains a particular

44

rectangle. This happens for instance, when the grid cell has to be updated withcounts of rectangles. To be able to do this, the grid was implemented as a hashtable that contains a mapping of a unique key(id) that identi�es the grid cell, toa grid cell itself. The formula to obtain the id from the coordinates of a rectanglewas already presented. The id of the grid cell should also be derivable from thecoordinates of the rectangle that lies in the grid cell. The solution to this problemis to obtain the key from the index of the coordinates that de�ne the boundaryof the grid cell. This index can easily be obtained using the AxisSegments hashtable.

� Maintaining vs. computing the minimum bounding boxThe rectangles in a priority node are kept in a sorted order to easily obtain the leastextreme value, which is frequently required during the distribution of rectangles.The minimum bounding box to be kept in the parent node could either be computedwhen the tree has been constructed or maintained as the tree is being constructed.Maintaining the bounding box was found to be more expensive because rectanglesenter and leave the priority nodes very often during its construction phase. Hencethe computation of the bounding box was chosen to be done after the constructionhas completed.

� Improving the algorithms to update a PriorityNodeTo store rectangles in a sorted order along a particular dimension, it is importantto insert new rectangles in the correct position. A linear traversal of this list madethe construction algorithm very slow. Hence a binary search is performed to locatethe position of insertion or deletion before actually inserting or deleting a rectangle.

4.4 LPR-tree

4.4.1 Structure

The LPR-tree contains a sequence of pseudo-PR-trees (with annotated information).The most important operations related to the LPRTree structure are the Insert, Deleteand the Query algorithms. The following class describes how these algorithms are im-plemented.

45

Class LPRTreeData

InternalNodeBlock cachedBlocks[]PriorityNode cachedPriorityNodes[]RootNodeBlock rootBlockAMI collection outputStream

Operationsvoid Insert(r)bool Delete(r)void Query(queryWindow, outputSize, nodes, priorityNodes)void GetRectangles()void CacheTree(treeIndex)

The root node of the LPR-tree contains links to the root node blocks of the childrenpseudo-PR-trees. Depending on the size of these pseudo-PR-trees either the full tree ora part of the tree is cached. This gives the bene�t to be I/O e�cient while updating thetree. The info �eld of this root node block, contains the following �elds:

� numberOfRectangles - This indicates the number of rectangles with which thetree was last bulk loaded.

� numberOfInsertions - Indicates the number of insertions made to the tree sincethe last time the tree was bulk loaded.

� numberOfDeletions - Indicates the number of insertions made to the tree sincethe last time the tree was bulk loaded.

During updates, the tree is reconstructed by collecting all the rectangles. This happensin the following two situations:

� The number of insertions becomes equal to the number of rectangles with whichthe tree was bulk loaded.

� The number of deletions becomes equal to half the number of rectangles with whichthe tree was bulk loaded.

When such a situation happens, the caches are emptied and the tree is bulk loadedagain, resetting these counters to 0 and setting the numberOfRectangles to the correctnumber. The GetRectangles method, recursively traverses the entire tree and writesall the rectangles to a stream, subsequently deleting all the blocks and nodes from thetree. Bulk loading a LPR-tree involves constructing the τdlog(N/B)e+1 child pseudo-PR-tree. This s followed by a phase in which a part or whole of the tree is cached. Giventhe tree index, the number of levels to cache is described by the following pseudo code:

46

PseudoCode CacheTree(treeIndex)Input: Index of the child pseudo-PR-tree to be cached.1. block ← GetBlock (rootNodeBlock, treeIndex) (∗ Gets the root block ∗)2. maxLevels ← −1; cache ← false3. l ← log N

M

4. m ← log MB

5. if NumberOfItems(block) > 06. then7. if treeIndex ≤ m8. then maxLevels ← −19. else if l > (m + 1) and treeIndex < l10. then cache ←false11. else maxLevels ← (treeIndex− l; )12. if cache = true13. then Recursively cache all the priority nodes and internal nodes up to a

depth of maxLevels.

4.4.2 Insertion Algorithm

The following pseudo code describes the high-level implementation details to insert arectangle in a LPR-tree:

PseudoCode Insert(r)Input: TwoDRectangle object, r to be inserted.1. if (numberOfInsertions + 1) = numberOfRectangles2. then3. stream ←GetRectangles( )4. AddItem(stream, r)5. Construct(stream)6. else7. t0Count ← NumberOfRectangles(t0Block)8. if t0Count >= B9. then10. stream ←nil11. for i ← 0 to n12. block ← GetBlock (rootNodeBlock, treeIndex) (∗ Gets the

root block ∗)13. if NumberOfRectangles(block) > 014. then GetRectangles(i) (∗ Add all the rectangles to stream ∗)15. else break (continues with step 16)16. tree ←ConstructPseudoPRTree(stream, outputStream)17. CacheTree(tree)18. else (∗ τ0 block has space for r ∗)

47

19. AddRectangle(t0Block, r)

When the number of rectangles inserted has reached the threshold, the tree is recon-structed by collecting all rectangles in a stream that also contains the rectangle to beinserted. When this not the case, the rectangle is inserted in the τ0 tree. If τ0 is alreadyfull, a search is made for the �rst empty tree τi in step 12. All the rectangles in thetrees preceeding this tree are moved to a stream and the trees discarded. τi is then bulkloaded with the stream of rectangles collected. Having now made sure that τ0 has space,the rectangle is inserted into τ0.

4.4.3 Deletion Algorithm

The deletion algorithm uses the least extreme values of priority nodes and the splitvalues of the nodes to search for the rectangle that has to be deleted. As a consequenceof the distribution strategy of rectangles in a pseudo-PR-tree, it is guaranteed to haveexactly one path in a pseudo-PR-tree to search for a speci�c rectangle for deletion. Thispath may result in the rectangle being found or it is concluded that this rectangle is notpresent in the tree. While deleting a rectangle, we take care to update the minimumbounding boxes of the priority nodes. The internal nodes are updated after deletion bykeeping track of the path followed for deletion. In this section we describe the pseudo-code for deletion followed by a description of the algorithm to replenish rectangles inunder-full nodes.

PseudoCode Delete(r, block, index, path)Input: �r - TwoDRectangle object to be deleted.

�block - InternalNodeBlock representing a node in the tree (initially the rootblock)

�index - index of the position in block of the current node.

�path - path to the deleted rectangle that is initially empty.Output: true if r is found and deleted, false otherwise.1. node ← blockindex

2. currentV alue ← rdimension

3. for each dimension in xmin, ymin, xmax, ymax

4. lev ← node.leastExtremeV aluedimension

5. if lev ≤ currentValue (* ≥ for max nodes *)6. then7. p ← node.priorityNodesdimension

8. if RemoveRectangle(p, r)9. then10. Add(path, dimension)11. if NumberOfRectangles(p) < B/212. Replenish this node and update bounding boxes

48

13. else ComputeMinimumBoundingBox (p, node)14. return true15. else16. if lev = currentV alue17. goto step 3 with next dimension.18. else return false19. if NumberOfSubtrees(node) > 020. then21. subtreeBlock, subtreeIndex ←Choose subtree according to split value22. Add(path, subtreeIndex)23. return Delete(r, subtreeBlock, subtreeIndex, path)24. else return false

Given a rectangle r, the above pseudo code recursively goes through the tree. Instep 3 it checks each of the priority nodes based on the least extreme values stored inthe parent node. If r falls in this range, either it gets deleted or it is concluded thatthe rectangle is not found in the tree except when the least extreme value is the sameas the coordinate value of the search rectangle (step 17). This is because, it may bepossible that more than one rectangle in the tree has the same coordinate value alonga speci�ed dimension which also happens to be the least extreme value of the prioritynode. If none of the priority nodes could be checked as the rectangle falls out of range orwhen the special case of step 17 occurs, the sub trees are searched recursively in step 20.The path to the deleted rectangle is stored to be able to correct the minimum boundingboxes. This path stores the index of the child from the root to the priority node thatcontained the deleted rectangle. More precisely index of the child is the index of thesubtree in case of an internal node(step 22), otherwise it is the dimension of the prioritynode(step 10). Note that we don't have to store the id's of the blocks as this informationis already present in the internal nodes. As the recursion terminates when a prioritynode is reached, the last index in this path must be the dimension of the priority node.When deletion happens successfully and the rectangle count in the priority node fallsbelow (B/2), the node is replenished (step 12) with rectangles from its priority nodesiblings or from the priority node children of its sibling KD-nodes. This is described bythe following pseudo code:

PseudoCode ReplenishNodes(p, d, node)Input: PriorityNode p that is the d-th child node of the InternalNodenode that is

under-full.1. stream ← nil (∗ Stream contains items of rectangles and their priority node ad-

dress ∗)2. underF lowStack ←nil3. priorityNodes ←nil (∗ List of addresses of nodes from which rectangles are col-

lected ∗)4. for each dimension succeeding d in xmin, ymin, xmax, ymax

49

5. q ← node.priorityNodesdimension

6. rectangles ←GetRectangles(q)7. AddItem(stream, rectangles)8. Add(priorityNodes, Address(q))9. for each sub-tree of node10. subtreeNode ← root of the sub-tree.11. for each dimension in xmin, ymin, xmax, ymax

12. rectangles ←q ← subtreeNode.priorityNodesdimension

13. GetRectangles(q)14. AddItem(stream, rectangles)15. Add(priorityNodes, Address(q))16. n ←MIN(B/2, StreamLength(stream))17. AMI sort(stream, d)18. for i ← 0 to n19. ReadItem(stream, r, q)(∗ q is priority node where the rectangle r exists. ∗)20. AddRectangle(p, r)21. RemoveRectangle(q, r)22. ComputeMinimumBoundingBox (p)23. for i ←0 to Count(priorityNodes)24. q ← priorityNodesi

25. if NumberOfRectangles(q) < B/226. Push(q, underF lowStack)27. else ComputeMinimumBoundingBox (q)28. while underF lowStack is not empty29. q ← Pop(underF lowStack)30. ReplenishNodes(q.priorityNode, q.dimension, q.parent)

Rectangles are �rst collected in an AMI stream from the sibling priority nodes instep 9 and from the children priority nodes of the KD-nodes of the parent of p (step 16).The stream is then sorted. The number of rectangles to be replenished is a minimumof B/2 and the number of rectangles collected (step 17). This is required to ensure thecorrectness of the LPR-tree after deletion. During this process of replenishing p, thenodes from which rectangles were borrowed may run under-full. The addresses of thesepriority nodes are collected in a stack (step 24). The address of a priority node, is de�nedby the id of its AMI block, its dimension and the id of AMI block of its parent. Theseunder owing priority nodes are replenished recursively in step 29. Its important to notethat these rectangles are replenished in the reverse order. This means, for instance, ifwhile replenishing the xmin priority leaf, all its siblings run under-full, then those childrenmust be replenished in the order ymax, xmax and ymin. This is to ensure that the prioritynode can be replenished with correct number of rectangles while still preserving all theproperties of a pseudo-PR-tree. Such a reverse order can easily be ensured by storingthe addresses of these under- owing nodes in a stack.

50

4.4.4 Implementation Issues

� Correctness of LPR-tree structureImplementing the update algorithms taking care of every precise detail is verydi�cult. To easily �nd and �x implementation bugs, a small procedure was writtento check the correctness of the tree. When working with large datasets, this was agood help in diagnosing problems. The correctness is checked by verifying certaintrivial facts about the structure of the LPR-tree. The following are some of theserules:

– Any priority node that is under-full (having less than B/2 rectangles) is anode that has no succeeding sibling priority nodes and the parent of such anode does not have any subtrees.

– The most extreme value of a priority node (of a certain dimension d) at leveli ( i > 0)is less extreme than the least extreme value of the correspondingpriority node in the same dimension of its parent at level i− 1.

– All priority nodes of the upper levels of a tree that are children of a nodehaving KD-nodes as children must be full.

� Replenishing nodes A lot of e�ort was spent on correctly implementing the re-plenishing of nodes during deletion. The following were the two main problemareas encountered:

– When a priority node for instance, pxmin, under ows due to deletion it get

replenished with sibling priority nodes or the children priority nodes of siblingKD-nodes. It may be possible that some of these nodes which gave theirrectangles to pxmin

may under ow. In such a situation, these priority nodesalso need to get recursively replenished in the reverse order. Reverse ordermeans that �rst the priority nodes of sibling KD-nodes have to replenishedthen the sibling priority nodes in the order ymax, xmax, ymin, xmin. This orderis necessary to maintain the correctness of the tree. It also ensures that anynode that needs replenishment can be adequately replenished.

– It is important to replenish a under owing priority node with at most (B/2)rectangles and not more to preserve the correctness of the tree. Before ex-plaining the problem, the following observations are made.� Observation 1 : The least extreme rectangle of a priority node, in any

dimension, is more extreme than the most extreme value along that di-mension across all the priority node children of the sibling KD-nodes.This observation immediately follows from the structure of the LPR-tree.� Observation 2 : A priority node can under ow to an extent much below

(B/2) and thereby requiring more than (B/2) rectangles to be completelypacked.Consider the situation that atleast two of priority nodes (ν and µ) are

51

having (B/2) + 1 rectangles. When a rectangle from the �rst of thesepriority node ν, gets deleted, it requires replenishment. It is possible thatthe most of these rectangles taken from µ. In this case, µ required morethan (B/2) rectangles to be completely packed.� Observation 3 : For a priority node, pd, in any dimension d, greater

than xmin, having n rectangles where n < B, the following is true: the nrectangles in pd together with the rectangles of its priority node siblings,may not have the most extreme B rectangles, along dimension d, in thesubtree rooted at its parent.Any priority node having KD-siblings is completely full only after bulkload. After bulk load, all priority nodes of a node ν, collectively have Bmost extreme along dimension d in the subtree rooted at its parent.

Now assume that we completely pack the priority nodes that are under owing.Also assume that ymin, xmax, ymax priority nodes, whose parent is η, haveexactly (B/2) + 1 rectangles. Also assume that η has only one sibling KD-nodeφ as the other sibling KD-node ψ is deleted as all its rectangles in its subtreehave been deleted earlier. Now, deleting one rectangle from ymin may causexmax to under ow below (B/2) (From Observation 2 ) and also assume thatno rectangles are taken from ymax. As a result, xmax requires replenishmentbut not ymax as ymax has enough rectangles to just stay above the threshold.Now replenishing xmax to the fullest extent may cause ymax to run completelyempty, while also borrowing rectangles from the priority nodes of the siblingKD-node, φ. This ymax node will now be replenished with rectangles from thechildren priority nodes of its KD-siblings. All priority nodes of φ provide themost extreme B rectangles in the ymaxdirection. However it is possible thatthese B rectangles are not the most extreme ymaxrectangles in the subtreerooted at φ(from Observation 3 ). Moving these rectangles to ymaxnode of ηviolates Observation 1 which makes the LPR-tree inconsistent.However moving a maximum of (B/2)rectangles will ensure that no prioritynode can run empty while one its sibling priority nodes or the priority nodesof the sibling KD-nodes have more than (B/2) rectangles. Thereby, the abovementioned problem will not occur.

52

Chapter 5

Experiments

5.1 Experimental Setup

The LPR-tree is implemented in C++. The code has been developed using the MicrosoftVisual Studio 2003 compiler for the Windows XP platform. TPIE[4] is used as the libraryto control block allocations and count I/O's (See Appendix B). Each rectangle has a sizeof 40 bytes. The implementation uses a maximum possible block size of 1638 rectangles.The experiments were run on a Pentium 4 CPU, 2.00 Ghz with 256 MB of RAM. Theamount of memory available for TPIE is restricted to 64 MB. This constraint is takenfrom the experiments on (static) PR-trees performed earlier[3]. The experiments areperformed using the LPR-tree implementation presented in this thesis and the R∗-treeimplementation included in the distribution of TPIE[4, 6].

5.2 Datasets

We use both real-life and synthetic datasets.

5.2.1 Real life data

As the real-life data we use the TIGER data of the geographical features in the UnitedStates of America. As most of our test results are expected to vary with the size of thedatasets, we would like to have real life datasets of varying sizes. The TIGER datasetis distributed over six CD-ROMs. We choose to experiment with road line segments ofthe two of the CD-ROMs. We collect 17 million rectangles from this dataset and create7 streams. Five of these streams contain one million rectangles and the rest is split intotwo streams of 4 million and 8 million rectangles respectively.

53

5.2.2 Synthetic data

To investigate the query performance of these dynamic R-trees over various extremeparameters and distribution characteristics, we use the following datasets.

1. Uniform Distribution : This dataset is designed to test the performance of R treesover rectangles whose centers follow a uniform distribution in the unit square. Thewidth and height of the rectangles is also generated uniformly at random as anumber between 0 and 0.001. These �gures are the same as the the experimentson (static) PR-trees performed earlier[3]. In general, we refer to this distributionas the UNIFORM dataset.

2. Normal Distribution :This dataset has �xed sized squares of size 0.001. The cen-ters of the squares follow a normal distribution whose mean is 0.50 and standarddeviation is 0.25. In general, we refer to this distribution as the NORMAL dataset.

The following table describes these datasets.Dataset Identifier Rectangle - ρUNIFORM(na)

� Center(ρ) = (x, y) where x and y are uniformlygenerated at random such that 0 ≤ x, y ≤ 1.0,with the current system time used as the seedbforrandomization.

� Width(ρ) = UniformRandom(0, 0.001). Heightis also chosen in a similar manner.

NORMAL(n)

� Center(ρ) = (x, y) where x and y are generatedat random from a normal distribution with µ =0.50, σ = 0.25, with the current system time usedas the seed for randomization.

� Width(ρ) = Height(ρ) = 0.001.

an - number of rectangles to generatebnote that the generator is seeded only once for the generation of n rectangles.

Both these distributions are used to generate 17 streams similar to the TIGERdatasets.

For convenience the described datasets are given names as indicated below:

54

Dataset Identifier DescriptionUni1 1 .. Uni1 5 Uniform distribution of 1 million rectangles.Uni4 Uniform distribution of 4 million rectangles.Uni8 Uniform distribution of 8 million rectangles.Uni1.5 Rectangles from Uni1 1 and half the rectangles from Uni1 2.Nor1 1 .. Nor1 5 Normal distribution of 1 million rectangles.Nor4 Normal distribution of 4 million rectangles.Nor8 Normal distribution of 8 million rectangles.Nor1.5 Rectangles from Nor1 1 and half the rectangles from Nor1 2.Tig1 1 .. Tig1 5 TIGER dataset of 1 million rectangles.Tig4 TIGER dataset of 4 million rectangles.Tig8 TIGER dataset of 8 million rectangles.Tig1.5 Rectangles from Tig1 1 and half the rectangles from Tig1 2.

5.3 Bulk Load

� Experiment Description

Hypothesis 5.3.1. The comparative bulk performance of the R∗-tree and theLPR-tree is consistent with their static variants[3] i.e; for instance, the per-formance of the R∗-tree bulk loaded with the TIGER dataset using the Hilbertconstruction algorithm and some R∗-tree heuristics, is three times better thanthe performance of LPR-tree.

Hypothesis 5.3.2. The bulk load performance of the R∗-tree linearly increaseswith the number of rectangles in the dataset. The LPR-tree also shows asimilar behavior.

� ProcedurePerform bulk load with Uni1, Uni4, Uni4 datasets. Repeat this with similardatasets in the NORMAL and TIGER sets. It is expected that I/O's increaselinearly with data set size in case of the eastern data sets while the data distributionof the synthetic datasets should have little or no e�ect on the I/O performance ofthe algorithm.

� ResultsThe CPU time taken to bulk load for the various datasets for the LPR and theR∗-tree is shown in �gure 5.1, 5.2 and 5.3. The CPU time appears to increase ina slightly super-linear fashion with the dataset size for both the LPR-tree and theR∗-tree. For the LPR-tree, this can be explained by the fact that the grid implemen-tation creates two new grids with each split. The LPR-tree also shows negligibletime di�erence between di�erent distributions. In fact, the minor di�erences could

55

be more attributed to the sorting time than to the nature of distribution. In con-trast, the R∗-tree is more sensitive to the distribution performing worst for normaldataset and the best with about 25% less time for the uniform dataset.

Figure 5.1: Bulk Load CPU time - Uniform dataset.

Figure 5.2: Bulk Load CPU time - Normal dataset.

The I/O counts shown in �gure 5.4, 5.5 and 5.6 seem to increase almost linearlywith dataset size.This is inline with expectations for the LPR-tree. For the R∗-treeI/O's are expected to scale linearly with dataset size within the same distribution.The results seem to be independent of the type of distribution for LPR-tree whereasfor R* trees, there is variation, with almost 13% more I/O's incurred in the bestcase compared to the worst case for the 8 million datasets. The same �gure forLPR-tree is less than 1.3

The most important observation is the large I/O di�erence between the LPR andthe R∗-trees. A large part of this I/O cost for the LPR-tree could be attributed tosorting of streams that is done at each recursive step in the implementation.

56

Figure 5.3: Bulk Load CPU time - TIGER dataset.

Figure 5.4: Bulk Load I/O - Uniform dataset.

Figure 5.5: Bulk Load I/O - Normal dataset.

57

Figure 5.6: Bulk Load I/O - TIGER dataset.

5.4 Insertion


Hypothesis 5.4.1. The average number of I/O's incurred per rectangle in-serted remain approximately the same until the LPR-tree is rebuilt. The sameis true for CPU time spent per insertion.

Hypothesis 5.4.2. The average number of I/O's per insertion (amortized) onthe LPR-tree is better than the same average number of I/O's per insertion(amortized) on the R∗-tree, on similar datasets and under similar conditions.

Hypothesis 5.4.3. The performance results of the insertion algorithm onLPR-tree shows little or no variation (on average) to the nature of the distri-bution of the data rectangles.

� ProcedureFirst the LPR-tree that was bulk loaded with Uni4 is loaded. We insert therectangles in Uni1 i where iε[1, 5]. After the insertion of every Uni1 i, we measurethe average number of I/O's required per insertion. We expect a peak in I/O'swhen inserting Uni1 4. This is due to the cleanup/rebuilding of tree. We repeatthis experiment with NORMAL and TIGER datasets. Similar experiments arecarried out on the R∗-tree.

� ResultsFigure 5.7 shows that the insertion CPU time for the LPR-tree, is almost the sameper rectangle. The graph shows the CPU time for every million rectangles insertedin the tree. We do see a sudden increase when a million rectangles are insertedthe fourth time. This is due to the rebuilding of the tree that happens when theinsertion count becomes equal to the count of the number of rectangle. At this

58

point the tree is doubled in capacity and bulk loaded. The CPU time taken seemsto almost independent to variations in distribution of rectangles. Its importantto note that when the tree does get rebuilt, its capacity doubles as a result, theexpensive cost for rebuilding has to be incurred again only much later.

Figure 5.7: Insertion CPU time - LPR-tree.

Figure 5.8 shows that the insertion CPU time for the R∗-tree is always above600 seconds for every million rectangles inserted any time. The probability thatinsertion time with the R∗-tree is going to be greater than with the LPR-tree isquite high. Figure 5.9 shows that the average CPU time for insertion is almost thesame with R∗-tree having only a slightly better performance.

Figure 5.8: Insertion CPU time - R∗ tree.

Figure 5.10 shows the average I/O counts per 100 rectangles inserted for the LPR-tree. The graph is plotted for every million rectangles inserted. The sudden increaseon inserting the fourth million rectangles is because of the rebuilding of the tree.But on an average over the 5 million rectangles inserted we see in �gure 5.12 thatthere are around 23 I/O's per 100 rectangles inserted. There is a sharp deviation

59

Figure 5.9: Insertion average CPU time - LPR tree vs. R∗ tree.

of around +16 I/O's from the best case. However if we look at the insertion costper rectangle, this is not much. Once again there is negligible deviation in resultsacross dataset types with the TIGER dataset performing the worst. This is correctas insertion algorithm basically performs bulk loading which is inert to variationsin dataset types.

Figure 5.10: Insertion I/O's - LPR tree.

Figure 5.11 shows the average I/O counts per 100 rectangles inserted for the R∗-tree. There is quite a bit of variation with nature of distribution with the TIGERdataset giving the worst performance. The I/O counts comparison between LPRand R∗-tree shown in �gure 5.12 is interesting. It clearly shows that insertionoutperforms the R∗-tree by a really large amount. The poor R∗-tree performancecould be attributed to the really dense datasets and the cost of re-inserting 30%of the rectangles from the over owing seems to be quite high even though this isdone once per level.

60

Figure 5.11: Insertion I/O's - R∗ tree.

Figure 5.12: Insertion Average I/O's - LPR tree vs. R∗ tree.

61

5.5 Deletion


Hypothesis 5.5.1. The average number of I/O's incurred per rectangle deletedin the LPR-tree reduces on average as rectangles get deleted. The same is truefor CPU time spent per deletion except when the tree gets bulk loaded duringwhich CPU time taken would be signi�cantly higher.

Hypothesis 5.5.2. The average number of I/O's per deletion amortized) onthe LPR-tree is better than the same average number of I/O's per deletion(amortized) on the R∗-tree, on similar datasets and under similar conditions.

Hypothesis 5.5.3. The results of the delete algorithm on the LPR-tree, showslittle or no variation (on average) to the nature of the distribution of the datarectangles.

� ProcedureWe load the LPR-tree containing 4 million rectangles. We insert the rectangles inUni1 i where iε[1, 3]. This tree is saved and used as the base of the deletion exper-iments. We then successively delete rectangles Uni1 i where iε[1, 3]. We measurethe average number of I/O's after every deletion of Uni1 i. We expect deletions totake more I/O's than insertion on average because of the reorganization resultingfrom replenishing underfull nodes. Also a peak in I/O's is expected when half therectangles get deleted. This is due to the cleanup/rebuilding of tree. Howeverthe average number of I/O's per deletion is expected to decrease as rectangles getdeleted. This is due to the fact that the tree is becoming smaller in size.

� ResultsThe results of the CPU time measurements for LPR-tree are depicted in Figure 5.13.In line with the other results, there seems to be very little e�ect of the variationof the distribution of rectangles in datasets. The �rst million rectangles deletedtake around 700 seconds. The second million rectangles take more CPU time ascost is incurred in the reconstruction of the tree. The deletion of the third millionrectangles seem to go faster than the �rst million. This can be explained by thefact that the tree is smaller than the tree from which the �rst million rectangleswere deleted.

Figure 5.14 shows the CPU time measurements for R∗-tree. The CPU time taken todelete 1 million rectangles is on the average 1200 seconds for the normal and tigerdatasets. An equivalent experiment on uniform datasets failed to complete evenafter 4.5 hours. Hence the data shown here does not have statistics for this dataset.In comparison to R∗-tree, deletion CPU time on the LPR-tree over di�erent typeof datasets has less variation. It also shows that, deletion in the LPR-tree is very

62

Figure 5.13: Deletion CPU time - LPR-tree.

often expected to be much faster compared to the R∗-tree. An overall averagecomparison of deletion times shown in Figure 5.15 shows the LPR-tree having anadvantage over the R∗-tree.

Figure 5.14: Deletion CPU time - R∗ tree.

The I/O's incurred per deletion for the LPR-tree are shown in Figure 5.16. In-terestingly the average counts per deletion go down with every million rectanglesdeleted. Also the cost incurred in restructuring is not much. This is because thetree is bulk loaded with only 2 million rectangles. Its hard to explain why thereis no variation in the I/O counts with variation in dataset distribution. The bestguess, one could make is that rectangles are very dense in the dataset and everydeletion seems to have equal amount of e�ect w.r.t to replenishing nodes. Deletionis more expensive than insertion w.r.t I/O's and this is clear from the results.

The I/O's incurred per deletion for the R∗-tree are shown in Figure 5.17. The R∗-tree apparently performs slightly better on the TIGER dataset compared to the

63

Figure 5.15: Deletion average CPU time - LPR tree vs. R∗ tree.

Figure 5.16: Delete I/O - LPR-tree.

64

NORMAL dataset. In contrast with LPR-tree, where cost is reduced for every mil-lion rectangles deleted (across datasets), the R∗-tree does not show any predictablebehavior with I/O's staying at almost the same level.The comparison of average I/O's(Figure 5.18) clearly shows that R∗-tree needs twicethe number of I/O's on average compared to the LPR-tree, to perform deletion.

Figure 5.17: Deletion I/O - R∗ tree.

Figure 5.18: Deletion average I/O's - LPR tree vs. R∗ tree.

5.6 Query


Hypothesis 5.6.1. LPR-tree updates (insertions and deletions) should not af-fect the query performance. In other words, query performance in terms ofI/O's and CPU time should be close to the query performance after bulk load.

65

Hypothesis 5.6.2. The average number of I/O's incurred per output rectanglereported, in a LPR-tree, is, better than the R∗-tree on similar datasets andunder similar conditions.

Hypothesis 5.6.3. The R∗-tree performs best on the NORMAL dataset (squares)and slightly worse on the UNIFORM and TIGER dataset.

ProcedureLPR-tree is bulk loaded with uni1.5. For each of the UNIFORM, NORMAL andTIGER datasets, 1000 randomly generated window queries are performed on thistree. Each rectangle representing a window query is generated as follows:

Dataset Identifier Window query - ρUNIFORM

– Center(ρ) = (x, y) where x and y are uniformlygenerated at random such that 0.3 ≤ x, y ≤ 0.7,with the current system time used as the seed forrandomization.

– Width(ρ) = UniformRandom(u, v), where u, v isa number uniformly chosen at random such that,0.2 ≤ u < v ≤ 0.4. Height is also chosen in asimilar manner.

NORMAL– Center(ρ) = (x, y) where x and y are generated

at random from a normal distribution with µ =0.50, σ = 0.25, with the current system time usedas the seed for randomization.

– Width(ρ) = Height(ρ) = UniformRan-dom(0.001, 0.1)

.TIGER ρ = ComputeMinimumBoundingBox (s) where s, is

a set of rectangles chosen from the tig1 2 streamsuch that si = tig1 2k+ji

, where ji = (∑i−1

q=0 jq) +UniformRandom(0, 10). k is chosen at uniformly dis-tributed o�sets for each query rectangle.

Then uni 3 and uni 4 are inserted with the 1000 random queries being performedin between these insertions. This is followed by deletion of uni 3 and uni 4. Once

66

again queries are performed before and after deletions. Finally, uni 3 and uni 4are re-inserted with queries performed before and after the insertions. Similarexperiments are carried out with the NORMAL and UNIFORM datasets. Thesame set of experiments are repeated for the R∗-tree.

� ResultsFigure 5.19 shows the query CPU time per B rectangles output reported for theLPR and the R∗-trees. Clearly, the LPR-tree performs much better than the R∗-tree for all types of distributions with query times being twice to three times betterfor the LPR-tree. The variation in distribution could be explained by the fact thatthe queries for the di�erent distributions return very di�erent output sizes. Forinstance, the uniform datasets reports very good CPU time per output,becauseits output size on average is quite big (more than 100000) compared to normaldataset (around 2000 - 3000). This shows that query cost gets amortized with alarge number of queries reporting large output sizes. This same variation w.r.tdatasets can be seen in the R∗-tree. The implementation of deletion in the R∗-treehangs while deleting the uniform dataset. So there are no statistics for this dataset.A last interesting observation is that query CPU time seems to remain the sameover the interleaved insertions and deletions which was expected from the LPR-tree. However in the case of the R∗-tree, query time deteriorates after deletionsespecially for the NORMAL dataset. This could be attributed to the same kind ofincrease in I/O's for this dataset.

Figure 5.20 shows the I/O's per B rectangles reported for the LPR and R∗-tree. Aswith the time statistics, the LPR-tree outperforms the R∗-tree by a good marginacross all datasets. There is very little e�ect of the updates on the query perfor-mance with the graph showing a near horizontal line. To answer the same queriesR∗-tree requires three times to six times more I/O's compared to the LPR-tree.The interleaved updates once again seem to make query costs expensive for theR∗-tree for the NORMAL dataset.

To verify the theoretical worst-case query guarantees of the LPR-tree experimentally,is very di�cult. However, given all the queries on the LPR-trees, performed acrossall the datasets, I plot the theoretical value,

√NB

+ TB

, using the average output sizeand average number of I/O's to answer these queries (Figure 5.21). Then I plottedthe experimentally measured I/O's on a di�erent scale on the same graph. Adjustingthe scale of experimental results, it was possible �gure out the worst performing set ofqueries which is point 11 on this graph. When the scale is adjusted, the experimentalcurve completely lies below the theoretical curve. As we know that the R∗-tree takesmore I/O's to answer these set of queries, the worst-case behavior of the LPR-tree ismore closer to the worst-case optimum number of I/O's than the R∗-tree.

67

Figure 5.19: Query CPU time (in msec) per B rectangles output .

68

Figure 5.20: Query I/O's per B rectangles output.

69

Figure 5.21: Empirical Analysis - Theoretical vs. Experimental query I/O results forLPR-tree .

70

Chapter 6

Conclusions

From the experiments and the results obtained the following conclusions can be made.

� The R∗-tree is more e�cient in terms of I/O and time to construct static R-treestructures. However query performance of the LPR-tree is much better than R∗-tree.

� Update algorithms of the LPR tree outperform the R∗-tree by a large amount interms of IO's. However insertion and deletion time are quite close on average.Considering that LPR-tree performs the updates more faster most of the time withthe rebuilding cost getting amortized, one would still prefer the LPR-tree over theR∗tree.

� Query performance for the LPR-tree remains very good with interleaved insertionsand deletions and here the LPR-tree de�nitely has an edge over the R∗-tree.

� Based on the above observations, LPR-tree should be preferred over the R∗-treewhen indexing data that is frequently updated. For static data, if query per-formance is more important than indexing, LPR-tree is still preferable over theR∗-tree, otherwise the R∗-tree is better. In other words, to have static R-treestructures over relatively small datasets (less than 1 million data objects), R∗-treeis preferable.

� Experimental veri�cation of the query guarantees show a single set of queries thathave worst-case performance than all the remaining queries. As we know that theR∗-tree takes more I/O's to answer this query, the worst-case behavior of the LPR-tree is more closer to the worst-case optimum number of I/O's than the R∗-tree.

71

Appendix A

Tables of Experimental Results

A.1 Bulk Load

A.1.1 LPR tree

Dataset Blk Readsa Blk Writesb Strm Readsc Strm Writesd I/O's Time(sec)Uni1 1 1062 1731 20892 10048 33733 143Uni4 4314 6960 122763 67626 201663 859Uni8 8678 12984 285043 160675 467380 2247Nor1 1 1102 1783 20903 10054 33842 138Nor4 4311 6902 126850 68017 206080 882Nor8 8589 13776 288579 160596 471540 2223Tig1 1 1072 1746 21999 10049 34866 133Tig4 4379 7085 132221 69765 213450 855Tig8 8642 13879 285393 159672 467586 2047

aNumber of AMI block objects read from the external memorybNumber of AMI block objects written to the external memorycNumber of block writes incurred while reading from an AMI stream objectdNumber of block reads incurred while writing to an AMI stream object

73

A.1.2 R∗treeDataset Blk Reads Blk Writes Strm Reads Strm Writes I/O's Time(sec)Uni1 1 1035 2072 3053 1832 7992 65Uni4 4168 8338 12216 7331 32053 262Uni8 8361 16724 24431 14663 64179 588Nor1 1 1352 2706 3054 1833 8945 67Nor4 5472 10946 12216 7332 35966 302Nor8 11088 22178 24433 14665 72364 809Tig1 1 1233 2468 3054 1833 8588 58Tig4 4837 9676 12216 7332 34061 314Tig8 9845 19692 24433 14664 68634 694

74

A.2 Insertion

A.2.1 Insertion Time

Dataset Dataset inserted(mln) Time(sec)lpr uni4a uni1 1 363

uni1 2 529uni1 3 397uni1 4 2839uni1 5 378

lpr nor4 nor1 1 347nor1 2 505nor1 3 386nor1 4 2795nor1 5 338

lpr tig4 tig1 1 371tig1 2 531tig1 3 408tig1 4 3136tig1 5 367

r* uni4 b uni1 1 767uni1 2 583uni1 3 840uni1 4 945uni1 5 1010

r* nor4 nor1 1 593nor1 2 655nor1 3 1090nor1 4 1212.5nor1 5 565

r* tig4 tig1 1 782.5tig1 2 850tig1 3 787.5tig1 4 787.5tig1 5 767.5

aLPR-tree on the Uni4 datasetbR∗-tree on the Uni4 dataset

75

A.2.2 Insertion I/O’s

Dataset Datasetinserted

Blk Reads Blk Writes Strm Reads Strm Writes I/O per 100

lpr uni4 uni1 1 9807 10297 45396 22962 8.8462uni1 2 21725 22750 118024 61834 13.5871uni1 3 31993 33588 168995 87616 9.7859uni1 4 66509 69094 582407 324700 72.0518uni1 5 76207 79359 627216 347357 8.7429

lpr nor4 nor1 1 9918 10408 45327 22904 8.8557nor1 2 22013 23059 119834 62764 13.9113nor1 3 32385 33999 170680 88467 9.7861nor1 4 66983 69488 587696 326224 72.486nor1 5 76859 79975 632260 348717 8.742

lpr tig4 tig1 1 9884 10370 45360 22925 8.8539tig1 2 21995 23052 121660 62376 14.0544tig1 3 32355 33976 173206 88046 9.85tig1 4 67051 69596 619578 330984 75.9626tig1 5 76885 79997 665682 353791 8.9146

r* uni4 uni1 1 5878028 5878420 611 1 1175.706uni1 2 6098508 6099128 613 2 1219.8251uni1 3 6521546 6522337 612 2 1304.4497uni1 4 6828814 6829608 613 2 1365.9037uni1 5 6735878 6736691 612 2 1347.3183

r* nor4 nor1 1 5409684 5410228 612 2 1082.0526nor1 2 6162222 6163002 612 1 1232.5837nor1 3 6723870 6724688 612 2 1344.9172nor1 4 6155400 6156271 612 2 1231.2285nor1 5 5971599 5972418 613 2 1194.4632

r* tig4 tig1 1 9127032 9127874 613 2 1825.5521tig1 2 8311476 8312339 613 3 1662.4431tig1 3 8633678 8634539 613 3 1726.8833tig1 4 7828134 7828961 613 2 1565.771tig1 5 7477655 7478448 613 2 1495.6718

76

A.3 Deletion

A.3.1 Deletion I/O’s and time

Dataset Datasetdeleted

Blk I/O's Strm Reads Strm Writes Time(sec) I/O per 100

lpr uni4 uni1 1 6265945 6437 9359 720 6.281741uni1 2 11823870 177900 116444 1680 5.836473uni1 3 16244805 177800 116000 515 4.420391

lpr nor4 nor1 1 6452651 5610 9907 744 6.468168nor1 2 12065165 177882 115540 1674 5.890419nor1 3 16514678 178493 115540 523 4.450124

lpr tig4 tig1 1 6266815 6537 10359 700 6.283711tig1 2 11825313 177279 116567 1682 5.835448tig1 3 16246248 177890 116567 518 4.421546

r* uni4 uni1 1uni1 2uni1 3

r* nor4 nor1 1 12403475 611 0 1272.5 12.403475nor1 2 11967682 611 1 1272.5 11.967682nor1 3 13588870 610 1 1360 13.58887

r* tig4 tig1 1 8692591 611 1 912.5 8.692591tig1 2 9992747 611 1 895 9.992747tig1 3 8590855 611 1 912.5 8.590855

77

A.4 Query

A.4.1 LPR-tree

Dataset Action Avg|OP | I/O's Time(sec) I/O per OPa

Nor BLb 1711.11 12864 12.3 7.51792696Ins1 c 2852.05 22082 18.84 7.742501008Ins2 d 3995.42 18054 14.078 4.518673882Del1 e 2854.49 18040 13.81 6.319867997Del2 f 1711.11 13278 10.53 7.759875169Ins1 g 2852.05 22550 19.9 7.906593503Ins2 h 3995.42 18414 13.98 4.60877705

Uni BL 81719.47 143384 11 1.754587983Ins1 136112.5 231690 26.563 1.702194876Ins2 190476.5 279534 43.5 1.467551115Del1 136083.5 279408 37.84 2.053209978Del2 81719.47 183310 25.89 2.243161881Ins1 136112.5 272290 43.54 2.000477546Ins2 190476.5 317856 44.08 1.668741288

Tig BL 6749.02 17449 14.37 2.585412401Ins1 6749.02 27996 14.1 4.148157807Ins2 6749.02 23378 12.57 3.463910316Del1 6749.02 23158 13.6 3.431312991Del2 6749.02 19796 11.12 2.933166593Ins1 6749.02 20379 10.31 3.019549505Ins2 6749.02 24818 13.85 3.677274627

aOP is the number of rectangles returned by the window query.bBL - Queries after bulk loading nor1.5cIns1 - Queries after inserting nor1 1dIns2 - Queries after inserting nor1 2eDel1 - Queries after deleting nor1 1fDel2 - Queries after deleting nor1 2gIns1 - Queries after inserting nor1 1hIns2 - Queries after inserting nor1 2

78

A.4.2 R∗-treeDataset Action Avg |OP | I/O's Time(sec) I/O per OPa

Nor BL 1711.11 68560 16 40.06755849Ins1 2852.05 68417 27 23.98870988Ins2 3995.42 68573 38 17.16290152Del1 2854.49 76195 36 26.69303448Del2 1711.11 68158 27 39.83262327Ins1 2852.05 73538 27 25.78426044Ins2 3995.42 74391 59 18.61906883

Uni BL 81719.47 300226 304 3.673861321Ins1 136112.5 269338 430 1.978789604Ins2 190476.5 249673 372 1.31078112Del1 136083.5Del2 81719.47Ins1 136112.5Ins2 190476.5

Tig BL 6749.02 84970 42.5 12.58997603Ins1 6749.02 86201 37.5 12.77237288Ins2 6749.02 88053 40 13.04678309Del1 6749.02 78794 35 11.6748802Del2 6749.02 78794 40 11.6748802Ins1 6749.02 77752 37.5 11.52048742Ins2 6749.02 77752 35 11.52048742

aOP is the number of rectangles returned by the window query.

79

Appendix B

Brief Introduction to TPIE

TPIE[4, 6] is a software environment (written in C++) that facilitates the implemen-tation of I/O e�cient algorithms. The goal of theoretical work in the area of externalmemory algorithms (also called I/O algorithms or out-of-core algorithms) has been todevelop algorithms that minimize the I/O(i.e; transfer of data between the main mem-ory and disk), performed when solving problems on very large data sets. The TPIElibrary consists of a kernel and a set of I/O-e�cient algorithms and data structures im-plemented on top of the kernel. Most of the functionality is provided as templated classesand functions in C++.

The following are some of the important structures of TPIE used in the implemen-tation of the LPR-tree:

� AMI streamAMI stream is templated class to store a list of user de�ned objects to the externalmemory. This stream provides interfaces to read or write items. Dedicated streamssuch as the TwoDRectangleStream are AMI stream objects parameterized withthe object type (such as TwoDRectangle) they hold.

TPIE provides a sorting algorithm AMI sort for a stream. Given a comparisonfunction or object, it sorts the stream using the external-memory merge sort algo-rithm.

� AMI blockAMI block is a templated class that represents a logical block, which is the unitamount of data transferred between the external memory and main memory. Ablock can store data and hold links to other blocks. It also provides an informationstructure to hold information such as the number of items allocated. Given thetype of data objects that have to be stored and the number of links, the maximumnumber of data objects that can be stored represents the capacity of the block.

Blocks are used in the implementation of the LPR-tree to store priority nodes andthe internal KD-node data structures.

81

� AMI collectionAMI collection is a class that represents a collection of AMI block objects. Anyblock can be identi�ed and retrieved in a collection using the unique block identi�erassociated with the block. This collection provide the convenience to control datalayout strategies required by many IO e�cient algorithms. The LPR-tree is in factone such block collection.

82

Bibliography

[1] Chuan-Heng Ang, S. T. Tan, and T. C. Tan. Bitmap R-trees. Informatica (Slove-nia), 24(2), 2000.

[2] Arge, Hinrichs, Vahrenhold, and Vitter. E�cient bulk operations on dynamic R-trees. Algorithmica, 33, 2002.

[3] Lars Arge, Mark de Berg, Herman J. Haverkort, and Ke Yi. The priority R-tree:A practically e�cient and worst-case optimal R-tree. In SIGMOD, pages 347{358,2004.

[4] Lars Arge, Octavian Procopiuc, and Je�rey Scott Vitter. Implementing I/O-e�cientdata structures using TPIE. In ESA, pages 88{100, 2002.

[5] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. TheR*-tree: An e�cient and robust access method for points and rectangles. In HectorGarcia-Molina and H. V. Jagadish, editors, SIGMOD, pages 322{331. ACM Press,1990.

[6] TPIE distribution. http://www.cs.duke.edu/tpie.

[7] Yv�an J. Garc��a, Mario A. Lopez, and Scott T. Leutenegger. A greedy algorithm forbulk loading R-trees. In ACM-GIS, pages 163{164, 1998.

[8] Antonin Guttman. R-trees: A dynamic index structure for spatial searching. InBeatrice Yormark, editor, SIGMOD, pages 47{57, Boston, Massachusetts, June1984.

[9] Ibrahim Kamel and Christos Faloutsos. On packing R-trees. In CIKM, pages 490{499, 1993.

[10] Ibrahim Kamel and Christos Faloutsos. Hilbert R-tree: An improved R-tree usingfractals. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, VLDB, Pro-ceedings of 20th International Conference on Very Large Data Bases, Santiagode Chile, pages 500{509. Morgan Kaufmann, 1994.

[11] T. Sellis, N. Roussopoulos, and C. Faloustos. The R+ -tree: A dynamic index formulti-dimensional objects. In VLDB, pages 507{518, Brighton, England, 1987.

83

[12] A. Papadopoulo Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees:Theory and applications. Springer, 2006.

84

an experimental evaluation of the logarithmic priority...

Documents