comprehensive spatial query containment framework for minimizing redundancy

172
 The Pennsylvania State University The Graduate School COMPREHENSIVE SPA TIAL QUERY CONT AINMENT FRAMEWORK FOR MINIMIZING REDUNDANCY A Thesis in Computer Science and Engineering  by Brandon M. Unger c 2009 Brandon M. Unger Submitted in Partial Fulllment of the Requirements for the Degree of Master of Science May 2009

Upload: nitish-upreti

Post on 04-Nov-2015

12 views

Category:

Documents


0 download

DESCRIPTION

COMPREHENSIVE SPATIAL QUERY CONTAINMENT FRAMEWORK FOR MINIMIZING REDUNDANCY.

TRANSCRIPT

  • The Pennsylvania State UniversityThe Graduate School

    COMPREHENSIVE SPATIAL QUERY CONTAINMENT FRAMEWORK FORMINIMIZING

    REDUNDANCY

    A Thesis inComputer Science and Engineering

    byBrandon M. Unger

    c 2009 Brandon M. Unger

    Submitted in Partial Fulfillmentof the Requirementsfor the Degree of

    Master of Science

    May 2009

  • The thesis of Brandon M. Unger was reviewed and approved by the following:

    Wang-Chien LeeAssociate Professor of Computer Science and EngineeringThesis Adviser

    John HannanAssociate Professor of Computer Science and Engineering

    Daniel KiferAssistant Professor of Computer Science and Engineering

    Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

    Signatures are on file in the Graduate School.

  • Abstract

    As storage capacity and computational power continue to increase, society is able to collect considerableamounts of data from heterogeneous sources. The analysis of this information may require programsto perform complex, multidimensional analysis in potentially adverse environments. Example appli-cations include business intelligence operations, geographic information systems, and location-basedservices. While these tools produce valuable information for users, they frequently must operate onsystems with limited processing capability and bandwidth capacity. To minimize unnecessary resourceconsumption, a primary goal is to avoid the execution of any query that is redundant based on resultspreviously obtained by the client. This work introduces the concept of spatial query containment as ameans to identifywhen a newquery can be answered solely using results from an existing query. Spatialquery containment has been engineered to support a variety of popular spatial query types, includingrange, window, k-nearest neighbor, and reverse k-nearest neighbor. Each query Q has an associatedcontainment scope area, and any future queryQ0 both semantically contained byQ and issued at a pointinside of the containment scope of Q can be answered using only the results from Q. Theoretical andexperimental analysis indicate that the containment scope processing framework outperforms existingtechniques under awide variety of datasets, query loads, and computing environments. The substantialreduction in redundant query evaluations provided by the spatial query containment framework sup-ports the deployment of novel, data rich applications in challenging environments while maintainingsucient scalability, reliability, and performance.

    iii

  • Table of Contents

    List of Figures viii

    List of Tables x

    Acknowledgments xi

    Chapter 1Introduction 11.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Solution Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Contribution and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    Chapter 2Literature Review 122.1 Essential Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Data Organization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.1 B-Tree Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 R-Tree Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 Other Spatial Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.4 Voronoi Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3 Spatial Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Region Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Nearest Neighbor Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.3 Reverse Nearest Neighbor Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 Location-Dependent Spatial Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.5 Time-Parameterized Spatial Query . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.4 Auxiliary Scope Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.1 Semantic Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4.2 Valid Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.5 Caching Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    iv

  • Chapter 3Containment Scope Framework 343.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 System Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Underlying Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Communication Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5 Spatial Query Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6 Containment Scope Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.7 Containment Scope Evaluation and Computation Strategies . . . . . . . . . . . . . . . . 40

    Chapter 4Region Query Computation Methods 434.1 Containment Scope Server Processing for Region Queries . . . . . . . . . . . . . . . . . . 434.2 Containment Scope for Range Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    4.2.1 Basic Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.3 Optimized Computation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    4.3 Containment Scope for Window Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.1 Basic Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.3 Optimized Computation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.4 Containment Scope Client Processing for Region Queries . . . . . . . . . . . . . . . . . . 53

    Chapter 5Nearest Neighbor Query Computation Methods 565.1 Containment Scope Server Processing for NN Queries . . . . . . . . . . . . . . . . . . . . 565.2 Containment Scope for 1NN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    5.2.1 Basic Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.2 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    5.3 Containment Scope for kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3.1 Basic Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3.2 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    5.4 Containment Scope Client Processing for NN Queries . . . . . . . . . . . . . . . . . . . . 65

    Chapter 6Reverse Nearest Neighbor Query Computation Methods 676.1 Preliminary Notes on Reverse Nearest Neighbor Query Processing . . . . . . . . . . . . 676.2 Basic RkNN Auxiliary Scope Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    6.2.1 Korn Unary Basic Auxiliary Scope Processing . . . . . . . . . . . . . . . . . . . . 706.2.2 Basic Auxiliary Scope Client Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 756.2.3 Basic Auxiliary Scope Processing Variants . . . . . . . . . . . . . . . . . . . . . . . 76

    6.3 Dynamic RkNN Auxiliary Scope Construction . . . . . . . . . . . . . . . . . . . . . . . . 776.3.1 Dynamic RkNN Auxiliary Scope Processing . . . . . . . . . . . . . . . . . . . . . 786.3.2 Dynamic RkNN Auxiliary Scope Example . . . . . . . . . . . . . . . . . . . . . . 866.3.3 Dynamic RkNN Auxiliary Scope Client Evaluation . . . . . . . . . . . . . . . . . 92

    6.4 Optimial RkNN Auxiliary Scope Construction . . . . . . . . . . . . . . . . . . . . . . . . 936.4.1 Monochromatic Optimal Auxiliary Scope Processing . . . . . . . . . . . . . . . . 936.4.2 Bichromatic Optimal Auxiliary Scope Processing . . . . . . . . . . . . . . . . . . 98

    v

  • Chapter 7Theoretical Analysis 1047.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.2 Relevant Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.3 Cost Model Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.4 Region Query Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    7.4.1 Query Submission Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.4.2 Auxiliary Scope Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.4.3 Bandwidth Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.4.4 I/O Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.4.5 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    7.5 NN Query Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.5.1 Query Submission Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.5.2 Auxiliary Scope Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.5.3 Bandwidth Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.5.4 I/O Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.5.5 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    7.6 Extension to Non-Uniform Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    Chapter 8Experimental Analysis 1178.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.2 Domain of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1188.4 Exp. I. Impact of Auxiliary Scope Formation . . . . . . . . . . . . . . . . . . . . . . . . . 120

    8.4.1 Uniform Dataset Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . 1218.4.1.1 Region Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.4.1.2 kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1268.4.1.3 RkNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    8.4.2 Non-Uniform Dataset Performance Analysis . . . . . . . . . . . . . . . . . . . . . 1288.4.2.1 Region Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.4.2.2 kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.4.2.3 RkNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    8.5 Exp. II. Impact of Client Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.5.1 Fixed Query Parameter Performance Analysis . . . . . . . . . . . . . . . . . . . . 133

    8.5.1.1 Region Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338.5.1.2 kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    8.5.2 Variable Query Parameter Performance Analysis . . . . . . . . . . . . . . . . . . 1388.5.2.1 Region Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398.5.2.2 kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

    8.6 Exp. III. Impact of Object Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408.6.1 Region Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418.6.2 kNN Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    8.7 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    vi

  • Chapter 9Auxiliary Scope Simulator 1439.1 Simulator Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1439.2 Simulator Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449.3 Simulator Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1459.4 Simulator Development Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.5 Simulator Implementation Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    Chapter 10Conclusion 15210.1 Spatial Query Processing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15210.2 Spatial Query Containment Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

    10.2.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15510.2.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15610.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

    10.3 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    Bibliography 159

    vii

  • List of Figures

    1.1 Example LBS system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Illustration of overlapped query results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Containment scope and containment test . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1 Traditional data indexing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Spatial data indexing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Spatial query types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Basic spatial query attempts to solve RNN query . . . . . . . . . . . . . . . . . . . . . . . 222.5 RNN evaluation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 Semantic scope construction approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.7 Valid scope formulation (TP-query approach) . . . . . . . . . . . . . . . . . . . . . . . . . 292.8 Valid scope formulation (geometric approach) . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.1 General spatial query containment system model . . . . . . . . . . . . . . . . . . . . . . 353.2 Example R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3 Algorithm best first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    4.1 Algorithm region query containment scope . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Subroutine not needed for cs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Range query circle and Minkowski circles of objects . . . . . . . . . . . . . . . . . . . . . 474.4 Determining the containment scope for a range query result . . . . . . . . . . . . . . . . 494.5 Detection of redundant complementary objects . . . . . . . . . . . . . . . . . . . . . . . . 504.6 Determining the containment scope for a window query result . . . . . . . . . . . . . . . 534.7 Removable complementary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.8 Algorithm client region query eval cs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    5.1 Geometric representation of NN containment scope . . . . . . . . . . . . . . . . . . . . . 575.2 Algorithm nn query containment scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3 Determining the containment scope for a NN query result . . . . . . . . . . . . . . . . . 605.4 Determining the containment scope for a 2NN query result . . . . . . . . . . . . . . . . . 615.5 Algorithm knn query containment scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.6 Algorithm client knn query eval cs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.1 Eect of k on RkNN result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.2 Algorithm find korn unary as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.3 Subroutine not needed for vs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.4 Subroutine not needed for cs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.5 Sample query auxiliary scope computation . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    viii

  • 6.6 Algorithm client query eval vs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.7 Algorithm client query eval cs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.8 Algorithm find dynamic as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.9 Dynamic RkNN auxiliary scope set membership flowchart . . . . . . . . . . . . . . . . . 826.10 Subroutine finalize kcnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.11 Subroutine finalize kdist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.12 Subroutine initialize stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.13 Subroutine update stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.14 Subroutine refine vs comp set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.15 Subroutine refine cs comp set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.16 Example dynamic auxiliary scope computation . . . . . . . . . . . . . . . . . . . . . . . . 876.17 Algorithm client query eval vs (Revised) . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.18 Algorithm client query eval cs (Revised) . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.19 Outside search space scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.20 Subroutine outside search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.21 Algorithm find optimal as . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.22 Subroutine initialize stats (Revised) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.23 Algorithm find optimal as (Bichromatic) . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.24 Subroutine initialize stats (Bichromatic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.25 Subroutine outside search (Bichromatic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.26 Sample bichromatic query auxiliary scope computation . . . . . . . . . . . . . . . . . . . 103

    7.1 Search area cir(q, 3r) and MBRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    8.1 Server overhead for computing range query auxiliary scope on uniform dataset . . . . . 1218.2 Server overhead for computing window query auxiliary scope on uniform dataset . . . 1228.3 Server overhead for computing kNN query auxiliary scope on uniform dataset . . . . . 1238.4 Server overhead for computing RkNN query auxiliary scope on uniform dataset . . . . 1248.5 Server overhead for computing range query auxiliary scope on non-uniform dataset . . 1288.6 Server overhead for computing window query auxiliary scope on non-uniform dataset 1298.7 Server overhead for computing kNN query auxiliary scope on non-uniform dataset . . . 1308.8 Server overhead for computing RkNN query auxiliary scope on uniform dataset . . . . 1318.9 Impact of client mobility on the performance of fixed range query (r = 1.5%) . . . . . . . 1338.10 Impact of client mobility on the performance of fixed window query (l = 1.5%) . . . . . 1348.11 Impact of client mobility on the performance of fixed kNN query (k = 4) . . . . . . . . . 1358.12 Impact of client mobility on traditional spatial queries with variable parameters . . . . . 1388.13 Impact of object density on traditional spatial query performance . . . . . . . . . . . . . 141

    9.1 Auxiliary scope simulator components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479.2 Auxiliary scope simulator screen captures . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    ix

  • List of Tables

    6.1 Algorithm assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Set definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Dynamic RkNN auxiliary scope (Stage I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.4 Dynamic RkNN auxiliary scope (Stage II) . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.5 Dynamic RkNN auxiliary scope (Stage III - VS) . . . . . . . . . . . . . . . . . . . . . . . . 906.6 Dynamic RkNN auxiliary scope (Stage III - CS) . . . . . . . . . . . . . . . . . . . . . . . . 91

    7.1 Cost model definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    8.1 Experiment parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    9.1 Auxiliary scope simulator release schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    x

  • Acknowledgments

    I would not have been in a position to perform this work without the steadfast support of manypeople in my life. In particular, I owe a great debt to my research advisor, Professor Wang-Chien Lee,and fellow Penn State graduate student Ken C.K. Lee. Their continued guidance and high standardscontributed in a large part to the success of this work. The Pervasive Data Access (PDA) group atPenn State has provided much needed input as I refined my research ideas, and both Ken C.K. Lee andBaihua Zheng went one step further by assisting in draft revisions and experimental analysis. I alsowould like to acknowledge the useful thesis template files oered by Professors Francesco Costanzoand Gary Gray. Dr. John Hannan has served graciously as my academic advisor despite my continualquestions, inquiries, and issues. For this, I oer my admiration and thanks. In addition, I would liketo thank Michael Kozeniauskas for his assistance in reviewing drafts of this work and for suggestingmuch needed improvements. Furthermore, I extend mywarmest regards to the entire thesis committeefor donating their limited time and considerable talent to review my research. Finally, I would nothave maintained sanity throughout the entire thesis composition process without the unyielding andunconditional support of my family and my friends. There are far too many of you to name, andregardless, I would be afraid of ommitting a name. However, this thesis would not have been possiblewithout your help.

    xi

  • Dedication

    This work is dedicated to my family, my friends, and my teachers (both past and present). You werethere to share in my accomplishments and to lift me up after disappointments. This work is as muchyours as it is my own.

    xii

  • Chapter1Introduction

    1.1 Problem Background

    As societymoves boldly into a new century, computer scientists arewitnessing an exponential explosionin the amount of data being collected by systems and sensors for analysis. Dramatic price drops andperformance improvements in both computationally rich mobile devices (e.g. laptops, portable deviceassistants, smart phones) and computationally poor sensors oer more ways to collect data than at anyother point in history. With this wealth of new knowledge, information workers have developed a needto mine literally exabytes of raw data for useful information. Furthermore, the successful processing ofmodern data often requires an application to perform complex, multi-dimensional analysis to search forpatterns. Alternatively, users may seek information about increasingly popular spatial and temporaldatasets. By bringing timely and geographically appropriate information to end users, corporations arebeginning to develop an exciting new category of applications that is likely to increase in popularityand importance over the next decade has their user base continues to expand.

    Consider the following situations that often arise in todays computing landscape:

    Geographic Information Systems (GIS)Advances in processing power and storage capacity have made it possible to analyze geographicinformation on a massive scale. Workstations can now perform useful operations on geographicdata in real-time and oer users the ability to plan vacations and business trips, to identify themarket region for dierent retail locations, and to track the spread of a disease.

    Business Intelligence (BI)Business intelligence eorts are becoming increasingly important in todays global economy andrequire the synthesis of massive amounts of operational data. The general goal of such func-tionality is to identify patterns and relationships that can guide future business decisions. Thisfrequently requires trend analysis and discovering complex interactions between multiple vari-ables.

  • 2 Location-Based Services (LBS)As prices fall and performance improves, consumers are moving away from stationary desktopsystems and embracing mobile devices such as laptops, PDAs, and smart phones. This increas-ing mobile user base presents new opportunities for location based services (LBS), which oerinformation that is specific to a clients location. Examples include locating nearby restaurantsand calculating directions to a hotel. LBS applications frequently must issue location-dependentspatial queries (LDSQ), which depend on the current location of the client.

    In all of these cases, there is a need to solve complex problems on multi-dimensional data. Identifyingunique and ecient methods for answering these questions in dynamic and diverse environments hasbecome an important area of research in the computer science field.

    1.2 ProblemMotivation

    The increased demand for and proliferation of multi-dimensional data has widened the problem spacethat needs to be addressed by modern information processing systems. Traditionally, basic point andrange queries have been sucient to answer the majority of questions posed by users. However, thisis no longer the case. Increasingly diverse and complex questions necessitate a more comprehensivetoolset for successful data analysis. In one particular case, we often need to identify those objects thatdepend on or are near some specified location. Spatial queries that operate on a variety of (potentiallymulti-dimensional) indexes provide one key mechanism to address the demands of todays changingcomputing landscape. Basic spatial query types (including region and nearest neighbor queries) as wellas advanced spatial query types (including k nearest neighbor and reverse nearest neighbor) provide alanguage for requesting this information.

    Despite the great opportunity provided by spatial queries, these new tools are frequently deployedin areas that require exceptional scalability, reliability, and performance under highly stressed systemsin possibly adverse conditions. There is an urgent need to conserve server and (in some cases) networkresources through the reduction of CPU, disk, and bandwidth utilization. In cases where mobiledevices are included in the processing scheme, additional concerns of wireless connectivity, clientbattery life, and mobile computing resources become a concern. High client-server communicationcosts can degrade the user experience through high query latency and slow user response times.

    Unfortunately, addressing these concerns is more easily said than done. One of the keys to achievingreasonable and economically sustainable performance in these situations is the identification of redun-dancy in query requests. Notice that all of the above sample applications potentially contain a highlevel of overlapping data requests.

    GIS: Systems may serve many users or issue queries at many nearby points to perform analysis.In such cases, a subsequent query answermay be partially or even entirely contained in a previousanswer obtained by the same client.

    BI: Spatial queries oer a mechanism for identifying the eect that dierent business factorshave on some outcome. For example, a retailer may want to determine how decreasing a sale

  • 3price aects the quantity of surf boards sold during cool weather? This singular logical questionmay induce many queries executed simultaneously over temperature and sale price dimensions.Substantial overlap in requested temperature and sale price values oer an opportunity to reusesome query results.

    LBS: Requests in LBS applications are based on a users current location. While this locationmay change frequently and result in a high rate of client query submissions, the actual rangeof movement between each update may be very limited in relation to the domain of the entiredataset. Therefore, we can expect hight degrees of query overlap in situations where the client isfrequently updating information.

    It follows from the above observations that substantial reductions in query submissions, processingoverhead, and bandwidth consumption can be achieved if redundant spatial query submissions couldbe detected and suppressed. That is, the client should only issue a spatial query to the server forprocessing when that spatial query cannot be answered locally using previous query results. Adoptingthis strategy would lead directly to improved response time, increased system scalability, and reducedsystem overhead. However, it is very dicult to detect redundancy in a spatial query request since theresult set for such a request is contingent on (1) the query point location, (2) additional query parameters,and (3) the distribution of items within the dataset.

    Extensive work has been done in pursuit of a solution to the redundancy identification problemwith limited success. One such attempt and the focus of this paper is the construction of an auxiliaryscope for each spatial query that is submitted to the server. This auxiliary scope data allows a client todetermine if a new spatial query can be answered locally using results from a previous query. Some ofthe existing research that falls under this category of research includes the following:

    Semantic ScopeSemantic scope techniques use bounds on the size of the query region to detect when one query iscompletely contained within another query. An example situation may be a circular range queryissued at the same query point with a smaller radius than a previously issued query. Popularapproaches include semantic region processing for range and window queries as well as mNNquery development for kNN queries.

    Valid ScopeValid scope techniques construct a valid scope region for each spatial query Q that is issued tothe server. Any new spatial query Q0 that diers from Q only in its query point and which isissued within the valid scope region of query Q can be answered using only result set data fromQ. Thus, the client can answer the new query Q0 locally without sending it to the server forprocessing. Example implementations of this technique include time-parameterized (TP) queryvalid scope, geometric valid scope, as well as a specialized geometric valid scope framework forLocation-Dependent Spatial Queries (LDSQs) issued within a broadcast environment.

  • 4Although both semantic scope and valid scope techniques can substantially reduce the number ofredundant queries that are submitted to the server for processing, these existing approaches are not op-timal because they fail to identifymany situations inwhich a spatial querymay be completely containedwithin a previous result. To address this concern, this work introduces the notion of containment scope asthe newest member within the auxiliary scope family of redundancy solutions. Compared with otherauxiliary scope techniques, containment scope applies to future spatial queries that are exceptionallyvaried and oers a large area over which existing spatial queries can be reused.

    1.3 Problem Definition

    Whilemuchworkhas beendone in thediverse area of spatial queryprocessing, the inability to eectivelyand eciently detect redundancy has at least partially limited the functionality, performance, andavailabilty of applications that require such functionality in a real world environment. Furthermore,no existing technique obtains optimal redundancy detection over a wide variety of spatial query typeswith a minimal amount of overhead. Thus, we seek to develop computational methods for a novelauxiliary scope framework, called spatial query containment, that provides the following necessaryproperties to real-world systems:

    Scalability: We must provide for a massive and growing user base through the reduction of amaximal number of redundant client spatial query requests.

    Reliability: We must protect mission critical infrastructure by providing accurate results and notintroduce inconsistent results into the information processing system.

    Performance: We must provide responses quickly as analyzed data is frequently used by clientsto make time sensitive decisions.

    Flexibility: We must support the processing of highly varied information requests through thedeployment of a framework that can process spatial queries of a wide variety.

    Satisfying these requirements is particularly challenging in dynamic environments that feature (1)transient network connections with limited data transmission capabilities, (2) centralized servers underhigh demands, and (3) a potentially large client base. Thus our fundamental goal is as follows: todevelop a containment scope processing framework that accurately solves spatial queries in a waythat eciently utilizes existingdata available to the client to conserve server resources and tomitigatebandwidth contention by maximizing client self-reliance.

    Typically, auxiliary scope implementations are deployed using a client-server model in either astationary or mobile nvironment using wired or wireless connections, respectively. An example ofa mobile deployment of auxiliary scope in support of location-based services (LBSs) is depicted inFigure 1.1. Mobile clients seek spatially nearby objects by transmitting location-dependent spatialqueries (LDSQs) over a wireless channel to a base station that then relays queries to an LBS server.The server evaluates these queries and delivers a result set of matching spatial objects from the global

  • 5dataset for each submitted LDSQ to the client fromwhich the request originated. Aswill be discussed inmore detail later, some auxiliary scope implementations may operate using broadcast communication,using stationary clients, or using logical clients that reside on the same physical machine as the servercomponent. In all cases, the general flow of information remains the same.

    base

    station LBS server

    mobile

    client

    mobile

    client

    Figure 1.1. Example LBS system model

    In addition to being compatible with a number of communication models, our system frameworkmust support a wide range of spatial query requests. Common types of spatial queries include region(range, window) queries, nearest neighbor (NN, kNN) queries, and reverse nearest neighbor (RNN)queries. Region queries retrieve objects within specified areas (e.g. circles and rectangles) that aregeographically centered at the query point. Meanwhile, nearest neighbor queries find an object locatedclosest to a query position. Finally, reverse nearest neighbor queries seek all data objects that are closerto the query point than to any other data point in the dataset. Although this final query type mayresemble the nearest neighbor query, it is important to note and will later be shown that the reversenearest neighbor query cannot be solved using nearest neighbor query algorithms. Although eachauxiliary scope concept studied in this work supports some subset of these query types, only spatialquery containment oers native methods and a common framework for processing all three querygroups.

    As previouslymentioned in Section 1.2, an appropriatemethod for obtaining necessary spatial queryscalability, reliability, performance, and flexibility in real-world systems should avoid resubmittingspatial queries to the server for processingwhenever the result of that querydoes not require informationfrom the dataset that is locally unavailable. We refer to such a request as a redundant spatial query.It can be observed that two spatial queries, even if issued at dierent query points or with dierentparameters, may share the same result. Therefore, a result of a previous spatial query possibly can bereused to answer a new query without sending a new query submission to the server for processing. Toenable every client to determinewhether the results of new spatial query are locally available, additionalinformation about previous spatial queries is needed. Meanwhile, the eectiveness of such an approachis dependent on the ability to minimize the overhead required to generate, to transmit, and to utilizethe additional data.

  • 61.4 Solution Outline

    This work introduce a new concept referred to as spatial query containment, which determines whetherthe result of a spatial query Q0 (denoted by RQ0) is contained by that of a previous spatial query Q(denoted by RQ). Any contained query can be answered locally and avoid submission to the server forevaluation. Formally, RQ0 RQ denotes a containment relationship whereby RQ0 is contained by RQ.

    Perhaps most importantly, our solution avoids costly resource consumption for mobile applications(e.g. military troop notification) or for large batches of related operations (e.g. business intelligencescenario analysis). Containment scope eectively reduces I/O accesses, CPU consumption, as well asnetwork bandwidth utilization within the established system model. In addition, containment scopesserve to highlight areas in which the result set will not be larger than original query with which it isassociated. That is, any query Q0 that is semantically contained by another query Q and issued insideof the containment scope of Qwill have the cardinality of its result set bounded by |RQ|, the cardinalityof the result set ofQ. This information can be valuable in its own right under certain situations, and wepoint out such cases at the appropriate places in our discussion.

    Consider how some real world applications can benefit from a containment framework:

    Air force pilots need to remain aware of anti-aircraft batteries that pose a threat through the use ofsurface-to-air missiles (SAMs). Pilots constantly retrieve information about nearby threats froma centralized database. Accurate information is vital in such situations, yet the military mustsupport a large number of aircraft in combat. Spatial query containment can help fighter jetsto realize when new threats are encountered and allow computer systems to trigger appropriatesystem updates.

    Frequent business travelers need to locate hotel accommodations, dining establishments, and gasstations. Some devices such as cell phones may be unable to contain a complete database oftravel information. Such a scenario would require server queries by the client. In addition, atravelers position may change quickly, so constant updates are often needed. It is essential thatredundant queries be eliminated if many users are to take advantage of such a system. Spatialquery containment oer a way to provide this functionality.

    A pizza chain wants to determine which customers might frequent a new store added in somegeographic region. Theymay alsowant to knowwhat other pizza restaurantsmight lose business.Using containment scope in tandem with an RNN query can identify both pieces of informationwith minimal redundant processing by the server.

    Consider the conceptual illustration in Figure 1.2(a) in which a circular region query (range query)Q is to be evaluated at a query point q on a set of objects {a, b, c, d, e, f , g}. In the context of our previousexamples, this range query could be issued by a fighter jet in search of information about nearby threats.We use cir(q, r) to represent the circular search region of Q, where q is the center and r is the searchradius. The result set R contains {c, d} and the remaining objects (e.g. g and h) are non-result objects.Logically, for another range queryQ0with search area cir(q0, r0), it is straightforward to determineR0 R

  • 7if cir(q0, r0) cir(q, r). However, this is only one of several possibilities. There are many other cases inwhich cir(q0, r0) * cir(q, r) but R0 R still holds. Being able to eciently identify a large number of thesescenarios can improve the reusability of R for new queries. We observe and list the five cases in whichR0 R is possible below.

    h

    a

    b

    e

    fg

    non-result objects

    d

    cr

    resultobjects

    q

    cir(q,r)

    Q

    (a) Range query Q

    h

    a

    b

    e

    fg

    d

    cr1

  • 8For the cases in which query points are unchanged (i.e., cases 1-2), we can decide whether RQ0 RQimmediately by comparing the radii of their search ranges. On the contrary, it is challenging to examinethe containment of results for situations where query points and possibly search ranges are dierent(i.e., cases 3-5). Existing solutions to eliminate redundant spatial queries only consider at most twoof the previously described cases. In contrast, our proposed containment scope can identify result setreusability in all but the last case.

    Before explaining how containment scope attains its improved redundancy detection, we brieflyexamine the example depicted in Figure 1.2(b). From the cases of RQ2 RQ and RQ3 RQ, we cansee that although the search ranges of both Q2 and Q3 are dierent from that of Q, Q2 and Q3 coversome result objects in RQ and do not contain objects outside the result RQ. This corresponds to thegeneral observation about cir(q0, r0) cir(q, r) mentioned previously. Any spatial query Q0 in which allits result objects are located in the search range covered by Q will possess a result set RQ0 that must betotally contained in RQ. In other words, knowledge of the surrounding object distribution is needed todetermine containment. Assume that the object distribution is fixed. Then there exists a spatial area foreach query Q and corresponding result set RQ such that any future query Q0 (with an equal or smallersearch area than that of Q) issued inside of this region has its result set RQ0 RQ. To capture thisspatial area, we propose a notion of containment scope, denoted by SQ, throughwhich wemay determinewhether a query (say Q0) can be answered by a maintained result (say RQ).

    The shaded area shown in Figure 1.3(a) represents a containment scope corresponding to the resultof a range query Q over a given dataset. It is guaranteed that for any range query Q0 with r0 r andq0 2 SQ, RQ0 RQ. Thus, with a containment scope associated with RQ, the client can answer Q0 locally.This evaluation of whether Q0 can be answered based on the result of a query Q is defined as spatialquery containment test (or containment test for short). As shown in Figure 1.3(b), because conditions (1)r1 r, (2) r2 r, (3) r3 r and (4) q1, q2, q3 2 SQ hold, the client can completely answer Q1, Q2, and Q3by retrieving result objects from RQ. In particular, we have RQ1 = {d}, RQ2 = {d}, and RQ3 = {c}.

    h

    a

    b

    e

    fg

    d

    c

    containment scope

    complementary objects

    (a) Containment scope SQ

    h

    a

    b

    e

    fg

    d

    cq1

    q2

    q3

    (b) Containment test

    Figure 1.3. Containment scope and containment test

    The above discussion is based on range query, a type of region query. In fact, the concept of spatialquery containment is much more general and is applicable to all previously mentioned query types.

  • 9As the formulation of containment scopes and containment tests are highly related to the type of spatialquery, we shall explore them in detail throughout the remained of this work.

    To exploit spatial query containment, we present a system framework that includes (1) basic spatialquery processing, (2) containment scope computation, and (3) spatial query containment testing logic.Since the formation of a containment scope requires knowledge of both result objects and non-resultobjects, we assign containment scope computation to the server. When a query is submitted andevaluated, the containment scope for that query result is computed. In order tominimize the processingcost of containment scope computation, we devise ecient online algorithms that are integrated withspatial query processing whenever possible to minimize index access. As will be discussed later, ourapproach can finish the evaluation of a spatial query and then determine the corresponding containmentscope with a single index traversal. It also can deliver the query result coupled with the correspondingcontainment scope back to the client in one message. Issuing a new query Q0 causes the client first toperform a containment test for each stored containment scope SQ and its associated spatial query resultRQ. It only submits the new query to the server when the containment test indicates RQ0 * RQ for allstored containment scopes.

    Also, the representation of a containment scope SQ has a direct impact on (1) the communicationcost of delivering SQ back to the client, (2) the containment test overhead incurred by the client indeciding whether the new query point q0 2 SQ, and (3) the local storage cost incurred by the client formaintaining SQ. Hence, excessive care must be used in selecting a containment scope representationfor our framework such that the benefit of spatial query containment is achieved while minimizingoverhead. Certainly, a containment scope can be represented as a polygon that consists of edges andvertices. However, this approach will incur a large volume of data and high computation costs inchecking if a point is inside a polygon. Furthermore, for some queries (e.g. range queries with circularsearch areas), polygon-based representation cannot provide an exact containment scope.

    Instead, we choose to use individual data object locations to represent containment scope data.Recall that a new query Q whose search area does not touch any non-result object is guaranteed tohave its result contained by RQ. However, the number of non-result objects can be very large; thus it isimpractical to transmit and to store all objects on the client. Instead, our approach tries to identify onlythose representative non-result objects that aect the formation of containment scope, to minimize thecommunication overhead. We refer to such objects as complementary objects. Referring to Figure 1.3(a),the containment scope is composed of result objects {c, d} and complimentary objects {a, b, e, f }. Noticethat some non-result objects such as g and h are skipped.

    1.5 Contribution and Organization

    In the remainder of this paper, we continue to describe the concept of spatial query containment, a newtechnique to reduce redundant queries by allowing clients to determine whether their maintainedspatial query results are sucient to answer subsequent spatial queries. We propose containmentscope, containment testing logic, and a spatial query processing framework to eciently realize thisnew concept for a number of dierent query types and under a wide variety of circumstances that

  • 10

    include applications to GIS, BI, and LBS. We further conduct a comprehensive set of experiments toevaluate the eectiveness of spatial query containment in relation to a representative sample of existingtechniques. The results consistently indicate the superiority of the spatial query containment approachunder a wide variety of scenarios.

    In summary, the primary contributions presented in this work are as follows:

    1. We introduce the concept of spatial query containment, which can eliminate the submission of spatialqueries when their results are locally available and thereby reduce redundant server requests,query response time, client energy consumption, and bandwidth contention.

    2. We propose a new notion of containment scope, which represents a spatial area corresponding to aresult set RQ of an LDSQ Q wherein a new LDSQ Q0 has a result set RQ0 that is fully covered byRQ so long as the search area of Q0 is smaller than (or contained by) that of Q.

    3. We devise ecient online containment scope computation algorithms for region (range, window)queries, nearest neighbor (NN) queries, and reverse nearest neighbor (RNN) queries. Severalvariants of these basic query types (such as kNNandRkNN) are also considered. Our computationmethods integrate containment scope evaluationwith spatial query processingwhenever possibleto minimize incurred processing overhead.

    4. For each query type, we devise a containment test algorithm that uses a previous computedcontainment scope to determine if a new spatial query result is fully covered by the previous one.

    5. We present a spatial processing framework that incorporates online containment scope computa-tion and containment testing over several existing communication models in support of a widevariety of commercial applications. The eectiveness of spatial query containment over thismodelis analyzed within the context of the assumptions defined in Chapter 3.

    6. We conduct extensive theoretical analysis and empirical experiments to evaluate system perfor-mance in comparisonwith existing related approaches. In general, the amortized savings by usingspatial query containment are shown to outweigh the minimal overhead required during initialquery processing. Furthermore, spatial query containment outperforms all existing related worksunder a wide variety of circumstances.

    7. We implement a working auxiliary scope simulator to test the eectiveness of various techniquesunder real-world application scenarios. The performance, reliability, and scalability of spatialquery containment in the example system is measured in relation to other auxiliary scope tech-niques as well as a baseline system with now query reduction mechanisms.

    The remainder of the paper is organized as follows. Chapter 2 reviews literature used as a foundationfor this work as well as numerous existing methods for the reduction of spatial queries. Distinctionsbetween spatial query containment and these approaches are mentioned when appropriate. Chapter 3provides an outline of the spatial query containment framework, defines basic definitions and assump-tions, and discusses the spatial query processing algorithms that form the basis of this work. Chapter 4,

  • 11

    Chapter 5 , and Chapter 6 discuss spatial query containment for region (range and window) queries,nearest neighbor (NN and kNN) queries, and reverse nearest neighbor (RNN andRkNN) queries aswellas our proposed approaches. Chapter 7 analysis spatial query containment from a theoretical perspec-tive, while Chapter 8 evaluates our proposed framework against related works over various situations.In Chapter 9, the results from the construction of our auxiliary scope simulator and their implicationson the eectiveness of spatial query containment are considered. Finally, Chapter 10 concludes thispaper and states possible future research directions.

  • Chapter2Literature Review

    2.1 Essential Concepts

    This chapter reviews a variety of work that is relevant to the issue of spatial query containment as wellas to the construction of ecient and eective containment scope processing algorithms. We begin inSection 2.2 by considering various indexing structures that are frequently used to facilitate the ecientinsertion, deletion, or updating of spatial data. Next, Section 2.3 examines the spatial queries supportedby our containment framework. We oer example usage scenarios for each query type as well as ageneral overview of existing computational approaches. After considering spatial data indexing andquerying, the notion of auxiliary scope support is carefully examined in Section 2.4. Techniques inthis section attempt to accomplish similar goals as spatial query containment by forming a regionwherein query results can be reused. Current processing methods as well as the relative advantagesand disadvantages of each approach are reviewed. All auxiliary scope methods attempt to identifyfuture redundant queries by examining stored data that is associated with a specific previously issuedquery. We close the chapter in Section 2.5 by examining the important role that various client cachingstrategies play in the eective use of dierent auxiliary scopes.

    2.2 Data Organization Techniques

    We begin our comprehensive literature review with a look at how multi-dimensional data is typicallyorganized to provide for ecient access and modification. We begin with the classical B-tree indexand associated linearization techniques. Next, we turn our attention to custom spatial data indexingmethods such as the ubiquitous R-tree index, the quad tree index, and the D-tree index. Finally, weoer a definition for a Voronoi cell and consider its applicability to spatial information processing.

  • 13

    h

    a

    b c

    d e

    fg

    (a) Sample dataset

    h

    a

    b c

    d e

    fg

    )1b g

    )2d a c

    )3f e h

    rootdX fX

    (b) B-tree index (x-dimension)

    h

    a

    b c

    d e

    fg

    )1a c

    )2b d c

    )3f g h

    rootbY fY

    (c) B-tree index (y-dimension)

    h

    a

    b c

    d e

    fg

    )1g f

    )2h e d

    )3b a c

    roothZ bZ

    (d) B-tree index (z-order curve)

    Figure 2.1. Traditional data indexing methods

    2.2.1 B-Tree Index

    One principle research issue in the area spatial information management has been the development ofecient data storage structures that can be used to hold data that is relevant to a given system. Spatialquery processing represents a unique technical challenge given that these requests typically restrict thedataset using two separate fields simultaneously (e.g. latitude and longitude). Classical disk basedindexing structures such as the ubiquitous B-tree and its variant the B+-tree can only eciently indexinformation in a single dimension [1].

    Recall that the B+-tree index sorts data keys based on some relative ordering property. Each leafnode stores the keys for data objects, while each internal node of the tree stores pointers to childrennodes. These child nodes are responsible for storing all keys that fall inside of some closed interval that

  • 14

    is specified by the key values stored by the parent node. For example, we may produce an index basedon the age of students. A parent node may store pointers to three children as well as the key values 21and 25. This means that the first child holds all data keys with a value that is less than 21, the secondchild holds all data keys with a value between 21 (inclusive) and 25 (exclusive), and the third childholds all data keys with a value greater than 25. Any internal node that has n children will have n 1key values to facilitate tree traversal. As a result, those keys with similar values tend to be groupedinto the same index node. In our example, younger students (age 18-20) would be grouped togetherinto one part of the tree, while older students (age 21-24) would be grouped into a dierent branch ofthe tree. Unlike the classical B-tree, the B+-tree always stores data objects at the leaf level of the tree tofacilitate sequential scanning of the dataset.

    The typical B+-tree usage scenario involves accessing items on disk, and we generally size each treenode such that it is equal to a single disk page. Because disk pages are large in size relative to a typicalkey size, each tree node can store many keys and each internal node will have many children nodes.The number of children that belong to each node is referred to as the fanout of the tree. Each dataobjects key is inserted as an entry into a leaf node of the B-tree based on its key value. The leaf nodeis chosen by starting at the root of the tree and following the branch of the tree that is responsible forstoring the key range in which the new data key value lives. All nodes have a specified minimum andmaximum capacity, and the B-tree nodes are recursively split or merged as needed to accomodate dataupdates.

    As previously mentioned, the B+-tree is a popular method for indexing data in a single dimension.However, it spatial data often requires that two or more dimensions be considered simultaneously.This method is not implicitly supported by the B+-tree structure. Consider the example dataset givenin Figure 2.1. Here, we have a set of eight data points (a-h) that need to be indexed based on spatiallocality. However, the B+-tree index can only consider a single attribute. In Figure 2.1(b), we considerjust the x-dimension of each data object. The root node has three children (N1, N2, and N3), and eachchild is responsible for storing a certain subset of possible x-coordinate values. N1 stores values in theinterval (1, dx), N2 stores values in the interval [dx, fx), and N3 stores values in the interval [ fx,1).Here, ax refers to the value of the x-coordinate of data object a. Examining this grouping, we noticethat data objects with similar x-coordinates are grouped together; however, this does not necessarilyimply spatial locality. For instance, objects b and g are grouped together but are not actually in closeproximity. Figure 2.1(c) illustrates a similar process by which we index our spatial data points by theiry-coordinates. The resulting index also fails to truly represent the spatial locality of stored objects as isevident by the grouping of objects f , g, and h into a single index node.

    A final organizational technique for the B+-tree index utiilizes linearization techniques such as z-curve ordering (pictured in Figure 2.1(d)) and Hilbert curve ordering [2, 3]. These processes attempt tomerge two dierent dimensions of information in a way that maintains the spatial locality. That is, theresulting linear B+-tree index represents spatial information by projecting multi-dimensional objectsonto a one-dimensional space. The example dataset in Figure 2.1(d) shows the eect of z-ordering on thedataset. Weassign eachdata point a location on the curve thatminimizes theEuclideandistance betweenthe objects curve location and real location. Next, assign one end of the z-order curve a small value and

  • 15

    then monotonically increase z-coordinate values as the curve is traversed. This approach produces thebest spatial locality of all B+-tree indexes but still is limited by the fact that only a single dimension ofinformation can be represented in the final index. This is clearly illustrated by the grouping of objects dand f into a single node despite the relatively large Euclidean distance that separates them. In general,spatial query evaluation performs poorly under this scenario, so researchers have endeavored to createnew data organizations that actually maintain full dimensionality in the resulting index. These dataindexes primarily involve either grid or tree structures and have experienced varying degrees of success[1, 4, 5].

    2.2.2 R-Tree Index

    h

    a

    b c

    d e

    fg

    )1

    )2

    )3

    root)1 )2 )3

    )1a c

    )2b d g

    )3e f h

    (a) R-tree index

    qc

    e i

    g ha

    f

    db(b) Voronoi cell

    Figure 2.2. Spatial data indexing methods

    In 1984, Antonin Guttman introduced the R-tree indexing structure, which oers ecient storagein both memory and disk, incurs minimal update cost, and indexes information using all spatialattributes. R-trees group nearby data points together into minimum bounding rectangles (MBRs)[4, 5, 6]. A minimum bounding rectangle represents the smallest rectangle that contains all of the datapoints with which the MBR is associated. As the R-tree becomes full, additional levels are added, andhigh level MBRs are determined by the smallest rectangle that is needed to contain the MBRs of allchildren nodes. All data objects are always stored at the lowest level of the tree. Much like the B+-treeindexing structure, the R-tree has a minimum and maximum node capacity. Underflow and overfloware handled through recursively merging or splitting nodes (and MBRs) in tree as needed. However,unlike the B+-tree structure inwhich sibling nodes store disjoint subsets of the data key range, theMBRsassociated with sibling R-tree nodes can overlap. Thus, a query may need to follow multiple paths inthe R-tree in order to consider all possible results. Many popular spatial queries are supported wellunder this structure. For example, window queries over a particular region can be answered quickly

  • 16

    by traversing the children of any node whose MBR overlaps the search region.Continuing with the same dataset used in our B+-tree example, Figure 2.2(a) illustrates how data

    objects a-h can be indexed using an R-tree. Once again, we have a root node with three children nodesN1, N2, and N3. Each internal node has an MBR (shown in the figure) that contains all of its childrenobjects. Notice that the R-tree oers a much higher degree of spatial locality than any of the previousB-tree solutions since it incorporates both dimensions into the index structure. The outter rectangle inthe figure represents the MBR for the root node and illustrates how child MBRs can be used to create anew MBR for a higher level in the tree.

    Future variants to the general R-tree structure resulted in popular optimizations such as the R+-treeand the R*-tree. These refined data structures cemented the R-tree as an ubiquitous choice for indexingspatial information [4, 6]. When grouping objects, R-trees attempt to minimize area using variousheuristics that trade computational speed for algorithm eectiveness. The R+-tree attempts to avoidthe overlapping MBRs of internal index nodes. However, this complicates the grouping and updatinglogic for the overall index. In contrast, R*-trees consider area, overlap, perimeter, and node fill factorwhen making decisions about how to group, split, and merge data objects. The choice and relativeinfluence of each of these factors is based largely on empirical results. R*-trees provide very goodperformance with low-dimensionality datasets and are one of the most commonly used data structuresfor indexing massive spatial datasets. Consequently, we adopt this variant as the primary spatialindexing method used for this work. Any deviation from this decision will be noted as appropriate.

    When issuing spatial queries against theR*-tree, we adopt the distance browsing techniqueproposedby Hjaltason et al [7]. They use an incremental, greedy approach to locate nearby objects. A priorityqueue stores R*-tree nodes and is pre-populated with the root of the tree. For each iteration of thealgorithm, we remove the node that is closest to the query location from the queue. If the object isan internal node, we add all of its children to the priority queue for further analysis. Otherwise, weexamine records in the leaf node as potential result objects.

    2.2.3 Other Spatial Indexes

    Various other spatial indexes have been proposed for the eective management of complex data. Twointeresting approaches that exemplify overall design patterns for spatial indexes are the quad tree andthe D-tree. The quad tree recursively divides the data space into quadrants based on the relativelocation of data objects. When a particular quadrant in the index is filled to capacity, the restructinglogic splits the quadrant once in the x-dimension and once in the y-dimension. Typically, the goal ofthe splitting routine is to perform more recursive splits in areas of the dataset that are exceptionallydense. Compared with the R-tree approach, D-trees oer a more rigid index structure that providesmore predictable behavior at the expense of flexibility.

    The second type of alternative spatial index is the D-tree, which indexes the data space based onregional divisions. D-trees divide the entire data space into non-overlapping polygonal regions. Itindexes this information in a way that allows for quick determination of membership in a particularregion. Such a membership determination is referred to as a planar point query. Many spatial queries

  • 17

    can be reduced to planar point queries, so such an index can be quite useful. D-trees also provide clientswith information about the specific partition, or zone, in which their query was issued.

    2.2.4 Voronoi Cells

    For our final data structure discussion, we consider the important role that Voronoi cells play inclassifying locations in the data space. A Voronoi cell for some data object o in the dataset is the convexpolygon formed by taking the intersection of all perpendicular bisectors formed by considering a linesegment from o to any other object o0 , o in the dataset. Let ?o,o0 represent the perpendicular bisectorof the line segment between data objects o and o0. It follows that ?o,o0 divides the data space into twodisjoint subsets. The region that contains object o represents all points in the data space closer to o thanto o0. Similiary, the region that contains object o0 represents all points in the data space closer to o0 thanto o. For convenience, we let ?o,o0 refer to the subregion of the dataset that contains object o. Then theVoronoi cell V(o) for data object o can be represented as V(o) = \o0,o ?o,o0 . In addition, no two Voronoicells overlap (i.e. V(o) \ V(o0) = ;). Finally, another useful property for subsequent discussion is thatthe union of the Voronoi cells of all data objects covers the entire data space. Extending our observationabout the two disjoint regions formed from a perpendicular bisector ?o,o0 , we can conclude that anypoint inside of V(o) is closer to object o than to any other object o0 in the dataset.

    Consider the example Voronoi cell depicted in Figure 2.2(b). The shaded region represents theVoronoi cell V(c) for object c. (Note that point q is a query point and is not a part of the dataset.)Then, the sides of the Voronoi cell are formed by perpendicular bisectors ?c,a, ?c,b, ?c,d, and ?c,e. Otherperpendicular bisectors such as ?c, f , ?c,g, ?c,h, and ?c,i do not aect the final Voronoi cell since theyare less restrictive than the original bisector. The notion of Voronoi cells is essential to spatial querycontainment for nearest neighbor and k nearest neighbor queries. For example, we know that the closestdata object to query point q is object c by virtue of the fact that q 2 V(c).

    2.3 Spatial Query Types

    With essential data organization techniques now firmly established, we turn our attention to commonquestions asked about spatial information. This section considers traditional core spatial queries thatinclude region (range, window) queries and nearest neighbor (NN, kNN) queries. In addition, wereview the more recent and complex reverse nearest neighbor (RNN, RkNN) query family. Thesequeries represent a comprehensive set of popular queries that should be supported by any auxiliaryscope approach. As such, we explore methods for supporting each of these spatial query types in thecontainment scope framework throughout the rest of this paper. In addition, our framework is easilyextendible to support additional query types as necessary. Each query type is first defined. We thenoer real world situations in which such a query would be useful. Finally, we oer an overview ofcommon query evaluation techniques. Since query evaluation is highly dependent on the type of dataindex available, we restrict our discussion to algorithms that are appropriate for the R-tree data indexor its variants. Recall that our spatial query containment framework uses the R-tree index because of

  • 18

    h

    a

    b c

    d e

    fg

    q

    rcir(q,r)

    (a) Range query

    h

    a

    b c

    d e

    fg

    q

    win(q,l,h)(b) Window query

    h

    a

    b c

    d e

    fg

    q

    cir(q,|q-d|)(c) NN query

    h

    a

    b c

    d e

    fg

    q

    cir(b,|q-b|)cir(d,|q-d|)

    cir(c,|q-c|)(d) Reverse NN query

    Figure 2.3. Spatial query types

    its widespread acceptance and ecient performance.

    2.3.1 Region Query

    The first basic type of spatial query is the region query, which returns all objects in the dataset that liewithin a specified area. Typically, we define the region to be searched in terms of a central query point qas well as some set of supplemental query parameters given by E. Dierent subtypes of region queriesexist based on the shape of the specified region. Two common categories that will be considered in thispaper include the range query and the window query.

    Range queries attempt to identify all objects inside of a circular region centered at the query point qwith a radius r. In this case, r is the sole parameter included in set E. A sample range query is given inFigure 2.3(a). Here, the shaded region given by cir(q, r) represents the search area. It follows that objectsc and d are returned as query results since c, d 2 cir(q, r). All other objects are outside of the circle andare not returned to the client.

    The second type of region query is the window query, which attempts to identify all objects insideof a rectangular region centered at the query point q with extents given by length l and height h. Here,the total size of the rectangle is 2l 2h, and the passed query parameters represent the horizontal andvertical distances between the query point and the rectangle boundary. A sample window query is

  • 19

    given in Figure 2.3(b). Here, the shaded region given by rect(q, l, h) represents the search area. It followsthat objects c and d are returned as query results since c, d 2 rect(q, l, h). All other objects are outside ofthe rectangle and are not returned to the client.

    Consider some examples where a region query could be useful:

    A military base may want to locate all fighters within range of a particular targe. A range querycan accomplish this task.

    A tourist may want to find points of interest on a certain city block prior to leaving the area. Thisis precisely the case that is solved by a window query.

    Pennsylvania State University ocials may want to locate potential food distributors that arewithin 10 miles of the University Park campus.

    A retailer may want to identify sales that ocurred during a particular timeframe and within acertain price range. These two parameters can be simultaneously restricted using a windowquery with time as the first dimension and price as the second dimension.

    Several mechanisms exist for processing region query information. Recall that an R-tree consists ofboth internal and external nodes and that each node has an MBR. The general strategy for answering aregion query is to explore the children of all nodes in the R-tree that have an MBR which overlaps thequery search region. At the leaf level, we include all objects that are located inside of the query range.Various algorithms process nodes in dierent orders and have dierent termination criteria. The firsttwo obvious choices are to traverse the tree using either a breadth first search (BFS) or a depth firstsearch (DFS). We manage such searches using a queue or stack. We only continue a search path if thecurrent nodes MBR overlaps with the query region.

    An alternative search process follows the distance browsing technique [7]. In this case, the algorithmexplores nodes based on their minimum distance (ormindist) from the query point q. This increases thespeed with which data objects are found, since result objects are likely to be spatially nearby the querypoint. The order of and list of items that still need to be explored is maintained by a priority queue datastructure. The distance browsing method has the beneficial byproduct that the algorithm terminateswith all unexplored data nodes already sorted by the mindist metric in the priority queue. We exploitthis fact during the construction of several spatial query containment algorithms.

    2.3.2 Nearest Neighbor Query

    The second basic type of spatial query is the nearest neighbor (NN) query, which returns those objects inthe dataset that are closest to a given query point q. Two dierent subtypes of nearest neighbor queriesexist based on the number of objects that are to be returned to the client. The 1-NN query identifiesthe closest object to a given query point, while the k-NN query returns the k closest objects to the querypoint

    First, we consier the 1-NN query. This query only has a single parameter q that denotes the locationat which the query is to be issued. The algorithm returns the data object that minimizes the Euclidean

  • 20

    distance between the query point and the data object. If we denote the set of all data points as O, thenthe 1-NN result object o satisfies the property |q, o| |q, o0|8o0 2 O. Note that the cardinality of the resultset is always one by definition.

    Furthermore, we notice that the 1-NN query is relatively more dicult to solve than a standardregion query. This follows from the observation that the query answer is dependent on not only thequery location and data object location but also on the relative location of all other objects in the dataset.An example nearest neighbor query is given in Figure 2.3(c). Here, a 1-NN query is issued at point q.The entire result set consists only of object d, as it is the closest to point q. To see this, observe that thecircle cir(q, |q, d|) is empty, so no other object in the dataset can possibly be closer to q than object d.

    The second type of nearest neighbor query is the k-NN query. In this case, we return the k objectsin the dataset that are closest to the query point q. That is, we return the k data objects that minimizethe sum of the Euclidean distances between the query point q and each of k dierent data objects. Theresult set R of any k-NN query satisfies the property |q, o| |q, o0|8o 2 R, o0 2 O R. Furthermore, weobserve that the cardinality of the result set is always k. Finally, it is worth noting that the 1-NN queryis simply a specific type of the k-NN query with k = 1.

    Consider some examples where a region query could be useful:

    Amotorist may want to find the five closest gas stations to his/her current geographic location. Air force pilots may need to identify the closest enemy fighter in order to engage in combateectively.

    Given projected audience age and income demographics, a movie studio may attempt to identifypast movies that are similar to a proposed motion picture in an eort to predict sales.

    Severalmechanisms exist for processing nearest neighbor query information. Once againwe assumethat anR-tree index exists on thedata to beprocessed. As such,weknow that eachnodehas an associatedMBR to indicate the region that is covered by child nodes. The primary method used for answering NNqueries is that of the distance browsing technique [7]. Recall that this algorithmprocesses nodes in orderof their minimum distance from the query point q. During each iteration of the algorithm processinglogic, we dequeue a node entry from a priority queue and insert its children back into the priority queuefor future processing. Furthermore, the algorithm examines data objects precisely in increasing order oftheir Euclidean distance from the query point. It follows that the first object (excluding internal nodes)examined is precisely the single result set object in the case of a 1-NN query. By extension, the first kobjects located by the algorithm are precisely the k objects in the result set of a k-NN query. As in thecase of region queries, the distance browsing technique terminates with a priority queues of unexplorednodes that are sorted by the mindistmetric. This fact will be useful in constructing a containment scopefor nearest neighbor queries.

    2.3.3 Reverse Nearest Neighbor Query

    With the twobasic spatial query types nowdefined,we turn our attention to the reverse nearest neighborquery (RNN). Two dierent subtypes of reverse nearest neighbor queries exist and are similar to those

  • 21

    defined for the nearest neighbor query. The two categories of RNN queries include the R1NN queryand the RkNN query. We review the general idea of RNN queries below and then supplement thediscussion with details that are specific to each subtype of the RNN query category.

    Recall that a nearest neighbor query identifies the object in a dataset that is closest to a querypoint q with respect to all other data objects. The term closest allows for some ambiguity in thisdefinition which can be eliminated by introducing a clearly defined distance function. Most commonly,the Euclidean distance metric is used to perform comparisons. The RNN query attempts to identify thesame relationship as a NN query but does so in the opposite direction. That is, it identifies all objects ina dataset that would have the query point as one of their closest points if the query point were addedto the dataset. Unlike the NN query type, the cardinality of an RNN result set is not fixed and also canpotentially be empty.

    Considering the R1NN query type, the result set consists of all data objects o that are closer to thequery point q than to all other data objects in the dataset O. Any result object o of an R1NN querysatisfies the property |o, q| |o, o0|8o0 2 O. Figure 2.3(d) shows an example of an R1NN query issued atpoint q. The result set of this query includes objects b, c, and d. Notice that the corresponding circlescir(b, |b q|), cir(c, |c q|), and cir(d, |d q|) contain no other data objects. It follows that q is the closestpoint to each data object. On the other hand, objects a, e, f , g, and h are closer to other dataset objectsthan they are to the query point q.

    Next, we examine the RkNN query type. There is a natural analog between the relationship of NNand kNN queries, and this relationship can be extended to cover the case of reverse NN and reversekNN (RkNN) queries as well. Specifically, a kNN query looks for the closest k objects to a query point,while an RkNN query searches for any object that has the query point as one of its closest k objects. Anyresult object o of an RkNN query satisfies the property |o, q| |o, o0|8o0 2 Z. Here, Z represents any setthat satisfies (1) Z O and (2) |S| = |O| k. Once again, the R1NN query is simply a specialized case ofthe RkNN query with k = 1.

    Although the RNN query type is analgous to the nearest neighbor query but cannot be addressedby that basic query type because of the inherent asymmetry between the two query definitions. Toillustrate that an R1NN query cannot be solved using existing 1NN or range query types, considerFigure 2.4. In Figure 2.4(a), we issue a 1NN query at point q and obtain the result of data object d.However, d is not a member of the R1NN result set, as data point c is closer to d than q. (That is, L0 < L.)In Figure 2.4(b), we attempt to use a range query issued at point q to identify the result of an R1NNquery issued q. Using a radius of r, we ensure that every included data object has q as its closest pointbut accidentally eliminate the legitimate result object a. If we expand the radius to r0 so as to includedata point a, we accidentally include non-result objects c and d that are closer to each other than they areto query point q. Thus, we conclude that basic spatial queries cannot easily be used to solve an R1NN(and, by extension, RkNN) query.

    Consider some examples where an RNN query could be useful:

    A pizza chain wants to determine which customers might frequent a new store added in somegeographic region. Theymay alsowant to knowwhat other pizza restaurantsmight lose business.An RNN query can identify both pieces of information.

  • 22

    L

    h

    a

    b

    cd e

    fg

    q

    L

    (a) NN query

    r

    h

    a

    b

    cd e

    fg

    q

    r

    (b) Range query

    Figure 2.4. Basic spatial query attempts to solve RNN query

    If we define the closeness of two objects to be some sort of similarity function, then we cancompare the eect of adding dierent products into a market. For instance, a movie studio couldrate the similarity between multiple movies being released to that of a previous blockbuster andselect the one thatwould provide the highest predicted viewership and, by extension, profitability.

    Schools could use RNN queries to identify possible papers that have been victims of plagiarismby some new but dubious work by identifying common phrasing and content. Once again, thedistance function of the RNN query can be designed in a way to identify the suspect commonalitybetween papers.

    In theUnited Statesmilitary, the Joint Chiefs of Stamayneed to conduct simulations to determinewhere command stations should be constructed to serve asmany troops as possible. RkNNqueriescan help to answer these questions.

    On a related note, military commanders can alert troops of new enemies in a field of combat byidentifying those troops that are closer to the enemies in question than to other friendly or hostileforces. RNN queries provide precisely this ability.

    We now turn our attention to various methods for computing RNN queries. There has been sub-stantial work done in the area of RNN query analysis since the query type was formally introduced in2000 by Korn. This survey considers four popular methods [8, 9, 10, 11] for computing the results of anRNN query and its variants. We focus on algorithms that would provide an exact result set in order tosatisfy the accuracy requirements of our spatial query containment framework.

    The first approach for computing an RNN query was proposed by a paper by Korn on the topicof influence sets. Korn introduced the concept of reverse nearest neighbor (RNN) queries and severalstraightforward variants such as the reverse k nearest neighbor (RkNN) query [8]. In addition, hedistinguished between two types of RNN queries: monochromatic RNN queries and bichromatic RNNqueries. In the former case, each point in the dataset considers all other points as possible nearestneighbor candidates. An example of this scenario might be a commuter looking for nearby peoplewith whom to visit. In contrast, a bichromatic dataset is divided into two distinct categories, and each

  • 23

    h

    a

    b c

    d e

    fg

    q

    (a) Korn RNN processing algorithm

    h

    a

    b c

    d e

    fg

    q

    A

    B

    C

    D

    E

    F

    (b) Stanoi RNN processing algorithm

    h

    a

    b c

    d e

    fg

    q

    P

    PP

    (c) Tao RNN processing algorithm

    h

    a

    b c

    d e

    fg

    q

    (0)

    (1)

    (0)

    (0 1)

    (1 2)(1 2)

    (0)

    (0)

    (d) Lee RRNN processing algorithm

    Figure 2.5. RNN evaluation techniques

    category only considers candidates from the other category as possible neighbors. We often color thepoints in our dataset as either red or blue depending upon designated group membership. In such acase, red points identify their nearest neighbor from all existing blue points and vice-versa. A real-worldexample of such a scenario might be a categorization of police ocers (colored red) and the citizens thatthey are charges with protecting (colored blue).

    In this inaugural solution to the RNN query problem, Korn pre-computed the NN query result foreach point in the dataset. He then maintained an R-tree that was populated with NN circles insteadof data points. An NN circle for a given data object, o, is the circle centered at o whose circumferencetouches the nearest neighbor o0 of o. We denote such a circle as cir(o, |o, o0|). Sample RNN circles forour running example dataset are given in Figure 2.5(a). An RNN query then reduces to identifyingall circles that contain the query point. This was an eective solution for static datasets in which the

  • 24

    expensive pre-computation step was only performed once. However, dynamic datasets exhibited poorperformance since the updated process was computationally expensive. Upon insertion of a new object,the circles of every object that had the new object as a nearest neighbor must be updated. The NN resultfor the new object also had to be computed and has to be inserted into the R-tree. To facilitate theaddition of new data objects, Korn also maintains a second R-tree that contains only the data objects tofacilitate NN query evaluation.

    This work also extends the general algorithm to support the RkNN query type by storing the circlefor the kth closest object to each object in the dataset. However, the approach assumes that k is bothfixed and known in advance. Unfortunately, this generally is not the case.

    Shortly after the publication of Korns technique, Stanoi developed a more ecient method of RNNcomputation on dynamic datasets [11]. This new method addressed the large update cost associatedwith the R-tree structure of NN circles in Korns original design. The new technique for computingRNN queries avoids pre-computing NN circles by observing that there can be at most six RNN queryresults when monochromatic queries are considered. This last restriction is an important one, as theassumption does not hold for bichromatic cases. Beyond simply identifying that there can only be sixmonochromatic RNN query results, Stanoi partitioned the dataspace into six sections in such a waythat each section could contain at most one of the RNN result objects. A sample partitioning scheme isprovided in Figure 2.5(b). The query and dataset are the same as in the example used to illustrate Kornsapproach. We depict the six partitions as A, B, C, D, E, and F. Notice that the partitions are centeredaround the RNN query point and that each consist of infinite length sectors with interior angles of 60degrees. Finally, notice that at most one result object exists in each partition as is to be expected.

    To compute an RNN query result, the authors use an R-Tree structure that contains only the pointsof the dataset. (In fact, other spatial indexes can be used. For example, the R*-Tree structure can actuallyyield better performance in most cases.) To begin, the algorithm issues six NN queries from the querypoint but restricts the results to the sector under consideration. Next, the approach issues additional NNqueries from each of the previously identified NN points to determine if they are in fact RNN solutions.That is, the distance between one of the candidate result objects and its nearest neighbor in the data setmust be greater than the distance between the candidate result object and the query point for it to bean actual result object for the RNN query. Unlike the solution by Korn, this technique lacks scalablesupport for RkNN query types and thus makes it ill-suited for environments where such queries areneeded.

    As a third substantial work, we consider the contribution of Tao et. al in discovering an ecientmethod for performing RkNN queries in a variety of datasets [9]. This method leverages ideas from thestudy of Voronoi cells in order to prune R-tree nodes from the search space. Recall that Voronoi cellsare simply formed using a series of perpendicular bisectors. The algorithm by Tao eectively pretendsthat the query point q is in the dataset and prunes the space using a series of bisectors between q andother objects in the dataset. Any data object that lies completely on the opposite side (with respect tothe query point) of a perpendicular bisector of the line between q and some other data object o0 cannotpossibly be a part of the RNN query result set. Figure 2.5(c) illustrates the incremental reduction of thequery space by the algorithm. Here, we are examining object d and use the perpendicular bisector ?o,q

  • 25