when is nearest neighbors indexable? uri shaft (oracle corp.) raghu ramakrishnan (uw-madison)

When Is Nearest Neighbors Indexable?

Uri Shaft (Oracle Corp.)Raghu Ramakrishnan (UW-

Madison)

Motivation -Scalability Experiments

• Dozens of papers describe experiments about index scalability with increased dimensions.– Constants are:

• Number of data points• Data and Query distribution• Index structure / search algorithm

– Variable:• Number of dimensions

– Measurement:• Performance of index.

Example From PODS 1997

Motivation

• In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality

• We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability.

Historical Context

• Continues work done in “When Is Nearest Neighbors Meaningful?” (ICDT 1999)

• Previous work about behavior of distance distributions.

• This work about behavior of indexing structures under similar conditions.

Contents

• Vanishing Variance property• Convex Description index

structures• Indexing Theorem

– The performance of CD index does not scale for VV workloads using Euclidean distances.

• Conclusion• Future Work

Vanishing Variance

• Same definition used in ICDT 99 work (although not named in that work)

• In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small.

• We use the same result here.

Vanishing Variance• A scalability experiment contains a

series of workloads W1,W2,…,Wm,…– m is the number of dimensions– each workload W1 has n data points

and a query point (same distribution)– Distance distribution marked as Dm

• Vanishing Variance:

0)(

varlim

m

mm DE

D

Contents





Convex Description Index• Data points distributed to buckets (e.g.

disk pages). Access to a buckets is “all or nothing”. We allow redundancy. A bucket contains at least two data points.

• Each bucket associated with a description – a convex region containing all data points in the bucket.

• Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor.

• Cost of search is the number of data points retrieved.

Example: R-Tree• Buckets are disk pages. Under normal

construction buckets contain more than two data points each.

• Bucket descriptions are convex and contain all data points (Bounding Rectangles).

• Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more).

Convex Description Indexes

• All R-Tree variants• X-Tree• M-Tree• kdb-Tree• SS-Tree and SR-Tree• Many more

Other indexes (non-CD)

• Probability structures (P-Tree, VLDB 2000)– Access based on clusters. A near

enough bucket may not be accessed

• Projection index (like VA-file)– Compression structures. – All data points accessed in pieces, not

in buckets.

Contents





Indexing Theorem

• If:– Scalability experiment uses a series of

workloads with Vanishing Variance– The distance metric is Euclidean– The indexing structure is Convex

Description

• Then:– The expected cost of a query converges to

the number of data points – I.e., a linear scan of the data

Sketch of Proof• Because of Vanishing Variance, the

ratio of distances between various query and data points becomes arbitrarily close to 1.

• When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle:

Bucket

Q

D1 D2Y

Distances of Q, D1, D2,…, Dn are about the same.

Distance of Q to Y is much smaller

Therefore, distance of Q to data bucket is less than distance to nearest neighbor.

Contents





Conclusion• Dozens of papers describe

experiments about index scalability with increased dimensions.

• We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability.

• We proved that many of these experiments do not scale in dimensionality.

Conclusion

• Use this theorem to to channel indexing research into more useful and practical avenues

• Review previous results accordingly.

Future Work

• Remove restriction of at least two data points in bucket. – Easy exercise, need to take into

account the cost of traversing a hierarchical data structure.

• Investigate other Lp metrics• Investigate projection indexes

using Euclidean metric (looks like they do not scale either)

• Find scalable indexing structure for Uniform data and L metric– Hint: use compression

• Find number of data points needed for R-Tree to be practical on uniform data, L2 metric.– Approx:

Future Work

mFn 3

Questions

when is nearest neighbors indexable? uri shaft (oracle corp.) raghu ramakrishnan (uw-madison)

Documents

data slide

number of data points

various data points

n data points

index scalability

uniform data

conclusion future work

arbitrary data bucket