when is nearest neighbors indexable? uri shaft (oracle corp.) raghu ramakrishnan (uw-madison)

24
When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW- Madison)

Upload: chloe-carter

Post on 28-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

When Is Nearest Neighbors Indexable?

Uri Shaft (Oracle Corp.)Raghu Ramakrishnan (UW-

Madison)

Page 2: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Motivation -Scalability Experiments

• Dozens of papers describe experiments about index scalability with increased dimensions.– Constants are:

• Number of data points• Data and Query distribution• Index structure / search algorithm

– Variable:• Number of dimensions

– Measurement:• Performance of index.

Page 3: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Example From PODS 1997

Page 4: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Example From PODS 1997

Page 5: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Motivation

• In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality

• We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability.

Page 6: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Historical Context

• Continues work done in “When Is Nearest Neighbors Meaningful?” (ICDT 1999)

• Previous work about behavior of distance distributions.

• This work about behavior of indexing structures under similar conditions.

Page 7: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Contents

• Vanishing Variance property• Convex Description index

structures• Indexing Theorem

– The performance of CD index does not scale for VV workloads using Euclidean distances.

• Conclusion• Future Work

Page 8: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Vanishing Variance

• Same definition used in ICDT 99 work (although not named in that work)

• In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small.

• We use the same result here.

Page 9: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Vanishing Variance• A scalability experiment contains a

series of workloads W1,W2,…,Wm,…– m is the number of dimensions– each workload W1 has n data points

and a query point (same distribution)– Distance distribution marked as Dm

• Vanishing Variance:

0)(

varlim

m

mm DE

D

Page 10: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Contents

• Vanishing Variance property• Convex Description index

structures• Indexing Theorem

– The performance of CD index does not scale for VV workloads using Euclidean distances.

• Conclusion• Future Work

Page 11: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Convex Description Index• Data points distributed to buckets (e.g.

disk pages). Access to a buckets is “all or nothing”. We allow redundancy. A bucket contains at least two data points.

• Each bucket associated with a description – a convex region containing all data points in the bucket.

• Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor.

• Cost of search is the number of data points retrieved.

Page 12: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Example: R-Tree• Buckets are disk pages. Under normal

construction buckets contain more than two data points each.

• Bucket descriptions are convex and contain all data points (Bounding Rectangles).

• Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more).

Page 13: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Convex Description Indexes

• All R-Tree variants• X-Tree• M-Tree• kdb-Tree• SS-Tree and SR-Tree• Many more

Page 14: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Other indexes (non-CD)

• Probability structures (P-Tree, VLDB 2000)– Access based on clusters. A near

enough bucket may not be accessed

• Projection index (like VA-file)– Compression structures. – All data points accessed in pieces, not

in buckets.

Page 15: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Contents

• Vanishing Variance property• Convex Description index

structures• Indexing Theorem

– The performance of CD index does not scale for VV workloads using Euclidean distances.

• Conclusion• Future Work

Page 16: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Indexing Theorem

• If:– Scalability experiment uses a series of

workloads with Vanishing Variance– The distance metric is Euclidean– The indexing structure is Convex

Description

• Then:– The expected cost of a query converges to

the number of data points – I.e., a linear scan of the data

Page 17: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Sketch of Proof• Because of Vanishing Variance, the

ratio of distances between various query and data points becomes arbitrarily close to 1.

• When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle:

Page 18: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Bucket

Q

D1 D2Y

Distances of Q, D1, D2,…, Dn are about the same.

Distance of Q to Y is much smaller

Therefore, distance of Q to data bucket is less than distance to nearest neighbor.

Page 19: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Contents

• Vanishing Variance property• Convex Description index

structures• Indexing Theorem

– The performance of CD index does not scale for VV workloads using Euclidean distances.

• Conclusion• Future Work

Page 20: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Conclusion• Dozens of papers describe

experiments about index scalability with increased dimensions.

• We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability.

• We proved that many of these experiments do not scale in dimensionality.

Page 21: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Conclusion

• Use this theorem to to channel indexing research into more useful and practical avenues

• Review previous results accordingly.

Page 22: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Future Work

• Remove restriction of at least two data points in bucket. – Easy exercise, need to take into

account the cost of traversing a hierarchical data structure.

• Investigate other Lp metrics• Investigate projection indexes

using Euclidean metric (looks like they do not scale either)

Page 23: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

• Find scalable indexing structure for Uniform data and L metric– Hint: use compression

• Find number of data points needed for R-Tree to be practical on uniform data, L2 metric.– Approx:

Future Work

mFn 3

Page 24: When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Questions