fast nonparametric machine learning algorithms for high-dimensional ting liu carnegie mellon...

Post on 22-Dec-2015

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fast Nonparametric Machine Learning Fast Nonparametric Machine Learning Algorithms for High-dimensionalAlgorithms for High-dimensional

Ting LiuTing Liu

Carnegie Mellon UniversityCarnegie Mellon University

February, 2005February, 2005

Ph.D. Thesis ProposalPh.D. Thesis Proposal

Massive Data and ApplicationsMassive Data and Applications

Ting Liu, CMU 2

Thesis CommitteeThesis Committee

• Andrew Moore (Chair)

• Martial Hebert

• Jeff Schneider

• Trevor Darrell (MIT)

Ting Liu, CMU 3

Thesis ProposalThesis Proposal

Goal: to make nonparametric methods tractable for high-dim, massive datasets

• Nonparametric methods:– K-nearest-neighbor (K-NN)– Kernel density estimation– SVM evaluation phase– and more …

My thesis

Ting Liu, CMU 4

High-dim, massive data

Why K-NN?Why K-NN?• It is simple

– goes back to as early as [Fix-Hodges 1951]– [Cover-Hart 1967] justifies k-NN theoretically

• It is easy to implement– sanity check for other (more complicated) algorithms– similar insights for other nonparametric algorithms

• It is useful many applications in

– text categorization– drug activity detection– multimedia, computer vision– and more…

Ting Liu, CMU 5

Application: Video SegmentationApplication: Video Segmentation

Task: Shot transition detection

• Cut

• Gradual transition (fades, dissolves …)

Ting Liu, CMU 6

Technically Technically [Qi-Hauptmann-Liu 2003][Qi-Hauptmann-Liu 2003]

Pair-wise similarityfeatures

Classificationnormal: 0cut: 1gradual: 2

4 hours MPEG-1 video

(420,970 frames)

K-NN• very slow • good performance

We want a fast k-NN classification

method.

Color histogramVideoframes

Ting Liu, CMU 7

Application: Application: Near-duplicate Detection and Sub-image Retrieval

Copyrighted Image Database

Ting Liu, CMU 8

Algorithm Overview Algorithm Overview [Yan-Rahul 2004][Yan-Rahul 2004]

12,100,000 patches(12,100 copyrighted images)

Transformation DoG + PCA-SIFT

Searchstore

Each image: 1000 patches

1000 k-NN search per query

Each image:1000 patches

Each patch: 36-dim

train query

We want a fast k-NN search

method.

Ting Liu, CMU 9

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

Spatial tree

• SR-tree

• Kd-tree

• Metric-tree

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Naïve

• Random sample• PCA• LSH

Spill-tree

My workslow

Ting Liu, CMU 10

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-treeSpatial tree

• SR-tree

• Kd-tree

• Metric-tree

Ting Liu, CMU 11

Problems with Exact K-NN Search: Problems with Exact K-NN Search: EfficiencyEfficiency

• Slow with huge dataset in high dimensions

• Complexity of algorithms

– Naïve (linear scan): O(dN) per query

– Advanced: O(dlogN) ~ O(dN)

(spatial data structure to avoid searching all points) • SR-tree [Katayama-Satoh 1997]

• Kd-tree [Friedman-Bentley-Finkel 1977]

• Metric-tree (ball-tree) [Uhlmann 1991, Omohundro 1991]

Ting Liu, CMU 12

A set of points in R2

Metric-tree: an ExampleMetric-tree: an Example

Ting Liu, CMU 13

Build a metric-treeBuild a metric-tree

P2

P1

L

[Uhlmann 1991, Omohundro 1991]

Ting Liu, CMU 14

A metric-tree

Metric-tree Data StructureMetric-tree Data Structure

Internal data structure

P2

P1

[Uhlmann 1991, Omohundro 1991]

Ting Liu, CMU 15

• Let q be any query point

• Let x be a point inside ball B

Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality

xq

xq

Ting Liu, CMU 16

Metric-tree Based K-NN SearchMetric-tree Based K-NN Search

• Depth first search

• Pruning using the triangle inequality

• Significant speed-up when d is small: O(dlogN)

• Little speed-up when d is large: O(dN)

• “Curse of dimensionality”

Ting Liu, CMU 17

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-treeSpatial tree

• SR-tree

• Kd-tree

• Metric-tree

Ting Liu, CMU 18

My Work (part 1): Fast K-NN Classification Based on Metric-tree

Idea: Do classification w/o finding the k-NNs

KNS2: Fast k-NN classification for skewed 2-class KNS3: Fast k-NN classification for 2-class IOC: Fast k-NN classification for multi-class

Ting Liu, CMU 19

KNS2: Fast K-NN Classification for KNS2: Fast K-NN Classification for Skewed 2-class Skewed 2-class

Assumptions:

(1) 2 classes: pos. / neg.

(2) pos. class much less frequent than neg. class

Example: video segmentation

(~10,000 shot transitions, ~400,000 normal frames)

Q: How many of the k-NN are from pos. class?

Ting Liu, CMU 20

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 1 --- Find positive

Find the k closest pos. points

q

d1

d2

d3

Example: k = 3

di : distance of the i’th

closest pos. point to q

Fewer pos. points → easy to compute

Ting Liu, CMU 21

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Count negative

q

d1

d2

d3

c1 c2 c3

Example: k = 3

c1 = 1

c2 = 5

c3 = 8

ci: Num. of neg. points within di

Ting Liu, CMU 22

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Lowerbound negative

q

d1

d2

d3

c1 c2 c3

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

Idea: lowerbound each ci instead of computing it

ci: Num. of neg. points within di

Ting Liu, CMU 23

• Let q be any query point

• Let x be a point inside ball B

Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality

xq

xq

Ting Liu, CMU 24

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di20

c1 ≥ 0, c2 ≥ 0, c3 ≥ 0

A

Ting Liu, CMU 25

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

12

8

c1 ≥ 0, c2 ≥ 0, c3 ≥ 12

B

C

Ting Liu, CMU 26

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

12

8

c1 ≥ 0, c2 ≥ 0, c3 ≥ 12

B

C

Ting Liu, CMU 27

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

5

c1 ≥ 0, c2 ≥ 5, c3 ≥ 12

7

DE

Ting Liu, CMU 28

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

5

c1 ≥ 0, c2 ≥ 5, c3 ≥ 12

7

DE

Ting Liu, CMU 29

How Many of the K-NN are From pos. Class?How Many of the K-NN are From pos. Class?

• Step 2 --- Estimate negative

q

Example: k = 3

Estimate c1 ≥ 3 ?

c2 ≥ 2 ?

c3 ≥ 1 ?

ci: Num. of neg. points within di

4

c1 ≥ 4, c2 ≥ 5, c3 ≥ 12

7

We are done! Return 0

E

F

Ting Liu, CMU 30

KNS2: the AlgorithmKNS2: the Algorithm

Build two metric-trees (Pos_tree / Neg_tree) Search Pos_tree, find k pos. NNs Search Neg_tree repeat pick a node from Neg_tree refine C = {c1, c2,…,ck}

if ci ≥ k-i+1 remove ci from C

end repeat

Let k’ = size(C) after the search

return k’

Ting Liu, CMU 31

Experimental Results (KNS2)Experimental Results (KNS2)

Dataset Dimension(d)

Data Size(N)

ds1 10 26,733

Letter 16 20,000

Video 45 420,970

J_Lee 100 181,395

Blanc_Mel 100 186,414

ds2 1.1£106 88,358

Ting Liu, CMU 32

CPU Time Speedup Over Naïve K-NN CPU Time Speedup Over Naïve K-NN (k = 9)(k = 9)

KNS2: 3x – 60x speed-up over naïve

0

10

20

30

40

50

60

70

Ds1(d=10) Letter(d=16) video(d=45) J_Lee(d=100) Blanc_Mel(d=100) ds2(d=1.1M)

spee

dups

Naive

Metric-tree

KNS2

Ting Liu, CMU 33

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-tree

Ting Liu, CMU 34

--- “I’m Feeling Lucky” search

--- spill-tree

My Work (Part 2): a New Metric-tree Based Approximate NN Search

Ting Liu, CMU 35

Empirically…

• takes 10% of the time finding the NN

• takes 90% of the time backtracking

Why is Metric-tree Slow?

p2

p1

q

Ting Liu, CMU 36

““I’m Feeling Lucky” SearchI’m Feeling Lucky” Search

• Algorithm: simple– Descends a metric-tree without backtracking– Return the first point hit in a leaf node

• Complexity: super fast – O(logN) per query

• Accuracy: quite low – Liable to make mistakes when q is near the decision

boundary

Ting Liu, CMU 37

Spill-tree:Spill-tree: – adding redundancy to help – adding redundancy to help “I’m-Feeling-Lucky” search “I’m-Feeling-Lucky” search

Ting Liu, CMU 38

Spill-treeSpill-tree

• A variant of metric-tree

• The children of a node can “spill over” onto each other, and contain shared data-points

Ting Liu, CMU 39

A Spill-tree Data StructureA Spill-tree Data Structure

LLL

p2p1

LROverlapping buffer

Overlapping buffer size

• Spill-tree: Both children own points between LL and LR

• Metric-tree: each child only owns points to one side of L

Ting Liu, CMU 40

A Spill-tree Data StructureA Spill-tree Data Structure

Advantage of Spill-tree

– higher accuracy– makes mistake only when

true NN is far away

LLL LR

p2p1

q

Overlapping buffer

Overlapping buffer size

Ting Liu, CMU 41

A Spill-tree Data StructureA Spill-tree Data Structure

Problem with spill-tree

– uncontrolled depth – O(logN) when – when – empirically,

is the expected dist.

of a point to its NN

LLL LR

p2p1

q

Overlapping buffer

Overlapping buffer size

Ting Liu, CMU 42

Hybrid Spill-tree SearchHybrid Spill-tree Search

• Balance threshold ρ = 70% (empirically)

if either child of a node v contains more than ρ of the total points,

then split v in the conventional way.

Overlapping node

-- “I’m Feeling Lucky” search

Non-overlapping node

-- backtracking search

Ting Liu, CMU 43

Further Efficiency Improvement by Further Efficiency Improvement by Random ProjectionRandom Projection

Intuition: random projection approximately preserves distance.

Ting Liu, CMU 44

Experiments for Spill-treeExperiments for Spill-tree

Dataset Num. Data (N)

Num. Dim

(d)

Aerial 275,465 60

Corel_hist 20,000 64

Corel_uci 68,040 64

Disk 40,000 1024

Galaxy 40,000 3838

Ting Liu, CMU 45

Comparison MethodsComparison Methods

• Naïve k-NN

• Metric-tree

• Locality Sensitive Hashing (LSH)

• Spill-tree

Ting Liu, CMU 46

Spill-tree vs. Metric-treeSpill-tree vs. Metric-tree

The CPU time (s) speed-up of Spill-tree over metric-tree

Spill-tree enjoys 3.3 ~ 706 folds speed-up over metric-tree

Ting Liu, CMU 47

Spill-tree vs. LSHSpill-tree vs. LSH

The CPU time (s) of Spill-tree and its speedup (in parentheses) over LSH

Spill-tree enjoys 2.5 ~ 31 folds speed-up over LSH

Ting Liu, CMU 48

• KNS2 (2-class)

• KNS3 (2-class)

• IOC (multi-class)

K-NN MethodsK-NN MethodsK-NN

Exact K-NN Approximate K-NN

K-NN searchK-NN

classification

Spill-tree

Ting Liu, CMU 49

My ContributionMy Contribution• T.Liu, A. W. Moore, A. Gray.

Efficient Exact k-NN and Nonparametric Classification in High Dimensions, NIPS 2003.

• Y. Qi, A. Hauptman, T.Liu. Supervised Classification for Video Shot Segmentation, ICME 2003.

• T.Liu, K. Yang, A. W. Moore. The IOC algorithm: Efficient Many-Class Non-parametric Classification for High-Dimensional Data, KDD 2004.

• T.Liu, A. W. Moore, A. Gray, K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms, NIPS 2004.

Ting Liu, CMU 50

Related WorkRelated Work

• [Uhlmann 1991, Omohundro 1991] Propose the idea of Metric-tree (Ball-tree)• [Omachi-Aso, 1997] Similar idea of KNS2 for NN classification• [Gionis-Indyk-Motwani, 1999] A practical approximate NN method: LSH• [Arya-Fu, 2003] Expected-case complexity of approximate NN searching

• [Yan-Rahul, 2004] Near-duplicate Detection and Sub-image Retrieval• [Indyk, 1998]

Approximate NN under L∞ norm

Ting Liu, CMU 51

Future WorkFuture Work

• Improve my previous work– Self-tuning spill-tree– Theoretical analysis of spill-tree

• Explore new related area– Dual-tree

• Applications in real-world

Ting Liu, CMU 52

Future Work (1): Self-Tuning Spill-treeFuture Work (1): Self-Tuning Spill-tree

• Two key factors of spill-tree– random projection dimension d’– overlapping size

Ting Liu, CMU 53

Benefits of Automatic Parameter TuningBenefits of Automatic Parameter Tuning

• Avoid tedious hand-tuning

• Gain more insights into the approx. NN

Ting Liu, CMU 54

Future work(2): Theoretical AnalysisFuture work(2): Theoretical Analysis

• Spill-tree + “I’m feeling lucky search”– good performance in practice – no theoretic guarantee

Ting Liu, CMU 55

Idea: when the number of points is large enough, then I’m feeling lucky search finds the true NN w.h.p.

Ting Liu, CMU 56

Idea: with overlapping buffer, the probability of successfully finding the true NN can be increased

Ting Liu, CMU 57

Future Work(3): Dual-Tree SearchFuture Work(3): Dual-Tree Search

• N-body problem [Gray-Moore, 2001]– NN classification– Kernel density estimation– Outlier detection– Two-point correlation

• Require pair-wise comparison of all N points– Naïve solution: O(N2)– Advanced solution based metric-tree

• Single-tree: only build trees on training data• Dual-tree: build trees on both training, query data

Ting Liu, CMU 58

• Let q be a point inside query node Q

• Let x be a point inside training node B

Metric-tree: the Triangle InequalityMetric-tree: the Triangle Inequality

x

q

Q B

Ting Liu, CMU 59

Pruning OpportunityPruning Opportunity [Gray-Moore 2001]

Dmax(Q, B)

Dmin(Q, A)

A B

Q

OAOB

OQ

Prune A when

A can’t be pruned in this case

A, B: nodes from training set

Q: node from test set

But, this is too pessimistic!

Ting Liu, CMU 60

More pruning opportunityMore pruning opportunity

A B

Q

OAOB

q

Prune A when

Hyperbola H determined by OA,OB, rA+rB

A can be pruned in this caseChallenge: to compute this efficiently

Ting Liu, CMU 61

Future Work(4): ApplicationsFuture Work(4): Applications

• Multimedia --- video segmentation– shot-based segmentation– story-based segmentation

• Image retrieval --- near-duplicate detection

• Computer vision --- object recognition

Ting Liu, CMU 62

Time LineTime Line

• Now – Apr., 2005– Dual-tree (design and implementation)– Testing on real-world datasets

• May – Aug., 2005– Improving spill-tree algorithm– Theoretical analysis

• Sept. – Dec., 2005– Applications of new k-NN algorithm

• Jan. – Mar., 2006– Write up final thesis

Ting Liu, CMU 63

Thank you!Thank you!

QU

ES

TIONS

top related