1 budgeted nonparametric learning from data streams ryan gomes and andreas krause california...

1

Budgeted Nonparametric Learning from Data

Streams

Ryan Gomes and Andreas KrauseCalifornia Institute of Technology

Application ExamplesClustering Millions of Internet

Images

Torralba et al. 80 Million tiny images. IEEE PAMI Nov. 2008

2

Application ExamplesNonlinear Regression in Embedded

Systems

Control Input

Act

uato

r S

tate

3

Data Streams

• Can’t access data set all at once• Can’t control order of data access (random access may be available)

Charikar et al. Better streaming algorithms for clustering problems. STOC 2003

4

Data Streams

maximum wait until an element is revisited

elements available at iteration t

5

Nonparametric Methods

• Highly flexible, use training examples to make predictions

• In streaming environment: select budget of K examples to do prediction

6

Problem Statementactive set at iteration t:

monotone utility function: when

,

Given sequence of available elementsmaintain active sets

,

where final

active set satisfies:

7

Exemplar Based Clustering

8

Gaussian Process Regression

information gain

M. Seeger et al. Fast forward selection to speed up sparse gaussian process regression. (AISTATS 2003)

9

Gaussian Process Regression

expected variance reduction

10

Submodularity

andIf then

FC, FV, and FH are all submodular! “diminishing returns”

greater change

smaller change

11

StreamGreedy

Repeat:

Until forconsecutive iterations

1.

2.

3.

12

Optimality of StreamGreedy

•Clustering-consistency•FC, FV, and FH are clustering-consistent when data consists of very well-separated clusters•Preferable to select exemplar from new cluster rather than two from same cluster

13

Theorem: If F is monotonic, submodular, and clustering-consistent then StreamGreedy finds

after at most iterations.

Optimality of StreamGreedy

14

Approximation Guarantee

Theorem: Assume F is monotonic submodular and further assume F is bounded by constant B. Then StreamGreedy finds


•Typically, data does not consist of well-separated clusters •Maximizing F is NP-hard in general

15

Limited Stream Access

Approximate and

Uniform subsample approximation

“validation set”

within accuracy.

16

Approximation Guarantee

Theorem: Assume F is monotonic submodular and may be evaluated to ε-precision. Further, assume F is bounded by constant B. Then StreamGreedy finds


•May only be able to approximately evaluate F

17

with distance

• Convergence rate comparable to online k-means

• Quantization performance difference due to exemplar constraint

MNIST Convergence

18

Example based centers Unconstrained centers

• Good performance with small validation sets• Larger validation set needed for larger number of

clusters K

Validation Set Size

19

Tiny Images

StreamGreedy Online K-means

> 1.5 millions 28 x 28 pixel RGB images

• Online K-means finds many singleton or empty clusters

20

StreamGreedy Exemplars

Tiny Images

21

Online k-means centers

StreamGreedy Cluster Examples

Nearest to exemplar

Randomly Chosen

Tiny Images

22

Run time vs. Accuracy

• Vary and • StreamGreedy performance saturates with run

time• Outperforms Online K-means in less time

23

Gaussian Process RegressionKin-40k dataset

outperforms but requires sufficient validation set

24

Conclusions

•Flexible framework•Theoretical performance guarantees:•Exemplar based clustering with non-metric similarities in streaming environment•Leads to efficient algorithms•Excellent empirical performance

StreamGreedy

25

1 budgeted nonparametric learning from data streams ryan gomes and andreas krause california...

Documents

streamgreedy performance

access data set

streamgreedy cluster

streamgreedy repeat

larger validation set

validation set size

tiny images streamgreedyonline

clustering problems