sublinear

Biomedical imaging

Sloan Digital Sky

Survey4 petabytes4 petabytes

(~1MG)(~1MG)4 petabytes4 petabytes

(~1MG)(~1MG)

10 10 petabytes/yrpetabytes/yr

10 10 petabytes/yrpetabytes/yr

150 petabytes/yr150 petabytes/yr150 petabytes/yr150 petabytes/yr

massive input

massive input outputoutput

Sublinear Sublinear algorithalgorith

msms

Sublinear Sublinear algorithalgorith

msms

Sample tiny fractionSample tiny fractionSample tiny fractionSample tiny fraction

ApproximateApproximate MSTMSTApproximateApproximate MSTMST [CRT ’01][CRT ’01][CRT ’01][CRT ’01]

Reduces to counting connected componentsReduces to counting connected componentsReduces to counting connected componentsReduces to counting connected components

EEEE = no. connected components= no. connected components= no. connected components= no. connected components

varvarvarvar << (no. connected components)<< (no. connected components)<< (no. connected components)<< (no. connected components)2222

Shortest PathsShortest PathsShortest PathsShortest Paths [CLM ’03][CLM ’03][CLM ’03][CLM ’03]

Ray ShootingRay ShootingRay ShootingRay Shooting

VolumeVolume IntersectionIntersection Point locationPoint location

VolumeVolume IntersectionIntersection Point locationPoint location

[CLM ’03][CLM ’03][CLM ’03][CLM ’03]

low-entropy datalow-entropy data

low-entropy datalow-entropy data

Takens embeddingsTakens embeddings Markov models Markov models (speech)(speech)

Takens embeddingsTakens embeddings Markov models Markov models (speech)(speech)

Self-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving Algorithms

Arbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random source

SortingSorting MatchingMatching MaxCutMaxCut All pairs shortest pathsAll pairs shortest paths Transitive closureTransitive closure ClusteringClustering

SortingSorting MatchingMatching MaxCutMaxCut All pairs shortest pathsAll pairs shortest paths Transitive closureTransitive closure ClusteringClustering


Arbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random source

11. . Run algorithm for best worst-case behaviorRun algorithm for best worst-case behavior or best under uniform distribution or best underor best under uniform distribution or best under some postulated prior.some postulated prior.

11. . Run algorithm for best worst-case behaviorRun algorithm for best worst-case behavior or best under uniform distribution or best underor best under uniform distribution or best under some postulated prior.some postulated prior.

22. Learning phase: Algorithm finetunes itself. Learning phase: Algorithm finetunes itself as it learns about the random source throughas it learns about the random source through repeated use.repeated use.

22. Learning phase: Algorithm finetunes itself. Learning phase: Algorithm finetunes itself as it learns about the random source throughas it learns about the random source through repeated use.repeated use.

33. . Algorithm settles to stationary status: optimalAlgorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source.random source.

33. . Algorithm settles to stationary status: optimalAlgorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source.random source.


E Tk E Tk Optimal Optimal expected time for expected time for random sourcerandom source

E Tk E Tk Optimal Optimal expected time for expected time for random sourcerandom source

time T1time T1time T1time T1





(x1, x2, … , xn) (x1, x2, … , xn) (x1, x2, … , xn) (x1, x2, … , xn)

SortingSortingSortingSorting

each xi independent from Dieach xi independent from Di H = entropy of rank distributionH = entropy of rank distribution each xi independent from Dieach xi independent from Di H = entropy of rank distributionH = entropy of rank distribution

ClusteringClusteringClusteringClustering K-median K-median (k=2)(k=2)K-median K-median (k=2)(k=2)

Minimize sum of distancesMinimize sum of distancesMinimize sum of distancesMinimize sum of distances Hamming cube Hamming cube {0,1}{0,1}Hamming cube Hamming cube {0,1}{0,1}

dddd

Minimize sum of distancesMinimize sum of distancesMinimize sum of distancesMinimize sum of distances Hamming cube Hamming cube {0,1}{0,1}Hamming cube Hamming cube {0,1}{0,1}

dddd

[KSS][KSS][KSS][KSS]

How to achieve linear limiting expected How to achieve linear limiting expected time?time?How to achieve linear limiting expected How to achieve linear limiting expected time?time?

Input space {0,1}Input space {0,1}Input space {0,1}Input space {0,1}dndndndn

prob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSS

Identify coreIdentify coreIdentify coreIdentify core

TailTail::TailTail::

Use KSS Use KSS Use KSS Use KSS

How to achieve linear limiting expected How to achieve linear limiting expected time?time?How to achieve linear limiting expected How to achieve linear limiting expected time?time?

Store sample of Store sample of precomputed KSSprecomputed KSSStore sample of Store sample of precomputed KSSprecomputed KSS

nearest neighbornearest neighbornearest neighbornearest neighborIncremental algorithmIncremental algorithmIncremental algorithmIncremental algorithm

NP vs P: input vicinity NP vs P: input vicinity algorithmic algorithmic vicinityvicinityNP vs P: input vicinity NP vs P: input vicinity algorithmic algorithmic vicinityvicinity

Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?

1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise

2. Or it’s 2. Or it’s notnot2. Or it’s 2. Or it’s notnot2. Or ?2. Or ?2. Or ?2. Or ?

1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise

encode decode

Data inaccessible before noise

Assumptions are Assumptions are necessary !necessary !


2. Bipartite graph, expander2. Bipartite graph, expander

3. Solid w/ angular 3. Solid w/ angular constraintsconstraints

1. Sorted sequence1. Sorted sequence

4. Low dim attractor set4. Low dim attractor set


data must satisfydata must satisfy

some property some property PP

but does not quitebut does not quite

f(x) = ?f(x) = ?

x

f(x)

But life being what it is…

data

f = access function

f(x) = ?f(x) = ?

x

f(x) data

)(O

Humans

Define distance from any object to data class

f(x) = ?f(x) = ?

x

g(x)

x1, x2,…

f(x1), f(x2),…

filter

g is access function for:

Similar to Self-Correction [RS96, BLR’93]

except:

about data, not functions

error-free

allows O(distance to property)

Monotone function: [n] Rd

Filter requires polylog (n) queries

Offline reconstruction

Online reconstruction

monotonemonotone functionfunctionmonotonemonotone functionfunction

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 1011121314151617181920

Frequency of a pointFrequency of a pointFrequency of a pointFrequency of a point

Smallest interval I containing > |I|/2 Smallest interval I containing > |I|/2 violations involving f(x)violations involving f(x)

Smallest interval I containing > |I|/2 Smallest interval I containing > |I|/2 violations involving f(x)violations involving f(x)

xxxx

Frequency of a pointFrequency of a pointFrequency of a pointFrequency of a point

Given Given x:x:

Given Given x:x:

1. estimate its frequency1. estimate its frequency1. estimate its frequency1. estimate its frequency

2. if nonzero, find “smallest” 2. if nonzero, find “smallest” intervalinterval

2. if nonzero, find “smallest” 2. if nonzero, find “smallest” intervalintervalaround x with both endpoints around x with both endpoints around x with both endpoints around x with both endpoints

having zero frequencyhaving zero frequencyhaving zero frequencyhaving zero frequency

3. interpolate between 3. interpolate between f(endpoints)f(endpoints)

3. interpolate between 3. interpolate between f(endpoints)f(endpoints)

To prove:To prove:To prove:To prove:

1. Frequencies can be estimated 1. Frequencies can be estimated in in

1. Frequencies can be estimated 1. Frequencies can be estimated in in

2. Function is monotone 2. Function is monotone overover

2. Function is monotone 2. Function is monotone overover

polylog time polylog time polylog time polylog time

3. ZF domain occupies (1-3. ZF domain occupies (1-22

3. ZF domain occupies (1-3. ZF domain occupies (1-22

zero-frequency domainzero-frequency domainzero-frequency domainzero-frequency domain

) fraction) fraction) fraction) fraction

Bivariate concave function

Filter requires polylog (n) queries

bipartite graph

k-connectivity

expander

denoising low-dim attractor sets

Priced Priced

computation & computation & accuracyaccuracy

Priced Priced

computation & computation & accuracyaccuracy

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting

001100001010001111110011001101011100001100000101111o1o1100001100

001100001010001111110011001101011100001100000101111o1o1100001100

Linear programmingLinear programming Linear programmingLinear programming

computation

experimentation

Pricing dataPricing data

Pricing dataPricing data

Ongoing project w/ Nir AilonOngoing project w/ Nir AilonOngoing project w/ Nir AilonOngoing project w/ Nir Ailon

Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….

Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu

sublinear

Documents

noise data

data classfx

property monotone function

linear limiting

optimal expected complexity

monotone overpolylog

access function fx

n rdfilter