sublinear
DESCRIPTION
Sublinear. Algorithms. Sloan Digital Sky Survey. 4 petabytes (~1MG). 10 petabytes/yr. Biomedical imaging. 150 petabytes/yr. Data. Data. massive input. output. Sample tiny fraction. Sublinear algorithms. Approximate MST. [CRT ’01]. Optimal!. - PowerPoint PPT PresentationTRANSCRIPT
Biomedical imaging
Sloan Digital Sky
Survey4 petabytes4 petabytes
(~1MG)(~1MG)4 petabytes4 petabytes
(~1MG)(~1MG)
10 10 petabytes/yrpetabytes/yr
10 10 petabytes/yrpetabytes/yr
150 petabytes/yr150 petabytes/yr150 petabytes/yr150 petabytes/yr
massive input
massive input outputoutput
Sublinear Sublinear algorithalgorith
msms
Sublinear Sublinear algorithalgorith
msms
Sample tiny fractionSample tiny fractionSample tiny fractionSample tiny fraction
ApproximateApproximate MSTMSTApproximateApproximate MSTMST [CRT ’01][CRT ’01][CRT ’01][CRT ’01]
Reduces to counting connected componentsReduces to counting connected componentsReduces to counting connected componentsReduces to counting connected components
EEEE = no. connected components= no. connected components= no. connected components= no. connected components
varvarvarvar << (no. connected components)<< (no. connected components)<< (no. connected components)<< (no. connected components)2222
Shortest PathsShortest PathsShortest PathsShortest Paths [CLM ’03][CLM ’03][CLM ’03][CLM ’03]
Ray ShootingRay ShootingRay ShootingRay Shooting
VolumeVolume IntersectionIntersection Point locationPoint location
VolumeVolume IntersectionIntersection Point locationPoint location
[CLM ’03][CLM ’03][CLM ’03][CLM ’03]
low-entropy datalow-entropy data
low-entropy datalow-entropy data
Takens embeddingsTakens embeddings Markov models Markov models (speech)(speech)
Takens embeddingsTakens embeddings Markov models Markov models (speech)(speech)
Self-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving Algorithms
Arbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random source
SortingSorting MatchingMatching MaxCutMaxCut All pairs shortest pathsAll pairs shortest paths Transitive closureTransitive closure ClusteringClustering
SortingSorting MatchingMatching MaxCutMaxCut All pairs shortest pathsAll pairs shortest paths Transitive closureTransitive closure ClusteringClustering
Self-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving Algorithms
Arbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random sourceArbitrary, unknown random source
11. . Run algorithm for best worst-case behaviorRun algorithm for best worst-case behavior or best under uniform distribution or best underor best under uniform distribution or best under some postulated prior.some postulated prior.
11. . Run algorithm for best worst-case behaviorRun algorithm for best worst-case behavior or best under uniform distribution or best underor best under uniform distribution or best under some postulated prior.some postulated prior.
22. Learning phase: Algorithm finetunes itself. Learning phase: Algorithm finetunes itself as it learns about the random source throughas it learns about the random source through repeated use.repeated use.
22. Learning phase: Algorithm finetunes itself. Learning phase: Algorithm finetunes itself as it learns about the random source throughas it learns about the random source through repeated use.repeated use.
33. . Algorithm settles to stationary status: optimalAlgorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source.random source.
33. . Algorithm settles to stationary status: optimalAlgorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source.random source.
Self-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving AlgorithmsSelf-Improving Algorithms
E Tk E Tk Optimal Optimal expected time for expected time for random sourcerandom source
E Tk E Tk Optimal Optimal expected time for expected time for random sourcerandom source
time T1time T1time T1time T1
time T2time T2time T2time T2
time T5time T5time T5time T5
time T3time T3time T3time T3
time T4time T4time T4time T4
(x1, x2, … , xn) (x1, x2, … , xn) (x1, x2, … , xn) (x1, x2, … , xn)
SortingSortingSortingSorting
each xi independent from Dieach xi independent from Di H = entropy of rank distributionH = entropy of rank distribution each xi independent from Dieach xi independent from Di H = entropy of rank distributionH = entropy of rank distribution
ClusteringClusteringClusteringClustering K-median K-median (k=2)(k=2)K-median K-median (k=2)(k=2)
Minimize sum of distancesMinimize sum of distancesMinimize sum of distancesMinimize sum of distances Hamming cube Hamming cube {0,1}{0,1}Hamming cube Hamming cube {0,1}{0,1}
dddd
Minimize sum of distancesMinimize sum of distancesMinimize sum of distancesMinimize sum of distances Hamming cube Hamming cube {0,1}{0,1}Hamming cube Hamming cube {0,1}{0,1}
dddd
Minimize sum of distancesMinimize sum of distancesMinimize sum of distancesMinimize sum of distances Hamming cube Hamming cube {0,1}{0,1}Hamming cube Hamming cube {0,1}{0,1}
dddd
[KSS][KSS][KSS][KSS]
How to achieve linear limiting expected How to achieve linear limiting expected time?time?How to achieve linear limiting expected How to achieve linear limiting expected time?time?
Input space {0,1}Input space {0,1}Input space {0,1}Input space {0,1}dndndndn
prob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSSprob < O(dn)/KSS
Identify coreIdentify coreIdentify coreIdentify core
TailTail::TailTail::
Use KSS Use KSS Use KSS Use KSS
How to achieve linear limiting expected How to achieve linear limiting expected time?time?How to achieve linear limiting expected How to achieve linear limiting expected time?time?
Store sample of Store sample of precomputed KSSprecomputed KSSStore sample of Store sample of precomputed KSSprecomputed KSS
nearest neighbornearest neighbornearest neighbornearest neighborIncremental algorithmIncremental algorithmIncremental algorithmIncremental algorithm
NP vs P: input vicinity NP vs P: input vicinity algorithmic algorithmic vicinityvicinityNP vs P: input vicinity NP vs P: input vicinity algorithmic algorithmic vicinityvicinity
Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?
1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise
2. Or it’s 2. Or it’s notnot2. Or it’s 2. Or it’s notnot2. Or ?2. Or ?2. Or ?2. Or ?
1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise1. Data is accessible before noise
encode decode
Data inaccessible before noise
Assumptions are Assumptions are necessary !necessary !
Data inaccessible before noise
2. Bipartite graph, expander2. Bipartite graph, expander
3. Solid w/ angular 3. Solid w/ angular constraintsconstraints
1. Sorted sequence1. Sorted sequence
4. Low dim attractor set4. Low dim attractor set
Data inaccessible before noise
data must satisfydata must satisfy
some property some property PP
but does not quitebut does not quite
f(x) = ?f(x) = ?
x
f(x)
But life being what it is…
data
f = access function
f(x) = ?f(x) = ?
x
f(x) data
)(O
Humans
Define distance from any object to data class
f(x) = ?f(x) = ?
x
g(x)
x1, x2,…
f(x1), f(x2),…
filter
g is access function for:
Similar to Self-Correction [RS96, BLR’93]
except:
about data, not functions
error-free
allows O(distance to property)
Monotone function: [n] Rd
Filter requires polylog (n) queries
Offline reconstruction
Offline reconstruction
Online reconstruction
Online reconstruction
Online reconstruction
Online reconstruction
monotonemonotone functionfunctionmonotonemonotone functionfunction
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 1011121314151617181920
Frequency of a pointFrequency of a pointFrequency of a pointFrequency of a point
Smallest interval I containing > |I|/2 Smallest interval I containing > |I|/2 violations involving f(x)violations involving f(x)
Smallest interval I containing > |I|/2 Smallest interval I containing > |I|/2 violations involving f(x)violations involving f(x)
xxxx
Frequency of a pointFrequency of a pointFrequency of a pointFrequency of a point
Given Given x:x:
Given Given x:x:
1. estimate its frequency1. estimate its frequency1. estimate its frequency1. estimate its frequency
2. if nonzero, find “smallest” 2. if nonzero, find “smallest” intervalinterval
2. if nonzero, find “smallest” 2. if nonzero, find “smallest” intervalintervalaround x with both endpoints around x with both endpoints around x with both endpoints around x with both endpoints
having zero frequencyhaving zero frequencyhaving zero frequencyhaving zero frequency
3. interpolate between 3. interpolate between f(endpoints)f(endpoints)
3. interpolate between 3. interpolate between f(endpoints)f(endpoints)
To prove:To prove:To prove:To prove:
1. Frequencies can be estimated 1. Frequencies can be estimated in in
1. Frequencies can be estimated 1. Frequencies can be estimated in in
2. Function is monotone 2. Function is monotone overover
2. Function is monotone 2. Function is monotone overover
polylog time polylog time polylog time polylog time
3. ZF domain occupies (1-3. ZF domain occupies (1-22
3. ZF domain occupies (1-3. ZF domain occupies (1-22
zero-frequency domainzero-frequency domainzero-frequency domainzero-frequency domain
) fraction) fraction) fraction) fraction
Bivariate concave function
Filter requires polylog (n) queries
bipartite graph
k-connectivity
expander
denoising low-dim attractor sets
Priced Priced
computation & computation & accuracyaccuracy
Priced Priced
computation & computation & accuracyaccuracy
spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting
spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting
001100001010001111110011001101011100001100000101111o1o1100001100
001100001010001111110011001101011100001100000101111o1o1100001100
Linear programmingLinear programming Linear programmingLinear programming
computation
experimentation
Pricing dataPricing data
Pricing dataPricing data
Ongoing project w/ Nir AilonOngoing project w/ Nir AilonOngoing project w/ Nir AilonOngoing project w/ Nir Ailon
Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….
Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu