sampling: an algorithmic perspective richard peng m.i.t
TRANSCRIPT
Sampling: an Algorithmic Perspective
Richard PengM.I.T.
OUTLINE
• Structure preserving sampling• Sampling as a recursive
‘driver’• Sampling the inaccessible• What can sampling preserve?
RANDOM SAMPLING
• Collection of many objects• Pick a small subset of them
Goal:• Estimate quantities• Small approximates• Use in algorithms
SAMPLING CAN APPROXIMATE
• Point sets• Matrices• Graphs• Gradients
PRESERVING GRAPH STRUCTURESUndirected graph, n vertices, m < n2 edges
Is n2 edges (dense) sometimes necessary?
For some information, e.g. connectivity:encoded by spanning forest, < n edges
Deterministic, O(m) time algorithm
: questions
MORE INTRICATE STRUCTURES
k-connectivity: # of disjoint paths between s-t
[Benczur-Karger `96]: for ANY G, can sample to get H with O(nlogn) edges s.t. G ≈ H on all cuts
Stronger: weights of all 2n cuts in graphs
Cut: # of edges leaving a subset of vertices
s
t
Menger’s theorem / maxflow-mincut
: previous works
≈: multiplicative approximation
n
n
MORE GENERAL: ROW SAMPLING
A’
A
L2 Row sampling:
Given A with m>>n, sample a few rows to form A’ s.t.║Ax║2 ≈║A’x║2 ∀ x
m
0 -1 0 0 0 1 0
0 -5 0 0 0 5 0
≈n
• ║Ax║p: finite dimensional Banach space
• Sampling: embedding Banach spaces• e.g. [BLM `89], [Talagrand `90]
HOW TO SAMPLE?Widely used: uniform sampling Works well when data is
uniform e.g. complete graph
Problem: long path, removing any edge changes connectivity
(can also have both in one graph)
More systematic view of sampling?
SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE
[Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance
Effective resistance: • commute time / m• Statistical leverage score
in unweighted graphs
L2 MATRIX-CHERNOFF BOUNDS
[Foster `49] Σi τi = rank ≤ n O(nlogn) rows
[Rudelson, Vershynin `07], [Tropp `12]: sampling with pi ≥ τiO( logn) gives B’ s.t. ║Bx║2 ≈║B’x║2 ∀x w.h.p.
τ: L2 statistical leverage scores
τi = biT(BTB)-1bi = ║bi║2
L-
1
Near optimal:• L2-row samples of
B• Graph sparsifiers
• In practice O(logn) 5 usually suffices
• can also improve via derandomization
THE `RIGHT’ PROBABILITIES
Only one non-zero row Column with one entry
00100
n/mn/mn/mn/m1
Path + clique:
1
1/n
τ: L2 statistical leverage scores
τi = biT(BTB)-1bi = ║bi║2
L-
1
Any good upper bounds to τi lead to size reductions
OUTLINE
• Structure preserving sampling• Sampling as a recursive
‘driver’• Sampling the inaccessible• What can sampling preserve?
ALGORITHMIC TEMPLATES
W-cycle:T(m) = 2T(m/2) + O(m)
V-cycle:T(m) = T(m/2) + O(m)
Instances:• Sorting• FFT• Voronoi / Delaunay
Instances:• Selection• Parallel indep. Set• Routing
Difficulty:• Exists many non-separable
graphs• Easy to compose hard instances
EFFICIENT GRAPH ALGORITHMS
Partition via separators
SIZE REDUCTION
Ultra-sparsifier: for any k, can find H ≈k G that’s tree + O(mlogcn/k) edges
` `
e.g. [Koutis-Miller-P `10]: obtain crude estimates on τi via a tree
• H equivalent to graph of size O(mlogcn/k)• Picking k > logcn gives reductions
: my results
INSTANCE: Lx = b
Input: graph Laplacian L, vector bOutput: x ≈ε L+b
Runtimes• [KMP `10, `11]: O(mlogn) work, O(m1/3) depth• [CKPPR`14, CMPPX`14]: O(mlog1/2n) work, O(m1/3)
depth
Note:• L+: pseudo-inverse• Approximate solution• Omitting log(1/ε)
+ recursive Chebyshev iteration:T(m) = k1/2(T(mlogcn/k) + O(m))
INSTANCE: INPUT-SPARSITY TIME NUMERICAL ALGORITHMS
Similar: Nystrom method
sample post-process
[Li-Miller-P 13]:• Create smaller approximation• Recurse on it• Bring solution back
INSTANCE: APPROX MAXFLOW
Absorb additional (small) error via more calls to approximatorRecurse on instances with smaller total size, total cost: O(mlogcn)
[P`14]: build approximator on the smaller graph
[Racke-Shah-Taubig `14] good approximator by solving maxflows
[Sherman `13] [KLOS `14]: structure approximators fast maxflow routines
OUTLINE
• Structure preserving sampling• Sampling as a recursive
‘driver’• Sampling the inaccessible• What can sampling preserve?
DENSE OBJECTS
• Matrix inverse• Schur complement• K-step random
walks• Cost-prohibitive to store• Application of
separatorsDirectly access sparse approximates?
TWO STEP RANDOM WALKSA: step of random walk
Still a graph, can sparsify!
A2: 2 step random walk
WHAT THIS ENABLED
[P-Spielman `14] use this to approximate (I – A)-1 = (I + A) (I + A2) (I + A4)…
• Similar to multi-level methods• Skipping: control / propagation of error
Combining known tools: efficiently sparsify I – A2 without computing A2
[Cheng-Cheng-Liu-P-Teng `15]: sparsified Newton’s method for matrix roots and Gaussian sampling
MATRIX SQUARING
Connectivity More general
Iteration Ai+1 ≈ Ai2 I - Ai+1 ≈ I - Ai
2
Until ║Ad║ small ║Ad║ small
Size Reduction Low degree Sparse graph
Method Derandomized Randomized
Solution transfer
Connectivity Solution vectors
• NC algorithm for shortest path• Logspace connectivity: [Reingold `02]• Deterministic squaring: [Rozenman-Vadhan
`05]
LONGER RANDOM WALKS
A: one step of random walk
A3: 3 steps of random walk
(part of) edge uv in A3
Length 3 path in A: u-y-z-v
PSEUDOCODE
Repeat O(cmlognε-2) times:1. Uniformly randomly pick 1 ≤ k ≤ c and edge e =
uv2. Perform (k -1)-step random walk from u.3. Perform (r - k)-step random walk from v.4. Add a scaled copy of the edge to the sparsifier
Resembles:• Local clustering• Approximate triangle counting (c =
3)
[Cheng-Cheng-Liu-P-Teng `15]: combine this with repeated squaring to approximate any random walk polynomial in nearyl-linear time.
GAUSSIAN ELIMINATION
[Lee-P-Spielman, in progress] approximate such circuits in O(mlogcn) time
Partial state of Gaussian elimination: linear system on a subset of variables
Graph theoretic interpretation: equivalent circuit on boundaries, Y-Δ transform
WHAT THIS ENABLES
[Lee-P-Spielman, in progress] O(n) time approximate Cholesky factorization for graph Laplacians
[Lee-Sun, `15] constructible in nearly-linear work
OUTLINE
• Structure preserving sampling• Sampling as a recursive
‘driver’• Sampling the inaccessible• What can sampling
preserve?
MORE GENERAL STRUCTURES
• Non-linear structures• Directed constraints: Ax ≤
b
║y║1║y║2
OTHER NORMSGeneralization of row sampling:given A, q, find A’ s.t.║Ax║q ≈║A’x║q ∀ x
1-norm: standard for representing cuts, used in sparse recovery / robust regression
Applications (for general A):• Feature selection• Low rank approximation / PCA
q-norm: ║y║q = (Σ|yi|q)1/q
L1 ROW SAMPLING
L1 Lewis weights ([Lewis `78]):
w s.t. wi2 = ai
T(ATW-
1A)-1ai
Recursive definition!
[Sampling with pi ≥ wiO( logn) gives ║Ax║1 ≈ ║A’x║1
∀x Can check: Σi wi ≤ n O(nlogn) rows
[Talagrand `90, “Embedding subspaces of L1 into LN
1”] can be analyzed as row-sampling /
sparsification
[Cohen-P `15] w’i (ai
T(ATW-1A)-
1ai)1/2 Converges in loglogn steps
WHERE THIS FITS IN#rows for
q=2#rows for
q=1Runtim
e
Dasgupta et al. `09 n2.5 mn5
Magdon-Ismail `10 nlog2n mn2
Sohler-Woodruff `11 n3.5 mnω-1+θ
Drineas et al. `12 nlogn mnlogn
Clarkson et al. `12 n4.5log1.5n mnlogn
Clarkson-Woodruff `12 n2logn n8 nnz
Mahoney-Meng `12 n2 n3.5 nnz+n6
Nelson-Nguyen `12 n1+θ nnz
Li et.`13 nlogn n3.66 nnz+nω+θ
Cohen et al. 14,Cohen-P `15
nlogn nlogn nnz+nω+θ
[Cohen-P `15] Elementary, optimization motivated proof of w.h.p. concentration for L1
CONNECTION TO LEARNING THEORY
Sparsely-used Dictionary Learning: given Y, find A, X so that ║Y - AX║ is small and X is sparse
[Spielman-Wang-Wright `12]: L1 regression solves this using about n2 samples
[Luh-Vu `15]: generic chaining: O(nlog4n) samples suffice
Proof in [Cohen-P `15] gives O(nlog2n) samples
Key: if X satisfies the Bernoulli-Subgaussian model, then ║Xy║1 is close to expectation for all y
‘Right’ bound should be O(nlogn)
UNSPARSIFIABLE INSTANCE
Complete bipartite graph:Removing any edge uv makes v unreaclable from u
Preserve less structure?
WEAKER REQUIREMENT
Sample only needs to make gains in some directions
Q1
P
Q2
≈w.p. 1/2
w.p. 1/2
[Cohen-Kyng-Pachocki-P-Rao `14]: point-wise convergence without matrix concentration
UNIFORM SAMPLING?
Nystrom method (on matrices):• Pick random subset of data• Compute on subset• Post-process result
Post-processing:• Theoretical works before us: copy x over• Practical: projection, least-squares fitting
[CLMMPS `15]: half the rows as A’ gives good sampling probabilities for A that sum to ≤ 2n
How powerful is (recursive) post-processing?
WHY IS THIS EFFECTIVE?
Needle in a haystack: only d dimensions, can’t have too many, easy to find via post-process
Hay in a haystack: half the data should still contain some info
FUTURE WORK
More concretely:• More sparsification based
algorithms? E.g. multi-grid maxflow?• Sampling directed graphs• Hardness results?
• What structures can sampling preserve?• What do sampling need to preserve?