A Nearly-Linear Time Framework forGraph-Structured Sparsity
Chinmay Hegde Piotr Indyk Ludwig Schmidt
MIT
6 July 2015
ICML
Authors ordered alphabetically.1 / 22
Structured sparsitySparsity is widely used in signal processing, machine learning, andstatistics (compressive sensing, sparse linear regression, etc.)
Examples of sparsity
In many cases, there is rich structure in addition to sparsity.
→ How can we exploit this prior information?
2 / 22
Structured sparsitySparsity is widely used in signal processing, machine learning, andstatistics (compressive sensing, sparse linear regression, etc.)
Examples of sparsity
In many cases, there is rich structure in addition to sparsity.
→ How can we exploit this prior information?
2 / 22
Structured sparsitySparsity is widely used in signal processing, machine learning, andstatistics (compressive sensing, sparse linear regression, etc.)
Examples of sparsity
In many cases, there is rich structure in addition to sparsity.
→ How can we exploit this prior information?
2 / 22
Structured sparsitySparsity is widely used in signal processing, machine learning, andstatistics (compressive sensing, sparse linear regression, etc.)
Examples of sparsity
Cluster sparsity Tree sparsity Group sparsity
In many cases, there is rich structure in addition to sparsity.
→ How can we exploit this prior information?
2 / 22
Our focus: stable sparse recovery
Goal: Estimate an unknown, sparse vector β ∈ Rd from observationsof the form
y = Xβ + e .
X ∈ Rn×d is the design / measurement matrix.
y ∈ Rn are the observations / measurements.
e ∈ Rn is an observation noise vector.
We are interested in the regime n d (i.e., X is a fat matrix).
→ Use structured sparsity to reduce sample complexity n.
3 / 22
Utilizing structured sparsity in sparse recoveryLarge body of work: [Yuan, Lin, 2006], [Eldar, Mishali, 2009], [Jacob, Obozinski,Vert, 2009], [Baraniuk, Cevher, Duarte, Hegde, 2010], [Kim, Xing, 2010], [Bi, Kwok,2011], [Huang, Zhang, Metaxas, 2011], [Bach, Jenatton, Mairal, Obozinski, 2012b],[Rao, Recht, Nowak, 2012], [Negahban, Ravikumar, Wainwright, Yu, 2012], [Simon,Friedman, Hastie, Tibshirani, 2013], [El Halabi, Cevher, 2015] etc.
Surveys [Bach, Jenatton, Mairal, Obozinski, 2012a] and [Wainwright, 2014].
Main goals:GeneralityWhat sparsity structures does the approach apply to?
Generalize several previously studied sparsity models.
Statistical efficiencyWhat is the statistical performance improvement?
Asymptotically optimal sample complexity.
Computational efficiencyHow fast are the resulting algorithms?
Nearly-linear time algorithms.
4 / 22
Utilizing structured sparsity in sparse recoveryLarge body of work: [Yuan, Lin, 2006], [Eldar, Mishali, 2009], [Jacob, Obozinski,Vert, 2009], [Baraniuk, Cevher, Duarte, Hegde, 2010], [Kim, Xing, 2010], [Bi, Kwok,2011], [Huang, Zhang, Metaxas, 2011], [Bach, Jenatton, Mairal, Obozinski, 2012b],[Rao, Recht, Nowak, 2012], [Negahban, Ravikumar, Wainwright, Yu, 2012], [Simon,Friedman, Hastie, Tibshirani, 2013], [El Halabi, Cevher, 2015] etc.
Surveys [Bach, Jenatton, Mairal, Obozinski, 2012a] and [Wainwright, 2014].
Main goals:GeneralityWhat sparsity structures does the approach apply to?
Generalize several previously studied sparsity models.
Statistical efficiencyWhat is the statistical performance improvement?
Asymptotically optimal sample complexity.
Computational efficiencyHow fast are the resulting algorithms?
Nearly-linear time algorithms.
4 / 22
Utilizing structured sparsity in sparse recoveryLarge body of work: [Yuan, Lin, 2006], [Eldar, Mishali, 2009], [Jacob, Obozinski,Vert, 2009], [Baraniuk, Cevher, Duarte, Hegde, 2010], [Kim, Xing, 2010], [Bi, Kwok,2011], [Huang, Zhang, Metaxas, 2011], [Bach, Jenatton, Mairal, Obozinski, 2012b],[Rao, Recht, Nowak, 2012], [Negahban, Ravikumar, Wainwright, Yu, 2012], [Simon,Friedman, Hastie, Tibshirani, 2013], [El Halabi, Cevher, 2015] etc.
Surveys [Bach, Jenatton, Mairal, Obozinski, 2012a] and [Wainwright, 2014].
Main goals:GeneralityWhat sparsity structures does the approach apply to?Generalize several previously studied sparsity models.
Statistical efficiencyWhat is the statistical performance improvement?
Asymptotically optimal sample complexity.
Computational efficiencyHow fast are the resulting algorithms?
Nearly-linear time algorithms.
4 / 22
Utilizing structured sparsity in sparse recoveryLarge body of work: [Yuan, Lin, 2006], [Eldar, Mishali, 2009], [Jacob, Obozinski,Vert, 2009], [Baraniuk, Cevher, Duarte, Hegde, 2010], [Kim, Xing, 2010], [Bi, Kwok,2011], [Huang, Zhang, Metaxas, 2011], [Bach, Jenatton, Mairal, Obozinski, 2012b],[Rao, Recht, Nowak, 2012], [Negahban, Ravikumar, Wainwright, Yu, 2012], [Simon,Friedman, Hastie, Tibshirani, 2013], [El Halabi, Cevher, 2015] etc.
Surveys [Bach, Jenatton, Mairal, Obozinski, 2012a] and [Wainwright, 2014].
Main goals:GeneralityWhat sparsity structures does the approach apply to?Generalize several previously studied sparsity models.
Statistical efficiencyWhat is the statistical performance improvement?Asymptotically optimal sample complexity.
Computational efficiencyHow fast are the resulting algorithms?
Nearly-linear time algorithms.
4 / 22
Utilizing structured sparsity in sparse recoveryLarge body of work: [Yuan, Lin, 2006], [Eldar, Mishali, 2009], [Jacob, Obozinski,Vert, 2009], [Baraniuk, Cevher, Duarte, Hegde, 2010], [Kim, Xing, 2010], [Bi, Kwok,2011], [Huang, Zhang, Metaxas, 2011], [Bach, Jenatton, Mairal, Obozinski, 2012b],[Rao, Recht, Nowak, 2012], [Negahban, Ravikumar, Wainwright, Yu, 2012], [Simon,Friedman, Hastie, Tibshirani, 2013], [El Halabi, Cevher, 2015] etc.
Surveys [Bach, Jenatton, Mairal, Obozinski, 2012a] and [Wainwright, 2014].
Main goals:GeneralityWhat sparsity structures does the approach apply to?Generalize several previously studied sparsity models.
Statistical efficiencyWhat is the statistical performance improvement?Asymptotically optimal sample complexity.
Computational efficiencyHow fast are the resulting algorithms?Nearly-linear time algorithms.
4 / 22
Generality
The Weighted Graph Model (WGM)
5 / 22
Structured sparsity modelsModeling approach: restrict the set of allowed supports.[Baraniuk, Cevher, Duarte, Hegde, 2010]
So far: β is a vector.
β1
β2
β3
β4
β5
β6
β7
β8
Now: β corresponds to a graph.
β7 β8
β2
β5
β6
β1
β4
β3
Restrict size and number of connected components of supports.
6 / 22
Structured sparsity modelsModeling approach: restrict the set of allowed supports.[Baraniuk, Cevher, Duarte, Hegde, 2010]
So far: β is a vector.
β1
β2
β3
β4
β5
β6
β7
β8
Now: β corresponds to a graph.
β7 β8
β2
β5
β6
β1
β4
β3
Restrict size and number of connected components of supports.6 / 22
Weighted Graph Model (simplified)Parameters
Graph G = ([d ],E) defined on the index set [d ].Sparsity s.Number of connected components g.
Examples for s = 3 and g = 2:
In the model
Not in the model
In the model
Not in the model
7 / 22
Weighted Graph Model (simplified)Parameters
Graph G = ([d ],E) defined on the index set [d ].Sparsity s.Number of connected components g.
Examples for s = 3 and g = 2:
In the model
Not in the model
In the model
Not in the model 7 / 22
Generality
We can encode several sparsity structures via the graph G.
No edges: standard s-sparsity
Tree: hierarchical / tree sparsity
(Almost) line graph: block sparsity
Grid graph: 2D cluster sparsity
8 / 22
Generality
We can encode several sparsity structures via the graph G.
No edges: standard s-sparsity
Tree: hierarchical / tree sparsity
(Almost) line graph: block sparsity
Grid graph: 2D cluster sparsity
8 / 22
Generality
We can encode several sparsity structures via the graph G.
No edges: standard s-sparsity
Tree: hierarchical / tree sparsity
(Almost) line graph: block sparsity
Grid graph: 2D cluster sparsity
8 / 22
Generality
We can encode several sparsity structures via the graph G.
No edges: standard s-sparsity
Tree: hierarchical / tree sparsity
(Almost) line graph: block sparsity
Grid graph: 2D cluster sparsity
8 / 22
Weighted Graph Model (full version)Our structured sparsity model also supports edge weights.
Additional parameter: B, bound on the sum of weights in the support.
E.g., s = 3, g = 2, and B = 5:
1
2
310
56
7
89
4
11
In the model
1
2
310
56
7
89
4
11
Not in the model
Allows further generalizations, e.g., encoding the EMD-model(a model for correlated supports in adjacent columns).
9 / 22
Weighted Graph Model (full version)Our structured sparsity model also supports edge weights.
Additional parameter: B, bound on the sum of weights in the support.
E.g., s = 3, g = 2, and B = 5:
1
2
310
56
7
89
4
11
In the model
1
2
310
56
7
89
4
11
Not in the model
Allows further generalizations, e.g., encoding the EMD-model(a model for correlated supports in adjacent columns).
9 / 22
Statistical efficiency
Sample complexity of sparse recovery with the WGM
10 / 22
Cardinality of the WGMKey quantity: |M|, the number of allowed supports in the WGM.
→ Counting argument: how many subgraphs with size s and gconnected components does G contain?
|M| depends on the graph G and the parameters s and g.
Useful graph parameter: ρ(G), the maximum degree of a node in G.
ρ(G) = 4
11 / 22
Cardinality of the WGMKey quantity: |M|, the number of allowed supports in the WGM.
→ Counting argument: how many subgraphs with size s and gconnected components does G contain?
|M| depends on the graph G and the parameters s and g.
Useful graph parameter: ρ(G), the maximum degree of a node in G.
ρ(G) = 4
11 / 22
Sample complexity
Let β ∈ Rd be in the (G, s,g,B)-weighted graph model. Then
n = O(
s(
log ρ(G) + logBs
)+ g · log
dg
)i.i.d. Gaussian observations suffice to find an estimate β such that∥∥β − β∥∥ ≤ C ‖e‖ .
Unweighted case: n = O(
s log ρ(G) + g · log dg
)
“Standard” stable sparse recovery: n = O(
s · log dg
).
Asymptotically optimal sample complexity n = O(s) forBlock sparsity.Tree sparsity.Cluster sparsity in constant-degree graphs (for g = O(s/ log d)).
12 / 22
Sample complexity
Let β ∈ Rd be in the (G, s,g,B)-weighted graph model. Then
n = O(
s(
log ρ(G) + logBs
)+ g · log
dg
)i.i.d. Gaussian observations suffice to find an estimate β such that∥∥β − β∥∥ ≤ C ‖e‖ .
Unweighted case: n = O(
s log ρ(G) + g · log dg
)“Standard” stable sparse recovery: n = O
(s · log d
g
).
Asymptotically optimal sample complexity n = O(s) forBlock sparsity.Tree sparsity.Cluster sparsity in constant-degree graphs (for g = O(s/ log d)).
12 / 22
Sample complexity
Let β ∈ Rd be in the (G, s,g,B)-weighted graph model. Then
n = O(
s(
log ρ(G) + logBs
)+ g · log
dg
)i.i.d. Gaussian observations suffice to find an estimate β such that∥∥β − β∥∥ ≤ C ‖e‖ .
Unweighted case: n = O(
s log ρ(G) + g · log dg
)“Standard” stable sparse recovery: n = O
(s · log d
g
).
Asymptotically optimal sample complexity n = O(s) forBlock sparsity.Tree sparsity.Cluster sparsity in constant-degree graphs (for g = O(s/ log d)).
12 / 22
Computational efficiency
Nearly-linear time model projection for the WGM
13 / 22
Model projection
Goal: Given b ∈ Rd and a sparsity model M, find
Ω∗ = arg minΩ∈M
‖b − bΩ‖ .
For the (G, s,g)-WGM: Find the subgraph G with size s and gconnected components that maximizes the sum of node weights.
3 5
7
2
6
8
10
This problem is NP-hard.
14 / 22
Model projection
Goal: Given b ∈ Rd and a sparsity model M, find
Ω∗ = arg minΩ∈M
‖b − bΩ‖ .
For the (G, s,g)-WGM: Find the subgraph G with size s and gconnected components that maximizes the sum of node weights.
3 5
7
2
6
8
10
This problem is NP-hard.
14 / 22
Model projection
Goal: Given b ∈ Rd and a sparsity model M, find
Ω∗ = arg minΩ∈M
‖b − bΩ‖ .
For the (G, s,g)-WGM: Find the subgraph G with size s and gconnected components that maximizes the sum of node weights.
3 5
7
2
6
8
10
This problem is NP-hard.
14 / 22
Model projection
Goal: Given b ∈ Rd and a sparsity model M, find
Ω∗ = arg minΩ∈M
‖b − bΩ‖ .
For the (G, s,g)-WGM: Find the subgraph G with size s and gconnected components that maximizes the sum of node weights.
3 5
7
2
6
8
10
3 5
7
2
6
8
10
This problem is NP-hard.
14 / 22
Model projection
Goal: Given b ∈ Rd and a sparsity model M, find
Ω∗ = arg minΩ∈M
‖b − bΩ‖ .
For the (G, s,g)-WGM: Find the subgraph G with size s and gconnected components that maximizes the sum of node weights.
3 5
7
2
6
8
10
3 5
7
2
6
8
10
This problem is NP-hard.14 / 22
Approximation to the rescue!Approximation-tolerant model-based sparse recovery [HIS’14].→ Approximate projections suffice, but two types are necessary.
Tail-approximation oracle T (b)
Find a support Ω ∈M such that
‖b − bΩ‖ ≤ cT · minΩ′∈M
‖b − bΩ′‖ .
Head-approximation oracle H(b)
Find a support Ω ∈M such that
‖bΩ‖ ≥ cH · maxΩ′∈M
‖bΩ′‖ .
head: bΩ tail: b − bΩ
minimize
head: bΩ
maximizetail: b − bΩ
15 / 22
Approximation to the rescue!Approximation-tolerant model-based sparse recovery [HIS’14].→ Approximate projections suffice, but two types are necessary.
Tail-approximation oracle T (b)
Find a support Ω ∈M such that
‖b − bΩ‖ ≤ cT · minΩ′∈M
‖b − bΩ′‖ .
Head-approximation oracle H(b)
Find a support Ω ∈M such that
‖bΩ‖ ≥ cH · maxΩ′∈M
‖bΩ′‖ .
head: bΩ tail: b − bΩ
minimize
head: bΩ
maximizetail: b − bΩ
15 / 22
Approximation to the rescue!Approximation-tolerant model-based sparse recovery [HIS’14].→ Approximate projections suffice, but two types are necessary.
Tail-approximation oracle T (b)
Find a support Ω ∈M such that
‖b − bΩ‖ ≤ cT · minΩ′∈M
‖b − bΩ′‖ .
Head-approximation oracle H(b)
Find a support Ω ∈M such that
‖bΩ‖ ≥ cH · maxΩ′∈M
‖bΩ′‖ .
head: bΩ tail: b − bΩ
minimize
head: bΩ
maximizetail: b − bΩ
15 / 22
The prize-collecting Steiner tree problem (PCST)Generalization of the classical Steiner tree problem.
Goal: Given a graph with edge costs c and node prizes π, find asubtree T minimizing c(T ) + π(T ) (T : nodes not in T ).
1
2
34
56
7
89
10
11
The Goemans-Williamson (GW) scheme produces a tree T with
c(T ) + 2π(T ) ≤ 2 minT ′is a tree
c(T ′) + π(T ′)
and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].
16 / 22
The prize-collecting Steiner tree problem (PCST)Generalization of the classical Steiner tree problem.
Goal: Given a graph with edge costs c and node prizes π, find asubtree T minimizing c(T ) + π(T ) (T : nodes not in T ).
1
2
34
56
7
89
10
11
The Goemans-Williamson (GW) scheme produces a tree T with
c(T ) + 2π(T ) ≤ 2 minT ′is a tree
c(T ′) + π(T ′)
and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].
16 / 22
The prize-collecting Steiner tree problem (PCST)Generalization of the classical Steiner tree problem.
Goal: Given a graph with edge costs c and node prizes π, find asubtree T minimizing c(T ) + π(T ) (T : nodes not in T ).
7 6
2
5
4
1
83
1
2
34
56
7
89
10
11
The Goemans-Williamson (GW) scheme produces a tree T with
c(T ) + 2π(T ) ≤ 2 minT ′is a tree
c(T ′) + π(T ′)
and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].
16 / 22
The prize-collecting Steiner tree problem (PCST)Generalization of the classical Steiner tree problem.
Goal: Given a graph with edge costs c and node prizes π, find asubtree T minimizing c(T ) + π(T ) (T : nodes not in T ).
7 6
2
5
4
1
83
1
2
34
56
7
89
10
11
The Goemans-Williamson (GW) scheme produces a tree T with
c(T ) + 2π(T ) ≤ 2 minT ′is a tree
c(T ′) + π(T ′)
and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].
16 / 22
The prize-collecting Steiner tree problem (PCST)Generalization of the classical Steiner tree problem.
Goal: Given a graph with edge costs c and node prizes π, find asubtree T minimizing c(T ) + π(T ) (T : nodes not in T ).
7 6
2
5
4
1
83
1
2
34
56
7
89
10
11
The Goemans-Williamson (GW) scheme produces a tree T with
c(T ) + 2π(T ) ≤ 2 minT ′is a tree
c(T ′) + π(T ′)
and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].16 / 22
Our algorithmic contributions1 Generalize GW to the prize-collecting Steiner forest problem.
We find a forest F with g components such that:
c(F ) + 2π(F ) ≤ 2 minF ′ has g components
c(F ′) + π(F ′)
2 Give a nearly-linear time and practical variant of GW.
Building on the dynamic edge splitting idea introduced in[Cole, Hariharan, Lewenstein, Porat, 2001].
a b
3 Reduce WGM-projection to a sequence of PCSF problems.
Lagrangian relaxation + binary search and graph post-processing.
17 / 22
Our algorithmic contributions1 Generalize GW to the prize-collecting Steiner forest problem.
We find a forest F with g components such that:
c(F ) + 2π(F ) ≤ 2 minF ′ has g components
c(F ′) + π(F ′)
2 Give a nearly-linear time and practical variant of GW.
Building on the dynamic edge splitting idea introduced in[Cole, Hariharan, Lewenstein, Porat, 2001].
a b
3 Reduce WGM-projection to a sequence of PCSF problems.
Lagrangian relaxation + binary search and graph post-processing.
17 / 22
Our algorithmic contributions1 Generalize GW to the prize-collecting Steiner forest problem.
We find a forest F with g components such that:
c(F ) + 2π(F ) ≤ 2 minF ′ has g components
c(F ′) + π(F ′)
2 Give a nearly-linear time and practical variant of GW.
Building on the dynamic edge splitting idea introduced in[Cole, Hariharan, Lewenstein, Porat, 2001].
a b
3 Reduce WGM-projection to a sequence of PCSF problems.
Lagrangian relaxation + binary search and graph post-processing.17 / 22
Running time
TheoremOn a graph with |E | edges and d nodes, GRAPH-COSAMP runs in time
O(
(TX + |E | log3 d) log d),
where TX is the cost of a matrix-vector multiplication with the design /measurement matrix X .
Model Reference Previous time Our time
1D-cluster [CIHB09] O(d log2 d) O(d log4 d)
Trees [HIS14a] O(d log2 d) O(d log4 d)
EMD [HIS14b] O(d2 log d) O(d3/2 log4 d)
Graph clusters [HZM11] O(dc) O(d log4 d)
18 / 22
Running time
TheoremOn a graph with |E | edges and d nodes, GRAPH-COSAMP runs in time
O(
(TX + |E | log3 d) log d),
where TX is the cost of a matrix-vector multiplication with the design /measurement matrix X .
Model Reference Previous time Our time
1D-cluster [CIHB09] O(d log2 d) O(d log4 d)
Trees [HIS14a] O(d log2 d) O(d log4 d)
EMD [HIS14b] O(d2 log d) O(d3/2 log4 d)
Graph clusters [HZM11] O(dc) O(d log4 d)
18 / 22
Experiments
19 / 22
Sparse recovery experiments
2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
Oversampling ratio n/s
Pro
babi
lity
ofre
cove
ry
2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
Oversampling ratio n/s
Pro
babi
lity
ofre
cove
ry
Graph-CoSaMP StructOMP LaMP CoSaMP Basis Pursuit
2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
Oversampling ratio n/s
Pro
babi
lity
ofre
cove
ry
StructOMP: [HZM11], LaMP: [CDHB09], CoSaMP: [NT09], BP: [CD92]. 20 / 22
Running timesAngiogram image, n = 6s observations, subsampled Fourier matrix.
0 1 2 3 4·104
0
20
40
60
80
100
Problem size d
Rec
over
ytim
e(s
ec)
0 1 2 3 4·104
10−2
10−1
100
101
102
Problem size d
Rec
over
ytim
e(s
ec)
Graph-CoSaMP StructOMP LaMP CoSaMP Basis Pursuit
Graph-CoSaMP is about 20× faster than StructOMP for d = 104
and scales nearly-linearly.
Constant factor: solving more than 20 PCSF instances per recovery.21 / 22
ConclusionsFurther applications, e.g. in seismicimage processing.
We introduced the Weighted Graph Model.Generalizes several structuredsparsity models.
Asymptotically optimal samplecomplexity in many cases.
Nearly-linear time approximate modelprojections.
Open problems / future directionsFast measurement matrix for allsparsity levels.Recovery guarantees beyond RIP.Learning sparsity models.
Noisy input Human labels Automatic
22 / 22
ConclusionsFurther applications, e.g. in seismicimage processing.
We introduced the Weighted Graph Model.Generalizes several structuredsparsity models.
Asymptotically optimal samplecomplexity in many cases.
Nearly-linear time approximate modelprojections.
Open problems / future directionsFast measurement matrix for allsparsity levels.Recovery guarantees beyond RIP.Learning sparsity models.
Noisy input Human labels Automatic
22 / 22
ConclusionsFurther applications, e.g. in seismicimage processing.
We introduced the Weighted Graph Model.Generalizes several structuredsparsity models.
Asymptotically optimal samplecomplexity in many cases.
Nearly-linear time approximate modelprojections.
Open problems / future directionsFast measurement matrix for allsparsity levels.Recovery guarantees beyond RIP.Learning sparsity models.
Noisy input Human labels Automatic
22 / 22