a nearly-linear time framework for graph … nearly-linear time framework for graph-structured...

A Nearly-Linear Time Framework forGraph-Structured Sparsity

Chinmay Hegde Piotr Indyk Ludwig Schmidt

MIT

6 July 2015

ICML

Authors ordered alphabetically.1 / 22

Structured sparsitySparsity is widely used in signal processing, machine learning, andstatistics (compressive sensing, sparse linear regression, etc.)

Examples of sparsity

In many cases, there is rich structure in addition to sparsity.

→ How can we exploit this prior information?

2 / 22

Structured sparsitySparsity is widely used in signal processing, machine learning, andstatistics (compressive sensing, sparse linear regression, etc.)

Examples of sparsity

Cluster sparsity Tree sparsity Group sparsity

In many cases, there is rich structure in addition to sparsity.

→ How can we exploit this prior information?

2 / 22

Our focus: stable sparse recovery

Goal: Estimate an unknown, sparse vector β ∈ Rd from observationsof the form

y = Xβ + e .

X ∈ Rn×d is the design / measurement matrix.

y ∈ Rn are the observations / measurements.

e ∈ Rn is an observation noise vector.

We are interested in the regime n d (i.e., X is a fat matrix).

→ Use structured sparsity to reduce sample complexity n.

3 / 22

Utilizing structured sparsity in sparse recoveryLarge body of work: [Yuan, Lin, 2006], [Eldar, Mishali, 2009], [Jacob, Obozinski,Vert, 2009], [Baraniuk, Cevher, Duarte, Hegde, 2010], [Kim, Xing, 2010], [Bi, Kwok,2011], [Huang, Zhang, Metaxas, 2011], [Bach, Jenatton, Mairal, Obozinski, 2012b],[Rao, Recht, Nowak, 2012], [Negahban, Ravikumar, Wainwright, Yu, 2012], [Simon,Friedman, Hastie, Tibshirani, 2013], [El Halabi, Cevher, 2015] etc.

Surveys [Bach, Jenatton, Mairal, Obozinski, 2012a] and [Wainwright, 2014].

Main goals:GeneralityWhat sparsity structures does the approach apply to?

Generalize several previously studied sparsity models.

Statistical efficiencyWhat is the statistical performance improvement?

Asymptotically optimal sample complexity.

Computational efficiencyHow fast are the resulting algorithms?

Nearly-linear time algorithms.

4 / 22



Main goals:GeneralityWhat sparsity structures does the approach apply to?Generalize several previously studied sparsity models.

Statistical efficiencyWhat is the statistical performance improvement?

Asymptotically optimal sample complexity.



4 / 22




Statistical efficiencyWhat is the statistical performance improvement?Asymptotically optimal sample complexity.



4 / 22




Statistical efficiencyWhat is the statistical performance improvement?Asymptotically optimal sample complexity.

Computational efficiencyHow fast are the resulting algorithms?Nearly-linear time algorithms.

4 / 22

Generality

The Weighted Graph Model (WGM)

5 / 22

Structured sparsity modelsModeling approach: restrict the set of allowed supports.[Baraniuk, Cevher, Duarte, Hegde, 2010]

So far: β is a vector.

β1

β2

β3

β4

β5

β6

β7

β8

Now: β corresponds to a graph.

β7 β8

β2

β5

β6

β1

β4

β3

Restrict size and number of connected components of supports.

6 / 22

Structured sparsity modelsModeling approach: restrict the set of allowed supports.[Baraniuk, Cevher, Duarte, Hegde, 2010]

So far: β is a vector.

β1

β2

β3

β4

β5

β6

β7

β8

Now: β corresponds to a graph.

β7 β8

β2

β5

β6

β1

β4

β3

Restrict size and number of connected components of supports.6 / 22

Weighted Graph Model (simplified)Parameters

Graph G = ([d ],E) defined on the index set [d ].Sparsity s.Number of connected components g.

Examples for s = 3 and g = 2:

In the model

Not in the model

In the model

Not in the model

7 / 22

Weighted Graph Model (simplified)Parameters

Graph G = ([d ],E) defined on the index set [d ].Sparsity s.Number of connected components g.

Examples for s = 3 and g = 2:

In the model

Not in the model

In the model

Not in the model 7 / 22

Generality

We can encode several sparsity structures via the graph G.

No edges: standard s-sparsity

Tree: hierarchical / tree sparsity

(Almost) line graph: block sparsity

Grid graph: 2D cluster sparsity

8 / 22

Weighted Graph Model (full version)Our structured sparsity model also supports edge weights.

Additional parameter: B, bound on the sum of weights in the support.

E.g., s = 3, g = 2, and B = 5:

1

2

310

56

7

89

4

11

In the model

1

2

310

56

7

89

4

11

Not in the model

Allows further generalizations, e.g., encoding the EMD-model(a model for correlated supports in adjacent columns).

9 / 22

Statistical efficiency

Sample complexity of sparse recovery with the WGM

10 / 22

Cardinality of the WGMKey quantity: |M|, the number of allowed supports in the WGM.

→ Counting argument: how many subgraphs with size s and gconnected components does G contain?

|M| depends on the graph G and the parameters s and g.

Useful graph parameter: ρ(G), the maximum degree of a node in G.

ρ(G) = 4

11 / 22

Sample complexity

Let β ∈ Rd be in the (G, s,g,B)-weighted graph model. Then

n = O(

s(

log ρ(G) + logBs

)+ g · log

dg

)i.i.d. Gaussian observations suffice to find an estimate β such that∥∥β − β∥∥ ≤ C ‖e‖ .

Unweighted case: n = O(

s log ρ(G) + g · log dg

)

“Standard” stable sparse recovery: n = O(

s · log dg

).

Asymptotically optimal sample complexity n = O(s) forBlock sparsity.Tree sparsity.Cluster sparsity in constant-degree graphs (for g = O(s/ log d)).

12 / 22

Sample complexity

Let β ∈ Rd be in the (G, s,g,B)-weighted graph model. Then

n = O(

s(

log ρ(G) + logBs

)+ g · log

dg

)i.i.d. Gaussian observations suffice to find an estimate β such that∥∥β − β∥∥ ≤ C ‖e‖ .

Unweighted case: n = O(

s log ρ(G) + g · log dg

)“Standard” stable sparse recovery: n = O

(s · log d

g

).

Asymptotically optimal sample complexity n = O(s) forBlock sparsity.Tree sparsity.Cluster sparsity in constant-degree graphs (for g = O(s/ log d)).

12 / 22

Computational efficiency

Nearly-linear time model projection for the WGM

13 / 22

Model projection

Goal: Given b ∈ Rd and a sparsity model M, find

Ω∗ = arg minΩ∈M

‖b − bΩ‖ .

For the (G, s,g)-WGM: Find the subgraph G with size s and gconnected components that maximizes the sum of node weights.

3 5

7

2

6

8

10

This problem is NP-hard.

14 / 22

Model projection



‖b − bΩ‖ .


3 5

7

2

6

8

10

3 5

7

2

6

8

10

This problem is NP-hard.

14 / 22

Model projection



‖b − bΩ‖ .


3 5

7

2

6

8

10

3 5

7

2

6

8

10

This problem is NP-hard.14 / 22

Approximation to the rescue!Approximation-tolerant model-based sparse recovery [HIS’14].→ Approximate projections suffice, but two types are necessary.

Tail-approximation oracle T (b)

Find a support Ω ∈M such that

‖b − bΩ‖ ≤ cT · minΩ′∈M

‖b − bΩ′‖ .

Head-approximation oracle H(b)

Find a support Ω ∈M such that

‖bΩ‖ ≥ cH · maxΩ′∈M

‖bΩ′‖ .

head: bΩ tail: b − bΩ

minimize

head: bΩ

maximizetail: b − bΩ

15 / 22

The prize-collecting Steiner tree problem (PCST)Generalization of the classical Steiner tree problem.

Goal: Given a graph with edge costs c and node prizes π, find asubtree T minimizing c(T ) + π(T ) (T : nodes not in T ).

1

2

34

56

7

89

10

11

The Goemans-Williamson (GW) scheme produces a tree T with

c(T ) + 2π(T ) ≤ 2 minT ′is a tree

c(T ′) + π(T ′)

and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].

16 / 22



7 6

2

5

4

1

83

1

2

34

56

7

89

10

11



c(T ′) + π(T ′)

and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].

16 / 22



7 6

2

5

4

1

83

1

2

34

56

7

89

10

11



c(T ′) + π(T ′)

and runs in time O(|V |2 log|V |) [Goemans, Williamson, 1995].16 / 22

Our algorithmic contributions1 Generalize GW to the prize-collecting Steiner forest problem.

We find a forest F with g components such that:

c(F ) + 2π(F ) ≤ 2 minF ′ has g components

c(F ′) + π(F ′)

2 Give a nearly-linear time and practical variant of GW.

Building on the dynamic edge splitting idea introduced in[Cole, Hariharan, Lewenstein, Porat, 2001].

a b

3 Reduce WGM-projection to a sequence of PCSF problems.

Lagrangian relaxation + binary search and graph post-processing.

17 / 22

Our algorithmic contributions1 Generalize GW to the prize-collecting Steiner forest problem.

We find a forest F with g components such that:

c(F ) + 2π(F ) ≤ 2 minF ′ has g components

c(F ′) + π(F ′)

2 Give a nearly-linear time and practical variant of GW.

Building on the dynamic edge splitting idea introduced in[Cole, Hariharan, Lewenstein, Porat, 2001].

a b

3 Reduce WGM-projection to a sequence of PCSF problems.

Lagrangian relaxation + binary search and graph post-processing.17 / 22

Running time

TheoremOn a graph with |E | edges and d nodes, GRAPH-COSAMP runs in time

O(

(TX + |E | log3 d) log d),

where TX is the cost of a matrix-vector multiplication with the design /measurement matrix X .

Model Reference Previous time Our time

1D-cluster [CIHB09] O(d log2 d) O(d log4 d)

Trees [HIS14a] O(d log2 d) O(d log4 d)

EMD [HIS14b] O(d2 log d) O(d3/2 log4 d)

Graph clusters [HZM11] O(dc) O(d log4 d)

18 / 22

Experiments

19 / 22

Sparse recovery experiments

2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

Oversampling ratio n/s

Pro

babi

lity

ofre

cove

ry

2 3 4 5 6 70

0.2

0.4

0.6

0.8

1


Pro

babi

lity

ofre

cove

ry

Graph-CoSaMP StructOMP LaMP CoSaMP Basis Pursuit

2 3 4 5 6 70

0.2

0.4

0.6

0.8

1


Pro

babi

lity

ofre

cove

ry

StructOMP: [HZM11], LaMP: [CDHB09], CoSaMP: [NT09], BP: [CD92]. 20 / 22

Running timesAngiogram image, n = 6s observations, subsampled Fourier matrix.

0 1 2 3 4·104

0

20

40

60

80

100

Problem size d

Rec

over

ytim

e(s

ec)

0 1 2 3 4·104

10−2

10−1

100

101

102

Problem size d

Rec

over

ytim

e(s

ec)

Graph-CoSaMP StructOMP LaMP CoSaMP Basis Pursuit

Graph-CoSaMP is about 20× faster than StructOMP for d = 104

and scales nearly-linearly.

Constant factor: solving more than 20 PCSF instances per recovery.21 / 22

ConclusionsFurther applications, e.g. in seismicimage processing.

We introduced the Weighted Graph Model.Generalizes several structuredsparsity models.

Asymptotically optimal samplecomplexity in many cases.

Nearly-linear time approximate modelprojections.

Open problems / future directionsFast measurement matrix for allsparsity levels.Recovery guarantees beyond RIP.Learning sparsity models.

Noisy input Human labels Automatic

22 / 22

a nearly-linear time framework for graph … nearly-linear time framework for graph-structured...

Documents