mvpa with spacenet: sparse structured priors

30
MVPA with SpaceNet: sparse structured priors https://hal.inria.fr/hal-01147731/document Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX)

Upload: elvis-dohmatob

Post on 22-Jan-2017

35 views

Category:

Science


2 download

TRANSCRIPT

Page 1: MVPA with SpaceNet: sparse structured priors

MVPA with SpaceNet: sparse structured priorshttps://hal.inria.fr/hal-01147731/document

Elvis DOHMATOB(Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX)

L R

y=20-75

-38

0

38

75 x2

x1

Page 2: MVPA with SpaceNet: sparse structured priors

1 Introducing the model

2

Page 3: MVPA with SpaceNet: sparse structured priors

1 Brain decoding

We are given:n = # scans; p = number of voxels in maskdesign matrix: X ∈ Rn×p (brain images)response vector: y ∈ Rn (external covariates)

Need to predict y on new data.

Linear model assumption: y ≈ X wWe seek to estimate the weights map, w that

ensures best prediction / classification scores

3

Page 4: MVPA with SpaceNet: sparse structured priors

1 The need for regularization

ill-posed problem: high-dimensional (n p)Typically n ∼ 10− 103 and p ∼ 104 − 106

We need regularization to reduce dimensions andencode practioner’s priors on the weights w

4

Page 5: MVPA with SpaceNet: sparse structured priors

1 Why spatial priors ?3D spatial gradient (a linear operator)∇ : w ∈ Rp −→ (∇xw,∇yw,∇zw) ∈ Rp×3

penalize image grad ∇w⇒ regions

Such priors are reasonablesince brain activity isspatially correlatedmore stable maps and morepredictive than unstructuredpriors (e.g SVM)[Hebiri 2011, Michel 2011,Baldassare 2012, Grosenick 2013,Gramfort 2013] 5

Page 6: MVPA with SpaceNet: sparse structured priors

1 SpaceNet

SpaceNet is a family of “structure + sparsity”priors for regularizing the models for brain decoding.

SpaceNet generalizesTV [Michel 2001],Smooth-Lasso / GraphNet [Hebiri 2011,Grosenick 2013], andTV-L1 [Baldassare 2012, Gramfort 2013].

6

Page 7: MVPA with SpaceNet: sparse structured priors

2 Methods

7

Page 8: MVPA with SpaceNet: sparse structured priors

2 The SpaceNet regularized model

y = X w + “error”Optimization problem (regularized model):

minimize 12‖y− Xw‖22 + penalty(w)

12‖y− Xw‖22 is the loss term, and will be different

for squared-loss, logistic loss, ...

8

Page 9: MVPA with SpaceNet: sparse structured priors

2 The SpaceNet regularized model

penalty(w) = αΩρ(w), where

Ωρ(w) := ρ‖w‖1 + (1− ρ)

12‖∇w‖2, for GraphNet‖w‖TV , for TV-L1...

α (0 < α < +∞) is total amount regularizationρ (0 < ρ ≤ 1) is a mixing constant called the`1-ratio

ρ = 1 for Lasso

Problem is convex, non-smooth, andheavily-ill-conditioned.

9

Page 10: MVPA with SpaceNet: sparse structured priors

2 The SpaceNet regularized model

penalty(w) = αΩρ(w), where

Ωρ(w) := ρ‖w‖1 + (1− ρ)

12‖∇w‖2, for GraphNet‖w‖TV , for TV-L1...

α (0 < α < +∞) is total amount regularizationρ (0 < ρ ≤ 1) is a mixing constant called the`1-ratio

ρ = 1 for LassoProblem is convex, non-smooth, and

heavily-ill-conditioned. 9

Page 11: MVPA with SpaceNet: sparse structured priors

2 Interlude: zoom on ISTA-based algorithms

Settings: min f + g ; f smooth, g non-smoothf and g convex, ∇f L-Lipschitz; both f and gconvex

ISTA: O(L∇f /ε) [Daubechies 2004]Step 1: Gradient descent on fStep 2: Proximal operator of g

FISTA: O(L∇f /√ε) [Beck Teboulle 2009]

= ISTA with a “Nesterov acceleration” trick!

10

Page 12: MVPA with SpaceNet: sparse structured priors

2 FISTA: Implementation for TV-L1

1e-30.10

1e-30.25

1e-30.50

1e-30.75

1e-30.90

1e-20.10

1e-20.25

1e-20.50

1e-20.75

1e-20.90

1e-10.10

1e-10.25

1e-10.50

1e-10.75

1e-10.90

102

103

Con

verg

ence

time

inse

cond

s

α`1 ratio

[DOHMATOB 2014 (PRNI)]

11

Page 13: MVPA with SpaceNet: sparse structured priors

2 FISTA: Implementation for GraphNet

Augment X: X := [X cα,ρ∇]T ∈ R(n+3p)×p

⇒ Xz(t) = Xz(t) + cα,ρ∇(z(t))

1. Gradient descent step (datafit term):w(t+1) ← z(t) − γXT (Xz(t) − y)

2. Prox step (penalty term):w(t+1) ← softαργ(w(t+1))

3. Nesterov acceleration:z(t+1) ← (1 + θ(t))w(t+1) − θ(t)w(t)

Bottleneck: ∼ 80% of runtime spent doing Xz (t)!We badly need speedup!

12

Page 14: MVPA with SpaceNet: sparse structured priors

2 Speedup via univariate screening

Whereby we detect and remove irrelevantvoxels before optimization problem is even entered!

13

Page 15: MVPA with SpaceNet: sparse structured priors

2 X T y maps: relevant voxels stick-out

L R

y=20

100% brain vol

L R

y=20

50% brain vol

L R

y=20-75

-38

0

38

75

20% brain vol

The 20% mask has the 3 bright blobs we wouldexpect to get

... but contains much less voxels ⇒ less run-time

14

Page 16: MVPA with SpaceNet: sparse structured priors

2 X T y maps: relevant voxels stick-out

L R

y=20

100% brain vol

L R

y=20

50% brain vol

L R

y=20-75

-38

0

38

75

20% brain vol

The 20% mask has the 3 bright blobs we wouldexpect to get

... but contains much less voxels ⇒ less run-time

14

Page 17: MVPA with SpaceNet: sparse structured priors

2 X T y maps: relevant voxels stick-out

L R

y=20

100% brain vol

L R

y=20

50% brain vol

L R

y=20-75

-38

0

38

75

20% brain vol

The 20% mask has the 3 bright blobs we wouldexpect to get

... but contains much less voxels ⇒ less run-time

14

Page 18: MVPA with SpaceNet: sparse structured priors

2 X T y maps: relevant voxels stick-out

L R

y=20

100% brain vol

L R

y=20

50% brain vol

L R

y=20-75

-38

0

38

75

20% brain vol

The 20% mask has the 3 bright blobs we wouldexpect to get

... but contains much less voxels ⇒ less run-time

14

Page 19: MVPA with SpaceNet: sparse structured priors

2 Our screening heuristic

tp := pth percentile of the vector |X T y |.Discard jth voxel if |X T

j y | < tp

L R

y=20

k = 100% voxels

L R

y=20

k = 50% voxels

L R

y=20-75

-38

0

38

75

k = 20% voxels

Marginal screening [Lee 2014], but without the(invertibility) restriction k ≤ min(n, p).

The regularization will do the rest... 15

Page 20: MVPA with SpaceNet: sparse structured priors

2 Our screening heuristic

See [DOHMATOB 2015 (PRNI)] for a moredetailed exposition of speedup heuristics developed.

16

Page 21: MVPA with SpaceNet: sparse structured priors

2 Automatic model selection via cross-validationregularization parameters:

0 < αL < ... < α3 < α2 < α1 = αmax

mixing constants:0 < ρM < ... < ρ3 < ρ2 < ρ1 ≤ 1

Thus L×M grid to search over for bestparameters

(α1, ρ1) (α1, ρ2) (α1, ρ3) ... (α1, ρM)(α2, ρ1) (α2, ρ2) (α2, ρ3) ... (α2, ρM)(α3, ρ1) (α3, ρ2) (α3, ρ3) ... (α3, ρM)

... ... ... ... ...(αL, ρ1) (αL, ρ2) (αL, ρL) ... (αL, ρM)

17

Page 22: MVPA with SpaceNet: sparse structured priors

2 Automatic model selection via cross-validation

The final model uses average of the the per-foldbest weights maps (bagging)

This bagging strategy ensures more stable androbust weights maps

18

Page 23: MVPA with SpaceNet: sparse structured priors

3 Some experimental results

19

Page 24: MVPA with SpaceNet: sparse structured priors

3 Weights: SpaceNet versus SVM

Faces vs objects classification on [Haxby 2001]

Smooth-Lasso weights TV-L1 weights

SVM weights 20

Page 25: MVPA with SpaceNet: sparse structured priors

3 Classification scores: SpaceNet versus SVM

21

Page 26: MVPA with SpaceNet: sparse structured priors

3 Concluding remarks

SpaceNet enforces both sparsity and structure,leading to better prediction / classification scoresand more interpretable brain maps.

The code runs in ∼ 15 minutes for “simple”datasets, and ∼ 30 minutes for very difficultdatasets.

In the next release, SpaceNet will feature as partof Nilearn [Abraham et al. 2014]http://nilearn.github.io.

22

Page 27: MVPA with SpaceNet: sparse structured priors

3 Concluding remarks

SpaceNet enforces both sparsity and structure,leading to better prediction / classification scoresand more interpretable brain maps.

The code runs in ∼ 15 minutes for “simple”datasets, and ∼ 30 minutes for very difficultdatasets.

In the next release, SpaceNet will feature as partof Nilearn [Abraham et al. 2014]http://nilearn.github.io.

22

Page 28: MVPA with SpaceNet: sparse structured priors

3 Why X T y maps give a good relevance measure ?In an orthogonal design, least-squares solution is

wLS = (X T X )−1X T y = (I)−1X T y = X T y⇒ (intuition) X T y bears some info on optimal

solution even for general X

Marginal screening: Set S = indices of top kvoxels j in terms of |XT

j y| valuesIn [Lee 2014], k ≤ min(n, p), so thatwLS ∼ (XT

S XS)−1XTS y

We don’t require invertibility conditionk ≤ min(n, p). Our spatial regularization will dothe rest!

Lots of screening rules out there: [El Ghaoui2010, Liu 2014, Wang 2015, Tibshirani 2010,Fercoq 2015]

23

Page 29: MVPA with SpaceNet: sparse structured priors

3 Why X T y maps give a good relevance measure ?In an orthogonal design, least-squares solution is

wLS = (X T X )−1X T y = (I)−1X T y = X T y⇒ (intuition) X T y bears some info on optimal

solution even for general XMarginal screening: Set S = indices of top k

voxels j in terms of |XTj y| values

In [Lee 2014], k ≤ min(n, p), so thatwLS ∼ (XT

S XS)−1XTS y

We don’t require invertibility conditionk ≤ min(n, p). Our spatial regularization will dothe rest!

Lots of screening rules out there: [El Ghaoui2010, Liu 2014, Wang 2015, Tibshirani 2010,Fercoq 2015]

23

Page 30: MVPA with SpaceNet: sparse structured priors

3 Why X T y maps give a good relevance measure ?In an orthogonal design, least-squares solution is

wLS = (X T X )−1X T y = (I)−1X T y = X T y⇒ (intuition) X T y bears some info on optimal

solution even for general XMarginal screening: Set S = indices of top k

voxels j in terms of |XTj y| values

In [Lee 2014], k ≤ min(n, p), so thatwLS ∼ (XT

S XS)−1XTS y

We don’t require invertibility conditionk ≤ min(n, p). Our spatial regularization will dothe rest!

Lots of screening rules out there: [El Ghaoui2010, Liu 2014, Wang 2015, Tibshirani 2010,Fercoq 2015] 23