seeking interpretable models for high dimensional data
DESCRIPTION
Seeking Interpretable Models for High Dimensional Data. Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu. Basically, I am not interested in doing research and I never have been. I am interested in understanding, - PowerPoint PPT PresentationTRANSCRIPT
Seeking Interpretable Models for High Dimensional
Data
Bin Yu
Statistics Department, EECS DepartmentUniversity of California, Berkeley
http://www.stat.berkeley.edu/~binyu
page 2April 20, 2023
Basically, I am not interested in doing research and I never have been. I am interested in understanding, which is quite a different thing.
David Blackwell (1919-- )
page 3April 20, 2023
Characteristics of Modern Data Problems
Goal: efficient use of data for:– Prediction– Interpretation (proxy: sparsity)
Larger number of variables:– Number of variables (p) in data sets is large– Sample sizes (n) have not increased at same pace
Scientific opportunities:– New findings in different scientific fields– Insightful analytical discoveries
page 4April 20, 2023
Today’s Talk
Understanding early visual cortex area V1 through fMRI
Occam’s Razor
M-estimation with decomposable regularization: L2-error bounds via a unified approach
Learning about V1 through sparse models (pre-selection via corr, sparse linear, sparse
nonlinear)
Future work
page 5April 20, 2023
Understanding visual pathway
Gallant Lab at UCB is a leading vision lab.
Da Vinci (1452-1519) Mapping of different visual cortex areas
PPoPolyak (1957) Small left middle grey area: V1
page 6April 20, 2023
Understanding visual pathway through fMRI
One goal at Gallant Lab: understand how One goal at Gallant Lab: understand how natural imagesnatural images relate to fMRI signals relate to fMRI signals
page 7April 20, 2023
Gallant Lab in Nature News
This article is part of Nature's premium content.
Published online 5 March 2008 | Nature | doi:10.1038/news.2008.650
Mind-reading with a brain scan
Brain activity can be decoded using magnetic resonance imaging.
Kerri Smith
Scientists have developed a way of ‘decoding’ someone’s brain activity to determine what they are looking at.
“The problem is analogous to the classic ‘pick a card, any card’ magic trick,” says Jack Gallant, a neuroscientist at the University of California in Berkeley, who led the study.
page 8April 20, 2023
Stimuli
Natural image stimuli
page 9April 20, 2023
Stimulus to fMRI response
Natural image stimuli drawn randomly from a database of 11,499 images
Experiment designed so that response from different presentations are nearly independent
fMRI response is pre-processed and roughly Gaussian
page 10
April 20, 2023
Gabor Wavelet Pyramid
page 11
April 20, 2023
Features
page 12
April 20, 2023
“Neural” (fMRI) encoding for visual cortex V1
Predictor : p=10,921 features of an image Response: (preprocessed) fMRI signal at a voxel n=1750 samples
Goal: understanding human visual system interpretable (sparse) model desired good prediction is necessary
Minimization of an empirical loss (e.g. L2) leads to • ill-posed computational problem, and• bad prediction
page 13
April 20, 2023
Linear Encoding Model by Gallant Lab
Data
– X: p=10921 dimensions (features)– Y: fMRI signal– n = 1750 training samples
Separate linear model for each voxel via e-L2boosting (or Lasso)
Fitted model tested on 120 validation samples– Performance measured by predicted squared correlation
page 14
April 20, 2023
Modeling “history” at Gallant Lab
Prediction on validation set is the benchmark
Methods tried: neural nets, SVMs, e-L2boosting (Lasso)
Among models with similar predictions, simpler (sparser) models by e-L2boosting are preferred for interpretation
This practice reflects a general trend in statistical machine learning -- moving from prediction to simpler/sparser models for interpretation, faster computation or data transmission.
page 15
April 20, 2023
Occam’s Razor
14th-century English logician and Franciscan friar, William of Ockham
14th-century English logician and Franciscan friar, William of Ockham
Principle of Parsimony:
Entities must not be multiplied beyond necessity.
Wikipedia
page 16
April 20, 2023
Occam’s Razor via Model Selection in Linear Regression
• Maximum likelihood (ML) is LS when Gaussian assumption
• There are submodels
• ML goes for the largest submodel with all predictors
• Largest model often gives bad prediction for p large
page 17
April 20, 2023
Model Selection Criteria
Akaike (73,74) and Mallows’ Cp used estimated prediction error to choose a model:
Schwartz (1980):
Both are penalized LS by . “norm”.
Rissanen’s (1978) Minimum Description Length (MDL) principle gives rise to many different criteria. The two-part code leads to BIC.
page 18
April 20, 2023
Model Selection for image-fMRI problem
For the linear encoding model, the number of submodels
Combinatorial search: too expensive and often not necessary
A recent alternative: continuous embedding into a convex optimization problem
through L1 penalized LS (Lasso) -- a third generation computational method in statistics or machine learning.
page 19
April 20, 2023
The L1 penalty is defined for coefficients
Used initially with L2 loss:– Signal processing: Basis Pursuit (Chen & Donoho,1994)
– Statistics: Non-Negative Garrote (Breiman, 1995)
– Statistics: LASSO (Tibshirani, 1996)
Properties of Lasso– Sparsity (variable selection) and regularization
– Convexity (convex relaxation of L0-penalty)
Lasso: L1-norm as a penalty
page 20
April 20, 2023
The “right” tuning parameter unknown so “path” is needed
(discretized or continuous)
Initially: quadratic program for each a grid on . QP is called
for each .
Later: path following algorithms for moderate p:
homotopy by Osborne et al (00) and LARS by Efron et al (04)
first order algorithms for really large p (…)
Theoretical studies: much work recently on Lasso in terms of
L2 prediction error
L2 error of parameter
model selection consistency
Lasso: computation and evaluation
page 21
April 20, 2023
Estimation: Minimize loss function plus regularization term
Regularized M-estimation including Lasso
Goal: for some error metric (e.g. L2 error), bound in this metric the difference
in high-dim scaling as (n, p) tends to infinity.
page 22
April 20, 2023
Example 1: Lasso (sparse linear model)
Some past work: Tropp, 2004; Fuchs, 2000; Meinshausen/Buhlmann, Some past work: Tropp, 2004; Fuchs, 2000; Meinshausen/Buhlmann, 2005; Candes/Tao, 2005; Donoho, 2005; Zhao & Yu, 2006; Zou, 2006; 2005; Candes/Tao, 2005; Donoho, 2005; Zhao & Yu, 2006; Zou, 2006; Wainwright, 2006; Koltchinskii, 2007; Tsybakov et al., 2007; van de Geer, Wainwright, 2006; Koltchinskii, 2007; Tsybakov et al., 2007; van de Geer, 2007; Zhang and Huang, 2008; Meinshausen and Yu, 2009, Bickel et al., 2007; Zhang and Huang, 2008; Meinshausen and Yu, 2009, Bickel et al., 2008 .... 2008 ....
Some past work: Tropp, 2004; Fuchs, 2000; Meinshausen/Buhlmann, 2006; Candes/Tao, 2005; Donoho, 2005; Zhao & Yu, 2006; Zou, 2006; Wainwright, 2009; Koltchinskii, 2007; Tsybakov et al., 2007; van de Geer, 2007; Zhang and Huang, 2008; Meinshausen and Yu, 2009, Bickel et al., 2008 ....
page 23
April 20, 2023
Example 2: Low-rank matrix approximation
Some past work : Frieze et al, 1998; Achilioptas & McSherry, 2001; Fayzel & Boyd, 2001; Srebro et al, 2004; Drineas et al, 2005; Rudelson & Vershynin, 2006; Recht et al, 2007, Bach, 2008; Candes and Tao, 2009; Halko et al, 2009, Keshaven et al, 2009; Negahban & Wainwright, 2009
page 24
April 20, 2023
Example 3: Structured (inverse) Cov. Estimation
Some past work :Yuan and Lin, 2006; d’Aspremont et al, 2007; Bickel & Levina, 2007; El Karoui, 2007; Rothman et al, 2007; Zhou et al, 2007; Friedman et al 2008; Ravikumar et al, 2008; Lam and Fan, 2009; Cai and Zhou, 2009
page 25
April 20, 2023
Unified Analysis
Many high-dimensional models and associated results case by case
Is there a core set of ideas that underlie these analyses?
We discovered that two key conditions are needed for a unified error bound:
Decomposability of regularizer r (e.g. L1-norm)Restricted strong convexity of loss function (e.g Least Squares loss)
page 26
April 20, 2023
Why can we estimate parameters?
Asymptotic analysis of maximum likelihood estimator or OLS:1.Fisher information (Hessian matrix) non-degenerate or strong convexity in all directions.2. Sampling noise disappears with sample size gets large
page 27
April 20, 2023
In high-dim and when r corresponsds to structure (e.g. sparsity)
When p >n as in the fMRI problem, LS is flat in many directions or it is impossible to have strong convexity in all directions.
Regularization is needed.
When is large enough to overcome the samplingnoise, we have a deterministic situation:
• The decomposable regularization norm forces the estimation difference into a constraint or “cone” set.
• This constraint set is small relative to when p large.
• Strong convexity is needed only over this constraint set.
. When predictors are random and Gaussian (dependence ok), strong convexity does hold for LS over the l1-norm induced constraint set.
page 28
April 20, 2023
In high-dim and when r corresponsds to structure (e.g. sparsity)
page 29
April 20, 2023
Main Result
More work is required for each case: verify restrictive strong convexity and find to overcome sampling noise (concentration ineq.)
page 30
April 20, 2023
Summary of unified analysis
Recovered existing results: sparse linear regression with Lasso multivariate group Lasso sparse inverse cov. estimation
Derived new results:
weak sparse linear model with Lasso low rank matrix estimation generalized linear models
page 31
April 20, 2023
Back to image-fMRI problem: Linear sparse encoding model Gallant Lab’s approach:
Separate linear model for each voxel Y = Xb + e
Model fitting via e-L2boosting and stopping by CV
– X: p=10921 dimensions (features)– n = 1750 training samples Effective sample size 1750/log(10921)=188 –- 10 fold loss Irrepresentable condition hard to check – we do have much dependence so Lasso – picks more…
Fitted model tested on 120 validation samples (not used in fitting)
Performance measured by sq. correlation
page 32
April 20, 2023
Our story on image-fMRI problem
Episode 1: linear sparsity via pre-selection by correlation - localization
Fitting details:
Regularization parameter selected by 5-fold cross validation
L2 boosting applied to all 10,000+ features -- L2 boosting is the method of choice in Gallant Lab
Other methods applied to 500 features pre-selected by correlation
page 33
April 20, 2023
Other methods
k = 1: semi OLS (theoretically better than OLS) k = 0: ridge k = -1: semi OLS (inverse)
First and third are semi-supervised because of the use of pop. Cov.
page 34
April 20, 2023
Validation correlation
voxel OLS L2 boost Semi OLS Ridge Semi OLS inverse (k=-1)
1 0.434 0.738 0.572 0.775 0.773
2 0.228 0.731 0.518 0.725 0.731
3 0.320 0.754 0.607 0.741 0.754
4 0.303 0.764 0.549 0.742 0.737
5 0.357 0.742 0.544 0.763 0.763
6 0.325 0.696 0.520 0.726 0.734
7 0.427 0.739 0.579 0.728 0.736
8 0.512 0.741 0.608 0.747 0.743
9 0.247 0.746 0.571 0.754 0.753
10 0.458 0.751 0.606 0.761 0.758
11 0.288 0.778 0.610 0.801 0.801
page 35
April 20, 2023
Comparison of the feature locations
Semi methods (corr-preselection)L2boost
page 36
April 20, 2023page
362023年4月20日
Our Story (cont)
Episode 2: Nonlinearity
Indication of nonlinearity via residual plots (1331 voxels)
page 37
April 20, 2023page
372023年4月20日
Our Story (cont)
Episode 2: search for nonlinearity via V-SPAM (non-linear sparsity)
Additive Models (Hastie and Tibshirani, 1990):
Sparse:
High dimensional: p > n
SPAM (Sparse Additve Models) By Ravikumar, Lafferty, Liu, Wasserman (2007)
Related work: COSSO, Lin and Zhang (2006)
niiij
p
jji XfY ,,1
1
,)(
0jf for most j
page 38
April 20, 2023page
382023年4月20日
V-SPAM encoding model (1331 voxels from V1)(Ravikumar, Vu, Gallant Lab, BY, NIPS08)
For each voxel,
Start with 10921 features
Pre-selection of 100 features via correlation
Fitting of SPAM to 100 features with AIC stopping
Pooling of SPAM fitted features according to location
and frequency to form pooled features
Fitting SPAM to 100 features and pooled features with AIC stopping
page 39
April 20, 2023page
392023年4月20日
Prediction performance (R2)
median improvement 12%.
inset region
page 40
April 20, 2023page
402023年4月20日
Nonlinearities
Compressive effect (finite energy supply)
page 41
April 20, 2023page
412023年4月20日
Our recent story:
100 features pre-selected
Episode 3: computation speed-up with cubic smoothing spline500 features pre-selected for V-SPAM (2010)
page 42
April 20, 2023page
422023年4月20日
Power transformation of features 20% improvement of V-SPAM over sparse linear
V-SPAM is sparser: 20+ features
page 43
April 20, 2023page
432023年4月20日
Power transformation of features V-SPAM (non-linear) vs Sparse Linear
Spatial tunning
Sparse Linear V-SPAM (non-linear)
page 44
April 20, 2023page
442023年4月20日
Power transformation of features Decoding accuracy (%) with pertubation noise of various levels
Sparse linearSparse linear
V-SPAMV-SPAM
page 45
April 20, 2023page
452023年4月20日
Power transformation of features V-SPAM (non-linear) vs Sparse Linear
Frequency and orientation tunning
Sparse Linear V-SPAM (non-linear)
Freq..
page 46
April 20, 2023page
462023年4月20日
Summary of fMRI project
Discovered non-linear meaningful compressive effect in V1 fMRi responses (via EDA and Machine Learning)
Why is this effect real? 20% median prediction improvement over sparse linear
V-SPAM is sparser (20+ features) than sparse linear (100+)
Different tuning properties from sparse linear model
Improved decoding performance over sparse linear
page 47
April 20, 2023page
472023年4月20日
Future and on-going work
V4 challenge:
How to build models for V4 where memory and attention matter?
Multi-response regression to take into account correlation among voxels
Adaptive filters
page 48
April 20, 2023page
482023年4月20日
Acknowledgements
Co-authors
S. Neghban, P. Ravikumar, M. Wainwright (unified analysis)
V. Vu, P. Ravikumar, K. Kay, T. Nalesaris, J. Gallant (fMRI)
Funding
Guggenheim (06-07) NSF
ARO