efficient and numerically stable sparse learning

20
Efficient and Numerically Stable Sparse Learning Sihong Xie 1 , Wei Fan 2 , Olivier Verscheure 2 , and Jiangtao Ren 3 1 University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China

Upload: kirkan

Post on 07-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

Efficient and Numerically Stable Sparse Learning. Sihong Xie 1 , Wei Fan 2 , Olivier Verscheure 2 , and Jiangtao Ren 3 1 University of Illinois at Chicago, USA 2 IBM T.J. Watson Research Center, New York, USA 3 Sun Yat-Sen University, Guangzhou, China. Applications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient and Numerically Stable Sparse Learning

Efficient and Numerically Stable Sparse Learning

Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao

Ren3

1University of Illinois at Chicago, USA2 IBM T.J. Watson Research Center, New York, USA

3 Sun Yat-Sen University, Guangzhou, China

Page 2: Efficient and Numerically Stable Sparse Learning

Applications

• Signal processing (compressive sensing, MRI, coding, etc.)

• Computational Biology (DNA array sensing, gene expression pattern annotation )

• Geophysical Data Analysis

• Machine learning

Page 3: Efficient and Numerically Stable Sparse Learning

Algorithms• Greedy selection

– Via L-0 regularization– Boosting, forward feature selection not for large scale

problem

• Convex optimization– Via L-1 regularization (e.g. Lasso)– IPM (interior point method) medium size problem

– Homotopy method full regularization path computation

– Gradient descent– Online algorithm (Stochastic Gradient Descent)

Page 4: Efficient and Numerically Stable Sparse Learning

Rising awareness of Numerical Problems in ML

• Efficiency– SVM, beyond Optimization black box solver– Large scale problems, parallelization– Eigenvalue problems, randomization

• Stability– Gaussian process calculation, solving large system of

linear equations, matrix inversion– Convergence of gradient descent, matrix iteration

computation

• For more topics of numerical mathematics in ML, See : ICML Workshop on Numerical Methods in Machine Learning 2009

Page 5: Efficient and Numerically Stable Sparse Learning

Stability in Sparse learning

• Iterative Hard Thresholding (IHT)– Solve the following optimization problem

– Incorporating gradient descent with hard thresholding

Page 6: Efficient and Numerically Stable Sparse Learning

• Iterative Hard Thresholding (IHT)– Simple and scalable– With RIP assumption, previous methods

[BDIHT09, GK09] shows that iterative hard thresholding converges.

– Without the assumption of the spectral radius of the iteration matrix, such methods may diverge.

Stability in Sparse learning

Page 7: Efficient and Numerically Stable Sparse Learning

Stability in Sparse learning• Gradient Descent with Matrix Iteration

• Error Vector

• Error Vector of IHT

Page 8: Efficient and Numerically Stable Sparse Learning

Stability in Sparse learning

• Mirror Descent Algorithm for Sparse Learning (SMIDAS)

Dual vector Primal vector

1. Recover predictors from the Dual vector

2. Gradient Descent and Soft-threshold

Page 9: Efficient and Numerically Stable Sparse Learning

Stability in Sparse learning• Elements of the Primal Vector is

exponentially sensitive to the corresponding elements of the Dual Vector

• Due to limited precisioin, small components will be omitted when computing

d is the dimensionality

of data

Needed in Prediction

Page 10: Efficient and Numerically Stable Sparse Learning

Stability in Sparse learning

• Example– Suppose data are

Page 11: Efficient and Numerically Stable Sparse Learning

Efficiency of Sparse Learning• Sparse models

– Less computational cost– Lower generalization bound

• Existing sparse learning algorithms may not good at trading off between sparsity and accuracy

Over complicated models are produced with lower accuracy

Can we get accurate models with higher

sparsity?For a theoretical treatment of trading off between accuracy and sparsitysee S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.

Page 12: Efficient and Numerically Stable Sparse Learning

The proposed method

Perceptron + soft-thresholding• Motivation

– Soft-thresholding• L1-regularization for sparse model

– Perceptron1. Avoids updates when the current features are able to

predict well

2. Convergence under soft-thresholding and limited precision (Lemma 2 and Theorem 1)

3. Compression (Theorem 2)

4. Generalization error bound (Theorem 3)

Don’t complicate the model when

unnecessary

Page 13: Efficient and Numerically Stable Sparse Learning

Experiments

• Datasets

Large Scale Contesthttp://largescale.first.fraunhofer.de/instructions/

Page 14: Efficient and Numerically Stable Sparse Learning

Experiments Divergence of IHT

• For IHT to converge

• The iteration matrices found in practice don’t meet this condition• For IHT (GraDes) with learning rate set to 1/3 and 1/100, respectively, we found …

Page 15: Efficient and Numerically Stable Sparse Learning

Experiments Numerical problem of MDA

• Train models with 40% density.

• Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively

• percentage of elements of the model within [em, em-52], indicating how many features will be lost during prediction

• Dynamical range indicate how wildly can the elements of model change.

Page 16: Efficient and Numerically Stable Sparse Learning

Experiments Numerical problem of MDA

• How parameter p=O(ln(d)) affects performance– Smaller p, algorithm acts more like ordinary

stochastic gradient descent [GL1999]– Larger p, causing truncation during prediction– When dimensionality is high, MDA becomes

numerically unstable.

[GL1999] Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In Proceeding of 12th Annual Conference on Computer Learning Theory, pages 1–11.ACM Press, New York, NY, 1999.

Page 17: Efficient and Numerically Stable Sparse Learning

Experiments Overall comparison

• The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function)– SMIDAS (MDA based [ST2009])– TG (Truncated Gradient [LLZ2009])– SCD (Stochastic Coordinate Descent [ST2009])

Parameter tuning

• Run 10 times for each algorithm, find out the average accuracy on the validation set.

[ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning,pages 929-936, 2009.[LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncatedgradient. Journal of Machine Learning Research, 10:777–801, 2009.

Page 18: Efficient and Numerically Stable Sparse Learning

Experiments Overall comparison

• Accuracy under the same model density– First 7 datasets: maximum 40% of features– Webspam: select maximum 0.1% of features– Stop running the program when maximum

percentage of features are selected

Page 19: Efficient and Numerically Stable Sparse Learning

Experiments Overall comparison

• Accuracy vs. sparsity– The proposed algorithm works consistently better

than other baselines.– On 3 out of 5 tasks, stopped updating model before

reaching the maximum density (40% of features)– On task 1, outperforms others with 10% less features– On task 3, ties with the best baseline using less 20%

features– On task 1-7, SMIDAS: the smaller p,

the better accuracy, but it is beat byall other algorithms

Numerically unstable

Sparse

Generalizability

Convergence

Page 20: Efficient and Numerically Stable Sparse Learning

Conclusion