predicting short term movements of stock prices: a two-stage l1-penalized model

3rd Place in 2010 INFORMS DATA MINING CONTEST

Predicting Short Term Movements of Stock Prices: ATwo-Stage L1-Penalized Model

Nan Zhou

Department of Statistics, University of Pittsburgh

[email protected]

November 9, 2010, Austin

Outline

Overview

Basic AnalysisPre-ProcessingLogistic Regression

Variable Selection MethodsGeneralized Linear ModelVariable Selection-Traditional MethodsVariable Selection-L1 Penalized Methods

ResultsUsing Future InformationHow It Fails WITHOUT Using Future Information

Future Work

Overview

First Model:Support Vector Machine, kernel SVMSeptember 22nd, 2010; Public MCE: 0.771654; Result: 0.822041

Logistic Regression:

September 23rd, 2010; Public MCE: 0.951764; Result: 0.966906

I The models always choose variable 74

I Multivariate logistic regression is worse

Try LASSO:

Logistic Regression + Variable SelectionSeptember 27th, 2010; Public MCE: 0.952942; Result: 0.967112

Try Others:

SPCA, AdaBoost, Gradient Boost, Neutral Network, Random Forest, and etc.

Finally

I More information from data at different lags

I Variable selection - LASSO, Two-Stage Selection

Overview

Try LASSO:

Try Others:

Finally

Overview

Try LASSO:

Try Others:

Finally

Overview

Try LASSO:

Try Others:

Finally

Overview

Try LASSO:

Try Others:

Finally

Basic Analysis

Outline

Overview

Future Work

Basic Analysis

Pre-Processing

Pre-ProcessingI Missing data: variable 167 to 180, variable 157

I Predictors: Xit =

Sit+60−S

Basic Analysis

Logistic Regression

Logistic Regression-Single Predictor

5-folds Cross Validation Results:

Predictors AUC test sd AUC train sdVariable74LAST PRICE 00 60 0.9545 0.0048 0.9546 0.0011Variable88LAST PRICE 00 60 0.8594 0.0086 0.8592 0.0020

Variable107LAST PRICE 00 60 0.8585 0.0066 0.8581 0.0016. . . . . . . . .

Variable28LAST PRICE 00 60 0.5901 0.0308 0.5906 0.0076Variable41LAST PRICE 00 60 0.5897 0.0206 0.5903 0.0051

Variable102LAST PRICE 00 60 0.5390 0.0205 0.5390 0.0051

Basic Analysis

Logistic Regression

Bar Plot - Single Predictor

Variable Selection Methods

Outline

Overview

Future Work

Generalized Linear Model

Model Components

Response/Dependent Variable: YPredictors/Independent Variables: X1, . . . , XpObservations: y˜n×1 = [y1, . . . , yn]

′, xn×p = [x˜1, . . . , x˜p]Model Setup

Model: Y |X ∼ FY , E(Y ) = µ = g−1(Xβ˜), V ar(Y ) = V (g−1(Xβ˜))where, β˜p×1 is unknown parameters; g is the link function.

Examples

I Linear Regression: Y |X ∼ N(Xβ, σ2), oryi = β0 +

∑pj=1 βjxij + εi, where εi i.i.d ∼ N(0, σ2)

I Logistic Regression: Y |X ∼ Bernoulli(p = g−1(Xβ)), g(µ) = log( µ1−µ ), or

P(yi = 1) =exp (β0+

∑pj=1 βjxij)

1+exp (β0+∑p

j=1 βjxij)

Model Components

Examples

P(yi = 1) =exp (β0+

∑pj=1 βjxij)

1+exp (β0+∑p

j=1 βjxij)

Model Components

Examples

P(yi = 1) =exp (β0+

∑pj=1 βjxij)

1+exp (β0+∑p

j=1 βjxij)

Model Components

Examples

P(yi = 1) =exp (β0+

∑pj=1 βjxij)

1+exp (β0+∑p

j=1 βjxij)

Parameter Estimations

I Maximum Likelihood, Maximum Quasi-Likelihood, Minimum LossFunction

I Least Square, Iteratively Reweighted Least Squares

I Bayesian Methods

Overfitting

Figure: Examples of Best Subset Selection with different Criteria

Variable Selection-Traditional Methods

Variable Selection

I Forward-Stepwise Selection: greedy algorithm

I Backward-Stepwise Selection

I Best Subset SelectionDifferent Criterions: R2, MSE, AIC, SBC

Figure: Examples of Best Subset Selection with different Criteria

To Avoid Overfitting-Traditional Methods

I Principal Components Regression: Instead of original predictors, utilizingindependent components (linear combination of original predictors), whichminimize the reproduction error.

I Ridge Regression: Shrinkage Estimations

To Avoid Overfitting-Traditional Methods

I Principal Components Regression: Instead of original predictors, utilizingindependent components (linear combination of original predictors), whichminimize the reproduction error.

I Ridge Regression: Shrinkage Estimations

Ridge Regression-L2 Penalized Methods

yi = β0 +

p∑j=1

βjxij + εi, εi i.i.d ∼ N(0, σ2)

L(X,β)-Loss Function

Least Squares Regression

β˜ls = argminβ {L(X,β)}

= argminβ {p∑i=1

(yi − yi)2}

β˜ls = (X ′X)−1Y˜

Ridge Regression

β˜ridge = argminβ {L(X,β)}

= argminβ {p∑i=1

(yi − yi)2 + λ

p∑j=1

β2j }

β˜ridge = (X ′X + λI)−1Y˜

yi = β0 +

p∑j=1

L(X,β)-Loss Function

Least Squares Regression

β˜ls = argminβ {L(X,β)}

= argminβ {p∑i=1

(yi − yi)2}

β˜ls = (X ′X)−1Y˜

Ridge Regression

β˜ridge = argminβ {L(X,β)}

= argminβ {p∑i=1

(yi − yi)2 + λ

p∑j=1

β2j }

β˜ridge = (X ′X + λI)−1Y˜

β˜ridge = argminβ {‖Y˜ −Xβ˜‖2 + λ‖β˜‖l2}, where ‖β˜‖l2 =

p∑i=1

|βi|2

⇐⇒β˜ridge = argminβ {‖Y˜ −Xβ˜‖2}

s.t. ‖β˜‖l2 ≤ tThere is a one-to-one correspondence between λ and t

Variable Selection-L1 Penalized Methods

LASSO-L1 Penalized Model

LASSO (Least Absolute Shrinkage and Selection Operator):

yi = β0 +

p∑j=1

β˜lasso = argminβ {‖Y˜ −Xβ˜‖2 + λ‖β˜‖l1}, where ‖β˜‖l1 =

p∑j=1

|βj |

⇐⇒β˜lasso = argminβ {‖Y˜ −Xβ˜‖2}

s.t. ‖β˜‖l1 ≤ t

LASSO-L1 Penalized Model

Comparison of L1 and L2 Penalized Model

L2 Penalized Estimation

β˜ridge = argminβ {‖Y˜ −Xβ˜‖2}s.t. ‖β˜‖l2 ≤ t

L1 Penalized Estimation

β˜lasso = argminβ {‖Y˜ −Xβ˜‖2}s.t. ‖β˜‖l1 ≤ t

References:

I The original paper:Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J.Royal. Statist. Soc B., Vol. 58, No. 1

I The least angle regression (LAR) algorithm for solving the Lasso:Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2004). Least angleregression, Ann. Statist. Vol. 32, No. 2

I Details and comparisons:Hastie, T., Tibshirani, R. and Jerome, F. The elements of statisticallearning, second edition, Springer

Computations:

I LASSO in R: glmnet, lasso2, lars

I Relaxed LASSO in R: relaxo

References:

I The original paper:Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J.Royal. Statist. Soc B., Vol. 58, No. 1

I The least angle regression (LAR) algorithm for solving the Lasso:Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2004). Least angleregression, Ann. Statist. Vol. 32, No. 2

I Details and comparisons:Hastie, T., Tibshirani, R. and Jerome, F. The elements of statisticallearning, second edition, Springer

Computations:

I LASSO in R: glmnet, lasso2, lars

I Relaxed LASSO in R: relaxo

LASSO-Grouped LASSO

Sometimes, the predictors are grouped:

I Genes that belong to the same biological pathway;

I Collections of indicators variables for representing levels of a categoricalpredictor;

I The price change of one stock at different time lags

Observations: Y˜n×1 = [Y1, . . . , Yn]′, X1

n×p1 , . . . , XGn×pG ,

β˜Glasso = argminβp {‖Y˜ −Xβ˜‖2 + λG∑g=1

√pg‖β˜g‖l1}

This model was proposed by:Bakin, S. (1999) Adaptive Regression and Model Selection in Data MiningProblems, Ph.D. ThesisYuan, M. and Lin, Y. (2006), Model Selection and Estimation in Regressionwith Grouped Variables, Journal of the Royal Statistical Society, Series B, 68(1)

LASSO-Grouped LASSO

Sometimes, the predictors are grouped:

I Genes that belong to the same biological pathway;

I Collections of indicators variables for representing levels of a categoricalpredictor;

I The price change of one stock at different time lags

n×p1 , . . . , XGn×pG ,

β˜Glasso = argminβp {‖Y˜ −Xβ˜‖2 + λG∑g=1

√pg‖β˜g‖l1}

This model was proposed by:Bakin, S. (1999) Adaptive Regression and Model Selection in Data MiningProblems, Ph.D. ThesisYuan, M. and Lin, Y. (2006), Model Selection and Estimation in Regressionwith Grouped Variables, Journal of the Royal Statistical Society, Series B, 68(1)

LASSO - TSL1P(Two-Stage L1-Penalized Logistic Regression)

Similarity

Our data are grouped based on each stock or economic indicator: Open, High,Low, Close at different time periods

Difference

I Only partial variables in each group need to be selected;

I The within-group correlations are very high, while some between-groupcorrelations are small.

LASSO - TSL1P

n×p1 , . . . , XGn×pG ,

As for the contest training data I used, n = 5910, G = 118,p1 = p2 = . . . = pG = 4 · 118 · 6 = 2832 or 4 · 118 · 10 = 4720

Stage 1

I Choose one represent for each group: Ri, i = 1, . . . , G

I Utilize L1-Penalized Logistic Regression to select variablesRi, i ∈ M′ ⊆ M = {1, . . . , G}

I Reduce the predictors from {Xin×pi}i∈M to {Xi

n×pi}i∈M′

Stage 2

I Utilize L1-Penalized Logistic Regression based on the reduced predictorsspace

I Use the right cross validation methods to choose the tune parameter - λ

LASSO - TSL1P

n×p1 , . . . , XGn×pG ,

Stage 1

n×pi}i∈M′

Stage 2

LASSO - TSL1P

n×p1 , . . . , XGn×pG ,

Stage 1

n×pi}i∈M′

Stage 2

Results

Outline

Overview

Future Work

Results

Using Future Information

Result - Stage 1

λ df auc test auc test.sd auc train auc train.sd0.0010 92.4000 0.9513 0.0057 0.9581 0.00120.0020 71.0000 0.9526 0.0056 0.9578 0.00110.0050 28.2000 0.9540 0.0054 0.9561 0.00120.0060 20.6000 0.9544 0.0054 0.9558 0.00120.0070 14.8000 0.9545 0.0053 0.9555 0.00120.0080 11.0000 0.9545 0.0052 0.9552 0.00130.0090 7.0000 0.9545 0.0051 0.9550 0.00130.0100 5.4000 0.9546 0.0051 0.9549 0.00130.0200 1.0000 0.9546 0.0051 0.9546 0.00120.0500 1.0000 0.9546 0.0051 0.9546 0.00120.1000 1.0000 0.9546 0.0051 0.9546 0.0012

Results

Result - Stage 1

Figure: Variable Selection in Stage 1of TSL1P

Results

Result - Stage 1res = glmnet(X,as.factor(Y),family=”binomial”,lambda=0.008)ind.select = which(abs(res$beta)>0)

Variable159Variable9Variable13Variable14Variable71Variable74Variable89Variable112Variable133

Result - Stage 2

Results

How It Fails WITHOUT Using Future Information

Not Using Future Information

Cross Validation Results

para.Var2 df auc test auc.sd1e-04 506.4 0.8010608 0.0056617263e-04 309.2 0.7998391 0.0069352635e-04 235.6 0.7873526 0.0084642471e-03 139.8 0.7498463 0.012675834

Unfortunately, it fails

However, we will continue...

Results

Future Work

Outline

Overview

Future Work

Future work:

I Data Pre-processing

I Modifications of TSL1P: How to divide groups? How to choose therepresents?

I L1-penalized methodS for time series data

I Continuous values of Y

I Time series models:GARCH and Stochastic Volatility Models

Applications:

I Generalized Pairs-trading (using future information)

I Application in trading system designs in equity market (ex: digital options)

I High frequency financial data (more information, stronger correlations(?))

Future Work

Future work:

I Data Pre-processing

I Modifications of TSL1P: How to divide groups? How to choose therepresents?

I L1-penalized methodS for time series data

I Continuous values of Y

I Time series models:GARCH and Stochastic Volatility Models

Applications:

I Generalized Pairs-trading (using future information)

I Application in trading system designs in equity market (ex: digital options)

I High frequency financial data (more information, stronger correlations(?))

Future Work

High Frequency Data - Tick by Tick Trades

Tick by Tick Consolidated Trades Data for Microsoft(MSFT) at the regularmarket opening time on July 31, 2009:

SYMBOL DATE TIME PRICE SIZE EX1 MSFT 20090731 9:30:00 23.77 158 Z2 MSFT 20090731 9:30:00 23.77 142 Z3 MSFT 20090731 9:30:00 23.77 258 Z4 MSFT 20090731 9:30:00 23.77 100 Z5 MSFT 20090731 9:30:00 23.77 100 Z6 MSFT 20090731 9:30:00 23.77 132 Q

. . . . . . . . .86 MSFT 20090731 9:30:01 23.77 428150 Q87 MSFT 20090731 9:30:01 23.77 428150 Q88 MSFT 20090731 9:30:02 23.77 216 D89 MSFT 20090731 9:30:02 23.77 200 D

The data is from the database of Trade and Quote(TAQ), which is provided by

Wharton Research Data Services(WRDS).

Future Work

High Frequency Data - 1 second Trades

1 second OpenHighLowClose data for Microsoft(MSFT) on the Dec 31, 2008:

Time Open High Low Close Vol9:30:01 19.31 19.33 19.29 19.32 62529:30:02 19.32 19.33 19.32 19.33 34309:30:03 19.3 19.31 19.3 19.3 9853049:30:04 19.28 19.29 19.28 19.29 11009:30:05 19.31 19.31 19.29 19.29 35029:30:06 19.32 19.33 19.29 19.29 43679:30:07 19.29 19.3 19.27 19.27 3106

. . . . . . . . .15:59:54 19.44 19.44 19.43 19.43 1920015:59:55 19.43 19.44 19.43 19.44 2230015:59:56 19.44 19.45 19.43 19.45 3135215:59:57 19.45 19.45 19.44 19.44 1189815:59:58 19.44 19.44 19.44 19.44 781415:59:59 19.44 19.44 19.44 19.44 500

Future Work

Figure: Open, High, Low, Close Data of MSFT on December 31, 2008

Future Work

High Frequency Data - Tick by Tick Quotes

Tick by Tick Consolidated Quotes Data for Microsoft(MSFT) at the regularmarket opening time on July 31, 2009:

TIME BID OFR BIDSIZ OFRSIZ EX1 9:30:00 23.77 23.78 7 3 Z2 9:30:00 23.77 23.78 6 3 Z3 9:30:00 23.77 23.78 3 3 Z4 9:30:00 23.77 23.78 3 6 Z5 9:30:00 23.77 23.78 2 6 Z6 9:30:00 23.77 23.78 2 6 Z

. . . . . . . . .282 9:30:01 23.75 23.76 17 49 T283 9:30:01 23.74 23.76 1 1 I284 9:30:02 23.75 23.76 17 50 T285 9:30:02 23.75 23.76 17 56 T286 9:30:02 23.75 23.76 3 4 Z287 9:30:02 23.75 23.76 6 7 Z288 9:30:02 23.75 23.76 17 59 T

Future Work

Myself

Education

I 2006 - 2010, Ph.D. Student in Statistics, University of Pittsburgh

I 2007 - 2009, Courses and projects in Machine Learning, QuantitativeFinance, Carnegie Mellon University

Research Interests

I Statistical tests, estimations and predictions in continuous financialstochastic process with jumps and stochastic volatility

I Applications of statistical learning and machine learning for trading systemdesigns

Contact:

I [email protected]

I http://www.linkedin.com/in/nanzhou

Future Work

Thank You!

predicting short term movements of stock prices: a two-stage l1-penalized model

Documents

public mce

variable selection lasso

stage selection

support vector machine

kernel svm

future information future

gradient boost

neutral network