predicting short term movements of stock prices: a two-stage l1-penalized model
TRANSCRIPT
3rd Place in 2010 INFORMS DATA MINING CONTEST
Predicting Short Term Movements of Stock Prices: ATwo-Stage L1-Penalized Model
Nan Zhou
Department of Statistics, University of Pittsburgh
November 9, 2010, Austin
3rd Place in 2010 INFORMS DATA MINING CONTEST
Outline
Outline
Overview
Basic AnalysisPre-ProcessingLogistic Regression
Variable Selection MethodsGeneralized Linear ModelVariable Selection-Traditional MethodsVariable Selection-L1 Penalized Methods
ResultsUsing Future InformationHow It Fails WITHOUT Using Future Information
Future Work
Overview
First Model:Support Vector Machine, kernel SVMSeptember 22nd, 2010; Public MCE: 0.771654; Result: 0.822041
Logistic Regression:
September 23rd, 2010; Public MCE: 0.951764; Result: 0.966906
I The models always choose variable 74
I Multivariate logistic regression is worse
Try LASSO:
Logistic Regression + Variable SelectionSeptember 27th, 2010; Public MCE: 0.952942; Result: 0.967112
Try Others:
SPCA, AdaBoost, Gradient Boost, Neutral Network, Random Forest, and etc.
Finally
I More information from data at different lags
I Variable selection - LASSO, Two-Stage Selection
Overview
First Model:Support Vector Machine, kernel SVMSeptember 22nd, 2010; Public MCE: 0.771654; Result: 0.822041
Logistic Regression:
September 23rd, 2010; Public MCE: 0.951764; Result: 0.966906
I The models always choose variable 74
I Multivariate logistic regression is worse
Try LASSO:
Logistic Regression + Variable SelectionSeptember 27th, 2010; Public MCE: 0.952942; Result: 0.967112
Try Others:
SPCA, AdaBoost, Gradient Boost, Neutral Network, Random Forest, and etc.
Finally
I More information from data at different lags
I Variable selection - LASSO, Two-Stage Selection
Overview
First Model:Support Vector Machine, kernel SVMSeptember 22nd, 2010; Public MCE: 0.771654; Result: 0.822041
Logistic Regression:
September 23rd, 2010; Public MCE: 0.951764; Result: 0.966906
I The models always choose variable 74
I Multivariate logistic regression is worse
Try LASSO:
Logistic Regression + Variable SelectionSeptember 27th, 2010; Public MCE: 0.952942; Result: 0.967112
Try Others:
SPCA, AdaBoost, Gradient Boost, Neutral Network, Random Forest, and etc.
Finally
I More information from data at different lags
I Variable selection - LASSO, Two-Stage Selection
Overview
First Model:Support Vector Machine, kernel SVMSeptember 22nd, 2010; Public MCE: 0.771654; Result: 0.822041
Logistic Regression:
September 23rd, 2010; Public MCE: 0.951764; Result: 0.966906
I The models always choose variable 74
I Multivariate logistic regression is worse
Try LASSO:
Logistic Regression + Variable SelectionSeptember 27th, 2010; Public MCE: 0.952942; Result: 0.967112
Try Others:
SPCA, AdaBoost, Gradient Boost, Neutral Network, Random Forest, and etc.
Finally
I More information from data at different lags
I Variable selection - LASSO, Two-Stage Selection
Overview
First Model:Support Vector Machine, kernel SVMSeptember 22nd, 2010; Public MCE: 0.771654; Result: 0.822041
Logistic Regression:
September 23rd, 2010; Public MCE: 0.951764; Result: 0.966906
I The models always choose variable 74
I Multivariate logistic regression is worse
Try LASSO:
Logistic Regression + Variable SelectionSeptember 27th, 2010; Public MCE: 0.952942; Result: 0.967112
Try Others:
SPCA, AdaBoost, Gradient Boost, Neutral Network, Random Forest, and etc.
Finally
I More information from data at different lags
I Variable selection - LASSO, Two-Stage Selection
3rd Place in 2010 INFORMS DATA MINING CONTEST
Basic Analysis
Outline
Overview
Basic AnalysisPre-ProcessingLogistic Regression
Variable Selection MethodsGeneralized Linear ModelVariable Selection-Traditional MethodsVariable Selection-L1 Penalized Methods
ResultsUsing Future InformationHow It Fails WITHOUT Using Future Information
Future Work
3rd Place in 2010 INFORMS DATA MINING CONTEST
Basic Analysis
Pre-Processing
Pre-ProcessingI Missing data: variable 167 to 180, variable 157
I Predictors: Xit =
Sit+60−S
it
Sit
3rd Place in 2010 INFORMS DATA MINING CONTEST
Basic Analysis
Logistic Regression
Logistic Regression-Single Predictor
5-folds Cross Validation Results:
Predictors AUC test sd AUC train sdVariable74LAST PRICE 00 60 0.9545 0.0048 0.9546 0.0011Variable88LAST PRICE 00 60 0.8594 0.0086 0.8592 0.0020
Variable107LAST PRICE 00 60 0.8585 0.0066 0.8581 0.0016. . . . . . . . .
Variable28LAST PRICE 00 60 0.5901 0.0308 0.5906 0.0076Variable41LAST PRICE 00 60 0.5897 0.0206 0.5903 0.0051
Variable102LAST PRICE 00 60 0.5390 0.0205 0.5390 0.0051
3rd Place in 2010 INFORMS DATA MINING CONTEST
Basic Analysis
Logistic Regression
Bar Plot - Single Predictor
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Outline
Overview
Basic AnalysisPre-ProcessingLogistic Regression
Variable Selection MethodsGeneralized Linear ModelVariable Selection-Traditional MethodsVariable Selection-L1 Penalized Methods
ResultsUsing Future InformationHow It Fails WITHOUT Using Future Information
Future Work
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Generalized Linear Model
Generalized Linear Model
Model Components
Response/Dependent Variable: YPredictors/Independent Variables: X1, . . . , XpObservations: y˜n×1 = [y1, . . . , yn]
′, xn×p = [x˜1, . . . , x˜p]Model Setup
Model: Y |X ∼ FY , E(Y ) = µ = g−1(Xβ˜), V ar(Y ) = V (g−1(Xβ˜))where, β˜p×1 is unknown parameters; g is the link function.
Examples
I Linear Regression: Y |X ∼ N(Xβ, σ2), oryi = β0 +
∑pj=1 βjxij + εi, where εi i.i.d ∼ N(0, σ2)
I Logistic Regression: Y |X ∼ Bernoulli(p = g−1(Xβ)), g(µ) = log( µ1−µ ), or
P(yi = 1) =exp (β0+
∑pj=1 βjxij)
1+exp (β0+∑p
j=1 βjxij)
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Generalized Linear Model
Generalized Linear Model
Model Components
Response/Dependent Variable: YPredictors/Independent Variables: X1, . . . , XpObservations: y˜n×1 = [y1, . . . , yn]
′, xn×p = [x˜1, . . . , x˜p]Model Setup
Model: Y |X ∼ FY , E(Y ) = µ = g−1(Xβ˜), V ar(Y ) = V (g−1(Xβ˜))where, β˜p×1 is unknown parameters; g is the link function.
Examples
I Linear Regression: Y |X ∼ N(Xβ, σ2), oryi = β0 +
∑pj=1 βjxij + εi, where εi i.i.d ∼ N(0, σ2)
I Logistic Regression: Y |X ∼ Bernoulli(p = g−1(Xβ)), g(µ) = log( µ1−µ ), or
P(yi = 1) =exp (β0+
∑pj=1 βjxij)
1+exp (β0+∑p
j=1 βjxij)
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Generalized Linear Model
Generalized Linear Model
Model Components
Response/Dependent Variable: YPredictors/Independent Variables: X1, . . . , XpObservations: y˜n×1 = [y1, . . . , yn]
′, xn×p = [x˜1, . . . , x˜p]Model Setup
Model: Y |X ∼ FY , E(Y ) = µ = g−1(Xβ˜), V ar(Y ) = V (g−1(Xβ˜))where, β˜p×1 is unknown parameters; g is the link function.
Examples
I Linear Regression: Y |X ∼ N(Xβ, σ2), oryi = β0 +
∑pj=1 βjxij + εi, where εi i.i.d ∼ N(0, σ2)
I Logistic Regression: Y |X ∼ Bernoulli(p = g−1(Xβ)), g(µ) = log( µ1−µ ), or
P(yi = 1) =exp (β0+
∑pj=1 βjxij)
1+exp (β0+∑p
j=1 βjxij)
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Generalized Linear Model
Generalized Linear Model
Model Components
Response/Dependent Variable: YPredictors/Independent Variables: X1, . . . , XpObservations: y˜n×1 = [y1, . . . , yn]
′, xn×p = [x˜1, . . . , x˜p]Model Setup
Model: Y |X ∼ FY , E(Y ) = µ = g−1(Xβ˜), V ar(Y ) = V (g−1(Xβ˜))where, β˜p×1 is unknown parameters; g is the link function.
Examples
I Linear Regression: Y |X ∼ N(Xβ, σ2), oryi = β0 +
∑pj=1 βjxij + εi, where εi i.i.d ∼ N(0, σ2)
I Logistic Regression: Y |X ∼ Bernoulli(p = g−1(Xβ)), g(µ) = log( µ1−µ ), or
P(yi = 1) =exp (β0+
∑pj=1 βjxij)
1+exp (β0+∑p
j=1 βjxij)
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Generalized Linear Model
Generalized Linear Model
Parameter Estimations
I Maximum Likelihood, Maximum Quasi-Likelihood, Minimum LossFunction
I Least Square, Iteratively Reweighted Least Squares
I Bayesian Methods
Overfitting
Figure: Examples of Best Subset Selection with different Criteria
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
Variable Selection
I Forward-Stepwise Selection: greedy algorithm
I Backward-Stepwise Selection
I Best Subset SelectionDifferent Criterions: R2, MSE, AIC, SBC
Figure: Examples of Best Subset Selection with different Criteria
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
To Avoid Overfitting-Traditional Methods
I Principal Components Regression: Instead of original predictors, utilizingindependent components (linear combination of original predictors), whichminimize the reproduction error.
I Ridge Regression: Shrinkage Estimations
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
To Avoid Overfitting-Traditional Methods
I Principal Components Regression: Instead of original predictors, utilizingindependent components (linear combination of original predictors), whichminimize the reproduction error.
I Ridge Regression: Shrinkage Estimations
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
Ridge Regression-L2 Penalized Methods
yi = β0 +
p∑j=1
βjxij + εi, εi i.i.d ∼ N(0, σ2)
L(X,β)-Loss Function
Least Squares Regression
β˜ls = argminβ {L(X,β)}
= argminβ {p∑i=1
(yi − yi)2}
β˜ls = (X ′X)−1Y˜
Ridge Regression
β˜ridge = argminβ {L(X,β)}
= argminβ {p∑i=1
(yi − yi)2 + λ
p∑j=1
β2j }
β˜ridge = (X ′X + λI)−1Y˜
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
Ridge Regression-L2 Penalized Methods
yi = β0 +
p∑j=1
βjxij + εi, εi i.i.d ∼ N(0, σ2)
L(X,β)-Loss Function
Least Squares Regression
β˜ls = argminβ {L(X,β)}
= argminβ {p∑i=1
(yi − yi)2}
β˜ls = (X ′X)−1Y˜
Ridge Regression
β˜ridge = argminβ {L(X,β)}
= argminβ {p∑i=1
(yi − yi)2 + λ
p∑j=1
β2j }
β˜ridge = (X ′X + λI)−1Y˜
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
Ridge Regression-L2 Penalized Methods
β˜ridge = argminβ {‖Y˜ −Xβ˜‖2 + λ‖β˜‖l2}, where ‖β˜‖l2 =
p∑i=1
|βi|2
⇐⇒β˜ridge = argminβ {‖Y˜ −Xβ˜‖2}
s.t. ‖β˜‖l2 ≤ tThere is a one-to-one correspondence between λ and t
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-Traditional Methods
Ridge Regression-L2 Penalized Methods
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO-L1 Penalized Model
LASSO (Least Absolute Shrinkage and Selection Operator):
yi = β0 +
p∑j=1
βjxij + εi, εi i.i.d ∼ N(0, σ2)
β˜lasso = argminβ {‖Y˜ −Xβ˜‖2 + λ‖β˜‖l1}, where ‖β˜‖l1 =
p∑j=1
|βj |
⇐⇒β˜lasso = argminβ {‖Y˜ −Xβ˜‖2}
s.t. ‖β˜‖l1 ≤ t
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO-L1 Penalized Model
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
Comparison of L1 and L2 Penalized Model
L2 Penalized Estimation
β˜ridge = argminβ {‖Y˜ −Xβ˜‖2}s.t. ‖β˜‖l2 ≤ t
L1 Penalized Estimation
β˜lasso = argminβ {‖Y˜ −Xβ˜‖2}s.t. ‖β˜‖l1 ≤ t
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO
References:
I The original paper:Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J.Royal. Statist. Soc B., Vol. 58, No. 1
I The least angle regression (LAR) algorithm for solving the Lasso:Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2004). Least angleregression, Ann. Statist. Vol. 32, No. 2
I Details and comparisons:Hastie, T., Tibshirani, R. and Jerome, F. The elements of statisticallearning, second edition, Springer
Computations:
I LASSO in R: glmnet, lasso2, lars
I Relaxed LASSO in R: relaxo
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO
References:
I The original paper:Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J.Royal. Statist. Soc B., Vol. 58, No. 1
I The least angle regression (LAR) algorithm for solving the Lasso:Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2004). Least angleregression, Ann. Statist. Vol. 32, No. 2
I Details and comparisons:Hastie, T., Tibshirani, R. and Jerome, F. The elements of statisticallearning, second edition, Springer
Computations:
I LASSO in R: glmnet, lasso2, lars
I Relaxed LASSO in R: relaxo
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO-Grouped LASSO
Sometimes, the predictors are grouped:
I Genes that belong to the same biological pathway;
I Collections of indicators variables for representing levels of a categoricalpredictor;
I The price change of one stock at different time lags
Observations: Y˜n×1 = [Y1, . . . , Yn]′, X1
n×p1 , . . . , XGn×pG ,
β˜Glasso = argminβp {‖Y˜ −Xβ˜‖2 + λG∑g=1
√pg‖β˜g‖l1}
This model was proposed by:Bakin, S. (1999) Adaptive Regression and Model Selection in Data MiningProblems, Ph.D. ThesisYuan, M. and Lin, Y. (2006), Model Selection and Estimation in Regressionwith Grouped Variables, Journal of the Royal Statistical Society, Series B, 68(1)
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO-Grouped LASSO
Sometimes, the predictors are grouped:
I Genes that belong to the same biological pathway;
I Collections of indicators variables for representing levels of a categoricalpredictor;
I The price change of one stock at different time lags
Observations: Y˜n×1 = [Y1, . . . , Yn]′, X1
n×p1 , . . . , XGn×pG ,
β˜Glasso = argminβp {‖Y˜ −Xβ˜‖2 + λG∑g=1
√pg‖β˜g‖l1}
This model was proposed by:Bakin, S. (1999) Adaptive Regression and Model Selection in Data MiningProblems, Ph.D. ThesisYuan, M. and Lin, Y. (2006), Model Selection and Estimation in Regressionwith Grouped Variables, Journal of the Royal Statistical Society, Series B, 68(1)
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO - TSL1P(Two-Stage L1-Penalized Logistic Regression)
Similarity
Our data are grouped based on each stock or economic indicator: Open, High,Low, Close at different time periods
Difference
I Only partial variables in each group need to be selected;
I The within-group correlations are very high, while some between-groupcorrelations are small.
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO - TSL1P
Observations: Y˜n×1 = [Y1, . . . , Yn]′, X1
n×p1 , . . . , XGn×pG ,
As for the contest training data I used, n = 5910, G = 118,p1 = p2 = . . . = pG = 4 · 118 · 6 = 2832 or 4 · 118 · 10 = 4720
Stage 1
I Choose one represent for each group: Ri, i = 1, . . . , G
I Utilize L1-Penalized Logistic Regression to select variablesRi, i ∈ M′ ⊆ M = {1, . . . , G}
I Reduce the predictors from {Xin×pi}i∈M to {Xi
n×pi}i∈M′
Stage 2
I Utilize L1-Penalized Logistic Regression based on the reduced predictorsspace
I Use the right cross validation methods to choose the tune parameter - λ
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO - TSL1P
Observations: Y˜n×1 = [Y1, . . . , Yn]′, X1
n×p1 , . . . , XGn×pG ,
As for the contest training data I used, n = 5910, G = 118,p1 = p2 = . . . = pG = 4 · 118 · 6 = 2832 or 4 · 118 · 10 = 4720
Stage 1
I Choose one represent for each group: Ri, i = 1, . . . , G
I Utilize L1-Penalized Logistic Regression to select variablesRi, i ∈ M′ ⊆ M = {1, . . . , G}
I Reduce the predictors from {Xin×pi}i∈M to {Xi
n×pi}i∈M′
Stage 2
I Utilize L1-Penalized Logistic Regression based on the reduced predictorsspace
I Use the right cross validation methods to choose the tune parameter - λ
3rd Place in 2010 INFORMS DATA MINING CONTEST
Variable Selection Methods
Variable Selection-L1 Penalized Methods
LASSO - TSL1P
Observations: Y˜n×1 = [Y1, . . . , Yn]′, X1
n×p1 , . . . , XGn×pG ,
As for the contest training data I used, n = 5910, G = 118,p1 = p2 = . . . = pG = 4 · 118 · 6 = 2832 or 4 · 118 · 10 = 4720
Stage 1
I Choose one represent for each group: Ri, i = 1, . . . , G
I Utilize L1-Penalized Logistic Regression to select variablesRi, i ∈ M′ ⊆ M = {1, . . . , G}
I Reduce the predictors from {Xin×pi}i∈M to {Xi
n×pi}i∈M′
Stage 2
I Utilize L1-Penalized Logistic Regression based on the reduced predictorsspace
I Use the right cross validation methods to choose the tune parameter - λ
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
Outline
Overview
Basic AnalysisPre-ProcessingLogistic Regression
Variable Selection MethodsGeneralized Linear ModelVariable Selection-Traditional MethodsVariable Selection-L1 Penalized Methods
ResultsUsing Future InformationHow It Fails WITHOUT Using Future Information
Future Work
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
Using Future Information
Result - Stage 1
λ df auc test auc test.sd auc train auc train.sd0.0010 92.4000 0.9513 0.0057 0.9581 0.00120.0020 71.0000 0.9526 0.0056 0.9578 0.00110.0050 28.2000 0.9540 0.0054 0.9561 0.00120.0060 20.6000 0.9544 0.0054 0.9558 0.00120.0070 14.8000 0.9545 0.0053 0.9555 0.00120.0080 11.0000 0.9545 0.0052 0.9552 0.00130.0090 7.0000 0.9545 0.0051 0.9550 0.00130.0100 5.4000 0.9546 0.0051 0.9549 0.00130.0200 1.0000 0.9546 0.0051 0.9546 0.00120.0500 1.0000 0.9546 0.0051 0.9546 0.00120.1000 1.0000 0.9546 0.0051 0.9546 0.0012
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
Using Future Information
Result - Stage 1
Figure: Variable Selection in Stage 1of TSL1P
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
Using Future Information
Result - Stage 1res = glmnet(X,as.factor(Y),family=”binomial”,lambda=0.008)ind.select = which(abs(res$beta)>0)
Variable159Variable9Variable13Variable14Variable71Variable74Variable89Variable112Variable133
Result - Stage 2
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
How It Fails WITHOUT Using Future Information
Not Using Future Information
Cross Validation Results
para.Var2 df auc test auc.sd1e-04 506.4 0.8010608 0.0056617263e-04 309.2 0.7998391 0.0069352635e-04 235.6 0.7873526 0.0084642471e-03 139.8 0.7498463 0.012675834
Unfortunately, it fails
However, we will continue...
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
How It Fails WITHOUT Using Future Information
Not Using Future Information
Cross Validation Results
para.Var2 df auc test auc.sd1e-04 506.4 0.8010608 0.0056617263e-04 309.2 0.7998391 0.0069352635e-04 235.6 0.7873526 0.0084642471e-03 139.8 0.7498463 0.012675834
Unfortunately, it fails
However, we will continue...
3rd Place in 2010 INFORMS DATA MINING CONTEST
Results
How It Fails WITHOUT Using Future Information
Not Using Future Information
Cross Validation Results
para.Var2 df auc test auc.sd1e-04 506.4 0.8010608 0.0056617263e-04 309.2 0.7998391 0.0069352635e-04 235.6 0.7873526 0.0084642471e-03 139.8 0.7498463 0.012675834
Unfortunately, it fails
However, we will continue...
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
Outline
Overview
Basic AnalysisPre-ProcessingLogistic Regression
Variable Selection MethodsGeneralized Linear ModelVariable Selection-Traditional MethodsVariable Selection-L1 Penalized Methods
ResultsUsing Future InformationHow It Fails WITHOUT Using Future Information
Future Work
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
Future work:
I Data Pre-processing
I Modifications of TSL1P: How to divide groups? How to choose therepresents?
I L1-penalized methodS for time series data
I Continuous values of Y
I Time series models:GARCH and Stochastic Volatility Models
Applications:
I Generalized Pairs-trading (using future information)
I Application in trading system designs in equity market (ex: digital options)
I High frequency financial data (more information, stronger correlations(?))
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
Future work:
I Data Pre-processing
I Modifications of TSL1P: How to divide groups? How to choose therepresents?
I L1-penalized methodS for time series data
I Continuous values of Y
I Time series models:GARCH and Stochastic Volatility Models
Applications:
I Generalized Pairs-trading (using future information)
I Application in trading system designs in equity market (ex: digital options)
I High frequency financial data (more information, stronger correlations(?))
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
High Frequency Data - Tick by Tick Trades
Tick by Tick Consolidated Trades Data for Microsoft(MSFT) at the regularmarket opening time on July 31, 2009:
SYMBOL DATE TIME PRICE SIZE EX1 MSFT 20090731 9:30:00 23.77 158 Z2 MSFT 20090731 9:30:00 23.77 142 Z3 MSFT 20090731 9:30:00 23.77 258 Z4 MSFT 20090731 9:30:00 23.77 100 Z5 MSFT 20090731 9:30:00 23.77 100 Z6 MSFT 20090731 9:30:00 23.77 132 Q
. . . . . . . . .86 MSFT 20090731 9:30:01 23.77 428150 Q87 MSFT 20090731 9:30:01 23.77 428150 Q88 MSFT 20090731 9:30:02 23.77 216 D89 MSFT 20090731 9:30:02 23.77 200 D
The data is from the database of Trade and Quote(TAQ), which is provided by
Wharton Research Data Services(WRDS).
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
High Frequency Data - 1 second Trades
1 second OpenHighLowClose data for Microsoft(MSFT) on the Dec 31, 2008:
Time Open High Low Close Vol9:30:01 19.31 19.33 19.29 19.32 62529:30:02 19.32 19.33 19.32 19.33 34309:30:03 19.3 19.31 19.3 19.3 9853049:30:04 19.28 19.29 19.28 19.29 11009:30:05 19.31 19.31 19.29 19.29 35029:30:06 19.32 19.33 19.29 19.29 43679:30:07 19.29 19.3 19.27 19.27 3106
. . . . . . . . .15:59:54 19.44 19.44 19.43 19.43 1920015:59:55 19.43 19.44 19.43 19.44 2230015:59:56 19.44 19.45 19.43 19.45 3135215:59:57 19.45 19.45 19.44 19.44 1189815:59:58 19.44 19.44 19.44 19.44 781415:59:59 19.44 19.44 19.44 19.44 500
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
Figure: Open, High, Low, Close Data of MSFT on December 31, 2008
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
High Frequency Data - Tick by Tick Quotes
Tick by Tick Consolidated Quotes Data for Microsoft(MSFT) at the regularmarket opening time on July 31, 2009:
TIME BID OFR BIDSIZ OFRSIZ EX1 9:30:00 23.77 23.78 7 3 Z2 9:30:00 23.77 23.78 6 3 Z3 9:30:00 23.77 23.78 3 3 Z4 9:30:00 23.77 23.78 3 6 Z5 9:30:00 23.77 23.78 2 6 Z6 9:30:00 23.77 23.78 2 6 Z
. . . . . . . . .282 9:30:01 23.75 23.76 17 49 T283 9:30:01 23.74 23.76 1 1 I284 9:30:02 23.75 23.76 17 50 T285 9:30:02 23.75 23.76 17 56 T286 9:30:02 23.75 23.76 3 4 Z287 9:30:02 23.75 23.76 6 7 Z288 9:30:02 23.75 23.76 17 59 T
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
Myself
Education
I 2006 - 2010, Ph.D. Student in Statistics, University of Pittsburgh
I 2007 - 2009, Courses and projects in Machine Learning, QuantitativeFinance, Carnegie Mellon University
Research Interests
I Statistical tests, estimations and predictions in continuous financialstochastic process with jumps and stochastic volatility
I Applications of statistical learning and machine learning for trading systemdesigns
Contact:
I http://www.linkedin.com/in/nanzhou
3rd Place in 2010 INFORMS DATA MINING CONTEST
Future Work
Thank You!