© 2014 cy lin, columbia university e6893 big data analytics – lecture 4: big data analytics...

12
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility Final Project Presentation Jimmy Zhong, Tim Wu, Oliver Zhou, John Terzis December 22, 2014

Upload: griffin-norman

Post on 22-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

© 2014 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms1

E6893 Big Data Analytics:

Financial Market Volatility

Final Project Presentation

Jimmy Zhong, Tim Wu, Oliver Zhou, John Terzis

December 22, 2014

Page 2: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

2

• Map Reduce programming model used to generate feature matrix from raw price data across 100’s of symbols.

• Raw price data is first merged with feature symbols from a fixed set of user determined features on timestamp.• Feature extraction is done on reducer

by creating forward and backward looking volatility values for each timestamp for each symbol.

• Resultant feature matrix contains over 300 columns from a starting point of 12.

• Feature matrix can be further transformed using a script to perform time-series clustering on intra-day price activity.

Feature Selection/Extraction using Hadoop

Page 3: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

3

• Spark was installed and pyspark used to perform cross-validated Ridge Regression using Stochastic Gradient Descent with the goal of producing a regressor that can predict volatility for some forward looking interval (60 Minute, 1 Day, 10 Day etc) for a given symbol.

• A combination of MLLIB and scikit learn were used since MLLIB did not have python bindings yet for cross-validated splitting of dataset.

• Spark was ran on data held in HDFS.• Results obtained were tested on a hold

out sample and R^2 calculated to show how much variance could be explained by the regressor.

Supervised Learning on Spark using MLLIB

Page 4: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

4

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VARMotivation: Real world financial time series has property called volatility clustering; that is periods of relative calm are interrupted by bursts of volatility. An extreme market movement might represent a significant downside risk to the security portfolio of an investor. Using RHadoop ecosystem to forecast the future volatility and calculate Value at Risk (VAR) can help investor to prepared for losses arising from natural or man-made catastrophes, even of a magnitude not experienced before.

Algorithm: 1. Used PIG and Python script to pre-process the raw data (AAPL) then load it into Rstudio2. Applied R code (TimeSeriesAnalysis.R). Calculated the return in percentage.3. Applied GARCH modeling to forecast the future volatility and calculate VAR4. Applied Extreme Value Theory (EVT) to fit a GPD distribution to the tails

Result: 5. Calculated Forecast for the volatility and Value at Risk (VaR) at 99% confidence level (Loss

is expected to be exceeded only 1% of the time). In this example, AAPL (2008 – 2009), we calculated that 99% probability the monthly return is above 4%.

6. Used statistical hypothesis tests (Ljung-Box) for autocorrelation in squared returns (p value ~0, reject the null hypothesis of no autocorrelations in the squared returns at 1% significance level). GARCH model should be employed in modeling the return timeseries.

Page 5: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

5

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VAR

Page 6: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

6

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VAR

Page 7: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

7

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VAR

Tail of the AAPL % Return data Quantile-quantile plot

Page 8: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

8

K-Means Clustering

• Goal is to attempt to relate different time intervals to stock volatility through clustering.

• Symbols: AIG, AMZN, PEP• Vector Dimensions: Normalized Volume,

Symbol Volatility +1 Day, VIX Volatility +1 Day, Time Interval

• Time Intervals: Period of Day, Day of Week, Fiscal Quarter, Year

• K-means clustering in R and Hadoop with cluster size of 3-4

• Euclidean Distance Measure used since all features were real valued.

Page 9: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

9

Cluster Results• No strong correlation of time intervals to

symbol volatility across all three sectors.

• No strong correlation between VIX volatility and symbol volatility.

• There is a significant relationship between volume and symbol volatility.

Page 10: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

10

Logistic Regression• Goal is to use classification model to separate variables out during

feature selection and identify which ones generate the best predictive power

• Stock Symbols Tested: AIG, AMZN, PEP• Parameters in Dataset:

• Normalized Volume, Symbol Volatility +1 Day, VIX Volatility +1 Day, Time Interval

• Targeted predicting when Symbol VIX Volatility would rise over .25, which historically is a rough cutoff between regime changes from low to high volatility market cycles.

Page 11: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

11

Logistic Regression Results• Measured by AUC (Area Under Curve)• 1 is a True Positive and 0 is a True Negative, while .5 is

completely Random• Little to no relationship with time intervals to symbol volatility,

but that may be skewed by market crashes• VIX volatility and symbol volatility are nearly completely

randomly related• There is a significant relationship between volume and symbol

volatility.

Page 12: © 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility

12

Questions ?