© 2014 cy lin, columbia university e6893 big data analytics – lecture 4: big data analytics...

© 2014 CY Lin, Columbia UniversityE6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms1

E6893 Big Data Analytics:

Financial Market Volatility

Final Project Presentation

Jimmy Zhong, Tim Wu, Oliver Zhou, John Terzis

December 22, 2014

2

• Map Reduce programming model used to generate feature matrix from raw price data across 100’s of symbols.

• Raw price data is first merged with feature symbols from a fixed set of user determined features on timestamp.• Feature extraction is done on reducer

by creating forward and backward looking volatility values for each timestamp for each symbol.

• Resultant feature matrix contains over 300 columns from a starting point of 12.

• Feature matrix can be further transformed using a script to perform time-series clustering on intra-day price activity.

Feature Selection/Extraction using Hadoop

3

• Spark was installed and pyspark used to perform cross-validated Ridge Regression using Stochastic Gradient Descent with the goal of producing a regressor that can predict volatility for some forward looking interval (60 Minute, 1 Day, 10 Day etc) for a given symbol.

• A combination of MLLIB and scikit learn were used since MLLIB did not have python bindings yet for cross-validated splitting of dataset.

• Spark was ran on data held in HDFS.• Results obtained were tested on a hold

out sample and R^2 calculated to show how much variance could be explained by the regressor.

Supervised Learning on Spark using MLLIB

4

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VARMotivation: Real world financial time series has property called volatility clustering; that is periods of relative calm are interrupted by bursts of volatility. An extreme market movement might represent a significant downside risk to the security portfolio of an investor. Using RHadoop ecosystem to forecast the future volatility and calculate Value at Risk (VAR) can help investor to prepared for losses arising from natural or man-made catastrophes, even of a magnitude not experienced before.

Algorithm: 1. Used PIG and Python script to pre-process the raw data (AAPL) then load it into Rstudio2. Applied R code (TimeSeriesAnalysis.R). Calculated the return in percentage.3. Applied GARCH modeling to forecast the future volatility and calculate VAR4. Applied Extreme Value Theory (EVT) to fit a GPD distribution to the tails

Result: 5. Calculated Forecast for the volatility and Value at Risk (VaR) at 99% confidence level (Loss

is expected to be exceeded only 1% of the time). In this example, AAPL (2008 – 2009), we calculated that 99% probability the monthly return is above 4%.

6. Used statistical hypothesis tests (Ljung-Box) for autocorrelation in squared returns (p value ~0, reject the null hypothesis of no autocorrelations in the squared returns at 1% significance level). GARCH model should be employed in modeling the return timeseries.

5

Time-Series Analysis: Forecasting multistep ahead base on GARCH model and calculate VAR

6


7


Tail of the AAPL % Return data Quantile-quantile plot

8

K-Means Clustering

• Goal is to attempt to relate different time intervals to stock volatility through clustering.

• Symbols: AIG, AMZN, PEP• Vector Dimensions: Normalized Volume,

Symbol Volatility +1 Day, VIX Volatility +1 Day, Time Interval

• Time Intervals: Period of Day, Day of Week, Fiscal Quarter, Year

• K-means clustering in R and Hadoop with cluster size of 3-4

• Euclidean Distance Measure used since all features were real valued.

9

Cluster Results• No strong correlation of time intervals to

symbol volatility across all three sectors.

• No strong correlation between VIX volatility and symbol volatility.

• There is a significant relationship between volume and symbol volatility.

10

Logistic Regression• Goal is to use classification model to separate variables out during

feature selection and identify which ones generate the best predictive power

• Stock Symbols Tested: AIG, AMZN, PEP• Parameters in Dataset:

• Normalized Volume, Symbol Volatility +1 Day, VIX Volatility +1 Day, Time Interval

• Targeted predicting when Symbol VIX Volatility would rise over .25, which historically is a rough cutoff between regime changes from low to high volatility market cycles.

11

Logistic Regression Results• Measured by AUC (Area Under Curve)• 1 is a True Positive and 0 is a True Negative, while .5 is

completely Random• Little to no relationship with time intervals to symbol volatility,

but that may be skewed by market crashes• VIX volatility and symbol volatility are nearly completely

randomly related• There is a significant relationship between volume and symbol

volatility.

12

Questions ?

© 2014 cy lin, columbia university e6893 big data analytics – lecture 4: big data analytics...

Documents

volatility clustering

future volatility

stock volatility

bursts of volatility

volatility values

raw price data

raw data aapl

timeseries analysis