big data's big secret
TRANSCRIPT
![Page 1: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/1.jpg)
BIG DATA’S DIRTY SECRET Harvey J. Stein
Head, Quantitative Risk Analytics Bloomberg
Machine Learning in Finance Workshop April 20, 2018
![Page 2: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/2.jpg)
Bloomberg
OUTLINE
1. Introduction
2. Symptoms
3. Hole filling
4. Bad data detection
5. Summary
6. References
2
![Page 3: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/3.jpg)
Introduction
![Page 4: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/4.jpg)
Go gle "big data" OR "machine l'earning"
All News Images Videos
About 99,900,000 results (0. 75 seconds
Bloomberg
LOTS OF BIG DATA
Big data is big news!
3
![Page 5: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/5.jpg)
Although I guess that’s not so surprising...
Go gle "president trump"
All News Images Videos
Bloomberg
TRUMPS TRUMP
Almost twice as popular as “President Trump”!
4
![Page 6: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/6.jpg)
Go gle "president trump"
All News Images Videos
Bloomberg
TRUMPS TRUMP
Almost twice as popular as “President Trump”!
Although I guess that’s not so surprising...
4
![Page 7: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/7.jpg)
Bloomberg
FAKE NEWS
But big data analysis doesn’t mean better data analysis I More variables I More outliers I More noise I More spurious results
Conclusion? I Data needs to be cleaned
We will discuss data anomalies and methods for cleaning data
5
![Page 8: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/8.jpg)
Bloomberg
ACKNOWLEDGEMENTS
Work on data cleaning with: I Mario Bondioli I Jan Dash I Xipei Yang I Yan Zhang
6
![Page 9: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/9.jpg)
Symptoms
![Page 10: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/10.jpg)
Bloomberg
THE DATA
We worked with credit default swap (CDS) spread data I Spread = cost (in bp) of insuring against default of a given company
for a given time period I Quoted for 6 month, 1 year, 2 year, 3 year, 5 year, 7 year and 10
year horizons I Quoted for 1,000s of different individual companies I Quoted both for senior and subordinated debt I Consider market close data
7
![Page 11: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/11.jpg)
Bloomberg
EXAMPLE
01/08 01/09 01/10 01/11 01/12 01/13 01/14 01/15 01/16-200
0
200
400
600
800
1000
1200
1400AVP Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
8
![Page 12: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/12.jpg)
Bloomberg
DATA ISSUES
General data quality issues I Missing values I Bad values
Clean for a purpose I Relative valuation I Mark to market I Trading strategy development I Risk analysis
Risk I Missing data points
I Problematic return calculations I Problematic covariance calculations
I Bad values I Bad returns I Bad variances
9
![Page 13: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/13.jpg)
Bloomberg
CDS DATA ISSUES
CDS data specific characteristics: I 6 month point missing for first 2.5 years I Often large range of values I High volatility makes detecting bad values difficult I Data used for risk analysis
I Deleting outliers reduces risk measures I Leaving anomalies inflates risk measures
10
![Page 14: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/14.jpg)
Bloomberg
TYPICAL APPROACHES
Hole filling I Regression I Interpolation I Flat filling
Anomaly detection I Comparison to trailing volatility I Cluster analysis I Neural networks I Statistics-sensitive Non-linear Iterative Peak (SNIP) clipping
algorithm
11
![Page 15: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/15.jpg)
Hole filling
![Page 16: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/16.jpg)
Bloomberg
OVERVIEW
Hole filling Overview I Use Multi-channel Singular Spectrum Analysis (MSSA) hole filling
algorithm I Variant of Singular Spectrum Analysis (SSA) used simultaneously on
multiple time series I Decomposes each time series into a sum of components, one for
each eigenvector I Borrowed from geophysical data analysis I Makes use of both space relationships (covariance) and time
relationships (autocovariance and cross-autocovariance)
12
![Page 17: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/17.jpg)
Bloomberg
SSA Uses:
I Inspect eigenvectors and components to extract specific features of data
I Smooth data by throwing away small eigenvalues I Helpful for stabilizing correlation calculations (smooth data then
compute) References:
I A beginner’s guide to SSA, Claessen and Groth, [CG] I Singular spectrum analysis, Wikipedia, [Wik16] I Analysis of Time Series Structure: SSA and Related
Techniques, Golyandina, Nekrutkin, and Zhigljavsky, [GNZ01] I A review on singular spectrum analysis for economic and
financial time series, Hassani and Thomakos, [HT10] I SSA, Random Matrix Theory, and Noise-Reduced Correlations,
Dash et al., [Das+16a] I Stable Reduced-Noise ’Macro’ SSA–Based Correlations for
Long-Term Counterparty Risk Management, Dash et al., [Das+16b]
13
![Page 18: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/18.jpg)
Bloomberg
MSSA
Multi-channel Singular Spectrum Analysis (MSSA): I Applies SSA algorithm to a set of time series simultaneously
Uses: I Same as SSA, but takes relationships between different time series
into account I Used for forecasting
References: I Multivariate singular spectrum analysis for forecasting
revisions to real-time data, Patterson et al., [Pat+11] I Multivariate singular spectrum analysis: A general view and
new vector forecasting approach, Hassani and Mahmoudvand, [HM13]
I Advanced spectral methods for climatic time series, Ghil et al., [Ghi+02]
14
![Page 19: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/19.jpg)
Bloomberg
MSSA BASED HOLE FILLING
MSSA hole filling algorithm: I Nominally fill holes (e.g. via interpolation) I Use level l hole filling algorithm for l = l0:
I Run MSSA algorithm I Replace holes with MSSA reconstruction using l biggest singular
values I Repeat until convergence
I Increment l by one and repeat until adding singular values doesn’t have much impact and used enough singular values
References: I Spatio-temporal filling of missing points in geophysical data
sets, Kondrashov and Ghil, [KG06]
15
![Page 20: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/20.jpg)
Bloomberg
MIXED RESULTS Unfortunately, it doesn’t always work:
03/07 05/07 07/07 09/07 11/07 01/08-10
0
10
20
30
40
50
60
70
80NAB Senior USD: Original MSSA hole filling
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
16
![Page 21: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/21.jpg)
Bloomberg
OBSERVATIONS
Observations: I Sometimes MSSA doesn’t line up with actual data I Sometimes MSSA bottoms out I Using too few singular values will smooth the data
Solutions: I Anchoring – patch in data in a more consistent fashion I Reparameterization – working in log space I Adjusting MSSA parameters I Avoid filling large gaps
17
![Page 22: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/22.jpg)
Bloomberg
ANCHORING
Holes are replaced with MSSA partial reconstruction
I Can yield bias if remaining components shift results
Instead I Patch in differences relative to endpoints I Can be additive or multiplicative I One-sided holes need special treatment
18
![Page 23: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/23.jpg)
Bloomberg
REPARAMETERIZATION
MSSA hole filling is like a fixed point algorithm I Trying to find points which match reconstruction I Similar to constrained optimization
Apply classic optimization techniques I Transform problem to eliminate constraints I Work in log space if values must be positive I Log space also helps to handle changes in magnitude
Fast drop-off of eigenvalues is evidence that working in log space is the right thing
19
![Page 24: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/24.jpg)
Bloomberg
ADJUSTING MSSA PARAMETERS
Many parameters to adjust I Lag I Max/Min number of EVs I Max/Min percentage of sum of EVs I Measure of convergence
Smoothing caused by fast drop-off of EVs I Max/Min percentage ineffective I Can add more EVs, but leads to instability
20
![Page 25: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/25.jpg)
r--Af I 1 . ..,,\
\,
,-,~- 'v\ ,~,-~ < •:, . '-_ ___,
Bloomberg
NEW RESULTS After adjustments NAB:
03/07 05/07 07/07 09/07 11/07 01/08-10
0
10
20
30
40
50
60
70
80
90NAB Senior USD: Angle/distance/MSSA spike detection, all SVs
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
21
![Page 26: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/26.jpg)
Bad data detection
![Page 27: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/27.jpg)
Bloomberg
BAD DATA
How to handle bad data? I Detect it I Remove it I In our case, replace it
22
![Page 28: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/28.jpg)
Bloomberg
BAD DATA DETECTION
Many algorithms I Statistical – compare to statistical properties (like trailing SD) I Data science – clustering I Neural networks
References I Outlier Detection Techniques, Kriegel, Kroger, and Zimek, [KKZ10] I Detecting Local Outliers in Financial Time Series, Verhoeven and
McAleer, [VM] I Outlier Analysis, Aggarwal, [Agg13] I Algorithms for Mining Distance-Based Outliers in Large Datasets,
Knorr and Ng, [KN98] I Data Mining and Knowledge Discovery Handbook: A Complete
Guide for Practitioners and Researchers, Ben-Gal, [BG05] I An online spike detection and spike classification algorithm capable of
instantaneous resolution of overlapping spikes, Franke et al., [Fra+10] I A Survey of Outlier Detection Methodologies, Hodge and Austin,
[HA04]
23
![Page 29: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/29.jpg)
Bloomberg
DIFFICULTIES
Regime changes and changing volatility
01/14 07/14 01/15 07/15 01/16 07/16-200
0
200
400
600
800
1000
1200
1400
1600
1800GNW Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
24
![Page 30: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/30.jpg)
Bloomberg
HYBRID APPROACH
Data science approach – Cluster analysis I Angle-based I Distance-based
Hybrid approach I Run clustering on a windowed basis (in a neighborhood of each
point) I Combine MSSA with clustering I Remove points using analysis, then put them back if MSSA
reconstructs them close enough
Conservative approach I Do both angle and distance-based combined with MSSA I If both algorithms agree, then it’s really an anomaly
25
![Page 31: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/31.jpg)
. · ...
Bloomberg
DISTANCE-BASED EXAMPLE
26
![Page 32: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/32.jpg)
73~-----------------~
72
71
70
6Q
68
67
• •
no outlier 66 +----~~-~-~-~-~-~-~~----,
31 3.2 33 34 35 36 37 38 3ll 40 41
Bloomberg
ANGLE-BASED EXAMPLE
Angle-based, no outlier:
27
![Page 33: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/33.jpg)
73
• n • • 71 • •
• • 70
• 00
68
151 outlier 66
31 32 33 34 35 36 37 38 311 «l 41
Bloomberg
ANGLE-BASED EXAMPLE
Angle-based outlier:
28
![Page 34: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/34.jpg)
Bloomberg
RESULTS Filling of large holes
01/09 03/09 05/09 07/09 09/09 11/09 01/10-20
0
20
40
60
80
100
120
140
160COP Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
29
![Page 35: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/35.jpg)
Bloomberg
RESULTS Ignoring regime changes
09/11 01/12 05/12 09/12 01/13 05/13-200
0
200
400
600
800
1000
1200CHK Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
30
![Page 36: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/36.jpg)
X
Bloomberg
0
0 0
0 0 0
RESULTS Detecting and correcting bad data
01/15 03/15 05/15 07/15 09/15 11/15 01/160
0.5
1
1.5
2
2.5104 JNY Senior USD: Hybrid Angle/Distance
6mo
1yr
2yr
3yr
4yr
5yr
7yr
10yr
31
![Page 37: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/37.jpg)
i
Bloomberg
.. 2012
_ LC240AS
_ LC2SOAS
_ LC280AS
_ LC290AS
LC300AS _ LC320AS
LC330AS
LC340AS
LC350AS
RESULTS
Even works on CMO OASs!
32
![Page 38: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/38.jpg)
Summary
![Page 39: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/39.jpg)
Bloomberg
SUMMARY
Moral of the story 1. Know your data!
I Bad data = bad results I Big data increases need for data cleaning I Look at your data!
2. Know its usage! I Cleaning must respect usage of data
3. Algorithms will often not work as advertised! I Your data can be different I Your data usage can be different
4. Expect substantial work modifying and adjusting algorithms I Tuning I Modifying algorithms I Combining algorithms I Performance must be inspected
33
![Page 40: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/40.jpg)
SANFRANCISCO sAOP>WLO TOKYO .Pfesslhe<HELP> +496912041210 +n221neooo +442073307500 +12123112000 +14151122tl0 +55113(),114500 +0&2121000 +11332011900 ktYIW!uforinscanr -------------------------------- 1,.,.ns,stance.
Bloomberg
Thank you!
Harvey J. Stein
© 2018 Bloomberg Finance L.P. All rights reserved.
34
![Page 41: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/41.jpg)
References
![Page 42: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/42.jpg)
Bloomberg
REFERENCES
[Agg13] C. C. Aggarwal. Outlier Analysis. Springer, 2013.
[BG05] I. Ben-Gal. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers, 2005.
[CG] David Claessen and Andreas Groth. A beginner’s guide to SSA. CERES-ERTI, Ecole Normale Superieure. url: http://environnement.ens.fr/IMG/file/DavidPDF/ SSA_beginners_guide_v9.pdf.
[Das+16a] Jan W Dash et al. SSA, Random Matrix Theory, and Noise-Reduced Correlations. Tech. rep. Bloomberg LP, Sept. 2016. url: https://ssrn.com/abstract=2808027.
35
![Page 43: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/43.jpg)
Bloomberg
REFERENCES [Das+16b] Jan W Dash et al. Stable Reduced-Noise ’Macro’
SSA–Based Correlations for Long-Term Counterparty Risk Management. Tech. rep. Bloomberg LP, May 2016. url: https://ssrn.com/abstract=2808015.
[Fra+10] F. Franke et al. “An online spike detection and spike classification algorithm capable of instantaneous resolution of overlapping spikes.” In: J Comput Neurosci (2010), pp. 127–148.
[Ghi+02] Michael Ghil et al. “Advanced spectral methods for climatic time series.” In: Reviews of geophysics 40.1 (2002).
[GNZ01] Nina Golyandina, Vladimir Nekrutkin, and Anatoly A Zhigljavsky. Analysis of Time Series Structure: SSA and Related Techniques. CRC Monographs on Statistics & Applied Probability. Chapman & Hall, 2001.
[HA04] V. J. Hodge and J. Austin. A Survey of Outlier Detection Methodologies. Kluwer Academic Publishers., 2004.
36
![Page 44: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/44.jpg)
Bloomberg
REFERENCES
[HM13] Hossein Hassani and Rahim Mahmoudvand. “Multivariate singular spectrum analysis: A general view and new vector forecasting approach.” In: International Journal of Energy and Statistics 1.01 (2013), pp. 55–83.
[HT10] Hossein Hassani and Dimitrios Thomakos. “A review on singular spectrum analysis for economic and financial time series.” In: Statistics and Its Interface 3.3 (2010), pp. 377–397. issn: 1938-7989.
[KG06] Dmitri Kondrashov and Michael Ghil. “Spatio-temporal filling of missing points in geophysical data sets.” In: Nonlinear Processes in Geophysics 13.2 (2006), pp. 151–159.
[KKZ10] H.-P. Kriegel, P. Kroger, and A. Zimek. “Outlier Detection Techniques.” In: The 2010 SIAM International Conference on Data Mining, 2010.
37
![Page 45: Big Data's Big Secret](https://reader030.vdocument.in/reader030/viewer/2022012017/615ba1c04405693cd75df798/html5/thumbnails/45.jpg)
Bloomberg
REFERENCES [KN98] Edwin M. Knorr and Raymond T. Ng. “Algorithms for
Mining Distance-Based Outliers in Large Datasets.” In: Proceedings of the 24th VLDB Conference New York, 1998.
[Pat+11] Kerry Patterson et al. “Multivariate singular spectrum analysis for forecasting revisions to real-time data.” In: Journal of Applied Statistics 38.10 (2011), pp. 2183–2211.
[VM] P. Verhoeven and M. McAleer. Detecting Local Outliers in Financial Time Series. Department of Economics, University of Western Australia.
[Wik16] Wikipedia. Singular spectrum analysis. [Online; accessed 9-December-2016]. 2016. url: https://en.wikipedia.org/w/index.php?title= Singular_spectrum_analysis.
38