light up your dark data
TRANSCRIPT
![Page 1: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/1.jpg)
QuantCon“Light Up Your Dark Data”
April 2016
![Page 2: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/2.jpg)
2
What is dark data?
SQL
CSV
REST
JSON
SQL
CSV
REST
JSON
SQL
CSV
SQL
CSV
![Page 3: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/3.jpg)
3
Example Datasets
Trade History
Signal History
Clearing Data
Log Files
Ref Data
Corp Actions
Market Data
Models
Firm Generated Vendor Generated
![Page 4: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/4.jpg)
4
Compounding ChallengesAccumulates
Quickly
Disparate StorageDifferent Vendors
Format Changes
Ad-hoc Usage
Urgent!
![Page 5: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/5.jpg)
5
Workflow
Find Data
Ad-Hoc ETL
Store / CopyAnalysis
Report
![Page 6: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/6.jpg)
6
Sample Environment
Oracle MySQL MSSQL KDB ZIPCSV
SQL
Python
DSL
R Matlab
C++ Java
Storage
ETL
Analysis
REST
![Page 7: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/7.jpg)
7
Independent First Class Citizens
Expression
ComputeData
![Page 8: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/8.jpg)
8
DatashapeStructured data description language
http://datashape.pydata.org
![Page 9: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/9.jpg)
9
Datashape Example daily_bars: var * { date: string, symbol: string, open: float64, high: float64, low: float64, close: float64, volume: int64, }
Language, compute, and storage independent
![Page 10: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/10.jpg)
10
Blaze
Write expressions independent of storage system
Push computations to the data
Lazy evaluation
Pandas-like API
![Page 12: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/12.jpg)
12
Blaze Expressions
![Page 13: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/13.jpg)
13
Flat File Repositories
Many directories and files
Dictated structure
Naming convention part of dataset
Requires one off ad-hoc scripts
![Page 14: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/14.jpg)
14
Vendor - directory structure/daily/us/nasdaq stocks//daily/us/nasdaq stocks/1//daily/us/nasdaq stocks/2/
osn.us.txtostk.us.txt…
zyne.us.txt/daily/us/nyse etfs//daily/us/nyse stocks/1//daily/us/nyse stocks/2/
Contains ~8400 individual files
![Page 15: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/15.jpg)
15
Vendor – file contents
Date,Open,High,Low,Close,Volume,OpenInt20151111,18.5,25.9,18,24.5,1584600,020151112,24.25,27.12,22.5,25,83000,020151113,25.47,26.2,24.55,25.26,67300,020151116,25.01,26.19,24.13,25.02,16900,020151117,24.46,25.51,24.38,24.62,25900,020151118,24.62,26.31,24.06,25,111100,020151119,24.85,26,24.71,25.9,113100,0…
Symbol is not contained within the individual data files
/daily/us/nasdaq stocks/1/aaap.us.txt
![Page 16: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/16.jpg)
16
Luxsource: "lux://global-equities/data/daily/us/nasdaq stocks" extractor: "{}/{Symbol}.{Region}.txt"
Date,Open,High,Low,Close,Volume,OpenInt,Symbol,Region20151111,18.5,25.9,18,24.5,1584600,0,aaap,us20151112,24.25,27.12,22.5,25,83000,0,aaap,us20151113,25.47,26.2,24.55,25.26,67300,0,aaap,us…20160322,11.56,11.98,10.8894,11.09,517604,0,zyne,us20160323,11.3,11.72,9.5,9.75,489743,0,zyne,us20160324,9.5,10.24,9.22,9.64,188512,0,zyne,us
One dataset with ~5.5 million rows
![Page 17: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/17.jpg)
17
Lux Benefits
Combines individual files
No separate ETL or storage
Names become part of data
Optimized compute
![Page 18: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/18.jpg)
18
Anaconda Mosaic
Interactive exploration
Intuitive interface
Advanced visualizations
Catalog of datasets and expressions
Provenance and Governance
![Page 19: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/19.jpg)
19
Live Walkthrough
![Page 20: Light Up Your Dark Data](https://reader035.vdocument.in/reader035/viewer/2022062905/586e72e11a28ab99598b5273/html5/thumbnails/20.jpg)
20
Project References
• Anaconda Mosaic - http://know.continuum.io/Anaconda-Mosaic
• Blaze Ecosystem - http://blaze.pydata.org• Bokeh - http://bokeh.pydata.org