![Page 1: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/1.jpg)
Python and MongoDB as a Market Data Platform
Scalable storage of time series data
2014
![Page 2: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/2.jpg)
Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc(‘Man’). These opinions are subject to change without notice, and are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which any member of Man’s group of companies provides investment advisory or any other services. Any forward-looking statements speak only as of the date on which they are made and are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements. Unless stated otherwise this information is communicated by Man Investments Limited and AHL Partners LLP which are both authorised and regulated in the UK by the Financial Conduct Authority.
2
Legalese…
![Page 3: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/3.jpg)
3
The Problem
![Page 4: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/4.jpg)
Financial data comes in different sizes…
• ~1MB 1x a day price data
• ~1GB x 1000s 9,000 x 9,000 data matrices
• ~40GB 1-minute data
• ~30TB Tick data
• > even larger data sets (options, …)
… and different shapes
• Time series of prices
• Event data
• News data
• What’s next?
4
Overview – Data shapes
![Page 5: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/5.jpg)
Quant researchers
• Interactive work – latency sensitive
• Batch jobs run on a cluster – maximize throughput
• Historical data
• New data
• ... want control of storing their own data
Trading system
• Auditable – SVN for data
• Stable
• Performant
5
Overview – Data consumers
![Page 6: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/6.jpg)
6
The Research Problem – Scale
lib.read(‘Equity Prices')
Out[4]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00
Columns: 8103 entries, AST10000 to AST9997
dtypes: float64(8631)
Equity Prices: 77M float64s
593MB of data = 4,744Mbits!
600 MB
![Page 7: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/7.jpg)
Many different existing data stores
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
7
Overview – Databases
![Page 8: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/8.jpg)
Many different existing data stores
• Relational databases
• Tick databases
• Flat files
• HDF5 files
• Caches
8
Can we build one system to rule them all?
Overview – Databases
![Page 9: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/9.jpg)
Goals
• 10 years of 1 minute data in <1s
• 200 instruments x all history x once a day data <1s
• Single data store for all data types• 1x day data Tick Data
• Data versioning + Audit
Requirements
• Fast – most data in-memory
• Complete – all data in single location
• Scalable – unbounded in size and number of clients
• Agile – rapid iterative development
9
Project Goals
![Page 10: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/10.jpg)
10
Implementation
![Page 11: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/11.jpg)
Impedance mismatch between Python/Pandas/Numpy and Existing Databases
- Machine cluster operating on data blocks
Vs
- Database doing the analytical work
MongoDB:
- Developer productivity
- Document Python Dictionary
- Fast out the box
- Low latency
- High throughput
- Predictable performance
- Sharding / Replication for growth and scale out
- Free
- Great support
- Most widely used NoSQL DB11
Implementation – Choosing MongoDB
![Page 12: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/12.jpg)
12
Implementation – System Architecture
Python
client
rs0
mongo
d500GB
rs1
mongod
500GB
rs2
mongod
500GB
rs3
mongod
500GB
rs4
mongod
500GB
configserve
r
configserve
r
configserve
r
mongos mongosmongos
Python
client
cn…
Python
client
{'_id': ObjectId(…'),
'c': 47,
'columns': {
'PRICE': {'data': Binary('...', 0),
'dtype': 'float64',
'rowmask': Binary('...', 0)},
'SIZE': {'data': Binary('...', 0),
'dtype': 'int64',
'endSeq': -1L,
'index': Binary('...', 0),
'segment': 1296568173000L,
'sha': abcd123456,
'start': 1296568173000L,
'end': 1298569664000L,
'symbol': ‘AST1209',
'v': 2}
![Page 13: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/13.jpg)
Data bucketed into named Libraries
• One minute
• Daily
• User-data: jbloggs.EOD
• Metadata Index
Pluggable library types:
• VersionStore
• TickStore
• Metadata store
• … others …
© Man 2013 13
Implementation – Mongoose
![Page 14: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/14.jpg)
Mongoose key-value store
14
Implementation - MongooseAPI
from ahl.mongo import Mongoose
m = Mongoose('research') # Connect to the data store
m.list_libraries() # What data libraries are available
library = m[‘jbloggs.EOD’] # Get a Library
library.list_symbols() # List symbols
library.write(‘SYMBOL’, <TS or other data>) # Write
library.read(‘SYMBOL’, version=…) # Read, with an optional version
library.snapshot('snapshot-name') # Create a named snapshot of the library
Library.list_snapshots()
![Page 15: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/15.jpg)
15
Implementation – Version Store
Snap A
Snap B
Sym1, v1
Sym2, v3
Sym2, v4
Sym2, v4
Sym2, v4
![Page 16: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/16.jpg)
16
Implementation – VersionStore: A chunk
![Page 17: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/17.jpg)
17
Implementation – VersionStore: A version
![Page 18: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/18.jpg)
18
Implementation – VersionStore: Bringing it together
![Page 19: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/19.jpg)
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
19
Implementation – Arbitrary Data
![Page 20: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/20.jpg)
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
20
Implementation – Arbitrary Data
![Page 21: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/21.jpg)
_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB
class PickleStore(object):
def write(collection, version, symbol, item):
# Try to pickle it. This is best effort
pickled = lz4.compressHC(cPickle.dumps(item))
for i in xrange(len(pickled) / _CHUNK_SIZE + 1):
segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}
segment['segment'] = i
sha = checksum(symbol, segment)
collection.update({'symbol': symbol, 'sha': sha},
{'$set': segment,
'$addToSet': {'parent': version['_id']}},
upsert=True)
21
Implementation – Arbitrary Data
![Page 22: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/22.jpg)
class PickleStore(object):
def read(self, collection, version, symbol):
data = ''.join([x['data'] for x in collection.find({'symbol': symbol,
'parent': version['_id']},
sort=[('segment', pymongo.ASCENDING)])])
return cPickle.loads(lz4.decompress(data))
22
Implementation – Arbitrary Data
![Page 23: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/23.jpg)
23
Implementation – DataFrames
def do_write(df, version):
records = df.to_records()
version['dtype'] = str(records.dtype)
chunk_size = _CHUNK_SIZE / records.dtype.itemsize
... chunk_and_store ...
def do_read(version):
... read_chunks ...
data = ''.join(chunks)
dtype = np.dtype(version['dtype'])
recs = np.fromstring(data, dtype=dtype)
return DataFrame.from_records(recs)
![Page 24: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/24.jpg)
24
Results
![Page 25: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/25.jpg)
Flat files on NFS – Random market
25
Results – Performance Once a Day Data
![Page 26: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/26.jpg)
HDF5 files – Random instrument
26
Results – Performance One Minute Data
![Page 27: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/27.jpg)
Random E-Mini S&P contract from 2013
© Man 2013 27
Results – TickStore – 8 parallel
![Page 28: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/28.jpg)
Random E-Mini S&P contract from 2013
© Man 2013 28
Results – TickStore
![Page 29: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/29.jpg)
Random E-Mini S&P contract from 2013
© Man 2013 29
Results – TickStore Throughput
![Page 30: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/30.jpg)
Random E-Mini S&P contract from 2013
30
Results – System Load
OtherTick Mongo (x2)N Tasks = 32
![Page 31: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/31.jpg)
Built a system to store data of any shape and size
- Reduced impedance between Python language and the data store
Low latency:
- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL)
- OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)
- 1s for 15M rows Java
Parallel Access:
- Cluster with 256+ concurrent data access
- Consistent throughput – little load on the Mongo server
Efficient:
- 10-15x reduction in network load
- Negligible decompression cost (lz4: 1.8Gb/s)
31
Conclusions
![Page 32: Python and MongoDB as a Market Data Platform by James Blackburn](https://reader034.vdocument.in/reader034/viewer/2022042602/55983b191a28ab2d628b4809/html5/thumbnails/32.jpg)
32
Questions?