python and mongodb as a market data platform by james blackburn

Python and MongoDB as a Market Data Platform

Scalable storage of time series data

Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc(‘Man’). These opinions are subject to change without notice, and are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which any member of Man’s group of companies provides investment advisory or any other services. Any forward-looking statements speak only as of the date on which they are made and are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements. Unless stated otherwise this information is communicated by Man Investments Limited and AHL Partners LLP which are both authorised and regulated in the UK by the Financial Conduct Authority.

Legalese…

The Problem

Financial data comes in different sizes…

• ~1MB 1x a day price data

• ~1GB x 1000s 9,000 x 9,000 data matrices

• ~40GB 1-minute data

• ~30TB Tick data

• > even larger data sets (options, …)

… and different shapes

• Time series of prices

• Event data

• News data

• What’s next?

Overview – Data shapes

Quant researchers

• Interactive work – latency sensitive

• Batch jobs run on a cluster – maximize throughput

• Historical data

• New data

• ... want control of storing their own data

Trading system

• Auditable – SVN for data

• Stable

• Performant

Overview – Data consumers

The Research Problem – Scale

lib.read(‘Equity Prices')

Out[4]:

DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00

Columns: 8103 entries, AST10000 to AST9997

dtypes: float64(8631)

Equity Prices: 77M float64s

593MB of data = 4,744Mbits!

600 MB

Many different existing data stores

• Relational databases

• Tick databases

• Flat files

• HDF5 files

• Caches

Overview – Databases

Many different existing data stores

• Relational databases

• Tick databases

• Flat files

• HDF5 files

• Caches

Can we build one system to rule them all?

Overview – Databases

• 10 years of 1 minute data in <1s

• 200 instruments x all history x once a day data <1s

• Single data store for all data types• 1x day data Tick Data

• Data versioning + Audit

Requirements

• Fast – most data in-memory

• Complete – all data in single location

• Scalable – unbounded in size and number of clients

• Agile – rapid iterative development

Project Goals

Implementation

Impedance mismatch between Python/Pandas/Numpy and Existing Databases

- Machine cluster operating on data blocks

- Database doing the analytical work

MongoDB:

- Developer productivity

- Document Python Dictionary

- Fast out the box

- Low latency

- High throughput

- Predictable performance

- Sharding / Replication for growth and scale out

- Free

- Great support

- Most widely used NoSQL DB11

Implementation – Choosing MongoDB

Implementation – System Architecture

Python

client

d500GB

mongod

configserve

mongos mongosmongos

Python

client

Python

client

{'_id': ObjectId(…'),

'c': 47,

'columns': {

'PRICE': {'data': Binary('...', 0),

'dtype': 'float64',

'rowmask': Binary('...', 0)},

'SIZE': {'data': Binary('...', 0),

'dtype': 'int64',

'endSeq': -1L,

'index': Binary('...', 0),

'segment': 1296568173000L,

'sha': abcd123456,

'start': 1296568173000L,

'end': 1298569664000L,

'symbol': ‘AST1209',

'v': 2}

Data bucketed into named Libraries

• One minute

• Daily

• User-data: jbloggs.EOD

• Metadata Index

Pluggable library types:

• VersionStore

• TickStore

• Metadata store

• … others …

Implementation – Mongoose

Mongoose key-value store

Implementation - MongooseAPI

from ahl.mongo import Mongoose

m = Mongoose('research') # Connect to the data store

m.list_libraries() # What data libraries are available

library = m[‘jbloggs.EOD’] # Get a Library

library.list_symbols() # List symbols

library.write(‘SYMBOL’, <TS or other data>) # Write

library.read(‘SYMBOL’, version=…) # Read, with an optional version

library.snapshot('snapshot-name') # Create a named snapshot of the library

Library.list_snapshots()

Implementation – Version Store

Snap A

Snap B

Sym1, v1

Sym2, v3

Sym2, v4

Implementation – VersionStore: A chunk

Implementation – VersionStore: A version

Implementation – VersionStore: Bringing it together

_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB

class PickleStore(object):

def write(collection, version, symbol, item):

# Try to pickle it. This is best effort

pickled = lz4.compressHC(cPickle.dumps(item))

for i in xrange(len(pickled) / _CHUNK_SIZE + 1):

segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])}

segment['segment'] = i

sha = checksum(symbol, segment)

collection.update({'symbol': symbol, 'sha': sha},

{'$set': segment,

'$addToSet': {'parent': version['_id']}},

upsert=True)

Implementation – Arbitrary Data

_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB

{'$set': segment,

upsert=True)

_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB

{'$set': segment,

upsert=True)

def read(self, collection, version, symbol):

data = ''.join([x['data'] for x in collection.find({'symbol': symbol,

'parent': version['_id']},

sort=[('segment', pymongo.ASCENDING)])])

return cPickle.loads(lz4.decompress(data))

Implementation – DataFrames

def do_write(df, version):

records = df.to_records()

version['dtype'] = str(records.dtype)

chunk_size = _CHUNK_SIZE / records.dtype.itemsize

... chunk_and_store ...

def do_read(version):

... read_chunks ...

data = ''.join(chunks)

dtype = np.dtype(version['dtype'])

recs = np.fromstring(data, dtype=dtype)

return DataFrame.from_records(recs)

Results

Flat files on NFS – Random market

Results – Performance Once a Day Data

HDF5 files – Random instrument

Results – Performance One Minute Data

Random E-Mini S&P contract from 2013

Results – TickStore – 8 parallel

Results – TickStore

Results – TickStore Throughput

Results – System Load

OtherTick Mongo (x2)N Tasks = 32

Built a system to store data of any shape and size

- Reduced impedance between Python language and the data store

Low latency:

- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL)

- OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)

- 1s for 15M rows Java

Parallel Access:

- Cluster with 256+ concurrent data access

- Consistent throughput – little load on the Mongo server

Efficient:

- 10-15x reduction in network load

- Negligible decompression cost (lz4: 1.8Gb/s)

Conclusions

Questions?

python and mongodb as a market data platform by james blackburn

Technology

mongodb + pylons ftw: scalable web apps with python & nosql

nosdb vs mongodb - alachisoft · language including...

rapid, scalable web development with mongodb, ming, and...

django mongodb engine - read the docs · django mongodb...

using mongodb and python

automate mongodb with mongodb management service

realtime analytics using mongodb, python, gevent, and zeromq

analyzedatain mongodbwiththe* hunkapp€¦ ·...

python ireland conference 2016 - python and mongodb workshop

python mongodb tutorial...python mongodb 4 a collection in...

mongodb europe 2016 - mongodb 3.4 preview and introduction...

mongodb and python

intelligence artificielle, macroéconomie & finance...

the definitive guide to mongodb - apphosting.io ·...

mongodb world 2016: mongodb & ibm

nosql with mongodb in 24 hours, sams teach yourself · the...

chef on python and mongodb

matthieu forel · em ail: matthieu.lc.forel@gmail.com w...

mongodb and using mongodb with .net

mongodb evenings minneapolis: medtronic's mongodb journey