backtesting with spark · © cloudera, inc. all rights reserved. 1 backtesting with spark patrick...

14
1 © Cloudera, Inc. All rights reserved. Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

1© Cloudera, Inc. All rights reserved.

Backtesting with Spark

Patrick Angeles, Cloudera

Sandy Ryza, Cloudera

Rick Carlin, Intel

Sheetal Parade, Intel

Page 2: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

2© Cloudera, Inc. All rights reserved.

Traditional Grid

• Shared storage

• Storage and compute scale independently

• Bottleneck on I/O fabric

• Typically exotic hardware

• Proprietary schedulers

• Homegrown application frameworks

I/O Fabric

Page 3: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

3© Cloudera, Inc. All rights reserved.

Hadoop Cluster

• Shared nothing

• Storage and compute scale together

• Hierarchical architecture minimizes data transfer

• Typically commodity hardware

• Open source scheduler and frameworks

ToR Switch

Core Switch

ToR Switch

ToR Switch ToR Switch

Page 4: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

4© Cloudera, Inc. All rights reserved.

Implementation Choices

• HBase

• HDFS + CSV

• HDFS + Parquet

• HDFS + AvroFile

• SQL

• Native: C / C++

• JVM: Java, Scala

• Python

• Hive / Impala

• MR4C

• MapReduce

• Spark

Storage Engine Processing Language Processing Framework

Page 5: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

5© Cloudera, Inc. All rights reserved.

Spark in 60 Seconds

• Started in 2009 by Matei Zaharia at UC Berkeley AMPLab

• Advancements over MapReduce:

• DAG engine

• Takes advantage of system memory

• Optimistic fault tolerance using lineage

• 10x – 100x faster

• Supports applications written in Scala, Java and Python

• Rich ecosystem: SparkStreaming, SparkR, SparkSQL, MLLib, GraphX

• Strong community: ~40 committers, 100s of contributors, multi-vendor support

Page 6: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

6© Cloudera, Inc. All rights reserved.

BLASH: Algorithm Implementation

• Data layout: one file per symbol for a year.

• Pipelining to avoid re-reading the data.

• Process order book for all symbols.

• Sort and filter results in the end.

• Unit Testing

• Separation of concerns – parallelization from algorithm.

• Automated verification for correctness.

• Optimizations

• Use trending to reduce expensive method invocation.

• Keep memory in check – process an order at a time.

• ~2 weeks effort. Includes coding, data generation and running benchmarks.

Page 7: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

7© Cloudera, Inc. All rights reserved.

Hardware

• 1 master, 1 mgmt node

• 12 workers

• 2 x E5-2695 v2 @ 2.40GHz

• 24 physical cores

• 96GB RAM

• 8 x 1TB SAS drives

• 10Gb Ethernet

Software

• RHEL 6.6

• CDH 5.4.0

• Apache Hadoop 2.6.0

• Apache Spark 1.3

• Apache Parquet 1.5

Setup

Page 8: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

8© Cloudera, Inc. All rights reserved.

Spark Settings

• 4 cores, 4GB per executor

• Given 288 total physical cores, theoretical max of 72 executors for the entire cluster

• Or 144 executors taking into account hyper-threading

• In reality, the effective core count is somewhere in between

Setup

Page 9: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

9© Cloudera, Inc. All rights reserved.

Data set

CSV Parquet + Snappy

Avg File Size (Hi / Low) 7.5 GB550 MB

2.2 GB164 MB

Total Size, 1 year 8.23 TB 2.46 TB

Data Specs Full market symbols251 trading days (1 year)Simulated high and low volume instrumentsSimulated high volume trading days

Page 10: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

10© Cloudera, Inc. All rights reserved.

Vertical Scaling Test

10.00

12.00

14.00

16.00

18.00

20.00

22.00

24.00

26.00

40 60 80 100 120 140 160 180

Page 11: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

11© Cloudera, Inc. All rights reserved.

Parquet vs CSV

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

90 72 54

CSV

Parquet

Page 12: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

12© Cloudera, Inc. All rights reserved.

Horizontal Scaling Test

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

4 5 6 7 8 9 10 11 12 13

Page 13: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

13© Cloudera, Inc. All rights reserved.

Observations

• Lots of room for optimization

• Code refactor (avoiding expensive operations)

• Pre-processing of common data like order books, moving averages, bars, etc.

• No built-in way of dealing with split time-series data

• Processing is local only for the first split

• Workaround: use bigger HDFS block sizes

• Better: API to process file splits sequentially, ability to pass intermediate state to the next task

Page 14: Backtesting with Spark · © Cloudera, Inc. All rights reserved. 1 Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel

14© Cloudera, Inc. All rights reserved.

Thank you!@patrickangeles