benchmarking data flow systems for scalable machine learning … · 2017-07-11 · christoph boden...

30
Benchmarking Data Flow Systems for Scalable Machine Learning Christoph Boden, Andrea Spina, Tilmann Rabl, Volker Markl 1 SPEC Research Big Data Group call, 19.6.2017

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Benchmarking Data Flow Systems

for Scalable Machine Learning Christoph Boden, Andrea Spina, Tilmann Rabl, Volker Markl

1 SPEC Research Big Data Group call, 19.6.2017

Page 2: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Motivation

- Hadoop MapReduce inherently ineffecient at executing iterative

computations

- second generation systems (Spark, Flink, GraphLab …) address this

shortcomming

- distributed data flow systems are poular choices to train machine learning

models at scale

- existing benchmarks use non-representative workloads and fail to

address scalability aspects of machine learning models

2

Page 3: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

3

Page 4: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Problem

existing (big data) benmcharks:

- use non-representative workloads

(word count, sort …)

- fail to address all dimensions of scalability

- use existing libraries for experiments

4

„[…] Spark obtains the best results for K-Means

thanks to the optimized MLlib library, although it is

expected that the support of K-Means in Flink-ML

can bridge this performance gap. […]”

Page 5: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Example: Click-Through Rate Prediction

- Goal: predict whether a user will click an ad

- a crucial building block in the multi-billion dollar online advertising industry

- logistic regression models still a „major workhorse“

- Prediction models are trained on

- >100 TB data

- billions of training samples

- up to 100 billion unique features*

* https://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf

5

Page 6: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Dimensions of Scalability

Problem: existing (big data) bencharks fail to address all dimensions of

scalability

- Scaling the data (number of training samples)

- Scaling the model (dimensions)

- Scaling the number of models (ensembles, hyperparameter tuning, …)

6

Page 7: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Goal

- Introduce a representative workloads and experiments to evaluate the

Performance of distributed data flow systems for machine learning

- Implemented mathematically equivalent workloads on different systems and

assess their scalability w.r.t. Machine Learning

Systems:

Apache Flink Apache Spark Single Thread 7

Page 8: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Scalability you say …

8

Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! but at what cost?.

In Proceedings of the 15th USENIX conference on Hot Topics in Operating Systems (HOTOS'15)

Page 9: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

COST

9

- hardware configuration required before the platform outperforms a

competent single-threaded implementation.

Page 10: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Experiments

Production Scaling: maximum number of nodes, varying data size

Strong Scaling: varying number of nodes, fixed data size

Model Scaling: varying number of nodes and dimensionality

fixed number of data points

COST: varying number of nodes and dimensions compared

against single threaded implementation

10

Page 11: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Background: Spark and Flink

Spark:

- data-parallel transformations on Resilient Distributed Datasets (RDDs)

- can be cached and recomputed in case of node failures

Flink:

- distributed streaming data flow engine supporting batch- and streaming

workloads

- native operator for iterative computations

- jobs are compiled and optimized by a cost-based optimizer

11

Page 12: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Data Sets

Unsupervised Learning: generated 100 dimensional data sampled from k

Gaussians and added uniform random noise (similar to HiBench)

Supervised Learning: used part of the Criteo Click log data set (1 bn data

points) with feature hashing to convert to desired dimensionality for experiments –

(e.g. 530 GB for 1000 dim)

12

Page 13: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Cluster Setup

• Quadcore Intel Xeon CPU E3-1230 V2 3.30GHz CPU (4 cores, 8 hyperthreads)

• 16 GB RAM

• 3x1TB hard disks (linux software RAID0)

• 1 GBit Ethernet NIC

• Flink Version: 1.0.3

• Spark Version: 1.6.2

• LibLinear Version

13

Page 14: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Parameter Tuning

• parallelism

• caching

• buffers

• serialization

14

Page 15: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Workloads

15

Page 16: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Machine Learning Pipelines

16

raw training data

feature extraction

model training

model evaluation

model

Model performance

Feature Selection Feature Engineering

Model Selection

(Hyper-) parameter tuning

Validation Data

Test Data

Training Data

Page 17: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Supervised Learning

Objective:

Batch Gradient Descent:

Different parametrizations of loss and regularization function yield a variety of ML methods

A good workload proxy for more sophisticated solvers that share a similar computational footprint

17

𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤

Page 18: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Map-Reduce Implementation

Map1 Map2 MapN compute gradient per data point

Reduce sum up partial gradients

18

Page 19: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Map-Partition Implementation

MapPartition1 compute gradient per data point (per partition)

Reduce

locally sum up partial gradients (in udf)

MapPartitionN … pre-

aggregate pre-

aggregate

aggregate pre-aggregated partial sums 19

Page 20: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Tree-Aggregate (Spark)

Executor Executor Executor Executor

Executor Executor

driver

20

Page 21: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Experimental Results

21

Page 22: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Production Scaling: Implementation Strategies

0

20

40

60

80

100

120

140

160

180

200

0 0,5 1 1,5 2 2,5 3 3,5 4

Ru

nti

me

in M

inu

tes

Data Set Size (linear scaling factor)

Spark MapPartition

Spark MapReduce

Spark TreeAggregate

Flink MapPartition

Flink MapReduce

22

• choice of implementation

stategy matters!

• all implementation scale

gracefully out-of-core

• Spark‘s MapPartition

slightly faster than

TreeAgregate, but not

robust

• unfortunate kryo

serialization bug

penalizing Flink‘s

MapReduce

implementation

5 Iterations of BGD training

Page 23: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Strong Scaling Experiments

K-Means Clustering Batch Gradient Descent

0

5

10

15

20

25

30

0 5 10 15 20 25 30 35

Ru

nti

me

in M

inu

tes

Number of Nodes

Apache Spark

Apache Flink

0

20

40

60

80

100

120

0 5 10 15 20 25

Ru

nti

me

in M

inu

tes

Number of Nodes

Apache Spark

Apache Flink

23

5 Iterations 5 Iterations

Page 24: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Batch Gradient Descent on 4 Nodes

Apache Flink Apache Spark 24

CPU

DSK

NET

MMR

Page 25: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Batch Gradient Descent on 25 Nodes

Apache Flink Apache Spark 25

CPU

DSK

NET

MMR

Page 26: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Dimensionality Scaling (log-log)

26

two data sets:

• 0.2 = size of combined

main memory

• 0.8 = bigger than

combined main memory

• Spark performance

comparable or better than

flink for small dimensions

5 Iterations of BGD training 1

10

100

1 10 100 1000 10000 100000 1000000 10000000

run

tim

e in

min

ute

s

Dimensionality of the Model

Flink small data set (0.2)

Flink large data set (0.8)

Spark small data set (0.2)

Spark large data set (0.8)

Page 27: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Dimensionality Scaling

27

0

10

20

30

40

50

60

70

80

90

100

0 2000000 4000000 6000000 8000000 10000000 12000000

run

tim

e in

min

ute

s

Dimensionality of the Model

Flink small data set (0.2)

Flink large data set (0.8)

Spark small data set (0.2)

Spark large data set (0.8)

• spark fails to tain models

beyond 6m dimensions on

0.8 data set

• spark fails to tain models

beyond 8m dimensions on

0.2 data set

• flink robustly scales to

10m dimensions for both

data sets

• flink fails to train models

greater than 10m

dimensions

Page 28: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

BGD – 0.8 Data Set - 6 Million Dimensions

Apache Flink Apache Spark

28

Page 29: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

COST: vs. Single Threaded Implementation

0

1

2

3

4

5

6

7

8

10 100 1000 10000 100000 1000000 10000000100000000 1E+09

Ru

nti

me

in M

inu

tes

Dimensionality of Model

LibLinear (Single Thread)

Spark 1 Node (4 Cores)

Flink 1 Node (4 Cores)

Spark 2 Nodes (8 Cores)

Flink 2 Nodes (8 Cores)

29

• 4GB subsample of criteo

data set

• 2 machines (8 cores)

sufficient to outperform

single threaded impl.

• both Flink and Spark fail to

train with 100m

dimensions or beyond

10 iterations of BGD training

Page 30: Benchmarking Data Flow Systems for Scalable Machine Learning … · 2017-07-11 · Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big

Christoph Boden :. Benchmarking Data Flow Systems for Scalable Machine Learning; SPEC Research Big Data Group call, June 19 2017

Summary

• Proposed, implemented and evaluate a set of representative workloads and experiments to

evaluate systems for machine learing

• Both systems scale robustly with growing data-set sizes

• Choice of implementation strategy has a noticeable impact on performance

• Spark fails to train high dimensionsal models (beyond 6 million dimensions)

• Both systems did not manage to train a model with 100 million dimensions even on a small data set

• Two nodes (8 cores) are a sufficient hardware configuration to outperform a competent single-

threaded implementation

30

contact: [email protected] [soon] code: https://github.com/bodenc/ml-benchmark