designing hybrid data processing systems for heterogeneous …ey204/teaching/acs/r244_2017... ·...

Designing Hybrid Data ProcessingSystems for Heterogeneous Servers

Peter PietzuchLarge-Scale Distributed Systems (LSDS) Group

Imperial College Londonhttp://lsds.doc.ic.ac.uk<[email protected]>

University of Cambridge – Cambridge, United Kingdom – November 2017 1

Data is the New Oil

• Many new sources of data become available– Most data is produced continuously

• Data powers plethora of new and personalised services…

2Peter Pietzuch - Imperial College London

Mobiledevices Scientific

instruments

CamerasSocial feeds IoTdevices

Internet services,web sites

RFIDtags

Datarepositories

Data-Intensive Systems

• Data analytics over web click streams– How to maximise user experience with

relevant content?– How to analyse “click paths” to trace most

common user routes?

• Machine learning models foronline prediction– E.g. serving adverts on search engines

Peter Pietzuch - Imperial College London 3

Hits

Page Views

Visits

Unique Visitors

Uniquely Identified Visitors

Volume of Available Data

…

f1

fn

y E {−1,1}

predict

update

• Solution: AdPredictor– Bayesian learning algorithm

ranks adverts according toclick probabilities

Throughout and Result Freshness Matter

Facebook Insights: Aggregates 9 GB/s < 10 sec latencyFeedzai: 40K credit card transactions/s < 25 ms latencyGoogle Zeitgeist: 40K user queries/s < 1 ms latencyNovaSparks: 150M trade options/s < 1 ms latency


…

High-throughputprocessing

Low-latencyresults

Data-intensivesystem

Design Space for Data-Intensive Systems

• Tension between performanceand algorithmic complexity


Easy formostalgorithms

Hard formachinelearningalgorithms

Hard forall algorithms

Result latency

Dat

a am

ount

MBs

GBs

TBs

10s 1s 100ms 10ms 1ms

Algorithmic Complexity Increases

…Share state

Aggregate

Iterate…

Pre-process

Parallelize…

Online machinelearning, data

mining

Topic-basedfiltering

Content-basedfiltering

Complexpattern

matchingStreamqueries

highwaysegmentdirectionspeed





T1

T2

T3

T1(a, b, c)

T2(c, d, e)

T3(g, i, h)

Publish/Subscribe Complex EventProcessing (CEP)

Streamprocessing


Scale Out Model in Data Centres


Task Parallelism vs. Data Parallelism


Input data ...

Servers indata centre

Results

select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second]group by highway, segment, directionhaving avg < 40

Task parallelism:Multiple data processing jobs

Data parallelism:Single data processing job

select distinct W.cidFrom Payments [range 300 seconds] as W,

Payments [partition-by 1 row] as Lwhere W.cid = L.cid and W.region != L.region





select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second]group by highway, segment, directionhaving avg < 40

Distributed Dataflow Systems

• Idea:Execute data-parallel taskson cluster nodes

• Tasks organised as dataflow graph

• Almost all big data systems do this:• Apache Hadoop, Apache Spark,

Apache Storm, Apache Flink,Google TensorFlow, ...

•Peter Pietzuch - Imperial College London 9

parallelismdegree 3

parallelismdegree 2

Nobody Ever Got Fired For Using Hadoop/Spark

• 2012 study of MapReduce workloads– Microsoft: median job size < 14 GB– Yahoo: median job size < 12.5 GB– Facebook: 90% of jobs < 100 GB

• Many data-intensive jobs easily fit into memory

• One server cheaper/more efficient than compute cluster


(A. Rowstron, D. Narayanan, A. Donnely, G. O’Shea, A. Douglas, HotCDP’12)

Parallelism of Heterogeneous Servers

Servers have many parallel CPU coresHeterogeneous servers with GPUs common

New types of compute accelerators: Xeon Phi, Google's TPUs, FPGAs, ...Peter Pietzuch - Imperial College London 11

L3

C1

C2

C3

C4

C5

C6

C7

C8

L3

C1

C2

C3

C4

C5

C6

C7

C8

L2 Cache

DRAM DRAM

SMX1 ... SMXN

Sock

et 1

Sock

et 2

Command QueuePCIe Bus

DMA

1000s ofGPU cores

10s ofCPU cores

Slide courtesy of Torsten Hoefler (Systems Group, ETH Zürich)

Servers Are Becoming Increasingly Heterogeneous


E How can Data-Intensive Systems Exploit Heterogeneous Hardware?

Roadmap

• SABER: Hybrid stream processing engine for heterogeneous servers• [SIGMOD’16]

• (1) How to parallelise computation on modern hardware?

• (2) How to utilise heterogeneous servers?

• (3) Experimental performance results


Analytics with Window-based Stream Queries

• Real-time analytics over data streams• Windows define finite data amount for processing


window











now

Time-based window with size τ at current time t[t - τ : t] Vehicles[Range τ seconds]

Count-based window with size n:last n tuples Vehicles[Rows n]

Defining Stream Query Semantics

• Windows convert data streams to dynamic relations (database table)


Streams Relations

Window specification

Stream operators: Istream, Dstream, Rstream

Any relational query

(select, project,join, group by, etc)

SQL Stream Queries

SQL provides well-defined declarative semantics for queries– Based on relational algebra (select, project, join, …)

• Example: Identify slow moving traffic on highway– Input stream: Vehicles(highway, segment, direction, speed)– Find highway segments with average speed below 40 km/h


select highway, segment, direction, AVG(speed) as avg

from Vehicles[range 5 sec slide 1 sec]group by highway, segment, directionhaving avg < 40

Input data

Output

Operators

(1) How to Parallelise Computation?

• Perform query evaluation across sliding windows in parallel– Exploit data parallelism across stream


123456

w1

w2

w3

w4

size: 4 secslide: 1 sec

How to use GPUs with Stream Queries?

• Naive strategy parallelises computation along window boundaries


123456 size: 4 secslide: 1 sec

Task T1

Task T2

Combine partial results

E Window-based parallelism results in redundant computation

How to use GPUs with Stream Queries?

• Parallel processing of non-overlapping window data?


w1

w2

w3

w4

123456 size: 4 secslide: 1 sec

T1

T2

T3

T4

T5

Combine partial results

E Slide-based parallelism limits degree of parallelism

Apache Spark: Small Slides à Low Throughput

• Spark relates window slide to micro-batch size used for parallelisation


00.20.40.60.8

11.21.41.61.8

2

0 1 2 3 4 5 6 7 8 9

Thro

ughp

ut

(106

tupl

es/s

)

Window slide (106 tuples)

select AVG(S.1) from S [rows 1024 slide x]

E Avoid coupling system parameters with query definition

SABER: Parallel Window Processing

• Idea: Parallelise using task size that is best for hardware

• Task contains one or more window fragments


10 9 8 7 6 5 4 3 2 115 14 13 12 11 10 9 8 7 615 14 13 12 11

T1T2T3

w1w2

w3w4

w5

w1w2

w3w4

w5

size: 7 tuplesslide: 2 tuples

5 tuples/task

SABER: Window Fragment Processing

• Process window fragments in parallel• Reassemble partial results to obtain overall result


Worker B: T2

w1w2w3

w4w5

Worker A: T1w1

w2w3

w1result

w2result

Result stageSlot 2 Slot 1

EmptyEmpty

Output result circular buffer

Partial result reassembly must also be done in parallel

• Fragment function ff– Processes window fragments

• Assembly function fa– Merges partial window results

• Batch function fb– Composes fragment functions within task– Allows incremental processing

10 9 8 7 6 5 4 3 2 1

T1T2

w1w2

size: 7 tuplesslide: 2 tuples

5 tuples/task

fa fa

ff ff

w2 results

ff ff

w1 results

output

fb

API for Operator Implementation


SABER: Performance of Window-based Queries


0

0.05

0.1

0.15

0.2

0

2

4

6

8

64 256 1024 4096 16384

Late

ncy

(sec

)

Thro

ughp

ut (G

B/s)

Window slide (tuples)

select AVG(S.1) from S [rows 1024 slide x]

E Performance of window-based queries remains predictable

0

2

4

6

8

0.0625 0.25 1 4 16 64 256 1024 4096

Thro

ughp

ut (G

B/s)

Task Size (KB)

CPU-only processing

GPU-only processing

How to Pick the Task Size?

• Problem: Small data transfers over PCIe bus costly– Example: select * from S where p1 [rows 1 slide 1]

Limited by dispatcher and thread contention

Limited by data movement


Roadmap





E Avoid coupling system parameters with processing semantics

(2) How to Utilise Heterogeneous Servers?

• Hard to decide acceleration potential of heterogeneous processors– Depends on operator semantics, window definition,

data distribution, …


E Don’t leave decision about heterogeneous processors to users

0

2

4

6

Thro

ughp

ut (G

B/s)

0

0.1

0.2

0.3 CPU execution

GPU execution

Aggregation Group-by θ-join

GPU is faster CPU is faster

SABER: Hybrid Execution Model

• Idea: Execute tasks on all heterogeneous processors (CPUs, GPUs, ...)

• Fully utilise all hardware parallelism available in dedicated servers


T2 T1T3T4T5T6T7T8T9

QBQAQBQBQBQBQAQBQA

T10

QA

Task queue CPU worker

GPU worker

0 3 6 9 12

CPUGPU T1 T2

T3

T2T1 T4 T5 T6

T7

T8 T9

T10

T4 T5 T6

Static Task Scheduling using Cost Model?


CPU GPUQA 3 ms 2 msQB 3 ms 1 ms

T2 T1T3T4T5T6T7T8T9

QBQAQBQBQBQBQAQBQA

T10

QA

Task queue

T2

T1

CPUGPU

Static: QA on CPU and QB on GPU

T7 T9 T10

T3 T4 T5 T6 T8

0 3 6 9 12

CPU workers

GPU worker

E Static scheduling under-utilises processors

• Profile tasks to obtain cost model• Assign tasks to processor with shortest execution time

First-Come First-Serve Task Scheduling?


E FCFS ignores effectiveness of processors for given task


T2 T1T3T4T5T6T7T8T9

QBQAQBQBQBQBQAQBQA

T10

QA

Task queue CPU workers

GPU worker

0 3 6 9 12

CPUGPU

First-Come First-Served

T1 T4 T8

T2 T3 T5 T6 T7 T9

T10

• Assign tasks to processors first-come, first-serve– CPU/GPU execute both QA and QB tasks

Heterogeneous Lookahead Scheduling (HLS)

Idea: Scheduler assigns tasks to idle processors dynamically– Skips tasks that could be executed faster by another processor



T2 T1T3T4T5T6T7T8T9

QBQAQBQBQBQBQAQBQA

T10

QA

Task queue CPU workers

GPU worker

0 3 6 9 12

CPUGPU

HLS

T1 is slower on CPU: Skip

T1

T2 is slower on CPU: Skip

T2

T3 is slower on CPU but already have 3 ms of work for GPU

T3

T2T1

Skip T4, T5 and T6 and select T7

T4 T5 T6

T7

Finally skip T8 and T9 and select T10

T8 T9

T10

T4 T5 T6

E HLS achieves aggregate throughput of all heterogeneous processors

SABER Hybrid Stream Processing Engine


T1

T2

T2 T1

op

ααop

CPU

GPU

T1 T2

Scheduling & execution stageDequeue tasks based on HLS

Dispatching stageDispatch fixed-size tasks

Merge & forward partial window results

Result stage

Java15K LOC

C & OpenCL4K LOC

Roadmap




• (3) Experimental performance results


E Hybrid execution utilises all heterogeneous processors

E Avoid coupling system parameters with processing semantics

Experimental Evaluation


Ubuntu Linux 14.04 NVIDIA driver 346.47

Intel Xeon 2.6 GHz

NVIDIA Quadro K5200

PCIe 3.0 x16

10 Gbps NIC

16 cores

64 GB RAM

2,304cores

8 GB RAM

Google Cluster Data 144M jobs events from Google infrastructure

SmartGridMeasurements974Mplugmeasurementsfromhouses

LinearRoadBenchmark11Mcarpositionsandspeedonhighway

What is SABER’s Performance?


0

10

20

30

40

50

CM2 SG1 SG2 LRB3 LRB4Thro

ughp

ut (1

06tu

ples

/s)

SABER (CPU contrib.)

SABER (GPU contrib.)

Cluster Mgmt. Smart Grid LRB

aggravg group-byavg select

group-byavg group-bycnt

group-bycntgroup-byavg

select

Intel Xeon 2.6 GHz

NVIDIA Quadro K5200

16 cores

2,304 cores

E SABER exploits both CPUs and GPUs effectively for different queries

Is Hybrid Throughput Additive?


0

2

4

6

Thro

ughp

ut (G

B/s)

0

0.1

0.2

0.3 SABER (CPU only)SABER (GPU only)SABER

Aggregation Group-by θ-join

GPU is faster CPU is faster Not additive due to queue contention

E Aggregate throughput of CPU + GPU always highest

What is the Trade-Off between CPUs and GPUs?

• Hybrid processing model benefits from GPU's ability to process complex predicates fast

0

2

4

6

8

1 4 16 64

Thro

ughp

ut (G

B/s)

# selection predicates

SABER (CPU only) SABER (GPU only) SABER

Dispatch bound

0

0.1

0.2

0.3

0.4

1 4 16 64# join predicates

[rows 1024, slide 1024] [rows 1024, slide 1024]


Does SABER Adapt to Workload Changes?

38

0

0.1

0.2

0 10 20 30 40 50 60

Selec

tivity

0

2

4

6

8

0 10 20 30 40 50 60

Thro

ughp

ut (G

B/s)

Time (seconds)

Series1 Series2SABER SABER (GPU contribution)

HLS periodically uses idle, non-preferred processor to run tasks to update query task throughput matrix

E Higher selectivity à more predicates evaluated à GPU preferredPeter Pietzuch - Imperial College London

Summary

• Heterogeneous servers have huge impact on data-intensive systems– Shift from scale out to scale up model– Need new general-purpose system designs for heterogeneous servers

☛SABER: Hybrid Stream Processing Engine for CPUs & GPUs

• (1) Parallelise computation to fit hardware capabilitiesE Decouple hardware/system parameters from processing semantics

• (2) Fully utilise all heterogeneous processors independently of workloadE Hybrid processing model to achieve aggregate CPU/GPU throughput

39Peter Pietzuch - Imperial College London

Acknowledgement:LSDS Group at Imperial College London


Peter Pietzuchhttp://lsds.doc.ic.ac.uk<[email protected]>Thank you!

Any Questions?

We're Hiring!Post-docs, PhDs

designing hybrid data processing systems for heterogeneous …ey204/teaching/acs/r244_2017... ·...

Documents