finestream: fine-grained window-based stream processing on

USENIX ATC’202020 USENIX Annual Technical ConferenceJULY 15–17, 2020

FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du

Renmin University of China

Technische Universitat Berlin

National University of Singapore

1

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

2

1. Background

• Bulk-synchronous parallel model• query granularity

3

• Continuous operator model• operator granularity

operator 1

operator 2

operator n

…

queryCPU

GPU

operator 1

operator 2

operator n

…

queryCPU

GPU

CPU and GPU can concurrently execute in both cases — only the granularity is different.

This paper[SIGMOD’16] Saber: Window-based hybrid stream

processing for heterogeneous architectures

2. Integrated Architectures

4

• 2011, Jan

• AMD APU• 2014, Apr

• Nvidia Tegra

Benefits• No PCI-e transfer overhead

• Shared global memory

• High energy efficiency

• 2012, Jan

• Intel Ivy Bridge

1. Background

• Integrated architectures vs. discrete architectures

5

Integrated architectures Discrete architectures

Architecture A10-7850K Ryzen5 2400G GTX 1080Ti V100

#cores 512+4 704+4 3584 5120

TFLOPS 0.9 1.7 11.3 14.1

bandwidth (GB/s) 25.6 38.4 484.4 900

price ($) 209 169 1100 8999

TDP (W) 95 65 250 300

3. Stream Processing with SQL

• Data stream

• Window

• Operator

• Query

* Batch

6

tuple …

window w1

window w2

window size

window slide

data stream

…

operator 1

operator 2

operator n

… query

results

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

7

2. Motivation

• Varying Operator-Device Preference

8

CPU queue:

GPU queue:

time

operator 2aggregation

operator 1group-by

operator 1group-by

operator 2aggregation

query on CPU query on GPU

18.2 ms 5.2 ms

5.8 ms6.7 ms

2. Motivation

• Performance (tuples/s) of operators on the CPU and the GPU of the integrated architecture.

9

Operator CPU only GPU only Device choice

Projection 14.2 14.3 GPU

Selection 13.1 14.1 GPU

Aggregation 14.7 13.5 CPU

Group-by 8.1 12.4 GPU

Join 0.7 0.1 CPU

• Fine-Grained Stream Processing• A fine-grained stream processing method that can consider both integrated

architecture characteristics and operator features shall have better performance. • memory bandwidth limit

• operators - preferred devices

2. Motivation

10

• CPU and GPU have good performance

• consider the interplay of operator features and architecture difference.

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

11

3. Challenges

• Challenge 1: Application topology combined with architectural characteristics

12

System DRAM

CPUcore

…

CPU

CPUcore

CPUcore

GPUcore

…

GPU

GPUcore

GPUcore

Shared Memory Management Unit

CPU Cache GPU Cache

OP2 OP3

OP5 OP6

OP9 OP10

OP7

OP11

3. Challenges

• Challenge 2: SQL query plan optimization with shared main memory

13

query on CPU

query on GPU

query on CPU

query on GPU

CPU queue:

GPU queue:

time

CPU only18.2 ms

GPU only6.7 ms

CPU-GPU co-run22.4 ms

3. Challenges

• Challenge 3: Adjustment for dynamic workload

14

OP3

OP1

OP290%

10% OP3

OP1

OP210%

90%

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

15

4. FineStream

• Overview

16

…

batch

onlineprofiling

performancemodel

op1 op2 op1

operators

dev dev dev

device mapping

dispatcher

results

…

streambatch

SQL

FineStream

4. FineStream

• Topology

17

OP1 OP2 OP3

OP4 OP5 OP6

OP8 OP9 OP10

OP7

OP11

branch1

branch2

branch3

pathcritical

4. FineStream

• Optimization 1: Branch Co-Running

18

tstage3

timebranch 3

branch 2

branch 1

tstage

1tstage2 tstage3

timebranch 3

branch 2

branch 1

tstage1 tstage2

branch 3

(a) Branch parallelism. (b) Branch scheduling optimization.

4. FineStream

• Optimization 2: Batch Pipeline

19

OP3

OP6

OP10

OP7

OP11

phase 1 phase 2PH i: phase i B i: batch i

OP1 OP2

OP4 OP5

OP8 OP9PH1 B1 PH1 B2 …

PH2 B1 PH2 B2

time

(a) Phase partitioning. (b) Batch pipeline.

4. FineStream

• Optimization 3: Handling Dynamic Workload• Light-Weight Resource Reallocation

• Query Plan Adjustment

20

OP3

OP1

OP290%

10%

(a) 90% workload goes to OP2. (b) 90% workload goes to OP3.

Shared memory

GPU CUs

CPU CUs

Integrated architectures

OP3

OP1

OP210%

90%

Shared memory

CPU CUs

GPU CUs

Integrated architectures

4. FineStream

21

thread 1

stream 1

CPU

GPU

thread 2operators

OPi

CPU

GPU

bandwidth utilization

performance

OP1…

parallelism utilization Branch

Co-Running

Batch Pipeline

DAG 1OP CPU% GPU%

OP1 … …… … …

OPi … …

batch3

batch4dynamic-workload detection

operator dataflow

monitoring

still low performance?…

time

…

yes resource reallocation

migration detected

DAG i…………

default

batch1

batch2

…………

query plan adjustment

• Execution flow

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

22

5. Evaluation

• Platforms • AMD A10- 7850K

• Ryzen 5 2400G

• Datasets• Google compute cluster monitoring

• Anomaly detection in smart grids

• Linear road benchmark

• Synthetically generated dataset

• Benchmarks• Nine queries

23

Example - Q1

(Google compute cluster monitoring)

select timestamp, category, sum(cpu)

as totalCPU

from TaskEvents [range 256 slide 1]

group by category

5. Evaluation

• Throughput: FineStream achieves the best performance in most cases.

24

0

5

10

15

20

25

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

A10-7850K Ryzen5 2400G

Thro

ugh

pu

t(1

E5tu

ple

s/s)

Single Saber FineStream

5. Evaluation

• Latency: Low latency in most cases.

25

0

0.5

1

1.5


A10-7850K Ryzen5 2400G

Late

ncy

(s)

Single Saber FineStream

5. Evaluation

• Throughput vs. latency• Queries with high throughput usually have low latency, and vice versa.

26

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Thro

ugh

pu

t(1

E5 t

up

les/

s)

Latency (s)

FineStream(A10-7850K)

Saber(A10-7850K)

FineStream(Ryzen5)

Saber(Ryzen)

5. Evaluation

• Utilization• FineStream utilizes the GPU device better on the integrated architecture.

27

0

20

40

60

80

100


CPU GPU

uti

lizat

ion

(%

)

Saber FineStream

5. Evaluation

• Comparison with Discrete Architectures• Throughput: The discrete GPUs exhibit 1.8x to 5.7x higher throughput than

the integrated architectures, due to the more computational power of discrete GPUs.

• Latency:

• Discrete GPUs: Ttotal = TPCIe_transmit + Tcompute

• Integrated GPUs: Ttotal = Tcompute

28

5. Evaluation

• Comparison with Discrete Architectures• High Price-Throughput Ratio

29

0

2000

4000

6000

8000

10000

12000

14000

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Pri

ce-p

erfo

rman

ce r

atio

(p

erfo

rman

ce/U

SD)

1080ti v100 A10-7850K Ryzen

5. Evaluation

• Comparison with Discrete Architectures• High Energy Efficiency

30

0

5000

10000

15000

20000

25000

30000

35000

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Ener

gy e

ffic

ien

cy

(pe

rfo

rman

ce/W

att)

1080ti v100 A10-7850K Ryzen

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

31

6. Conclusion

• The first fine-grained window-based relational stream processing.

• Lightweight query plan adaptations handling dynamic workloads.

• FineStream evaluation on a set of stream queries.

32

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong DuRenmin University of China, Technische Universitat Berlin, National University of Singapore

USENIX ATC’202020 USENIX Annual Technical ConferenceJULY 15–17, 2020

finestream: fine-grained window-based stream processing on

Documents