finestream: fine-grained window-based stream processing on

32
USENIX ATC’20 2020 USENIX Annual Technical Conference JULY 15–17, 2020 FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du Renmin University of China Technische Universitt Berlin National University of Singapore 1

Upload: others

Post on 08-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FineStream: Fine-Grained Window-Based Stream Processing on

USENIX ATC’202020 USENIX Annual Technical ConferenceJULY 15–17, 2020

FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du

Renmin University of China

Technische Universitat Berlin

National University of Singapore

1

Page 2: FineStream: Fine-Grained Window-Based Stream Processing on

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

2

Page 3: FineStream: Fine-Grained Window-Based Stream Processing on

1. Background

• Bulk-synchronous parallel model• query granularity

3

• Continuous operator model• operator granularity

operator 1

operator 2

operator n

queryCPU

GPU

operator 1

operator 2

operator n

queryCPU

GPU

CPU and GPU can concurrently execute in both cases — only the granularity is different.

This paper[SIGMOD’16] Saber: Window-based hybrid stream

processing for heterogeneous architectures

Page 4: FineStream: Fine-Grained Window-Based Stream Processing on

2. Integrated Architectures

4

• 2011, Jan

• AMD APU• 2014, Apr

• Nvidia Tegra

Benefits• No PCI-e transfer overhead

• Shared global memory

• High energy efficiency

• 2012, Jan

• Intel Ivy Bridge

Page 5: FineStream: Fine-Grained Window-Based Stream Processing on

1. Background

• Integrated architectures vs. discrete architectures

5

Integrated architectures Discrete architectures

Architecture A10-7850K Ryzen5 2400G GTX 1080Ti V100

#cores 512+4 704+4 3584 5120

TFLOPS 0.9 1.7 11.3 14.1

bandwidth (GB/s) 25.6 38.4 484.4 900

price ($) 209 169 1100 8999

TDP (W) 95 65 250 300

Page 6: FineStream: Fine-Grained Window-Based Stream Processing on

3. Stream Processing with SQL

• Data stream

• Window

• Operator

• Query

* Batch

6

tuple …

window w1

window w2

window size

window slide

data stream

operator 1

operator 2

operator n

… query

results

Page 7: FineStream: Fine-Grained Window-Based Stream Processing on

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

7

Page 8: FineStream: Fine-Grained Window-Based Stream Processing on

2. Motivation

• Varying Operator-Device Preference

8

CPU queue:

GPU queue:

time

operator 2aggregation

operator 1group-by

operator 1group-by

operator 2aggregation

query on CPU query on GPU

18.2 ms 5.2 ms

5.8 ms6.7 ms

Page 9: FineStream: Fine-Grained Window-Based Stream Processing on

2. Motivation

• Performance (tuples/s) of operators on the CPU and the GPU of the integrated architecture.

9

Operator CPU only GPU only Device choice

Projection 14.2 14.3 GPU

Selection 13.1 14.1 GPU

Aggregation 14.7 13.5 CPU

Group-by 8.1 12.4 GPU

Join 0.7 0.1 CPU

Page 10: FineStream: Fine-Grained Window-Based Stream Processing on

• Fine-Grained Stream Processing• A fine-grained stream processing method that can consider both integrated

architecture characteristics and operator features shall have better performance. • memory bandwidth limit

• operators - preferred devices

2. Motivation

10

• CPU and GPU have good performance

• consider the interplay of operator features and architecture difference.

Page 11: FineStream: Fine-Grained Window-Based Stream Processing on

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

11

Page 12: FineStream: Fine-Grained Window-Based Stream Processing on

3. Challenges

• Challenge 1: Application topology combined with architectural characteristics

12

System DRAM

CPUcore

CPU

CPUcore

CPUcore

GPUcore

GPU

GPUcore

GPUcore

Shared Memory Management Unit

CPU Cache GPU Cache

OP2 OP3

OP5 OP6

OP9 OP10

OP7

OP11

Page 13: FineStream: Fine-Grained Window-Based Stream Processing on

3. Challenges

• Challenge 2: SQL query plan optimization with shared main memory

13

query on CPU

query on GPU

query on CPU

query on GPU

CPU queue:

GPU queue:

time

CPU only18.2 ms

GPU only6.7 ms

CPU-GPU co-run22.4 ms

Page 14: FineStream: Fine-Grained Window-Based Stream Processing on

3. Challenges

• Challenge 3: Adjustment for dynamic workload

14

OP3

OP1

OP290%

10% OP3

OP1

OP210%

90%

Page 15: FineStream: Fine-Grained Window-Based Stream Processing on

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

15

Page 16: FineStream: Fine-Grained Window-Based Stream Processing on

4. FineStream

• Overview

16

batch

onlineprofiling

performancemodel

op1 op2 op1

operators

dev dev dev

device mapping

dispatcher

results

streambatch

SQL

FineStream

Page 17: FineStream: Fine-Grained Window-Based Stream Processing on

4. FineStream

• Topology

17

OP1 OP2 OP3

OP4 OP5 OP6

OP8 OP9 OP10

OP7

OP11

branch1

branch2

branch3

pathcritical

Page 18: FineStream: Fine-Grained Window-Based Stream Processing on

4. FineStream

• Optimization 1: Branch Co-Running

18

tstage3

timebranch 3

branch 2

branch 1

tstage

1tstage2 tstage3

timebranch 3

branch 2

branch 1

tstage1 tstage2

branch 3

(a) Branch parallelism. (b) Branch scheduling optimization.

Page 19: FineStream: Fine-Grained Window-Based Stream Processing on

4. FineStream

• Optimization 2: Batch Pipeline

19

OP3

OP6

OP10

OP7

OP11

phase 1 phase 2PH i: phase i B i: batch i

OP1 OP2

OP4 OP5

OP8 OP9PH1 B1 PH1 B2 …

PH2 B1 PH2 B2

time

(a) Phase partitioning. (b) Batch pipeline.

Page 20: FineStream: Fine-Grained Window-Based Stream Processing on

4. FineStream

• Optimization 3: Handling Dynamic Workload• Light-Weight Resource Reallocation

• Query Plan Adjustment

20

OP3

OP1

OP290%

10%

(a) 90% workload goes to OP2. (b) 90% workload goes to OP3.

Shared memory

GPU CUs

CPU CUs

Integrated architectures

OP3

OP1

OP210%

90%

Shared memory

CPU CUs

GPU CUs

Integrated architectures

Page 21: FineStream: Fine-Grained Window-Based Stream Processing on

4. FineStream

21

thread 1

stream 1

CPU

GPU

thread 2operators

OPi

CPU

GPU

bandwidth utilization

performance

OP1…

parallelism utilization Branch

Co-Running

Batch Pipeline

DAG 1OP CPU% GPU%

OP1 … …… … …

OPi … …

batch3

batch4dynamic-workload detection

operator dataflow

monitoring

still low performance?…

time

yes resource reallocation

migration detected

DAG i…………

default

batch1

batch2

…………

query plan adjustment

• Execution flow

Page 22: FineStream: Fine-Grained Window-Based Stream Processing on

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

22

Page 23: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Platforms • AMD A10- 7850K

• Ryzen 5 2400G

• Datasets• Google compute cluster monitoring

• Anomaly detection in smart grids

• Linear road benchmark

• Synthetically generated dataset

• Benchmarks• Nine queries

23

Example - Q1

(Google compute cluster monitoring)

select timestamp, category, sum(cpu)

as totalCPU

from TaskEvents [range 256 slide 1]

group by category

Page 24: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Throughput: FineStream achieves the best performance in most cases.

24

0

5

10

15

20

25

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

A10-7850K Ryzen5 2400G

Thro

ugh

pu

t(1

E5tu

ple

s/s)

Single Saber FineStream

Page 25: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Latency: Low latency in most cases.

25

0

0.5

1

1.5

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

A10-7850K Ryzen5 2400G

Late

ncy

(s)

Single Saber FineStream

Page 26: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Throughput vs. latency• Queries with high throughput usually have low latency, and vice versa.

26

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1 1.2 1.4

Thro

ugh

pu

t(1

E5 t

up

les/

s)

Latency (s)

FineStream(A10-7850K)

Saber(A10-7850K)

FineStream(Ryzen5)

Saber(Ryzen)

Page 27: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Utilization• FineStream utilizes the GPU device better on the integrated architecture.

27

0

20

40

60

80

100

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

CPU GPU

uti

lizat

ion

(%

)

Saber FineStream

Page 28: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Comparison with Discrete Architectures• Throughput: The discrete GPUs exhibit 1.8x to 5.7x higher throughput than

the integrated architectures, due to the more computational power of discrete GPUs.

• Latency:

• Discrete GPUs: Ttotal = TPCIe_transmit + Tcompute

• Integrated GPUs: Ttotal = Tcompute

28

Page 29: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Comparison with Discrete Architectures• High Price-Throughput Ratio

29

0

2000

4000

6000

8000

10000

12000

14000

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Pri

ce-p

erfo

rman

ce r

atio

(p

erfo

rman

ce/U

SD)

1080ti v100 A10-7850K Ryzen

Page 30: FineStream: Fine-Grained Window-Based Stream Processing on

5. Evaluation

• Comparison with Discrete Architectures• High Energy Efficiency

30

0

5000

10000

15000

20000

25000

30000

35000

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9

Ener

gy e

ffic

ien

cy

(pe

rfo

rman

ce/W

att)

1080ti v100 A10-7850K Ryzen

Page 31: FineStream: Fine-Grained Window-Based Stream Processing on

Outline

1. Background

2. Motivation

3. Challenges

4. FineStream

5. Evaluation

6. Conclusion

31

Page 32: FineStream: Fine-Grained Window-Based Stream Processing on

6. Conclusion

• The first fine-grained window-based relational stream processing.

• Lightweight query plan adaptations handling dynamic workloads.

• FineStream evaluation on a set of stream queries.

32

[email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong DuRenmin University of China, Technische Universitat Berlin, National University of Singapore

USENIX ATC’202020 USENIX Annual Technical ConferenceJULY 15–17, 2020