finestream: fine-grained window-based stream processing on
TRANSCRIPT
USENIX ATC’202020 USENIX Annual Technical ConferenceJULY 15–17, 2020
FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures
Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong Du
Renmin University of China
Technische Universitat Berlin
National University of Singapore
1
Outline
1. Background
2. Motivation
3. Challenges
4. FineStream
5. Evaluation
6. Conclusion
2
1. Background
• Bulk-synchronous parallel model• query granularity
3
• Continuous operator model• operator granularity
operator 1
operator 2
operator n
…
queryCPU
GPU
operator 1
operator 2
operator n
…
queryCPU
GPU
CPU and GPU can concurrently execute in both cases — only the granularity is different.
This paper[SIGMOD’16] Saber: Window-based hybrid stream
processing for heterogeneous architectures
2. Integrated Architectures
4
• 2011, Jan
• AMD APU• 2014, Apr
• Nvidia Tegra
Benefits• No PCI-e transfer overhead
• Shared global memory
• High energy efficiency
• 2012, Jan
• Intel Ivy Bridge
1. Background
• Integrated architectures vs. discrete architectures
5
Integrated architectures Discrete architectures
Architecture A10-7850K Ryzen5 2400G GTX 1080Ti V100
#cores 512+4 704+4 3584 5120
TFLOPS 0.9 1.7 11.3 14.1
bandwidth (GB/s) 25.6 38.4 484.4 900
price ($) 209 169 1100 8999
TDP (W) 95 65 250 300
3. Stream Processing with SQL
• Data stream
• Window
• Operator
• Query
* Batch
6
tuple …
window w1
window w2
window size
window slide
data stream
…
operator 1
operator 2
operator n
… query
results
Outline
1. Background
2. Motivation
3. Challenges
4. FineStream
5. Evaluation
6. Conclusion
7
2. Motivation
• Varying Operator-Device Preference
8
CPU queue:
GPU queue:
time
operator 2aggregation
operator 1group-by
operator 1group-by
operator 2aggregation
query on CPU query on GPU
18.2 ms 5.2 ms
5.8 ms6.7 ms
2. Motivation
• Performance (tuples/s) of operators on the CPU and the GPU of the integrated architecture.
9
Operator CPU only GPU only Device choice
Projection 14.2 14.3 GPU
Selection 13.1 14.1 GPU
Aggregation 14.7 13.5 CPU
Group-by 8.1 12.4 GPU
Join 0.7 0.1 CPU
• Fine-Grained Stream Processing• A fine-grained stream processing method that can consider both integrated
architecture characteristics and operator features shall have better performance. • memory bandwidth limit
• operators - preferred devices
2. Motivation
10
• CPU and GPU have good performance
• consider the interplay of operator features and architecture difference.
Outline
1. Background
2. Motivation
3. Challenges
4. FineStream
5. Evaluation
6. Conclusion
11
3. Challenges
• Challenge 1: Application topology combined with architectural characteristics
12
System DRAM
CPUcore
…
CPU
CPUcore
CPUcore
GPUcore
…
GPU
GPUcore
GPUcore
Shared Memory Management Unit
CPU Cache GPU Cache
OP2 OP3
OP5 OP6
OP9 OP10
OP7
OP11
3. Challenges
• Challenge 2: SQL query plan optimization with shared main memory
13
query on CPU
query on GPU
query on CPU
query on GPU
CPU queue:
GPU queue:
time
CPU only18.2 ms
GPU only6.7 ms
CPU-GPU co-run22.4 ms
3. Challenges
• Challenge 3: Adjustment for dynamic workload
14
OP3
OP1
OP290%
10% OP3
OP1
OP210%
90%
Outline
1. Background
2. Motivation
3. Challenges
4. FineStream
5. Evaluation
6. Conclusion
15
4. FineStream
• Overview
16
…
batch
onlineprofiling
performancemodel
op1 op2 op1
operators
dev dev dev
device mapping
dispatcher
results
…
streambatch
SQL
FineStream
4. FineStream
• Topology
17
OP1 OP2 OP3
OP4 OP5 OP6
OP8 OP9 OP10
OP7
OP11
branch1
branch2
branch3
pathcritical
4. FineStream
• Optimization 1: Branch Co-Running
18
tstage3
timebranch 3
branch 2
branch 1
tstage
1tstage2 tstage3
timebranch 3
branch 2
branch 1
tstage1 tstage2
branch 3
(a) Branch parallelism. (b) Branch scheduling optimization.
4. FineStream
• Optimization 2: Batch Pipeline
19
OP3
OP6
OP10
OP7
OP11
phase 1 phase 2PH i: phase i B i: batch i
OP1 OP2
OP4 OP5
OP8 OP9PH1 B1 PH1 B2 …
PH2 B1 PH2 B2
time
(a) Phase partitioning. (b) Batch pipeline.
4. FineStream
• Optimization 3: Handling Dynamic Workload• Light-Weight Resource Reallocation
• Query Plan Adjustment
20
OP3
OP1
OP290%
10%
(a) 90% workload goes to OP2. (b) 90% workload goes to OP3.
Shared memory
GPU CUs
CPU CUs
Integrated architectures
OP3
OP1
OP210%
90%
Shared memory
CPU CUs
GPU CUs
Integrated architectures
4. FineStream
21
thread 1
stream 1
CPU
GPU
thread 2operators
OPi
CPU
GPU
bandwidth utilization
performance
OP1…
parallelism utilization Branch
Co-Running
Batch Pipeline
DAG 1OP CPU% GPU%
OP1 … …… … …
OPi … …
batch3
batch4dynamic-workload detection
operator dataflow
monitoring
still low performance?…
time
…
yes resource reallocation
migration detected
DAG i…………
default
batch1
batch2
…………
query plan adjustment
• Execution flow
Outline
1. Background
2. Motivation
3. Challenges
4. FineStream
5. Evaluation
6. Conclusion
22
5. Evaluation
• Platforms • AMD A10- 7850K
• Ryzen 5 2400G
• Datasets• Google compute cluster monitoring
• Anomaly detection in smart grids
• Linear road benchmark
• Synthetically generated dataset
• Benchmarks• Nine queries
23
Example - Q1
(Google compute cluster monitoring)
select timestamp, category, sum(cpu)
as totalCPU
from TaskEvents [range 256 slide 1]
group by category
5. Evaluation
• Throughput: FineStream achieves the best performance in most cases.
24
0
5
10
15
20
25
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
A10-7850K Ryzen5 2400G
Thro
ugh
pu
t(1
E5tu
ple
s/s)
Single Saber FineStream
5. Evaluation
• Latency: Low latency in most cases.
25
0
0.5
1
1.5
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
A10-7850K Ryzen5 2400G
Late
ncy
(s)
Single Saber FineStream
5. Evaluation
• Throughput vs. latency• Queries with high throughput usually have low latency, and vice versa.
26
0
5
10
15
20
25
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Thro
ugh
pu
t(1
E5 t
up
les/
s)
Latency (s)
FineStream(A10-7850K)
Saber(A10-7850K)
FineStream(Ryzen5)
Saber(Ryzen)
5. Evaluation
• Utilization• FineStream utilizes the GPU device better on the integrated architecture.
27
0
20
40
60
80
100
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
CPU GPU
uti
lizat
ion
(%
)
Saber FineStream
5. Evaluation
• Comparison with Discrete Architectures• Throughput: The discrete GPUs exhibit 1.8x to 5.7x higher throughput than
the integrated architectures, due to the more computational power of discrete GPUs.
• Latency:
• Discrete GPUs: Ttotal = TPCIe_transmit + Tcompute
• Integrated GPUs: Ttotal = Tcompute
28
5. Evaluation
• Comparison with Discrete Architectures• High Price-Throughput Ratio
29
0
2000
4000
6000
8000
10000
12000
14000
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Pri
ce-p
erfo
rman
ce r
atio
(p
erfo
rman
ce/U
SD)
1080ti v100 A10-7850K Ryzen
5. Evaluation
• Comparison with Discrete Architectures• High Energy Efficiency
30
0
5000
10000
15000
20000
25000
30000
35000
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
Ener
gy e
ffic
ien
cy
(pe
rfo
rman
ce/W
att)
1080ti v100 A10-7850K Ryzen
Outline
1. Background
2. Motivation
3. Challenges
4. FineStream
5. Evaluation
6. Conclusion
31
6. Conclusion
• The first fine-grained window-based relational stream processing.
• Lightweight query plan adaptations handling dynamic workloads.
• FineStream evaluation on a set of stream queries.
32
[email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Feng Zhang, Lin Yang, Shuhao Zhang, Bingsheng He, Wei Lu, Xiaoyong DuRenmin University of China, Technische Universitat Berlin, National University of Singapore
USENIX ATC’202020 USENIX Annual Technical ConferenceJULY 15–17, 2020