exploiting auto-tuning to analyze and improve performance ... · haswell cpu skylake cpu tuned for...
Post on 26-Sep-2020
3 Views
Preview:
TRANSCRIPT
Exploiting auto-tuning to analyze and improve performance portability
on many-core architectures James Price & Simon McIntosh-Smith
University of Bristol - High Performance Computing Group http://uob-hpc.github.io
Funded in part by Imagination Technologies
Overview• Performance portability is one of the key concerns
for developers targeting many different architectures
• Current work in this area has provided mixed results
• Here we propose a method of rigorously analyzing performance portability by exploiting the black-box nature of auto-tuning
• We also present a simple technique that can improve performance portability when using auto-tuning
DevicesVendor Product Architecture Compute units
NVIDIA
GeForce GTX 580 Fermi (GF110) 16GeForce GTX 680 Kepler (GK104) 8
GeForce GTX 780 Ti Kepler (GK110) 15GeForce GTX 980 Ti Maxwell (GM200) 22GeForce GTX 1080 Ti Pascal (GP102) 28
AMD
Radeon HD 7970 Tahiti 32Radeon R9 290X Hawaii 44Radeon R9 Furyx Fiji 64Radeon RX 480 Ellesmere 36
IntelCore i5-3550 Ivy Bridge 4Core i5-4590 Haswell 4Core i5-6600 Skylake 4
Benchmarks• Using three benchmarks from different domains
• Implemented with OpenCL (using PyOpenCL)
‣ Need portability but also explicit control over kernel implementation
• Expose large set of possible implementation decisions as tuning options
• Dynamically generate kernels to provide greater flexibility
Benchmark #1: Jacobi method• Work-group size / parallel decomposition
• Memory layout / memory access pattern
• Loop unrolling
• *+ vs mad vs fma
• Branching vs masks
• Division by diagonal (inline or precompute)
• Address spaces of input vectors
• Embedding values into kernel as constants
Solve Ax = b
Split matrix A into diagonal D and remainder R
xi+1 = D-1(b - Rxi)
Benchmark #2: Bilateral filter• Work-group size
• Tile size / tile layout
• Prefetch pixels into private/local memory
• Buffers vs images
• *+ vs mad vs fma
• Loop interchange
• Native math functions
• Embedding values into kernel as constants
O(x, y) =
rX
i=�r
rX
j=�r
I(i, j) ·W (x, y, i, j)
rX
i=�r
rX
j=�r
W (x, y, i, j)
W (x, y, i, j) = e
(�p
i
2+j
2
2�d
� ||(I(x,y)�I(x+i,y+j)||2�
r
)
Benchmark #3: BUDE
• Parallel decomposition / work-group size
• Data layout / memory access patterns
• Loop interchange
• Address space of molecules / forcefield
• Precomputing forcefield coefficients
• Native math functions
• Embedding values into kernel as constants
Analysis approach• Run auto-tuner to optimize the kernel for each device
individually
‣ Genetic algorithm used for search
‣ Tuned for 24 hours on each device (examines ~10K kernels)
• This produces a set of architecture-specific kernels
• We then run each architecture-specific kernel on every other device and measure efficiency
• Efficiency defined as fraction of best kernel observed on that device
Jacobi efficiencies
GTX
580
GTX
680
GTX
780
TiG
TX98
0Ti
GTX
1080
TiH
D79
70
R929
0X
R9Fu
ryX
RX48
0Iv
yBr
idge
CPU
Has
wel
l CPU
Skyl
ake
CPU
running on
GTX 580
GTX 680
GTX 780 Ti
GTX 980 Ti
GTX 1080 Ti
HD 7970
R9 290X
R9 Fury X
RX 480
Ivy Bridge CPU
Haswell CPU
Skylake CPU
tune
dfo
r
100%
92%
93%
71%
91%
23%
91%
24%
87%
21%
12%
90%
87%
100%
66%
69%
97%
19%
79%
20%
74%
14%
16%
69%
74%
71%
100%
51%
58%
20%
84%
20%
78%
18%
16%
93%
53%
98%
95%
100%
98%
21%
97%
23%
95%
41%
39%
90%
56%
98%
90%
92%
100%
33%
97%
37%
93%
40%
43%
90%
X
55%
68%
53%
68%
100%
99%
100%
99%
19%
37%
45%
X
38%
33%
33%
38%
100%
100%
99%
99%
14%
20%
28%
X
26%
32%
49%
28%
99%
93%
100%
97%
6%
11%
14%
X
57%
62%
36%
62%
100%
100%
99%
100%
13%
17%
21%
1%
1%
5%
5%
1%
73%
2%
62%
2%
100%
2%
2%
5%
5%
5%
5%
5%
85%
70%
80%
93%
53%
100%
99%
4%
4%
4%
5%
4%
71%
57%
65%
81%
44%
97%
100%
Jacobi - NVIDIAG
TX58
0
GTX
680
GTX
780
Ti
GTX
980
Ti
GTX
1080
Ti
running on
GTX 580
GTX 680
GTX 780 Ti
GTX 980 Ti
GTX 1080 Ti
tune
dfo
r
100%
92%
93%
71%
91%
87%
100%
66%
69%
97%
74%
71%
100%
51%
58%
53%
98%
95%
100%
98%
56%
98%
90%
92%
100%• Work-group size
differs between devices
• Other drops due to address spaces and memory access pattern
Jacobi - AMDH
D79
70
R929
0X
R9Fu
ryX
RX48
0
running on
HD 7970
R9 290X
R9 Fury X
RX 480
tune
dfo
r
100%
99%
100%
99%
100%
100%
99%
99%
99%
93%
100%
97%
100%
100%
99%
100%• Most parameters
uniform between all AMD devices
• FuryX suffers a little with expensive division
Jacobi - IntelIv
yBr
idge
CPU
Has
wel
l CPU
Skyl
ake
CPU
running on
Ivy Bridge CPU
Haswell CPU
Skylake CPU
tune
dfo
r
100%
2%
2%
53%
100%
99%
44%
97%
100% • Ivy Bridge performs poorly with fma builtin
• Vectorisation width differs
Jacobi efficiencies
GTX
580
GTX
680
GTX
780
TiG
TX98
0Ti
GTX
1080
TiH
D79
70
R929
0X
R9Fu
ryX
RX48
0Iv
yBr
idge
CPU
Has
wel
l CPU
Skyl
ake
CPU
running on
GTX 580
GTX 680
GTX 780 Ti
GTX 980 Ti
GTX 1080 Ti
HD 7970
R9 290X
R9 Fury X
RX 480
Ivy Bridge CPU
Haswell CPU
Skylake CPU
tune
dfo
r
100%
92%
93%
71%
91%
23%
91%
24%
87%
21%
12%
90%
87%
100%
66%
69%
97%
19%
79%
20%
74%
14%
16%
69%
74%
71%
100%
51%
58%
20%
84%
20%
78%
18%
16%
93%
53%
98%
95%
100%
98%
21%
97%
23%
95%
41%
39%
90%
56%
98%
90%
92%
100%
33%
97%
37%
93%
40%
43%
90%
X
55%
68%
53%
68%
100%
99%
100%
99%
19%
37%
45%
X
38%
33%
33%
38%
100%
100%
99%
99%
14%
20%
28%
X
26%
32%
49%
28%
99%
93%
100%
97%
6%
11%
14%
X
57%
62%
36%
62%
100%
100%
99%
100%
13%
17%
21%
1%
1%
5%
5%
1%
73%
2%
62%
2%
100%
2%
2%
5%
5%
5%
5%
5%
85%
70%
80%
93%
53%
100%
99%
4%
4%
4%
5%
4%
71%
57%
65%
81%
44%
97%
100%
WCE2%2%6%2%
20%2%
19%1%5%4%1%-
Bilateral efficienciesWCE8%8%5%
26%26%25%
-14%37%37%14%
GTX
580
GTX
680
GTX
780
Ti
GTX
980
TiG
TX10
80Ti
R929
0X
R9Fu
ryX
RX48
0
Ivy
Brid
geCP
UH
asw
ell C
PUSk
ylak
eCP
U
running on
GTX 580
GTX 680
GTX 780 Ti
GTX 980 Ti
GTX 1080 Ti
R9 290X
R9 Fury X
RX 480
Ivy Bridge CPU
Haswell CPU
Skylake CPU
tune
dfo
r
100%
96%
95%
100%
93%
70%
70%
70%
8%
19%
19%
98%
100%
96%
97%
90%
68%
71%
70%
6%
13%
13%
99%
99%
100%
99%
97%
73%
76%
69%
7%
18%
18%
100%
98%
98%
100%
97%
85%
85%
84%
7%
17%
17%
99%
97%
98%
98%
100%
86%
87%
87%
7%
12%
12%
35%
37%
37%
35%
X
100%
100%
100%
6%
11%
11%
31%
38%
37%
31%
X
100%
100%
100%
5%
8%
8%
29%
45%
45%
37%
X
100%
100%
100%
5%
9%
9%
14%
74%
74%
14%
14%
28%
28%
29%
100%
98%
98%
73%
72%
71%
72%
72%
27%
27%
27%
99%
100%
100%
74%
74%
73%
71%
73%
25%
26%
26%
99%
100%
100%
BUDE efficienciesWCE39%37%8%
30%13%46%12%35%
----
GTX
580
GTX
680
GTX
780
TiG
TX98
0Ti
GTX
1080
TiH
D79
70
R929
0X
R9Fu
ryX
RX48
0Iv
yBr
idge
CPU
Has
wel
l CPU
Skyl
ake
CPU
running on
GTX 580
GTX 680
GTX 780 Ti
GTX 980 Ti
GTX 1080 Ti
HD 7970
R9 290X
R9 Fury X
RX 480
Ivy Bridge CPU
Haswell CPU
Skylake CPU
tune
dfo
r
100%
100%
X
91%
91%
90%
84%
91%
68%
15%
81%
74%
98%
100%
81%
66%
48%
78%
50%
52%
30%
8%
81%
64%
X
X
100%
83%
35%
33%
46%
32%
32%
13%
37%
39%
X
X
89%
100%
77%
66%
89%
79%
90%
21%
90%
85%
X
X
83%
100%
100%
74%
97%
86%
96%
29%
93%
87%
X
X
X
X
71%
100%
75%
72%
72%
10%
78%
84%
X
X
X
X
61%
73%
100%
66%
67%
10%
64%
61%
X
X
X
X
71%
86%
86%
100%
50%
12%
48%
47%
X
X
X
X
75%
95%
98%
100%
100%
12%
92%
92%
57%
57%
X
X
62%
15%
93%
16%
63%
100%
98%
98%
51%
51%
X
X
61%
15%
93%
14%
61%
98%
100%
98%
43%
43%
X
X
57%
12%
93%
13%
57%
99%
98%
100%
Multi-objective auto-tuning• Extend tuning process to consider multiple devices
at once
• Each time a kernel is generated, the auto-tuner evaluates it on every target device
• The performance values are then reduced into a single number representing the overall ‘fitness’
• We use worst-case efficiency for this fitness function
Multi-objective tuning results
Summary• Over-optimisation hurts performance portability
• Auto-tuning can be a great way to expose these issues
• It can also help generate performance portable kernels
• Future work looking at tuning across different input/problem configurations
top related