amd gpu - rwth aachen universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/amd...
TRANSCRIPT
AMD GPU
Jasper Manousek
Ying Li
05.02.2015
Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.
Agenda
Architecture
Dwarfs
Sparse Linear Algebra
Dense Linear Algebra
Graph Traversal
MapReduce
Conclusion
2
Architecture
3
Comparison
4 Architecture
Nvidea GTX640
• 1 Controlling unit for every 8 Stream processors
• advantage: easier for developers due to simple structure
Radeon HD 6850 • blocks of 6 SP • 4 general ones and one
overseer • one Sp with FP/Int
arithmetic functions • advantage: more potential
if used correctly • disadvantage: requires
developer to specically program towards it
Less Power overall
Through structure smaller Die size
Less Expensive
Other small differences
Comparison
5 Architecture
Dense Linear Algebra
Classic vector and matrix operations1
Data is typically laid out as a contiguous array and
computations on elements, rows, columns, or matrix
blocks are the norm2
Examples3
6 Dense Linear Algebra
1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra
Paper
7
Title Pannotia: Understanding Irregular GPGPU Graph Applications
Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron
Publication Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013
Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf
Dense Linear Algebra
Overview of the Paper
8
Design of several fundamental dense linear algebra
(DLA) algorithms in OpenCL (clMAGMA library)
Efficient implementation on AMD’s Tahiti GPUs with the
use of the OpenCL standard and optimized BLAS
routines
Observation of a wide applicability and many-fold
performance improvement over highly tuned codes
constituting state-of-the-art libraries for the current
generation of multicore CPUs
Dense Linear Algebra
Performance Study
9
Hardware: AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host
Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host
Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU
Dense Linear Algebra
Results in Detail (1)
10
1) LU factorization (up to 5.7x speedup vs. the CPU host)
2) Cholesky factorization (up to 5.4x speedup vs. the CPU host)
Dense Linear Algebra
CPU+GPU with clMAGMA
CPU with MKL11.1
Source of the figures: (1)
Results in Detail (2)
11
3) QR factorization (up to 5.9x speedup vs. the CPU host)
4) Hessenberg factorization (up to 5.5x speedup vs. the CPU host)
Dense Linear Algebra
CPU+GPU with clMAGMA
CPU with MKL11.1 Source of the figures: (1)
Results in Detail (3)
12
5) Matrix Inversion (up to 1.2x speedup vs. the CPU host)
Dense Linear Algebra
Source of the figures: (1)
CPU+GPU with clMAGMA
CPU with MKL11.1
Sparse Linear Algebra
Used when input matrices have a large number of zero
entries1
Compressed data structures, keeping only the non-zero
entries and their indices, are the norm here2
13
3
1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html
Sparse Linear Algebra
Paper
14
Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries
Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling
Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5
Link http://arxiv.org/pdf/1212.6326v2.pdf
Sparse Linear Algebra
Overview of the Paper
15
Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many-core architectures on top of CUDA or OpenCL
One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product
In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path
Sparse Linear Algebra
Performance Study
16
Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070
− CPU: Intel Core i7 930
Implementation − GPUs: OpenCL implementations from AMD and NVIDIA
− CPU: OpenCL implementations from AMD and Intel
Results − Distinct acceleration is observed when running a GPU path vs.
the CPU path
− Significant acceleration requires problems of sizes between 103 and 105 due to considerable overhead at smaller problem size
− Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL
Sparse Linear Algebra
Results in Detail (1)
17 Sparse Linear Algebra
Source of the table : (2)
VexCL CPU (Intel)
GPU (AMD)
Results in Detail (2)
18
Hamiltonian lattice Time sec
Achieved throughput GB/sec
(percentage of theoretical peak)
Thrust 319.60 120 (81%)
CMTL4 370.31 104 (70%)
VexCL 401.39 96 (65%)
ViennaCL 433.50 89 (60%)
VexCL 225.41 170 (65%)
ViennaCL 214.87 179 (68%)
Thrust N/A N/A
VexCL (AMD) 2934.99 13 (51%)
VexCL (Intel) 3171.74 12 (47%)
ViennaCL (AMD) 2608.80 15 (58%)
ViennaCL (Intel) 2580.47 15 (58%)
GPU: NVIDIA
GPU: Tahiti
CPU: Intel Core i7 930
Sparse Linear Algebra
Source of the table : (2)
Performance under largest problem size:
Graph Traversal
19 Graph Traversal
http://de.wikipedia.org/wiki/Graph_%28Graphentheorie%29#mediaviewer/File:U-
Bahn_Wien.png
Branche Divergence
Multiple Threads on same wavefront
Threads can go into Lockstep
Memory Divergence
All threads on one wavefront must access memory before next step
Some threds must go through multiple adjacency lists to find correct
memory
Load Imbalance
Graphs are in their nature umbalanced
Some threads will get much more workload than others
Divergence
20 Graph Traversal
All data was gathered using a AMD Radeon HD7000
AMD A8-5500 accelerated processing unit
Pannotia was used as an application suite
Speedup
21 Graph Traversal
Dijkstra and Graph Coloring
22 Graph Traversal
http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg http://de.wikipedia.org/wiki/Dijkstra-Algorithmus #mediaviewer/File:DijkstraStep09.svg
Speedups ranging from 4 to 8
Speedup tends to be better for larger graphs
Strong paralisation
Dijkstra and Graph Coloring
23 Graph Traversal
Dijkstra and Graph Coloring
24 Graph Traversal
Source: (4)
Friend Recommendation and Connected Components
Labelling
25 Graph Traversal
http://scipy-lectures.github.io/_images/plot_synthetic_
data_1.png
Speedups ranging from 1 to 2
Relativly little speedup due to strong inbalance
Friend Recommendation and Connected Components
Labelling
26 Graph Traversal
Effetiveness dependant on exact problem
Deep understanding of GPU required
Deep understanding of problem required
Summary
27 Graph Traversal
Map Reduce
28 Map Reduce
http://de.wikipedia.org/wiki/Datei:MapReduce2.svg
AMD GPUs have two ways of accesing memory
Fast Path/ complete Path
All Current GPU implimentations use global atomic operations
Use of global atomic operations causes AMD GPUs to use the
complete path
Tests show 32 times slower memory access over the complete path
Map Reduce
29 Map Reduce
Software-based Atomic add
30 Map Reduce
A Map Reduce Framework for Heterogeneous Computing Architectures
Master thread quickly becomes bottleneck
Instead group by wavefront
Define first thread as dominant thread
Create 4 global arrays with one elment per wavefront
WavefrontsAddresse, WavefrontsSum,
WavefrontsPrefixSums, Finished.
Map Reduce
31 Map Reduce
Map Reduce
32
Threads Load address
and sums Sync
Map Reduce
Step 1
Map Reduce
33
Is only wavefront on address
WFprefixSum = address
Wfincrement = localSum
Local atomic add to generate
prefixSumm and increment
Map Reduce
Sync
Update dominate and
set local increment to 0
Step 2
true
False
Map Reduce
34 Map Reduce
Sync
If Requesting wavefront
Step 3
Set addresses = 0
If dominant Update global
variable
Reset Local data
true
False
true
False
Evaluation
35 MapReduce
Hardware
− GPU: ATI Radeon HD 5870 (Cypress)
− CPU: Intel Xeon e5405 x2
Key Performance measures
Total execution time in nano-seconds
Ratio of FastPath to CompletePath memory transactions
Experiment Micro Benchmarks
1) without memory transaction
(up to 1.9x vs. system atomic operation)
36
2) with memory transactions
(up to 3x vs. system atomic
operation)
MapReduce
Source of the figures: (3)
Experiment MapReduce: Test Applications
37 MapReduce
Matrix Multiplication (MM)
String Match (SM)
KMeans (KM)
Matrix X & Y as Input Outputs Matrix Z Implementation: only the
map phase Each map task responsible
for calculating one element of Matrix Z
Searches an input keyword
Outputs all matching locations
Implementation: only the map phase
Each map task reads a chunk of the input document and outputs the found locations
Iterative clustering algorithm
Each iteration assigns each input point to a closest cluster and recalculates the clusters
Implementation: both the map and reduce phase
Map function assigns points and reduce function recalculates clusters
Experiment MapReduce: Result for Matrix Multiplication
38
MapReduce
The speedup of using
software-based atomic add
over the system one increases
as the input matrices get
larger (up to 13.55 folds)
Ratio of FastPath to
CompletePath memory
accesses: 30:0 for software-
based atomic and 3:28 for
system-provided atomic
implementations
Source of the figures: (3)
Experiment MapReduce: Result for String Match
39
MapReduce
The software atomic approach helps to improve the memory read performance.
In the case of a large number of matches, the overhead incurred by the software atomic approach for writing results offsets the benefit of using FastPath for read accesses.
Ratio of FastPath to CompletePath memory accesses: 12:0 for software-based atomic and 1:19 for system-provided atomic implementations
Source of the figures: (3)
Experiment MapReduce: Result KMeans
40
MapReduce
The speedup of using software-
based atomic add over the
system one increases with the
number of points (up to 67.3
folds)
Source of the figures: (3)
Conclusion AMD GPU
41
MapReduce
Significant speedup has been observed
Readily available in most computers
Requirements for deep understanding of the architecture
and the programming language
In contrast to NVidia more complicated implementation to
enhance the efficiency
Source of the figures: (3)
References
1) Chongxiao Cao , Jack Dongarra , Peng Du , Mark Gates , Piotr Luszczek and Stanimire Tomov (2013): clMAGMA: High Performance Dense Linear Algebra with OpenCL. International Workshop on OpenCL 2013.
2) Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling: Programming CUDA and OpenCL(2013): A Case Study Using Modern C++ Libraries. SIAM Journal on Scientific Computing: Vol. 35, No. 5.
3) Marwa K. Elteir (2012).: A MapReduce Framework for Heterogeneous Computing Architectures. Dissertation, Virginia Polytechnic Institute and State University.
4) Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron(2013): Pannotia: Understanding Irregular GPGPU Graph Applications. Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013
42
Work Distribution
43
Ying Jasper
Architecture p.3-5
Graph Traversal p.19-27
Dense Linear Algebra p.6-12
Sparse Linear Algebra p.13-18
MapReduce p.28-34 p.35-40
Conclusion p.41