amd gpu - rwth aachen universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/amd...

AMD GPU

Jasper Manousek

Ying Li

05.02.2015

Seminar | High-Performance and Scientific Computing Prof. Paolo Bientinesi, Ph.D.

Agenda

Architecture

Dwarfs

Sparse Linear Algebra

Dense Linear Algebra

Graph Traversal

MapReduce

Conclusion

2

Architecture

3

Comparison

4 Architecture

Nvidea GTX640

• 1 Controlling unit for every 8 Stream processors

• advantage: easier for developers due to simple structure

Radeon HD 6850 • blocks of 6 SP • 4 general ones and one

overseer • one Sp with FP/Int

arithmetic functions • advantage: more potential

if used correctly • disadvantage: requires

developer to specically program towards it

Less Power overall

Through structure smaller Die size

Less Expensive

Other small differences

Comparison

5 Architecture


Classic vector and matrix operations1

Data is typically laid out as a contiguous array and

computations on elements, rows, columns, or matrix

blocks are the norm2

Examples3

6 Dense Linear Algebra

1,2,3: http://view.eecs.berkeley.edu/wiki/Dense_Linear_Algebra

Paper

7

Title Pannotia: Understanding Irregular GPGPU Graph Applications

Author Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron

Publication Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013

Link http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf


http://www.cs.virginia.edu/~skadron/Papers/Che-pannotia-iiswc2013.pdf





Overview of the Paper

8

Design of several fundamental dense linear algebra

(DLA) algorithms in OpenCL (clMAGMA library)

Efficient implementation on AMD’s Tahiti GPUs with the

use of the OpenCL standard and optimized BLAS

routines

Observation of a wide applicability and many-fold

performance improvement over highly tuned codes

constituting state-of-the-art libraries for the current

generation of multicore CPUs


Performance Study

9

Hardware: AMD’s Radeon HD7970 card and a single socket six-core AMD Phenom IIX6 1100T CPU running at 3.71 GHz as the GPU’s multicore host

Library: MKL 11.1 on CPU; clMAGMA on GPU and its CPU host

Results: Higher performance of the clMAGMA applied to heterogeneous systems of multicore processors with GPU accelerators and coprocessors in the area of dense linear algebra in comparison with the MKL applied to CPU


Results in Detail (1)

10

1) LU factorization (up to 5.7x speedup vs. the CPU host)

2) Cholesky factorization (up to 5.4x speedup vs. the CPU host)


CPU+GPU with clMAGMA

CPU with MKL11.1

Source of the figures: (1)


11

3) QR factorization (up to 5.9x speedup vs. the CPU host)

4) Hessenberg factorization (up to 5.5x speedup vs. the CPU host)



CPU with MKL11.1 Source of the figures: (1)


12

5) Matrix Inversion (up to 1.2x speedup vs. the CPU host)




CPU with MKL11.1


Used when input matrices have a large number of zero

entries1

Compressed data structures, keeping only the non-zero

entries and their indices, are the norm here2

13

3

1, 2: http://view.eecs.berkeley.edu/wiki/Sparse_Linear_Algebra 3: http://www.lanl.gov/Caesar/node223.html


Paper

14

Title Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Author Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling

Publication SIAM Journal on Scientific Computing: Vol. 35, No. 5

Link http://arxiv.org/pdf/1212.6326v2.pdf


http://arxiv.org/pdf/1212.6326v2.pdf

Overview of the Paper

15

Comparison of several modern C++ libraries providing high-level interfaces for programming multi- and many-core architectures on top of CUDA or OpenCL

One of the performance and usage study: a nonlinear disordered Hamiltonian lattice, the implementation of which is a sparse matrix-vector product

In general, all the experiments including the nonlinear disordered Hamiltonian lattice show up to 10x to 20x acceleration when running a GPU as compared to the CPU path


Performance Study

16

Hardware − GPUs: AMD Radeon HD 7970/Tahiti & NVIDIA Tesla C2070

− CPU: Intel Core i7 930

Implementation − GPUs: OpenCL implementations from AMD and NVIDIA

− CPU: OpenCL implementations from AMD and Intel

Results − Distinct acceleration is observed when running a GPU path vs.

the CPU path

− Significant acceleration requires problems of sizes between 103 and 105 due to considerable overhead at smaller problem size

− Overhead of using high-level libraries negligible compared to the effort spent in getting familiar with the details of CUDA or OpenCL



17 Sparse Linear Algebra

Source of the table : (2)

VexCL CPU (Intel)

GPU (AMD)


18

Hamiltonian lattice Time sec

Achieved throughput GB/sec

(percentage of theoretical peak)

Thrust 319.60 120 (81%)

CMTL4 370.31 104 (70%)

VexCL 401.39 96 (65%)

ViennaCL 433.50 89 (60%)

VexCL 225.41 170 (65%)

ViennaCL 214.87 179 (68%)

Thrust N/A N/A

VexCL (AMD) 2934.99 13 (51%)

VexCL (Intel) 3171.74 12 (47%)

ViennaCL (AMD) 2608.80 15 (58%)

ViennaCL (Intel) 2580.47 15 (58%)

GPU: NVIDIA

GPU: Tahiti

CPU: Intel Core i7 930


Source of the table : (2)

Performance under largest problem size:

Graph Traversal

19 Graph Traversal

http://de.wikipedia.org/wiki/Graph_%28Graphentheorie%29#mediaviewer/File:U-

Bahn_Wien.png

Branche Divergence

Multiple Threads on same wavefront

Threads can go into Lockstep

Memory Divergence

All threads on one wavefront must access memory before next step

Some threds must go through multiple adjacency lists to find correct

memory

Load Imbalance

Graphs are in their nature umbalanced

Some threads will get much more workload than others

Divergence

20 Graph Traversal

All data was gathered using a AMD Radeon HD7000

AMD A8-5500 accelerated processing unit

Pannotia was used as an application suite

Speedup

21 Graph Traversal

Dijkstra and Graph Coloring

22 Graph Traversal

http://de.wikipedia.org/wiki/Datei:GolombGraphProperties.svg http://de.wikipedia.org/wiki/Dijkstra-Algorithmus #mediaviewer/File:DijkstraStep09.svg

Speedups ranging from 4 to 8

Speedup tends to be better for larger graphs

Strong paralisation


23 Graph Traversal


24 Graph Traversal

Source: (4)

Friend Recommendation and Connected Components

Labelling

25 Graph Traversal

http://scipy-lectures.github.io/_images/plot_synthetic_

data_1.png

Speedups ranging from 1 to 2

Relativly little speedup due to strong inbalance

Friend Recommendation and Connected Components

Labelling

26 Graph Traversal

Effetiveness dependant on exact problem

Deep understanding of GPU required

Deep understanding of problem required

Summary

27 Graph Traversal

Map Reduce

28 Map Reduce

http://de.wikipedia.org/wiki/Datei:MapReduce2.svg

AMD GPUs have two ways of accesing memory

Fast Path/ complete Path

All Current GPU implimentations use global atomic operations

Use of global atomic operations causes AMD GPUs to use the

complete path

Tests show 32 times slower memory access over the complete path

Map Reduce

29 Map Reduce

Software-based Atomic add

30 Map Reduce

A Map Reduce Framework for Heterogeneous Computing Architectures

Master thread quickly becomes bottleneck

Instead group by wavefront

Define first thread as dominant thread

Create 4 global arrays with one elment per wavefront

WavefrontsAddresse, WavefrontsSum,

WavefrontsPrefixSums, Finished.

Map Reduce

31 Map Reduce

Map Reduce

32

Threads Load address

and sums Sync

Map Reduce

Step 1

Map Reduce

33

Is only wavefront on address

WFprefixSum = address

Wfincrement = localSum

Local atomic add to generate

prefixSumm and increment

Map Reduce

Sync

Update dominate and

set local increment to 0

Step 2

true

False

Map Reduce

34 Map Reduce

Sync

If Requesting wavefront

Step 3

Set addresses = 0

If dominant Update global

variable

Reset Local data

true

False

true

False

Evaluation

35 MapReduce

Hardware

− GPU: ATI Radeon HD 5870 (Cypress)

− CPU: Intel Xeon e5405 x2

Key Performance measures

Total execution time in nano-seconds

Ratio of FastPath to CompletePath memory transactions

Experiment Micro Benchmarks

1) without memory transaction

(up to 1.9x vs. system atomic operation)

36

2) with memory transactions

(up to 3x vs. system atomic

operation)

MapReduce


Experiment MapReduce: Test Applications

37 MapReduce

Matrix Multiplication (MM)

String Match (SM)

KMeans (KM)

Matrix X & Y as Input Outputs Matrix Z Implementation: only the

map phase Each map task responsible

for calculating one element of Matrix Z

Searches an input keyword

Outputs all matching locations

Implementation: only the map phase

Each map task reads a chunk of the input document and outputs the found locations

Iterative clustering algorithm

Each iteration assigns each input point to a closest cluster and recalculates the clusters

Implementation: both the map and reduce phase

Map function assigns points and reduce function recalculates clusters

Experiment MapReduce: Result for Matrix Multiplication

38

MapReduce

The speedup of using

software-based atomic add

over the system one increases

as the input matrices get

larger (up to 13.55 folds)

Ratio of FastPath to

CompletePath memory

accesses: 30:0 for software-

based atomic and 3:28 for

system-provided atomic

implementations


Experiment MapReduce: Result for String Match

39

MapReduce

The software atomic approach helps to improve the memory read performance.

In the case of a large number of matches, the overhead incurred by the software atomic approach for writing results offsets the benefit of using FastPath for read accesses.

Ratio of FastPath to CompletePath memory accesses: 12:0 for software-based atomic and 1:19 for system-provided atomic implementations


Experiment MapReduce: Result KMeans

40

MapReduce

The speedup of using software-

based atomic add over the

system one increases with the

number of points (up to 67.3

folds)


Conclusion AMD GPU

41

MapReduce

Significant speedup has been observed

Readily available in most computers

Requirements for deep understanding of the architecture

and the programming language

In contrast to NVidia more complicated implementation to

enhance the efficiency


References

1) Chongxiao Cao , Jack Dongarra , Peng Du , Mark Gates , Piotr Luszczek and Stanimire Tomov (2013): clMAGMA: High Performance Dense Linear Algebra with OpenCL. International Workshop on OpenCL 2013.

2) Denis Demidov, Karsten Ahnert, Karl Rupp and Peter Gottschling: Programming CUDA and OpenCL(2013): A Case Study Using Modern C++ Libraries. SIAM Journal on Scientific Computing: Vol. 35, No. 5.

3) Marwa K. Elteir (2012).: A MapReduce Framework for Heterogeneous Computing Architectures. Dissertation, Virginia Polytechnic Institute and State University.

4) Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt and Kevin Skadron(2013): Pannotia: Understanding Irregular GPGPU Graph Applications. Proceedings of 2013 IEEE International Symposium on Workload Characterization (IISWC), Sept 2013

42

Work Distribution

43

Ying Jasper

Architecture p.3-5

Graph Traversal p.19-27

Dense Linear Algebra p.6-12

Sparse Linear Algebra p.13-18

MapReduce p.28-34 p.35-40

Conclusion p.41

amd gpu - rwth aachen universityhpac.rwth-aachen.de/teaching/sem-hpsc-14/presentations/amd...

Documents