best gpu code practices combining openacc, cuda, and ompss · – severo ochoa seminars on deep...

28
Munich, Oct. 12 2017 Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs Pau Farré Antonio J. Peña www.bsc.es

Upload: others

Post on 18-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Munich, Oct. 12 2017

Best GPU Code Practices Combining

OpenACC, CUDA, and OmpSs

Pau Farré

Antonio J. Peña

www.bsc.es

Page 2: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

PROLOGUE

Page 3: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

3

Barcelona Supercomputing Center

Marenostrum 4

• 13.7 PetaFlop/s

• General Purpose Computing

3400 nodes of Xeon, 11 PF/s

• Emerging Technologies

Power 9 + Pascal 1.5 PF/s

Knights Landing and Knights Hill 0.5 PF/s

64bit ARMv8 0.5 PF/s

Page 4: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

4

Mission of BSC Scientific Departments

EARTH SCIENCES

To develop and implement global

and regional state-of-the-art

models for short-term air quality

forecast and long-term climate

applications

LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics)

CASETo develop scientific and engineering software to efficiently exploit super-computing capabilities (biomedical, geophysics,

COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer architecture, energy efficiency

atmospheric, energy,social and economic simulations)

Page 5: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

5

BSC Training on European Level - PATC

PRACE Advanced Training Centers

The PRACE, designated 6 Advanced Training Centers:

• Barcelona Supercomputing Center (Spain)

• CINECA Consorzio Interuniversitario (Italy),

• CSC - IT Center for Science Ltd (Finland),

• EPCC at the University of Edinburgh (UK),

• Gauss Centre for Supercomputing (Germany) and

Maison de la Simulation (France).

Mission of PATCs

Carry out and coordinate training and education

activities that foster the efficient usage of the

infrastructure available through PRACE.

Page 6: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

6

BSC & The Global IT Industry 2016

BSC-Microsoft Research Centre

IBM-BSC

Deep Learning Center

NVIDIA GPU

Center of Excellence

Intel-BSC Exascale

Lab

Page 7: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

7

Projects with the Energy Industry

Research into advanced technologies

for the exploration of hydrocarbons,

subterranean and subsea reserve

modelling and fluid flows

Repsol-BSC Research Center Iberdrola Renovables

Page 8: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

NVIDIA Award to BSC/UPC (since 2011)

R&D around GPU Computing (currently ~10 core collaborators)– Architecture, Programming Models, Libraries, Applications, Porting

Education, Training, Dissemination (free registration)– PUMPS Summer School – advanced CUDA mainly

– PRACE Adv. Training Center courses on Introduction to CUDA & OpenACC

– Severo Ochoa Seminars on Deep Learning & Image/Video Processing

– Always open to research collaborations, internships, advising, hiring

BSC/UPC NVIDIA GPU Center of Excellence

8

[email protected]

Page 9: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Pau Farre, Jr. Engineer– GCoE Core Team

– GPU porting and optimization specialist

– Did most of the hard work for this lab

Antonio J. Peña, Sr. Researcher– Manager of the GCoE

– Juan de la Cierva Fellow – Prospective Marie Curie Fellow

– Activity Leader “Accelerators and Communications for HPC”

– The one to blame if anything goes wrong

Introductions

9

[email protected]

[email protected]

Page 10: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

CUDA (Compute Unified Device Architecture)– Runtime & Driver APIs (high-level / low-level)

– Specific for NVIDIA GPUs: best performance & control

OpenACC (Open Accelerators)– Open Standard

– Higher-level, pragma-based

– Aiming at portability – heterogeneous hardware

– For NVIDIA GPUs, implemented on top of CUDA

OpenCL(Open Computing Language)– Open Standard

– Low-level – similar to CUDA Driver API

– Multi-target, portable*

(Intentionally leaving out weird stuff like CG, OpenGL, …)

Introduction: Programming Models for GPU Computing

10

Page 11: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Motivation: Coding Productivity & Performance

11

CUDA

OpenACC

OpenACC + CUDA

OmpSs + CUDA

OmpSs + OpenACC– High-level, task-based, pragma-based, developed @ BSC

– Target accelerators combined with CUDA or (recently) OpenACC

Coding Prod. / Perf.Don’t get me wrong: CUDA delivers awesome coding

productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) colors here. Please interpret

colors as relative to each other.

OpenACC may well deliver more than the performance you

*need*. However, we have thelowest control on performance w.r.t. the discussed alternatives.

Page 12: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

HANDS-ON

Page 13: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

LAB CONNECTION INSTRUCTIONS - Part 1

Go to nvlabs.qwiklab.com

Sign in or create an account

Check for Access Codes (each day):

- Click My Account

- Click Credits & Subscriptions

If no Access Codes, ask for paper one from TA.Please tear in half once used

An Access Code is needed to start the labWIFI SSID: GTC_Hands_On Password: HandsOnGpu

Page 14: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

LAB CONNECTION INSTRUCTIONS - Part 2

1. Click Qwiklabs in upper-left

2. Select GTC2017 Class

3. Find lab and click on it

4. Click on Select

5. Click Start Lab

1

2

3

4

WIFI SSID:

GTC_Hands_On

Password:

HandsOnGpu

Page 15: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

1. Identify Parallelism

○ Using a CPU profiling tool (example: nvprof –cpu-profiling on)

2. Express Parallelism

○ Declare parallel regions with directives

3. Express Data Locality

○ Help OpenACC figure out how to manage data

4. Optimize

○ Using nvprof & Nvidia visual profiler

Steps to Parallelize with OpenACC

15

Page 16: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

● Analyzes physical properties of the subsoil from seismic measures

● Elastic wave propagator + linearly elastic stress-strain relationships– Six different stress components

– Finite differences (FD) method with a Fully Staggered Grid (FSG)

FWI – A Full Wave Inversion Oil & Gas (mini-)application

16

Base code developed by theBSC Repsol Team

Page 17: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

FWI Parallelization – OpenACC/CUDA #6: Results

17

• Our optimized CUDA Kernels have better performance than the OpenACC

1,00 1,743,02

0,51

3,06 3,11 3,82

6,46 7,15

0,60

7,07 7,16

9,29

11,76 11,69

0,92

12,17 12,72

19,32

0,002,004,006,008,00

10,0012,0014,0016,0018,0020,00

Spee

du

p

FWI SpeedupsBaseline: OpenMP

Xeon Platinium 8160 (23c) Tesla K40 (Kepler) Titan X (Maxwell) Tesla P100 (Pascal)

Page 18: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

OmpSs + CUDA / OpenACC

Page 19: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Sequential control flow– Defines a single address space

– Executes sequential code that

• Can spawn/instantiate tasks that will be executed sometime in the future

• Can stall/wait for tasks

Tasks annotated with directionality clauses– in, out, inout

– Used

• To build dependences among tasks

• For main to wait for data to be produced

– Basis for memory management functionalities (replication, locality, movement, …

• Copy clauses

Sequential equivalence (~)

OmpSs Main Program

19

Page 20: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

OmpSs: A Sequential Program…

20

void Cholesky( float *A[NT][NT] ) {

int i, j, k;

for (k=0; k<NT; k++) {

spotrf (A[k*NT+k]) ;

for (i=k+1; i<NT; i++) {

strsm (A[k][k], A[k][i]);

}

for (i=k+1; i<NT; i++) {

for (j=k+1; j<i; j++) {

sgemm( A[k][i], A[k][j], A[j][i]);

}

ssyrk (A[k][i], A[i][i]);

}

}

TS

TSNT

NT

TS

TS

Page 21: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

OmpSs: … with Directionality Annotations …

21

void Cholesky( float *A[NT][NT] ) {

int i, j, k;

for (k=0; k<NT; k++) {

#pragma omp task inout (A[k][k])

spotrf (A[k][k]) ;

for (i=k+1; i<NT; i++) {

#pragma omp task in (A[k][k]) inout (A[k][i])

strsm (A[k][k], A[k][i]);

}

for (i=k+1; i<NT; i++) {

for (j=k+1; j<i; j++) {

#pragma omp task in (A[k][i], A[k][j]) inout (A[j][i])

sgemm( A[k][i], A[k][j], A[j][i]);

}

#pragma omp task in (A[k][i]) inout (A[i][i])

ssyrk (A[k][i], A[i][i]);

}

}

TS

TSNT

NT

TS

TS

Page 22: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

OmpSs: … that Happens to Execute in Parallel

22

void Cholesky( float *A[NT][NT] ) {

int i, j, k;

for (k=0; k<NT; k++) {

#pragma omp task inout (A[k][k])

spotrf (A[k][k]) ;

for (i=k+1; i<NT; i++) {

#pragma omp task in (A[k][k]) inout (A[k][i])

strsm (A[k][k], A[k][i]);

}

for (i=k+1; i<NT; i++) {

for (j=k+1; j<i; j++) {

#pragma omp task in (A[k][i], A[k][j]) inout (A[j][i])

sgemm( A[k][i], A[k][j], A[j][i]);

}

#pragma omp task in (A[k][i]) inout (A[i][i])

ssyrk (A[k][i], A[i][i]);

}

}

TS

TSNT

NT

TS

TS

Decouple how we write/think (sequential) from how it is executed

Page 23: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

OmpSs + CUDA – Example: AXPY Algorithm

23

#include <kernel.h>

int main(int argc, char *argv[])

{

float a=5, x[N], y[N];

// Initialize values

for (int i=0; i<N; ++i)

x[i] = y[i] = i;

// Compute saxpy algorithm (1 task)

saxpy(N, a, x, y);

#pragma omp taskwait

//Check results

for (int i=0; i<N; ++i){

if (y[i]!=a*i+i) perror("Error\n")

}

message("Results are correct\n");

}

void saxpy(int n, float a, float *X, float *Y)

{

for (int i=0; i<n; ++i)

Y[i] = X[i] * a + Y[i];

}

#pragma omp target device(smp) copy_deps

#pragma omp task in([n]x) inout([n]y)

void saxpy(int n, float a, float* x, float* y);

kernel.c

kernel.hmain.c

1 Port kernel to CUDA 2 Annotate device (cuda)

#pragma omp target device(cuda) copy_deps ndrange(1,n,128)

#pragma omp task in([n]x) inout([n]y)

__global__ void saxpy(int n, float a, float* x, float* y);

kernel.cuh2

__global__ void saxpy(int n, float a, float* x, float* y)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if(i < n) y[i] = a * x[i] + y[i];

}

kernel.cu1

3 Complete device (smp)

3

Page 24: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Taskify all your application in a data-flow manner– Process kernels are just a type of tasks executed inside a GPU

The OmpSs runtime manages automatically the use of streams &

memory transfers

OpenACC directives are used to generate all GPU kernels that will be

treated as a CUDA tasks by OmpSs

Greatest coding productivity for accelerators!– But OpenACC kernels might perform lower than fine-tuned CUDA

OmpSs + OpenACC: General Idea

24

Page 25: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

OmpSs + OpenACC: Syntax

25

#pragma omp target(openacc)#pragma omp task in(rho, sxptr, syptr, szptr) inout(vptr)#pragma acc parallel loop deviceptr(rho, sxptr, syptr, szptr, vptr)for (int y=ny0; y < nyf; y++) {

for (int x=nx0; x < nxf; x++) {for (int z=nz0; z < nzf; z++) {

…code…}

}}

Not released yet

Page 26: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

FWI Parallelization – OmpSs/OpenACC - Results

26

• OmpSs/OpenACC performance is similar to OpenACC

1,00 1,743,02

0,51

3,06 3,11 3,82 3,32

6,46 7,15

0,60

7,07 7,169,29

6,58

11,76 11,69

0,92

12,17 12,72

19,32

10,58

0,00

5,00

10,00

15,00

20,00

25,00

Spee

du

p

FWI SpeedupsBaseline: OpenMP

Xeon Platinium 8160 (23c) Tesla K40 (Kepler) Titan X (Maxwell) Tesla P100 (Pascal)

Page 27: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Your Turn!

27

• Open http://github.com/Hopobcn/FWI

• Follow step-by-step instructions @ GTC2017eu.md

Page 28: Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs · – Severo Ochoa Seminars on Deep Learning & Image/Video Processing – Always open to research collaborations, internships,

Thank you!

For further information please contact

[email protected]

www.bsc.es