introduction to openacc - pittsburgh supercomputing center · pgf90 –acc -minfo=accel saxpy.f90....

73
Introduction to OpenACC John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2019

Upload: others

Post on 01-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Introduction to OpenACC

John UrbanicParallel Computing Scientist

Pittsburgh Supercomputing Center

Copyright 2019

Page 2: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

What is OpenACC?

It is a directive based standard to allow developers to

take advantage of accelerators such as GPUs from

NVIDIA and AMD, Intel's Xeon Phi, FPGAs, and even DSP

chips.

Page 3: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Directives

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

...

End Program myscience

CPU GPU

Your original

Fortran or C code

OpenACC

Compiler

Hint

Simple compiler hints from coder.

Compiler generates parallel

threaded code.

Ignorant compiler just sees some

comments.

Page 4: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Familiar to OpenMP Programmers

main() {

double pi = 0.0; long i;

#pragma omp parallel for reduction(+:pi)

for (i=0; i<N; i++)

{

double t = (double)((i+0.05)/N);

pi += 4.0/(1.0+t*t);

}

printf(“pi = %f\n”, pi/N);

}

CPU

OpenMP

main() {

double pi = 0.0; long i;

#pragma acc kernels

for (i=0; i<N; i++)

{

double t = (double)((i+0.05)/N);

pi += 4.0/(1.0+t*t);

}

printf(“pi = %f\n”, pi/N);

}

CPU GPU

OpenACC

More on this later!

Page 5: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

How Else Would We Accelerate Applications?

Applications

Libraries

“Drop-in”

Acceleration

Programming

Languages

(CUDA)

OpenACC

Directives

Maximum

Flexibility

Incrementally

Accelerate

Applications

Page 6: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Key Advantages Of This Approach

High-level. No involvement of OpenCL, CUDA, etc.

Single source. No forking off a separate GPU code. Compile the same program for

accelerators or serial; non-GPU programmers can play along.

Efficient. Experience shows very favorable comparison to low-level implementations

of same algorithms.

Performance portable. Supports GPU accelerators and co-processors from multiple

vendors, current and future versions.

Incremental. Developers can port and tune parts of their application as resources

and profiling dictates. No wholesale rewrite required. Which can be quick.

Page 7: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

True Standard

Full OpenACC 1.0 and 2.0 and now 2.7 specifications available online

http://www.openacc-standard.org

Quick reference card also available and useful

Implementations available now from PGI, Cray, CAPS and GCC.

GCC version of OpenACC started in 5.x, but use 9.1 or soon 10.x

Best free option is very probably PGI Community version:

http://www.pgroup.com/products/community.htm

Page 8: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Resources https://www.openacc.org/resources

Success Storieshttps://www.openacc.org/success-stories

Eventshttps://www.openacc.org/events

OPENACC ResourcesGuides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow

Compilers and Tools https://www.openacc.org/tools

FREE

Compilers

Page 9: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Sunway TaihuLight

#1 Top 500, June

2016

NEW PLATFORMSGROWING

COMMUNITY

▪ 6,000+ enabled developers

▪ Hackathons constantly

▪ Diverse online community

▪ Five of 13 CAAR

codes using

OpenACC

▪ Gaussian ported to

Tesla with OpenACC

▪ FLUENT using

OpenACC in R18

production release

PORTING SUCCESS

Serious Adoption

Page 10: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

A Few CasesReading DNA nucleotide sequences

Shanghai JiaoTong University

Designing circuits for quantum computing

UIST, Macedonia

Extracting image features in real-time

Aselsan

1 week

40x faster

3 directives

4.1x faster

HydroC- Galaxy Formation

PRACE Benchmark Code, CAPS

Real-time Derivative Valuation

Opel Blue, Ltd

Matrix Matrix Multiply

Independent Research Scientist

Few hours

70x faster

4 directives

6.4x faster

4 directives

16x faster

1 week

3x faster

Page 11: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

A Champion Case

S3D: Fuel Combustion

Design alternative fuels with up to 50% higher efficiencyTitan

10 days

Jaguar

42 days

Modified <1% Lines of Code

4x Faster

15 PF! One of fastest

simulations ever!

Page 12: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

subroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, i

!$acc kernelsdo i=1,n

y(i) = a*x(i)+y(i)enddo

!$acc end kernelsend subroutine saxpy

...$ From main program$ call SAXPY on 1M elementscall saxpy(2**20, 2.0, x_d, y_d)...

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

...

// Somewhere in main

// call SAXPY on 1M elements

saxpy(1<<20, 2.0, x, y);

...

A Simple Example: SAXPY

SAXPY in C SAXPY in Fortran

Page 13: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

kernels: Our first OpenACC Directive

We request that each loop execute as a separate kernel on the GPU.

This is an incredibly powerful directive.

!$acc kernels

do i=1,n

a(i) = 0.0

b(i) = 1.0

c(i) = 2.0

end do

do i=1,n

a(i) = b(i) + c(i)

end do

!$acc end kernels

kernel 1

kernel 2

Kernel: A parallel routine to

run on the GPU

Page 14: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

General Directive Syntax and Scope

Fortran

!$acc kernels [clause …]structured block

!$acc end kernels

C

#pragma acc kernels [clause …]{

structured block

}

I may indent the directives at the natural code indentation level for readability. It is a

common practice to always start them in the first column (ala #define/#ifdef). Either is fine

with C or Fortran 90 compilers.

Page 15: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Complete SAXPY Example Code

#include <stdlib.h>

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a * x[i] + y[i];

}

int main(int argc, char **argv)

{

int N = 1<<20; // 1 million floats

if (argc > 1)

N = atoi(argv[1]);

float *x = (float*)malloc(N * sizeof(float));

float *y = (float*)malloc(N * sizeof(float));

for (int i = 0; i < N; ++i) {

x[i] = 2.0f;

y[i] = 1.0f;

}

saxpy(N, 3.0f, x, y);

return 0;

}

“I promise y is not aliased by

Anything else (esp. x)”

Page 16: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

C Detail: the restrict keyword

Standard C (as of C99).

Important for optimization of serial as well as OpenACC and OpenMP code.

Promise given by the programmer to the compiler for a pointer

float *restrict ptr

Meaning: “for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be

used to access the object to which it points”

Limits the effects of pointer aliasing

OpenACC compilers often require restrict to determine independence

Otherwise the compiler can’t parallelize loops that access ptr

Note: if programmer violates the declaration, behavior is undefined

Page 17: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Compile and Run

C: pgcc –acc -Minfo=accel saxpy.c

Fortran: pgf90 –acc -Minfo=accel saxpy.f90

Compiler Output

pgcc -acc -Minfo=accel saxpy.c

saxpy:

8, Generating copyin(x[:n-1])

Generating copy(y[:n-1])

Generating compute capability 1.0 binary

Generating compute capability 2.0 binary

9, Loop is parallelizable

Accelerator kernel generated

9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */

CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy

CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy

Run: a.out

Page 18: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Compare: Partial CUDA C SAXPY CodeJust the subroutine

__global__ void saxpy_kernel( float a, float* x, float* y, int n ){

int i;

i = blockIdx.x*blockDim.x + threadIdx.x;

if( i <= n ) x[i] = a*x[i] + y[i];

}

void saxpy( float a, float* x, float* y, int n ){

float *xd, *yd;

cudaMalloc( (void**)&xd, n*sizeof(float) );

cudaMalloc( (void**)&yd, n*sizeof(float) ); cudaMemcpy( xd, x, n*sizeof(float),

cudaMemcpyHostToDevice );

cudaMemcpy( yd, y, n*sizeof(float),

cudaMemcpyHostToDevice );

saxpy_kernel<<< (n+31)/32, 32 >>>( a, xd, yd, n );

cudaMemcpy( x, xd, n*sizeof(float),

cudaMemcpyDeviceToHost );

cudaFree( xd ); cudaFree( yd );

}

Page 19: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Compare: Partial CUDA Fortran SAXPY CodeJust the subroutine

module kmoduse cudafor

containsattributes(global) subroutine saxpy_kernel(A,X,Y,N)real(4), device :: A, X(N), Y(N)integer, value :: Ninteger :: ii = (blockidx%x-1)*blockdim%x + threadidx%xif( i <= N ) X(i) = A*X(i) + Y(i)end subroutine

end module

subroutine saxpy( A, X, Y, N )use kmodreal(4) :: A, X(N), Y(N)integer :: Nreal(4), device, allocatable, dimension(:):: &

Xd, Ydallocate( Xd(N), Yd(N) )Xd = X(1:N)Yd = Y(1:N)call saxpy_kernel<<<(N+31)/32,32>>>(A, Xd, Yd, N)X(1:N) = Xddeallocate( Xd, Yd )end subroutine

Page 20: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Again: Complete SAXPY Example Code

#include <stdlib.h>

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a * x[i] + y[i];

}

int main(int argc, char **argv)

{

int N = 1<<20; // 1 million floats

if (argc > 1)

N = atoi(argv[1]);

float *x = (float*)malloc(N * sizeof(float));

float *y = (float*)malloc(N * sizeof(float));

for (int i = 0; i < N; ++i) {

x[i] = 2.0f;

y[i] = 1.0f;

}

saxpy(N, 3.0f, x, y);

return 0;

}

Entire Subroutine

Main Code

Page 21: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Big Difference!

With CUDA, we changed the structure of the old code. Non-CUDA

programmers can’t understand new code. It is not even ANSI standard code.

We have separate sections for the host code and the GPU code. Different flow

of code. Serial path now gone forever.

Where did these “32”s and other mystery numbers come from? This is a clue

that we have some hardware details to deal with here.

Exact same situation as assembly used to be. How much hand-assembled code

is still being written in HPC now that compilers have gotten so efficient?

Page 22: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

This looks easy! Too easy…

If it is this simple, why don’t we just throw kernel in front of every loop?

Better yet, why doesn’t the compiler do this for me?

The answer is that there are two general issues that prevent the compiler from being

able to just automatically parallelize every loop.

Data Dependencies in Loops

Data Movement

The compiler needs your higher level perspective (in the form of directive hints) to

get correct results and reasonable performance.

Page 23: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Dependencies

Most directive based parallelization consists of splitting up big do/for loops into

independent chunks that the many processors can work on simultaneously.

Take, for example, a simple for loop like this:

for(index=0; index<1000000; index++)

Array[index] = 4 * Array[index];

When run on 1000 processors, it will execute something like this…

Page 24: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

for(index=0, index<999,index++)

Array[index] = 4*Array[index];

Processor

0

for(index=1000, index<1999,index++)

Array[index] = 4*Array[index];

Processor

1

for(index=2000, index<2999,index++)

Array[index] = 4*Array[index];

Processor

2

for(index=3000, index<3999,index++)

Array[index] = 4*Array[index];

Processor

3

for(index=4000, index<4999,index++)

Array[index] = 4*Array[index];

Processor

4 ….

No Data Dependency

Page 25: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Dependency

But what if the loops are not entirely independent?

Take, for example, a similar loop like this:

for(index=1; index<1000000; index++)

Array[index] = 4 * Array[index] – Array[index-1];

This is perfectly valid serial code.

Page 26: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Dependency

Now Processor 1, in trying to calculate its first iteration…

for(index=1000; index<1999; index++)

Array[1000] = 4 * Array[1000] – Array[999];

needs the result of Processor 0’s last iteration. If we want the correct (“same

as serial”) result, we need to wait until processor 0 finishes. Likewise for

processors 2, 3, …

Page 27: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Dependencies

That is a data dependency. If the compiler even suspects that there is a data

dependency, it will, for the sake of correctness, refuse to parallelize that loop

with kernels.

11, Loop carried dependence of 'Array' prevents parallelization

Loop carried backward dependence of 'Array' prevents vectorization

As large, complex loops are quite common in HPC, especially around the most

important parts of your code, the compiler will often balk most when you most

need a kernel to be generated. What can you do?

Page 28: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Dependencies

Rearrange your code to make it more obvious to the compiler that there

is not really a data dependency.

Eliminate a real dependency by changing your code.

There is a common bag of tricks developed for this as this issue goes

back 40 years in HPC. Many are quite trivial to apply.

The compilers have gradually been learning these themselves.

Override the compiler’s judgment (independent clause) at the risk of

invalid results. Misuse of restrict has similar consequences.

Page 29: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Our Foundation Exercise: Laplace Solver

I’ve been using this for MPI, OpenMP and now OpenACC. It is a great simulation problem, not rigged for OpenACC.

In this most basic form, it solves the Laplace equation: 𝛁𝟐𝒇(𝒙, 𝒚) = 𝟎The Laplace Equation applies to many physical problems, including:

Electrostatics

Fluid Flow

Temperature

For temperature, it is the Steady State Heat Equation:

Metal

Plate

Heating

Element

Initial Conditions Final Steady State

Metal

Plate

Page 30: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise Foundation: Jacobi Iteration

The Laplace equation on a grid states that each grid point is the average of it’s

neighbors.

We can iteratively converge to that state by repeatedly computing new values at

each point from the average of neighboring points.

We just keep doing this until the difference from one pass to the next is small

enough for us to tolerate.

A(i,j) A(i+1,j)A(i-1,j)

A(i,j-1)

A(i,j+1)

𝐴𝑘+1 𝑖, 𝑗 =𝐴𝑘(𝑖 − 1, 𝑗) + 𝐴𝑘 𝑖 + 1, 𝑗 + 𝐴𝑘 𝑖, 𝑗 − 1 + 𝐴𝑘 𝑖, 𝑗 + 1

4

Page 31: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Serial Code Implementation

for(i = 1; i <= ROWS; i++) {for(j = 1; j <= COLUMNS; j++) {

Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +Temperature_last[i][j+1] + Temperature_last[i][j-1]);

}}

do j=1,columnsdo i=1,rows

temperature(i,j)= 0.25 * (temperature_last(i+1,j)+temperature_last(i-1,j) + &temperature_last(i,j+1)+temperature_last(i,j-1) )

enddoenddo

Page 32: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

for(i = 1; i <= ROWS; i++) {for(j = 1; j <= COLUMNS; j++) {

Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +Temperature_last[i][j+1] + Temperature_last[i][j-1]);

}}

dt = 0.0;

for(i = 1; i <= ROWS; i++){for(j = 1; j <= COLUMNS; j++){

dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

if((iteration % 100) == 0) {track_progress(iteration);

}

iteration++;

}

Serial C Code (kernel)

Calculate

Update

temp

array and

find max

change

Output

Done?

Page 33: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

void initialize(){

int i,j;

for(i = 0; i <= ROWS+1; i++){for (j = 0; j <= COLUMNS+1; j++){

Temperature_last[i][j] = 0.0;}

}

// these boundary conditions never change throughout run

// set left side to 0 and right to a linear increasefor(i = 0; i <= ROWS+1; i++) {

Temperature_last[i][0] = 0.0;Temperature_last[i][COLUMNS+1] = (100.0/ROWS)*i;

}

// set top to 0 and bottom to linear increasefor(j = 0; j <= COLUMNS+1; j++) {

Temperature_last[0][j] = 0.0;Temperature_last[ROWS+1][j] = (100.0/COLUMNS)*j;

}}

Serial C Code Subroutines

void track_progress(int iteration) {

int i;

printf("-- Iteration: %d --\n", iteration);for(i = ROWS-5; i <= ROWS; i++) {

printf("[%d,%d]: %5.2f ", i, i,Temperature[i][i]);}printf("\n");

}

BCs could run from 0

to ROWS+1 or from 1

to ROWS. We chose

the former.

Page 34: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

#include <stdlib.h>#include <stdio.h>#include <math.h>#include <sys/time.h>

// size of plate#define COLUMNS 1000#define ROWS 1000

// largest permitted change in temp (This value takes about 3400 steps)#define MAX_TEMP_ERROR 0.01

double Temperature[ROWS+2][COLUMNS+2]; // temperature griddouble Temperature_last[ROWS+2][COLUMNS+2]; // temperature grid from last iteration

// helper routinesvoid initialize();void track_progress(int iter);

int main(int argc, char *argv[]) {

int i, j; // grid indexesint max_iterations; // number of iterationsint iteration=1; // current iterationdouble dt=100; // largest change in tstruct timeval start_time, stop_time, elapsed_time; // timers

printf("Maximum iterations [100-4000]?\n");scanf("%d", &max_iterations);

gettimeofday(&start_time,NULL); // Unix timer

initialize(); // initialize Temp_last including boundary conditions

// do until error is minimal or until max stepswhile ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

// main calculation: average my four neighborsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +

Temperature_last[i][j+1] + Temperature_last[i][j-1]);}

}

dt = 0.0; // reset largest temperature change

// copy grid to old grid for next iteration and find latest dtfor(i = 1; i <= ROWS; i++){

for(j = 1; j <= COLUMNS; j++){dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

// periodically print test valuesif((iteration % 100) == 0) {

track_progress(iteration);}

iteration++;}

Whole C Code

gettimeofday(&stop_time,NULL);timersub(&stop_time, &start_time, &elapsed_time); // Unix time subtract routine

printf("\nMax error at iteration %d was %f\n", iteration-1, dt);printf("Total time was %f seconds.\n", elapsed_time.tv_sec+elapsed_time.tv_usec/1000000.0);

}

// initialize plate and boundary conditions// Temp_last is used to to start first iterationvoid initialize(){

int i,j;

for(i = 0; i <= ROWS+1; i++){for (j = 0; j <= COLUMNS+1; j++){

Temperature_last[i][j] = 0.0;}

}

// these boundary conditions never change throughout run

// set left side to 0 and right to a linear increasefor(i = 0; i <= ROWS+1; i++) {

Temperature_last[i][0] = 0.0;Temperature_last[i][COLUMNS+1] = (100.0/ROWS)*i;

}

// set top to 0 and bottom to linear increasefor(j = 0; j <= COLUMNS+1; j++) {

Temperature_last[0][j] = 0.0;Temperature_last[ROWS+1][j] = (100.0/COLUMNS)*j;

}}

// print diagonal in bottom right corner where most action isvoid track_progress(int iteration) {

int i;

printf("---------- Iteration number: %d ------------\n", iteration);for(i = ROWS-5; i <= ROWS; i++) {

printf("[%d,%d]: %5.2f ", i, i, Temperature[i][i]);}printf("\n");

}

Page 35: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

do while ( dt > max_temp_error .and. iteration <= max_iterations)

do j=1,columnsdo i=1,rows

temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ &temperature_last(i,j+1)+temperature_last(i,j-1) )

enddoenddo

dt=0.0

do j=1,columnsdo i=1,rows

dt = max( abs(temperature(i,j) - temperature_last(i,j)), dt )temperature_last(i,j) = temperature(i,j)

enddoenddo

if( mod(iteration,100).eq.0 ) thencall track_progress(temperature, iteration)

endif

iteration = iteration+1

enddo

Serial Fortran Code (kernel)

Calculate

Update

temp

array and

find max

change

Output

Done?

Page 36: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

subroutine initialize( temperature_last )implicit none

integer, parameter :: columns=1000integer, parameter :: rows=1000integer :: i,j

double precision, dimension(0:rows+1,0:columns+1) :: temperature_last

temperature_last = 0.0

!these boundary conditions never change throughout run

!set left side to 0 and right to linear increasedo i=0,rows+1

temperature_last(i,0) = 0.0temperature_last(i,columns+1) = (100.0/rows) * i

enddo

!set top to 0 and bottom to linear increasedo j=0,columns+1

temperature_last(0,j) = 0.0temperature_last(rows+1,j) = ((100.0)/columns) * j

enddo

end subroutine initialize

Serial Fortran Code Subroutines

subroutine track_progress(temperature, iteration)implicit none

integer, parameter :: columns=1000integer, parameter :: rows=1000integer :: i,iteration

double precision, dimension(0:rows+1,0:columns+1) :: temperature

print *, '---------- Iteration number: ', iteration, ' ---------------'do i=5,0,-1

write (*,'("("i4,",",i4,"):",f6.2," ")',advance='no'), &rows-i,columns-i,temperature(rows-i,columns-i)

enddoprint *

Page 37: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

program serialimplicit none

!Size of plateinteger, parameter :: columns=1000integer, parameter :: rows=1000double precision, parameter :: max_temp_error=0.01

integer :: i, j, max_iterations, iteration=1double precision :: dt=100.0real :: start_time, stop_time

double precision, dimension(0:rows+1,0:columns+1) :: temperature, temperature_last

print*, 'Maximum iterations [100-4000]?'read*, max_iterations

call cpu_time(start_time) !Fortran timer

call initialize(temperature_last)

!do until error is minimal or until maximum stepsdo while ( dt > max_temp_error .and. iteration <= max_iterations)

do j=1,columnsdo i=1,rows

temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ &temperature_last(i,j+1)+temperature_last(i,j-1) )

enddoenddo

dt=0.0

!copy grid to old grid for next iteration and find max changedo j=1,columns

do i=1,rowsdt = max( abs(temperature(i,j) - temperature_last(i,j)), dt )temperature_last(i,j) = temperature(i,j)

enddoenddo

!periodically print test valuesif( mod(iteration,100).eq.0 ) then

call track_progress(temperature, iteration)endif

iteration = iteration+1

enddo

call cpu_time(stop_time)

print*, 'Max error at iteration ', iteration-1, ' was ',dtprint*, 'Total time was ',stop_time-start_time, ' seconds.'

end program serial

Whole Fortran Code

! initialize plate and boundery conditions! temp_last is used to to start first iterationsubroutine initialize( temperature_last )

implicit none

integer, parameter :: columns=1000integer, parameter :: rows=1000integer :: i,j

double precision, dimension(0:rows+1,0:columns+1) :: temperature_last

temperature_last = 0.0

!these boundary conditions never change throughout run

!set left side to 0 and right to linear increasedo i=0,rows+1

temperature_last(i,0) = 0.0temperature_last(i,columns+1) = (100.0/rows) * i

enddo

!set top to 0 and bottom to linear increasedo j=0,columns+1

temperature_last(0,j) = 0.0temperature_last(rows+1,j) = ((100.0)/columns) * j

enddo

end subroutine initialize

!print diagonal in bottom corner where most action issubroutine track_progress(temperature, iteration)

implicit none

integer, parameter :: columns=1000integer, parameter :: rows=1000integer :: i,iteration

double precision, dimension(0:rows+1,0:columns+1) :: temperature

print *, '---------- Iteration number: ', iteration, ' ---------------'do i=5,0,-1

write (*,'("("i4,",",i4,"):",f6.2," ")',advance='no'), &rows-i,columns-i,temperature(rows-i,columns-i)

enddoprint *

end subroutine track_progress

Page 38: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercises: General Instructions for Compiling

Exercises are in the “Exercises/OpenACC” directory in your home

directory

Solutions are in the “Solutions” subdirectory

To compile

pgcc –acc laplace.c

pgf90 –acc laplace.f90

This will generate the executable a.out

Page 39: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercises: Very useful compiler option

Adding –Minfo=accel to your compile command will give you some very useful information about

how well the compiler was able to honor your OpenACC directives.

[urbanic@gpu017 Solutions]$ pgcc -acc -Minfo=accel laplace_acc.cmain:

59, Generating create(Temperature[:][:]) [if not already present]Generating copy(Temperature_last[:][:]) [if not already present]

64, Loop is parallelizable65, Loop is parallelizable

Generating Tesla code64, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */65, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

75, Loop is parallelizable76, Loop is parallelizable

Generating Tesla code75, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */76, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */77, Generating implicit reduction(max:dt)

85, Generating update self(Temperature[:][:])

Page 40: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Special Instructions for Running on the GPUs

(during this workshop)

As mentioned, on Bridges you generally only have to use the queueing system

when you want to. However, as we have hundreds of you wanting quick

turnaround, we will have to use it today.

Once you have an a.out that you want to run, you should use the simple job that

we have already created (in Exercises/OpenACC) for you to run:

fred@br003$ sbatch gpu.job

Page 41: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Output From Your Batch Job

The machine will tell you it submitted a batch job, and you can await your

output, while will come back in a file with the corresponding number as a

name:

slurm-138555.out

As everything we are doing this afternoon only requires a few minutes at

most (and usually just seconds), you could just sit there and wait for the file

to magically appear. At which point you can “more” it or review it with

your editor.

Page 42: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Changing Things Up

If you get impatient, or want to see what the machine us up to, you can

look at the situation with squeue.

You might wonder what happened to the interaction count that the user is

prompted for. I stuck a reasonable default (4000 iterations) into the job

file. You can edit it if you want to. The whole job file is just a few lines.

Congratulations, you are now a Batch System veteran. Welcome to

supercomputing.

Page 43: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 1: Using kernels to parallelize the main loops(About 20 minutes)

Q: Can you get a speedup with just the kernels directives?

1. Edit laplace_serial.c/f90

1. Maybe copy your intended OpenACC version to laplace_acc.c to start

2. Add directives where it helps

2. Compile with OpenACC parallelization

1. pgcc -acc –Minfo=accel laplace_acc.c or

pgf90 -acc –Minfo=accel laplace_acc.f90

2. Look at your compiler output to make sure you are having an effect

3. Run

1. sbatch gpu.job (Leave it at 4000 iterations if you want a solution that converges to current tolerance)

2. Look at output in file that returns (something like slurm-138555.out)

3. Compare the serial and your OpenACC version for performance difference

Page 44: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 1 C Solution

while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +

Temperature_last[i][j+1] + Temperature_last[i][j-1]);}

}

dt = 0.0; // reset largest temperature change

#pragma acc kernelsfor(i = 1; i <= ROWS; i++){

for(j = 1; j <= COLUMNS; j++){dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

if((iteration % 100) == 0) {track_progress(iteration);

}

iteration++;}

Generate a GPU kernel

Generate a GPU kernel

Page 45: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 1 Fortran Solutiondo while ( dt > max_temp_error .and. iteration <= max_iterations)

!$acc kernelsdo j=1,columns

do i=1,rowstemperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ &

temperature_last(i,j+1)+temperature_last(i,j-1) )enddo

enddo!$acc end kernels

dt=0.0

!$acc kernelsdo j=1,columns

do i=1,rowsdt = max( abs(temperature(i,j) - temperature_last(i,j)), dt )temperature_last(i,j) = temperature(i,j)

enddoenddo!$acc end kernels

if( mod(iteration,100).eq.0 ) thencall track_progress(temperature, iteration)

endif

iteration = iteration+1

enddo

Generate a GPU kernel

Generate a GPU kernel

Page 46: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 1: Compiler output (C)

[urbanic@gpu017 Solutions]$ pgcc -acc -Minfo=accel laplace_acc.cmain:

59, Generating create(Temperature[:][:]) [if not already present]Generating copy(Temperature_last[:][:]) [if not already present]

64, Loop is parallelizable65, Loop is parallelizable

Generating Tesla code64, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */65, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */

75, Loop is parallelizable76, Loop is parallelizable

Generating Tesla code75, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */76, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */77, Generating implicit reduction(max:dt)

85, Generating update self(Temperature[:][:])

Compiler was able to

parallelize

Compiler was able to

parallelize

Page 47: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +

Temperature_last[i][j+1] + Temperature_last[i][j-1]);}

}

dt = 0.0;

#pragma acc kernelsfor(i = 1; i <= ROWS; i++){

for(j = 1; j <= COLUMNS; j++){dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

.

.iteration++;

}

First, about that “reduction”

Exiting this loop,

each processor has

a different idea of

what the max dt is.

With kernel the compiler recognizes

this and does a reduction, a very

convenient thing. We can get too

sophisticated for this autoscoping to

happen.

loop reduction (max:dt)

This explicitly declares the

reduction.

Page 48: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 1: Performance3372 steps to convergence

Execution Time (s) Speedup

CPU Serial 20.8 --

CPU 2 OpenMP threads 9.6 2.1

CPU 4 OpenMP threads 4.8 4.3

CPU 8 OpenMP threads 2.7 7.7

CPU 16 OpenMP threads 1.4 14.8

CPU 28 OpenMP threads 0.9 23.1

OpenACC GPU 31.1 0.6x

Using PGI 19.10 on a P100

Page 49: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

What’s with the OpenMP?

We can compare our GPU results to the best the multi-core CPUs can do.

If you are familiar with OpenMP, or even if you are not, you can compile and run the

OpenMP enabled versions in your OpenMP directory as:

pgcc –mp laplace_omp.c or pgf90 -mp laplace_omp.f90

then to run on 8 threads do:

export OMP_NUM_THREADS=8

a.out

Note that you probably only have 8 real cores if you are still on a GPU node. Do

something like “interact –n28” if you want a full node of cores.

Page 50: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

What went wrong?export PGI_ACC_TIME=1 to activate profiling and run again:

Accelerator Kernel Timing data/home/urbanic/laplace_bad_acc.c

main NVIDIA devicenum=0time(us): 12,095,53162: compute region reached 3372 times

64: kernel launched 3372 timesgrid: [32x250] block: [32x4]device time(us): total=127,989 max=48 min=37 avg=37

elapsed time(us): total=241,221 max=1,407 min=61 avg=7162: data region reached 6744 times

62: data copyin transfers: 3372device time(us): total=2,446,765 max=972 min=712 avg=725

70: data copyout transfers: 3372device time(us): total=2,098,635 max=835 min=616 avg=622

73: compute region reached 3372 times73: data copyin transfers: 3372

device time(us): total=32,465 max=71 min=6 avg=975: kernel launched 3372 times

grid: [32x250] block: [32x4]device time(us): total=179,342 max=63 min=52 avg=53

elapsed time(us): total=294,686 max=407 min=76 avg=8775: reduction kernel launched 3372 times

grid: [1] block: [256]device time(us): total=50,490 max=23 min=14 avg=14

elapsed time(us): total=137,910 max=549 min=34 avg=4075: data copyout transfers: 3372

device time(us): total=60,080 max=266 min=13 avg=1773: data region reached 6744 times

73: data copyin transfers: 6744device time(us): total=5,004,411 max=1,005 min=716 avg=742

82: data copyout transfers: 3372device time(us): total=2,095,354 max=854 min=616 avg=621

2.4 seconds

2.0 seconds0.3 seconds

0.2 seconds

0.1 seconds5.0 seconds

2.0 seconds

Page 51: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Basic ConceptSimplified, but sadly true

PCI Bus GPU

GPU Memory

CPU

CPU Memory

> 200 GB/s < 1TB/s

16 GB/s (PCI Gen 3)

80 GB/s (NVLink 1)

150 GB/s (NVLink 2)

All bandwidths one-direction.

Page 52: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Multiple Times Each Iteration

PCI Bus

CPU Memory GPU Memory

CPU GPU

A(i,j) A(i+1,j)A(i-1,j)A(i,j-1)

A(i+1,j)

A(i,j) A(i+1,j)A(i-1,j)

A(i,j-1)

A(i+1,j)

A(i,j) A(i+1,j)A(i-1,j)

A(i,j-1)

A(i+1,j)

Page 53: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Excessive Data Transferswhile ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_old[i+1][j] + …

}}

Temperature, Temperature_old

resident on host

Temperature, Temperature_old

resident on host

Temperature, Temperature_old

resident on device

Temperature, Temperature_old

resident on device4 copies happen

every iteration of

the outer while

loop!

#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_old[i+1][j] + …

}}

Temperature, Temperature_old

resident on host

Temperature, Temperature_old

resident on host

Temperature, Temperature_old

resident on device

Temperature, Temperature_old

resident on device

}

dt = 0.0;

Page 54: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Management

The First, Most Important, and Possibly Only OpenACC Optimization

Page 55: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Scoped Data Construct Syntax

Fortran

!$acc data [clause …]

structured block

!$acc end data

C

#pragma acc data [clause …]

{

structured block

}

Page 56: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Clauses

copy( list ) Allocates memory on GPU and copies data from host to GPU when

entering region and copies data to the host when exiting region.

Principal use: For many important data structures in your code, this

is a logical default to input, modify and return the data.

copyin( list ) Allocates memory on GPU and copies data from host to GPU when

entering region.

Principal use: Think of this like an array that you would use as just

an input to a subroutine.

copyout( list ) Allocates memory on GPU and copies data to the host when exiting

region.

Principal use: A result that isn’t overwriting the input data structure.

create( list ) Allocates memory on GPU but does not copy.

Principal use: Temporary arrays.

Page 57: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Array Shaping

Compilers sometimes cannot determine the size of arrays, so we must specify

explicitly using data clauses with an array “shape”. The compiler will let you know

if you need to do this. Sometimes, you will want to for your own efficiency reasons.

C

#pragma acc data copyin(a[0:size]), copyout(b[s/4:3*s/4])

Fortran

!$acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))

Fortran uses start:end and C uses start:length

Data clauses can be used on data, kernels or parallel

Page 58: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Compiler will (increasingly) often make a good guess…

int main(int argc, char *argv[]) {

int i;double A[2000], B[1000], C[1000];

#pragma acc kernelsfor (i=0; i<1000; i++){

A[i] = 4 * i;B[i] = B[i] + 2;C[i] = A[i] + 2 * B[i];

}}

Smart

pgcc -acc -Minfo=accel loops.cmain:

6, Generating present_or_copyout(C[:])Generating present_or_copy(B[:])Generating present_or_copyout(A[:1000])Generating NVIDIA code

7, Loop is parallelizableAccelerator kernel generated

Smarter

Smartest

Page 59: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

int main(int argc, char** argv){

float A[1000];

#pragma acc kernels

for( int iter = 1; iter < 1000 ; iter++){

A[iter] = 1.0;

}

A[10] = 2.0;

printf("A[10] = %f", A[10]);

}

Data Regions Have Real Consequences

Simplest Kernel With Global Data Region

Output:

A[10] = 2.0

int main(int argc, char** argv){

float A[1000];

#pragma acc kernels

for( int iter = 1; iter < 1000 ; iter++){

A[iter] = 1.0;

}

A[10] = 2.0;

printf("A[10] = %f", A[10]);

}

Output:

A[10] = 1.0

A[]

Copied

To GPU

A[]

Copied

To Host

Runs

On

Host

#pragma acc data copy(A)

{

}

A[]

Copied

To GPU

Still

Runs On

Host

A[]

Copied

To Host

Page 60: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

int main(int argc, char** argv){

float A[1000];

#pragma acc kernels

for( int iter = 1; iter < 1000 ; iter++){

A[iter] = 1.0;

}

A[10] = 2.0;

printf("A[10] = %f", A[10]);

}

Data Regions Are Different Than Compute Regions

Output:

A[10] = 1.0

#pragma acc data copy(A)

{

}

Data

Region

Compute

Region

Page 61: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Data Movement Decisions

Much like loop data dependencies, sometime the compiler needs your human

intelligence to make high-level decisions about data movement. Otherwise, it

must remain conservative – sometimes at great cost.

You must think about when data truly needs to migrate, and see if that is better

than the default.

Besides the scope-based data clauses, there are OpenACC options to let us manage

data movement more intensely or asynchronously. We could manage the above

behavior with the update construct:

Fortran : C:!$acc update [host(), device(), …] #pragma acc update [host(), device(), …]

Ex: #pragma acc update host(Temp_array) //Get host a copy from device

Page 62: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 2: Use acc data to minimize transfers(about 40 minutes)

Q: What speedup can you get with data + kernels directives?

• Start with your Exercise 1 solution or grab laplace_bad_acc.c/f90 from the Solutions

subdirectory. This is just the solution of the last exercise.

• Add data directives where it helps.

• Think: when should I move data between host and GPU? Think how you would do it by

hand, then determine which data clauses will implement that plan.

• Hint: you may find it helpful to ignore the output at first and just concentrate on getting

the solution to converge quickly (at 3372 steps). Then worry about updating the printout.

Page 63: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 2 C Solution#pragma acc data copy(Temperature_last, Temperature)while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

// main calculation: average my four neighbors#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +

Temperature_last[i][j+1] + Temperature_last[i][j-1]);}

}

dt = 0.0; // reset largest temperature change

// copy grid to old grid for next iteration and find latest dt#pragma acc kernelsfor(i = 1; i <= ROWS; i++){

for(j = 1; j <= COLUMNS; j++){dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

// periodically print test valuesif((iteration % 100) == 0) {

#pragma acc update host(Temperature)track_progress(iteration);

}

iteration++;}

No data movement in

this block.

Except once in a while

here.

Page 64: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 2, Slightly better solution

#pragma acc data copy(Temperature_last), create(Temperature)while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

// main calculation: average my four neighbors#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +

Temperature_last[i][j+1] + Temperature_last[i][j-1]);}

}

dt = 0.0; // reset largest temperature change

// copy grid to old grid for next iteration and find latest dt#pragma acc kernelsfor(i = 1; i <= ROWS; i++){

for(j = 1; j <= COLUMNS; j++){dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

// periodically print test valuesif((iteration % 100) == 0) {

#pragma acc update host(Temperature)track_progress(iteration);

}

iteration++;}

Temperature is purely

temporary.

Page 65: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Slightly better still solution

#pragma acc data copy(Temperature_last), create(Temperature)while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

// main calculation: average my four neighbors#pragma acc kernelsfor(i = 1; i <= ROWS; i++) {

for(j = 1; j <= COLUMNS; j++) {Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] +

Temperature_last[i][j+1] + Temperature_last[i][j-1]);}

}

dt = 0.0; // reset largest temperature change

// copy grid to old grid for next iteration and find latest dt#pragma acc kernelsfor(i = 1; i <= ROWS; i++){

for(j = 1; j <= COLUMNS; j++){dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt);Temperature_last[i][j] = Temperature[i][j];

}}

// periodically print test valuesif((iteration % 100) == 0) {

#pragma acc update host(Temperature[ROWS-4:5][COLUMNS-4:5])track_progress(iteration);

}

iteration++;}

Only need corner

elements.

Page 66: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 2 Fortran Solution!$acc data copy(temperature_last), create(temperature)do while ( dt > max_temp_error .and. iteration <= max_iterations)

!$acc kernelsdo j=1,columns

do i=1,rowstemperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ &

temperature_last(i,j+1)+temperature_last(i,j-1) )enddo

enddo!$acc end kernels

dt=0.0

!copy grid to old grid for next iteration and find max change!$acc kernelsdo j=1,columns

do i=1,rowsdt = max( abs(temperature(i,j) - temperature_last(i,j)), dt )temperature_last(i,j) = temperature(i,j)

enddoenddo!$acc end kernels

!periodically print test valuesif( mod(iteration,100).eq.0 ) then

!$acc update host(temperature)call track_progress(temperature, iteration)

endif

iteration = iteration+1

enddo!$acc end data

Keep these on GPU

Except bring back a copy

here

Extra efficient:

!$acc update host(temperature(columns-5:columns,rows-5:rows))

Page 67: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Exercise 2: Performance3372 steps to convergence

Execution Time (s) Speedup

CPU Serial 20.8 --

CPU 2 OpenMP threads 9.6 2.1

CPU 4 OpenMP threads 4.8 4.3

CPU 8 OpenMP threads 2.7 7.7

CPU 16 OpenMP threads 1.4 14.8

CPU 28 OpenMP threads 0.9 23.1

OpenACC GPU 0.64 32.5

Page 68: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

OpenACC or OpenMP?

Don’t draw any grand conclusions yet. We have gotten impressive

speedups from both approaches. But our problem size is pretty small.

Our main data structure is:

1000 x 1000 = 1M elements = 8MB of memory

We have 2 of these (temperature and temperature_last) so we are

using roughly 16 MB of memory. Not very large. When divided over

cores it gets even smaller and can easily fit into cache.

The algorithm is realistic, but the problem size is tiny and hence the

memory bandwidth stress is very low.

Page 69: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

OpenACC or OpenMP on Larger Data?

We can easily scale this problem up, so why don’t I? Because it is nice to have exercises that finish

in a few minutes or less.

We scale this up to 10K x 10K (1.6 GB problem size) for the hybrid challenge. These numbers start

to look a little more realistic. But the serial code takes over 30 minutes to finish. That would have

gotten us off to a slow start!

Execution Time (s) Speedup

CPU Serial 2187 --

CPU 16 OpenMP threads 183 12

CPU 28 OpenMP threads 162 13.5

OpenACC 103 21

10K x 10K Problem Size

Obvious cusp for core

scaling appears

Page 70: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Latest Happenings In Data Management

Unified Memory

Unified address space allows us to pretend we have

shared memory

Skip data management, hope it works, and then

optimize if necessary

For dynamically allocated memory can eliminate need

for pointer clauses

NVLink

One route around PCI bus (with multiple GPUs)

Page 71: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Further speedups

OpenACC gives us even more detailed control over parallelization

Via gang, worker, and vector clauses

By understanding more about OpenACC execution model and GPU

hardware organization, we can get higher speedups on this code

By understanding bottlenecks in the code via profiling, we can

reorganize the code for higher performance

But you have already gained most of any potential speedup, and

you did it with a few lines of directives!

Page 72: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

Is OpenACC Living Up To My Claims?

High-level. No involvement of OpenCL, CUDA, etc.

Single source. No forking off a separate GPU code. Compile the same program

for accelerators or serial; non-GPU programmers can play along.

Efficient. Experience shows very favorable comparison to low-level

implementations of same algorithms. kernels is magical!

Performance portable. Supports GPU accelerators and co-processors from

multiple vendors, current and future versions.

Incremental. Developers can port and tune parts of their application as

resources and profiling dictates. No wholesale rewrite required. Which can be

quick.

Page 73: Introduction to OpenACC - Pittsburgh Supercomputing Center · pgf90 –acc -Minfo=accel saxpy.f90. Compiler Output. pgcc -acc -Minfo=accel saxpy.c saxpy: 8, Generating copyin(x[:n-1])

In Conclusion…

OpenMP

OpenACC

MPI