parallel implementation of tdse on a graphics processing...

28
Parallel implementation of TDSE on a Graphics Processing Unit (GPU) platform Cathal Ó Broin Dublin City University

Upload: others

Post on 22-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Parallel implementation of TDSE on a Graphics Processing Unit

(GPU) platform

Cathal Ó BroinDublin City University

Page 2: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Together with...

Lampros Nikolopoulos,In collaboration with Ken Taylor

Page 3: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

GPU Evolution

Compilers are now available in higher level languages (C and Fortran) for GPUs.

GPUs focus on parallelism.

Compared to CPUs, GPUs:● have less control units● more processing elements (Cores)● increased amount of on chip memory

Page 4: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Current GPU Example

NVIDIA Tesla Cards (with Fermi):● 448 Cores● 6GB of Memory● 0.5 Teraflops peak double precision performance● 148 GB/s bandwidth to the GPU

Page 5: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

GPU Architecture

● Most graphics cards have a SIMD architecture● Graphics cards have a high amount of on board memory● GPUs aim for high throughput● Double precision is available

GPUs are used for highly parallel tasks.

Page 6: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

What tasks are GPUs suitable for?

GPUs are suitable for tasks where:● the task can be broken up into groups of units● the units in the group execute the same instructions with different data.

But not for tasks that:● require high levels of communication within the task● require high levels of flow control such as if conditions within the code

Page 7: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

The Physical Problem

An atomic or molecular system in an intense laser field fufills the TDSE:

Page 8: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

The basis expansion approach

The problem can be changed to the form:

Page 9: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

The Hamiltonian structure

Page 10: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Elements of the solution

The solution is of the form:

d

Page 11: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

The Taylor Expansion Method (TE)

p

Page 12: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

What is OpenCL

OpenCL (looks like C) is a language that generalizes the computational resources of a

computer.

OpenCL has:● portability between all supported architectures● combined use of CPU and GPU execution● compilation of code at runtime● massive hardware vendor support

Page 13: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

kernel void MatrixMultiplication(const global double * a, const global double * b, global double * c, int n){

int LId, GroupId;int divcol, divrow; //Number of answers we must getdouble curr;

LId = get_local_id(0);GroupId = get_group_id(0);divcol = n/get_local_size(0);divrow = n/get_num_groups(0);

// Memory protection:if ((GroupId+1)*divrow > n)

divrow = n;

if (divcol*(LId + divcol) > n)divcol = n;

for (int k = 0; k < divrow; k++) {for (int j = 0; j < divcol; j++) {

curr = 0;for (int i = 0; i < n; i++)

curr += a[(GroupId*divrow+k)*n + i] * b[i*n + divcol*LId + j];c[(GroupId*divrow+k)*n + divcol*LId + j] = curr;

}}

}

Page 14: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and
Page 15: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Division of Work

Page 16: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Graphics card used

AMD FirePro 7800● Cost approx 750 Euro (pre-installed)● 1GB of total global memory● 32 KB per local memory unit● 64 KB of total constant Memory● 8 KB of private registers per processing element● 1440 Processing element● 64 processing elements per SIMD● 18 Compute Units● 400 Gigaflops maximum performance

Page 17: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Existing CPU code in C++

● Thoroughly tested on a number of systems (H, He, Mg etc...)

● Tested over the last ten years● Uses a NAG propagator

Page 18: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Results for N = 191

3 5 7 9 11 13 15 170

10

20

30

40

50

60

70

80

90

100

OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG

Angular Momentum

Tim

e (

se

c)

Page 19: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Results for N = 391

4 5 6 7 8 9 10 11 12 130

100

200

300

400

500

600

700

OpenCLNAG Propagator

Highest angular momenta value

Tim

e (

Se

c)

Page 20: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Further Work

Work will be undertaken to port the implementation to the NVIDIA specific CUDA so that it can operate at Ireland's High-Performance Computing Centre (ICHEC).

Work will be done to implement more sophisticated methods on the GPU.

Page 21: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

END

Page 22: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

OpenCL

NAG

N = 191, L = 12

Page 23: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

On OpenCL

● Kernels are functions that are called from regular CPU based programs (host code).● Kernels are written in an OpenCL variant of C99.● Multiple instances of a kernel function are executed by different work items● Global synchronization of the memory to all work items can not be done except at the start of a new kernel function call.

Page 24: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Work Items

Each work item executes an instance of a Kernel.

A work item differs from a thread in that:● It's instruction set should be the same as the rest of the work group● There is no communication between work items out of the work group

Page 25: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Queueing in host code

● A problem can be broken up into tasks divided along synchronization points.●Each part of a task is then implemented in a kernel function●In host code, written in host languages such as C, C++ and Fortran, kernels are queued for execution.●Other items can also be queued, such as copying of buffers, or reading/writing buffers into host memory

Page 26: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Synchronization

● When one item in a queue is finished the next item queued can guarantee that it is executed after it.● Any changes to memory will be seen by the next item.● For the taylor expansion a synchronization point is required after the calculation of each successive derivative.

Page 27: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

Results

0 5 10 15 20 25 30 35 40 450

50

100

150

200

250

300

350

400

OpenCL WGSZ:16OpenCL WGSZ:32OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG

Page 28: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and

GPU Execution Model