parallel implementation of tdse on a graphics processing...
TRANSCRIPT
![Page 1: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/1.jpg)
Parallel implementation of TDSE on a Graphics Processing Unit
(GPU) platform
Cathal Ó BroinDublin City University
![Page 2: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/2.jpg)
Together with...
Lampros Nikolopoulos,In collaboration with Ken Taylor
![Page 3: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/3.jpg)
GPU Evolution
Compilers are now available in higher level languages (C and Fortran) for GPUs.
GPUs focus on parallelism.
Compared to CPUs, GPUs:● have less control units● more processing elements (Cores)● increased amount of on chip memory
![Page 4: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/4.jpg)
Current GPU Example
NVIDIA Tesla Cards (with Fermi):● 448 Cores● 6GB of Memory● 0.5 Teraflops peak double precision performance● 148 GB/s bandwidth to the GPU
![Page 5: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/5.jpg)
GPU Architecture
● Most graphics cards have a SIMD architecture● Graphics cards have a high amount of on board memory● GPUs aim for high throughput● Double precision is available
GPUs are used for highly parallel tasks.
![Page 6: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/6.jpg)
What tasks are GPUs suitable for?
GPUs are suitable for tasks where:● the task can be broken up into groups of units● the units in the group execute the same instructions with different data.
But not for tasks that:● require high levels of communication within the task● require high levels of flow control such as if conditions within the code
![Page 7: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/7.jpg)
The Physical Problem
An atomic or molecular system in an intense laser field fufills the TDSE:
![Page 8: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/8.jpg)
The basis expansion approach
The problem can be changed to the form:
![Page 9: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/9.jpg)
The Hamiltonian structure
![Page 10: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/10.jpg)
Elements of the solution
The solution is of the form:
d
![Page 11: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/11.jpg)
The Taylor Expansion Method (TE)
p
![Page 12: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/12.jpg)
What is OpenCL
OpenCL (looks like C) is a language that generalizes the computational resources of a
computer.
OpenCL has:● portability between all supported architectures● combined use of CPU and GPU execution● compilation of code at runtime● massive hardware vendor support
![Page 13: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/13.jpg)
kernel void MatrixMultiplication(const global double * a, const global double * b, global double * c, int n){
int LId, GroupId;int divcol, divrow; //Number of answers we must getdouble curr;
LId = get_local_id(0);GroupId = get_group_id(0);divcol = n/get_local_size(0);divrow = n/get_num_groups(0);
// Memory protection:if ((GroupId+1)*divrow > n)
divrow = n;
if (divcol*(LId + divcol) > n)divcol = n;
for (int k = 0; k < divrow; k++) {for (int j = 0; j < divcol; j++) {
curr = 0;for (int i = 0; i < n; i++)
curr += a[(GroupId*divrow+k)*n + i] * b[i*n + divcol*LId + j];c[(GroupId*divrow+k)*n + divcol*LId + j] = curr;
}}
}
![Page 14: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/14.jpg)
![Page 15: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/15.jpg)
Division of Work
![Page 16: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/16.jpg)
Graphics card used
AMD FirePro 7800● Cost approx 750 Euro (pre-installed)● 1GB of total global memory● 32 KB per local memory unit● 64 KB of total constant Memory● 8 KB of private registers per processing element● 1440 Processing element● 64 processing elements per SIMD● 18 Compute Units● 400 Gigaflops maximum performance
![Page 17: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/17.jpg)
Existing CPU code in C++
● Thoroughly tested on a number of systems (H, He, Mg etc...)
● Tested over the last ten years● Uses a NAG propagator
![Page 18: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/18.jpg)
Results for N = 191
3 5 7 9 11 13 15 170
10
20
30
40
50
60
70
80
90
100
OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG
Angular Momentum
Tim
e (
se
c)
![Page 19: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/19.jpg)
Results for N = 391
4 5 6 7 8 9 10 11 12 130
100
200
300
400
500
600
700
OpenCLNAG Propagator
Highest angular momenta value
Tim
e (
Se
c)
![Page 20: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/20.jpg)
Further Work
Work will be undertaken to port the implementation to the NVIDIA specific CUDA so that it can operate at Ireland's High-Performance Computing Centre (ICHEC).
Work will be done to implement more sophisticated methods on the GPU.
![Page 21: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/21.jpg)
END
![Page 22: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/22.jpg)
OpenCL
NAG
N = 191, L = 12
![Page 23: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/23.jpg)
On OpenCL
● Kernels are functions that are called from regular CPU based programs (host code).● Kernels are written in an OpenCL variant of C99.● Multiple instances of a kernel function are executed by different work items● Global synchronization of the memory to all work items can not be done except at the start of a new kernel function call.
![Page 24: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/24.jpg)
Work Items
Each work item executes an instance of a Kernel.
A work item differs from a thread in that:● It's instruction set should be the same as the rest of the work group● There is no communication between work items out of the work group
![Page 25: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/25.jpg)
Queueing in host code
● A problem can be broken up into tasks divided along synchronization points.●Each part of a task is then implemented in a kernel function●In host code, written in host languages such as C, C++ and Fortran, kernels are queued for execution.●Other items can also be queued, such as copying of buffers, or reading/writing buffers into host memory
![Page 26: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/26.jpg)
Synchronization
● When one item in a queue is finished the next item queued can guarantee that it is executed after it.● Any changes to memory will be seen by the next item.● For the taylor expansion a synchronization point is required after the calculation of each successive derivative.
![Page 27: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/27.jpg)
Results
0 5 10 15 20 25 30 35 40 450
50
100
150
200
250
300
350
400
OpenCL WGSZ:16OpenCL WGSZ:32OpenCL WGSZ:64OpenCL WGSZ:128OpenCL WGSZ:256NAG
![Page 28: Parallel implementation of TDSE on a Graphics Processing ...damot/talk/Cathal_O_Broin_Dublin_2011.pdf · GPU Evolution Compilers are now available in higher level languages (C and](https://reader034.vdocument.in/reader034/viewer/2022042223/5ec973ca9259b21db32deea2/html5/thumbnails/28.jpg)
GPU Execution Model