![Page 1: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/1.jpg)
Parallel programming technologies on hybrid architectures
SCHOOL ON JINR/CERN GRIDAND ADVANCED INFORMATION SYSTEMS
Dubna, Russia 23, October 2014
SCHOOL ON JINR/CERN GRIDAND ADVANCED INFORMATION SYSTEMS
Dubna, Russia 23, October 2014
Streltsova O.I., Podgainy D.V. Laboratory of Information Technologies
Joint Institute for Nuclear Research
Streltsova O.I., Podgainy D.V. Laboratory of Information Technologies
Joint Institute for Nuclear Research
![Page 2: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/2.jpg)
Goal:Goal: Efficient parallelization of complex numerical problems in computational physics
Plan of the talk:I.Efficient parallelization of complex numerical problems in computational physics•Introduction •Hardware and software •Heat transfer problemII. GIMM FPEIP package and MCTDHB packageIII. Summary and conclusion
Plan of the talk:I.Efficient parallelization of complex numerical problems in computational physics•Introduction •Hardware and software •Heat transfer problemII. GIMM FPEIP package and MCTDHB packageIII. Summary and conclusion
![Page 3: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/3.jpg)
…
TOP500 List – June 2014
![Page 4: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/4.jpg)
Source:http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/
TOP500 List – June 2014
![Page 5: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/5.jpg)
Source:http://www.top500.org/blog/slides-for-the-43rd-top500-list-now-available/
TOP500 List – June 2014
![Page 6: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/6.jpg)
«Lomonosov» Supercomputer , MSU
>5000 computation nodesIntel Xeon X5670/X5570/E5630, PowerXCell 8i~36 Gb DRAM2 x nVidia Tesla X2070 6 Gb GDDR5 (448 CUDA-cores)InfiniBand QDR
![Page 7: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/7.jpg)
Custom languages such as CUDA and OpenCL Specifications • 2880 CUDA GPU cores• Peak precision floating
point performance 4.29 TFLOPS single-precision 1.43 TFLOPS double-precision• memory 12 GB GDDR5 Memory bandwidth up to 288 GB/s
NVIDIA Tesla K40 “Atlas” GPU AcceleratorNVIDIA Tesla K40 “Atlas” GPU Accelerator
Supports Dynamic Parallelism and HyperQ features
![Page 8: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/8.jpg)
«Tornado SUSU» Supercomputer, South Ural State University, Russia
480 computing units (compact and powerful computing blade-modules) 960 processors Intel Xeon X5680 (Gulftown, 6 cores with frequency 3.33 GHz) 384 coprocessors Intel Xeon Phi SE10X (61 cores with frequency 1.1 GHz)
«Tornado SUSU» supercomputer took the 157 place in 43-th issue of TOP500 rating (June 2014).
![Page 9: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/9.jpg)
At the end of 2012, Intel launched the first generation of the
Intel Xeon Phi product family.
Intel® Xeon Phi™ Coprocessor Intel® Xeon Phi™ Coprocessor
Intel Xeon Phi 7120PClock Speed 1.24 GHzL2 Cache 30.5 MBTDP 300 WCores 61More threads 244
Intel Many Integrated Core Architecture Intel Many Integrated Core Architecture (Intel MICMIC ) is a multiprocessor computer
architecture developed by Intel.
The core is capable of supporting
4 threads in hardware.
The core is capable of supporting
4 threads in hardware.
![Page 10: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/10.jpg)
HybriLIT: heterogeneous computation cluster
Суперкомпьютер «Ломоносов» МГУ
CICC comprises CICC comprises 25822582 Cores Cores Disk storage capacity Disk storage capacity 18001800 TB TB
August, 2014
Site: http:// hybrilit.jinr.ru
![Page 11: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/11.jpg)
2x Intel Xeon CPUE5-2695v2
3x NVIDIATESLA K40S
2x Intel Xeon CPUE5-2695v2
NVIDIA TESLA K20XIntel Xeon Phi Coprocessor 5110P
2x Intel Xeon CPUE5-2695v2
2x Intel Xeon Phi Coprocessor
7120P
1,21,2
33
44
HybriLIT: heterogeneous computation cluster
![Page 12: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/12.jpg)
• Multiple CPU cores with share memory
• Multiple GPU
• Multiple CPU cores with share memory
• Multiple GPU
What we see: modern Supercomputers are hybrid with heterogeneous nodes
• Multiple CPU cores with share memory
• Multiple Coprocessor
• Multiple CPU cores with share memory
• Multiple Coprocessor
• Multiple CPU• GPU• Coprocessor
• Multiple CPU• GPU• Coprocessor
![Page 13: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/13.jpg)
Parallel technologies: levels of parallelism
In the last decade novel computational technologies and facilities becomes available:
MP-CUDA-Accelerators?...
How to control hybrid hardware: MPI – OpenMP – CUDA - OpenCL ...
#node 1#node 1 #node 2#node 2
![Page 14: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/14.jpg)
In the last decade novel computational facilities and technologies has become available:
MPI-OpenMP-CUDA-OpenCL...
It is not easy to follow modern trends.Modification of the existing codes or developments of new ones ?
![Page 15: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/15.jpg)
Problem HCE: Problem HCE: heat conduction equation heat conduction equation
00
, , , , , 0;
, , , ; , , , 0,t Г
uLu f x y t x y D t
t
u u x y x y D u µ x y t t
( , ) : ,L R L RD D Г x y x x x y y y
Initial boundary value problem for the heat conduction equation:Initial boundary value problem for the heat conduction equation:
• D – rectangular domain with boundary Г : • D – rectangular domain with boundary Г :
2
2
1
1 1
2
,
, , ,
, , .
L L L
uL u K x y t
x xu
L u K x y ty y
![Page 16: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/16.jpg)
Problem HCEProblem HCE: computation scheme : computation scheme
1
2
1 1
2 2
, 0, 1 ,
, 0, 1 ,
,
,
0, 1
x y x y
x
y
h h h h j t
h i L x x
h i y yL
t j j N
x x i h i N
y y i h i N
:x yh h
Locally one-dimensional scheme:reduction of a multidimensional problem to a chain of one-dimensional problemsLocally one-dimensional scheme:reduction of a multidimensional problem to a chain of one-dimensional problems
Let:
Difference scheme: Explicit, implicit, … ?Difference scheme: Explicit, implicit, … ?
9
�௫
�௬
9௫ൈ�9௬
9
![Page 17: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/17.jpg)
Step 1: Difference equations (Ny-2)on x direction
21
1 1
11
1 1
2
2
1 11
11 , , , , ,j j
ji
ix x
v vv v a v a K y tx
1
2
12 1
122 2 2
1
2
2 2 2, , , , , j j
ji
iy y
v vv v a v ya K x t
1 2 1 2
1 21 1
2 2
1 1, , , 1, 2; , , ; ,
2.
2j
i i j i x i yi i
v v x y t x y t x x h y y h
12 1
02
( )
1 2
, , , , , 0, 2,
, ,0 , , , ;
( , , ), 1, 2, ( , ;)
x y
j j t
h h
j aj
v x y t v x y t j N
v x y u x y x y
v x y t x y
f
Step 2: Difference equations (Nx-2)on y direction
under the additional conditions of conjugation,boundary conditions andnormalization condition
Problem HCEProblem HCE: computation scheme : computation scheme
![Page 18: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/18.jpg)
Problem HCEProblem HCE: parallelization scheme: parallelization scheme
Parallel
Parallel
![Page 19: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/19.jpg)
Parallel TechnologiesParallel Technologies
![Page 20: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/20.jpg)
OpenMP realization of parallel algorithmOpenMP realization of parallel algorithm
![Page 21: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/21.jpg)
OpenMP (Open specifications for Multi-Processing)
OpenMP (Open specifications for Multi-Processing) is an API that supports multi-platform shared memory multiprocessing programming in Fortran, C, C++.
• Compiler directives
• Compiler directives
• Environment variables• Environment variables
• Library routines• Library routines
export OMP_NUM_THREADS=33
http://openmp.org/wp/
![Page 22: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/22.jpg)
1. #include <stdio.h>
2. #include <omp.h>
3. 4. int main (int argc, char *argv[]) {
5. const int N = 1000;
6. int i, nthreads; 7. double A[N]; 8. nthreads = omp_get_num_threads(); 9. printf("Number of thread = %d \n, nthreads); 10.
11. #pragma omp parallel for 12. for (i = 0; i < N; i++) {
13. A[i] = function(i); 14. } 15. return 0; 16. }
Compiler directive
Library routines
OpenMP (Open specifications for Multi-Processing)
Use flag -openmp to compile using Intel compilers:icc –openmp code.c –o code
![Page 23: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/23.jpg)
OpenMP realization: OpenMP realization: Multiple CPU cores that share memoryMultiple CPU cores that share memory
Number of threads
Time (sec) Acceleration
1 84.64439 12 46.93157 1,803574 23.46677 3,606996 17.19202 4,923478 14.08791 6,0083
10 12.47396 6,78569
Table 2. OpenMP realization problem 1: execution time and acceleration ( CPU Xeon K100 KIAM RAS)
![Page 24: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/24.jpg)
OpenMP realization: OpenMP realization: Intel® Xeon Phi™ Coprocessor Intel® Xeon Phi™ Coprocessor
Compiling:icc -openmp -O3 -vec-report=3 -mmic algLocal_openmp.cc –o alg_openmp_xphi
Table 3. OpenMP realization: Execution time and Acceleration (Intel Xeon Phi, LIT).
![Page 25: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/25.jpg)
OpenMP realization: OpenMP realization: Intel® Xeon Phi™ CoprocessorIntel® Xeon Phi™ Coprocessor
OptimizationsOptimizations
KMP_ AFFI NI TY compact scatter balanced Execution time (sec) 10.85077 13.24356 10.46549
The KMP_AFFINITY Environment Variable: The Intel® OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units. The interface is controlled using the KMP_AFFINITY environment variable.
The KMP_AFFINITY Environment Variable: The Intel® OpenMP* runtime library has the ability to bind OpenMP threads to physical processing units. The interface is controlled using the KMP_AFFINITY environment variable.
Source: https://software.intel.com/
compact scatter
![Page 26: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/26.jpg)
CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) programming model, CUDA Cprogramming model, CUDA C
![Page 27: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/27.jpg)
CUDACUDA (Compute Unified Device Architecture) (Compute Unified Device Architecture) programming model, CUDA Cprogramming model, CUDA C
Source:http://blog.goldenhelix.com/?p=374
Core 1 Core 2
Core 3 Core 4
CPUGPU
Multiprocessor 1
•••
•
•
•
(192 Cores)
Multiprocessor 2
(192 Cores)
Multiprocessor 14
(192 Cores)
Multiprocessor 15
(192 Cores)
•••
CPU / GPU Architecture
2880 CUDA GPU cores
![Page 28: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/28.jpg)
Source: http://www.realworldtech.com/includes/images/articles/g100-2.gif
CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) programming modelprogramming model
![Page 29: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/29.jpg)
Device Memory HierarchyDevice Memory Hierarchy
Registers are fast, off-chiplocal memory has high latency
Tens of kb per block, on-chip,very fast
Size up to 12 Gb, high latencyRandom access very expensive!Coalesced access much moreefficient
CUDA C Programming Guide (February 2014)
![Page 30: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/30.jpg)
Function Type QualifiersFunction Type Qualifiers
__global__
__host__
CPU
GPU
__global__
__device__
__global__ void kernel ( void ){}
int main{ … kernel <<< gridDim, blockDim >>> ( args ); …}
dim3 gridDim – dimension of grid,dim3 blockDim – dimension of blocks
![Page 31: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/31.jpg)
Threads and blocksThreads and blocks
Thread 0 Thread 1Thread 0 Thread 1Thread 0 Thread 1Thread 0 Thread 1
Block 0Block 1Block 2Block 3
int tid = threadIdx.x + blockIdx.x * blockDim.xint tid = threadIdx.x + blockIdx.x * blockDim.x
tid – index of threads
![Page 32: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/32.jpg)
Scheme program on CUDA C/C++ and C/C++Scheme program on CUDA C/C++ and C/C++
CUDA C / C++
1. Memory allocation
cudaMalloc (void ** devPtr, size_t size); void * malloc (size_t size);
2. Copy variables
cudaMemcpy (void * dst, const void * src, size_t count, enum cudaMemcpyKind kind);
void * memcpy (void * destination,const void * source, size_t num);
copy: host → device, device → host,
host ↔ host, device ↔ device---
3. Function call
kernel <<< gridDim, blockDim >>> (args); double * Function (args);
4. Copy results to host
cudaMemcpy (void * dst, const void * src, size_t count, device → host); ---
![Page 33: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/33.jpg)
nvcc -arch=compute_35 test_CUDA_deviceInfo.cu -o test_CUDA –o deviceInfo
CompilationCompilation
Compilation tools are a part of CUDA SDK•NVIDIA CUDA Compiler Driver NVCC
•Full information http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#axzz37LQKVSFi
![Page 34: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/34.jpg)
NVIDIA cuBLAS
NVIDIA cuRAND NVIDIA
cuSPARSE NVIDIA NPP
Vector SignalImage
Processing
GPU Accelerated
Linear Algebra
Matrix Algebra on GPU and Multicore
NVIDIA cuFFT
C++ STL Features for
CUDAIMSL LibraryBuilding-block
Algorithms for CUDA
Sparse Linear
Algebra
Source: https://developer.nvidia.com/cuda-education. (Will Ramey ,NVIDIA Corporation)
Some GPU-accelerated LibrariesSome GPU-accelerated Libraries
![Page 35: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/35.jpg)
Problem HCEProblem HCE: parallelization scheme: parallelization scheme
Parallel
Parallel
![Page 36: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/36.jpg)
ProblemProblem HCEHCE: CUDA realization: CUDA realizationInitialization: parameters of the problem and the computational scheme are copied in constant memory GPU.Initialization of descriptors: cuSPARSE functions
Calculation of array elements lower, upper and main diagonals and right side of SLAEs (1) :
Kernel_Elements_System_1 <<<blocks, threads>>>()
Parallel solution of (Ny-2) SLAEs in the direction x using
cusparseDgtsvStridedBatch()Calculation of array elements lower, upper and main diagonals and right side of SLAEs (1) :
Kernel_Elements_System_2 <<<blocks, threads>>>()
Parallel solution of (Nx-2) SLAEs in the direction x using
cusparseDgtsvStridedBatch()
![Page 37: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/37.jpg)
Table 1. CUDA realization: Execution time and Acceleration
CUDA realization of parallel algorithm:CUDA realization of parallel algorithm:efficiency of parallelizationefficiency of parallelization
![Page 38: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/38.jpg)
Problem HCE Problem HCE : analysis of results: analysis of results
![Page 39: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/39.jpg)
Hybrid Programming: MPI+CUDA: Hybrid Programming: MPI+CUDA: on the Example of GIMM FPEIP Complexon the Example of GIMM FPEIP Complex
GIMM FPEIP : package developed for simulation of thermal processes in materials irradiated by heavy ion beams
Alexandrov E.I., Amirkhanov I.V., Zemlyanaya E.V., Zrelov P.V., Zuev M.I., Ivanov V.V., Podgainy D.V., Sarker N.R., Sarkhadov I.S., Streltsova O.I., Tukhliev Z. K., Sharipov Z.A. (LIT) Principles of Software Construction for Simulation of Physical Processes on Hybrid Computing Systems (on the Example of GIMM_FPEIP Complex) // Bulletin of Peoples' Friendship University of Russia. Series "Mathematics. Information Sciences. Physics". — 2014. — No 2. — Pp. 197-205.
![Page 40: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/40.jpg)
To solve a system of coupled equations of heat conductivity which are a basis of the thermal spike model in cylindrical coordinate system
GIMM FPEIP : package for simulation of thermal processes GIMM FPEIP : package for simulation of thermal processes in materials irradiated by heavy ion beamsin materials irradiated by heavy ion beams
Multi-GPU
![Page 41: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/41.jpg)
GIMM FPEIP: Logical scheme of the complex GIMM FPEIP: Logical scheme of the complex
![Page 42: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/42.jpg)
Using Multi-GPUsUsing Multi-GPUs
![Page 43: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/43.jpg)
MPI, MPI+CUDA ( CICC LIT, MPI, MPI+CUDA ( CICC LIT, К100К100 KIAM KIAM))
![Page 44: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/44.jpg)
Hybrid Programming: Hybrid Programming: MPI+OpenMP, MPI+OpenMP+CUDA MPI+OpenMP, MPI+OpenMP+CUDA
The The MMultiultiCConfigurationalonfigurationalTTtimetimeDDependnetependnetHHartree (for) artree (for) BBosonsosons method: method: PRL PRL 9999, 030402 (2007), PRA , 030402 (2007), PRA 7777, 033613 (2008), 033613 (2008)
It solves TDSE It solves TDSE numerically exactly numerically exactly – see for benchmarking – see for benchmarking PRA PRA 8686, 063606 , 063606 (2012) (2012)
MMultiultiCConfigurational onfigurational TTtime time DDependnet ependnet HHartree (for) artree (for) BBosonsosons
MCTDHB founders:Lorenz S. Cederbaum, Ofir E. Alon, Alexej I. StreltsovAlexej I. Streltsov
MCTDHB founders:Lorenz S. Cederbaum, Ofir E. Alon, Alexej I. StreltsovAlexej I. Streltsov
Since 2013 cooperation with LITLIT: the development of new hybrid implementations package
Ideas, methods, and parallel implementation of the MCTDHB package:Many-body theory of bosons Many-body theory of bosons group in group in Heidelberg, GermanyHeidelberg, Germany
http://MCTDHB.org
Ideas, methods, and parallel implementation of the MCTDHB package:Many-body theory of bosons Many-body theory of bosons group in group in Heidelberg, GermanyHeidelberg, Germany
http://MCTDHB.org
![Page 45: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/45.jpg)
Time-Dependent SchrTime-Dependent Schrödinger equation governs ödinger equation governs the physics of trapped ultra-cold atomic cloudsthe physics of trapped ultra-cold atomic clouds
To solve the Time-Dependent Many-Boson SchrTo solve the Time-Dependent Many-Boson Schröödinger Equation dinger Equation
we apply the we apply the MMultiultiCConfigurationalonfigurationalTTtimetimeDDependnetependnetHHartree (for) artree (for) BBosonsosons method: method:
PRL PRL 9999, 030402 (2007), PRA , 030402 (2007), PRA 7777, 033613 (2008), 033613 (2008)It solves TDSE It solves TDSE numerically exactly numerically exactly – see for benchmarking – see for benchmarking PRA PRA 8686, 063606 (2012) , 063606 (2012)
ˆ( , ) ( , )i t tt
x = H x
One has to specify initial condition
and propagate ((xx,,tt))→ → ((xx,,tt + +tt) )
20
1
1ˆ ( ; ) ( , ; )2 i
N N
r i i ji i j
V r W r rm
t t
H
( , )0 0,r r rt t x
![Page 46: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/46.jpg)
,r t V( )
2 2 2 2 2 21 1 12 2 2, , x y zx y z m x m y m z V( )
All the terms of the Hamiltonian are under All the terms of the Hamiltonian are under experimental experimental control and can be manipulatedcontrol and can be manipulated
1D-2D-3D: Control on dimensionality by changing the aspect ratio of the 1D-2D-3D: Control on dimensionality by changing the aspect ratio of the trap trap
20
1
1ˆ ( ; ) ( , ; )2 i
N N
r i i ji i j
V r W r rm
t t
H
BECs of alkaline, alkaline earth, and lanthanoid atoms (7Li, 23Na, 39K, 41K, 85Rb, 87Rb, 133Cs, 52Cr, 40Ca, 84Sr, 86Sr, 88Sr, 174Yb,164Dy,
and 168Er )
The interatomic interaction can be widely varied with a magnetic Feshbach resonance… (Greiner Lab at Harvard. )
Magneto-optical trap
![Page 47: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/47.jpg)
2 2312 2, 1.5, 0.5x y x y x y V( ) V( )
( , ) 40eff
iir tV
0 0.5 0.1 0 0.5 0.7
0 0.5 0.8 ( , )
eff
RRr tV
( , )eff
LLr tV
2( , )LL r t
2( , )RR r t
Two generic rgimes: (i) non-violent (under-a-barrier) and (ii) Explosive (over-a-barrier)
Two generic regimes: (i) non-violent (under-a-barrier) and (ii) Explosive (over-a-barrier)
Dynamics N=100: sudden displacement of trap and sudden quenches of the repulsion in 2D
arXiv:1312.6174
![Page 48: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/48.jpg)
List of Applications• Modern development of computer technologies Modern development of computer technologies
(multi-core processors, GPU , coprocessors and (multi-core processors, GPU , coprocessors and
other) require the development of new approaches other) require the development of new approaches
and technologies for parallel programming. and technologies for parallel programming.
• Effective use of high performance computing Effective use of high performance computing
systems allow accelerating of researches, systems allow accelerating of researches,
engineering development and creation of a engineering development and creation of a
specific device.specific device.
Conclusion
![Page 49: Parallel programming technologies on hybrid architectures SCHOOL ON JINR/CERN GRID AND ADVANCED INFORMATION SYSTEMS Dubna, Russia 23, October 2014 SCHOOL](https://reader036.vdocument.in/reader036/viewer/2022062409/56649cb15503460f94976158/html5/thumbnails/49.jpg)
Thank you for attention!