managing memory of complex aggregate data...
TRANSCRIPT
Xiaonan (Daniel) Tian March 26, 2018
MANAGING MEMORY OF COMPLEX AGGREGATE DATA STRUCTURES IN OPENACC
2
OPENACC: A DIRECTIVE-BASED APPROACH
Rich Set of Data Directives
Two Offload Region Constructs: parallel and kernels
Three Levels of Parallelism: gang, worker and vector
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
GPUCPU
Program myscience
... serial code ...
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
...
End Program myscience OpenACCCompiler Directives
#pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n]){ #pragma acc kernels loop independent for(i=0; i<n; i++){ c[i]=a[i] + b[i]; }}
3
WHAT IS DEEP COPY?
type dp1
real, pointer :: x( : )
real, pointer :: y( : )
integer :: len
end type
…
TYPE(dp1) :: b1
…
! b1 is initialized
!$acc data copy(b1)
…
!$acc end data
Shallow
copy
Multi-core CPU GPU
CPU Memory GPU Memory
xy
len
x
ylen
b1b1
4
WHAT WILL HAPPEN?Multi-core CPU GPU
CPU Memory GPU Memory
1. type dp1
2. real, pointer :: x( : )
3. real, pointer :: y( : )
4. integer :: len
5. end type
6. …
7. type(dp1) :: b1
8. …
9. !assume b1 is initialized
10. …
11. !$acc parallel loop copy(b1)
12. Do i=1, n
13. b1%y(i) = b1%x(i) + 1.0
14. End Do
xy
len
xy
lenCrash b1b1
5
OPENACC 2.5
1. type dp1
2. real, pointer::x(:)
3. real, pointer::y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable::x1(:), y1(:)
8. Type(dp1)::b1;
9. …
10. ! x1 and y1 are initialized
11. b1%x => x1
12. b1%y => y1
13. !$acc data create(b1)
14. !$acc parallel loop copyin(b1%x,) copyout b1%y)
15. do i=1, n
16. b1%y(i) = b1%x(i) + 1.0;
17. end do
18. !$acc end parallel loop
19. !$acc end data
20. … CPU Memory GPU Memory
xy
len
x
ylen
b1b1
X descriptor
Y descriptor
Memory layout of dp1
len
y1
x1 b1%y
b1%x
Crash
6
MANUAL DEEP COPY IN OPENACC 2.6
1. type dp1
2. real, pointer::x(:)
3. real, pointer::y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable::x1(:), y1(:)
8. Type(dp1)::b1;
9. …
10. !$acc enter data copyin(x1) create(y1)
11. …
12. b1%x => x1
13. b1%y => y1
14. !$acc data create(b1)
15. !$acc parallel loop attach(b1%x, b1%y)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. #pragma acc exit data copyout(y1) delete(x1)
22. …
CPU Memory GPU Memory
xy
len
xy
lenb1b1
X descriptor
Y descriptor
Memory layout of dp1
len
y1
x1
y1
x1
PASSED
7
MANUAL DEEP COPY IN OPENACC 2.6
1. type dp1
2. real, pointer :: x(:)
3. real, pointer :: y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable::x1(:), y1(:)
8. Type(dp1) :: b1;
9. …
10. !$acc enter data copyin(x1) create(y1)
11. …
12. b1%x => x1
13. b1%y => y1
14. !$acc data create(b1)
15. !$acc parallel loop attach(b1%x, b1%y)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. !$acc exit data copyout(y1) delete(x1)
22. …
CPU Memory GPU Memory
xy
len
xy
lenb1b1
X descriptor
Y descriptor
Memory layout of dp1
len
y1
x1
b1%y
b1%x
8
MANUAL DEEP COPY IN OPENACC 2.6
1. type dp1
2. real, pointer :: x(:)
3. real, pointer :: y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable :: x1(:), y1(:)
8. Type(dp1) :: b1
9. …
10. !$acc enter data create(x1, y1)
11. …
12. b1%x => x1
13. b1%y => y1
14. !$acc data create(b1)
15. !$acc parallel loop attach(b1%x, b1%y)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. !$acc exit data copyout(y1) delete(x1)
22. …
1. type dp1
2. real, pointer::x(:)
3. real, pointer::y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable::x1(:), y1(:)
8. Type(dp1) :: b1
9. …
10. !$acc enter data copyin(x1) create(y1)
11. …
12. b1%x => x1
13. b1%y => y1
14. !$acc data create(b1)
15. !$acc parallel loop copyin(b1%x) copyout(b1%y)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. !$acc exit data copyout(y1) delete(x1)
22. …
9
MANUAL DEEP COPY IN OPENACC 2.6
1. type dp1
2. real, pointer::x(:)
3. real, pointer::y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable::x1(:), y1(:)
8. Type(dp1) :: b1
9. … …
10. b1%x => x1
11. b1%y => y1
12. !$acc data create(b1)
13. !$acc parallel loop copyin(b1%x) copyout(b1%y)
14. do i=1, n
15. b1%y(i) = b1%x(i) + 1.0;
16. end do
17. !$acc end parallel loop
18. !$acc end data
19. …
10
MANUAL DEEP COPY FOR C/C++
struct dp1 {
float* x, *y;
int len;
} ;
…
dp1 b1;
…
#pragma acc data copyin(b1)
{
#pragma acc parallel loop copyin(b1.x[0:n]) copyout(b1.y[0:n])
for(i=0; i<n; i++) {
b1.y[i] = b1.x[i] + 1.0;
}
}
11
OPENACC 2.6 MANUAL DEEPCOPY
Derived Type 1Members:3 dynamic
1 derived type 2
Derived Type 2Members:
21 dynamic1 derived type 31 derived type 4
Derived Type 3Members:only static
Derived Type 4Members:8 dynamic
4 derived type 52 derived type 6
Derived Type 5Members:3 dynamic
Derived Type 6Members:8 dynamic
!$acc data copyin(array1)
call my_copyin(array1)
-> 48 lines of code-> 12 lines of code -> 26 lines of code -> 8 lines of code
-> 13 lines of code
-> 107 lines of code just for COPYIN
Plus additional lines of code for COPYOUT, CREATE, UPDATE
VASP: managing one aggregate data structure
12
FULL DEEP COPY
1. type dp1
2. real, pointer :: x(:)
3. real, pointer :: y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable :: x1(:), y1(:)
8. Type(dp1)::b1
9. …
10. …
11. b1%x => x1
12. b1%y => y1
13. …
14. !$acc data create(b1)
15. !$acc parallel loop copyin(b1%x) copyout(b1%y)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. …
1. type dp1
2. real, pointer :: x(:)
3. real, pointer :: y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable :: x1(:), y1(:)
8. Type(dp1) :: b1
9. …
10. …
11. b1%x => x1
12. b1%y => y1
13. …
14. !$acc data create(b1)
15. !$acc parallel loop copy(b1)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. …
MANUAL DEEP COPY
13
FULL DEEP COPY IN PGI
pgfortran –ta=tesla:deepcopy a.f90
1. type dp1
2. real, pointer :: x(:)
3. real, pointer :: y(:)
4. integer :: len
5. End type
6. …
7. Real, target, allocatable :: x1(:), y1(:)
8. Type(dp1) :: b1
9. …
10. …
11. b1%x => x1
12. b1%y => y1
13. …
14. !$acc data create(b1)
15. !$acc parallel loop copy(b1)
16. do i=1, n
17. b1%y(i) = b1%x(i) + 1.0;
18. end do
19. !$acc end parallel loop
20. !$acc end data
21. …
14
WHY FULL DEEP COPY IN ICON
Derived Type 1 Members:1 dynamic,
4 members with Derived Type 2, Derived Type 3, Derived Type 4, Derived Type 5
Derived Type 2Members:
10 dynamicderived type 6derived type 7
Derived Type 6
Members:
2 dynamic
Derived Type 7
Members:
1 dynamic
Derived Type 3Members:
82 dynamicderived type 6derived type 8
Derived Type 6
Members:
2 dynamic
Derived Type 8
Members:
3 dynamic
Derived Type 4Members:2 dynamic
Derived Type 3Members:
88 dynamicderived type 6derived type 8
Derived Type 6 Members:
2 dynamic
Derived Type 8
Members:
1 dynamic
15
FULL DEEP COPY FOR C/C++
typedef struct {
float* x, *y;
int len;
} dp1;
…
dp1 b1;
…
#pragma acc data copy(b1)
{
#pragma acc parallel loop
for(i=0; i<b1.len; i++) {
b1.y[i] = b1.x[i] + 1.0;
}
}
#pragma acc shape(x[0:len], y[0:len])
16
STUDYING: SHAPE DIRECTIVE
struct/union/class name1{
…
#pragma acc shape(member1 [lbnd : length][…], …)
…
};
17
FULL DEEP COPY FOR C/C++
1. typedef struct {
2. float* x, *y;
3. int len;
4. #pragma acc shape(x[0:len], y[0:len])
5. } dp1;
6. …
7. dp1 b1;
8. …
9. #pragma acc data copy(b1)
10. {
11. #pragma acc parallel loop
12. for(i=0; i<b1.len; i++) {
13. b1.y[i] = b1.x[i] + 1.0;
14. }
15. }
CPU Memory GPU Memory
xy
len
xy
lenb1b1
b1.y
b1.x
Elapsed Time
18
FULL DEEP COPY FOR C/C++
1. typedef struct {
2. float* x, *y;
3. int len;
4. #pragma acc shape(x[0:len], y[0:len])
5. } dp1;
6. …
7. dp1 b1;
8. …
9. #pragma acc data copy(b1)
10. {
11. #pragma acc parallel loop
12. for(i=0; i<b1.len; i++) {
13. b1.y[i] = b1.x[i] + 1.0;
14. }
15. }
CPU Memory GPU Memory
xy
len
xy
lenb1b1
b1.y
b1.x
Elapsed Time
Copy in y Copy out x
Computing
Copy out b1
19
STUDYING: POLICY DIRECTIVE
struct/union/class name1{
…
#pragma acc shape(member1 [lbnd : length][…], …)
…
#pragma acc policy(ident) data-clauses
};
20
POLICY DIRECTIVE
1. typedef struct {
2. float* x, *y;
3. int len;
4. #pragma acc shape(x[0:len], y[0:len])
5. #pragma acc policy(xiyo) copyin(x) copyout(y)
6. } dp1;
7. …
8. dp1 b1;
9. …
10. #pragma acc data copyin<xiyo>(b1)
11. {
12. #pragma acc parallel loop
13. for(i=0; i<a.len; i++) {
14. b1.y[i] = b1.x[i] + 1.0;
15. }
16. }
Execution Time of Deep Copy with Policy
Optimized
Execution Time of Full Deep Copy
21
UNIFIED MEMORY
➢ CUDA Unified Memory
➢ HMM (Heterogeneous Memory Management)
Why do we still need Deep Copy?