managing memory of complex aggregate data...

22
Xiaonan (Daniel) Tian March 26, 2018 MANAGING MEMORY OF COMPLEX AGGREGATE DATA STRUCTURES IN OPENACC

Upload: others

Post on 04-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Xiaonan (Daniel) Tian March 26, 2018

MANAGING MEMORY OF COMPLEX AGGREGATE DATA STRUCTURES IN OPENACC

2

OPENACC: A DIRECTIVE-BASED APPROACH

Rich Set of Data Directives

Two Offload Region Constructs: parallel and kernels

Three Levels of Parallelism: gang, worker and vector

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

...

End Program myscience

GPUCPU

Program myscience

... serial code ...

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

...

End Program myscience OpenACCCompiler Directives

#pragma acc data copyin(a[0:n], b[0:n]) copyout(c[0:n]){ #pragma acc kernels loop independent for(i=0; i<n; i++){ c[i]=a[i] + b[i]; }}

3

WHAT IS DEEP COPY?

type dp1

real, pointer :: x( : )

real, pointer :: y( : )

integer :: len

end type

TYPE(dp1) :: b1

! b1 is initialized

!$acc data copy(b1)

!$acc end data

Shallow

copy

Multi-core CPU GPU

CPU Memory GPU Memory

xy

len

x

ylen

b1b1

4

WHAT WILL HAPPEN?Multi-core CPU GPU

CPU Memory GPU Memory

1. type dp1

2. real, pointer :: x( : )

3. real, pointer :: y( : )

4. integer :: len

5. end type

6. …

7. type(dp1) :: b1

8. …

9. !assume b1 is initialized

10. …

11. !$acc parallel loop copy(b1)

12. Do i=1, n

13. b1%y(i) = b1%x(i) + 1.0

14. End Do

xy

len

xy

lenCrash b1b1

5

OPENACC 2.5

1. type dp1

2. real, pointer::x(:)

3. real, pointer::y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable::x1(:), y1(:)

8. Type(dp1)::b1;

9. …

10. ! x1 and y1 are initialized

11. b1%x => x1

12. b1%y => y1

13. !$acc data create(b1)

14. !$acc parallel loop copyin(b1%x,) copyout b1%y)

15. do i=1, n

16. b1%y(i) = b1%x(i) + 1.0;

17. end do

18. !$acc end parallel loop

19. !$acc end data

20. … CPU Memory GPU Memory

xy

len

x

ylen

b1b1

X descriptor

Y descriptor

Memory layout of dp1

len

y1

x1 b1%y

b1%x

Crash

6

MANUAL DEEP COPY IN OPENACC 2.6

1. type dp1

2. real, pointer::x(:)

3. real, pointer::y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable::x1(:), y1(:)

8. Type(dp1)::b1;

9. …

10. !$acc enter data copyin(x1) create(y1)

11. …

12. b1%x => x1

13. b1%y => y1

14. !$acc data create(b1)

15. !$acc parallel loop attach(b1%x, b1%y)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. #pragma acc exit data copyout(y1) delete(x1)

22. …

CPU Memory GPU Memory

xy

len

xy

lenb1b1

X descriptor

Y descriptor

Memory layout of dp1

len

y1

x1

y1

x1

PASSED

7

MANUAL DEEP COPY IN OPENACC 2.6

1. type dp1

2. real, pointer :: x(:)

3. real, pointer :: y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable::x1(:), y1(:)

8. Type(dp1) :: b1;

9. …

10. !$acc enter data copyin(x1) create(y1)

11. …

12. b1%x => x1

13. b1%y => y1

14. !$acc data create(b1)

15. !$acc parallel loop attach(b1%x, b1%y)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. !$acc exit data copyout(y1) delete(x1)

22. …

CPU Memory GPU Memory

xy

len

xy

lenb1b1

X descriptor

Y descriptor

Memory layout of dp1

len

y1

x1

b1%y

b1%x

8

MANUAL DEEP COPY IN OPENACC 2.6

1. type dp1

2. real, pointer :: x(:)

3. real, pointer :: y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable :: x1(:), y1(:)

8. Type(dp1) :: b1

9. …

10. !$acc enter data create(x1, y1)

11. …

12. b1%x => x1

13. b1%y => y1

14. !$acc data create(b1)

15. !$acc parallel loop attach(b1%x, b1%y)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. !$acc exit data copyout(y1) delete(x1)

22. …

1. type dp1

2. real, pointer::x(:)

3. real, pointer::y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable::x1(:), y1(:)

8. Type(dp1) :: b1

9. …

10. !$acc enter data copyin(x1) create(y1)

11. …

12. b1%x => x1

13. b1%y => y1

14. !$acc data create(b1)

15. !$acc parallel loop copyin(b1%x) copyout(b1%y)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. !$acc exit data copyout(y1) delete(x1)

22. …

9

MANUAL DEEP COPY IN OPENACC 2.6

1. type dp1

2. real, pointer::x(:)

3. real, pointer::y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable::x1(:), y1(:)

8. Type(dp1) :: b1

9. … …

10. b1%x => x1

11. b1%y => y1

12. !$acc data create(b1)

13. !$acc parallel loop copyin(b1%x) copyout(b1%y)

14. do i=1, n

15. b1%y(i) = b1%x(i) + 1.0;

16. end do

17. !$acc end parallel loop

18. !$acc end data

19. …

10

MANUAL DEEP COPY FOR C/C++

struct dp1 {

float* x, *y;

int len;

} ;

dp1 b1;

#pragma acc data copyin(b1)

{

#pragma acc parallel loop copyin(b1.x[0:n]) copyout(b1.y[0:n])

for(i=0; i<n; i++) {

b1.y[i] = b1.x[i] + 1.0;

}

}

11

OPENACC 2.6 MANUAL DEEPCOPY

Derived Type 1Members:3 dynamic

1 derived type 2

Derived Type 2Members:

21 dynamic1 derived type 31 derived type 4

Derived Type 3Members:only static

Derived Type 4Members:8 dynamic

4 derived type 52 derived type 6

Derived Type 5Members:3 dynamic

Derived Type 6Members:8 dynamic

!$acc data copyin(array1)

call my_copyin(array1)

-> 48 lines of code-> 12 lines of code -> 26 lines of code -> 8 lines of code

-> 13 lines of code

-> 107 lines of code just for COPYIN

Plus additional lines of code for COPYOUT, CREATE, UPDATE

VASP: managing one aggregate data structure

12

FULL DEEP COPY

1. type dp1

2. real, pointer :: x(:)

3. real, pointer :: y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable :: x1(:), y1(:)

8. Type(dp1)::b1

9. …

10. …

11. b1%x => x1

12. b1%y => y1

13. …

14. !$acc data create(b1)

15. !$acc parallel loop copyin(b1%x) copyout(b1%y)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. …

1. type dp1

2. real, pointer :: x(:)

3. real, pointer :: y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable :: x1(:), y1(:)

8. Type(dp1) :: b1

9. …

10. …

11. b1%x => x1

12. b1%y => y1

13. …

14. !$acc data create(b1)

15. !$acc parallel loop copy(b1)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. …

MANUAL DEEP COPY

13

FULL DEEP COPY IN PGI

pgfortran –ta=tesla:deepcopy a.f90

1. type dp1

2. real, pointer :: x(:)

3. real, pointer :: y(:)

4. integer :: len

5. End type

6. …

7. Real, target, allocatable :: x1(:), y1(:)

8. Type(dp1) :: b1

9. …

10. …

11. b1%x => x1

12. b1%y => y1

13. …

14. !$acc data create(b1)

15. !$acc parallel loop copy(b1)

16. do i=1, n

17. b1%y(i) = b1%x(i) + 1.0;

18. end do

19. !$acc end parallel loop

20. !$acc end data

21. …

14

WHY FULL DEEP COPY IN ICON

Derived Type 1 Members:1 dynamic,

4 members with Derived Type 2, Derived Type 3, Derived Type 4, Derived Type 5

Derived Type 2Members:

10 dynamicderived type 6derived type 7

Derived Type 6

Members:

2 dynamic

Derived Type 7

Members:

1 dynamic

Derived Type 3Members:

82 dynamicderived type 6derived type 8

Derived Type 6

Members:

2 dynamic

Derived Type 8

Members:

3 dynamic

Derived Type 4Members:2 dynamic

Derived Type 3Members:

88 dynamicderived type 6derived type 8

Derived Type 6 Members:

2 dynamic

Derived Type 8

Members:

1 dynamic

15

FULL DEEP COPY FOR C/C++

typedef struct {

float* x, *y;

int len;

} dp1;

dp1 b1;

#pragma acc data copy(b1)

{

#pragma acc parallel loop

for(i=0; i<b1.len; i++) {

b1.y[i] = b1.x[i] + 1.0;

}

}

#pragma acc shape(x[0:len], y[0:len])

16

STUDYING: SHAPE DIRECTIVE

struct/union/class name1{

#pragma acc shape(member1 [lbnd : length][…], …)

};

17

FULL DEEP COPY FOR C/C++

1. typedef struct {

2. float* x, *y;

3. int len;

4. #pragma acc shape(x[0:len], y[0:len])

5. } dp1;

6. …

7. dp1 b1;

8. …

9. #pragma acc data copy(b1)

10. {

11. #pragma acc parallel loop

12. for(i=0; i<b1.len; i++) {

13. b1.y[i] = b1.x[i] + 1.0;

14. }

15. }

CPU Memory GPU Memory

xy

len

xy

lenb1b1

b1.y

b1.x

Elapsed Time

18

FULL DEEP COPY FOR C/C++

1. typedef struct {

2. float* x, *y;

3. int len;

4. #pragma acc shape(x[0:len], y[0:len])

5. } dp1;

6. …

7. dp1 b1;

8. …

9. #pragma acc data copy(b1)

10. {

11. #pragma acc parallel loop

12. for(i=0; i<b1.len; i++) {

13. b1.y[i] = b1.x[i] + 1.0;

14. }

15. }

CPU Memory GPU Memory

xy

len

xy

lenb1b1

b1.y

b1.x

Elapsed Time

Copy in y Copy out x

Computing

Copy out b1

19

STUDYING: POLICY DIRECTIVE

struct/union/class name1{

#pragma acc shape(member1 [lbnd : length][…], …)

#pragma acc policy(ident) data-clauses

};

20

POLICY DIRECTIVE

1. typedef struct {

2. float* x, *y;

3. int len;

4. #pragma acc shape(x[0:len], y[0:len])

5. #pragma acc policy(xiyo) copyin(x) copyout(y)

6. } dp1;

7. …

8. dp1 b1;

9. …

10. #pragma acc data copyin<xiyo>(b1)

11. {

12. #pragma acc parallel loop

13. for(i=0; i<a.len; i++) {

14. b1.y[i] = b1.x[i] + 1.0;

15. }

16. }

Execution Time of Deep Copy with Policy

Optimized

Execution Time of Full Deep Copy

21

UNIFIED MEMORY

➢ CUDA Unified Memory

➢ HMM (Heterogeneous Memory Management)

Why do we still need Deep Copy?

22

SUMMARY

➢ Manual deep copy support

➢ Full deep copy support in PGI Fortran compiler

➢ New directives for fine tuning deep copy support

• Shape Directive

• Policy Directive

• Data clause extension to use shape and policy

pgfortran –ta=tesla:deepcopy …