new poster p4225 xiaonan tian: [email protected] openuh an open … · 2014. 3. 4. · xiaonan tian:...

j

i

j

i

j

i

j

i

j

i i (f)

j(a) (b) (c)

(d) (e)

8

2O

22

24

26

28

3O

32

34

36

Tim

e(s

)Ti

me

(s)

More kernels

acc_shutdown() (cleanup the context)

No

Tim

e(s

)

Tim

e(s

)Ti

me

(s)

Tim

e(s

)

Total Time Kernel Time Total TimeDGEMM Laplacian

Fig.7: Performance Comparison between PGI and OpenUH

PGI-OOPGI-O3

OpenUH

PGI-OOPGI-O3

OpenUH

12

14

16

20

18

22

24

26

28

32

30

Jacobi DGEMM Gaussblur

Benchmark

Map-g-vMap-gv-vMap-g-gv

Map-gv-gv

acc_init() (setup the context)

remaining datain data clause

Yes

Is data in

the map

Allocate devicememory for this data,and put it in the map

Copy this data fromhost to device

Move to the nextdata clause

Setup threads topology

Push kernel arguments

Load and launch kernel

Has reduction

Launch reductionalgorithm kernel

Copy result data fromdevice to host

Yes

No

Yes

No

Yes

No

OpenUH Compiler Infrastructure

FRONTENDS (C/C++,F90,OpenMP,OpenACC)

IPA(Inter Procedural Analyzer)

PRELOWER(Preprocess OpenACC)

LNO(Loop Nest Optimizer)

LOWER(Transformation of OpenACC)

WOPT(Global Scalar Optimizer)

WHIRL2C & WHIRL2CUDA(IR-to-source for other targets)

CG(Code for IA-32,IA-64,X86_64)

Source with OpenACC Directives

CPU Code

General CPU Compiler

GPU Code

NVCCCompiler

PTXAssembler

Loaded Dynamically

CPU Binary

Runtime LibraryLinker

Executable

13.4

13.3

13.2

13.1

13

12.9

12.8

12.7

12.6

13.5

13.6

Jacobi

PGI-OOPGI-O3

OpenUH

PGI-OOPGI-O3

OpenUH

Kernel TimeTotal Time

1

10

100PGI-O0 PGI-O3

OpenUH

PGI-O0PGI-O3

OpenUH

Kernel Time

O

1

2

3

4

5

Stencil

PGI-OOPGI-O3

Open

PGI-OOPGI-O3

Open

Kernel TimeTotal Time

0.1

1

10

100

1000

Stencil Laplacian Wave13pt

Benchmark

Map-g-gv-vMap-v-gv-gv Map-v-gv-g

OpenUH – An Open Source OpenACC Compiler

XiaonanTian, Rengan Xu,YonghongYan, ZhifengYun, Sunita Chandrasekaran, Barbara ChapmanDepartment of Computer Science,University of Houston

Email: {xtian2, rxu6, yyan3, zyun, schandrasekaran, bchapman}@uh.edu , http://web.cs.uh.edu/~openuh

Introduction

Loops Transformation

● OpenACC is an emerging directive-based programmingmodel for programming accelerators that typically enablenon-expert programmers to achieve portable andproductive performance of their applications.

● We constructed a prototype open-source OpenACCcompiler OpenUH which is based on a branch of mainstream Open64 compiler. The experiences could beapplicable to other compiler implementation efforts.

● We provide multiple loop mapping strategies in thecompiler on how to efficiently distribute parallel loopsto the GPGPU accelerators. Our findings provideguidance for users to adopt suitable loop mappingsdepending on their application characteristics.

● OpenUH compiler adopts a source-to-source approachand generates readable CUDA source code for GPGPUs.This gives users opportunities to understand how the loopmapping mechanism are applied and to further optimizethe code manually. It also allows us to leverage theadvanced optimization features in the backendcompilation step by the CUDAcompiler.

Results

References

Rengan Xu, Xiaonan Tian, Yonghong Yan, Sunita Chandrasekaran, andBarbara Chapman. Reduction Operations in Parallel Loops for GPGPUs, inPMAM 2014, Feb., 2014, Orlando, Florida, USA

Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, SunitaChandrasekaran, Barbara Chapman. Compiling a High-Level Directive-BasedProgramming Model for GPGPUs , In LCPC2013, Sep. 2013, San Jose, CA,USA

Acknowledgment This research was supported by NVIDIAand Department of Energy underAward Agreement No. DE-FC02-12ER26099. We also thank PGI for providingthe compiler and support for the evaluation

OpenACC Implementation in OpenUH

● An open-source OpenACCcompiler is created using OpenUHcompiler framework

● Loop mapping mechanisms are designed to translate single loop,double loop and triple nested loop

● Competitive performance compared to a commercial OpenACCcompiler

● Explore advanced compiler analysis and transformation techniquesto further improve the performance in the future

Conclusion

Fig.5: Performance of DoubleNested Loop Mapping

Fig.6: Performance of TripleNested Loop Mapping

Fig 1:Triple Nested Loop Iteration Distribution

Fig 2: Double Nested Loop Iteration Distribution

Fig 3: OpenUH Framework for OpenACC

Fig.4: Execution Flow with OpenACCRuntime Library

contact name

Xiaonan Tian: [email protected]

P4225

category: Programming Languages & ComPiLers - PC08

new poster p4225 xiaonan tian: [email protected] openuh an open … · 2014. 3. 4. · xiaonan tian:...

Documents