new poster p4225 xiaonan tian: [email protected] openuh an open … · 2014. 3. 4. · xiaonan tian:...
TRANSCRIPT
j
i
j
i
j
i
j
i
j
i i (f)
j(a) (b) (c)
(d) (e)
8
2O
22
24
26
28
3O
32
34
36
Tim
e(s
)Ti
me
(s)
More kernels
acc_shutdown() (cleanup the context)
No
Tim
e(s
)
Tim
e(s
)Ti
me
(s)
Tim
e(s
)
Total Time Kernel Time Total TimeDGEMM Laplacian
Fig.7: Performance Comparison between PGI and OpenUH
PGI-OOPGI-O3
OpenUH
PGI-OOPGI-O3
OpenUH
12
14
16
20
18
22
24
26
28
32
30
Jacobi DGEMM Gaussblur
Benchmark
Map-g-vMap-gv-vMap-g-gv
Map-gv-gv
acc_init() (setup the context)
remaining datain data clause
Yes
Is data in
the map
Allocate devicememory for this data,and put it in the map
Copy this data fromhost to device
Move to the nextdata clause
Setup threads topology
Push kernel arguments
Load and launch kernel
Has reduction
Launch reductionalgorithm kernel
Copy result data fromdevice to host
Yes
No
Yes
No
Yes
No
OpenUH Compiler Infrastructure
FRONTENDS (C/C++,F90,OpenMP,OpenACC)
IPA(Inter Procedural Analyzer)
PRELOWER(Preprocess OpenACC)
LNO(Loop Nest Optimizer)
LOWER(Transformation of OpenACC)
WOPT(Global Scalar Optimizer)
WHIRL2C & WHIRL2CUDA(IR-to-source for other targets)
CG(Code for IA-32,IA-64,X86_64)
Source with OpenACC Directives
CPU Code
General CPU Compiler
GPU Code
NVCCCompiler
PTXAssembler
Loaded Dynamically
CPU Binary
Runtime LibraryLinker
Executable
13.4
13.3
13.2
13.1
13
12.9
12.8
12.7
12.6
13.5
13.6
Jacobi
PGI-OOPGI-O3
OpenUH
PGI-OOPGI-O3
OpenUH
Kernel TimeTotal Time
1
10
100PGI-O0 PGI-O3
OpenUH
PGI-O0PGI-O3
OpenUH
Kernel Time
O
1
2
3
4
5
Stencil
PGI-OOPGI-O3
Open
PGI-OOPGI-O3
Open
Kernel TimeTotal Time
0.1
1
10
100
1000
Stencil Laplacian Wave13pt
Benchmark
Map-g-gv-vMap-v-gv-gv Map-v-gv-g
OpenUH – An Open Source OpenACC Compiler
XiaonanTian, Rengan Xu,YonghongYan, ZhifengYun, Sunita Chandrasekaran, Barbara ChapmanDepartment of Computer Science,University of Houston
Email: {xtian2, rxu6, yyan3, zyun, schandrasekaran, bchapman}@uh.edu , http://web.cs.uh.edu/~openuh
Introduction
Loops Transformation
● OpenACC is an emerging directive-based programmingmodel for programming accelerators that typically enablenon-expert programmers to achieve portable andproductive performance of their applications.
● We constructed a prototype open-source OpenACCcompiler OpenUH which is based on a branch of mainstream Open64 compiler. The experiences could beapplicable to other compiler implementation efforts.
● We provide multiple loop mapping strategies in thecompiler on how to efficiently distribute parallel loopsto the GPGPU accelerators. Our findings provideguidance for users to adopt suitable loop mappingsdepending on their application characteristics.
● OpenUH compiler adopts a source-to-source approachand generates readable CUDA source code for GPGPUs.This gives users opportunities to understand how the loopmapping mechanism are applied and to further optimizethe code manually. It also allows us to leverage theadvanced optimization features in the backendcompilation step by the CUDAcompiler.
Results
References
Rengan Xu, Xiaonan Tian, Yonghong Yan, Sunita Chandrasekaran, andBarbara Chapman. Reduction Operations in Parallel Loops for GPGPUs, inPMAM 2014, Feb., 2014, Orlando, Florida, USA
Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, SunitaChandrasekaran, Barbara Chapman. Compiling a High-Level Directive-BasedProgramming Model for GPGPUs , In LCPC2013, Sep. 2013, San Jose, CA,USA
Acknowledgment This research was supported by NVIDIAand Department of Energy underAward Agreement No. DE-FC02-12ER26099. We also thank PGI for providingthe compiler and support for the evaluation
OpenACC Implementation in OpenUH
● An open-source OpenACCcompiler is created using OpenUHcompiler framework
● Loop mapping mechanisms are designed to translate single loop,double loop and triple nested loop
● Competitive performance compared to a commercial OpenACCcompiler
● Explore advanced compiler analysis and transformation techniquesto further improve the performance in the future
Conclusion
Fig.5: Performance of DoubleNested Loop Mapping
Fig.6: Performance of TripleNested Loop Mapping
Fig 1:Triple Nested Loop Iteration Distribution
Fig 2: Double Nested Loop Iteration Distribution
Fig 3: OpenUH Framework for OpenACC
Fig.4: Execution Flow with OpenACCRuntime Library
contact name
Xiaonan Tian: [email protected]
P4225
category: Programming Languages & ComPiLers - PC08