openmp compiler for distributed memory architectures

13
. RESEARCH PAPERS . SCIENCE CHINA Information Sciences May 2010 Vol. 53 No. 5: 932–944 doi: 10.1007/s11432-010-0074-0 c Science China Press and Springer-Verlag Berlin Heidelberg 2010 info.scichina.com www.springerlink.com OpenMP compiler for distributed memory architectures WANG Jue , HU ChangJun, ZHANG JiLin & LI JianJiang School of Information and Engineering, University of Science and Technology Beijing, Beijing 100083, China Received November 11, 2008; accepted June 14, 2009; published online April 14, 2010 Abstract OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for dis- tributed memory architectures. Based on the “partially replicating shared arrays” memory model, we propose an algorithm for shared array recognition based on the inter-procedural analysis, optimization technique based on the producer/consumer relationship, and communication generation technique for nonlinear references. We eval- uate the performance on nine benchmarks which cover computational fluid dynamics, integer sorting, molecular dynamics, earthquake simulation, and computational chemistry. The average scalability achieved by KLCoMP version is close to that achieved by MPI version. We compare the performance of our translated programs with that of versions generated for Omni+SCASH, LLCoMP, and OpenMP(Purdue), and find that parallel appli- cations (especially, irregular applications) translated by KLCoMP can achieve more effective performance than other versions. Keywords parallel compiling, high performance computing, distributed memory architecture, OpenMP, ir- regular application Citation Wang J, Hu C J, Zhang J L, et al. OpenMP compiler for distributed memory architectures. Sci China Inf Sci, 2010, 53: 932–944, doi: 10.1007/s11432-010-0074-0 1 Introduction Parallel compiling plays a significant role in the system software of parallel computer. The goal of parallel compiling is to take advantage of parallel computer. OpenMP [1] has been widely accepted by both industry and academia since it was proposed as a shared-memory programming standard in 1997. For its portability, scalability and incremental programming, many researchers tried their best to port or extend OpenMP to support distributed memory architectures so that the productivity of programmers is improved. Related works fall into three categories. The first category [2–4] adopts SDSM (software distributed shared memory) to deploy OpenMP on distributed memory architectures. Although SDSM is easier to construct compiling system compared with other methods, the performance of SDSM is close to that of MPI (message passing interface) for part of specified applications [5, 6]. The second category translates OpenMP or extended OpenMP to message-passing form. The PARAMOUNT group [7, 8] translated OpenMP to MPI [9], but some of optimization techniques (such as shared variables recognition) Corresponding author (email: [email protected])

Upload: jue-wang

Post on 15-Jul-2016

227 views

Category:

Documents


11 download

TRANSCRIPT

Page 1: OpenMP compiler for distributed memory architectures

. RESEARCH PAPERS .

SCIENCE CHINAInformation Sciences

May 2010 Vol. 53 No. 5: 932–944

doi: 10.1007/s11432-010-0074-0

c© Science China Press and Springer-Verlag Berlin Heidelberg 2010 info.scichina.com www.springerlink.com

OpenMP compiler for distributed memoryarchitectures

WANG Jue∗, HU ChangJun, ZHANG JiLin & LI JianJiang

School of Information and Engineering, University of Science and Technology Beijing, Beijing 100083, China

Received November 11, 2008; accepted June 14, 2009; published online April 14, 2010

Abstract OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has

advantages on its ease of use and incremental programming, message passing is today still the most widely-used

programming model for distributed memory architectures. How to effectively extend OpenMP to distributed

memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for dis-

tributed memory architectures. Based on the “partially replicating shared arrays” memory model, we propose an

algorithm for shared array recognition based on the inter-procedural analysis, optimization technique based on

the producer/consumer relationship, and communication generation technique for nonlinear references. We eval-

uate the performance on nine benchmarks which cover computational fluid dynamics, integer sorting, molecular

dynamics, earthquake simulation, and computational chemistry. The average scalability achieved by KLCoMP

version is close to that achieved by MPI version. We compare the performance of our translated programs with

that of versions generated for Omni+SCASH, LLCoMP, and OpenMP(Purdue), and find that parallel appli-

cations (especially, irregular applications) translated by KLCoMP can achieve more effective performance than

other versions.

Keywords parallel compiling, high performance computing, distributed memory architecture, OpenMP, ir-

regular application

Citation Wang J, Hu C J, Zhang J L, et al. OpenMP compiler for distributed memory architectures. Sci China

Inf Sci, 2010, 53: 932–944, doi: 10.1007/s11432-010-0074-0

1 Introduction

Parallel compiling plays a significant role in the system software of parallel computer. The goal ofparallel compiling is to take advantage of parallel computer. OpenMP [1] has been widely accepted byboth industry and academia since it was proposed as a shared-memory programming standard in 1997.For its portability, scalability and incremental programming, many researchers tried their best to port orextend OpenMP to support distributed memory architectures so that the productivity of programmersis improved. Related works fall into three categories. The first category [2–4] adopts SDSM (softwaredistributed shared memory) to deploy OpenMP on distributed memory architectures. Although SDSMis easier to construct compiling system compared with other methods, the performance of SDSM is closeto that of MPI (message passing interface) for part of specified applications [5, 6]. The second categorytranslates OpenMP or extended OpenMP to message-passing form. The PARAMOUNT group [7, 8]translated OpenMP to MPI [9], but some of optimization techniques (such as shared variables recognition)∗Corresponding author (email: [email protected])

Page 2: OpenMP compiler for distributed memory architectures

WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5 933

did not break away from previous SDSM OpenMP, especially for irregular applications. Skeleton method[10] is used in LLCoMP to translate extended OpenMP to MPI. Because the skeleton is difficult forcompiling optimization, it has no effect on discontinuous data accesses. To some extent, extensions toOpenMP increase the burden of programming. The third category translates OpenMP to MPI+SDSMor other parallel languages. Some of research works [11, 12] translate OpenMP to MPI+SDSM to reducethe overhead of SDSM. However, SDSM is still the bottleneck of system performance. Chapman et al.[13] translated OpenMP to Global Array based on OpenUH compiler.

KLCoMP consists of a source-to-source compiler and a runtime library. OpenMP directives are trans-lated into corresponding APIs provided by our run-time library, which are written by MPI+Pthread.Compared with previous OpenMP compilers, KLCoMP employs MPI+ “partially replicating sharedarrays” memory model to reduce data volume maintained by compiler, and it then reduces the compile-time/run-time overhead of program. Based on the memory model, how to effectively recognize sharedvariables, to reduce or hide communication and to increase the efficiency of irregular applications are thekey to the improvement of KLCoMP. This paper proposes an effective algorithm for shared array recogni-tion based on the inter-procedural analysis, optimizations based on producer/consumer relationship andcommunication generation technique for nonlinear references.

We adopt nine applications, which cover computational fluid dynamics, integer sorting, mandelbrot setcomputation, molecular dynamics, earthquake simulation and computational chemistry, to evaluate theperformance of KLCoMP. These applications are from widely-used benchmarks including NAS benchmark[14], SPEC OMPM2001 [15], COSMIC software [16] and CHARMM [17]. We note that five benchmarksfall into typical irregular applications. Taking these benchmarks as input, KLCoMP generates parallelcodes which gain the scalability close to MPI version. Especially for irregular applications, the perfor-mance of generated codes is better than that of codes generated by Omni+SCASH [2], LLCoMP[10] andOpenMP(Purdue) [7].

The rest of this paper is organized as follows. Section 2 provides a memory model for KLCoMP; section3 gives communication optimizaitons based on this memory model; section 4 gives communication gen-eration technique for nonlinear references; section 5 evaluates proposed techniques in KLCoMP; section6 draws our conclusions.

2 Memory model for KLCoMP

The traditional memory model of OpenMP includes the following two characteristics:(1) Private data and shared data. Every thread is allowed to have its own private data which must

not be accessed by other threads, whereas shared data can be accessed by all threads.(2) Relaxed-consistency. OpenMP uses the implicit barrier (for example, after worksharing construct)

and explicit flush to enforce the coherence of shared data. The order of data accesses between two flushoperations are not restricted for every thread.

To utilize the above characteristics, we divide the implementation of cluster OpenMP and memorymodel into four categories: SDSM+“everything-shared” [2], SDSM+“partially replicating shared data”[4], MPI+“everything-shared” [10] and MPI+“partially replicating shared data” [7]. The “everything-shared” denotes that the memory in each node has the entire copy of data. The “partially replicatingshared data” denotes that each node copies part of data (including shard arrays and scalars). Thecomparative studies [5, 6] of SDSM and MPI indicated that SDSM introduces a lot of additional datastructures to enforce the coherence of data within each node. For irregular accesses, SDSM incurs re-dundant communication which significantly degrades the performance of parallel applications. KLCoMPemploys MPI+“partially replicating shared array” model to reduce the redundant communication. Thecomparisons of KLCoMP with other compilers are given in Table 1.

Page 3: OpenMP compiler for distributed memory architectures

934 WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5

Table 1 Comparisons for different implementation and memory models

Compiler Implementation + Data to be Size of additional Redundant

memory model maintained structures (compared communications

with MPI version) (compared with

MPI version)

OMNI+SCASH [2] SDSM+“everything- All of data High High

shared”

PCOMP[4] SDSM+“partially replicating Shared arrays High Medium

shared data” and scalars

OpenMP(Purdue) [7] MPI+“partially replicating Shared arrays Medium Medium

shared data” and scalars

LLCoMP [10] MPI+“everything-shared” All of data Medium High

KLCoMP MPI+“partially replicating Shared arrays Low Low

shared array”

3 Compiling optimization techniques based on MPI+“partially replicatingshared array” model

To fully utilize OpenMP relaxed-consistency to reduce the run-time overhead and to optimize commu-nication, KLCoMP proposes the following compiling optimization techniques based on MPI+“partiallyreplicating shared array”.

(1) Algorithm for shared array recognition based on the inter-procedural analyses, marked as LSA (listshared arrays). Refs. [4] and [7] focused on SDSM to give an algorithm to identify shared variables. Togenerate more flexible MPI codes, LSA is required to have more comprehensive data-flow analysis. Thedifferences between the proposed LSA and the algorithm presented in [4] and [7] include the followingitems.

• Smaller input set. The algorithm proposed in [4] and [7] is relatively conservative owing to the factthat it is required to identify shared variables/scalars and to input its results into SDSM to maintain thecoherence of data. The maintained arrays result in redundant communication overhead. So, LSA paysattentions to identifying shared arrays and then updates shard scalars at synchronous points.

• More comprehensive recognition information for shared arrays. The algorithm proposed in [4] and[7] only lists shared arrays and correlative arrays rather than establish a relationship for each sharedarray and correlative arrays. LSA provides support for the establishment of data flow graph based onproducer/consumer relationship.

(2) Optimization techniques based on producer/consumer relationship. The producer is a statementwhere a shared array is defined while the consumer is a statement where a shared array is used. Underthe “partially replicating shared array” memory model, data movement may exist between producer andconsumer. Many parallel compilers use a conservative strategy. For example, LLCoMP performs com-munication for shared arrays just after each worksharing construct. A lot of redundant communicationsmay occur due to the lack of data-flow analysis based on producer and consumer relationship. Some ofHPF compilers, such as Adaptor HPF [18], performs actual communication according to data layout andowner-compute rule/almost owner-compute rule, which loses the chance for the overlap of independentcommunication and computation based on producer and consumer relationship. The focus of this paperis on optimization techniques based on the data-flow graph of producer/consumer relationship. Thesetechniques include the recognition of adaptive communication pattern and eliminating redundant com-putation, communication and computation overlap based on the data-flow graph of producer/consumerrelationship.

(3) Communication generation technique for nonlinear references. Some of compiling techniques, suchas induction variables substitution and scalar expansion, are often used to reduce/eliminate dependencies

Page 4: OpenMP compiler for distributed memory architectures

WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5 935

Figure 1 Induction variable substitution. (a) Before induction variable substitution; (b) after induction variable substi-

tution.

within loop so as to parallelize the loop. However, these techniques may generate a lot of nonlineararray references. In Figure 1, induction variable substitution results in two nonlinear array referencesi1 ∗ (i1 − 1)/2 + i2 and i2 ∗ (i2 − 1)/2+ i1. Petersen and Padua [19] found that there were 6503 nonlinearsubscript references, when they analyzed dependencies in Perfect benchmark. Current parallel compilersregard nonlinear references as indirect array references and employ traditional inspector/executor model[20], which waste a plenty of run-time analysis time. KLCoMP makes use of the monotonicity of nonlinearreferences to generate communication and to reduce run-time overhead.

The following sections will detail the above compiling optimization techniques.

3.1 Shared array recognition based on inter-procedural analysis

Based on control-flow graph and call graph, KLCoMP recognizes shard arrays and then uses an inter-procedural analysis method to ensure the coherence of shared array on each node. Figure 2 presentsan algorithm to recognize shared arrays – LSA. The proposed algorithm generates a set SA for sharedarrays within each parallel region according to OpenMP semantic. We then focus on each shared array togenerate a set Ssa using inter-procedural argument/parameter analysis. The algorithm proposed in [4]and [7] uses inter-procedural analysis to find shared variables and enforces the coherence of these variablesat run-time phase, whereas the algorithm LSA not only finds all arrays related to shared arrays but alsoclassifies these arrays. LSA provides support for the establishment of producer/consumer data-flow graphwhich is described in subsection 3.2.

3.2 Optimization techniques based on producer/consumer relationship

We mark the def -use status and access range (within basic block) of each element in Ssa. The setSsa is a group of shared arrays which derive from LSA. We then traverse control-flow graph to findand record producer/consumer relationship for shared arrays. Using the producer/consumer pair, wegenerate actual communication statements. If the access range of producer statement contains that ofconsumer statement within each node, we will not insert MPI call; otherwise, we will insert MPI callbetween producer and consumer.

The algorithm proposed in [21] finds producer/consumer relationship and adjusts the location of MPIcall in the data-flow graph of sequential program. Our work focus on OpenMP relaxed-consistent model toprocess shared arrays, and based on the producer/consumer data flow analysis of shared arrays KLCoMPthen gives the following optimization techniques.

(1) Eliminating redundant communication and computation. Based on producer/consumer data-flowgraph, KLCoMP first implements optimization techniques based on the recognition of adaptive commu-nication patterns.

• Eliminating redundant communication based on the recognition of adaptive communication patterns.In Figure 3, the indirect array Aperm is updated with different iterations; however, the variation ofindirect array is very small (less than 1%). We only communicate the updated data between producerand consumer.

Page 5: OpenMP compiler for distributed memory architectures

936 WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5

Algorithm list shared arrays (LSA)

Input: P -A Program with OpenMP.

Output: SAL-a set of lists of shared arrays.

List para procedure (sa, Ssa)

do ∀ function call F , where sa is a parameter of F

Let PP be the procedure that defines F

Let sa PP be the procedure parameter of PP corresponding to function parameter sa

List para procedure (sa PP , Ssa)

Ssa = Ssa ∪ sa PP

enddo

List para fun call (sa, Ssa)

if sa is a parameter of procedure PP that defines function F

Let sa F be the parameter of F (which is called in P ) corresponding to procedure

parameter sa

List para procedure (sa F, Ssa)

List para fun call (sa F, Ssa)

Ssa = Ssa ∪ sa F

else

Record the sa’s declaration points and size

endif

/*main program*/

SA = Φ, SA is a set of shared array within parallel region

SAL = Φ

/*Find shared arrays in a parallel region*/

do ∀PR, PR is a parallel region in P

C= Set of all arrays listed in copyin()

do ∀WS,WS is a work-sharing construct in parallel region PR

FP = Set of all arrays explicitly declared firstprivate for WS

LP = Set of all arrays explicitly declared lastprivate for WS

S = Set of all arrays explicitly declared shared for WS

SA = SA ∪ FP ∪ LP ∪ S

end do

SA = SA ∪ C

enddo

/*Find corresponding arrays corresponding to each shared arrays*/

if SA == Φ, exit, endif

do ∀sa ∈ SA

Ssa = {sa}, Ssa is a list of arrays corresponding to sa

List para procedure (sa, Ssa)

List para fun call (sa, Ssa)

Put Ssa into SAL

enddo

Figure 2 Inter-procedural analysis for shared array recognition.

• Eliminating redundant computation. We record the updated part of indirect array Aperm and thenmodify the corresponding part of array A, which can reduce redundant computation significantly.

Based on producer/consumer data-flow analysis, KLCoMP develops the following techniques to gen-erate MPI codes. Compared with other parallel compiler [2, 4, 7, 10, 18], techniques (2), (3) and (4) arefirst proposed in KLCoMP.

(2) Replacing MPI Send/Receive operations with collective communication (such as MPI Allgatherv).(3) Collective communication interfaces for communication/computation overlap based on producer/

consumer data-flow graph. Traditional MPI Alltoallv implementation [9] does not have any optimiza-tion. To improve communication and to ensure computation/communication overlap, we utilize compile-time and run-time communication optimization to develop non-block collective communication such as

Page 6: OpenMP compiler for distributed memory architectures

WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5 937

for (i=0; i <NSTEPS; i++){/*The values in indirect array change with different iterations*/

if (. . .)

Aperm [. . .] = . . . ;

for (. . .)/*the producer of array A*/

A[Aperm[j]]=. . . ;

. . .

for (. . .)/*the consumer of array A*/

. . . = A[j];

}Figure 3 Adaptive communication pattern.

for (i1=1; i1 <= N ; i1+ = S1)

for (i2 = L2(i1); i2 <= U2(i1); i2+ = S2(i1))

. . . . . .

for (in = Ln(i1, . . . , in−1); in <= Un(i1, . . . , in−1); in+ = Sn(i1, . . . , in−1))

A[f(i1, . . . , in)] = F (B[g(i1, . . . , in)]);

Figure 4 Perfectly nested loop.

/*the SPMD code in node p (p ∈ {0, . . . , np − 1}, there exist np nodes)*/

/*pack and send messages to other nodes*/

do i ∈ {0, . . . , np − 1}q = i + p mod np and p �= q

calculate Comm Set(i1, i2, . . . , in)[q,p]

do i1, i2, . . . , in ∈ Comm Set(i1, i2, . . . , in)[q,p]

Append Comm Set(B : r − c)[q,p] to Bufsend(q)

Send Bufsend(q) to node q if not empty.

enddo

enddo

/*perform local iteration*/

calculate Comm Set(i1, i2, . . . , in)[p,p]

derive Comm set(B : r − c)[p,p] according to Comm Set(A : r − c)[p,p].

do (i1, i2, . . . , in) ∈ Comm Set(i1, i2, . . . , in)[p,p]

Comm Set(A : r − c)[p,p] = Comm Set(B : r − c)[p,p]

enddo

/*receive data and perform remaining iterations*/

while expecting messages

wait for a message from q and store into Buf[q]

calculate Comm Set(i1, i2, . . . , in)[p,q]

do (i1, i2, . . . , in) ∈ Comm Set(i1, i2, . . . , in)[p,q]

Comm Set(A : r − c)[p,q] = Buf[q]

enddo

end

Figure 5 SPMD codes.

MPI Allgatherv and MPI Alltoallv to optimize communication and to ensure the overlap of communica-tion and computation.

(4) Packing and unpacking for multidimensional arrays. We use the periodic theory presented in ourprevious work [22] to simplify the packing/unpacking of multidimensional arrays.

(5) Schedule reuse. We introduce schedule-reuse technique [18] from HPF to KLCoMP. To reduce thecost of communication generation, we reuse communication schedule for the iterations which have thesame communication pattern.

Page 7: OpenMP compiler for distributed memory architectures

938 WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5

4 Communication generation technique for nonlinear references

KLCoMP converts imperfectly nested loop to perfectly nested loop shown in Figure 4 using nonlineardependency testing [23], array privatization [24] and basic symbol analysis [25]. In Figure 4, the symbolsA and B represent one dimension array. The symbols f and g denote array access functions. Thesymbols L, U, and S are the lower, upper bound and stride, respectively, which are made up of loopconstants and loop indices. According to the producer/consumer data-flow graph in subsection 3.2, ifthere is the producer of array B before the loop in Figure 4, KLCoMP will generate communicationset and corresponding SPMD (single program multiple data) codes. In our previous work [26, 27], weintegrated integer lattice [28] and algebraic solution to generate communication set to reduce run-timeoverhead using compile-time analysis. Figure 5 presents the generated SPMD code which can overlapcommunication with computation effectively. The details of description can be found in [26] and [27].

In case of the above compiling optimization techniques, shared array recognition and part of optimiza-tion techniques based on producer/consumer relationship can be applied to almost all of applications.Communication generation technique for nonlinear references is available for the loop with nonlinearreferences which results from induction variables substitution, scalar expansion, etc.

5 Evaluation and experimental results

Since current KLCoMP only supports C+OpenMP, all of test cases adopt corresponding C version.To comprehensively evaluate the performance of KLCoMP, this paper employs NAS benchmark, SPECOMPM2001, COSMIC software and CHARMM. Because three applications (SP, MG and BT) in NASbenchmark have similar characteristics of access pattern, we chose SP as test case. To evaluate the per-formance for irregular applications, our test cases include EQUAKE, IRREG and MOLDYN applicationsexcept for NAS benchmark CG and IS. All of benchmarks cover computational fluid dynamics, inte-ger sorting, mandelbrot set computation, molecular dynamics, earthquake simulation and computationalchemistry.

We evaluate KLCoMP on the platform USTB Cluster, an Ethernet switched cluster with 16 IntelXeon 3.0 G/1024 K Cache. Each node has 2 GB memory and runs RedHat Linux version FC 3, withkernel 2.6.11. The native compiler uses gcc 3.4.2. The nodes are interconnected with 1000 M Ethernets.MPICH-2.1 [9] is provided by Argonne lab. We use -O3 option to optimize applications in our evaluation.

We use different parallelization versions as follows:(1) Comparative study for the scalability of MPI version and that of KLCoMP version. MPI version of

CG, IS, EP, FT, LU and SP are from NAS Benchmark [14]. MPI implementations of EQUAKE, IRREGand MOLDYN are created by hand. Owing to the situation in which the MPI version of NAS CG, IS,EP, FT, LU and SP are in Fortran, this experiment is designed to evaluate the speedup of two differentversions.

(2) Comparative study for the performance of OpenMP on SDSM and that of KLCoMP version. Wedeploy OpenMP applications using SCASH [29] SDSM, while corresponding compiler is Omni OpenMPcompiler [2] provided by Real World Computing Partnership (RWCP). CG, IS and EP benchmarks arefrom Omni group [2].

(3) Comparative study for the performance of LLC version and that of KLCoMP version. LLC grouphas translated extended OpenMP to MPI using LLCoMP compiler.

(4) Comparative study on the performance of OpenMP (Purdue) version and that of KLCoMP version.

5.1 Scalability comparison with MPI version

EP, FT, LU and SP benchmarks are typically regular applications in which memory accesses are contin-uous or travel identical distance. In case of regular applications, Figure 6 shows that KLCoMP versionachieve speedups close to MPI version. In case of LU, MPI version outperforms KLCoMP version owingto the different distribution patterns of input matrices. The MPI version distributes each input matrixusing 2-D block partitioning, whereas KLCoMP version distributes each input matrix using 1-D block.

Page 8: OpenMP compiler for distributed memory architectures

WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5 939

16

14

12

10

8

6

4

2

0

MPI KLCoMP MPI KLCoMP MPI KLCoMP MPI KLCoMP

EP-B FT-B LU-B SP-B

2*nodes 4*nodes 8*nodes 16*nodesSpeedup

Figure 6 Scalability comparison of regular applications.

16

14

12

10

8

6

4

2

0

MPI

MPI

MPI

MPI

MPI

MPI

MPI

KLCoMP

KLCoMP

KLCoMP

KLCoMP

KLCoMP

KLCoMP

KLCoMP

IS-A IS-B CG-A CG-B MOLDYN IRREG EQUAKE

2*nodes 4*nodes 8*nodes 16*nodes

Speedup

Figure 7 Scalability comparison of irregular applications.

The 2-D block partitioning of the input matrix has better communication and load-balance for pipeliningparallelism.

In case of irregular applications, such as IS, CG, MOLDYN, IRREG and EQUAKE, Figure 7 presentsscalability comparison with MPI version. In case of IS, KLCoMP version achieves super-linear speedup on2 and 4 nodes, because our compiling techniques recognize adaptive communication pattern to eliminateredundant computation. Our techniques also eliminate redundant communication to achieve acceptablespeedups for a larger input matrix on 16 nodes. In case of CG, KLCoMP version achieves speedups closeto MPI version. According to the OpenMP applications of MOLDYN, IRREG and EQUAKE, their MPIversions are created by hand. So, both KLCoMP version application and MPI version application achievesimilar speedups.

5.2 Performance comparison with Omni OpenMP on SCASH

Only EP (Class A), EP (Class B), FT (Class A), LU (Class A), CG (Class A) and IS (Class A) out ofnine benchmarks in our evaluation are available for Omni+SCASH, because we employ a relatively largeinput set. The size of shared data on SDSM is limited by page size×page number. So, comparative worksfocus on the above applications with relatively small input set. EP is a compute-intensive application,which has high computation/communication ratio and load-balance. Omni+SCASH uses PM2 message-

Page 9: OpenMP compiler for distributed memory architectures

940 WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5

Figure 8 Performance comparison for EP (Class A). Figure 9 Performance comparison for EP (Class B).

Figure 10 Performance comparison for FT (Class A). Figure 11 Performance comparison for LU (Class A).

passing interface which provides lower latency than traditional MPI implementation based on TCP/IP. Incase of regular application EP, nevertheless, Figures 8 and 9 show that KLCoMP achieve a slightly betterperformance than Omni + SCASH version. In case of FT and LU, memory accesses of multidimensionalarrays have an equivalent gap, which incurs a lot of pages to be updated in SCASH. KLCoMP adoptsaccurate packing/unpacking technique to generate MPI codes. The performance comparisons for FT andLU are shown in Figures 10 and 11, respectively.

LLC group has released translated NAS Benchmark FT and CG. So, only two applications are usedin our evaluation.

In case of irregular applications, CG and IS involve many irregular read/write operations of sharedarrays. Omni+SCASH achieves very low performance owing to a large number of page faults. Thecomparison with Omni+SCASH is shown in Table 2.

From the above comparisons, KLCoMP not only is available for a large input set (which is not limitedby SDSM), but also achieves better performance and stability than Omni+SCASH version.

5.3 Performance comparison with LLCoMP

In terms of language’s ease of use, LLC language proposes OpenMP extensions to guide compiler toensure the coherence of data on each node. These extensions increase programming burden, becausea programmer is required to identify data and corresponding data type to be communicated and arrayaccess pattern within loop, whereas KLCoMP does not ask programmer to make additional efforts.

Figures 12 and 13 show that KLCoMP version achieves better performance than LLCoMP versionfor FT (Class B) and CG (Class B), which benefits from our optimization techniques based on pro-ducer/consumer relationship.

5.4 Performance comparison with OpenMP (Purdue)

Some of optimizations in Cluster OpenMP developed by PARAMOUNT group are applied by hand. This

Page 10: OpenMP compiler for distributed memory architectures

WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5 941

Table 2 Performance comparisons for CG and IS (seconds)

2*nodes 4*nodes 8*nodes 16*nodes

CG (Class A) Omni+SCASH 45.910 74.38 121.34 121.340

KLCoMP 2.650 2.03 1.88 1.800

IS (Class A) Omni+SCASH 30.580 84.30 195.61 412.580

KLCoMP 1.325 2.17 3.04 3.619

Figure 12 Performance comparison for FT (Class B). Figure 13 Performance comparison for CG (Class B).

Figure 14 Performance comparison for EP (Class B). Figure 15 Performance comparison for FT (Class B).

Figure 16 Performance comparison for SP (Class B). Figure 17 Performance comparison for LU (Class B).

compiler is not an open source system for us. In our cases, the translation technique and optimizationsof OpenMP (Purdue) from [7] and [8] are re-implemented by us. In case of compute-intensive applicationEP, Figure 14 shows that OpenMP (Purdue) and KLCoMP achieve similar performance. In case of FT(Class B) and SP (Class B), Figures 15 and 16 show that KLCoMP version achieves slightly better per-

Page 11: OpenMP compiler for distributed memory architectures

942 WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5

Figure 18 Performance comparison for IS (Class B). Figure 19 Performance comparison for CG (Class B).

Figure 20 Performance comparison for EQUAKE. Figure 21 Performance comparison for IRREG.

Figure 22 Performance comparison for MOLDYN.

formance than OpenMP (Purdue), which benefits from our communication generation technique fornonlinear references and periodic packing/unpacking technique for multidimensional arrays. Figure 17gives a performance comparison for LU application. KLCoMP can identify the pipelining parallelism inthe compute-kernel SSOR of LU to overlap communication with computation. Thus, KLCoMP versionachieves better performance than OpenMP (Purdue) version.

In Figure 18, for IS, KLCoMP achieves better performance than OpenMP (Purdue), owing to our opti-mizations including the recognition of adaptive communication pattern, the elimination of redundant com-munication and computation, and the replacement of MPI Send/Receive operations with MPI Allgatherv.

The communication traffic of CG is greater than that of IS. OpenMP (Purdue) replaces MPI Send/Receive operations with MPI Alltoallv, whereas KLCoMP casts these operations as MPI Allgatherv. Inour cases, Figure 19 shows that the KLCoMP achieves better scalability than OpenMP (Purdue) whichsignificantly degrades the performance of CG on 8 nodes and 16 nodes. In case of EQUAKE, SMVP(sparse matrix vector product) is the most time-consuming subroutine, which takes about 70% of this

Page 12: OpenMP compiler for distributed memory architectures

WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5 943

application. Our communication generation for nonlinear references and corresponding SPMD codesgeneration can reduce the run-time overhead. Figure 20 shows that the KLCoMP version achieves betterperformance than OpenMP (Purdue).

There are producer/consumer relationships and irregular accesses in IRREG application. For this case,OpenMP (Purdue) version is re-implemented using our runtime library of irregular applications, whichcan eliminate redundant communication. So the two versions achieve similar performance as shown inFigure 21.

Compared with other irregular applications, indirect arrays in MOLDYN are updated after every 10or 20 time steps. For this case, KLCoMP adopts schedule reuse scheme to reduce inspector overhead atrun-time phase. Figure 22 shows that the KLCoMP version achieves better performance than OpenMP(Purdue) for MOLDYN application.

From the above comparisons, KLCoMP improves the performance of irregular applications on dis-tributed memory architectures. We summarize the characteristics of KLCoMP as follows:

(1) Feasibility. Benchmarks cover computational fluid dynamics, integer sorting, mandelbrot set com-putation, molecular dynamics, earthquake simulation and computational chemistry.

(2) Scalability. KLCoMP achieves speedups close to MPI.(3) High performance. KLCoMP achieves better performance for parallel applications, especially ir-

regular applications, than other parallel compilers.

6 Conclusions

KLCoMP provides not only a portable programming environment but also advanced parallel computationmodel and optimization techniques which have the following characteristics:

(1) MPI+“partially replicating shared arrays” model to reduce maintained data volume and redundantcommunication.

(2) Our algorithm for shared array recognition, which is more available for MPI-based OpenMP com-piler.

(3) Optimization techniques based on producer/consumer relationship can reduce redundant commu-nication and computation in parallel applications.

(4) Communication generation technique for nonlinear references utilizes monotonicity information pro-vided by compile-time phase to reduce run-time overhead. The experimental results show that KLCoMPhas feasibility, scalability and high performance.

Acknowledgements

This work was supported by the National High-Tech Research & Development Program of China (Grant Nos.

2006AA01Z105, 2008AA01Z109), the National Natural Science Foundation of China (Grant No. 60373008), and

by the Key Project of Chinese Ministry of Education (Grant Nos. 106019, 108008). We thank LLC group

from Universidad de La Laguna and Omni group from Real World Computing Partnership for providing us with

benchmark applications. We especially thank Dr. Ayon Basumallik from Purdue University for the discussion of

Cluster OpenMP. We thank the members from the HPC lab of University of Science and Technology Beijing.

References

1 OpenMP Architecture Review Board. OpenMP Application Program Interface, version 2.5, 2005

2 Sato M, Satoh S, Kusano K, et al. Design of OpenMP compiler for an SMP cluster. In: Proc. of the 1st European

Workshop on OpenMP. Belin: Springer, 1999. 32–39

3 Costa J J, Cortes T, Martorell X, et al. Running OpenMP applications efficiently on an everything-shared SDSM. J

Parall Distrib Comput, 2006, 66: 647–658

4 Min S J, Eigenmann R. Combined compile-time and runtime-driven, pro-active data movement in software DSM systems.

In: Proc. of Seventh Workshop on Languages, Compilers, and Run-time Support for Scalable Systems, Houston, Texas,

2004. 1–6

Page 13: OpenMP compiler for distributed memory architectures

944 WANG Jue, et al. Sci China Inf Sci May 2010 Vol. 53 No. 5

5 Lu H H. Quantifying the performance differences between PVM and TreadMarks. J Parall Distrib Comput, 1997, 43:

65–78

6 Basumallik A, Min S, Eigenmann R. Programming distributed memory systems using OpenMP. In: Proc. of International

Parallel and Distributed Processing Symposium. New York: IEEE Press, 2007. 1–8

7 Basumallik A, Eigenmann R. Towards automatic translation of OpenMP to MPI. In: Proc. of the 19th Annual Interna-

tional Conference on Supercomputing. New York: ACM Press, 2005. 189–198

8 Basumallik A, Eigenmann R. Optimizing irregular shared-memory applications for distributed-memory systems. In:

Proc. of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York:

ACM Press, 2006. 119–128

9 MPICH2.1.0.7, http://www.mcs.anl.gov/research/projects/mpich2/, March 21, 2008

10 Dorta A, Lopez P, Sande F. Basic skeletons in llc. Parall Comput, 2006, 32: 491–506

11 Eigenmann R, Hoeflinger J, Kuhn R H, et al. Is OpenMP for Grids? In: Proc. of International Parallel and Distributed

Processing Symposium. New York: IEEE Press, 2002. 171–178

12 Jeun W C, Kee Y S, Ha S. Improving performance of OpenMP for SMP clusters through overlapped page migrations.

In: Proc. of International Workshop on OpenMP, Reims, France, 2006

13 Eachempati D, Huang L, Chapman B M. Strategies and implementation for translating OpenMP code for clusters. In:

Proc. of High Performance Computing and Communications. Belin: Springer, 2007. 420–431

14 Jin H, Frumkin M, Yan J. The OpenMP implementation of NAS parallel benchmarks and its performance. Technical

Report NAS-99-011, 1999

15 Aslot V, Domeika M, Eigenmann R. SPEComp: A new benchmark suite for measuring parallel computer performance.

In: Proc. of the Workshop on OpenMP Applications and Tools. Belin: Springer, 2001. 1–10

16 COSMIC group, University of Maryland. COSMIC software for irregular applications. http://www.cs.umd.edu/projects/

osmic/software.html

17 Brooks B R, Bruccoleri R E, Olafson B D, et al. A program for macromolecular energy, minimization, and dynamics

calculations. J Comp Chem, 1983, 4: 187–217

18 Brandes T. ADAPTOR Users Guide, Fraunhofer Gesellschaft, Augustin, Germany, 2004

19 Petersen P, Padua D A. Static and dynamic evaluation of data dependence analysis techniques. IEEE Trans Parall

Distrib Syst, 1996, 7: 1121–1132

20 Brezany P, Dang M. CHAOS+ Runtime Library. Internal Report, Institute for Software Technology and Parallel Systems,

University of Vienna, September 1997

21 Michelle M, Barbara K, Paul D. Data-flow analysis for MPI programs. In: Proceedings of the 2006 International

Conference on Parallel Processing, Columbus, Ohio, USA, 2006. 175–184

22 Wang J, Hu C J, Zhang J L, et al. An optimized strategy for collective communication in data parallelism (in Chinese).

Chinese J Comput, 2008, 2: 318–328

23 Engelen R, Birch J, Shou Y, et al. A unified framework for nonlinear dependence testing and symbolic analysis. In:

Proc. of the ACM International Conference on Supercomputing. New York: ACM Press, 2004. 106–115

24 Li Z. Array privatization for parallel execution of loops. In: Proc. of the ACM International Conference on Supercom-

puting. New York: ACM Press, 1992. 313–322

25 Haghighat M R, Polychronopoulos C D. Symbolic analysis for parallelizing compilers. ACM Trans Program Languag

Syst, 1996. 18: 477–518

26 Hu C, Li J, Wang J, et al. Communication generation for irregular parallel applications. In: Proc. of IEEE International

Symposium on Parallel Computing in Electrical Engineering. New York: IEEE Press, 2006. 263–270

27 Wang J, Hu C, Zhang J, et al. OpenMP extensions for irregular parallel applications on cluster international workshop

on OpenMP. Lecture Notes in Computer Science 4935. Berlin: Springer Publisher, 2007. 101–111

28 Tseng E, Gaudlot J. Communication generation for aligned and cyclic(k) distributions using integer lattice. IEEE Trans

Parallel Distrib Syst, 1999, 10: 136–146

29 Ojima Y, Sato M, Harada H, et al. Performance of cluster-enabled OpenMP for the SCASH software distributed shared

memory system, cluster computing and the grid. In: Proc. of 3rd IEEE/ACM International Symposium on CCGrid,

Tokyo, Japan, 2003. 450–456