java-based coupling for parallel predictive-adaptive...

185

Java-based coupling for parallelpredictive-adaptive domain decomposition

Cécile Germain-Renaud∗ and Vincent NériLRI CNRS – Université Paris 11, Bat 490, UniversitéParis-Sud, F-91405 Orsay Cedex, FranceTel.: +33 1 69 15 76 55; Fax: +33 1 69 15 65 86;E-mail: [email protected]

Adaptive domain decomposition exemplifies the problem ofintegrating heterogeneous software components with inter-mediate coupling granularity. This paper describes an exper-iment where a data-parallel (HPF) client interfaces with asequential computation server through Java. We show thatseamless integration of data-parallelism is possible, but re-quires most of the tools from the Java palette: Java Native In-terface (JNI), Remote Method Invocation (RMI), callbacksand threads.

1. Introduction

Presently or in the near future, high performancecomputing applications are bound to be built as dis-tributed systems based on an object component ar-chitecture [4,7,9]. Some of these components may bewritten in data-parallel languages and run either ontraditional Distributed Memory Architectures or onfast Networks of Workstations (NOWs). The questionof integrating such data-parallel components has beenraised in multiple contexts. Java-based integration hasbeen discussed by the Parallel Compiler Runtime Con-sortium [14], and implemented in the Converse frame-work [3]. CORBA based integration asks for describ-ing data distribution at the IDL level, and defining effi-cient communication mechanisms [12].

The PPADD (Parallel Predictive-Adaptive DomainDecomposition) project is a large-scale application thatnaturally requires a distributed implementation on het-erogeneous parallel machines. The general applicationfield is solving PDEs through adaptive domain decom-position methods. The entry point to distributed com-puting is that the main computation (the solver) can

* Corresponding author.

see in advance that it will require some adaptation ofits mesh structure, and is able to delegate these costlycomputations before the new data is actually required.Moreover, the solver naturally fits into the data-parallelmodel, while the mesh computing service is more effi-ciently performed on shared memory architectures.

In the development of the PPADD project, we werepresented with the problem of designing both thesolver and the mesh service for efficiency and reusabil-ity, and to find a smooth way of connecting the com-ponents through a LAN.

A good efficiency/reusability tradeoff for the solverwas considered to be High Performance Fortran (HPF).The mesh server is written in Fortran 90. Other compo-nents, especially a visualization module, will be addedlater, and will use commercial tools.

We choose to implement the coupling through Java,using the language and the various standard APIs.Smooth coupling precluded using low-level message-passing interfaces. Ideal smooth coupling would re-quire to be able to see HPF and Fortran 90 code as ob-ject components, and to invoke remote methods fromthe HPF objects on the F90 objects. Java Native Inter-face (JNI) API allows wrapping Fortran 90 (in the lim-its of C-F90 interoperability, which suffer some severerestrictions), and coupling F77 with Java has been ex-tensively studied [5,9]. The Java Remote Method In-vocation (RMI) mechanism allows interoperation withremote Java objects. On the other hand, no standard hasyet emerged for wrapping data-parallel objects. So wedecided to experiment with the possibility of adaptingboth the Java JNI and RMI to HPF and to the require-ments of our application. This paper presents a prelim-inary report of this experiment.

The second section describes the application, andmotivates the choice of a decoupled architecture. Thethird section gives an overview of the architecture andpoints out the requirements for efficiency. The follow-ing section describes the Java based implementation ofcomponent coupling. Although thoroughly tested forfunctionality, the architecture has not yet been testedfor performance, so the next section will be the conclu-sion.

Scientific Programming 7 (1999) 185–189ISSN 1058-9244 / $8.00 1999, IOS Press. All rights reserved

186 C. Germain-Renaud and V. Néri / Java-based coupling for parallel predictive-adaptive domain decomposition

2. Application overview

The PPADD project is based on a finite element mul-tilevel scheme for the solution of non-linear PDEs. Theterms describing the non-linear interactions betweenthe fine and coarse level are costly to compute, and canbe frozen during a time that can be estimated from thebehavior of the solution. This multilevel scheme is themathematical basis for decoupling the solver from themesh refinement. Moreover, the solution of each ap-proximate non-linear equation uses domain decompo-sition methods based on dual Schur’s complement.

The global architecture is composed of three mainmodules: theSolver, the Domain Managerand theVisualization Interface. Currently, only the DomainManager and one version of the Solver are imple-mented. Besides visualization facilities, future versionswill allow the configuration of the Solver for multiplenumerical solvers.

2.1. The Solver

At the application level, the Solver is responsiblefor computing an approximate solution of the initialequation. The solver iterates over time. At each itera-tion, the behavior of the solution is checked against atoo steep variation from from the previous iteration (orfrom an initial guess at cold boot), as described in [10].When such a variation occurs, the checking procedurealso computes a tolerance delay, and issues a requestfor mesh refinement. The tolerance delay is the numberof non-linear iterations during which the current meshcan remain in use. Moreover, the checking procedurecomputes two types of indicators: the first one relates

the mesh elements to the quality of the solution, andthe second one provides information about computingload on a per-element basis.

When a request for mesh refinement has been issued,the computation continues until the tolerance delay isexhausted. At this point, the new mesh is acquired andthe time iterations can go ahead.

Currently, mesh refinements are serialized. For agiven iteration, only one domain can request refine-ment, and the request must complete before a new onecan be issued. Parallelizing mesh refinement requestsdoes not make sense, because the remesh logic is in-herently serial. Pipelining is an interesting possibility.

2.2. The Domain Manager

The Domain Manager acts as a repository for thecurrent state and the history of the finite element meshand its associated decompositions. Moreover, it pro-vides a unified mesh refinement/domain decomposi-tion service. Finally, it provides the initial mesh at coldboot or on restart. To save memory space, as the com-putation is not supposed to be out of core, the DomainManager only sees a simplified version of the meshdata (basic domain structure in the following).

On a mesh refinement request, the Domain Manageralways performs the required refinement, and can alsodecide to split the requesting domain in two subdo-mains or to merge two subdomains. The decision tosplit or merge depends on two factors: memory space,and load balancing. Memory space limitation is easyto compute, because the actual memory size neededby the vectors and sparse matrices defined on a do-main can be accurately predicted from the basic do-

Fig. 1. Architecture overview.

C. Germain-Renaud and V. Néri / Java-based coupling for parallel predictive-adaptive domain decomposition 187

main structure. Load balancing is much trickier, be-cause the convergence time for the non-linear iterationshas to be predicted.

2.3. The programming models

The Solver is a data-parallel program, written inHigh Performance Fortran. It is based on an ongoingeffort to develop an HPF library for domain decompo-sition based sparse computations, which is in turn anadaptation of a Fortran 90 library developed by the Nu-merical Analysis Laboratory of Paris 11. A discussionof the techniques for handling unstructured computa-tions in HPF is outside the scope of this paper; a sur-vey can be found in [6] and examples for sparse matrixcomputations in [8] and [17].

The Domain Manager cannot be easily parallelizedon a distributed memory machine, because it spendsmost of its execution time in searching complex datastructures. In any case, its interface to the externalworld is assumed to be a sequential front-end.

Neither Fortran 90, nor HPF as a superset of Fortran90, are object oriented languages. However, using theirmodularity and polymorphism features [13], makes itrather easy to interface them with a truly object lan-guage.

3. Architecture overview

The application naturally presents a client/serverstructure. The Solver acts as a client for a service ofmesh refinement and domain decomposition providedby the Domain Manager. Here, the most importantpoint is the fact that the computation of a new meshand the progress in computing a solution based on theold mesh can overlap, at least for a certain time, whichis the tolerance delay. Thus, delegating the computa-tion of the new mesh makes sense. As seen before, itis also natural that the two programs run of differentmachines, because they require different architectures.

The Solver and the Domain Manager are coordi-nated through Java-based interfaces. These interfaceshave two functions: the native/Java wrapper and themechanism for accessing remote objects. Fig. 1 de-scribes what would be an ideal architecture. On the lefthand side Java objects wrap the HPF data and/or codefor the Solver; on the right hand side Java objects wrapthe Fortran 90 data and/or code for the Domain Man-ager; on both sides, at least some data can be accessedby both Java and the native codes and each can hold its

own data. Finally, method invocation from the Solverto the Domain Manager is asynchronous: invoking themesh refinement tool should not block the caller.

Actually, only some of the ideal feature are directlyavailable. A very important one is memory manage-ment. Simply put, complex dynamic Fortran variablescan be allocated and operated on (as Objects) in a Javaenvironment. Hence wrapping true Fortran 90 code ispossible, except for the numerous restrictions of C/F90interoperability, even for the very simple data structureof an array section.

No JNI exists for HPF. Various proposals are on-going to define how a data-parallel object should bewrapped, either as a mere Object or with more infor-mation about its distribution properties. This wouldclearly be mandatory to have an authentic HPF JNI.However, we observed that the most important need inthe PPADD project was a Java-based mechanism to es-cape from data-parallelism. This takes place when theSolver requests a new mesh. The semantics of this callis strongly asymmetric: currently one, and later sev-eral domains, actively call for remeshing, while otherdomains may be created as a result of the remeshingprocess. The semantics ofHPF_LOCAL answers thisneed. The semantics of this call also is strongly asyn-chronous: first, the Solver computation should overlapwith the Domain Manager computation, but also withthe communication of the new mesh data. Thus, therequest for a new mesh must be a true asynchronousRPC, with return parameters set by the called service.This is definitely not the semantics of any data-parallelconstruct, which imply one thread of execution, northe semantics of the Java RMI, which is synchronous.However, we show below that the functionality can beimplemented through the Java multithreading facility.

4. Implementation

4.1. Memory management

Fortran 90 and, at least in the standard, HPF, afforddynamic data structures, product types and the com-bination of both, which have been efficiently imple-mented by commercial compilers each in a proprietaryway. A well-designed Fortran 90 code will heavily relyon these complex structures and will define “construc-tors” for each of them with dynamic memory alloca-tion. The finite element mesh is a typical example ofthis kind of object, on the Solver and on the DomainManager as well. The domain decomposed sparse ma-

188 C. Germain-Renaud and V. Néri / Java-based coupling for parallel predictive-adaptive domain decomposition

trix is another example on the Solver. To be usable byFortran code, such objects must be constructed by For-tran code, because of the proprietary implementationdetails, and remain valid until de-allocated. Althoughthe status of the native data is not clearly documentedin the API specification, we never encountered mem-ory management conflicts. Conversely, large Java ar-rays have to be accessed by native methods, especiallythe RMI parameters. The JNI provides the notion ofpinning a Java array against Garbage Collection, sothat the native method can access this array withoutcopying (provided that the layouts are compatible).

4.2. Java/HPF Interface

When running on a parallel machine, the abstractunique thread of control of data-parallelism is actuallyimplemented through multiple parallel processes run-ning on the parallel processors, in the so-called SPMDmode. Each of these processes must be coupled with aJava VM running on the processor to achieve the func-tionality of a JNI. A straightforward design would in-volve one more Java VM as a front-end. However, sucha design would either lead to unnecessary communica-tions if the front-end is run on a remote machine, or toload unbalance if the front-end is run on one proces-sor of the parallel machine. Thus no front-end has beenimplemented, but the Java programs that run in parallelhave been designed in SPMD mode so as to offer to theexternal world the single thread semantics.

A true JNI would have the HPF codes launched bythe Java VMs. More precisely, on each processor, theJava VMs would call the sequential code, compiled byan HPF compiler, as native procedures. Unfortunately,executing a parallel procedure is a less standardizedprocess than in the sequential case: for instance, theset-up of the communication layer must complete be-fore executing the actual user code, and there is nostandard for all these parallel initializations. Hence,such a design cannot be compiler independent, andneeds to know the internal of the compiler. To ensureportability across compilers and machines, we used theJava Invocation API, which allows a Java VM to beloaded into a native application. Here, a Java VM islaunched on each processor from inside the HPF code,throughHPF_LOCAL subroutines. After this point, theSPMD programs appear as native methods run insideJava threads. However, the SPMD programs are run asa whole. Thus the burden of synchronizing with theDomain Manager is on the HPF program. In practice,we had to mirror in HPF what should be the Java mainprogram.

4.3. Asynchronous RPC

The Solver requests a new mesh through aHPF_LOCAL routine. This routine calls back to the localJava VM. The called-back method forks the executionthread of the requesting processor. One thread contin-ues executing the Solver code, while the other one ac-tually performs a synchronous RMI to the Java serverof the Domain Manager.

More precisely, the processor which owns the re-questing domain invokes a method that calls the For-tran 90 remesh library (active method). The other pro-cessors invoke a method whose role is to synchronizewith the active method and to get back a new domainif the Domain Manager decides to split the requestingdomain. As the RMI API guarantees that methods in-voked from different physical machines execute on dif-ferent threads, this synchronization uses standard Javathread primitives.

Currently, the RMIs return a Java Object which en-capsulates all the arrays required to describe the modi-fied domain and the new one if the requesting domainhas been split or merged. It has been shown that mov-ing Objects through RMI is much less efficient thanmoving primitive arrays, because of the cost of mar-shaling/unmarshaling Objects instead of primitive ar-rays [2]. Designing a domain linearization layer to op-timize this transfer could be realized if necessary.

Eventually, the thread associated with the Solvercode reaches the point where it needs the new mesh.Then, the finalizer of the RPC is called, as anotherHPF_LOCAL routine. This routine joins with the RMIthread, hence blocks if it happens that the RMI has notcompleted yet. Then the mesh data are updated withthe return parameters. This is the only non-overlappedcopy-back overhead, which is unavoidable because allthe data related to the old mesh (matrices and solutionvectors) are in use up to this point in time.

5. Conclusion

Although we cannot present performance results, weare confident that the architecture will be as efficient asthe underlying software allows for. Preliminary evalu-ations of the non-linear solver execution time indicatethat it it will largely overlap the remesh time. Opti-mizations are possible for best exploitation of the RMImechanism; we checked that on all the machines whichrun the Solver (IBM SP2 and Linux PCs), and the Do-

C. Germain-Renaud and V. Néri / Java-based coupling for parallel predictive-adaptive domain decomposition 189

main manager (Sun workstations), the Java GC is ac-tually able to pin down arrays.

Besides completing this experiment, future workwill try to generalize this experiment to a more generalHPF JNI. We plan to use an open compiler infrastruc-ture such as ADAPTOR [1] to get rid of the techni-cal problem of compiler-dependent initialization prob-lems. Such an interface will probably have to use someof the Java extensions designed for integrating data andtask parallelism [16,15,11] in order to be able to ex-press the semantics of asynchronous RPC as describedin this paper.

Acknowledgments

The experiments were performed on the SP-2 fromthe Computing Resource Center (CRI) of Paris-11 andthe National Computing Center (CNUSC).

References

[1] T. Brandes and F. Zimmermann, ADAPTOR: A transforma-tion tool for HPF programs, in:Programming Environments forMassively Distributed Systems, 1994.

[2] F. Berg et al., Java RMI performance and object model inter-operability: Experiments with Java/HPC++, in:Java98 ACMWorkshop, 1998.

[3] L.V. Kale, M. Bhandarkar and T. Wilmarth, Design and imple-mentation of parallel Java with global object space, in:Proc.Int. Conf. Parallel and Distributed Processing, July 1997.

[4] H. Casanova and J. Dongarra, Netsolve: A network server forsolving computational science problems, in:Proc. Supercom-puting ’96.

[5] H. Casanova, J. Dongarra and D.M. Doolin, Java access to nu-merical libraries, in:ACM Workshop on Java for Science andEngineering Computations, June 97.Concurrency: Practiceand Experience, 1997.

[6] F. Coelho, C. Germain and J.-L. Pazat, Compiling HPF, in:TheData Parallel Programming Model, 1996. Lecture Notes Com-put. Sci., Vol. 1132.

[7] G. Fox and W. Furmanski, Towards Web/Java based high per-formance distributed computing, in:Proc. 5th IEE Int. Symp.on High Performance Distributed Computing, 1996.

[8] C. Germain, J. Laminie, M. Pallud and D. Etiemble. An HPFcase study of a domain-decomposition based irregular applica-tion, in: PaCT ’97, Lecture Notes Comput. Sci., Vol. 1277.

[9] V. Getov, S. Flynn-Hummel and S. Mintchev, High-performance parallel programming in Java: Exploiting na-tive libraries, in:Java98 ACM Workshop, 1998, http://www.cs.ucsb.edu/conferences/java98/

[10] C. Calgaro, J. Laminie and M. Temam, Dynamic multilevelschemes for the solution of evolution equations by finite ele-ments discretization,Applied Numerical Maths, 1997.

[11] P. Launay and J.-L. Pazat, A framework for parallel program-ming in Java, in:HPCN ’98, Lecture Notes Comput. Sci.,April 1998.

[12] K. Keahey and D. Gannon, PARDIS: A parallel approach toCORBA, in:Proceedings of the 6th IEEE International Sympo-sium on High Performance Distributed Computing(best paperaward), August 1997.

[13] L. Machiels and M.O. Deville, Fortran 90: An entry to object-oriented programming for the solution of partial differentialequations,ACM Trans. Mathematical Software, March 1997.

[14] HPCC and Java – A preliminary report by the ParallelCompiler Runtime Consortium, http://www.npac.syr.edu/users/gcf/hpjava.html

[15] M. Philippsen and M. Zenger, JavaParty – transparent remoteobjects in Java, in:PPoPP, June 1997.

[16] K. van Reeuwijk, A.J.C. van Gemund and H.J. Sips, Spar:A programming language for semi-automatic compilation ofparallel programs,Concurrency: Practice and Experience,Nov. 1997.

[17] E. Sturler and D. Loher, Parallel iterative solvers for irregularsparse matrices in High Performance Fortran,Journal of FutureGeneration Computer Systems, 1998.

Submit your manuscripts athttp://www.hindawi.com

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014


Applied Computational Intelligence and Soft Computing

Advances in

Artificial Intelligence


Advances inSoftware EngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in


java-based coupling for parallel predictive-adaptive...

Documents