a common parallel computing framework for modeling ... · a common parallel computing framework for...

Parallel Computing 37 (2011) 302–315

Contents lists available at ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate /parco

A common parallel computing framework for modeling hydrologicalprocesses of river basins

Hao Wang ⇑, Xudong Fu, Guangqian Wang, Tiejian Li, Jie GaoState Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing 100084, China

a r t i c l e i n f o a b s t r a c t

Article history:Received 27 April 2010Received in revised form 22 March 2011Accepted 9 May 2011Available online 13 May 2011

Keywords:Binary treeDistributed hydrological modelDrainage networkMPIParallel computingRiver basin

0167-8191/$ - see front matter � 2011 Elsevier B.Vdoi:10.1016/j.parco.2011.05.003

⇑ Corresponding author. Tel.: +86 13811332227.E-mail addresses: [email protected]

edu.cn (T. Li), [email protected] (J. Gao).

Restricted computing power has become one of the primary factors obstructing advance-ment in basin simulations for majority of hydrological models. Parallel computing is oneof the most available approaches to solve this problem. Using binary-tree theory, we pres-ent in this study a common parallel computing framework based on the message passinginterface (MPI) protocol for modeling hydrological processes of river basins. A practical anddynamic spatial domain decomposition method, based on the binary-tree structure of thedrainage network, is proposed. This framework is computationally efficient, and is inde-pendent of the type of physical models chosen. The framework is tested in the Chabagouriver basin of China, where two years of runoff processes of the entire basin were simu-lated. Results demonstrate that the system may provide efficient computing performance.However, primarily because of the constraint of the binary-tree structure for drainage net-work, this study finds that unlimited enhancement of computing efficiency is impossible torealize.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

The hydrologic cycle has served as the significant link between climate and human society, and the study of its laws inlarge basin scale [2,3,28] is of immense importance. Since the mid-1980s, distributed hydrological models (DHM)[4,20,27,31] have become one of the most important tools used in basin hydrology studies. However, although DHM hasthe potential to increase simulation accuracy, it usually requires a significant number of numerical computations. In orderto obtain higher computational performance, often hydrologists have to reduce the accuracy of simulations.

One way to overcome the efficiency limitation is to use parallel algorithms in multi-processor computers or in distributedenvironments [1]. The nature of parallel computing is to exploit concurrency of a simulation problem, with the computa-tional domain separated into a number of subunits that can be executed simultaneously. In the fields of hydrology and com-putational fluid dynamics (CFD), as water movement processes are usually temporally successive, spatial domaindecomposition is usually adopted for parallel computing. Based on the spatial partition method, the parallel computingframework is classified into two types in this paper: closely coupled and loosely coupled.

In the closely coupled case, the spatial domain is usually discretized into numerous 2D or 3D grids of various sizes. Phys-ical connections exist between arbitrary adjacent units. This type of partitioning has the potential to perform more intensiveand exact simulations. Presently, a large body of work on closely-coupled parallelization can be found in the literature[5,8,13,18,19,21,23]. However, in general, communication overhead in a closely-coupled framework may significantly rise

. All rights reserved.

n (H. Wang), [email protected] (X. Fu), [email protected] (G. Wang), litiejian@tsinghua.

http://dx.doi.org/10.1016/j.parco.2011.05.003

mailto:[email protected]



mailto:litiejian@tsinghua. edu.cn

mailto:litiejian@tsinghua. edu.cn


http://dx.doi.org/10.1016/j.parco.2011.05.003

http://www.sciencedirect.com/science/journal/01678191

http://www.elsevier.com/locate/parco

H. Wang et al. / Parallel Computing 37 (2011) 302–315 303

with the increase of the number of spatial grids. On a parallel architecture this may become a dominant factor inhibitingcomputing efficiency.

In contrast, when considering landscape structure, a river basin is often partitioned into a number of loosely-coupledunits for DHM applications [4,17,26], such as sub-basin, hillslope, hydrological response units (HRU). These units are con-nected by drainage network, and are relatively independent of each other. Unlike the closely-coupled case, hydrological pro-cesses within a loosely-coupled unit can be assumed closed in many cases [4,26,27]; thus, a loosely-coupled unit only hasphysical relations with its adjacent upstream and downstream units. Compared to the closely-coupled situation, loosely-coupled domain decomposition may have greater potential to provide efficient parallel performance for large-scale riverbasin applications.

Using loosely-coupled partitioning, parallel simulations of basin hydrology have been attempted by some studies [1,7,11–13,22]. Although satisfactory results in efficiency are obtained, only a handful of studies present a common and detailedparallel approach for performing large scale physical simulation of river basins. This paper introduces an approach basedon binary-tree theory to depict the drainage network structure. Each binary-tree node corresponds to a unique sub-basin.It can be noted that the binary-tree structure implies that each node (except for the root node) is connected only to its parentnode and its two children nodes. This implication of binary tree can weaken the dependency among sub-basins and has po-tential to decrease communication between processors. However, if groundwater movement in a river basin is so severe thatthere is a large amount of water exchange among sub-basins, it may be unsuitable to treat a sub-basin as an independentunit; in this case, the binary-tree structure may not be applicable. A binary-tree-based domain decomposition method is pro-posed in this study. Our method can effectively perform dynamic task allocation during parallel simulations. We also imple-mented a parallel computing framework to carry out the entire simulation process of a river basin. This programmingframework is independent of the type of hydrological model, and is implemented using MPI.

The parallel computing framework comprises of one database, and three different functional node types: master node,slave node, and transfer node. The database stores the static data for physical models as well as the final simulation results.The master node is responsible for performing the domain decomposition method and allocating computational tasks toslave nodes. Slave nodes are responsible for applying the physical models to all accepted tasks. The transfer node is in chargeof the communication processes among all slave nodes. The detailed system mechanism will be discussed in the followingsections.

The paper is arranged as follows: Section 2 discusses the process of transforming the drainage network to a binary-treestructure. Section 3 illustrates the architecture of the parallel computing system as well as its workflow. The master nodeand the domain decomposition method are described in detail in Section 3.1. The slave node workflow is described in Section3.2. The transfer node workflow is described in Section 3.3. Section 4 presents a case study of Chabagou River Basin in Chinato highlight the performance of the parallel framework. A formula calculating maximum speedup ratio (MSR) for drainagenetwork is proposed. Finally, Section 5 concludes the paper with a general discussion.

2. Binary-tree structure of drainage network

Fig. 1 presents a schematic drainage network and the corresponding binary-tree structure. Each river-reach represents asub-basin, as illustrated in the enlarged drawings. In addition to one river-reach, a sub-basin also contains two or three hill-slopes; the sub-basins in the source have three and the others have two.

In computer programming, the binary-tree structure of drainage network is expressed through a coding method. In thefield of hydrology, there are already several types of commonly employed coding methods for drainage networks [29].

Fig. 1. A constructed drainage network and its binary-tree structure. The number on each river-reach is the binary-tree code defined in Eq. (1). Each river-reach corresponds to a single sub-basin. Sub-basins in the source have three hillslopes (left enlarged drawing), and the others have two (right enlargeddrawing).

Fig. 2. A drainage network with two multidirectional points and its binary-tree structure. The drainage network of (a) has two multidirectional points: Aand B. A has three upstream sub-basins and B has four. Nodes 8, 6 and 13 are virtual nodes which has transformed (a) to a standard binary tree (b) or (c).

304 H. Wang et al. / Parallel Computing 37 (2011) 302–315

However, as a result of the relatively low-query efficiency, they may not be very suitable for the parallel computing. In thispaper, a binary-tree coding method [15,25] is proposed, as defined in Eq. (1):

BC ¼ 2� PCGC ¼ 2� PC þ 1

ð1Þ

where PC is the parent node code, BC is the left-child code of PC, and GC is the right one. The binary-tree code begins with 1for the root node. By Eq. (1), all sub-basins in the entire river basin can be coded, as shown in Fig. 1. Subsequently, the for-mula below can be concluded by Eq. (1) as well:

PC ¼ BC or GC2

� �ð2Þ

where ‘‘[ ]’’ is downward-rounding function here (i.e., [7/2] = 3). The definition mode of binary-tree code is capable of real-izing direct positioning for sub-basins. By Eq. (1), from an arbitrary sub-basin, its adjacent sub-basins upstream could beidentified swiftly. By Eq. (2), the sub-basin downstream could be immediately located. The concrete applications in thesystem with the binary-tree code are mentioned in Sections 3.1 and 3.3.

Sometimes, there are some multidirectional points in drainage network, as shown in Fig. 2. A multidirectional point isdefined as the point having at least three upstream sub-basins here, like Points A and B in Fig. 2(a). Point A has three up-stream sub-basins and Point B has four. Multidirectional point does not meet the binary-tree feature. In order to solve thisproblem, the ‘‘Virtual Node’’ method is proposed. This method can convert the drainage network with multidirectionalpoints to a standard binary-tree structure. A virtual node is a virtual sub-basin with ‘‘zero’’ topological features, such as zeroarea, zero length. Because of these ‘‘zero’’ features, physical processes of river basin will not be affected by the virtual node.However, a virtual node occupies one real binary-tree code. We can see that in Fig. 2(b), three virtual nodes, Node 8, Node 6and Node 13, are inserted into the original drainage network; and Fig. 2(c) shows the standard binary-tree structure. It can benoted that if one multidirectional point has N upstream sub-basins, N � 2 virtual nodes are needed.

3. Parallel architecture design and implementation

Fig. 3 presents the framework of our parallel computing system for river basin simulations. This system is implementedusing the C++ language and runs in the Windows OS environment. MPICH2 [16] is employed to perform inter-processor com-munication. The system can operate on a single-core computer, a multi-core computer, or multi-computers connected by thelocal area network. The system consists of four components: database, master node, slave node and transfer node. These fourcomponents collaborate closely to accomplish a unitary simulation process. The database is the data center of the system. Onthe one hand, it stores the original static data needed by physical models to commence simulation, such as basin topographyinformation, land use and soil type parameters, and parameters of physical models. On the other hand, the database receivesfinal simulation results exported by all kinds of physical models.

The master node is unique in the parallel system. It mainly performs the processes of domain decomposition and com-puting-tasks allocation. A task usually contains a small-scale binary tree extracted from the whole drainage network. The-oretically, the number of sub-basins in a computing task can range from 1 to the number of all sub-basins. By the actualhardware situation, the slave node can be employed in any quantity. Merely two are shown in Fig. 3 to show the systemframework. The slave node applies physical models to all computing tasks received from the master node. The transfer nodeespecially carries out the communication processes among all slave nodes. It provides a temporary location to store the

Fig. 3. Framework of the parallel computing system. The system is made up of one database, one master node, one transfer node, and any quantity of slavenodes. The master node is in charge of the domain decomposition and tasks allocation. The slave node runs physical models for the tasks accepted from themaster node. The transfer node is responsible for the communication processes among slave nodes.


information sent by slave nodes. Fig. 4 presents the pseudo-code of the system. From Fig. 4, we can see that the system isindependent of all types of basin physical models. Detailed operating mechanisms of the master node, the slave node and thetransfer node are distinctly described in Sections 3.1–3.3. The domain decomposition method run by the master node is de-scribed in Section 3.4. Section 3.4 also states the reason why the master node and the transfer node are designed in thesystem.

3.1. Master node work flow

The master node is the ‘‘brain’’ of the parallel system. Fig. 5 illustrates its detailed interior workflow during computation.First, the master node imports the essential information with river basin from the database to computer memory. Accordingto the binary-tree code of each sub-basin, the binary-tree structure of drainage network is constructed. The binary-tree cod-ing process for all sub-basins is completed in advance through a separate program. Second, by the domain decompositionmethod (Section 3.4), the master node can find some computing tasks and sends them to each slave node to begin the firstcomputation. Third, the master node enters into the message detecting state. If certain slave node accomplishes a task, theslave node will send the binary-tree code of the task root node to the master node. According to this binary-tree code and Eq.(2), the master node finds its downstream sub-basin (the parent node) directly and marks it computable. Then, the masternode calls the domain decomposition method again to find a new computing task and sends the task to the slave node.Finally, the master node returns to the message detecting state and begins a new cycle. Meanwhile, after receiving a newtask, the slave node starts a new simulation.

3.2. Slave node work flow

The slave node is in charge of running the processes of basin physical models. Through many tests, we find that the mas-ter and transfer nodes in the system generally consume very limited processor resources. Most of time, they lie in an I/Oblocking state. A theoretical analysis is given in Section 3.4. In most cases, the number of slave nodes employed could equal

Fig. 4. Pseudo-code of the parallel computing system.


to the number of processors. Fig. 6 describes the slave node workflow. Once the slave node receives a computing task fromthe master node, it will enter into the simulation procedure at once. During computation, corresponding physical models ofriver basin, such as the hydrological model, will be called automatically. Meanwhile, the slave node may extract upstreaminformation from the transfer node. In terms of Fig. 1, if a task only having Node 8 is to be computed, the slave node willextract the simulation results of Nodes 16 and 17 from the transfer node first. The simulation results of Nodes 16 and 17serve as the up-boundary conditions of Node 8. After finishing one task by the slave node, it will carry out three proceduresbelow (Fig. 6):

First, for the subsequent data post-analysis, the slave node exports the simulation results of concerned sub-basins in thetask to the database. Sub-basins concerned are appointed by their binary-tree codes prior to the simulation. These

Fig. 5. Master node workflow. The master node is responsible for the splitting and dispatching processes of computing tasks from river basin.

Fig. 6. Slave node workflow. The slave node is responsible for the practical simulation process of river basin.


binary-tree codes are stored in the database. Second, the slave node sends simulation results and the binary-tree code of thetask root node to the transfer node. This procedure is for the downstream node to extract upstream information in the future.Third, the slave node notifies the master node to get a new computing task and begin a new cycle.

3.3. Transfer node workflow

Transfer node is the ‘‘traffic hub’’ of the parallel computing system. It is responsible for the communication processesamong different slave nodes [24]. As introduced in Section 3.2, once a computing task is completed by a slave node, the mod-el outputs and the corresponding binary-tree code will be sent to the transfer node; then, these messages are organized bythe data structure of one-way linked list (Fig. 7), and wait to be extracted by other slave nodes in the future. Every element ofthe list includes two parts: binary-tree code of the sub-basin received (labeled with ‘‘V’’ in Fig. 7), and the simulation outputsby physical models (illustrated with red rectangle in Fig. 7). In terms of the hydrological model of river basin, red rectangleusually represents the discharge series at the end of river reach. However, if more physical processes of the river basin needto be simulated, the element needs to contain more parts, such as sediment series, contamination series. In order to increasethe message exchanging efficiency, diverse information should be packed together in the slave node to send to the transfernode at one time. Meanwhile, the transfer node needs to unpack the messages received. The packing and unpacking pro-cesses are achieved through the MPI functions: ‘‘MPI_Pack’’ and MPI_Unpack’’ [10].

When a slave node computes a task, the upstream sub-basins of the task must have already been completed. This can beguaranteed by the domain decomposition method (Section 3.4). Then, to extract simulation results of the upstreamsub-basins, the slave node will send a message (message 2 or 3 in Fig. 7) to the transfer node. The message contains thebinary-tree node needed of the upstream sub-basin(s), which is (are) calculated by Eq. (1). As each binary-tree node hastwo upstream nodes at most, messages 2 and 3 in Fig. 7 are enough. Then, by the message, the transfer node locates the right

Fig. 7. Transfer node workflow. Transfer node is responsible for the communicating processes among slave nodes. The red rectangle represents simulationoutputs of various physical models. ‘‘V’’ is the binary-tree code of sub-basin. One red rectangle and its ‘‘V’’ belong to the root node of a finished task. Allelements are organized by one-way linked list. (For interpretation of the references to color in this figure legend, the reader is referred to the web version ofthis article.)


element(s) from the linked list and sends it (them) swiftly to the slave node. Finally, the transfer node will delete the ele-ments from the one-way linked list and begin a new cycle.

3.4. Domain decomposition method

The domain decomposition method has great impacts on the parallel computing efficiency. In order to acquire high com-puting speed, the method should satisfy load balance for the processors involved as much as possible. However, the comput-ing amount of a sub-basin may be influenced by many factors, such as physical models selected, precipitation, andtopographic differences. We find that it is difficult for us at present to explore a suitable way to evaluate the computingamount of one sub-basin; therefore, we do not adopt the static domain decomposition method. The static method allocatesall computing units to each computer processor before the simulation starts. Here, we adopt a dynamic approach. Thisapproach can automatically allocate a computing task to idle processors during simulation. This is helpful to balance theprocessors load; hence, to execute this dynamic process, the master node is realized in the system (see Fig. 5). Because ofthe dynamic nature of the task-allocating process, the computing sequence of each sub-basin cannot be determined beforethe simulation; thus, the transfer node is designed to dynamically manage all communication processes among slave nodes(see Fig. 7).

Other than the dynamics introduced above, the domain decomposition method here primarily refers to the computingsequence of sub-basins. In the system, we have specified two assumptions for the domain decomposition method. First,backwater effects are not considered; that is, water is not allowed to move from the downstream sub-basin to its upstream.However, within one sub-basin, backwater effect is allowed. Second, the downstream sub-basin cannot be processed until itsupstream sub-basins (Child nodes) are completed. In terms of Fig. 1, Node 8 cannot begin until Nodes 16 and 17 are finished.While this restriction is convenient to program, it however may be too strict. Two ways to surpass this restriction are pro-posed in detail in Section 4, which also outlines our future directions. Under these two assumptions, we report the maximumspeedup ratio (MSR) for river basin computations based on our binary-tree structure. A domain decomposition method toattain MSR is proposed as well and has already been employed by the system. The derivation of MSR and the method areas follows:

For a drainage network of a river basin, M is defined as the number of sub-basins. Tr is the average computing time of asingle sub-basin. To is the average communication time of a single sub-basin. We can find that if only one computer proces-sor is used, M � Tr time is required to finish the entire simulation. According to the assumptions mentioned above, we knowthat nodes in the same line of binary-tree must be calculated one by one from upstream to downstream. For example, inFig. 1, Nodes 1, 2, 4, 9, and 19 are in a line. Nodes 1, 3, 6, and 12 are in a line as well. We assume that each binary-tree nodehas the same values of Tr and To (load balance situation). Then, for the parallel computing, the lines with the largest numberof nodes determines the minimum computing time (MCT), so,

MCT ¼ L� ðTr þ ToÞ ð3Þ

where Tr and To are the average computing and communication time for a single binary-tree node; L is the number of nodesin the longest line. It can be noted that L is just equal to the number of binary-tree layers. Instead of MCT, we use the speedupratio, a commonly-used dimensionless index, to evaluate the parallel-computing performance. The speedup ratio is definedas follows:

R ¼ Ts

Tpð4Þ

where R is the speedup ratio; Ts is the serial computing time; Tp is the parallel computing time. We can find that whenTp = MCT, R can attain MSR, then,

Fig. 8. Pseudo-code of the domain decomposition method.


MSR ¼ Ts

ðTpÞmin¼ Ts

MCT¼ M � Tr

L� ðTr þ ToÞ� M

Lðif To � TrÞ ð5Þ

where M is the number of sub-basins in the entire river basin, and L is the number of binary-tree layers. According to theassumptions above, it can be noted that a sub-basin only needs to transfer MPI messages twice during the entire simulation.One is the receiving process from upstream at the sub-basin inlet, prior to the simulation start. The other is the sending pro-cess to downstream at the sub-basin outlet, at the end of simulation. Therefore, compared to Tr, the communication time To isquite small. This is also the reason why the master and transfer nodes merely consume very limited CPU resources by thesystem, and most of CPU time is occupied by the slave nodes. Section 4 presents a simple comparison between Tr and To. Interms of the binary tree in Fig. 1, there are 15 nodes and 5 layers. By Eq. (5), M = 15 and L = 5; hence, MSR = 15/5 = 3.

In order to reach MSR, the domain decomposition method of the system assigns the computing sequence from the bottombinary-tree layer to the top one (the top layer is defined as Layer 1 here). With respect to the load balance, if one layer isprocessed after each Tr, MSR can be obtained. Therefore, it can be noted that if the number of processors equals the maxi-mum value of the number of nodes in each layer, this domain decomposition method is capable of achieving MSR. For exam-ple, in Fig. 1, Layers 3, 4 and 5 all have four nodes. Thus, four processors can be used by the domain decomposition method.Nodes 16, 17, 18, and 19 are computed first. The second step is Nodes 8, 9, 12, and 13; the third step is Nodes 4, 5, 6, and 7;the fourth step is Nodes 2 and 3; the last step is Node 1. Thus, every step passes, one integrated binary-tree layer can befinished and MSR is obtained. Fig. 8 shows the pseudo-code of domain decomposition method in the system.

4. Application

4.1. Study area and computing environment

The Chabagou basin, located in the Yellow River basin of China, is the study area. The Chabagou basin is situated in thegullied rolling region with a catchment area of 205 km2; its geographical location is between 37�380 and 37�480N latitudesand 109�470 and 110�030E longitudes. Fig. 9 shows the 50 m-resolution digital elevation model (DEM) of the basin. 501 riverreaches are extracted by the TOPAZ software [9,14] from the DEM.

The parallel code is written with C++ language under the Microsoft Visual Studio 2008 platform. MPICH2 [16] is the MPIimplementation. In this application, the program is executed on one single computer server with 16 CPUs, 24 GB RAM, andWindows 2008 Server Operating System. All processors are on the same board and the CPU type is Intel (R) Xeon E55202.27 GHz. The Oracle 11g database serves as the I/O system. The Xinanjiang rainfall–runoff model [30] is applied to simulatethe hydrological processes of the Chabagou basin. The Muskingum flow routing model [6], commonly used in hydrology, isadopted. In our simulations, the computing time span of each sub-basin is set to 2 years, and the time step is set to 1 min.

While the resolution of the drainage network (see Fig. 9), the type of physical model, and the time-step value may beinappropriate in practice, the primary objective of this paper is to test the computational efficiency of the parallel computingsystem. We only wish to produce a reasonable output and an acceptable simulation time for a single sub-basin. Therefore, inthis study of the Chabagou basin, we have not considered the practical rationality of our choice of parameters.

4.2. Results and discussion

Fig. 10 shows the binary-tree structure of the Chabagou basin. The X-coordinate represents the binary-tree layer, whilethe Y-coordinate represents the number of nodes at each layer. There are 55 layers and 501 nodes in the binary tree. The root

Fig. 9. 50 m-resolution DEM and drainage network of the Chabagou basin. DEM data contains 3D coordinates (X, Y, Z) of every geographic point. X islatitude, Y is longitude, and Z is the absolute elevation. The drainage network has 501 river reaches and is extracted from the DEM by the TOPAZ software[9,14].

Fig. 10. The binary-tree structure of the Chabagou-basin drainage network. The X-coordinate represents the binary-tree layer. The Y-coordinate representsthe number of binary-tree nodes at each layer. The root node corresponds to Layer 1.


node is at Layer 1. Layer 19 has 20 nodes, which is the maximum value of Y-coordinate. According to Eq. (5), MSR of thebinary tree is equal to 501/55 � 9.10. Therefore, if 20 processors are used, the theoretical speedup ratio obtained by the par-allel computing framework can reach 9.10.

Fig. 11 shows the performances of hydrological simulations under the single processor and 16-processors configurationsof the parallel computing system. Three-month results at the basin outlet are compared. From the figure, we can see that thehydrographs are identical. This demonstrates the effectiveness of the parallelization. Since the objective of Fig. 11 is to eval-uate the success of the parallelization, model calibration is not carried out here.

Fig. 12 shows the computational performance of the Chabagou basin under different number of processors. In actual sim-ulations, we find that the master and transfer nodes hardly consume any processor resources for the reasons explained inSection 3.4. Thus the number of processors shown in Fig. 12 represents only the number of slave nodes used, not includingthe master and transfer nodes.

From Fig. 12, we can observe that the computing-time curve is decreasing and convex. The speedup-ratio curve comprisesof two distinct stages: the former is monotonically increasing, and the latter is almost horizontal. The coordinates of turningpoint are (10, 7.80). The speedup-ratio curve can clearly reflect that although the gains due to parallelization initially im-prove, this effect becomes increasingly limited. With the increase of processors, time consumed will attain a constant value.MSR is around 7.80 for this case. Compared to the computing time, speedup ratio may be a more suitable index to indicatecomputational efficiency. Not only the speedup ratio is a dimensionless value that is independent of physical modelsemployed, but the turning point is much easier to observe.

Fig. 11. Comparison of hydrological simulation results at the basin outlet between the single and 16-processor configurations. Three-month results areillustrated. Hydrographs of the two situations are identical. This demonstrates the parallel computing system is effectively constructed.

Fig. 12. Parallel-computing time and speedup ratio under different number of processors. The computing-time curve is decreasing and convex. The formerstage of the speedup-ratio curve is monotonically increasing and the latter is almost horizontal. The coordinates of turning point are (10, 7.80). It can beinferred that the newly added processors after the former stage do not participate in the simulation in practice.


The simulation MSR shown in Fig. 12 equals to 7.80, smaller than the theoretical MSR 9.10. By Eq. (5), we know that thecommunication time can decrease MSR. Fig. 13 presents the average computing time for a single sub-basin under differentnumber of computer processors. The computing time proportion is defined as Tr/(Tr + To) (see Eq. (5)). From Fig. 13, it can beobserved that this proportion is always greater than 97%; this implies that the influence of communication on MSR is quitelimited. The reason for this low communication load has been analyzed in Section 3.4. The decrease of MSR may primarilyresult from the load imbalance with each sub-basin. MSR of Eq. (5) is obtained under a balanced load scenario. In practice,however, load imbalance is inevitable. Thus, the computing time for each binary-tree layer (Fig. 8) is actually set by the nodewith the lowest running speed. The maximum node running time of each layer is set by Ti

rmax (1 6 i 6 L), where L is the num-ber of binary-tree layers. According to Eq. (5) and by setting Tr max ¼ ð

PLi¼1Ti

r maxÞ=L, the following relation can be obtained:

MSR ¼ Ts

ðTpÞmin¼ Ts

MCT¼ M � Tr

L� Tr max6

M � Tr

L� Tr6

ML

ð6Þ

From Eq. (6), as a result of load imbalance, the practical MSR is generally smaller than the ideal value (M/L) of Eq. (5).From the analysis of the domain decomposition method (Section 3.4), we have shown that if the processor number equals

to the number of nodes in the widest binary-tree layer, MSR can be attained. By widest layer we mean the layer that has thelargest number of nodes. In Fig. 10, the widest layer of binary tree has 20 nodes; however, in Fig. 12, we can see that when

Fig. 13. Average computing time for a single sub-basin and the computing time-proportion. This proportion is defined as Tr/(Tr + To) (see Eq. (5)). From thefigure, we can observe that compared to the communication time, computing time is absolutely dominant. This mainly arises from the domaindecomposition method used by our system as explained in Section 3.4.

Fig. 14. A binary-tree example to show OPN. OPN is the minimum number of processors that can achieve MSR. For this case, OPN is equal to three, which issmaller than the number of nodes in the widest layer.


the number of processors is equal to 10, speedup ratio has already reached MSR value. Fig. 14 gives an explanation for this. InFig. 14, there are five binary-tree layers, and the widest layer has four nodes. Assuming that every node has the same sim-ulation time (balanced load scenario), then, at least five steps are needed to reach MSR. Fig. 14 shows that if four processorsare used, each layer is fully processed after every step. However, if three processors are used, the MSR can also be obtained.The main reason is that some nodes in the upper layer can be shared to perform earlier steps. For example, Node 11 in Layer4 is computed in Step 1. Nodes 6 and 7 in Layer 3 are computed in Step 2. As discussed before, the domain decompositionmethod in our system is dynamic; thus, it is capable of simulating the computable nodes of upper layers in advance. This isthe reason why only after the processor number in Fig. 12 is larger than 10, MSR can be reached.

We hypothesize that every binary tree may have an optimal processor number (OPN). OPN here is defined as the mini-mum number of processors that can reach MSR. In Fig. 14, OPN is equal to three. It is impossible to complete the whole com-putation in five steps using less than three processors. In our parallel system, the number of processors used is determined atthe very beginning. Therefore, if a formula estimating OPN could be obtained, the system would use the least hardware re-sources to realize MSR. We will carry out further research geared towards the calculation of OPN in future work.

MSR as defined in this paper is based on one fundamental assumption: a binary-tree node cannot begin computing untilits children nodes have completed their tasks (Section 3.4). However, this limitation may be too strict. To improve the par-allel computing efficiency and surpass this restriction, we are investigating two aspects of our work:

The first aspect is motivated by the fact that according to the binary-tree scheme, physical processes occurring on hill-slopes of each sub-basin are independent of each other. These physical processes can thus be simulated synchronouslyand should not be restricted by their upstream or downstream position in the network. Therefore, we are consideringtwo levels of parallelization in future developments of our system. The first level involves the hillslope processes and does

a bFig. 15. Temporal–spatial dual decomposition method of the binary tree. (a) The simulation time of each sub-basin is divided into three even time slicesand (b) the interdependence of all time slices. Nodes of (b) cannot be computed until its upstream neighbors (connected by solid and dashed lines) areprocessed. For example, D2 can be computed only after D1, F2, and G2 have completed.


not depend on the computing sequence of the sub-basins, allowing a high parallel efficiency. The second level primarily fo-cuses on the river-reach processes, which are instead restricted by their position in the network. This level has been the mainfocus of this paper. However, it should be noted that if hillslopes are simulated independently, then physical interactions onthe hillslopes by their corresponding river reaches within a sub-basin will be ignored. This is similar to ignoring backwatereffects (see Section 3.4).

The second aspect we are currently exploring is to allow downstream sub-basins to begin their simulation while theirupstream neighbor is being processed. Fig. 15 presents an example of such a temporal–spatial dual decomposition method.In Fig. 15(a), there are six nodes in the binary tree, from A to G. We assume that each node has identical simulation timeswhich are divided into three even time slices, such as A1, A2 and A3 for Node A. In a real simulation, in order to run A3, theinformation of A2, B3 and C3 needs to be obtained in advance. Similarly, A2 depends on A1, B2 and C2; A1 depends on B1 and C1,and so on. Fig. 15(b) presents the interdependence of all these time slices. Each node in Fig. 15(b) depends on the nodes con-nected to it by solid or dashed lines. We can observe that the nodes in the same layer include more sub-basin information, ascompared to the method presently used in our system. For example, in Layer 3, a part of Nodes A, B, C, D and E (A1, B2, C2, D3and E3) could be computed synchronously. Distinct from the MSR of Eq. (5), the MSR of this temporal–spatial decompositionis given below:

MSR ¼ Ts

ðTpÞmin¼ M � Tr

ðLþ N � 1Þ � ðTrN þ ToÞ

¼ MNLþ N � 1

� 11þ NTo

Tr

� MNLþ N � 1

ðif NTo � TÞ ð7Þ

where M is the number of sub-basins in drainage network; L is the number of binary-tree layers; Tr is the average computingtime of a single sub-basin; To is the average communication time of a single sub-basin; N is the number of time slices. FromEq. (7), it can be observed that for a specific drainage network, M and L are fixed, so, the larger is N, the larger is the MSR. InFig. 15 we set M = 6, L = 4 and N = 3. Ignoring the communication load, the MSR defined in Eq. (7) is equal to 3, while the MSRdefined by Eq. (5) is equal to 1.5. Thus, this temporal–spatial dual decomposition has the potential to be more efficient thanthe pure spatial decomposition presently used by our system. However, we should also note that the communication load ofEq. (7) is ‘‘NTo’’, which is larger than ‘‘To’’ in Eq. (5). Hence, if the value of N is sufficiently large, then the communication loadmust be considered.

5. Conclusions

This paper presents a parallel computing system capable of achieving high computing efficiency for river basin simula-tions. The system comprises of four different parts: database, master node, slave node, and transfer node. The binary-treecoding method is employed by the system. The river basin is treated as a binary-tree structure, with each node representinga sub-basin. The binary-tree structure implies that one node only has physical relations with its parent and two childrennodes. This approximation may effectively decrease the communication load of parallel computing. Nevertheless, if ground-water movement in river basin is so severe that there is a large amount of water exchange among sub-basins, the binary treemay not be an appropriate choice depicting the drainage network structure.

The domain decomposition method in the system has great impacts on the parallel computing efficiency. It is difficult toevaluate in advance the computational requirements of a sub-basin. Thus, we have adopted a dynamic approach to obtainload balance. This method is implemented via the master node of the system. Once a particular slave node completes a task,the master node will automatically get another task from the remaining binary tree and send it to the slave node. We havespecified two necessary assumptions for our domain decomposition method. First, the backwater effect is not considered.


Second, a downstream sub-basin cannot start computing until all its upstream tasks are completed. Based on these twoassumptions, a formula approximating MSR (see Eq. (5)) is obtained under our load balancing scenario. MSR means thatthe binary tree structure has an upper limit of computing efficiency. However, the second assumption may be too strict.Two ways to surpass this limitation are also discussed: first, physical processes occurring within sub-basin hillslopes canbe separated. Since these processes are not restricted by the assumption above, they can be simulated synchronously. Sec-ond, when considering that physical processes have interdependence between upstream and downstream areas, the tempo-ral and spatial dual decomposition method is proposed. The MSR for this method is derived in Eq. (7) and shown to be moreadvantageous than the one of Eq. (5).

The domain decomposition method that can achieve the MSR of Eq. (5) is currently implemented by our system. Themethod assigns the computing sequence from the bottom binary-tree layer to the top. The system was tested in the Chab-agou river basin of China. 501 sub-basins are extracted from the DEM. Two-year hydrological processes are simulated, withcomputer processors increasing from 1 to 16.

Three-month hydrological performance of the single processor and 16-processor runs are compared (Fig. 11). The resultsare identical and demonstrate the effectiveness of the system parallelization. The results of computing time and speedupratio are illustrated in Fig. 12. It is observed that the speedup ratio curve contains two distinct stages: the former is mono-tonically increasing and the latter is almost horizontal. This indicates that while the gains provided by our parallel comput-ing framework are evident initially, improvements become increasingly limited. This phenomenon is in accordance with theanalysis of MSR in Eq. (5). However, our simulation value of MSR is 7.80 while the theoretical one is 9.10. This is mainly dueto the effect of load imbalance. The analysis is given in Eq. (6). From Fig. 12, we can see that the OPN is equal to 10 nodes,while the widest layer of the Chabagou basin has 20 nodes. This is because computing nodes in the upper binary-tree layerscan be used to perform tasks assigned to lower layers (Fig. 14). OPN can make the parallel system use the least number ofprocessors to achieve MSR. Further in-depth theoretical research is required to obtain an OPN formula for the binary tree,which will be undertaken in our future work.

Acknowledgements

This paper is supported by the National Natural Science Foundation of China under the Grant No. 50823005 and by theMinistry of Science and Technology of China under the Grant No. 2011CB409901. The authors are grateful to Dino Bellugi andother anonymous reviewers for their insightful comments and useful advice.

References

[1] T.K. Apostolopoulos, K.P. Georgakakos, Parallel computation for streamflow prediction with distributed hydrologic models, Journal of Hydrology 197(1997) 1–24.

[2] V.K. Arora, Streamflow simulations for continental-scale river basins in a global atmospheric general circulation model, Advances in Water Resources24 (2001) 775–791.

[3] L.E. Band, E.F. Wood, Strategies for large-scale distributed hydrologic simulation, Applied Mathematics and Computation 27 (1988) 23–37.[4] K.J. Beven, M.J. Kirkby, A physically based, variable contributing area model of basin hydrology, Hydrological Sciences Bulletin 24 (1979) 43–69.[5] W.M. Charles, E. van den Berg, H.X. Lin, A.W. Heemink, M. Verlaan, Parallel and distributed simulation of sediment dynamics in shallow water using

particle decomposition approach, Parallel and Distributed Computing 68 (2008) 717–728.[6] J. Chen, X. Yang, Optimal parameter estimation for Muskingum model based on Gray-encoded accelerating genetic algorithm, Communications in

Nonlinear Science and Numerical Simulation 12 (2007) 849–858.[7] Z. Cui, B.E. Vieux, H. Neeman, F. Moreda, Parallelisation of a distributed hydrologic model, International Journal of Computer Applications in

Technology 22 (2005) 42–52.[8] T. Esposti Ongaro, C. Cavazzoni, G. Erbacci, A. Neri, M.V. Salvetti, A parallel multiphase flow code for the 3D simulation of explosive volcanic eruptions,

Parallel Computing 33 (2007) 541–560.[9] J. Garbrecht, J. Campbell, TOPAZ: An Automated Digital Landscape Analysis Tool for Topographic Elevation, Drainage Identification, Watershed

Segmentation and Subcatchment Parameterization, TOPAZ User Manual, USDA-ARS, Oklahoma, 1997.[10] W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message- Passing Interface, Massachusetts Institute of Technology,

USA, 1999.[11] M. Grübsch, O. David, How to divide a catchment to conquer its parallel processing, an efficient algorithm for the partitioning of water catchments,

Mathematical and Computer Modelling 33 (2001) 723–731.[12] J.P. Gwo, G.T. Yeh, High-performance simulation of surface-subsurface coupled flow and reactive transport at watershed scale, in: Proceedings of the

International Conference on Computational Methods, Singapore, 2004, pp. 1–4.[13] S.J. Kollet, R.M. Maxwell, Integrated surface-groundwater flow modeling: a free-surface overland flow boundary condition in a parallel groundwater

flow model, Advances in Water Resources 29 (2006) 945–958.[14] M.P. Lacroix, L.W. Martz, G.W. Kite, J. Garbrecht, Using digital terrain analysis modeling techniques for the parameterization of a hydrologic model,

Environmental Modelling & Software 17 (2002) 127–136.[15] T.J. Li, G.Q. Wang, J.H. Liu, Drainage network codification method for digital watershed model, Advances in Water Science 17 (2006) 658–664.[16] Message Passing Interface Forum. <http://www.mpi-forum.org/index.html>.[17] N.K. Ajami, H. Gupta, T. Wagener, S. Sorooshian, Calibration of a semi-distributed hydrologic model for streamflow estimation along a river system,

Journal of Hydrology 298 (2004) 112–135.[18] L. Paglieri, D. Ambrosi, L. Formaggia, A. Quarteroni, A.L. Scheinine, Parallel computation for shallow water flow: a domain decomposition approach,

Parallel Computing 23 (1997) 1261–1277.[19] P. Rao, A parallel hydrodynamic model for shallow water equations, Applied Mathematics and Computation 150 (2004) 291–302.[20] F. Rodriguez, H. Andrieu, F. Morena, A distributed hydrological model for urbanized areas – model development and application to case studies, Journal

of Hydrology 351 (2008) 268–287.[21] T. Sun, K. Ma, Parallel Galerkin domain decomposition procedures for wave equation, Journal of Computational and Applied Mathematics 233 (2010)

1850–1865.

http://www.mpi-forum.org/index.html


[22] E.R. Vivoni, S. Mniszewski, P. Fasel, E.S. Springer, V.Y. lvanov, R.L. Bras, Parallelization of a fully-distributed hydrologic model using sub-basin partition,in: Proceedings of AGU, Fall Conference, San Francisco, USA, Poster NO: H13H- 1399, 2005.

[23] E.A.H. Vollebregt, M.R.T. Roest, J.W.M. Lander, Large scale computing at Rijkswaterstaat, Parallel Computing 29 (2003) 1–20.[24] H. Wang, X.D. Fu, Q.C. Sun, H.B. Ma, J. Gao, Method improvement of the parallel computing for the large-scale hydrology, Journal of Basic Science and

Engineering 17 (2009) 1–9.[25] H. Wang, T.J. Li, J. Gao, X.D. Fu, G.Q. Wang, Binary-tree coding for drainage network of large-scale basins, Journal of Hohai University (Natural Sciences)

37 (2009) 499–504.[26] G.Q. Wang, B.S. Wu, T.J. Li, Digital Yellow River model, Journal of Hydro-environment Research 1 (2007) 1–11.[27] D. Yang, S. Herath, K. Musiake, Development of a geomorphology-based hydrological model for large catchments, Annual Journal of Hydraulic

Engineering, JSCE 42 (1998) 169–174.[28] Z.B. Yu, P. David, C. Li, On continental-scale hydrologic simulations with a coupled hydrologic model, Journal of Hydrology 331 (2006) 110–124.[29] L. Zhang, G.Q. Wang, B.X. Dai, T.J. Li, Classification and codification methods of stream networks in a River Bas: a review, Environmental Informatics

Archives 5 (2007) 364–372.[30] R.J. Zhao, The Xinanjiang model applied in China, Journal of Hydrology 135 (1992) 371–381.[31] J. Zheng, G.Y. Li, Z.Z. Han, G.X. Meng, Hydrological cycle simulation of an irrigation district based on a SWAT model, Mathematical and Computer

Modelling 51 (2010) 1312–1318.

a common parallel computing framework for modeling ... · a common parallel computing framework for...

Documents