load balancing in distributed shared memory

46
1 Load Balancing in a Distributed Shared Memory System Alexander Dubrovsky 1 , Roy Friedman 2 and Assaf Schuster 3 1 IBM Israel MATAM, Haifa 31905 Israel [email protected] 2 Department of Computer Science Cornell University Ithaca, NY 14853 USA [email protected] 3 Department of Computer Science The Technion Haifa 32000 Israel [email protected] Abstract: This paper reports on a comparison of six different algorithms for load balancing in a distributed environments. These algorithms represent different approaches to load balancing, such as dynamic vs. Static, cooperative vs. Non-cooperative, centralized vs. Distributed, and whether threads are allowed to migrate after their initial allocation or not. These algorithms are implemented as part of the PARC-MACH environment for executing parallel programs over a cluster of workstations. All experiments reported in this work have been conducted on this system. The results of these experiments indicate that dynamic distributed algorithms are usually better than static centralized algorithms, although in some cases the latter behave better than the former. As a result of this work, we believe that in distributed environments there is a need for load balancing

Upload: jyoti-prakash

Post on 02-Mar-2015

169 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Load Balancing in Distributed Shared Memory

1

Load Balancing in a Distributed Shared Memory System

Alexander Dubrovsky1, Roy Friedman2 and Assaf Schuster3

1IBM Israel MATAM, Haifa 31905

Israel [email protected]

2Department of Computer Science

Cornell University Ithaca, NY 14853

USA [email protected]

3Department of Computer Science

The Technion Haifa 32000

Israel [email protected]

Abstract: This paper reports on a comparison of six different algorithms for load balancing in a distributed environments. These algorithms represent different approaches to load balancing, such as dynamic vs. Static, cooperative vs. Non-cooperative, centralized vs. Distributed, and whether threads are allowed to migrate after their initial allocation or not.

These algorithms are implemented as part of the PARC-MACH environment for executing parallel programs over a cluster of workstations. All experiments reported in this work have been conducted on this system. The results of these experiments indicate that dynamic distributed algorithms are usually better than static centralized algorithms, although in some cases the latter behave better than the former. As a result of this work, we believe that in distributed environments there is a need for load balancing

Page 2: Load Balancing in Distributed Shared Memory

2

algorithms that take into account the effect of remote memory access on the performance of the system.

1. Introduction

1.1 Parallel computation in distributed systems

Nowadays, most organizations have a cluster of computers connected by a local area network (LAN). However, these computers are usually used as separate computation units without sharing their resources. In particular, the expensive CPU time of many computers is wasted while other computers are heavily overloaded. A distributed run-time system which allows to execute parallel programs on such clusters of computers and can utilize the resources of the entire cluster as a whole in an efficient way is could eliminate this waste of resources and serve as a cost effective alternative to large and expensive parallel computers.

One of the main problems that needs to be tackled by any system that attempts to provide efficient execution of parallel programs in distributed environments is load balancing. In order to be efficient, the system must distribute the workload among the different computing nodes in a way that guarantees optimal utilization of the available resources, and in particular of the CPU.

On the other hand, unlike in parallel computers, in a distributed environment a good load balancing algorithm is not necessarily an algorithm which distributes the load evenly among all available computing nodes. This is because in a distributed environment, remote memory accesses are much more expensive than local memory accesses. This means that sometime it may be better to have uneven work load, but to avoid remote memory accesses.

In this work we have compared several load balancing algorithms. These algorithms represent various alternatives in deciding on which node should a parallel thread run. Some of these algorithms are cooperative, i.e., nodes share information in making the allocation decision, while other are not. Some of the algorithms make dynamic decisions, while others are static. Part of the algorithms make the allocation decisions in a centralized way, while others are distributed. Finally, some of the protocols allow threads to migrate after their initial allocation, while other do not.

All these protocols are implemented as part of the PARC-MACH distributed shared memory system, which runs on top of the MACH operating system. All our measurements were done on this system with the actual implementation of these protocols. The results we achieved suggest that the optimal load balancing scheme depends on the characteristics of the parallel application being run. In general, the more dynamic and distributed the algorithm is, the better its chances of achieving good performance, although static algorithms can perform better for

Page 3: Load Balancing in Distributed Shared Memory

3

parallel applications with massive data sharing, in which the memory locality printciple is more significant. Non of these protocols we have implemented takes into account the effect of remote memory accesses on performance, which we believe was a limiting factor in their performance. We are currently not aware of any such protocols, and their development remains and interesting research problem.

This paper is organized as follows. In the rest of this section we review existing load balancing algorithms. In Section 2, we give a brief description of the PARC-MACH system in which these protocols were implemented and where our experiments were carried. In Section 3, we describe in detail the load balancing algorithms we have implemented. Section 4 discusses experimental results and compares load balancing algorithms. We conclude with a discussion in section 5.

1.2 Related work

Excellent surveys of load balancing are provided in [16] and [3] . Load balancing schemes involve either cooperation among distributed nodes, or independent decision making by individual nodes. Many authors indicate the difficulty of achieving good balancing in a distributed system without any mechanism of cooperation among nodes.

Cooperative load balancing schemes can be divided into the centralized and the distributed approach [11]. In the centralized approach, a special node keeps the system’s load state information and makes decisions on selecting a node for remote task allocation . We use a similar scheme in our system.

Variations of centralized algorithms include global queue load balancing schemes. According to the algorithm mentioned in [14], every new task is sent to the central load manager. Each processor gets a task from the central work pile, returns it after some quantum of time, and then gets another task. The described algorithm suffers from a synchronization bottleneck and context switch overhead. Another version of the central queue scheme is implemented in the Makbilian multiprocessor [2] . A special activity runs on each processor, taking one new activity from the global work pile according to schedule. As the special activity competes with other activities for the processor, the rate at which new activity is allocated depends on the processor’s load. In [15], load balancing of prioritized tasks is described. The central load manager maintains both the priority queue of new tasks and the information about the load state of managed hosts, which is updated either periodically or by piggybacking when a new task is sent to the host. The load manager distributes tasks between the managed hosts so that the load on every processor is maintained within the allowed range. Another variation of the central queue load balancing algorithm is implemented in our system.

The centralized approach suffers from load management bottleneck and high information exchange overhead, problems many of the distributed algorithms

Page 4: Load Balancing in Distributed Shared Memory

4

proposed in the literature were designed to solve. Some distributed schemes attempt to get information on the load state in the system, using minimum communication. The nodes query only a small number of neighbors at periodical intervals, in order to keep the local information up-to-date, while the allocation decision is also made locally by a simple load balancing algorithm [10]. In order to minimize communication, Kremien [11] proposes a load state metric which is not likely to change frequently. The node state is mapped onto one of three possible levels, and remote nodes are notified only when the local load level changes. A similar scheme is implemented in our system.

Hong et al. [9] equalize the load between pairs of processors at fixed, predetermined intervals. Another load balancing scheme [12] initiates load exchange only when a processor becomes idle. In the Ivy system [13], the special null process asks a remote overloaded processor for new tasks when there are no ready tasks available on the local processor. Auxiliary information about the system load state is kept on each processor and helps to minimize the number of migration request rejections. Rudolph et al. [14] propose to equalize the load between random pairs of processors at periods, which are inversely proportional to the current load. The balancing operation is performed with some randomization only when loads on two processors differ by more than some threshold value. The local queue load balancing algorithm implemented in our system is similar to the schemes mentioned in this paragraph.

Several basic strategies of load balancing in highly parallel systems are presented in [17] and [16]. The sender or receiver initiated diffusions make use of near-neighbor load information. According to the former scheme, the load balancing is performed by a processor whenever it receives a load update message from its neighbor, indicating that the neighbor’s load is less than some threshold value, and only if the load of the processor is greater than the average load in the domain. The latter scheme is the converse of the sender initiated diffusion. Here the balancing process is initiated by any processor whose load drops below a prespecified threshold.

The hierarchical balancing method organizes a multicomputer system into a hierarchy of balancing domains. At each hierarchy level computers are partitioned into clusters, with the load manager processor designated in each cluster to control that cluster’s internal balancing. Load managers receive information from lower level domains and execute regular load balancing among themselves by exchanging information and tasks. Similar models have been proposed in [15] and [11].

The dimension exchange method is a synchronized approach, which performs load balancing iteratively in each of the log N dimensions, where N is the number of processors. All processor pairs in the first dimension balance the load among themselves, then all processor pairs in the second dimension, and so forth, until each processor has balanced its load with each of its neighbors.

The gradient model uses a gradient proximity map of underloaded processors in the system, in order to guide the migration of tasks from overloaded to

Page 5: Load Balancing in Distributed Shared Memory

5

underloaded processors. The gradient proximity map is updated dynamically by underloaded nodes while task migration is initiated by overloaded nodes.

2. The PARC-MACH system architecture

A general overview of the PARC-MACH system is presented in Figure 1. Parallel applications are executed in PARC-MACH by user tasks, running on remote hosts participating in the parallel execution. A user task consists of a user defined application, a thread manager and a shared memory allocation unit. The set of remote thread managers constitutes the distributed thread manager which controls creation and termination of user threads, thread migration and load balancing. The shared memory unit supplies the user application with shared memory allocation and freeing; it also coordinates some memory consistency services. The user application includes parallel activities created by the thread manager and executed by user threads.

The shared memory manager provides the distributed user application with a predefined shared memory consistency model. Shared memory management includes page access control (through the kernel interface) and a page migration

Threadmanager

User Application

User Task

Threads

Threadmanager

User Application

User Task

Threads

N e t w o r k

Processor

Machkernel

Memorymanager

Threadmanager

User Application

User Task

Threads

Processor

Machkernel

Memorymanager

Processor

Machkernel

Memorymanager

1

2 7

3

4

56

8

9

Sharedmemory

unit

DistributedThread Manager

DistributedMemory Manager

Figure 1. The PARC-MACH system architecture.

Page 6: Load Balancing in Distributed Shared Memory

6

mechanism. The page migration mechanism supplies pages with necessary access rights whenever current page rights are insufficient. This mechanism is executed in the context of the page-fault interrupt handler, which is called by the kernel in case of page access rights violation. The distributed shared memory manager consists of memory manager tasks running on each processor. The shared memory manager can also be implemented as a non-distributed manager running on one of the hosts known to all the kernels. In this case, if page-fault happens, the kernel sends an appropriate request to the non-distributed manager.

Interaction between various parts of the PARC-MACH system is shown by arrows in Figure 1. User threads are created by special system function calls implemented in the thread manager (1). The latter also implements other routines of thread management and thread migration with invocation of low level kernel primitives (2). Shared memory allocation and freeing are executed in the context of user threads by special system function calls (3). The shared memory model implementation is based on a page-fault mechanism. When a page-fault exception occurs, kernel sends a message to the appropriate local (4) or remote (5) memory manager, which supplies the host with the required page invoking low level kernel services (6). The shared memory unit supplies the user application with several weak memory consistency services by sending requests to a memory manager (7). The remote tasks of the distributed thread manager and memory manager communicate with each other by messages (8 and 9). More details on the PARC-MACH system can be found in [4] and [7] .

3. Load balancing algorithms of PARC-MACH system

The main goal of thread management is to control the distribution of threads among processors, in order to minimize execution time of parallel applications. A prominent techniques used for this purpose is load balancing, which tries to uniformly load the processors participating in the parallel execution.

An objective measure of a processor’s load is its CPU load. In our system we measure processor load as the number of ready user threads. An apparent disadvantage of this approach is that the number of ready user threads does not always properly reflect the CPU load, especially when there are external tasks running on the processor. However, measuring node load according to CPU load has its drawbacks as well: the CPU load may decrease as a result of massive paging caused by too many running tasks, in which case a node can be mistakenly considered underloaded. A considerable advantage of our measure is that it is fully controlled by the PARC-MACH thread manager and independent of operating system services.

A thread which has not terminated is considered ready if it is not suspended “for a long time”. There are a number of reasons for thread suspension:

• waiting for completion of children activities;

• waiting for sync barrier synchronization;

Page 7: Load Balancing in Distributed Shared Memory

7

• delay of faa and tas primitives;

• remote memory access;

• suspension by the operating system.

Only the first two reasons cause suspension of a thread for a relatively long time. Remote memory accesses and synchronization primitives are serviced fast enough, and they do not affect the general tendency of load changing. Moreover, remote memory accesses happen frequently, and taking their influence on the load into account causes fluctuations of load level. The same is true of suspension by the operating system. Thus, we define a user thread as ready if it is not waiting for completion of its children activities and is not suspended on a sync barrier synchronization.

In our system we have implemented six load balancing algorithms: random, round-robin, central load manager, thresholds, central queue, local queue.

The first four algorithms reach a decision concerning thread allocation statically when parallel activities are created, while the last two take into account dynamic change of the system load. Therefore, the algorithms are divided into two groups - static and dynamic load balancing algorithms. The random and round-robin algorithms allocate threads according to some internal policy without regard for the system load. In the central load manager and thresholds algorithms, the decision is taken based on the system load at the moment of thread allocation. The central queue algorithm assigns threads to processors dynamically, taking into account changes in processor load during application execution. Although the last algorithm allocates new threads statically, the threads migrate among processors as a result of dynamic change of the system load. Thus, this algorithm is actually the most dynamic, as thread migration accounts for dynamic change of load distribution during thread execution. Classification of the above algorithms based on their properties is given by.

All static load balancing algorithms in our system are implemented in two modes - Single Thread Allocation and Multiple Thread Allocation mode. In the former, the load balancing algorithm selects a host for each new thread separately. Usually a parent thread creates several children threads and the number of children threads is greater than the number of processors. Separate selection of a host for each thread causes allocation of neighbor children1 on remote hosts. The locality principle provides that a thread is more likely to share data with a neighbor sibling than with other siblings (apart from the data shared by all siblings). Thus, allocation of neighbors on remote hosts causes massive page migration.

This problem can be solved in the latter mode, by selecting hosts for all the children simultaneously, minimizing the number of separated neighbors while maintaining the number of threads allocated to each processor. This allows for

1 We call two parallel activities created by the same parent one after another,

neighbor activities. These activities execute consecutive iterations of a parallel loop or

adjacent parallel blocks.

Page 8: Load Balancing in Distributed Shared Memory

8

certain optimizations in thread allocation. In particular, neighbor children are allocated by the Multiple Thread Allocation mode on the same host for as long as possible. In Figure 2b only two pairs of neighbors - [22,23] and [24,25] - are separated.

Our experience indicated that the Multiple Thread Allocation mode usually performs better, and the discussion and results reported in the rest of this paper assume this mode. A comparison of the Multiple Thread Allocation mode and the Single Thread Allocation mode can be found in [4].

Table 1. Features of load balancing algorithms.

3.1 Round-robin and random algorithms

The random and round-robin algorithms are the simplest load balancing algorithms supported by the system. Both are static, selecting a host for a new thread when the thread is being created. The thread runs on this host during its entire execution. In both schemes a host for a new thread is selected regardless of the system load.

In the random algorithm the host is selected at random from the set of processors participating in the application execution.

Features Random Round-Robin

Central Manager

Thresholds Central Queue

Local Queue

Cooperative No No Yes Yes Yes Yes

Static /

Dynamic

Static Static Static Static Dynamic Dynamic

Centralized /

Distributed

Distr. Distr. Central. Distr. Central. Distr.

Thread migration support

No No No No No Yes

Node 1 Node 3Node 2

2120 2625242322

create7 children threads

2021

25

23 24

22

26

Figure 2: Round-robin load balancing.

Page 9: Load Balancing in Distributed Shared Memory

9

In the round-robin scheme new threads are divided evenly between all processors. The threads are assigned to processors in a “round robin” order, i.e., each new thread is sent to the next processor. The order of thread allocation is maintained on each processor locally, independent of allocations from remote processors. An example of thread allocation by the round-robin algorithm is shown in Figure 2.

The round-robin algorithm is expected to work well with an equal workload of threads. We also expect that both random and round-robin schemes perform satisfactorily with a large number of threads (much greater than the number of processors). Thus, load distribution is probably uniform, and error negligible.

An advantage of both random and round-robin algorithms is the absence of load balancing interprocess communication which increases message overhead. Hence, both schemes can even attain the best performance among all the load balancing algorithms for particular special parallel applications. Nevertheless, random and round-robin algorithms are not expected to achieve good performance in the general case. We used the round-robin algorithm only as a basis of comparison for other schemes.

3.2 Central load manager algorithm

The central load manager algorithm is a static load balancing algorithm. A host for allocation of a new thread is selected by the central load manager, according to the overall system load when the thread is created. The thread is allocated to the minimally loaded host.

The central load manager runs on the main host known to all remote load managers. All requests for host selection are sent to the central load manager. If a parent thread runs on the main host, then the central load manager is called directly without sending a message, and the request is supplied in context of the parent thread.

Hosts for new threads are selected by the load manager so that the processor load after thread allocation is as uniform as possible, and the number of separated neighbor threads is minimized (see Figure 3). The central load manager reaches a decision based on the available information on the system load state. This information is updated by remote thread managers, which send a message each time the load on their nodes changes as a result of one of the following events:

• a parent activity waits until execution of its children activities are complete;

• a thread is suspended on a sync barrier synchronization;

• completion of parallel activity execution;

• a parent activity continues its execution after receiving messages from all of its children;

Page 10: Load Balancing in Distributed Shared Memory

10

• a thread is resumed after a sync barrier synchronization;

Note that allocation of a new thread is not reported to the central load manager because it already received this information when a host for that thread was

selected.

The message overhead of the central load manager algorithm is one message for each change of load and two messages per thread or sibling allocation from remote hosts.

create7 children threads

Centralload manager

Node 1 Node 3Node 2

20 2221 26252423

20

23

24

21 22

25

26

select hosts for 7 threads1, 2, 2, 3, 3, 3, 3

Figure 3. Central load manager algorithm.

Single thread allocation mode, Tu=2, To=4

Node 1 Node 3Node 2

create7 children threads

23

24

20 21 262522

20 21 262522 23 24

Figure 4. Thresholds load balancing algorithm.

Page 11: Load Balancing in Distributed Shared Memory

11

An obvious advantage of the central load manager algorithm is in centralization of load state information. The load manager’s decisions are based on the general system load state information, which allows for the best decision at the moment of the thread creation. Among its disadvantages are the high degree of interprocess communication and load management centralization, which could bottleneck the system.

A general disadvantage of all static schemes is that the final selection of a host for thread allocation is made when the thread is created, and can not be changed during thread execution to accommodate changes in the system load. Nevertheless, the central load manager scheme is expected to perform much better than the simpler schemes for parallel applications, especially when dynamic activities are created by different hosts.

3.3 Thresholds algorithm

Another static load balancing algorithm is the thresholds algorithm. According to this algorithm, the threads are allocated immediately upon creation to hosts selected by the load manager. The load manager is distributed between the processors, and hosts are selected locally without sending remote messages. Each local load manager keeps a private copy of the system’s load state. The load state of a processor is characterized by one of the following three levels: underloaded, medium and overloaded. These levels are defined by two threshold parameters, Tunder and Tupper, which can be defined by the user:

• Underloaded when load < Tunder

• Medium when Tunder ≤ load ≤ Tupper

• Overloaded when load > Tupper

Default values for the algorithm are Tunder = 2 ready threads and Tupper = 4 ready threads. Initially, all the processors are considered to be underloaded. When the load state of a processor exceeds a load level boundary, the local load manager sends messages regarding the new load state to all remote load managers, constantly updating them as to the actual load state of the entire system.

A host is selected for a new thread according to the following algorithm: if the

local state is not overloaded then the thread is allocated locally; otherwise, a

remote underloaded host is selected, and if no such host exists, the thread is also

allocated locally (see Figure 4).

The message overhead of the algorithm is N-1 messages per exceeding load level boundary on a processor, where N is the total number of processors.

Among the advantages of the thresholds algorithm are relatively low interprocess communication and a large number of local thread allocations. The

Page 12: Load Balancing in Distributed Shared Memory

12

latter decreases the overhead of remote thread allocations and, more importantly, the overhead of remote memory accesses, thus possibly leading to significant performance improvement. A disadvantage of the algorithm is that all threads are allocated locally when all remote processors are overloaded (their load is more than the constant parameter Tupper). A load on one overloaded processor can be much higher than on other overloaded processors, causing significant load disbalance, and increasing the execution time of an application. Nevertheless, we expect the thresholds algorithm to perform better for certain parallel applications than all the algorithms considered so far.

3.4 Central Queue algorithm

The central queue is a dynamic load balancing algorithm. It differs from static algorithms in that new parallel activities are not allocated immediately after creation. Instead, they are buffered in the central thread-request queue on the main host and allocated dynamically upon requests from remote hosts (see Figure 5).

The purpose of the central thread-request queue is to store new activities and unfulfilled requests. It is organized as a cyclic FIFO queue on the main host and

IBM Compatible

IBM Compatible

IBM Compatible

Newactivity

Newactivity

Newactivities

Newactivities

request

request

IBM Compatiblemain host

Newactivities

Newactivities

Newactivities

Figure 5. Central queue load balancing algorithm.

Page 13: Load Balancing in Distributed Shared Memory

13

is maintained by the central load manager running on that host. Each new activity arriving at the queue manager is inserted into the queue (Figure 6a). Then, whenever a request for an activity is received by the queue manager, it removes the first activity from the queue and sends it to the requester (b). If there are no ready activities in the queue, the request is buffered (c) until a new activity is available. If a new activity arrives at the queue manager while there are unanswered requests in the queue, the first such request is removed from the queue and the new activity is assigned to it (d). Note that the central thread-request queue can contain in any given moment either new activities or unanswered requests; they can not be interleaved in the queue.

All new parallel activities are sent to the central load manager running on the main host. When a processor load falls beneath the threshold Tlower, the local load manager sends a request for a new activity to the central load manager. The central load manager answers the request immediately if a ready activity is found in the thread-request queue, or queues the request until a new activity arrives. The parameter Tlower is user-defined as the minimal number of ready threads we would like to have on each processor. Its default value is two ready threads. The central queue algorithm provides at least Tlower ready threads on each processor if a sufficient number of activities have been created.

The message overhead of the central queue algorithm is three messages per parallel activity (one message transfers a new thread to the central load manager, another makes the request and the third is for thread allocation). The load manager running on the main host does not send any messages to the central load manager, but rather requests new activities directly from it, decreasing the overall message overhead of the algorithm.

The most important advantage of the central queue algorithm is dynamic distribution of threads. Unlike static algorithms, dynamic algorithms allocate threads dynamically when one of the processors becomes underloaded. The central queue algorithm takes into account the dynamic change of load state in the system during the application execution.

Page 14: Load Balancing in Distributed Shared Memory

14

The algorithm’s behavior is also affected by an additional external load created by other tasks running on the processors. If a host becomes heavily loaded as a result of external task execution, then parallel activities running on it will be executed more slowly and requests for new activities will be sent from it less frequently. A disadvantage of the algorithm is that it completely ignores the memory locality principle. Another disadvantage is centralization of load management that cause bottlenecking in high scale systems. In general, the performance of the central queue algorithm is expected to be better then that of any static algorithm discussed so far. It should be noted, however, that certain “badly written” parallel applications can “fail” with the central queue load balancing algorithm. An example of such an application is given in Figure 7. All iterations of the parallel loop wait while the last activity assigns 1 to a. If the number of parallel activities is large enough, then the last activity will probably not be allocated. A request for its allocation is not sent to the central load manager because all hosts are “overloaded” with busy waiting. As a result, the application never terminates.

Thread

Thr

ead

Thr

ead

NewThread

Thr

ead-

Req

uest

que

ue

NewThreada) New thread when no request queued up

Request

host z

Req

uest

host

yR

eque

st

host

z

NewRequesthost u

Thr

ead-

Req

uest

que

ue

Requestfrom host uc) New request when no thread queued up

Thread-Request queue

ThreadThread

Thr

eadT

hread

Supplyto host x

Requestfrom host x

b) New request when a thread queued up

Thread-Request queue

Request

host y

Requesthost z

Request

host u

Supplyhost z withNew thread

Req

uest

host

z

New thread

d) New thread when a request queued up

Figure 6. Operation of the central thread-request queue.

Page 15: Load Balancing in Distributed Shared Memory

15

3.5 Local Queue algorithm

The local queue algorithm is a dynamic load balancing algorithm. Its main feature is dynamic thread migration support. The basic idea of the local queue algorithm is static allocation of all new threads with thread migration initiated by a host when its load falls beneath a threshold Tunder.

Tunder is a user-defined parameter of the algorithm with default value of 2. The parameter defines the minimal number of ready threads the load manager attempts to provide on each processor if at least one host with more than Tunder ready threads exists.

Initially, new threads created on the main host are allocated on all hosts (Tunder threads per host). The number of parallel activities created by the first parallel construct on the main host is usually sufficient for allocation on all remote hosts. From then on, all the threads created on the main host and all other hosts are allocated locally.

When the host load during application execution falls beneath the threshold Tunder, the local load manager attempts to get several threads from remote hosts. It randomly sends synchronous requests with the number of local ready threads to remote load managers. When a load manager receives such a request, it compares the local number of ready threads with the received number. If the former is greater than the latter, then some of the running threads are transferred to the requester and an affirmative confirmation with the number of threads transferred is returned. A negative reply is sent to the requester if the local number of ready

int a; a = 0;

pparfor (int i; 1; 100; 1;) { if ( i == 100 ) a = 1;

while ( a == 0 ) ; .......

Figure 7. A counter-example for the central queue algorithm.

Page 16: Load Balancing in Distributed Shared Memory

16

threads is less than the number received. If the requester receives a negative reply, or if the number of threads received is not sufficient to reach the Tunder threshold, the load balancing process is continued with another remote processor. If, after trying all remote processors, the Tunder threshold is still not reached, the load balancing is periodically repeated (approximately every 10 seconds) until the threshold is met. All the hosts apart from the main one also initiate periodic load balancing at the beginning of an application execution, until the Tunder threshold is achieved.

The number of threads to be transferred is selected so that the load on the requesting processor and on the local processor equalize. The thread manager receives a list of ready threads from the kernel and selects from it the required number of “transferable” threads, suspends them, and finally transfers them to the requesting node.

The local queue load balancing algorithm is expected to achieve the best performance, as it is dynamic and can redistribute running threads during application execution. Static allocation of new activities decreases the overhead of remote thread allocations and the overhead of remote memory accesses, thus improving performance significantly. Another advantage of the algorithm is that its message overhead is relatively low; messages are sent only when a host becomes underloaded and thread redistribution is required.

One apparent drawback of the algorithm is that it ignores the locality principle. A thread for transfer is selected randomly regardless of the threads running on the underloaded and local processors. This decreases the performance of parallel applications with massive data exchange between subsequent parallel iterations or blocks.

4. Experimental results

4.1 Testing environment

The PARC-MACH system was tested on a LAN in the Parallel Computation Laboratory of the Computer Science Department at the Technion. Our system consisted of a cluster of i486 33MHz and Pentium™ 90 MHz machines, connected by a 10 Mbps Ethernet. All the computers ran the Mach 3.0 MK78 version of the Mach operating system. The system as well as sequential and parallel applications were compiled using the gcc-6.1 compiler with the highest level of optimization.

Each testing result presented in this chapter is an average based on five executions.

Page 17: Load Balancing in Distributed Shared Memory

17

4.2 Benchmark applications

We have designed and tested six parallel applications, namely, parallel search, matrix multiplication, Dijkstra’s algorithm in graphs, the traveling salesperson problem (TSP), solving systems of linear equations and solving partial differential equations. In what follows we present test results for the four applications we feel best characterize the various types of parallel applications. These are matrix multiplication, Dijkstra’s algorithm for graphs, the traveling salesperson problem and solving partial differential equations. The selected benchmark applications differ in type, data sharing and workload distribution. Detailed analysis of each of the applications is given below.

4.2.1 Matrix multiplication

The problem is to multiply two large square matrices A and B of size N x N, and place the result into the third matrix C of the same size. The sequential code of the application can be made parallel in a very natural manner at the level of the external loop (loop on rows of matrix C).

The input matrices A and B and the output matrix C are placed in the shared memory, so that they are accessible from all nodes. The input matrices are initialized on the main host, from which they migrate to remote hosts upon request. At the end of the execution the output matrix C is fetched back to the main host. To understand the shared memory overhead of the parallel application, it is important to analyze the shared memory accesses by user threads. In the proposed implementation, each parallel activity (thread) computes several consecutive rows of the resultant matrix C. Each thread reads consecutive rows of matrix A, reads all elements of matrix B and writes the corresponding consecutive rows of matrix C (Figure 8).

The matrices are stored in the computer memory row-wise. Thus each physical page contains one or more consecutive matrix rows. The number of rows per page depends on the size of the allocated matrix. For example, if the page size is 4 Kbytes and each matrix element occupies 4 bytes, then a page contains exactly one row when matrix size is 1024x1024, two rows when matrix size is 512x512, etc. Each thread accesses several consecutive pages of matrix A, all the pages of matrix B and several consecutive pages of matrix C. Both matrices A and B are only accessed for reading, while matrix C is accessed for writing. Therefore, pages of matrices A and B do not migrate after arriving at the processing host, while those of matrix C may migrate as a result of concurrent writing (false sharing). If the matrices are allocated at the page aligned addresses and the matrix size is selected so that rows accessed by a thread occupy a whole number of pages, then different threads access different pages of matrices A and C. In this case, there is no page sharing by threads and pages of matrix C do not migrate.

Page 18: Load Balancing in Distributed Shared Memory

18

Overall, page migration includes supplying all the pages of matrix B and the required pages of matrix A to all nodes. The computed pages of matrix C are transferred back from remote nodes to the main one. The amount of transferred data of all three matrices is equal to the amount of the required (accessible) data. There is no page migration during the execution; hence, from this point of view the suggested parallelization is optimal.

Unaligned allocation of matrices or fractional number of rows per page increase the shared memory overhead. This overhead includes superfluous transfer of the data of matrix A and false sharing of the pages of matrix C. This additional overhead decreases as the number of rows to be calculated by a single thread increases. False sharing may be decreased by writing temporary results to the local memory and only copying fully-computed rows to the shared matrix C.

Matrix multiplication takes O(N3) time and O(N2) space.

4.2.2 Dijkstra’s algorithm

The problem is to find the shortest paths between all pairs of vertices in a connected graph with non-negative weights on directed edges. Dijkstra’s algorithm is one of several available for the solution of this [5]. Dijkstra’s original algorithm finds the shortest paths from a given vertex to all other vertices, and the shortest paths between all the pairs of vertices may be found by its repeated application. The above formulation of the problem induces a simple way of parallelization. Each parallel activity (thread) calculates the shortest paths from a distinct set of starting points.

An input of the problem is an incidence matrix A of the graph. This matrix is of size |V|x|V| and each its element [i,j] contains the weight of the edge i→j. The output of the problem is a matrix B of size |V|x|V|, where an element [i,j] contains the weight of a shortest path from vertex i to vertex j. We ran Dijkstra’s algorithm on various graph sizes, but in all the tests matrices A and B were of allocated size 512x512.

Matrix A

Read by Thread 4

Read by Thread 3

Read by Thread 1

Read by Thread 2

Written by Thread 4

Written by Thread 3

Written by Thread 1

Written by Thread 2x =

Matrix B Matrix C

Readby all threads

Figure 8. Shared memory accesses in parallel matrix multiplication.

Page 19: Load Balancing in Distributed Shared Memory

19

Each thread accesses the entire matrix A for reading and writes several consecutive rows of the matrix B (Figure 9). This causes shared page migration similar to that in matrix multiplication. The entire matrix A is supplied to all hosts, and each remote node sends the rows of matrix B it has calculated to the main node.

Dijkstra’s algorithm takes O(|V|3) time and O(|V|2) space.

Readby all threads

Matrix A

Written by Thread 4

Written by Thread 3

Written by Thread 1

Written by Thread 2

Matrix B

Figure 9. Shared memory accesses in parallel execution of Dijkstra’s algorithm.

4.2.3 Solving partial differential equations

Here we test a numerical solution of Poisson’s equation, which is a second order linear partial equation with two independent variables x and y:

∂∂

∂∂

2

2

2

2

u x y

x

u x y

yG x y

( , ) ( , )( , )+ = ,

where u(x,y) is the unknown function and G(x,y) is a given function.

The partial differential equation is to be solved relative to the function u(x,y) in the unit square space (0<x<1 and 0<y<1). The values of the unknown function u(x,y) are only given at the square boundaries (x=0, or x=1, or y=0, or y=1).

Page 20: Load Balancing in Distributed Shared Memory

20

P(4,1) P(4,4)P(4,3)P(4,2)

P(3,1) P(3,4)P(3,3)P(3,2)

P(2,1) P(2,4)P(2,3)P(2,2)

P(1,1) P(1,4)P(1,3)P(1,2)

Figure 10. Mesh points for solving partial differential equations.

Sync

local precisionwas achievedby all threads

at the first iterationof the last round

Pbreak

y

local precisionis achieved

y n

kill kill

local precisionis achieved

n y

General precisionis achieved

y

local precisionis achieved

n

n

Figure 11. Parallel solution of partial differential equations.

Page 21: Load Balancing in Distributed Shared Memory

21

The partial differential equation is solved numerically on a uniform mesh of n+1 horizontal and n+1 vertical lines. The algorithm finds values of the unknown function at the (n-1)2 internal mesh points (Figure 10). An iterative process is used to obtain approximate values of u(x,y) at each of the interior points. Starting with zero initial values of u(x,y) at the internal points, we use the following approximation at the point [i,j] on the k-th iteration [1]

uk(xi ,yi) = uk-1(xi ,yi)+w[u’ k (xi ,yi)- uk-1(xi ,yi)]

where u’k(xi,yi)=[ u k(xi-1,yi)+uk(xi+1,yi)+uk(xi,yi-1)+uk(xi,yi+1)-d2G(xi,yi)]/4;

w=2/[1+sin(πd)]; d=1/n;

The iterative process continues until, after k iterations, the difference uk(xi ,yi)-uk-1(xi ,yi) in all the points becomes less than a predefined precision ε. Since the number of iterations in this algorithm is linear in n and the number of interior points is equal to (n-1)2, the entire process takes O(n3) time and O(n2) space.

For parallelization of the sequential algorithm we divide the mesh points into groups of consecutive lines. Having N threads, the first one executes iterations on the first n/N lines, the second executes iterations on the second n/N lines, etc. The proposed division of points among threads is optimal for minimization of false sharing. Elements of the same row of a matrix are located on the same physical page. The difficulty arises when remote threads process elements of the same line: they then write to the same physical page, which causes false sharing of that page.

The parallel execution of the algorithm is visualized in Figure 11. Each thread executes internal iterations on its private points until the required precision ε is achieved at all the points. Next, the thread waits at the synchronization barrier so that other threads can achieve local precision as well; then, all threads besides the first one begin the following round of iterations. The first thread checks if all the threads achieved local precision on the first iteration of the last round. If yes, it executes the pbreak instruction which terminates all the siblings, and the algorithm completes. Otherwise the first thread continues iterations on its private points until the next synchronization barrier.

Each thread updates only its private points that belong to consecutive lines. If

memory areas being updated by various threads are multiples of integer page sizes

and are aligned to the page boundary, then there is no page sharing on writing (as

shown in Figure 12). But updating an interior point requires reading its four

nearest neighbors (Figure 10), which causes interleaving of reading spaces even if

writing spaces lie on different pages. When a thread computes the first and the

last private rows, it reads the rows written by its neighbor threads. There will

probably be interleaving of writing spaces in most cases (as discussed in

subsection 4.2.1). All of the above causes massive page migration that can even

result in ping-ponging boundary pages.

Page 22: Load Balancing in Distributed Shared Memory

22

We tested solving partial differential equations on two different inputs. The

first one initialized boundary values by the function u1(x,y) =

2((x-0.5)2+(y-0.5)2), while G1(x,y) = 8. The second input initialized boundary

values by the function u2(x,y) = x7 with G2(x,y) = 56*x5. Level diagrams of both

functions u1(x,y) and u2(x,y) are shown in Figure 13. It can be readily seen from

these diagrams that initial approximations (zero) in all rows for the first input are

far from the result function u1(x,y). In the second input initial values in two thirds

of the rows are pretty close to the result values and only in the last third of the

Read by Thread 1

Read by Thread 4

Read by Thread 3

Read by Thread 2

Written by Thread 1

Written by Thread 4

Written by Thread 3

Written by Thread 2

Figure 12. Memory accesses in solving partial differential equations.

b) u(x,y) = x7;

0.5

0.8

0.2

u

a) u(x,y) = 2((x-0.5)2+(y-0.5)2);

0.5

0.2

0.8

0.05

0.05

0.75

0.25

0.5

0.0

1.0

0.75

0.25

0.5

0.0

1.0

x

y

x

y

u

Figure 13. Inputs for solving partial differential equations.

Page 23: Load Balancing in Distributed Shared Memory

23

rows is the difference considerable. Thus, the load caused by the first two thirds

of threads is negligible compared to that of the rest. Note that in the first case the

load of all threads is approximately equal.

We expect even the simplest load balancing algorithms to show good

performance on the first input. The second input requires more sophisticated algorithms, specifically dynamic load balancing algorithms. On the other hand, these algorithms can fail on solving partial differential equations, owing to the shared memory overhead discussed above.

4.2.4 Traveling Salesperson Problem.

The Traveling Salesperson Problem (TSP) consists in finding the shortest Hamiltonian path (a closed path visiting all the vertices exactly once) in a graph with positive edge weights. The time complexity of this problem is exponential to the size of the input; thus it is a good subject for parallel testing.

The algorithm receives a directed graph in the form of a weight matrix and produces a minimum weight Hamiltonian path as an array of vertices. A recursive sequential algorithm for the TSP checks all the Hamiltonian paths, recursively beginning and ending all the paths with vertex 0. At each level of recursion all vertices not yet in the current path are added to it in the loop, and the search continues recursively.

We tested two parallel versions of the algorithm. The simplest parallelization

is performed at the first level of recursion when the algorithm attempts to add the

second vertex to the path after vertex 0. If, for example, we have a graph with 16

vertices and want to create 3 parallel threads, then the first thread will check

paths 0→1→...; 0→2→...; ... 0→5→...; , the second one will check paths

0→6→...; 0→7→...; ... , etc. (see Figure 14).

A more sophisticated parallelization can be performed on both the first and the second levels of recursion. An example of such parallelization with 3 threads at each level is shown in Figure 15. Three threads are created at the first level of parallelization, each responsible for checking a subset of paths that contain designated second vertices in the path similarly to the previous parallel version. Each thread in its turn creates 3 children threads for each vertex at the second place in the path, so that 9 parallel “grandchild” threads run in the system. These threads are created, terminated and then soon recreated dynamically by “child” activities. The most important feature of this implementation is dynamic creation and termination of threads during application execution. Creation of parallel activities in all the previous algorithms was done statically, i.e., all threads were created at the beginning of the application execution.

All parallel threads access the shared array containing the best path found so far and its shared length. Most of the accesses are performed for reading and are executed concurrently without page migration, after arrival of a page at the host.

Page 24: Load Balancing in Distributed Shared Memory

24

Write accesses are performed when the minimum path is to be updated by a parallel activity. In this case, the page containing the minimal path and its length is invalidated on all the other nodes, to be resupplied to all the hosts after updating. Overall, there is insignificant migration of that page during the parallel application execution (although the intensity of migration may vary for different inputs).

We tested the application on four different inputs. The first two inputs (14-1 and 14-2) were graphs with 14 vertices and the other two inputs (15-1 and 15-2) were graphs with 15 vertices.

4.3 Performance measurement

During our tests we measured several parameters of the general performance of our system and efficiency of load balancing algorithms.

The major performance criteria of parallel execution is the speedup:

Parallel Children Threads

Test paths: 0->1->... ..... 0->5->...

Test paths: 0->6->... ..... 0->10->...

Test paths: 0->11->... ..... 0->15->...

Parent Thread

Test paths: 0->...

Figure 14. TSP: one level parallelization.

Page 25: Load Balancing in Distributed Shared Memory

25

speedupT

Tsequential

parallel

= ,

where Tparallel is the time required for the parallel algorithm execution, and Tsequential is the time required for the sequential algorithm execution of the same problem on the same input.

If the time of the sequential algorithm execution differs for the computers participating in the parallel execution (because of differences in performance or external load), then the effective sequential time can be calculated according to the following formula:

Tn

t t t t

effect

i n

=+ + + + +

1 1 1 1

1 2

... ...,

Parent Thread

Test paths: 0->...

Parallel Children Threads

Parallel Parent Threads

Test paths: 0->1->... ..... 0->5->...

Test paths: 0->6->... ..... 0->10->...

Test paths: 0->11->... ..... 0->15->...

test paths:0->i->2...

......0->i->6...

test paths:0->i->7...

......0->i->11..

test paths:0->i->12..

......0->i->15..

i = 1 i = 2 i = 4i = 3 i = 5

test paths:0->i->2...

......0->i->6...

test paths:0->i->7...

......0->i->11..

test paths:0->i->12..

......0->i->15..

i = 6 i = 7 i = 9i = 8 i = 10

test paths:0->i->2...

......0->i->6...

test paths:0->i->7...

......0->i->11..

test paths:0->i->12..

......0->i->15..

i = 11 i = 12 i = 14i = 13 i = 15

Repeat creation ofparallel threads

Figure 15. TSP: two level parallelization.

Page 26: Load Balancing in Distributed Shared Memory

26

where n is the number of computers, and t i is the time of the application sequential execution on computer i .

Teffect should be used instead of Tsequential when speedup is calculated in a heterogeneous system. The effective sequential time is the average time of the sequential execution on a set of computers.

We also introduce normalized speedup, which shows the relation of the achieved speedup to the linear speedup:

SPspeedup

speedup

speedup

nnormlinear

= = ,

where n is the number of computers participating in the parallel execution.

In order to compare various load balancing algorithms we measure the active time of application execution. Active time of user task execution on a node is the execution time when at least one ready user thread exists on this node. The relative active time is equal to the ratio of the active time to the entire time of the application execution. It approximates the amount of time the host was not idle. The parallel application execution is characterized by the average active time of all the hosts. This parameter reflects the efficiency of the load balancing algorithm - the more effective the load balancing, the closer to one the relative active time.

We also measure the CPU time consumed by the application on each host and on all the hosts together. This is the best measurement of parallel application resource usage.

4.4 Load balancing parameters

The system designed and the load balancing algorithms implemented therein can be fine-tuned by several parameters (Tunder and Tupper, the number of threads defined in a user application, or the Lpar_Threads_per_Node parameter).

In this section we present experimental results of running parallel with various parameters. Whenever possible we give recommendations on the usage parameters for different types of parallel applications and environments. We also try to figure out the default values for load balancing parameters.

4.4.1 Thresholds algorithm parameters

The behavior of the thresholds load balancing algorithm is affected by two parameters - Tunder and Tupper. In this section we analyze the dependence of the thresholds algorithm performance on the values of these parameters.

The results of executions with various parameters are shown in Figure 16. The first three applications are static; all activities are allocated at the beginning of

Page 27: Load Balancing in Distributed Shared Memory

27

the execution. According to the algorithm, Tunder threads are allocated on each node and the rest are allocated locally on the main node. The Tupper parameter is immaterial in static applications. The first three tests show that the execution time decreases as the value of Tunder increases. This result can be simply explained by the fact that workloads of threads in the first three tests are either equal or almost equal to one another. The optimal load balancing in this case uniformly distributes threads among nodes. Higher values of Tunder lead to more optimal load distribution and better execution times.

The fourth graph indicates that Tunder=3 is the best value among those tests. Workloads of threads in this test are very uneven; only the last third of threads are overloaded. In this case the thresholds algorithm distributes threads as follows:

1. Tunder=1 18 + 1 + 1

2. Tunder=3 14 + 3 + 3

3. Tunder=5 10 + 5 + 5

If we take into account that the last seven threads are overloaded, we can see that the second case is actually the best.

In general, the optimal value of Tunder for static parallel applications is equal to the number of threads divided by the number of nodes. In this case the thresholds load balancing algorithm behaves as the round-robin algorithm.

The results of TSP problem testing with two level parallelization and allocation of three threads per parent are shown on the last graph. These results indicate that the best values of Tunder and Tupper parameters among those tested are Tunder=2 and Tupper=3. Note, however, that as mentioned in the previous section, the TSP problem is not a good testing example. On the other hand, the given pair of parameters is the best or near-best in all TSP tests. We are aware that our tests are insufficient for the selection of optimal parameter values. Nevertheless, the above pair of parameters coincides with our expectation and we set Tunder=2 and Tupper=3 as the default for the thresholds algorithm.

Page 28: Load Balancing in Distributed Shared Memory

28

Matrix multiplication3 hosts, 9 threads

Matrix size

Time,sec

0

50

100

150

200

250

300

350

400

450

500

100 200 300 400 500

Thresholds Tu=1 To=2

Thresholds Tu=2 To=3

Dijkstra algorithm3 hosts, 15 threads

Graph size

Time,sec

0

100

200

300

400

500

600

100 200 300 400

Thresholds Tu=1 To=2

Thresholds Tu=3 To=4

Partial differential equation 2((x-0.5)2+(y-0.5)2),3 hosts, 12 threads

Matrix size

Time,sec

0

50

100

150

200

250

300

350

400

450

100 200 300 400

Thresholds Tu=1 To=2

Thresholds Tu=3 To=4

Partial differential equation x7,3 hosts, 20 threads (1/3 of threads are active)

Matrix size

Time,sec

0

20

40

60

80

100

120

140

100 200 300 400

Thresholds Tu=1 To=2

Thresholds Tu=3 To=4

Thresholds Tu=5 To=6

TSP, two level parallelization,3 hosts, LPAR=1

Input

Time,sec

0

50

100

150

200

250

300

350

14-1 14-2 15-1 15-2

Thresholds Tu=1 To=2Thresholds Tu=2 To=3

Thresholds Tu=3 To=4

Figure 16. Thresholds LB algorithm parameters.

Page 29: Load Balancing in Distributed Shared Memory

29

4.4.2 Central queue algorithm parameterization

The behavior of the central queue load balancing algorithm depends on the Tunder parameter. According to the algorithm, the load manager sends a request for a new thread when the local load falls beneath the Tunder value.

Experimental results for several applications with various values of Tunder are shown in Figure 17. The first two graphs represent the execution time of matrix multiplication and Dijkstra’s algorithm. The next pair of graphs shows results of a special case of Dijkstra’s algorithm execution, in which one node is also loaded by five external tasks. The last pair of graphs represents execution times of partial differential equation solutions on two inputs.

The first two graphs indicate the results achieved with various Tunder values in matrix multiplication and Dijkstra’s algorithm. The similarity in results can be explained by equal workload of the threads in these applications. The total number of threads executed by each node does not depend on the parameter value.

The first graph also indicates a minor advantage of Tunder=1. This result is explained as follows. On one hand, the value of Tunder defines the granularity of the central queue algorithm. The greater the value of Tunder is, the more threads stay on the last node at the end of the application, causing this node to work longer after all other nodes become idle. On the other hand, we expected that allocating a single thread per node may occasionally cause this node to become idle, for example, when the thread is temporarily suspended for remote memory access or for any other reason. As our tests show, the former factor is more significant than the latter, and higher performance is achieved with Tunder=1.

The next pair of graphs represents results of Dijkstra’s algorithm execution when one of the nodes has additional load of 5 external tasks. In this case the optimal value of Tunder is 3. In brief, the operating system provides equal slots of CPU time to each thread regardless of which task they belong to. Thus, the more threads execute a certain task in parallel with other tasks, the more CPU time is allocated to this task by the operating system at the expense of other tasks. That is why better results were achieved with a larger Tunder value, as opposed to the results presented in the previous paragraph.

A similar situation occurs in solving partial differential equations - the best results are achieved with higher values of the parameter. The higher the value of Tunder is, the more neighbor activities are allocated on the same node and, as a result, page migration becomes less intensive. We thus conclude that it is more important to work on reducing page migration than on improving load balancing.

Page 30: Load Balancing in Distributed Shared Memory

30

In general, we see that the optimal value of Tunder depends on the type of parallel application and the system environment. If the processors are busy with external applications, the Tunder parameter should be set to higher values. Higher

Matrix multiplication3 hosts, 9 threads

Matrix size

Time,sec

0

50

100

150

200

250

300

100 200 300 400 500

Central Queue Tu=1Central Queue Tu=2

Dijkstra algorithm3 hosts, 15 threads

Graph size

Time,sec

0

50

100

150

200

250

100 200 300 400

Central Queue Tu=1

Central Queue Tu=3

Dijkstra algorithm,the third node is overloaded by 5 tasks

3 hosts, 15 threads

Graph size

Time,sec

0

50

100

150

200

250

100 200 300

Central Queue Tu=1Central Queue Tu=3

Central Queue Tu=4

Dijkstra algorithm,the first node is overloaded by 5 tasks

(reverse order of hosts) 3 hosts, 15 threads

Graph size

Time,sec

0

50

100

150

200

250

100 200 300

Central Queue Tu=1

Central Queue Tu=2

Central Queue Tu=3

Central Queue Tu=4

Partial differential equation 2((x-0.5)2+(y-0.5)2),3 hosts, 12 threads

Matrix size

Time,sec

0

100

200

300

400

500

600

100 200 300 400

Central Queue Tu=1

Central Queue Tu=3

Partial differential equation x7,3 hosts, 20 threads (1/3 of threads are active)

Matrix size

Time,sec

0

20

40

60

80

100

120

140

160

180

100 200 300 400

Central Queue Tu=1

Central Queue Tu=3

Central Queue Tu=5

Figure 17. Central Queue LB algorithm parameter.

Page 31: Load Balancing in Distributed Shared Memory

31

value of the Tunder should also be set for parallel applications with intensive data exchange among subsequent activities. We set Tunder=2 as the default value for the central queue load balancing algorithm.

4.4.3 Local queue algorithm parameterization

According to the local queue load balancing algorithm, thread migration is initiated when the local load falls beneath the value of Tunder. As a result, this parameter affects the algorithm performance.

Experimental results with various parameter values are shown in Figure 18. In almost all the tests the best performance was received with Tunder=2..3. In the matrix multiplication problem and Dijkstra’s algorithm the performance difference is minor. When Dijkstra’s algorithm is executed on a system in which the third node is overloaded, the performance with Tunder=2 is better than that with Tunder=1 by 12%. The fourth graph shows that Tunder=1 attains slightly better results than Tunder=2 for large graphs on a system with an overloaded first node. According to the last pair of graphs, the performance for solving differential equations with Tunder=3 is 15-20% higher than performance with Tunder=1.

On the whole we can say that the optimal value of the Tunder parameter in the local queue algorithm is 2 or 3, whereas the optimal value for the central queue algorithm is 1. The difference can be explained as follows. The value of 1 was chosen as optimal in the central queue algorithm because of the case in which several threads continue running on the last node after the other nodes have become idle. This situation could never occur in the local queue algorithm, where some of these threads would be transferred to an idle node. A single thread running on a node causes it to become idle when the thread accesses remote data or is suspended for any other reason. Another problem with Tunder=1 is that a node becomes idle when it requests a new thread upon completion of the previous one. If several threads have been allocated on a node, another thread would utilize this time. As a result we set the default value of Tunder to 2.

4.4.4 Number of threads

The number of threads allocated by the system for a parallel application affects overall performance. The number of threads is defined for pparfor/pparblock parallel constructs by the user application and for lparfor/lparblock constructs by the parameter Lpar_Threads_per_Node and the number of nodes. In this section we analyze how the number of threads influences the

performance of various load balancing algorithms with different types of applications and system environments.

Page 32: Load Balancing in Distributed Shared Memory

32

Sample execution times and relative active times of three applications are presented in Figure 19. The first pair of graphs represents results of matrix multiplication with no external load except for the testing application. As the graphs show, there is almost no correlation between the execution time or relative active time and the number of threads in matrix multiplication. The workload of threads in this application is uniform. Thus, to optimally distribute threads we need to allocate an equal number of threads on each processor. Additional threads contribute nothing to the load balancing. They only cause additional system overhead. Another feature shown on the graph involves dynamic load balancing

Matrix multiplication3 hosts, 9 threads

Matrix size

Time,sec

0

50

100

150

200

250

300

100 200 300 400 500

Local Queue Tu=1

Local Queue Tu=2

Dijkstra algorithm3 hosts, 15 threads

Graph size

Time,sec

0

50

100

150

200

250

100 200 300 400

Local Queue Tu=1

Local Queue Tu=3

Dijkstra algorithm,the third node is overloaded by 5 tasks

3 hosts, 15 threads

Graph size

Time,sec

0

20

40

60

80

100

120

140

160

100 200 300

Local Queue Tu=1Local Queue Tu=2

Local Queue Tu=3

Local Queue Tu=4

Dijkstra algorithm,the first node is overloaded by 5 tasks

(reverse order of hosts) 3 hosts, 15 threads

Graph size

Time,sec

0

20

40

60

80

100

120

140

160

180

200

100 200 300

Local Queue Tu=1

Local Queue Tu=2

Local Queue Tu=4

Partial differential equation 2((x-0.5)2+(y-0.5)2),3 hosts, 12 threads

Matrix size

Time,sec

0

50

100

150

200

250

300

350

400

450

500

100 200 300 400

Local Queue Tu=1

Local Queue Tu=3

Partial differential equation x7,3 hosts, 20 threads (1/3 of threads are active)

Matrix size

Time,sec

0

20

40

60

80

100

120

140

160

100 200 300 400

Local Queue Tu=1

Local Queue Tu=3Local Queue Tu=5

Figure 18. Local Queue LB algorithm parameter

Page 33: Load Balancing in Distributed Shared Memory

33

algorithms. For example, the central queue algorithm with Tu=2 and 15 threads is distinguished by high execution time and low relative active time, because of the uneven thread distribution created by a small “mistake” on the part of the load manager. In this case the distribution of allocated threads was 5+6+4 instead of 5+5+5, thus causing load imbalance among processors.

The number of threads influences the performance much more in a heterogeneous system, particularly when one of the nodes is also loaded by external tasks (we call such a node non-dedicated). The next pair of graphs shows execution of Dijkstra’s algorithm in such a system. Here, increasing the number of threads from 6 to 15 causes the execution time to decrease sharply. When several tasks run on a node, the operating system provides equal slots of CPU time to each thread regardless of which task they belong to, a fact also reflected in our experimental results. Figure 20 demonstrates sequential and simulation execution of Dijkstra’s algorithm with various numbers of threads and various numbers of external tasks, when each such task is executed by one thread. As we can see, the more threads execute a parallel application, the less time is required to complete the execution. Additional threads make the task more powerful, so it receives more CPU time. Without the addition of threads, the non-dedicated nodes could cause bottlenecking in parallel execution. But additional threads result in additional CPU time, and execution time decreases instead. Thus the round-robin algorithm achieves higher performance with more threads.

The reason for the decrease in execution time is different for dynamic load balancing algorithms. Load imbalance decreases along with load balancing granularity. The latter is defined by the workload of a thread and it decreases when the number of threads rises. For example, suppose that one node executes the last thread while all the other are already idle. The less loaded this last thread is, the less time is needed to finish its execution. This explains the indicated dependency between load balancing and the number of threads for dynamic algorithms.

The last pair of graphs shows results for solving partial differential equations. Here we can see a strong correlation between execution time and relative active time on the one hand, and the number of threads on the other hand. The more threads execute an application, the higher relative active time is and the lower the execution time is. The reasons for these two dependencies are different and are explained below.

Page 34: Load Balancing in Distributed Shared Memory

34

Matrix multiplication,3 hosts, size of matrix 400x400

Total number of threads

Time,sec

100

110

120

130

140

150

160

170

6 15 21 30

Round-Robin MultipleCentral Queue Tu=1

Central Queue Tu=2Local Queue Tu=2

Matrix multiplication,3 hosts, size of matrix 400x400

Total number of threads

Relativeactive time,

%

70

75

80

85

90

95

100

6 15 21 30

Round-Robin MultipleCentral Queue Tu=1

Central Queue Tu=2Local Queue Tu=2

Dijkstra algorithm, the third node is overloaded by 5 tasks

3 hosts, size of graph 300 vertices

Total number of threads

Time,sec

0

50

100

150

200

250

300

350

6 15 21 30

Round-Robin MultipleCentral Queue Tu=2

Local Queue Tu=2

Dijkstra algorithm, the third node is overloaded by 5 tasks

3 hosts, size of graph 300 vertices

Total number of threads

Relativeactive time,

%

0

10

20

30

40

50

60

70

80

90

100

6 15 21 30

Round-Robin MultipleCentral Queue Tu=2

Local Queue Tu=2

Partial differential equation x7,3 nodes, size 400x400

Total number of threads

Time,sec

0

50

100

150

200

250

300

3 6 15 30

Round-Robin Multiple

Central Queue Tu=2Local Queue Tu=1

Partial differential equation x7,3 nodes, size 400x400

Total number of threads

Relativeactive time,

%

0

10

20

30

40

50

60

70

80

90

100

3 6 15 30

Round-Robin MultipleCentral Queue Tu=2

Local Queue Tu=1

Figure 19. Number of threads.

Page 35: Load Balancing in Distributed Shared Memory

35

As discussed above, the relative active time is affected by granularity of load balancing. Larger numbers of threads decrease granularity of load balancing. This explains the indicated dependency between relative active time and the number of threads.

The major problem in solving partial differential equations in a distributed system is massive page migration because of data sharing among consecutive activities. The amount of data shared is proportional to the number of threads. Thus, the more threads execute the application, the more intensive page migration is. Nevertheless, the graph exhibits a reverse result. This contradiction can be explained by another feature of the parallel applications considered. Dividing the application into threads decreases the execution time proper (see Figure 21). A running time of sequential execution and simulation on one processor by various numbers of threads is given for two inputs. As we can see, the more threads simulate a parallel execution, the less time is needed for this execution. This dependency is even greater for x7. As we know, the workload of two thirds of the threads running on this input is quite low. In a sequential execution we have to process all the mesh points in each iteration, until they all achieve the required precision. In the simulation execution, however, two thirds of the threads achieve the required precision on each iteration very quickly. They are idle most of the time leaving the CPU time for the rest of the threads. The achieved absolute precision decreases slightly together with execution time.

To summarize the results, we saw that the performance of load balancing algorithms and of the system as a whole depends on the number of threads. This dependence varies for different applications and their execution conditions. In most cases, increasing the number of threads up to some optimal value leads to improved performance.

Dijkstra algorithm: Sequential and Simulationon a host loaded by external tasks,

size of graph 300 vertexes

Number of busy tasks

Time,sec

0

200

400

600

800

1000

1200

1400

0 1 2 3 4 5

Sequential

Simulation - 5 threads

Simulation - 15 threads

Differential equation:Sequential and Simulation executions

Number of threads

Time,sec

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30

Seq. 2((x-0.5)2+(y-0.5)2)Sim. 2((x-0.5)2+(y-0.5)2)

Sequential x7Simulation x7

Figure 20. Simulation on non-dedicated node.

Figure 21. Simulation of differential equation solving on one processor.

Page 36: Load Balancing in Distributed Shared Memory

36

4.4.5 Number of nodes

In this section we compare execution of parallel applications in 2-node and 3-node systems, the reason being that we only had 3 computers available for our experiments. Nevertheless, the results received are quite interesting. The experimental results are presented in Table 2. Execution time and achieved speedup are shown for Dijkstra’s algorithm and for solving partial differential equations with two inputs. The tests were conducted with the round-robin, central queue and local queue load balancing algorithms.

As we can see, the normalized speedup of Dijkstra’s algorithm on two and three nodes is almost the same. The execution on two nodes achieves slightly better normalized results, which is explained by less interprocessor communication and lower relative system overhead.

The difference between speedups achieved by solving differential equations on two and three nodes is much higher. The normalized speedup achieved on two nodes is, on the average, 1.5 times the speedup achieved on three nodes. Moreover, the execution time on two nodes is less than or almost equal to that on three nodes, because the probability of subsequent threads being allocated on the same node increases as the number of nodes decreases. Thus, intensity of page migration in execution on two nodes is reduced.

We also see that additional nodes do not always add computational power to the distributed system. There is an optimal number of nodes for each parallel application. Greater data sharing among parallel activities causes the optimal number of nodes to decrease. One of our future research directions is to compute the optimal number of nodes at run-time, using the load manager and the memory manager together.

Dijkstra, size 300,

15 / 16 threads

Differential equation 2((x-0.5)2+(y-0.5)2) size 300, 12 threads

Differential equation x7

size 400, 20 threads Sequential time

221 237 564

Num. of nodes

3 2 3 2 3 2

Round-robin load balancing Parallel time

101 142 166 172 122 127

Speedup 2.18 1.56 1.42 1.38 4.62 4.44 Norm. speedup

0.73 0.78 0.47 0.69 1.54 2.22

Central queue load balancing Parallel time

101 144 261 270 167 190

Speedup 2.18 1.53 0.90 0.87 3.38 2.97 Norm. 0.73 0.77 0.3 0.44 1.13 1.48

Page 37: Load Balancing in Distributed Shared Memory

37

speedup Local queue load balancing

Parallel time

103 146 196 196 143 130

Speedup 2.15 1.51 1.21 1.21 3.94 4.34 Norm. speedup

0.72 0.76 0.40 0.61 1.31 2.17

Table 2. Parallel execution on different number of nodes.

4.5 Comparison of load balancing algorithms

In this section we compare load balancing algorithms on different application types. We try to figure out the best load balancing scheme for each type of parallel applications, and show the performance our system achieved for each application type. In what follows, we call a node “dedicated” if it has no external load and only executes the testing application properly. Otherwise (if a node also executes other applications not related to testing) we call it “non-dedicated”.

The results achieved for four parallel applications with all load balancing algorithms are summarized in Figure 22 and Figure 23. All the tests were conducted on a 3-node system. Matrix multiplication was only tested on dedicated nodes. Dijkstra’s algorithm was executed on both dedicated nodes and in systems with one node non-dedicated (overloaded with external five tasks). Solving partial differential equations was run on two inputs - 2((x-0.5)2+(y-0.5)2) and x7. All applications were tested with various input sizes. The fourth application (TSP problem) was tested in one level and two level parallelization versions on several inputs.

Each application was executed with all the relevant load balancing algorithms. The central load manager and thresholds load balancing schemes were not tested for static parallel applications, as they only make sense for dynamic parallel applications such as TSP with two level parallelization.

The first two graphs report almost identical performance achieved by all load balancing schemes for matrix multiplication and Dijkstra’s algorithm on dedicated nodes. The optimal thread distribution was achieved in this case even with simple load balancing, as in round-robin, because the workload of all threads is almost equal and the nodes were dedicated.

Achieved speedups depend on the problem size. Up to a certain problem size, parallel execution time is even greater than sequential because of system overhead caused by thread management, remote thread allocations and page migration. The amount of computation required for small problems is not sufficient to surpass the system overhead.

Page 38: Load Balancing in Distributed Shared Memory

38

The maximal speedups achieved were 1.8 and 2.4 respectively. We reckon that the relatively low speedup achieved by matrix multiplication is mainly caused by the overhead imposed by double addressing of shared variables. In order to check this claim we ran the parallel application on a single processor with one thread. The execution time of matrix multiplication was 1.3 - 1.4 times that of the sequential application; the corresponding index for Dijkstra’s algorithm was approximately 1.1. We see another reason for the low speedup in the ratio between time complexity and space complexity of the matrix multiplication problem. In both matrix multiplication and Dijkstra’s algorithm these complexities are O(n3) and O(n2) respectively. But the computation in case of matrix multiplication is quite simple, although the amount of data processed is larger than that in Dijkstra’s algorithm.

The next pair of graphs represent Dijkstra’s algorithm execution when one node is non-dedicated. Observe that these 2 graphs demonstrate two different orderings of hosts. In the first case hosts are ordered as mach2+mach3+mach5, and as mach5+mach3+mach2 in the second case; in both cases, mach5 is non-dedicated. As we can see, the results achieved on these two orderings of hosts are different. This is caused by the asymmetry of our distributed system. The main host (the first host) begins execution of the parallel application and runs the central load balancing manager. The order of hosts also influences distribution of threads. The last thread executed by a non-dedicated host increases the execution time much more than the last thread executed on a regular host. All this increases the system asymmetry.

As the graphs show, the difference between two system configurations is minimal for the round-robin policy. In this case, thread distribution is symmetric and the asymmetry of the two systems does not affect the results significantly. The performance of the central queue and local queue algorithms is much more affected by host order. This is caused by all the factors listed in the previous paragraph.

Nevertheless, the same tendency persists in both cases, i.e., with either the first node or the last node non-dedicated. Dynamic load balancing algorithms achieve higher performance than the static round-robin scheme. This supports the claim that behavior of dynamic load balancing algorithms is influenced by external load, despite the fact that our load measurement does not control the external load of processors. The effective speedup achieved by the central queue and local queue schemes in heterogeneous systems is 2.1 - 2.3. This is only 10% less than the speedup achieved in homogeneous systems, thus proving the effectiveness of our dynamic schemes.

Test results for solving partial differential equations are presented in the next pair of graphs. We run dynamic load balancing algorithms with higher parameter values (as recommended in previous sections). Here, the situation observed is the reverse of our previous results. The performance of the round-robin scheme is higher than that of dynamic load balancing algorithms. This does not mean that the round-robin scheme is especially good; it merely points to the drawback of our dynamic load balancing schemes. The locality principle, essential for this type of parallel application, is maintained by the

Page 39: Load Balancing in Distributed Shared Memory

39

round-robin but almost completely ignored by our dynamic algorithms. This problem is especially clear when solving partial differential equations, because of the massive data sharing among subsequent threads in this type of application. Improving our dynamic load balancing algorithms to maintain the locality principle is the most important direction of future research. Nevertheless, round-robin and dynamic schemes achieve almost the same results when running on the input x7 (unless the mesh size is large). It is the central queue and local queue algorithms that unevenly redistribute loaded threads in order to balance processor loading. Thus massive page migration is compensated by more uniform load distribution.

In spite of the massive page migration, we achieve acceptable speedups in solving partial differential equations as well. On the first input the achieved speedup is 1.9, while on the second we even reach the overlinear speedup of 4.6 (normalized speedup of 1.5). The main reason for such a result (including the overlinear speedup) is the significant decrease in execution time with algorithm parallelization. This was first discussed in subsection 4.4.4 (page 31).

The last application (TSP) was tested in both one and two level parallelization versions. We present both the speedup and relative active time. The speedup is less representative here because of the peculiarities of the TSP parallel application mentioned above. The relative active time reflects the system load balancing, which is an important benchmark of algorithm effectiveness. Two additional load balancing algorithms were tested with the two level parallelization version, namely the thresholds and central load manager algorithms. These two only make sense in dynamic parallel applications. Unfortunately, the two-level TSP version is the only dynamic parallel application tested in this work. We are aware that testing only one application is not sufficient for general conclusions; nevertheless, the tests considered give us some information about these two algorithms.

In the static one level parallelization, the dynamic load balancing algorithms “compete” and alternately achieve better results. For some of the inputs the central queue algorithm excels, while for others the local queue algorithm performs better.

The same picture is observed in the dynamic two level parallelization: the central queue and local queue algorithms are the best. At the same time, the relative active time achieved on the input 15-2 by the central queue load balancing is slightly less than that of the round-robin. We think that this particular result does not violate the general tendency. The worst performance is unexpectedly achieved by the central load manager and thresholds algorithms. This is probably caused by high message overhead. This is noted as a disadvantage of these two algorithms.

The speedup achieved by our system for TSP applications varies between 2.5 and 3.5. Overlinear speedup is achieved due to the peculiarities of the TSP parallel execution, in which execution time depends greatly on the order of thread execution, and the amount of work a thread must perform can change drastically if another thread finds a shorter path.

Page 40: Load Balancing in Distributed Shared Memory

40

Summarizing the results surveyed, all the tests demonstrate the high efficiency of our distributed shared memory system. The speedup achieved in the system depends on the specific application type. The highest normalized speedup 0.8 has been achieved on parallel applications with little data sharing. In more problematic parallel applications with a large amount of shared data, the speedup is also reasonable. There are several parallel applications where overlinear speedup was achieved due to special features of these applications.

Our dynamic load balancing algorithms prove to be the most effective schemes for most parallel applications. Nevertheless, selection of the optimal load balancing algorithm depends on the application type. The round-robin load balancing algorithm remains the best selection for parallel applications with intensive data sharing among subsequent threads. This is because dynamic schemes completely ignore the locality principle. The tests conducted also demonstrate acceptable behavior of our dynamic load balancing algorithms in heterogeneous systems. The speedup achieved in heterogeneous systems is almost equal to that achieved in homogeneous systems, thus proving the effectiveness of dynamic schemes.

Based on the tests described, we could reach no unequivocal conclusion with regard to central queue vs. local queue algorithm performance. The disadvantages of the centralized schemes become apparent in parallel systems with much more nodes than we had. In general, the local queue scheme is more stable and is more promising for future development.

Page 41: Load Balancing in Distributed Shared Memory

41

Matrix multiplication, 3 hosts, 9 threads

Matrix size

Speedup

0

0.5

1

1.5

2

2.5

3

100 200 300 400 500

Round-Robin,Multiple modeCentral Queue Tu=1

Local Queue Tu=2

Dijkstra, 3 hosts, 15 threads

Graph size

Speedup

0

0.5

1

1.5

2

2.5

3

100 200 300 400

Round-Robin,Multiple modeCentral Queue Tu=1

Local Queue Tu=3

Dijkstra algorithm,the third node is overloaded by 5 tasks

3 hosts, 15 threads

Graph size

Speedupeffective

0

0.5

1

1.5

2

2.5

3

100 200 300

Round-Robin Multiple

Central Queue Tu=3

Local Queue Tu=3

Dijkstra algorithm,the first node is overloaded by 5 tasks

(reverse order of hosts) 3 hosts, 15 threads

Graph size

Speedupeffective

0

0.5

1

1.5

2

2.5

3

100 200 300

Round-Robin Multiple

Central Queue Tu=3Local Queue Tu=3

Partial differential equation 2((x-0.5)2+(y-0.5)2), 3 hosts, 12 threads

Matrix size

Speedup

0

0.5

1

1.5

2

2.5

3

100 200 300 400

Round-Robin, Multiplemode

Central Queue Tu=3

Local Queue Tu=3

Partial differential equation x7, 3 hosts, 20 threads

Matrix size

Speedup

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

100 200 300 400

Round-Robin,Multiple mode

Central Queue Tu=3

Local Queue Tu=5

Figure 22. Comparison of load balancing algorithms for matrix multiplication, Dijkstra’s algorithm and solving pa rtial differential equation.

Page 42: Load Balancing in Distributed Shared Memory

42

Thresholds and central load manager load balancing algorithms are only applicable in dynamic parallel applications with several levels of parallelization and many parallelization points. In this case these algorithms can also achieve acceptable results.

5. Conclusion and further research directions

We have developed and implemented the PARC-MACH distributed shared memory system for parallel execution of applications on top of the Mach operating system. Six static and dynamic load balancing policies including thread migration have been implemented in the system, namely random, round-robin, central load manager, thresholds, central queue and local queue schemes. The load balancing algorithms differ by the cooperative property, static or dynamic nature, centralized or distributed implementation and use of thread migration.

Research and comparison of the implemented load balancing schemes were conducted on four benchmark parallel applications - matrix multiplication, Dijkstra’s algorithm, solving partial differential equation and TSP. These

TSP: one level parallelization, 3 hosts

Speedup

0

0.5

1

1.5

2

2.5

3

3.5

Input 14-1,6..threads

Input 15-1,6..threads

Input 14-1,15..threads

Input 15-1,15..threads

Round-Robin, MultipleCentral Queue Tu=1

Local Queue Tu=1

TSP: one level parallelization, 3 hosts

Relative active time,

%

50

55

60

65

70

75

80

85

90

95

100

Input 14-1,6..threads

Input 15-1,6..threads

Input 14-1,15..threads

Input 15-1,15..threads

Round-Robin, Multiple

Central Queue Tu=1

Local Queue Tu=1

TSP: two level parallelization, 3 hosts, 6 threads allocated each time

Speedup

0

0.5

1

1.5

2

2.5

3

3.5

4

Input 14-1 Input 14-2 Input 15-1 Input 15-2

Round-Robin, Multiple

Central Load Manager

Central Queue Tu=1

Local Queue Tu=1

Thresholds Tu=2 To=3

TSP: two level parallelization, 3 hosts, 6 threads allocated each time

Relativeactive time,

%

50

55

60

65

70

75

80

85

90

95

100

Input 14-1 Input 14-2 Input 15-1 Input 15-2

Round-Robin, Multiple

Central Load Manager

Central Queue Tu=1

Local Queue Tu=1

Thresholds Tu=2 To=3

Figure 23. Comparison of load balancing algorithms for the TSP problem.

Page 43: Load Balancing in Distributed Shared Memory

43

applications differ by workload of threads and intensity of data sharing. We examined how the performance is affected by various load balancing parameters, like threshold values and the number of nodes and threads.

The results achieved in this work are as follows:

• Single thread allocation and multiple thread allocation modes of static load balancing algorithms have been compared. The experimental results indicate similar performance of both modes in matrix multiplication, Dijkstra’s algorithm and TSP. The execution time of solving partial differential equations in the former mode is on average twice that in the latter. These results can be explained by massive data sharing among consequent threads in the application. The locality principle becomes very important for such types of parallel applications, and neglecting this principle drastically affects performance.

• Selection of optimal threshold values for dynamic load balancing algorithms depends on the type of parallel application and the conditions in the system environment. Generally, the optimal value of the threshold is 1 for the central queue scheme and 2 for the local queue scheme. In the former algorithm, the threshold value defines granularity of the load balancing, while in the latter the load balancing granularity is always 1, and allocation of more than one thread per node is advantageous due to interruption of threads. Higher values of the threshold parameter achieve better performance in heterogeneous systems, where some of the nodes are overloaded by external tasks. The same tendency is observed in parallel applications with massive data sharing.

• The number of threads executing parallel application affects the performance as well. There is no strong correlation between the number of threads and the performance in a homogeneous system, when the workload of threads is similar. At the same time, increasing the number of threads up to some optimal value decreases execution time of applications in heterogeneous systems, or when the workload of threads is very uneven. This is explained by lower granularity of load balancing and by certain features of thread execution on nodes overloaded by external tasks. The same tendency is observed in solving partial differential equations, where it is caused by peculiar features of this parallel application.

• The experimental results indicate that additional nodes do not always add computational power. Similar normalized speedups are achieved in two- and three-node systems for parallel applications with low intensity of data sharing. This result demonstrates good utilization of nodes by this type of parallel applications. On the other hand, execution time of solving partial differential equation is almost equal in two- and three-node systems. This phenomenon is caused by massive page migration, which gains intensity as the number of nodes increases.

• Comparison results for load balancing policies differ for various types of parallel applications. Similar performance is achieved by all the policies in matrix multiplication and Dijkstra’s algorithm in a homogeneous system. The optimal distribution of load for these applications is an equal number of threads per node. This load distribution is provided by all the policies, thus explaining the results observed. Higher performance is achieved by the dynamic schemes in a

Page 44: Load Balancing in Distributed Shared Memory

44

heterogeneous system, where one of the nodes is overloaded by external tasks, and in parallel applications with uneven workload of threads. This result is predictable, because the dynamic schemes redistribute load as a result of unevenly loaded processors or uneven workload of threads. Our tests did not determine, whether the central queue or the local queue algorithm is better. The disadvantages of the centralized schemes become apparent in parallel systems with many more nodes.

• We observed that performance achieved by the thresholds and central load manager algorithms is worse than that of the round-robin scheme. This result is caused by the high message overhead of the first two algorithms, and by the relatively uniform load distribution by the round-robin scheme for dynamic parallel applications with a large number of threads.

• The most interesting effect is observed in solving partial differential equations where data sharing among threads is very massive. Here, the performance of dynamic load balancing schemes is even worse than that of the round-robin algorithm. The locality principle, which is extremely important for this kind of parallel application, is maintained by the round-robin scheme and is almost fully ignored by the implemented dynamic load balancing schemes. Neglecting the locality principle causes massive page migration that affects the performance.

The results reported in this section define future directions of this research. The most important problem we encountered in shared memory distributed systems is, again, massive page migration. We noted that load balancing in the narrow sense, i.e., equalizing the load of nodes, does not always minimize execution time. Dynamic load balancing algorithms, which optimize load balancing and decrease page migration, should be developed. At the first stage, an algorithm must take into account the locality principle when selecting a thread for allocation or transfer from one node to another. The general solution is cooperation between the thread manager and shared memory manager. The decision to involve either object-to-thread or thread-to-object policy should be made for each miss separately, in order to reach optimum load balancing and minimize page migration. The same load balancing policy has to optimize the number of nodes involved in parallel execution of the application.

6. Acknowledgements

Page 45: Load Balancing in Distributed Shared Memory

45

We wish to thank the anonymous referees for their useful comments. Roy Friedman was supported by DARPA/ONR grant N00014-96-1-1014. Assaf Schuster is supported by Intel Israel and Microsoft Israel.

Bibliography

[1] S. G. Akl. The design and analysis of parallel algorithms. Prentice-Hall International, INC, 1989.

[2] Y. Ben-Asher and L. Rudolph. The PARC system. The Hebrew University of Jerusalem. Technical report CS-88-8, Aug. 1988.

[3] T. L. Casavant, and J. G. Kuhl. A Formal Model of Distributed Decision-Making and Its Application to Distributed Load Balancing. IEEE Proceedings of the 6th Int. Conf. On DCS, 1986, pp. 232-239.

[4] A. Dubrovsky. Load balancing in distributed shared memory system. M.Sc. Thesis, Technion, 1996.

[5] S. Even. Graph algorithms. Computer Science Press, 1979.

[6] G. Fox, W. Furmanski, J. Koller, P. Simic. Physical optimization and load balancing algorithms. Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers and Applications. Vol. 1, pp. 595-598, March 1989.

[7] M. Goldin. Weak consistency distributed shared memory system design. M.Sc. Thesis, Technion, 1995.

[8] A. I. Holub. Compiler Design in C. Prentice-Hall International, INC, 1990.

[9] J. Hong, X. Tan, M. Chen. Dynamic cyclic load balancing on hypercubes. Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers and Applications. Vol. 1, pp. 595-598, March 1989.

[10] J. Juang, B. Wah. Load balancing and ordered selection in a computer system with multiple contention buses. Journal of Parallel and distributed computing. Vol. 7, pp. 391-415, March 1989.

[11] O. Kremien. The Design and Evaluation of Adaptive Load-Sharing Algorithms for Distributed Systems. Ph.D. thesis, University of London, 1992.

[12] V. Kumar, V. Rao. Load balancing on the hypercube architecture. Proceedings of the Fourth Conference on Hypercubes, Concurrent Computers and Applications. Vol. 1, pp. 603-608, March 1989.

[13] K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ASM Transactions on Computer Systems, Vol. 4, num. 4, Nov. 1989, pp. 321-359.

[14] L. Rudolph, M. Slivkin-Allalouf, E. Upfal. A Simple Load Balancing Scheme for Task Allocation in Parallel Machines. In Proceedings of the 3rd ACM Symposium on Parallel Algorithms and Architectures, pp. 237-245, July 1991.

[15] A. B. Sanha, L. V. Kalé . A Load Balancing Strategy for Prioritized Execution of Tasks. Proceedings of the IEEE. 1993.

Page 46: Load Balancing in Distributed Shared Memory

46

[16] W.-T. Wang, R.J.T. Morris. Load Sharing in Distributed Systems. IEEE Transactions on Computers. pp. 204-217, March 1985.

[17] M. H. Willebeek-LeMair, and A. P. Reeves. Strategies for Dynamic Load Balancing on Highly Parallel Computers. IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 9, September 1993.