task n data schedulling

8/7/2019 task n data schedulling

1/16

An improved two-step algorithm for task and dataparallel scheduling in distributed memory machines

Savina Bansal a,*, Padam Kumar b, Kuldip Singh b

a Department of Electronics Engineering, GZS College of Engineering and Technology, Bathinda, Punjab, Indiab Department of E&CE, Indian Institute of Technology Roorkee, Roorkee, Uttranchal, India

Received 7 January 2006; received in revised form 4 July 2006; accepted 29 August 2006Available online 19 October 2006

Abstract

Scheduling of most of the parallel scientific applications demand simultaneous exploitation of task and data parallelismfor efficient and effective utilization of system and other resources. Traditional optimization techniques, like optimal con-trol-theoretic approaches, convex-programming, and bin-packing, have been suggested in the literature for dealing withthe most critical processor allocation phase. However, their application onto the real world problems is not straightfor-ward, which departs the solutions away from optimality. Heuristic based approaches, in contrast, work in the integerdomain for the number of processors all through, and perform appreciably well. A two-step Modified Critical Path andArea-based (MCPA) scheduling heuristic is developed which targets at improving the processor allocation phase of an

existing Critical Path and Area-based (CPA) scheduling algorithm. Strength of the suggested algorithm lies in bridgingthe gap between the processor allocation and task assignment phases of scheduling. It helps in making better processorallocations for data parallel tasks without sacrificing the essential task parallelism available in the application program.Performance of MCPA algorithm, in terms of normalized schedule length and speedup, is evaluated for random and realapplication task graph suites. It turns out to be much better than the parent CPA algorithm and comparable to the highcomplexity Critical Path Reduction (CPR) algorithm. 2006 Elsevier B.V. All rights reserved.

Keywords: Mixed parallelism; Scheduling algorithm; Distributed systems; Data parallelism; Parallelizable tasks

1. Introduction

Scheduling is one of the most vital design issues in parallel and distributed computing systems affecting theoverall execution time of an application. Simultaneous exploitation of task (or function) and data parallelismis relatively a new trend steadily taking shapes for high performance parallel computing. Merits of thisapproach have been emphasized for solving many scientific and engineering applications that involve extensive

0167-8191/$ - see front matter 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.parco.2006.08.004

* Corresponding author. Tel.: +91 164 2281954.E-mail addresses: [email protected] (S. Bansal), [email protected] (P. Kumar), [email protected] (K.

Singh).

Parallel Computing 32 (2006) 759774

www.elsevier.com/locate/parco
mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]


2/16


3/16

exploiting task parallelism gets sidelined at times. As a result, an algorithm may loose its track, in spite ofmaking a perfect start with an optimal processor allocation. We support this conjecture in the following sec-tions that motivated us to develop the Modified Critical Path and Area-based algorithm (MCPA), which suc-ceeds in bridging this gap to an appreciable extent. In the MCPA algorithm, essential task parallelism ispreserved, at the cost of data parallelism sometimes, if it is crucial for the overall performance of the algo-

rithm, and the low-complexity feature of two-step algorithms is also retained. The paper is organized as fol-lows: the following section formulates the problem and describes some of the significant terms and expressionsused in the work. Section 3 discusses related work and motivation behind the proposed algorithm, which hasbeen expounded in the section following it. Sections 5 and 6 presents the experimental set-up and performanceresults in comparison to state-of-the-art M-task scheduling algorithms. Section 7 discusses and concludes thework.

2. Problem formulation

The basic application or task model is represented by a quadruple Q(T, R, [cij], [w(i, ni)]), whereT={ti : i= 1,2, . . . , v} is a set v of tasks/nodes; R represents a relation that defines a partial order on the taskset such that if tiRtj then task ti must finish before tj can start execution; [cij] is a v v matrix giving the com-

munication cost (depending on network characteristics and volume of data involved) between task ti and tj;and [w(i, ni)] for 1 6 ni6 p is a v p matrix (p being the total number of processors in the system), which rep-resents execution cost of task (ti) for different number of processors allocated (ni) to them (Table 1). It may beseen that speedup improvement, with increasing number of processors allocated to a node, is not linear butconvex due to the presence of various overheads and sequential part of computation wsi . Further, commu-nication cost is not only a function of dataflow involved but also of data distribution overheads; the latter is,however, taken as the part of computation cost of concerned nodes as elaborated in [19].

An MDG (Fig. 1) can better represent the application model, with vertices representing coarse grain macronodes that can be executed in a data parallel manner on disjoint set of processors and edges the dataflow pathssignifying data/control dependencies among the nodes. Many researchers have used such a task model and itspracticability is substantiated in [19], which deals with automatic extraction of macro dataflow graphs from an

extended HPF or MATLAB program. The techniques for estimating computation and communication costsat compile-time are dealt in [8,9] and have been implemented as part of PARADIGM compiler. Communica-tion costs are assumed non-negligible, even if two macro nodes are scheduled on the same set of processors dueto data redistribution costs that might be involved. All the tasks are preceded and succeeded by a source and asink node, which are responsible for distributing and collecting data in the beginning and at the end of theprogram, respectively. Communication costs from these nodes are also considered due to the possible datadistribution or redistribution costs involved in the mixed parallel scenario. Communication links are assumedto be contention free.

Computation costs for the data parallel nodes can be obtained through estimation or profiling. Amdahlslaw provides good cost estimations for a node for different number of processors allocated to it. In the costprofiling method, costs are obtained by actually measuring them as a function of the number of processorsin sample program runs, and then, using linear regression method to fit these values to a function of the form

Table 1Computation cost matrix w(i, ni) for the MDG shown in Fig. 1

ti ni

1 2 3 4 5 6 7 8

2 wsi 20;wpi 240 260 140 100 80 68 60 54.3 503 wsi 20;wpi 240 260 140 100 80 68 60 54.3 504 wsi 20;wpi 240 260 140 100 80 68 60 54.3 505 wsi 20;wpi 240 260 140 100 80 68 60 54.3 506 (S-task) 7 7 7 7 7 7 7 7

7 (S-task) 7 7 7 7 7 7 7 7

S. Bansal et al. / Parallel Computing 32 (2006) 759774 761


4/16


5/16

h maxfLcp;Apg 6that needs to be minimized. Scheduling algorithms, as suggested from time to time, try to work out for anoptimal processor allocation so as to minimize h (Eq. (6)) within these two conflicting constraints. The objec-tive function of M-task scheduling is to allocate and assign processors to the nodes and to suggest a time orderof execution on them, so that the overall execution time (x) of the application, which corresponds to the max-

imum finish time of a task in the scheduled DAG, is minimizedx maxfFigti2T 7

3. Related work and motivation

Scheduling with mixed task and data parallelism is a relatively new trend in scheduling with most of thework focused around specific task graph topologies like series-parallel, pipelined, divide and conquer, andtree [14,20,23]. In one of the earlier works [2], an approximation algorithm for scheduling parallelizableindependent tasks was suggested with performance bound as x6

211=pxopt. The work was later extended

[1] to schedule dependent tasks using the two-step approach. Another approximation algorithm for sched-uling independent M-tasks was developed in [26], which guarantees the makespan to be within twice the

optimal. For the special case of tree structured precedence constrained M-tasks, there exist an approxima-tion algorithm [13] with a performance guarantee of 3 ffiffiffi5p =2. The communications between theM-tasks is, however, neglected as the granularity in M-task scheduling is generally large. Among the moregeneral approaches that deal with communication costs and arbitrary task graph precedence, a Two StepAllocation and Scheduling (TSAS) algorithm, based upon the two-step approach used in [1,2,26], was sug-gested in [18]. In the first step, it employs a Convex Programming Allocation Algorithm (CPAA), whichminimizes h (Eq. (6)) to obtain an optimal processor allocation (in real number domain) for the data par-allel tasks. In the second step, a Prioritized Scheduling Algorithm (PSA) is used to list-schedule the nodeson the allocated number of processors, which become available at the earliest. However, the real numbersolution to the processor allocation problem forced the authors [18] to round off the numbers to a nearinteger value departing the solutions from optimality. The time complexity for the TSAS algorithm is

O(v2.5

+ vp log p).Critical Path Reduction (CPR) algorithm [16], in contrast, is a one-step greedy iterative algorithm that

deals with the integral number of processors at the very beginning. It starts with a single processor allocationto every task (ni = 1, "1 6 i6 v), and computes makespan using the traditional list-scheduling approach thatassigns highest priority (top_level+ bottom_level) ready node (node with satisfied precedence constraints) tothe first available processors in the system. Next, it increments the processor allocation for the most crucialcritical path node (one that benefits the most from the processor increment) and computes the resultingimprovements on the makespan generated. An increment is accepted, if it results in improving the previouslycalculated makespan, else discarded. CPR algorithm outperforms many other algorithms [15,18,20] due to itsone-step approach that offers better decision-making opportunity, since exact information about makespan isavailable at hand while doing processor allocations. However, recomputation of makespan at every stepmakes the algorithm quite complex (O(ev2p + v3p log v + v3p2 log p)).

To save on the complexity front, Critical Path and Area-based (CPA) algorithm [15] (pseudo code given inFig. 3) employed the two-step approach of TSAS algorithm. However, instead of using the complex convexprogramming approach, the greedy processor allocation heuristic of CPR algorithm is used. In the processorallocation phase, initially all processors are allocated one processor. After that, the critical path task returningmaximum benefits of data parallelism, as reflected by the higher gain parameter G in Fig. 3, is selected forhigher processor allocation within the constraints imposed by theoretical lower bound h. Gain parameterG reflects the benefit in terms of reduction in computational cost that can be achieved by allocating an extraprocessor to a data parallel task. By comparing Gain parameter of two different M-tasks, algorithm tends tofind out the most appropriate node that should be extended the benefits of more processor allocation. It isworth mentioning that due to serial computation cost (wsi ) present in all data parallel tasks, allocation of addi-tional processors does not return the same benefit; but, the benefit starts diminishing due to this convex

speedup.



6/16

In the second task scheduling phase, highest priority (bottom_level_cost) ready node is scheduled on its allo-cated number of processors that are first available. The complexity of the algorithm is O(v2p + evp) and per-formance is shown to be within 50% to the CPR algorithm for some real and synthetic applications taskgraphs. The low-complexity feature of two-step algorithms is noteworthy.

However, we feel that the first (processor allocation) phase, in these algorithms, works more or less in iso-lation with the second (task scheduling) phase. For example in Steps 3 and 4, CPA algorithm may allocate allthe available processor to a given task for better returns, as it works keeping in view the local objective func-tion that is to exploit maximum data parallelism. It concentrates too much on optimizing data parallelism thatthe possibility of exploiting task parallelism in the second phase may gets sidelined. As a result, the algorithmmay loose its track, in spite of making a perfect start with an accurate processor allocation phase. We defendour point in the following section that motivated us to modify the processor allocation phase of the CPA algo-rithm; an algorithm chosen because of its simpler non-convex programming approach.

3.1. Motivating example

Processor allocation phase of CPA algorithm needs some pondering for further improvements. It exploitsmaximum possible data parallelism by incrementing processor allocations for the most critical M-tasks as longas Lcp > Ap (Step 2 to Step 5 in Fig. 3). However, in the process, it overlooks the feasibility aspect of attainingthis parallelism in the second phase. We elucidate our point through the example graph ( Fig. 1) that corre-sponds to a typical matrix multiplication problem with computation cost matrix [w(i, ni)] (for p = 8) shownin Table 1. The nodes t1 and t8 signify the source and sink nodes; nodes t2, t3, t4, and t5 correspond to thematrix multiplication operations that can be done in a data parallel manner; t6 and t7 to the matrix addi-

tion/subtraction respectively that is done serially on a single processor.

CPA ()

Input: Task Graph model (T), Processor Graph model (P)

Output: Mapping and assignment of tasks on processors

BeginObtain processor allocation using Proc_alloc ( ); // phase 1

Schedule the tasks using Task_sch ( ); // phase 2

End

-------------------------------------------------------------------------------------------------------------------------

Proc_alloc()

Begin

{

for all Tti do

Step 1: 1in ; end forall

Step 2: while )( pcp AL > do

Step 3: it CP task such that pni < and the gain parameter G =1

)1,(),

+

+

i

i

i

i

n

niw

n

niw(is maximized;

Step 4: + 1; ii nn

Step 5: Recompute top_level_costand bottom_level_cost;

endwhile}

End

-----------------------------------------------------------------------------

Task_sch( )

Begin

Sort tasks in decreasing order ofbottom_level_cost priority;

while (not all tasks are scheduled) do

Schedule it on the first in processors becoming free;

endwhile

End

Fig. 3. Pseudo code of CPA algorithm.

764 S. Bansal et al. / Parallel Computing 32 (2006) 759774


7/16

We first take a look at the running trace of processor allocation phase (Table 2) of the CPA algorithm. Inthe beginning, critical path length Lcp and average computational area Ap are 276 and 131, respectively. Sinceall the four M-tasks give same value for the gain parameter G, algorithm selects t2 at random and incrementsprocessor allocation for it; it results in increasing the average computational area to 133.5 whereas Lcp remainsunchanged; as the critical paths other than the {t1, t2, t6, t8} are still present. The process continues and all the

four M-tasks get allocated two processors each. The critical path length then, reduces to 156 and average com-putational area jumps to 141. The process repeats further, till all the four M-tasks get allocated three proces-sors each. Subsequently, no more processor allocation is permitted as Ap (=151) exceeds Lcp (=116). Finalprocessor allocation is shown in Table 3.

It may be seen that the total number of processors allocated at prec_level= 1 (Fig. 1) is 12, which isfour more than the available number of processors (p = 8) in the system. As a result, in the second phaseof the algorithm (task_sch()), task t2 and t3 will consume three processors each that are first available, andtasks t4 and t5, though they could start concurrently with these tasks (at the same prec_level), shall have towait for these processors to become free, as shown in the schedule shown in Fig. 4 (x = 216). The width ofeach task block in Fig. 4 represents the number of processors allocated to it, and height represents theexecution time. It shows that the CPA algorithm is unable to exploit enough task parallelism, even if itis available in the graph. Consequently, the benefits of exploiting more data parallelism (by allocating more

processors to the tasks) in the first phase get nullified in the second phase due to the non-availability ofprocessors. This stimulates the need for an algorithm that allocates processors keeping a close watch onthe crucial task parallelism and available resources, and still maintains the low complexity feature oftwo-phase algorithms.

4. MCPA algorithm

A two-step Modified Critical Path and Area based algorithm MCPA (pseudo code shown in Fig. 5) isdeveloped for scheduling arbitrary task graphs composed of parallelizable M-tasks. It modifies the processorallocation phase of the CPA algorithm and tends to bridge the gap between the two phases for better schedulelengths especially with the increasing number of processors available in the system.

4.1. Processor allocation phase

This phase starts by allocating single processor to every task in the DAG and marking them unvisited ini-tially. As long as Lcp > Ap, computation cost of the critical path nodes influences the makespan. Consequently,algorithm tries to speedup the execution of these nodes by allocating higher number of processor to them.However, unlike the CPA algorithm, a task ti is considered suitable for allocating more processors if and onlyif it satisfies the following two conditions:

Table 2

Running trace of processor allocation phase of CPA algorithm

Step Lcp Ap Number of processors (ni) allocated(before)

Maximum gain node selected (ti) Number of processors (ni) allocated(after)

t1 t2 t3 t4 t5 t6 t7 t8 t1 t2 t3 t4 t5 t6 t7 t8

1 276 131 1 1 1 1 1 1 1 1 t2 1 2 1 1 1 1 1 12 276 133.5 1 2 1 1 1 1 1 1 t3 1 2 2 1 1 1 1 13 276 136 1 2 2 1 1 1 1 1 t4 1 2 2 2 1 1 1 14 276 138.5 1 2 2 2 1 1 1 1 t5 1 2 2 2 2 1 1 15 156 141 1 2 2 2 2 1 1 1 t2 1 3 2 2 2 1 1 16 156 143.5 1 3 2 2 2 1 1 1 t3 1 3 3 2 2 1 1 17 156 146 1 3 3 2 2 1 1 1 t4 1 3 3 3 2 1 1 18 156 148.5 1 3 3 3 2 1 1 1 t5 1 3 3 3 3 1 1 1

9 116 151 Not allowed



8/16

(1) It is a critical path task i.e. bi + si = Lcp.(2) Number of processors already allocated to the critical path tasks (if any), at this precedence level i are

less than p, the number of processors available in the system.

The second condition is required, as there could be more than one critical path nodes at the same prec_level,

as was the case with the problem discussed above. Concurrent execution of these nodes becomes quite essentialfor the success of an algorithm. These nodes, thus, indicate the crucial task parallelism available in the taskgraph, which must be retained even if at the cost of sacrificing some data parallelism. CPA algorithm over-looks this observation and keeps on allocating processors only to gain much lesser (due to the diminishingreturns with higher processor allocation). MCPA algorithm, in contrast, works towards retaining this crucialtask parallelism by keeping in sight the number of available processors and those that have already been allo-cated to the critical path tasks at the particular prec_level(Step 5 in Fig. 5). As the algorithm works with arbi-trary precedence and limited number of processors so our main concern is to provide a fair chance to all thecritical tasks for using these limited resources.

An optimal node, which returns maximum benefits (as reflected from the gain parameter G) of data paral-lelism, is next searched among the suitable nodes. Number of processors allocated to this optimal task is thenincremented by one. The top and bottom level costs of the affected tasks are then recomputed due to change inthe computation cost of this task. The process is repeated till ( Lcp > Ap). The final processor allocation isshown in Table 3.

It may be noted that the concept of prec_level (Step 5 in Fig. 5) comes into picture only after it is foundthat all the tasks at a given prec_level are equally critical (Step 3 in Fig. 5); and since all the tasks at a givenlevel can be executed in parallel, so, the algorithm gives all of them an equal opportunity to utilize the avail-able resources. In case, all tasks at a given prec_level are not critical (less regular graphs) then the processorsmay not be shared equally but, preference shall be given to the other critical nodes. Running trace of thisphase for the same problem (Fig. 1) is shown in Table 4. Processor allocation stops at Step 5, the momentall critical path nodes at prec_level (=1) get allocated two processors each (with a total sum of 8, the numberof processors available in the system). This is in spite of the fact that there still exists a possibility of furtherdata parallelism exploitation, since (Lcp > Ap). The final processor allocation using MCPA algorithm is shown

in Table 3.

Table 3Processors allocation with p = 8 for Fig. 1

Algorithm Task

t1 t2 t3 t4 t5 t6 t7 t8

CPA 1 3 3 3 3 1 1 1

MCPA 1 2 2 2 2 1 1 1

Fig. 4. Schedule generated by CPA algorithm.



9/16

4.2. Task scheduling phase

In the second phase, tasks are scheduled using a list-scheduling approach with priorities being decided onthe basis ofbottom_level_cost; so as to have a fair comparison with the similar priority based CPA algorithm.The highest priority ready node is assigned to its allocated number of processors (calculated in first phase) thatbecome free first. The start and finish times on the assigned processors is then calculated using the expressionsgiven in Section 2. The schedule generated by the algorithm is shown in Fig. 6. It may be seen that the pro-

posed heuristic succeeds in doing a more balanced processor allocation (may be at the cost of giving up some

MCPA ()

Input: Task Graph (T), Processor Graph (P)

Output: Mapping and assignment of tasks {T} on processors {P}

Begin

Obtain processor allocation using Proc_alloc ( ). // phase 1

Schedule the tasks using Task_sch ( ). // phase 2

End

---------------------------------------------------------------------------------------------------------------------------Proc_alloc()

Begin

for (all tasks Tti )

Step 1: 1;in

Step 2: Mark it as unvisited; endfor

while )( pcp AL >

Step 3: Get the set of critical nodes;

Step 4: Mark all it as visited;

Step 5: for all it

iN Get the number of processors allocated at prec_level i for the visited tasks;

Step 6: optt Optimal task it such that ( pNi < ) and gain G = (1

)1,),

+

+

i

i

i

i

n

n(iw

n

niw() is maximized;

Step 7: + 1; optopt nn

Step 8: Mark all it except optt as unvisited;

Step 9: Modify bottom_level_costs and top_ level_costs of affected tasks;

endwhile

End

---------------------------------------------------------------------------------------------------------------------------

Task_sch( )

Begin

Sort tasks in decreasing order ofbottom_level_cost priority;

while (not all tasks are scheduled) do

Schedulei

t on the firsti

n processors becoming free;

endwhile

End

Fig. 5. Pseudo code for the MCPA algorithm.

Table 4Running trace of processor allocation phase of MCPA algorithm

Step Lcp Ap Maximumgain node (ti)

P_level (i) No. of criticalnodes at (i)

Processors alreadyallocated at (i)

Processoravailability at (i)

ni (before) ni (after)

1 276 131 t2 1 4 4 Yes 1 22 276 133.5 t3 1 4 5 Yes 1 2

3 276 136 t4 1 4 6 Yes 1 24 276 138.5 t5 1 4 7 Yes 1 25 156 141 t2 1 4 8 No Not allowed



10/16

data parallelism) in the first phase, which allows all the four tasks at prec_level= 1 to execute concurrently inthe second phase, resulting in a schedule length (x = 156) much better than that of the CPA algorithm(x = 216).

Complexity of the algorithm is calculated as follows: the main time consuming loop in the first phase iswhile loop that may get repeated for at the most v p times. Various priority levels can be computed in

O(v + e) in a depth search manner (e being the number of edges in the task graph). Once the priorities arecalculated, the critical path task set can be generated in O(v). The inner for loop (Step 5) can be repeatedat the most v times with the complexity of processors calculation at the given precedence level as O(W), whereW indicates the width or the maximum number of nodes at any prec_level in the DAG. As a result, the worstcase complexity of processor allocation phase is O(vp(vW+ e)). For the second phase, the priorities can becalculated in O(v + e) and the sorting may be done in O(v log v). Task scheduling in the while loop takesO(vp); thereby, the overall worst case complexity for the MCPA algorithm comes out to be O(vp(vW+ e)),which is marginally higher than that of CPA O(v2p + evp) owing to the presence of task graph parameterW. The task graph structure plays a crucial role; for the problems possessing higher task parallelism, MCPAalgorithms improvement shall come at somewhat higher cost; whereas, for the densely connected graphs (withe

%v2), the cost of complexity shall be same (O(pv3)) for both CPA and MCPA algorithms. However, in com-

parison to CPR algorithm (O(ev2

p + v3

p log v + v3

p2

log p)), complexity of MCPA is lower by an order at least.

5. Experimental set-up

Performance of the MCPA algorithm is evaluated against a set of task graphs taken from [15,16]1 that wereused for comparing various M-task scheduling algorithms. The task graph suite consists of two real worldapplications i.e. Matrix Multiplication (Matmul) with matrix sizes taken as (32 32, 64 64, and128 128) and Strassen Matrix Multiplication (Strassen) with matrix sizes taken as 32 32, 64 64,128 128, and 512 512. Strassen algorithm substitutes multiplications by addition, thereby reducing thenumber of multiplications to be computed, resulting in a complexity lower than the classical O( M3), for a(M M) matrix. The computation and communication costs for these applications graphs was borrowed from

the authors of[15,16] who estimated these by running the applications on an actual cluster of workstations. Inaddition, a synthetic task graph suite, along with the computation and communication costs, used in the eval-uation of CPA algorithm was also taken from the same authors. It consists of elementary structures like but-terfly, tree, and diamond (Fig. 7) and 10 randomly generated task graphs. Number of nodes in these syntheticgraphs varies from 9 to 22, and Communication to Computation cost Ratio CCR 6 0.2 (CCR less than 1reflects the coarse grained nature of task graphs, as is generally the case with mixed parallelism schedulingproblems), and the serial fraction a (as referred in Eq. (1)) =0.2, with a Pv1i0 wsi=Pv1i0 wi; 1. Number ofprocessors (p) in the system varies from 2 to 64.

Performance metrics adopted are Normalized Schedule Length (NSL) and Speedup (ratio of makespansgenerated by the algorithm for a multiprocessor and a uniprocessor system). NSL measures schedule length

Fig. 6. Schedule generated by MCPA algorithm.

1 We are thankful to the authors for providing the implementation of their algorithms and task graphs.



11/16

normalized with respect to CPR, which is one of the best available algorithms in terms of makespan. NSLgreater/lesser than one directly reflects the degradation/improvement suffered by an algorithm with respectto the CPR algorithm. An algorithm generating higher Speedup and lower NSL (61), within reasonable timeconstraints, is much sought-after for the optimum utilization of resources. Algorithms chosen for comparisonare TSAS [18], CPA [15] and CPR [16], as all of these algorithms have been designed for arbitrary task graphs,and further, the TSAS and CPA algorithms are based on a similar two-step approach. The comparison with

respect to other algorithms such as TwoL [20], TASK (ni = 1 "ti2 T), and DATA (ni = p "ti2 T) algorithmshas not been undertaken as, on an average, the CPR algorithm has been shown to outperform them all by aconsiderable margin for the tested graph suites [16]. A comparison with CPR algorithm thus provides an indi-rect comparison to them as well.

A comparison with the CPA algorithm is rather obvious, as it quantifies the benefits of modified processorallocation scheme on the overall performance; since, the two algorithms (CPA and MCPA) differ only in theirprocessor allocation phase. For the CPA and TSAS algorithms we have used the results supplied by the author[15]. It is to be mentioned that due to the non-availability of original code (based on convex programmingapproach), TSAS results were generated by GENOCOP non-linear solver [11]. The CPR algorithm was imple-mented at our own end, as results were not available for the complete graph suite. The task and processorgraph parameters are fed as inputs to the scheduling algorithm and allocation and assignment of tasks on dif-ferent processors is available as the output.

6. Performance comparison

6.1. Matmul application

Simulated performance results, for the Matmul applications, are presented in Figs. 8 and 9. These resultsare very much in favor of MCPA algorithm for different matrix sizes and number of processors in the system.For the Matmul application (32 32, 64 64 and 128 128), in comparison to parent CPA algorithm, sche-dule length improvement of MCPA algorithm ((xCPA xMCPA)/xCPA) on an average (over different networksizes) is 30%, 29%, and 20%, respectively. Further, the schedule length degradation suffered by MCPA withrespect to CPR algorithm is zero, as the algorithm succeeds in generating matching schedule lengths for all

the network and matrix sizes.

Diamond Tree Butterfly

Matmul

Strassen

Fig. 7. Some of the benchmark MDGs used for performance comparison.



12/16

Performance of MCPA algorithm improves with the increasing number of processors available in the sys-tem, and it is somewhat marginal for the lesser number of processors. For example, for the coarse grain Mat-

mul (128 128) application, performance improvement of MCPA over the CPA algorithm is rather small forp = 4. Further, for p = 2 the performance of CPA and MCPA algorithm is almost comparable. It may beexplained as follows.

With the number of processors limited to just two, the possibility of exploiting data or even task parallelismat any precedence level, gets too restricted to appreciate the spirit behind the suggested algorithm. For exam-ple, in the Matmul graph (same as in Fig. 1), all the four-macro nodes (at p_level= 1) are equally critical andhence, MCPA algorithm allocates them an equal share of available processors (i.e. one). It forces the algo-rithm to behave just like a pure task parallel (TASK) algorithm unable to exploit any data parallelism, whichis essential for the success of a scheduling algorithm in the mixed parallel environment, especially for thecoarse grain graphs. As a result only marginal or no improvements are observed for coarse grained graphswith small number of available processors in the system.

In Fig. 9, performance comparison between CPA and MCPA algorithms is given in terms of Speedup,which helps in quantifying the improvement that is achieved by using modified processor allocation schemesuggested in this work. Performance improvements are consistent beyond p = 4, as the potential behind theheuristic get fully exploited with higher number of processors in the system. For Matmul (32 32, 64 64,and 128 128), average speedup improvements (over all the network sizes) of the proposed algorithm areabout 46%, 44% and 29%, respectively, which is same as that of the CPR algorithm.

6.2. Strassen application

For the Strassen application, performance is shown in Figs. 10 and 11. In comparison to the CPA algo-rithm, for the Strassen (32 32, 64 64, 128 128, and 512 512), average schedule length generated byMCPA algorithm, is better by 12%, 16%, 12%, and 14%, respectively. While, maximum average makespan

degradation suffered with respect to CPR algorithm, for different matrix sizes, is within 4%.

Matmul (32x32)

0

0.5

1

1.5

2

2 4 8 16 32 64

No. of Processors No. of Processors No. of Processors

NSL

NSL

NSL

CPR MCPA CPA TSAS

Matmul(64x64)

0

5

1

1.5

2

2 4 8 16 32 64

CPR MCPA CPA TSAS

Matmul (128x128)

0

0.5

1

1.5

2

2 4 8 16 32 64

CPR MCPA CPA TSAS

Fig. 8. Relative performance of MCPA algorithm for Matmul application.

32x32

0

2

4

6

8

10

2 4 8 16 32 64

No. of Processors No. of Processors No. of Processors

Speedup

Speedup

Speedup

64x64

0

5

10

15

20

25

2 4 8 16 32 64

MCPAMCPA

CPACPA

128X1280

5

10

15

20

2 4 8 16 32 64

MCPA

CPA

Fig. 9. Speedup comparison for Matmul.



13/16

The corresponding average speedup improvement over CPA algorithm (Fig. 11) is 11%, 22%, 15%, and28% respectively, and the maximum average degradation with respect to the CPR algorithm, for differentmatrix sizes, is within

2%. It may be seen that the MCPA algorithm is able to provide these results with

much lower time complexity in comparison to CPR algorithm.

Strassen 32 X 32

0

5

1

1.5

2

2 4 8 16 32 64

No. of Processors No. of Processors


CPR MCPA CPA TSAS

Strassen (64x64)

0

0.5

1

1.5

2

2 4 8 16 32 64

CPR MCPA CPA TSAS

Strassen (128x128)

0

1

2

3

2 4 8 16 32 64

NSL

NSL

NSL

NSL

CPR MCPA CPA TSAS

Strassen (512x512)

0

0.5

1

1.5

2

2 4 8 16 32 64

CPR MCPA CPA

Fig. 10. Relative performance of MCPA algorithm for Strassen application.

32x32

1

2

3

4

5

2 4 8 16 32 64


Speedup

Spe

edup

Speedup

Speedup

MCPA

CPA

64x640

2

4

6

8

10

12MCPA

CPA

128x1280

6

12

18

24

2 4 8 16 32 64

No of Processors No of Processors

MCPA

CPA

512x512

0

2 4 8 16 32 64

2 4 8 16 32 64

2

4

6

8

10

MCPA

CPA

Fig. 11. Speedup comparison for Strassen.



14/16

6.3. Synthetic and random task graphs

In Fig. 12, average performance results for the synthetically generated benchmark application graphs arepresented. These results strongly support our previous observation that MCPA algorithm is better able to allo-cate processors beyond a certain network size. In Table 5, we have summed up the average makespan and

speedup degradations (over all network sizes) suffered by various algorithms, for different graph types, in com-parison to CPR algorithm. For calculating average results, average makespan (say), the schedule length gen-erated by an algorithm, for a particular network size, is divided by the schedule length of CPR algorithm andthen the average of these normalized schedule lengths for all the network sizes is taken and reported. For therandom graphs, before normalizing the schedule lengths, they are averaged over all the 10 graphs, for a par-ticular network size, and then normalized with respect to corresponding CPR average makespan. Average ofthese results, over all the network sizes, is then computed and reported. Performance improvement beyondp = 8 is evident in almost all task graphs.

For the random task graphs, average NSL degradation suffered by TSAS, CPA and MCPA algorithms isabout 22%, 17% and 6%, respectively. The maximum makespan degradation suffered by MCPA for the testedgraph suites is about 8% and minimum 0%. For the parent CPA algorithm it comes out to be 46% and 11%,respectively. The TSAS algorithm also does not benefit much from the optimal processor allocation that has to

be rounded-off to a near integer value in later steps. The maximum and minimum performance degradation, interms of makespan, suffered by the TSAS algorithm is 80% and 8%, respectively. Corresponding values formaximum and minimum speedup degradation for the TSAS, CPA, and MCPA algorithms are; 40% and6%; 30% and 9%; 6% and 0%, respectively.

The performance of MCPA algorithm is quite significant with minimum degradations. Performanceimprovement beyond p = 8 is evident in almost all task graphs. All these observations can be summarizedas follows.

For p = 2, all algorithms generate comparable schedules, as the number of processors is too small to exploittask or even data parallelism effectively. Further, under the worst case, when task parallelism in the task graphis higher than p and tasks are equally critical at a given prec_level, MCPA algorithm allocates resourcesequally among the nodes and thus, behaves like a pure S-task scheduling algorithm unable to exploit data par-

allelism. That is why, performance of MCPA algorithm deteriorates for p = 4 especially for coarse grain

Butterfly

0

0.5

1

1.5

2

2 4 8 16 32 64

No of Processors

NSL

NSL

NSL

NSL

CPR MCPA CPA TSAS

Diamond

0

0.5

1

1.5

2

2 4 8 16 32 64

No of Processors

CPR MCPA CPA TSAS

Tree

0

0.5

1

1.5

2

2 4 8 16 32 64No of Processors No of Processors

CPR MCPA CPA TSAS CPR MCPA CPA TSAS

Random

0

0.5

1

1.5

2

2 4 8 16 32 64

Fig. 12. Relative performance of MCPA algorithm for synthetic benchmark graphs.



15/16

graphs (128 128 and 512 512). However, beyond p = 8, other algorithms get trapped in exploiting moredata parallelism even if schedule length improvement is quite marginal; thereby, increasing the average com-putational area and hence overall makespan in the end. MCPA algorithm, in contrast, may get trapped onlyafter taking into consideration the most effective critical task parallelism in the graph. Therefore, beyondp = 8, MCPA algorithm tends to improve upon other algorithms as reflected in Fig 12 and Table 5.

7. Conclusions

A low complexity two-step M-task scheduling algorithm is proposed for arbitrary task graphs based on theModified Critical Path and Area-based processor allocation heuristic that can be employed in the first phase of

the existing two-step M-task scheduling algorithms as well. In the available two-step algorithms, processorallocation phase works more or less in isolation, without gauging the effect of its decisions on the secondphase. As a result, these algorithms, at times, get trapped in exploiting too much data parallelism, overlookingthe feasibility of exploiting crucial task parallelism in the later steps. MCPA algorithm, in contrast, succeeds inbridging this gap to an appreciable extent. The strength of the algorithm lies in judicious allocation of proces-sors, which restricts exploitation of data parallelism the moment it starts interfering with the crucial task par-allelism available in the graph, thereby achieving a more balanced processor allocation especially in a resourceconstrained set-up. It preserves the essential task parallelism, at the cost of data parallelism sometimes, if it iscrucial for the overall performance of the algorithm, and still retains the low-complexity features of a multi-step algorithm. Performance of the suggested algorithm under the stated assumptions is presented for real andsynthetic application graphs, which indicates MCPA algorithm to be a potential candidate for scheduling arbi-trary task graphs exhibiting mixed parallelism.

References

[1] K.P. Belkhale, P. Banerjee, A scheduling algorithm for parallelizable dependent tasks, in: Proc. of Int. Parallel ProcessingSymposium, Anaheim, CA, April 1991, pp. 500506.

[2] K.P. Belkhale, P. Banerjee, An approximation algorithm for the partitionable independent task scheduling problem, in: Proc. of 19thInt. Conference on Parallel Processing, Aug. St. Charles, IL, August 1990, pp. 7275.

[3] S. Chakrabarti, J. Demmel, K. Yelick, Modeling the benefits of mixed data and task parallelism, in: Seventh Annual ACMSymposium on Parallel Algorithms and Architectures, CA, July 1995, pp. 7483.

[4] S. Chakrabarti, J. Demmel, K. Yelick, Models and scheduling algorithms for mixed data and task parallel programs, Journal ofParallel and Distributed Computing 47 (9) (1997) 168184.

[5] I.T. Foster, K.M. Chandy, Fortran M: a language for modular parallel programming, Journal of Parallel and Distributed Computing26 (1) (1995) 2435.

[6] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman and Co, 1979.

Table 5Average makespan and speedup degradation for all network sizes w.r.t. CPR algorithm

Graph type Algorithm

TSAS (%) CPA (%) MCPA (%) TSAS (%) CPA (%) MCPA (%)

Avg. NSL degradation Avg. speedup degradation

Matmul (32 32) 29 46 0 21 30 0Matmul (64 64) 10 44 0 6 29 0Matmul (128 128) 25 29 0 16 20 0Strassen (32 32) 8 20 4 7 16 2Strassen (64 64) 15 23 2 11 18 2Strassen (128 128) 80 18 2 40 14 2Strassen (512 512) 31 2 22 2Butterfly 23 11 5 17 9 9Diamond 26 28 7 19 20 6Tree 36 25 8 25 18 6Random 22 17 6 18 12 5

Minus () sign indicates an improvement instead of degradation.



16/16

[7] T. Gross, D. OHallaron, J. Subhlok, Task parallelism in a high performance Fortran framework, IEEE Parallel DistributedTechnology 2 (3) (1994) 626.

[8] M. Gupta, P. Banerjee, Compile time estimation of communication costs on multicomputers, in: Proc. of 6th Int. Parallel ProcessingSymposium, Beverley Hills, CA, March 1992, pp. 470475.

[9] M. Gupta, Automatic data partitioning on distributed memory multicomputers, PhD thesis, Department of Computer Science,University of Illinois, Urbana, IL, September 1992.

[10] S. Ben Hassen, H.E. Bal, C.J. Jacobs, A task and data parallel programming language based on shared objects, ACM Transactions onProgramming Languages and Systems 20 (6) (1998) 11311170.

[11] S. Kozieland, Z. Michalewicz, Evolutionary algorithms, homomorphous mappings and constrained parameter optimization,Evolutionary Computation 7 (1) (1991) 1944.

[12] Y.K. Kwok, I. Ahmad, Benchmarking and comparison of the task graph scheduling algorithms, Journal of Parallel and DistributedComputing 59 (3) (1999) 381422.

[13] R. Lepere, D. Trystram, G.J. Woeginger, Approximation algorithms for scheduling malleable tasks under precedence constraints,International Journal of Found. Comp. Sci. 13 (4) (2002) 613627.

[14] G. Prasanna, A. Agarwal, B.R. Musicus, Hierarchical compilation of macro dataflow graphs for multiprocessors with local memory,IEEE Transactions on Parallel and Distributed Systems 5 (7) (1994) 720736, Jul.

[15] A. Radulescu, A.J.C. van Gemund, A low cost approach towards mixed task and data parallel scheduling, in: Proc. of the 15th Int.Conf. on Parallel Processing, Valencia, Spain, September 2001, pp. 6976.

[16] A. Radulescu, C. Nicolescu, A.J.C. van Gemund, P.P. Jonker, CPR: mixed task and data parallel scheduling for distributed systems,in Proc. of the 15th Int. Parallel and Distributed Processing Symposium (IPDPS), San Francisco, April 2001, pp. 3947.

[17] S. Ramaswamy, P. Banerjee, Processor allocation and scheduling of macro dataflow graphs on distributed memory multicomputersby the PARADIGM compiler, in: Proc. of the 22nd Int. Conf. on Parallel Processing, St. Charles, IL, August 1993, pp. 134138.[18] S. Ramaswamy, S. Sapatnekar, P. Banerjee, A framework for exploiting task and data parallelism on distributed memory

multicomputers, IEEE Transactions on Parallel and Distributed Systems 8 (11) (1993) 10981115.[19] S. Ramaswamy, Simultaneous exploitation of task and data parallelism in regular scientific applications, PhD thesis, University of

Illinois at Urbana-Champaign, 1996.[20] T. Rauber, G. Runger, Compiler support for task scheduling in hierarchical execution models, Journal of Systems Architecture 45 (6

7) (1999) 483503.[21] H.El. Rewini, T.G. Lewis, H.H. Ali, Task Scheduling in Parallel and Distributed Systems, Prentice Hall, NJ, 1994.[22] J. Subhlok, B. Yang, A new model for integrated nested task and data parallel programming, in: Proc. of the Sixth ACM SIGPLAN

Symposium on Principles and Practice of Parallel Programming, Las Vegas, NV, June 1997, pp. 112.[23] J. Subhlok, G. Vondran, Optimal use of mixed task and data parallelism for pipelined computations, Journal of Parallel and

Distributed Computing 60 (3) (2000) 297319.[24] J. Subhlok, J Stichnoth, D OHallaron, T Gross, Exploiting task and data parallelism on a multicomputer, in: Proc. of the Fourth

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, May 1993, pp. 1322.[25] F. Suter, F. Desprez, H. Casanova, From heterogeneous task scheduling to heterogeneous mixed parallel scheduling, in: Proc. Euro-

Par 2004, Lecture Notes in Computer Science 3149 (2004) 230237.[26] J. Turek, J.L. Wolf, P.S. Yu, Approximate algorithms for scheduling parallelizable tasks, in: Proc. of 5th ACM Symposium on

Parallel Algorithms and Architectures (SPAA), 1992, pp. 323332.


task n data schedulling

Documents