queueing network models for parallel processing with

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 11, NOVEMBER 1982

[8] P. Denning, "Working sets past and present," IEEE Trans. SoftwareEng., vol. SE-6, Jan. 1980.

[9] M. Dubois and F. A. Briggs, "Efficient interprocessor communicationfor MIMD multiprocessor systems," in Proc. 8th Int. Symp. Comput.Architecture, May 1981.

[10] K. R. Kaplan and R. 0. Winder, "Cache-based computer systems,"IEEE Computer, vol. 6, Mar. 1973.

[11] H. T. Kung, "The structure of parallel algorithms," in Advances inComputers, vol. 19, M. C. Youits, Ed. New York: Academic, 1980,pp.65-112.

[12] J. H. Saltzer, "A simple linear model of demand paging performance,"Commun. Ass. Comput. Mach., vol. 17, Apr. 1974.

[13] M. Satyanarayanan, "Commercial multiprocessing systems," IEEEComputer, May 1980.

[14] J. R. Spirn, Program Behavior: Models and Measurements. New York:Elsevier, 1977.

[15] W. D. Strecker, "Cache memories for PDP-I I family computers," inProc. 3rd Annu. Symp. Comput. Architecture, Jan. 1976.

[16] C. K. Tang, "Cache system design in the tightly coupled multiprocessorsystem," in Proc. AFIPS, 1976.

[17] L. C. Widdoes, "The S-I project: Development of high performancedigital computers," in Dig. Paper, COMPCON '80, IEEE Comput. Soc.,San Francisco, CA, Feb. 1980.

[18] C. Yeh, "Shared cache organization for multiple-stream computersystems," Univ. Illinois, Urbana-Champaign, Tech. Rep. R-904, CSL,Jan. 1981.

[19] W. C. Yen and K. S. Fu, "Memory organization and synchronizationmechanism of multiprocessing computer systems," School Elec. Eng.,Purdue Univ., West Lafayette, IN, Tech. Rep., TR-EE 81-34, Oct.1981.

Michel Dubois (S'79-M'81) was born in Charler-oi, Belgium, in 1953. He received the IngenieurCivil Electricien degree from the Facult6 Poly-technique de Mons, Belgium, the M.S. degreefrom the University of Minnesota, Minneapolis, in1978, and the Ph.D. degree from Purdue Universi-ty, West Lafayette, IN, in 1982, all in electricalengineering.

During his studies in the U.S., he was funded bygrants from the Facult6 Polytechnique de Monsand by research and teaching positions. He is now

with the Central Research Laboratory, Thomson-CSF, Domaine de Corbe-ville, France, where he designs architectures for very large scale integration.His main interests are in computer architecture and digital image pro-cessing.

Faye A. Briggs (M'77), for a photograph and biography, see p. 982 of theOctober 1982 issue of this TRANSACTIONS.

Queueing Network Models for Parallel Processingwith Asynchronous Tasks

PHILIP HEIDELBERGER, MEMBER, IEEE, AND KISHOR S. TRIVEDI

Abstract-Computer performance models of parallel processingsystems in which a job subdivides into two or more tasks at some pointduring its execution are considered. Except for queueing effects, thetasks execute independently of one another and do not require syn-chronization. An approximate solution method is developed and resultsof the approximation are compared to those of simulations. Boundson the performance improvement due to overlap are derived.

Index Terms-Approximate solution, computer systems modeling,multiprogramming, multitasking, parallel processing, performanceevaluation, queueing network models.

I. INTRODUCTION

T HIS PAPER considers computer performance modelsTof certain types of parallel processing systems. In par-

ticular, we consider models in which a job subdivides into two

Manuscript received December 18, 1981; revised June 21, 1982.P. Heidelberger is with the IBM Thomas J. Watson Research Center,

Yorktown Heights, NY 10598.K. S. Trivedi was on sabbatical at the IBM Thomas J. Watson Research

Center Yorktown Heights, NY 10598. He is with the Department of Com-puter Science, Duke University, Durham NC 27706.

or more tasks at some point during its execution. Except forqueueing effects, the tasks execute independently of one an-other and do not require synchronization. The performanceof the system is analyzed using queueing network models.However, because of the parallelsim, the models do not havean analytically tractable solution. An approximate solutionmethod is developed and results of the approximation arecompared to those of simulations. The approximation is foundto be quite accurate unless the system under consideration ishighly imbalanced.Two different applications of the model are described. The

first application is a terminal-oriented system. In this systemthe model may be used to represent transactions that are splitinto two tasks. The terminal user must wait for completion ofthe first task before issuing the next transaction. However, thesecond task can execute concurrently with the first task andwith the user's think time. Furthermore, the terminal user doesnot wait for the second task to complete before issuing the nexttransaction. An alternative interpretation of this model is thata user is allowed to submit a second processing request while

0018-9340/82/1100-1099$00.75 © 1982 IEEE

1099


the first is being processed. Most systems do allow users to

submit multiple requests in this way.

The second application of our model is to CPU-I/O overlapin a batch-oriented multiprogramming system. In this modela job can issue I/O requests which are served either synchro-nously or asynchronously. If the I/O request is asynchronous,then the job can continue CPU processing concurrently withexecution of the I/O. Furthermore, the job never waits for anasynchronous I/O request to complete. I/O requests whichwrite sequentially into write-only files-are examples of I/O'swhich could be done asynchronously.A general model which allows jobs to split into two or more

synchronous tasks, all of which must complete execution beforethe job may resume processing, is described in Heidelbergerand Trivedi [8]. Towsley, Chandy, and Browne [20] havestudied models in which CPU and I/O (or I/O and I/O) ac-

tivity can be overlapped; however, their models require tightsynchronization between the two concurrent tasks in the sense

that both the CPU and I/O (or the two I/O) tasks must

complete before processing can continue. Furthermore, a jobcan have, at most, two outstanding I/O requests at any one

time. Towsley et al. [20] conclude that the performance gaindue to this type of overlap is greatest for balanced systems andrelatively low levels of multiprogramming. Price [ 14] considersperformance models of multiple I/O buffering schemes. InPrice's model there are a fixed number of buffers per file (orper user). A limiting case (as the number of buffers increasesto infinity) of his model is obtained as a special case of our

model by setting the parameters appropriately (f = 1, wherefis defined in Section III). Price shows that performance im-provements become insignificant as the size of the buffer poolincreases beyond 3 or 4 buffers and his results closely agree

with ours. Results obtained by our model also show that theperformance improvement due to overlap is maximized forbalanced systems and low multiprogramming levels. As more

of the work is offloaded to auxilliary processors, however, theperformance gains due to parallel processing will remain sig-nificant and the modeling technique described in this paper

is applicable to such distributed systems. Other studies ofmultitasking include Browne, Chandy, Hogarth, and Lee [3]and Sauer and Chandy [ 16]. Both of these studies model ov-

erlapped CPU processing in multiprocessor systems by ad-justing queue-dependent CPU processing rates.

In Section II the general model, of overlap is described indetail. Section III contains a description of the terminal andbatch models as well as reporting on the accuracy of the ap-

proximation. Section IV derives bounds on the performancegain that can be achieved by overlap and contains plots ofperformance improvements predicted by the model. SectionV summarizes the results and lists areas for further re-

search.

II. MODEL DESCRIPTION

We assume that the system under consideration consists ofm active resources such as processors and channels. The

workload consists of a set of statistically identical jobs whereeach job consists of a primary task (labeled 1) and zero or more

statistically identical secondary tasks (labeled 2). The secon-dary tasks are spawned by the primary task sometime duringits execution and execute concurrently with it, competing forsystem resources. A secondary task is otherwise assumed torun independently of the primary task; in particular, we do notaccount for any synchronization between tasks.We assume that there is a specially designated node, node

0. A secondary task is spawned whenever a primary task entersnode 0. The service time at this node is 0. For i = 1, 2, .. ,ViI denotes the average number of vists to node i per visit tonode 0 by a primary task and let Vi2 denote the average num-ber of visits to node i by a secondary task. We assume thatVi2 < o and that a secondary task leaves the network aftercompleting execution. Similarly, we assume that V1l <co. LetSij denote the average service time requirement of task typej per visit to node i. Each type of task does not require holdingmore than one resource at a time. Concurrency within a jobis allowed only through multitasking while several independentjobs are allowed to execute concurrently and share systemresources. The system is assumed to contain a fixed number,n, of primary tasks at all times.

If the scheduling discipline at a node is FCFS (first-come-first-served), then we require that all tasks have an exponentialservice distribution at that node with a common mean. If thescheduling discipline at a node is PS (processor sharing) orLCFS-PR (last-come-first-served, preemptive resume), or thenode is an IS (infinite server) node, an arbitrary differentiableservice time distribution is allowed and each task can have adistinct service time distribution.

Because of concurrency within a job, the queueing networkmodel of the system under consideration does not belong to theclass of product-form networks (see Baskett, Chandy, Muntz,and Palacios [1] and Chandy, Howard, and Towsley [5]). Wedescribe an iterative technique for solving a sequence ofproduct-form queueing networks so that upon convergence,the solution to the product-form network closely approximatesthe solution to the system under investigation. The assumptionson the service time distributions and the queueing disciplinesguarantee that each network in the sequence has a product-form solution and is thus computationally tractable.

The queueing network model of the approximate system hastwo chains, one closed and the other open. The closed chainmodels the behavior of primary tasks and hence the populationof the closed chain is equal to the number of primary tasks, n.The open chain models the behavior of secondary tasks andhence we set the arrival rate of the open chain equal to theprimary task throughput, at node 0, of the closed chain. Noticethat the approximation assumes that the arrival process ofprimary tasks to node 0 is a Poisson process which is inde-pendent of the state of the network. However, in general, thisprocess is neither Poisson nor independent of the state of thenetwork. -Because of these two points, the method is an ap-proximation and not an exact method of solution. Since thethroughput of the closed chain is itself a nonlinear function ofthe arrival rate of the open chain (recall that the two chainsshare system resources), a closed form solution is not available.A simple algorithm, e.g., regulafalsi (see Hamming [7]), maybe used to iteratively solve this nonlinear equation. We applied

1100

HEIDELBERGER AND TRIVEDI: QUEUEING NETWORK MODELS

CASE 1CONTENTION AT BOTTLENECK

z

OPEN CHAIN THROUGHPUT

_IL.

CASE 3NO CONTENTION AT BOTTLENECK

7 LITTLE CONTENTION AT OTHER DEVICES

CASE 2NO CONTENTION AT BOTTLENECK

MODERATE CONTENTION AT OTHER DEVICES

OPEN CHAIN TRUHU

U U.5 1.U 1.5 2.U 2.5OPEN CHAIN THROUGHPUT

Fig. 1. Three types of behavior of the approximation method. The maximum open-chain throughput is 2.5.

this method to the models described in Section III and foundit to converge very rapidly. On the average, only five iterationswere required to achieve a maximum absolute difference of0.001 between the throughputs of the open and closed chains,and the average model required less than one second ofCPUtime on an IBM 3033.

Let X2 and ,2, respectively, denote the arrival rate and thethroughput (or departure rate) of the open chain. Note thatif any of the queues are saturated, then the throughput /2 willbe less than or equal to the given arrival rate X2. Let XI denotethe throughput of the closed chain at node 0. We require thatA2 = XI and for the stability of the network, we must haveL2 = X2. It is clear therefore that we need to solve the followingnonlinear equation:

X0(A2) = X2 (1)

where XI (X2) is the throughput of primary tasks (at node 0)when the secondary task throughput is X2. Equation (1) is a

nonlinear function of the single variable X2 and for any fixedvalue of X2, X1(X2), may be evaluated by solving a product-form queueing network with one open and one closed chain.Let XA and X* denote the solution to (1), i.e., X* = X*= X1(X*).The approximation uses X* as the secondary task arrival rate.At this point, the rate at which secondary tasks arrive (or are

spawned) equals the throughput of primary tasks at node 0.Any algorithm for solving nonlinear functions of a single

variable may be used to solve (1) for the appropriate arrivalrate X2. The properties of this equation are discussed next.Since the two chains share system resources, an increase in thevalue of X2 cannot imply an increase in X1; we conclude thatXI is a monotone nonincreasing function of X2. Let X2 denote

the maximum throughput of the open chain. The condition ofstability is

X2 <X2. (2)Then the condition for the existence of a stable solution, i.e.,a solution in which no queue is saturated, for the nonlinearequation (1) is given by

liMr X(X2)<X2.X2tA2

(3)

By the monotonicity property it follows that if a stable solutionexists it will be unique. If this condition is not satisfied, thenthe primary tasks can generate secondary tasks at a rate whichexceeds the system capacity.The maximum possible throughput of the open chain is

determined by the node which presents a bottleneck (seeKleinrock [10, p. 218]). The index of this node is

I = argmaxlViV2S12 (4)

where argmax denotes the index of the largest element in a set.Hence,

A2 =V12SI2

(5)

Depending on network parameters, three different types ofbehavior can be identified (see Fig. 1). The first case resultswhen the node labeled I is utilized by the primary task, i.e.,

VIISII > 0, (6)

since in this case when the queue length of node I growswithout bound due to an excessive open-chain arrival rate, the

z

Cu

-j. I Cl 11 1-1 11 .

1 101


3(D

4(D

5(D(a)

O-.-0---'Po I ssonSoursce

40

(b)Fig. 2. (a) The central server model without overlapped 1/0. (b) The

central server model with overlapped 1/0.

closed-chain throughput will approach zero. If condition (6)is not satisfied, but some of the other network nodes are sharedby the two chains, then either case two or case three resultsdepending upon the degree of sharing. Since case two doesyield a unique solution to (1), we conclude that (6) is a suffi-cient but not necessary condition for convergence while con-

dition (3) is both necessary and sufficient. However, the ad-vantage of condition (6) over condition (3) is that the formeris testable prior to the solution of the network while the latteris not.The approximation may be generalized to include cases in

which more than one type of secondary task may be spawned.Suppose that whenever a primary task passes through node 0

a secondary task of type k (k = 2, * , K) is spawned withprobability Pk. The secondary tasks may be modeled by K - 1

Kopen chains, each with arrival rate Xk. Let X = Xk denote

k=2the total arrival rate of secondary tasks. The individual arrivalrates are constrained so that Xk = PkX for k > 2. The total

throughput of secondary tasks is set equal to the throughputof the primary tasks at node 0. This defines a nonlinear equa-tion of the single variable X, similar to (1), which must besolved.

III. ACCURACY OF THE APPROXIMATION

In this section we investigate the accuracy of the approxi-mation by comparing performance measures obtained by theapproximation to those obtained by simulation. Computerperformance models of two general types were considered:

central server models (see Buzen [4]) and models of terminaloriented timesharing computer systems, (see, e.g., Sauer andChandy [17, p. 1 6]). The accuracy of the approximation wasstudied for device utilizations, throughputs, mean queuelengths, and, when appropriate, mean response times. Thecomparison was performed for a wide range of input parametersettings for these two general model types. In all, the com-parison was performed for over 400 individual models. Theaverage relative error over all models and all performancemeasures was 2.5 percent, and 90 percent of the relative errorswere less than or equal to 6.1 percent.The standard central server model, without overlap, is pic-

tured in Fig. 2(a). Server number 1 represents a CPU andservers 2, 3, 4, and 5 represent I/O devices such as disks. In themodel with overlap, there are a fixed number, n, of primarytasks. We assume that a primary task has a random CPUprocessing service time with mean Sp and that whenever aprimary task finishes this processing a secondary task isspawned with probabilityf. If the secondary task is spawned,the primary task returns to the CPU for continued processing(with mean Sp) and the secondary task begins processing atthe CPU for a random period of time with mean SI2. Uponleaving the CPU the secondary task moves to I/O device i withprobability Pi2. The secondary task then returns to the CPUwith probability P2 or exits the system with probability 1 - P2.If the primary task completes CPU processing and does notspawn a secondary task (with probability 1 -J), it begins anoverhead period of CPU processing with mean So. Uponcompletion of this overhead processing the primary task movesto I/O device i with probability Pi, and then returns to theCPU. We assume that the CPU has the PS discipline and thatall I/O devices are FCFS and that the service times at I/Odevice i are exponentially distributed with mean Si.The model has the interpretation that the primary task can

issue I/O requests which are handled by the secondary taskand continue processing without waiting for those I/O requeststo complete. IfSo = S1 2, then using this interpretation So =S12 is the average time to initiate the I/O request. In thenotation of the previous section, S1 1 = Sp + (1 -f)So. Eachsecondary task has an average of 1/(1 - P2) I/O's in it andthus the fraction of all I/O's which are overlapped with theprimary task is

fol = ~~~~~~~~(7)0o( f) + f/(I - P2)Notice that if P2 = 0, thenfol =f The benefit of overlappingthe I/O can be studied by comparing this model to one inwhichf = 0 and there is no overlap.The approximation for the central server model with overlap

is pictured in Fig. 2(b). We insert an extra node, node 0, in themodel. The service time at node 0 is 0. When the primary taskleaves the CPU it is routed to node 0 with probabilityf and toI/O device i with probability (1 - f)p,i. The throughput ofprimary tasks at node 0 is the rate at which secondary tasksare spawned. The secondary tasks are modeled by an openchain with all arrivals routed to the CPU. The Poisson arrivalrate, X2, of the open chain is set so that it equals the throughputof the closed chain of primary tasks at node 0. The routing in

1102


TABLE IPARAMETERS FOR CENTRAL SERVER MODELS

Model S2=S3=54 55 P21=P22 Ps1=P52 P2Set P31 =P32

P41 =P42

1 0.04 0.04 0.25 0.25 0.00

II 0.04 0.04 0.25 0.25 0.50

111 0.04 0.04 0.30 0.10 0.00

IV 0.04 0.10 0.30 0.10 0.00

V 0.04 0.20 0.30 0.10 0.00

VI 0.04 0.40 0.30 0.10 0.00

the open chain of secondary tasks is as described earlier; asecondary task leaves the CPU and enters I/O device i withprobability Pi2 and then returns to the CPU with probabilityP2 or exits the system with probability 1 - P2. Notice that, forany fixed arrival rate X2, this approximate model of overlapsatisfies the conditions for product form and thus has a com-putationally tractable solution. The stability conditions for thismixed open and closed network are that X2Si2Vi2 < 1 for alliwhere V12= 1/( -P2) andV2 =pi/(l -p2)fori> 1.Ifthe X2 for which X2 = X1 does not satisfy these conditions, thenthe primary tasks are able to generate more secondary tasksthan the system can handle and the mean queue length ofsecondary tasks at at least one device will be infinite.We considered six sets of central server models with a total

of 60 models in each set. Each set contained cases with onlymoderate utilizations at the devices as well as heavily CPUand/or I/O bound cases. Each set contained a range of valuesof multiprogramming levels, n, CPU service time, Sp, andoverlap factors, f. In particular, each model within a set wasparameterized by the triplet (n, Sp, f) where n = 1, 3, 5,Sp = 0.002, 0.01, 0.02, 0.10, 0.20, andf = 0.1, 0.25, 0.50, 0.75.The sets differed in the branching probabilities, Pj and P2, andin the I/O service times. In all models the overhead processingtime So = S12 = 0.0008. The pattern of I/O accesses rangedfrom an even distribution to a highly skewed distribution andthe mean service times of the least active I/O device (i.e., thedevice i with the smallest values of pij) varied in an intervalabout the mean service times of the other I/O devices. TableI lists the input parameters for each set of 60 central servermodels.The second class of models tested were those of terminal-

oriented timesharing systems. This model is pictured in Fig.3(a). It is essentially the same as the central server model ex-cept that it has an extra node, node 6, which is an infinite serverqueue representing a finite number of terminals. The meanservice time at this node, S61, is the mean think time. Thereare a fixed number of terminals, n, submitting primary tasksto the computer system. A primary task uses the CPU and thenenters I/O device i with probability Pi,. Upon completion ofan I/O the primary task returns to the CPU with probabilityP I or enters the terminals for the think state with probability1 - p . We assume that a secondary task is spawned sometimeduring the execution of a primary task. The secondary taskexecutes independently of the primary task (except forqueueing effects) and in particular can execute during the

0

6

2(D

3(D

(a)

(b)Fig. 3. (a) Terminal model without overlapped tasks. (b) Terminal

model with overlapped tasks.

think time of its corresponding primary task. The routing andservice times of the secondary task are the same as in thecentral server model. The approximate model of this overlapsituation is shown in Fig. 3(b). Node 0, with 0 service time, isinserted in the model between the I/O devices and the termi-nals. The throughputs at the terminals and at node 0 are thusequal. The secondary tasks are modeled by an open chain witharrivals entering the system at the CPU. The arrival rate of thischain, X2, is set equal to the throughput of the primary tasksat node 0. The stability conditions of this model are, as before,that X2S12/(l - P2) < 1 and X2Si2/( - P2) < 1 for alli>i.The interpretation of this model is that each terminal in-

teraction can be split into two tasks. The terminal user mustwait for the completion of the primary task, but does not waitfor the completion of the secondary task. Certain operatingsystem overhead associated with processing a transaction maybe overlaped and thus modeled this way. We considered 47models of this general type. The parameter settings for thesemodels are described in Table II.

Each of the models was simulated using the IBM ResearchQueueing package RESQ (see Sauer and MacNair [18] andSauer, MacNair, and Salza [19]). The overlap of jobs wasmodeled using "split nodes," a RESQ node type that splits a

1103

/


TABLE I1PARAMETERS FOR TERMINAL MODELS

S61 SI1 =S12 Si Pij Pl=P2 n

10 0.010 0.04 0.25 0.10 1,5,10,20,30,40,50,60,70,80,90,100

10 0.015 0.04 0.25 0.10 1,5,10,20,30,40,50,60,70,80,90,100

15 0.002 0.04 0.25 0.10 25,50,100

15 0.010 0.04 0.25 0.10 25,50,100

15 0.020 0.04 0.25 0.10 25,50,100

15 0.020 0.08 0.25 0.02 1,5,10,20,30,40,50

15 0.030 0.08 0.25 0.02 1,5,10,20,30,40,50

TABLE IIICOMPARISON OF SIMULATION AND ANALYTIC APPROXIMATIONFOR CENTRAL SERVER MODELS WITH BALANCED I/O SUBSYSTEM

Models CPUThrough-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CPU Disk CPU Disk Through-Ut. Ut. Q.L. Q.L. put

Set Absolute ErrorMean 0.003 0.006 0.011 0.044 0.056

0.9 Percentile 0.011 0.024 0.031 0.137 0.233Maximum 0.026 0.071 0.110 0.457 0.710

Relative ErrorMean 1.2 1.5 2.8 3.2 1.3


Set Absolute ErrorII Mean 0.003 0.005 0.009 0.072 0.034




TABLE IVCOMPARISON OF SIMULATION AND ANALYTIC APPROXIMATION

FOR CENTRAL SERVER MODELS WITH IMBALANCED I/OSUBSYSTEM

Models CPU Disk Disk, CPU Disk Disk Through-Ut. 2 Ut. 3-5 Ut. Q.L. 2 Q.L. 3-5 Q.L. put

Set Absolute ErrorI[1 Mean 0.003 0.007 0.003 0.010 0.060 0.004 0.050

0.9 Percentile 0.010 0.024 0.007 0.023 0.205 0.009 0.159Maximum (0.0)25 0.076 0.025 0.101 0.799 0.065 0.631


09 Percentile 3.7 3.9 5.1 13.6 10.6Maximum 14.7 14.4 13.9 27.7 18.3

2.0 1.34.4 3.6

13.1 14.3

Set Absolute ErrorIV Mean 0.003 0.007 0.007 0.010 0.044 0.030 0.055

0.9 Percentile 0.011 0.027 0.021 0.033 0.135 0.097 0.222Maximum 0.029 0.078 0.069 0.095 0.555 0.323 0.636

Relative ErrorMean 1.4 1.6 2.1 2.9 3.1 4.2 1.4


Set Absolute ErrorV Mean 0.003 0.005 0.008 0.009 0.043 0.394 0.039




Set Absolute ErrorVt Mean 0.002 0.002 0.007 0.008 0.041 0.592 0.019



0.9 Percentile 3.0 2.9 3.4 16.1 25.3 39.2 2.7Maximum 7 2 7.2 8.1 29.8 47.9 89.0 7.1

TABLE VCOMPARISON OF SIMULATION AND ANALYTIC APPROXIMATION

FOR TERMINAL MODELS WITH BALANCED I/O SUBSYSTEM

Models

job into two tasks which then proceed independently throughthe network. Confidence intervals for each estimate wereformed using the regenerative method, e.g., Iglehart [9]. Thelength of each simulation was determined by the sequentialstopping rule described in Lavenberg and Sauer [ 12]. Usingthis rule, each model was simulated until a 90 percent confi-dence interval for the mean queue length at each device (andmean response time for the terminal models) had a relativehalf-width of 1 percent or until a maximum CPU time (5 and10 min of IBM 3033 time for each central server and terminalmodel, respectively) had been reached. If the CPU limit wasreached before the 1 percent desired accuracy was achieved,confidence intervals with a relative half-width of generally lessthan 10 percent were obtained. However, for some parametersettings, no regenerative cycles were observed during the courseof the entire simulation in which case no confidence intervalswere produced. This does not necessarily imply that the esti-mates obtained from these simulations were highly inaccurate,but merely that the mean time to return to the fixed regener-ative state was extremely long. We felt that the CPU-limits of5 and 10 min were sufficient to obtain estimates of reasonableaccuracy for comparison with the analytic approximation.

Tables III, IV, and V report the results of the comparison.Both absolute errors (absolute value of simulation estimate

CPU Disk CPU Disk Term. Through- ResponseUt. Ut. Q.L. Q.L. Q.L. put Time

Term- Absolute Errorinal Mean 0.002 0.002 0.429 0.056 0.115 0.010 0.228

Models 0.9 Percentile 0.004 0.004 1.285 0.201 0.302 0.026 0.783Maximum 0.010 0.008 2.261 0.421 0.674 0.036 1.141



0.7 1.71.7 4.12.2 5.4

minus approximation value) and percent relative errors (100percent times absolute value of simulation estimate minusapproximation value divided by simulation estimate) are re-ported. For each set of models, the mean, 90th percentile, andmaximum absolute and relative errors are listed for utiliza-tions, queue lengths, throughputs, and response times (for theterminal models). If the utilizations or queue lengths for dif-ferent devices were known to be identical due to symmetriesin the model, the simulation estimates for these quantities wereaveraged prior to performing the comparison; this reducesstatistical fluctuations.The approximation is particularly accurate for estimating

utilizations and throughputs. For these quantities the meanrelative error is 1.3 percent, 90 percent of the errors are lessthan 3.2 percent, and the maximum relative error is 17.4percent. Estimates of mean queue lengths are somewhat less

1104

Models


accurate (the approximation tends to overestimate mean queuelengths); the mean relative error is 4.2 percent, and 90 percentof the errors are less than 12.0 percent. However, the maxi-mum relative error is as large as 89 percent. These large errorsalways occur in very imbalanced systems, such as central servermodel sets V and VI, with extreme values of the overlap factor(f,ol = 0.5, 0.75). For the better balanced models (central servermodel sets I, II, III, IV, and the terminal models) the ap-proximation is quite accurate; for these cases the mean relativeerror for all performance measures is 1.9 percent, 90 percentof the relative errors are less than 4.0 percent, and the maxi-mum relative error is 29.2 percent.The errors in the terminal models are particularly low. The

parameters for these models were chosen so that all disks wereequally utilized. Furthermore, Lavenberg [ 11 ] has shown thatan IS source in a closed product-form network with largepopulation sizes behaves like a Poisson source. In this case, the -

flow of jobs returning to the terminals in the unoverlappedmodel is approximately Poisson since it corresponds to an exitpoint of a Jackson network, e.g., Beutler and Melamed [2].Thus, in the overlapped model, the times at which secondarytasks are spawned define a point process which should be ap-proximately Poisson. Since the approximation assumes thatthe arrival stream of overlapped tasks is a Poisson process, theapproximation should be quite accurate in this case.

IV. PERFORMANCE IMPROVEMENT

We now consider the performance gain by multitasking.First we consider the benefit of overlapping I/O with com-putation in the central server model. The maximum possiblethroughput is obtained when the bottlenecked server is fullyutilized. Recall that the bottleneck is presented by the serverwith the highest relative utilization. For convenience, we willassume that P2 = 0. Further, since performance comparisonis made with the case with no overlap-the model of Fig. 2(a)in which node 0 is nonexistent-it is more convenient tomeasure throughput at the output ofCPU and reinitialize allvisit counts. In other words, we now define a job that is com-prised of a CPU burst and one I/O service (on exactly one ofthe m -1 I/O devices). Thus,

V l = V12 = 1. (8)

Note that per visit to the CPU made by the primary task, theprobability of spawning a secondary task isf. Then the averageCPU time required for the primary task is SI I = (Sp + (1 -f)So) while that required by the secondary task is So. Thenthe total CPU time per job completion is

B1 = V11SIl +fYV2S12= (Sp + SO). (9)

Similarly, the average time needed per job completion onserver i, i > 1 is

Bi = (1 -f)VilSil +JVyi2S2 = ((1 -J)Pil +fPi2)Si. (10)Given fixed values of Bi's, the maximum possible job through-put is given by

Tmax 1/max 3BA.(1

For comparison purposes, the network of Fig. 2(a) shouldhave branching probabilities adjusted in such a way that Bi's,are kept the same as in the case with overlap. This can be doneby setting the branching probabilities, qi, in the network of Fig.2(a) as

qi = ((1 -f)Pil +fPi2). (12)

The minimum possible throughput is achieved when jobs areprocessed sequentially as follows:

1Tmi.

m

E Bii=l

(13)

This latter case occurs when there is no sharing of resourceseither among distinct jobs or by parts of the same job. Mul-tiprogramming increases throughput by promoting sharingof the first type and overlapped I/O increases throughput bypromoting sharing of the second type. In any case, maximumpossible throughput improvement factor is given by

TmaxTmin

(14)

The advantage of overlapped job execution over multiprog-ramming is the possible saving in main store size since twotasks working on the same job will likely require less main storespace than two independent job running concurrently. Also,note that neither technique will accrue much benefit in casethe bottleneck device, defined by I = argmax 1Bi} presents asevere bottleneck; that is, BI >> Bi for all i # I, since thenTmin _ Tmax. Thus overlapping I/O with computation canprovide an appreciable performance benefit if the utilizationsof the active resources are nearly balanced and the MPL israther low. Note that (14) represents the combined gain fromoverlapped I/O and multiprogramming. Furthermore, ex-pression (14) above does not take into account the limitationin gain imposed by a small value off.

In order to obtain a tighter upper bound on the gain due tooverlapped I/O, we let

Tun(n) = nmE Bi + Quni=l

(15)

denote the throughput at MPL n with all I/O activity unov-erlapped (see Reiser and Lavenberg [ 15] ). The quantity Qundenotes the queueing delay due to other jobs competing for thesame resources. Since the network without any overlapped I/Ois a product-form network, (15) is easily computable.

Similarly, T01(n), the throughput at MPL = n with over-lapping satisfies the inequality

T01(n) < n(16)

mS11 + Z (1 -J)Sipil + Qo,

i=2

where Q0i denotes the queueing delay due to other primarytasks when all secondary tasks have been removed from thenetwork. Note that this bound is easily computable since theresulting network has a product-form solution. Another and

1105

(I 1)

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-31, NO. 11, NOVEMBER 1982

EQUAL BRANCHING PROBABILITIES, EQUAL I/O SERVICE TIMESo N=1 + N=3 x N=5

C.j OVERLAP FACTOR= 0.1 OVERLAP FACTOR= 0.25

I I

oO0o

I _

Z @'_Z _z z

m ,o

0a. _. :

0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0

MILLIONS OF INSTRUCTIONS/START 10 MILUONS OF INSTRUCTIONS/START 10

CN OVERLAP FACTOR= 0.5 4 - OVERLAP FACTOR= 0.75

I I

z z*..,, S

LiL

x~~ ~ ~ ~ ~ ~ ~ ~

0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0

MILLIONS OF INSTRUCTIONS/START 10 MILLIONS OF INSTRUCTIONS/START 10

Fig. 4. Throughput improvement due to overlapped 1/0 in central server model with balanced 1/0 subsystem (parameter set 1).

UNEQUAL BRANCHING PROBABILITIES, S(4)=0.4o N=1 + N=3 x N-5

OVERLAP FACTOR= 0.1 C, OVERLAP FACTOR= 0.25

0

Z I

0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0MILLIONS OF INSTRUCTIONS/START iO MILLIONS OF INSTRUCTIONS/START IO

C - OVERLAP FACTOR= 0.5 N-OVERLA FACTOR= 0.75

a--

0 0 I>

00. .0.

0-

0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0

MILLIONS OF INSTRUCTIONS/START 10 MILLIONS OF INSTRUCTIONS/START 10

Fig. 5. Throughput improvement due to overlapped 1/0 in central server model with imbalanced 1/0 subsystem (parameter set VI).

1106


COMPARISON OF TERMINAL MODELS WITHOVERLAPPED AND UNOVERLAPPED TRANSACTIONS

NUMBER OF TERMINALS

o OVERAPPED TRANSACTIONS* LNOVERLAPPED TRANSACTIONS

/F5

~5

20 40 60

NUMBER OF TERMINALS80 100

o OVERLAPPED TRANSACTIONS* UNOVERLAPPED TRANSACTIONS

20 40 60 100 20 40 60

NUMBER OF TERMINALS NUMBER OF TERMINALS

Fig. 6. Performance improvement due to overlapped tasks in terminal model.

even simpler upper bound on the throughput with overlappingis obtained when one of the servers is fully utilized as fol-lows:

TO/(n) < = Tmax.max$Bi,

Thus, the gain due to overlapping at MPl n is

G(n) = T(n)

T.n(n)m, Bi + Qun

< i=l

max jn/Tmax, (S11 + (1 -) E PiiSi + QolI ~~~~~i=2

thermore, g(n, B*) is maximized by setting all Bi = B* (fori = 1, 2, , m). In this case, Tu,(n) = n/lB*(n + m-1)j, and

the upper bound on G(n) is (n + m 1)/n. Combining the twobounds together, we have

(17)G(n) < min

I,n + m--1 (19)

For the interesting special case of balanced utilization of allI/O devices, i.e., Bi = B for i > 2, a closed-form expression forthe upper bound (18) can be given as follows:

(18)

By using the homogeneity (see Price [13]) and the mono-tonicity property (see Trivedi and Sigmon [21]) of thethroughput function of a closed queueing network, we can showthat if none of the devices is fully utilized, then (18) is boundedabove by 1/(1-).Next assume that some device approaches 100 percent

utilization. Then the denominator of (18) equals n/TTmax =

n X max IBiI. Consider T01(n) as a function of IB' and fix thevalues of lB,, i > 2}. Set B* = max IBi, i > 21 and let g(n, B1 )= Tmax/Tun(n, B1). For B1 < B*, g(n, B1) = II(B*Tun(n,B1)) is monotonically increasing since throughput is a mono-tonically decreasing function of B1. For B1 > B*, g(n, BI) =1/(BI T"n(n, B1)) can be shown to be monotonically decreasingusing [6, eq. 3.1.5.]. Thus g(n, B1) is maximized at B*. Fur-

G(n) <

n-i n

E aixi E bjyii=O j=O

n n-I(1-) E bixi E ajy

i=O j=O

(20)

where

Sp + (1 -J)SoA(

(I -f)BSp + So

Y=

B

=n - k+rm- 3ak=- ,

and

n - k+m - 2)

In this inequality, inequality (16) is used to bound T01(n). Inthis special case, both Tun(n) and the right-hand side of in-

Un

il

If

1107

0

-;r

I

m

°~r


equality (16) may be evaluated explicitly using the expressionfor the normalizing constant, C(n), of a single-chain prod-uct-form queueing network with n jobs and the fact that thethroughput for such a network is proportional to C(n -1)/C(n).

In our example, with m = 5, bound (19) gives improvementfactors 1.1 1, 1.33, 2.0, and 4.0, respectively, for n = 1 andf =0.10, 0.25, 0.5, and 0.75. The corresponding bounds for n =3 are 1.11, 1.33, 2.0, and 2.33, while those for n = 5 are 1.11,1.33, 1.8, and 1.8.

Figs. 4 and 5 show the gain in throughput as a function ofthe mean CPU processing time, Sp, for data sets I and VI,respectively. In these figures the units for the x axis are millionsof instructions between I/O's, and a CPU processing rate of5 million instructions per second is assumed. Each of these

,figures is drawn in four parts:f = 0.1, 0.25, 0.5, and 0.75. Eachpart of the figure shows the effect of the multiprogramminglevel as a parameter. Notice that the improvement factor isgreatest for the balanced set of models (set I in Fig. 4) with lowMPL.

In the terminal-oriented system, the bound (18) is modifiedas follows:

G(n) = T.1 (n)Tu,(n)

mS61 + E Bi + Qun

< =1 * (21)

max jn/Tmax (S61 + Sll + E VilSil + Qoi)

where

Tmax = min, n } (22)Bi. m

S61 + E VilSili=1

and Bi = Vi SiI + Vi2Si2 for all i. Notice that if the think time,S61, is the dominant term in (21), then the gain in throughputwill be close to one. However, if n is large, even a small gainin throughput results in a substantial reduction in responsetime. On the other hand, if S61 is close to zero, then we obtainthe bound in (19) if for all i,f = V,2S12/B1. A bound similarto (20) is obtained if Bi = B for all i. Fig. 6 gives the responsetime, CPU utilization, and the ratios of throughputs, and re-sponse times with and without multitasking. The parametersfor this figure are listed in the first row of Table II.

VI. SUMMARY

Performance models of parallel processing in which jobsdivide into two or more asynchronous tasks have been devel-oped. Because of the parallelism, the resulting queueing net-work does not have a product-form solution. An approximatesolution method is described which iterates through a sequenceof product-form networks. The condition for convergence ofthis method are given. The accuracy of the approximation isstudied through extensive comparisons with simulations. The

approximation is found to be very accurate for systems whichare not highly imbalanced. Bounds are developed on the gainin performance due to overlapped tasks and the performancegains predicted by the model are given for several systems.

Future topics for research include investigating the exis-tence, uniqueness, and accuracy of approximate solutions tomodels which have heterogeneous job types, each of which mayspawn one or more asynchronous tasks.

REFERENCES

[11 F. Baskett, K. M. Chandy, R. R. Muntz, and F. G. Palacios, "Open,closed, and mixed networks of queues with different classes of custom-ers," J. Ass. Comput. Mach., vol. 22, pp. 248-260, 1975.

[21 F. J. Beutler and B. Melamed, "Decomposition and customer streamsof feedback networks of queues in equilibrium," Operations Res., vol.26, pp. 1059-1072, 1978.

[31 J. C. Browne, K. M. Chandy, J. Hogarth, and C. C.-A. Lee, "The effecton throughput of multiprocessing in a multiprogramming environment,"IEEE Trans. Comput., vol. C-22, pp. 728-735, 1973.

[4] J. P. Buzen, "Queueing network models of multiprogramming," Ph.D.dissertation, Div. Eng. Appl. Sci., Harvard Univ., Cambridge, MA,1971.

[5] K. M. Chandy, J. H. Howard Jr., and D. F. Towsley, "Product form andlocal balance in queueing networks," J. Ass. Comput. Mach., vol. 24,pp. 250-263, 1977.

161 K. D. Gordon and L. W. Dowdy, "The impact of certain-parameter es-timation errors in queueing network models," in Proc. Performance '80,1980, pp. 3-9; see also Performance Eval. Rev., vol. 9.

[7] R. W. Hamming, Numerical Methods for Scientists and Engineers,2nd ed. New York: McGraw-Hill, 1973.

[8] P. Heidelberger and K. S. Trivedi, "Analytic queueing models for pro-grams with internal concurrency," Yorktown Heights, NY, IBM ResRep. RC 9194, 1982; also IEEE Trans. Comput., to be published.

[9] D. L. Iglehart, "The regenerative method for simulation analysis," inCurrent Trends in Programming Methodology, Vol. III: SoftwareEngineering, K. M. Chandy and R. T. Yeh, Eds. Englewood Cliffs,NJ: Prentice-Hall, 1978.

[101 L. Kleinrock, QueueingSystems, Volume II ComputerApplications.New York: Wiley, 1976.

[11] S. S. Lavenberg, "Closed multichain product form queueing networkswith large population sizes," Yorktown Heights, NY, IBM Res. Rep.RC 8496, 1980.

[12] S. S. Lavenberg and C. H. Sauer, "Sequential stopping rules for theregenerative method of simulation," IBM J. Res. Develop., vol. 21, pp.545-558, 1977.

[13] T. G. Price, "Probability models of multiprogrammed computer sys-tems," Ph.D. dissertation, Dep. Elec. Eng., Stanford Univ., Palo Alto,CA, 1974.

[14] -, "Models of multiprogrammed computer systems with 1/O buf-fering," in Proc. 4th Texas Conf. Comput. Syst., Univ. Texas, Austin,1975.

[15] M. Reiser and S. S. Lavenberg, "Mean-value analysis of closed multi-chain queueing networks," J. Ass. Comput. Mach., vol. 27, pp. 313-322,1980.

[16] C. H. Sauer and K. M. Chandy, "The impact of distributions and dis-ciplines on multiple processor systems," Commun. Ass. Comput. Mach.,vol. 22, pp. 25-34, 1979.

[17] , Computer Systems Performance Modeling. Englewood Cliffs,NJ: Prentice-Hall, 1981.

[18] C. H. Sauer and E. A. MacNair, "Queueing network software for sys-tems modelling," in Software-Practice and Experience, vol. 9, pp.369-380, 1979.

[19] C. H. Sauer, E. A. MacNair, and S. Salza, "A language for extendedqueueing network models," IBM J. Res. Develop., vol. 24, pp. 747-755,1980.

[20] D. Towsley, K. M. Chandy, and J. C. Browne, "Models for parallelprocessing within programs: Applications to CPU:l/O and 1/0:1/0overlap," Commun. Ass. Comput. Mach., vol. 21, pp. 821-831,1978.

[211 K. S. Trivedi and T. M. Sigmon, "Optimal design of linear storagehierarchies," J. Ass. Comput. Mach., vol. 28, pp. 270-288, 1981.

1108


Philip Heidelberger (M'82) received the B.A. de-gree in mathematics from Oberlin College, Ober-lin, OH, in 1974 and the Ph.D. degree in opera-tions research from Stanford University, Stan-ford, CA, in 1978.He has been a Research Staff member at the

IBM Thomas J. Watson Research Center, York-town Heights, NY, since 1978. His current re-search interests include computer performancemodeling and statistical analysis of simulationoutput.

Dr. Heidelberger is a member of the Operations Research Society ofAmerica and the Association for Computing Machinery.

Kishor S. Trivedi received the B.Tech. degree fromthe Indian Institute of Technology and the M.S.and Ph.D. degrees in computer science from theUniversity of Illinois, Urbana.

Presently, he is an Associate Professor of Com-puter Science and Electrical Engineering at DukeUniversity, Durham, NC. He is the author of thebook Probability and Statistics with Reliability,

4 Queueing, and Computer Science Applications,published by Prentice-Hall. He has served as aPrincipal Investigator on various NSF- and

NASA-funded projects and as a consultant to industry and research labora-tories. He has published in the areas of computer arithmetic, computer ar-chitecture, memory management, and performance evaluation.

Dr. Trivedi is an ACM National Lecturer and has been a DistinguishedVisitor of the IEEE Computer Society.

Correspondence.

Pin Limitations and Partitioning of VLSI InterconnectionNetworks

MARK A. FRANKLIN, DONALD F. WANN, ANDWILLIAM J. THOMAS

Abstract-Multiple processor interconnection networks can be charac-terized as having N' inputs and N' outputs, each being B' bits wide. A majorimplementation constraint of large networks in the VLSI environment is thenumber of pins available on a chip, N,. Construction of large networks requirespartitioning of the N' * N' * B' network into a collection of N * N switchmodules with each input and output port being B (B S B') bits wide. If eachmodule corresponds to a single chip, then a large network can be implementedby interconnecting the chips in a particular manner. This correspondencepresents a methodology for selecting the optimum values ofN and B given valuesof N', B', Np, and the number of control lines per port. Models for both banyanand crossbar networks are developed and arrangements yielding minimum: I)number of chips, 2) average delay through the network, and 3) product ofnumberof chips and delay, are presented.

Index Terms-Banyan, crossbar, interconnection networks, pin limitations,multiprocessors, synchronization.

1. INTRODUCTIONRecently a variety of physically local, closely coupled multiple

processor computer systems have been proposed and, in some cases,built [II- [4]. One key issue in the design of such systems concernsthe communications network used by these multiprocessor systems.Various studies have focused on the functional properties of suchnetworks (i.e., permutations, control algorithms), on their complexity,and on performance issues [5]-[9]. Network complexity has oftenbeen measured by the number of elementary switching componentsneeded, while performance has been determined by the averagenumber of components through which a message must pass (i.e.,

Manuscript received January 11, 1982; revised June 17, 1982. This workwas supported in part by NSF under Grant MCS-78-20731 and ONR underContract N00014-80-0761.The authors are with the Department of Electrical Engineering, Washington

University, St. Louis, MO 63130.

average delay). Complexity and performance questions have beenexamined in the context of VLSI implementation of such intercon-nection networks. Franklin [10] has compared crossbar and banyannetworks operating in circuit-switched mode in terms of their space(area) and time (delay) requirements. Wise [II] presents a three-dimensional VLSI layout arrangement for banyan networks.Thompson [12] and Padmanabhan [13] derive lower bounds on thearea and time complexity for a number of networks.

Closer examination of VLSI network implementation problemshowever show that pin limitations, rather than chip area or logicalcomponent limitations, are a major constraint in designing very largenetworks. Consider a network with N' inputs, M' outputs, and witheach output being B' bits wide (N' * M' * B'). The number of requiredpin connections (ignoring power, ground, and general control) for asingle chip implementation is given by B'(N' + M'). For a squarenetwork of size thirty-two, with B' = 16, the number of pins requiredwould thus be 1024. This is much larger than is commonly availableon commercial integrated circuit carriers and is near the limits ofadvanced ceramic modules where the entire area of one module sur-face is used for pin placement.

This correspondence focuses on two important problems encoun-tered in the design of interconnecting networks. First we examine howpartitioning the network can be used to overcome the pin limitationconstraint. We develop relations for optimum partition configurationsas a function of the major network parameters including total numberof integrated circuit chips and average message delay through thenetwork. Secondly, we identify an unusual problem, called word in-consistency, that is created when local control of the partitionednetwork (which may be highly desirable for its ease of modular ex-pansion/contraction) is used, and propose a control structure andprotocol that overcomes this problem.One approach to partitioning is to implement a large network (N'

* N') requiring many pins as an interconnected set of smaller sub-networks (N * N) where each smaller subnetwork can be containedon a single chip whose packaging design meets the pin constraints.

Another approach is to slice the network so that one creates a setof network planes with each plane handling one or more bits (e.g., Bbits) of the B' wide data path. It is this bit slicing procedure whichcan lead to synchronization problems. Consider the situation wheremessage routing through the network is via local control logic present

0018-9340/82/1100-1109$00.75 © 1982 IEEE

1109

queueing network models for parallel processing with

Documents