stride scheduling: deterministic proportional-share...

Stride Scheduling:Deterministic Proportional-Share Resource Management

Carl A. Waldspurger�

William E. Weihl�

Technical Memorandum MIT/LCS/TM-528MIT Laboratory for Computer Science

Cambridge, MA 02139

June 22, 1995

Abstract

This paper presents stride scheduling, a deterministic schedul-

ing technique that efficiently supports the same flexible

resource management abstractions introduced by lottery

scheduling. Compared to lottery scheduling, stride schedul-

ing achieves significantly improved accuracy over relative

throughput rates, with significantly lower response time vari-

ability. Stride scheduling implements proportional-share con-

trol over processor time and other resources by cross-applying

elements of rate-based flow control algorithms designed for

networks. We introduce new techniques to support dynamic

changes and higher-level resource management abstractions.

We also introduce a novel hierarchical stride scheduling al-

gorithm that achieves better throughput accuracy and lower

response time variability than prior schemes. Stride schedul-

ing is evaluated using both simulations and prototypes imple-

mented for the Linux kernel.

Keywords: dynamic scheduling, proportional-share resource

allocation, rate-based service, service rate objectives

1 Introduction

Schedulers for multithreaded systems must multiplexscarce resources in order to service requests of varyingimportance. Accurate control over relative computation

�E-mail:

�carl, weihl � @lcs.mit.edu. World Wide Web:

http://www.psg.lcs.mit.edu/. Prof. Weihl is currently sup-ported by DEC while on sabbatical at DEC SRC. This researchwas also supported by ARPA under contract N00014-94-1-0985, bygrants from AT&T and IBM, and by an equipment grant from DEC.The views and conclusions contained in this document are those ofthe authors and should not be interpreted as representing the officialpolicies, either expressed or implied, of the U.S. government.

rates is required to achieve service rate objectives forusers and applications. Such control is desirable across abroad spectrum of systems, including databases, media-based applications, and networks. Motivating examplesinclude control over frame rates for competing videoviewers, query rates for concurrent clients by databasesand Web servers, and the consumption of shared re-sources by long-running computations.

Few general-purpose approaches have been proposedto support flexible, responsive control over service rates.We recently introduced lottery scheduling, a random-ized resource allocation mechanism that provides effi-cient, responsive control over relative computation rates[Wal94]. Lottery scheduling implements proportional-share resource management – the resource consumptionrates of active clients are proportional to the relativeshares that they are allocated. Higher-level abstractionsfor flexible, modular resource management were alsointroduced with lottery scheduling, but they do not de-pend on the randomized implementation of proportionalsharing.

In this paper we introduce stride scheduling, a deter-ministic scheduling technique that efficiently supportsthe same flexible resource management abstractions in-troduced by lottery scheduling. One contribution of ourwork is a cross-application and generalization of rate-based flow control algorithms designed for networks[Dem90, Zha91, ZhK91, Par93] to schedule other re-sources such as processor time. We present new tech-niques to support dynamic operations such as the modifi-cation of relative allocations and the transfer of resourcerights between clients. We also introduce a novel hier-archical stride scheduling algorithm. Hierarchical stride

1

scheduling is a recursive application of the basic tech-nique that achieves better throughput accuracy and lowerresponse time variability than previous schemes.

Simulation results demonstrate that, compared to lot-tery scheduling, stride scheduling achieves significantlyimproved accuracy over relative throughput rates, withsignificantly lower response time variability. In con-trast to other deterministic schemes, stride schedulingefficiently supports operations that dynamically modifyrelative allocations and the number of clients competingfor a resource. We have also implemented prototypestride schedulers for the Linux kernel, and found thatthey provide accurate control over both processor timeand the relative network transmission rates of competingsockets.

In the next section, we present the core stride-scheduling mechanism. Section 3 describes extensionsthat support the resource management abstractions in-troduced with lottery scheduling. Section 4 introduceshierarchical stride scheduling. Simulation results withquantitative comparisons to lottery scheduling appear inSection 5. A discussion of our Linux prototypes and re-lated implementation issues are presented in Section 6.In Section 7, we examine related work. Finally, wesummarize our conclusions in Section 8.

2 Stride Scheduling

Stride scheduling is a deterministic allocation mecha-nism for time-shared resources. Resources are allocatedin discrete time slices; we refer to the duration of astandard time slice as a quantum. Resource rights arerepresented by tickets – abstract, first-class objects thatcan be issued in different amounts and passed betweenclients. � Throughput rates for active clients are directlyproportional to their ticket allocations. Thus, a clientwith twice as many tickets as another will receive twiceas much of a resource in a given time interval. Clientresponse times are inversely proportional to ticket allo-cations. Therefore a client with twice as many ticketsas another will wait only half as long before acquiring aresource.

The throughput accuracy of a proportional-sharescheduler can be characterized by measuring the differ-

�In this paper we use the same terminology (e.g., tickets and

currencies) that we introduced for lottery scheduling [Wal94].

ence between the specified and actual number of alloca-tions that a client receives during a series of allocations.If a client has

�tickets in a system with a total of �

tickets, then its specified allocation after �� consecutiveallocations is � �� . Due to quantization, it istypically impossible to achieve this ideal exactly. Wedefine a client’s absolute error as the absolute value ofthe difference between its specified and actual numberof allocations. We define the pairwise relative errorbetween clients � � and �� as the absolute error for thesubsystem containing only � � and � � , where �� ,and � � is the total number of allocations received by bothclients.

While lottery scheduling offers probabilistic guaran-tees about throughput and response time, stride schedul-ing provides stronger deterministic guarantees. For lot-tery scheduling, after a series of � � allocations, a client’sexpected relative error and expected absolute error areboth �� . For stride scheduling, the relative errorfor any pair of clients is never greater than one, inde-pendent of �� . However, for skewed ticket distributionsit is still possible for a client to have �� absoluteerror, where �� is the number of clients. Nevertheless,stride scheduling is considerably more accurate than lot-tery scheduling, since its error does not grow with thenumber of allocations. In Section 4, we introduce ahierarchical variant of stride scheduling that provides atighter �� "!#�� bound on each client’s absolute error.

This section first presents the basic stride-schedulingalgorithm, and then introduces extensions that supportdynamic client participation, dynamic modifications toticket allocations, and nonuniform quanta.

2.1 Basic Algorithm

The core stride scheduling idea is to compute a repre-sentation of the time interval, or stride, that a client mustwait between successive allocations. The client with thesmallest stride will be scheduled most frequently. Aclient with half the stride of another will execute twiceas quickly; a client with double the stride of anotherwill execute twice as slowly. Strides are represented invirtual time units called passes, instead of units of realtime such as seconds.

Three state variables are associated with each client:tickets, stride, and pass. The tickets field specifiesthe client’s resource allocation, relative to other clients.

2

/* per-client state */typedef struct

��int tickets, stride, pass;

� *client t;

/* large integer stride constant (e.g. 1M) */const int stride1 = (1 << 20);

/* current resource owner */client t current;

/* initialize client with specified allocation */void client init(client t c, queue t q, int tickets)�

/* stride is inverse of tickets */c->tickets = tickets;c->stride = stride1 / tickets;c->pass = c->stride;

/* join competition for resource */queue insert(q, c);

�

/* proportional-share resource allocation */void allocate(queue t q)�

/* select client with minimum pass value */current = queue remove min(q);

/* use resource for quantum */use resource(current);

/* compute next pass using stride */current->pass += current->stride;queue insert(q, current);

�

Figure 1: Basic Stride Scheduling Algorithm. ANSI

C code for scheduling a static set of clients. Queue ma-

nipulations can be performed in�� time by using an

appropriate data structure.

The stride field is inversely proportional to tickets, andrepresents the interval between selections, measured inpasses. The pass field represents the virtual time indexfor the client’s next selection.

Performing a resource allocation is very simple: theclient with the minimum pass is selected, and its passis advanced by its stride. If more than one client hasthe same minimum pass value, then any of them may beselected. A reasonable deterministic approach is to usea consistent ordering to break ties, such as one definedby unique client identifiers.

Figure 1 lists ANSI C code for the basic stridescheduling algorithm. For simplicity, we assume a staticset of clients with fixed ticket assignments. The stridescheduling state for each client must be initialized viaclient init() before any allocations are performed by al-locate(). These restrictions will be relaxed in subsequentsections to permit more dynamic behavior.

To accurately represent stride as the reciprocal oftickets, a floating-point representation could be used.We present a more efficient alternative that uses a high-precision fixed-point integer representation. This is eas-ily implemented by multiplying the inverted ticket valueby a large integer constant. We will refer to this constantas stride � , since it represents the stride corresponding tothe minimum ticket allocation of one. �

The cost of performing an allocation dependson the data structure used to implement the clientqueue. A priority queue can be used to imple-ment queue remove min() and other queue operationsin �� "!#�� time or better, where � � is the number ofclients [Cor90]. A skip list could also provide expected� � �"!#�� time queue operations with low constant over-head [Pug90]. For small � � or heavily skewed ticketdistributions, a simple sorted list is likely to be mostefficient in practice.

Figure 2 illustrates an example of stride scheduling.Three clients, � , � , and � , are competing for a time-shared resource with a 3 : 2 : 1 ticket ratio. For simplicity,a convenient stride � = 6 is used instead of a large number,yielding respective strides of 2, 3, and 6. The pass valueof each client is plotted as a function of time. For eachquantum, the client with the minimum pass value isselected, and its pass is advanced by its stride. Ties are

�Appendix A discusses the representation of strides in more

detail.

3

0 5 10

Time (quanta)

0

5

10

15

20

Pas

s V

alue

Figure 2: Stride Scheduling Example. Clients � (trian-

gles), � (circles), and � (squares) have a 3 : 2 : 1 ticket ratio.

In this example, stride � = 6, yielding respective strides of 2,

3, and 6. For each quantum, the client with the minimum pass

value is selected, and its pass is advanced by its stride.

broken using the arbitrary but consistent client ordering� , � , � .

2.2 Dynamic Client Participation

The algorithm presented in Figure 1 does not supportdynamic changes in the number of clients competing fora resource. When clients are allowed to join and leaveat any time, their state must be appropriately modified.Figure 3 extends the basic algorithm to efficiently handledynamic changes.

A key extension is the addition of global variablesthat maintain aggregate information about the set of ac-tive clients. The global tickets variable contains thetotal ticket sum for all active clients. The global passvariable maintains the “current” pass for the scheduler.The global pass advances at the rate of global stride perquantum, where global stride = stride � / global tickets.Conceptually, the global pass continuously advances ata smooth rate. This is implemented by invoking theglobal pass update() routine whenever the global passvalue is needed. �

�Due to the use of a fixed-point integer representation for

strides, small quantization errors may accumulate slowly, causing

A state variable is also associated with each clientto store the remaining portion of its stride when a dy-namic change occurs. The remain field represents thenumber of passes that are left before a client’s next se-lection. When a client leaves the system, remain iscomputed as the difference between the client’s passand the global pass. When a client rejoins the system,its pass value is recomputed by adding its remain valueto the global pass.

This mechanism handles situations involving eitherpositive or negative error between the specified and ac-tual number of allocations. If remain � stride, thenthe client is effectively given credit when it rejoins forhaving previously waited for part of its stride withoutreceiving a quantum. If remain � stride, then the clientis effectively penalized when it rejoins for having previ-ously received a quantum without waiting for its entirestride. �

This approach makes an implicit assumption that apartial quantum now is equivalent to a partial quantumlater. In general, this is a reasonable assumption, andresembles the treatment of nonuniform quanta that willbe presented Section 2.4. However, it may not be ap-propriate if the total number of tickets competing fora resource varies significantly between the time that aclient leaves and rejoins the system.

The time complexity for both the client leave() andclient join() operations is �� "!#� �� , where �� is the num-ber of clients. These operations are efficient because thestride scheduling state associated with distinct clients iscompletely independent; a change to one client does notrequire updates to any other clients. The �� !#� � � costresults from the need to perform queue manipulations.

2.3 Dynamic Ticket Modifications

Additional support is needed to dynamically modifyclient ticket allocations. Figure 4 illustrates a dynamicallocation change, and Figure 5 lists ANSI C code for

global pass to drift away from client pass values over a long periodof time. This is unlikely to be a practical problem, since client passvalues are recomputed using global pass each time they leave andrejoin the system. However, this problem can be avoided by veryinfrequently resetting global pass to the minimum pass value for theset of active clients.�

Several interesting alternatives could also be implemented. Forexample, a client could be given credit for some or all of the passesthat elapse while it is inactive.

4

/* per-client state */typedef struct

�� int tickets, stride, pass, remain;

� *client t;

/* quantum in real time units (e.g. 1M cycles) */const int quantum = (1 << 20);



/* global aggregate tickets, stride, pass */int global tickets, global stride, global pass;

/* update global pass based on elapsed real time */void global pass update(void)�static int last update = 0;int elapsed;

/* compute elapsed time, advance last update */elapsed = time() - last update;last update += elapsed;

/* advance global pass by quantum-adjusted stride */global pass +=

(global stride * elapsed) / quantum;�

/* update global tickets and stride to reflect change */void global tickets update(int delta)�global tickets += delta;global stride = stride1 / global tickets;

�

/* initialize client with specified allocation */void client init(client t c, int tickets)�

/* stride is inverse of tickets, whole stride remains */c->tickets = tickets;c->stride = stride1 / tickets;c->remain = c->stride;

�

/* join competition for resource */void client join(client t c, queue t q)�

/* compute pass for next allocation */global pass update();c->pass = global_pass + c->remain;

/* add to queue */global tickets update(c->tickets);queue insert(q, c);

�

/* leave competition for resource */void client leave(client t c, queue t q)�

/* compute remainder of current stride */global pass update();c->remain = c->pass - global_pass;

/* remove from queue */global tickets update(-c->tickets);queue remove(q, c);

�

/* proportional-share resource allocation */void allocate(queue t q)�int elapsed;

/* select client with minimum pass value */current = queue remove min(q);

/* use resource, measuring elapsed real time */elapsed = use resource(current);

/* compute next pass using quantum-adjusted stride */current->pass +=

(current->stride * elapsed) / quantum;queue insert(q, current);

�

Figure 3: Dynamic Stride Scheduling Algorithm. ANSI C code for stride scheduling operations, including support for

joining, leaving, and nonuniform quanta. Queue manipulations can be performed in�� time by using an appropriate data

structure.

5

stride’

stride

global_pass pass

global_pass pass’

remain

remain’

done

Figure 4: Allocation Change. Modifying a client’s al-

location from tickets to tickets � requires only a constant-time

recomputation of its stride and pass. The new stride � is in-

versely proportional to tickets � . The new pass � is determined

by scaling remain, the remaining portion of the the current

stride, by stride � / stride.

/* dynamically modify client ticket allocation */void client modify(client t c, queue t q, int tickets)�int remain, stride;

/* leave queue for resource */client leave(c, q);

/* compute new stride */stride = stride1 / tickets;

/* scale remaining passes to reflect change in stride */remain = (c->remain * stride) / c->stride;

/* update client state */c->tickets = tickets;c->stride = stride;c->remain = remain;

/* rejoin queue for resource */client join(c, q);

�

Figure 5: Dynamic Ticket Modification. ANSI C code

for dynamic modifications to client ticket allocations. Queue

manipulations can be performed in�� time by using

an appropriate data structure.

dynamically changing a client’s ticket allocation. Whena client’s allocation is dynamically changed from ticketsto tickets � , its stride and pass values must be recom-puted. The new stride � is computed as usual, inverselyproportional to tickets � . To compute the new pass � , theremaining portion of the client’s current stride, denotedby remain, is adjusted to reflect the new stride � . Thisis accomplished by scaling remain by stride � / stride.In Figure 4, the client’s ticket allocation is increased,so pass is decreased, compressing the time remaininguntil the client is next selected. If its allocation had de-creased, then pass would have increased, expanding thetime remaining until the client is next selected.

The client modify() operation requires � � �"! �� time,where �� is the number of clients. As with dy-namic changes to the number of clients, ticket allocationchanges are efficient because the stride scheduling stateassociated with distinct clients is completely indepen-dent; the dominant cost is due to queue manipulations.

2.4 Nonuniform Quanta

With the basic stride scheduling algorithm presented inFigure 1, a client that does not consume its entire allo-cated quantum would receive less than its entitled shareof a resource. Similarly, it may be possible for a client’susage to exceed a standard quantum in some situations.For example, under a non-preemptive scheduler, clientrun lengths can vary considerably.

Fortunately, fractional and variable-size quanta caneasily be accommodated. When a client consumes afraction

�of its allocated time quantum, its pass should

be advanced by� � stride instead of stride. If

�� ,

then the client’s pass will be increased less, and it willbe scheduled sooner. If

�� , then the client’s pass

will be increased more, and it will be scheduled later.The extended code listed in Figure 3 supports nonuni-form quanta by effectively computing

�as the elapsed

resource usage time divided by a standard quantum inthe same time units.

Another extension would permit clients to specifythe quantum size that they require. � This could be im-plemented by associating an additional quantum � fieldwith each client, and scaling each client’s stride field by�An alternative would be to allow a client to specify its scheduling

period. Since a client’s period and quantum are related by its relativeresource share, specifying one quantity yields the other.

6

quantum � / quantum. Deviations from a client’s speci-fied quantum would still be handled as described above,with�

redefined as the elapsed resource usage dividedby the client-specific quantum � .

3 Flexible Resource Management

Since stride scheduling enables low-overhead dynamicmodifications, it can efficiently support the flexible re-source management abstractions introduced with lotteryscheduling [Wal94]. In this section, we explain howticket transfers, ticket inflation, and ticket currenciescan be implemented on top of a stride-based substratefor proportional sharing.

3.1 Ticket Transfers

A ticket transfer is an explicit transfer of tickets fromone client to another. Ticket transfers are particularlyuseful when one client blocks waiting for another. Forexample, during a synchronous RPC, a client can loan itsresource rights to the server computing on its behalf. Atransfer of

�tickets between clients � and � essentially

consists of two dynamic ticket modifications. Usingthe code presented in Figure 5, these modifications areimplemented by invoking client modify(A, q, A.tickets– t) and client modify(B, q, B.tickets + t). When �transfers tickets to � , � ’s stride and pass will increase,while � ’s stride and pass will decrease.

A slight complication arises in the case of a completeticket transfer; i.e., when � transfers its entire ticket al-location to � . In this case, � ’s adjusted ticket value iszero, leading to an adjusted stride of infinity (divisionby zero). To circumvent this problem, we record thefraction of � ’s stride that is remaining at the time of thetransfer, and then adjust that remaining fraction when� once again obtains tickets. This can easily be imple-mented by computing � ’s remain value at the time of thetransfer, and deferring the computation of its stride andpass values until � receives a non-zero ticket allocation(perhaps via a return transfer from � ).

3.2 Ticket Inflation

An alternative to explicit ticket transfers is ticket infla-tion, in which a client can escalate its resource rightsby creating more tickets. Ticket inflation (or deflation)

simply consists of a dynamic ticket modification for aclient. Ticket inflation causes a client’s stride and pass todecrease; deflation causes its stride and pass to increase.

Ticket inflation is useful among mutually trustingclients, since it permits resource rights to be reallocatedwithout explicitly reshuffling tickets among clients.However, ticket inflation is also dangerous, since anyclient can monopolize a resource simply by creating alarge number of tickets. In order to avoid the dangersof inflation while still exploiting its advantages, we in-troduced a currency abstraction for lottery scheduling[Wal94] that is loosely borrowed from economics.

3.3 Ticket Currencies

A ticket currency defines a resource management ab-straction barrier that contains the effects of ticket in-flation in a modular way. Tickets are denominated incurrencies, allowing resource rights to be expressed inunits that are local to each group of mutually trustingclients. Each currency is backed, or funded, by tick-ets that are denominated in more primitive currencies.Currency relationships may form an arbitrary acyclicgraph, such as a hierarchy of currencies. The effects ofinflation are locally contained by effectively maintain-ing an exchange rate between each local currency and acommon base currency that is conserved. The currencyabstraction is useful for flexibly naming, sharing, andprotecting resource rights.

The currency abstraction introduced for lotteryscheduling can also be used with stride scheduling. Oneimplementation technique is to always immediately con-vert ticket values denominated in arbitrary currenciesinto units of the common base currency. Any changesto the value of a currency would then require dynamicmodifications to all clients holding tickets denominatedin that currency, or one derived from it. � Thus, thescope of any changes in currency values is limited toexactly those clients which are affected. Since curren-cies are used to group and isolate logical sets of clients,the impact of currency fluctuations will typically be verylocalized.

�An important exception is that changes to the number of tick-

ets in the base currency do not require any modifications. This isbecause all stride scheduling state is computed from ticket valuesexpressed in base units, and the state associated with distinct clientsis independent.

7

4 Hierarchical Stride Scheduling

Stride scheduling guarantees that the relative throughputerror for any pair of clients never exceeds a single quan-tum. However, depending on the distribution of ticketsto clients, a large � �� absolute throughput error isstill possible, where � � is the number of clients.

For example, consider a set of 101 clients with a100 : 1 : �� : 1 ticket allocation. A schedule that mini-mizes absolute error and response time variability wouldalternate the 100-ticket client with each of the single-ticket clients. However, the standard stride algorithmschedules the clients in order, with the 100-ticket clientreceiving 100 quanta before any other client receivesa single quantum. Thus, after 100 allocations, the in-tended allocation for the 100-ticket client is 50, whileits actual allocation is 100, yielding a large absoluteerror of 50. This behavior is also exhibited by sim-ilar rate-based flow control algorithms for networks[Dem90, Zha91, ZhK91, Par93].

In this section we describe a novel hierarchical variantof stride scheduling that limits the absolute throughputerror of any client to �� "!#� �� quanta. For the 101-clientexample described above, hierarchical stride schedulersimulations produced a maximum absolute error of only4.5. Our algorithm also significantly reduces responsetime variability by aggregating clients to improve in-terleaving. Since it is common for systems to consistof a small number of high-throughput clients togetherwith a large number of low-throughput clients, hierarchi-cal stride scheduling represents a practical improvementover previous work.

4.1 Basic Algorithm

Hierarchical stride scheduling is essentially a recur-sive application of the basic stride scheduling algo-rithm. Individual clients are combined into groups withlarger aggregate ticket allocations, and correspondinglysmaller strides. An allocation is performed by invok-ing the normal stride scheduling algorithm first amonggroups, and then among individual clients within groups.

Although many different groupings are possible, weconsider a balanced binary tree of groups. Each leafnode represents an individual client. Each internal noderepresents the group of clients (leaf nodes) that it covers,and contains their aggregate tickets, stride, and pass

/* binary tree node */typedef struct node

�� struct node *left, *right, *parent;int tickets, stride, pass;

� *node t;

/* quantum in real time units (e.g. 1M cycles) */const int quantum = (1 << 20);



/* proportional-share resource allocation */void allocate(node t root)�int elapsed;node t n;

/* traverse root-to-leaf path following min pass */for (n = root; !node is leaf(n); )if (n->left == NULL ||

n->right->pass < n->left->pass)n = n->right;

elsen = n->left;

/* use resource, measuring elapsed real time */current = n;elapsed = use_resource(current);

/* update pass for each ancestor using its stride */for (n = current; n != NULL; n = n->parent)n->pass += (n->stride * elapsed) / quantum;

�

Figure 6: Hierarchical Stride Scheduling Algorithm.ANSI C code for hierachical stride scheduling with a static set

of clients. The main data structure is a binary tree of nodes.

Each node represents either a client (leaf) or a group (internal

node) that summarizes aggregate information.

8

values. Thus, for an internal node, tickets is the totalticket sum for all of the clients that it covers, and stride= stride � / tickets. The pass value for an internal nodeis updated whenever the pass value for any of the clientsthat it covers is modified.

Figure 6 presents ANSI C code for the basic hierar-chical stride scheduling algorithm. Each node has thenormal tickets, stride, and pass scheduling state, as wellas the usual tree links to its parent, left child, and rightchild. An allocation is performed by tracing a path fromthe root of the tree to a leaf, choosing the child with thesmaller pass value at each level. Once the selected clienthas finished using the resource, its pass value is updatedto reflect its usage. The client update is identical tothat used in the dynamic stride algorithm that supportsnonuniform quanta, listed in Figure 3. However, the hi-erarchical scheduler requires additional updates to eachof the client’s ancestors, following the leaf-to-root pathformed by successive parent links.

Each client allocation can be viewed as a series ofpairwise allocations among groups of clients at eachlevel in the tree. The maximum error for each pairwiseallocation is 1, and in the worst case, error can accumu-late at each level. Thus, the maximum absolute errorfor the overall tree-based allocation is the height of thetree, which is

� �"! �� , where �� is the number of clients.Since the error for a pairwise A : B ratio is minimizedwhen � � � , absolute error can be further reduced bycarefully choosing client leaf positions to better balancethe tree based on the number of tickets at each node.

4.2 Dynamic Modifications

Extending the basic hierarchical stride algorithm tosupport dynamic modifications requires a careful consid-eration of the effects of changes at each level in the tree.Figure 7 lists ANSI C code for performing a ticket mod-ification that works for both clients and internal nodes.Changes to client ticket allocations essentially followthe same scaling and update rules used for normal stridescheduling, listed in Figure 5. The hierarchical sched-uler requires additional updates to each of the client’sancestors, following the leaf-to-root path formed by suc-cessive parent links. Note that the root pass value usedin Figure 7 effectively takes the place of the global passvariable used in Figure 5; both represent the aggregateglobal scheduler pass.

/* dynamically modify node allocation by delta tickets */void node modify(node t n, node t root, int delta)�int old stride, remain;

/* compute new tickets, stride */old stride = n->stride;n->tickets += delta;n->stride = stride1 / n->tickets;

/* done when reach root */if (n == root)return;

/* scale remaining passes to reflect change in stride */remain = n->pass - root->pass;remain = (remain * n->stride) / old stride;n->pass = root->pass + remain;

/* propagate change to ancestors */node modify(n->parent, root, delta);

�

Figure 7: Dynamic Ticket Modification. ANSI C code

for dynamic modifications to client ticket allocations un-

der hierarchical stride scheduling. A modification requires�� time to propagate changes.

9

Although not presented here, we have also devel-oped operations to support dynamic client participa-tion under hierarchical stride scheduling [Wal95]. Asfor allocate(), the time complexity for client join() andclient leave() operations is �� "! � � � , where � � is thenumber of clients.

5 Simulation Results

This section presents the results of several quantitativeexperiments designed to evaluate the effectiveness ofstride scheduling. We examine the behavior of stridescheduling in both static and dynamic environments,and also test hierarchical stride scheduling. When stridescheduling is compared to lottery scheduling, we findthat the stride-based approach provides more accuratecontrol over relative throughput rates, with much lowervariance in response times.

For example, Figure 8 presents the results of schedul-ing three clients with a 3 : 2 : 1 ticket ratio for 100 al-locations. The dashed lines represent the ideal alloca-tions for each client. It is clear from Figure 8(a) thatlottery scheduling exhibits significant variability at thistime scale, due to the algorithm’s inherent use of ran-domization. In contrast, Figure 8(b) indicates that thedeterministic stride scheduler produces precise periodicbehavior.

5.1 Throughput Accuracy

Under randomized lottery scheduling, the expectedvalue for the absolute error between the specified andactual number of allocations for any set of clients is� � � �� , where �� is the number of allocations. Thisis because the number of lotteries won by a client hasa binomial distribution. The probability � that a clientholding

�tickets will win a given lottery with a total of �

tickets is simply � � � � . After �� identical lotteries,the expected number of wins � is �� , withvariance � � � �� .

Under deterministic stride scheduling, the relative er-ror between the specified and actual number of alloca-tions for any pair of clients never exceeds one, indepen-dent of �� . This is because the only source of relativeerror is due to quantization.

0 20 40 60 80 100

Time (quanta)

0

10

20

30

40

50

Cum

ulat

ive

Qua

nta

A

B

C

0 20 40 60 80 1000

10

20

30

40

50

Cum

ulat

ive

Qua

nta

A

B

C

Figure 8: Lottery vs. Stride Scheduling. Simulation

results for 100 allocations involving three clients, � , � , and

� , with a 3 : 2 : 1 allocation. The dashed lines represent ideal

proportional-share behavior. (a) Allocation by randomized

lottery scheduler shows significant variability. (b) Allocation

by deterministic stride scheduler exhibits precise periodic be-

havior: � , � , � , � , � , � .

10

0 20 40 60 80 1000

5

10

Err

or (

quan

ta)

(b) Stride 7:3

0 200 400 600 800 10000

5

10

Mea

n E

rror

(qu

anta

)

(a) Lottery 7:3

0 20 40 60 80 100

Time (quanta)

0

5

10

Err

or (

quan

ta)

(d) Stride 19:1

0 200 400 600 800 1000

Time (quanta)

0

5

10

Mea

n E

rror

(qu

anta

)

(c) Lottery 19:1

Figure 9: Throughput Accuracy. Simulation results for two clients with 7 : 3 (top) and 19 : 1 (bottom) ticket ratios over 1000

allocations. Only the first 100 quanta are shown for the stride scheduler, since its quantization error is deterministic and periodic.

(a) Mean lottery scheduler error, averaged over 1000 separate 7 : 3 runs. (b) Stride scheduler error for a single 7 : 3 run. (c) Mean

lottery scheduler error, averaged over 1000 separate 19 : 1 runs. (d) Stride scheduler error for a single 19 : 1 run.

11

Figure 9 plots the absolute error � that results fromsimulating two clients under both lottery scheduling andstride scheduling. The data depicted is representativeof our simulation results over a large range of pairwiseratios. Figure 9(a) shows the mean error averaged over1000 separate lottery scheduler runs with a 7 : 3 ticketratio. As expected, the error increases slowly with �� ,indicating that accuracy steadily improves when erroris measured as a percentage of � � . Figure 9(b) showsthe error for a single stride scheduler run with the same7 : 3 ticket ratio. As expected, the error never exceedsa single quantum, and follows a deterministic patternwith period 10. The error drops to zero at the end ofeach complete period, corresponding to a precise 7 : 3allocation. Figures 9(c) and 9(d) present data for similarexperiments involving a larger 19 : 1 ticket ratio.

5.2 Dynamic Ticket Allocations

Figure 10 plots the absolute error that results fromsimulating two clients under both lottery scheduling andstride scheduling with rapidly-changing dynamic ticketallocations. This data is representative of simulation re-sults over a large range of pairwise ratios and a varietyof dynamic modification techniques. For easy compar-ison, the average dynamic ticket ratios are identical tothe static ticket ratios used in Figure 9.

The notation [ � , � ] indicates a random ticket allo-cation that is uniformly distributed from � to � . New,randomly-generated ticket allocations were dynamicallyassigned every other quantum. The client modify() oper-ation was executed for each change under stride schedul-ing; no special actions were necessary under lotteryscheduling. To compute error values, specified allo-cations were determined incrementally. Each client’sspecified allocation was advanced by

� � on every quan-tum, where

�is the client’s current ticket allocation, and

� is the current ticket total.Figure 10(a) shows the mean error averaged over 1000

separate lottery scheduler runs with a [2,12] : 3 ticket ra-tio. Despite the dynamic changes, the mean error isnearly the same as that measured for the static 7 : 3 ratiodepicted in Figure 9(a). Similarly, Figure 10(b) showsthe error for a single stride scheduler run with the same

�In this case the relative and absolute errors are identical, since

there are only two clients.

dynamic [2,12] : 3 ratio. The error never exceeds a sin-gle quantum, although it is much more erratic than theperiodic pattern exhibited for the static 7 : 3 ratio in Fig-ure 9(b). Figures 10(c) and 10(d) present data for similarexperiments involving a larger dynamic 190 : [5,15] ra-tio. The results for this allocation are comparable tothose measured for the static 19 : 1 ticket ratio depictedin Figures 9(c) and 9(d).

Overall, the error measured under both lotteryscheduling and stride scheduling is largely unaffectedby dynamic ticket modifications. This suggests that bothmechanisms are well-suited to dynamic environments.However, stride scheduling is clearly more accurate inboth static and dynamic environments.

5.3 Response Time Variability

Another important performance metric is response time,which we measure as the elapsed time from a client’scompletion of one quantum up to and including its com-pletion of another. Under randomized lottery schedul-ing, client response times have a geometric distribution.The expected number of lotteries � � that a client mustwait before its first win is �� , with variance� �� . Deterministic stride schedulingexhibits dramatically less response-time variability.

Figures 11 and 12 present client response time distri-butions under both lottery scheduling and stride schedul-ing. Figure 11 shows the response times that result fromsimulating two clients with a 7 : 3 ticket ratio for one mil-lion allocations. The stride scheduler distributions arevery tight, while the lottery scheduler distributions aregeometric with long tails. For example, the client withthe smaller allocation had a maximum response time of4 quanta under stride scheduling, while the maximumresponse time under lottery scheduling was 39.

Figure 12 presents similar data for a larger 19 : 1 ticketratio. Although there is little difference in the responsetime distributions for the client with the larger allocation,the difference is enormous for the client with the smallerallocation. Under stride scheduling, virtually all of theresponse times were exactly 20 quanta. The lotteryscheduler produced geometrically-distributed responsetimes ranging from 1 to 194 quanta. In this case, thestandard deviation of the stride scheduler’s distributionis three orders of magnitude smaller than the standarddeviation of the lottery scheduler’s distribution.

12

0 200 400 600 800 10000

5

10

Err

or (

quan

ta)

(b) Stride [2,12]:3

0 200 400 600 800 10000

5

10

Mea

n E

rror

(qu

anta

)

(a) Lottery [2,12]:3

0 200 400 600 800 1000

Time (quanta)

0

5

10

Err

or (

quan

ta)

(d) Stride 190:[5,15]

0 200 400 600 800 1000

Time (quanta)

0

5

10

Mea

n E

rror

(qu

anta

)

(c) Lottery 190:[5,15]

Figure 10: Throughput Accuracy – Dynamic Allocations. Simulation results for two clients with [2,12] : 3 (top) and

190 : [5,15] (bottom) ticket ratios over 1000 allocations. The notation [ � , � ] indicates a random ticket allocation that is uniformly

distributed from � to � . Random ticket allocations were dynamically updated every other quantum. (a) Mean lottery scheduler

error, averaged over 1000 separate [2,12] : 3 runs. (b) Stride scheduler error for a single [2,12] : 3 run. (c) Mean lottery scheduler

error, averaged over 1000 separate 190 : [5,15] runs. (d) Stride scheduler error for a single 190 : [5,15] run.

13

0 5 10 15 200

100

200

300

400

500

Fre

quen

cy (

thou

sand

s)

(b) Stride - 7

0 5 10 15 20

Response Time (quanta)

0

50

100

150

200

Fre

quen

cy (

thou

sand

s)

(d) Stride - 3

0 5 10 15 200

100

200

300

400

500

Fre

quen

cy (

thou

sand

s)

(a) Lottery - 7

0 5 10 15 20


0

50

100

150

200

Fre

quen

cy (

thou

sand

s)

(c) Lottery - 3

Figure 11: Response Time Distribution. Simulation results for two clients with a 7 : 3 ticket ratio over one million

allocations. (a) Client with 7 tickets under lottery scheduling: �� , �� . (b) Client with 7 tickets under stride

scheduling: �� , �� . (c) Client with 3 tickets under lottery scheduling: �� , �� . (d) Client with 3

tickets under stride scheduling: �� , �� .

14

0 5 10 15 200

200

400

600

800

1000

Fre

quen

cy (

thou

sand

s)(b) Stride - 19

0 20 40 60 80 100


0

20

40

60

80

100

Fre

quen

cy (

thou

sand

s)

(d) Stride - 1

0 5 10 15 200

200

400

600

800

1000

Fre

quen

cy (

thou

sand

s)

(a) Lottery - 19

0 20 40 60 80 100


0

2

4

6

8

10

Fre

quen

cy (

thou

sand

s)

(c) Lottery - 1

Figure 12: Response Time Distribution. Simulation results for two clients with a 19 : 1 ticket ratio over one million

allocations. (a) Client with 19 tickets under lottery scheduling: � � �� , �� . (b) Client with 19 tickets under stride

scheduling: � �� , � �� . (c) Client with 1 ticket under lottery scheduling: � � � �� , �� . (d) Client with 1

ticket under stride scheduling: �� , �� .

15

5.4 Hierarchical Stride Scheduling

As discussed in Section 4, stride scheduling can producean absolute error of �� for skewed ticket distribu-tions, where �� is the number of clients. In contrast,hierarchical stride scheduling bounds the absolute er-ror to � � �"! �� . As a result, response-time variabilitycan be significantly reduced under hierarchical stridescheduling.

Figure 13 presents client response time distributionsunder both hierarchical stride scheduling and ordinarystride scheduling. Eight clients with a 7 : 1 : �� : 1 ticketratio were simulated for one million allocations. Ex-cluding the very first allocation, the response time foreach of the low-throughput clients was always 14, underboth schedulers. Thus we only present response timedistributions for the high-throughput client.

The ordinary stride scheduler runs the high-throughput client for 7 consecutive quanta, and thenruns each of the low-throughput clients for one quan-tum. The hierarchical stride scheduler interleaves theclients, resulting in a tighter distribution. In this case,the standard deviation of the ordinary stride scheduler’sdistribution is more than twice as large as that for thehierarchical stride scheduler. We observed a maximumabsolute error of 4 quanta for the high-throughput clientunder ordinary stride scheduling, and only 1.5 quantaunder hierarchical stride scheduling.

6 Prototype Implementations

We implemented two prototype stride schedulers bymodifying the Linux 1.1.50 kernel on a 25MHz i486-based IBM Thinkpad 350C. The first prototype enablesproportional-share control over processor time, and thesecond enables proportional-share control over networktransmission bandwidth.

6.1 Process Scheduler

The goal of our first prototype was to permitproportional-share allocation of processor time to con-trol relative computation rates. We primarily changedthe kernel code that handles process scheduling, switch-ing from a conventional priority scheduler to a stride-based algorithm with a scheduling quantum of 100 mil-liseconds. Ticket allocations can be specified via a new

0 2 4 6 8 100

100

200

300

400

500

Fre

quen

cy (

thou

sand

s)

0 2 4 6 8 10


0

100

200

300

400

500

Fre

quen

cy (

thou

sand

s)

Figure 13: Hierarchical Stride Scheduling. Response

time distributions for a simulation of eight clients with a

7 : 1 : � � � : 1 ticket ratio over one million allocations. Re-

sponse times are shown only for the client with 7 tickets. (a)

Hierarchical Stride Scheduler: �� , � �� . (b)

Ordinary Stride Scheduler: �� , �� .

16

0 2 4 6 8 10

Allocated Ratio

0

2

4

6

8

10

Obs

erve

d It

erat

ion

Rat

io

Figure 14: CPU Rate Accuracy. For each allocation

ratio, the observed iteration ratio is plotted for each of three

30 second runs. The gray line indicates the ideal where the

two ratios are identical. The observed ratios are within 1% of

the ideal for all data points.

stride cpu set tickets() system call. We did notimplement support for higher-level abstractions such asticket transfers and currencies. Fewer than 300 lines ofsource code were added or modified to implement ourchanges.

Our first experiment tested the accuracy with whichour prototype could control the relative execution rateof computations. Each point plotted in Figure 14 indi-cates the relative execution rate that was observed fortwo processes running the compute-bound arith inte-ger arithmetic benchmark [Byt91]. Three thirty-secondruns were executed for each integral ratio between oneand ten. In all cases, the observed ratios are within 1%of the ideal. We also ran experiments involving higherratios, and found that the observed ratio for a 20 : 1 al-location ranged from 19.94 to 20.04, and the observedratio for a 50 : 1 allocation ranged from 49.93 to 50.44.

Our next experiment examined the scheduler’s behav-ior over shorter time intervals. Figure 15 plots averageiteration counts over a series of 2-second time windowsduring a single 60 second execution with a 3 : 1 alloca-tion. The two processes remain close to their allocated

0 20 40 60

Time (sec)

0

500

1000

1500

2000

2500

Ave

rage

Ite

rati

ons

(per

sec

)

Figure 15: CPU Fairness Over Time. Two processes

executing the compute-bound arith benchmark with a 3 : 1

ticket allocation. Averaged over the entire run, the two pro-

cesses executed 2409.18 and 802.89 iterations/sec., for an

actual ratio of 3.001:1.

ratios throughout the experiment. Note that if we used a10 millisecond time quantum instead of the scheduler’s100 millisecond quantum, the same degree of fairnesswould be observed over a series of 200 millisecond timewindows.

To assess the overhead imposed by our prototypestride scheduler, we ran performance tests consistingof concurrent arith benchmark processes. Overall, wefound that the performance of our prototype was com-parable to that of the standard Linux process scheduler.Compared to unmodified Linux, groups of 1, 2, 4, and8 arith processes each completed fewer iterations un-der stride scheduling, but the difference was always lessthan 0.2%.

However, neither the standard Linux scheduler norour prototype stride scheduler are particularly efficient.For example, the Linux scheduler performs a linear scanof all processes to find the one with the highest priority.Our prototype also performs a linear scan to find theprocess with the minimum pass; an � � �"!#� �� time im-plementation would have required substantial changesto existing kernel code.

17

6.2 Network Device Scheduler

The goal of our second prototype was to permitproportional-share control over transmission bandwidthfor network devices such as Ethernet and SLIP inter-faces. Such control would be particularly useful forapplications such as concurrent ftp file transfers, andconcurrenthttpWeb server replies. For example, manyWeb servers have relatively slow connections to the In-ternet, resulting in substantial delays for transfers oflarge objects such as graphical images. Given controlover relative transmission rates, a Web server could pro-vide different levels of service to concurrent clients. Forexample, tickets

�

could be issued by servers based uponthe requesting user, machine, or domain. Commercialservers could even sell tickets to clients demanding fasterservice.

We primarily changed the kernel code that han-dles generic network device queueing. This involvedswitching from conventional FIFO queueing to stride-based queueing that respects per-socket ticket alloca-tions. Ticket allocations can be specified via a newSO TICKETS option to the setsockopt() system call.Although not implemented in our prototype, a morecomplete system should also consider additional formsof admission control to manage other system resources,such as network buffers. Fewer than 300 lines of sourcecode were added or modified to implement our changes.

Our first experiment tested the prototype’s ability tocontrol relative network transmission rates on a localarea network. We used the ttcp network test program

�

[TTC91] to transfer fabricated buffers from an IBMThinkpad 350C running our modified Linux kernel, to a

�

To be included with http requests, tickets would require anexternal data representation. If security is a concern, cryptographictechniques could be employed to prevent forgery and theft.

�

We made a few minor modifications to the standard ttcp bench-mark. Other than extensions to specify ticket allocations and facili-tate coordinated timing, we also decreased the value of a hard-codeddelay constant. This constant is used to temporarily put a trans-mitting process to sleep when it is unable to write to a socket dueto a lack of buffer space (ENOBUFS). Without this modification, theobserved throughput ratios were consistently lower than specifiedallocations, with significant differences for large ratios. With thelarger delay constant, we believe that the low-throughput client isable to continue sending packets while the high-throughput client issleeping, distorting the intended throughput ratio. Of course, chang-ing the kernel interface to signal a process when more buffer spacebecomes available would probably be preferable to polling.

0 2 4 6 8 10

Allocated Ratio

0

2

4

6

8

10

Obs

erve

d T

hrou

ghpu

t R

atio

Figure 16: Ethernet UDP Rate Accuracy. For each

allocation ratio, the observed data transmission ratio is plotted

for each of three runs. The gray line indicates the ideal where

the two ratios are identical. The observed ratios are within

5% of the ideal for all data points.

DECStation 5000/133 running Ultrix. Both machineswere on the same physical subnet, connected via a10Mbps Ethernet that also carried network traffic forother users.

Each point plotted in Figure 16 indicates the rela-tive UDP data transmission rate that was observed fortwo processes running the ttcp benchmark. Each ex-periment started with both processes on the sending ma-chine attempting to transmit 4K buffers, each containing8Kbytes of data, for a total 32Mbyte transfer. As soonas one process finished sending its data, it terminated theother process via a Unix signal. Metrics were recordedon the receiving machine to capture end-to-end applica-tion throughput. The observed ratios are very accurate;all data points are within 5% of the ideal. For largerticket ratios, the observed throughput ratio is slightlylower than the specified allocation. For example, a 20 : 1allocation resulted in actual throughput ratios rangingfrom 18.51 : 1 to 18.77 : 1.

To assess the overhead imposed by our prototype,we ran performance tests consisting of concurrent ttcpbenchmark processes. Overall, we found that the perfor-mance of our prototype was comparable to that of stan-dard Linux. Although the prototype increases the length

18

of the critical path for sending a network packet, we wereunable to observe any significant difference between un-modified Linux and stride scheduling. We believe thatthe small additional overhead of stride scheduling wasmasked by the variability of external network traffic fromother users; individual differences were in the range of��

%.

7 Related Work

We independently developed stride scheduling as a de-terministic alternative to the randomized selection as-pect of lottery scheduling [Wal94]. We then discov-ered that the core allocation algorithm used in stridescheduling is nearly identical to elements of rate-basedflow-control algorithms designed for packet-switchednetworks [Dem90, Zha91, ZhK91, Par93]. Despite therelevance of this networking research, to the best of ourknowledge it has not been discussed in the processorscheduling literature. In this section we discuss a va-riety of related scheduling work, including rate-basednetwork flow control, deterministic proportional-shareschedulers, priority schedulers, real-time schedulers,and microeconomic schedulers.

7.1 Rate-Based Network Flow Control

Our basic stride scheduling algorithm is very similarto Zhang’s VirtualClock algorithm for packet-switchednetworks [Zha91]. In this scheme, a network switchorders packets to be forwarded through outgoing links.Every packet belongs to a client data stream, and eachstream has an associated bandwidth reservation. A vir-tual clock is assigned to each stream, and each of itspackets is stamped with its current virtual time upon ar-rival. With each arrival, the virtual clock advances by avirtual tick that is inversely proportional to the stream’sreserved data rate. Using our stride-oriented terminol-ogy, a virtual tick is analogous to a stride, and a virtualclock is analogous to a pass value.

The VirtualClock algorithm is closely related to theweighted fair queueing (WFQ) algorithm developed byDemers, Keshav, and Shenker [Dem90], and Parekh andGallager’s equivalent packet-by-packet generalized pro-cessor sharing (PGPS) algorithm [Par93]. One differ-ence that distinguishes WFQ and PGPS from Virtual-

Clock is that they effectively maintain a global virtualclock. Arriving packets are stamped with their stream’svirtual tick plus the maximum of their stream’s virtualclock and the global virtual clock. Without this modi-fication, an inactive stream can later monopolize a linkas its virtual clock caught up to those of active streams;such behavior is possible under the VirtualClock algo-rithm [Par93].

Our stride scheduler’s use of a global pass vari-able is based on the global virtual clock employed byWFQ/PGPS, which follows an update rule that producesa smoothly varying global virtual time. Before we be-came aware of the WFQ/PGPS work, we used a simplerglobal pass update rule: global pass was set to the passvalue of the client that currently owns the resource. Tosee the difference between these approaches, considerthe set of minimum pass values over time in Figure 2.Although the average pass value increase per quantumis 1, the actual increases occur in non-uniform steps.We adopted the smoother WFQ/PGPS virtual time ruleto improve the accuracy of pass updates associated withdynamic modifications.

To the best of our knowledge, our work on stridescheduling is the first cross-application of rate-basednetwork flow control algorithms to scheduling other re-sources such as processor time. New techniques wererequired to support dynamic changes and higher-levelabstractions such as ticket transfers and currencies. Ourhierarchical stride scheduling algorithm is a novel recur-sive application of the basic technique that exhibits im-proved throughput accuracy and reduced response timevariability compared to prior schemes.

7.2 Proportional-Share Schedulers

Several other deterministic approaches have recentlybeen proposed for proportional-share processor schedul-ing [Fon95, Mah95, Sto95]. However, all require expen-sive operations to transform client state in response todynamic changes. This makes them less attractive thanstride scheduling for supporting dynamic or distributedenvironments. Moreover, although each algorithm is ex-plicitly compared to lottery scheduling, none providesefficient support for the flexible resource managementabstractions introduced with lottery scheduling.

Stoica and Abdel-Wahab have devised an interestingscheduler using a deterministic generator that employs

19

a bit-reversed counter in place of the random numbergenerator used by lottery scheduling [Sto95]. Their al-gorithm results in an absolute error for throughput thatis �� !#�� , where �� is the number of allocations. Al-locations can be performed efficiently in � � �"!#� � � timeusing a tree-based data structure, where �� is the numberof clients. However, dynamic modifications to the setof active clients or their allocations require executing arelatively complex “restart” operation with � �� timecomplexity. Also, no support is provided for fractionalor nonuniform quanta.

Maheshwari has developed a deterministic charge-based proportional-share scheduler [Mah95]. Looselybased on an analogy to digitized line drawing, thisscheme has a maximum relative throughput error ofone quantum, and also supports fractional quanta. Al-though efficient in many cases, allocation has a worst-case � �� time complexity, where � � is the numberof clients. Dynamic modifications require executing a“refund” operation with �� time complexity.

Fong and Squillante have introduced a generalscheduling approach called time-function scheduling(TFS) [Fon95]. TFS is intended to provide differen-tial treatment of job classes, where specific throughputratios are specified across classes, while jobs within eachclass are scheduled in a FCFS manner. Time functionsare used to compute dynamic job priorities as a func-tion of the time each job has spent waiting since it wasplaced on the run queue. Linear functions result in pro-portional sharing: a job’s value is equal to its waitingtime multipled by its job-class slope, plus a job-classconstant. An allocation is performed by selecting thejob with the maximum time-function value. A naiveimplementation would be very expensive, but since jobsare grouped into classes, allocation can be performed in� �� time, where � is the number of distinct classes. Iftime-function values are updated infrequently comparedto the scheduling quantum, then a priority queue can beused to reduce the allocation cost to �� "!#�� , with an� �� "!#�� cost to rebuild the queue after each update.

When Fong and Squillante compared TFS to lotteryscheduling, they found that although throughput accu-racy was comparable, the waiting time variance of low-throughput tasks was often several orders of magnitudelarger under lottery scheduling. This observation is con-sistent with our simulation results involving response

time, presented in Section 5. TFS also offers the poten-tial to specify performance goals that are more generalthan proportional sharing. However, when proportionalsharing is the goal, stride scheduling has advantages interms of efficiency and accuracy.

7.3 Priority Schedulers

Conventional operating systems typically employ prior-ity schemes for scheduling processes [Dei90, Tan92].Priority schedulers are not designed to provideproportional-share control over relative computationrates, and are often ad-hoc. Even popular priority-basedapproaches such as decay-usage scheduling are poorlyunderstood, despite the fact that they are employed bynumerous operating systems, including Unix [Hel93].

Fair share schedulers allocate resources so that usersget fair machine shares over long periods of time[Hen84, Kay88, Hel93]. These schedulers are layeredon top of conventional priority schedulers, and dynam-ically adjust priorities to push actual usage closer toentitled shares. The algorithms used by these systemsare generally complex, requiring periodic usage moni-toring, complicated dynamic priority adjustments, andadministrative parameter setting to ensure fairness on atime scale of minutes.

7.4 Real-Time Schedulers

Real-time schedulers are designed for time-critical sys-tems [Bur91]. In these systems, which include manyaerospace and military applications, timing require-ments impose absolute deadlines that must be met toensure correctness and safety; a missed deadline mayhave dire consequences. One of the most widelyused techniques in real-time systems is rate-monotonicscheduling, in which priorities are statically assignedas a monotonic function of the rate of periodic tasks[Liu73, Sha91]. The importance of a task is not re-flected in its priority; tasks with shorter periods are sim-ply assigned higher priorities. Bounds on total processorutilization (ranging from 69% to nearly 100%, depend-ing on various assumptions) ensure that rate monotonicscheduling will meet all task deadlines. Another pop-ular technique is earliest deadline scheduling, whichalways schedules the task with the closest deadline first.The earliest deadline approach permits high processor

20

utilization, but has increased runtime overhead due tothe use of dynamic priorities; the task with the nearestdeadline varies over time.

In general, real-time schedulers depend upon veryrestrictive assumptions, including precise static knowl-edge of task execution times and prohibitions on taskinteractions. In addition, limitations are placed on pro-cessor utilization, and even transient overloads are disal-lowed. In contrast, the proportional-share model used bystride scheduling and lottery scheduling is designed formore general-purpose environments. Task allocationsdegrade gracefully in overload situations, and activetasks proportionally benefit from extra resources whensome allocations are not fully utilized. These proper-ties facilitate adaptive applications that can respond tochanges in resource availability.

Mercer, Savage, and Tokuda recently introduceda higher-level processor capacity reserve abstraction[Mer94] for measuring and controlling processor usagein a microkernel system with an underlying real-timescheduler. Reserves can be passed across protectionboundaries during interprocess communication, with aneffect similar to our use of ticket transfers. While this ap-proach works well for many multimedia applications, itsreliance on resource reservations and admission controlis still more restrictive than the general-purpose modelthat we advocate.

7.5 Microeconomic Schedulers

Microeconomic schedulers are based on metaphors toresource allocation in real economic systems. Moneyencapsulates resource rights, and a price mechanismis used to allocate resources. Several microeconomicschedulers [Dre88, Mil88, Fer88, Fer89, Wal89, Wal92,Wel93] use auctions to determine prices and allocate re-sources among clients that bid monetary funds. Both theescalator algorithm proposed for uniprocessor schedul-ing [Dre88] and the distributed Spawn system [Wal92]rely upon auctions in which bidders increase their bidslinearly over time. Since auction dynamics can be unex-pectedly volatile, auction-based approaches sometimesfail to achieve resource allocations that are proportionalto client funding. The overhead of bidding also limitsthe applicability of auctions to relatively coarse-grainedtasks. Other market-based approaches that do not relyupon auctions have also been applied to managing pro-

cessor and memory resources [Ell75, Har92, Che93].Stride scheduling and lottery scheduling are compat-

ible with a market-based resource management philoso-phy. Our mechanisms for proportional sharing provide aconvenient substrate for pricing individual time-sharedresources in a computational economy. For example,tickets are analogous to monetary income streams, andthe number of tickets competing for a resource can beviewed as its price. Our currency abstraction for flexi-ble resource management is also loosely borrowed fromeconomics.

8 Conclusions

We have presented stride scheduling, a determinis-tic technique that provides accurate control over rel-ative computation rates. Stride scheduling also effi-ciently supports the same flexible, modular resourcemanagement abstractions introduced by lottery schedul-ing. Compared to lottery scheduling, stride schedulingachieves significantly improved accuracy over relativethroughput rates, with significantly less response timevariability. However, lottery scheduling is conceptu-ally simpler than stride scheduling. For example, stridescheduling requires careful state updates for dynamicchanges, while lottery scheduling is effectively stateless.

The core allocation mechanism used by stridescheduling is based on rate-based flow-control algo-rithms for networks. One contribution of this paper isa cross-application of these algorithms to the domain ofprocessor scheduling. New techniques were developedto support dynamic modifications to client allocationsand resource right transfers between clients. We also in-troduced a new hierarchical stride scheduling algorithmthat exhibits improved throughput accuracy and lowerresponse time variability compared to prior schemes.

Acknowledgements

We would like to thank Kavita Bala, Dawson Engler,Paige Parsons, and Lyle Ramshaw for their many helpfulcomments. Thanks to Tom Rodeheffer for suggestingthe connection between our work and rate-based flow-control algorithms in the networking literature. Specialthanks to Paige for her help with the visual presentationof stride scheduling.

21

References

[Bur91] A. Burns. “Scheduling Hard Real-Time Systems:A Review,” Software Engineering Journal, May1991.

[Byt91] Byte Unix Benchmarks, Version 3, 1991. Avail-able via Usenet and anonymous ftp from manylocations, including gatekeeper.dec.com.

[Che93] D. R. Cheriton and K. Harty. “A Market Approachto Operating System Memory Allocation,” Work-ing Paper, Computer Science Department, Stan-ford University, June 1993.

[Cor90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest.Introduction to Algorithms, MIT Press, 1990.

[Dei90] H. M. Deitel. Operating Systems, Addison-Wesley, 1990.

[Dem90] A. Demers, S. Kehav, and S. Shenker. “Anal-ysis and Simulation of a Fair Queueing Algo-rithm,” Internetworking: Research and Experi-ence, September 1990.

[Dre88] K. E. Drexler and M. S. Miller. “Incentive En-gineering for Computational Resource Manage-ment” in The Ecology of Computation, B. Huber-man (ed.), North-Holland, 1988.

[Ell75] C. M. Ellison. “The Utah TENEX Scheduler,”Proceedings of the IEEE, June 1975.

[Fer88] D. Ferguson, Y. Yemini, and C. Nikolaou. “Mi-croeconomic Algorithms for Load-Balancing inDistributed Computer Systems,” InternationalConference on Distributed Computer Systems,1988.

[Fer89] D. F. Ferguson. “The Application of Microeco-nomics to the Design of Resource Allocation andControl Algorithms,” Ph.D. thesis, Columbia Uni-versity, 1989.

[Fon95] L. L. Fong and M. S. Squillante. “Time-Functions:A General Approach to Controllable ResourceManagement,” Working Draft, IBM Research Di-vision, T.J. Watson Research Center, YorktownHeights, NY, March 1995.

[Har92] K. Harty and D. R. Cheriton. “Application-Controlled Physical Memory using External Page-Cache Management,” Fifth International Confer-ence on Architectural Support for ProgrammingLanguages and Operating Systems, October 1992.

[Hel93] J. L. Hellerstein. “Achieving Service Rate Objec-tives with Decay Usage Scheduling,” IEEE Trans-actions on Software Engineering, August 1993.

[Hen84] G. J. Henry. “The Fair Share Scheduler,” AT&TBell Laboratories Technical Journal, October1984.

[Kay88] J. Kay and P. Lauder. “A Fair Share Scheduler,”Communications of the ACM, January 1988.

[Liu73] C. L. Liu and J. W. Layland. “Scheduling Algo-rithms for Multiprogramming in a Hard Real-TimeEnvironment,” Journal of the ACM, January 1973.

[Mah95] U. Maheshwari. “Charge-Based ProportionalScheduling,” Working Draft, MIT Laboratory forComputer Science, Cambridge, MA, February1995.

[Mer94] C. W. Mercer, S. Savage, and H. Tokuda. “Proces-sor Capacity Reserves: Operating System Sup-port for Multimedia Applications,” Proceedingsof the IEEE International Conference on Multi-media Computing and Systems, May 1994.

[Mil88] M. S. Miller and K. E. Drexler. “Markets and Com-putation: Agoric Open Systems,” in The Ecol-ogy of Computation, B. Huberman (ed.), North-Holland, 1988.

[Par93] A. K. Parekh and R. G. Gallager. “A GeneralizedProcessor Sharing Approach to Flow Control inIntegrated Services Networks: The Single-NodeCase,” IEEE/ACM Transactions on Networking,June 1993.

[Pug90] W. Pugh. “Skip Lists: A Probabilistic Alternativeto Balanced Trees,” Communications of the ACM,June 1990.

[Sha91] L. Sha, M. H. Klein, and J. B. Goodenough. “RateMonotonic Analysis for Real-Time Systems,” inFoundations of Real-Time Computing: Schedul-ing and Resource Management, A. M. van Tilborgand G. M. Koob (eds.), Kluwer Academic Publish-ers, 1991.

[Sto95] I. Stoica and H. Abdel-Wahab. “A New Approachto Implement Proportional Share Resource Al-location,” Technical Report 95-05, Departmentof Computer Science, Old Dominion University,Norfolk, VA, April 1995.

[Tan92] A. S. Tanenbaum. Modern Operating Systems,Prentice Hall, 1992.

22

[TTC91] TTCP benchmarking tool. SGI version, 1991.Originally developed at the US Army Ballis-tics Research Lab (BRL). Available via anony-mous ftp from many locations, includingftp.sgi.com.

[Wal89] C. A. Waldspurger. “A Distributed ComputationalEconomy for Utilizing Idle Resources,” Master’sthesis, MIT, May 1989.

[Wal92] C. A. Waldspurger, T. Hogg, B. A. Huberman, J.O. Kephart, and W. S. Stornetta. “Spawn: A Dis-tributed Computational Economy,” IEEE Trans-actions on Software Engineering, February 1992.

[Wal94] C. A. Waldspurger and W. E. Weihl. “Lot-tery Scheduling: Flexible Proportional-Share Re-source Management,” Proceedings of the FirstSymposium on Operating Systems Design and Im-plementation, November 1994.

[Wal95] C. A. Waldspurger. “Lottery and Stride Schedul-ing: Flexible Proportional-Share Resource Man-agement,” Ph.D. thesis, MIT, 1995 (to appear).

[Wel93] M. P. Wellman. “A Market-Oriented ProgrammingEnvironment and its Application to DistributedMulticommodity Flow Problems,” Journal of Ar-tificial Intelligence Research, August 1993.

[Zha91] L. Zhang. “Virtual Clock: A New Traffic ControlAlgorithm for Packet Switching Networks,” ACMTransactions on Computer Systems, May 1991.

[ZhK91] H. Zhang and S. Kehav. “Comparison of Rate-Based Service Disciplines,” Proceedings of SIG-COMM ’91, September 1991.

A Fixed-Point Stride Representation

The precision of relative rates that can be achieved de-pends on both the value of stride � and the relative ratiosof client ticket allocations. For example, with stride �= � �� , and a maximum ticket allocation of � �� tickets,ratios are represented with 10 bits of precision. Thus,ratios close to unity resulting from allocations that differby only one part per thousand, such as 1001 : 1000, canbe supported.

Since stride � is a large integer, stride values will alsobe large for clients with small allocations. Since passvalues are monotonically increasing, they will eventuallyoverflow the machine word size after a large number ofallocations. For a machine with 64-bit integers, this isnot a practical problem. For example, with stride � = � ��and a worst-case client tickets = 1, approximately � � �allocations can be performed before overflow occurs.At one allocation per millisecond, centuries of real timewould elapse before an overflow.

For a machine with 32-bit integers, the pass valuesassociated with all clients can be adjusted by subtract-ing the minimum pass value from all clients wheneveran overflow is detected. Alternatively, such adjustmentscan periodically be made after a fixed number of allo-cations. For example, with stride � = � �� , a conservativeadjustment period would be a few thousand allocations.Perhaps the most straightforward approach is to simplyuse a 64-bit integer type if one is available. Our pro-totype implementation makes use of the 64-bit “longlong” integer type provided by the GNU C compiler.

23

stride scheduling: deterministic proportional-share...

Documents