proceedings of the 7th international workshop on compositional...

CRTS 2014

Proceedings of the 7th International Workshop onCompositional Theory and Technology for

Real-Time Embedded Systems

Rome, ItalyDecember 2, 2014

In conjunction with:The 35th International Conference on Real-Time Systems (RTSS’14),

December 3-5, 2014

Edited by Reinder J. Bril and Jinkyu Lee

c© Copyright 2014 by the authors

ForewordWelcome to Rome and the 7th International Workshop on Compositional Theory and Technology for Real-Time EmbeddedSystems (CRTS 2014). The CRTS workshops provide a forum for researchers and technologists to discuss the state-of-the-art,present their work and contributions, and set future directions in compositional technology for real-time embedded systems.

CRTS 2014 is organized around presentations of papers (regular papers and invited extended abstracts) and a panel dis-cussion focussed on the “state-of-the art and future directions” of CRTS. As usual, the presentations of regular papers addresstypical topics of CRTS. The invited presentations, on the other hand, particularly aim at open problems (“future directions”)and give an indication about the difficulty to solve these problems. These latter presentations may be controversial or thoughtprovoking, but also be an invitation to join in tackling hard problems. In addition, they are meant to serve the organizingcommittee with respect to future directions for CRTS.

A total of 7 papers were selected for presentation at the workshop, 2 regular papers and 5 invited extended abstracts.These proceedings are also published as a Computer Science Report from the Technical University of Eindhoven (CSR-1407)available at http://library.tue.nl/catalog/CSRPublication.csp?Action=GetByYear.

This year, CRTS is organized in conjunction with the 5th Analytical Virtual Integration of Cyber-Physical Systems Workshop(AVICPS 2014), which has close theoretical and practical scientific interests. Our joint program contains a keynote by Prof.Dr. Dr. h.c. Manfred Broy from the Technisch Universitat Munchen.

We would like to thank the Organizational Committee listed below, for granting us the honor, privilege and opportunityto be the co-chairs of CRTS 2014.

Insup Lee University of Pennsylvania, USAThomas Nolte Malardalen University, SwedenInsik Shin KAIST, Republic of KoreaOleg Sokolsky University of Pennsylvania, USA

Moreover, we would like to thank the Technical Program Committee listed below, for their work in reviewing the regu-lar papers and extended abstracts, and helping to make the workshop a success.

Benny Akesson Czech Technical University in Prague, Czech RepublicLuıs Almeida Universidade do Porto, PortugalBjorn Andersson Software Engineering Institute at Carnegie Mellon University, USAMoris Behnam Malardalen University, SwedenEnrico Bini Scuola Superiore Sant’Anna, ItalyArvind Easwaran Nanyang Technological University, SingaporeMartijn M.H.P. van den Heuvel Eindhoven University of Technology (TU/e), The NetherlandsHyun-Wook Jin Konkuk University, Republic of KoreaJulio Luis Medina Pasaje Universidad de Cantabria, SpainJan Reineke Saarland University, GermanyLuca Santinelli ONERA, FranceMikael Sjodin Malardalen University, SwedenLinh Thi Xuan Phan University of Pennsylvania, USALothar Thiele Swiss Federal Institute of Technology Zurich, SwitzerlandTullio Vardanega Universita di Padova, Italy

Last but not least, special thanks go to the RTSS 2014 Workshop Chair, Program Chair and General Chair listed below, aswell as the AVICPS 2014 co-chairs, for their support and assistance in organizing this joint seminar.

Rodolfo Pellizzoni University of Waterloo, Canada (RTSS 2014 Workshops Chair)Christopher D. Gill Washington University in St. Louis, USA (RTSS 2014 Program Chair)Michael Gonzalez Harbour Universidad de Cantabria, Spain (RTSS 2014 General Chair)Sibin Mohan University of Illinois at Urbana-Champaign (AVICPS 2014 Co-chair)Jean-Pierre Talpin INRIA, France (AVICPS 2014 Co-chair)

Jinkyu Lee and Reinder J. BrilCo-chairs7th International Workshop on Compositional Theory and Technology for Real-Time Embedded Systems (CRTS 2014)

iii

Table of Contents

Regular papersSupporting Fault-Tolerance in a Compositional Real-Time Scheduling FrameworkGuy Martin Tchamgoue, Junho Seo, Jongsoo Hyun, Kyong Hoon Kim, and Yong-Kee Jun 1

Designing a Time-Predictable Memory Hierarchy for Single-Path CodeBekim Cilku and Peter Puschner 9

Extended AbstractsFive problems in compositionality of real-time systemsBjorn Andersson 15

Compositional Mixed-Criticality SchedulingArvind Easwaran and Insik Shin 16

Challenges of Virtualization in Many-Core Real-Time SystemsMatthias Becker, Mohammad Ashjaei, Moris Behnam, and Thomas Nolte 17

Managing end-to-end resource reservationsLuis Almeida, Moris Behnam, and Paulo Pedreiras 18

Supporting Single-GPU Abstraction through Transparent Multi-GPU Execution for Real-Time GuaranteesWookhyun Han, Hoon Sung Chwa, Hwidong Bae, Hyosu Kim and Insik Shin 19

v

Supporting Fault-Tolerance in a Compositional Real-TimeScheduling Framework

Guy Martin Tchamgoue1, Junho Seo1, Jongsoo Hyun2, Kyong Hoon Kim1, andYong-Kee Jun1

1Department of Informatics 2Avionics SW TeamGyeongsang National University Korea Aerospace Industries, Ltd.

660–701, Jinju, South Korea Sacheon, South [email protected], [email protected], [email protected],

{khkim,jun}@gnu.ac.kr

ABSTRACTComponent-based analysis allows a robust time and spacedecomposition of a complex real-time system into compo-nents, which are then recomposed and hierarchically sched-uled under potentially different scheduling policies. Thismechanism is of great benefit to many critical systems as itenables fault isolation. To provide fault-tolerant schedulingin a compositional real-time scheduling framework, a fewworks have recently emerged, but remain inefficient in pro-viding fault isolation or in terms of resource utilization. Inthis paper, we introduced a new interface model that takesinto account the fault requirements of a component, and afault-tolerant resource model that helps the component toeffectively respond to each of its child components in pres-ence of a fault. Finally, we analyzed the schedulability ofthe framework considering the Rate Monotonic schedulingalgorithm.

Categories and Subject DescriptorsC.3 [Special-Purpose and Application-Based Systems]:Real-time and embedded systems; C.4 [Performance ofSystems]: Fault tolerance; D.4 [Operating Systems]: Pro-cess Management—Scheduling

General TermsTheory, Reliability

KeywordsCompositional real-time scheduling, periodic resource model,periodic task model, fault-tolerant scheduling

1. INTRODUCTIONThe increasing size and complexity and the requirement

of high performance have led to the rapid adoption of thecomponent-based analysis in many cyber-physical systems.A compositional real-time scheduling framework allows mul-tiple components, that may have been individually devel-oped and validated, to be hierarchically composed and sched-uled altogether. In this kind of open computing environ-ment, a component or partition receives computational re-sources from its parent component and shares the resourceswith its child components through its own local scheduler.This robust space and time partitioning opens ways to achiev-

ing rigorous fault containment. Therefore, faults can trans-parently be detected and handled by a fault managementpolicy at each level of the hierarchy: intra-component (ortask level), inter-component (or component level), and sys-tem level .

In safety-critical real-time systems, such as avionics [1]and automotive [2], where component-based analysis has be-come a standard, two main conflicting challenges are to beaddressed: (1) providing an efficient resource sharing foreconomical reasons, and (2) guaranteeing the reliability ofthe system for validation and certification. Many composi-tional real-time scheduling frameworks [3, 5, 15, 16, 17] havealready been proposed, but with a great focus on efficient re-source abstraction and sharing, schedulability analysis, andabstraction and runtime overheads. Thus, research on afault-tolerant compositional real-time scheduling frameworkis yet to be done. Such a framework should provide an effi-cient resource model for an effective resource sharing even inpresence of faults. Nevertheless, many error recovery strate-gies such as redundancy [7, 11, 14], roll-back [6, 13, 19] androll-forward [12, 18] with check-pointing [4], have alreadybeen devised for the long studied field of fault-tolerance inreal-time systems, but their direct application to a composi-tional scheduling framework has not been thoroughly inves-tigated.

Considering the periodic resource model [16], Hyun andKim [8] proposed a task level fault-tolerant framework andlater extended it with a component level fault containmentwith backup partitions [9]. Although it offers task and com-ponent level fault isolation, the approach remains inefficientas the highest possible resource is always required to guaran-tee the feasibility of the system even in the absence of faults.Jin [10] extended the periodic resource model to support thebackup resource requirements, but does not provide a fine-grained fault management as the system definitely switchesto a backup partition whenever a fault is detected inside theassociated primary partition.

In this paper, we propose a new compositional real-timescheduling framework that uses the time redundancy tech-nique to tolerate faults. Our framework introduces a newinterface model that takes into account the real-time faultrequirements of a component, and a resource model thathelps the component to effectively respond to each of itschild components in presence of a fault. When a fault isdetected inside a component, the new resource model guar-antees to provide an extra resource to the faulty component

CRTS 2014 1 Rome, Italy

only until the fault is handled and thereafter, switches backto a normal supply as the demand of the component hasalso decreased. Contrarily to the previous approaches [8,9, 10], the new model provides a more flexible and efficientresource sharing in presence of faults. The schedulabilityof the framework has been analyzed considering the RateMonotonic (RM) scheduling algorithm. However, our analy-sis focuses only on errors that are caused by transient faults,allowing each single task of the system to define its own errorrecovery strategy.

The remainder of this paper is organized as follows. Sec-tion 2 presents our system model with an overview of a com-positional real-time scheduling framework, and describes theproblems addressed in the paper. Section 3 focuses on theproposed framework itself and introduces the new interfaceand resource models, and discusses the schedulability anal-ysis with the RM algorithm. Section 4 provides details onhow to compute each parameter that makes up the fault-tolerant interface model. Finally, the paper is concluded inwith Section 5.

2. BACKGROUNDThis section presents our system model with an overview

of a compositional real-time scheduling framework (CRTS),describes our fault model and finally defines the problemshandled in this paper.

2.1 System ModelIn a compositional real-time scheduling framework [16,

17], components are organized in a tree-like hierarchy wherea upper-layer component allocates resources to its child com-ponents, as shown in Figure 1. Thus, the basic schedulingunit (i.e. component or partition) of the framework is de-fined as C(W,R,A), where W is the workload, R the re-source model supported by the upper-layer component, andA the scheduling algorithm of the component.

In this paper, we assume that the workload W of eachcomponent is composed of a set of periodic real-time tasksrunning on a single processor platform. Each task τi is thendefined by its period, pi and its worst-case execution time,ei. We also assume the deadline of each task τi to be equalto its period pi.

A resource model R specifies the exact amount of resourceto be allocated by a parent component to its child com-ponents. The periodic resource model Γ(Π,Θ) [16], as inFigure 1, guarantees a resource supply of Θ at every periodof Π time units to a given component. In contrast to theresource model, the interface model abstracts a componenttogether with its collective real-time requirement as a newreal-time task. The periodic interface model I(P,E) [16]represents a component task I with execution time E andperiod P .

As an example, Figure 1 depicts a two-layer composi-tional real-time scheduling framework comprising three com-ponents, C0, C1, and C2. The two tasks of component C1

which are scheduled with EDF (Earliest Deadline First) areabstracted under the interface I1 as a periodic task with aperiod of 10 and an execution time of 3 time units. Similarly,component C2 which contains two tasks scheduled with RM

(Rate Monotonic) is seen by the upper layer component asa single task represented by I2. Thus, component C0, whichis also summarized as interface I0, focuses on scheduling C1

and C2 as simple real-time tasks through their respective

Figure 1: An example of compositional real-timescheduling framework

interfaces I1 and I2, therefore providing C1 and C2 withresource models R1 and R2, respectively.

2.2 Fault ModelIn this paper, we consider only errors that are caused by

transient faults. We assume that only the single task underexecution at the time of a fault occurrence is affected by thefault. Whenever a fault is detected, the state of the affectedtask is recovered by an appropriate error recovery strategysuch as redundancy, rollback, or roll-forward. Therefore, weexpect each task τi to define its own recovery strategy andthus, maintains its own backup task referred to as βi. Forany task τi, the backup execution time, denoted by bi, isassumed to be not greater than the normal execution timeei (i.e. 0 ≤ bi ≤ ei). The backup execution is definedaccording to the recovery strategy as follows:

• bi = ei: when the re-execution strategy is applied,

• 0 < bi < ei: for a forward recovery strategy such as anexception handler

• bi = 0: when the fault is to be ignored.

A fault is assumed to be detected at the end of the execu-tion of each task as this represents the worst-case scenario.Once a fault is detected on a task τi, its backup task βi isto be released and executed by the task’s deadline. Thus, atask τi is supposed to finish at least by (pi − bi) in order tomake enough slack time for its backup task. However, due tothe nature of the resource model, the remaining slack time ofbi may still be insufficient to cover the backup task, in whichcase we assume the recovery to start from the next periodof the task. With a periodic resource model Γ(Π,Θ) for ex-ample, the system may become non schedulable because the

resource supply of⌊

biΠ

⌋Θ cannot satisfy the backup require-

ment of bi time units for a faulty task τi. We also assumea fault to occur only once in a time interval of TF units,which represents the minimum distance between two con-secutive faults in the system. When a fault is detected on atask τi, the faulty component may require an extra compu-tational resource to cover up the fault. However, due to theperiodicity of the resource supply and in order to preservethe schedulabilty of other components in the framework, the


extra resource can only be claimed from the next resourceperiod. Thus, each component of the framework is assumedto detect a task fault only at the end of each resource sup-ply. Therefore, the component assumes the fault recoveryprocess to start from the resource period that comes rightafter the one in which the fault was detected. It is impor-tant to emphasize on the fact that the backup task does notneed to wait until the next resource period to be executed,but as soon as it gets ready.

2.3 Problem StatementIn this paper, we present a fault-tolerant compositional

scheduling framework assuming the periodic real-time taskmodel. Considering a single fault model, we propose a tasklevel fault management scheme while handling the followingproblems:

• Interface model: to model the workload W of a compo-nent C(W,R,A) as a single periodic task with consid-eration of the deadline and fault requirements of eachtask. An upper-layer component can then use the in-terface model to efficiently share its resource with itschild components.

• Resource model: to guarantee an optimal resource sup-ply to each component in order to satisfy its deadlineand fault-tolerance requirements.

• Interface generation: to effectively determine each pa-rameter that makes up the interface model for eachcomponent of the framework.

• Schedulability analysis: to guarantee to each compo-nent especially in the presence of faults, the minimumresource supply that makes it schedulable.

We believe that such a fault-tolerant system will be usefulfor example in the design of a modern avionics mission com-puter that implements a strict time and space partitioningbased on the ARINC 653 standard [1]. In such a systemfaults need to be handled and dealt with properly. A sin-gle fault may, for example, cause an entire operational flightprogram to behave incorrectly or to fail, eventually forcingthe mission computer itself to a cold or warm restart. Awarm restart of the system takes about 5 seconds, whichmay then force ongoing missions such as target attack andaerial reconnaissance to abortion [8].

3. FAULT-TOLERANT CRTSThis section describes a new fault-tolerant compositional

real-time scheduling framework. We present our new inter-face and resource models. The schedulability analysis of theframework is provided assuming the Rate Monotonic (RM)scheduling algorithm.

3.1 Interface ModelEach component of the framework contains a Fault Man-

ager (FM) module which function is to detect and handlefaults inside the component. Although this paper consid-ers only the RM algorithm, any other scheduler capable ofhandling faults like EDF maybe used. A new periodic inter-face model defined by I(P,E,B,M) is introduced to supportboth the real-time and the fault requirements of each com-ponent. In this interface definition, P , E, and B respectively

Figure 2: The proposed scheduling framework

represents the period, the execution time during the normalmode, and the extra execution time to be supplied in caseof fault for backup. When a fault is detected on a task,the component may require more than one resource supplyto recover from the fault. Therefore, the parameter M isto materialize the total number of resource intervals whichare needed by the component to properly respond to faults.Thus, when a fault signaled inside a component C(W,R,A),the overall resource demand of the component, due to therelease of a backup task, increases by approximately M ×Btime units. In other words, when a fault is detected on atask τi inside a component C(W,R,A), the component levelbackup task Ib(P,B) will be released M times in order torequest enough resource to cover up the backup requirementof the faulty task τi. We also normalized the definition of atask τi to add a new parameter mi which as M , representsthe number of backup releases of the task. If a fault occurson a task τi(pi, ei, bi,mi), the additional backup job with biexecution times is released for exactly mi times. With thisdefinition, the backup task βi of a task τi can be registeredto spread across multiple release periods.

An example of the new framework is shown in Figure 2,where component C2 has two periodic tasks τ3(50, 4, 4, 1)and τ4(25, 3, 2, 1) scheduled with a fault-tolerant RM algo-rithm. The component exposes its interface I2(15, 4, 2, 2) toits parent component C0 to claim a computational time of 4units every 15 time units under normal execution. However,if a fault is detected, C2 will require an additional 2 timeunits to be supplied during the next 2 resource periods inorder to deal with the fault. In a similar way, component C1

presents its interface I1(10, 3, 2, 3) to C0, which then focuseson scheduling the two components as two normal periodictasks.

3.2 Resource ModelThis paper introduces a new fault-aware periodic resource

model which extends the existing periodic resource model [16]to support faults in a compositional scheduling framework.The fault-tolerant resource model Γ(Π,Θ,Δ) guarantees tosupply to each component a resource amount of Θ time unitswhenever the component is running without any fault. How-ever, when a fault is detected on a task τi, the resourcedemand of the component increases by bi. To support thefault recovery process, the component is supplied an addi-


Figure 3: An example of resource supply for Γ(5, 2, 1)where M = 3

tional computational time of Δ. Thus, the fault-tolerantresource model Γ(Π,Θ,Δ) supplies Θ time units during thenormal execution and increases the supply to Θ+Δ duringthe recovery time. Contrarily to the previous fault-tolerantmodel Γ(Π,Θp,Θb) [10] that provides Θp during the normalexecution and definitely switches to Θb when a fault is de-tected, our resource model Γ(Π,Θ,Δ) switches back to thenormal execution mode when the fault is entirely recoveredand therefore, continues to supply only Θ time units. Theexact number of time the extra resource is supplied is justtaken from the interface of each component. Figure 3 showsan example of resource supply model R = Γ(5, 2, 1) to acomponent C(W,R,A) with the interface model I(5, 2, 1, 3).

For the schedulability purpose, it is important to accu-rately evaluate the amount of resource supplied by a re-source model to a component. The supply bound functionsbfΓ(t) of a resource model Γ calculates the minimum re-source supply for any given time interval of length t. In anormal execution mode, the supply bound function is simi-lar that of the periodic resource model [16] and given by thefollowing equation:

sbfΓ(Π,Θ)(t) =

⎧⎨⎩

t− (k + 1)(Π−Θ) if t ∈ [(k + 1)Π− 2Θ,(k + 1)Π−Θ],

(k − 1)Θ otherwise,(1)

where k = max(�(t− (Π−Θ))/Π�, 1).However, during the recovery mode, the resource supply tothe faulty component increases by Δ time units. Thus,the supply bound function for the recovery mode sbfRΓ (t)is given by Equation (2).

sbfRΓ(Π,Θ,Δ)(t) = sbfΓ(Π,Θ+Δ)(t−Δ) (2)

Given a component C(W,R,A) represented by its interfaceI(P,E,B,M) and a resource supply model R = Γ(Π,Θ,Δ),if we assume that a fault is detected during the k-th resourcesupply to C, then the supply bound function can be providedby Equation (3).

sbfΓ(Π,Θ,Δ)(t, k) =

⎧⎨⎩

sbfΓ(Π,Θ)(t) if t ≤ tNsbfRΓ(Π,Θ,Δ)(t)− hs if tN < t ≤ tRsbfΓ(Π,Θ)(t) + vs Otherwise

(3)

where tN = kΠ−ΘtR = (M + k)Π−Θhs = sbfRΓ(Π,Θ,Δ)(tN)− sbfΓ(Π,Θ)(tN)

vs = sbfRΓ(Π,Θ,Δ)(tR)− sbfΓ(Π,Θ)(tR)− hs

Example 3.1. Let us consider a component C(W,R,A)where W = {τ1(10, 1, 1, 1), τ2(15, 2, 2, 1)} and A = RM. Let

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20 25 30

Res

ourc

e S

uppl

y

Time

sbfΓ(5,3)(t)sbfΓ(5,2)(t)

sbfΓ(5,2,1) (t,1)

sbfΓ(5,2,1) (t,2)sbfΓ(5,2,1)(t,3)sbfΓ(5,2,1) (t,4)

Figure 4: Supply bound function for Γ(5, 2, 1) withM = 2

the interface of the component be given by I(5, 2, 1, 2) andits resource supply modeled by R = Γ(5, 2, 1). Figure 4 com-pares sbfΓ(Π,Θ,Δ)(t, k) for k = 1, 2, 3, and 4 with the worst-case resource supply of Γ(5, 3) as considered by the previouswork [8, 10]. The new resource model provides a signifi-cant gain in terms of resource for the framework. Figure 4shows that the curves of our fault-tolerant resource modelare always between those of the normal and the worst-caseresource supply.

3.3 Schedulability AnalysisFor the analysis of the schedulability, we focus only on the

Rate Monotonic (RM) algorithm which assigns higher priori-ties to tasks with the shortest periods. Thus, without loss ofgenerality, we assume that tasks are sorted in each compo-nent in an ascendant order of their periods, that is pi ≤ pi+1.Also, when released, a backup task βi inherits the priorityof its faulty task τi. We define the resource demand of aworkload as the amount of resource requested by a compo-nent to its parent component. The demand bound func-tion dbfW (A, t) computes the maximum resource demandrequired by the workload W when scheduled with the algo-rithm A during a time interval t. Since we focus only on theRM algorithm, we will omit the scheduling algorithm fromthe future notations of the demand bound function.

For a component C(W,R, RM) under normal execution, thedemand bound function dbfW (t, i) of a task τi is given bythe following equation:

dbfW (t, i) = ei +∑

τj∈hp(i)

⌈t

pj

⌉· ej (4)

where hp(i) represents the set of tasks with priority higherthan the one of τi. However, if a task τi is still recover-ing from a fault, its demand bound function considering itbackup task βi is given by Equation (5).

dbfRW (t, i) = ei + bi +

∑τj∈hp(i)

⌈t

pj

⌉· ej (5)

If the fault is detected on a task τj with higher priority thananother task τi, then the demand bound function dbfFW (t, i)


0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e S

uppl

y/D

eman

d

Time

dbfW (t,2)dbfW(t,1)

sbfΓ(5,2)(t)

(a) Normal execution

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e S

uppl

y/D

eman

d

Time

dbfW (R,t,2)dbfW(F,t,2)dbfW(R,t,1)

sbfΓ(5,2,1)(R,t)sbfΓ(5,2)(t)

(b) Recovery execution

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Res

ourc

e S

uppl

y/D

eman

d

Time

dbfW (t,2,k)dbfW(t,1,k)

sbfΓ(5,2,1)(t,1)sbfΓ(5,2,1)(t,2)sbfΓ(5,2,1)(t,3)

(c) Fault analysis

Figure 5: Schedulability analysis of Example 3.2

of τi is provided by Equation (6). Among all tasks withpriority higher than that of τi, the demand bound functionof τi assumes the worst-case situation in which the faultytask τj is the one with the maximum backup execution time.

dbfFW (t, i) = ei +

∑τj∈hp(i)

⌈t

pj

⌉· ej

+ maxτj∈hp(i)

(min

(⌈t

pj

⌉,mj

)· bj

)(6)

We can now determine for a task τi the demand bound func-tion dbfW (t, i, k) assuming that a fault was detected on an-other task τj with priority higher than that of τi during thek-th resource supply to the component as in Equation (7).

dbfW (t, i, k) = ei +∑

τj∈hp(i)

⌈t

pj

⌉· ej + max

τj∈hp(i)

(min

(

max

(⌈t− (k − 1)Π

pj

⌉, 0

), mj

)· bj

)(7)

A component C(W,R, RM) is schedulable if the resourcedemand of its workload W is guaranteed to be satisfied bythe resource model R = Γ(Π,Θ,Δ) during the normal exe-cution mode and also in presence of a fault as summarizedin Theorem 1.

Theorem 1. A given component C(W,R,RM) where W ={τi(pi, ei, bi,mi)|i = 1, . . . , n} and which interface is definedas I(P,E,B,M), is schedulable with a resource model R =Γ(Π,Θ,Δ) if and only if for all τi ∈ W , there exists ti ∈[0, pi] such that the following three conditions are satisfied:

1. dbfW (ti, i) ≤ sbfΓ(Π,Θ)(ti)

2. dbfRW (ti, i) ≤ sbfRΓ(Π,Θ,Δ)(ti)

3. dbfW (ti, i, k) ≤ sbfΓ(Π,Θ,Δ)(ti, k),∀k = 1, . . . , (⌈piΠ

⌉−1)

Proof. The proof to the first condition of Theorem 1follows from the work by Shin and Lee [16, Theorem 4.2].A task τi completes its execution requirement at time ti ∈[0, pi], if, and only if, all the execution requirements from allthe jobs of higher-priority tasks than τi and ei , the execu-tion requirement of τi , are completed at time ti. The total ofsuch requirements is given by dbfW (ti, i), and they are com-pleted at ti if, and only if, dbfW (ti, i) = sbfΓ(Π,Θ)(ti) anddbfW (t′i, i) > sbfΓ(Π,Θ)(t

′i) for 0 ≤ t′i < ti. It follows that

a necessary and sufficient condition for τi to meet its dead-line is the existence of a ti ∈ [0, pi] such that dbfW (ti, i) =sbfΓ(Π,Θ)(ti). The entire task set is schedulable if, and onlyif, each of the tasks is schedulable, which implies that thereexists a ti ∈ [0, pi] such that dbfW (ti, i) = sbfΓ(Π,Θ)(ti) foreach task τi ∈ W .

Similarly, the proofs to the two other conditions can beestablished by repeating the same reasoning with the appro-priate demand and supply bound functions.

Example 3.2. Let us consider again our previous compo-nent C(W,R,RM) where W = {τ1(10, 1, 1, 1), τ2(15, 2, 2, 1)}and R = Γ(5, 2, 1). The interface of the component is givenby I(5, 2, 1, 2). Figure 5(a) which plots the demand boundfunction as presented in Equation (4), shows that the com-ponent is schedulable for the minimum resource supply ofR = Γ(5, 2). This satisfies the first condition of Theorem 1.


However, as seen in Figure 5(b), if the resource supply re-mains R = Γ(5, 2) while a fault occurs on task τ1, task τ2 willmiss its deadlines due to the interference from the backuptask of τ1. Also, if the fault is instead detected on τ2, thecomponent will still be unschedulable with a resource supplyof R = Γ(5, 2). However, Figure 5(b) shows that the com-ponent becomes schedulable if the resource supply becomesR = Γ(5, 2, 1). Figure 5(c) plots the third condition of The-orem 1 to analyze the impact of a faulty τ1 on task τ2. Itresults that by supplying an extra 1 computation time unitduring exactly 2 resource period, the component is alwaysschedulable. Therefore, there is no need to always provide Cwith a resource of R = Γ(5, 3). Moreover, Figure 5(c) showsthat if the fault is detected during the last resource supplybefore the deadline of τ2 (i.e. [10−15]), the recovery processwill be handled only after the deadline.

4. INTERFACE GENERATIONA component expresses its resource demand to its parent

component through its interface which abstracts, withoutrevealing it, the internal real-time requirements of the com-ponent. The interface of a component C(W,R,A) is definedby I(P,E,B,M). When a fault is detected on a task τiby the local fault manager, the backup task βi is releasedto execute for bi time units. This release also triggers therelease of the component backup task as the component isnow seen as faulty by its parent component. As a result, thefaulty component is provided with an extra Δ time units.The component remains in this faulty status for exactly Mresource periods. This section focuses on determining the in-terface parameters that make each component schedulablewith a resource model Γ(Π,Θ,Δ).

In this paper, we assume that the period P of the inter-face is decided by the system designer. However, there isa tradeoff in choosing the right P for a given componentC(W,R,A) due to the scheduling overhead. A smaller Pincreases the scheduling overhead in the upper-layer com-ponent due to the increased number of context switching.Inversely, a larger P makes it also difficult to find a feasibleinterface model. Thus, we suggest to select P as the mini-mum period among all tasks in W , or as a number dividingthe minimum period, or finally as a common divider to allpi,∀τi ∈ W .

The parameter E can be easily determined assuming thecomponent is in its normal execution mode where backupresource supply is not required as stated by the first con-dition of Theorem 1. When a resource model Γ(Π,Θ,Δ) isprovided to a component with interface I(P,E,B,M), theexecution time E can be determined by Proposition 1.

Proposition 1. The schedulability of a given componentC(W,R, RM) abstracted by the interface I(P,E,B,M), whereW = {τi(pi, ei, bi, mi)|i = 1, . . . , n} and R = Γ(Π,Θ,Δ), isguaranteed if

E =P · UN

log(

2k+2(1−UN )k+2(1−UN )

) (8)

where k = max(integer i|(i+1)P−E < p∗, UN =∑

τi∈Weipi,

and p∗ represents the smallest period in W ,

Proof. Let us consider a component C(W,R,RM), whereW = {τi(pi, ei, bi,mi)|i = 1, . . . , n} and its periodic inter-face defined as I(P,E,B,M). Let us assume the component

in a normal non-faulty execution mode with a resource sup-ply model R = Γ(Π,Θ,Δ). According to work by Shin andLee [16], the utilization bound of the component C undernormal execution mode is given by

UBW (RM) = UI · n[(

2k + 2(1− UI)

k + 2(1− UI)

)1/n

− 1

]

with k defined by k = max(integer i | (i + 1)P − E < p∗)and UI = E

P.

In order to guarantee the schedulability of the component,the interface normal execution time E is the minimum valuesuch that

UN =∑

τi∈W

eipi

≤ UI · n[(

2k + 2(1− UI)

k + 2(1− UI)

)1/n

− 1

](9)

When n becomes large, we have

n

[(2k + 2(1− UI)

k + 2(1− UI)

)1/n

− 1

]≈ log

(2k + 2(1− UI)

k + 2(1− UI)

)(10)

Therefore, from Equations 9 and 10, it follows that

UI ≥ UN

log(

2k+2(1−UI )k+2(1−UI )

)Since UN ≤ UI , we have

log

(2k + 2(1− UN )

k + 2(1− UN )

)≤ log

(2k + 2(1− UI)

k + 2(1− UI)

)(11)

Equation (11) finally implies that

UI ≥ UN

log(

2k+2(1−UN )k+2(1−UN )

)Therefore, the minimum value for E is given by

E =P · UN

log(

2k+2(1−UN )k+2(1−UN )

)However, since UN ≤ 1, it is easy to see that E ≤ P .

When a fault occurs on a task τi the resource utilizationof the component increases by bi/pi and consequently, theresource supply to the component is also increased by Δ.Thus, the total resource utilization UF of a component inpresence of a fault is given by the following equation:

UF =∑

τi∈W

eipi

+ maxτk∈W

(mkbkpk

)(12)

Since we assume only a single fault model, the value of theinterface backup execution time B can be obtained by ex-tending the result of Proposition 1 as given in Proposition 2

Proposition 2. The schedulability of a given componentC(W,R,RM) abstracted by the interface I(P,E,B,M), whereW = {τi(pi, ei, bi,mi)|i = 1, . . . , n} and R = Γ(Π,Θ,Δ), isguaranteed if

E +B =P · UF

log(

2k+2(1−UF )k+2(1−UF )

) (13)

where k = max(integer i|(i + 1)P − (E + B) < p∗, and p∗represents the smallest period in W ,


0

5

10

15

20

25

30

35

0 20 40 60 80 100 120 140 160

Res

ourc

e S

uppl

y/D

eman

d

Time

sbfΓ(10,3.14)(t)dbfW (t,4)dbfW (t,3)dbfW (t,2)dbfW(t,1)

(a) Normal execution

0

5

10

15

20

25

30

35

0 20 40 60 80 100 120 140 160

Res

ourc

e S

uppl

y/D

eman

d

Time

sbfΓ(10,3.14,1.36)(t)sbfΓ(10,3.14)(t)dbfW (R,t,4)dbfW (R,t,3)dbfW (R,t,2)dbfW(R,t,1)

(b) Recovery execution

0

5

10

15

20

25

30

35

0 20 40 60 80 100 120 140 160

Res

ourc

e S

uppl

y/D

eman

d

Time

sbfΓ(10,3.14,1.36)(t)sbfΓ(10,3.14,1.36)(t,2)

sbfΓ(10,3.14)(t)dbfW (t,4,2)dbfW (t,3,2)dbfW (t,2,2)

dbfW(t,1)

(c) Fault analysis with k = 2

Figure 6: Schedulability analysis of Example 4.1

Proof. The proof to Proposition 2 directly follows fromthat of Proposition 1.

Finally, we determine M to be the maximum number oftimes the additional resource B is to be requested by thefaulty component from its upper-layer component in case offault. However, the value M should be decided to guaranteethat the length of the resource supply cannot violate thedeadline requirement of the faulty task, and that the totaladditional resource supplied for backup is large enough tocover the backup requirement of each task in case of fault.These two conditions are then formalized into Equation (14).

M × P ≤ mi × pi,∀τi ∈ WM ×B ≥ mi × bi,∀τi ∈ W

(14)

From Equation (14), it follows that

mibiB

≤ M ≤ mipiP

,∀τi ∈ W (15)

We can still preserve the schedulability of the component bychoosing M as the maximum value among the lower boundvalues that satisfy Equation (15) as shown in Equation (16).

M = maxτi∈W

(⌈mibiB

⌉)(16)

Once the interface I(P,E,B,M) of a component C(W,R,A)is determined, the resource model R = Γ(Π,Θ,Δ) provided

by the upper-layer component to C can directly be derivedfrom the interface by setting Π = P , Θ = E, and Δ = B.

Example 4.1. Let us consider a component C(W,R,RM)with its workload given by W = {τ1(20, 1, 1, 1), τ2(40, 4, 4, 1),τ3(80, 3, 2, 1), τ4(160, 2, 0, 1)}. Let us also assume that TF isequal to 160, the least common multiple of all task periodsin W . Now, let us set the period of the interface as P = 10.By choosing k =

⌊p∗P

⌋in Proposition 1, we can obtain that

E = 3.14. Similarly, we can obtain from Proposition 2 thatB = 1.36. Equation (16) then provides M = 3. The com-ponent interface can then be given as I(10, 3.14, 1.36, 3) andthe resource model as R = Γ(10, 3.14, 1.36). Thus, whena fault occurs in the component, an additional resource of1.36 time units will be supplied to the components for 3periods. Figure 6 shows the schedulability analysis of thecomponent. Figure 6(a) tells that the component is schedu-lable under normal execution mode with the resource sup-ply of Γ(10, 3.14). However, in case of a fault as seen inFigure 6(b), the task τ2 will miss its deadline with the re-source supply of Γ(10, 3.14), but the workload will preserveits schedulability with the resource supply of Γ(10, 3.14, 1.36).We assumed a case where a fault is detected on τ2 during thesecond resource supply. As seen in Figure 6(c), the schedu-lability of the workload is guaranteed by the resource modelΓ(10, 3.14, 1.36) which provided an extra 1.36 time units dur-ing 3 more resource intervals to backup the faulty task τ2.


5. CONCLUSIONThis paper presents a new fault-tolerant compositional

real-time scheduling framework. In the framework, eachcomponent contains a fault manager module which is incharge of detecting faults inside the component and recover-ing the faulty task by launching an associated backup strat-egy. The release of a backup task immediately increases theresource demand of a component. Thus, we introduced afault-aware interface model to expose both the deadline andthe fault requirements of each component to each upper-layer component. Furthermore, we provided a new fault-tolerant resource model that guarantees a minimum resourcesupply to a component in its normal execution mode, andincreases the resource supply when the resource demand ofthe component increases due to a fault. Moreover, the re-source also switches back to its minimum supply once thecomponent has entirely recovered from the fault. We ana-lyzed the schedulability of the new framework consideringthe Rate Monotonic scheduling algorithm and showed itsefficiency over existing models.

In this paper, we considered only the task level fault man-agement. Our future interest will be on a component andsystem levels fault management strategies. It will also beinteresting to extend the fault model to for example a mul-tiple fault model, since the occurrence of faults can be burstyor memoryless. Finally, we are planning to implement theframework on a real hardware to support the design anddevelopment of safety-critical avionics mission computersbased on the ARINC-653 standard.

6. ACKNOWLEDGMENTSThis work was supported by Basic Science Research Pro-

gram through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & FuturePlanning (No. NRF-2012R1A1A1015096), and the BK21Plus Program (Research Team for Software Platform onUnmanned Aerial Vehicle, 21A20131600012) funded by theMinistry of Education (MOE, Korea) and National ResearchFoundation of Korea (NRF).

7. REFERENCES[1] ARINC. Avionics application software standard interface:

Part 1 - required services (arinc specification 653-2).Technical report, Aeronautical Radio, Incorporated, March2006.

[2] M. Asberg, M. Behnam, F. Nemati, and T. Nolte. Towardshierarchical scheduling in autosar. In EmergingTechnologies Factory Automation, ETFA 2009, pages 1–8,Sept 2009.

[3] S. Chen, L. T. X. Phan, J. Lee, I. Lee, and O. Sokolsky.Removing abstraction overhead in the composition ofhierarchical real-time systems. In Proceedings of the 201117th IEEE Real-Time and Embedded Technology andApplications Symposium, RTAS ’11, pages 81–90. IEEEComputer Society, 2011.

[4] A. Cunei and J. Vitek. A new approach to real-timecheckpointing. In Proceedings of the 2Nd InternationalConference on Virtual Execution Environments, VEE ’06,pages 68–77. ACM, 2006.

[5] A. Easwaran, I. Lee, O. Sokolsky, and S. Vestal. Acompositional scheduling framework for digital avionicssystems. In Proceedings of the 15th IEEE InternationalConference on Embedded and Real-Time ComputingSystems and Applications (RTCSA’09), pages 371–380,August 2009.

[6] P. Eles, V. Izosimov, P. Pop, and Z. Peng. Synthesis offault-tolerant embedded systems. In Proceedings of theConference on Design, Automation and Test in Europe,DATE ’08, pages 1117–1122. ACM, 2008.

[7] M. A. Haque, H. Aydin, and D. Zhu. Real-time schedulingunder fault bursts with multiple recovery strategy. InProceedings of the 20th IEEE Real-Time and EmbeddedTechnology and Applications Symposium, RTAS ’14,pages –. IEEE Computer Society, 2014.

[8] J. Hyun and K. H. Kim. Fault-tolerant scheduling inhierarchical real-time scheduling framework. In Proceedingsof the 2012 IEEE International Conference on Embeddedand Real-Time Computing Systems and Applications,RTCSA ’12, pages 431–436. IEEE Computer Society, 2012.

[9] J. Hyun, S. Lim, Y. Park, K. S. Yoon, J. H. Park, B. M.Hwang, and K. H. Kim. A fault-tolerant temporalpartitioning scheme for safety-critical mission computers. InProceedings of the 31st IEEE/AIAA Digital AvionicsSystems Conference, DASC’12, pages 6C3–1–6C3–8. IEEEComputer Society, Oct 2012.

[10] H.-W. Jin. Fault-tolerant hierarchical real-time schedulingwith backup partitions on single processor. SIGBED Rev.,10(4):25–28, Dec. 2013.

[11] F. Many and D. Doose. Scheduling analysis under faultbursts. In Proceedings of the 2011 17th IEEE Real-Timeand Embedded Technology and Applications Symposium,RTAS ’11, pages 113–122. IEEE Computer Society, 2011.

[12] V. Mikolasek and H. Kopetz. Roll-forward recovery withstate estimation. In Proceedings of the 14th IEEEInternational Symposium onObject/Component/Service-Oriented Real-Time DistributedComputing, ISORC ’11, pages 179–186. IEEE ComputerSociety, March 2011.

[13] D. Nikolov, U. Ingelsson, V. Singh, and E. Larsson.Evaluation of level of confidence and optimization ofroll-back recovery with checkpointing for real-time systems.Microelectronics Reliability, 54(5):1022–1049, 2014.

[14] R. M. Pathan and J. Jonsson. Exact fault-tolerantfeasibility analysis of fixed-priority real-time tasks. InProceedings of the 2010 IEEE 16th InternationalConference on Embedded and Real-Time ComputingSystems and Applications, RTCSA ’10, pages 265–274.IEEE Computer Society, 2010.

[15] L. T. X. Phan, M. Xu, J. Lee, I. Lee, and O. Sokolsky.Overhead-aware compositional analysis of real-timesystems. In Proceedings of the 2013 IEEE 19th Real-Timeand Embedded Technology and Applications Symposium(RTAS), RTAS ’13, pages 237–246. IEEE ComputerSociety, 2013.

[16] I. Shin and I. Lee. Compositional real-time schedulingframework with periodic model. ACM Transactions onEmbedded Computing Systems (TECS), 7(3):30:1–30:39,April 2008.

[17] G. M. Tchamgoue, K. H. Kim, Y.-K. Jun, and W. Y. Lee.Compositional real-time scheduling framework for periodicreward-based task model. Journal of Systems and Software,86(6):1712–1724, 2013.

[18] J. Xu and B. Randell. Roll-forward error recovery inembedded real-time systems. In Proceedings of theInternational Conference on Parallel and DistributedSystems, pages 414–421. IEEE, June 1996.

[19] Y. Zhang and K. Chakrabarty. Fault recovery based oncheckpointing for hard real-time embedded systems. InProceedings of the 18th IEEE International Symposium onDefect and Fault Tolerance in VLSI Systems, pages320–327. IEEE, 2003.


Designing a Time-Predictable Memory Hierarchyfor Single-Path Code

Bekim CilkuInstitute of Computer EngineeringVienna University of technology

A 1040 Wien, [email protected]

Peter PuschnerInstitute of Computer EngineeringVienna University of technology

A 1040 Wien, [email protected]

ABSTRACTTrustable Worst-Case Execution-Time (WCET) bounds area necessary component for the construction and verificationof hard real-time computer systems. Deriving such boundsfor contemporary hardware/software systems is a complextask. The single-path conversion overcomes this difficultyby transforming all unpredictable branch alternatives in thecode to a sequential code structure with a single executiontrace. However, the simpler code structure and analysis ofsingle-path code comes at the cost of a longer executiontime. In this paper we address the problem of the execu-tion performance of single-path code. We propose a newinstuction-prefetch scheme and cache organization that uti-lize the “knowledge of the future” properties of single-pathcode to reduce the main memory access latency and thenumber of cache misses, thus speeding up the execution ofsingle-path programs.

Keywordshard real-time systems, time predictability, memory hierar-chy, prefetching, cache memories

1. INTRODUCTIONEmbedded real-time systems need safe and tight estima-

tions of the Worst Case Execution Time (WCET) of time-critical tasks in order to guarantee that the deadlines im-posed by the system requirements are meet. Missing a sin-gle deadline in such a system can lead to catastrophic con-sequences.

Unfortunately, the process of calculating theWCET boundfor contemporary computer systems is, in general, a complexundertaking. On the one hand, the software is written to ex-ecute fast – it is programmed to follow different executionpaths for different input data. Those different paths, in gen-eral, have different timing, and analyzing them all can leadto cases where the analysis cannot produce results of the de-sired quality. On the other hand, the inclusion of hardwarefeatures (cache, branch prediction, out-of-order execution,and pipelines) extend the analysis with state dependenciesand mutual interferences; a high-quality WCET analysis hasto consider the interferences of all mentioned hardware fea-tures to obtain tight timing analysis. The state-of-the-arttools for WCET analysis are using a highly integrated ap-proach by considering all interferences caused by hardwarestate interdependencies [4]. Keeping track of all possible in-terferences and also the hardware state history for the wholecode in an integrated analysis can lead to a state-space ex-

plosion and will make the analysis infeasible. An effectiveapproach that would allow the tool to decompose the timinganalysis into compositional components is still lacking [1].One strategy to avoid the complexity of the WCET analy-

sis is the single-path conversion [12]. The single-path conver-sion reduces the complexity of timing analysis by convertingall input-depended alternatives of the code into pieces of se-quential code. This, in turn, eliminates control-flow inducedvariations in execution time. The benefit of this conversionare the predictable properties that are gained with the codetransformation. The new generated code has a single execu-tion trace that forces the execution time to become constant.To obtain information about the timing of the code it is suffi-cient to run the code only once and to identify the stream ofthe code execution which is repeated on any other iteration.Large programs that have been converted into single-path

code can be decomposed into smaller segments where eachsegment can be easily analyzed for its worst-case timing inseparation. This contrasts the analysis of traditional code,where a decomposition into segments may lead to highlypessimistic timing-analysis results, because important in-formation about possible execution paths and informationabout how these execution paths within one segment influ-ence the feasible execution paths and timings in subsequentsegments gets lost at segment boundaries. In single-pathcode, each code segment has a constant trace of executionand the initial hardware states for each segment can be easilycalculated, because there are no different alternatives of theincoming paths that can lead to a loss of information dur-ing a (de)compositional analysis. However, the advantage ofgenerating compositional code that allows for a highly accu-rate, low-complexity analysis comes at the cost of a longerexecution time of the code.The long latency of memory accesses is one of the key per-

formance bottlenecks of contemporary computer systems.While the inclusion of an instruction cache is a crucial firststep to bridge the speed gap between CPU and main mem-ory, this is still not a complete solution – cache misses canresult in significant performance losses by stalling the CPUuntil the needed instructions are brought into the cache.For such a problem, prefetching has been shown to be

an effective solution. Prefetching can mask large memorylatencies by loading the instructions into the cache beforethey are actually needed [15]. However, to take advantageof this improvement, the prefetching commands have to beissued at the right times – if they are issued too late memorylatencies are only partially masked, if they are issued tooearly, there is the risk that the prefetched instruction will


evict other useful instructions from the cache.Prefetching mechanisms also have to consider the accu-

racy, since speculative prefetching may pollute the cache.Mainly the prefetching algorithms can be divided into twocategories: correlated and non-correlated prefetching. Cor-related prefetching associates each cache miss with some pre-defined target stored in a dedicated table [6, 16], while non-correlated ones predict the next prefetch line according tosome simple predefined algorithms [11, 7, 14].

For all mentioned techniques, the ability to guess the nextreference is not fully accurate and prefetching can resultin cache pollution and unnecessary memory traffic. In thispaper we propose a new memory hierarchy for single-pathcode that consists of a cache and a hardware prefetcher.The proposed design is able to prefetch sequential and non-sequential streams of instructions with full accuracy in thevalue and time domain. This constitutes an effective instruc-tion prefetching scheme that increases the execution perfor-mace of single-path code and reduces both cache pollutionand useless memory traffic.

The rest of the paper is organized as follows. Section2 gives a short description of predicated instruction andpresents some simple rules used to convert conventional codeto single-path code. The new proposed memory hierarchyis presented in Section 3. Section 4 discusses related work.Finally, we make concluding remarks and present the futurework in Section 5.

2. GENERATING SINGLE-PATH CODEThe goal of the single-path code-generation strategy is

to eliminate the complexity of multi-path code analysis, byeliminating branch instructions from the control flow of thecode. Different paths of program execution are the resultof branch instructions which force the execution to followdifferent sequences of instructions. Branch instructions canbe unconditional branches which always result in branching,or conditional branches where the decision for the executiondirection depends on the evaluation of the branching condi-tion.

The single-path conversion transforms conditional branches,i.e., those branches whose outcome is influenced by programinputs [12]. Before the actual single-path code conversion isdone, a data-flow analysis [3] is run to identify the input-dependent instructions of the code. Branches which are notinfluenced by the input values are not affected by the trans-formation. After the data-flow analysis, the single-path con-version rules are applied and the new single-path code isgenerated. The only additional requirement for executingsingle-path converted code is that the hardware must sup-port the execution of predicated instructions.

2.1 Predicated executionPredicated instructions are instructions whose semantics

are controlled by a predicate (or guard), where the predicatecan be implemented by a specific predicate flag or registerin the processor. Instructions whose predicate evaluate to“true” at runtime are executed normally, while those whichevaluate to “false” are nullified to prevent that the processorstate gets modified.

Predicated execution is used to remove all branching oper-ations by merging all blocks into a single one with straight-line code [10]. For architectures that support predicated(guarded) execution the compiler converts conditional branches

into (a) predicate-defining instructions and (b) sequences ofpredicated instructions – the instructions along the alterna-tive paths of each branch are converted into sequences ofpredicated instructions with different predicates.

if(a) beq a,0,L1 pred_eq p,a x=x+1 add x,x,1 add x,x,1 (p)else jump L2 add y,y,1 (not p) y=y+1 L1: add y,y,1

L2:

Figure 1: if-conversion

Figure 1 shows an example of an if-then-else structuretranslated in assembler code with and without predicatedinstructions. In the first assembler code, depending on theoutcome of the branch instruction, only part of the codewill be executed, while in the second, single-path case allinstruction will be executed but the state of the processorwill be changed only for instructions with true predicatedvalue.

2.2 Single-Path Conversion RulesIn the following we describe a set of rules to convert reg-

ular code into a single-path code [13]. Table 1 shows thesingle-path transformation rules for sequences, alternativesand loops structures. In this table we assume that con-ditions for alternatives and loops are simplified in booleanvariables. The precondition for statement execution is rep-resented with σ, while in cases of recursion the δ counter isused to generate unique variable name.Simple Statement. If precondition for simple statement

S is always true then the statement will be executed in everyexecution. Otherwise the execution of S will depend on thevalue of the precondition σ, which becomes the executionpredicate. The same rule is used for statement sequences, byapplying the rule sequentially to each part of the sequence.Conditional Statement. For input-dependent (ID(cond)

is true) branching structures, we serialize the S1 and S2

alternatives, where the precondition parameters of the al-ternatives S1and S2 are formed by a conjunction of the oldprecondition (σ) and the outcome of the branching conditionthat is stored in guardδ. If branching is not dependent onprogram inputs then the if-then-else structure is conservedand the set of rules for single-path conversion are appliedindividually to S1 and S2.

Loop. Input-data dependent loops are transformed in twosteps. First, the original loop is transformed into a for-loopand the number of iterations N is assigned – the iterationcount N of the new loop is set to the maximum numberof iterations of the original loop code. The termination ofthe new new loop is controlled by a new counter variable(countδ) in order to force the loop to iterate always for theconstant number N . Further, a variable endδ is introduced.This variable is used to enforce that the transformed loophas the same semantics as the original one. The endδ-flagstored in this variable is initialised to true and assumes thevalue false as soon as the termination condition of the orig-inal loop evaluates to true for the first time. The value ofendδ-flag can also be changed to false if a break is em-bedded into the loop. Thus S is executed under the samecondition as in the original loop.


Table 1: Single-Path Transfromation Rules

Construct S Translated Construct SP� S �σδ

S if σ = T S

otherwise (σ) S

S1;S2 SP� S1 �σδ;SP� S2 �σδ

if cond then S1 if ID(cond) guardδ := cond;else S2 SP� S1 �〈σ ∧ guardδ〉〈δ + 1〉;

SP� S2 �〈σ ∧ ¬guardδ〉〈δ + 1〉otherwise if cond then SP� S1 �σδ

else SP� S2 �σδ

while cond if ID(cond) endδ := false

max N times for countδ := 1 to N do begin

do S SP� if ¬cond then endδ := true �σ〈δ + 1〉;SP� if ¬endδ then S �σ〈δ + 1〉

end

otherwise while cond do SP� S �σδ

3. MEMORY HIERARCHY FOR SINGLE-PATH CODE

This section presents our novel architecture of the cachememory and the prefetcher used for single-path code.

3.1 Architecture of the Cache MemoryCaches are small and fast memories that are used to im-

prove the performance between processors and main mem-ories based on the principle of locality. The property oflocality can be observed from the aspects of temporal andspatial behavior of the execution. Temporal locality meansthat the code that is executed at the moment is likely tobe referenced again in the near future. This type of behav-ior is expected from program loops in which both data andinstructions are reused. Spatial locality means that the in-structions and data whose addresses are close by will tendto be referenced in temporal proximity because the instruc-tions are mostly executed sequentially and related data areusually stored together [15].

As an application is executed over the time, the CPUmakes references to the memory by sending the addresses.At each such step, the cache compares the address with tagsfrom the cache. References (instructions or data) that arefound in cache are called hits, while those that are not inthe cache are called misses. Usually the processor stalls incase of cache misses until the instructions/data have beenfetched from main memory.

Figure 2 shows an overview of the cache memory aug-mented with the single-path prefetcher. The cache has twobanks, each consisting of tag, data, and valid bit (V) en-tries. Separation of the cache into two banks allows us tooverlap the process of fetching (by the CPU) with prefetch-ing (by the prefetch unit) and also cost less than dual-portcache of the same size. At any time, one of the banks isused to send instructions to the CPU and the other oneto prefetch instructions from the main memory. Both, CPUand prefetcher can issue requests to the cache memory. When-ever a new value in program counter (PC) is generated the

PC

Next Line Prefetching

MUX

Prefetch unit

State Machine

Cache

Tag TagData DataV V

Trigger line Destination line Count Type

to main memory

Bank 1 Bank 2

RPT

Figure 2: Prefetch-Cache architecture

value is sent to the cache and to the prefetcher. There arethree different cases of cache accesses when the CPU issuesan instruction request:

• No match within tag columns - the instruction is not inthe cache. The cache stalls the processor and forwardsthe address request to the main memory;

• Tag match, V bit is zero - the instruction is not in thecache but the prefetcher has already sent the requestfor that cache line and the fetching is in progress. Thecache stalls the processor and waits for the ongoingprefetching operation to be finished (V value to switchfrom zero to one).


• Tag match, V bit is one - the instruction is already inthe cache (cache hit).

The V bit prevents the cache to replicate requests issuedto main memory if the same request has already been issuedby the prefetcher.

Instructions in the cache can be mapped to any loca-tion (fully associative), to a dedicated set of cache lines(set-associative) or to only one cache line location (direct-mapped). For single-path code, direct mapped is the mostconventional cache solution. In single-path code, loop havesequential bodies and if the loop size is cachesize < loopsize <2 ∗ cachesize then a minimal number of instructions will beevicted from the cache. To illustrate this, let us assume thata cache has a size of four cache lines (1, 2, 3, 4) and the loophas a size equal to six cache lines (a, b, c, d, e, f). In eachiteration the loop will be mapped in cache as: (a, e) → 1,(b, f) → 2, (c) → 3, and (d) → 4. In each iteration lines(c, d) are in the cache.

3.2 Prefetching Algorithm for Single-Path CodeThe prefetching algorithm for single-path code consid-

ers two forms of prefetching: sequential and non-sequetialprefetching. Sequential instruction streams are a trivial pat-tern to predict since the address of the next prefetching isan increment of the current address line. A simple next-lineprefetcher [14] is a suitable solution for such a pattern ofinstructions.

In contrast, a non-sequential prefetcher needs input in-formation to determine the target address to be prefetched.Single-path code has a strong advantage in this part of theprefetching, because the outcome of every branch target isstatically known. This target information can be given tothe prefetcher in form of instructions (software prefetching)or it can be kept in a local memory (organized as a table) andused by the prefetcher when it is needed (hardware prefetch-ing). For the software solution, special prefetch instructionsare needed and the CPU hardware has to be modified.

In order to keep the development of the prefetcher inde-pendent form the CPU and the compiler, and also to avoidthe overhead of executing fetch instructions in software, wehave decided to use a hardware solution for the single-pathprefetcher. Since the single-path code consists of serial seg-ments and loops only, the subject of treatment from non-sequential algorithm are only the branch instructions of theloop back-edge. Loops larger than a cache size are easilyhandled by the prefetching algorithm, where the loop bodyis prefetched with the sequential algorithm while for the loopheader with non-sequential one. If loops fully fit into thecache then they do not need to be prefetched on each itera-tion. Thus these loops are identified and the prefetcher doesnot generate any prefetching requests until the last iteration,when the execution stream exits the loop.

The granularity of prefetching is defined as one cache line.For larger amounts of prefetched instructions, the probabil-ity of overshoting the end of sequence would increase, thusresulting in cache pollution with useless prefetching. Thegranularity also determines that the prefetching distance isone cache line ahead.

3.3 Reference Prediction TableThe Reference Prediction Table (RPT) is the part of the

prefetcher that holds information about the instruction stream

(Figure 2). The RPT entries consist of trigger address, des-tination address, count and type column. Trigger address isthe program counter address that triggers the non-sequentialalgorithm of the prefetcher. Destination address is the tar-get address that is prefetched. Since loops in single-pathcode have a constant number of iterations the counter datais used to inform the prefetcher for how many times thetarget address should be prefetched. The type field indi-cates which loops fit into the cache and which loops arebigger than the cache. If the value of type is zero then theprefetcher will not take any action since the loop is smallerthan cache and is completely in it. When the counter ofthat loop reaches zero, the loop iterations are finished, andthe prefetcher triggers the prefetching of the next cache line.A profiling process is needed to identify loops (loop header

and back-edge branch of the loop), the number of iterationsand the size of the loops in order to fill the RPT table.

3.4 Architecture of the PrefetcherAs shown in Figure 2 the prefetch hardware for single-

path code consist of the Reference Prediction Table (RPT),the next-line prefetcher, and the prefetch controller (statemachine). The next-line prefetcher serves for prefetchingthe sequential parts of the code, while the state machine inassociation with the RPT is used for pefetching targets indistance.At run-time, when a new address is generated, its value

is passed to the RPT table and the next-line prefetcher. Incases when the PC value matches an entry in the RPT table,the prefetch controller reads the type bit and the countervalue to check if the loop is smaller/bigger than cache andon each interation if the final iteration has been reached.If there is no match with the RPT table entry, the next-line prefetching will increment the address for one cache lineand issue that address to the cache. The RPT output hasprecedence over next-line prefetching.

4. RELATED WORKDesigners have proposed several strategies to increase the

performance of cache-memory systems. Some of these ap-proaches use software support to perform prefetching, whileothers are strictly hardware-based. Software solutions needexplicit fetch instructions to be issued from the compiler todo prefetching. In this section we discuss only related hard-ware solutions.The simplest form of prefetching comes passively from the

cache itself. When a cache miss occurs, besides the missedinstruction the cache fetches also instructions that belongto the same line into the cache. An extention of the cacheline size implies a larger granularity – more instructions arefetched on a cache miss. The disadvantages are that longercache line take longer to fill, generate useless memory traffic,and also they contribute to cache pollution [5].The ”one block look ahead” prefetching, later extended to

”next-N-line”, prefetches cache lines that are following theone currently being fetched by the CPU [14]. The schemerequires small additional hardware to compute the next se-quential addresses. Unfortunately, ”next-n-line” is unlikelyto improve performance when execution proceeds throughnon-sequential execution paths. In this case the prefetchingguess can be incorrect.Tagged prefetching has a tag bit associated with each

cache line [7]. When a line is prefetched, its tag bit is set


to zero. If the line is used then the bit is set to one and thenext sequential line is immediately prefetched. The streambuffer is a similar approach except that the buffer standsbetween main memory and cache in order to avoid pollutingthe cache with data that may never be needed.

The target prefetching scheme addresses the problem ofnon sequential code [16]. This approach comprises a next-line prediction table consisting of two entries (current lineaddress and target line address). When the program counterchanges the value, the prefetcher searches the prediction ta-ble. If there is a match, then the target address is a candi-date for prefetching.

The hybrid scheme that combines both next-line and tar-get prefetching offers a cumulative accuracy for reducingcache misses. However, this solution has limited effective-ness since the predicted direction is defined from the previ-ous execution. A similar approach is also used on ”wrong-path” prefetching except that instead of target table thisapproach prefetches immediately the target of conditionalbranch after the branch instruction is recognized in in thedecode stage [11]. This solution can be effective only fornon-taken branches. The Markov prefetcher [6] prefetchesmultiple reference predictions from the memory subsystem,by observing the miss-reference stream as an Markov model.

Loop-based instruction prefetcing [2] is similar to our so-lution, except that the loop headers are always prefetchedand the prefetching is issued at the end of the loop with noprefetch distance. The cooperative approach [9] also con-siders sequential and non-sequential prefetching by using asoftware solution for non-sequential prefetch. A dual-modeinstruction prefetch scheme [8] is an alternative to improveworst-case execution time by associating a thread to each in-struction block that is part of WCET. Threads are generatedby the compiler and they are static during task execution.

5. CONCLUSION AND FUTURE WORKTo overcome the problem of long execution times of single-

path code, we have proposed a new memory hierarchy or-ganization that attempts to reduce the memory access timeby bringing the instructions into the cache before they arerequired.

The single-path prefetching algorithm combines a sequen-tial and a non-sequential prefetching scheme with the fullaccuracy in the predicted instruction stream based on thepredictable properties of the single-path code. Designed as ahardware solution, the prefetcher does not produce an addi-tional timing overhead for the instruction prefetching. Also,our solution allows the prefetcher functionality to be inde-pendent without interfering with any stage of CPU. Thedual-bank cache makes it possible to pipeline the CPU andprefetcher accesses into the cache memory in order to fullyutilize the memory bandwidth. By using a prefetch granu-larity of one cache line we eliminate the possibility for cachepollution and useless memory traffic.

In our future work we plan to show the feasibility of thememory hierarchy by implementing it in an FPGA platformand also to extend the prefetcher for input-independent if-else structures that are not converted to sequential code.

AcknowledgmentsThis work has been supported in part by the European Com-munity’s Seventh Framework Programme [FP7] under grant

agreement 287702 (MultiPARTES) and the EU COST Ac-tion IC1202: Timing Analysis on Code Level (TACLe).

6. REFERENCES[1] P. Axer, R. Ernst, H. Falk, A. Girault, D. Grund,

N. Guan, B. Jonsson, P. Marwedel, J. Reineke,C. Rochange, et al. Building timing predictableembedded systems. ACM Transactions on EmbeddedComputing Systems (TECS), 13(4):82, 2014.

[2] Y. Ding and W. Zhang. Loop-based instructionprefetching to reduce the worst-case execution time.Computers, IEEE Transactions on, 59(6):855–864,2010.

[3] J. Gustafsson, B. Lisper, R. Kirner, and P. Puschner.Code analysis for temporal predictability. Real-TimeSystems, 32(3):253–277, 2006.

[4] S. Hahn, J. Reineke, and R. Wilhelm. Towardscompositionality in execution time analysis-definitionand challenges. In 6th International Workshop onCompositional Theory and Technology for Real-TimeEmbedded Systems, 2013.

[5] J. L. Hennessy and D. A. Patterson. Computerarchitecture: a quantitative approach. Elsevier, 2012.

[6] D. Joseph and D. Grunwald. Prefetching using markovpredictors. In ACM SIGARCH Computer ArchitectureNews, volume 25, pages 252–263. ACM, 1997.

[7] N. P. Jouppi. Improving direct-mapped cacheperformance by the addition of a smallfully-associative cache and prefetch buffers. InComputer Architecture, 1990. Proceedings., 17thAnnual International Symposium on, pages 364–373.IEEE, 1990.

[8] M. Lee, S. L. Min, C. Y. Park, Y. H. Bae, H. Shin, andC. S. Kim. A dual-mode instruction prefetch schemefor improved worst case and average case programexecution times. In Real-Time Systems Symposium,1993., Proceedings., pages 98–105. IEEE, 1993.

[9] C.-K. Luk and T. C. Mowry. Architectural andcompiler support for effective instruction prefetching:a cooperative approach. ACM Transactions onComputer Systems, (1):71–109, 2001.

[10] J. C. Park and M. Schlansker. On predicatedexecution. Technical report, Technical ReportHPL-91-58, HP Labs, 1991.

[11] J. Pierce and T. Mudge. Wrong-path instructionprefetching. In Microarchitecture, 1996. MICRO-29.Proceedings of the 29th Annual IEEE/ACMInternational Symposium on, pages 165–175. IEEE,1996.

[12] P. Puschner and A. Burns. Writing temporallypredictable code. In Object-Oriented Real-TimeDependable Systems, 2002.(WORDS 2002).Proceedings of the Seventh International Workshop on,pages 85–91. IEEE, 2002.

[13] P. Puschner, R. Kirner, B. Huber, and D. Prokesch.Compiling for time predictability. In Computer Safety,Reliability, and Security, pages 382–391. Springer,2012.

[14] A. J. Smith. Sequential program prefetching inmemory hierarchies. Computer, 11(12):7–21, 1978.

[15] A. J. Smith. Cache memories. ACM ComputingSurveys (CSUR), 14(3):473–530, 1982.


[16] J. E. Smith and W.-C. Hsu. Prefetching insupercomputer instruction caches. In Proceedings ofthe 1992 ACM/IEEE conference on Supercomputing,pages 588–597. IEEE Computer Society Press, 1992.


Five problems in compositionality of real-time systems

Björn AnderssonCarnegie Mellon University

ABSTRACTThis abstract presents five important problems in composi-tionality of real-time systems.

StatementThe general problem of compositionality of real-time sys-tems is to create a concept which describes resource con-sumption of something so that this concept can be taken asinput to analysis. In this abstract, I present a list of fiveproblems of compositionality that I believe are important:

1. Create a provably good interface for constrained-deadlinesporadic tasks on a single processor. Many previouslyproposed interfaces are based on bandwidth-like quan-tities; this makes them unable to achieve provablygood performance as seen from this example: Con-sider a system with a single processor and two compo-nents. Component 1 comprises k tasks (τ1, τ2, . . . , τk)characterized by Ti = ∞, Di = i, Ci = 1. Compo-nent 2 comprises one tasks (τk+1) characterized byTk+1 = ∞, Dk+1 = k + 1, Ck+1 = 1. For each valueof k ≥ 1, if tasks were scheduled directly on the pro-cessor using EDF then the taskset is schedulable. Us-ing bandwidth-like interfaces, however, we obtain thatcomponent 1 requires the bandwidth

∑ki=1(1/i) and

component 2 requires the bandwidth (1/(k + 1)). Ifsuch an interface would be used and if a schedulabil-ity test would take these bandwidths as input then,for k ≥ 2, the bandwidth required exceeds 100% andhence the taskset would be deemed unschedulable. Fork → ∞, the required bandwidth (as stated by the in-terfaces) would be infinite. Hence, there are tasksetsthat are schedulable directly on the processor but if abandwidth-like interface is used, even an infinite speedupof the processor is not enough to make the tasksetschedulable.

2. Create a concept that describes the memory accesses ofa task. This concept should describe the possible tim-ing and the possible memory addresses and it shouldbe possible to take this as input to timing analysis thatis aware of contention for resources in the memory sys-tem.

3. Create a concept that describes the aggregate resourceconsumption of a complex interaction in a protocol.This is particularly important for compositional anal-ysis of real-time communication over shared communi-cation medium of traffic that uses Transmission Con-

trol Protocol (TCP) because in such a setting, the ac-knowledgement packets consume network bandwidthbut they are dependent on the data-carrying packets.So it is desirable to create a concept that describes theaggregate resource consumption of both data packetsand ack packets of a given flow. On a higher level,this problem becomes important when using softwareframeworks for distributed systems that rely on TCP.

4. Create a concept that describes aggregate resource con-sumption of wireless traffic. On a wireless medium,a transmitted packet can get corrupted by noise re-quiring a retransmission. But then a concept that de-scribes that aggregate resource consumption of wire-less traffic of a given traffic flow needs to be a functionof noise parameters of the channel.

5. Create a concept that describes the aggregate resourceconsumption of software in an autonomous car. Sometasks in an autonomous car are just like other typesof embedded systems; they periodically read sensors,perform computations, and actuate commands. Thisis true for low-level control loops and some sensor fu-sion tasks in autonomous cars. But an autonomous carneeds to perform other computations as well relatedto higher-level cognition and responding to events andplanning future actions. And the resource consump-tion of such tasks depends on the environment. (Forexample, if you drive in a dense urban environment,there are more events to deal with than if you driveon a highway.) Hence, there is the need to find an”eventfulness” metric of a physical world and describethe resource consumption of the software as a functionof this metric.

There should be an analysis that takes the concept asinput and this analysis should have low pessimism and thereshould be a metric that characterizes this low pessimism andthe concept should be capable of describing systems so thatthe description requires few bits because this indicates goodinformation hiding.

AcknowledgmentsThis material is based upon work funded and supported bythe Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operationof the Software Engineering Institute, a federally funded re-search and development center. This material has been ap-proved for public release and unlimited distribution. DM-0001733


Compositional Mixed-Criticality Scheduling

Extended Abstract

Arvind EaswaranNanyang Technological University, Singapore

[email protected]

Insik ShinKAIST, Korea

[email protected]

An increasingly important trend in embedded systems istowards open computing environments, where multiple func-tionalities are developed independently and integrated to-gether on a single computing platform; this trend is evidentin industry-driven initiatives such as ARINC653 in avion-ics and AUTOSAR in automotive. An important notionbehind this trend is the safe isolation of separate functional-ities, or partitioning, to achieve fault containment and com-positional validation/certification. This raises the challengeof how to balance the conflicting requirements of partition-ing for safety assurance and resource sharing for economicalbenefits. The concept of mixed-criticality appears to be im-portant in meeting those two seemingly conflicting goals.

In many safety-critical embedded systems, the correct be-havior of some functionality (e.g., flight control) is more im-portant (“critical”) to the overall safety of the system thanthat of another (e.g., in-flight entertainment). In order tocertify such systems as being correct, they are conventionallyassessed under certain assumptions on the worst-case run-time behavior. For example, the estimation of Worst-CaseExecution Times (WCETs) of code for highly critical func-tionalities, typically involves very conservative assumptionsthat are unlikely to occur in practice [3]. Such assumptionsmake sure that the resources reserved for critical functional-ities are always sufficient. Thus, the system can be designedto be safe from a validation and certification perspective, butthe resources are in fact severely under-utilized in practice.

In order to close such a gap in resource utilization, Vestal [6]proposed the mixed-criticality (MC) task model that com-prises different WCET values. These different values aredetermined at different levels of confidence (“criticality”),based on the following principle. A reasonable low-confidenceWCET estimate, even if it is based on measurements, maybe sufficient for almost all possible execution scenarios inpractice. In the highly unlikely event that this estimate isviolated, as long as the scheduling mechanism can ensuredeadline satisfaction for highly critical applications, the re-sulting system design may still be considered as safe.

To ensure deadline satisfaction of critical applications,existing mixed-criticality studies make pessimistic assump-tions when a single high-criticality task executes beyond itsexpected (low-confidence) WCET. They assume that thesystem will either immediately ignore all the low-criticalitytasks (e.g., [1, 4]) or degrade the service offered to them(e.g., [2, 5]). They further assume that all the high-criticalitytasks in the system can thereafter request for additional re-sources, up to their pessimistic (high-confidence) WCET es-timates. Although these strategies ensure safe execution of

critical applications, they have a serious drawback as pointedout in a recent article [2]. When a high-criticality task ex-ceeds its expected WCET, the likelihood that all the otherhigh-criticality tasks in the system will also require more re-sources is very low in practice. Therefore, to penalize all thelow-criticality tasks for this unlikely event seems unreason-able.

Proposed Research.We are currently exploring a new research direction using

component-based mixed-criticality system model to addressthe above issues. Designer tunable component boundariescan be used to isolate low-criticality tasks from unexpectedvariations in high-criticality WCETs. Thus, it is possibleto allow low-criticality tasks in some components to con-tinue their execution uninterrupted, even when some high-criticality tasks in other components have exceeded their ex-pected WCET. Some of the main challenges in this directionof research are as follows: 1) How to enable the designer totune low-criticality isolation capabilities of a component, 2)What does a mixed-criticality component interface look like,3) How does a component indicate criticality change to therest of the system, 4) What impact does a criticality changeinside a component have on the rest of the system, and 5)Can we maintain compositionality even when the impact ofcriticality changes are not confined within components.

1. REFERENCES[1] S. Baruah, V. Bonifaci, G. D’Angelo, H. Li, and

A. Marchetti-Spaccamela. The PreemptiveUniprocessor Scheduling of Mixed-CriticalityImplicit-Deadline Sporadic Task Systems. In ECRTS,2012.

[2] A. Burns and S. Baruah. Towards a More PracticalModel for Mixed-Criticality Systems. In Workshop onMixed-Criticality Systems (co-located with RTSS), 2013.

[3] A. Burns and B. Littlewood. Reasoning about thereliability of multi-version, diverse real-time systems. InRTSS, 2010.

[4] A. Easwaran. Demand-based Scheduling ofMixed-Criticality Sporadic Tasks on One Processor. InRTSS, 2013.

[5] P. Huang, G. Giannopoulou, N. Stoimenov, andL. Thiele. Service adaptions for mixed-criticalitysystems. In ASP-DAC, 2014.

[6] S. Vestal. Preemptive Scheduling of Multi-criticalitySystems with Varying Degrees of Execution TimeAssurance. In RTSS, 2007.


Challenges of Virtualization in Many-CoreReal-Time Systems

Matthias Becker, Mohammad Ashjaei, Moris Behnam, Thomas NolteMRTC / Mälardalen University, Västerås, Sweden

{matthias.becker, mohammad.ashjaei, moris.behnam, thomas.nolte}@mdh.se

ABSTRACTEmbedded real-time virtualization is used for single core andmulticore platforms to consolidate multiple systems within asingle chip. The number of cores on one processor is steadilyincreasing, making many-core processors with tens to hun-dreds of cores available in the near future. This gives rise toa number of new challenges for real-time virtualization onsuch systems. In this work we identify a number of thosechallenges. We also describe initial ideas on how to tacklethem.

1. INTRODUCTIONVirtualization is a key component in todays industrial

systems [4]. Many-core processors with tens to hundredsof cores, connected by a 2D-mesh based Network-on-Chip(NoC) bring new challenges. The large number of simplecores deliver enough resources to allocate whole cores to vir-tual partitions. Therefore, the new challenge will be how todivide the hardware topology in order to guarantee enoughresources, computational and memory bandwidth, for all theguest systems. Hereby a guest system is abstracted by aninterface, hence can be seen as a component. Such compo-nents describe newly developed parallel applications as wellas legacy applications designed for single-core systems.

2. PROBLEM DESCRIPTIONAn important challenge is the interface definition needed

to abstract the resource requirement of guest systems. Giventhose interfaces, system integration needs to find a suitablecomposition of guest systems on the many-core. Runtimemechanisms are needed to police the resource access depend-ing on the parameters defined during system integration.

2.1 Challenges for the guest interfaceEach guest system has to operate under certain constraints.

For many-cores, knowing the location of the cores relativeto each other is important. This information is needed inorder to perform schedulability analysis of the NoC commu-nication [2]. On recent many-core processors (e.g. Tilera’sTile64 or Kalrays MPPA-256 processor [6, 3]) different NoCsare provided for core to core communication and memory ac-cess. Finding the right parameters and abstractions for theinterface is an important challenge. Moreover, we may havedependencies among guests, hence guest communication ab-straction is required. We propose to define those constraintsby an interface: Gn = {β,Nx, Ny,L}, where n is the indexof the system and β is the required bandwidth to offchipmemory. The parameters Nx and Ny specify the size andshape of the guest system (i.e. Nx×Ny yields the number ofcores). The set L defines the individual communications to

other guests. Evaluating the minimal values for the interfaceparameters is another important challenge.

2.2 Challenges for system integrationSince we consider real-time systems, we target static con-

stellations. At design time, a set of guest systems, G, needsto be mapped to the many-core hardware. Since this stepis performed offline, sophisticated mapping techniques canbe applied. Challenges for the mapping are multidimen-sional. A physical mapping to the cores needs to guaranteethe required network bandwidth and timeliness of the guestcommunication. Communication latencies depend on thephysical location of the cores.

2.3 Challenges for runtime mechanismsAt runtime, allocation of resources specified by a guest

interface needs to be guaranteed. Computational resourcesare statically assigned, thus they do not need to be mon-itored. However access to the offchip memory is sharedamong guests. Therefore, distribution of the memory band-width during runtime is an important challenge. Lu et al.proposed the (σ, ρ)-based flow regulation for NoC [5] whichitself is based on Network Calculus [1]. In order to regulatethe outgoing traffic flows on a NoC link, a packet shaperis used to inject messages based on a given profile ratherthan as fast as possible. This mechanism is successfully im-plemented in the MPPA-256 processor [3]. We propose toinclude a software-based packet shaper in the hypervisor.Virtualization of the NoC for guest-to-guest communicationmay affect the communication inside a third guest as theyuse the same network. This raises another challenge which isto investigate a proper mechanism to handle communicationon this network.

3. REFERENCES[1] R. Cruz. A calculus for network delay. I. network

elements in isolation. IEEE Transactions onInformation Theory, 37(1):114–131, 1991.

[2] D. Dasari et. al. Noc contention analysis using abranch-and-prune algorithm. ACM Trans. Embed.Comput. Syst., 13:113:1–113:26, 2014.

[3] B. D. de Dinechin et. al. Time-critical computing on asingle-chip massively parallel processor. In DATE ’14,pages 97:1–97:6, 2014.

[4] G. Heiser. The role of virtualization in embeddedsystems. In IIES ’08, pages 11–16, 2008.

[5] Z. Lu et. al. Flow regulation for on-chipcommunication. In DATE ’09, pages 578–581, 2009.

[6] Tilera. Tile64 processor. http://www.tilera.com/products/processors, Retrieved Oct. 30, 2014.


Managing end-to-end resource reservations

Extended Abstract

Luis Almeida1,2

1IT - University of PortoPorto, [email protected]

Moris Behnam2

2IDT - Mälardalen UniversityVästeras, Sweden

[email protected]

Paulo Pedreiras3

3IT - University of AveiroAveiro, [email protected]

Categories and Subject DescriptorsC.2.1 [Computer-Communication Networks]: NetworkArchitecture and Design—Network communications, Distributednetworks

KeywordsComputer networks, Resource reservations, Scalability, Adapt-ability

1. CONTEXTCurrently, there is a strong push for real-time applications

distributed over large geographical areas, fuelled by inter-active multimedia, remote interactions, cloud-based time-sensitive services and many cyber-physical systems [3]. Theseapplications rely on many different resources, from the endnodes where they execute, to the networks over which theycommunicate. However, supporting global compositionalityin this realm, i.e., guaranteeing applications performance apriori independently of load variations, is a significant chal-lenge. It calls upon resource reservations across the systemthat comply to performance metrics agreed at the time of ap-plication deployment, what is normally called Service LevelAgreement (SLA). This requires well defined application re-source requirements, from the computing end nodes, possi-bly including multiple internal resources, to the network.

In spite of existing technical solutions for enforcing reser-vations, from traffic shaping in networks to virtualizationin processors and reservations in static distributed systems[2], such solutions typically fall short on at least two as-pects, scalability and adaptability. The former is neededto cope with large numbers of reservations to achieve thedesired segregation and isolation among many applications,while the latter is needed to mitigate the effects of overpro-visioning that micro-reservations can have in the efficiencyof the whole system. These two requirements apply partic-ularly to the network, given its central position in support-ing the applications referred above and, in general, emerg-ing paradigms such as the Internet-of-Things. Nevertheless,adaptability naturally extends from the network to the endsnodes, for the sake of efficiency, too.

On the other hand, the scalable solutions that we have to-day, e.g., deployed in current Internet, provide at best class-based Quality-of-Service (QoS) support [1]. This grants pro-tection against interference from traffic of lower QoS delaytolerant classes but not within the same class, which is ag-gravated when the number of similar applications grows, e.g.VoIP calls.

Finally, supporting scalability and adaptability requiresagile resource management. If the former points to a dis-tributed management architecture, the latter is better achievedwith a centralized one. This apparent conflict is typicallydealt with using clustering. Local reservations are managedat the cluster level while end-to-end reservations require adistributed cluster-level protocol. An example of such ap-proach is the Stream Reservation Protocol (SRP) in Audio-Video Bridges, but it is still class-oriented.

2. OPEN PROBLEMSTherefore, given the considerations above, we herein for-

mulate what we consider to be open problems to achievethe desired compositionality of distributed real-time appli-cations in large open systems, in a resource efficient way.Given a distributed real-time application that will executein N end nodes connected to a large network, how to:

• Formulate its resource requirements and interfaces?

• Express adaptivity in such requirements/ interfaces?

• Support scalable and adaptive network reservations?

• Analyze the requirements feasibility?

• Carry out global admission control and enforce theneeded reservations?

• Track and distribute slack?

3. ACKNOWLEDGMENTSWith support from the Portuguese Gov. through FCT

grants CodeStream (PTDC/EEI-TEL/3006/2012) and Serv-CPS (PTDC/EEAAUT/122362/2010).

4. REFERENCES[1] G. Bertrand, S. Lahoud, M. Molnar, and G. Texier.

Qos routing and management in backbone networks. InIntell. QoS Tech. and Network Manag.t: Models forEnhancing Comm., pages 138–159. IGI Global, 2010.

[2] N. Serreli, G. Lipari, and E. Bini. Deadline assignmentfor component-based analysis of real-time transactions.In CRTS 2009 Proceedings, December 2009.

[3] F. Xia, L. Ma, J. Dong, and Y. Sun. Network qosmanagement in cyber-physical systems. In ICESS 2008Proceedings, pages 302–307. IEEE, July 2008.


Supporting Single-GPU Abstraction through TransparentMulti-GPU Execution for Real-Time Guarantees∗

[Extended Abstract]

Wookhyun Han, Hoon Sung Chwa, Hwidong Bae, Hyosu Kim and Insik ShinKAIST, South Korea

[email protected]

Graphics Processing Units (GPUs) are high-performancemany-core processors capable of very high computation anddata throughput. Recently, multi-GPU platforms appear tobe a promising platform that allows two or more GPUs towork in parallel to accelerate computation or to accommo-date a demand exceeding a single GPU’s capacity. However,little support is provided for GPGPU over multi-GPU. Inthis paper, we consider supporting real-time applications inmulti-GPU systems.

Despite many benefits of GPGPU, it raises several chal-lenges to apply GPGPU technology to real-time computing.In the current GPGPU programming frameworks, applica-tions (i) copy input data to the device memory from thehost memory, (ii) launch a piece of GPU-accelerated code,called a kernel, on GPU to perform computation for out-puts, and (iii) copy the device outputs to the host memory.Due to the non-preemptive nature of copying data to/fromthe device memory and launching kernels on GPU, it couldblock other data copy and kernel launch requests by higher-priority applications. To overcome such a non-preemptivenature a couple of studies [1, 3] share the principle of mak-ing non-preemptible regions smaller to allow higher-priorityapplication to experience shorter blocking times for kernellaunch [1] and for data transfer between host and devicememory [3].

Supporting real-time computing on multiple resources (i.e.,multiprocessors) is generally much more complicated com-pared with the single resource case. Liu and Layland [5]attributed the cause of difficulty in global multi-resourcereal-time scheduling to “the simple fact that a task can useonly one processor even when several processors are freeat the same time.” Such a task-level single-resource restric-tion causes optimal preemptive uniprocessor scheduling al-gorithms, RM and EDF, to become subject to a schedulinganomaly, called Dhall’s effect [2]: some task sets may be un-schedulable on multiple processors even though they have alow utilization close to 1. This shows the importance ofrelaxing the task-level single-resource restriction when pos-sible.

Aiming at exploiting the massive parallelism of GPU ar-chitecture, each single kernel is typically designed to spawn alarge number of threads that perform the same computationover different data in parallel. The nature of such a threadparallelism within a kernel enables the kernel to be decom-posed into multiple sub-kernels, where each sub-kernel has asmaller number of threads, and individual sub-kernels to ex-ecute concurrently without interfering each other on differ-ent GPUs [4]. However, such a feature is not yet supportedby the current prevailing GPGPU programming models andruntime supports such as CUDA and OpenCL.

Proposed Research.We are currently exploring the benefits and issues of re-

laxing the restriction of executing a single kernel only on asingle GPU in real-time multi-GPU systems. Compositionof multiple GPU resources provides more capacity to be uti-lized by GPGPU applications, as well as more potential forperformance improvement through multi-GPU parallel exe-cution of each application. The multi-GPU execution of asingle kernel can reduce GPU computation time leveragingthe concurrent use of multiple GPUs. However, it imposessome overheads, including extra data transfer between hostand device memory. Thereby, multi-GPU execution mightnot be beneficial to all GPGPU applications, but some de-pending on their compute- or memory-intensive behavior.This raises an issue of determining the multi-GPU mode ofindividual applications, i.e., determining how many GPUseach application uses. This entails a good strategy of mak-ing decisions from the system’s perspective to optimize thesystem schedulability.In addition, designing a good scheduling algorithm is also

important. We aim to apply the virtual cluster-based schedul-ing approach presented in [6] for scheduling GPGPU appli-cations with the multi-GPU mode. Depending on the GPUexecution mode, each application is assigned to a GPU clus-ter, applications in each cluster are scheduled among them-selves, and clusters in turn are scheduled on a multi-GPUsystem. There are a lot of new issues arise to adopt cluster-based scheduling in multi-GPU systems. One of them is howto define a cluster interface that takes multi-GPU executionoverhead into account.

1. ACKNOWLEDGMENTSThis work was supported in part by BSRP (NRF-2010-0006650,

NRF-2012R1A1A1014930), NCRC (2012-0000980), IITP (2011-10041313, 14-824-09-013) and KIAT (M002300089) funded by theKorea Government (MEST/MSIP/MOTIE).

2. REFERENCES[1] C. Basaran and K.-D. Kang. Supporting preemptive task

executions and memory copies in gpgpus. In ECRTS, 2012.[2] S. Dhall and C. Liu. On a real-time scheduling problem.

Operations Research, 26(1):127–140, 1978.[3] S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa,

and R. Rajkumar. Rgem: A responsive gpgpu execution modelfor runtime engines. In RTSS, 2011.

[4] J. Kim, H. Kim, J. H. Lee, and J. Lee. Achieving a singlecompute device image in opencl for multiple gpus. In PPoPP,2011.

[5] C. Liu and J. Layland. Scheduling algorithms formulti-programming in a hard-real-time environment. Journalof the ACM, 20(1):46–61, 1973.

[6] I. Shin, A. Easwaran, and I. Lee. Hierarchical schedulingframework for virtual clustering of multiprocessors. In ECRTS,2008.


proceedings of the 7th international workshop on compositional...

Documents