effects of real-time scheduling on cache performance - retis lab

99
Universit ` a di Pisa Facolt ` a di Ingegneria Tesi di Laurea Specialistica in Ingegneria Informatica Effects of real-time scheduling on cache performance and worst case execution times Relatore Candidato Prof. Antonio Prete Orges Xhani Relatore Prof. Giorgio Buttazzo Relatore Ing. Marko Bertogna

Upload: others

Post on 10-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Effects of real-time scheduling on cache performance - ReTiS Lab

Universita di Pisa

Facolta di Ingegneria

Tesi di Laurea Specialistica in

Ingegneria Informatica

Effects of real-time scheduling

on cache performance

and worst case execution times

Relatore Candidato

Prof. Antonio Prete Orges Xhani

Relatore

Prof. Giorgio Buttazzo

Relatore

Ing. Marko Bertogna

Page 2: Effects of real-time scheduling on cache performance - ReTiS Lab

Anno Accademico 2008/2009

Page 3: Effects of real-time scheduling on cache performance - ReTiS Lab

Abstract

Cache memories in real-time systems can increase performance, but at thecost of unpredictable behaviour and loose bounds of the worst case executiontime analysis. Preemptive schedulers, while necessary for overall schedulabi-lity, introduce further uncertainty in such systems because, when resumingexecution after a preemption, a task can find some of the useful cache linesevicted by the preempting task and suffer further delays (CRPD - cacherelated preemption delay) not expected by the static code analysis. Thesedelays are strictly dependent on the execution environment of the task, suchas the scheduling algorithm and other tasks running in the system. Mean-while schedulability analysis, essential to guarantee the correct behaviourof hard real-time systems such as avionic and automotive controllers, re-quires worst-case bounds on the execution time of each task. It is easy tosee how the CRPD (and therefore WCET) of the tasks and the schedulingreciprocally depend on each other. Hence providing firm guarantees on theschedulability of these systems is complex.

In more complex systems, the main processor is not the only master ofthe bus: there are several co-processors and direct memory access devices.In particular in personal computers there are even more than one powerfulgeneral purpose processors. Nowadays these solutions are often used even inembedded and safety critical systems because of the lower production costs,earlier time to market and lower power requirements. In these systems thedelay introduced by each cache miss strongly depends on the time neededto gain access to the bus, hence the worst case execution time depends onthe number and traffic profile of the other DMA devices and on the busarbiter. In systems with high bus contention (many and memory intensivedevices) the CRPD can severely increase the WCET and jeopardize systemschedulability. Therefore accurately bounding the number of cache missesis crucial to ensure the safety of these systems.

In this thesis we observe the cache misses introduced by scheduling andpreemption and their effects. Since the events we are interested in, such ascache misses and bus access, happen in hardware and at high frequencies,they are difficult to monitor in real systems. For this reason, to observethese phenomenons we used a cycle accurate software simulator mimicking

3

Page 4: Effects of real-time scheduling on cache performance - ReTiS Lab

a real hardware architecture and running real software. Furthermore wereport a survey of current state of the art approaches to bound the effectsof the CRPD. In addition we propose a new technique which can, not onlybound, but actually reduce the number of cache misses and thus the WCETof a task, by limiting the preemptions a task can suffer but still maintainingthe schedulability of the system. Finally we demonstrate the benefits of ourapproach by implementing it in the simulation environment.

The remainder of this thesis is composed of the following chapters:

• In “Introduction” we present in more details the delays introducedby preemption and present a historical prespective of the problem.In addition we quantify the influence of CRPD on the WCET withreferences to related publications and our measurements. Furthermorewe investigate several parameters that influence these delays such ascache size, memory access patterns, number of devices linked to thebus and their traffic profile.

• In“Simulated system”we present the simulated environment. Firstwe describe the MPARM simulator with particular attention to the ac-curacy of simulation and the statistics it can provide. Furthermore weprovide a quick overview of the statistics that the simulator can pro-vide. Later we describe the Erika real-time operating system, present-ing the overall architecture and describing the typical field of usage. Inaddition we describe our efforts to port Erika to the simulated hard-ware and to improve their integration. These enhancements allowedus to obtain better and easier to interpret simulation results.

• In “State of the art” we present the existing techniques to providefirm bounds to the CRPD. First we present the currently most usedtechniques such as cache locking, cache partitioning, use of scratchpadmemories. The majority of these approaches require special hardware(cache partitioning can be implemented in software) and all of themare ad-hoc solutions requiring manual tweaking while implementingthe system. Furthermore, in this chapter we describe the static timinganalysis of the code, highlighting the information it can provide andwhat is still unknown. At last we show how we can compute a boundto the number of preemptions required by the most commonly usedreal-time schedulers (Earliest Deadline First and Rate Monotonic).

• In “Limited preemption” we present our approach to improve thebounds on the delay introduced by preemption on the execution time.First we present advantages of non preemptive scheduling and thelimits of this approach. When the system is not schedulable in a nonpreemptive way we propose a limited preemption approach, where pre-emptions are limited to specific points in the task. With this technique

4

Page 5: Effects of real-time scheduling on cache performance - ReTiS Lab

we can compute a bound on the amount of time a task can run with-out permitting preemption and use timing analysis results to placepreemption points in the code in locations where we can respect thisbound and minimize the total delay introduced by preemption.

• In “Simulation results” we implement the limited preemption insome benchmark task-sets and measure their behaviour in the simu-lated environment comparing results to other approaches.

• In “Conclusions” we summarize the results of this work and presentsome possible future improvements to this research.

5

Page 6: Effects of real-time scheduling on cache performance - ReTiS Lab

6

Page 7: Effects of real-time scheduling on cache performance - ReTiS Lab

Contents

1 Introduction 91.1 Effects of scheduling on WCET . . . . . . . . . . . . . . . . . 101.2 Bus contention delay . . . . . . . . . . . . . . . . . . . . . . . 161.3 Preemption with bus contention . . . . . . . . . . . . . . . . . 22

2 Simulated system 252.1 MPARM simulator . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Erika real-time operating system . . . . . . . . . . . . . . . . 302.3 Simulated task sets . . . . . . . . . . . . . . . . . . . . . . . . 33

3 State of the art 373.1 Heuristic and hardware driven solutions . . . . . . . . . . . . 373.2 Timing analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Useful cache block analysis . . . . . . . . . . . . . . . . . . . 473.4 Schedulability analysis with CRPD . . . . . . . . . . . . . . . 50

4 Limited preemption 534.1 Non-preemptive scheduling . . . . . . . . . . . . . . . . . . . 534.2 Maximum allowed non preemptive execution time . . . . . . . 564.3 Timing analysis to verify manually inserted PPs . . . . . . . 644.4 Automatically choosing preemption points . . . . . . . . . . . 66

5 Experiments with limited preemption 755.1 Implementing limited preemption in Erika . . . . . . . . . . . 755.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . 795.3 Simulations with bus contention . . . . . . . . . . . . . . . . 82

6 Conclusions 87

A C++ implementation for insertPP 99

7

Page 8: Effects of real-time scheduling on cache performance - ReTiS Lab

8

Page 9: Effects of real-time scheduling on cache performance - ReTiS Lab

Chapter 1

Introduction

Use of cache memories have been traditionally avoided in hard real-timesystems because of their unpredictable behaviour. Actually, caches drasti-cally reduce average execution time and for this reason are widely adoptedin general purpose computation systems since sixty’s [Lip68]. On the otherside, in real-time applications engineers need to guarantee the correct exe-cution of all the tasks in any case, even if each task experiences the worstcase execution time. Therefore a real-time scheduler must be used and theexecution time of each task has to be bounded. The scheduling problemhas been thoroughly studied since the early years of informatics [Jac55] andsafe bounds have been available since those early years, even for modernschedulers widely used in current production environments [LL73]. Howeverthose guarantees require a safe evaluation of the worst case execution time(WCET) of each task in the system and in those years if a cache was usedno available tool could statically bound the execution time. Therefore cachememories were avoided or disabled when built into the processor. This is thecase in the U.S. Navy’s AEGIS Combat System deployed in the AN/UYK-43, standard 32-bit computer of the United States Navy for surface ship andsubmarine platforms, starting in 1984. The AN/UYK-43 “has a 32K-wordcache partitioned for instruction and data. However, due to unpredictablecache performance, all the module (task) utilizations are calculated as ifthe cache were turned off (cache bypass option). As a result the theoreti-cally overutilized CPU is often underutilized at run-time when the cache isenabled.” [KSS91]

The first academic attempts to explore the use of caches in real time sys-tems were presented only in late eighty’s [Kir88] and required heavy manualmodifications to the memory layout during the linking process. The first at-tempts to statically analyze the WCET of a task in presence of a cache werepresented only in the ninety’s [RDAH94], was limited to instruction cacheanalysis and did not consider the effects of the pipeline and prefetching.

9

Page 10: Effects of real-time scheduling on cache performance - ReTiS Lab

Today, almost fifteen years later, methods has been developed to boundexecution time with cache memories that were state-of-the-art more thantwenty years ago [Seb01].

Still in current days, for hard real time systems the use of cache mem-ories requires many special attentions and disabling them is still commonpractice. For example MIPS equipped boards targeted for real time op-erations do not include a cache. Instead “The M4K core has a cachelessarchitecture, designed to interface directly to SRAM memory. [. . . ] The ad-vanced interrupt handling capabilities and deterministic memory interfacein MIPS Technologies’ cores ensure these real-time service requirements aremet.” [MIP] Certification authorities for safety critical systems, when revi-sioning control devices that use cache memories, pay particular attention tothe methodology of bounding the unpredictability introduced by the cache.The airborne certification authority still considers not using any cache thebest option when implementing airplane control devices as stated in theirmost recent position paper on the argument [CAS03]. Nevertheless the useof caches even in those systems is economically attractive because all com-mercially available general purpose processors include a cache and cannoteven work if that cache is disabled because many of their architectural fea-tures (such as out of order execution, prefetching etc.) are built on theassumption of a cache included in the chip. Using general purpose proces-sors allows for large economies of scale and sensible price cuts. Furthermore,by using market ready devices, hard real time projects can reduce their timeto market and gain from future technology improvements more rapidly thanhaving to wait for them to be introduced in the niche market of processorsdesigned for safety critical applications.

1.1 Effects of scheduling on WCET

In the last decade, several commercial tools and academic prototypes havebeen developed to analyse the worst case execution time of the code evenon cache equipped processors [PB00, WEE+08]. As an example we canreport AbsInt’s aiT tool, which is in routine use in the aeronautics andthe automotive industries. In other chapters of this thesis we will use thistool for the timing analysis of our benchmarks. It is currently used in thecertification of several time-critical avionic systems from companies such asAirbus [SPH+07]. Obviously there is still intense ongoing research on thefield as new type of caches are considered and the effects of other modernprocessor features are analyzed such as branch prediction modules and out-of-order execution.

Yet all of these tools analyze the cache performance only on the assump-tion that the task is not preempted. When actually scheduling the tasks, any

10

Page 11: Effects of real-time scheduling on cache performance - ReTiS Lab

present days’ scheduler will allow any activating higher priority task to pre-empt the currently running one. This introduces overheads not accountedfor during the static analysis. In particular for each preemption:

• pipeline contents are lost,

• the context switch and scheduler code has to be executed,

• after the task resumes execution part the contents of the cache arechanged from the ones expected by timing analysis.

The time necessary for the context-switch and the execution of the schedulercode does not depend on which task is running and on the time of preemp-tion, therefore we can compute it statically. The pipeline and cache contentsdepend on when the preemption happens and on the previous history of ex-ecution, which are known only at run-time. While losing the contents of thepipeline has negligible effects on the execution time, the loss of the contentsof the cache can severely impact the following execution of the code. Indeed,after the preemption, some data is expected to be already loaded in cacheby previous code but have been evicted by the preempting job (or by thescheduler code, actually). Therefore the following access to these data is ex-pected to be a cache-hit during static timing analysis, but is actually a cachemiss in run-time. These cache misses introduced by preemption are calledextrinsic cache misses [BN94] as opposed to intrinsic cache misses which areexpected to happen even if no preemption is suffered. Intrinsic cache missesare accounted for during static timing analysis and included in the WCET.On the other side, extrinsic cache misses can cause the execution of the taskto last more than the WCET computed with the non-preemptive assump-tion. This additional execution time is called Cache Related PreemptionDelay (CRPD) and need to be accounted for during schedulability analysis,otherwise unexpected deadline misses can occur.

The static analysing tools cannot know where the scheduler will pre-empt the task, therefore cannot foresee these effects. Indeed this dependsnot only on the adopted scheduler and other tasks in the system (informationnot available to the timing analysis tool which only considers tasks individ-ually), but also on run-time information such as which tasks have pendingactivations. Furthermore, in different iterations during the run-time of thesystem the tasks can be preempted in different points in the code, there-fore with different effects on the execution time. Still, which other task isexecuted after a preemption is unknown during static analysis. Therefore,which cache lines are evicted and which are still valid when resuming exe-cution highly depends on run-time information that no static analysis toolwill ever be able to compute. For this reason, it is important to analysethe adopted scheduling scheme to bound the delays the task execution cansuffer because of the run-time environment, and provide time guarantees.

11

Page 12: Effects of real-time scheduling on cache performance - ReTiS Lab

The cache related preemption delay is a not negligible component ofoperating system’s overhead even in general purpose systems. This overheadhas been measured to range from 1% to 7% in a typical Unix workstationin the ninety’s [MB91]. While within around twenty years kernel schedulingalgorithms improved, starting to consider those effects, the gap betweenmemory and CPU performance sensibly increased, resulting in a still highoverhead. More recent measurements (Linux on eSeries machine with Xeonprocessor) report the CRPD to range from several microseconds to morethan one millisecond [LDS07]. When compared to typical scheduling time-slices of 10 to 20 milliseconds, the delay is still almost the same percentageof overhead on the execution time as the earlier measurements.

Measurements on benchmark task-sets are more of performance thanof hard real time interest. Indeed, hardly an experiment can happen tomeasure the worst combination possible of so many run-time dependentparameters, while designers need to account for those combinations, evenif improbable, to guarantee the safety of the system. Nevertheless, suchmeasures exist for academic and explorative purposes and report the CRPDfor each preemption to be almost the same even in typical real time devices.For example [Seb02] reports the CRPD in a task-set executing in a MotorolaPower PC750-processor (MPC750) board ranges from several microsecondsto almost 200 microseconds after each preemption. Furthermore, in real timesystems the requirement of low response times does not allow reducing theoverhead by simply increasing the time-slice of scheduling as general purposeoperative systems usually do. Instead, the number of context switches canbe much higher than on general purpose computers and the executed tasksare generally much shorter (typical task-sets are composed of 15 to 20 tasks,with periods ranging from 1 ms to 100 ms [GNL+03]), thus the relativeoverhead of CRPD over the “useful” computation can get even higher. “Forexample, if we consider the PowerPC processor MPC7410 (widely used forembedded systems) which has 2MB two-way associative L2 cache [. . . ], thetime taken by the system to reload the whole L2 cache is about 655µs.This reloading time directly affects task execution time. A typical partitionsize of an avionic system compliant to ARINC-653 can be as small as 2ms.Under this scenario, the execution time increment due to cache interferencecan be as big as 655µs/2ms ≈ 33%: hence, the multi-task cache interferenceproblem can be severe in embedded systems.”[BCSM08]

We can measure the delay in the execution time introduced by the pre-emption in the simulation environment prepared for this thesis (see laterchapters for a more detailed description). In figure 1.1 we can observe howthe execution time of a benchmark task is delayed when a higher frequencytask preempts it. Obviously the CRPD suffered by the preempted task de-pends on the number of lines evicted by the higher priority one as we cansee in the graphic. Notice that even if the higher priority task does not

12

Page 13: Effects of real-time scheduling on cache performance - ReTiS Lab

505000

510000

515000

520000

525000

530000

0 200 400 600 800 1000 1200

(CP

Ucy

cles

)

Size dist.cache (cache lines)

Execution time

no preemptionwith preemption

Fig. 1.1: CoreMark scenario without preemptions and with a higher prioritytask preempting periodically and evicting a variable number of cache lines

explicitly evict any data cache lines, a not negligible overhead is still intro-duced because of the instruction cache interference, context switching code,pipeline flushing and some data cache interferences introduced by OS code.

The system we are simulating is an example of real-time oriented hard-ware: it is equipped with a fast SRAM with a latency of one single CPUcycle and the bus (AMBA bus [Lim99]) is designed to be integrated in thesame die as the processor core, operating with the same clock frequency.Indeed each cache refill (line of 4 words) costs only 50 ns (10 CPU cycles).Furthermore the processor, an ARM7 core, is really slow when compared tocurrent general purpose processors, and even to many CPUs designed to beused in real-time applications. Therefore, the gap between the performanceof the processor and the latency of the main memory is less evident. Insuch an optimized scenario, carefully designed to avoid most of timing prob-lems, the measured CRPD is only a small percentage of the execution time.But if we explore architectures with a slower access to the memory (highlyprobable with more modern processors than our ARM7) the influence of theCRPD on the overall execution time can get higher as shown in figure 1.2.As we are interested on general results, not based on a specific hardware,in the following of this dissertation, if not explicitly specified, we will runour simulations in the real-time optimized scenario used to plot the graphin figure 1.1. This means that values of CRPD and percentage of overhead

13

Page 14: Effects of real-time scheduling on cache performance - ReTiS Lab

450000

500000

550000

600000

650000

700000

750000

800000

850000

900000

0 1 2 3 4 5 6 7 8 9

(CP

Ucy

cles

)

CPU cycles required to fetch data from bus

Execution time

no preemptionwith preemption

10.53%11.73%

12.81%13.78%

14.67%15.47%

16.21%16.88%

17.51%18.08%

Fig. 1.2: CoreMark scenario CRPD overhead and memory access times

on the non-preempted WCET may not be relevant in absolute terms, butcan still use the simulations to observe how different parameters influencethem, and compare different implementations and algorithms.

Since the early adoption of cache memories, the scientific community hasthoroughly studied the effects of this technology on the performance [Lip68].Many research papers have investigated the trade-offs engineers have to faceto obtain the best from caches and the effects of architectural parameterssuch as size and associativity [PHH88]. In particular, in embedded systemssuch knowledge can be very helpful because the software running in theplatform is known and unlikely to change. Several available tools can helpthe system engineers to exploit this knowledge to design caches in such de-vices [GPAP97]. Experiments in our simulated environment confirm that,by tweaking the cache configuration, designers can sensibly reduce the ex-ecution time of bottle-neck tasks as we can observe in figures 1.3 and 1.4.

In particular, when the footprint of the task is comparable to the size ofthe cache, this is used in the most profitable way to reduce the executiontime. For the focus of this thesis on the preemption influence on the CRPD,it is to observe that the more the task benefits from the use of the cache(as is the case when the hardware is designed to match the requirementsof the application) the more lines of cache will contain useful data. Thismeans that when the cache is well configured and the code has a high level

14

Page 15: Effects of real-time scheduling on cache performance - ReTiS Lab

1.03e+06

1.04e+06

1.05e+06

1.06e+06

1.07e+06

1.08e+06

1.09e+06

1.1e+06

8 10 12 14 16 18 20

(CP

Ucy

cles

)

Data cache size (log2)

Execution time

Fig. 1.3: Tweaking data cache for CoreMark scenario

500000

600000

700000

800000

900000

1e+06

1.1e+06

8 10 12 14 16 18 20

(CP

Ucy

cles

)

Instruction cache size (log2)

Execution time

Fig. 1.4: Tweaking instruction cache for CoreMark scenario

15

Page 16: Effects of real-time scheduling on cache performance - ReTiS Lab

of locality, the CRPD is even higher. We can observe this by repeating theCRPD measurements with different sizes of cache. When comparing figure1.5 with the previous ones, we can notice that the CRPD is higher withcache configurations that in the previous figures reduces the non preemptiveexecution time. It is clear that, if not appropriately considered the CRPD,

1.01

1.015

1.02

1.025

1.03

1.035

1.04

8 10 12 14 16 18 20DCache size (log2)

CRPD slowdown factor

DCacheICache

Fig. 1.5: Example Slowdown factor (1 + CRPD / WCET) in CoreMarkscenario with different cache configurations

all attempts to improve performance by tweaking the cache configurationwill be ineffective in systems with frequent preemptions, such as the real-time ones. The benefits of better cache configuration on the execution timeof a single task will be partially lost because of the cache misses induced bypreemptions.

1.2 Bus contention delay

Obviously, each cache miss causes the processor to access the bus and readdata from the main memory. If a dedicated bus is used, the time necessary tofetch a cache line from memory is fixed, therefore static analysis taking intoaccount cache behaviour is sufficient to bound the WCET of each task (if notpreempted). In more complex systems, there can be other devices capable ofusing the bus, such as other CPUs, co-processors and smart peripherals withDMA capability. Different bus masters can compete for the bus. Therefore,the time necessary to fetch data from memory and fill a cache line is not

16

Page 17: Effects of real-time scheduling on cache performance - ReTiS Lab

fixed. For this reason, even if the number of cache misses is bounded bystatic analysis, the delay introduced by each of them depends on the timenecessary to acquire control of the bus.

Bus contention delay has been traditionally avoided in safety critical sys-tems by using ad hoc architectures and avoiding DMA-capable devices. Asan example of such ad-hoc systems we can mention the SAFEbus backplanebus [HD93], built to comply to the ARINC 659 specification [Inc93], cur-rently mounted on the Boeing777 commercial aircraft. In current days, evenfor avionic and automotive manufacturing companies, it is becoming diffi-cult to rely completely on such specialized components (both software andhardware). Even avionic companies such as Lockheed Martin Aeronautics,Boeing and Airbus build up good part of their systems by assembling Com-mercial Off-The-Shelf (COTS) components [Bak02]. In addition to rapidtime-to-market and incomparably low costs due to large scale economies,COTS components commonly used in general purpose architectures usu-ally offer higher efficiency, often both in terms of performance and powerrequirements. For example the already cited SAFEbus, is capable of trans-ferring data up to 60 Mbps, while a modern PCI Express interconnection iscapable of peak transfer rates of 16 Gbyte/s, more than three orders of mag-nitude higher. Furthermore, by using DMA-capable devices, the CPU loadcan be drastically reduced, especially with fast I/O interfaces which wouldotherwise produce millions of interrupts per second. The actual trend forelectronic devices is to include many processors, often in the same die. Tosustain this trend and save energy, multi-core technology is being adoptedeven in the embedded field. Indeed, the number of CPU cores integrated inthe same system is expected to grow up to several hundreds in the next fewyears [LJ09]. Another common trend in current embedded SoC devices is toinclude run-time configurable electronic components such as FPGAs. If thesystem designers take advantage of this feature, we can expect a near futureof system components implemented with a high level description (such asSystemC) that can be implemented both in software or in hardware. Moreadvanced future systems can even decide at run-time to deploy these compo-nents in software or to migrate to a hardware implementation, depending onthe current load or status of the system [PC08]. When deployed in hardware,these components will still read information and write results in the mainmemory. This high number of devices competing for the bus can induce notnegligible delays in the execution time of tasks.

Architectures with more than one cores, several dedicated coprocessorsand many DMA-capable I/O devices are nowadays being adopted more andmore even in safety critical systems. Multiprocessor system on chip (MP-SoC) devices specifically targeted for embedded systems are commercializedby many industry leaders (a few examples are Intel IXP2850 Network Pro-cessor, Philips Nexperia Digital Video Platform, TI OMAP architecture and

17

Page 18: Effects of real-time scheduling on cache performance - ReTiS Lab

ST Nomadik architecture) [Wol04] and are therefore commonly adopted bymanufacturers. This is true even in safety critical sectors such as health-care,avionics and automotive. Examples of some existing real-time applicationswith multiple asymmetric processing units are:

• the Janus microcontroller [GNL+03] (dual processor platform with sev-eral DMA capable peripherals for automotive power-train applications,developed by PARADES, ST Microelectronics and Magneti Marelli),

• an ECG biochip [AKBP+06] (a parallel system with up to 13 mastersand 16 slaves, the processing elements are multi-issue VLIW DSP coresfrom STMicroelectronics),

• the Common Integrated Processor [Spi01] (developed by RaytheonSystems Company and currently deployed in the F-22 Raptor U.S.air fighter — figure 1.6 — composed of several modules with 25-MHzIntel 80960 processors and almost a dozen coprocessors).

Fig. 1.6: Lockheed-Boeing F-22 Raptor US Airforce fighter

Some measures of the delay induced by bus contention in existing hard-ware have been presented in the last few years. Obviously, the exact delaydepends on the bus load, the type of application and bus. Still, those mea-sures are relevant to observe how bus contention delay can become a problemin many real world applications. Indeed those measures report the overheadcaused by bus contention on the execution time of a task to be up to 42% with realistic bus loads and applications [PC09, Sch03]. When repeatingthose simulations in our environment, once more we have to remember thatthe architecture we are simulating is particularly optimized to avoid tim-ing issues. In particular the bus that we are considering is an on-die bus,designed to connect CPU cores, coprocessors and peripherals in the same

18

Page 19: Effects of real-time scheduling on cache performance - ReTiS Lab

791000791500792000792500793000793500794000794500795000795500796000796500

0 20 40 60 80 100

(CP

Ucy

cles

)

% Bus usage of DMA devices

Execution Time

0 DMA devices2 DMA devices4 DMA devices7 DMA devices

Fig. 1.7: Enter caption here.

chip (MPSoC systems). Still, we can notice severe delays when the bus iscontended with other DMA devices, as shown in figure 1.7.

To guarantee the safe execution of code in presence of bus contentionengineers have to account for these delays in the worst case execution timeanalysis. A proposed approach, inspired by the network calculus, is to boundthe demand function of the processor and of each DMA-capable peripheralin the system by a simple function and compute safe and tight boundsfor the maximum delay the execution time can suffer [PC09]. This delayhas to be considered in the WCET to guarantee the schedulability of thesystem. An alternative approach currently adopted in many systems is touse a more predictable bus arbiter. For example, a round robin arbitercan avoid starvation and the maximum delay induced by bus contentiondepends on the number of devices connected to the bus and on the maximumlength of a burst for each peripheral. As this may lead to high maximumdelay bounds and may not be enough to provide firm real time guaranteesfor heavily loaded systems, a TDMA arbitration can be adopted instead.With TDMA bus contention is completely avoided and the maximum delaybounds on transactions is controllable by the designer (by manipulating theTDMA time slots). The drawback is the waste of bus band (and thereforeunnecessarily late response times of tasks) when the bus load is low. Asolution can be to use a hybrid solution, by re-assigning unused TDMAslots to other bus masters with a round-robin arbitration [BBB+05]. As we

19

Page 20: Effects of real-time scheduling on cache performance - ReTiS Lab

experimented in our simulated environment in figure 1.8, response times withTDMA are bounded and do not depend on number of devices contendingthe bus. While an hybrid arbiter re-assigning unused time slots providesperformances similar to round-robin with low bus contention.

1.22e+06

1.24e+06

1.26e+06

1.28e+06

1.3e+06

1.32e+06

1.34e+06

1.36e+06

0 20 40 60 80 100

(CP

Ucy

cles

)

% Bus usage of DMA devices

Execution time

Round-Robin ArbiterTDMA Arbiter

Fig. 1.8: Enter caption here.

The high number of devices connected to the bus reported in our exper-iments is not uncommon even in hard real-time systems. As an example wecan report the list of devices connected to the bus of a computation moduleof the already mentioned Raytheon’s Common Integrated Processor (CIP)(see figure 1.9 and 1.10) when deployed in the F-22 U.S. Airforce fighters[Spi01]:

• Dual Data Processing Element (two independent 25 MHz Intel 80960CPU)

• Dual Signal Processing Element (generic signal processor for mathe-matically intensive functions such as Kalman filter propagation andFast Fourier Transform algorithms used in radar signal processing)

• DPE/Mil-Std-1553 I/O Port (Data Processing Element (DPE) on sideA and a Mil-Std-1553 I/O interface port on side B)

• Global Bulk Memory (memory complex available to modules residingon the CIP)

20

Page 21: Effects of real-time scheduling on cache performance - ReTiS Lab

Fig. 1.9: Position of the CIP in the F22’s structure

• Gateway module (provides a bi-directional communications path be-tween Parallel Interface bus segments within a CIP and communica-tions between two CIPs via the fiber-optic)

• Low Latency Signal Processor (uses a Texas Instruments SMJ320C31processor to provide the interface between the CNI front end and theCIP and performs low latency signal processing for the CNI system)

• Graphics Processor/Video Interface (features a fiberoptic interface tothe cockpit Multi-Function Displays)

• Non-RF Signal Processor (Infra-Red signal processor, can support upto three Missile Launch Detectors)

• Data Encryption/Decryption Device (integrated Communications Se-curity unit)

• Voltage Regulator modules

• User Console Interface (featuring a DPE on side A and UCIF hardwareon side B, which supports instrumentation and access to the CIP I/Obackplanes during integration and test activities)

• Fiber Optic Transmit/Receive Network Interface module (communi-cations between the chip cluster and the sensors)

With such a complex system, clearly bus contention can heavily influencethe execution time of critical tasks.

At last, the two approaches presented in this section to deal with the buscontention delay problem (statically bounding the bus contention delay andusing a predictable bus arbiter) can be combined in real world systems. In-deed some DMA devices and coprocessors with low bus traffic requirements

21

Page 22: Effects of real-time scheduling on cache performance - ReTiS Lab

Fig. 1.10: Photo of an unmounted CIP (notice the modular structure: inthe available slots additional modules like the ones listed in this section canbe mounted)

can be statically considered in the WCET analysis, while particularly unpre-dictable devices can be connected to a “smart bridge”, buffering their trafficand accessing the actual bus connected to the main memory in a predictableway. This combined approach is proposed (and planned to be implementedin the near future) in [PC09].

1.3 Preemption with bus contention

As we showed in the previous section, bus contention has a not negligibleinfluence on the execution time of a task. Delays induced by this phenomeninfluence any access to main memory. Therefore, in contexts having a highbus load it is even more important to investigate the problem of preemptioninduced cache misses. Indeed, if by only considering CRPD without buscontention we observed we can expect up to 30 % of overhead on the execu-tion time of a task, in a heavily loaded bus scenario (which, as demonstratedin the previous section with concrete examples, is not an unusual scenario)the actual overhead can become even higher.

To the best of our knowledge, there are no previous works measuringthe CRPD in presence of high bus load, therefore we simulated this scenarioand traced the results in figure 1.11 and in figure 1.12.

With more bus musters contending the access of the bus the CRPDbecomes even more problematic.

22

Page 23: Effects of real-time scheduling on cache performance - ReTiS Lab

570000

575000

580000

585000

590000

595000

600000

605000

610000

615000

0 20 40 60 80 100

(CP

Ucy

cles

)

% Bus usage of DMA devices

Execution time

no preemptionswith preemptions

Fig. 1.11: Enter caption here.

570000

580000

590000

600000

610000

620000

630000

640000

0 1 2 3 4 5 6 7 8 9

(CP

Ucy

cles

)

Num. DMA devices

Execution time

no preemptionswith preemptions

Fig. 1.12: Enter caption here.

23

Page 24: Effects of real-time scheduling on cache performance - ReTiS Lab

It is noteworthy to observe that with bus contention, even in a highlyoptimized MPSoC such as the architecture we are simulating the CRPD canbecome a severe problem and definitively needs to be considered.

24

Page 25: Effects of real-time scheduling on cache performance - ReTiS Lab

Chapter 2

Simulated system

Hard real time systems need firm guarantees on the safe and timely exe-cution of each task. For such absolute guarantees, designers cannot basetiming considerations simply on measuring test executions (actually, for softreal-time systems such as multimedia applications, such approach can justbe fine; but with safety critical systems, such as health-care and automo-tive, simple tests are not enough). Worst case execution time has to bestatically determined by analysing the code. Furthermore, firm bounds onbus contention and cache related preemption delays have to be computedand accounted for in schedulability analysis. Indeed the actual worst casescenario can hardly be measured within the rather limited number of testsengineers can perform on any real world system. Considering all possibleinputs to a task and all possible execution paths is impossible. When weadd to the complexity of the scenario asynchronous random accesses to thebus by contending devices and the whole gamma of possible schedule se-quences, measures become totally ineffective for worst-case analysis. Still,by observing the cache misses and execution time in a real system, we canbetter understand the problem. Furthermore, by observing their perfor-mance, both in terms of execution time and cache misses, we can comparedifferent techniques of dealing with this problem. Indeed by using the pro-posed approach of limited preemption in a real RTOS (real-time operatingsystem), we can demonstrate how it can be easily implemented and showthe advantages of this technique not only on worst case, but even on averagecase execution times.

The events we need to measure such as cache misses, pipeline stalls andbus contention happen at high frequencies and in hardware componentsthat are not designed to be observable. For this reason we found moreconvenient to use a simulated environment in which we could automatemost of our tests and observe more easily cache and timing behaviour of thesystem. Furthermore this way we can more easily automate our measures

25

Page 26: Effects of real-time scheduling on cache performance - ReTiS Lab

to enable for more complex tests. As a simulation platform we selectedMPARM because it is capable of cycle accurate measures and simulates anARM7 processor which is well-known and widely adopted in the industry.As the chosen simulator permitted realistic simulation of every aspect ofthe real hardware, we were able to use software actually compiled for aphysical ARM7 core and tested in a real development board. This allowedus to use a real-time operative system (RTOS) available for that processor,Erika. The adopted RTOS and the simulated processor are both in use inmany commercial products, therefore our measures report highly realisticscenarios.

2.1 MPARM simulator

The MPARM simulator was initially developed by “Universita di Bologna”and now is used and improved by several other organizations:

• Universita di Verona

• Universita di Urbino

• Universita di Roma

• Universita di Cagliari

• Stanford University

• Penn State University

• Universidad Complutense de Madrid

• Universidad Politecnica de Madrid

• Denmark Technical University

• IMEC

• STMicroelectronics

• . . .

It is a simulation environment for multiprocessor system-on-chip (MPSoC)architectures. It is specifically designed to reflect embedded system charac-teristics for asymmetric multiprocessing systems (i.e, general purpose, DSP,VLIW can coexist and multiple on-chip memory modules and I/O units canbe included in the same silicon die, using an heterogeneous system inter-connection) [BBB+05]. The simulator is intended as a tool to explore thedifferent choices since since earcly in the design process. It can be used to

26

Page 27: Effects of real-time scheduling on cache performance - ReTiS Lab

explore different component types in the architectural design phase, pro-totype and test the hardware design during later phases and observe thebehaviour of software with different hardware components. For this intentof usage, simulation accuracy is of primary importance. Indeed MPARM isa cycle-accurate simulator. It can measure the performance of all the partsof a MPSoC system: bus usage, cache and processor performance, powerrequirements, etc.

The MPARM is based on SystemC. This allows us to implement any partof the system in the same language and to integrate existing parts written inC/C++ by using simple wrappers capable of synchronizing those moduleswith the SystemC clock signal. Indeed MPARM supports interchangableInstruction Set Simulators (ISSs – processor cores). The ARM7 ISS is basedon the existing SWARM project, describing a complete ARM core in C++.This is the most stable ISS supported by MPARM and can execute realcode compiled for that processor (THUMB code, commonly used feature ofthe widely adopted ARM7TDMI core, is not supported). Supported businterconnection tecnologies range from the ARM published AMBA bus tothe STMicroelectronics proprietary STbus, experimental NoC (network-on-chip) interconnections, and a TLM generic bus that can be configured tosimulate the timing and interconnection of any architecture. Several otherdevices are supported such as DMA engines, on-chip fast SRAM connecteddirectly to the SoC bus, off-chip DRAM memory controller, FFT engine,frequency scaling controllers. This wide possibility of choice allows hardwaredesigners to explore the design space of the system and experimentwith newhardware without having to wait for physical prototypes to be manifactured.

Fig. 2.1: Enter caption here.

27

Page 28: Effects of real-time scheduling on cache performance - ReTiS Lab

For our simulation environment we used the ARM7 ISS and the AMBAbus. In some simulations where we wanted to explore the influence of thebus on the system we used the TLM interconnection configured to mimic theAMBA bus topology, because the TLM bus is more configurable and canallow the comparison of different delays and bus loads, while the AMBAbus is a cycle-accurate mimic of the real world equivalent. For each ARMcore there is an on-chip SRAM connected through AMBA (figure 2.1). TheARM CPU is internally composed of the ARM core, instruction and datacaches and peripherals (UART, timer, interrupt controller). This modulewas derived from the open source cycle accurate SWARM (software ARM)simulator.

The AMBA bus is a widely used standard defining the communicationarchitecture for high performance embedded systems [Lim99]. Multi-mastercommunication is supported by this back-bone bus and requests for 26 simul-taneous accesses to the shared medium are serialized by means of an arbitra-tion algorithm. The AMBA specification includes an advanced highperfor-mance system bus (AHB), and a peripheral bus (APB) optimized for min-imal power consumption and reduced interface complexity to support con-nection with low-performance peripherals. The MPARM simulator includesa SystemC description only for the former one, given the multi-processor tar-get scenario. The implementation supports the distinctive standard-definedfeatures for AHB, namely burst transfers, split transactions and single-cyclebus master handover. Bus transactions are triggered by asserting a bus re-

Fig. 2.2: Enter caption here.

quest signal. Then the master waits until bus ownership is granted by thearbiter: at that time, address and control lines are driven, while data bus

28

Page 29: Effects of real-time scheduling on cache performance - ReTiS Lab

ownership is delayed by one clock cycle, as an effect of the pipelined opera-tion of the AMBA bus. Finally, data sampling at the master side (for readtransfers) or slave side (for write transfers) takes place when a ready signalis asserted by the slave, indicating that on the next rising edge of the clockthe configuration of the data bus can be considered stable and the transac-tion can be completed. Besides single transfers, four, eight and sixteen-beatbursts are defined in the AHB protocol too. Unspecified-length bursts arealso supported. An important characteristic of AMBA bus is that the arbi-tration algorithm is not specified by the standard, and it represents a degreeof freedom for a task-dependent performance optimization of the communi-cation architecture. The resulting architecture is optimized to avoid timingissues: on-chip AMBA bus and fast SRAM allow for an overhead of only 5cycles for each cache miss (refilling a line of 4 words, see figure 2.2).

MPARM can measure any aspect of the simulated environment, frombus to processor and cache performance. As a result of the simulations, itcan provide cumulative statistical data, timeline traces of every hardwareevent, and waveforms (in .vcd files) for the signals in the system. For thisthesis, we used only the cumulative statistical data (improved to separateinformation for different tasks, as we will describe in the next section).

The statistical data for each execution begins with an overview of thesimulated environment, listing the number and type of devices.�

1 Simulat ion executed with SWARM core s on AMBA AHB ( s i g n a l model ) i n t e r c onnec t i onSimulat ion executed with 1 buses connected by 0 br idge sSimulat ion executed with 1 co r e s (1 masters i n c lud ing DMAs and smart memories )4 s l a v e s : 1 pr ivate , 1 shared , 1 semaphores , 1 in te r rupt ,

0 core−as soc i a t ed , 0 storage , 0 f requency s ca l i ng ,6 0 smart memories , 0 FFT dev i c e s

( core−a s s o c i a t ed o f f , f requency s c a l i n g o f f , smart memories o f f , DRAMScratchpad memories d i s ab l ed. . . . . . . . .� �Following are the statistics for the interconnection in use (AMBA bus in ourcase) with data about time to access the bus and to complete transactionsfor each type of operation.�

1 . . . . . . . . .Bus busy = 0 master system cy c l e s (0.00% of 0)Bus t r a n s f e r r i n g data = 0 master system cy c l e s (0.00% of 0 , 0.00% of 0)Bus Accesses = 551117 (0 SR, 418794 SW, 132323 BR, 0 BW: 132323Time ( ns ) to bus ac c e s s (R) = 1323230 over 132323 a c c e s s e s (max 10 , avg 10 .00 ,

6 Time ( ns ) to bus compl . (R) = 6616150 over 132323 a c c e s s e s (max 50 , avg 50 .00 ,. . . . . . . . .� �There is information for each processor in the system. First bus usage statis-tics are provided about the accesses to the bus by that master.�−−−−−−−−−−−−−−−−−SWARM Processor 0

3 −−−−−−−−−−−−−−−−−Direct Accesses = 0 to DMABus Accesses = 551117 (0 SR, 418794 SW, 132323 BR, 0 BW: 132323Time ( ns ) to bus ac c e s s (R) = 1323230 over 132323 a c c e s s e s (max 10 , avg 10 .00 ,. . . . . . . . .� �Then, the number of accesses to memory areas is listed in detail,

29

Page 30: Effects of real-time scheduling on cache performance - ReTiS Lab

�+==================+=======================+| | Current setup |

3 | | Ext Acc Cycles |+==================+=======================+| Private reads | 132323∗ 13850338 || Bus+Wrapper waits | 1058584 |. . . . . . . . .� �At last detailed cache performance statistics are reported.�−−−Cache performance−−−∗ Read burs t s due to 132323 cache misses out o f 13056400 cacheable reads . Misses

3 a l s o co s t 793938 in t c y c l e s to r e f i l l . A l l wr i t e s were write−through .. . . . . . . . .D−Cache : 1185502 read h i t s ; 131817 read misses (527268 s i ng l e−word r e f i l l s )D−Cache : 336363 write−through h i t s ; 82431 write−through missesD−Cache t o t a l : 1736113 tag reads , 131817 tag wr i t e s

8 1317319 data reads , 131817 data l i n e wr i tes , 336363 data word wr i tD−Cache Miss Rate : 12.51%. . . . . . . . .� �In more complex simulation scenarios, statistics about other devices suchas DMA engines and frequency scaling could be generated. Furthermore, ifenergy consumption statistics were required an additional section with thepower statistics for each device would be added. For the purpose of thisthesis, we did not enable those additional features of MPARM.

2.2 Erika real-time operating system

Erika is a real-time operating system (RTOS) currently maintained by Ev-idence Srl, a spin-off company of the ReTiS Lab of the Scuola SuperioreSant’Anna. Erika is compliant to the OSEK/VDX standard for automo-tive RTOSes . The kernel supports mono and multi processor systems andis available for several small 8 to 32 bit microcontrollers and platforms,from Altera’s NIOS II Softcores to the dsPIC DSC micro-controllers fam-ily. It supports preemptive and non-preemptive multitasking, fixed priorityscheduling and EDF scheduling, protocols for accessing shared resources(with the Immediate Priority Ceiling protocol to prevent priority inversion).The Erika kernel is particularly optimized for little memory requirements.Minimalistic systems can shrink down operating system memory usage toless than 128 bytes of RAM and less than 1024 bytes of ROM while stillmaintaining a fully preemptive kernel, basic interrupt handling and resourcesharing. In particular it has a one-shot task model to reduce the stack usageand supports for stack sharing techniques. To further reduce the overall foot-print of the kernel, Erika provides subsets of the OSEK APIs (conformanceclasses named BCC1, BCC2, ECC1, ECC2).

The OSEK/VDX consortium provides the OIL language (OSEK Imple-mentation Language) as a standard configuration language, which is used forthe static definition of the RTOS objects which are instantiated and used bythe application. Erika fully supports the OIL language for the configurationof real-time applications through RT-Druid, a tool suite for the automatic

30

Page 31: Effects of real-time scheduling on cache performance - ReTiS Lab

configuration and deployment of embedded applications (see figure 2.3). The

Fig. 2.3: Enter caption here.

RT-Druid tool is built on the open source Eclipse IDE and provides a userfriendly development environment. Furthermore, it processes the OIL con-figuration and generates the build system for the application and the kernel.

The first step in porting Erika to the simulated platform is to preparea toolchain. As Erika is compatible with many compilers, we could chosedevkitARM, a GCC 4.2 derivate toolchain actively maintained by the opensource comunity to quickly adopt any improvement in the mainline GCC forx86. It was initially created and widely adopted by the wide comunity ofhomebrew developers for entertainment consoles. In present days it is usedintensively to develop homebrew games and applications for the NintendoDS. This guarantees an intensively tested toolchain and releaved us from theduty of creating and maintaining a toolchain during the porting process.

Erika already includes support for an ARM7 core. Actually it supportsthe ARM7TDMI processor Samsung KS32C50100 and the Evaluator-7Tboard (in figure 2.4). The ARM7 core included in the MPARM simula-tor misses Thumb mode support (compressed instruction set with only 16bits per instruction), therefore some light modifications were still needed,but the overall context switching code was ready. After adding some codeto support the simulated platform interrupt controller both in the startup ofthe kernel and in the interrupt handling routines, Erika was up and runningon MPARM. The board support package with an API to manage some de-vices (such as timer, interrupt controller and simulation support for printingdebug messages, starting measures and stopping simulation) completed thesupport of the platform. Still, having the Evaluator-7T board supported,

31

Page 32: Effects of real-time scheduling on cache performance - ReTiS Lab

Fig. 2.4: Enter caption here.

we could test our code on the physical hardware and accurately measure it’sbehaviour on the simulation environment.

MPARM can provide statistics for many parts of the hardware system,even separately, but it cannot distinguish what part of the code running inthe system has triggered the events. Since we wanted to measure interfer-ence among tasks, we had to run multiple tasks and evaluate the performanceof the hardware while executing each task separately, not the performanceduring the whole simulation time. Examining the long traces and wave-forms MPARM can supply to guess the context switch points and aggregatestatistics off-line after each simulation would be too complex and requirelarge amounts of disk space for long simulations (writing traces slows downseverely the simulation). Therefore we decided to modify MPARM to addsupport for separation of statistics by PID. A new module was added inthe simulation support (connected to a certain address in the AMBA bus)which could be used by the OS to specify the PID currently active. TheErika kernel was modified to write the active PID after each context switchto the memory-mapped address of the new simulation support module. InMPARM, statistics relative to the ARM cores (such as number of mem-ory accesses and cache behaviour) were maintained separately for each taskrunning on the core. When writing the results of the simulation, a separatestatistic file for each PID was produced in addition to the main statisticfile.

32

Page 33: Effects of real-time scheduling on cache performance - ReTiS Lab

2.3 Simulated task sets

The task-sets of our simulated scenarios are composed of periodic tasks. Inparticular we schedule a high frequency task which destroys a part of thecache and a lower frequency one with a longer execution time (as in figure2.5). Scheduling is preemptive with fixed priorities and the priorities areassigned in order of frequency of activation (the well-known Rate Monotonicscheduling). The higher priority task is a syntetic task we use to create

Disturber6 6 6 6

Measured6

Fig. 2.5: Enter caption here.

the desired “cache interference” by evicting a configurable number of cachelines. The lower priority task is the one we are interested in measuring.Exploiting MPARM capabilities and the modification to support statisticsseparated by PID, we can measure the execution time of the task and thenumber of cache misses. All the measures are repeated over a significantnumber of activations of the task. Later in this thesis, when we presentour proposed approach to deal with the CRPD problem and measure theperformance of this techinque, we will use even more disturbing tasks tosimulate a real-world task-set.

We simulated three different tasks, from different domains:

• simple iteration over array;

• neural network training through backpropagation;

• an iteration of the CoreMark benchmark.

The simple task iterating over an array, while a typical activity in controltasks (such as FIR filters), served principally for explorative purposes. In-deed the body of the task is composed of few ARM instructions, and thetiming behaviour can be analysed even by hand, because the array is accessedsequentially and cache behaviour is trivial. Furthermore the footprint of thetask can be configured by deciding the size of the array and it’s durationcan be altered by iterating more times over the array. The neural networktraining was used as an example of computation intensive code. The taskis composed foundamentally of a series of floating point computations. As

33

Page 34: Effects of real-time scheduling on cache performance - ReTiS Lab

the simulated ARM7 core does not include the FPU component, the codeis compiled with floating point emulation. For this reason, in this task thecomputation operations are far more than the memory access operations.The last type of task, actually used for all the graphics in this thesis, is asingle iteration of the well-known CoreMark benchmark.

CoreMark is an open source benchmark offered by the EEMBC (Em-bedded Microprocessor Benchmark Consortium). EEMBC is a non-profitcorporation formed to standardize on real-world, embedded benchmark soft-ware to help designers selecting the right embedded processors for their sys-tems. The EEMBC benchmarks are aimed at specific embedded marketsegments and are very successful at approximating real-world performanceof embedded devices. Many industry leader companies contribute to controlthe representativeness of these benchmarks. Some member companies of theconsortium are:

• AMD (since 11/01/02)

• ARM (since 05/01/97)

• Freescale Semiconductor (since 05/01/97)

• IBM (since 05/01/97)

• Intel (since 01/01/99)

• Microchip Technology (since 04/01/06)

• MIPS Technologies (since 05/01/97)

• Nokia (since 07/01/04)

• Samsung Electronics (since 07/01/08)

• Sony Computer Entertainment (since 11/01/02)

• STMicroelectronics (since 05/01/97)

• Sun Microsystems (since 05/01/97)

• Texas Instruments (since 05/01/97)

Coremark is designed to evaluate the performance of embedded processorsby modelling real world workloads. It is applicable for a wide range of pro-cessors ranging from 8-bit microcontrollers to high-end 32-bit devices andarchitectures. CoreMark is comprised of small and easy to understand ANSIC code with a realistic mixture of read/write operations, integer operations,and control operations. CoreMark has a small size, which makes it moreconvenient to run using simulation tools. CoreMark includes 4 major algo-rithms:

34

Page 35: Effects of real-time scheduling on cache performance - ReTiS Lab

• list manipulation - pointers and data access through pointers;

• matrix manipulation - serial data access, potentially using instructionlevel parallelism;

• simple state machine - FSM that exercises the branch unit in thepipeline.

• cyclic redundancy check (CRC)

Such a collection of highly representative algorithms provides a good, com-pact example on which to measure the effects of preemption and to experi-ment new techniques.

35

Page 36: Effects of real-time scheduling on cache performance - ReTiS Lab

36

Page 37: Effects of real-time scheduling on cache performance - ReTiS Lab

Chapter 3

State of the art

The techniques commonly adopted in real time systems to deal with theunpredictable behaviour of caches can be classified in two main categories.The first approach and the most used in currently deployed systems is touse cache memories in a restricted or customized manner, so as to adaptthem to the needs of real-time systems and schedulability analysis. Thesetechniques exploit hardware features or manually adjust the memory map-ping of software components to avoid unpredictable behaviour. In the firstsection of this chapter we present these techniques and the research effortsto automate them as much as possible. The second approach is to use thecache without any limitation and statically analyse the cache behaviour ofthe system, providing safe bounds for both the WCET of each single taskand the CRPD introduced by the adopted scheduling policy. The secondand third section of this chapter present this approach and an example of acommercial timing analysis tool used to bound the WCET of each task.

3.1 Heuristic and hardware driven solutions

Many modern caches can be controlled by software. This allows hard realtime software to use them in a predictable way [Jac99]. For caches that don’tinclude such advanced features, still the memory layout of the program canbe tweaked and modified to use the cache in a more predictable way. Sev-eral techniques based on these approaches have been proposed in academicresearch papers and are used in practice in many real-world systems. Allof them require particular hardware features or heavy modifications to thecode and linking scripts that can only be obtained by particular modifiedcompilers. Implementing them requires choices that can only be driven byprogrammer experience and heuristic guidelines. Nevertheless most of cur-rent hard real-time systems use one of these techniques to enable the use ofcaches in safety critical projects.

37

Page 38: Effects of real-time scheduling on cache performance - ReTiS Lab

Cache locking is supported by many commercial processors. We can men-tion Motorola ColdFire MCF5249, Motorola PowerPC 603e, IDT RC64575,Intel-960 and ARM 940T. The software running on these processors can ex-plicitely load data and instructions into the cache and instruct it to disabletheir replacement. Access time to memory locations locked in cache can beprecisely determined and this enables for very precise WCET estimations,because locked content can be safely accounted as a cache hit. Controllingby software which memory locations to lock, designers can decide whichmemory accesses need to be cache hits (for example code in a loop) andwhich ones the system does not have to care about (such as initializationcode). A major problem of this approach is that each processor implementscache locking in a different way. This severely limits portability of code andchoice of hardware (for example, it might constrain future projects on a cer-tain processor because they need to reuse some legacy code). Furthermore,selecting which memory addresses to lock in cache is not a trivial task andis often left to the programmer experience, or to simple heuristic guidelines.

Instruction and data can be locked in cache at startup and never changeduring the task execution. This approach, called static cache locking, is veryattractive because allows very precise WCET estimation and loading somecontent of the memory in cache is done during system startup instead of dur-ing the execution of the task, which can yield performance improvements inthe average case, too. Still, dynamic reorganization of cache contents is notpossible. Therefore, unlike the traditional use of caches, statically lockingthe caches can not adapt to the dynamic behaviour of the task. Anotherapproach, called dynamic cache locking, is to load and lock memory con-tents during the task execution. Therefore for different sections of code (forinstance in different branches) the task can lock in cache different mem-ory locations, allowing for a better use of the cache and a certain degree ofadaptability at the cost of a slightly more complex analysis of the WCETand the constraint to load all memory locations during task execution.

With respect to inter-task interferences, a possible cache locking strategyis to globally select the cache contents to lock for each task. Therefore eachtask will own a portion of the cache with locked content. This approachis more suitable for static cache locking because dynamically changing thecontents to lock in cache can yield to conflicts (two tasks need to lock dif-ferent memory locations that are mapped to the same cache line). Whilestatically selecting the contents of cache to lock, system designers can re-solve conflicts by changing memory layout or simply not locking one of thetwo memory blocks. Global statical cache locking allows the system startupcode to load and lock the memory locations in cache, since locked cachecontents does not needed to be changed during the whole life-time of thesystem. Another possibility is to assign the whole cache to the running task.This local cache locking approach is suitable for both static and dynamic

38

Page 39: Effects of real-time scheduling on cache performance - ReTiS Lab

selection of locked memory locations. Static local cache locking can be im-plemented selecting statically the memory blocks to lock in cache, loadingthem at task startup and never changing during task execution, while thedynamic approach allows for task code to load and lock different memorylocations during execution. The local cache locking scheme simplifies theselection of memory locations to lock because conflicts are more unlikelyand the size of cache available to select for locking for each task is larger.Unfortunately this technique requires the RTOS code to deal with lockedcontents when preempting a task, for instance unlocking these cache blocksand re-loading them when task resumes execution.

Cache locking content selection is traditionally decided by programmersand guided simply by their experience. Given the high interest of the in-dustry and wide adoption of cache locking techniques in existing systems,several research efforts have been expended in the last decade to developtools allowing for automatic implementation of cache locking mechanisms.Most of the research papers on the argument concentrate on the instructioncache, because data cache usage highly depends on the application. Decid-ing the optimal set of memory locations to lock in cache to minimize theWCET can be proven to be NP-Hard [TL09]. For this reason many paperspropose genetic algorithms to explore the cache locking space [CPIM05].Algorithms resolving the problem in polynomial time usually find the op-timal solution only with unrealistic assumptions [TL09] or use greedy andsub-optimal approaches [FPT07, CPIM05, AP06].

Instead of conventional cache memories, many real-time designers favourthe use of scratchpad memories (SPM). Also called tightly coupled memories(TCM), these are small static RAM modules included in the same silicon dieas the processor, just like cache memories [BSL+02]. Indeed many processorsallow the system to reconfigure part of the cache and use it as scratchpadmemory. Unlike the cache, TCM content is not controlled by hardware, butit is mapped onto the address space of a processor at a predefined addressrange. Any access to the scratchpad memory is serviced (usually in justone CPU cycle) without accessing the bus, therefore a TCM is inherentlypredictable.

The task of allocating instructions and data memory to the scratchpad isunder software control. While this enables for predictable and controllablebehaviour by choosing what to allocate in the TCM, the choice is completelyleft to the user, requiring significant compiler or programmer support. Theproblem of choosing the content to be allocated in a scratchpad memoryis similar to choosing contents to load and lock in cache. Significant re-search effort has been invested in developing efficient allocation techniquesfor scratchpad memories (for example [FK09]), and still most of research incache locking policies can be applied to TCM, too. Regarding implementa-tion, during the design phase it is necessary, in both cases, to choose, for

39

Page 40: Effects of real-time scheduling on cache performance - ReTiS Lab

every task in the task set, which instruction blocks will be either loadedand then locked into the cache or copied into the scratchpad memory. Thenumber of selected blocks per task must not exceed the capacity of eitherthe locking cache or the scratchpad memory. Once the blocks are chosen, itis possible to know how much time it would take to fetch every instruction inthe whole task set; therefore, the access time to the corresponding memoryhierarchy is predictable. At compile time, the assignment of memory blocksto either the locking cache or the scratchpad has to be handled by hand orautomatically using a compiler and/or a linker. However, since scratchpadmemories are mapped in the processor’s memory space, explicit modifica-tions in the code of tasks may be required to make control flow and addresscorrections. To confirm the similarity of the two techniques, in [PP07] isproposed an algorithm for offline contents selection of on-chip memories,supporting both locked caches and scratchpad memories. In this article aquantitative comparison of the use of dynamic WCET-oriented cache lock-ing and scratchpad allocation is presented. Experimental results show thatthe worst-case performance of applications using the two types of memoryare very close to each other in most cases.

By using scratchpad memories and additional custom hardware, otherpredictable memory access tecniques can be implemented. For example, in[WA09] the usage of a “Scratchpad Memory Management Unit” (SMMU) isproposed. This custom hardware controller allows the software to load con-tent from external memory locations and lock them in a scratchpad memory.Just like cache locking, loading an external memory content in the SMMUdoes not change the address of the memory location. Unlike cache locking,memory regions cannot conflict because the mapping of scratchpad mem-ory and address-space is implemented with a translation table, just like invirtual memory management units.

Another approach to deal with extrinsic cache misses 1 and the unpre-dictability introduced by preemption is cache partitioning: assign reservedportions of the cache (partitions) to certain tasks in order to guarantee thattheir most recently used code or data will remain in the cache while theprocessor executes other tasks. Additionally, a common partition may beused for data sharing or non critical tasks. This technique is particularlyattractive in systems like those defined by the ARINC 653 avionic stan-dard where different software components are developed independently andthen deployed in the same processor in a time-partitioned environment (In-tegrated Modular Avionics, IMA). In this kind of systems the software is

1We use the term extrinsic for cache misses of a task τ1, caused by a higher prioritytask preempting and evicting a cache line that the code of τ1 uses later. The term is usedas a counterpart of intrinsic cache misses, which would have still been cache misses even ifthe task was not preempted (capacity, conflict or compulsory misses). For further detail,refer to the “Introduction” chapter.

40

Page 41: Effects of real-time scheduling on cache performance - ReTiS Lab

inherently partitioned, but cache interference among tasks of different par-titiones cannot be considered till the late integration. In cache partitionedsystems inter-task interferences are eliminated by isolating the dynamic be-haviour of the cache within each partition. The counterpart is that theper-task available amount of cache memory is severely reduced, hence per-formance can degrade drastically. Still one of the other techniques (suchas static cache analysis or cache locking) is required to deal with intra-taskinterferences.

Cache partitioning can be implemented with dedicated hardware. Whilerequiring custom hardware (no commercial processors support similar de-vices natively) and introducing additional latency because of the additionallogic, this approach allows for precise control on the allocation of cacheblocks to tasks [KSS91]. Another common approach is to modify the link-ing and compiling process to statically allocate code and data of differenttasks in memory regions that are mapped in distinct cache sets (or simplylines in direct mapped caches). This technique is not applicable with fullyassociative caches and requires extensive linker and compiler support. Thisscheme also introduces overheads, in this case due to insertion of branchesto interconnect the relocated pieces of code. In addition, the access patternof data structures must be changed in order to achieve exclusive mappingsinto the data cache. Another software approach is possible in MMU (mem-ory management unit) equipped systems with a physically-indexed cache.This hardware architecture is common in general purpose systems, and usedeven in some modern real-time systems. In this case the operating systemcontrols the allocation of physical memory pages to tasks, therefore it canpartition the cache using commonly available hardware and without havingto modify the task code. The disadvantages of this technique are that it onlyallows a granularity of MMU pages and that most currently available hardreal-time hardware do not include MMUs and only use static memory alloca-tion. In the near future this may change given the current trend of adopting,even in hard real time systems, the processors originally designed for gen-eral purpose computation, most of which include a MMU. Furthermore theavailability of research techniques to support WCET predictable dynamicmemory allocation [HRW08] may attract many real time developers.

System designers using cache partitioning have to decide which tasksto assign a private partition and which to leave in the shared partition,as well as the size of each cache partition. Usually the choice is simplydriven by heuristic considerations and engineering experience. Only recentlya genetic algorithm has been proposed to find a near optimal solution in[BCSM08]. The algorithm allows for some tasks to be left in a shared cachepartition if assigning them a dedicated one is not convenient. Inter-taskinterference in this partition can be handled with one of the other techniquespresented in this chapter. For example, in [VLX03], the authors explore the

41

Page 42: Effects of real-time scheduling on cache performance - ReTiS Lab

performance of combining cache partitioning and dynamic cache locking.While, in [BMSW97], static analysis of inter-task cache interference is usedfor tasks using the shared partition.

3.2 Timing analysis

The goal of WCET analysis is to generate a safe (no underestimation) andtight (small overestimation) estimate of the worst-case execution time of aprogram. The techniques used to analyse the timing behaviour of a taskcan be based on measures on a real or simulated environment (also knownas dynamic analysis), or on a static analysis of the code, without actuallyexecuting the program. The goal of static code analysis is to evaluate someproperties of the code, such as bounds on the value of a variable or the cachestate in a certain point. When used for timing analysis, the property we areinterested in evaluating is the WCET.

By measuring the execution time of the task on hardware or a simulatorfor some set of test inputs, we can produce estimates or distributions, notbounds for the execution times. There are tools to assist in the choice of testcases, to guarantee that those tests cover all the possible execution paths ofthe task and to combine measurements on different code chunks to obtainan overall WCET, like RapiTime of Rapita Systems. These tools are com-monly used by several nowadays real time developers. Still, these methodsare rarely guaranteed to give bounds on the execution time. Indeed, theestimation provided with this technique approaches the upper bound fromthe side of lower execution times. With simple processors and architectureswhere execution time of a block of code does not depend on the previoushistory of execution, this approach is still feasible even in safety criticalprojects. For example, it is used when programming by reusing simple andwell-known basic blocks (like the SCADE Suite used among others by Air-bus) with a simple processor without cache and pipeline. As static timinganalysis tools become more precise in their estimations and more powerfulprocessors are required to support more advanced features in the controlsystems, more and more companies abandon measurement methodologiesfor WCET estimation. A noteworthy case study is the just mentioned Air-bus. Requiring more processing power in recent commercial airplanes, thecompany has adopted more powerful processors. With these new processors,even when using SCADE and the composition of the simple basic blocks,it requires the use of static analysis to guarantee the WCET of the tasks.Indeed, it requires from all it’s real-time software suppliers a statically com-puted WCET bound for all the tasks. In particular, the company suggestsusing the aiT tool from AbsInt [SPH+07].

A modular approach to the static timing analysis problem splits the

42

Page 43: Effects of real-time scheduling on cache performance - ReTiS Lab

overall task into a sequence of subtasks. Usually the structure of a tool issimilar to what you can see in figure 3.1.

Fig. 3.1: Enter caption here.

The Frontend module in figure 3.1 is responsible for reading the in-put format (such as an ELF executable) and preparing a control-flow graph(CFG). The CFG is a data structure describing all the possible executionpaths of the program (more precisely, a superset of the set of possible execu-tion paths). This data structure can be handled and analyzed by the othermodules. Most tools analyze the linked executable, since only this containsall the necessary information (analyzing source code can lead to differencesinduced by compiler optimizations and linking unless the analysing tool isstrictly coupled with an ad-hoc compiler).

The following phase is called Control-Flow Analysis (or High-levelAnalysis). During this step, the tools determine information about the pos-sible flow of control through the task. This analysis can bound the numberof iterations of loops, bound the depth of recursion of functions, excludeinfeasible paths (such as two subsequent branches controlled by contradic-tory conditions) etc. For this kind of analysis it is necessary to know thevalues of variables (actually registers and memory locations when analysingcompiled machine code) occurring in the conditions tested for branch orloop determination. For this reason a value analysis may be used, whichcomputes ranges for the variables at every program point. Additional in-formation may be supplied by the programmer in the form of source codeannotation. These are comments in a particular form that the WCET toolcan interpret and instruct about logical conditions hard or impossible to ob-tain through static analysis, such as range of values expected when readingfrom an external source, or bounds on the number of iteration for a loop.

The following phase, called Processor-Behavior Analysis (or Low-level Analysis), considers the processor behavior for the given task. Thismodule takes into account an abstract model of the processor with informa-

43

Page 44: Effects of real-time scheduling on cache performance - ReTiS Lab

tion such as memory, caches, pipelines, and branch prediction. It determinesupper bounds on the execution times of instructions or basic blocks. Thesafety of the results of this analysis for the most powerful microprocessorsdepends on detecting and accounting for possible timing anomalies. Theseare counter-intuitive influences of the (local) execution time of one instruc-tion on the (global) execution time of the whole task such. For example, alocal instruction cache miss might prevent the prefetching unit from load-ing too many in the cache. Instead, if the same cache access is a hit, theprefetcher can load further instructions. The instructions loaded because ofthe cache hit might evict from the cache some content used few instructionslater, inducing longer global delays in the program execution (see figure3.2). To guarantee the safety of the results, a conservative assumption has

Fig. 3.2: Example timing anomaly.

to be made or all possibilities have to be explored, wherever informationabout the processor’s execution state is missing. Advanced techniques usedin this module can allow modern tools to support different type of cachememories, replacement policies and hierarchies of caches. Most currentlyadopted approaches are based on abstract interpretation. They compute in-variants about the processor’s execution states at each program point. Theinvariants express static knowledge about the contents of caches, and stateof other processor units such as pipeline. Knowledge about cache contentsis then used to classify memory accesses as definite cache hits (or definitecache misses) [AFMW96].

Finally the estimation for the WCET of the whole task is computed(Bound Calculation). Three kind of approaches are adopted in currentcommercial and academic tools (see figure 3.3). In the structure-based ap-proach, known patterns are located (such as an if-else branch), an upperbound for the structure is computed using known rules (maximum of thetwo paths in the if-else) and an equivalent block is inserted instead of thestructure before searching for other patterns. In the path-based approach,bounds are calculated for different paths in the task, searching for the over-all path with the longest execution time. In the Implicit Path Enumeration(IPET), program flow and basic-block execution time bounds are combined

44

Page 45: Effects of real-time scheduling on cache performance - ReTiS Lab

Fig. 3.3: Enter caption here.

into sets of arithmetic constraints and an ILP (Integer Linear Program)is produced with the execution time of the task as an objective function[LM95]. While linear programs can be solved in polynomial time, requiringthe solution to be integer makes the problem NP-hard. A suboptimal solu-tion in timing analysis represents an unsafe estimate for the WCET, thus theescape of resorting to heuristics is barred. Indeed, all of the existing staticanalysis tools that use IPET for bound calculation use exact ILP solvers.

An excellent and recent survey of most of currently available WCETestimation tools in Europe can be found in [WEE+08]. In table 3.1, wereport some of the tools and their degree of support of caches. The missingor limited support of data caches by many of the tools should highlight thefact that the problem investigated in this thesis is still an actual issue. Forour experiments, when required, we have used the aiT tool from AbsInt (seefigure 3.4 for an example screenshot). The implementation of this tool usestechniques based on abstract interpretation for processor behaviour analysisand IPET for bound calculation. It supports ARM7 processors, thereforewe could validate the static analysis in our simulated environment.

45

Page 46: Effects of real-time scheduling on cache performance - ReTiS Lab

Tool Support for cache memoriesaiT I/D, direct/set associative, LRU, PLRU, pseudo round robinBound-T –Heptane I-cache, direct/set associative, LRU, locked cachesSWEET I-cache, direct/set associative, LRUFlorida I/D, direct/set associativeChronos I-cache, direct, LRU

Table 3.1: Some WCET timing analysis tools and their support for cachesas reported in 2008 in [WEE+08]

Fig. 3.4: Enter caption here.

46

Page 47: Effects of real-time scheduling on cache performance - ReTiS Lab

3.3 Useful cache block analysis

When performing WCET analysis, it is assumed that the program execu-tion is uninterrupted (no preemptions or interrupts) and that there are nointerfering background activities, such as direct memory access (DMA) andrefresh of DRAM. In the introduction chapter, we already presented theavailable methods to deal with asynchronous interference when accessingthe bus:

• using predictable bus arbiters,

• considering the maximum time the arbiter can delay the access,

• using smart bridges to mask the unpredictability of some devices,

• bounding the traffic of each peripheral with a network calculus ap-proach.

Still, the CRPD depends on how the task uses the cache. Therefore, tightbounds can only be calculated considering information gathered throughstatic analysis.

To quantify the amount of delay introduced by a preemption in a certainpoint of the task we need the notion of “Useful Cache Blocks” (UCB). Ateach point of the task execution, the UCBs are cache blocks containingmemory data which is referenced later in the execution of the task (beforebeing replaced from the cache). By estimating the number of useful cacheblocks at each program point, a timing analysis tool can bound the numberof extrinsic cache misses caused by a preemption in one of these points.Let’s suppose that task τ , at a certain point of execution, has 10 usefulcache blocks. Suppose that the scheduler preempts the execution at thatpoint and a higher priority task replaces the content of the whole cache.When the task τ resumes execution, the 10 useful blocks of cache are lost.Therefore, in the remainder of the execution it will experience 10 extrinsiccache misses (i.e. cache misses not expected if it was not preempted). Byestimating the number of UCBs at each program point, we can bound thenumber of extrinsic cache misses introduced by a preemption at any pointof the task.

In figure 3.5 we present a trivial example of UCB analysis. We assume asimple code of 9 sequential instructions. Each instruction, numbered from 1to 9 in the first row of the table, accesses a single memory location, namedin the second row of the table. There are four available memory locations,labeled from a to d. We assume a simple direct mapped cache, with twolines. Memory locations a and b are mapped to the first line of cache, whilec and d are mapped to the second line. In the example, initially the cache isassumed to be empty, therefore the first memory access is a cache miss. The

47

Page 48: Effects of real-time scheduling on cache performance - ReTiS Lab

Task instruction 1 2 3 4 5 6 7 8 9Mem. access a c a c d c c c b

hit/miss m m h h m m h h mCache a/b a a a a a a a a bCache c/d c c c d c c c c

# UCB (after) 1 2 1 0 0 1 1 0

Fig. 3.5: Enter caption here.

accessed memory location (a) is loaded in cache. The forth and fifth rowsof the table in figure 3.5 show the content of the cache after executing eachinstruction. After executing instruction 1, the cache contains only a. Thismemory location is later used (in instruction 3), therefore is a useful cacheblock (in bold in the cache content in figure). After executing instruction2, c is loaded in cache. It is later accessed in instruction 4, therefore c isa UCB, too. In instruction 3, the memory access is a cache hit, since ais already in cache. However a is never used in the reminder of the taskcode. Therefore a is still in cache, but it is no more a UCB (italics inthe figure). In instruction 4, the accessed memory location c is in cache,therefore it is a cache hit. After this instruction, c is still in cache and thenext instruction accessing this memory location is instruction 6. However,even without preemptions, the memory access in 6 will be a miss. Indeed,in the following instruction 5 the memory location c will be evicted from thecache to load d. Therefore, after instruction 4, c is still in cache, but it isno more a UCB. Whenever the task is preempted after instruction 4 or not,the memory access in instruction 6 will be a miss (intrinsic cache miss). Infigure 3.5, the same evaluations are made for all the subsequent instructions.

The UCB analysis as we have presented in the trivial example in figure3.5 can lead to two overestimations:

• it does not consider the actual cache blocks evicted by the higherpriority task,

• when considering two subsequent preemptions, an extrinsic cache missmay be counted twice.

In the first case, if the preempting tasks access memory locations not over-lapping in the cache with the useful cache blocks of the lower priority task,the preemption will not introduce any extrinsic cache miss. This overestima-tion can lead to not taking advantage of any memory layout tweaking duringthe linking process aimed at reducing cache interference among tasks (suchas the techniques proposed in [GA07]). Some techniques that consider eventhe memory accesses of the preempting tasks have been proposed, but when

48

Page 49: Effects of real-time scheduling on cache performance - ReTiS Lab

the number of higher priority tasks is large enough to evict the most partof the cache, the additional analysis may be unnecessary. A good approachcould be to use the more complex analysis for the high priority, but use thesimpler analysis for lower priority tasks. The other source of overestima-tion happens when some UCBs are not referenced between two preemptionpoints. These UCBs are evicted by the first preempting task; the secondpreempting task does not find an actual UCB in that location. When thepreempted task resumes execution for the second time, only one extrinsiccache miss is caused, despite suffering two preemptions. When estimatingthe CRPD for both preemption points, the UCB is accounted twice. Thisoverestimation is usually considered negligible, mainly because it is highlyimprobable to happen if the preemptions are far enough from each other.

A possible approach to estimate the UCB is to connect in a chain theaccesses to the same memory location, colouring with a unique color chainsthat refer to the same cache line (example in figure 3.6). Now, in each

Fig. 3.6: Enter caption here.

point in the program, it is sufficient to count the number of differentlycoloured chains that a preemption would cut [RM06a]. Actually, the mostcommonly used approach is to build two sets for each program point: theReaching Cache States (RCS) and the Live Cache States (LCS). TheRCS represents all possible cache states when reaching this program point(from any possible path), while the LCS represents the memory locationsactually referenced in later code (excluding those that yield a cache miss evenwithout preemptions). These two sets can be built by navigating throughthe CFG. The UCB in each point of the task are the cache blocks found bothin the RCS and the LCS [LHM+96]. A more precise analysis can be easilyobtained by maintaining cache states using cache vectors, not simply sets ofcache blocks. Furthermore, the RCS set for the whole task (RCS in the finalinstruction or basic block) can be used to represent the cache lines evictedby the whole task τ1, and improve the analysis for the lower priority tasks byexcluding from their CRPD the UCBs not evicted by task τ1 [NMR03]. Asmaintaining the RCS is usually necessary during the Processor BehaviourAnalysis phase of the WCET estimation, implementing UCB analysis inexisting tools usually requires little effort. For example the aiT tool fromAbsInt was recently enhanced to provide UCB information in addition tothe WCET estimation. At the moment of writing this thesis, only prototypeversions include this feature, which is planned to be included in the nextcommercial releases.

49

Page 50: Effects of real-time scheduling on cache performance - ReTiS Lab

By utilizing the techniques presented in these two sections, timing anal-ysis tools can provide for each task

• a WCET estimation when executing non preemptively,

• an estimation of the CRPD it would experience if preempted after aconsidered instruction, for all the instructions in the code.

Still, to know the actual WCET including the total CRPD and performschedulability analysis, it is necessary to know the exact preemption pointsor at least the number of preemptions.

3.4 Schedulability analysis with CRPD

To determine the total WCET of a task and use it for schedulability anal-ysis, the information supplied by timing analysis still need to consider theexact points where the preemptions will happen. For this computation it isnecessary to consider the adopted scheduling policy and the other tasks inthe system. In this section, we will present modified schedulability analysistechniques which take into account the CRPD, using information suppliedby timing analysis.

In [BN94] a modified version of the rate monotonic analysis (first pro-posed by Liu and Layland in [LL73]) is presented. This analysis, referredto as CRMA (Cached Rate Monotonic Analysis) in later articles, proposesto include the CRPD in the computation time of the task and computethe utilization. However, no method to actually estimate the CRPD waspresented, neither the actual number of preemptions. Recently, the RMAschedulability analysis has been extended to consider the exact number ofpreemptions in [MYS07].

A different technique called CRTA is presented in [BMSO+96]. Thisapproach is based on the Response Time Analysis, surpassing the conser-vative bound of the previous CRMA. Furthermore, during the computationof the response time for a task, the number of preemptions is also calcu-lated. With this information, the CRPD introduced in the execution timecan be bounded, at least by considering the time to refill the whole cachefor each preemption. Of course, knowing the UCB information the engi-neers can tighten even more the bound by considering only the time to refillthe maximum number of cache lines that may be useful at a given point inthe task. The authors of this technique have also compared the utilizationregions obtained by CRTA and by partitioning the cache. Depending onthe available cache and on the number and footprint of tasks, one of thetwo schemes can obtain better results. This clearly suggests using a hybridapproach, with some tasks on their private cache partitions and some others

50

Page 51: Effects of real-time scheduling on cache performance - ReTiS Lab

in a common partition where cache interference is handled by static analy-sis techniques. This hybrid approach is also advocated and implemented in[BMSW97]. The technique of CRTA has been later extended in [LLH+01] totighten the bounds on the CRPD by considering the maximum UCBs andthe lines of cache that the higher priority tasks can access.

Since the exact points where a task is preempted are not taken intoconsideration in the above methods, only the maximum number of UCBcan be considered. In [RM06b], the authors present a technique to tightenthe bound on the number of preemptions by considering that when a tasksuffers a single preemption, many instances of different higher priority taskscan execute before resuming the execution of the considered task. This mayhappen if a task τ1 is preempted by τ2, and a third task τ3 activate beforethe best case execution time (BCET) of the preempting task τ2. In this case,the tasks τ2 and τ3 will be both executed before resuming τ1. Therefore, onlyone preemption has to be accounted, despite activating two higher prioritytasks. When estimating the CRPD, only the extrinsic cache misses caused byone preemption have to be considered. Furthermore, the authors propose atechnique to actually build the worst case preemption scenario and computethe real WCET when considering the CRPD.

Finally, in [JCR07], a RTA considering the CRPD is presented for dy-namic priority scheduling. The number of preemptions is calculated by us-ing the concept of worst case response time (WCRT), and the informationis used in conjunction with the maximum number of UCBs, to analyse theschedulability of the whole task-set.

To summarize, the techniques used to estimate the CRPD and commonoverestimations of each step are:

1. analyze UCBs for each instruction in the task;

• overestimated if not considering the actual memory locations usedby the higher priority task (precise approaches are available);

• overestimate because UCBs not used between two preemptionsare charged twice (it is necessary to know the exact locations ofpreemption points, and with existing approaches it is not possi-ble);

2. estimate the number of preemptions considering the actual taskset and scheduler;

• overestimated if multiple tasks execute during one preemption (ifBCET is considered for higher priority tasks, can be estimatedmore precisely)

3. find the worst case locations of the preemptions to estimate theCRPD;

51

Page 52: Effects of real-time scheduling on cache performance - ReTiS Lab

• when also considering the BCET of the higher priority tasks, moreprecise approaches are available than considering the points withthe maximum number of UCBs.

52

Page 53: Effects of real-time scheduling on cache performance - ReTiS Lab

Chapter 4

Limited preemption

In this thesis we propose to avoid the unpredictability introduced by preemp-tions in the WCET by limiting them to well-known and statically analyzablepoints. First we present the advantages of non preemptive scheduling, whilereminding the limitations of this method of scheduling. Then we present ourtechnique to deal with task-sets that cannot be scheduled in a completelynon preemptive scheduling policy.

4.1 Non-preemptive scheduling

Non preemptive scheduling has many advantages. Certainly a non preemp-tive system is easier to implement than a preemptive one. This is especiallytrue with a one-shot task model, where periodic tasks are modelled as afunction to be executed periodically, not with an internal infinite loop (theone shot model is commonly adopted to lower the memory requirements ofthe system and often to allow for stack sharing techniques). In this case noteven context switch code would be necessary. Furthermore, if no other taskscan preempt the current task, this can access to shared resources withoutneed for synchronization techniques (such as a mutex or a semaphore). Thisway the problem of priority inversion is completely avoided and the cod-ing to use a shared resource is simpler. Obviously, the main advantage isavoiding the overheads introduced by preemptions. As we have extensivelyshowed in this thesis, these overheads are not only the raw context switchand scheduling code times, but also pipeline flushing and the indirect delaysintroduced by extrinsic cache misses and, in more complex architectures,bus contention. Modern timing analysis tools such as those presented in theprevious chapter can estimate with a high precision the WCET of a task ina non preemptive environment, even with complex processors with prefetch-ing and cache memories. Meanwhile context switches are unnecessary at all,therefore the RTOS overhead is reduced at a minimum and can be exactly

53

Page 54: Effects of real-time scheduling on cache performance - ReTiS Lab

estimated.

Despite these important advantages, non preemptive scheduling is rarelyadopted in real systems, because if a task cannot be preempted, when it isrunning it will block higher priority tasks. Indeed non preemptive schedul-ing is equivalent to having the code of each task inside a critical section (forinstance, waiting for a mutex before task execution and releasing the mutexonly at the end of the task). With this technique non preemptive schedulingcan be implemented in a RTOS or simulator that does not support it. Sothe preemptive scheduling has the disadvantage of high overheads, while thenon preemptive one has high blocking times: both influences can jeopar-dize the schedulability of a task-set. To show how severe the blocking timeproblem can be, we can observe the utilization of the processor. The usefulconcept of utilization least upper bound, introduced by Liu and Layland in[LL73], is a limit system utilization under which the schedulability of thesystem is guaranteed, no matter what the actual task-set composition. Intheir work Liu and Layland proved that for a fully preemptive Rate Mono-tonic (RM) scheduler it is around 69%, while for a fully preemptive EarliestDeadline First (EDF) it is 100%. Obviously, to use such guarantees forsystems with cache memories, the WCET of each task has to include theworst-case CRPD, too. As a matter of fact, the least upper bound of theutilization of a not preemptive scheduler (UNPlub ) is zero. We can show thiswith a trivial example in figure 4.1. This means that some task-sets, even if

τ1

6C16 6MISS 6 6 6 6

T1

τ2

6 C2

?

T2

C2 ≥ 2× T1 ⇒ task-set not schedulable not preemptively

U =C1

T1+C2

T2� 69%

UNPlub = limC1→0T2→∞

C1

T1+C2

T2= 0

Fig. 4.1: Example of task-set with low utilization but still not schedulablenot preemptively

making scarce use of the processing resources of the system, are not feasiblewith non-preemptive scheduling policies.

54

Page 55: Effects of real-time scheduling on cache performance - ReTiS Lab

If idle time with pending activations is not allowed, EDF is still an opti-mal scheduling policy [JS91]. However, many task-sets are feasible withoutpreemptions only with non-work conserving schedulers (leaving the proces-sor idle even if there are pending activations). A trivial example is presentedin figure 4.2. With this example in mind, it is clear that no online algorithm

Work conserving schedule (not feasible)

τ1

6

?

MISS

τ2

6

?

Non-work conserving schedule (feasible)

τ1

6

?

τ2

6

?

Fig. 4.2: Task-set that requires non-work conserving non-preemptive sched-ulers

can be optimal for scheduling non preemptive tasks [HV95]. The schedulerneeds to know the future arrival of the second task to decide to leave theprocessor idle. Only clairvoyant schedulers (off-line) can be optimal (i.e.find a feasible schedule if the task-set is feasible with any other algorithm).However, the general problem of finding a feasible schedule in an idling non-preemptive context is known to be NP-complete [GJ79]. In particular, wheninserting idle times is allowed and with non-concrete task sets (first activa-tion offset of each task is not known), the feasibility problem for any periodictask set is NP-Hard in the strong sense [HV95].

To exploit the advantages of the two policies, a trade-off between preemp-tive and non-preemptive scheduling has been investigated in many researchpapers. With this “limited preemption” approach the scheduler needs tobalance the increased blocking caused by non-preemptive sections and thebeneficial reduction of the overhead introduced by preemptions. Wang andSaksena [WS99] proposed a different approach for limiting preemptions, insystems scheduled with FP. Each task is assigned a regular priority and apreemption threshold, and it is allowed to preempt only when its priority ishigher than the threshold of the preempted task. Burns [Bur95] extendedthe response time analysis to verify the schedulability of fixed priority taskswith fixed preemption points, but did not address the problem of selecting

55

Page 56: Effects of real-time scheduling on cache performance - ReTiS Lab

the best location of the preemption points to improve the schedulability ofthe task set. His work has been later improved by Bril et al. [BLV07a].Baruah introduced limited preemption scheduling for EDF [Bar05], com-puting the maximum amount of time for which a task may execute nonpreemptively without missing any deadline. Yao et al. [YBB09] extendedBaruah’s work to fixed priority systems.

4.2 Maximum allowed non preemptive executiontime

Given the serious advantages of non-preemptive scheduling, we advocate forthe adoption of such policy. With this approach the WCET of each task ishighly predictable and, by using modern timing analysis techniques, the be-haviour of cache memories can be tightly estimated. This allows for systemdesigners to exploit the full power of hardware resources, without having tooversize processing units to deal with overly estimated WCET. A particularadvantage of this technique is that, not only WCET bounds are tighter,but also average case execution times (ACET) are lower. Still, the blockingtimes introduced by this policy can easily jeopardize schedulability. Thisis even more likely to happen in realistic task-sets, where many high fre-quency control tasks have to leave along with longer running computationones (higher WCET and lower frequency), such as in a typical automotivepower-train control application [GNL+03]. A promising technique to dealwith such task-sets seems to be a limited preemption model. In particular,the recent works presented in [Bar05] (for EDF) and [YBB09] (for FP) pro-vide a good theoretical background for this approach, but they do not takeinto account any preemption overhead, neither CRPD. In this section wewill present the technique used in these two works, using a unified notationssuitable to be used both for EDF and FP scheduling policies. Furthermorewe will extend this technique to deal with the overhead introduced by pre-emptions. The final objective is to achieve a feasible schedule when the taskset is not feasible in non-preemptive mode (due to high blocking times), norin fully preemptive mode (due to the high overhead). The work presentedin this section has been proposed in [MB09].

For the limited preemption model, we consider the code of each taskdivided in a sequence of Non-preemptive Regions (NPR). Non preemptiveregions are divided by Preemption Points (PP), where the scheduler can takecontrol and schedule other tasks. Preemption points are decided staticallyby the programmer, therefore their location and number is controllable bythe designers. The advantage of this model is that it is in line with thecurrent practice adopted in critical software development, so that the de-rived results can be applied to real applications with little effort, as we will

56

Page 57: Effects of real-time scheduling on cache performance - ReTiS Lab

see in the next chapter when actually implementing this technique in theErika RTOS. The problem designers have to deal with when adopting thisapproach is the size of each NPR. Creating a too long NPR can jeopar-dize the schedulability of the task-set by introducing unacceptable blockingtimes, while too frequent preemption points may yield to higher overheadand degrade to the performance of the fully preemptive scheduling policy. In[Bar05] and [YBB09], a mathematical bound is provided for the maximumNPR length. If designers respect this bound when sectioning each task ina sequence of NPR, the schedulability of the system is guaranteed. Thisbound is calculated to exploit the available slack in the system, while stillrespecting the blocking tolerance of higher priority tasks. The trivial exam-

Max NPR Blocking tol.

τ1

6

?

6

? Q1 =∞ β1 = D1 − C1

- β1�

τ2

6

?Q2 = β1 β2 = D2 − (C1 + C2)

- β2�

τ3

6

?

-Q3�

Q3 = min {β1, β2}

In general:

Qi = min1≤j<i {βj} βi = . . .(depends on scheduling policy)

Fig. 4.3: Example of maximum NPR length calculation.

ple in figure 4.3 can clarify the approach, while in the next few paragraphswe will present a mathematical formulation of the approach.

To compute the maximum NPR length and then test the schedulabilityunder FP, we use the request bound function rbfi(a) in an interval a, definedas

rbfi(a) =⌈a

Ti

⌉Ci.

Under EDF, the analysis is carried out by the demand bound functiondbfi(a) in an interval a, defined as

dbfi(a) =(

1 +⌊a−Di

Ti

⌋)Ci.

Moreover, we conventionally setDn+1 equal to the minimum between: (i) the

57

Page 58: Effects of real-time scheduling on cache performance - ReTiS Lab

least common multiple (lcm) of T1, T2, . . . , Tn, and (ii) the following expres-sion1:

max

(Dn,

11− U

·n∑i=1

Ui ·max(

0, Ti −Di

)).

We use qmaxk to refer to the WCET of the largest NPR of task τk. The

largest blocking Bi that a task τi might experience is given, under both FPand EDF, by the length of the largest non-preemptive chunk belonging totasks with index higher than i:

Bi = maxi<k≤n+1

{qmaxk }, (4.1)

where qmaxn+1 = 0 by definition. By using this notation, we can summarize the

results presented in [YBB09, Bar05]. The next theorem derives a schedula-bility condition under limited preemptions, for FP and EDF.

Theorem 1. A task set τ is schedulable with limited preemption EDF orFP if, for all i | 1 ≤ i ≤ n,

Bi ≤ βi, (4.2)

where, under FP, βi is given by

βFPi.= maxa∈A|a≤Di

a−∑τj≤i

rbfj(a)

(4.3)

withA = {kTj , k ∈ N, 1 ≤ j < n},

whereas, under EDF, βi is given by

βEDFi.= mina∈A|Di≤a<Di+1

a−∑τj∈τ

dbfj(a)

, (4.4)

withA = {kTj +Dj , k ∈ N, 1 ≤ j ≤ n}.

An alternative formulation of this theorem can express a different sche-dulability condition taking into account the definition of Bi. We can definea bound Qk on the longest non-preemptive region qmax

k of each task τk anduse check if this bound is respected to verify the schedulability.

1The expression may in general be exponential in the parameters of τ ; however, it ispseudo-polynomial if the system utilization is a priori bounded from above by a constantless than one.

58

Page 59: Effects of real-time scheduling on cache performance - ReTiS Lab

Theorem 2. A task set τ is schedulable with limited preemption EDF orFP if, for all k | 1 < k ≤ n+ 1,

qmaxk ≤ Qk

.= min1≤i<k

{βi}, (4.5)

where βi is given by Equation (4.3) in the FP case, and by Equation (4.3)in the EDF case.

Proof. A sufficient schedulability condition can be obtained combining The-orem 1 with Equation (4.1):∧

1≤i≤n

(max

i<k≤n+1{qmaxk } ≤ βi

).

The inner inequality can be rewritten as a system of inequalities, as follows:

∧1≤i≤n

∧i<k≤n+1

(qmaxk ≤ βi)

.

Developing the system, it is possible to obtain

∧1<k≤n+1

∧1≤i<k

(qmaxk ≤ βi)

,

which is equivalent to

∀k | 1 < k ≤ n+ 1 : qmaxk ≤ min

1≤i<k{βi},

proving the theorem.

Note that the definition of Qk can be rewritten in the following iterativeform (starting with Q1 =∞), for all 1 < k ≤ n+ 1:

Qk = min{Qk−1, βk−1}. (4.6)

We hereafter prove that the sufficient schedulability condition of Theo-rem 2 is also necessary under EDF. Suppose the test fails. Consider a qk forwhich condition (4.5) evaluates to false, i.e.,

qmaxk > min

i<k{βi} = min

a∈A|D1≤a<Dk

a−∑τj∈τ

dbfj(a)

.

Consider the point a∗ ∈ A that minimizes the RHS (right hand side) of theabove inequality. Then, qmax

k > a∗ −∑

τj∈τ dbfj(a∗), and

qmaxk +

∑τj∈τ

dbfj(a∗) > a∗. (4.7)

Consider a situation in which:

59

Page 60: Effects of real-time scheduling on cache performance - ReTiS Lab

• all tasks with relative deadline ≤ a∗(< Dk) start synchronously att = 0;

• task τk enters its largest NPR of length qmaxk an arbitrarily small

amount of time before t = 0. Since τk is the only task executingbefore t = 0, it will always be possible to build such a situation.

In the above conditions, the total demand in [0, a∗) is equal to the LHS(left hand side) of Equation (4.7). Therefore, the total demand exceeds thelength of the interval, leading to a deadline miss. Therefore, if the test ofTheorem 2 fails, it means that the task set is not schedulable with limitedpreemption EDF.

Under FP, instead, the test is necessary and sufficient only when noinformation is available on the location of each non-preemptive region, as inthe “floating” NPR model adopted in [YBB09]. When instead the positionof the (last) NPR of each task is known — i.e., under the “fixed” NPR model— the theorem is only sufficient. An exact test could be derived significantlycomplicating the analysis, adopting techniques described in [BLV07b].

However, directly applying previous theoretical results, is not so straight-forward. Computing the maximum lengths of non-preemptive regions re-quires the knowledge of worst-case execution times, which in turn are sig-nificantly influenced by the number of preemptions. To deal with such acircular dependency, we propose an iterative algorithm that considers bothproblems at the same time. It can be summarized with the following steps.

• The algorithm starts with no preemption points for each task, i.e.,setting pi = 1 and qmax

i = CNPi , ∀i, where CNP

i is the worst-caseexecution time of τi when it executes non-preemptively. This valuecan be found using timing analysis tools, without needing to take intoaccount preemptions.

• Then, βi is computed by Equations (4.3) or (4.4) for increasing indexes,that is, starting from β1. Note that βi, depends on the Cj of tasks withindexes j ≤ i. For the position and number of PPs is already decided,therefore the WCET can be computed. Assuming a fixed overhead ξjfor each preemption, it is given by CNP

j + (pj − 1)ξj .

• Then, Qi+1 is computed from βk≤i using Theorem 2.

• If Qi+1 is smaller than the maximum non-preemptive region of τi+1,procedure PPlace(Qi+1, i+ 1) is invoked to place preemption pointsin task τi+1 in appropriate positions to respect the bound Qi+1.

• If PPlace(Qi+1, i+1) returns false, the algorithm stops, declaring thetask set infeasible. The failing condition of PPlace(Qk, k) is (Qk ≤

60

Page 61: Effects of real-time scheduling on cache performance - ReTiS Lab

ξk). This is because, if Qk ≤ ξk, then the execution time available toτk is entirely dedicated to the preemption overhead.

• When all Qi values have been successfully checked, the algorithm re-turns, having guaranteed the schedulability of the task set.

The pseudo-code of the algorithm is summarized in Figure 4.4.

InsertPP(τ)Tasks ordered by non-increasing relative deadline.Initialize: {qmax

i ← CNPi }ni=1, qmax

n+1 ← 0,{pi ← 1}ni=1, and Q1 ←∞.

1 for (i : 1 ≤ i ≤ n)2 Ci ← CNP

i + (pi − 1)ξi3 Compute βi using Equation (4.3) or (4.4);4 Qi+1 ← min{Qi, βi}5 if (qmax

i+1 > Qi+1)6 if (PPlace(Qi+1, i+ 1) = false)7 return (Infeasible)

endfor8 return (Feasible)

Fig. 4.4: Algorithm to compute the maximum allowed NPR length for eachtask.

At the moment we can assume a fixed bound on the cost of preemptionin any position in code. In a scenario without bus contention delay (i.e.no other DMA-capable devices on the bus) and a processor without timinganomalies, a bound can be easily computed. For example we can staticallyanalyse the code of the RTOS that implements the context switch, computethe maximum delay a pipeline flush can introduce and account for CRPDthe total time to refill the whole cache (or the maximum number of UCB ifthis information is available). In this scenario a simple implementation ofPPlace(Qi+1, i+1) can be adopted. An example implementation insertingthe least possible number of PPs, while still respecting the constraint on themaximum NPR, is presented in figure 4.5. This is achieved by placing a firstPP after Qi+1 time-units of (non-preemptive) execution from the beginningof τi+1. To account for the preemption overhead, further PPs are placed afterevery (Qi+1 − ξi+1) time-units, until the end of the code is reached. Laterin this chapter, we will consider more smart ways to insert the preemptionpoints within the constraint imposed by the InsertPP(τ) algorithm.

We hereafter prove the correctness of procedure InsertPP(τ) in deriving

61

Page 62: Effects of real-time scheduling on cache performance - ReTiS Lab

PPlace(Qk, k)Let ξk be the preemption overhead of τk, 1≤k≤nand ξn+1 ← 0

1 if (Qk ≤ ξk) return false2 Place a PP in τk at Qk and after every (Qk − ξk).3 pi ←

⌈CNP

i −Qk

Qi−ξi

⌉+ 1

4 qmaxk ← Qk

5 return (true)

Fig. 4.5: Simple implementation of PPlace.

a schedulable condition. Then, we will show some optimality properties ofthe adopted method.

Theorem 3. Procedure InsertPP(τ) is correct.

Proof. If the procedure succeeds, each Qk will be larger than or equal tothe maximum non-preemptive region qmax

k of each task τk, i ≤ k ≤ n, andβn ≥ Qn+1 ≥ 0. Note that, both in the EDF and in the FP cases, the βivalue computed at line 3 of the algorithm depends only on Cj values withj ≤ i (as well as on deadlines and periods, which cannot change). Sincenone of these values may change in the next iterations (because PPs areinserted only into the code of tasks τk>i), all βi, and therefore Qi+1, arecorrectly computed. By Theorems 1 and 2, the correctness of the procedureis assured.

Having proved the correctness of InsertPP(τ), we now show that thisprocedure is optimal under EDF scheduling if PPlace(Qk, k) is optimal,too. This means that if the algorithm fails, then any other possible wayto place PPs in the task code leads to an unfeasible schedule. An optimalPPlace(Qk, k) should insert the preemption points in task τk in positionsof the code that respect the Qk bound and produce the minimum totalpreemption cost. If the cost of each preemption is considered the same,the minimum preemption overhead is obtained by placing the least possiblenumber of PPs, like the simple implementation presented in figure 4.5 does.

Theorem 4. Procedure InsertPP(τ) is optimal under EDF if an optimalprocedure PPlace(Qk, k) is used.

Proof. Suppose, by contradiction, there is a feasible task set τ for whichprocedure InsertPP(τ) fails. Then, there is at least one task τk for whichprocedure PPlace(Qk, k) fails. Let τk be the task with the smallest index

62

Page 63: Effects of real-time scheduling on cache performance - ReTiS Lab

for which the procedure fails, and let βi be the value that minimizes Qk,i.e., i = argmink−1

j=1{βj}. As previously mentioned, βi is a function of theworst-case execution times, deadlines and periods of all tasks τj≤i. Whilethe latter values (Dj and Tj) are fixed, the execution times Cj may vary fordifferent placements of PPs in the code of each task τj . Note that βi is adecreasing function of all Cj≤i. We now prove by induction that procedureInsertPP(τ) allows finding the smallest values Cj≤i, among PP allocationstrategies that are feasible. Therefore, if τ is feasible, the largest possible βiis found with InsertPP(τ). If such a value is too small (or even negative),so that no PP placement can be found for a task τk>i to satisfy qmax

k ≤ βi,then this latter condition will be violated by any other possible strategy,since it cannot lead to a larger βi. Therefore, τ is not feasible, reaching acontradiction.

Base case Independently of the number of PPs, task τ1 is always exe-cuted non-preemptively, both under FP and under EDF. Note that proce-dure InsertPP(τ) does not insert any PP in τ1, leading to the smallestpossible value of C1.

Inductive step Let j ≤ i. Assume InsertPP(τ (j−1)) obtained the sche-dulability of the reduced task set τ (j−1) .= {τ1, . . . , τj−1}, minimizing theworst-case execution times C1, . . . , Cj−1. We will prove that InsertPP(τ (j))obtains as well the schedulability of the set τ (j) = τ (j−1) ∪ {τj}, minimizingCj .

It is easy to see that the PP allocation strategy for tasks τ1, . . . , τj−1

is the same for InsertPP(τ (j−1)) and InsertPP(τ (j)), so that there is nochange in any Ck or qmax

k , for 1 ≤ k ≤ j − 1. Since τ (j−1) was schedulable,the schedulability of τ (j) can be obtained, by Theorem 2, if qmax

j ≤ Qj =min{βk<j}. By the inductive hypothesis, the procedure obtains the largestpossible values for each βk<j (since C1, . . . , Cj−1 are minimized). Hence,no other possible PP allocation for tasks ∈ τ (j−1) can result in a largerQj = min{βk<j}. Since Qj is a tight bound on the maximum NPR lengthin the EDF case, and procedure PPlace(Qj , j) is optimal, then the smallestpossible Cj is produced, proving the statement.

Note that the optimality of procedure InsertPP(τ) depends on (i) thetightness of the Qk bounds computed with Equation (4.5), and (ii) the as-sumption of an optimal PPlace(Qj , j) procedure. Regarding the first point,as we explained previously, condition (4.5) is necessary and sufficient onlyin the EDF case. In the FP case, instead, condition (4.5) is tight only whenthe floating NPR model is adopted. Otherwise, a larger bound Qk could bederived considering the exact location of the last NPR of τk. However, this

63

Page 64: Effects of real-time scheduling on cache performance - ReTiS Lab

would imply a much more complex analysis 2.

Regarding point (ii), if we assume identical preemption overheads for allthe PPs of each task, an optimal procedure is trivial. If information providedby the timing analysis on the CRPD for different PPs is considered, placingthe preemptions in optimal positions in procedure PPlace(Qj , j) is morecomplex. Later in this chapter we will deal with this problem. Noticethat, if a different preemption cost is considered for each PP, the worst-caseexecution time of a task τi at line 1 of procedure InsertPP(τ) have to bereplaced by a tighter expression.

4.3 Timing analysis to verify manually insertedPPs

The algorithm described in the previous section provides a bound on themaximum NPR length for each task in sequence. This information maybe used by the designers as a guideline to insert the preemption points inthe code. Programmers may choose convenient points in the task to insert,using their knowledge of the code and experience. Further information maybe provided by some static analysis tools, by mean of the UCB analysis, asdescribed in the previous chapter. The programmers could take advantage ofthis information to place PPs in positions of the task code where the CRPD islow. Using this manual implementation of the procedure PPlace(Qi+1, i+1), the algorithm can be used as-is for any real-world task-set. The WCETof the task can be more precisely estimated using the number of UCB in thechosen PPs.

Still, after manually inserting the preemption points in a task τk, the ac-tual maximum NPR length has to be estimated to verify the schedulabilitycondition qmax

k ≤ Qk. In this section we will present a technique, recentlyproposed by Altmeyer, Burguiere and Wilhelmin in [ABW09], that can beused for this purpose. This work has been developed by a different group,while in the sphere of the same project. For this reason, a different termi-nology is used and the maximum NPR length is referred to as “maximumblocking time”.

We already presented in the previous chapter the typical procedure oftiming analysis for the WCET of a task. To implement this technique, onlypath analysis, the last step in the analysis work-flow, has to be modified.This means that many existing timing analysis tools can easily be modi-

2As explained in [BLV07b], when the exact length of the last NPR of a task τk is fixedand known a priori, the worst-case response time of τk is not necessarily given by the firstinstance of τk after a critical instant. A necessary and sufficient schedulability conditionwould then need to check a large number of possible arrival times for τk, resulting in amuch more complex schedulability condition. See [BLV07b] for further details.

64

Page 65: Effects of real-time scheduling on cache performance - ReTiS Lab

fied to implement this technique. Furthermore, future improvements in theprevious modules of the tools will automatically produce results in this anal-ysis. This is particularly important for additions to the processor behaviourmodule to support new types of cache memories. This estimation of themaximum NPR length is based on light modifications to the IPET (implicitpath enumeration) analysis. The basic IPET analysis consists in describing

Fig. 4.6: Example IPET analysis.

the CFG (control flow graph) of the task code as a set of linear constrainswith typical Operational Research techniques. Then the cost of the pathfrom start to end is used as a target function to maximize and the ILP (in-teger linear program) is resolved. In figure 4.6 an example CFG is analysedwith the IPET technique.

The modified version of the IPET builds an ILP to search the longestpath from program start or a preemption point to program end or a pre-emption point. Obviously, for the paths starting from a preemption point,the CRPD for that point of code is computed in the path length. Thistechnique can be implemented by disconnecting the CFG in the preemptionpoints and adding an additional artificial node. An example modified CFGfor this analysis is presented in figure 4.7. In this example a PP is insertedafter the code of the basic block 4 and another before the basic block 5. Forfurther details on how to automatically modify the ILP to implement thisanalysis, refer to [ABW09].

65

Page 66: Effects of real-time scheduling on cache performance - ReTiS Lab

Fig. 4.7: CFG modified for the maximum NPR length analysis.

4.4 Automatically choosing preemption points

Manually inserting preemption points allows using the limited preemptionscheduling, using already available tools. The number of preemptions isseverely reduced when compared to the fully preemptive scenario and theposition of possible preemptions is well known, so many overestimations ofthe worst case CRPD are avoided. However this manual implementationof the PPlace(Qi+1, i+ 1) procedure is not guaranteed to optimally placethe preemption points. The algorithm InsertPP(τ), proposed in the sec-ond section of this chapter, is optimal only within the assumption that thePPlace(Qi+1, i+1) procedure is optimal, too. This means that if a task-setresults to not be schedulable because a too little Qj is computed for a taskτj , the designers have no guarantee that with a different placements of PPsthe task-set is not feasible. For instance, a feasible schedule may still befound by placing the PPs in better positions in the previous tasks τ1 . . . τj−1

(with a higher priority or a lower relative deadline than τj). Still, there is noclue on which task need different PPs. A better approach for implementingthe PPlace(Qi+1, i+1) procedure could be to use the information providedby timing analysis tools to automatically choose the best PP to minimizethe WCET, while still respecting the bound imposed by Qj on the maximumNPR length.

In this section we will present an algorithm to automatically choose thelocation of the PP for sequential code. The algorithm can still be used inthe current form, even on real-world code with branches and loops, by usingsequential potential preemption points (PPP). With this approach we caninclude branches and loops in a macroblock of code between two PPP. Infuture work this algorithm can be extended to real CFG, with branches andloops.

66

Page 67: Effects of real-time scheduling on cache performance - ReTiS Lab

Even in the sequential version, the problem is not easy to solve if PPP arenot considered to introduce the same delay in the execution. The naive ap-proach presented in the first implementation of procedure PPlace(Qi+1, i+1) is not optimal in this case. Going on with the execution of task τi+1 un-til the whole duration of the Qi+1 could yield to insert a preemption pointin a position where the introduced CRPD is very high. In some cases, itmay be convenient to insert more preemption points than the least possiblenumber, to take advantage of points with low CRPD. For example, in thescenario represented in figure 4.8, inserting just one PP is possible. How-ever, by inserting two PPs, we can exploit positions in the code with a lowpreemption overhead. In this rather simple example, by building the two

Sequential blocks of code and preemption cost at each PPP1 2 3 3 1

Q = 8

Inserting the least possible number of PPs (just one)

CRPD = 3

Optimal solution (inserting two PPs)

CRPD = 2

Fig. 4.8: Example sequence of macroblocks of code with preemption cost ateach PPP.

possible choices, we were able to identify the best solution, but with morebasic blocks this is usually not possible.

In the task model adopted in this section, a task is composed of a set ofsequential basic blocks, separated by potential preemption points. In thissection we will use k as an index for the potential preemption points. Forinstance, if in a task we have N basic blocks (in the example in figure 4.8N = 6), the first PPP will be identified by the index k = 1, while the lastby k = N − 1 = 5. For convenience of notation, we will include in the setof potential preemption points the start and the end of the program code,identifying them with k = 0 and k = N respectively. We will use bk forthe WCET of each basic block. In particular, the WCET of the first basicblock will be identified by b0, and the WCET of the last one, by bN−1.Notice that, with this notation, the k-th PPP is followed by a basic blockwith a WCET of bk. If we indicate with WCETNPk the WCET from thebeginning of the program code to the k-th potential preemption point if no

67

Page 68: Effects of real-time scheduling on cache performance - ReTiS Lab

preemptions happen (a value that any timing analysis tool can estimate),we can express the WCET of each basic block with the following expression.

bk = WCETNPk+1 −WCETNPk

With this expression, pipelining will be taken into account within bk. Noticethat the non-preemptive WCET of the task may be expressed in terms ofthe WCET of the basic blocks.

CNP =N−1∑k=0

bk

Another information, necessary to fully represent a task within the adoptedmodel, is the estimated overhead introduced by a preemption in a poten-tial preemption point C(k). The implementation of the PPlace(Q, τ) pro-cedure we are proposing in this section requires as input information themaximum allowed NPR length Q and the specification of the task usingthis model. An example task is presented in figure 4.9, to clarify the ter-minology. Obviously, preempting before the start and after the end of the

1 2 3 3 1

Q = 8

b0 = 2b1 = 2b2 = 2b3 = 1b4 = 2b5 = 3

C(0) = 0C(1) = 1C(2) = 2C(3) = 3C(4) = 3C(5) = 1C(6) = 0

Fig. 4.9: Example sequence of macroblocks of code with preemption cost ateach PPP.

task does not introduce any overhead in the WCET, therefore C(0) = 0 andC(N) = 0. For the other PPPs, this cost function has to include all thecosts of a preemption:

• pipeline flushing;

• RTOS code WCET (including context switch and scheduler);

• CRPD (that can be estimated with a UCB analysis);

• additional delay due to bus contention.

If we assume a processor without timing anomalies and, for the sake ofsimplicity, an architecture with no DMA-capable devices, all of the abovecosts can be estimated precisely.

68

Page 69: Effects of real-time scheduling on cache performance - ReTiS Lab

The goal of this implementation of the procedure PPlace(Q, τ) is tochoose a set of potential preemption points to activate. The sequence ofbasic blocks between two consecutive activated PPP i and j will form anon-preemptive region. The WCET of this NPR (q′) is composed of theWCET of the basic blocks between i and j, and the CRPD caused by thepreemption in the starting PPP i. The length of each NPR can be calculatedwith equation 4.8.

q′ = C(i) +j−1∑k=i

bk (4.8)

Notice that, in this scenario, if we know the WCET (including the cost ofpreemption) till i (WCETi), we can easily compute the WCET till j usingequation 4.9.

WCETj = WCETi + q′ = WCETi + C(i) +j−1∑k=i

bk (4.9)

Still, any non preemptive region of a feasible PP placement for the taskhas to respect the bound q′ ≤ Q. By applying this bound to the equation4.8, we obtain:

C(i) +j−1∑k=i

bk ≤ Q (4.10)

Assume we know the last PPP of a NPR, j. By using equation 4.10, wecan identify a set of PPP eligible to be the start of the considered NPR. Wecan compute this set for each PPP in the task, therefore we can define afunction PrevQ(j), for the set of feasible previous preemption points if j isthe end of a NPR. This function is defined in equation 4.11.

PrevQ(j) .= {i‖i < j ∧ C(i) +j−1∑k=i

bk ≤ Q} (4.11)

Assume we have a function WCET (i), which can compute the worst-case execution time till the i-th PPP. For any PPP of index j, we can definea function PrevPP (j), which selects the best PPP in the PrevQ(j) set tominimize the WCET till the j-th PP, as expressed by . For convenience wedefine PrevPP (0) .= 0. As a result, we can define this function in equation4.12.

PrevPP (j) .= argmini∈PrevQ(j)

WCET (i) + C(i) +j−1∑k=i

bk (4.12)

By using this function, we can develop equation 4.9 and define theWCET (i) we assumed to exist in the previous paragraph. Again, we can

69

Page 70: Effects of real-time scheduling on cache performance - ReTiS Lab

define WCET (0) .= 0, for convenience.

WCET (j) .= WCET (PrevPP (j)) + C(PrevPP (j)) +j−1∑

k=PrevPP (j)

bk

(4.13)Notice that the PrevPP (j) function only uses WCET (i) with i < j, there-fore can be calculated without knowing WCET (j).

The PrevPP (j) function can be used to find a feasible placement of thePPs in the task. In particular, if N is the end of the program, PrevPP (N) isthe index of last PPP that should be activated. The index of the penultimatePPP that has to be activated is PrevPP (PrevPP (N)), and so on. If theresult of this recursive lookup of function PrevPP (j) is 0, the start of theprogram has been reached. Furthermore, the calculated WCET (N) is theworst-case execution time to reach the end of the program. This value canbe used by the following step of the InsertPP(τ) algorithm to computethe Q bound for the next task. Using this technique, we can define a newalgorithm for the PPlace(Q, τ) procedure, presented in figure 4.10. A short

PPlace(Q, τ)Initialize: PrevPP (0)← 0, WCET (0)← 0, PrevQ← {0}

1 for (j : 1 ≤ j ≤ N)2 Remove from PrevQ infeasible PPPs (equation 4.11)3 if (PrevQ = ∅)4 return (Infeasible)5 Compute PrevPP (j) within PrevQ (equation 4.12)6 Compute WCET (j) using equation 4.137 PrevQ← PrevQ ∪ {j}

endfor8 j ← PrevPP (N))9 while (j > 0)

10 Activate the j-th potential preemption point11 j ← PrevPP (j)

endwhile12 return (Feasible)

Fig. 4.10: Algorithm to place automatically the preemption points in thetask.

implementation in C++ of this algorithm may be found in appendix of thisthesis.

Given the input data previously presented in figure 4.9, the executionof the algorithm is presented in figure 4.11. For the start of the program

70

Page 71: Effects of real-time scheduling on cache performance - ReTiS Lab

1 2 3 3 1

Q = 8

j PrevQ(j) PrevPP (j) WCET (j)0 0 01 {0} 0 22 {0, 1} 0 43 {0, 1, 2} 0 64 {0, 1, 2, 3} 0 75 {1, 2, 3, 4} 1 106 {4, 5} 5 14

Activate PrevPP (6) = 5 and PrevPP (5) = 1

Fig. 4.11: Example execution of algorithm PPlace.

j = 0, the values PrevPP (0) = 0 and WCET (0) = 0 are given by definition.For the first 4 potential preemption points j = 1 . . . 4, the execution timewithout preemptions from the start of the program does not violate thebound Q = 8. Indeed, for these points the start point of the program is inthe feasible set 0 ∈ PrevQ(j). Since C(0) = 0, the minimum WCET (j)is obtained if the previous preemption point is the start of the programi = 0. This means that the program can run with preemptions disabledtill these points. Indeed, the results of the algorithm for these PPPs arePrevPP (j) = 0 and WCET (j) =

∑j−1i=0 bi as you can see in the figure.

For j = 5 the start of the program is no more in the Q window (0 /∈PrevQ(j)), because the non preemptive execution time from the start tothe 5-th PPP is

∑4i=0 bi = 9. Notice that all the PPPs in PrevQ(j) can

execute non preemptively from the beginning. This means that, for anychoice while searching the argmin, there will be only one preemption point.Within PrevQ(j), the PPP with the lowest CRPD is 1 (C(1) = 1), so thisis the best start of a NPR that ends with j = 5. The results are thereforePrevPP (5) = 1 and WCET (5) = 2 + 1 + 7 = 10.

For the end of the program j = N = 6, the first three PPPs are no morefeasible solutions. Indeed, the first PPP is too distant even not consideringpreemption costs

∑5i=1 bi = 10 > Q = 8. For j = 2 and j = 3, the

bound for a NPR length is violated when considering the delay introducedby the preemption. Indeed C(2) +

∑5i=2 bi = 2 + 8 = 10 > Q = 8 and

C(3) +∑5

i=3 bi = 3 + 6 = 9 > Q = 8. Within the two remaining pointsin PrevQ(6) = {4, 5}, we have to use equation 4.12 to locate choose the

71

Page 72: Effects of real-time scheduling on cache performance - ReTiS Lab

solution:

WCET (4) + C(4) +5∑i=4

bi = 7 + 3 + 5 = 15

WCET (5) + C(5) +5∑i=5

bi = 10 + 1 + 3 = 14

PrevPP (6) = argmink∈{4,5}

WCET (k) + C(k) +5∑i=k

bi = 5

WCET (6) = WCET (5) + C(5) +5∑i=5

bi = 14

Finally, looking up recursively the results of the PrevPP (j) function, wecan select the PPPs to activate. We start by activating the result of thisfunction for the end of the program PrevPP (6) = 5. Then we have toactivate PrevPP (5) = 1. Since PrevPP (1) = 0, the start of the programhas been reached.

We hereafter prove that the proposed algorithm is optimal. This meansthat if this algorithm cannot find a feasible PP placement, no other algorithmcan find it. Furthermore, the WCET of the task (including preemption costs)produced by placing the PP with this algorithm is the minimum possible,within the constraint on the maximum NPR length.

Theorem 5. If procedure PPlace(Q, τ) considers a task infeasible, thereis no other feasible PP positioning.

Proof. If procedure PPlace(Q, τ) returns infeasible, there is a PPP j forwhich PrevQ(j) = ∅. This means that there is no PPP satisfying theconditions in equation 4.11. So for all the PPPs i such that i < j, must betrue that

C(i) +j−1∑k=i

bk = q′ > Q

This means that whatever PPP is activated, any NPR ending with j willviolate the bound imposed by Q, therefore will lead to an infeasible PPplacement. If you do not activate j, still the NPR that includes j willviolate the bound. In this case, let z be the smallest index of an active PPPsuch that j < z. Since j is not activated, there must be a PPP i < j thatstarts the NPR ending with z. The WCET of this NPR can be expressedwith:

C(i) +z−1∑k=i

bk = C(i) +j−1∑k=i

bk +z−1∑k=j

bk > C(i) +j−1∑k=i

bk > Q

proving the PP placement to be infeasible.

72

Page 73: Effects of real-time scheduling on cache performance - ReTiS Lab

Theorem 6. The WCET produced by procedure PPlace(Q, τ) is optimal.

Proof. We hereafter can prove by induction that the function WCET (j) isoptimal ∀j = 0 . . . N .

Base case For j = 0, WCET (j) = 0 by definition. This is the minimumpossible value of the WCET.

Inductive step Assume WCET (i) is optimal ∀i < j. We can define autility function

WCETutil(i, j) = WCET (i) + C(i) +j−1∑k=i

bk

and reformulate the definitions of the two functions in equation 4.12 and4.13:

PrevPP (j) = argmini∈PrevQ(j)

WCETutil(i, j)

WCET (j) = WCETutil(PrevPP (j), j)

With this formulation, it is clear that the PrevPP (j) function explicitlysearches the solution which minimizes WCET (j), within the set of feasi-ble PPPs PrevQ(j). So WCET (j) is optimal by construction, within thefeasible set of solutions.

WCET (j) = WCETutil( argmini∈PrevQ(j)

WCETutil(i, j) , j)

WCET (j) = mini∈PrevQ(j)

WCETutil(i, j)

WCET (N) represents the WCET till the end of the program. Since theinductive optimality proof is valid ∀j = 0 . . . N , the WCET of the wholetask is optimal, too.

We hereafter present some considerations on the implementation of thealgorithm. For each potential preemption point j, the implementation hasto store in memory the PrevQ(j) set. In later steps, only information (suchas the WCET and the cost of preemption) about elements of this set areneeded. This means that an implementation of this algorithm has to storein memory information only about those elements. Since the size of this setdepends on Q, an implementation of this algorithm requirements memoryin the order of O(Q). If a trace of the PrevPP (j) function is maintainedin memory for the final recursive lookup of the PPPs to activate, additionalO(N) memory (where N is the number of sequential basic blocks) may be

73

Page 74: Effects of real-time scheduling on cache performance - ReTiS Lab

required. For long tasks, with many potential preemption points N may behigh. In these cases, the trace of the function may be stored in a temporaryfile and the final recursive lookup may be implemented as a random accessto this file. A naive implementation would search, at each step, withinthe set PrevQ(j) the element generating the minimum value to calculatePrevPP (j). This would require O(Q) time for each PPP, yielding to a timecomplexity of O(N × Q). A more smart implementation could maintainthe PrevQ(j) set in an ordered queue, where the element generating theminimum WCET is always in the head. Maintaining the ordered queuewould then consist in

• removing infeasible elements form head (elements that violate the Qbound);

• prepare a new element to add in the set;

• remove from tail elements that generate a higher WCET than the newelement (elements that would be discarded by the minimum search);

• insert the new element in tail.

This technique yields to an implementation with O(N) time complexity.The example implementation in appendix maintains PrevQ(j) in an orderedqueue as described above.

74

Page 75: Effects of real-time scheduling on cache performance - ReTiS Lab

Chapter 5

Experiments with limitedpreemption

We have implemented the proposed limited preemtion scheduling in theErika RTOS and have measured the performance of this approach in oursimulation environment. Once more, we want to stress the fact that a ma-jor advantage of limited preemption is that it allows for a better estimationof the worst-case cache interference among tasks. Indeed, the number andexact position of points in the task where a preemption can happen don’thave to be estimated. They are statically known, therefore analyzable of-fline, without need for severe overestimations to guarantee the safety of thesystem. Still, this technique sensibly reduces the preemption overhead evenin the average case, as we will show in our simulation results.

5.1 Implementing limited preemption in Erika

Erika is a OSEK/VDX compliant RTOS. In this standard, as in manyother real-time oriented OS standards, tasks may be declared to be notpreemptible. In OSEK Implementation Language (OIL), each task specifi-cation accepts the SCHEDULE attribute. It may be set to FULL (meaninga fully preemptible task) or to NON, making a task non preemptible unlessthe task itself explicitly calls the Schedule() primitive. By setting all thetasks as non preemptive, we can implement the preemption points as a sim-ple call to the Schedule() system call and implement the limited preemptionmodel as presented in this thesis.

Another common approach in an OSEK/VDX compliant RTOS to pre-vent a task from being preempted is to lock a special shared resource calledRES SCHEDULER. The resource management is used to co-ordinate con-current accesses of several tasks with different priorities to shared resources,e.g. management entities, program sequences, memory or hardware ar-

75

Page 76: Effects of real-time scheduling on cache performance - ReTiS Lab

eas. The resource management is mandatory for all conformance classesin the OSEK/VDX standard. The already mentioned RES SCHEDULERresource is automatically created to represent the scheduler managemententity, therefore any task can lock the scheduler and prevent it from pre-empting the task.

Both the approaches are implemented in Erika in the same way. Forresource sharing, Erika uses Immediate Priority Ceiling for FP schedulingand Stack Resource Policy for EDF scheduling. Both of these approachesrequire tasks to have a dual priority: the nominal priority and the preemp-tion threshold priority. The nominal priority is used to schedule the tasks.When a task is running, it can be preempted only by tasks with a nominalpriority higher than the it’s threshold priority. If a task has the highest pos-sible preemption threshold priority, it cannot be preempted by any task. Toimplement the limited preemption model, the preemption threshold priori-ties of all the tasks in the system are set to the maximum value. Thereforethe only way for the scheduler to preempt a task is an explicit call to theSchedule() system call.

With this implementation, interrupts may still happen during the execu-tion of a task and they have to be handled. Blocking the scheduler preventsonly other tasks from preempting the task, interrupt service routines (ISR)can still be executed. Another possible approach could be to disable inter-rupts in the beginning of the task execution and enable them at the end.Without interrupts no RTOS code can execute if not called explicitly fromthe task itself, therefore in this scenario manipulating preemption thresh-old priorities is not necessary. Preemption points can be implemented byenabling the interrupts and disabling them again. The CPU cycle after en-abling the CPU, the interrupts masked till that moment will be handledby the CPU, therefore the ISR for each interrupt will be executed. Whenthe last ISR releases the control to the RTOS, it will run the scheduler todecide which task to execute. If a new task is executed, at the beginningof the task code it will disable interrupts. If a previously preempted task isexecuted, it will disable interrupts because, as we said, preemption pointsare implemented by enabling and, immediately afterwards, disabling them.If the task has been preempted, the execution must have been interruptedafter enabling and before disabling interrupts. This approach allows forthe estimates of the proposed algorithm to be precise and the designers ofthe system don’t have to oversize the processing resources to guarantee thesafety of the system. In figure 5.1 we have extracted a detail of the simula-tion results presented later in this chapter. In this simulated scenario, boththe fully preemptive and the limited preemption scenario suffer the samenumber of preemptions. The differences in the execution times are onlycaused by different implementation overheads. Of course, to implement thelimited preemption model we need to modify the task code, therefore mem-

76

Page 77: Effects of real-time scheduling on cache performance - ReTiS Lab

144000

145000

146000

147000

148000

149000

150000

0 20 40 60 80 100 120 140

(CP

Ucy

cles

)

Num. cache lines evicted

Execution time

Fully preemptiveLP (IRQ disabled)LP (OSEK/VDX)

Fig. 5.1: Enter caption here.

ory addresses are slightly different in the three scenarios. As we are usinga direct mapped cache for our experiments, minor differences can also becaused by changed memory addresses.

As you can see in figure 5.2, implementing the limited preemption modelby disabling the interrupts performs better than simply locking preemptions.Obviously, interrupt handling routines evict some lines of cache themselves,therefore some CRPD is still suffered by the task is interrupts are not dis-abled. Still, the code of an ISR have to be short to guarantee a fast responseto the interrupts, therefore the CRPD caused by interrupt handling is usu-ally negligible with respect to the interference of other tasks. Indeed, infigure 5.2 we can observe that, when the number of cache lines evicted bythe higher priority task is realistic, the difference among the two implemen-tations is negligible with respect to the CRPD introduced by the higher pri-ority task. The high interrupt handling latency, suffered by the system withthe disabling interrupts implementation, is not counterbalanced by relevantimprovements. Furthermore the benefits of the lower number of preemptionsand lower CRPD overweight any implementation overhead, even in a simplescenario with only two tasks, when the cache interference is realistic.

For this reason we have adopted the first implementation and propose,as a future development of this research, to consider in the model the impactof some higher priority tasks running with a fully preemptive policy. Thesehigher priority tasks can be used to model interrupt handlers, but also to

77

Page 78: Effects of real-time scheduling on cache performance - ReTiS Lab

145000146000147000148000149000150000151000152000153000154000155000

0 50 100 150 200 250 300

(CP

Ucy

cles

)

Num. cache lines evicted

Execution time

Fully preemptiveLP (IRQ disabled)LP (OSEK/VDX)

Fig. 5.2: Enter caption here.

allow for some proper tasks, with a low free slack time, to be scheduledpreemptively and therefore not influence the Q bound on the NPR length ofthe other tasks. Cache partitioning could be a good technique to avoid anyCRPD from these high priority tasks to those scheduled with limited pre-emption. This scenario is similar to the hybrid cache partitioning, proposedin [BMSW97] and [VLX03], but the shared partition is scheduled with alimited preemption policy.

Another possible implementation of the limited preemption model weconsidered, is not based on the modification of code to insert explicit pre-emption points. Indeed, the bound Q is expressed in time, therefore werun a task unmodified and measure the execution time. To implement themodel with this approach, we would have to modify the scheduler. Afterscheduling a task τi, the scheduler should set a timer for the time limit Q,and disable himself from preempting the task. When the timer for the nonpreemptive run signals the expired Q limit, the scheduler should executeand run the appropriate task, setting the timer for the new task. At themoment we discarded this implementation because it is not in line with cur-rent trends in the industry. We could implement this scheduling techniquebecause we are using an open source RTOS and we are not using it forcommercial applications. When using a commercial on the shelf RTOS thisis not possible because of closed source code and licensing issues. Insteadthe implementation we adopted is perfectly compliant to the OSEK/VDX

78

Page 79: Effects of real-time scheduling on cache performance - ReTiS Lab

standard, therefore feasible as-is with any compliant RTOS. Not to men-tion that almost any RTOS and standard about real-time systems usuallyincludes similar primitives.

This alternative timer-based implementation does not need any modifi-cations to the code of the task, therefore introduces less overhead than anyapproach based on explicit preemption points. On the other hand, while thenumber of preemptions is perfectly known, their location is still unknown.This means that for each preemption the maximum possible CRPD has tobe charged to the WCET of the task. As explained earlier in this thesis,this means to charge for the time required to refill the whole cache or atleast the maximum number of UCB the task may have at any instruction.It is not possible to choose the best preemption points to furtherly reducethe cache interference, like we proposed in the last section of the previouschapter. Instead, when the timer fires, the program may be in the middleof a loop, therefore the CRPD may be high.

The simulation results presented in the next sections of this chapter areexecuted with the OSEK/VDX compliant implementation and with inter-rupts enabled, therefore ISR can execute without having to wait for thecurrently running task to complete the NPR.

5.2 Simulation results

While exploring the possible implementations in the previous section, weused the simplest simulation scenario: an iteration over an array. In thissection we present the simulation results in the CoreMark scenario. Fur-thermore in this section we use a more complex task set.

With only two tasks, the limited preemption can bring only limited im-provements on the execution time of the lower priority task. Indeed, both inthe limited preemption and in the fully preemptive scenario, the higher pri-ority task τ1 (or the one with lower relative deadline in EDF) will never bepreempted. The slack time left unused by τ1 after each execution is T1−C1.Both in the limited preemption and in the fully preemptive scenario, taskτ2 can run unpreempted for at most this amount of time (otherwise thefollowing instance of the higher priority task would suffer a deadline miss).

Therefore, in the fully preemptive scenario the task τ2 can be preemptedat most ⌈

C2

C1− T1

⌉times. The limited preemption policy can fully exploit this slack, thereforethe minimum number of required preemption points is⌊

C2

C1− T1

⌋79

Page 80: Effects of real-time scheduling on cache performance - ReTiS Lab

. Obviously the actual execution time C2 is influenced by the number ofpreemptions and the useful cache blocks evicted for each preemption, there-fore the WCET of the second task C2 in the two scenarios is not the same.Considering this, we can express the number of preemptions saved by usinglimited preemption LPsav as:

LPsav =⌈

CFP2

C1− T1

⌉−⌊

CLP2

C1− T1

⌋If the WCET in the two scenarios is comparable CFP2 ≈ CLP2 , then thenumber of preemptions saved by using the limited preemption policy cannot be more than one LPsav ≤ 1. The limited preemption model, even whensaving only one preemption or none at all, offers the designers the ability tocontrol the points where a preemption will happen. If correctly exploited,this can bring relevant advantages even with only two tasks.

Real world applications require far more than two tasks. For example,a realistic automotive power-train control application is usually composedof around thirty tasks, with high rate tasks mixed with lower frequent ones[GNL+03]. With more complex and realistic task sets, limited preemptionimproves sensibly the execution time of the lower priority tasks, when com-pared to fully preemptive execution. In our simulation environment, wedecided to use four tasks as in figure 5.3. The three higher priority tasks

Disturber 16 6 6

Disturber 26 6 6

Disturber 36 6 6

CoreMark6

Fig. 5.3: Enter caption here.

periodically destroy a configurable number of data cache lines, while wemeasure the performance of the lower priority task. As mentioned earlier,this task performs an iteration of the CoreMark benchmark. This task set isclearly not schedulable with a completely not preemptive scheduling policy.Indeed, the CoreMark task execution time is far longer than the period ofthe disturber tasks. Therefore whenever this task is executed the higherpriority tasks will certainly suffer deadline misses. In our simulation envi-

80

Page 81: Effects of real-time scheduling on cache performance - ReTiS Lab

ronment, we assigned the CoreMark task a large enough relative deadline tomake it schedulable with fully preemptive policy.

To implement limited preemption, we inserted three preemption pointsin the code of CoreMark. An example scheduling with limited preemptionfor this task set is presented in figure 5.4. As you can see, the CoreMark task

Disturber 16 6 6

Disturber 26 6 6

Disturber 36 6 6

CoreMark6

Fig. 5.4: Enter caption here.

suffers much less preemptions, therefore the execution time will be reduced.

Unfortunately, given the Q bound imposed by the disturber tasks, wehad to insert the preemption points in the CoreMark task in an inner loopin the code, where the number of UCB was high. Despite this unfortunatechoice in the placement of preemption points, limited preemption schedulingsensibly improved the performance of the task, by reducing the number ofpreemptions suffered. In figure 5.5 we preset the execution time of theCoreMark task with limited preemption and full preemption, when varyingthe number of cache lines evicted by the higher priority tasks.

The task scheduled with limited preemption performs better even whenthe higher priority tasks does not evict any cache lines, limiting the amountof overhead of each preemption. When the higher priority tasks evict morethan 64 lines of cache, the measured task’s execution time becomes longenough to include another activation of a disturber task. The fully preemp-tive scheduling increases the number of preemptions, therefore the executiontime sensibly increases. The limited preemption scenario still guaranteesthe same number of preemptions, therefore the execution time is not muchchanged when exceeding the same threshold of the execution time (at 128cache lines evicted, this time). Still, a slight increment in the executiontimes can be noticed even in the limited preemption scenario, because thenumber of timer interrupts handled during the execution of the CoreMarktask is increased. The same event can be noticed later in the fully pre-emptive scheduling scenario, when the disturber task evicts more than 164

81

Page 82: Effects of real-time scheduling on cache performance - ReTiS Lab

528000529000530000531000532000533000534000535000536000537000538000

0 50 100 150 200 250 300

(CP

Ucy

cles

)

Num. cache lines evicted

Execution time

Fully preemptiveLimited preemption

Fig. 5.5: CoreMark scenario comparing limited preemption with full pre-emption.

cache lines. The limited preemption scheduling never exceeds this secondthreshold of the execution time.

5.3 Simulations with bus contention

When repeating the simulations with other devices contending the bus usage,limited preemption scheduling still performs better than the fully preemptivescenario. In figure 5.6, you can see that the performance with both thescheduling policies depends on the number of other devices contending thebus. When adopting a TDMA arbiter, the time bandwidth assigned to eachbus master is controlled by the system designers and does not depend onthe number of other devices using the bus. In this scenario, again, limitedpreemption performs better than the fully preemptive scheduling, as youcan see in figure 5.7.

Obviously, the execution time of the task depends on the usage of the busof these devices, too. In figure 5.8, we compare the performance of limitedpreemption to the fully preemptive scheduling with different percentages ofusage of the bus for the other DMA devices. In the simulations in thisfigure, a round robin arbiter is used. Again, in every simulated scenariolimited preemption performs much better than the fully preemptive policy.With a TDMA arbiter, the execution time does not anymore depend on the

82

Page 83: Effects of real-time scheduling on cache performance - ReTiS Lab

690000

695000

700000

705000

710000

715000

720000

0 1 2 3 4 5 6 7 8 9

(CP

Ucy

cles

)

Num. DMA devices

Execution time

Fully preemptiveLimited preemption

Fig. 5.6: Enter caption here.

700000701000702000703000704000705000706000707000708000709000710000

0 1 2 3 4 5 6 7 8 9

(CP

Ucy

cles

)

Num. DMA devices

Execution time

Fully preemptiveLimited preemption

Fig. 5.7: Enter caption here.

83

Page 84: Effects of real-time scheduling on cache performance - ReTiS Lab

692000

694000

696000

698000

700000

702000

704000

706000

708000

0 20 40 60 80 100

(CP

Ucy

cles

)

% Bus usage of DMA devices

Execution time

Fully preemptiveLimited preemption

Fig. 5.8: Enter caption here.

usage of the bus from the contending DMA devices. Still limited preemptionoutperforms the fully preemptive scheduling, as you can see in figure 5.9.

84

Page 85: Effects of real-time scheduling on cache performance - ReTiS Lab

697000698000699000700000701000702000703000704000705000706000707000

0 20 40 60 80 100

(CP

Ucy

cles

)

% Bus usage of DMA devices

Execution time

Fully preemptiveLimited preemption

Fig. 5.9: Enter caption here.

85

Page 86: Effects of real-time scheduling on cache performance - ReTiS Lab

86

Page 87: Effects of real-time scheduling on cache performance - ReTiS Lab

Chapter 6

Conclusions

In this thesis we investigated the unpredictable behaviour of cache memoriesand the techniques used to deal with this uncertainty in hard real timesystems. We studied the influence of scheduling and preemptions on thecache performance of a task, in particular with relation to the estimation ofit’s worst case execution time. Indeed, when a task is preempted because thescheduler decides to execute others, many of the cache contents of this taskwill be evicted. When resuming the execution of the task, some memoryaccesses that would have been cache hits if the task was not preempted, areactually misses (extrinsic cache misses). These additional delays, commonlyreferred to as cache related preemption delays CRPD, cause longer executiontimes. Many previous research efforts have measured these effects, both ingeneral purpose systems and in typical real time hardware. After analysingand reporting in this thesis the main contributions of those previous works,we decided to prepare a simulation environment where we could measureour own these effects. In particular, we used MPARM, a cycle accuratesimulator of a typical real time architecture. On this hardware simulator,we ported ERIKA, a real RTOS (real time operating system) compliantto the OSEK/VDX automotive standard. We prepared many benchmarktask sets and observed the CRPD by executing these real programs in oursimulation environment.

We investigated and reported in this thesis the current state of the arttechniques to deal with the unpredictable behaviour of cache memories inreal time systems. There are currently two possible kind of approaches:

• using the cache in a limited and predictable way, perhaps by exploitinghardware features such as cache locking and scratchpad memories, ormodifying the memory layout of software;

• using the cache without limitations, but statically analyze the code topredict it’s behaviour.

87

Page 88: Effects of real-time scheduling on cache performance - ReTiS Lab

Using the techniques in the first group usually requires the system designersto manually tweak the software. Only heuristic guidelines exist to guidetheir choices. Instead, static analysis is usually automatic, but having toconsider the worst case each time the state of the processor is uncertainmay cause large overestimations of the execution time. This is particularlytrue with the CRPD, because the point and the number of preemptions isunknown and the static analysis has to chose the worst case possible. Inreal world applications, techniques from both the approaches can be mixedand previous research has proposed some hybrid systems.

With non preemptive scheduling we can avoid CRPD. When the taskcannot be preempted, modern static analysis tools can predict rather pre-cisely the cache behaviour. Unfortunately many realistic task sets are notschedulable non preemptively because of the high blocking time lower pri-ority tasks impose on higher priority ones. In this thesis we proposed atechnique, called limited preemption, to exploit the slack time of higher pri-ority tasks to schedule non preemptively large regions of tasks. With thisapproach, we can exploit the low overhead of non preemptive scheduling,but still maintain feasible blocking times. We described how to compute themaximum allowed non preemptive region length for each task in sequence.Then preemption points need to be inserted in the code of the task in appro-priate points to respect this bound. When executed, the task can only bepreempted in these points, therefore the number and place of preemptionscan be exactly predicted (and actually controlled by the designers). Ex-trinsic cache misses can be precisely estimated and, by placing preemptionpoints in suitable points, reduced. We considered two approaches for thispurpose:

• the programmer inserts preemption points manually, and a modifiedstatical analysis tool verifies that the bound on the maximum nonpreemptive region length is respected;

• an automatic tool decides the best points in code where to insert thepreemption points.

The first approach is immediately implementable, while the automatic choiceof the preemption points is not an easy task. An algorithm for this dutyhas to consider that, sometimes, a larger number of preemption points canexploit points in code where the CRPD is low. In this thesis we only con-sidered sequential instructions (or basic blocks of code) and proposed anoptimal, linear complexity algorithm to choose the best preemption pointsto minimize the total WCET of the task, while still respecting the boundon the non preemptive region.

At last, we implemented the limited preemption policy for some bench-marks in the ERIKA RTOS. After considering different implementations, we

88

Page 89: Effects of real-time scheduling on cache performance - ReTiS Lab

only used standard OSEK/VDX techniques, therefore the same techniquecan be used in any other compliant RTOS. We executed the tasks in oursimulation environment and compared the performance of the tasks whenscheduled with limited preemption and with a fully preemptive policy.

Future work on this topic will extend the algorithm used to place auto-matically the preemption points in the code to a fully blown task controlflow graph, with branches and loops. For this purpose, the existing sequen-tial algorithm can even be used as a basic block to analyse the single pathsof the CFG. A similar tool, integrated with static analysis, can provide anautomatic environment for large scale industrial usage of limited preemp-tion. With a similar environment system designers would have no more toconsider the unpredictability of the cache for real time applications.

Additional improvements to the limited preemption scheduling can beintroduced by allowing the model to include higher priority tasks, includinginterrupt handling routines, that can still preempt the running task. Thiswould allow to consider the interrupt handling in the analysis and to excludefrom the limited preemption some high frequency tasks that would constraintall the other tasks to have only short non preemptive regions. The cacheinterference for these high priority and fully preemptive tasks can be handledwith other techniques such as cache partitioning. This hybrid approachdeserves further attention in future works.

Future research in this field will have to integrate the bounds on thecache behaviour with that we can provide with limited preemption withbounds on the bus contention. Indeed, in this work we only measured thebus contention influence on the execution time of the task and proposedexisting techniques to deal with it, such as adopting predictable bus arbitersor statically analysing the devices linked to the bus with a network calculusapproach.

Finally, the computational resources saved with limited preemption al-low for serious energy savings. With a strait-forward implementation wesave energy by causing less cache misses, therefore using less the bus. Still,the idle time in the processor, in particular what we save by using lim-ited preemption, can be used to save energy with typical techniques such asfrequency and voltage scaling. Having a well-known scheduling behaviour,limited preemption is particularly adequate for these techniques and researchefforts will be spent in integrating energy saving in this scheduling policy.

89

Page 90: Effects of real-time scheduling on cache performance - ReTiS Lab

90

Page 91: Effects of real-time scheduling on cache performance - ReTiS Lab

Bibliography

[ABW09] Sebastian Altmeyer, Claire Burguiere, and Reinhard Wilhelm.Computing the maximum blocking time for scheduling withdeferred preemption. In STFSSD ’09: Proceedings of the 2009Software Technologies for Future Dependable Distributed Sys-tems, pages 200–204, Washington, DC, USA, 2009. IEEE Com-puter Society.

[AFMW96] Martin Alt, Christian Ferdin, Florian Martin, and ReinhardWilhelm. Cache behavior prediction by abstract interpretation.In Science of Computer Programming, pages 52–66. Springer,1996.

[AKBP+06] Iyad Al Khatib, Davide Bertozzi, Francesco Poletti, LucaBenini, Axel Jantsch, Mohamed Bechara, Hasan Khalifeh,Mazen Hajjar, Rustam Nabiev, and Sven Jonsson. Mpsoc ecgbiochip: a multiprocessor system-on-chip for real-time humanheart monitoring and analysis. In CF ’06: Proceedings of the3rd conference on Computing frontiers, pages 21–28, New York,NY, USA, 2006. ACM.

[AP06] Alexis Arnaud and Isabelle Puaut. Dynamic instruction cachelockin in hard real-time systems. In Proc. of the 14th Interna-tional Conference on Real-Time and Network Systems (RNTS),2006.

[Bak02] Thomas G. Baker. Lessons learned integrating cots into sys-tems. In ICCBSS ’02: Proceedings of the First InternationalConference on COTS-Based Software Systems, pages 21–30,London, UK, 2002. Springer-Verlag.

[Bar05] Sanjoy Baruah. The limited-preemption uniprocessor schedul-ing of sporadic task systems. In ECRTS ’05: Proceedings of the17th Euromicro Conference on Real-Time Systems, pages 137–144, Washington, DC, USA, 2005. IEEE Computer Society.

91

Page 92: Effects of real-time scheduling on cache performance - ReTiS Lab

[BBB+05] Luca Benini, Davide Bertozzi, Alessandro Bogliolo, FrancescoMenichelli, and Mauro Olivieri. Mparm: Exploring the multi-processor soc design space with systemc. J. VLSI Signal Pro-cess. Syst., 41(2):169–182, 2005.

[BCSM08] Bach Duy Bui, Marco Caccamo, Lui Sha, and Joseph Mar-tinez. Impact of cache partitioning on multi-tasking real timeembedded systems. In RTCSA, pages 101–110, 2008.

[BLV07a] Reinder J. Bril, Johan J. Lukkien, and Wim F. J. Verhaegh.Worst-case response time analysis of real-time tasks underfixed-priority scheduling with deferred preemption revisited.In ECRTS ’07: Proceedings of the 19th Euromicro Conferenceon Real-Time Systems, pages 269–279, Washington, DC, USA,2007. IEEE Computer Society.

[BLV07b] Reinder J. Bril, Johan J. Lukkien, and Wim F. J. Verhaegh.Worst-case response time analysis of real-time tasks underfixed-priority scheduling with deferred preemption revisited.In ECRTS ’07: Proceedings of the 19th Euromicro Conferenceon Real-Time Systems, pages 269–279, Washington, DC, USA,2007. IEEE Computer Society.

[BMSO+96] J. V. Busquets-Mataix, J. J. Serrano, R. Ors, P. Gil, andA. Wellings. Adding instruction cache effect to schedulabi-lity analysis of preemptive real-time systems. In RTAS ’96:Proceedings of the 2nd IEEE Real-Time Technology and Ap-plications Symposium (RTAS ’96), page 204, Washington, DC,USA, 1996. IEEE Computer Society.

[BMSW97] Jose V. Busquets-Mataix, Juan J. Serrano, and Andy Wellings.Hybrid instruction cache partitioning for preemptive real-timesystems. Real-Time Systems, Euromicro Conference on, 0:56,1997.

[BN94] Swagato Basumallick and Kelvin Nilsen. Cache issues in real-time systems, 1994.

[BSL+02] Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakr-ishnan, and Peter Marwedel. Scratchpad memory: design al-ternative for cache on-chip memory in embedded systems. InCODES ’02: Proceedings of the tenth international symposiumon Hardware/software codesign, pages 73–78, New York, NY,USA, 2002. ACM.

92

Page 93: Effects of real-time scheduling on cache performance - ReTiS Lab

[Bur95] Alan Burns. Preemptive priority-based scheduling: an appropri-ate engineering approach, pages 225–248. Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, 1995.

[CAS03] Airborne Certification Authorities Software Team CAST. Ad-dressing cache in airborne systems and equipment. Position Pa-per CAST-20 Rev. 1, Airborne Certification Authorities, 2003.

[CPIM05] Antonio Marti Campoy, Isabelle Puaut, Angel Perles Ivars, andJose Vicente Busquets Mataix. Cache contents selection forstatically-locked instruction caches: An algorithm comparison.Real-Time Systems, Euromicro Conference on, 0:49–56, 2005.

[FK09] Heiko Falk and Jan C. Kleinsorge. Optimal static wcet-awarescratchpad allocation of program code. In DAC ’09: Proceed-ings of the 46th Annual Design Automation Conference, pages732–737, New York, NY, USA, 2009. ACM.

[FPT07] Heiko Falk, Sascha Plazar, and Henrik Theiling. Compile-time decided instruction cache locking using worst-case exe-cution paths. In CODES+ISSS ’07: Proceedings of the 5thIEEE/ACM international conference on Hardware/softwarecodesign and system synthesis, pages 143–148, New York, NY,USA, 2007. ACM.

[GA07] Gernot Gebhard and Sebastian Altmeyer. Optimal task place-ment to improve cache performance. In EMSOFT ’07: Pro-ceedings of the 7th ACM & IEEE international conference onEmbedded software, pages 259–268, New York, NY, USA, 2007.ACM.

[GJ79] Michael R. Garey and David S. Johnson. Computers and In-tractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., New York, NY, USA, 1979.

[GNL+03] Paolo Gai, Marco Di Natale, Giuseppe Lipari, Alberto Fer-rari, Claudio Gabellini, and Paolo Marceca. A comparison ofmpcp and msrp when sharing resources in the janus multiple-processor on a chip platform. In RTAS ’03: Proceedings of theThe 9th IEEE Real-Time and Embedded Technology and Ap-plications Symposium, page 189, Washington, DC, USA, 2003.IEEE Computer Society.

[GPAP97] Roberto Giorgi, Cosimo Antonio Prete, Cosimo Antonio, andGianpaolo Prina. Cache memory design for embedded systemsbased on program locality analysis. In In Proc. Int’l Conf. onMicroelectronic System Education, 1997.

93

Page 94: Effects of real-time scheduling on cache performance - ReTiS Lab

[HD93] Kenneth Hoyme and Kevin Driscoll. Safebus. IEEE Aerospaceand Electronic Systems Magazine, 8(3):34–39, Mar 1993.

[HRW08] J. Herter, J. Reineke, and R. Wilhelm. Cama: Cache-awarememory allocation for wcet analysis. In Proceedings Work-In-Progress Session of the 20th Euromicro Conference on Real-Time Systems, 2008.

[HV95] Rodney R. Howell and Muralidhar K. Venkatrao. On non-preemptive scheduling of recurring tasks using inserted idletimes. Inf. Comput., 117(1):50–62, 1995.

[Inc93] Aeronautical Radio Inc. Arinc 659 specification: backplanedata bus, 1993.

[Jac55] J.R. Jackson. Scheduling algoritms for multiprogramming ina hard-real-time environment. Technical report, ManagementScience Research Project, University of California, Los Angeles,1955.

[Jac99] Bruce Jacob. Cache design for embedded real-time systems.In Proceedings of the Embedded Systems Conference, Summer,1999.

[JCR07] Lei Ju, Samarjit Chakraborty, and Abhik Roychoudhury. Ac-counting for cache-related preemption delay in dynamic prior-ity schedulability analysis. In DATE ’07: Proceedings of theconference on Design, automation and test in Europe, pages1623–1628, San Jose, CA, USA, 2007. EDA Consortium.

[JS91] Kevin Jeffay and Donald F. Stanat. On non-preemptivescheduling of periodic and sporadic tasks. In Proceedings of theTwelfth Real-Time Systems Symposium, pages 129–139, 1991.

[Kir88] David B. Kirk. Process dependent static cache partitioning forreal-time systems. In IEEE Real-Time Systems Symposium,pages 181–190, 1988.

[KSS91] D.B. Kirk, J.K. Strosnider, and J.E. Sasinowski. Allocatingsmart cache segments for schedulability. In Real Time Systems,1991. Proceedings., Euromicro ’91 Workshop on, pages 41–50,Jun 1991.

[LDS07] Chuanpeng Li, Chen Ding, and Kai Shen. Quantifying thecost of context switch. In ExpCS ’07: Proceedings of the 2007workshop on Experimental computer science, page 2, New York,NY, USA, 2007. ACM.

94

Page 95: Effects of real-time scheduling on cache performance - ReTiS Lab

[LHM+96] Chang-Gun Lee, J. Hahn, Sang Lyul Min, R. Ha, SeongsooHong, Chang Yun Park, Minsuk Lee, and Chong Sang Kim.Analysis of cache-related preemption delay in fixed-priority pre-emptive scheduling. Real-Time Systems Symposium, IEEE In-ternational, 0:264, 1996.

[Lim99] ARM Limited. Amba specification (rev. 2), May 1999.

[Lip68] J. S. Liptay. The structural aspects of the system 360 model85, part ii: The cache. IBM Systems Journal, 7:15–21, 1968.

[LJ09] Zhonghai Lu and Axel Jantsch. Trends of terascale computingchips in the next ten years. In Proceedings of IEEE ASICON2009, October 2009.

[LL73] C. L. Liu and J. W. Layland. Scheduling algoritms for mul-tiprogramming in a hard-real-time environment. Journal ofACM, 20(1), January 1973.

[LLH+01] Chang-Gun Lee, Kwangpo Lee, Joosun Hahn, Yang-Min Seo,Sang Lyul Min, Rhan Ha, Seongsoo Hong, Chang Yun Park,Minsuk Lee, and Chong Sang Kim. Bounding cache-relatedpreemption delay for real-time systems. IEEE Transactions onSoftware Engineering, 27(9):805–826, 2001.

[LM95] Yau-Tsun Steven Li and Sharad Malik. Performance analysis ofembedded software using implicit path enumeration. SIGPLANNot., 30(11):88–98, 1995.

[MB91] Jeffrey C. Mogul and Anita Borg. The effect of context switcheson cache performance. SIGPLAN Not., 26(4):75–84, 1991.

[MB09] Mauro Marinoni Gang Yao Francesco Esposito Marco CaccamoMarko Bertogna, Giorgio Buttazzo. Cache-aware schedulingwith limited preemptions, 2009.

[MIP] MIPS Technologies, Inc. MD00655 Rev 01.00 - Addressing De-sign Challenges in 32-bit Microcontrollers For Automotive andIndustrial Applications.

[MYS07] Patrick Meumeu Yomsi and Yves Sorel. Extending rate mono-tonic analysis with exact cost of preemptions for hard real-timesystems. In ECRTS ’07: Proceedings of the 19th EuromicroConference on Real-Time Systems, pages 280–290, Washing-ton, DC, USA, 2007. IEEE Computer Society.

[NMR03] Hemendra Singh Negi, Tulika Mitra, and Abhik Roychoud-hury. Accurate estimation of cache-related preemption delay.

95

Page 96: Effects of real-time scheduling on cache performance - ReTiS Lab

In CODES+ISSS ’03: Proceedings of the 1st IEEE/ACM/I-FIP international conference on Hardware/software codesignand system synthesis, pages 201–206, New York, NY, USA,2003. ACM.

[PB00] Peter P. Puschner and Alan Burns. Guest editorial: A re-view of worst-case execution-time analysis. Real-Time Systems,18(2/3):115–128, 2000.

[PC08] Rodolfo Pellizzoni and Marco Caccamo. Hybrid hardware-software architecture for reconfigurable real-time systems. InIEEE Real-Time and Embedded Technology and ApplicationsSymposium, pages 273–284, 2008.

[PC09] Rodolfo Pellizzoni and Marco Caccamo. Impact of peripheral-processor interference on wcet analysis of real-time embeddedsystems. IEEE Transactions on Computers, 2009.

[PHH88] Steven A. Przybylski, Mark Horowitz, and John L. Hennessy.Performance tradeoffs in cache design. In ISCA, pages 290–298,1988.

[PP07] Isabelle Puaut and Christophe Pais. Scratchpad memories vslocked caches in hard real-time systems: a quantitative compar-ison. In DATE ’07: Proceedings of the conference on Design,automation and test in Europe, pages 1484–1489, San Jose, CA,USA, 2007. EDA Consortium.

[RDAH94] David B. Whalley Robert D. Arnold, Frank Mueller and Mar-ion G. Harmon. Bounding worst-case instruction cache perfor-mance. In IEEE Real-Time Systems Symposium, page 172–181,December 1994.

[RM06a] Harini Ramaprasad and Frank Mueller. Bounding preemptiondelay within data cache reference patterns for real-time tasks.In RTAS ’06: Proceedings of the 12th IEEE Real-Time andEmbedded Technology and Applications Symposium, pages 71–80, Washington, DC, USA, 2006. IEEE Computer Society.

[RM06b] Harini Ramaprasad and Frank Mueller. Tightening the boundson feasible preemption points. In RTSS ’06: Proceedings ofthe 27th IEEE International Real-Time Systems Symposium,pages 212–224, Washington, DC, USA, 2006. IEEE ComputerSociety.

[Sch03] Sebastian Schonberg. Impact of pci-bus load on applications ina pc architecture, 2003.

96

Page 97: Effects of real-time scheduling on cache performance - ReTiS Lab

[Seb01] Filip Sebek. Cache memories in real-time systems. TechnicalReport 01/37, Malardalen Real-Time Research Centre, Depart-ment of Computer Engineering, Malardalen University, Swe-den, October 2 2001.

[Seb02] Filip Sebek. The real cost of task pre-emptions — measur-ing real-time-related cache performance with a hw/sw hybridtechnique (paper b). Technical Report 02/58, Malardalen Real-Time Research Centre, August 2002.

[SPH+07] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, GuillaumeBorios, Victor Jegu, and Reinhold Heckmann. Computing theworst case execution time of an avionics program by abstract in-terpretation. In Reinhard Wilhelm, editor, 5th Intl. Workshopon Worst-Case Execution Time (WCET) Analysis, Dagstuhl,Germany, 2007. Schloss Dagstuhl - Leibniz-Zentrum fuer Infor-matik, Germany.

[Spi01] Cary R. Spitzer, editor. The Avionics Handbook, chapter 32.CRC Press LLC, 2001.

[TL09] Chun Jason Xue Tiantian Liu, Minming Li. Instruction cachelocking for wcet minimization on real-time embedded systems.Submitted as invited paper to the IEEE Industrial ElectronicsSociety, 2009.

[VLX03] Xavier Vera, Bjorn Lisper, and Jingling Xue. Data caches inmultitasking hard real-time systems. In RTSS ’03: Proceedingsof the 24th IEEE International Real-Time Systems Symposium,page 154, Washington, DC, USA, 2003. IEEE Computer Soci-ety.

[WA09] Jack Whitham and Neil Audsley. The scratchpad memory man-agement unit for microblaze: Implementation, testing, and casestudy. Technical Report YCS-2009-439, University of York,2009.

[WEE+08] Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Nik-las Holsti, Stephan Thesing, David Whalley, Guillem Bernat,Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, FrankMueller, Isabelle Puaut, Peter Puschner, Jan Staschulat, andPer Stenstrom. The worst-case execution-time problem—overview of methods and survey of tools. ACM Trans. Embed.Comput. Syst., 7(3):1–53, 2008.

97

Page 98: Effects of real-time scheduling on cache performance - ReTiS Lab

[Wol04] Wayne Wolf. The future of multiprocessor systems-on-chips. InDAC ’04: Proceedings of the 41st annual Design AutomationConference, pages 681–685, New York, NY, USA, 2004. ACM.

[WS99] Yun Wang and Manas Saksena. Scheduling fixed-priority taskswith preemption threshold. In RTCSA ’99: Proceedings of theSixth International Conference on Real-Time Computing Sys-tems and Applications, page 328, Washington, DC, USA, 1999.IEEE Computer Society.

[YBB09] Gang Yao, Giorgio Buttazzo, and Marko Bertogna. Boundingthe maximum length of non-preemptive regions under fixed pri-ority scheduling. In RTCSA ’09: Proceedings of the 2009 15thIEEE International Conference on Embedded and Real-TimeComputing Systems and Applications, pages 351–360, Wash-ington, DC, USA, 2009. IEEE Computer Society.

98

Page 99: Effects of real-time scheduling on cache performance - ReTiS Lab

Appendix A

C++ implementation forinsertPP

The following C++ code is the actual implementation of the insertPP algo-rithm for sequential potential preemption points.�int t = 0 ;long WCETnonpreemptive = 0 ;

for ( i=task−>begin ( ) ; i != task−>end ( ) ; ++i , ++t ) {5 // inpu t data f o r each non preempt i ve chunk

long q i = i−> f i r s t ;long Ci = i−>second ;WCETnonpreemptive += qi ;

10 // remove from preqMin the f i r s t e l emen t s whose e x e cu t i on t ime// would v i o l a t e t h e Q l i m i tlong nonpreemptiveExe = WCETnonpreemptive − preqMin . f r on t ( ) . WCETnopreempt ;while ( ! preqMin . empty ( ) &&

preqMin . f r on t ( ) . Ci + nonpreemptiveExe > maxNonPreemChunkLength ) {15

preqMin . pop f ront ( ) ;nonpreemptiveExe = WCETnonpreemptive − preqMin . f r on t ( ) . WCETnopreempt ;

}i f ( preqMin . empty ( ) ) return NULL; // not f e a s i b l e

20// PP and WCET computat ion f o r each p o t e n t i a l preempt ion// po in t ( a f t e r t h i s non preempt i v e chunk )long PPi = preqMin . f r on t ( ) . i ;long WCETi = preqMin . f r on t ( ) .WCETi + preqMin . f r on t ( ) . Ci + nonpreemptiveExe ;

25// update g l o b a l PP f un c t i o n t r a c ePP[ i ] = PPi ;

// compute preqMin f o r nex t preempt ion po i n t30 while ( ! preqMin . empty ( ) && preqMin . back ( ) .WCETi + preqMin . back ( ) . Ci +

WCETnonpreemptive − preqMin . back ( ) . WCETnopreempt > WCETi + Ci ) {

preqMin . pop back ( ) ;}

35 preqMin . push back ( preqElem ( t , WCETi, WCETnonpreemptive , Ci ) ) ;}

// r e s u l t s l o o k i n g up r e c u r s i v e l y in PP[ t ] s t a r t i n g from l a s t e l ementfor ( t = PP[ task−>s i z e ( ) − 1 ] ; t>=0; t = PP[ t ] ) {

40 act ivatePreempt ionPoint ( t ) ;}� �

99