thread coloring: a scheduler proposal from user to hardware...

Thread Coloring: A Scheduler Proposal from User to Hardware Threads

Marisa Gil, Ruben Pinilla

Computer Architecture Department, Technical University of Catalonia c/Jordi Girona, 1-3, Campus Nord, 08034 Barcelona, Spain

e-mail: {marisa, rpinilla}@ac.upc.edu

Abstract

The appearance of simultaneous multithreading (SMT) processors, offering instruction-level parallelism and increasing hardware resources utilization, permits the execution of w:ious instructions streams at the same time. A particular case is the Intel SMT implementation (Hyper- Threading Technology) where a processor can execute two instructions streams independently. Due to hardware resources sharing in a Hyper-Threading processor, both instruction streams can contend for the same execution resources depevAing on the instruction execution units utilization.

As other works in this area have demonstrated, we have observed that the system throughput can be improved when the execution resources used by each thread are contemplated in their scheduling. Starting from the evaluation results, we propose a Thread Coloring philosophy that designs a way to manage information about resources utilization (hardware and software resources) per each thread. This policy permits the scheduler to select threads that will perform better with tess resource contention, besides including OS upper layers in the resource management. The proposed Thread Coloring scheme establishes the bases of processor and OS cooperation, sharing resources utilization information and providing a form of controlling logical processor thread assignment taking into account possible resource contention.

Keywords: SMT, Hyper-Threading, thread coloring, Linux kernel, multithreaded, scheduling.

1. INTRODUCTION

The appearance of simultaneous multithreading (SMT) processors, offering instruction-level parallelism and increasing hardware resources utilization, permits the execution of various instructions streams at the same time. A particular case is the Intel SMT implementation (Hyper-Threading Technology) where a processor can execute two instructions streams independently sharing execution resources [ 1 ].

With this new processor architecture, operating systems have to be conscious of the comkputation power that SMT processors can perform. From the operating system point of view, each SMT hardware context is treated as a logical processor where threads can be assigned to execute. The OS has to pay special attention to the way that these logical processors are managed due to their inherent characteristics. Logical processors in a SMT chip make use of replicated resources as well as shared execution resources

* This work has been supported by the Ministry of Science and Technology of Spain and by the European Union (FEDER) under contract TIC2001-0995-C02-01 and the HiPEAC European Network of Excellence. All products or company names mentioned herein are the trademarks or registered trademarks of their respective owners.

~4

to make progress on the executing OS threads. So, resource dependence can exists among all the threads being executed on the SMT processor. Therefore, resource dependence is a factor that may limit the amount of parallelism extracted from a given piece of code. So, parallel execution is possible only if hardware resources can be utilized at the same time by several operations. Hardware parallelism exists in several levels of a computer design and may be limited by different interactions; for example, we could have a mismatch between software parallelism and hardware parallelism.

One important issue that has to be present is the idea of thread scheduling, an essential activity that is designed to maximize the use of computing power of a hardware platform. There are a lot of research done in the area of job (thread or process) scheduling making up new ways of managing threads of execution to take better processor utilization, such as first-class user-level threads [2], scheduler activations [3] or process control [4].

In this paper we study the main characteristics of thread scheduling in a system with a Hyper- Threading processor running a Linux 2.6.7 kernel. The main goal of this work is to propose a scheduling policy design that manages information about the hardware resources needed by each thread of execution, and establish a way to control thread assignment to logical processor considering possible resource contention.

In our approach, we focus on a thread coloring philosophy that takes into account what resources need each thread when the hardware platform is based on SMT processors. The way threads are scheduled follows a coloring strategy that has been used in memory management to improve performance in cache line and also in page placement.

The solution adopted in [5] is based on placing procedures in the cache in such a way that procedures that interact with each other frequently will not overlap in the cache address space. At the page level, page coloring controls the allocation of the physical page for a virtual address ensuring that virtual pages of different color do not contend for the same cache line and so, reducing the miss rate. Similarly to page coloring, our idea is to mark threads using colors to avoid possible resource conflict when they are running on a SMT processor system.

Another form of using the coloring scheme is the developed by Sutherland et al. [6], where they show an interesting way of managing relationships among threads using the coloring algorithm. This study offers a design model, using annotated code, to be able to manage relationships among threads, executable code, and shared state. This proposal is an example of solution designed to provide a way' to manage information about resource milization by Java threads. We can treat code regions and shared state as resources shared among all Java threads and, the way these resources are used by the Java threads helps us to model, using colors, relationships among threads and enable a way to manage the resources to avoid possible incorrect concurrent interaction, as detected by its authors.

The rest of the document is structured as follows: section 2 includes related work to thread schednling inLprovements on SMT architectures. Section 3 introduces the Hyper-Threading technology. Section 4 presents the main characteristics of the Linux kernel 2.6 and its Hyper-Threading aware scheduler. Section 5 shows some experimental results and section 6 describes our proposal about the thread coloring scheme to classify threads depending on the resources they use. Section 7 concludes the paper.

55

2. RELATED W O R K

There are a considerable amount of work typifying different weight processes in software and hardware. The most common case in software approaches is the classification between user and kernel threads depending on the control management we have to do and the hardware resources they need to perform their work. Some of them differentiate threads by the memory they need, for example, the local environment or the state between two activations, so that we can stack-off a thread when blocking it: this is the case of Math continuations [7] or, also implemented in Mach, the more sophisticate of scheduler activations [3].

Thread scheduling has two well-defined and totally independent decision levels: first, the OS level where the scheduler decides which threads run next and, second, the hardware level where the processor resources are assigned to each running thread.

In the hardware area, computers have been implemented to offer high performance systems integrating large number of processors to produce expensive high parallel supercomputer systems. An up- and-coming processor architecture that increases the possibilities of thread based parallelism is the simultaneous rnultithreading processors providing the execution of various instructions streams at the same time.

Some previous studies have evaluated SMT architectures analyzing the contended use of shared resources among all active instruction streams as well as techniques to obtain better scheduling schemes to maximize processor fhroughput.

Parekh et al. [8] proposed, in a simulated environment, hardware cooperative scheduling by presenting hardware counters to the operating system to provide better information of hardware context resources occupation. So, using the information provided by hardware counters, the scheduling of threads can be more efficient and improve processor throughput.

Snavely and Tullsen [9] proposed a scheduling model, named symbiotic job scheduling, based on co-schedule tasks by previously using sampling to determine what jobs will run with best performance.

Cazorla et al. [10] studied a method to dynamically control resource allocation in SMT processors. The proposed policy monitors the resource utilization by each thread providing fair share of resources among all threads. In [ 11 ] proposed a solution applying the concept of network QoS to processor resource management, providing collaboration between the OS and the SMT processor that makes possible the execution of a thread in a workload at a given percentage of its full speed.

In the operating systems area, evaluation of operating systems running on top of SMT processors are trrying to obtain better resources management taking into account the SMT architecture characteristics. Linux operating system is mainly used to evaluate Intel Hyper-Threading systems, using different kernel versions where HT awareness varies.

Settle et ak [12] studied a technique to reduce cache access contention among diftbrent contexts in SMT processors, providing a mechanism to allow the operating system to know the caches access patterns of cun'ent executing threads. Knowing the cache access pattern of a thread makes possible to schedule

56

threads that do not follow the same cache access pattern, obtaining a lower number of cache misses. They validated the simulation model in a modified Linux 2.6.0 kernel.

Bulpin and Pratt [13] corroborate the observations made by Tuck and Tullsen in [14] about the performance evaluation of the Pentium 4 Hyper-Threading processor. Bulpin and Pratt considered in their measurements the mutual fairness of simultaneously executing threads. They performed a multiprogramming performance evaluation of pairs of benchmarks running on a Hyper-Threading system compared to a SMP system. Tuck and Tullsen used Linux 2.4.18smp, while Bulpin and Pratt used Linux 2.4.19.

Nakajima et al. [I 5] centered their investigation on a Hyper-Threading process scheduling design based on Linux 2.4.18 where a user program assists the kernel scheduler to gather information about processor resources utilization to enable a way to predict how the processes and system use the processor execution resources.

3. HYPER-THREADING TECHNOLOGY

Intel introduced the Hyper-Threading (HT) technology [1], the Intel implementation of the Simultaneous Multi-Threading (SMT), into the Pentium 4 processors [16]. Processors with this architecture enable the execution of various instruction streams enhancing utilization of processor execution resources. Focusing on Intel's HT, a physical processor (processor package) offers two logical processors that permit the execution of two different instruction streams (see Figure 1).

1 ;r hle%; 1 LSto e ] Sto,e ...................................................... I

- E x e c u t i o n R e s o u r c e s L Processor Package

Figure 1. A processor package with two logical processors.

A Hyper-Threading enabled processor provides duplicated resources to both logical processors to allow each logical processor to manage its architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the Advanced Programmable Interrupt Controller (APIC) registers, and some machine state registers. Giving duplicated resources to store the architecture state, each logical processor has assigned an instructions stream where its execution is independent from the other logical processor from the point of view of the operating system. There are some hardware resources (buffers) that are partitioned so each logical processor can utilize at most half the entries. Some examples of partitioned buffers are the re-order buffer, and load/store buffers [ 1 ].

However, in a processor package there is a set of execution resources that are shared between the two logical processors. More precisely, the shared execution resources are caches, execution units, branch predictors, control logic, and buses. The fact of sharing the execution resources can produce situations where both instruction streams contend for the same resources (i.e. execution units, same cache-lines.)

57

Focusing on the execution units, we include in Table 1 the details of the out-of-order execution core of a Pentium 4 Xeon where ports and execution units are described.

Table 1. Execution units and ports in the out-of-order core (extracted from [ 17]).

Ports ;: ExecutiOn Units ALU 0 (double speed) Port 0 FP Move ALU 1 (double speed)

Port 1 integer Operation

FP Execute

'Port 2 Memory Load Port 3 Memory Store _

I n s t r u c t i o n s

ADD/SUB, Logic, Store Data, Branches FP Move, FP Store Data, FXCH ADD/SUB Shift/Rotate

FP_ADD, FP_MUL, FP_DIV, FP MISC, MMX SHFT, MMX ALU, MMX MISC All Loads, Prefetch Store, Address

The execution core can execute up to 6 gops (Pentium 4 micro-operations) per cycle, where the ~tops are sent to execute through the four available execution unit ports. When two instruction streams are scheduled to execute, they can produce contention when both streams have gops that can not be executed at the same time, due to ports use restriction. As an example, two FP_ADD ~tops can not be executed in the same cycle as there is only one port to the FP Execute execution unit (or two loads in the same cycle). On the contrary, we can execute two integer adds using port 0 and port 1 to access ALU 0 and ALU 1 (in fact, ALU 0 and ALU 1 are double speed execution units, so each one can execute 2 ~tops per cycle).

On the other hand, from the operating system point of view, a logical processor is treated as a traditional processor, where the scheduler can assign threads to execute. One differentiating factor between traditional superscalar processors and SMT processors is that there are hardware resources shared among all the hardware contexts (or logical processors). As the operating system scheduler decides what tasks executes each processor, the scheduler has to be aware of the tasks instruction characteristics (that is, what execution units will be used more frequently) to reduce shared execution resources contention.

4. LINUX AND SMT

Linux introduced the possibility of SMT conscious scheduling from patches in the development kernel branch 2.5, specifically applying patch-2.5.31-BK-curr provided by Ingo Molnar [18]. This patch modifies the scheduler to be aware of SMT processors. Basically, it consists in sharing a runqueue per each processor package (for a more detailed list of features, see [18]), that is, a runqueue is shared between the two logical processors (SMT logical processors are called siblings in Linux terminology).

In this section, we describe the main characteristics of the Linux scheduling, detailing the difl~rent available scheduling policies as well as the main aspects of the kernel 2.6.7 scheduler focusing on the optimizations for the Hyper-Threading architecture.

4.1. Linux scheduling policies

The scheduler is the part of the kernel that decides what task (or thread) will be scheduled next per each system processor. In the case of Linux, the scheduler defines three types of scheduling policies based on priorities. Specifically, the scheduler works with two kinds of priority: static priority and dynamic priority. The scheduling policy of each task determines the place where it will be inserted inside the task list with the same static priority and how the task will move through that list (the policy establishes the

58

order of the tasks with equal static priority). In the following table, we describe the different values that both types of priorities can take depending on the scheduling policy:

Table 2. Scheduling policies in Linux.

Policy SCHED_OTHER

Static Dynamic priority Priority

0 -20_19 SCHED_FIFO 1..99 SCHED_RR 1 ..99

Not applicable Not applicable

The Linux scheduler is preemptive, that is, if a task with higher priority is ready to execute, the current task with lower priority will be preempted and placed at the waiting list. On the other hand, the use of real time policies (SCHED FIFO y SCHED_RR) as well as incrementing the task priority with SCHED_OTHER policy is restricted to tasks with super-user privileges.

The default policy assigned to a task is SCHED_OTHER (or SCHED_NOm4AL, in kernel 2.6), a time- sharing policy (with a quantum assigned per task). Tasks pertaining to this policy only are scheduled to execute when there are no real-time tasks read5, to execute. SCHED_OTHER tasks have value 0 in their static priority and is the only that uses dynamic priority, taking values between -19 y 20 (lower values means higher priority).

The SCHED_FIFO is a scheduling policy without quantum assigned per task. When a SCHED_FIEO task is preempted by a higher priority task, the preempted task is placed at the list head of equal priority. The pre-empted task will resume the execution when no other ready higher priority tasks exist. In the case of a SCHED FIFO task changes from not runnable to runnable, it will inserted at the end of the task list with equal priority (the same as if current task executes s c h e d _ y i e l d ) . A SCHED_FIFO will run till:

- Get blocked in a I/O operation or - Get pre-enLpted by a higher priority task or - Makes a call to s c h e d _ y i e l d .

In the case of the SCHED_RR (Round-Robin) policy, it has the same properties as SCHED_FIFO except that it limits the execution time by giving a quantum. When the quantum of a SCHED_RR task expires, the task is placed at the bottom of the task list with equal priority. If a SCHED_RR task is preempted, it will resume the execmion with the remaining quantum when there is no higher priority task to run.

4.2. Inside Linux 2.6 scheduler

One of the main responsibilities of the operating system scheduler is to assign tasks read 5 , to execute to each system processor. In a multiprocessor system exists the possibility that the tasks assignments is done in an imbalance way, producing situations where there are processors with assigned tasks waiting in tile task list ready to run and other processors without assigned tasks (in this case, the processor is executing the idle task).

59

One possible solution to solve this load imbalance among all the processors is to use one task queue to place all the tasks in the system. Doing in this way, all the processors share the task queue and each processor access to it searching a task to run. But this solution offers poor scalability and performance as ~ne number of processor grows, due to the existing contention when all the processors access the task list in mutual exclusion. This design is implemented in Linux kernel 2.4 where there is a unique data structure ( t a s k _ l i s t) that manages all the system tasks.

The proposed solution in kernel 2.6 changes the criterion of disposing of only one data structure in the system that manages the ready tasks. In this kernel, each processor has a data structure ( s t r u c t runqueue) that contains its assigned tasks to be executed. This design offers better scalability and performance, since each processor access directly to its runqueue without competing for the exclusive access with the rest of processors. On the contrary, this tasks distribution among the processors increases the possibility of load imbalance among the system processors.

To solve the load imbalance problem, the kernel 2.6 introduces the concept of scheduler domains that represents the relationship among the processors that compose a multiprocessor system. Depending on the system architecture (SMP, Hyper-Threading, UMA, NUMA ...), the scheduler domain will have a set of characteristics that will establish the criteria to guide the load balancing process among the processors.

A scheduling domain is formed by one or various scheduling groups that are treated as a scheduling unit. In the next figure, we depict the data structures associated to a system of one Hyper- Threading enabled processor. Each logical processor has assigned a runqueue, composed by two priority arrays and references (pointers) to the current running task ( cu r t ) and the idle task ( id le ) .

@ sibling 0

@ Processor Package

• curr

active priority array expired priority array

LJ idle

curt

active priority array expired priority array

D idle

PriorityArray

task queues highl-:~--{ ] [.3 [:3 E.~}I

..... LLt LJ L I D [3 E] E]

- D FS E:] D " E]

D D D D ~ow [.~--4:7J [3 E}

Figure 2. Runqueues disposition in a Hyper-Threading enabled processor.

In Figure 2 we include a graphical description of a priority array data structure formed by an array of 140 elements (priority levels) where each element is a task queue. In a system configured as the represented in the previous figure, the resulting scheduling domain is as follows:

60

phys domain

Figure 3. Scheduler domain configuration for one package (extracted from [19]).

Each processor package has a scheduling domain composed by two scheduling groups, one per logical processor. In an upper level is placed the parent domain (called physical domain) of the scheduling domains of each logical processor. The physical domain is formed by one scheduling group (one per package), containing each scheduling group the logical processors of each package.

The implementation of scheduling domains data structures in kernel 2.6.7 makes that each processor has a copy of each domain it belongs to. For instance, the data structures tbr the system represented in Figure 3 are:

Table 3. Scheduling domains data structures implementation.

static st~ct sched_group sched group_cpus[2]; static struct sched_grol/p sched_group_phys[2]; static DEFINE_PER_CPU(struct sched domain, cpu_domains); static DEFINE_PER CPU(struct sched domain, phys domains);

The initial value of a scheduling domain associated to a logical processor has defined by SD SIBLING_INIT and for the case of a physical processor (a processor without Hyper-Threading technology) the assigned value is SD_CPU_INIT.

The load balancing inside a scheduling domain is done among all the scheduling groups of that scheduling domain. When the workload of the processors that belong to a scheduling group is imbalanced, tasks are moved among the scheduling groups of the domain to balance the load among all the scheduling groups. In each processor the procedure r e b a l a n c e _ t i c k is executed every tick timer. That procedure examines each scheduling domain to detect load imbalance and, in the case that a scheduler domain is imbalanced, it executes l o a d _ b a l a n c e . This procedure examines the load value inside each scheduling group of the domain. If the difference among all the load values of each scheduling group produces imbalance, the scheduler tries to move tasks of the higher loaded group to the less loaded group inside the same scheduling domain.

61

Introducing the scheduling domain abstraction inside the Linux scheduler permits the scheduler to control the load balancing among all the system processors. In addition to the scheduler domain technique, there are a group of optimizations that apply to Hyper-Threading processors that are explained in [18].

Actually, the Linux kernel does not consider a way to track resource utilization per task, being a possible cause of resource execution contention in Hyper-Threading architectures. Studies as [12] enable a way to add information of resource use during scheduling decisions, permitting to increase the system throughput. In this study, we focus on the execution units utilization to detect situations where applications may compete for execution resources.

5. EXPERIMENTS

We have performed a set of experiments to evaluate the decisions the scheduler has to consider when threads are assigned to each logical processor. We have developed a set of single-threaded applications based on the AIM Benchmark suite [20] to analyze the performance variation when two applications are executed at the same time. With the experiments, we want to identify scenarios where programs obtain better (and worse) performance due to resources utilization.

5.1. Methodology

Each application has been developed to make intensive use of determined execution units to study the performance variation (execution time) influence when two of these single-threaded programs are executed. Basically, each mrithmetical application (±n t ege r app and £1oatz_app) consists of a loop that executes a sequence of arithmetic operations (lraem app implements a loop where meracpy is called). In Table 4, we describe the applications that we have used in our experiments:

Table 4. Applications used in experiments.

Applications Description integer~app floahapp mem_app

Integer computations Floating Point computations Memory access operations to different memory addresses

In the experiments, we have used a system running Linux kernel 2.6.7 on top of an Intel Pentium 4 Xeon 2,8 GHz with 512 Kbytes L2 cache with Hyper-Threading technology. All the experiments have been realized in single user mode (that is, Linux runlevel 1 S) and Hyper-Threading enabled. We have done a total of 9 tests and have measured the following parameters:

- Execution Time (ET): time to complete an application when running alone. - M u l t i p r o g r a m m e d Execution Time (MET): time to complete an application when running

simultaneously with other application (we register both application times).

During the experiments, we have executed each test 10 times and then we have calculated the average value of each measurement. Next, we include the execution time for each application:

62

Table 5. Execution times with HT enabled.

Applications integeE-_~E____

Execution

20,093764 19,578999 18,479058

To be able to compare two tests, we calculate the relation between the execution time (ET) and the multiprogrammed execution time (MET) for both applications in a test. The relation ET/MET indicates, for an application, the degree of improvement when an application is executed with other application, with respect executing the same application alone. This relation can take positive values up to 1, being better results values near to 1. Values near to 0 indicate high execution resources contention, and values close to 1 exhibit low contended execution resources utilization.

5.2. Evaluation

We have analyzed a total of 9 experiments of two single-threaded applications that have been executed with Hyper-Threading enabled. The results that we have obtained running the tests with Hyper- Threading enabled are described in Tffble 6, where we present Multiprogrammed Execution Time (MET) and the relation Execution Time /Multiprogrammed Execution Time (ET/MET). Each test executes one application in foreground (application under observation) and the other in background. In Table 6, application names in bold style means that is running in foreground and normal style running in background.

Table 6. Experiments results with Hyper-Threading enabled. We present Multiprogrammed Execution Time (MET) and the relation Execution Time / Multiprogrammed Execution Time (ET/MET). In each test row, application names

in bold style means that is running in tbreground and normal style running in background.

Test 1 AppLications T1 integer a_ pp

!nte~er-aPP T2 integer-aPP

float app T3 integer_app

mem_app

T4 floatapp float app

T5 float_app integer_app

T6 fl0at-aPP mem_app

T7 mern_app _ _ mem_app

T8 mem app integer_app

T~ mere app u float_app

! MET{s! I ETIMET 22,885367 0,8780 22,891471 0,8778

: 28,130907 0,7143 26,032563 0,7521 20,571932 0,9767 25,549355 0,7233 45,862525 0,4269 45,855417 0,4269 25,985763 0,7535 28,182266 0,7129 26,167839 0,7482 361595550 0,5049 29,304936 0,6306 29,379679 0,6289 25,561660 0,7229 20,570494 0,9768 36,439731 0,5071 26,284316 I 0,7449

Reviewing the results of Table 6, we can see that running the same application (tests T4 and T7) behaves worse as the same execution resources are used, except for the case of applications with integer operations (test TI) as there are two available ports (see Table 1) for integer instructions. Floating point

63

operations obtain better results when they are executed with ofher operations (.for example, f l o a t _ a p p performs better when it runs with raern_app or ±ngeger_app, tests T5 and T6). A graphical detailed view is depicted in Figure 4 where we have grouped the results listed in Table 6 by application and by test.

[nteger_app float_app mere app

k .................. . . . . . . . . . . . . . . .T.3_.2 Z4 ........................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................... ] 1

0,8 0,6 0,4

0,2

0

Figure 4. Graph representing ET/MET relation (y-axis) for Hyper-Threading enabled for each test (x-axis) from Table 6. ET stands for Execution Time and MET stands for Multiprogrammed Execution 7)'me. The relation ET/MET indicates the degree of improvement obtained when an application is executed at the same time with other compared to execute it alone. We have grouped the tests related to the foreground running application (integeLapp, float_app,

and rnem_app). In each group, the best (dark gray) and worst (light gray) results are highlighted. In each test, application names in bold style means that is running in foreground and normal style running in background.

Examining the results from other works, we agree with the results obtained in the evaluation of executing multiprogrammed workload to determine the grade of resource contention. As shown in [131114][15], performance results may vary depending on the instruction streams that are executed in each logical processor, due to cache and execution resources conflicts. At this point, in the next section we introduce our proposal of thread coloring where we explain the way that resources have to be grouped (colored) to reduce conflicts.

6. THREAD COLORING

As we have viewed in the evaluation, the way threads are scheduled by the operating system is important to get better processor performance. That is, the OS scheduler has to be conscious of assigning threads to logical processors that are part of a SMT processor. So that, the importance of making use of intbrmation about the utilization of execution resources per each running thread becomes necessary. A starting point to have information flow from a SMT processor m the OS is that the processors do resource utilization accounting. Works such as [8][9][10][11 ][12] provide designs that add hardware modifications to enable the processor supply information about resources utilization to the OS.

We have not only to implement mechanisms at hardware level to track resource utilization by each runn2Ne thread in the operating system, but also at upper abstraction levels as user-level concurrency. In actual systems, there are several abstraction levels that include management functionalities to increase system performance. Software architectures may vary, but in Figure 5 we depict the principal layers that are present in a computer system (with a SMT processor), where different abstraction levels can be defined.

64

' l Application

. . . . . . . . . . . . . S y s t e m Cal ls . . . . . . . . . . . . . .

Operating System ] u}

"~ Instruct ion Set Archi tecture

Figure 5. Principal layers ofa SMT computer system.

Taking the application as a whole, we can find different thread profiles depending on the weight, on the job they have to perform, or in other requirements. For example, sofftware parallelism, or program partitioning, can be exploited at various granularity levels: fine grain parallelism, exploited by compilers at tlhe basic unit of computation; medium grain parallelism, bounded by function or procedure call/return, often involves both the programmer and the compiler; and coarse grain parallelism, based in independent programs and exploited by the runtime system (OS).

The thread model (i.e. the way" user threads map to kernel threads) is not always 1-to-1 from top to bottom level as we can see depicted in Figure 6. In the thread execution model of a system, there can be val:ious abstraction levels where thread mapping may differ. In the example described in Figure 6, we find a middleware level that separates the application and the operating sysWm, offering an M-to-N thread model. From the application point of view, the middleware provides execution resources (for instance, user-level threads) that allow the application to assign its tasks to be executed. The application is unable to know how its tasks will run and if the tasks will run with inadequate system resources. Indeed, there is no information from the middleware (neither from the Operating System nor tile SMTprocessor) to inform to the application about possible performance problems due to resource contention.

Thread Model Application tasks

App.oatio. © @ @ ~' 47 (17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User-Level Threads

~i..lowaro 0 © @ © @ ©

. . . . . . . . . . . . . . . . . . System- Coils . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

t " ......................................................................................................... Kernel-Level Threads

o,o.,,.0 y.,om (il;, "$ Instruct ion Set Archi tecture

Hardware Threads

:~ /. SMT Processor ~l ~

Figure 6. Example of thread execution model on a SMT computer system.

65

Both hardware and software existing proposals are based on characterizing different types of threads depending on the work they do (for instance, helper threads and transparent threads). That is an important issue to consider, because the scheduler can categorize threads by their worI< offering a better global system performance.

At hardware level, work can be done in parallel although the resource mapping is transparent to the OS and so, applications can perform unpredictable depending on the resources needed [24]. Some approaches to take advantage of parallelism are transparent thread [25], helper threads [26] and mini- threads [27].

There are different work proposals about offering better user-level concurrency increasing overall system performance [2][3][4] or works as [21] that integrate a new abstraction layer to provide an execution environment independent from the execution platform. More recent works in software threads are the user-level package Capriccio, for internet services [22] and the virtual multithreading approach for Itanium 2 [23], which implements the idea of helper threads in software.

We can view the thread world as a heterogeneous virtual processor environment, where the thread model defines how kernel threads resources are used by the user threads. We can classify execution resources depending on the abstraction level and then, classify the different levels of resource management that are present in a computer system. We propose a coloring philosophy to optimize resource allocation and improve processor perfornlance in such a way that threads competing for a resource will not overlap in time. Our goal is to mark threads with different colors depending on the hardware or software resources they need to reach job completion. For this purpose, we have to characterize the different amount of resources each one needs in order to accomplish its work.

Due to a program has different job levels, we can characterize different chunks of allocated resources and, in this way, to have different thread personalities, by software or hardware (e.g., user threads need less resources than kernel threads). In each abstraction level, we have to introduce the needed elements to be able to manage information about resource utilization. Furthermore, information about resource utilization in an abstraction level has to be available to the upper level.

We consider different levels of resource coloring, as different abstraction levels are present in a system. At application level, we can proceed as in [6] annotating source code with information classit~ying regions of code by colors to differentiate resources needs. Other possibility is that the compiler by means of profiling generates information about resource utilization (e.g., percentage of instructions per type: integer, floating-point, loads, stores, etc.). At middleware level, information about processor hardware resources utilization (supplied via the Operating System layer) is important to enable the user-level threads scheduler be aware of possible resource contention when assigning user threads to kernel threads. So, user threads need to include color information to enable thread coloring scheduling at middIeware level. In the sarne way, at Operating SysWm level, kernel threads must include color information to enable thread color scheduling. At SMT Processor level, new hardware elements are needed to realize resource utilization accounting. These new hardware counters are consulted by the OS, being the base information needed by the OS thread coloring scheduler.

It is also important to note that a thread could Change its color during its life. So we should have the possibility to modify this characterization. For example, a thread could have to wait tbr data and then perform some mathematical operations, in well-defined moments of its work. At this level, it seems that

66

the compiler has the possibility to mark the thread. In order to perform an I/O operation, due to the optimizations the OS can perform with buffers and mapped memory, maybe is the place where the thread has to change its personality.

7. CONCLUSIONS AND FUTURE W O R K

Wifn this paper we have described a base study that analyses the problem of contended access to shared resources in a Hyper-Threading architecture. The results obtained in the experimems indicate that, depending on the characterization of the instructions stream of each thread, there are situations where the running threads make high contended access to execution resources. According to other works [ 13] [ 14][ 15], performance results may vary depending on the instruction streams that are executed in each logical processor, due to cache and execution resources conflicts.

Our proposal of thread coloring establishes the main aspects to consider when dealing with resource contention access. We state that information about resource utilization has to be available at each abstraction level, beginning fi'om hardware resource accounting. We propose a coloring philosophy to optimize resource allocation and improve processor pert%nnance in such a way that threads competing for a resource will not overlap in time. The objective is to mark threads with different colors depending on the hardware or software resources they need to reach job completion.

As future work, our research will go on studying possible mechanisms to allow cooperation among all the involved abstraction level during the execution of an application. We need to make use of information about resource utilization to make feasible a better resource management, reducing contended access to resources. Enabling a more efficient utilization of resources needed by an application, overall application performance would get better results.

We will also analyze how this results can be applied to develop a solution inside the Linux kernel 2.6.7 that perform better shared resource utilization. We will first establish what requirements are needed by this new thread coloring scheme as well as to contemplate possible enhancements at architecture level to offer a better support in accounting resource utilization per each available hardware context. In particular, we will study possible new architecture mechanisms that allow cooperation between the processor and the operating system that allows the operating system be informed of shared execution resources utilization.

8. A C K N O W L E D G M E N T S

We would like to thank Mateo Valero and the people of the GSOMK team at the Depan:ment of Computer Architecture for using their time to comment, discuss and argue about the topics of this paper.

9. REFERENCES

[1] D. T. Man', F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton Hyper-Threading technology architecture and microarehitecture InteI Technology Journal, 6(1):4-16, February 2002.

[211 Brian D. Marsh, Michael L. Scott, Thomas J. LeBlanc, Evangelos P. Markatos

67

First-Class User-Level Threads Proceedings of the 13th ACM Symposium on Operating Systems Principles, 1991.

[3J Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska and Henry M. Levy Scheduler Activations: Effective Kernel Support for the User-Level Management cu~Parallelism ACM Transactions on Computer Systems Vol.10 No. I, pp 53-79, February 1992.

[41] A. Tucker and A. Gupta Process control and scheduling issues for mu#iprogrammed shared-memory multiprocessotw Proceedings of the 12th ACM Symposium on Operating Systems Principles, pages 159-166, t 989.

[51] Hakan Aydin and David Kaeli Usqng Cache Line Coloring to Perform Aggressive Procedure Inlining ACM SIGARCH News, 28(1), March 2000.

[611 Dean F. Sutherland, Aaron Greenhouse, and William L. Scherlis The Code of Many Colors: Relating Threads to Code and Shared State PASTE'02, November 18-19, 2002.

[7][ Richard P. Draves, Brian N. Bershad, Richard F. Rashid, Randall W. Dean Using Continuations to Implement Thread Management and Communication in Operating Systems ACM Symposium on Operating Systems Principles, 1991.

[8] S. Parekh, S. Eggers, and H. Levy Thread-sensitive scheduling for smt processors Technical Report, University of Washington, Seattle, WA, May 2000.

[91 A. Snavely and D. M. Tullsen Symbiotic jobscheduling for a simultaneous multithreading processor In Architectural Support for Programming Languages and Operating Systems, pages 234-244, 2000.

[101 Francisco J. Cazorla, Alex Ramirez, Mateo Valero and Enrique Fern/mdez Dynamically Controlled Resource Allocation in SMT ProcessoJ~9 Proceedings of the 37th International Symposium on Microarchitecture (MICRO'04), pp. I71-182, December 2004.

[11] F.J. Cazorla, P. Knijnenburg, R. Sakellariou, E. Femfindez, A. Ram[rez and M. Valero QoS for High Petformance SMT Processors in Embedded Systems IEEE Micro, pp. 24-31, vol. 24, no. 4, July 2004.

[12] A. Settle, J. Kihm, A. Janiszewski, and D. Connors Architectural Support for Enhanced SMT Job Scheduling Proceedings of the 13th International Conference on Parallel Architecture and Compilation Techniques (PACT'04), Antibes Juan-les-Pins, France, October 2004.

[13] James R. Bulpin and Ian A. Pratt Muttiprogramming Performance of the Pentium 4 with Hyper-Threading In the Third Annual Workshop on Duplicating, Deconstruction and Debunking (at ISCA'04), pp53--62. June 2004.

[14] Nathan Tuck and Dean M.Tullsen Initial Observations of the Simultaneous Multithreading Pentium 4 Processor

68

Proceedings of 12th tntl Conference on Parallel Architectures and Compilation Techniques, September 2003.

[151 J.Nakajima and V. Pallipadi Enhancements for I-Iyper-Threading Technology in the Operating System - Seeking the Optimal Scheduling Proceedings of the 2nd Workshop on Industrial Experiences with Systems Software, Boston, MA, USA December 8, 2002

[16] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker and P. Roussel The Microarchiwcture of the Pentium 4 Processor Intel Technology Journal, Q1,2001.

[171 Intel Corporation IA-32 Intel® Architecture Optimization Reference Manual Order number 248966-011,2004. ftp://download.intel.comJdesign/Pentium4/manuals/24896611 .pdf

[181] Kernel Trap http ://kerneltrap.org/node/view/'391/972

[19] LWN.net htttp ://lwn.net/Articles/80911 /

[2011 AIM Benchmarks http://www.caldera.com/deveIopers/community/contrib/aim.html

[21] Ruben Pinitla and Marisa Oil O% T: A Java threads model for platJbrm independent execution ACM SIGOPS Operating Systems Review, 37(4):48-62, October 2003.

[22] Rob von Behren, Jeremy Condit, Feng Zhou, George C. Necula and Eric Brewer Capriccio: Scalable Threads for Internet Services 19th SOSP, October 19-22, 2003.

[2311 Perry H. Wang, Jamison D. Collins, Hong Wang, Dongkeun Kim, Bill Greene, Kai-Ming Chan, Aamir B. Yunus, Terry Sych, Stephen F. Moore and John P. Shen Helper Threads via Virtual Multithreading On An Experimental Itanium 2 Processor-based Platform ASPLOS XI, October 9-13, 2004.

[24] Vuk Marojevic and Marisa Gil Evaluating h%perthreading and VMware Technical report UPC-DAC-2004-27, July, 2004.

[25] Gautham K. Dorai and Donald Yeung 7~'ansparent Threads: Resource Sharing in SMT Processors for ~hh'gh Single-Thread Performance Proceedings of the 11 th Intl. Conf. on Parallel Architectures and Compilation Techniques, 2002.

[26] Dongkeun Kim, Steve Shih-wei Liao, Perry H. Wang, Juan del Cuvillo, Xinmin Tian, Xiang Zou, Hong Wang, Donald Yeung, Milind Girkar and John P. Shen Physical Experimentation with Prefetching Helper Threads on [ntels Hyper-Threaded Processors CGO 2004.

[27] Joshua Redstone, Susan Eggers and Henry Levy

69

Mini-threads: Increasing TLP on Small-Scale SMT Processors HPCA-9, February 8-12, 2003.

[28] Duc Vianney Hyper- T,~reading speeds Linux IBM developerWorks, January 2003. http://www- 106.ibm.com/developerrworks/linux/library/l-htl/

[2911 H. Wang, P. H. Wang, R. D. Weldon, S. M. Ettinger, H. Saito, M. Girkar, S. Shih-wei Liao and J. P. Shen Speculative Precomputation: Exploring the Use of Multithreading f ) r Lawncy 7i)ols Intel Technology Journal, 6(1):22-35, February 2002.

[30:1 Yen-Kuang Chen, M. Holliman, E. Debes, S. Zheltov, A. Knyazev, S. Bratanov, R. Belenov and I. Santos Media Applications on Hyper-Threading Technology Intel Technology Journal, 6(1):47-57, February 2002.

[3111 Kent F. Milfeld, Chona S. Guiang, Avijit Purkayastha and John R. Boisseau Exploring the Effects of Hyper-Threading on Scientific Applications 45th CUG Conference, May 12-16, 2003.

70

thread coloring: a scheduler proposal from user to hardware...

Documents