a comparative study of smt and cmp multiprocessors€¦ · in this report we attempt to study the...

ee8365 - 1 - Project Report

A Comparative study of SMT and CMP

multiprocessors

Rajat A Dua

Bhushan Lokhande


1. Introduction

Advances in IC processing technology offers increasing integration density allowing for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies calls for new ways to use their capabilities effectively. Currently, processor designs dynamically extract parallelism from these transistors by executing many instructions within a single, sequential program in parallel. Examples are superscalar architectures, which involve out-of-order instruction execution and speculative execution of instructions using branch prediction with dynamic hardware prediction techniques. However, reliance on a single thread of control limits the parallelism available for many applications, and the cost of extracting parallelism from a single thread is becoming prohibitive. This cost manifests itself in numerous ways, including increased die area and longer design and verification times. In general, we see diminishing returns when trying to extract parallelism from a single thread.

Researchers have proposed two alternative microacrhitectures that exploit multiple threads of control: simultaneous multithreading (SMT) and chip multiprocessors (CMP). Simultaneous multithreading (SMT), is a technique that permits multiple independent threads to issue instructions to a superscalar’s functional units in a single cycle. This promotes much higher utilization of the processor’s execution resources and provides latency tolerance in case a thread stalls due to cache misses or data dependencies. When multiple threads are not available, however, the SMT simple looks like a conventional wide-issue processor. CMPs use relatively simple single-thread processor cores to exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores. If an application cannot be effectively decomposed into threads, MPs will be underutilized.

From an architectural point of view, the SMT processor’s flexibility makes it superior. However, the need to limit the effects of interconnects delays, which are becoming much slower than transistor gate delays, will be a driving force for the billion-transistor chip design. Interconnect delays will force the microarchitecture to be partitioned into small, localized processing elements, thereby favoring the CMP design. In this report we attempt to study the various issues related to these two technologies and try to evaluate the performance and tradeoffs for both.

The remainder of this report is organized as follows. Section 2 briefly outlines the

issues involved in the three alternative approaches namely superscalar, SMT, and CMP. Section 3 discusses a base SMT architecture and analyzes its performance. Section 4 discusses the schemes for SMT implementation and design issues involved in the SMT implementation along with the working of job scheduler for SMT. Section 4 also present


some implications SMT has for architects, compiler writers, and the operating systems development. Section 5 presents a study for thread-area analysis of the SMT processor. Section 6 discusses a commercial implementation of SMT architecture: Intel’s Hyperthreading Technology. Section 7 presents the approaches for CMP architecture design considering the available hardware and software support for its implementation. Section 7 also discusses the specific implementation details of well-known CMP architecture namely, Stanford Hydra CMP. Section 8 discusses the CMP organizations that maximize the total chip performance, which is equivalent to throughput considering factors like processor organization, cache hierarchy, off-chip bandwidth, application characteristics. Finally we conclude in section 9, discussing the problems researchers are trying to address in this field and suggesting the areas and future direction of further research in SMT and CMP architectures.


2. Alternative Architectures

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). To exploit ILP, most of the processors have special hardware that allows them to dynamically identify independent instructions that can be issued in the same cycle. Typically, this involves maintaining a pool of instructions in a large associative window, along with a register renaming mechanism that eliminates any false dependence between instructions. Multiprocessors exploit TLP by executing different threads in parallel.

Wide-issue superscalar processors exploit the ILP by executing multiple

instructions from a single program in a single cycle. However for exploiting the available ILP, superscalar processors requires centralized hardware structures that lengthen the critical path of the processor pipeline. Such structures include a register renaming logic, an instruction window wake-up and select mechanisms, and a register bypass logic. Also, there is only a finite amount of ILP present in any particular sequence of instructions that the processor executes because instructions from the same sequence are typically highly interdependent. As a result, processors that use this technique are seeing diminishing returns as they attempt to execute more instructions per clock cycle, even as the logic required to process multiple instructions per clock cycle increases quadratically.

Simultaneous multithreading (SMT) allows multiple threads to compete for and share available processor resources every cycle. When executing parallel applications, SMT has the ability to use TLP and ILP interchangeably. When a program has only a single thread (i.e. it lacks TLP) all of the SMT processor’s resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. Thus SMT most effectively utilizes the available resources to achieve the goals of greater throughput and significant program speedups. To keep the processor’s execution unit busy SMT processors feature advance branch prediction, register renaming, out-of-order instruction issue, and non blocking caches which result in the processor having rename buffers, issue queues, and register files.

In the emerging billion-transistor CMOS implementation technology, the most crucial design issue is managing the interconnect delays as currently rate of decrease in the delay of the longest wires on a chip in improving two to four times more slowly than gate delay. In addition, processor clock rates have been rising exponentially as circuit designers have increasingly optimized the critical paths. The inherent complexity of the SMT architectures result in three major hardware design problems:

• Their area increases quadratically with the core’s complexity. Also the complexity increases the design time.


• They can require longer cycle times. Long, high capacitance I/O wires span the large buffers, queues, and register files. Extensive use of multiplexers and crossbars to interconnect these units adds more capacitance. Delays associated with these wires will probably dominate the delay along the CPU’s critical path. Complicated logic requires more clock cycles to keep clock period short. Making the pipeline deeper increases the branch mispredict penalty. • The CPU cores are complicated and composed of many closely interconnected components. As a result, design and verification costs will increase since they must be designed and verified as single, large units.

In CMPs, the TLP is obtained by running completely separate sequence of

instructions on each of the separate processors separately. As chip densities increase, single-chip multiprocessing will be possible. Chip-multiprocessor architectures (CMP) are one promising approach in this direction that can also better exploit the increasing transistor count on a chip. The CMP design has following advantages:

• The CMP approach allows a fairly short cycle time to be targeted with relatively little design effort, since its hardware is naturally clustered —each of the small CPUs is already a very small fast cluster of components. Since the operating system allocates a single software thread of control to each processor, the partitioning of work among the “clusters” is natural and requires no hardware to dynamically allocate instructions to different component clusters. This heavy reliance on software to direct instructions to clusters limits the amount of instruction-level parallelism that can be dynamically exploited by the entire CMP, but it allows the structures within each CPU to be small and fast. • Also, since the CMP architecture uses a group of small, identical processors, the design and verification cost for a single CPU core is low, and amortizes these costs over a larger number of processor cores. • CMP architecture is much less sensitive to poor data layout and poor communication management, since the interprocessor communication latencies are lower and bandwidths are higher.

However, given that the amount of thread- and instruction-level parallelism of applications vary widely, the traditional CMP approach of statically partitioning the chip resources between threads may lead to wasted resources when one of the threads stalls due to hazards or when the application lacks threads. However in CMP architectures, speculation may be employed to execute applications that cannot be parallelized statically. Overall, we augment the CMP with just enough hardware and software support, while still maintaining the generic CMP architecture to a reasonable degree. On the whole, the SMT is logical and simple in realization improvement of modern processors. The Chip MultiProcessing (CMP) is a more radical approach, which implies that several processor cores are located on one die.


3. Simultaneous Multithreading

As explained un the previous section, Simultaneous Multithreading is a technique that combines the multiple-issue-per instruction techniques employed in modern superscalar processors with the latency-hiding abilities of multithreaded architectures where we change the context that is being executed in the case of any latencies. It is a technique that permits several independent threads to issue to multiple functional units in the same clock cycle. In traditional multithreaded processors and superscalar processors, even though multiple instruction issue has the potential to increase performance, it is ultimately limited by both the instruction dependencies and the long latency of operation within a single executing thread. As shown in fig 3.1 there are two types of resource idling: horizontal waste and vertical waste.

Fig 3.1: Waste slots during execution

Horizontal waste is due to the lack instruction level parallelism between the

instructions belonging to a single process. Thus whenever there is a shortage of independent instructions, it results in the issue slots going empty in the same issue cycle. On the other hand whenever we are waiting for a remote memory fetch with long latency, vertical waste occurs. Thus in this case no new instruction is issued for the required number of cycles, which is effectively like a processor stall. Thus traditional multithreading basically overcomes the deleterious effects of vertical waste but is unable to overcome horizontal waste. Traditional multithreading is limited by the amount of parallelism that is available between instructions for a single thread in a particular clock cycle.


The basic principle of simultaneous multiprocessing is that even within a single clock cycle the instructions that are fetched are not necessarily from the same thread. The instructions are a mix from the different threads. So even if in a particular clock cycle we have very little instruction level parallelism for a particular thread, since we also have instructions coming from other threads, the horizontal waste that occurs is greatly reduced. Thus this helps to reduce the dependency of performance on the inherent instruction level parallelism in a particular thread. At the same time since we are fetching instructions from the different threads we are also exploiting thread level parallelism by executing multiple threads at the same time.

Here we explain an architecture as given by Hirata et al [1] which has served as the base architecture for the various SMT architectures being developed. As shown in fig 4.2, the processor consists of several instruction queue unit and decode unit pairs called thread slots. The instruction fetch unit and the functional units are shared between the different thread slots. The instruction fetch unit has a buffer of size such that if the instruction fetching takes C cycles and there are S threads then SxC instructions are fetched at one time in C cycles. This is done for each thread so that effectively we get SxC instructions for each thread in SxC cycles resulting in there being an instruction present in the buffer for each cycle for every thread. The decode unit then decodes the instructions and each instruction is checked for dependencies using a score boarding technique. The instruction is issued only if there is no dependency, otherwise the issuing is interlocked.

The issued instructions are dynamically scheduled by the instruction schedule units. The use of the functional units is arbitrated accordingly by the schedule unit and if the functional unit is not being used by some other instruction of higher priority according to the arbitration logic, then the instruction is immediately sent to the functional unit for execution. The arbitration logic for this architecture basically implements a rotation priority mechanism. However if the instruction cannot be sent to the instruction unit immediately it is stored in the standby station of depth one till it can


Fig 3.2: SMT Architecture

be executed. The register file is common to all the threads. However it is divided into banks each of which is used as a full register set private to a thread. This ensures that each thread accesses only the registers bound to it and not any other. The queue registers are used for the communication between the threads at the register transfer level.

The status of each thread including its registers, etc is collectively called a context frame. The processor always has more context frames than it has threads. Thus whenever required, the context is rapidly switched by changing the logical binding between a thread and the context frame without reference to the memory. This results in very little overhead while changing contexts. The context frame also contains an access requirement buffer to handle loads/stores. The outstanding memory accesses are also saved as a part of the context frame when context switching is carried out and this enables us to overcome the effect of memory access latencies by switching contexts. Thus this architecture allows us to use instructions from different threads in the same clock cycle thereby reducing the effects of insufficient ILP in a particular thread as well as to switch between contexts.

Tullsen et al [14] have proposed some models for simultaneous multithreading spanning a range of hardware complexities. Their models assume 10 functional units and a capability of issue of 8 instructions per cycle. These include

• Full Simultaneous Issue: Here all the eight threads active can compete for each issue slot. Thus in one case it may even happen that all the eight instructions being simultaneously executed are from the same thread.


• Single Issue, Dual Issue, Four Issue: Here each thread can issue one, two and four instructions respectively per cycle making the number of threads respectively as eight, four and two.

• Limited Connection: Here each context is directly connected to only one instance of each type of a functional unit. Thus there is less flexibility in regards that a particular thread will be able to compete for use of only one instance of a particular functional unit.

Fig 3.3: Performance of various models

The hardware complexity of the full issue model is very high resulting in very high costs even though the performance is the best. The architecture described in Hirata et al is of the type of single issue. The performance of the various models for a workload consisting of the SPEC92 benchmark suite [21] is shown in Fig 3.3

As can be seen, the performance of the of the four-issue model is almost the same as that of the full simultaneous issue model and for 8 threads it is the same. Even for two issue model the performance is 94% of the full simultaneous issue model for 8 threads. On the other hand the hardware costs of these models is much less. Thus usually the preferred choice for SMT processors is either the 4-issue or the 2-issue model. It has also been pointed out by Tullsen et al [21] that the performance curve flattens even before we reach 8 threads. Thus threads more than 8 are rarely employed.

One of the problems with SMT is that since we need separate register banks for each thread that we support. This along with some additional registers for register renaming results in a very large register file, which makes it slow and causes the cycle time to go up. To avoid this we allow the pipeline to take two cycles for register access. Due to two-cycle register access the branch prediction penalty increases. The extra cycle to write back requires an extra level of bypass logic. Also if some sort of prediction mechanism is employed, the additional cycles result in the mispredicted instructions to remain in the pipeline for additional cycles resulting in additional delays.


4. Schemes and Design issues for SMT Architecture

Tullsen et al [4] observed that in the base architecture of the type proposed by

Hirata et al [1] the instruction bandwidth is not the bottleneck. Instead the bottleneck lies in the following three cases:

• IQ size: It was observed that as many as 12-21% of all cycles are accounted by the IQ being full.

• Fetch throughput: This basically considers the conditions related to the instructions being fetched not being useful instructions.

• Lack of Parallelism: This may be because the base architecture of Hirata et al is having only three threads and additional number of threads may offset this problem.

Increase the fetch throughput can be obtained by partitioning the fetch unit to fetch

instructions from multiple threads and by making the fetch unit selective about which threads it fetches instructions from. Here the base architecture employed in [4] is assumed to fetch 8 instructions from a single thread per cycle. Following are the schemes suggested in [4]. 4.1 Partition of the fetch unit


These schemes basically attempt to reduce the bottleneck that is caused due to the lack of enough instructions that can be fetched from a single thread by fetching from multiple threads. Some of the possible schemes are

i. RR.1.8 – This is nothing but the base scheme where every cycle each thread fetches up to eight instructions ii. RR.2.4, RR.4.2 – In these schemes multiple cache addresses are driven to each cache data bank and a multiplexer is used to select the cache index every cycle. This is then used by the address decoder. Thus RR.2.4 has two cache outputs each four instructions wide while the RR4.2 scheme has four cache outputs with two instructions each. In the implementation of both these schemes however, additional hardware is required.

4.2 Selective Fetching

The aim of this scheme is to fetch instructions from a thread which will result in better performance. The factors to be considered while deciding between threads is the probability of the thread following a wrong path due to an earlier branch misprediction and the length of the time the fetched instructions will be in the queue waiting for dependencies to be resolved or data to be fetched. A thread which has a lower ‘wrong path probability’ and a lower ‘instruction time in the queue’ is better for fetching instructions from. For a thread which has instructions that may get blocked, it may result in the queue getting clogged but with time these dependencies are resolved and its best to delay this thread for execution. Some of the fetch policies suggested in [4] are:

i. BRCOUNT – The highest priority for fetching is given to the threads having minimum probability of being in the wrong path. This is done by counting the number of branch instructions that may be influencing and the instruction with the minimum count is given highest priority. ii. MISSCOUNT – In this policy the highest priority is given to the threads having the fewest outstanding cache misses. This policy thus takes into consideration the effect of a thread providing instructions that may clog the queue. iii. ICOUNT - This policy gives highest priority to a thread with the fewest instructions that are present in the pipeline in stages previous to the functional units. This policy makes sure that the instructions are distributed over the threads and priority is given to threads with instructions not clogging the queue. iv. IQPQSN – This policy gives the lowest priority to the threads which have instructions closest to the head of the queue. This is done because those threads which are prone to clogging will have instructions closest to the head of the queue and wont be selected.

The best policy usually is the ICOUNT as it provides a good mix for instructions

while fetching the best threads. However if the cache miss are the major cause of queue


clogging another policy like MISSCOUNT may be better. Thus the policy employed depends also on the conditions and the workload.

Some other enhancements that can be done are to increase the instruction queue size. However this results in an increase in the time that it takes to search the queue. This is overcome by increasing the queue size but at the same time the search is carried out on only on the first few entries for issue thus keeping the search time less. This scheme is called the BIGQ scheme. Another enhancement is to do the tag lookups in the instruction cache one cycle early and begin the cache fetch one cycle earlier. This is done to work around the problem of cache miss. However since we calculate the address earlier too, we add a stage in the pipeline and increase the various penalties due to misprediction and misfetching. This scheme is called the ITAG scheme.

Similar to choosing the best threads it is also important to choose the correct mix of instructions. In an SMT processor it happens many times that there are more number of instructions that can be issued than the actual number of instructions that have to be issued. While choosing instructions, those instructions are chosen which have least probability of being on a wrongly predicted path. This is done by choosing schemes like the OPT_LAST and the SPEC_LAST schemes, which means we choose the optimistic and speculative instructions last. Optimistic instructions are the load dependent instructions which are executed a cycle in advance of the cache hit data. Also another approach is the BRANCH_FIRST approach, which basically executes the branches as early as possible to identify the mispredicted branches quickly. 4.3 Pre-Execution

This is a technique where single threads are accelerated by generating threads, which perform pre-execution in order to generate data addresses on behalf of the main thread. Pre-execution can be carried out in different ways like prefetching data and/or instructions, predicting branch directions and pre-computing their outputs and also to pre compute the general execution results. For an SMT implementation latency is tolerated by pre-execution by running several data streams of the same thread simultaneously.

The pre-execution can be controlled both in software and in hardware. In hardware based pre-execution events like cache misses etc are used to trigger pre-execution. The hardware-based approach thus results in no instruction overhead but we cannot incorporate any logic, which can analyze the program and exploit the characteristics of the program to carry out pre-fetching more effectively. On the other hand, software pre-execution uses insertion of certain pre-execution instruction supported by the hardware to carry out pre-execution. These instructions can either be inserted by the compiler or by the programmer. In both cases since there is some knowledge of the characteristics of the process, pre-execution can be carried out more effectively. However since this is done in software, there is an additional overhead of the pre-execution instructions etc and thus the software approach is slower.


Here we explain software controlled pre-execution based on studies done by Luk et al [6]. As explained above software controlled pre-execution allows the programmer or compiler to insert instructions that can be used to activate helper threads which can run code ahead of the main thread for parts of the code where it is suspected that a cache miss will result. Wherever in the code a pre-execution instruction is inserted, the instruction requests the hardware to generate a thread, which will pre-execute the code sequence starting at the given PC. If there is no thread available then the pre-execution instruction is ignored. However if such a thread can be spawned then the initial register state is copied from the spawning thread. This thread then starts pre-executing the code from the given PC in the pre-execution mode. The characteristics of the pre-execution mode are that it ignores any exception if generated and also stores are not committed to memory. The pre-execution stops executing after a fixed number of instructions have been executed or at a particular PC location.

In order for the hardware to support pre-execution, [6] has suggested that the hardware should support an instruction, which spawns a pre-execution thread at a particular PC and stops when a certain number of instructions have been executed. There should also be instructions to terminate the thread in the pre-execution mode both the same thread and other threads. Also whenever a pre-execution thread is generated, the entire register state has to be duplicated as fast as possible. This can again be done both in hardware and in software. In hardware-based duplication, a mechanism called mapping synchronization bus (MBS) has been proposed. All it does is copy the register map instead of the register values across the threads. This leads to a problem that the physical register can be shared by more than one thread and so if one thread frees the register, it may become unattached even while it is still being used by the other threads. A good solution to this is that a counter can be associated with every register and every time a new thread is spawned for pre-execution which uses that register, the counter is incremented and every time a thread frees the register the counter is decremented. The counter is only freed when the last thread frees it. In software based register duplication, the premier advantage is that no additional hardware has to be added to the processor.

In [6], Luk et al also discuss problem specific pre-execution for SMT. In the case of multiple pointer chains, this technique prefetches multiple pointer chains that will be visited soon while traversing the current chain and thus overlapping the latency of the other chains with that of accessing the current one. In the case of multiple procedure calls, the threads can be used to pre-execute at the procedure level itself. Similarly for multiple control flow paths, all the possible paths can be pre-executed and then when we come to that part of the process, the correct path in execution thread is chosen while all the other threads are quashed. 4.4 Loop Parallelization Techniques

Puppin and Tullsen [5] have proposed techniques implemented by the compiler specifically for a SMT processor. These include general loop restructuring, like loop fusion, loop peeling, invariant code motion. In loops with independent iterations, we have vector computations, which can be carried out independently for every element. The


compiler can use iteration interleaving which is basically assigning iterations to threads in an interleaved manner. It was observed that the completion time got reduced by as much as 60% using this technique. Similarly for loops with loop carried dependencies in which interleaving is not carried out, [5] has suggested to use algorithms like cyclic reduction which implemented specifically for the SMT by making the level of recursion to 2.

In loops where we have a part of the loop carrying out some independent computation followed by the accumulation of the values, [5] has suggested that the compiler can parallelize the computation part of the loop using the interleaving technique while the accumulation either be delayed till later or each loop can carry out its own local accumulation if the accumulation is commutative and associative and the global accumulation can be carried out after the loop. There are also certain loops in which the loop has smaller parts which if the compiler can fuse together, the opportunities for parallelism increase. 4.5 Compiler Optimizations

Whenever a compiler carries out some optimizations it does so taking into consideration the characteristics of the underlying architecture. Lo, Eggers et al [9] have targeted three optimizations carried out by traditional compilers which need to be carried out quite differently for a SMT processor namely inter-thread data sharing, software speculative execution and loop tiling.

Conventional parallelization techniques, which target multiprocessors, carry out optimizations on the premise that the threads are physically distributed on different processors. In order to minimize cache coherence problems and inter-processor communication, these compilers allocate a disjoint set of data for each processor. However in a SMT, since multiple threads execute on the same processor, the sharing of data between threads is in fact beneficial as it results in local memory access and absence of coherence overhead. This suggests that in a compiler the loop distribution policy should be one that aggregates data rather than segregate it for multiple threads.

Traditional compilers also carry out optimizations like speculative execution and loop tiling. Loop tiling is a process by which a loop is broken down into a nest of loops and the order of accesses within an array is changed thus resulting in a change in the cache behavior. However, in the case of SMT processors, it has few empty slots and the latencies are also hidden better due to its characteristics. In such a case the overhead incurred due to the speculative execution instructions and loop tiling may in fact result in the performance going down. Thus it is necessary that the compiler be aware of this characteristic while carrying out the optimizations. 4.5 Job Scheduling for SMT

Snavely and Tullsen [7] have suggested a job-scheduling scheme targeting the SMT. Whenever there are more jobs in the system than the hardware which can execute them simultaneously, we have job-scheduling at two levels, namely the job scheduler


makes a running set of jobs which are co-scheduled and compete for hardware in every cycle while jobs move in and out of the running set through the OS control. A job-scheduling scheme is evaluated by its effectiveness to schedule multiple jobs in order to improve the performance. This is termed as symbiosis.

The job-scheduler suggested in [7] is called SOS. This job-scheduler begins to run jobs in groups equal to the multithreading level. After certain amount of time the running jobs are replaced by other jobs in order to ensure that all the jobs are executed. The process of job scheduling consists of two phases. In the first phase, called the sample phase, the scheduler gathers dynamic execution of the jobs being run by referencing hardware performance counters. After determining the performance of many schedule permutations, the SOS chooses one according to some algorithm, which predicts a particular schedule as one that will be the best. This schedule is then run in the symbiosis phase. One consideration is that as the jobs are being executed, their profile changes with time. Thus a schedule that works well for jobs at an earlier time instant may not work so well after some time. Thus the sample phase has to be repeated periodically.

The symbiosis is a function of three attributes of the workload, namely diversity, balance and low conflicts. Diversity refers to the requirement that the instructions that are being scheduled to be executed should be as diverse as possible which will result in fewer conflicts for the use of the functional units. Balance refers to the requirement that the jobs should not be so scheduled that the system is over busy for a certain time and then it is very under utilized. Instead the jobs should be such that the system utilization should be balanced. Low conflicts refers to consideration that the jobs should be scheduled so that they have lower conflicts.

During the sample phase the data is obtained regarding the various jobs and the predictions have to be made regarding the performance of that job schedule during the symbiosis phase. The job scheduler then selects the job schedule, which is predicted to give the best performance. The prediction schemes that can be used as suggested by [7] are:

i. IPC: A schedule that has a high IPC is predicted to be highly symbiotic. However this scheme does not work every time because there may be cases where high IPC threads monopolize the system resources resulting in the degradation of the execution performance of low IPC threads during the sample phase. ii. AllConf: This scheme predicts a schedule as highly symbiotic if it has a lower sum of conflicts for the various system resources. This is measured as the sum of cycles due to conflicts. This scheme is however useful because a high conflict count may be because of higher system utilization, which is exactly what we want to achieve. iii. Dcache: A hit rate in the cache is considered to be symbiotic in this scheme. However practically this scheme fails and so is not used.


iv. Diversity: This scheme predicts a schedule with a diverse mix of instructions in all of its time slices as highly symbiotic. Practically it is not found to be effective. v. Balance: A schedule with little variation in IPC between time slices is predicted to be symbiotic. This scheme works quite well because a good schedule requires the system to be balanced and this scheme achieves just that. vi. Composite: A composite predictor has also been suggested which takes into account various factors to reach upon a prediction. This predictor as suggested in [7] gives the schedule with the highest score in a formula taking into account criteria for smoothness and low conflicts, as the schedule, which is highly symbiotic. This composite predictor has been observed to give a very good performance. vii. Score: [7] has also suggested a scheme in which the job schedule, which has been voted as the best by a majority of predictors, is predicted as the best. However this scheme has to take into account various predictors and thus is a highly complex scheme.

Another consideration for symbiotic job scheduling using SOS is the resampling

time. Since the job resource profiles change as the execution proceeds and also jobs get terminated and new jobs are created with time, it is important that the sample phases be repeated over some optimum intervals. This sampling rate should not be too high as we want to maximize the time of the symbiosis phase but at the same time this rate should not be too less so that we can catch any changes in the job execution profile and job mix. An optimum system is one that will alter the sampling rate itself depending on the type of jobs being executed. If the job mix is such that its profile changes rapidly, then the sampling rate has to be made high and vice versa.


5. Thread-Area analysis of the SMT Processor

Burns and Gaudiot [10] have carried out an analysis of the SMT processor at the transistor level and have come up with some interesting results. The observed that as the dispatch width i.e. the width for fetch, decode, rename and the number of registers for renaming, etc of the processor is increased linearly its area increases quadratically. They analyzed the area increase with the number of threads for various components of a SMT processor and their observations are summarized in Table 5.1. The area is given in terms of the number of threads ‘t’. All the ‘log’ terms are to the base ‘2’.

From Table 5.1, we can observe that for the remapping tables which are nothing but multiported RAMs the area increases as t*log (t). This is because for each thread we not only add new registers but also store the physical address bits, which account for the log term. Also the area of the remapping table increases quadratically with the increase in the dispatch width. Thus the remapping tables are one of key terms in consideration of the additional area requirements for an SMT as compared to normal processors. Similarly the INT and FP register files are also multiported RAMs and so the area grows quadratically with increase in dispatch width. However since the number of address bits is fixed, their area grows linearly with the number of threads.


Functional Block

Area

Remapping Tables (Architectural to physical registers)

O (t*log (t))

INT and FP registers increase due to more entries

O (t)

Fetch block, multiple PC and icount fetch policy

O (t)

Branch predict block, Separate return stack per block

O (t)

Per-thread instruction squash (O – O – O)

O (t)

Per-thread instruction commit (O – O – O)

O (t)

Instruction Queue (O – O – O)

O (log (t))

Dtag, Itag

O (log (t))

Routing

O (log (t))

Table 5.1: Area of different blocks in an SMT The fetch block area increases linearly both with the dispatch width, as the fetch

width grows linearly, and the number of threads due to an increase in the number of program counters and status registers. As the dispatch width increases, it is imperative that the accuracy of the branch prediction block also increases because the misprediction rate increases with the increase in the dispatch width. Thus this block’s area increases linearly with the increase in the dispatch width. Moreover with the increase in the number of threads, we have to add a new return stack for each thread which results in a linear increase in area with the number of threads.

Since for every thread we need a separate instruction squash and commit logic for out of order (O – O – O) execution, the area of this block increases linearly as O (t). Similarly for O – O – O with increase in dispatch width the area increases quadratically. Similarly with increase in the number of threads we have to have a larger dependency tag which increase as O (log (t)). However blocks such as the various functional units and data cache, which are shared by all the threads, remain constant for area with the number of threads.


6. A Commercial implementation of SMT: Intel’s Hyperthreading

Technology

Intel is incorporating an approach similar to the SMT technology on the Xeon

processor family. Here we summarize a few of the implementation details of an actually manufactured SMT processor. As explained by Marr et al in [13], hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a micro-architecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.

As mentioned in [13], one goal of this technology was to minimize the die area cost of implementing Hyper-Threading Technology. A second goal was to ensure that when one logical processor is stalled the other logical processor could continue to make forward progress. Independent forward progress was ensured by managing buffering queues such that no logical processor can use all the entries when two active software threads were executing. This is accomplished by either partitioning or limiting the


number of active entries each thread can have. A third goal was to allow a processor running only one active software thread to run at the same speed on a processor with Hyper-Threading Technology as on a processor without this capability.

Fig 6.1: Processor with hyper threading technology

As shown in fig 6.1, with two copies of the architectural state on each physical processor, the system appears to have four logical processors. This implementation of Hyper-Threading Technology adds less than 5% to the relative chip size and maximum power requirements, but can provide performance benefits much greater than that. Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total. Logical processors share nearly all other resources on the physical processor, such as caches, Each logical processor has its own interrupt controller or APIC. Interrupts sent to a specific logical processor is handled only by that logical processor.

Most instructions in a program are fetched and executed from the Execution Trace Cache (TC). Two sets of next-instruction-pointers independently track the progress of the two software threads executing. The two logical processors arbitrate access to the TC every clock cycle. If both logical processors want access to the TC at the same time, access is granted to one then the other in alternating clock cycles. For example, if one cycle is used to fetch a line for one logical processor, the next cycle would be used to fetch a line for the other logical processor, provided that both logical processors requested access to the trace cache. If one logical processor is stalled or is unable to use the TC, the other logical processor can use the full bandwidth of the trace cache, every cycle.

When a complex instruction is encountered, the TC sends a microcode-instruction pointer to the Microcode ROM. The Microcode ROM controller then fetches the


operands needed and returns control to the TC. Two microcode instruction pointers are used to control the flows independently if both logical processors are executing complex microcode based instructions. Both logical processors share the Microcode ROM entries. Access to the Microcode ROM alternates between logical processors just as in the TC.

The Instruction Translation Lookaside Buffers (ITLB) are duplicated. Each logical processor has its own ITLB and its own set of instruction pointers to track the progress of instruction fetch for the two logical processors. The branch prediction structures are either duplicated or shared. The return stack buffer, which predicts the target of return instructions, is duplicated.

The out-of-order execution engine has several buffers to perform its re-ordering, tracing, and sequencing operations. Some of these key buffers are partitioned such that each logical processor can use at most half the entries. Specifically, each logical processor can use up to a maximum of 63 re-order buffer entries, 24 load buffers, and 12 store buffer entries. If there are operands for both logical processors in the operand queue, the resource allocator will alternate selecting operands from the logical processors every clock cycle to assign resources. If a logical processor has used its limit of a needed resource, such as store buffer entries, the resource allocator will signal “stall” for that logical processor and continue to assign resources for the other logical processor. In addition, if the operand queue only contains operands for one logical processor, the resource allocator will try to assign resources for that logical processor every cycle to optimize allocation bandwidth, though the resource limits would still be enforced.

The register rename logic uses a Register Alias Table (RAT) to track the latest version of each architectural register to tell the next instruction(s) where to get its input operands. There are two RATs, one for each logical processor. The register renaming process is done in parallel. Initial benchmark tests show up to a 65% performance increase on high-end server applications when compared to the previous-generation Pentium® III Xeon™ processor on 4-way server platforms. A significant portion of those gains can be attributed to Hyper-Threading Technology.


7. Single-Chip Multiprocessor (CMP)

As discussed in section 2, CMP architecture will be easiest to implement using the billion-transistor CMOS implementation technology while still offering excellent performance. CMPs use relatively simple single-threaded processor cores to exploit only moderate amount of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores. As CMPs are partitioned into individual processing cores, it limits the effect of interconnect delays, which are becoming much slower than transistor gate delays. Thus the design simplicity allows for a faster clock in each of the processing unit as well eases the time-consuming design validation phase. Finally, the CMP approach also results in better utilization of the silicon space. By avoiding the extra logic devoted to the centralized architecture, a CMP allows the chip to have a higher overall issue bandwidth when compared to a conventional superscalar implemented on the same die area. 7.1 Basic Design

The basic CMP architecture consist of narrow issue multiple cores. The architecture shown in fig 7.1 has eight small 2-issue superscalar processors. The cores are completely independent and tightly integrated with their individual pairs of caches. This design is of clustered form leading to simple, high-frequency design for the primary


cache system. The small cache size and tight connection to these caches allow single-cycle access.

Fig.7.1 Basic CMP architecture

The rest of the memory system remains essentially unchanged, except that the

secondary cache controller adding two extra cycles of secondary cache latency to handle requests from multiple processors. To make a shared memory multiprocessor, the data caches could be made writethrough, or a MESI (modified, exclusive, shared, and invalid) cache coherence protocol could be established between the primary data caches. In this way, designers can implement a small-scale multiprocessor with very low interprocessor communication latency. To provide enough off-chip memory bandwidth for our high-performance processors, main memory could be designed as multiple banks of Rambus DRAMs (RDRAMs), attached via multiple Rambus channels to each processor. 7.2 CMP Architecture with Speculative Multithreading

From a software perspective, the CMP is an ideal platform to run a

multiprogrammed workload or a multithreaded application. However, if the CMP is to be fully accepted, it must also be able to give good performance when running sequential applications. Since parallelizing compilers, are successful only for a restricted class of applications, typically numerical ones, and the CMP approach would not be able to handle a large class of general-purpose sequential applications. To address this problem, speculation may be used.

In speculative multithreading, the speculative threads in the application needs to

be identified either at compile time or completely at runtime with hardware support. The different threads are executed in parallel specutively. Added hardware or software support enables detection of and recovery from dependence violations. 7.21 General Approaches for implementing Speculative Multithreading


There have been two major approaches for configuring multiple processing units on a chip. i. Using hardware support

Using hardware support, the architecture is fully geared towards exploiting

speculative parallelism. Typical examples are the Trace [22], Multiscalar [23], and Dynamic Speculative Multithreaded processors [17]. These processors can potentially handle sequential binaries without recompilation of the source. As such, these processors have hardware features that allow communication both at the memory and the register level. For instance, in the Trace processor [22], additional hardware in the form of a trace cache assists in the identification of threads at runtime, while interthread register communication is performed with the help of a centralized global register file. Recovery from misspeculation is achieved by maintaining two copies of the registers, along with a set of register masks, in each processing unit. Overall, these processors have sufficient hardware support to tailor the architecture for speculative execution. This enables them to deliver high performance on existing sequential binaries without the need for full recompilation of the source program. A direct consequence of this, however, is that a large amount of hardware remains unutilized when running a fully parallel application or a multiprogrammed workload. ii. Using minimal Hardware support

In this approach, the CMP is generic enough and has only minimal support for speculative execution [17]. Current proposals for these systems restrict the communication between processors to occur only through memory. Such limited hardware may be sufficient when programs are compiled using a compiler that is aware of the speculation hardware [17]. However, the need to recompile the source is a handicap, especially when the source is unavailable. 7.22 Optimal Implementation/Architecture using Hardware/Software support

In this sub-section we explain below the speculation-based CMP architecture [17] that attempts to combine the best of the above two approaches. In the proposed architecture rather than requiring source recompilation, we can directly operate on sequential binaries and require modest hardware support. Overall the proposed architecture adds little hardware to a generic CMP, while being able to handle sequential binaries quite effectively. 7.221 Overview of design

Software approach is used to identify threads with only difference that

compilation step operates on the sequential executable file. The threads are limited to loop iterations, however, the approach can be easily expanded to include other sections of code. In the analysis, entry and exit points for each loop are marked. During the course of execution, when a loop entry point is reached, multiple threads are spawned to begin execution of successive iterations speculatively. Only one iteration is nonspeculative, with all its successors having a speculative status. When a nonspeculative thread


completes, the immediate speculative successor acquires nonspeculative status. Only the nonspeculative thread is allowed to update the sequential state of the processor. To simplify the hardware support, we follow sequential semantics for thread termination. We wait for the thread to reach nonspeculative status before it can be retired and a new thread initiated on the same processor. When the last iteration of a loop completes, any iterations that were speculatively spawned after the last one are squashed. 7.222 Implementation i) Binary Annotation Process

A binary annotation tool is used to extract multiple threads from sequential binaries to execute on the CMP. The steps involved in the annotation process for binaries are illustrated in Fig.6.2 which is similar to the one used in multiscalar [23], except that the operation is on binaries, instead of on the intermediate code.

First, we identify inner-loop iterations and annotate their initiation and termination points. Then, we need to identify the register-level dependences between these threads. This involves identifying looplive registers, which are those that are live at loop entry/exits and may also be redefined in the loop. We then identify the reaching definitions at loop exits of all the looplive registers. From these looplive reaching definitions, we identify safe definitions, which are definitions that may occur but whose value will never be overwritten later in the loop body. Similarly, we identify the release points for the remaining definitions whose value may be overwritten by another definition. These points are identified by first performing a backward reaching definition analysis. Memory dependency and thread initiation and termination is achieved using hardware support.

In closing, it must be noted that incorporating the annotations in a binary is quite


Fig. 7.2 Binary Annotation process simple and requires only minor extensions to the ISA. Additional instructions are needed only for identifying thread entry, exit, and register value release points. ii) Interthread Register/Memory Communication

To enable flexible interthread register communication, Synchronizing Scoreboard (SS) [17] explained later is used. Identifying memory dependences is difficult at the binary level. Therefore, the hardware is fully responsible for identifying and enforcing memory dependences. This hardware should also ensure that speculative and non-speculative versions of data are not mixed up and must allow speculative threads to acquire data from the appropriate producer thread. It must also identify dependence violations that may occur when a speculative thread prematurely accesses a memory location. This will result in the squashing of the violating thread along with its successors.

Hardware involves a decentralized design, where each processor’s primary cache is used to store the speculative data, along with enhancements to the cache coherence protocol to maintain data consistency. The thread also writes the register value on the bus, thereby allowing other processors to update their values if needed. At that point, the register is safe to be used by successor threads. An approach similar to a directory-based cache-coherence scheme is used with the aid of hardware, which we call the Memory Disambiguation Table (MDT). The brief working of SS and MDT is as follows. Our discussion assumes a 4-processor CMP. For our hardware to work, each thread maintains


its status in the form of a bit mask (called ThreadMask) in a special register. The status of a thread can be any of the four values shown in Table 1.

Table 1

iii) Register-Level Communication with the Synchronizing Scoreboard (SS)

The Synchronizing Scoreboard (SS) is a fully decentralized structure used by threads to synchronize and communicate register values [18]. It is a scoreboard augmented with additional bits. Each processor has its own copy. The SSs in the different processors are connected with a broadcast bus, on which register values are transferred.

As in a conventional scoreboard, each SS has one entry per register. Fig. 6.3 shows the different fields for one entry [17]. The fields are grouped into local and global. To avoid centralization, the latter bits are replicated but easily kept coherent across the several SSs in different processors. The global fields include the Sync (S) and the StartSync (F) fields. For a given register, the Si bit, if set, implies that the thread running on processor i has not made the register available to successor threads yet. When a thread starts on processor i, it sets the Si bit for all the looplive registers that it may create. The Si bit for a register is cleared when the thread executes either a safe definition or the release instruction for that register. When this occurs, the thread also writes the register value on the bus, thereby allowing other processors to update their values if needed. At that point, the register is safe to be used by successor threads. The F bit is set to the value of the corresponding S bit when the thread is initiated. This is done with dedicated hardware that, when a thread starts on processor i, initializes the Fi and the Si bits for all the registers in the SS of all processors. From then on, the F bit remains unchanged throughout the execution of the thread. Thus, it always remains the same across all the SSs. We will see the use of the F bits later. Each processor has an additional local Valid (V) bit for each register, as in a conventional scoreboard. This preprocessor private bit tells whether the processor has a valid copy of the register. When a parallel section of the code is reached, the processors that were idle in the preceding serial section start with their V bits set to zero. The V bit for a register is set when the register value is generated by the local thread or is communicated from another processor. Within a given parallel section, a processor can reuse registers across threads. To understand how the SS works, we consider how registers are communicated between processors and the problem of the last copy. Register Communication between Processors


Register communication between threads can be producer initiated or consumer-initiated and is described below in brief. When a thread clears the S bit for a register, it writes the register on the SS bus. At that point, each of the successor threads checks its V bit for the register and also the F bits for the register of the threads between the producer (not inclusive) and itself (inclusive). If all these bits are zero, the successor thread loads the register and sets the V bit corresponding to the register to 1.At the same time, all processors clear the S bit corresponding to the producer thread in their SSs. The F bits, however, remain unchanged.

Fig. 7.3 Hardware support for register communication. The Last Copy Problem

When the last speculative thread updates a register, it has no successors to which it can send the value. As a result, any future consumer threads will have to explicitly request the value from it. Also, when a new thread is initiated, it invalidates any local register that a predecessor may produce. Under these conditions, a situation may occur where all the copies of a given register on chip are about to become invalid. We term this scenario as the last copy problem.

The last copy problem will not occur if we use a communication mechanism that buffers live-out values to percolate to the new threads or if we use a centralized global register set that maintains live-out values. For instance, the Multiscalar processor [23] uses the first approach. The Trace processor [22] avoids the last-copy problem by keeping a centralized global register set that is visible to all processors. iv) Hardware Support for Handling Memory-Level Dependences

When an application runs in speculative mode, it can potentially generate wrong data. Thus, we need to separate the speculative from the nonspeculative data.


Consequently, in this scheme, each processor in the CMP has a private L1-cache with special support. During speculative execution, speculative threads use the L1-cache in a restricted write-back mode: The L1 is write back, but it cannot displace dirty lines. When a dirty line is about to be displaced, the thread stalls.

Unlike the speculative threads, the nonspeculative thread is allowed to update the shared L2-cache since its store operations are safe. Consequently, when a speculative thread acquires nonspeculative status, dirty lines are allowed to be displaced from the L1-cache. Furthermore, the L1-cache starts working in write-through mode. After the iteration completes and before it can be committed, any remaining dirty lines in the cache are flushed to memory. This allows us to start a new thread with a clean cache. Under the caching environment described, we identify memory dependence violations with the help of a Memory Disambiguation Table (MDT).

To enable speculative execution, each word cached in the private L1-caches is augmented to have the bits shown in Table3. The Invalid (I) and Dirty (D) bits serve the same purpose as in a conventional cache.

Table 3

Table 4

The SafeWrite (SW) and SafeRead (SR) bits are used by the speculative threads. In

theory, each processor, when performing a load or a store operation, may have to inform the MDT, which tracks memory dependence violations. The SW and SR bits allow the processor to perform load and store operations without informing the MDT. Specifically, the SW bit, if set, permits a thread to perform writes to a word without informing the MDT. When a thread performs a store for the first time, it sets the D, SW, and SR bits. It also sends a message to the MDT. Subsequent stores to the word can be done without any messages to the MDT, provided the SW bit is set. The SW bit is cleared when a new thread is initiated on the processor. It is also cleared when a successor thread loads the


same word and the MDT forwards the request to the processor with the SW bit set. The SW bit is cleared because, if the thread stores again to the same word, a message will be sent to the MDT. This enables the MDT to flag that a speculative thread has performed a premature load. The SR bit, if set, allows the thread to perform load operations from the word without informing the MDT. This bit is set when the thread loads from or stores to the word for the first time. It is cleared when a new thread is initiated on the processor

The Flush (F) bit is used to invalidate stale words from the L1-cache before a new thread is initiated on the processor. When a thread stores to a word, any words with the same address have to be invalidated from the caches of all its successors. As for the predecessors, information needs to be maintained to denote that the word will become stale in their caches after they complete. The F bit identifies those cached words that need to be invalidated across thread initiations, while allowing the reuse of the remaining words. Finally, the Forward (FD) bit is used to identify forwarded data from other processors. Maintaining these bits with a cache-line granularity could result in false dependence detection and, consequently, lead to unnecessary squashing of threads. So, we maintain information at a word level. v) Memory Disambiguation Table (MDT)

The MDT performs the disambiguation mechanism. It keeps entries on a per memory-line basis and, like the L1-caches, maintains information on a per-word basis. For each word, it keeps a Load (L) and a Store (S) bit for each processor. When a new thread is initiated on a processor, all its L and S bits are cleared. As the thread executes, the MDT works like a directory that keeps track of which processors shared which words. Table 4 shows an MDT for a 4-processor CMP. Only the state for word 0 of the lines is shown. The MDT is placed between the per-processor L1-caches and the shared L2-cache and is distributed in multibanked fashion. It receives all the requests that are not intercepted by the L1-caches. vi) Action on thread initiation/squashing

At the point of thread initiation, all words whose F bits are set are invalidated. The F, SW, and SR bits are cleared. As for the MDT, the corresponding L and S bits are also cleared for this new thread. No special action is taken when threads are squashed, as new threads will be eventually initiated on those processors and all the actions described above will be performed. When a thread is squashed, we need to invalidate from the L1-cache of its processor some additional words: those that are dirty and those that were forwarded from a squashed predecessor thread.

7.23 Comparison with Other Schemes


Multiscalar [23] has a specialized module ARB placed between the multiple processing units and the shared L1-cache. It provides storage for speculative data and also identifies dependence violations. In such a scheme, the L1-cache stores the sequential state of the program. All memory accesses from all processing units are streamed through the ARB before they can reach the L1-cache. It is a very centralized scheme and would hurt performance due to the added latency of L1-cache accesses.

Oplinger et al. [17] have proposed a scheme that has extra buffers, rather than the

L1-cache, to hold speculative data. Although this simplifies the protocol, the small buffers may become full and stall the speculative threads.

Steffan and Mowry [17] outline a scheme that allows the L1-cache to buffer the

speculative state. However, their description does not mention a precise cache coherence scheme for handling data dependences.

The Speculative Versioning Cache (SVC) scheme [17] is a detailed architectural

proposal for handling dependences. Unlike our MDT scheme, which uses a directory-like approach, the SVC scheme uses snooping caches for detecting memory dependence violations. Finally, the Multi-Value cache [17] maintains multiple states of the same memory location and uses a broadcast mechanism for conveying information between threads. 7.24 Overall Support required for Speculative Multithreading

In this sub-section, we summarize the overall support that is required to run applications speculatively on the CMP.

First, we need to predict the starting address of the next Thread. Since the grain

size of the tasks used in our system small, we need hardware for thread initiation and termination. The hardware for thread initiation includes support initializing the extra bits in the SS, L1-cache, and MDT entries. The hardware for thread termination includes support to perform the termination operations on these same bits and table entries. In addition, L1-caches need work as write-back or write-through depending on the state the thread and, for speculative threads, cause a thread stall on a dirty line displacement. Register-level communication requires a broadcast bus along with one additional read and one additional write port for each register file. In addition, the scoreboard needs be augmented with 3n extra bits per register, where n number of processors on chip. Finally, we need some additional logic replicated for each register to perform various checks, as specified in previous sections. Handling memory dependences requires an MDT, along with the related logic that checks for memory dependence violations. Overall, enhancing a CMP to execute applications speculatively requires a modest amount of hardware. 7.25 Implementation of CMP Architecture: Stanford Hydra CMP


The Hydra chip multiprocessor (CMP) [15] integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. The Hydra CMP supports thread-level speculation and memory renaming which simplifies parallel programming. The architecture of the Hydra CMP is as shown below in figure 7.1. It has four MIPS based processor cores. All chips share a single, large on-chip secondary cache. There are separate primary instruction and data cache for all of the processor cores.

The Hydra design uses thread-level speculation. Speculation allows parallelization of a program into threads even without prior knowledge of where true dependencies between threads may occur. To support thread-level speculation, there is a need for special hardware to monitor the data shared by the threads.

Fig. 7.1 Hydra CMP design The basis requirements for special coherency hardware to implement thread-level speculation is as follows:

• Forwarding data between parallel threads. • Mechanism to track reads and writes to shared data memory. • Mechanism to safely discard speculative state after violations. • Retirement of speculative writes in correct order. • Providing memory renaming.

Most of the additional hardware is contained in two major blocks. The first is a

set of additional tag bits added to each primary cache line to track whether any data in the line has been speculatively read or written. The second is a set of write buffers that hold speculative writes until they can be safely committed into the secondary cache, which is guaranteed to hold only nonspeculative data. Each of the speculative thread running on the Hydra CMP [15] is allocated a write buffer, so that the writes are kept separate. Only after the completion of execution of the threads are the contents of the buffers actually written into the secondary cache and made permanent. The proposed


speculative hardware enforces the speculative coherence on the memory system, while software handles register-level coherence.

The hydra CMP will be a high-performance, economical alternative to large

single-chip uniprocessor and a CMP of comparable die are can achieve performance similar to a uniprocessor on integer program using thread-level speculation. With multiprogrammed workloads or highly parallel application a CMP can significantly outperform a uniprocssor of comparable cost.

8. Design Space for CMP organizations


This section studies the space of the CPM organizations to determine how many

processing cores [19] should have in-order or out-of-order issue, and how big the per-processor on-chip caches should be. The CMP organization that maximize total chip performance, which is equivalent to job throughout depends mainly on following factors:

• Processor Organization: Whether powerful out-of-order issue processors, or

smaller, more numerous in-order processors provide superior throughput. • Cache hierarchy: The amount of cache memory per processor that results in

maximal throughput. • Off-chip bandwidth: Finite bandwidth limits the number of cores that can be placed

on a chip, forcing more area to be devoted to on-chip caches to reduce bandwidth demands.

• Application characteristics: Different applications display varying sensitivities to L2 cache capacity, resulting in widely varying bandwidth demands.

These constraints have complex interactions. More powerful processors place a

heavier individual load on the off-chip memory channels, but smaller, more numerous processors may result in a heavier aggregate bandwidth load. Large caches reduce the number of off-chip accesses, permitting more processors to share a fixed bandwidth, but large caches consume significant area, resulting in room for fewer processing cores. To evaluate the CMP organization area alternatives, we focus on throughput-oriented workloads with no sharing of data among tasks. 8.1 Processor Organization and cache size impact

For processor model, let us consider in-order and out-of-order issue processors ranging from 2-way to 8-way issue width. The table1, shows the harmonic means [19] of IPCs for mix of processor-bound, cache-sensitive, bandwidth-bound benchmarks, for each model with varying L2 cache size. The number of ALUs is scaled with the issue width.

Table 1 Harmonic mean of IPCs for 6 processor models

For in-order cores, issue width has little impact on the performance, but out-of-

order cores have significant performance improvement from 2 to 4 issue cores.

We would study the uniprocessor performance considering the processor organization and cache capacity. Henceforth for the analysis 2 processor models are used


whose configuration is shown in table2. Pin is a simple 2-way in-order issue processor that is roughly comparable to the Alpha 21064. Pout is 4-way issue out-of-order processor comparable to Alpha 21264. The core area of Pout is approximately five times larger than that of Pin.

Table2. Processor (under evaluation) model parameters

The effectiveness of increased cache capacity and out-of-order processors is

limited by the bandwidth demands of the application. To display characteristics more clearly fig. 8.1 shows the IPC, DRAM access frequency, and memory channel utilization as a function of cache capacity. The applications considered for analysis are mesa (processor bound), gcc(cache sensitive), and sphinx and art (bandwidth bound).

Fig. 8.1 Effect of varying L2 cache sizes

From the figure, we note the following points.

• The gap between the Pin and Pout configurations in columns (a) depends on the memory demands of the benchmark and the largest gap is for the processor-bound benchmark (mesa), indicating that out-of-order cores will be more efficient for that category.

• For cache sensitive benchmark (gcc) the performance of the out-of-order and in-order core converges, as cache size drops and more frequent request are made to memory.


• Data in column (b) shows that for cache sensitive benchmark (art) larger caches causes sharp reductions in L2 misses

• Column (c) data shows that the out-of-order cores place heavier demand on the channel utilization, as the out-of-order cores moves the same quantity of data across the wires in a shorter time.

8.2 Impact of Channel sharing

Channel sharing arises when multiple processes are executing simultaneously on different processors. Fig 8.2 plots the aggregate IPC seen by number of processors sharing a single channel.

Fig.8.2 Performance scalability versus channel sharing.

The data shows that the processor-bound job mesa exhibits good scaling of throughput with the increased numbers of channel sharers, except for those experiments with the smallest (128 KB) caches. The bandwidth-bound job sphinx and art show no improvement as more jobs are added, since their bandwidth is the critical resource and already saturated at one job.

Thus, once too many processors are sharing the channel, adding more processors no longer improves throughput; that the area would be better spent increasing the sizes of the cache and reducing the load on the channel. 8.3 Evaluating CMP organization area alternatives

Results show that the number of processing cores indeed grows large with the future integration technologies. However, the useful number of processors will be limited by off-chip bandwidth, because the number of transistors is predicted to increase faster than the number of signaling pins. Thus limited off-chip bandwidths will always constrain the maximum number of cores that can be placed on a chip. The factors to consider for designing the CMP configuration depending on the type of application are as follows:

• For the application, which is not bandwidth bound, the design with powerful out-of-order cores with small L-2 caches per core support maximal chip throughput.


• At small feature sizes, as more processors are forced to share memory channels due to restricted pin counts, many applications begin to be limited by memory bandwidth and larger per-processor caches are necessary to reduce off-chip load.

• For cache-sensitive application the best configuration would be out-of-order cores with large L-2 caches so that enough of the working sets are contained in the L2 caches.

9. Conclusions and Future Work


One of the first things that we notice from our study so far is that SMTs will

definitely require a much more complex hardware scheduling logic. We will need to have hardware for checking resource dependencies for a larger number of threads. We need to have logic to take care of issued instructions in the situations when they require a functional unit that is currently in use. The instruction scheduling will be quite complicated. There will be logic for context switching in case of latencies. Detecting ILP in larger number of threads ultimately results in more complex logic. The more the number of threads, the more number of registers we will require, both for renaming and for the context state. This will result in larger and larger register files, which will be slow. Thus the maximum frequency at which the processor can operate will be less. Thus the overall complex hardware results in more interconnect delays, which will be the biggest hurdle for designing SMT architecture as the integration technology improves. To work out with the complicated logic as well to keep the clock period short, the depth of the pipeline is increased however deeper increases the branch mispredict penalty. Similarly the amount of hardware logic that will be needed makes the design process much more complex, which will translate into higher development costs. Also the area requirements of SMT are high which thus translates into higher silicon costs.

An SMT processor has some potential for performance improvements. If we use the policy of selective fetching, we have to incorporate additional logic in hardware, which will implement the selective fetching scheme. However this will translate into greater returns for performance. The various compiler optimizations and parallelization techniques will improve the performance at little additional costs. However these optimizations will not work well for every workload and are limited. Moreover, the SMT processor does not perform well in the case of the execution of a single thread. This is because of the additional cycles in the pipeline introduced specifically in the case of an SMT processor increase the delay for a single thread and result in lower performance. However this not such a major concern as we can always implement the pre-execution scheme that does result in a high performance for even a single thread.

We noticed that the addition of even a single additional thread translates into large costs with respect to the area. We have also observed that the increase of threads beyond 8 translates into an extremely small increase in the performance, which does not justify the costs involved and also SMT alone would not be able to utilize effectively the increasing chip resource available as integration technology improves. Also the various bottlenecks like the issue bandwidth, fetch bandwidth, instruction queue size, and register file size all limit the performance of the processor such that even if these bottlenecks are removed the gain in performance is only marginal. In fact the size of the register file is a big bottleneck for the operating speed of the processor. Future improvement in memory and silicon technology will reduce this problem but it will always remain a bottleneck in comparison with other technologies. Thus as far as the further increase in the performance of SMT processors is concerned with respect to removal of bottleneck, etc there is not too much scope for improvement. The improvement in SMT processor will thus be limited by the process technology improvements. Still SMT has been known to give extremely high performance as compared to other technologies and so if very high


performance is desired then SMT is the design of choice. Major chip companies like Intel with its hyper threading technology for the Xeon server processors have just started to enter the area of commercial SMT processors. If these ventures succeed, there is a chance for some breakthrough to come in SMT processor technology. On the other hand, CMP use relatively simple single-thread processor cores with each core executing instruction from a separate thread or process to exploit only moderate amount of parallelism within any one thread. This eliminates the complicated register renaming. Smaller simpler processors can run at a higher clock rate, and the design time is shorter as it allows single, proven processor design to be replicated multiple times over a die. Also CMP is much less sensitive to poor data layout and poor communication management, since the interprocessor communication latencies are lower and bandwidths are higher.

However as applications become bandwidth bound, and global wire delays increase, an interesting scenario may arise. It is likely that, given a ceiling on cache size (as the monolithic caches cannot be grown past a certain point in 50 or 35 nm technologies, since wire delay will make them slow), off-chip bandwidth will limit the number of cores. Thus, there may be useless area on the chip, which cannot be used for cache or processing logic. Improved packaging or signaling speeds may permit greater scaling, and even larger numbers of processors. In the long term, a tremendous number of processors can be designed on future CMPs to enable scaling of throughput with technology. However, setting the cache hierarchy, and the number of cores a priori will result in poor performance across many application classes. Future CMP’s would benefit form the mechanisms to support adaptation to an application’s available parallelism and resource needs. This application adaptively is likely to be an important direction for research in future CMP designs.

Thus we can see that we can get very high performance from SMT processors but

the costs and design complexity involved are quite high. On the other hand we get high operating speed and ease of design for good performance for CMPs. However we don’t get the same performance as an SMT. After reading about both SMT processors and CMPs, and evaluating their performance benefits and trade offs, the next logical thing that comes to mind is that can we have a design configuration that will give us the benefits of both the configurations. If this can be achieved, it will be like manna. Krishnan and Torellas have suggested just that in [12].

In the architectures elaborated in [12], in order to reduce the bypass delay problem and also to reduce the resource centralization, the SMT is partitioned into several clusters. One configuration can be that we can have two clusters where each cluster is a SMT processor that can support threads. Thus we have 8 instructions being issued simultaneously. Here each cluster has its own fetch unit and each thread is capable of fetching up to 4 instructions/cycle. In this configuration there is no resource sharing done across clusters. Thus, if we have a single thread executing, then this processor can achieve at most 4 IPC with two threads. Each processor handles a single thread. However the benefit of this scheme is that since both the SMT blocks are relatively simpler than a


centralized 8 issue SMT, we have much greater ease of design and also high clock rates due to smaller register files. [12] has estimated that the cycle time for an 8-isue processor will be twice as long as that for a 4-issue processor when using 0.18u technology. Other configurations possible are four clusters each of which can issue two instructions. Each cluster in a processor would have its own private primary cache and the secondary cache would be shared. [12] observed that when the two cluster configuration was used, it gave at least 13% better performance than any CMP.

From a purely architectural point of view, the SMT processor’s flexibility, dynamically exploiting ILP and TLP makes it superior. However, the need to limit the effects of interconnect delays, which are becoming much slower than transistor gate delays, will also drive the billion-transistor design. Interconnect delays will force the microarchitecture to be partitioned into small, localized processing elements. For this reason, the CMP is much more promising because it is already partitioned into individual processing cores which being simple are amenable to speed optimization and can be designed relatively easily. On the whole, the SMT is logical and simple in realization improvement of modern processors. On the other hand, CMP is a more radical approach, which implies that several processor cores are located on one die.


10. References

[1] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and

T. Nishizawa, “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads”, In Proceedings of the 19th Annual International Symposium on Compute Architecture, ACM & IEEE-CS, pp. 136-145, May 1992.

[2] J.P.Bradford, “PUMP: A New Architecture for Multithreaded Processors”. [3] J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, D. Tullsen, “Converting Thread-

Level Parallelism to Instruction Level Parallelism via Simultaneous Multithreading”, In ACM Transactions on Computer Systems, Vol. 15, No. 3, August 1997, pp 322-354.

[4] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor”, In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, PA, May, 1996.

[5] D. Puppin, D. Tullsen, “Maximizing TLP with loop-parallelization on SMT”. [6] C. Luk, “Tolerating memory Latency through Software-Controlled Pre-Execution

in Simultaneous Multithreading Processors”. [7] A. Snavely, D. Tullsen, “Symbiotic Jobscheduling for a Simultaneous

Multithreaded Processor”. [8] O. Zaki, M. McCormick, J. Ledlie, “Adaptively Scheduling Processes on a

Simultaneous Multithreading Processor”. [9] J. Lo, S. Eggers, H. Levy, S. Parekh, D. Tullsen, “ Tuning Compiler

Optimizations for Simultaneous Multithreading”, In Proceedings of Micro-30, Dec. 1-3, 1997, Research Triangle Park, NC.

[10] J. Burns, J-L. Gaudiot, “Quantifying the SMT Layout Overhead – Does SMT Pull Its Weight?”

[11] J. Burns, J-L. Gaudiot, “Area and System Clock Effects on SMT/CMP Processors”.

[12] V. Krishnan, J. Torrellas, “A Clustered Approach to Multithreaded Processors”. [13] D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, M. Upton, “Hyper-

Threading Technology Architecture and Microarchitecture”. [14] D. Tullsen, S. Eggers, H. Levy, “Simultaneous Multithreading: Maximizing On-

Chip Parallelism”, In Proceedings of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, June, 1995.

[15] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, K. Olukotun, “The Stanford Hydra CMP”.

[16] L. Hammond, B. Nayfeh, K. Olukotun, “A Single Chip Multiprocessor”. [17] V. Krishnan, J. Torrellas, “A Single Chip-Multiprocessor Architecture with

Speculative Multithreading”. [18] V. Krishnan, J. Torrellas, “ The Need for Fast Communication in Hardware-

Based Speculative Chip Multiprocessors”. [19] H. Huh, D. Burger, S. Keckler, “Exploring the Design Space of Future CMPs”. [20] V. Krishnan, J. Torrellas, “Hardware and Software Support for Speculative

Execution of sequential Binaries on a Chip-Multiprocessor”.


[21] K.M. Dixit, New CPU benchmark suites from SPEC. In COMPCON, Spring 1992, pp 305-310,1992.

[22] E. Rotenberg, Q. Jacobson, Y. Saziedas, and J. smith, “ Trace processor,” Proc. 30th Int’l Symp. Microarchitecture (MICRO-30), pp. 138-148, Dec. 1997.

[23] G. Sohi, S.Breach, and T.N. Vijaykumar, “Multiscalar Processors,” Proc. 22nd Int’l Symp. Comp Architecture, pp. 414-425, June 1995.

[24] A. Agrawal, B.H. Lim, D.Jkranz, and J. Kubiatowicz. “APRIL: a processor architecture for multiprocessing”, In 17th Annual International Symposium on Computer Architecture, pp 104-114, May 1990.

[25] A. Agrawal, “Perfomance Tradeoffs in multithreaded processors”, IEEE Transactions on Parallel and Distributed Systems”, pp 525-539, September 1992.

[26] G. E. Daddis,Jr. and H.C . Torng, “The Concurrent execution of multiple instruction streams on superscalar processors”, In International Conference on Parallel Processing, pp 76 – 83,Aug 1991.

[27] R. Thekkath and S.J. Eggers, “The Effectiveness of multiple hardware contexts”, In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems”, pp 328-337, Oct 1994.

a comparative study of smt and cmp multiprocessors€¦ · in this report we attempt to study the...

Documents