survey on explicit multithreading

A Survey of Processors with Explicit Multithreading

THEO UNGERER

University of Augsburg

BORUT ROBIC

University of Ljubljana

AND

JURIJ SILC

Jozef Stefan Institute

Hardware multithreading is becoming a generally applied technique in the nextgeneration of microprocessors. Several multithreaded processors are announced byindustry or already into production in the areas of high-performance microprocessors,media, and network processors.

A multithreaded processor is able to pursue two or more threads of control in parallelwithin the processor pipeline. The contexts of two or more threads of control are oftenstored in separate on-chip register sets. Unused instruction slots, which arise fromlatencies during the pipelined execution of single-threaded programs by a contemporarymicroprocessor, are filled by instructions of other threads within a multithreadedprocessor. The execution units are multiplexed between the thread contexts that areloaded in the register sets.

Underutilization of a superscalar processor due to missing instruction-levelparallelism can be overcome by simultaneous multithreading, where a processor canissue multiple instructions from multiple threads each cycle. Simultaneousmultithreaded processors combine the multithreading technique with a wide-issuesuperscalar processor to utilize a larger part of the issue bandwidth by issuinginstructions from different threads simultaneously.

Explicit multithreaded processors are multithreaded processors that apply processesor operating system threads in their hardware thread slots. These processors optimizethe throughput of multiprogramming workloads rather than single-threadperformance. We distinguish these processors from implicit multithreaded processors

Categories and Subject Descriptors: C.1 [Computer Systems Organization]:Processor Architectures; C.1.3 [Processor Architectures]: Other ArchitectureStyles—Pipeline processors

General Terms: Design, Performance

Additional Key Words and Phrases: Blocked multithreading, interleavedmultithreading, simultaneous multithreading

Authors’ addresses: T. Ungerer, Institute of Computer Science, University of Augsburg, Eichleitnerstrasse30, D-86135 Augsburg, Germany; email: [email protected]; B. Robic, Faculty of Com-puter and Information Science, University of Ljubljana, Trzaska 25, Sl-1000 Ljubljana, Slovenia; email:[email protected]; J. Silc, Computer Systems Department, Jozef Stefan Institute, Jamova 39, Sl-1000Ljubljana, Slovenia; email: [email protected] to make digital/hard copy of part or all of this work for personal or classroom use is grantedwithout fee provided that the copies are not made or distributed for profit or commercial advantage, thecopyright notice, the title of the publication, and its date appear, and notice is given that copying is bypermission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to listsrequires prior specific permission and/or a fee.c©2003 ACM 0360-0300/03/0300-0029 $5.00

ACM Computing Surveys, Vol. 35, No. 1, March 2003, pp. 29–63.

30 Ungerer et al.

that utilize thread-level speculation by speculatively executing compiler- ormachine-generated threads of control that are part of a single sequential program.

This survey paper explains and classifies the explicit multithreading techniques inresearch and in commercial microprocessors.

1. INTRODUCTION

A multithreaded processor is able to con-currently execute instructions of differ-ent threads of control within a singlepipeline. The minimal requirement forsuch a processor is the ability to pursuetwo or more threads of control in paral-lel within the processor pipeline, that is,the processor must provide two or moreindependent program counters, an inter-nal tagging mechanism to distinguish in-structions of different threads within thepipeline, and a mechanism that triggersa thread switch. Thread-switch overheadmust be very low, from zero to only a fewcycles. Multithreaded processor featuresoften, but not always, multiple registersets on the processor chip.

The current interest in hardware multi-threading stems from three objectives:

—Latency reduction is an important taskwhen designing a microprocessor. La-tencies arise from data dependenciesbetween instructions within a singlethread of control. Long latencies arecaused by memory accesses that miss inthe cache and by long running instruc-tions. Short latencies may be bridgedwithin a superscalar processor by ex-ecuting succeeding, nondependent in-structions of the same thread. Long la-tencies, however, stall the processor andlessen its performance.

—Shared-memory multiprocessors sufferfrom memory access latencies thatare several times longer than in asingle-processor system. When access-ing a nonlocal memory module in adistributed-shared memory system, thememory latency is enhanced by thetransfer time through the communica-tion network. Additional latencies arisein a shared memory multiprocessorfrom thread synchronizations, whichcause idle times for the waiting thread.One solution to fill these idle times is

to switch to another thread. However,a thread switch on a conventional pro-cessor causes saving of all registers,loading the new register values, andseveral more administrative tasks thatoften require too much time to prove thismethod as an efficient solution.

—Contemporary superscalar micropro-cessors [Silc et al. 1999] are ableto issue four to six instructions eachclock cycle from a conventional lin-ear instruction stream. VLSI technol-ogy will allow future microprocessorswith an issue bandwidth of 8–32 in-structions per cycle [Patt et al. 1997].As the issue rate of future micropro-cessors increases, the compiler or thehardware will have to extract moreinstruction-level parallelism (ILP) froma sequential program. However, ILPfound in a conventional instructionstream is limited. ILP studies which al-low single control flow branch specula-tion have reported parallelism around7 instructions per cycle (IPC) withinfinite resources [Wall 1991; Lam andWilson 1992] and around 4 IPC withlarge sets of resources (e.g., 8 to 16execution units) [Butler et al. 1991].Contemporary high-performance micro-processors therefore exploit speculativeparallelism by dynamic branch predic-tion and speculative execution of thepredicted branch path to increase sin-gle thread performance. Control specu-lation may be enhanced by data depen-dence and value speculation techniquesto increase performance of a single pro-gram thread [Lipasti et al. 1996; Lipastiand Shen 1997; Chrysos and Emer 1998;Rychlik et al. 1998].

Multithreading pursues a different setof solutions by utilizing coarse-grainedparallelism [Iannucci et al. 1994].

The notion of a thread in the contextof multithreaded processors differs from

ACM Computing Surveys, Vol. 35, No. 1, March 2003.

Survey of Processors with Explicit Multithreading 31

the notion of software threads in multi-threaded operating systems. In case ofa multithreaded processor a thread isalways viewed as a hardware-supportedthread which can be—depending on thespecific form of multithreaded processor—a full program (single-threaded UNIXprocess), an operating system thread(a light-weighted process, e.g., a POSIXthread), a compiler-generated thread(subordinate microthread, microthread,nanothread etc.), or even a hardware-generated thread.

1.1. Explicit and Implicit Multithreading

The solution surveyed in this paper is theutilization of coarser-grained parallelismby explicit multithreaded processors thatinterleave the execution of instructions ofdifferent user-defined threads (operatingsystem threads or processes) in the samepipeline. Multiple program counters areavailable in the fetch unit and the multi-ple contexts are often stored in differentregister sets on the chip. The executionunits are multiplexed between the threadcontexts that are loaded in the registersets. The latencies that arise in the com-putation of a single instruction stream arefilled by computations of another thread.Thread-switching is performed automati-cally by the processor due to a hardware-based thread-switching policy. This abilityis in contrast to conventional processors ortoday’s superscalar processors, which usebusy waiting or a time-consuming, operat-ing system-based thread switch.

Depending on the specific multi-threaded processor design, either a single-issue instruction pipeline (as in scalarprocessors) is used, or multiple instruc-tions from possibly different instructionstreams are issued simultaneously. Thelatter are called simultaneous multi-threaded (SMT) processors and combinethe multithreading technique with a wide-issue superscalar processor such that thefull-issue bandwidth is more often utilizedby potentially issuing instructions fromdifferent threads simultaneously.

A different approach increases theperformance of sequential programs by

applying thread-level speculation. Athread in such processor proposals refersto any contiguous region of the static ordynamic instruction sequence. We callsuch processors implicit multithreadedprocessors, which refers to any processorthat can dynamically generate threadsfrom single-threaded programs and exe-cute such speculative threads concurrentwith the lead thread. In case of misspecu-lation, all speculatively generated resultsmust be squashed. Threads generatedby implicit multithreaded processors arealways executed speculatively, in contrastto the threads in explicit multithreadedprocessors.

Examples of implicit multithreadedprocessor proposals are the multiscalarprocessor [Franklin 1993; Sohi et al.1995; Sohi 1997; Vijaykumar and Sohi1998], the trace processor [Rotenberget al. 1997; Smith and Vajapeyam1997; Vajapeyam and Mitra 1997], thesingle-program speculative multithread-ing architecture [Dubey et al. 1995], thesuperthreaded architecture [Tsai and Yew1996; Li et al. 1996], the dynamic multi-threading processor [Akkary and Driscoll1998], and the speculative multithreadedprocessor [Marcuello et al. 1998]. Someapproaches, in particular the multiscalarscheme, use compiler-support for threadgeneration. In these examples a mul-tithreaded processor may be character-ized by a single processing unit with asingle or multiple-issue pipeline able toprocess instructions of different threadsconcurrently. As a result, some of theseapproaches may rather be viewed asvery closely coupled chip multiprocessors,because multiple subordinate processingunits execute different threads under con-trol of a single sequencer unit.

1.2. Multithreading and Multiprocessors

The first multithreaded processor ap-proaches in the 1970s and 1980s ap-plied multithreading at user-thread-levelto solve the memory access latency prob-lem. This problem arises for each mem-ory access after a cache miss—in particu-lar, when a processor of a shared-memory


32 Ungerer et al.

multiprocessor accesses a shared-memoryvariable located in a remote-memory mod-ule. The latency becomes a problem if theprocessor spends a large fraction of itstime sitting idle and waiting for remoteaccesses to complete [Culler et al. 1998].Latencies that arise in a pipeline aredefined with a wider scope—for exam-ple, covering also long-latency opera-tions like div or latencies due to branchinterlocking.

Older multithreaded processor ap-proaches from the 1980s usually extendscalar RISC processors by a multithread-ing technique and focus on effectivelybridging very long remote memory accesslatencies. Such processors will only beuseful as processor nodes in distributedshared-memory multiprocessors. How-ever, developing a processor that is specif-ically designed for such multiprocessorsis commonly regarded as too expensive.Multiprocessors today comprise standardoff-the-shelf microprocessors and almostnever specifically designed processors(with the exception of the Cray MTA).Therefore, newer multithreaded processorapproaches also strive for tolerating alllatencies (even single-cycle latencies) thatarise from primary cache misses thathit in secondary cache, from long-latencyoperations, or even from unpredictablebranches.

1.3. Multithreading and Dataflow

Another root of multithreading comesfrom dataflow architectures. Viewed froma dataflow perspective a single-threadedarchitecture is characterized by the com-putation that conceptually moves forwardone step at a time through a sequenceof states, each step corresponding to theexecution of one enabled instruction. Ac-cording to Dennis and Gao [1994], amultithreaded architecture differs froma single-threaded architecture in thatthere may be several enabled instruc-tions from different threads which allare candidates for execution. A thread isviewed as a sequentially ordered blockof instructions with a grain-size greater

than 1 (to distinguish multithreaded ar-chitectures from fine-grained dataflowarchitectures). Blocking and nonblockingthreads are distinguished. A nonblockingthread is formed such that its evalua-tion proceeds without blocking the pro-cessor pipeline (for instance, by remotememory accesses, cache misses, or syn-chronization waits). Evaluation of a non-blocking thread starts as soon as allinput operands are available, which isusually detected by some kind of dataflowprinciple. Thread switching is controlledby the compiler harnessing the idea ofrescheduling, rather than blocking, whenwaiting for data. Access to remote data isorganized in a split-phase manner by onethread sending the access request to mem-ory and another thread activating whenits data is available. Thus a program iscompiled into many, very small threadsactivating each other when data becomeavailable. The same hardware mecha-nisms may also be used to synchronizeinterprocess communications to awaitingthreads, thereby alleviating operating sys-tems overhead. In contrast, a blockingthread might be blocked during executionby remote memory accesses, cache misses,or synchronization needs. The waitingtime, during which the pipeline is blocked,is lost when using a von Neumann pro-cessor, but can be efficiently bridged by afast context switch to another thread ina multithreaded processor. Switching toanother thread in a single-threaded pro-cessor usually exhibits too much contextswitching overhead to mask the latencyefficiently. The original thread can be re-sumed when the reason for blocking isremoved.

Use of nonblocking threads typicallyleads to many small threads that areappropriate for execution by a hybriddataflow computer [Silc et al. 1998] orby a multithreaded architecture that isclosely related to hybrid dataflow. Block-ing threads may just be the threads(e.g., P(OSIX)threads or Solaris threads)or whole UNIX processes of a multi-threaded UNIX-based operating system,but may also be even smaller microthreads



Explicit multithreading

Issuingfrom a single threadin a cycle

Issuingfrom multiple threadsin a cycle

Interleavedmultithreading(IMT)

Simultaneousmultithreading(SMT)

Blockedmultithreading(BMT)

Fig. 1 . Explicit multithreading.

generated by a compiler to utilize the po-tentials of a multithreaded processor.

Note that we exclude in this survey hy-brid dataflow architectures that are de-signed for the execution of nonblockingthreads. Although these architectures areoften called multithreaded, we have cat-egorized them in a previous paper [Silcet al. 1998] as threaded dataflow or large-grain dataflow because a dataflow prin-ciple is applied to start the execution ofnonblocking threads. Thus, multithreadedarchitectures (in the more narrow senseapplied here) stem from the modificationof scalar RISC, VLIW/EPIC, or super-scalar processors.

2. CLASSIFICATION OF EXPLICITMULTITHREADING TECHNIQUES

Explicit multithreaded processors fallinto two categories depending on whetherthey issue instructions from only a singlethread or from multiple threads in a givencycle.

If instructions can be issued only from asingle thread in a given cycle, the followingtwo principal techniques of explicit multi-threading are used (see Figure 1):

—Interleaved multithreading (IMT): Aninstruction of another thread is fetchedand fed into the execution pipeline ateach processor cycle (see Section 3).

—Blocked multithreading (BMT): The in-structions of a thread are executed suc-cessively until an event occurs that maycause latency. This event induces a con-text switch (see Section 4).

When instructions can be issued frommultiple threads in a given cycle, the fol-lowing technique can be used:

—Simultaneous multithreading (SMT):Instructions are simultaneously issuedfrom multiple threads to the executionunits of a superscalar processor. Thus,the wide superscalar instruction issue iscombined with the multiple-context ap-proach (see Section 6).

Before we present these techniquesin detail, we briefly review the mainarchitectural approaches that integrateinstruction-level parallelism and thread-level parallelism.

A way to look at latencies that arisein a pipelined execution is the opportu-nity cost in terms of the instructions thatmight be processed while the pipelineis interlocked, for example, waiting fora remote reference to return. The op-portunity cost of single-issue processorsis the number of cycles lost by laten-cies. Multiple-issue processors (e.g., su-perscalar, VLIW, etc.) potentially executemore than 1 IPC, and thus the oppor-tunity cost is the product of the latency


34 Ungerer et al.

Fig. 2 . Different approaches possible with single-issue (scalar) processors: (a) single-threaded scalar,(b) interleaved multithreading scalar, (c) blockedmultithreading scalar.

cycles and the issue bandwidth plus thenumber of issue slots not fully filled. Weexpect that future single-threaded proces-sors will continue to exploit further super-scalar or other multiple-issue techniques,and thus further increase the opportunitycost of remote-memory accesses.

Figure 2 demonstrates the opportunitycost of different approaches possible withscalar processors, while the different ap-proaches possible with multiple-issue pro-cessors are given in Figure 3. In allthese cases instructions from only a sin-gle thread are issued in a given cycle.

The opportunity cost in single-threadedsuperscalar can be easily determined asthe number of empty issue slots (seeFigure 3(a)). It consists of horizontallosses (the number of empty places in notfully filled issue slot) and the even moreharmful vertical losses (cycles where noinstructions can be issued). In VLIW pro-cessors, horizontal losses appear as no-opoperations (not shown in Figure 3). The op-portunity cost of single-threaded VLIW isabout the same as single-threaded super-scalar. An IMT superscalar (or IMT VLIW)processor is able to fill the vertical lossesof the single-threaded models by instruc-tions of other threads, but not the hori-zontal losses. Further design possibilities,such as BMT superscalar or BMT VLIWprocessors, would fill several succeedingcycles with instructions of the same threadbefore context switching. The switching

event is more difficult to implement and acontext-switching overhead of one to sev-eral cycles might arise.

Let us now examine processors that canissue instructions from multiple threadsin a given cycle. In addition to the SMTprocessors, one can include in this cat-egory (by the widest of definitions) alsothe chip multiprocessor (CMP) approach.While SMT uses a monolithic design withmost resources shared among threads,CMP proposes a distributed design thatuses a collection of independent processorswith less resource sharing. For example,Figure 4a shows a four-threaded eight-issue SMT processor, and Figure 4b showsa CMP with four two-issue processors. TheSMT processor exploits ILP by selectinginstructions from any thread (four in thiscase) that can potentially issue. If onethread has high ILP, it may fill all hori-zontal slots depending on the issue strat-egy of the processor. If multiple threadseach have low ILP, instructions of severalthreads can be issued and executed simul-taneously. In the CMP processor with sev-eral multiple-issue processors (four two-issue processors in this case) on a singlechip, each CPU is assigned a thread fromwhich it can issue multiple instructionseach cycle (up to two in this case). Thus,each CPU has the same opportunity costas in a two-issue superscalar model. TheCMP is not able to hide latencies by issu-ing instructions of other threads. However,because horizontal losses will be smallerfor two-issue than for high-bandwidth su-perscalars, a CMP of four two-issue pro-cessors will reach a higher utilization thanan eight-issue superscalar processor.

There are a number of tradeoffs to con-sider when choosing between SMT andCMP. The present report does not surveyCMP design so these tradeoffs are out-side its scope (but see Section 6 and theconclusions).

3. INTERLEAVED MULTITHREADING

3.1. Principles

The interleaved multithreading (IMT)technique (also called fine-grain mul-tithreading) means that the processor



Fig. 3 . Different approaches possible with multiple-issue processors: (a) single-threaded superscalar, (b) interleaved multithreading superscalar, (c) blockedmultithreading superscalar.

Fig. 4 . Issuing from multiple threads in a cycle: (a) simultaneous mul-tithreading, (b) chip multiprocessor.

switches to a different thread after eachinstruction fetch. In principle, an instruc-tion of a thread is fed into the pipeline af-ter the retirement of the previous instruc-tion of that thread. Since IMT eliminatescontrol and data dependences between in-structions in the pipeline, pipeline haz-ards cannot arise and the processorpipeline can be easily built without the ne-cessity of complex forwarding paths. Thisleads to a very simple and therefore poten-tially very fast pipeline—no hardware in-terlocking or data forwarding is necessary.Moreover, the context-switching overheadis zero cycles. Memory latency is toler-

ated by not scheduling a thread until thememory transaction has completed. Thismodel requires at least as many threads aspipeline stages in the processor. Interleav-ing the instructions from many threadslimits the processing power accessible toa single thread, thereby degrading thesingle-thread performance. Two possibil-ities to overcome this deficiency are thefollowing:

1. The dependence look-ahead techniqueadds several bits to each instructionformat in the ISA. The additional op-code bits allow the compiler to state the


36 Ungerer et al.

number of instructions directly follow-ing in program order that are not data-or control-dependent on the instructionbeing executed. This allows the instruc-tion scheduler in the IMT processor tofeed non-data- or control-dependent in-structions of the same thread succes-sively into the pipeline. The dependencelook-ahead technique may be applied tospeed up single-thread performance orto increase machine utilization in thecase where a workload does not com-prise enough threads.

2. The interleaving technique proposed byLaudon, Gupta, and Horowitz [1994],adds caching and full pipeline inter-locks to the IMT technique. Contextsmay be interleaved on a cycle-by-cyclebasis, yet a single-thread context is alsoefficiently supported.

The most well-known examples of IMTprocessors are used in the Heteroge-neous Element Processor (HEP), theHorizon, and the Cray Multi-ThreadedArchitecture (MTA) multiprocessors (seeSection 5.1). The HEP system has up to16 processors while the other two con-sist of up to 256 processors. Each of theseprocessors supports up to 128 threads.While HEP uses instruction look-aheadonly if there is no other work, the Horizonand Cray MTA employ the explicit depen-dence look-ahead technique. Further IMTprocessor proposals include the Multil-isp Architecture for Symbolic Applications(MASA), the MIT M-Machine, the Me-dia Processor of MicroUnity, and the pro-cessor SB-PRAM/HPP (all covered in thenext section). An example of an IMT net-work processor is the five-threaded LextraLX4580 (see Section 5.2).

In principle, the IMT technique can alsobe combined with a superscalar instruc-tion issue, but simulations confirm the in-tuition that SMT is the more efficient tech-nique [Eggers et al. 1997].

3.2. Examples of Past Commercial Machinesand of Research Prototypes

HEP. The Heterogeneous Element Pro-cessor system, designed by Burton Smith

and developed by Denelcor Inc., in Denver,CO, between 1978 and 1985, was a pio-neering example of a multithreaded ma-chine [Smith 1981]. The HEP system[Smith 1985] was designed to have up to16 processors (Figure 5) with up to 128threads per processor. The 128 threadswere supported by a large number of reg-isters dynamically allocated to threads.The processor pipeline had eight stages,matching the minimum number of proces-sor cycles necessary to fetch a data itemfrom memory to a register. Consequentlyup to eight threads were in execution con-currently within a single HEP processor.However, the pipeline did not allow morethan one memory, branch, or divide in-struction to be in the pipeline at the giventime. Allowing only a single memory in-struction in the pipeline at a time resultedin poor single-thread performance and re-quired a very large number of threads. Ifthread queues were all empty, the next in-struction from the last thread dispatchedwas examined for independence from pre-vious instructions and, if found to be so,the instruction was also issued.

In contrast to all other IMT processors,all threads within a HEP processor sharedthe same register set. Multiple processorsand data memories were interconnectedvia a pipelined switch and any register-memory or data-memory location could beused to synchronize two processes on aproducer–consumer basis by a full/emptybit synchronization on a data memoryword.

MASA. The Multilisp Architecturefor Symbolic Applications [Halstead andFujita 1988] was an IMT processor pro-posal for parallel symbolic computationwith various features intended for effec-tive Multilisp [Halstead 1985] programexecution. MASA featured a tagged archi-tecture, multiple contexts, fast trap han-dling, and a synchronization bit in everymemory word. Its principal novelty wasthe use of multiple contexts both to sup-port interleaved execution from separateinstruction streams and to speed up proce-dure calls and trap handling (in the samemanner as register windows).



Fig. 5 . The HEP processor.

M-Machine. The MIT M-Machine [Filloet al. 1995] supports both public andprivate registers for each thread, anduses the IMT technique. Each processorsupports four hardware resident user V-threads, and each V-thread supports fourresident H-threads. All the H-threads ina given V-thread share the same addressspace, and each H-thread instruction isa three-wide VLIW. Event and exceptionhandling are each performed by a separateV-thread. Swapping a processor-residentV-thread with one stored in memory re-quires about 150 cycles (1.5 µs). TheM-Machine (like HEP, Horizon, and CrayMTA) employs full-empty bits for effi-cient, low-level synchronization. Moreoverit supports message passing and guardedpointers with base and bounds for accesscontrol and memory protection.

MicroUnity MediaProcessor. MicroUnity[Hansen 1996] proposed its MediaProces-sor in 1996 as a processor specialized

for video streaming applications. A su-perscalar processor kernel is enhancedby group instructions dedicated to videostreaming applications and by five threadcontexts. Instructions of the differentthreads are executed in the IMT fash-ion issuing instructions with a proposed1-GHz clock frequency providing fiveindependent 200-MHz threads. A verylight-weight context switch upon syn-chronous exceptions and asynchronousevents permits rapid handling of vir-tual memory system exceptions and I/Oevents.

SB-PRAM and HPP. The SB-PRAM (SBstands for Saarbrucken and PRAM for par-allel random access machine) [Bach et al.1997] or high-performance PRAM (HPP)[Formella et al. 1996] is a multiple in-struction multiple data (MIMD) parallelcomputer with shared address space anduniform memory access time due to its mo-tivation: building a multiprocessor that is


38 Ungerer et al.

Fig. 6 . Blocked multithreading.

as close as possible to the theoretical ma-chine model CRCW-PRAM (CRCW standsfor concurrent read concurrent write). Pro-cessor and memory modules are con-nected by a pipelined combining butter-fly network. Network latency is hiddenby pipelining several so-called virtual pro-cessors on one physical processor nodein IMT mode. Instructions of 32 virtualprocessors are interleaved round-robin ina single SB-PRAM processor, which istherefore classified as 32-threaded IMTprocessor. The project is in progress atthe University of Saarland, Saarbrucken,Germany. A first prototype was runningwith four processors, and recently a 64-processor model which in total supportsup to 2048 threads has been completed.Hardware details and performance mea-surements are reported by Paul, Bach,Bosch, Fischer, Lichtenau, and Rohrig[2002].

4. BLOCKED MULTITHREADING

4.1. Principles

The blocked multithreading (BMT) tech-nique (sometimes also called coarse-grainmultithreading) means that a singlethread is executed until it reaches a situa-tion that triggers a context switch. Usuallysuch a situation arises when the instruc-tion execution reaches a long-latency oper-ation or a situation where a latency mayarise. Compared to the IMT technique, asmaller number of threads is needed and asingle thread can execute at full speed un-til the next context switch. If only a single

thread runs on a BMT processor, there areno context switches and the processor per-forms just like a standard processor with-out multithreading.

In the following we classify BMT pro-cessors by the event that triggers acontext switch (Figure 6) [Kreuzinger andUngerer 1999]:

1. Static models: A context switch occurseach time the same instruction is ex-ecuted in the instruction stream. Thecontext switch is encoded by the com-piler. The main advantage of this tech-nique is that context switching can betriggered already in the fetch stageof the pipeline. The context switchingoverhead is one cycle (if the fetchedinstruction triggers the context switchand is discarded), zero cycles (if thefetched instruction triggers a contextswitch but is still executed in thepipeline), and almost zero cycles (if acontext switch buffer is applied; see theRhamma processor below in this sec-tion). There are two main variations ofthe static BMT technique:—The explicit-switch static model,

where an explicit context switch in-struction exists. This model is simpleto implement. However, the addi-tional instruction causes one-cyclecontext switching overhead.

—The implicit-switch model, whereeach instruction belongs to a specificinstruction class and a context switchdecision depends on the instructionclass of the fetched instruction. In-struction classes that cause context



switch include load, store, and branchinstructions.The switch-on-load static model

switches after each load instruction tobridge memory access latency. How-ever, assuming an on-chip D-cache, thethread switch occurs more often thannecessary, which makes an extremelyfast context switch necessary, prefer-ably with zero-cycle context switchoverhead.

The switch-on-store static modelswitches after store instructions. Themodel may be used to support the im-plementation of sequential consistencyso that the next memory access instruc-tion can only be performed after thestore has completed in memory.

The switch-on-branch static modelswitches after branch instructions. Themodel can be applied to simplify pro-cessor design by renouncing branchprediction and speculative execution.The branch misspeculation penalty isavoided, but single-thread performanceis decreased. However, it may be effec-tive for programs with a high percent-age of branches that are hard to predictor even are unpredictable.

2. Dynamic models: The context switch istriggered by a dynamic event. In gen-eral, all the instructions between thefetch stage and the stage that triggersthe context switch are discarded, lead-ing to a higher context switch overheadthan static context switch models. Sev-eral dynamic models can be defined:—The switch-on-cache-miss dynamic

model switches the context if a load orstore misses in the cache. The idea isthat only those loads that miss in thecache and those stores that cannotbe buffered have long latencies andcause context switches. Such a con-text switch is detected in a late stageof the pipeline. A large number of sub-sequent instructions have already en-tered the pipeline and must be dis-carded. Thus context switch overheadis considerably increased.

—The switch-on-signal dynamic modelswitches context on the occurrence

of a specific signal, for example, sig-naling an interrupt, trap, or messagearrival.

—The switch-on-use dynamic modelswitches when an instruction tries touse the still missing value from a load(which, for example, missed in thecache). For example, when a compilerschedules instructions so that a loadfrom shared memory is issued severalcycles before the value is used, thecontext switch should not occur untilthe actual use of the value and willnot occur at all if the load completesbefore the register is referenced.

To implement the switch-on-usemodel, a valid bit is added to eachregister (by a simple form of score-board). The bit is cleared when a loadto the corresponding register is is-sued and set when the result returnsfrom the network. A thread switchescontext if it needs a value from a reg-ister whose valid bit is still cleared.This model can also be seen as alazy model that extends either theswitch-on-load static model (calledlazy-switch-on-load) or the switch-on-cache-miss dynamic model (calledlazy-switch-on-cache-miss).

—The conditional-switch dynamicmodel couples an explicit switchinstruction with a condition. The con-text is switched only when the condi-tion is fulfilled; otherwise the contextswitch is ignored. A conditional-switch instruction may be used, forexample, after a group of load/storeinstructions. The context switch isignored if all load instructions (inthe preceding group) hit the cache;otherwise, the context switch isperformed. Moreover, a conditional-switch instruction could also beadded between a group of loads andtheir subsequent use to realize alazy context switch (instead of imple-menting the switch-on-use model).

The explicit-switch, conditional-switch,and switch-on-signal techniques enhancethe ISA by additional instructions. Theimplicit switch technique may favor a


40 Ungerer et al.

specific ISA encoding to simplify instruc-tion class detection. All other techniquesare microarchitectural techniques withoutthe necessity of ISA changes.

A previous classification [Boothe andRanade 1992; Boothe 1993] concerns theBMT technique only in a shared-memorymultiprocessor environment and is re-stricted to only a few of the models of theBMT technique described above. In par-ticular, the switch-on-load in [Boothe andRanade 1992] switches only on instruc-tions that load data from remote memory,while storing data in remote memory doesnot cause context switching. Likewise, theswitch-on-miss model is defined so that thecontext is only switched if a load from re-mote memory misses in the cache.

Several well-known processors usethe BMT technique. The MIT Sparcle[Agarwal et al. 1993] and the MSparcprocessors [Mikschl and Damm 1996]use both switch-on-cache-miss andswitch-on-signal dynamic models. TheCHoPP 1 [Mankovic et al. 1987] usesthe switch-on-cache-miss and switch-on-use dynamic models, while Rhamma[Gruenewald and Ungerer 1996, 1997]applies several static and dynamic modelsof BMT. The EVENTS mechanism andthe Komodo microcontroller proposalapply multithreading for real-time eventhandling. These, as well as some otherprocessors using the BMT technique, aredescribed in the next section, while thecurrent commercial high-performanceprocessors—that is, the two-threadedIBM RS64 IV, the Sun MAJC—and thenetwork processors of the Intel IXP, IBMPower NP, Vitesse IQ2x00, and AMCC nPfamilies are covered in Section 5.

4.2. Examples of Past Commercial Machinesand of Research Prototypes

CHoPP 1. The Columbia HomogeneousParallel Processor 1 (CHoPP 1) [Mankovicet al. 1987] was designed by CHoPP Sul-livan Computer Corporation (ANTs since1999) in 1987. The system was a shared-memory MIMD with up to 16 powerfulprocessors. High sequential performanceis due to issuing multiple instructions

on each clock cycle, zero-delay branchinstructions, and fast execution of indi-vidual instructions. Each processor cansupport up to 64 threads and uses theswitch-on-cache-miss dynamic interleav-ing model.

MDP in J-Machine. The MIT JellybeanMachine (J-Machine) [Dally et al. 1992] isso-called because it is to be built entirely ofa large number of ”jellybean” components.The initial version uses an 8× 8× 16 cubenetwork, with possibilities of expanding to64K nodes. The ”jellybeans” are message-driven processor (MDP) chips, each ofwhich has a 36-bit processor, a 4K wordmemory, and a router with communica-tion ports for bidirectional transmissionsin three dimensions. External memory ofup to 1M words can be added per processor.The MDP creates a task for each arrivingmessage. In the prototype, each MDP chiphas four external memory chips that pro-vide 256K memory words. However, accessis through a 12-bit data bus, and with anerror correcting cycle, the access time isfour memory cycles per word. Each com-munication port has a 9-bit data channel.The routers provide support for automaticrouting from source to destination. Thelatency of a transfer is 2 µs across theprototype machine, assuming no blocking.When a message arrives, a task is createdautomatically to handle it in 1 µs. Thus, itis possible to implement a shared-memorymodel using message passing, in which amessage provides a fetch address and anautomatic task sends a reply with the de-sired data.

MIT Sparcle. The MIT Sparcle proces-sor [Agarwal et al. 1993] is derived froma SPARC RISC processor. The eight over-lapping register windows of a SPARC pro-cessor are organized as four independentnonoverlapping thread contexts, each us-ing two windows, one as a register set, theother as a context for trap and messagehandlers (Figure 7).

Context switches are used only tohide long memory latencies since smallpipeline delays are assumed to be hiddenby the proper ordering of instructions by



Fig. 7 . Sparcle register usage.

an optimizing compiler. The MIT Spar-cle processor switches to another contextin the case of a remote cache miss ora failed synchronization (switch-on-cache-miss and switch-on-signal strategies).Thread switching is triggered by externalhardware, that is, by the cache/directorycontroller. Reloading of the pipeline andthe software implementation of the con-text switch requires 14 processor cycles.

The MIT Alewife distributed shared-memory multiprocessor [Agarwal et al.1995] is based on the multithreaded MITSparcle processor. The multiprocessor hasbeen operational since May 1994. A nodein the Alewife multiprocessor comprisesa Sparcle processor, an external floating-point unit, cache, and a directory-basedcache controller that manages cache-coherence, a network router chip, and amemory module (Figure 8).

The Alewife multiprocessor uses alow-dimensional direct interconnectionnetwork. Despite its distributed-memoryarchitecture, Alewife allows efficientshared-memory programming througha multilayered approach to localitymanagement. Communication latencyand network bandwidth requirementsare reduced by a directory-based cache-coherence scheme referred to as Limit-LESS directories. Latencies still occuralthough communication locality is en-

Fig. 8 . An Alewife node.

hanced by run-time and compile-timepartitioning and placement of data andprocesses.

Rhamma. The Rhamma processor[Gruenewald and Ungerer 1996, 1997]was developed at the University of Karl-sruhe, Germany, as an experimentalmicroprocessor that bridges all kinds oflatencies by a very fast context switch.The execution unit (EU) and load/storeunit (LSU) are decoupled and work con-currently on different threads. A numberof register sets used by different threadsare accessed by the LSU as well as theEU. Both units are connected by FIFObuffers for continuations, each denotingthe thread tag and the instruction pointer


42 Ungerer et al.

of a thread. The Rhamma processor com-bines several static and dynamic modelsof the BMT technique.

The EU of the Rhamma processorswitches the thread whenever a load/storeor control-flow instruction is fetched.Therefore, if the instruction format is cho-sen appropriately, a context switch canbe recognized by predecoding the instruc-tions already during the instruction fetchstage.

In the case of a load/store, the continu-ation is stored in the FIFO buffer of theLSU, a new continuation is fetched fromthe EU’s FIFO buffer, and the context isswitched to the register set given by thenew thread tag and the new instructionpointer. The LSU loads and stores data ofa different thread in a register set concur-rently to the work of the EU. There is a lossof one processor cycle in the EU for each se-quence of load/store instructions. Only thefirst load/store instruction forces a contextswitch in the EU; succeeding load/store in-structions are executed in the LSU.

The loss of one processor cycle in theEU when fetching the first of a se-quence of load/store instructions can beavoided by a so-called context switch buffer(CSB) whenever the sequence is executedthe second time. The CSB is a hard-ware buffer, collecting the addresses ofload/store instructions that have causedcontext switches. Before fetching an in-struction from the I-cache, the instructionfetch stage checks whether the addresscan be found in the context switch buffer.In that case the thread is switched im-mediately and an instruction of anotherthread is fetched instead of the load/storeinstruction. No bubble occurs.

PL/PS-Machine. The Preload and Post-store Machine (or PL/PS-Machine) [Kaviet al. 1997] is most similar to the Rhammaprocessor. It also decouples memory ac-cesses from thread execution by pro-viding separate units. This decouplingeliminates thread stalls due to memoryaccesses and makes thread switches dueto cache misses unnecessary. Threads arecreated when all data is preloaded intothe register set holding the thread’s con-

text, and the results from an executionthread are poststored. Threads are non-blocking and each thread is enabled whenthe required inputs are available (i.e.,data-driven at a coarse grain). The sep-arate load/store/sync processor performspreloads and schedules ready threads onthe pipeline. The pipeline processor exe-cutes the threads which will require nomemory accesses. On completion the re-sults from the thread are poststored by theload/store/sync processor.

The DanSoft nanothreading approach.The nanothreading and the microthread-ing approaches use multithreadingbut spare the hardware complexity ofproviding multiple register sets.

Nanothreading [Gwennap 1997], pro-posed for the DanSoft processor, dismissesfull multithreading for a nanothread thatexecutes in the same register set as themain thread. The DanSoft nanothread re-quires only a 9-bit program counter andsome simple control logic, and it residesin the same page as the main thread.Whenever the processor stalls on the mainthread, it automatically begins fetchinginstructions from the nanothread. Onlyone register set is available, so the twothreads must share the register set. Typ-ically the nanothread will focus on a sim-ple task, such as prefetching data into abuffer, that can be done asynchronously tothe main thread.

In the DanSoft processor nanothreadingis used to implement a new branch strat-egy that fetches both sides of a branch. Astatic branch prediction scheme is used,where branch instructions include 3 bitsto direct the instruction stream. The bitsspecify eight levels of branch direction. Forthe middle four cases, denoting low confi-dence on the branch prediction, the pro-cessor fetches from both the branch targetand the fall-through path. If the branchis mispredicted in the main thread, thebackup path executed in the nanothreadgenerates a misprediction penalty of onlyone to two cycles.

The DanSoft processor proposal is adual-processor CMP, each processor fea-turing a VLIW instruction set and thenanothreading technique. Each processor



contains an integer processor, but the twoprocessor cores share a floating-point unitas well as the system interface.

However, the nanothread techniquemight also be used to fill the instructionissue slots of a wide superscalar approachas in SMT.

Microthreading. The microthreadingtechnique [Bolychevsky et al. 1996] issimilar to nanothreading. All threadsshare the same register set and the samerun-time stack. However, the number ofthreads is not restricted to two. Whena context switch arises, the programcounter is stored in a continuation queue.The PC represents the minimum possiblecontext information for a given thread.Microthreading is proposed for a modifiedRISC processor.

Both techniques—nanothreading aswell as microthreading—are proposedin the context of a BMT technique, butmight also be used to fill the instructionissue slots of a wide superscalar approachas in SMT.

The drawback to nanothreading andmicrothreading is that the compiler hasto schedule registers for all threads thatmay be active simultaneously, because allthreads execute in the same register set.

The solution to this problem has beenrecently described in Jesshope and Luo[2000] and Jesshope [2001]. These papersdescribe the dynamic allocation of regis-ters using vector instruction sets and alsothe ease with which the architecture canbe developed as a CMP.

MSparc and the EVENTS mechanism. Anapproach similar to the MIT Sparcle pro-cessor was taken at the University ofOldenburg, Germany, with the MSparcprocessor [Mikschl and Damm 1996].MSparc supports up to four contexts ona chip and is compatible with standardSPARC processors. Switching is supportedby hardware and can be achieved withinone processor cycle. However, a four-cycleoverhead is introduced due to pipelinerefill. The multithreading policy is BMTwith the switch-on-cache-miss policy as inthe MIT Sparcle processor.

The rapid context-switching ability ofthe multithreading technique leads to ap-

proaches that apply multithreading innew areas, in particular embedded real-time systems, which are proposed by theEVENTS approach and by the Komodoproject.

The EVENTS scheduler [Luth et al.1997; Metzner and Niehaus 2000] actsas processor-external thread scheduler bysupervising several MSparc processors.Contest switches are triggered due toreal-time scheduling techniques and thecomputation-intensive real-time threadsare assigned to the different MSparc pro-cessors. The EVENTS mechanism is im-plemented as a field of field programmablegate arrays (FPGAs).

Komodo microcontroller. The Komodomicrocontroller [Brinkschulte et al. 1999a;1999b] implements real-time schedul-ing algorithms on an instruction-by-instruction basis deeply embedded inthe multithreaded processor core, in con-trast to the processor-external EVENTSapproach. The Komodo microcontrolleris a multithreaded Java microcontrolleraimed at embedded real-time systemswith a hardware event handling mecha-nism that allows handling of simultaneousoverlapping events with hard real-timerequirements. The main purpose for theuse of multithreading within the Komodomicrocontroller is not latency utilization,but extremely fast reactions on real-timeevents.

Real-time Java threads are used asinterrupt service threads (ISTs)—a newhardware event handling mechanism thatreplaces the common interrupt service rou-tines (ISRs). The occurrence of an eventactivates an assigned IST instead of anISR. The Komodo microcontroller sup-ports the concurrent execution of multi-ple ISTs and zero-cycle overhead contextswitching, and triggers ISTs by a switch-on-signal context switching strategy.

As shown in Figure 9, the four-stagepipelined processor core consists of aninstruction-fetch unit, a decode unit, anoperand fetch unit, and an execution unit.Four stack register sets are provided onthe processor chip. A Java bytecode in-struction is decoded either to a single


44 Ungerer et al.

Fig. 9 . Block diagram of the Komodo microcontroller.

micro-op or a sequence of micro-ops, ora trap routine is called. Each micro-opis propagated through the pipeline withits thread id. Micro-ops from multiplethreads can be simultaneously present inthe different pipeline stages.

The instruction fetch unit holds fourprogram counters (PCs) with dedicatedstatus bits (e.g., thread active/suspended);each PC is assigned to a separate thread.Four byte portions are fetched over thememory interface and put in the appropri-ate instruction window (IW). Several in-structions may be contained in the fetchportion because of the average Java byte-code length of 1.8 bytes. Instructions arefetched depending on the status bits andfill levels of the IWs.

The instruction decode unit contains theIWs, dedicated status bits (e.g., priority),and counters for the implementation ofthe proportional share scheme. A prioritymanager decides from which IW the nextinstruction will be decoded.

The real-time scheduling algorithmsfixed priority preemptive, earliest dead-line first, least laxity first, and guaranteedpercentage (GP) scheduling are imple-mented in the priority manager for next

instruction selection [Kreuzinger et al.2000]. The GP scheduling scheme assignsportions of the processor execution powerto each hardware thread and allows astrict isolation of the different real-timethreads—an urgently demanded featureof real-time systems. GP scheduling is onlyefficient if implemented within a multi-threaded processor [Brinkschulte et al.2002].

The priority manager applies one ofthe implemented scheduling schemes forIW selection. However, latencies may re-sult from branches or memory accesses.To avoid pipeline stalls, instructions fromthreads of less priority can be fed into thepipeline. The decode unit predicts the la-tency after such an instruction, and pro-ceeds with instructions from other IWs.

External signals are delivered to the sig-nal unit from the peripheral componentsof the microcontroller core as, for example,timer, counter, or serial interface. By theoccurrence of such a signal, the corre-sponding IST will be activated. As soon asan IST activation ends, its assigned real-time thread is suspended and its statusis stored. An external signal may activatethe same thread again.



Further investigations within theKomodo project [Brinkschulte et al.2000] have covered real-time garbagecollection on a multithreaded microcon-troller [Fuhrmann et al. 2001], and adistributed real-time middleware calledOSA+ [Brinkschulte et al. 2000].

5. INTERLEAVED AND BLOCKEDMULTITHREADING IN CURRENT ANDPROPOSED MICROPROCESSORS

5.1. High-Performance Processors

Cray MTA. The Cray Multi-ThreadedArchitecture (MTA) computer [Alversonet al. 1990] features a powerful VLIW in-struction set and interleaved multithread-ing technique. It was designed by theTera Computer Company, Seattle, WA,now Cray Corporation.1

The MTA inherits much of the de-sign philosophy of the HEP as well asa (paper) architecture called Horizon[Thistle and Smith 1988]. The latter wasdesigned for up to 256 processors andup to 512 memory modules in a 16 ×16 × 6-node internal network. Horizon(like HEP) employed a global addressspace and memory-based synchronizationthrough the use of full/empty bits at eachlocation. Each processor supported 128independent instruction streams by 128register sets with context switches occur-ring at every clock cycle. Unlike HEP,Horizon allowed multiple memory opera-tions from a thread to be in the pipelinesimultaneously.

MTA systems are constructed from re-source modules, each of which contains upto six resources. A resource can be a com-putational processor (CP), an I/O proces-sor (IOP), an I/O cache (IOC) unit, andeither two or four memory units (MUs)(Figure 10). Each resource is individuallyconnected to a separate routing node inthe depopulated three-dimensional (3-D)torus interconnection network [Almasiand Gottlieb 1994]. Each routing node hasthree or four communication ports and a

1 Tera purchased Cray and adopted the name.

Fig. 10 . The Cray MTA computer system.

resource port. There are several routingnodes per CP, rather than the several CPsper routing node. In particular, the num-ber of routing nodes is at least p3/2, wherep is the number of CPs. This allows the bi-section bandwidth to scale linearly with p,while the network latency scales as p1/2.The communication link is capable of sup-porting data transfers to and from mem-ory on each clock tick in both directions,as are all of the links between the routingnodes themselves.

The Cray MTA custom chip CP(Figure 11) is a VLIW pipelined processorusing the IMT technique. Each thread isassociated with one 64-bit stream statusword, thirty-two 64-bit general registers,and eight 64-bit target registers. Theprocessor may switch context every cycle(3-ns cycle period) between as many as128 distinct threads (called streams bythe designers of the MTA), thereby hidingup to 128 cycles (384 ns) of memorylatency. Since the context switching is sofast, the processor has no time to swapthe processor state. Instead, it has 128of everything, that is, 128 stream statuswords, 4096 general registers, and 1024target registers. Dependences betweeninstructions are explicitly encoded bythe compiler using explicit dependencelook-ahead. Each instruction containsa 3-bit look-ahead field that explicitlyspecifies how many instructions from thisthread will be issued before encounter-ing an instruction that depends on thecurrent one. Since seven is the maximumpossible look-ahead value, at most eightinstructions (i.e., 24 operations) from each


46 Ungerer et al.

Fig. 11 . The MTA computational processor.

thread can be concurrently executing indifferent stages of a processor’s pipeline.In addition, each thread can issue asmany as eight memory references withoutwaiting for earlier ones to finish, furtheraugmenting the memory latency toleranceof the processor. The CP has a load/storearchitecture with three addressing modesand 32 general-purpose 64-bit registers.The three-wide VLIW instructions are64 bits. Three operations can be exe-cuted simultaneously per instruction: amemory reference operation (M-op), anarithmetic/logical operation (A-op), and abranch or simple arithmetic/logical opera-tion (C-op). If more than one operation inan instruction specifies the same registeror setting of condition codes, the priorityis M-opÂA-opÂC-op.

Every processor has a clock register thatis synchronized with its counterparts inthe other processors and counts up onceper cycle. In addition, the processor countsthe total number of unused instructionissue slots (measuring the degree of un-derutilization of the processor) and the

time-integral of the number of instructionstreams ready to issue (measuring the de-gree of overutilization of the processor).All three counters are user-readable in asingle unprivileged operation. Eight coun-ters are implemented in each of the pro-tection domains of the processor. All areuser-readable in a single unprivileged op-eration. Four of these counters accumu-late the number of instructions issued,the number of memory references, thetime-integral of the number of instruc-tion streams, and the time-integral of thenumber of messages in the network. Thesecounters are also used for job accounting.The other four counters are configurable toaccumulate events from any four of a largeset of additional sources within the proces-sor, including memory operations, jumps,traps, and so on.

Thus, the Cray MTA exploits paral-lelism at all levels, from fine-grainedILP within a single processor to parallelprogramming across processors, to multi-programming among several applicationssimultaneously. Consequently, processor



scheduling occurs at many levels, andmanaging these levels poses uniqueand challenging scheduling concerns[Alverson et al. 1995].

After many delays the MTA reached themarket in December 1997 when a single-processor system (clock speed 145 MHz)was delivered to the San Diego Supercom-puter Center. In May 1999, the systemwas upgraded to eight processors and anetwork with 64 routing nodes. Each pro-cessor runs at 260 MHz and is theoreti-cally capable of 780 MFLOPS. This ma-chine was effectively retired in September2001.

Early MTA systems had been built us-ing GaAs technology for all logic design.As a result of the semiconductor market’sfocus on CMOS technology for computersystems, Cray started a transition to us-ing CMOS technology in the MTA. Thelatest in this development is the MTA-2,which is configured with 28 processors and112 gigabytes (GB) of memory. The firstCray MTA-2 supercomputer system wasshipped in late December 2001.

HTMT Architecture Project. In 1997, theJet Propulsion Laboratory in collabo-ration with several other institutions(The California Institute of Technology,Princeton University, University of NotreDame, University of Delaware, The StateUniversity of New York at Stonybrook,Tera Computer Company (now Cray),Argonne National Laboratories) initiatedthe Hybrid Technology MultiThreaded(HTMT) Architecture Project whose aimis to design the first petaflops computerby 2005–2007. The computer will be basedon radically new hardware technologiessuch as superconductive processors, op-tical holographic storages, and opticalpacket switched networks [Sterling 1997].There will be 4096 superconductive pro-cessors, called SPELL. A SPELL processorconsists of 16 multistream units, whereeach multistream unit is a 64-bit, deeplypipelined integer processor capable of ex-ecuting up to eight parallel threads. As aresult, each SPELL is capable of runningin parallel up to 128 threads, arranged in16 groups of 8 threads [Dorojevets 2000].

Sun MAJC. Sun proposed in 1999 itsMAJC-5200 [Tremblay 1999] that can beclassified as a dual-processor chip withBMT processors. Special Java-directed in-structions are provided, motivating theacronym MAJC for Micro Architecturefor Java Computing. Instruction, data,thread, and process-level parallelism isexploited in the basic MAJC architectureby supporting explicit multithreading (so-called vertical multithreading), implicitmultithreading (called speculative mul-tithreading), and chip multiprocessors.Instruction-level parallelism is utilized byVLIW packets containing from one to fourinstructions and data-level parallelismthrough single-instruction, multiple-data(SIMD) instructions in particular for mul-timedia applications. Thread-level par-allelism is utilized through compiler-supported explicit multithreading. Thearchitecture supports combining severalmultithreaded processors on a chip to har-ness process-level parallelism.

Single-threaded-program executionmay be accelerated by speculative mul-tithreading with microthreads that aredependent on nonspeculative threads[Tremblay et al. 2000]. Virtual channelscommunicate shared register valuesbetween different microthreads withproduced-consumer synchronization. Incase of misspeculation, the speculativemicrothread is discarded.

IBM RS64 IV. IBM developed a mul-tithreaded PowerPC processor, which isused in the IBM iSeries and pSeries com-mercial servers. The processor—originallycode-named SStar—is called RS64 IV andbecame available for purchase in thefourth quarter of 2000. Because of itsoptimization for commercial server work-load (i.e., on-line transaction processing,enterprise resource planning, Web serv-ing, and collaborative groupware) withtypically large and function-rich applica-tions, a two-threaded block-interleavingapproach with a switch-on-cache-missmodel was chosen. A so-called thread-switch buffer, which holds up to eight in-structions from the background thread,reduces the cost of pipeline reload after


48 Ungerer et al.

a thread switch. These instructions maybe introduced into the pipeline immedi-ately after a thread switch. A significantthroughput increase by multithreadingwhile adding less than 5% to the chip areawas reported in Borkenhagen et al. [2000].

5.2. Network Processors

Currently multithreading proves success-ful in network processors. Network pro-cessors are expected to become the funda-mental building blocks for networks in thesame fashion that microprocessors are fortoday’s personal computers [Glaskowsky2002]. Network processors offer real-time processing of multiple data streams,enhanced security, and Internet proto-col (IP) packet handling and forwardingcapabilities.

Network processors apply multithread-ing to bridge latencies during memoryaccesses. Hard real-time events, that is,the deadline should never be missed,are requirements for network processorsthat need specific instruction schedulingduring multithreaded execution. Multi-threading is typically restricted to the in-ternal architecture of the parallel workingpicoprocessors or microengines that per-form the data traffic handling.

Intel IXP network processors. The In-tel IXP1200 network processor [IntelCorporation 2002]—the first in theIntel Internet Exchange Architecture(Intel IXA) network processor familythat uses multithreading—combines anIntel StrongARM processor core withsix programmable multithreaded micro-engines. The IXP2400 network processoris based on an Intel XScale core witheight multithreaded microengines, andthe IXP2800 contains even 16 multi-threaded microengines [Intel Corporation2002]. The microengines (four-threadedBMT) are used for packet forwarding andtraffic management, whereas the XScaleprocessor core is applied for control tasks.Data and event signals are shared amongthreads and microengines at virtual zerolatency while maintaining coherence.The IXP2800 is claimed to be capable of

enabling 10-Gb/s wire-speed processing(Gb = gigabits).

IBM PowerNP network processor. TheIBM PowerNP NP46GS3 network proces-sor [IBM Corporation 1999] integrates aswitching engine, search engine, frameprocessors, and multiplexed medium ac-cess controllers (MACs). The NP4GS3includes an embedded IBM PowerPC405 processor core, 16 programmablemultithreaded picoprocessors, 40 FastEthernet/4-Gb MACs, and hardware ac-celerators performing functions such astree searches, frame forwarding, frame fil-tering, and frame alteration.

The multithreaded picoprocessors areused for packet manipulation while theprocessor core handles control functions.Each picoprocessor is a 32-bit, 133-MHzRISC core that supports two threads inthe hardware and includes nine hardwiredfunction units for tasks such as stringcopying, bandwidth-policy checking, andcheck summing.

The aggregated bandwidth of theNP4GS3 is some 4 Gb/s, allowing it tomanage a single OC-48 optical channel orup to 40 Fast Ethernet ports [Glaskowsky2002].

Vitesse IQ2x00 network processors. TheVitesse IQ2x00 family consists of IQ2000and IQ2200 network processors, whose ar-chitecture is based on multiple packed pro-cessors. Each packed processor is a five-threaded BMT that switches on cache miss[Glaskowsky 2002].

Lexra network processors. The Lexra net-work processor employs packet processors,which are used in parallel to implement IPlayer processing in software. The currentpacked processor is the LX4580 [Gelinaset al. 2002]. It implements the MIPS32 ar-chitecture, including recent architecturalenhancements, along with instructions foroptimized packet processing. By contrastwith the Intel IXP and Vitesse IQ2x00, theLX4580 uses the IMT approach (with fivethreads).

AMCC nP network processors. AMCC’s nPnetwork processor family is based on the



MMC Networks’ nPcore multithreadedpacket processor. Each processor in thisfamily consists of one or several nPcoreprocessing units that are connected to anexternal search coprocessor and a hostCPU [Glaskowsky 2002]. In particular,the nP7250 network processor has twonPcores with 2.4 Gb/s aggregated band-width. The current top of AMCC’s line isthe nP7510, whose six nPcore processingunits offer 10 Gb/s.

Other network processors. A number ofother network processors, such as theMotorola C-5, and the Fast Pattern Pro-cessor (as part of Agere’s PayloadPlus chipset) seem to include multithreading fea-tures. Unfortunately, it is often hard tofind out, because multithreading is notthe main feature of network processorsand often hardly mentioned. The mar-ket for network processors is extremelycompetitive, so the above-mentioned pro-cessors will soon be replaced by theirsuccessors.

6. SIMULTANEOUS MULTITHREADING

6.1. Principles

IMT and BMT are techniques which aremost efficient when applied to scalar RISCor VLIW processors. Combining multi-threading with the superscalar techniquenaturally leads to a technique whereall hardware contexts are active simul-taneously, competing each cycle for allavailable resources. This technique, calledsimultaneous multithreading (SMT), in-herits from superscalars the ability to is-sue multiple instructions each cycle; andlike multithreaded processors it containshardware resources for multiple contexts.The result is a processor that can is-sue multiple instructions from multiplethreads each cycle. Therefore, not onlycan unused cycles in the case of laten-cies be filled by instructions of alterna-tive threads, but so can unused issue slotswithin one cycle.

Thread-level parallelism can come fromeither multithreaded, parallel programsor from multiple, independent programsin a multiprogramming workload, while

ILP is utilized from the individualthreads. Because an SMT processor si-multaneously exploits coarse- and fine-grain parallelism, it uses its resourcesmore efficiently and thus achieves betterthroughput and speedup than single-threaded superscalar processors for mul-tithreaded (or multiprogramming) work-loads. The tradeoff is a slightly morecomplex hardware organization.

If all (hardware-supported) threads ofan SMT processor always execute threadsof the same process, preferably in single-program multiple-data fashion, a uni-fied (primary) I-cache may prove useful,since the code can be shared between thethreads. Primary D-cache may be unifiedor separated between the threads depend-ing on the access mechanism used.

The SMT technique combines a widesuperscalar instruction issue with mul-tithreading by providing several registersets on the processor and issuing instruc-tions from several instruction queues si-multaneously. Therefore, the issue slots ofa wide-issue processor can be filled by op-erations of several threads. Latencies oc-curring in the execution of single threadsare bridged by issuing operations of the re-maining threads loaded on the processor.In principle, the full issue bandwidth canbe utilized.

The resources of SMT processors can beorganized in two ways:

1. Resource sharing: Instructions of dif-ferent threads share all resources likethe fetch buffer, the physical registersfor renaming registers of different reg-ister sets, the instruction window, andthe reorder buffer. Thus SMT addsminimal hardware complexity to con-ventional superscalars; hardware de-signers can focus on building a fastsingle-threaded superscalar and addmultithread capability on top. Thecomplexity added to superscalars bymultithreading includes the thread tagfor each internal instruction represen-tation, multiple register sets, and theabilities of the fetch and the retireunits to fetch/retire instructions of dif-ferent threads.


50 Ungerer et al.

2. Resource replication: The second orga-nizational form replicates all internalbuffers of a superscalar such that eachbuffer is bound to a specific thread. In-struction fetch, decode, rename, andretire units may be multiplexed be-tween the threads or be duplicatedthemselves. The issue unit is able toissue instructions of different instruc-tion windows simultaneously to theexecution units. This form of organi-zation adds more changes to the orga-nization of superscalar processors butleads to a natural partitioning of theinstruction window and simplifies theissue and retire stages.

The fetch unit of an SMT processor cantake advantage of the interthread com-petition for instruction bandwidth in twoways. First, it can partition this band-width among the threads and fetch fromseveral threads each cycle. In this way, itincreases the probability of fetching onlynonspeculative instructions. Second, thefetch unit can be selective about whichthreads it fetches. For example, it mayfetch those that will provide the most im-mediate performance benefit.

The main drawback to SMT may be thatit complicates the issue stage, which isalways central to the multiple threads.A functional partitioning as required bythe on-chip wire-delay of future micro-processors is not easily achieved with anSMT processor due to the centralized in-struction issue. A separation of the threadqueues as in the SMT approaches withresource replication is a possible solution,although it does not remove the central in-struction issue.

Closely related to SMT are architecturalproposals that only slightly enhance a su-perscalar processor by the ability to pur-sue two or more threads only for a shorttime. In principle, predication is the firststep in this direction. An enhanced formof predication is able to issue and executepredicated instruction even if the predi-cate is not yet solved. A further step is dy-namic predication [Klauser et al. 1998a]as applied for the Polypath architecture[Klauser et al. 1998b] that is a superscalar

enhanced to handle multiple threads in-ternally. Another step to multithreadingis simultaneous subordinate microthread-ing [Chappell et al. 1999], which is a mod-ification of superscalars to run threads atmicroprogram level concurrently. A newmicrothread is spawned either by event-driven by hardware or by an explicitspawn instruction. The subordinate mi-crothread could be used, for example, toimprove branch prediction of the primarythread or to preload data.

A competing approach to SMT is repre-sented by the chip multiprocessors (CMPs)[Ungerer et al. 2002]. A CMP integratestwo or more complete processors on a sin-gle chip. Every unit of a processer is du-plicated and used independently of itscopies on the chip. CMP is easier toimplement, but only SMT has the abil-ity to hide latencies. Examples of CMPare the Texas Instruments TMS320C8x[Texas Instruments 1994], the Hydra[Hammond and Olukotun 1998], the Com-paq Piranha [Barroso et al. 2000], and theIBM POWER4 chip [Tendler et al. 2002].

Projects simulating different configura-tions of SMT are discussed next, whileSMT approaches that can be found incurrent microprocessors are given inSection 7.

6.2. Examples of Prototypes and SimulatedSMT Processors

MARS-M. The MARS-M multi-threaded computer system [Dorozhevetsand Wolcott 1992] was developed andmanufactured within the Russian Next-Generation Computer Systems programduring 1982–1988. The MARS-M was thefirst system where the SMT techniquewas implemented in a real design. Thesystem has a decoupled multiprocessorarchitecture with execution, address,control, memory, and peripheral multi-threaded subsystems working in paralleland communicating via multiple registerFIFO-queues. The execution and addresssubsystems are multiple-unit VLIW pro-cessors with SMT. Within each of thesetwo subsystems up to four threads canrun in parallel on their hardware contexts



allocated by the control subsystem, whilesharing the subsystem’s set of pipelinedfunctional units and resolving resourceconflicts on a cycle-by-cycle basis. Thecontrol processor uses IMT with a zero-overhead context switch upon issuing amemory load operation. In total, up to12 threads can run simultaneously withinMARS-M with a peak instruction issuerate of 26 instructions per cycle. Medium-scale integration (MSI) and large-scaleintegration emitter-coupled logic (LSIECL) elements with a minimum delay of2.5 ns are used to implement processorlogic. There are 739 boards, each of whichcan contain up to 100 ICs mounted onboth sides. The MARS-M hardware iswater-cooled and occupies three cabinets.The initial clock period of 100 ns wasincreased to 108 ns at the debuggingstage.

Matsushita Media Research Laboratoryprocessor. The multithreaded processorof the Media Research Laboratory ofMatsushita Electric Ind. (Japan) wasanother pioneering approach to SMT[Hirata et al. 1992]. Instructions of dif-ferent threads are issued simultaneouslyto multiple execution units. Simulationresults on a parallel ray-tracing appli-cation showed that using eight threads,a speedup of 3.22 in the case of oneload/store unit, and of 5.79 in the case oftwo load/store units, can be achieved overa conventional RISC processor. However,caches or translation look-aside buffersare not simulated, nor is a branch predic-tion mechanism.

“Multistreamed superscalar” at the Univer-sity of Santa Barbara. Serrano et al. [1994]and Yamamoto and Nemirovsky [1995]at the University of Santa Barbara, CA,extended the interleaved multithreading(then called multistreaming) technique togeneral-purpose superscalar processor ar-chitecture and presented an analyticalmodel of multithreaded superscalar per-formance, backed up by simulation.

SMT processor at the University ofWashington. The SMT processor ar-chitecture, proposed by Tullsen, Eggers,

and Levy [1995] at the University ofWashington, Seattle, WA, surveys en-hancements of the Alpha 21164 processor.Simulations were conducted to evaluateprocessor configurations of an up to eight-threaded and eight-issue superscalar.This maximum configuration showed athroughput of 6.64 IPC due to multi-threading using the SPEC92 benchmarksuite and assuming a processor with 32execution units (among them multipleload/store units). Descriptions and the re-sults are based on the multiprogrammingmodel.

The next approach was based on a hy-pothetical out-of-order instruction issuesuperscalar microprocessor that resem-bles the MIPS R10000 and HP PA-8000[Tullsen et al. 1996; Eggers et al. 1997].Both approaches followed the resource-sharing organization aiming at an onlyslight hardware enhancement of a su-perscalar processor. The latter approachevaluated more realistic processor con-figurations, and presented implementa-tion issues and solutions to register fileaccess and instruction scheduling for aminimal change to superscalar processororganization.

In the simulations of the latter archi-tectural model (Figure 12), eight threadsand an eight-issue superscalar organiza-tion are assumed. Eight instructions aredecoded, renamed, and fed to either theinteger or floating-point instruction win-dow. Unified buffers are used. Whenoperands become available, up to eightinstructions are issued out of order percycle, executed, and retired. Each threadcan address 32 architectural integer (andfloating-point) registers. These registersare renamed to a large physical registerfile of 356 physical registers. The largerregister file requires a longer access time.To avoid increasing the processor cycletime, the pipeline is extended by twostages to allow two-cycle register readsand two-cycle writes. Renamed instruc-tions are placed into one of two instruc-tion windows. The 32-entry integer in-struction window handles integer and allload/store instructions, while the 32-entryfloating-point instruction window handles


52 Ungerer et al.

Fig. 12 . SMT processor architecture.

floating-point instructions. Three floating-point and six integer units are assumed.All execution units are fully pipelined,and four of the integer units also exe-cute load/store instructions. The I- and D-caches are multiported and multibanked,but common to all threads.

The multithreaded workload consists ofa program mix of SPEC92 benchmark pro-grams that are executed simultaneouslyas different threads. The simulations eval-uate different fetch and instruction issuestrategies.

An RR.2.8 fetching scheme to accessmultiported I-cache—that is, in each cycle,eight instructions are fetched in round-robin policy from each of two differentthreads—was superior to other schemeslike RR.1.8, RR.4.2, and RR.2.4 withless fetching capacity. As a fetch policy,the ICOUNT feedback technique, whichgives the highest fetch priority to thethreads with the fewest instructions inthe decode, renaming, and queue pipelinestages, proved superior to the BRCOUNTscheme, which gives the highest prior-ity to those threads that are least likelyto be on a wrong path, and the MISS-COUNT scheme, which gives priority tothe threads that have the fewest out-standing D-cache misses. The IQPOSNpolicy that gives the lowest priority to

the oldest instructions by penalizing thosethreads with instructions closest to thehead of either the integer or the floating-point queue is nearly as good as ICOUNTand better than BRCOUNT and MISS-COUNT, which are all better than round-robin fetching. The ICOUNT.2.8 fetchingstrategy reached an IPC of about 5.4(the RR.2.8 only reached about 4.2). Mostinteresting is that neither mispredictedbranches nor blocking due to cache misses,but a mix of both and perhaps some othereffects, proved to be the best fetchingstrategy.

In a single-threaded processor, choosinginstructions for issue that are least likelyto be on a wrong path is always achievedby selecting the oldest instructions, thosedeepest into the instruction window. Forthe SMT processor, several different issuestrategies have been evaluated, like oldestinstructions first, speculative instructionslast, and branches first. Issue bandwidthis not a bottleneck on these simulated ma-chines and all strategies seem to performequally well, so the simplest mechanism isto be preferred. Also, doubling the size ofinstruction windows (but not the numberof searchable instructions for issue) has nosignificant effect on the IPC. Even an in-finite number of execution units increasesthroughput by only 0.5%.



Further research looked at compilertechniques for SMT [Lo et al. 1997] andat extracting threads of a single pro-gram designed for multithreaded execu-tion on an SMT [Tullsen et al. 1999]. TheSMT processor has also been evaluatedwith database workloads [Lo et al. 1998]achieving roughly a three-fold increasein instruction throughput with an eight-threaded SMT over a single-threaded su-perscalar with similar resources.

Wallace et al. [1998] presented thethreaded multipath execution model,which exploits existing hardware on anSMT processor to execute simultaneouslyalternate paths of a conditional branch ina thread. Threaded Multiple Path Execu-tion employs eager execution of branchesin an SMT processor model. It extends theSMT processor by introducing additionalhardware to test for unused processorresources (unused hardware threads), aconfidence estimator, mechanisms for faststarting and finishing threads, prioritiesattached to the threads, support forspeculatively executed memory accessoperations, and an additional bus fordistributing the contents of the registermapping table (mapping synchronizationbus, MSB). If the hardware detects thata number of processor threads are notprocessing useful instructions, the confi-dence estimator is used to decide whetheronly one continuation of a conditionalbranch should be followed (high confi-dence) or both continuations should befollowed simultaneously (low confidence).The MSB is used to provide a threadthat starts execution of one continuationspeculatively with the valid register map.Such register mappings between differentregister sets incur an overhead of four toeight cycles. Such a speculative executionincreases single-program performance by14–23%, depending on the mispredictionpenalty, for programs with a high branchmisprediction rate. Wallace et al. [1999]explored instruction recycling on theproposed multipath SMT processor.

Irvine multithreaded superscalar. Thismultithreaded superscalar processorapproach, developed at the University

of California at Irvine, combines out-of-order execution within an instructionstream with the simultaneous executionof instructions of different instructionstreams [Gulati and Bagherzadeh 1996;Loikkanen and Bagherzadeh 1996]. Aparticular superscalar processor calledthe Superscalar Digital Signal Processor(SDSP) is enhanced to run multiplethreads. The enhancements are directedby the goal of minimal modification tothe superscalar base processor. Therefore,most resources on the chip are shared bythe threads, as for instance the registerfile, reorder buffer, instruction window,store buffer, and renaming hardware.Based on simulations, a performance gainof 20–55% due to multithreading wasachieved across a range of benchmarks.

Pontius and Bagherzadeh [1999] evalu-ated a multithreaded superscalar proces-sor model using several video decode, pic-ture processing, and signal filter programsas workloads. The programs were paral-lelized at the source code level by par-titioning the main loop and distributingthe loop iteration to several threads. Therelatively low speedup that was reportedfor multithreading resulted from algorith-mic restrictions and from the already highIPC in the single-threaded model. Thelatter was only possible because multi-media instructions were not used. Other-wise a large part of the IPC in the single-threaded model would be hidden by theSIMD parallelism within the multimediainstructions.

SMV processor. The Simultaneous Mul-tithreaded Vector (SMV) architecture[Espasa and Valero 1997], designed atthe Polytechnic University of Catalunya(Barcelona, Spain) combines simultane-ous multithreaded execution and out-of-order execution with an integrated vec-tor unit and vector instructions. Figure 13depicts the SMV architecture. The fetchengine selects one of eight threads andfetches four instructions on its behalf.The decoder renames the instructions, us-ing a per-thread rename table, and thensends all instructions to several commonexecution queues. Inside the queues, the


54 Ungerer et al.

Fig. 13 . Simultaneous Multithreaded Vector architecture. FPU = floating-point unit; ALU = arithmeticlogic unit; VFU = vector functional unit.

instructions of different threads are indis-tinguishable, and no thread informationis kept except in the reorder buffer andmemory queue. Register names preserveall dependences. Independent threads useindependent rename tables, which pre-vents false dependences and conflicts fromoccurring. The vector unit has 128 vectorregisters, each holding 128 64-bit regis-ters, and has four general-purpose inde-pendent execution units. The number ofregisters is the product of the number ofthreads and the number of physical regis-ters required to sustain good performanceon each thread.

Karlsruhe multithreaded superscalar pro-cessor. While the SMT processor pro-posed by Tullsen, Eggers, and Levy[1995] surveys enhancements of the Alpha21164 processor, the multithreaded super-scalar processor approach of Sigmund andUngerer [1996a, 1996b] is based on a sim-plified PowerPC 604 processor, combiningmultiple pipelines that were dedicated todifferent threads with a simultaneous is-sue unit (resource replication).

Using an instruction mix with 20% loadand store instructions, the performanceresults showed, for an eight-issue pro-cessor with four to eight threads and



a single load/store unit, that two in-struction fetch units, two decode units,four integer units, 16 rename registers,four register ports, and completion queueswith 12 slots are sufficient. The singleload/store unit proved the principal bot-tleneck. The multithreaded superscalarprocessor (eight-threaded, eight-issue) isable to hide completely latencies caused by4-2-2-2 burst cache refills.2 It reaches themaximum throughput of 4.2 IPC that ispossible with a single load/store unit.

SMT multimedia processor. SubsequentSMT research at the University ofKarlsruhe has explored resource repli-cation-based microarchitecture modelsfor an SMT processor with multime-dia enhancements [Oehring et al. 1999a,1999b]. Related research by Pontius andBagherzadeh [1999] and Wittenburg et al.[1999] did not use multimedia operations.Hand-coding was applied to transfer acommercial MPEG-2 video decoding algo-rithm to the SMT multimedia processormodel and to program the algorithm inmultithreaded fashion. The research pro-cessor model assumes a wide-issue super-scalar processor, and enhances it by theSMT technique, by multimedia units, andby an additional on-chip RAM storage.

The most surprising finding was thatsmaller reservation stations for the threadunit and the global and the local load/storeunits as well as the smaller reorderbuffers, increased the IPC value for themultithreaded models. Intuition suggestsbetter performance with larger buffers.However, large reservation stations (andlarge reorder buffers) draw too manyhighly speculative instructions into thepipeline. Smaller buffers limit the spec-ulation depth of fetched instructions andlead to the fact that only nonspeculativeinstructions or instructions with low spec-ulation depth are fetched, decoded, is-sued, and executed in an SMT processor.An abundance of speculative instructionsmay decrease the performance of an SMT

2 4-2-2-2 assumes that four times 64-bit portions aretransmitted over the memory bus, the first portionreaching the processor four cycles after the cachemiss indication, the next portion two cycles later, etc.

processor. Another reason for the negativeeffect of large reorder buffers and of largereservation stations for the load/store andthread control units lies in the fact thatthose instructions have a long latencyand typically have two to three integerinstructions as consumers. The effect isthat the consuming integer instructionseat up space in the integer reservation sta-tion, thus blocking instructions from otherthreads from entering it. This multiplica-tion effect is made even worse by a non-speculative execution of store and threadcontrol instructions.

Subsequent research by Sigmund et al.[2000] investigated a cost/benefit analy-sis of various SMT multimedia processormodels. Transistor count and chip spaceestimations (applying the tool describedby Steinhaus et al. [2001]) showed thatthe additional hardware cost of a four-threaded, eight-issue SMT processor overa single-threaded processor is a 2% in-crease in transistor count and a 9% in-crease in chip space for a 288 milliontransistor chip, which yields a threefoldspeedup. The small scaled four-threaded,eight-issue SMT models with only 24 mil-lion transistors require a 9% increase intransistor count and a 27% increase inchip space, resulting in a 1.5-fold speedupover the superscalar base model.

Similar results were reached by Burnsand Gaudiot [2002], who estimated thelayout area for SMT. They identified whichlayout blocks are affected by SMT, de-termined the scaling of chip space re-quirements, and compared SMT versussingle-threaded processor space require-ments by scaling a R10000-based layoutto 0.18-micron technology.

Simultaneous multithreading for signal pro-cessors. Wittenburg et al. [1999] lookedat the application of the SMT techniquefor signal processors using combining in-structions that are applied to registers ofseveral register sets simultaneously in-stead of multimedia operations. Simula-tions with a Hough transformation asworkload showed a speedup of up to sixcompared to a single-threaded processorwithout multimedia extensions.


56 Ungerer et al.

Simultaneous multithreading and powerconsumption. Simultaneous multithread-ing can also be applied for reduction ofpower consumption. In principle, the samepower reduction techniques are applica-ble for SMT processors as for superscalars.When an SMT processor is simultane-ously executing threads, the per-cycle useof the processor resources should notice-ably increase, offering less opportunitiesfor power reduction via traditional tech-niques such as clock gating. However, anSMT processor, specifically designed to ex-ecute multiple threads in a power-awaremanner, provides additional options forpower-aware design [Brooks et al. 2000].

Mispredictions cost energy because thespeculatively executed instructions mustbe squashed. Contemporary superscalarprocessors discard approximately 60% ofthe fetched and 30% of the executed in-structions due to misspeculations. In caseof a power-aware design, the issue slotsof an SMT processor could be preferablyfilled by less speculative instructions ofother threads. Harnessing the extra paral-lelism provided by multiple threads allowsthe processor to rely much less on specula-tion. Less resources are utilized by specu-lative instructions and more parallelism isexploited. Moreover, a simpler branch unitdesign could be used. This inherently im-plies a greater power efficiency per thread.

The simulations of Seng et al. [2000]showed that by using an appropriatescheduler up to 22% less power is con-sumed by an SMT processor than by a com-parable superscalar.

7. SIMULTANEOUS MULTITHREADING INCURRENT MICROPROCESSORS

Alpha 21464. Compaq unveiled itsAlpha EV8 21464 proposal in 1999 [Emer1999], a four-threaded eight-issue SMTprocessor that closely resembles theSMT processor proposed by Tullsen et al.[1999]. The processor proposal featuredout-of-order execution, a large on-chipsecondary cache, a direct RAMBUS in-terface, and an on-chip router for systeminterconnect of a directory based, cache-

coherent NUMA (nonuniform memoryaccess) multiprocessor. This 250 milliontransistor chip was planned for year 2003.In the meantime, the project has beenabandoned, as Compaq sold the Alphaprocessor technology to Intel.

Blue Gene. Simultaneous multithread-ing is mentioned as processor techniquefor the building block of the IBM BlueGene system—a 5-year effort to builda petaflops supercomputer started inDecember 1999 [Allen et al. 2001].

Sun UltraSPARC V. Also the Sun Ultra-SPARC V processor has been announcedto exploit thread-level parallelism by sym-metric multithreading—that is, SMT. TheultraSPARC V will be able to switch be-tween two different modes depending onthe type of work—one mode for heavy dutycalculations and the other for businesstransactions such as database operations[Lawson and Vance 2002].

Hyper-Threading Technology in the IntelXeon processor. Intel’s Hyper-ThreadingTechnology [Marr et al. 2002] proposesSMT for the Pentium 4-based Intel Xeonprocessor family to be used in dual andmultiprocessor servers. Hyper-ThreadingTechnology makes a single physical pro-cessor appear as two logical processors byapplying a two-threaded SMT approach.Each logical processor maintains a com-plete set of the architecture state, whichconsists of the general-purpose registers,the control registers, the advanced pro-grammable interrupt controller (APIC)registers, and some machine state regis-ters. Logical processors share nearly allother resources on the physical processor,such as caches, execution units, branchpredictors, control logic, and buses. Eachlogical processor has its own APIC. Inter-rupts sent to a specific logical processorare handled only by that logical processor.

When one logical processor is stalled,the other logical processor can continue tomake forward progress. A logical processormay be temporarily stalled for a variety ofreasons, including servicing cache misses,handling branch mispredictions, or wait-ing for the results of previous instructions.



Independent forward progress was en-sured by managing buffering queues suchthat no logical processor can use all theentries when two active software threadswere executing.

The buffering queues that separate ma-jor pipeline logic blocks are either parti-tioned or duplicated to ensure indepen-dent forward progress through each logicblock. If only one active software threadis active, the thread should run at thesame speed on a processor with Hyper-Threading Technology as on a processorwithout this capability. This means thatpartitioned resources should be recom-bined when only one software thread isactive, thus applying a flexible resource-sharing model.

In the Xeon, as in the Pentium 4, in-structions generally come from the execu-tion trace cache (TC), which replaces theprimary I-cache. Two sets of instructionpointers independently track the progressof the two executing software threads. TheTC entries are tagged with thread infor-mation. The two logical processors arbi-trate access to the TC every clock cycle.If both logical processors want access tothe TC at the same time, access is grantedto one then the other in alternating clockcycles. If one logical processor is stalled oris unable to use the TC, the other logicalprocessor can use the full bandwidth of theTC, every cycle.

If there is a TC miss, instruction bytesneed to be fetched from the secondarycache and decoded into µops (microopera-tions) to be placed in the TC. Each logicalprocessor has its own instruction transla-tion look-aside buffer and its own set of in-struction pointers to track the progress ofinstruction fetch for the two logical proces-sors. The instruction fetch logic in chargeof sending requests to the secondary cachearbitrates on a first-come first-served ba-sis, while always reserving at least onerequest slot for each logical processor. Inthis way, both logical processors can havefetches pending simultaneously. Each log-ical processor has its own set of two 64-byte streaming buffers to hold instructionbytes in preparation for the instruction de-code stage.

The branch prediction structures are ei-ther duplicated or shared. The branch his-tory buffer used to look up the global his-tory array is also tracked independentlyfor each logical processor. However, thelarge global history array is a sharedstructure with entries that are taggedwith a logical processor ID. The returnstack buffer, which predicts the target ofreturn instructions, is duplicated.

When both threads are decoding in-structions simultaneously, the streamingbuffers alternate between threads so thatboth threads share the same decoder logic.The decode logic has to keep two copies ofall the state needed to decode IA-32 in-structions for the two logical processorseven though it only decodes instructionsfor one logical processor at a time. In gen-eral, several instructions are decoded forone logical processor before switching tothe other logical processor.

The µop queue that decouples the front-end from the out-of-order execution engineis partitioned such that each logical pro-cessor has half the entries. The allocatorlogic takes µops from the µop queue andallocates many of the key machine buffersneeded to execute each µop, including the126 reorder buffer entries, 128 integer and128 floating-point physical registers, and48 load and 24 store buffer entries. If thereare µops for both logical processors in theµop queue, the allocator will alternate se-lectingµops from the logical processors ev-ery clock cycle to assign resources. Eachlogical processor can use up to a maxi-mum of 63 reorder buffer entries, 24 loadbuffers, and 12 store buffer entries.

Since each logical processor must main-tain and track its own complete architec-ture state, there are two register alias ta-bles for register renaming, one for eachlogical processor. The register renamingprocess is done in parallel to the alloca-tor logic described above, so the registerrename logic works on the same µops towhich the allocator is assigning resources.

Once µops have completed the alloca-tion and register rename processes, theyare placed into the memory instructionqueue or the general instruction queue,respectively. The two sets of queues are


58 Ungerer et al.

also partitioned such that µops from eachlogical processor can use at most half theentries. The memory instruction queueand general instruction queues send µopsto the five scheduler queues as fast asthey can, alternating betweenµops for thetwo logical processors every clock cycle, asneeded.

Each scheduler has its own schedulerqueue of 8 to 12 entries from which it se-lects µops to send to the execution units.The schedulers choose µops regardless ofwhether they belong to one logical proces-sor or the other. The schedulers are ef-fectively oblivious to logical processor dis-tinctions. The µops are simply evaluatedbased on dependent inputs and availabil-ity of execution resources. For example,the schedulers could dispatch two µopsfrom one logical processor and two µopsfrom the other logical processor in thesame clock cycle. To avoid deadlock and en-sure fairness, there is a limit on the num-ber of active entries that a logical proces-sor can have in each scheduler’s queue.

The execution core and memory hier-archy are also largely oblivious to logicalprocessors. After execution, the µops areplaced in the reorder buffer, which decou-ples the execution stage from the retire-ment stage. The reorder buffer is parti-tioned such that each logical processor canuse half the entries.

The retirement logic tracks when µopsfrom the two logical processors are readyto be retired, then retires the µops in pro-gram order for each logical processor byalternating between the two logical pro-cessors. Retirement logic will retire µopsfor one logical processor, then the other, al-ternating back and forth. If one logical pro-cessor is not ready to retire any µops thenall retirement bandwidth is dedicated tothe other logical processor.

The implementation of Hyper-Threading Technology in the Xeonprocessor adds less than 5% to the relativechip size and maximum power require-ments, but can provide performancebenefits much greater than that. Initialbenchmark tests show up to a 65% per-formance increase on high-end serverapplications when comparing the Xeon

processor to the previous-generationPentiumIII Xeon processor on four-wayserver platforms. A significant portionof those gains can be attributed toHyper-Threading Technology [Marr et al.2002].

8. CONCLUSIONS

Although the meaning of the term mul-tithreading is sometimes used to includeall kinds of architectures that are ableto concurrently execute multiple instruc-tion streams on a single chip [Sohi 2001],that is, chip multiprocessors, explicit andimplicit multithreaded architectures, weclearly distinguish these architectural so-lutions and focus this survey on explicitmultithreaded processors.

Explicit multithreaded processors inter-leave the execution of instructions of dif-ferent user-defined threads within thesame pipeline, in contrast to implicit mul-tithreaded processors that dynamicallygenerate threads from single-threadedprograms and execute such speculativethreads concurrently with the lead thread.

Superscalar and implicit multithreadedprocessors aim at a low execution timeof a single program, while explicit mul-tithreaded processors (and chip multipro-cessors) aim at a low execution time of amultithreaded workload.

Several explicit multithreaded proces-sors as well as chip multiprocessors havecurrently been announced by industry orare already into production in the ar-eas of high-performance microprocessors,media, and network processors. The ba-sic interleaved and blocked multithread-ing techniques are applied in VLIW pro-cessors, in the network processors, andeven in superscalar processors. In partic-ular the success of these techniques innetwork processors is stunning. In addi-tion, simultaneous multithreading proces-sors and chip multiprocessors have beenannounced or are already in productionby IBM, Intel, and Sun. The additionalchip space requirement by a two-threadedprocessor is reported to be about 5% com-pared to a single-threaded processor (IBMRS64 IV and Intel Xeon processor), while a



significant throughput increase is reachedby multithreading. We therefore expect ex-plicit multithreading techniques, in par-ticular the simultaneous multithread-ing techniques, to be commonly used inthe next generation of high-performancemicroprocessors.

In contrast, implicit multithreaded pro-cessors and the related helper thread ap-proach remain hot research topics andmust still prove their efficiency in real pro-cessors. We expect that future researchwill also focus on the deployment of multi-threading techniques in the fields of signalprocessors and microcontrollers, in par-ticular for real-time applications, and forpower management. Moreover, instruc-tion scheduling within a multithreadedpipeline as well as system softwarearchitectures for multithreaded proces-sors are still open research questions.

This survey should clarify the terminol-ogy and demonstrate the state-of-the-artas well as research results achieved for ex-plicit multithreaded processors.

ACKNOWLEDGMENTS

The authors would like to thank anonymous review-ers for many valuable comments.

REFERENCES

AGARWAL, A., BIANCHINI, R., CHAIKEN, D., JOHNSON,K. L., KRANZ, D., KUBIATOWICZ, J., LIM, B. H.,MACKENZIE, K., AND YEUNG, D. 1995. The MITAlewife machine: architecture and performance.In Proceedings of the 22nd Annual InternationalSymposium on Computer Architecture (SantaMargherita Ligure, Italy). 2–13.

AGARWAL, A., KUBIATOWICZ, J., KRANZ, D., LIM, B. H.,YEOUNG, D., D’SOUZA, G., AND PARKIN, M. 1993.Sparcle: an evolutionary processor design forlarge-scale multiprocessors. IEEE Micro 13, 3,48–61.

AKKARY, H. AND DRISCOLL, M. A. 1998. A dynamicmultithreading processor. In Proceedings ofthe 31st Annual International Symposium onMicroarchitecture (Dallas, TX). 226–236.

ALLEN, F., ALMASI, G., ANDREONI, W., BEECE, D., BERNE,B. J., BRIGHT, A., BRUNHEROTO, J., CASCAVAL, C.,CASTANOS, J., COTEUS, P., CRUMLEY, P., CURIONI, A.,DENNEAU, M., DONATH, W., ELEFTHERIOU, M.,FITCH, B., FLEISCHER, B., GEORGIOU, C. J.,GERMAIN, R., GIAMPAPA, M., GRESH, D., GUPTA, M.,HARING, R., HO, H., HOCHSCHILD, P.,HUMMEL, S., JONAS, T., LIEBER, D., MARTYNA, G.,

MATURU, K., MOREIRA, J., NEWNS, D., NEWTON, M.,PHILHOWER, R., PICUNKO, T., PITERA, J.,PITMAN, M., RAND, R., ROYYURU, A.,SALAPURA, V., SANOMIYA, A., SHAH, R., SHAM,Y., SINGH, S., SNIR, M., SUITS, F., SWETZ, R.,SWOPE, W. C., VISHNUMURTHY, N., WARD, T. C. J.,WARREN, H., AND ZHOU, R. 2001. Blue Gene:a vision for protein science using a petaflopssupercomputer. IBM Syst. J. 40, 2, 310–326.

ALMASI, G. S. AND GOTTLIEB, A. 1994. Highly Par-allel Computing, 2nd ed. Benjamin/Cummings,Menlo Park, CA.

ALVERSON, G., KAHAN, S., KORRY, R., MCCANN, C., AND

SMITH, B. J. 1995. Scheduling on the TeraMTA. In Lecture Notes in Computer Science,vol. 949. Springer-Verlag, Heidelberg, Germany.19–44.

ALVERSON, R., CALLAHAN, D., CUMMINGS, D., KOBLENZ,B., PORTERFIELD, A., AND SMITH, B. J. 1990.The Tera computer system. In Proceedings ofthe 4th International Conference on Supercom-puting (Amsterdam, The Netherlands). 1–6.

BACH, P., BRAUN, M., FORMELLA, A., FRIEDRICH, J., GRUN,T., AND LINCHTENAU, C. 1997. Building the 4processor SB-PRAM prototype. In Proceedingsof the 30th Hawaii International Conference onSystem Science (Maui, HI). 5:14–23.

BARROSO, L. A., GHARACHORLOO, K., MCNAMARA, R.,NOWATZYK, A., QADEER, S., SANO, B., SMITH, S.,STETS, R., AND VERGHESE, B. 2000. Piranha: ascalable architecture based on single-chip mul-tiprocessing. In Proceedings of the 27th AnnualInternational Symposium on Computer Architec-ture (Vancouver, B.C., Canada). 282–293.

BOLYCHEVSKY, A., JESSHOPE, C. R., AND MUCHNIK, V. B.1996. Dynamic scheduling in RISC architec-tures. IEE P. Comput. Dig. Tech. 143, 5, 309–317.

BOOTHE, R. F. 1993. Evaluation of multithreadingand caching in large shared memory parallelcomputers. Tech. Rep. UCB/CSD-93-766. Com-puter Science Division, University of California,Berkeley, Berkeley, CA.

BOOTHE, R. F. AND RANADE, A. 1992. Improved mul-tithreading techniques for hiding communica-tion latency in multiprocessors. In Proceedings ofthe 19th International Symposium on ComputerArchitecture (Gold Coast, Australia). 214–223.

BORKENHAGEN, J. M., EICKEMEYER, R. J., KALLA, R. N.,AND KUNKEL, S. R. 2000. A multithreadedPowerPC processor for commercial servers. IBMJ. Res. Dev. 44, 6, 885–898.

BRINKSCHULTE, U., BECHINA, A., PICIOROAGA, F.,SCHNEIDER, E., UNGERER, T., KREUZINGER, J., AND

PFEFFER, M. 2000. A microkernel middlewarearchitecture for distributed embedded real-timesystems. In Proceedings of the 20th IEEE Sym-posium on Reliable Distributed Systems (NewOrleans LA). 218–226.

BRINKSCHULTE, U., KRAKOWSKI, C., KREUZINGER, J.,AND UNGERER, T. 1999a. A multithreadedJava microcontroller for thread-oriented real-time event-handling. In Proceedings of the


60 Ungerer et al.

International Conference on Parallel Architec-tures and Compilation Techniques (NewportBeach, CA). 34–39.

BRINKSCHULTE, U., KRAKOWSKI, C., MARSTON, R.,KREUZINGER, J., AND UNGERER, T. 1999b. TheKomodo project: thread-based event handlingsupported by a multithreaded Java microcon-troller. In Proceedings of the 25th EuromicroConference (Milan, Italy). 122–128.

BRINKSCHULTE, U., KREUZINGER, J., PFEFFER, M.,AND UNGERER, T. 2002. A scheduling tech-nique providing a strict isolation of real-timethreads. In Proceedings of the 7th IEEE In-ternational Workshop on Object-oriented Real-time Dependable Systems (San Diego, CA). 169–172.

BROOKS, D. M., BOSE, P., SCHUSTER, S. E., JACOBSON, H.,KUDVA, P. N., BUYUKTOSUNOGLU, A., WELLMAN, J. D.,ZYUBAN, V., GUPTA, M., AND COOK, P. W. 2000.Power-aware microarchitecture: designing andmodeling challenges for next-generation micro-processors. IEEE Micro 20, 6, 26–44.

BURNS, J. AND GAUDIOT, J. L. 2002. SMT layoutoverhead and scalability. IEEE T. Parall. Distr.Syst. 13, 2, 142–155.

BUTLER, M., YEH, T. Y., PATT, Y. N., ALSUP, M., SCALES,H., AND SHEBANOW, M. 1991. Single instructionstream parallelism is greater than two. In Pro-ceedings of the 18th International Symposium onComputer Architecture (Toronto, Ont., Canada).276–286.

CHAPPELL, R. S., STARK, J., KIM, S. P., REINHARDT, S. K.,AND PATT, Y. N. 1999. Simultaneous subordi-nate microthreading (SSMT). In Proceedings ofthe 26th Annual International Symposium onComputer Architecture (Atlanta, GA). 186–195.

CHRYSOS, G. Z. AND EMER, J. S. 1998. Memory de-pendence prediction using store sets. In Pro-ceedings of the 25th Annual International Sym-posium on Computer Architecture (Barcelona,Spain). 142–153.

CULLER, D. E., SINGH, J. P., AND GUPTA, A.1998. Parallel Computer Architecture: AHardware/Software Approach. MorganKaufmann, San Francisco, CA.

DALLY, W. J., FISKE, J., KEEN, J., LETHIN, R., NOAKES, M.,NUTH, P., DAVISON, R., AND FYLER, G. 1992. Themessage-driven processor: a multicomputer pro-cessing node with efficient mechanisms. IEEEMicro 12, 2, 23–39.

DENNIS, J. B. AND GAO, G. R. 1994. Multithreadedarchitectures: principles, projects, and issues. InMultithreaded Computer Architecture: A Sum-mary of the State of the Art, R. A. Iannucci, G. R.Gao, R. Halstead, and B. J. Smith, Eds. KluwerBoston, MA, Dordrecht, The Netherlands,London, U.K. 1–74.

DOROJEVETS, M. 2000. COOL multithreading inHTMT SPELL-1 processors. Int. J. High SpeedElectron. Sys. 10, 1, 247–253.

DOROZHEVETS, M. N. AND WOLCOTT, P. 1992. TheEl’brus-3 and MARS-M: recent advances in Rus-

sian high-performance computing. J. Supercom-put. 6, 1, 5–48.

DUBEY, P. K., O’BRIEN, K., O’BRIEN, K. M., AND

BARTON, C. 1995. Single-program speculativemultithreading (SPSM) architecture: compiler-assisted fine-grain multithreading. Tech. Rep.RC 19928. IBM, Yorktown Heights, NY.

EGGERS, S. J., EMER, J. S., LEVY, H. M., LO, J. L., STAMM,R. M., AND TULLSEN, D. M. 1997. Simultaneousmultithreading: a platform for next-generationprocessors. IEEE Micro 17, 5, 12–19.

EMER, J. S. 1999. Simultaneous multithreading:multiplying Alpha’s performance. In Proceed-ings of the Microprocessor Forum (San Jose, CA).

ESPASA, R. AND VALERO, M. 1997. Exploitinginstruction- and data-level parallelism. IEEEMicro 17, 5, 20–27.

FILLO, M., KECKLER, S. W., DALLY, W. J., CARTER,N. P., CHANG, A., AND GUREVICH, Y. 1995. TheM-machine multicomputer. In Proceedings ofthe 28th Annual International Symposium onMicroarchitecture (Ann Arbor, MI). 146–156.

FORMELLA, A., KELLER, J., AND WALLE, T. 1996. HPP:A high performance PRAM. In Lecture Notesin Computer Science, vol. 1123. Springer-Verlag,Heidelberg, Germany. 425–434.

FRANKLIN, M. 1993. The multiscalar architec-ture. Tech. Rep. 1196. Department of Com-puter Science, University of Wisconsin-Madison,Madison, WI.

FUHRMANN, S., PFEFFER, M., KREUZINGER, J.,UNGERER, T., AND BRINKSCHULTE, U. 2001. Real-time garbage collection for a multithreaded Javamicrocontroller. In Proceedings of the 4th IEEEInternational Symposium on Object-OrientedReal-Time Distributed Computing (Magdeburg,Germany). 69–76.

GELINAS, B., HAYS, P., AND KATZMAN, S. 2002. Fine-grained hardware multi-threading: A CPU ar-chitecture for high-touch packed processing.Lexra Inc., Waltham, MA. White paper.

GLASKOWSKY, P. N. 2002. Network processors ma-ture in 2001. Microproc. Report. February 19,2002 (online journal).

GRUNEWALD, W. AND UNGERER, T. 1996. Towards ex-tremely fast context switching in a blockmulti-threaded processor. In Proceedings of the 22ndEuromicro Conference (Prague, Czech Republic).592–599.

GRUNEWALD, W. AND UNGERER, T. 1997. A mul-tithreaded processor designed for distributedshared memory systems. In Proceedings of theInternational Conference on Advances in Par-allel and Distributed Computing (Shanghai,China). 206–213.

GULATI, M. AND BAGHERZADEH, N. 1996. Perfor-mance study of a multithreaded superscalarmicroprocessor. In Proceedings of the 2nd In-ternational Symposium on High-PerformanceComputer Architecture (San Jose, CA). 291–301.



GWENNAP, L. 1997. DanSoft develops VLIW de-sign. Microproc. Report 11, 2 (Feb. 17), 18–22.

HALSTEAD, R. H. 1985. MULTILISP: a language forconcurrent symbolic computation. ACM Trans.Program. Lang. Syst. 7, 4, 501–538.

HALSTEAD, R. H. AND FUJITA, T. 1988. MASA: a mul-tithreaded processor architecture for parallelsymbolic computing. In Proceedings of the 15thInternational Symposium on Computer Architec-ture (Honolulu, HI). 443–451.

HAMMOND, L. AND OLUKOTUN, K. 1998. Considera-tions in the design of Hydra: a multiprocessor-on-chip microarchitecture. Tech. Rep. CSL-TR-98-749. Computer Systems Laboratory, StanfordUniversity, Stanford, CA.

HANSEN, C. 1996. MicroUnity’s MediaProcessorarchitecture. IEEE Micro 16, 4, 34–41.

HIRATA, H., KIMURA, K., NAGAMINE, S., MOCHIZUKI,Y., NISHIMURA, A., NAKASE, Y., AND NISHIZAWA,T. 1992. An elementary processor architec-ture with simultaneous instruction issuingfrom multiple threads. In Proceedings of the19th International Symposium on ComputerArchitecture (Gold Coast, Australia). 136–145.

IANNUCCI, R. A., GAO, G. R., HALSTEAD, R., AND

SMITH, B. J., Eds. 1994. Multithreaded Com-puter Architecture: A Summary of the Stateof the Art. Kluwer Boston, MA, Dordrecht,The Netherlands, London, U.K.

IBM CORPORATION. 1999. IBM network processor.Product overview. IBM, Yorktown Heights, NY.

INTEL CORPORATION. 2002. Intel Internet exchangearchitecture network processors: flexible, wire-speed processing from the customer premisesto the network core. White paper. Intel, SantaClara, CA.

JESSHOPE, C. R. 2001. Implementing an efficientvector instruction set in a chip multi-processorusing micro-threaded pipelines. Aust. Comput.Sci. Commun. 23, 4, 80–88.

JESSHOPE, C. R. AND LUO, B. 2000. Micro-threading:a new approach to future RISC. In Proceedingsof the Australasian Computer Architecture Con-ference (Canbera, Australia). 34–41.

KAVI, K. M., LEVINE, D. L., AND HURSON, A. R. 1997.A non-blocking multithreaded architecture. InProceedings of the 5th International Conferenceon Advanced Computing (Madras, India). 171–177.

KLAUSER, A., AUSTIN, T., GRUNWALD, D., AND CALDER,B. 1998a. Dynamic hammock predication fornon-predicated instruction sets. In Proceedingsof the International Conference on Parallel Ar-chitectures and Compilation Techniques (Paris,France). 278–285.

KLAUSER, A., PAITHANKAR, A., AND GRUNWALD, D.1998b. Selective eager execution on thePolyPath architecture. In Proceedings of the25th Annual International Symposium onComputer Architecture (Barcelona, Spain).250–259.

KREUZINGER, J., SCHULZ, A., PFEFFER, M., UNGERER,T., BRINKSCHULTE, U., AND KRAKOWSKI, C. 2000.Real-time scheduling on multithreaded proces-sors. In Proceedings of the 7th International Con-ference on Real-Time Computer Systems and Ap-plications (Cheju Island, South Korea). 155–159.

KREUZINGER, J. AND UNGERER, T. 1999. Context-switching techniques for decoupled multi-threaded processors. In Proceedings of the 25thEuromicro Conference (Milan, Italy). 1:248–251.

LAM, M. S. AND WILSON, R. P. 1992. Limits of con-trol flow on parallelism. In Proceedings of the18th International Symposium on Computer Ar-chitecture (Toronto, Ont., Canada). 46–57.

LAUDON, J., GUPTA, A., AND HOROWITZ, M. 1994. In-terleaving: a multithreading technique target-ing multiprocessors and workstations. In Pro-ceedings of the 6th International Conference onArchitectural Support for Programming Lan-guages and Operating Systems (San Jose, CA).308–318.

LAWSON, S. AND VANCE, A. 2002. Sun hints atUltraSparc V and beyond. Available online atPC World.com.

LI, Z., TSAI, J. Y., WANG, X., YEW, P. C., AND

ZHENG, B. 1996. Compiler techniques for con-current multithreading with hardware specu-lation support. In Lecture Notes in ComputerScience, vol. 1239. Springer-Verlag, Heidelberg,Germany. 175–191.

LIPASTI, M. H. AND SHEN, J. P. 1997. The per-formance potential of value and dependenceprediction. In Lecture Notes Computer Sci-ence, vol. 1300. Springer-Verlag, Heidelberg,Germany. 1043–1052.

LIPASTI, M. H., WILKERSON, C. B., AND SHEN, J. P.1996. Value locality and load value prediction.In Proceedings of the 7th International Confer-ence on Architectural Support for ProgrammingLanguages and Operating Systems (Cambridge,MA). 138–147.

LO, J. L., BARROSO, L. A., EGGERS, S. J., GHARACHORLOO,K., LEVY, H. M., AND PAREKH, S. S. 1998. Ananalysis of database workload performance onsimultaneous multithreaded processors. In Pro-ceedings of the 25th Annual International Sym-posium on Computer Architecture (Barcelona,Spain). 39–50.

LO, J. L., EGGERS, S. J., EMER, J. S., LEVY, H. M., STAMM,R. L., AND TULLSEN, D. M. 1997. Convertingthread-level parallelism to instruction-level par-allelism via simultaneous multithreading. ACMTrans. Comput. Syst. 15, 3, 322–354.

LOIKKANEN, M. AND BAGHERZADEH, N. 1996. A fine-grain multithreading superscalar architecture.In Proceedings of the International Conferenceon Parallel Architectures and Compilation Tech-niques (Boston, MA). 163–168.

LUTH, K., METZNER, A., PIEKENKAMP, T., AND RISU, J.1997. The events approach to rapid prototyp-ing for embedded control system. In Proceedings


62 Ungerer et al.

of the Workshop Zielarchitekturen eingebetteterSyststeme (Rostock, Germany). 45–54.

MANKOVIC, T. E., POPESCU, V., AND SULLIVAN, H. 1987.CHoPP priciples of operations. In Proceedings ofthe 2nd International Supercomputer Conference(Mannheim, Germany). 2–10.

MARCUELLO, P., GONZALES, A., AND TUBELLA, J. 1998.Speculative multithreaded processors. InProceedings of the 12th International Confer-ence on Supercomputing (Melbourne, Australia).77–84.

MARR, D. T., BINNS, F., HILL, D. L., HINTON, G.,KOUFATY, D. A., MILLER, J. A., AND UPTON,M. 2002. Hyper-threading technology archi-tecture and microarchitecture: a hypertext his-tory. Intel Technology J. 6, 1 (online journal).

METZNER, A. AND NIEHAUS, J. 2000. MSparc: multi-threading in real-time architectures. J. Univer-sal Comput. Sci. 6, 10, 1034–1051.

MIKSCHL, A. AND DAMM, W. 1996. Msparc: a multi-threaded Sparc. In Lecture Notes in ComputerScience, vol. 1123. Springer-Verlag, Heidelberg,Germany. 461–469.

OEHRING, H., SIGMUND, U., AND UNGERER, T. 1999a.MPEG-2 video decompression on simultaneousmultithreaded multimedia processors. In Pro-ceedings of the International Conference on Par-allel Architectures and Compilation Techniques(Newport Beach, CA). 11–16.

OEHRING, H., SIGMUND, U., AND UNGERER, T. 1999b.Simultaneous multithreading and multimedia.In Proceedings of the Workshop on MultithreadedExecution, Architecture and Compilation(Orlando, FL).

PATT, Y. N., PATEL, S. J., EVERS, M., FRIENDLY, D. H.,AND STARK, J. 1997. One billion transistors,one uniprocessor, one chip. Computer 30, 9, 51–57.

PAUL, W. J., BACH, P., BOSCH, M., FISCHER, J.,LICHTENAU, C., AND ROHRIG, J. 2002. RealPRAM programming. In Lecture Notes inComputer Science, vol. 2400. Springer-Verlag,Heidelberg, Germany. 522–531.

PONTIUS, N. AND BAGHERZADEH, N. 1999. Multi-threaded extensions enhance multimedia perfor-mance. In Proceedings of the Workshop on Mul-tithreaded Execution, Architecture and Compila-tion (Orlando, FL).

ROTENBERG, E., JACOBSON, Q., SAZEIDES, Y., AND SMITH,J. E. 1997. Trace processors. In Proceedingsof the 30th Annual International Symposium onMicroarchitecture (Research Triangle Park, NC).138–148.

RYCHLIK, B., FAISTL, J., KRUG, B., AND SHEN, J. P.1998. Efficiency and performance impact ofvalue prediction. In Proceedings of the Inter-national Conference on Parallel Architecturesand Compilation Techniques (Paris, France).148–154.

SENG, J. S., TULLSEN, D. M., AND CAI, G. Z. N. 2000.Power-sensitive multithreaded architecture. InProceedings of the IEEE International Confer-

ence on Computer design: VLSI in Computersand Processors (Austin, TX). 199–206.

SERRANO, M. J., YAMAMOTO, W., WOOD, R., AND

NEMIROVSKY, M. D. 1994. Performance estima-tion in a multistreamed superscalar proces-sor. In Lecture Notes in Computer Science,vol. 794. Springer-Verlag, Heidelberg, Germany.213–230.

SIGMUND, U., STEINHAUS, M., AND UNGERER, T. 2000.Transistor count and chip space assessmentof multimedia-enhanced simultaneous multi-threaded processors. In Proceedings of the 4thWorkshop on Multithreaded Execution, Architec-ture and Compilation (Monterrey, CA).

SIGMUND, U. AND UNGERER, T. 1996a. Evaluating amultithreaded superscalar microprocessor ver-sus a multiprocessor chip. In Proceedings of the4th PASA Workshop on Parallel Systems andAlgorithms (Julich, Germany). 147–159.

SIGMUND, U. AND UNGERER, T. 1996b. Identifyingbottlenecks in multithreaded superscalar multi-processors. In Lecture Notes in Computer Sci-ence, vol. 1123. Springer-Verlag, Heidelberg,Germany. 797–800.

SILC, J., ROBIC, B., AND UNGERER, T. 1998. Asyn-chrony in parallel computing: from dataflowto multithreading. Parall. Distr. Comput. Prac-tices 1, 1, 57–83.

SILC, J., ROBIC, B., AND UNGERER, T. 1999. ProcessorArchitecture: From Dataflow to Superscalar andBeyond. Springer-Verlag, Heidelberg and Berlin,Germany, and New York, NY.

SMITH, B. J. 1981. Architecture and applications ofthe HEP multiprocessor computer system. SPIEReal-Time Signal Processing IV 298, 241–248.

SMITH, B. J. 1985. The architecture of hep. In Par-allel MIMD Computation: HEP Supercomputerand Its Applications, J. S. Kowalik, Ed. MITPress, Cambridge, MA, 41–55.

SMITH, J. E. AND VAJAPEYAM, S. 1997. Trace proces-sors: moving to fourth-generation microarchitec-tures. Computer 30, 9, 68–74.

SOHI, G. S. 1997. Multiscalar: another fourth-generation processor. Computer 30, 9, 72.

SOHI, G. S. 2001. Microprocessors—10 years back,10 years ahead. In Lecture Notes in ComputerScience, vol. 2000. Heidelberg, Germany. 208–218.

SOHI, G. S., BREACH, S. E., AND VIJAYKUMAR, T. N.1995. Multiscalar processors. In Proceedingsof the 22nd Annual International Symposiumon Computer Architecture (Santa MargheritaLigure, Italy). 414–425.

STEINHAUS, M., KOLLA, R., LARRIBA-PEY, J. L., UNGERER,T., AND VALERO, M. 2001. Transistor count andchip space estimation of simple-scalar-based mi-croprocessor models. In Proceedings of the Work-shop on Complexity-Effective Design (Goteborg,Sweden).

STERLING, T. 1997. Beyond 100 teraflops throughsuperconductors, holographic storage, and the



data vortex. In Proceedings of the InternationalSymposium on Supercomputing (Tokyo, Japan).

TENDLER, J. M., DODSON, J. S., FIELDS, JR., J. S., LE,H., AND SINHAROY, B. 2002. POWER4 systemmicroarchitecture. IBM J. Res. Dev. 46, 1, 5–26.

TEXAS INSTRUMENTS. 1994. TMS320C80 Technicalbrief. Texas Instruments, Dallas, TX.

THISTLE, M. AND SMITH, B. J. 1988. A processor ar-chitecture for Horizon. In Proceedings of the Su-percomputing Conference (Orlando, FL). 35–41.

TREMBLAY, M. 1999. A VLIW convergent multipro-cessor system on a chip. In Proceedings of theMicroprocessor Forum (San Jose, CA).

TREMBLAY, M., CHAN, J., CHAUDHRY, S., CONIGLIARO,A. W., AND TSE, S. S. 2000. The MAJC archi-tecture: a synthesis of parallelism and scalabil-ity. IEEE Micro 20, 6, 12–25.

TSAI, J. Y. AND YEW, P. C. 1996. The superthreadedarchitecture: thread pipelining with run-timedata dependence checking and control specula-tion. In Proceedings of the International Confer-ence on Parallel Architectures and CompilationTechniques (Boston, MA). 35–46.

TULLSEN, D. M., EGGERS, S. J., EMER, J. S., LEVY, H. M.,LO, J. L., AND STAMM, R. L. 1996. Exploitingchoice: instruction fetch and issue on an imple-mentable simultaneous multithreading proces-sor. In Proceedings of the 23rd Annual Inter-national Symposium on Computer Architecture(Philadelphia, PA). 191–202.

TULLSEN, D. M., EGGERS, S. J., AND LEVY, H. M. 1995.Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22ndAnnual International Symposium on ComputerArchitecture (Santa Margherita Ligure, Italy).392–403.

TULLSEN, D. M., LO, J. L., EGGERS, S. J., AND LEVY,H. M. 1999. Supporting fine-grained synchro-nization on a simultaneous multithreading pro-cessor. In Proceedings of the 5th International

Symposium on High-Performance ComputerArchitecture (Orlando, FL). 54–58.

UNGERER, T., ROBIC, B., AND SILC, J. 2002. Multi-threaded processors. Computer J. 45, 3, 320–348.

VAJAPEYAM, S. AND MITRA, T. 1997. Improving su-perscalar instruction dispatch and issue by ex-ploiting dynamic code sequences. In Proceedingsof the 24th Annual International Symposium onComputer Architecture (Denver, CO). 1–12.

VIJAYKUMAR, T. N. AND SOHI, G. S. 1998. Task selec-tion for a multiscalar processor. In Proceedingsof the 31st Annual International Symposium onMicroarchitecture (Dallas, TX). 81–92.

WALL, D. W. 1991. Limits of instruction-level par-allelism. In Proceedings of the 4th InternationalConference on Architectural Support for Pro-gramming Languages and Operating Systems(Santa Clara, CA). 176–188.

WALLACE, S., CALDER, B., AND TULLSEN, D. M. 1998.Threaded multiple path execution. In Proceed-ings of the 25th Annual International Sym-posium on Computer Architecture (Barcelona,Spain). 238–249.

WALLACE, S., TULLSEN, D. M., AND CALDER, B. 1999.Instruction recycling on a multiple-path proces-sor. In Proceedings of the 5th International Sym-posium on High-Performance Computer Archi-tecture (Orlando, FL). 44–53.

WITTENBURG, J. P., MEYER, G., AND PIRSCH, P. 1999.Adapting and extending simultaneous multi-threading for high performance video signal pro-cessing applications. In Proceedings of the Work-shop on Multithreaded Execution, Architectureand Compilation (Orlando, FL).

YAMAMOTO, W. AND NEMIROVSKY, M. D. 1995. In-creasing superscalar performance through mul-tistreaming. In Proceedings of the Interna-tional Conference on Parallel Architectures andCompilation Techniques (Limassol, Cyprus).49–58.

Received June 2001; revised September 2002; accepted November 2002


survey on explicit multithreading

Documents