multi threading unit 4 ca

8/11/2019 Multi Threading unit 4 CA

1/7

MULTITHREADING

The most important measure of performance for a processor is the rate at which it executesinstructions. This can be expressed as where f is the processor clock frequency, in MHz, and IPC (instructions

per cycle) is the a era!e number of instructions executed per cycle. "ncreasin! clock frequency and increasin!

the number of instructions executed or, more properly, the number of instructions that complete durin! a processor cycle.

"#$ can be increased by usin! an instruction pipeline and then by usin! multiple parallel instructionpipelines in a superscalar architecture. %ith pipelined and multiple&pipeline desi!ns, the principal problem is tomaximize the utilization of each pipeline sta!e.

To impro e throu!hput,

'xecutin! some instructions in a different order from the way they occur in the instruction streamand be!innin! execution of instructions that may ne er be needed. This approach may be reachin! alimit due to complexity and power consumption concerns.

n alternati e approach, which allows for a hi!h de!ree of instruction&le el parallelism withoutincreasin! circuit complexity or power consumption, is called multithreadin!.

The instruction stream is di ided into se eral smaller streams, known as threads, such that the threads can be executed in parallel.

Implicit and Explicit MultithreadinThe concept of thread used in multithreaded processors may or may not be the same as the concept of

software threads in a multipro!rammed operatin! system.

Process n instance of a pro!ram runnin! on a computer. process embodies two key characteristics

Resource o!nership process includes a irtual address space to hold the process ima!e*the process ima!e is the collection of pro!ram, data, stack, and attributes that define the

process. +rom time to time, a process may be allocated control or ownership of resources,such as main memory, " - channels, " - de ices, and files."chedulin #execution The execution of a process follows an execution path(trace)throu!h one or more pro!rams. This execution may be interlea ed with that of other

processes. Thus, a process has an execution state ( unnin!, eady, etc.) and a dispatchin! priority and is the entity that is scheduled and dispatched by the operatin! system

Process s!itch$ n operation that switches the processor from one process to another, by sa in! all the process control data, re!isters, and other information for the first and replacin! them with the processinformation for the second./

Thread$ dispatchable unit of work within a process. "t includes a processor context (which includesthe pro!ram counter and stack pointer) and its own data area for a stack (to enable subroutine branchin!). thread executes sequentially and is interruptible so that the processor can turn to another thread.

Thread s!itch$ The act of switchin! processor control from one thread to another within the same process. Typically, this type of switch is much less costly than a process switch.

Thus% a thread is concerned !ith schedulin and execution% !hereas a process is concerned !ithschedulin #execution and resource o!nership&


2/7

The multiple threads within a process share the same resources. This is why a thread switch ismuch less time consumin! than a process switch. Traditional operatin! systems, such as earlier ersions of 01"2, did not support threads. Most modern operatin! systems, such as 3inux, other ersions of 01"2,and %indows, do support thread.

distinction is made between user&le el threads, which are isible to the application pro!ram, andkernel&le el threads, which are isible only to the operatin! system. 4oth of these may be referred to asexplicit threads, defined in software. ll of the commercial processors and most of the experimental

processors so far ha e used explicit multithreadin!. These systems concurrently execute instructions fromdifferent explicit threads, either by interlea in! instructions from different threads on shared pipelines or by

parallel execution on parallel pipelines.Implicit multithreadin refers to the concurrent execution of multiple threads extracted from a

sin!le sequential pro!ram. These implicit threads may be defined either statically by the compiler ordynamically by the hardware.

Approaches to Explicit Multithreadin

t minimum, a multithreaded processor must pro ide a separate pro!ram counter for each thread of execution to be executed concurrently. The desi!ns differ in the amount and type of additional hardwareused to support concurrent thread execution.

"n !eneral, instruction fetchin! takes place on a thread basis. The processor treats each threadseparately and may use a number of techniques for optimizin! sin!le&thread execution, includin! branch

prediction, re!ister renamin!, and superscalar techniques. %hat is achie ed is thread&le el parallelism,which may pro ide for !reatly impro ed performance when married to instruction&le el parallelism.4roadly speakin!, there are four principal approaches to multithreadin!

5 Interlea'ed multithreadinThis is also known as fine&!rained multithreadin!. The processor deals with two or more thread

contexts at a time, switchin! from one thread to another at each clock cycle. "f a thread is blocked becauseof data dependencies or memory latencies, that thread is skipped and a ready thread is executed.

5 (loc)ed multithreadin

This is also known as coarse&!rained ultithreadin!. The instructions of a thread are executedsuccessi ely until an e ent occurs that may cause delay, such as a cache miss. This e ent induces aswitch to another thread. This approach is effecti e on an in&order processor that would stall the

pipeline for a delay e ent such as a cache miss.

5 "imultaneous multithreadin *"MT+$

"nstructions are simultaneously issued from multiple threads to the execution units of asuperscalar processor. This combines the wide superscalar instruction issue capability with the use ofmultiple thread contexts.

5 Chip multiprocessin


3/7

"n this case, the entire processor is replicated on a sin!le chip and each processor handlesseparate threads. The ad anta!e of this approach is that the a ailable lo!ic area on a chip is usedeffecti ely without dependin! on e er&increasin! complexity in pipeline desi!n.

+or the first two approaches, instructions from different threads are not executed simultaneously."nstead, the processor is able to rapidly switch from one thread to another, usin! a different set of re!isters andother context information. This results in a better utilization of the processor6s execution resources and a oidsa lar!e penalty due to cache misses and other latency e ents.

The 7MT approach in ol es true simultaneous execution of instructions from different threads, usin!replicated execution resources. $hip multiprocessin! also enables simultaneous execution of instructions fromdifferent threads. +i!ure 8, illustrates some of the possible pipeline architectures that in ol e multithreadin!and contrasts these with approaches that do not use multithreadin!. 'ach horizontal row represents the

potential issue slot or slots for a sin!le execution cycle* that is, the width of each row corresponds to themaximum number of instructions that can be issued in a sin!le clock cycle.9

The ertical dimension represents the time sequence of clock cycles. n empty (shaded) slot representsan unused execution slot in one pipeline. no&op is indicated by 1.

The ,irst three illustrations in -i ure&. sho! di,,erent approaches !ith a scalar *i&e&% sin le/issue+processor$

"in le/threaded scalar

This is the simple pipeline found in traditional "7$ and $"7$ machines, with nomultithreadin!.

5 Interlea'ed multithreaded scalar

This is the easiest multithreadin! approach to implement. 4y switchin! from one thread toanother at each clock cycle, the pipeline sta!es can be kept fully occupied, or close to fully occupied.The hardware must be capable of switchin! from one thread context to another between cycles.


4/7

-i ure$ . Approaches to Executin Multiple Threads

5 (loc)ed multithreaded scalar$

"n this case, a sin!le thread is executed until a latency e ent occurs that would stop the pipeline, at which time the processor switches to another thread.

-i ure .c shows a situation in which the time to perform a thread switch is one cycle, whereas -i ure .0shows that thread switchin! occurs in zero cycles.

"n the case of interlea ed multithreadin!, it is assumed that there are no control or data dependencies between threads, which simplifies the pipeline desi!n and therefore should allow a thread switch with no delay.Howe er, dependin! on the specific desi!n and implementation, block multithreadin! may require a clock cycle to perform a thread switch, as illustrated in -i ure&.

This is true if a fetched instruction tri!!ers the thread switch and must be discarded from the pipelinelthou!h interlea ed multithreadin! appears to offer better processor utilization than blocked multithreadin!,


5/7

it does so at the sacrifice of sin!le&thread performance. The multiple threads compete for cache resources,which raises the probability of a cache, miss for a !i en thread.

More opportunities for parallel execution are a ailable if the processor can issue multiple instructions per cycle. -i ures .d throu!h -i &.i illustrate a number of ariations amon! processors that ha e hardware for issuin! four instructions per cycle. "n all these cases, only instructions from a sin!le thread are issued in asin!le cycle.

The ,ollo!in alternati'es are illustrated

5 "uperscalar This is the basic superscalar approach with no multithreadin!. 0ntil relati ely recently, this wasthe most powerful approach to pro idin! parallelism within a processor. 1ote that durin! some cycles, not allof the a ailable issue slots are used. :urin! these cycles, less than the maximum number of instructions isissued* this is referred to as horizontal loss. :urin! other instruction cycles, no issue slots are used* these arecycles when no instructions can be issued* this is referred to as ertical loss.

1 Interlea'ed multithreadin superscalar :urin! each cycle, as many instructions as possible are issued

from a sin!le thread. %ith this technique, potential delays due to thread switches are eliminated, as pre iouslydiscussed. Howe er, the number of instructions issued in any !i en cycle is still limited by dependencies thatexist within any !i en thread.

5 (loc)ed multithreaded superscalar !ain, instructions from only one thread may be issued durin! anycycle, and blocked multithreadin! is used.

5 2er3 lon instruction !ord *2LI4+$ ;3"% architecture, such as " & instructions at a time. "f onethread has a hi!h de!ree of instruction&le el parallelism, it may on some cycles be able fill all of the horizontalslots. -n other cycles, instructions from two or more threads may be issued. "f sufficient threads are acti e, itshould usually be possible to issue the maximum number of instructions on each cycle, pro idin! a hi!h le elof efficiency.

5 Chip multiprocessor *multicore+$ -i ure .) shows a chip containin! four processors, each of which has atwo&issue superscalar processor. 'ach processor is assi!ned a thread, from which it can issue up to twoinstructions per cycle. $omparin! -i ures .5 and .)% we see that a chip multiprocessor with the sameinstruction issue capability as an 7MT cannot achie e the same de!ree of instruction&le el parallelism. This is

because the chip multiprocessor is not able to hide latencies by issuin! instructions from other threads. -n the


6/7

other hand, the chip multiprocessor should outperform a superscalar processor with the same instruction issuecapability, because the horizontal losses will be !reater for the superscalar processor. "n addition, it is possibleto use multithreadin! within each of the processors on a chip multiprocessor, and this is done on somecontemporary machines.

E6AMPLE "7"TEM"

PENTIUM 8 More recent models of the #entium = use a multithreadin! technique that. The#entium = approach is to use 7MT with support for two threads. Thus, the sin!le multithreaded

processor is lo!ically two processors. I(M P94ER: The "4M #ower? chip, which is used in hi!h&end #ower#$ products, combines

chip multiprocessin! with 7MT. The chip has two separate processors, each of which is amultithreaded processor capable of supportin! two threads concurrently usin! 7MT. "nterestin!ly,the desi!ners simulated arious alternati es and found that ha in! two two&way 7MT processors on

a sin!le chip pro ided superior performance to a sin!le four&way 7MT processor. The simulationsshowed that additional multithreadin! beyond the support for two threads mi!ht decrease performance because of cache thrashin!, as data from one thread displaces data needed by another thread.

-i ure&; shows the "4M #ower?6s instruction flow dia!ram. -nly a few of the elements in the processor need to be replicated, with separate elements dedicated to separate threads. Two pro!ram countersare used. The processor alternates fetchin! instructions, up to ei!ht at a time, between the two threads. ll theinstructions are stored in a common instruction cache and share an instruction translation facility, which does a

partial instruction decode. %hen a conditional branch is encountered, the branch prediction facility predicts thedirection of the branch and, if possible, calculates the tar!et address. +or predictin! the tar!et of a subroutine

return, the processor uses a return stack, one for each thread.

"nstructions then mo e into two separate instruction buffers. Then, on the basis of thread priority, a!roup of instructions is selected and decoded in parallel. 1ext, instructions flow throu!h a re!ister&renamin!facility in pro!ram order. 3o!ical re!isters are mapped to physical re!isters. The #ower ? has 8/@ physical!eneral purpose re!isters and 8/@ physical floatin!&point re!isters. The instructions are then mo ed into issuequeues. +rom the issue queues, instructions are issued usin!


7/7

-i ure&; Po!er : Instruction Data -lo!

symmetric multithreadin!. That is, the processor has a superscalar architecture and can issue instructions fromone or both threads in parallel. t the end of the pipeline, separate thread resources are needed to commit theinstructions.

multi threading unit 4 ca

Documents