computation gap instruction level parallelism

Computation GapInstruction Level Parallelism

A.R. HursonComputer Science Department

Missouri Science & Technologyhurson@mst.edu

Computation Gap

Computation gap is defined as the differebetween computational power demandedthe application environmentscomputational capability of the exiscomputers.

Today, one can find many applications whrequire orders of magnitude mcomputations than the capability of thecalled super-computers and super-systems.

Computation Gap

Some ApplicationsIt is estimated that the so called ProbSolving and Inference Systems requireenvironment with the computational powethe order of 100 MLIPS to 1 GLIPS (1 LIP100-1000 instructions).

Computation Gap

Some ApplicationsExperiences in Fluid Dynamics have shothat the conventional super-computerscalculate steady 2-dimensional flow in minu

However, conventional super-compurequire up to 20 hours to handle time depend2-dimensional flow or steady 3-dimensioflows on simple objects.

Computation Gap

Some ApplicationsNumerical Aerodynamics Simulator requireenvironment with a sustained speed of 1 bilFLOPS.

Strategic Defense Initiative requiresdistributed, fault tolerant compuenvironment with a processing rate ofMOPS.

Computation Gap

Some ApplicationsU.S. Patent Office and Trademark has a databassize 25 terabytes subject to search and update.

An angiogram department of a mid-size hosgenerates more than 64 * 1011 bits of data a year.

NASA's Earth Observing System will generate mthan 11,000 terabytes of data during the 15-year tperiod of the project.

Computation Gap

PerformanceCDC STAR-100 25-100 MFLOPDAP 100 MFLOPILLIAC IV 160 MFLOPHEP 160 MFLOPCRAY-1 25-80 MFLOPCRAY X-MP(1) 210 MFLOPCRAY X-MP(4) 840 MFLOPCRAY-2 250 MFLOPCDC CYBER

200400 MFLOP

Hitachi S-810(10) 315 MFLOPHitachi S-810(20) 630 MFLOPFujitsu FACOM

VP-50140 MFLOP

Fujitsu FACOMVP-100

285 MFLOP

Fujistu FACOMVP-200

570 MFLOP

Fujistu FACOMVP-400

1,140 MFLOP

Computation Gap

Performance

NEC SX-1 570 MFLOPSNEC SX-2 1,300 MFLOPSIBM RP3 1,000 MFLOPSMPP 8-bit integer 1,545-6,553 MIPSMPP 12-bit integer 795-4,428 MIPSMPP 16-bit integer 484-3,343 MIPSMPP 32-bit FL 165-470 MIPSMPP 40-bit FL 126-383 MIPS

Computation Gap

PerformanceNEC (Earth System) 35 tera FLOPSIBM Blue Gene 70 tera FLOPS

Computation Gap

Some New ApplicationsA recent estimate puts the amount of new informgenerated in 2002 to be 5 exabytes (1 exabyte= 1018

which is approximately equal to all words spoken by hubeings) and 92% of this information is in hard disk. Whgood fraction of this information is of transient intuseful information of archival value will continuaccumulate.The TREC database holds around 800 million static phaving 6 trillion bytes of plain text equal to the sizemillion books.The Google system routinely accumulates millions of pof new text information every week.

Computation Gap

1970 1980 1990 2000

Multimedia

Sensors

Binary

Computation GapProblem

Suppose a machine capable of handlingcharacters per second is in hand. How ldoes it take to search 25 terabytes of data?

25 * 10 12

10 6 = 25 * 10 6 sec. 4 * 10 5 min. 7 * 10 3 Hours 290 da

Even if the performance is improved by a factor of 1000, it takes about 8 hours to exhaustively search this database!

Computation Gap

ProblemNOT PRACTICAL!WHAT ARE THE SOLUTIONS?

Computation Gap

Reduce the amount of needed computati(advances in software technologyalgorithms).Improve the speed of the computers:

Physical Speed (Advances in hardwtechnology).Logical Speed (Advances in comparchitecture/organization).

Computation Gap

Architectural Advances of the Uprocessor Organization

Organization of the conventional uni-procesystems can be modified in order to removeexisting bottlenecks. For example, Accessis one of the problems in the von Neumorganization.

Computation Gap

Access GapAccess gap is defined as the time differebetween the CPU cycle time and the mmemory cycle time.

Access gap problem was created byadvances in technology.

Computation Gap

Access GapIn early computers such IBM 704, CPU and mmemory cycle time were identical — i.e., 12 sec.

IBM 360/195 had the logic delay of 5 sec per sthe CPU cycle time of 54 sec and the main memcycle time of .756 sec and CDC 7600 had theand main memory cycle time of 27.5 sec and

sec, respectively.

Computation Gap

System Architecture/OrganizationTo overcome the technological limitaticomputer designers have long been attractetechniques that are classified under the term"Concurrency".

Computation Gap

ConcurrencyConcurrency is a generic term which defthe ability of the computer hardwaresimultaneously execute many actions atinstant.Within this general term are several wrecognized techniques such as ParallelPipelining, and Multiprocessing.

Computation Gap

ConcurrencyAlthough these techniques have the same orand are often hard to distinguish, in practhey are different in their general approach.

Computation Gap

ConcurrencyParallelism achieves concurrencyreplicating/duplicating the hardware strucmany times,

Pipelining takes the approach of splittingfunction to be performed into smaller pieallocating separate hardware to each piece,overlapping operations of each piece.

Computation Gap

Concurrent SystemsClassification

• Feng's Classification• Flynn's Classification• Handler's Classification

Computation Gap

Concurrent SystemsFeng's Classification

• In this classification, the concurrent spacidentified as a two dimensional space based onbit and word multiplicities.

Computation GapConcurrent Systems

Feng's ClassificationMPP

(1,16384)

Staran(1,256)

C(1,N) D(M,N)

Cmmp(16,16)

(1,1)AIBM360

(32,1)( 1,1)

1 16 32 Bit Multiplicity

licity

Computation Gap

Concurrent SystemsFeng's Classification

• Point A represents a pure sequential machine - iuni-processor with serial ALU.

• Point B represents a uni-processor with parALU.

• Point C represents a parallel bit slice organizatio• Point D represents a parallel word organ

organization.

Computation Gap

Concurrent SystemsFlynn's Classification

• Flynn has classified the concurrent space accorto the multiplicity of instruction and data stream

I= { Single Instruction Stream (SI), Multiple Instruction Stream (M

Single Data Stream (SD), Multiple Data Stream (MD) }D={

Computation Gap

Concurrent SystemsFlynn's Classification

• The Cartesian product of these two sets will defour different classes:− SISD− SIMD− MISD− MIMD

Computation Gap

Concurrent SystemsFlynn's Classification — Revisited

• The MIMD class can be further divided based on− Memory structure — global or distributed− Communication/synchronism mechanism — s

variable or message passing.

Computation Gap

Concurrent SystemsFlynn's Classification — Revisited

• As a result we have four additional classecomputers:− GMSV — Shared memory multiprocessors− GMMP — ?− DMSV — Distributed shared memory− DMMP — Distributed memory (multi-computers)

Computation Gap

Concurrent SystemsHandler’s Classification

• Handler has extended Feng's concurrent spacethird dimension, namely, the number of counits.

• Handler's space is defined as T=(k,d,w):− k number of control units,− d number of ALUs controlled by a control unit,− w number of bits handled by an ALU.

Computation Gap

Number of CUs

Cmmp(16,1,16)

64IBM 360/91

(1,3,64)

TI ASC(1,4,64)

ILLIAC IV*(4,64,64)

dNumber of

Word Length

Computation Gap

• Point (1,1,1) represents von Neumann machineserial ALU.

• Point (1,1,M) represents von Neumann macwith parallel ALU.

Computation Gap

• To represent pipelining at different levels -macro pipeline, instruction pipeline and arithmpipeline - diversity, sequentiality,flexibility/adaptability, the original Handscheme has been extended by three varia(k΄,d΄,w΄) and three operators (+, *, v).

Computation Gap

• k΄ represents macro pipeline• d΄ represents instruction pipeline• w΄ represents arithmetic pipeline• + represents diversity (parallelism)• * represents sequentiality (pipelining)• v represents flexibility/adaptability

Computation Gap

• According to the extended Handler's scheme:

ILLIAC IV: (1*1, 1*1, 48*1) * (1*1, 64*1, 64*1)

Front-end (B 6700) Array Processor

CDC7600: (1*1, 1*9, 60*1)CDCStar: (1*1, 2*1, 64*4)

Computation Gap

• According to the extended Handler's scheme:DAP: (1*1,1*1, 32*1) * [ (1*1,128*1, 32*1) v (1*1, 4096*1, 1*1)]

front-end Array Processor

Computation Gap

QuestionsWhat are the motivations behindclassification of the computer systems?What are the shortcomings ofaforementioned classification schemes?Can you propose a new classification schem

Computation Gap

Why classify computer architecture?Generalize and identify the characteristicsdifferent systems.Group machines with common architectfeatures:

• To study systems easier.• To transfer solutions easier.

Computation Gap

Why classify computer architecture?Better estimate the weak and strong pointssystem:

• To utilize a system more effectively.Anticipate the future trends and thedevelopments:

• Research directions.

Computation Gap

Goals of a classification schemeCategorize all existing and foreseeable desigDifferentiate different designs.Assign an architecture to a unique class.

Computation Gap

SummaryComputation GapHow to reduce Computation Gap:• Advances in Software and Algorithms• Advances in Technology• Advances in Computer Organization/Architecture

ConcurrencyClassification• Feng• Flynn/Extended MIMD• Handler

Computation Gap

Concurrent SystemsScalar

Sequential Lookahead

I/E Overlap FunctionalParallelism

MultipleFunc. Units

Pipeline

Implicit vector Explicit vector

Memory-to-Memory

Register-to-Register

SIMD MIMD

AssociativeProcessor

ProcessorArray Multi-computer Multiprocessor

VLIWSuperScalar

Computation Gap

Concurrent SystemsWe group concurrent systems into two group

• Control Flow• Data Flow

Computation GapConcurrent Systems

In the control flow model of computatexecution of an instruction activatesexecution of the next instruction.In the data flow model of computatavailability of the data activates the execuof the next instruction(s).

Computation Gap

Concurrent Systems

ConcurrentSystems

ControlFlow

DataFlow

Parallel Systems

MultiprocessorSystems

Pipelined Systems

Data DrivenSystems

Demand DrivenSystems

Ensemble PSIMD ArrAssociative

Loosely CoTightly Co

Linear/FeeUnifunctioStatic/Dyn

°°°

Static

Dynamic

Computation Gap

Concurrent SystemsWithin the scope of the control flow systwe distinguish three classes — namely:

• Parallel Systems• Pipeline Systems• Multiprocessors

Computation Gap

Concurrent SystemsThis distinction is due to the exploitationconcurrency and the interrelationships amthe control unit, processing elementsmemory modules in each group.

Computation Gap

Multiprocessor SystemsMultiprocessor systems can be groupedtwo classes:

• Tightly Coupled (Central Memory — not scalab• Loosely Coupled (Distributed Memory — scalab

Computation Gap

Multiprocessor SystemsTightly Coupled (Central Memory — not Scalashared memory modules are separated from proceby an interconnection network or a multiport interfaAll processors have equal access time to all memwords. Therefore, the memory access time (assumno conflict) is independent of the module baccessed (C.mmp, HEP, Encore's Multimax, CedarNYU Ultracomputer).

Computation Gap

Multiprocessor SystemsLoosely Coupled (Distributed Memory — Scalaeach processor has a local-public memory.Each processor can directly access its memory mobut all other accesses to non-local memory modmust be made through an interconnection networkaccess time varies with the location of the memmodule (Cm*, BBN Butterfly, and Dash).

Computation Gap

Multiprocessor SystemsBesides the higher throughput, multiprocesystems offer more reliability since failure inone of the redundant components can be tolerthrough system reconfiguration.Multiprocessor organization is a logical extenof the parallel system — i.e., array of proceorganization. However, the degree of freeassociated with the processors are much hithan it is in an array of processor.

Computation Gap

Multiprocessor SystemsThe independence of the processors andsharing of resources among the processorsboth desirable features — are achieved atexpense of an increase in complexity at bothhardware and software levels.

Computation Gap

Multi-computer SystemsA multi-computer system is a collectionprocessors, interconnected by a messapassing network.Each processor is an autonomous comphaving its own local memorycommunicating with each other throughmessage passing network (iPSC, nCUBE,Mosaic).

Computation Gap

Pipeline SystemsThe term pipelining refers to a design techniqueintroduces concurrency by taking a basic function tinvolved repeatedly in a process and partitioning itseveral sub-functions with the following properties:

Computation Gap

Pipeline Systems• Evaluation of the basic function is equivalen

some sequential evaluation of the sub-functions.• Other than the exchange of inputs and outputs, t

is no interrelationships between sub-functions.• Hardware may be developed to execute each

function.• The execution time of these hardware units

usually approximately equal.

Computation Gap

Pipeline SystemsUnder the aforementioned conditions, the spup from pipelining equals the number of pstages.However, stages are rarely balancedfurthermore, pipelining does involve sooverhead.

Computation Gap

Pipeline SystemsThe concept of pipelining can be implemenat different levels. With regard to this isone can address:

• Arithmetic Pipelining• Instruction Pipelining• Processor Pipelining

Computation Gap

Pipeline SystemsNon-pipelined instruction cycle:

Inst. i IF ID EX MEM WB

Inst. i+1 IF ID EX MEM WB

Inst. i+2 IF ID EX

Computation Gap

Pipeline SystemsPipelined instruction cycle:

A pipelined instruction cycle gives a peak performaof one instruction every step.

Computation Gap

Pipeline Systems — ExampleAssume an unpipeline machine has a 10 ns ccycles. It requires four clock cycles for the ALUbranch operations and five clock cycles for the memreference operations. Calculate the average instrucexecution time, if the relative frequencies of toperations are 40%, 20%, and 40%, respectively.

Ave. instr. exec. time = 10 * [ (40%+20%) * 4 + 40%= 44 ns

Computation Gap

Pipeline Systems — ExampleNow assume we have a pipeline version ofmachine. Furthermore, due to the clock skew andup, pipelining adds 1 ns overhead to the clock tIgnoring the latency, now calculate the aveinstruction execution time.

Ave. instr. exec. time = 10 + 1 ns, andSpeed up = 44/11 = 4

Computation Gap

Pipeline Systems — ExampleAssume that the time required for the five units iinstruction cycle are, 10 ns, 8 ns, 10 ns, 10 ns, andFurther, assume that pipelining adds 1 ns overhFind the speed up factor:

Ave. instr. exec. timeunpipeline = 10 + 8 + 10 + 10 + 7= 45 ns

Ave. instr. exec. timepipeline = 11 nsSpeed up = 45/11 = 4.1

Computation Gap

Pipeline SystemsIssues of concern:

• Overlapping operations should not over comresources — Every pipe stage is active on eclock cycle.

• All operations in a pipe stage should compleone clock and any combination of operations shbe able to occur at once.

Computation Gap

Pipeline SystemsPipeline systems can be further classified as

• Linear Pipe / Feedback Pipe• Scalar Pipe / Vector Pipe• Uni-function Pipe / Multifunction Pipe• Statically/Dynamically Con-figured Pipe

Computation Gap

Pipeline SystemsUni-function Pipeline: Pipeline is dedicatedspecific function — CRAY-1 has 12 dedicpipelines.Multifunction Pipeline: Pipeline systemperform different functions either at different tior at the same time — TI-ACS hasmultifunction pipelines reconfigurable for a varof arithmetic operations.

Computation Gap

Pipeline SystemsStatic Pipeline: Pipeline system assumesconfiguration at a time.Dynamic Pipeline: Pipeline system allseveral functional configurations to esimultaneously.

Computation Gap

Pipeline Systems — ExampleIn a multifunction pipe of 5 stages calculatespeed-up factor for

Y = A(i) * B(i)i=1

Computation Gap

Pipeline Systems — ExampleProduct terms will be generated in (n-1steps.Additions will be performed in:5 + ( n/2 -1) + 5 + ( n/4 -1) + ... + 5 + (1-(4log2n + n) steps.Speed-up ratio

S = 5(2n-1)2n+4 log 2n+4

5 for large n

Computation Gap

Pipeline SystemsA concept known as hazard is a major concin a pipeline organization.A hazard prevents the pipeline from accepdata at the maximum rate that the staging clmight support.

Computation Gap

Pipeline SystemsA hazard can be of three types:

• Structural Hazard: Arises from resource conflicts wthe hardware cannot support all possible combinatioinstructions in simultaneous overlapped execution —different pieces of data attempt to use the same stathe same time.

• Data-Dependent Hazard: Arises when an instrudepends on the result of a previous instruction —pass through a stage is a function of the data value.

Computation Gap

Pipeline Systems• Control Hazard: Arises from the pipelinin

instructions that affect PC — Branch.

Computation Gap

Pipeline SystemsStructural Hazard (Assume a Single mempipeline system)

Inst. i+4 IF ID EX MEM WBStall

Computation Gap

Pipeline SystemsData Hazard

• A data hazard is created whenever theredependence between instructions, and they are cenough that the overlap caused by pipelining wchange the order of access to an operand.

ADD R1, R2, R3SUB R4, R1, R5

Computation Gap

Pipeline SystemsData Hazard

• Data Hazard can be resolved with a simforwarding technique — If the forwarding harddetects that the previous ALU operation has wrto a source register of the current ALU operacontrol logic selects the forwarded result as the Ainput rather the value read from the register file.

Computation Gap

Pipeline SystemsData Hazard — Classification

• Assume i and j are two instructions and j issuccessor of i, then one could expect three typedata hazard:− Read after write (RAW)− Write after write (WAW)− Write after read (WAR)

Computation Gap

• Read after write (RAW) — j reads a source beforwrites it (flow dependence).

• Write after write (WAW) — j writes into the samdestination as i does (output dependence).

LW R1, 0(R2) IF ID EX MEM1 WBMEM2

Add R1, R2, R3 IF ID EX WB

Computation Gap

• Write after read (WAR) — j writes into a source(anti dependence).

SW 0(R1), R2 IF ID EX MEM1 WBMEM2

Add R2, R4, R3 IF ID EX WB

Computation GapPipeline Systems

Data Hazard — Forwarding• One can use the concept of data forwardin

overcome stall (s) due to data hazard.

Add R1, R2, R3 IF ID EX MEM WB

Sub R4, R1, R5 IF ID EX MEM WB

IF ID EX MEM WBAnd R6, R1, R7

IF ID EX MEM WBOR R8, R1, R9

IF ID EX MEMXOR R10, R1, R11

Computation Gap

Pipeline SystemsData Hazard — Forwarding

• In some cases data forwarding does not work:

LW R1, 0(R2) IF ID EX MEM WB

Sub R4, R1, R5 IF ID EX MEM WB

IF ID EX MEM WBAdd R6, R1, R7

IF ID EX MEM WBOR R8, R1, R9

Forwarding works

Forwardindoes not w

Computation Gap

Pipeline SystemsData Hazard — Stalling

• In cases where data forwarding does not workpipe has to be stalled:LW R1, A IF ID EX MEM WB

Add R4, R1, R7 IF ID EX MEM WB

IF ID EX MEM WBSub R5, R1, R8

IF ID EX MEM WAnd R6, R1, R7

Computation Gap

LW R1, A IF ID EX MEM WB

Add R4, R1, R7 IF ID EX MEM W

Sub R5, R1, R8 IF ID EX MEM

IF ID EXAnd R6, R1, R7

Stalls

Computation Gap

• The pipeline interlock detects a hazard and stallpipeline until the hazard is cleared .

• This delay cycle — bubble or pipeline stall, althe load data to be generated at the time it is neby the instruction.

Computation Gap

• Let us look at A B + C

LW R1, B IF ID EX MEM WB

IF ID EX MEMAdd R3, R1, R2

IF ID EX MST A, R3

LW R2, C IF ID EX MEM WB

Stall needed to allowload of C to complete

Computation Gap

Pipeline SystemsData Hazard — Example

• Assume 30% of the instructions are load andthe time the instruction following a load instrucdepends on the result of the load. If the hacreates a single-cycle delay, how much faster iideal pipelined machine?

CPIideal = 1CPInew = (.7 * 1 + .3 * 1.5) = 1.15

Computation Gap

Pipeline SystemsData Hazard — Pipeline Scheduling or InstrucScheduling

• Compiler attempts to schedule the pipeline to athe stalls by rearranging the code sequenceliminate the hazard — Software support to adata hazard.

• Sometimes if compiler can not scheduleinterlocks, a no-op instruction may be inserted.

Computation Gap

• Let us look at the following sequenceinstructions:

a = b + cd = e - f

Computation Gap

Load Rb, b

Load Rc, c

ADD Ra, Rb, Rc

Store a, Ra

IDID EXEX

IF ID EX MEM WBIF ID EX MEM WB

IF ID EX MEM WIF ID EX M

Load Re, e

Load Rf, f

SUB Rd, Re, RfStore d, Rd

Computation Gap

Pipeline SystemsControl Hazard

• If instruction i is a successful branch, then the Pchanged at the end of MEM phase. This mstalling the next instructions for three clock cycl

Computation Gap

Pipeline SystemsControl Hazard

Computation Gap

Pipeline SystemsControl Hazard — Observations

• Three clock cycles are wasted for every branch.• However, the above sequence is not even poss

since we do not know the nature of the instrucuntil after the instruction i + 1 is fetched.

Computation Gap

Pipeline SystemsControl Hazard — Solution

Computation Gap

Pipeline SystemsControl Hazard — Solution

• Still the performance penalty is severe.• What are the solution(s) to speed up the pipeline

Computation Gap

Pipeline SystemsControl Hazard — Reducing pipeline brpenalties

• Detect, earlier in the pipeline, whether or notbranch is successful,

• For a successful branch, calculate the value oPC earlier,

• It should be noted that, these solutions come aexpense of extra hardware,

Computation Gap

• Freeze the pipeline — Holding any instructionthe branch until the branch destination is knowEasy to enforce,

• Assume unsuccessful branch — Continue toinstructions as if the branch were a noinstruction. If a branch is taken, then stoppipeline and restart the fetch,

Computation Gap

• Assume the branch is successful — as soon atarget address is calculated, fetch and exeinstructions at the target,

• Delayed Branch — Software attempts to makesuccessor instruction valid and useful.

Computation Gap

• Assume branch is not successful:IF

Inst. i + 1

Inst. i + 2

Inst. i + 3

Inst. i + 4

WBUntaken branch Inst.

Computation Gap

Pipeline SystemsStructural Hazard

• For Statically Configured pipelines, one cpredict precisely when a structural hazard moccur and hence it is possible to schedulepipeline so that the collisions do not occur.

Computation Gap

• Let Si (1 i n) denote a stage of a pipelineperforms a well defined subtask with a delay tim

i.• Define latency as the minimum time ela

between the initiation of two processes. Therefor a linear pipeline the latency is Max( i) 1 i

Computation Gap

• Reservation Table represents the flow ofthrough the pipeline for one complete evaluatioa given function.

• It is a table which shows the activation ofpipeline stages at each moment of time.

Computation Gap

• Assume the following pipeline:

inS0 S1 S2 S3

2 nd pass

3 rd pass S4

i= t 0 i 6 and i 33=2 t

Structural Hazard• The following is the reservation table of

aforementioned pipeline organization:Time

Stage t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 t 12

Computation Gap

• For a given pipeline organization, one can alwderive its unique reservation table. Howdifferent pipeline organizations might have the sreservation table.

Computation Gap

• A pipeline is statically configured if it assumesame reservation table for each activation.

• A pipeline is multiply configured if the reservatable is one from a set of reservation tables.

• A pipeline is dynamically configured ifactivation does not have a predetermined reservatable.

Computation Gap

• A collision occurs if two or more activations atteto use the same pipeline segment simultaneously

• A collision will occur if reservation tables are ofby l time units and activation of the same pipsegment overlaps.

• l is called a forbidden latency — two activashould not be initiated l time units apart.

Computation Gap

• Given a pipeline system, one can defineforbidden list L as the set of forbidden latencies

L = (l1, l2, ..., ln)

C = (Cn, Cn-1, ..., C1)n = Max(lj) lj L andCi = 1 if i LCi = 0 otherwise

• For a given L we can define the collision vectorC as:

Computation Gap

• The collision vector can be interpreted thainitiation is allowed at every time unit such that0. This allows us to build a finite state diagrapossible initiations.

• Initial state is the collision vector, and each starepresented by a combination of collision vewhich have led to such a state.

Computation Gap

• In case of our example we have:

L = (4,8,1,3,5)C = 10011101

• Initial state 10011101 indicates that we can haveinitiation at time 2, 6, or 7. If we have a new initat time 2 then the finite state machine would be istate (00100111) (10011101) = 10111111.Following such a procedure then we have:

Computation Gap

• From State Diagram then one can design a controllregulate the initiation of the activations.

• In a state diagram:− The simple cycle is a cycle in which each state ap

only once.− The average latency of a cycle is the sum of its late

divided by the number of states in the cycle.− The greedy cycle is a cycle that always minimize

latency between the current initiation and the veryinitiation.

Computation Gap

Pipeline Systems — ExampleFor a 5-stage pipeline characterized by the followreservation table:

TimeStage

1 2 3 4 5 6 7 8

a) Determine forbidden list and collision vector.b) Draw the state diagram and determine the minimal

average latency and the maximum throughput.

Computation Gap

Pipeline SystemsMultifunction Pipeline

• The scheduling method for static uni-funcpipelines can be generalized for multifuncpipelines.

• A pipeline that can perform P distinct functionsbe classified by P overlaid reservation tables.

Computation Gap

• Each task to be initiated can be associated wfunction tag to identify the reservation table tused.

• In this case collision may occur between twmore tasks with the same function tag ordistinct function tags.

Computation Gap

• For example, the following reservationcharacterizes a 3-stage 2-function pipeline

0 41 32

TimeStage

Computation Gap

• A forbidden set of latencies for a multifunction pipis the collection of collision-causing latencies.

• A cross-collision vector marks the forbidden latebetween the functions - i.e., vAB represents the forbilatencies between A and B. Therefore, for a P funpipeline one can define P2 cross-collision vectors.

• P2 cross-collision vectors can be represented bcollision matrices.

Computation Gap

• For our example we have:

Cross Collision VectorsvAA=(0110) vAB=(1011)vBA=(1010) vBB=(0110)Collision Matrices

Computation Gap

• Similar to the uni-function pipeline, one can uscollision matrices in order to construct adiagram.

Computation Gap

01101010

01111111

10110111

01111010

11110111

B5 +B4

10110110

B1, B3

Computation Gap

Pipeline SystemsVector Processors

• A vector processor is equipped with multiple vepipelines that can be concurrently used uhardware or firmware control.

• There are two groups of pipeline processors:− Memory-to-Memory Architecture− Register-to-Register Architecture

Vector Processors• Memory-to-Memory architecture that support

pipeline flow of vector operands directly frommemory to pipelines and then back to the mem(Cyber 205).

• Register-to-Register architecture that uses veregisters as operands for the functional pipe(Cray series that use size registers and Fujitsu2000 series that use reconfigurable vector registe

Computation Gap

Pipeline SystemsVector Processors

• Vector machines allow efficient use of pipeliwhile reducing memory latency and pipscheduling penalties.

• Computations on vector elements are maindependent from each other — lack ofhazards.

Vector Processors• A vector instruction is equivalent to a loop,

implies:− Smaller program size, hence reducing the instru

bandwidth requirement.− Fewer number of control hazards.

• Vector instructions initiate regular operand fpattern — allowing efficient use of meminterleaving and efficient address calculations.

Computation Gap

Pipeline SystemsVector Processors — Vector Stride

• What if adjacent vector elements of a vector operare not positioned in sequence in the memory.

• The distance separating elements that ought to be meinto a single vector is called the stride.

• Almost all vector machines allow access to vectorsany constant stride. Some constant strides may cmemory-bank conflict.

Computation Gap

Pipeline SystemsVector Processors — Chaining

• Chaining allows a vector operation to start asas the individual elements of its vector sooperand become available.

• Result of the first functional unit (pipeline) inchain are forwarded to the second functional uni

Efficient Use of Vector Processors — Memto-Memory Organization

• Increase the vector size — if possible:− Change the nesting order of the loop,− Convert multidimensional arrays into one-dimen

arrays,− Rearrange data into unconventional forms so that sm

vectors may be combined into a single large vector.• Perform as many operations on an input vecto

possible before storing the result vector back inmain memory.

Computation Gap

Pipeline SystemsEfficient Use of Vector Processors — Memto-Memory Organization

• Changing the nesting order of the loop:

Do I = 1, 100A(I, 1:60) = 0

Do J = 1, 60A(1:100,J) =

Computation Gap

Pipeline SystemsEfficient Use of Vector Processors — Memto-Memory Organization

• Convert multidimensional arrays intodimensional arrays:

Do I = 1, 100A(I, 1:60) = 0

EndA(1:6000) = 0

Computation Gap

Pipeline SystemsEfficient Use of Vector Processors — Regito-Register Organization

• Values often used in a program should be kept in intregisters.

• Perform as many operations on an input vectopossible before storing the result vector back in thememory.

• Organize vectors into sections of size equal to the leof the vector registers — Strip mining.

• Convert multidimensional arrays into one-dimensarrays.

Computation Gap

QuestionCompare and contrast Memory-to-MemoryRegister-to-Register pipeline systems agaeach other.

computation gap instruction level parallelism

Documents

lecture 8: openmp. parallel programming models parallel...

high-level parallel programming of computation-intensive...

parallelism-drivenperformanceanalysis...

technical architecture to bridge the computing generation...

quo vadis, evolutionary...

computation gap instruction level parallelismcomputation gap...

chapter 16 - instruction-level parallelism and superscalar...

summary problem: exponential performance gap: computer...

lecture 7: posix threads - pthreads. parallel programming...

web.cs.sunyit.edugandhaj/data parallelism in … · web...

computation calculus - bridging a formalization gap

9. alternative konzepte: parallele funktionale...

hardware parallelism vs. software parallelism ·...

curriculum vitae - capsl - research...

kite: braided parallelism for heterogeneous systemsmixture...

pthreads topics introduction to pthreads data parallelism...

exploiting parallelism in matrix-computation kernels for...

parallelism: avoiding faulty parallelism

program of ica3pp 07 at a glance - reiner...

ai-driven communication & computation co-design: gap