1 cs 201 computer systems programming chapter 4 “computer taxonomy” herbert g. mayer, psu cs...

1

CS 201Computer Systems Programming

Chapter 4“Computer Taxonomy”

Herbert G. Mayer, PSU CSHerbert G. Mayer, PSU CSStatus 6/30/2014Status 6/30/2014

2

Syllabus Introduction

Common Architecture Attributes

General Limitations

Data-Stream Instruction-Stream

Generic Architecture Model

Instruction Set Architecture (ISA)

Iron Law of Performance

Uniprocessor (UP) Architectures

Multiprocessor (MP) Architectures

Hybrid Architectures

References

3

Introduction: UniprocessorsIntroduction: Uniprocessors Single Accumulator Architectures, earliest in the

1940s; e.g. Atanasoff, Zuse, von Neumann General-Purpose Register Architectures (GPR) 2-Address Architecture, i.e. GPR with one operand

implied, e.g. IBM 360 3-Address Architecture, i.e. GPR with all operands

of arithmetic operation explicit, e.g. VAX 11/70 Stack Machines (e.g. B5000, B6000, HP3000) Pipelined architecture, e.g. CDC 5000, Cyber 6000 Vector Architecture, e.g. Amdahl 470/6, competing

with IBM’s 360 in the 1970s blurs line to Multiprocessor

4

Introduction: MultiprocessorsIntroduction: Multiprocessors Shared Memory Architecture; e.g. Illiac IV, BSP

Distributed Memory Architecture

Systolic Architecture; see Intel® iWarp and CMU’s warp architecture

Data Flow Machine; see Jack Dennis’ work at MIT

5

Introduction: Hybrid ArchitecturesIntroduction: Hybrid Architectures

Superscalar Architecture; see Intel 80860, AKA i860

VLIW Architecture see Multiflow computer or systolic array architecture, like Warp at CPU or iWarp

at Intel in the 1990s

Pipelined Architecture; debatable if it is a hybrid architecture

EPIC Architecture; see HP and Intel® Itanium® architecture

6

Common Architecture AttributesCommon Architecture Attributes Main memory (main store), external from processor

Program instructions stored in main memory

Also, data stored in main memory; typical for von Neumann architecture

Data available in –distributed over– static memory, stack, heap, reserved OS space, free space, IO space

Instruction pointer (AKA instruction counter, program counter pc), other special registers

Von Neumann memory bottle-neck: everything travels on the same, single bus

7

Common Architecture AttributesCommon Architecture Attributes Accumulator (register, 1 or many) holds result of

arithmetic-logical operation Memory Controller handles memory access

requests from processor; moves bits to/from memory; is part of “chipset”

Current trend is to move some of the memory controller or IO controller onto CPU chip; caveat: that does not mean the chipset IS part of the CPU!

Logical processor unit includes: FP unit, Integer unit, control unit, register file, load-store unit, pathways

Physical processor unit includes: heat sensors, frequency control, voltage regulator, and more

8

General LimitationsGeneral Limitations Compute-Bound:

type of application, in which the vast majority of execution time is spent fetching and executing instructions; time to load and store data in/from memory is small % of overall

Memory-Bound: application, in which the majority of execution time is spent

loading and storing data in memory; time executing instructions is small % vs. time to access memory

IO-Bound: application, in which the majority of execution time is spent

accessing secondary storage; time executing instructions, even the time accessing memory, is small % vs. time to access secondary storage

Backup-Bound (semi-serious only): Like IO-Bound, but backup storage medium can be even

slower than typical secondary storage devices

9

Data-Stream Instruction-StreamData-Stream Instruction-Stream

Classification developed by Michael J. Flynn, 1966Classification developed by Michael J. Flynn, 19661.Single-Instruction, Single-Data Stream (SISD) Architecture

PDP-11

2.Single-Instruction, Multiple-Data Stream (SIMD) Architecture Array Processors, Solomon, Illiac IV, BSP, TMC

3.Multiple-Instruction, Single-Data Stream (MISD) Architecture Pipelined architecture

4.Multiple-Instruction, Multiple-Data Stream Architecture (MIMD) true multiprocessor

10

Generic Architecture ModelGeneric Architecture Model

11


ISA is boundary between Software and HardwareISA is boundary between Software and Hardware

Specifies logical machine visible to programmer & compiler

Is functional specification for processor designers

That boundary is sometimes a very low-level piece of system SW that handles exceptions, interrupts, and HW-specific services that could fall into the domain of the OS

12

Instruction Set Architecture (ISA)Specified by ISA are:Specified by ISA are:

Operations: what to perform and in which order

Active, temporary operand storage in CPU: accumulator, stack, registers

note that stack can be word-sized, even bit-sized (e.g. extreme design of successor for NCR’s Century architecture of the 1970s)

Number of operands per instruction; implicit, others explicit

Operand location: where and how to locate/specify the operands: Register, literal, data in memory

Type and size of operands: bit, byte, word, double-word, . . .

Instruction Encoding in binary

Data types: int, float, double, decimal, char, bit

13


14

Iron Law of Performance Clock-rate doesn’t count! Bus width doesn’t count. Number of

registers and operations executed in parallel doesn’t count! What counts is how long it takes for my computational task to

complete. That time is of the essence of computing! If a MIPS-based solution runs at 1 GHz that completes a

program X in 2 minutes, while an Intel Pentium® 4–based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution

If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 * Y bytes, the Intel solution is generally more attractive

Meaning of this: Wall-clock time (Time) is time I have to wait for completion Program Size is overall complexity of computational task

15

Iron Law of Performance

16

Different Classesof

Architectures

17

Uniprocessor (UP) ArchitecturesSingle Accumulator Architecture (SAA)Single Accumulator Architecture (SAA)Single register to hold operation results

Conventionally called accumulator

Accumulator used as destination of arithmetic operations, and as (one) source

SAA has central processing unit, memory unit, connecting memory bus; typical for van Neumann architecture

The pc points to next instruction in memory to be executed

Sample: ENIAC

18

Uniprocessor (UP) ArchitecturesGeneral-Purpose Register (GPR) ArchitectureGeneral-Purpose Register (GPR) ArchitectureAccumulates ALU results in n registers, typically 4, 8, 16, 64Allows register-to-register operations, fast!GPR is essentially a multi-register extension of SAATwo-address architecture specifies one source operand explicitly, another implicitly, plus one destinationThree-address architecture specifies two source operands explicitly, plus an explicit destinationVariations allow additional index registers, base registers, multiple index registers, etc.

19


Stack Machine Architecture (SMA)Stack Machine Architecture (SMA)AKA zero-address architecture, since arithmetic operations require no explicit operand, hence no operand addresses; all are implied to be on the stack, except for push and popWake-up call to Students: What is equivalent of push/pop on GPR?Pure Stack Machine (SMA) has no registersHence performance is inherently poor, as all operations involve memory on a stack machineHowever, one will design an SMA that implements the n top of stack elements as registers, i.e. as a Stack Cache: n = 4, 8, . . .Sample architectures: Burroughs B5000, HP 3000Implement impure stack operations that bypass tos operand addressingSample code sequence to compute:

20

Uniprocessor (UP) ArchitecturesStack Machine Architecture (SMA)Stack Machine Architecture (SMA)res := a * ( 145 + b ) -- operand sizes are implied!res := a * ( 145 + b ) -- operand sizes are implied!

push apush a -- destination implied: stack-- destination implied: stackpushlit 145pushlit 145 -- also destination implied-- also destination impliedpush bpush b -- ditto-- dittoaddadd -- 2 sources, and destination implied-- 2 sources, and destination impliedmultmult -- 2 sources, and destination implied-- 2 sources, and destination impliedpop respop res -- source implied: stack-- source implied: stack

21

Uniprocessor (UP) ArchitecturesPipelined Architecture (PA)Pipelined Architecture (PA)

Arithmetic Logic Unit, ALU, split into separate, sequentially connected units in PAUnit is referred to as a stage; more precisely the time at which the action is done is the stageEach of these stages/units can be initiated once per cycleYet each subunit is implemented in HW just onceMultiple subunits operate in parallel on different sub-ops, each executing a different stage; each stage is part of one instruction execution, many stages running in parallel Non-unit time, differing # of cycles per operation cause different terminations Operations abort in intermediate stage, if some later instruction changes the flow of control; e.g. due to a branch, exception, return, conditional branch, call

22


Pipelined Architecture (PA)Pipelined Architecture (PA)

23



Operation must stall in case of data or control dependence: stall, AKA interlockIdeally each instruction can be partitioned into the same number of stages, i.e. sub-operations

Operations to be pipelined can sometimes be evenly partitioned into equal-length sub-operations

That equal-length time quantum might as well be a single sub-clock In practice it is hard/impossible for architect to achieve; compare for

example integer add and floating point divide!

24



Ideally all operations have independent operandsi.e. one operand being computed is not needed as source of the next few operationsif they were needed –and often they are—then this would cause dependence, which causes stall

read after write (RAW)write after read (WAR)

write after write –with use in between (WAW) Also, ideally, all instructions just happen to be arranged sequentially

one after another In reality, there are branches, calls, returns etc.

25

Uniprocessor (UP) ArchitecturesSimplified Pipelined Resource DiagramSimplified Pipelined Resource Diagram

if: fetch an instructionde: decode the instructionop1: fetch or generate the first operand; if anyop2: fetch or generate the second operand; if anyexec: execute that stage of the overall operationwb: write result back to destination, if any

e.g. noop has no destination; halt has no destination

26


Superscalar Architecture; more detail also Superscalar Architecture; more detail also shown at: “shown at: “Hybrid Architecture”Hybrid Architecture”

Identical to regular uniprocessor architectureBut some arithmetic or logical units are replicatedE.g. may have multiple floating point (FP) multipliersOr FP multiplier and FP adder may work at the same timeThe key is: On a superscalar architecture sometimes more instructions than one can execute at one time!Provided that there is no data dependence!First superscalar machines included CDC 6600, Intel i960CA, and AMD 29000 seriesObject code can look identical to code for strict uni-processor, yet the HW fetches more than just the next instruction, and performs data dependence analysis

27


Vector Architecture (VA)Vector Architecture (VA)

Register implemented as HW array of identical registers, named vri

VA may also have scalar registers, named r0, r1, etc.Scalar register can also be the first of the vector registersVector registers can load/store block of contiguous dataStill in sequence, but overlapped; number of steps to complete load/store of a vector also depends bus widthVector machine can perform multiple operations of the same kind on whole contiguous blocks of operandsStill in sequence, but overlapped, and all operands are readily availableOtherwise operates like GPR architecture, but on vector operands; if vector size is 1, then VA identical to UP

28


Vector Architecture (VA)Vector Architecture (VA)

29

Uniprocessor (UP) ArchitecturesSample Vector Architecture operation:Sample Vector Architecture operation:ldv vr1, memi -- loads 64 memory locs from [mem+i=0..63]stv vr2, memj -- stores vr2 in 64 contiguous locsvadd vr1, vr2, vr3 -- register-register vector add cvaddf r0, vr1, vr2, vr3 -- has conditional meaning:-- sequential equivalent:for i = 0 to 63 do

if bit i in r0 is 1 thenvr1[i] = vr2[i] + vr3[i] // e.g. cvadd r0, r1, r2, r3

else-- do not move corresponding bits

end ifend for -- parallel syntax equivalent:forall i = 0 to 63 doparallel

if bit i in r0 is 1 thenvr1[i] = vr2[i] + vr3[i]

end ifend parallel for

30

Multiprocessor (MP) ArchitecturesShared Memory Architecture (SMA)Shared Memory Architecture (SMA)Equal access to memory for all n processors, p0 to pn-1

Only one will succeed in accessing shared memory, when there are multiple, quasi-simultaneous accessesSimultaneous memory access must be deterministic; needs an arbiter to ensure determinismVon Neumann bottleneck tighter than conventional UP systemGenerally there are twice as many loads as there are stores in typical object codeOccasionally, some processors are idle due to memory conflictTypical number of processors n=4, but n=8 and greater possible, with large 2nd level cache, even larger 3rd levelOnly limited commercial success and acceptance, programming burden frequently on programmerMorphing in the 2000s into multi-core and hyper-threaded architectures, where programming burden is on multi-threading OS or the programmer

31

Multiprocessor (MP) ArchitecturesShared Memory Architecture (SMA)Shared Memory Architecture (SMA)

32

Multiprocessor (MP) ArchitecturesDistributed Memory Architecture (DMA)Distributed Memory Architecture (DMA)Processors have private memories, AKA local memoriesYet programmer has to see single, logical memory space, regardless of local distributionHence each processor pi always has access to its own memory Memi

Collection of all memories Memi i= 0..n-1 is logical data spaceThus, processors must access others’ memoriesDone via Message Passing or Virtual Shared MemoryMessages must be routed, route be determinedRoute may be long, i.e. require multiple, intermediate nodesBlocking when: message expected but hasn’t arrived yetBlocking when: when destination cannot receiveGrowing message buffer size increases illusion of asynchronicity of sending and receiving operationsKey parameter: time for 1 hop and package overhead to send empty messageMessage may also be delayed because of network congestion

33


Distributed Memory Architecture (DMA)Distributed Memory Architecture (DMA)

34


Systolic Array Architecture (SAA)Systolic Array Architecture (SAA)Very few designed: CMU and Intel for (then) ARPAEach processor has private memoryNetwork is pre-defined by the Systolic Pathway (SP)Each node is pre-connected via SP to some subset of other processorsNode connectivity: determined by network topologySystolic pathway is high-performance network; sending and receiving may be synchronized (blocking) or asynchronous (data received are buffered)Typical network topologies: line, ring, torus, hex grid, mesh, etc.Sample below is a ring; wrap-around along x and y dimensions not shownProcessor can write to x or y gate; sends word off on x or y SPProcessor can read from x or y gate; consumes word from x or y SPBuffered SA can write to gate, even if receiver cannot readReading from gate when no message available blocksAutomatic code generation for non-buffered SA hard, compiler must keep track of interprocessor synchronizationCan view SP as an extension of memory with infinite capacity, but with sequential access

35


Systolic Array Architecture (SAA)Systolic Array Architecture (SAA)

36



Note that each pathway, x or y, may be bi-directionalMay have any number of pathways, nothing magic about 2, x and y; could be 3 or morePossible to have I/O capability with each nodeTypical application: large polynomials of the form:

y = k0 + k1*x1 + k2*x2 .. + kn-1*xn-1 = Σ ki*xi

Next example shows a torus without displaying the wrap-around pathways across both dimensions

37



38

Hybrid ArchitecturesSuperscalar Architecture (SA)Superscalar Architecture (SA)

Replicates (duplicates) some operations in HWSeems like scalar architecture w.r.t. object code, can compute some operations of UP in parallel, e.g. fadd and fmultIs almost a parallel architecture, if it has multiple copies of some hardware units, say two fadd unitsIs not an MP architecture: ALU is not replicatedHas multiple parts of an ALU, possibly multiple FPA units, or FPM units, and/or integer unitsArithmetic operations simultaneous with load and store operations; note data dependence!Instruction fetch speculative, since number of parallel operations unknown; rule: fetch too much! But fetch no more than longest possible superscalar pattern

39

Hybrid ArchitecturesSuperscalar Architecture (SA)Superscalar Architecture (SA)Code sequence looks like sequence of instructions for scalar processorExample: 80486® code executed on Pentium® processorsMore famous and successful example: 80860® processorObject code can be custom-tailored by compiler; i.e. compiler can have superscalar target processor in mind, bias code emission, knowing that some code sequences are better suited for superscalar executionFetch enough instruction bytes to support longest possible object sequenceDecoding is bottle-neck for CISC, way easier for RISC 32-bit unitsSample of superscalar: i80860 could run in parallel one FPA, one FPM, two integer ops, and a load or store in ++ or --

40


Superscalar Architecture (SA)Superscalar Architecture (SA)

41


Very Long Instruction Word Architecture (VLIW)Very Long Instruction Word Architecture (VLIW)

Very Long Instruction Word, typically 128 bits or moreVLIW machine also has scalar operationsVLIW code is no longer scalar, but explicitly parallelLimitations like in superscalar: VLIW is not a general MP architecture

subinstructions do not have concurrent memory access dependences must be resolved before code emission

But the VLIW opcode is designed to execute in parallelVLIW suboperations can be defined as no-op, thus just the other suboperations run in parallelCompiler/programmer explicitly packs parallelizable operations into VLIW instructionJust like horizontal microcode compaction

42


VLIWVLIW

Sample: Compute instruction of CMU warp® and Intel® iWarp®

Could be 1-bit (or few-bit) opcode for compute instruction; plus sub-opcodes for subinstructionsData dependence example: Result of FPA cannot be used as operand for FPM in the same VLIW instruction

But provided proper SW pipelining (not covered in CS 201) both subinstructions may refer to the same FP register

Result of int1 cannot be used as operand for int2, etc. With SW pipelining both subinstructions may refer to same int register

Thus, need to software-pipeline

43


Itanium EPIC ArchitectureItanium EPIC ArchitectureExplicitly Parallel Instruction Computing

Group instructions into bundles

Straighten out the branches by associating predicate with instructions; avoids branch and executes speculatively

Execute instructions in parallel, say the else clause and the then clause of an If Statement

Decide at run time which of the predicates is true, and (post) complete just that path from multiple choices; discard others

Use speculation to straighten branch tree

Use rotating register file

Has many registers, not just 64 GPRs

44


ItaniumItaniumGroups and bundles lump multiple compute steps into one that can be run in parallel

Parallel comparisons allow fast decisions

Predication associates a condition (the predicate) with 2 simultaneously executed instruction sequences, only 1 of which will be posted

Speculation fetches operands, not knowing for sure, whether this results in use; branch may invalidate early fetch

Branch elimination, straightens out code with jumps

Branch prediction

Large register file

45


ItaniumItaniumNumerous branch registers; speeds up execution by having some branch destinations in register; fast to load into ip reg

Multiple CFM registers, Current Frame Marker regs; avoid slowness due to memory access

See separate lecture note

46

References

1.1. http://cs.illinois.edu/csillinois/historyhttp://cs.illinois.edu/csillinois/history2.2. http://www.arl.wustl.edu/~pcrowley/cse526/bsp2.pdfhttp://www.arl.wustl.edu/~pcrowley/cse526/bsp2.pdf3.3. http://dl.acm.org/citation.cfm?id=102450http://dl.acm.org/citation.cfm?id=1024504.4. http://csg.csail.mit.edu/Dataflow/talks/DennisTalk.pdfhttp://csg.csail.mit.edu/Dataflow/talks/DennisTalk.pdf5. http://en.wikipedia.org/wiki/Flynn's_taxonomy6. http://www.ajwm.net/amayer/papers/B5000.html7. http://www.robelle.com/smugbook/classic.html8. http://en.wikipedia.org/wiki/ILLIAC_IV9. http://www.intel.com/design/itanium/manuals.htm10. http://www.csupomona.edu/~hnriley/www/VonN.html11. http://cva.stanford.edu/classes/ee482s/scribed/lect11.pdf12. VLIW Architecture:

http://www.nxp.com/acrobat_download2/other/vliw-wp.pdf13. ACM reference to Multiflow computer architecture:

http://dl.acm.org/citation.cfm?id=110622&coll=portal&dl=ACM

1 cs 201 computer systems programming chapter 4 “computer taxonomy” herbert g. mayer, psu cs...

Documents

memory time

address architecture

vector architecture

memory io

static memory

overall memory

systolic array architecture

hp3000 pipelined architecture