iwks 2300 computer architecture architecture.pdfharvard mark 1 (circa 1940) howard aiken. the alu...

IWKS 2300

Computer Architecture

(plus finishing up computer arithmetic)

Fall 2019

John K. Bennett

From Last Lecture: "Ripple" Carry Adder

Ripple carry makes addition time approximately equal to number of bits

times the propagation delay of a Full Adder

Full Adder

∑Cout

BA Cin

Full Adder

∑Cout

BA Cin

Full Adder

∑Cout

BA Cin

Full adder prop. delay = 3 gpd (carry output)

So a 16 bit adder would take 48 gpd to complete add

Eliminating Ripple Carry: Carry Look Ahead Basics

If we understand how carry works we can compute carry in advance.

This is called "Carry Look-Ahead."

For any bit position, if A = 1 and B = 1; Cout = 1, i.e., a carry will be

generated to the next bit position, regardless of value of Cin. This is

called "Carry Generate"

For any bit position, if one input is 1 and the other input is 0; Cout

will equal Cin (i.e., the value of Cin will be propagated to the next bit

position. This is called "Carry Propagate"

For any bit position, if A = 0 and B = 0; Cout will equal 0, regardless

of value of Cin. This is called "Carry Stop."

Full Adder

∑Cout

BA Cin

Full Adder

∑Cout

BA Cin

Full Adder

∑Cout

BA Cin

Carry Generate, Propagate and Stop

Truth table for Full Adder

Cin A B Cout

fgps 0 0 0 x CSi

fgps 0 1 1 x CPi

fgps 1 0 1 x CPi

fgps 1 1 0 x CGi

∑

Full Adder

∑Cout

BA Cin

X

fgps

No need for carry chain

Carry Look Ahead Basics

The equations to compute Cin at Bit Position i are as follows:

Cini = Cgi-1

+ Cpi-1 ● Cgi-2

+ Cpi-1 ● Cpi-2 ● Cgi-3

…

+ Cpi-1 ● Cpi-2 … Cp1 ● Cg0

Practical Considerations

Cini = Cgi-1

+ Cpi-1 ● Cgi-2

+ Cpi-1 ● Cpi-2 ● Cgi-3

…

+ Cpi-1 ● Cpi-2 … Cp1 ● Cg0

Very wide (more than 8 input) gates are impractical, so we

would likely use a logn depth tree of gates to implement the

wide ANDs and ORs. This is still faster than chained carry,

even for 16 bits (and is much faster for 32 or 64 bit adders).

23

03 02 01 00

0

0

13 12 11 10

1

1

22 21 20

2

2

33 32 31 30

3

3

What About Multiplication?

Classic Multiplication in Hardware/Software

Use add – shift, like pen and paper method:

Speeding Up Binary Multiplication

1. Retire more than bit at a time:

• 2 bits at a time ("Booth’s Algorithm")

• 3 bits at a time:

Bits Operation000 No Operation

001 Add Multiplicand

010 Add Multiplicand

011 Add 2xMultiplicand

100 Sub 2xMultiplicand

101 Sub Multiplicand

110 Sub Multiplicand

111 No Operation

2. Parallel Multiplier Using Carry Save Addition

Carry Save Addition

The idea is to perform several additions in sequence, keeping the carries

and the sum separate. This means that all of the columns can be added

in parallel without relying on the result of the previous column, creating a

two output "adder" with a time delay that is independent of the size of its

inputs. The sum and carry can then be recombined using one normal

carry-aware addition (ripple or CLA) to form the correct result.

CSA Uses Full Adders

"Wallace Tree" Addition

CSA Adder Tree Depth = 7; 7 adders used

Plus add with carry

Depth = 4; 7 adders

Plus add with carry

A 4-bit Example (carry propagating to the right)

(or carry look-ahead)

Example: An 8-bit Carry Save Array Multiplier

A parallel multiplier for unsigned operands. It is composed of 2-input AND

gates for producing the partial products, a series of carry save adders for

adding them and a ripple-carry adder for producing the final product.

Generating Partial Products

FA with 3 inputs, 2 outputs

What is Computer Architecture?

Machine Organization + Instruction Set Architecture

Decisions in each area are made for reasons of:

Cost

Performance

Compatibility with earlier designs

Computer Design is the art of balancing these

criteria

Classic Machine Organization (Von Neumann)

Input (mouse, keyboard, …)

Output (display, printer, …)

Memory

main (DRAM), cache (SRAM)

secondary (disk,

CD, DVD, …)

Datapath

Control

Input

Processor

Control

Datapath

Output

Memory1001010010110000

0010100101010001

1111011101100110

1001010010110000

1001010010110000

1001010010110000

Processor

(CPU)

Atanasoff–Berry Computer (Iowa State University)

(1937-42; vacuum tubes)

Zuse Z3 (Nazi Germany)

(1941-43; relays)

Von Neumann (Princeton) Machine (circa 1940)

Keyboard

Arithmetic Logic

Unit (ALU)

CPU

Registers

Control

Memory

(data

+

instructions)

Input

device

Output

device

Gordon Moore, Andy Grove (and others)

... made it small and fast.John Von Neumann (and others) ... made it possible

Harvard Mark 1 (circa 1940)

Howard Aiken

The ALU

Arithmetic (in order of implementation complexity):

Add

Subtract

Shift (Right and Left)

Rotate (Right and Left

Multiply

Divide

Floating Point Operations

Logic (usually implemented with multiplexors)

And Nand

Or Nor

Not XNor

Xor etc.

32

32

32

operation

result

a

b

ALU

Registers

While there have been "memory-only" machines, even early

computers typically had at least one register (called the

"accumulator"), used to capture the output of the ALU for

the next instruction.

Since memory (RAM) is much slower than registers (which

are internal to the CPU), we would like a lot of them.

But registers are very expensive relative to RAM, and we

have to be able to address every register. This impacts both

instruction set design and word length (e.g., 8 bit, 16 bit, 32

bit, 64 bit).

This has led to unusual designs, e.g., the SPARC

architecture "register windows."

Control

Early computers were hardwired to perform a single

program.

Later, the notion of a "stored program" was introduced.

Early programmers entered programs in binary directly into

memory using switches and buttons. Assemblers and

compilers made it possible for more human-readable

programs to be translated into binary.

Binary programs, however entered, are interpreted by the

hardware to generate controls signals. This interpretation

can be "hardwired" logic, or another computer using what is

known as "microprogramming."

Keyboard

Arithmetic Logic

Unit (ALU)

CPU

Registers

Control

Memory

(data

+

instructions)

Input

device

Output

device

Processing Logic: fetch-execute cycle

Executing the current instruction involves one or more of

the following tasks:

Have the ALU compute some function out = f (register values)

Write the ALU output to selected registers

As a side-effect of this computation,

determine what instruction to fetch and execute next.

What Do Instructions Look Like in Memory?

In a Von Neumann machine, both instructions and data are stored in

the same memory.

Data is just a set of bits in one or more words of memory.

Instructions contain operation codes ("Op Codes") and addresses

(of either registers or RAM).

Suppose "addr" was 4 bits and the word length was 16 bits. How

many registers could we have? How many operations?

oprn

oprn

oprn

addr

1

addr

2

addr

3

addr

2

addr

1

addr

1

Architecture Families

Before mid-60’s, every machine had a different ISA

programs from previous generation could not run on new

machine (this made replacement very expensive)

IBM System/360 introduced the concept of an

"architecture family" based on different detailed

implementations

single instruction set architecture

wide range of price and performance with same software:

o memory path width (1 byte to 8 bytes)

o faster, more complex CPU design

o greater I/O throughput and overlap

IBM 360 Architecture Family

Model Shipped Scientific Commercial CPU Memory Memory size

Performance Performance Bandwidth Bandwidth KB

KIPS KIPS MB/Sec MB/Sec

30 Jun-65 10.2 29 1.3 0.7 8-64

40 Apr-65 40 75 3.2 0.8 16-256

50 Aug-65 133 169 8 2 64-512

20 Mar-66 2 2.6 4-32

91 Oct-67 1,900 1,800 133 164 1024-4096

65 Nov-65 563 567 40 21 128-1024

75 Jan-66 940 670 41 43 256-1024

67 May-66 40 21 512-2048

44 Sep-66 118 185 16 4 32-256

95 Feb-68 3800 est. 3600 est. 133 711 5220

25 Oct-68 9.7 25 1.1 2.2 16-48

85 Dec-69 3,245 3,418 100 67 512-4096

195 Mar-71 10000 est. 10000 est. 148 169 1024-4096

The Intel x86 Architecture History

Xeon

Platinum

8276L

2019

($16,616)

~7B 4.0GHz 42bit 384

(64 bit

arch.)

4.5 TB

…

The Intel x86 Instruction Set Architecture

Complexity

instructions from 1 to 17 bytes long

one operand must act as both a source and destination

one operand may come from memory

several complex addressing modes

Why has the x86 architecture survived this long?

Historically tied to MS Windows

The most frequently used instructions are relatively easy to implement and optimize

Compilers avoid the portions of the architecture that are slow (i.e., most compilers for X86 machines only use a fraction of the instruction set).

CISC vs. RISC

CISC = Complex Instruction Set Computer

RISC = Reduced Instruction Set Computer

Historically, machines tend to add features over time

Instruction opcodes

IBM 70X, 70X0 series went from 24 opcodes to 185 in 10 years

At the same time, performance increased 30 times

Addressing modes

Special purpose registers

CISC motivations were to

Improve efficiency, since complex instructions implemented in

hardware presumably execute faster

Supposed to make life easier for compiler writers

Supposed to support more complex higher-level languages

CISC vs. RISC

Examination of actual code demonstrated many of these features were

not used, largely because compiler code generation and optimization is

hard even with simple instruction sets.

RISC advocates (e.g., Dave Patterson of UC Berkeley) proposed

simple, limited (reduced) instruction set

large number of general purpose registers

instructions mostly only used registers

optimized instruction pipeline

Benefits of this approach included:

faster execution of instructions commonly used

faster design and implementation

Issues: things like floating point had to be implemented in SW

CISC vs. RISC

Some early RISC architectures compared to contemporaneous CISC

Year Instr. Instr.

Size

Addr

Modes

Registers

IBM

370/168

1973 208 2 - 6 4 16

VAX

11/780

1978 303 2 - 57 22 16

I 80486 1989 235 1 - 11 11 8

M 88000 1988 51 4 3 32

MIPS

R4000

1991 94 4 1 32

IBM 6000 1990 184 4 2 32

CISC vs. RISC

Which approach is best?

In general, fewer simpler instructions allow for increased clock

speeds.

Typically, RISC processors take less than half the design time of a

CISC processor., sometimes even less.

RISC/CISC comparisons often neglect the increased time it takes to

do things like develop a software floating point library.

In addition, CISC designers have adopted RISC techniques

everywhere possible.

Instruction complexity is only one variable

A couple of design principles:

Make the common case fast.

Design for the expected workload, e.g., a GPU needs a very

different ISA that a Windows 10 CPU.

Some Typical Assembly Language Constructs

// In what follows R1,R2,R3 are registers, PC is program counter,

// and addr is some address in memory. There is an implied PC++

// with every instruction

ADD R1,R2,R3 // R1 R2 + R3

ADDI R1,R2,addr // R1 R2 + addr

AND R1,R1,R2 // R1 R1 and R2 (bit-wise)

JMP addr // PC addr

JEQ R1,R2,addr // IF R1 == R2 THEN PC addr ELSE PC++

LOAD R1, addr // R1 RAM[addr]

STORE R1, addr // RAM[addr] R1

NOP // Do nothing

// Etc. – *many* variants

Three Address Architecture

Consider the following code fragment: X = (A-B) / (C+(D*E))

Load R1, A // R1 ← Mem[A]

Load R2, B // R2 ← Mem[B]

Sub R3, R2, R1 // R3 ← R2 – R1

Load R1, D // R1 ← Mem[D]

Load R2, E // R2 ← Mem[E]

Mpy R4, R1, R2 // R4 ← R1 * R2

Load R1, C // R1 ← Mem[C]

Add R2, R1, R4 // R2 ← R1 + R4

Div R1, R3, R2 // R1 ← R3 / R2

Store X, R1 // Mem[X] ← R1

This code: 10 instructions, 6 Memory references,

code is not compact.

There are typically a finite number of registers,

on the order of 16-32

Two Address Architecture


Load R1, A // R1 ← Mem[A]

Load R2, B // R2 ← Mem[B]

Sub R2, R1 // R2 ← R2 – R1

Load R1, D // R1 ← Mem[D]

Load R3, E // R3 ← Mem[E]

Mpy R1, R3 // R1 ← R1 * R3

Load R4, C // R4 ← Mem[C]

Add R1, R4 // R1 ← R1 + R4

Div R2, R1 // R2 ← R2 / R1

Store X, R2 // Mem[X] ← R2

There are typically a finite number of registers,

on the order of 16-32


code is a little more compact.

One Address Architecture


Load A // Acc ← Mem[A]

Sub B // Acc ← Acc - Mem[B]

Store Temp1 // Mem[Temp1] ← Acc

Load D // Acc ← Mem[D]

Mpy E // Acc ← Acc * Mem[E]

Add C // Acc ← Acc + Mem[C]

Store Temp2 // Mem[Temp2] ← Acc

Load Temp1 // Acc ← Mem[Temp1]

Div Temp2 // Acc ← Acc / Mem[Temp2]

Store X // Mem[X] ← Acc

There is one register, called the Accumulator


code is more compact.

Zero Address Architecture


Push D // SP = SP + 1; Mem[SP] ← Mem[D];

Push E // SP = SP + 1; Mem[SP] ← Mem[E];

Mpy // Mem[SP-1] ← Mem[SP] * Mem[SP-1];

// SP = SP -1

Push C // SP = SP + 1; Mem[SP] ← Mem[C]

Add // Mem[SP-1] ← Mem[SP] + Mem[SP-1]

// SP = SP -1

Push A // SP = SP + 1; Mem[SP] ← Mem[A]

Push B // SP = SP + 1; Mem[SP] ← Mem[B]

Sub // Mem[SP-1] ← Mem[SP] - Mem[SP-1];

// SP = SP -1

Div // Mem[SP-1] ← Mem[SP] / Mem[SP-1]

Pop X // Mem[X] ← Mem[SP]; SP = SP - 1

10 instructions, 24 Memory references,

code is very compact.

What Does it Mean to Make the Common Case Fast?

There are a variety of techniques to speed up instruction execution.

Some of these include:

Increase clock rate (we are approaching some hard limits here)

Pipelining (execute more than one instruction at one time)

Caching (store data we will need near the processor)

Other methods of improving memory hierarchy access time

Note that we only need to employ these techniques for instructions that

actually get used.

How do we know what instructions get used?

There are decades of research exploring how different kinds of

compilers and programs use instructions and memory.

If we have a specialized workload, we can study its execution

ourselves.

Registers (very small and very fast) –

implemented as part of the processor.

Cache (small and fast storage for data and

instructions we expect to need). There may

be several layers of cache, e.g., a small and

very fast L1 cache, a larger and somewhat

slower L2 cache, and an even larger and not

quite as fast L3 cache.

Main Memory (RAM) – the bulk of the

volatile memory available to the CPU.

Usually implemented using dynamic RAM.

Disk – Relatively non-volatile storage for

large amounts of information. May use

rotating media, or (more recently) SSD (solid

state drive) technology.

Although less common today, there may be

additional layers in the memory hierarchy,

e.g., tape, on-line, DVD, etc.

The Memory Hierarchy

Memory

CPU

Memory

Size Cost ($/bit)Speed

Smallest

Biggest

Highest

Lowest

Fastest

Slowest Memory

RAM

Disk

Cache

Registers

Ideally, all the memory we want would be in the processor, but that is

cost-prohibitive (and certainly wasteful) by today’s standards.

We use the memory hierarchy to efficiently create the illusion that all

memory is the same.

The memory hierarchy must be inclusive, i.e., lower levels must include

everything present in higher levels of the memory hierarchy.

The performance of the memory hierarchy depends on hit rate, i.e., how

often we find what we need at higher levels.

Memory Hierarchy

Processor

Data are transferred

CPU

Level n

Level 2

Level 1

Levels in the

memory hierarchy

Increasing distance

from the CPU in

access time

Size of the memory at each level

Block of data

(unit of data copy)

Program Locality

Caching in the memory hierarchy works because of

two kinds if program "locality."

temporal locality: an item (data or instruction) a

program has just accessed is likely to be

accessed again in the near future. Why?

spatial locality: items near an item a program has

just accessed are likely to be referenced soon.

Why?

Cache Terminology

block: minimum unit of data moved between levels

hit: data requested is found in the nearest upper level

miss: data requested is not found in nearest upper level

hit rate: fraction of memory accesses that are hits

miss rate: fraction of memory accesses that are not hits

miss rate = 1 – hit rate

hit time: time to determine if the access is a hit + time to deliver the data to the CPU

miss penalty: time to determine if the access is a miss + time to replace block at upper level with corresponding block at lower level + time to deliver the data to the CPU

Simple example:

assume block size = one word of data

Issues:

how do we know if a data item is in the cache?

if it is, how do we find it?

if not, what do we do, and what if the cache is full (or "dirty")?

Solution depends on cache addressing scheme.

How Do Caches Actually Work?

a. Before the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

b. After the reference to Xn

X3

Xn – 1

Xn – 2

X1

X4

Xn

X2X2

Reference to Xn

causes miss so

it is fetched from

memory

Fully Associative - A cache where data from any address can be stored in any cache location. All tags are compared simultaneously (associatively) with the requested address and if one matches then its associated data is accessed.

Direct Mapped - A cache where the cache location for a given address is explicitly determined from the middle address bits. The remaining top address bits are stored as a "tag" along with the entry. In this scheme, there is only one place for any block to go.

Set Associative - A compromise between a direct mapped cache and a fully associative cache where each address is mapped to a certain "set" of cache locations.

A direct mapped cache could be referred to as "one-way set associative", i.e. one location in each set, whereas a fully associative cache is "N-way associative" (where N is the total number of blocks in the cache).

Cache Addressing Schemes

Addressing scheme in direct mapped cache:

cache block address = memory block address mod cache size (unique)

if cache size = 2m, cache address = lower m bits of n-bit memory address

remaining upper n-m bits kept as tag bits at each cache block

also need a valid bit to recognize valid entry

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

00

0

Cache

Memory

001

01

0

01

1

100

101

110

11

1

Direct Mapped Cache Implementation Example

Address (showing bit positions)

20 10

Byte offset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

Address showing bit positions

Cache with 1024 1-word blocks: byte offset

(least 2 significant bits) is ignored and

next 10 bits used to index into cache

Implementation of a Set-Associative Cache

Address

22 8

V TagIndex

0

1

2

253

254

255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0

Address

4-way set-associative cache with 4 comparators and one 4-to-1 multiplexor:

size of cache is 1K blocks = 256 sets * 4-block set size

Set

Performance of Set-Associative Caches

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s ra

te

Associativity 16 KB

32 KB

64 KB

128 KB

Miss rates for each of eight cache sizes

with increasing associativity:

data generated from SPEC92 benchmarks

with 32 byte block size for all caches

It is generally more effective to increase the number of

entries rather than associativity.

Cache Replacement Policy

Cache has finite size.

What if the cache is full?

Analogy:

• Desktop full? Move books to bookshelf to make room

• Bookshelf full? Move least-used to library, etc.

Caches follow this same idea: if "replacement" is necessary,

move old block to next level of cache (if "dirty").

How do we choose "victim"? Many policies are possible, e.g.,:

• FIFO (first-in-first-out)

• LRU (least recently used)

• NMRU (not most recently used)

• Random

• Random + NMRU (almost as good as LRU)

Pipelining: The Basic Idea

Pipelining breaks instruction execution down into multiple stages

Put registers between stages to "buffer" data and control.

The idea is to start the execution of an instruction.

As first instruction moves to second stage, start execution of second

instruction, and so on.

Speedup same as number of stages as long as the pipeline is full.

Pipeline Hazards

Why the pipeline is not always full:

Structural hazards arise from resource conflicts when the

hardware cannot support all possible combinations of instructions in

simultaneous overlapped execution (e.g. floating point operation).

Data Hazards arise when an instruction depends on the result of a

previous instruction in a way that is exposed by the overlapping of

instructions in the pipeline (e.g., A = B**19; A = A**C; or a cache

miss)

Control Hazards arise from the pipelining of branches and other

instructions that change the PC (program counter) which points to

the next instruction to execute (e.g., an if statement).

Hazards in pipelines can make it necessary to stall the pipeline. When

this happens, some instructions in the pipeline are allowed to proceed

while others are delayed. When an instruction is stalled, all the

instructions issued later than the stalled instruction are also stalled.

Instructions issued earlier than the stalled instruction must continue,

since otherwise the hazard will never clear.

Processor speeds remain very fast relative to everything

else in the memory hierarchy. This isn’t likely to change in

the near future.

New designs are making memory wider (and a little bit

faster).

Compilers are getting better at restructuring code to

increase locality and to reduce the number of pipeline

hazards. In the general case, this is really hard.

Processor designers are making the cache visible in the

instruction set architecture, making it possible for

programmers / compilers to use "pre-fetching" to manually

populate cache entries.

Some Final Thoughts

iwks 2300 computer architecture architecture.pdfharvard mark 1 (circa 1940) howard aiken. the alu...

Documents