iwks 2300 computer architecture architecture.pdfharvard mark 1 (circa 1940) howard aiken. the alu...
TRANSCRIPT
IWKS 2300
Computer Architecture
(plus finishing up computer arithmetic)
Fall 2019
John K. Bennett
From Last Lecture: "Ripple" Carry Adder
Ripple carry makes addition time approximately equal to number of bits
times the propagation delay of a Full Adder
Full Adder
∑Cout
BA Cin
Full Adder
∑Cout
BA Cin
Full Adder
∑Cout
BA Cin
Full adder prop. delay = 3 gpd (carry output)
So a 16 bit adder would take 48 gpd to complete add
Eliminating Ripple Carry: Carry Look Ahead Basics
If we understand how carry works we can compute carry in advance.
This is called "Carry Look-Ahead."
For any bit position, if A = 1 and B = 1; Cout = 1, i.e., a carry will be
generated to the next bit position, regardless of value of Cin. This is
called "Carry Generate"
For any bit position, if one input is 1 and the other input is 0; Cout
will equal Cin (i.e., the value of Cin will be propagated to the next bit
position. This is called "Carry Propagate"
For any bit position, if A = 0 and B = 0; Cout will equal 0, regardless
of value of Cin. This is called "Carry Stop."
Full Adder
∑Cout
BA Cin
Full Adder
∑Cout
BA Cin
Full Adder
∑Cout
BA Cin
Carry Generate, Propagate and Stop
Truth table for Full Adder
Cin A B Cout
fgps 0 0 0 x CSi
fgps 0 1 1 x CPi
fgps 1 0 1 x CPi
fgps 1 1 0 x CGi
∑
Full Adder
∑Cout
BA Cin
X
fgps
No need for carry chain
Carry Look Ahead Basics
The equations to compute Cin at Bit Position i are as follows:
Cini = Cgi-1
+ Cpi-1 ● Cgi-2
+ Cpi-1 ● Cpi-2 ● Cgi-3
…
+ Cpi-1 ● Cpi-2 … Cp1 ● Cg0
Practical Considerations
Cini = Cgi-1
+ Cpi-1 ● Cgi-2
+ Cpi-1 ● Cpi-2 ● Cgi-3
…
+ Cpi-1 ● Cpi-2 … Cp1 ● Cg0
Very wide (more than 8 input) gates are impractical, so we
would likely use a logn depth tree of gates to implement the
wide ANDs and ORs. This is still faster than chained carry,
even for 16 bits (and is much faster for 32 or 64 bit adders).
23
03 02 01 00
0
0
13 12 11 10
1
1
22 21 20
2
2
33 32 31 30
3
3
What About Multiplication?
Classic Multiplication in Hardware/Software
Use add – shift, like pen and paper method:
Speeding Up Binary Multiplication
1. Retire more than bit at a time:
• 2 bits at a time ("Booth’s Algorithm")
• 3 bits at a time:
Bits Operation000 No Operation
001 Add Multiplicand
010 Add Multiplicand
011 Add 2xMultiplicand
100 Sub 2xMultiplicand
101 Sub Multiplicand
110 Sub Multiplicand
111 No Operation
2. Parallel Multiplier Using Carry Save Addition
Carry Save Addition
The idea is to perform several additions in sequence, keeping the carries
and the sum separate. This means that all of the columns can be added
in parallel without relying on the result of the previous column, creating a
two output "adder" with a time delay that is independent of the size of its
inputs. The sum and carry can then be recombined using one normal
carry-aware addition (ripple or CLA) to form the correct result.
CSA Uses Full Adders
"Wallace Tree" Addition
CSA Adder Tree Depth = 7; 7 adders used
Plus add with carry
Depth = 4; 7 adders
Plus add with carry
A 4-bit Example (carry propagating to the right)
(or carry look-ahead)
Example: An 8-bit Carry Save Array Multiplier
A parallel multiplier for unsigned operands. It is composed of 2-input AND
gates for producing the partial products, a series of carry save adders for
adding them and a ripple-carry adder for producing the final product.
Generating Partial Products
FA with 3 inputs, 2 outputs
What is Computer Architecture?
Machine Organization + Instruction Set Architecture
Decisions in each area are made for reasons of:
Cost
Performance
Compatibility with earlier designs
Computer Design is the art of balancing these
criteria
Classic Machine Organization (Von Neumann)
Input (mouse, keyboard, …)
Output (display, printer, …)
Memory
main (DRAM), cache (SRAM)
secondary (disk,
CD, DVD, …)
Datapath
Control
Input
Processor
Control
Datapath
Output
Memory1001010010110000
0010100101010001
1111011101100110
1001010010110000
1001010010110000
1001010010110000
Processor
(CPU)
Atanasoff–Berry Computer (Iowa State University)
(1937-42; vacuum tubes)
Zuse Z3 (Nazi Germany)
(1941-43; relays)
Von Neumann (Princeton) Machine (circa 1940)
Keyboard
Arithmetic Logic
Unit (ALU)
CPU
Registers
Control
Memory
(data
+
instructions)
Input
device
Output
device
Gordon Moore, Andy Grove (and others)
... made it small and fast.John Von Neumann (and others) ... made it possible
Harvard Mark 1 (circa 1940)
Howard Aiken
The ALU
Arithmetic (in order of implementation complexity):
Add
Subtract
Shift (Right and Left)
Rotate (Right and Left
Multiply
Divide
Floating Point Operations
Logic (usually implemented with multiplexors)
And Nand
Or Nor
Not XNor
Xor etc.
32
32
32
operation
result
a
b
ALU
Registers
While there have been "memory-only" machines, even early
computers typically had at least one register (called the
"accumulator"), used to capture the output of the ALU for
the next instruction.
Since memory (RAM) is much slower than registers (which
are internal to the CPU), we would like a lot of them.
But registers are very expensive relative to RAM, and we
have to be able to address every register. This impacts both
instruction set design and word length (e.g., 8 bit, 16 bit, 32
bit, 64 bit).
This has led to unusual designs, e.g., the SPARC
architecture "register windows."
Control
Early computers were hardwired to perform a single
program.
Later, the notion of a "stored program" was introduced.
Early programmers entered programs in binary directly into
memory using switches and buttons. Assemblers and
compilers made it possible for more human-readable
programs to be translated into binary.
Binary programs, however entered, are interpreted by the
hardware to generate controls signals. This interpretation
can be "hardwired" logic, or another computer using what is
known as "microprogramming."
Keyboard
Arithmetic Logic
Unit (ALU)
CPU
Registers
Control
Memory
(data
+
instructions)
Input
device
Output
device
Processing Logic: fetch-execute cycle
Executing the current instruction involves one or more of
the following tasks:
Have the ALU compute some function out = f (register values)
Write the ALU output to selected registers
As a side-effect of this computation,
determine what instruction to fetch and execute next.
What Do Instructions Look Like in Memory?
In a Von Neumann machine, both instructions and data are stored in
the same memory.
Data is just a set of bits in one or more words of memory.
Instructions contain operation codes ("Op Codes") and addresses
(of either registers or RAM).
Suppose "addr" was 4 bits and the word length was 16 bits. How
many registers could we have? How many operations?
oprn
oprn
oprn
addr
1
addr
2
addr
3
addr
2
addr
1
addr
1
Architecture Families
Before mid-60’s, every machine had a different ISA
programs from previous generation could not run on new
machine (this made replacement very expensive)
IBM System/360 introduced the concept of an
"architecture family" based on different detailed
implementations
single instruction set architecture
wide range of price and performance with same software:
o memory path width (1 byte to 8 bytes)
o faster, more complex CPU design
o greater I/O throughput and overlap
IBM 360 Architecture Family
Model Shipped Scientific Commercial CPU Memory Memory size
Performance Performance Bandwidth Bandwidth KB
KIPS KIPS MB/Sec MB/Sec
30 Jun-65 10.2 29 1.3 0.7 8-64
40 Apr-65 40 75 3.2 0.8 16-256
50 Aug-65 133 169 8 2 64-512
20 Mar-66 2 2.6 4-32
91 Oct-67 1,900 1,800 133 164 1024-4096
65 Nov-65 563 567 40 21 128-1024
75 Jan-66 940 670 41 43 256-1024
67 May-66 40 21 512-2048
44 Sep-66 118 185 16 4 32-256
95 Feb-68 3800 est. 3600 est. 133 711 5220
25 Oct-68 9.7 25 1.1 2.2 16-48
85 Dec-69 3,245 3,418 100 67 512-4096
195 Mar-71 10000 est. 10000 est. 148 169 1024-4096
The Intel x86 Architecture History
Xeon
Platinum
8276L
2019
($16,616)
~7B 4.0GHz 42bit 384
(64 bit
arch.)
4.5 TB
…
The Intel x86 Instruction Set Architecture
Complexity
instructions from 1 to 17 bytes long
one operand must act as both a source and destination
one operand may come from memory
several complex addressing modes
Why has the x86 architecture survived this long?
Historically tied to MS Windows
The most frequently used instructions are relatively easy to implement and optimize
Compilers avoid the portions of the architecture that are slow (i.e., most compilers for X86 machines only use a fraction of the instruction set).
CISC vs. RISC
CISC = Complex Instruction Set Computer
RISC = Reduced Instruction Set Computer
Historically, machines tend to add features over time
Instruction opcodes
IBM 70X, 70X0 series went from 24 opcodes to 185 in 10 years
At the same time, performance increased 30 times
Addressing modes
Special purpose registers
CISC motivations were to
Improve efficiency, since complex instructions implemented in
hardware presumably execute faster
Supposed to make life easier for compiler writers
Supposed to support more complex higher-level languages
CISC vs. RISC
Examination of actual code demonstrated many of these features were
not used, largely because compiler code generation and optimization is
hard even with simple instruction sets.
RISC advocates (e.g., Dave Patterson of UC Berkeley) proposed
simple, limited (reduced) instruction set
large number of general purpose registers
instructions mostly only used registers
optimized instruction pipeline
Benefits of this approach included:
faster execution of instructions commonly used
faster design and implementation
Issues: things like floating point had to be implemented in SW
CISC vs. RISC
Some early RISC architectures compared to contemporaneous CISC
Year Instr. Instr.
Size
Addr
Modes
Registers
IBM
370/168
1973 208 2 - 6 4 16
VAX
11/780
1978 303 2 - 57 22 16
I 80486 1989 235 1 - 11 11 8
M 88000 1988 51 4 3 32
MIPS
R4000
1991 94 4 1 32
IBM 6000 1990 184 4 2 32
CISC vs. RISC
Which approach is best?
In general, fewer simpler instructions allow for increased clock
speeds.
Typically, RISC processors take less than half the design time of a
CISC processor., sometimes even less.
RISC/CISC comparisons often neglect the increased time it takes to
do things like develop a software floating point library.
In addition, CISC designers have adopted RISC techniques
everywhere possible.
Instruction complexity is only one variable
A couple of design principles:
Make the common case fast.
Design for the expected workload, e.g., a GPU needs a very
different ISA that a Windows 10 CPU.
Some Typical Assembly Language Constructs
// In what follows R1,R2,R3 are registers, PC is program counter,
// and addr is some address in memory. There is an implied PC++
// with every instruction
ADD R1,R2,R3 // R1 R2 + R3
ADDI R1,R2,addr // R1 R2 + addr
AND R1,R1,R2 // R1 R1 and R2 (bit-wise)
JMP addr // PC addr
JEQ R1,R2,addr // IF R1 == R2 THEN PC addr ELSE PC++
LOAD R1, addr // R1 RAM[addr]
STORE R1, addr // RAM[addr] R1
NOP // Do nothing
// Etc. – *many* variants
Three Address Architecture
Consider the following code fragment: X = (A-B) / (C+(D*E))
Load R1, A // R1 ← Mem[A]
Load R2, B // R2 ← Mem[B]
Sub R3, R2, R1 // R3 ← R2 – R1
Load R1, D // R1 ← Mem[D]
Load R2, E // R2 ← Mem[E]
Mpy R4, R1, R2 // R4 ← R1 * R2
Load R1, C // R1 ← Mem[C]
Add R2, R1, R4 // R2 ← R1 + R4
Div R1, R3, R2 // R1 ← R3 / R2
Store X, R1 // Mem[X] ← R1
This code: 10 instructions, 6 Memory references,
code is not compact.
There are typically a finite number of registers,
on the order of 16-32
Two Address Architecture
Consider the following code fragment: X = (A-B) / (C+(D*E))
Load R1, A // R1 ← Mem[A]
Load R2, B // R2 ← Mem[B]
Sub R2, R1 // R2 ← R2 – R1
Load R1, D // R1 ← Mem[D]
Load R3, E // R3 ← Mem[E]
Mpy R1, R3 // R1 ← R1 * R3
Load R4, C // R4 ← Mem[C]
Add R1, R4 // R1 ← R1 + R4
Div R2, R1 // R2 ← R2 / R1
Store X, R2 // Mem[X] ← R2
There are typically a finite number of registers,
on the order of 16-32
This code: 10 instructions, 6 Memory references,
code is a little more compact.
One Address Architecture
Consider the following code fragment: X = (A-B) / (C+(D*E))
Load A // Acc ← Mem[A]
Sub B // Acc ← Acc - Mem[B]
Store Temp1 // Mem[Temp1] ← Acc
Load D // Acc ← Mem[D]
Mpy E // Acc ← Acc * Mem[E]
Add C // Acc ← Acc + Mem[C]
Store Temp2 // Mem[Temp2] ← Acc
Load Temp1 // Acc ← Mem[Temp1]
Div Temp2 // Acc ← Acc / Mem[Temp2]
Store X // Mem[X] ← Acc
There is one register, called the Accumulator
This code: 10 instructions, 10 Memory references,
code is more compact.
Zero Address Architecture
Consider the following code fragment: X = (A-B) / (C+(D*E))
Push D // SP = SP + 1; Mem[SP] ← Mem[D];
Push E // SP = SP + 1; Mem[SP] ← Mem[E];
Mpy // Mem[SP-1] ← Mem[SP] * Mem[SP-1];
// SP = SP -1
Push C // SP = SP + 1; Mem[SP] ← Mem[C]
Add // Mem[SP-1] ← Mem[SP] + Mem[SP-1]
// SP = SP -1
Push A // SP = SP + 1; Mem[SP] ← Mem[A]
Push B // SP = SP + 1; Mem[SP] ← Mem[B]
Sub // Mem[SP-1] ← Mem[SP] - Mem[SP-1];
// SP = SP -1
Div // Mem[SP-1] ← Mem[SP] / Mem[SP-1]
Pop X // Mem[X] ← Mem[SP]; SP = SP - 1
10 instructions, 24 Memory references,
code is very compact.
What Does it Mean to Make the Common Case Fast?
There are a variety of techniques to speed up instruction execution.
Some of these include:
Increase clock rate (we are approaching some hard limits here)
Pipelining (execute more than one instruction at one time)
Caching (store data we will need near the processor)
Other methods of improving memory hierarchy access time
Note that we only need to employ these techniques for instructions that
actually get used.
How do we know what instructions get used?
There are decades of research exploring how different kinds of
compilers and programs use instructions and memory.
If we have a specialized workload, we can study its execution
ourselves.
Registers (very small and very fast) –
implemented as part of the processor.
Cache (small and fast storage for data and
instructions we expect to need). There may
be several layers of cache, e.g., a small and
very fast L1 cache, a larger and somewhat
slower L2 cache, and an even larger and not
quite as fast L3 cache.
Main Memory (RAM) – the bulk of the
volatile memory available to the CPU.
Usually implemented using dynamic RAM.
Disk – Relatively non-volatile storage for
large amounts of information. May use
rotating media, or (more recently) SSD (solid
state drive) technology.
Although less common today, there may be
additional layers in the memory hierarchy,
e.g., tape, on-line, DVD, etc.
The Memory Hierarchy
Memory
CPU
Memory
Size Cost ($/bit)Speed
Smallest
Biggest
Highest
Lowest
Fastest
Slowest Memory
RAM
Disk
Cache
Registers
Ideally, all the memory we want would be in the processor, but that is
cost-prohibitive (and certainly wasteful) by today’s standards.
We use the memory hierarchy to efficiently create the illusion that all
memory is the same.
The memory hierarchy must be inclusive, i.e., lower levels must include
everything present in higher levels of the memory hierarchy.
The performance of the memory hierarchy depends on hit rate, i.e., how
often we find what we need at higher levels.
Memory Hierarchy
Processor
Data are transferred
CPU
Level n
Level 2
Level 1
Levels in the
memory hierarchy
Increasing distance
from the CPU in
access time
Size of the memory at each level
Block of data
(unit of data copy)
Program Locality
Caching in the memory hierarchy works because of
two kinds if program "locality."
temporal locality: an item (data or instruction) a
program has just accessed is likely to be
accessed again in the near future. Why?
spatial locality: items near an item a program has
just accessed are likely to be referenced soon.
Why?
Cache Terminology
block: minimum unit of data moved between levels
hit: data requested is found in the nearest upper level
miss: data requested is not found in nearest upper level
hit rate: fraction of memory accesses that are hits
miss rate: fraction of memory accesses that are not hits
miss rate = 1 – hit rate
hit time: time to determine if the access is a hit + time to deliver the data to the CPU
miss penalty: time to determine if the access is a miss + time to replace block at upper level with corresponding block at lower level + time to deliver the data to the CPU
Simple example:
assume block size = one word of data
Issues:
how do we know if a data item is in the cache?
if it is, how do we find it?
if not, what do we do, and what if the cache is full (or "dirty")?
Solution depends on cache addressing scheme.
How Do Caches Actually Work?
a. Before the reference to Xn
X3
Xn – 1
Xn – 2
X1
X4
b. After the reference to Xn
X3
Xn – 1
Xn – 2
X1
X4
Xn
X2X2
Reference to Xn
causes miss so
it is fetched from
memory
Fully Associative - A cache where data from any address can be stored in any cache location. All tags are compared simultaneously (associatively) with the requested address and if one matches then its associated data is accessed.
Direct Mapped - A cache where the cache location for a given address is explicitly determined from the middle address bits. The remaining top address bits are stored as a "tag" along with the entry. In this scheme, there is only one place for any block to go.
Set Associative - A compromise between a direct mapped cache and a fully associative cache where each address is mapped to a certain "set" of cache locations.
A direct mapped cache could be referred to as "one-way set associative", i.e. one location in each set, whereas a fully associative cache is "N-way associative" (where N is the total number of blocks in the cache).
Cache Addressing Schemes
Addressing scheme in direct mapped cache:
cache block address = memory block address mod cache size (unique)
if cache size = 2m, cache address = lower m bits of n-bit memory address
remaining upper n-m bits kept as tag bits at each cache block
also need a valid bit to recognize valid entry
Direct Mapped Cache
00001 00101 01001 01101 10001 10101 11001 11101
00
0
Cache
Memory
001
01
0
01
1
100
101
110
11
1
Direct Mapped Cache Implementation Example
Address (showing bit positions)
20 10
Byte offset
Valid Tag DataIndex
0
1
2
1021
1022
1023
Tag
Index
Hit Data
20 32
31 30 13 12 11 2 1 0
Address showing bit positions
Cache with 1024 1-word blocks: byte offset
(least 2 significant bits) is ignored and
next 10 bits used to index into cache
Implementation of a Set-Associative Cache
Address
22 8
V TagIndex
0
1
2
253
254
255
Data V Tag Data V Tag Data V Tag Data
3222
4-to-1 multiplexor
Hit Data
123891011123031 0
Address
4-way set-associative cache with 4 comparators and one 4-to-1 multiplexor:
size of cache is 1K blocks = 256 sets * 4-block set size
Set
Performance of Set-Associative Caches
0%
3%
6%
9%
12%
15%
Eight-wayFour-wayTwo-wayOne-way
1 KB
2 KB
4 KB
8 KB
Mis
s ra
te
Associativity 16 KB
32 KB
64 KB
128 KB
Miss rates for each of eight cache sizes
with increasing associativity:
data generated from SPEC92 benchmarks
with 32 byte block size for all caches
It is generally more effective to increase the number of
entries rather than associativity.
Cache Replacement Policy
Cache has finite size.
What if the cache is full?
Analogy:
• Desktop full? Move books to bookshelf to make room
• Bookshelf full? Move least-used to library, etc.
Caches follow this same idea: if "replacement" is necessary,
move old block to next level of cache (if "dirty").
How do we choose "victim"? Many policies are possible, e.g.,:
• FIFO (first-in-first-out)
• LRU (least recently used)
• NMRU (not most recently used)
• Random
• Random + NMRU (almost as good as LRU)
Pipelining: The Basic Idea
Pipelining breaks instruction execution down into multiple stages
Put registers between stages to "buffer" data and control.
The idea is to start the execution of an instruction.
As first instruction moves to second stage, start execution of second
instruction, and so on.
Speedup same as number of stages as long as the pipeline is full.
Pipeline Hazards
Why the pipeline is not always full:
Structural hazards arise from resource conflicts when the
hardware cannot support all possible combinations of instructions in
simultaneous overlapped execution (e.g. floating point operation).
Data Hazards arise when an instruction depends on the result of a
previous instruction in a way that is exposed by the overlapping of
instructions in the pipeline (e.g., A = B**19; A = A**C; or a cache
miss)
Control Hazards arise from the pipelining of branches and other
instructions that change the PC (program counter) which points to
the next instruction to execute (e.g., an if statement).
Hazards in pipelines can make it necessary to stall the pipeline. When
this happens, some instructions in the pipeline are allowed to proceed
while others are delayed. When an instruction is stalled, all the
instructions issued later than the stalled instruction are also stalled.
Instructions issued earlier than the stalled instruction must continue,
since otherwise the hazard will never clear.
Processor speeds remain very fast relative to everything
else in the memory hierarchy. This isn’t likely to change in
the near future.
New designs are making memory wider (and a little bit
faster).
Compilers are getting better at restructuring code to
increase locality and to reduce the number of pipeline
hazards. In the general case, this is really hard.
Processor designers are making the cache visible in the
instruction set architecture, making it possible for
programmers / compilers to use "pre-fetching" to manually
populate cache entries.
Some Final Thoughts