digitech - h&h - summary
TRANSCRIPT
DigiTech - H&H - Summary
Summary of the textbook D. Harris and S. Harris, "Digital Design and Computer Architecture (2nd
Edition)." .
Author: Ruben W. G. Schenk, [email protected]
Date: 01.05.2020
Version: 0.6
Current scope: Chapter 1 - 8 (Full lecture)
I, Ruben Schenk, take no responsibility for the accuracy, currency, reliability and correctness of any
information included in the work provided. This work is not licensed under a Creative Commons License.
Sharing this work is prohibited without requesting permission to do so.
Chapter 1 - From Zero to One
1.3 The Digital Abstraction
Digital Systems as we know them represent information with discrete varaibles . These are variables
with a finite number of distinct values. The amount of information $D$ in a discrete value with $N$
distinct staes is measure in units of bits as $D = \log_2 (N) \ bits$.
1.4 Number Systems
1.4.1 Decimal Numbers
The decimal number representation is the one we learned at school. There are ten digits, ranging from 0
to 9. Decimal numbers are referred to as base $10$ .
1.4.2 Binary Numbers
Bits represent one of two values, $0$ or $1$, and are joined together to form binary numbers . Each
column of a binary number has twice the weight of the previous number, so binary numbers are base
$2$ . If we want to convert decimal to binary numbers, we can work from right to left. We divide the
decimal number repeatedly by 2, the remainder goes into the binary position.
1.4.3 Hexadecimal Numbers
If we want to represent large numbers, it is sometimes more convenient to represent them in
hexadecimal numbers , since each column has 16 times the weight as the previous one, so hexadecimal
numbers are base $16$. They use the digits 0 to 9 along with the letters A to F. If we want to convert
hexadecimal to binary , we simply convert the hexadecimal number one by one, since each hexadecimal
digit corresponds to a 4 bit binary number. If we want to convert decimal to hexadecimal , we start
from the left and look for the highest power of 16 that still fits in our decimal number. E.g. if the 2nd
power of 16 fits in our number once, then there is a $1$ in the 3rd position of our hexadecimal number.
We repeate this with the remainder.
1.4.4 Bytes and Nibbles
We introduce some definitions of important terms:
Byte: A group of eight bits
Nibble: Half a byte, i.e. a group of four bits
Word: Data chunks a microprocessor handles, e.g. a group of 64 bits in a 64-bit architecture
Least significant bit (LSB): The right-most bit in a group of bits
Most significant bit (MSB): The left-most in a group of bits
Kilobyte (1KB): $2^{10} \approx 1'000 = \ "a \ thousand"$
Megabyte (1MB): $2^{20} \approx 1'000'000 = \ "a \ million"$
Gigabyte (1GB): $2^{30} \approx 1'000'000'000 = \ "a \ billion"$
1.4.6 Signed Binary Numbers
Until no we only looked at unsigned binary numbers that represent positive quantities. If we want to
represent negative numbers too, we need to use signed binary numbers . The two most common schemes
to display signed binary nubmers are:
Sign/Magnitude Numbers
Sign/Magnitude nubmers are intuitively appealing, because they simply match a binary number with an
additional sign bit in the msb-position. An $N$-bit sign/magnitude numbers therefore uses the msb as
the sign ($0$ indicating positive ) and the remaining $N-1$ bits are the magnitude, i.e. the absolute
value. For example $+5 = 0101$ and $-5 = 1101$. The only problem with this representation is, thay it
doesn't allow for ordinary binary addition.
Two's Complement Numbers
This representation overcomes the problem of sign/magnitude numbers since it allows for ordinary
binary addition . In the Two's Complement representation, we obtain a negative number with taking the
complement of the positive counterpart and adding one to the lsb . For example $+5 = 0101$ and $-5 =
1010 + 1 = 1011$. Again, the msb represents the sign of the number ($0$ for positive, $1$ for negative).
One exception is the most negative number . It is sometimes called the weird number , since it doesn't
have a positive counterpart in the two's complement representation. Keep in mind that when adding
more bits to a two's complement number, we must perform sign extension . For example $0101
\rightarrow 00000101$ and $1011 \rightarrow 11111011$.
1.5 Logic Gates
Logic gates are simple digital circuits that take one or more binary inputs and produce a binary output.
The defined logic gates are:
1.6 Beneath The Digital Abstraction
1.6.1 Supply Voltage
In reality, binary signals are represented with a voltage on a wire. The lowest voltage in a system is
called the ground (GND) and the highest voltage in the system is often denoted with $V_{DD}$. If we
focus on two logic gates, the first one is called the driver and the second one the reciever . The driver
produces either a LOW(0) in the range from $0$ to $V_{OL}$ or a HIGH(1) in the range from $V_{OH}$ to
$V_{DD}$. Similarly, if the reciever gets an input between $0$ to $V_{IL}$, it will consider the input as
LOW(0), and if it's an input between $V_{IH}$ to $V_{DD}$, it will consider it as HIGH(1). Every signal
outside of those ranges is said to be in the forbidden zone .
1.6.3 Noise Margins
The noise margin is the amount of noise that could be added to a worst-case output such that the
signal still can be correctly interpreted by the reciever. The margins are computed with:
$NM_L = V_{IL} - V_{OL}$
$NM_H = V_{OH} - V_{IH}$
1.7 CMOS Transistors
Transistors are electrically controlled switches that turn $ON$ or $OFF$ when a voltage is applied to a
control terminal. The most common transistors are metal-oxide-semiconductor field effect transistors
(MOSFETs or MOS transistors).
1.7.3 Capacitors
A capacitor consists of two conductors separated by an insulator. When a voltage $V$ is applied to
one of the conductors, the conductor accumulates electric charge $Q$ . The capacitance $C$ of the
capacitor is the ratio if charge to voltage: $C = \frac{Q}{V}$.
1.7.4 nMOS and pMOS Transistors
There are two flavors of MOSFETS: nMOS and pMOS . The n-type transistors, called nMOS , have regions
of n-type dopants (impurity atoms) adjacent to the gate called the source and the drain and are built
on a p-type conductor substrate. The pMOS transistors are just the opposite, consisting of p-type
source and drain regions in an n-type substrate .
When a positive voltage is applied to the top plate of a capacitor, it establishes an electric field that
attracts positive charge on the top plate and negative charge to the bottom plate. If the voltage is
sufficiently large, so much negative charge is attracted to the underside of the gate that the region
inverts from p-type to effectively become n-type. The inverted region is called the channel . Now the
transistor has a continuous path from the n-type source through the n-type channel to the n-type drain,
so electrons can flow from source to drain . The transistor is $ON$. pMOS transistors work in just the
opposite fashion.
In summary, CMOS processes give us two types of electrically controlled switches . The voltage at the
gate ($g$) regulates the flow of current between source ($s$) and drain ($d$). nMOS transistors are
$OFF$ when the gate is $0$ and $ON$ when the gate is $1$. pMOS transistors are $ON$ when the gate is
$0$ and $OFF$ when the gate is $1$.
Chapter 2 - Combinational Logic Design
2.1 Introduction
In digital eletctronics, a circuit is a network that processes discrete-valued variables. It has:
one or more input terminals
one or more output terminals
a functional specification
a timing specification
Digital circuits are classified as combinational or sequential . The difference is, that a combinational
circuit combines the current input values to compute the output, whereas in a sequential circuit, the
output depends on the input sequence, i.e. it depends on both current and previous values of the input.
In another way, a combinational circuit is memoryless and a sequential circuit has memory .
We can therefore classify a circuit as combinational if it satisfies the following requirements:
Every circuit element is itself combinational
Every node (wire) of the circuit is either designated as an input or connects exactly to one output
terminal
The circuit contains no cyclic paths
2.2 Boolean Equations
2.2.1 Terminology
We definde the following terms for a better understanding of the underlying topic:
Complement: The inverse of a variable, denoted $\bar A$
Literal: A variable in a boolean equation
Product (implicant): The $AND$ of one or more literals
Minterm: The product of all inputs (or their complement) of a function
Sum: The $OR$ of one or more literals
Maxterm: The sum of all inputs of a function
Precedence: The order of operations (NOT $\rightarrow$ AND $\rightarrow$ OR)
Sum-of-products canonical form: The sum of all minterms for which the output of a function is
$TRUE$
Prime implicant: An implicant that cannot be combined with other implicants to form a new
implicant with fewer literals
2.4 From Logic to Gates
A schemantic is a diagram of a digital circuit showing the elements and wires that connect them
together. When drawing schemantics, we will generally obey the following guidelines:
Inputs are on the left or top of a schemantic
Outputs are on the right or bottom of a schemantic
Whenever possible, gates should flow from left to right
Wires always connect at a T-junction
A dot where wires cross indicates a connection between the wires
Wire crossing without a dot make no connection
2.5 Multilevel Combinational Logic
2.5.2 Bubble Pushing
Bubble pushing is a helpful way to redraw combinational circuits so that the bubbles cancel out and
the function can be more easily determined. The guidelines for bubble pushing are as follows:
Begin at the output of the circuit and work toward the inputs
Push any bubbles on the final output back toward the inputs, so that you can read an equation in
terms of the inputs
Working backward, draw each gate in a form so that the bubbles cancel. If the current gate has an
input bubble, draw the preceding gate with an output bubble and vice versa.
2.6 X's, Z's, Oh My
2.6.1 Illegal Value: X
The symbol $X$ indicates that the circuit node has an unknown or illegal value. this commonly
happens if it is being driven to both $0$ and $1$ at the same time. This situation is also called
contention . Digital designers also use the symbol $X$ to indicate " don't care " values in truth tables.
2.6.2 Floating Value: Z
The symbol $Z$ denotes that a node is being driven neither by $HIGH$ nor by $LOW$. The node is said
to be floating, high impedance or high Z . In reality, a floating node might be $0$, might be $1$, or
might be at some voltage in between.
2.7 Karnaugh Maps
Karnaugh maps (K-maps) are a graphical method for simplifying Boolean equations. They work well for
problems with up to four variables and more important, they give insight into manipulating Boolean
equations .
The top row of the K-map gives the four possible values for the $A$ and $B$ inputs. The left column
gives the two (or four) possible values for input $C$ (and $D$). Each square in the K-map corresponds to
a row in the truth table and contains the value of output $Y$. The order of $AB$ and $CD$ inputs is in
Gray code . It differs from ordinary binary order ($00, 01, 10, 11$) in that adjacent entries differ only in a
single variable ($00, 01, 11, 10$).
2.7.2 Logic Minimization with K-Maps
Simplyfied we can minimize a Boolean equation by circling all the rectangular blocks of $1$'s in the
map, using fewest possible number of circles. Each circle should be as large as possible. Then we can
read off the implicants that we circled. Rules for finding a minimized equation from a K-map are as
follows:
Use the fewest circles neccessary to cover all the $1$'s
All the squares in each circle must contain only $1$'s
Each circle must span a rectangular block that is a power of 2 (i.e. 1, 2, or 4)
Each circle should be as large as possible
A circle may wrap around the edges of the K-map
A $1$ in a K-map may be circled multiple times if doing so allows fewer circles to be used
Example of how we can use a K-map to minimize a Boolean equation:
2.8 Combinational Building Blocks
2.8.1 Multiplexers
Multiplexers are among the most commonly used combinational circuits. They choose an output among
several possible inputs based on the value of a select signal . A multiplexer is sometimes affectionaly
called a mux . The combinational schemantic of a multiplexer looks as follows:
2.8.2 Decoders
A decoder has $N$ inputs and $2^N$ outputs. It asserts exactly one of its outputs depending on the
input combination. The outputs of a decoder are called one-hot , because exactly one is hot ($HIGH$)
at a given time. The combinational schemantic of a 2:4 decoder looks as follows:
2.9 Timing
An output takes time to change in response to an input change. We can draw a timing diagram to
portray the transient response of a combinational circuit. The transition from $LOW$ to $HIGH$ is
furthermore called the rising edge and similarly, the transition from $HIGH$ to $LOW$ is called the
falling edge . The following example shows a timing diagram of a buffer:
2.9.1 Propagation and Contamination Delay
Combinational logic is characterized by its propagation delay $t_{pd}$ , the maximum time from when
an input changes until the output reach their final value, and the contamination delay $t_{cd}$ , the
minimum time from an input change until the output takes its final value.
Since $t_{pd}$ and $t_{cd}$ are determined by signal paths in a combinational circuit, we introduce
the following two definitions:
Critical path: The longest and therefore slowest path in a circuit
Short path: The shortest and therefore fastest path through a circuit
We can now calculate the $t_{pd}$ by adding up the propagation time for each element along the
critical path, and similarly calculate the $t_{cp}$ by adding all propagation times along the short
path.
Chapter 3 - Sequential Logic Design
3.1 Introduction
We will now analyze sequential logic . As mentioned beforem, the outputs of sequential logic depend
on both current and prior input values. Hence, sequential logic has memory .
3.2 Latches and Flip-Flops
The fundamental building blocks of memory is a bistable element , an element with two stable states.
Those states are either $0$ or $1$. They also have a third possible state with the output being
approximately halfway between $0$ and $1$. This is called a metastable state and will be discussed
further on.
3.2.1 SR Latch
One of the simplest sequential circuits is the SR Latch , which is composed of two cross-coupled $NOR$
gates. The state of the SR Latch can be controlled through the $S$ and $R$ inputs, which set and
reset the output.
The SR Latch has the simple but powerful functionality, that if both inputs R and S are $0$, $Q$ will
keep the previous value $Q_{prev}$. The circuit has memory .
3.2.2 D Latch
The limitation of the SR Latch is, that on giving an input on either $R$ or $S$, not only what the state
should be but also when the state should change. Designing circuits becomes easier once we can
seperate those two actions. The D Latch solves this limitation.
It has two inputs. The data input $D$ controls what the next state should be and the clock input $CLK$
controls when the state should change. Observe that when $CLK = 0$, $Q$ remembers its old value
$Q_{prev}$ and if $CLK = 1$, $Q = D$. One also says that if $CLK = 1$, the latch is transparent and
opaque otherwise.
3.2.3 D Flip-Flop
A D Flip-Flop can be built from two back-to-back D Latches controlled by complementary clocks. The
first Latch L1 is called the master , the second latch L2 is called the slave .
The functionality of the D Flip-Flop is essential for many digital designers, therefore its describiton
should be known by heart:
A D flip-flop copies $D$ to $Q$ on the rising edge of the clock, and remembers its state at all other
times.
3.2.4 Register
An $N$-bit register is a bank of $N$ flip-flops that share a common $CLK$ input, so that all bits of the
register are updated at the same time. Registers are the key building block of most sequential circuits.
3.2.5 Enabled Flip-Flop
An enabled flip-flop adds another input called $EN$ or $ENABLE$ to determine whether data is loaded
on the clock edge. When $EN$ is $TRUE$, the enabled flip-flop behaves like an ordinary D flip-flop. When
$EN$ is $FALSE$, the enabled flip-flop ignores and retains its state.
3.2.6 Resettable Flip-Flop
A resettable flip-flop adds another input called $RESET$. When $RESET$ is $FALSE$, the resettable flip-
flop behaves like an ordinary D flip-flop. If $RESET$ is $TRUE$, the resettable flip-flop ignores $D$ and
sets the output to $0$.
We differ between synchronously and asynchronously resettable flip-flops. The difference being that
synchronously resettable flip-flops only reset themselves on the rising edge of $CLK$, whereas
asynchronously resettable flip-flops reset themselves as soon as $RESET$ becomes $TRUE$.
3.3 Synchronous Logic Design
In general, sequential circuits include all circuits that are not combinational, that is, those whose
outputs cannot be determined simply by looking at the current inputs.
3.3.2 Synchronous Sequential Circuits
In general, a sequential circuit has a finite set of discrete states $\{S_0, S_1,...,S_{k-1}\}$. A
synchronous sequential circuit has a clock input, whose rising edges indicate a sequence of times at
which state transitions occur. The rules of synchronous sequential circuit composition are:
Every circuit element is either a register or a combinational circuit
At least one circuit element is a register
All registers recieve the same clock signal
Every cyclic path contains at least one register
Similarly, if a circuit is not synchronous, it's said to be asynchronous .
3.4 Finite State Machines
Finite state machines are synchronous sequential circuits with $k$ registers and finite number ($2^k$)
of unique states the machine can be in. An FSM consists of two blocks of combinational logic, next
state logic and output logic , and a register that stores the state.
The are two general classes of finite state machines. One are the Moore machines (a) , in those, the
outputs depend only on the current state of the machine. In the Mealy machines (b) , the outputs
depend on both the current state of the machine and the current inputs.
3.4.1 FSM Design Example
The designing of an FSM involves several different steps. Those steps usually follow this procedure (for
a Moore machine):
1. Identify the inputs and outputs
2. Sketch a state transition diagram
3. Write a state transition table
4. Write an output table
5. Select state encodings (this effects the hardware design)
6. Write boolean equations for the next state and output logic
7. Sketch the circuit schemantic
The above mentioned terms are defined and explained as follows:
State transition diagram: It indicates all the possible states of the system and the transitions between
these states. Example of a state transition diagram:
State transition table: The state transition diagram is rewritten as a state transition table. It indicates,
for each state and each input, what the next state should be. In the state transition table one makes
often use of the don't care symbol $X$.
State encoding: To write a clean state transition diagram it is necessary to encode each state. The two
most common encoding schemes are binary encoding and one-hot-encoding . In the binary encoding
scheme, each state is represented as a binary number, therefore we only need $log_2 (K)$ bits to
represent a system with $K$ states. In one-hot-encoded system, each state is represented by one
seperate bit. Therefore in a system with one-hot-encoding we use a lot more flip-flops than in a binary
encoded system, but we have a much easier output logic and less used gates in general.
3.5 Timing of Sequential Logic
Recall that a flip-flop copies the input $D$ to the output $Q$ on the rising edge of the clock. This
process is called sampling $D$ on the clock edge. We define the aperture of a sequential element by a
setup time and a hold time, before and after the clock edge, respectively.
3.5.1 The Dynamic Discipline
When a clock rises, the output may start to change after the clock-to-Q contamination delay
$t_{ccq}$ and must definitely settle to the final value within the clock-to-Q propagation delay
$t_{pcq}$. For the circuit to sample its input correctly, the input must have stabilized at least some
setup time $t_{setup}$, before the rising edge of the clock must remain stable for at least some hold
time $t_{hold}$, after the rising edge of the clock. The sum of the setup and hold time is called the
aperture time of the circuit, and during this time, the input must be stable.
The dynamic discipline states that inputs of a synchronous sequential circuit must be stable during the
setup and hold aperture time around the clock edge.
3.5.2 System Timing
We introduce the following two terms that are of importance:
Clock period/Cycle time: $T_c$, the time between rising edges of a repetitive clock signal
Clock frequency: $f_c = \frac{1}{T_c}$
All those principles are shown in the following diagram:
3.5.4 Metastability
As noted earlier, it is not always possible to guarantee that the input to a sequential circuit is stable
during the aperture time. When a flip-flop samples an input that is changing during its aperture, the
output $Q$ may momentarily take on a voltage between $0$ and $V_{DD}$ that is in the forbidden
zone. This is called a metastable state.
3.6 Parallelism
The speed of a system is characterized by the latency and throughput of information moving thought it.
We define a token to be a group of inputs that are processed to produce a group of outputs. The
latency of a system is the time required for one token to pass through the system from start to end. The
throughput is the number of tokens that can be produced per unit time.
The throughput can be improved by processing several tokens at the same time. This is called
parallelism , and it comes in two forms: spatial and temporal. With spatial parallelism , multiple copies
of the hardware are provided so that multiple tasks can be done at the same time. With temporal
parallelism , a task is broken into stages, like an assembly line. Those taks can be spread across stages,
and if spread correctly, a different task will be in each stage at any given time so multiple tasks can
overlap. Temporal parallelism is commonly refered to as pipelining .
Chapter 4 - Hardware Description Language
4.1 Introduction
!!!
The following summary of chapter four is based on the HDL SystemVerilog . Verilog is very similar to
SystemVerilog, though it doesn't allow all advanced syntax SystemVerilog does. Syntax for VHDL is
excluded from the summary.
!!!
4.1.1 Modules
A block of hardware with inputs and outputs is called a module . The two general styles for describing
module functionality are behavioral and structural . Behavioral models describe what a module does,
and structural models describe how a module is built from simpler pieces.
4.1.3 Simulation and Synthesis
The two major purposes of HDL's are logic simulation and synthesis . During the simulation, inputs are
applied to a module, and the outputs are checked to verify that the module operates correctly. During
synthesis, the textual description of a module is transformed into logic gates.
4.2 Combinational Logic
4.2.1 Bitwise Operators
Bitwise operators act on single-bit signals or on multi-bit busses. A multi-bit bus is a wire with a width
$N>1$. a[3:0] represents a 4-bit wide bus. The bits in this notation are stored from most significant to
least significant order, also called little-endian order (i.e. a[3] is the msb, a[0] is the lsb). We can
also declare a multi-bit bus with a[0:3] , in this case the bits are stored in big-endian order , i.e. a[0]
is the msb and a[3] is the lsb.
Logic gates can either be declared with modules or simple bitwise operators. Those are:
4.2.4 Conditional Assignment
Conditional assignments select the output from among alternatives based on an input called the
condition . The conditional operator ?: , also called ternary operator , chooses, based on a first
expression, between a second (if the first expression is $TRUE$) and a third (if the first expression is
$FALSE$) expression. One can easily describe a 2:1 MUX with a conditional assignment:
assign y0 = ~ a; // NOT
assign y1 = a & b; // AND
assign y2 = a | b; // OR
assign y3 = a ^ b; // XOR
assign y4 = ~(a & b); // NAND
assign y5 = ~(a | b); // NOR
1
2
3
4
5
6
assign y = s ? d1 : d2; // 2:1 MUX with s as the select signal1
A 4:1 MUX can be describe with two nested conditional assignments:
4.2.5 Internal Variables
Signals that are neither inputs nor outputs are called internal variables . In SystemVerilog, internal
signals are usually declared with logic . This differs from Verilog, where internal variables are usually
declared as wire or reg , but this will be explained later on.
4.2.6 Precedence
If we do not use parentheses, the default operation order (precedence) is defined by the language. The
precedence in SystemVerilog is defined as follows:
4.2.7 Numbers
In SystemVerilog, numbers can be specified in binary, octal, decimal, or hexadecimal. The format for
declaring constants is N'Bvalue , where N is the size in bits, B is the letter indicating the base, and
value gives the value. SystemVerilog supports 'b for binary, 'o for octal, 'd for decimal, and 'h for
hexadecimal. If the base is omitted, it defaults to decimal. In SystemVerilog we can further us the
idioms '1 and '0 for filling a bus with $0$s and $1$s, respectively.
4.2.9 Bit Swizzling
Often it is necessary to operate on a subset of a bus or to concatenate (join together) signals from
busses. These operations are collectively known as bit swizzling . This could look like this:
The operator { } is used to concatenate busses. {3 { d[0] } } is the syntax for concatenating three
copies of d[0] .
4.4 Sequential Modeling
4.4.1 Registers
assign y = s1 ? (s0 ? d3 : d2) : (s0 ? d1 : d0); // 4:1 MUX with s0 and s1 for
select
1
assign y = {c[2:1], {3{d[0]}}, c[0], 3'b101};1
In SystemVerilog always statements, signals keep their old value until an event in the sensitivity list
takes place that explicitly causes them to change. In general, a SystemVerilog always statement is
written in the form:
The statement is executed only when the event/events in the sensitivity list occur. SystemVerilog
furthermore introduces always_ff, always_latch, and awlays_comb to reduce the risk of common
errors. In Verilog, anly the standart always @() block is a valid syntax .
4.4.2 Resettable Registers
Genrally, it is good practice to use resettable registers so that on powerup you can put the system in a
known state. The reset can either occur synchronous or asynchronous . The respective syntax looks as
follows:
Remark: always_ff behaves the same way as awlays but is used exclusively to imply flip-flops.
4.4.3 Enabled Registers
Enabled registers respond to the clock only when the enable is asserted.
4.4.5 Latches
Recall that a D latch is transparent when the clock is $HIGH$, allowing data to flow from input to
output. The latch becomes opaque when the clock is $LOW$, retaining its old state. An example for a D
latch in SystemVerilog is:
Remark: always_latch is equivalent to awlays @(clk, d) .
4.5 More Combinational Logic
always @ (sensitivity_list)
statement;
1
2
// asynchronous reset
always_ff @(posedge clk, posedge rst)
if(reset) q <= 4'b0;
else q <= d;
// synchronous reset
always_ff @(posedge clk)
if(reset) q <= 4'b0;
else q <= d;
1
2
3
4
5
6
7
8
9
// asynchronous resettable register
always_ff @(posedge clk, posedge reset)
if(reset) q <= 4'b0;
else if(enable) q <= d;
1
2
3
4
always_latch
if(clk) q <= d;
1
2
SystemVerilog always statements are used to describe sequential circuits , because they remember the
old state when no new state is prescribed. However, always statements can also be used to describe
combinational logic behaviorally if the sensitivity list is written to respond changes in all of the inputs
and the body prescribes the output value for every possible input combination . The following code-
snippet is an example for a combinational circuit using an always statement:
Remark: always_comb is equivalent to always @() . It re-evaluates the statements in the always -
block everytime a singal on the right hand side of any assignment changes.
HDLs support blocking (=) and non-blocking (<=) assignments in an always statement. A group of
blocking assignments are evaluated in the order in which they appear in the code. A group of non-
blocking assignments are evaluated concurrently . All of the statements are evaluated before any of the
signals on the left hand sides are updated.
4.5.1 Case Statements
A case statement performs different actions depending on the value of its inputs. A case statement
implies combinational logic if all possible input combinations are defined, otherwise it implies
sequential logic, because the output will keep its old value in the undefined cases. A case statement
is of the following form:
The default clause is a convenient way to describe the output for all cases not explicitly listed,
guaranteeing combinational logic.
With case statements it's also possible to use don't care variables. A possible implementation could
look like this:
Remark: The casez statement acts like a case statement except that it also recognizes ? as don't
care .
4.5.4 Blocking and Nonblocking Assignments
//inverter using always @
always_comb
y = ~ a;
1
2
3
// case statement
always_comb
case(data)
c1: variable = value1;
c2: variable = value2;
c3: variable = value3;
default: variable = defaultValue;
endcase
1
2
3
4
5
6
7
8
always_comb
casez(data)
3'b1?? : variable = value1;
3'b11? : variable = value2;
3'b111 : variable = value3;
endcase
1
2
3
4
5
6
We follow the upcoming guidelines to decide when to use blocking (=) and when non-blocking (<=)
assignments:
1. Use always_ff @(posedge clk) and non-blocking assignments to model synchronous sequential
logic.
2. Use continuous (blocking) assignments to model simple combinational logic.
3. Use always_comb and blocking assignments to model more complicated combinational logic
where the always statement is helpful.
4. Do not make assignments to the same signal in more than one always statement or continuous
assignment statement.
4.7 Data Types
4.7.1 SystemVerilog
Prior to SystemVerilog, Verilog primarily used two types: reg and wire . Despite its name, a reg
signal might or might not be associated with a registers. In Verilog, if a signal appears on the left
hand side of <= or = in an always block, it must be declared as reg . Otherwise, it should be
declared as wire . Input and output ports default wo the wire type unless ttheir type is explicitly
defined as reg .
SystemVerilog introduces the logic type. logic is a synonym for reg , but avoids misleading users
whether or not the signal is actually a flip-flop or not. Moreover, SystemVerilog also relaxes the rules
on assign statements and allows the use of logic outside of an always block.
In SystemVerilog, signals are considered as unsigned by default. Adding the signed modifier (e.g logic
signed [3:0] a ) causes the signal to be treated as signed.
4.8 Parameterized Modules
HDLs permit variable bit widths using parameterized modules . One such use could be a parameterized
2:1 MUX. Its implementation looks as follows:
4.9 Testbenches
A testbench is an HDL module that is used to test another module, called the devide under test (DUT) or
unit under test (UUT) . The testbench contains statements to apply inputs to the DUT and, ideally, to
check that the correct outputs are produced. The input and desired output patterns are called
testvector .
// parameterized 2:1 mux
module muxTwo2One
#(parameter width = 8)
(input [width-1:0] d0, d1, input s, output [width-1:0] y);
assign y = s ? d1 : 0
endmodule
// instatiation of with 12 bit wide busses
muxTwo2One#(12) mux1(d0, d1, s, y);
1
2
3
4
5
6
7
8
9
10
Testbenches are simulated the same as other HDL modules, however, they are not synthesizable. The
initial statement executes the statements in the body at the start of simulation. The assert
statement checks if a specified condition is true. If not, the else statement is executed. Testbenches
use the === and !== operators for comparisons of equality and inequality, because these operators
work correctly with operands like x and z .
$readmemb reads a file of binary numbers into the testvectors array. Similarly, $readmemh reads a
file of hexadecimal numbers. $display is a system task to print in the simulator window. For example,
$display("%b %b", y, yexpected) prints the two values y and yexpected , in binary. %d is used to print
values in decimal, and %h to print values in hexadecimal. $finish terminates a simulation.
Chapter 5 - Digital Building Blocks
5.1 Introduction
This chapter introduces more elaborate combinational and sequential building blocks used in digital
systems. These blocks include arithmetic circuits, counters, shifters, memory arrays, and logic arrays.
Each building block has a well-defined interface and can be treated as a black box when the
underlying implementation is unimportant.
5.2 Arithmetic Circuits
Arithemtic circuits are the central building blocks of computers. They perform the following operations:
additions, subtractions, comparisons, shifts, multiplications, and divisions.
5.2.1 Addition
Following are the most important building blocks for performing addition :
1-bit Half Adder
It has two inputs $A$ and $B$ and two outputs $S$ and $C_{out}$. It performs the addition of two 1-bit
signals and an additional $C_{out}$ signal that carries the overflow. The Boolean equations are:
$S = A \ xor \ B$
$C_{out} = A \ and \ B$
1-bit Full Adder
It works exactly the same as the half adder does, but has an additional input $C_{in}$. This allows to
connect multiple 1-bit adders to form a $N$-bit adder. The Boolean equations are:
$S = A \ xor \ B \ xor \ C_{in}$
$C_{out} = (A \ and \ B) \ xor \ (A \ and \ C_{in}) \ xor \ (B \ and \ C_{in})$
Ripple-Carry Adder
The simples way to build and $N$- bit carry propagate adder (CPA) is to chain together $N$ full adders.
The $C_{out}$ of one stage acts as the $C_{in}$ of the next stage. The delay of the adder, $t_{ripple}$,
grows directly with the number of bits, where $t_{FA}$ is the delay of the full adder:
$t_{ripple} = N \cdot t_{FA}$
Carry-Lookahead Adder
If we divide the $N$ bits to add into blocks of $k$-bit adders, we can shorten the delay of the carry
progragated adder. An $N$-bit adder divided into $k$-bit blocks has a delay:
$t_{CLA} = t_{pg} + t_{pg-block} + \Big (\frac{N}{k} - 1 \Big) \cdot t_{AND-OR} + k \cdot t_{FA}$
5.2.2 Subtraction
Subtraction itself is just addition with a negative variable. Therefore, subtraction is very easy to
implement. To compute $Y = A - B$, first create the two's complement of $B$: Invert the bits of $B$ to
obtain $\bar B$ and add $1$ to get $-B = \bar B + 1$. Add this quantity to $A$ to get $Y = A + \bar B + 1 =
A - B$. This operation can be performed with a single CPA and $C_{in} = 1$.
5.2.3 Comparators
A comparator determines whether or not two binary numbers are equal or if one is greater or less than
the other. There are two common types of comparators: An equality comparator produces a single
output indicating whether or not $A$ is equal to $B$ ( A === B ). A magnitude comparator produces one
or more outputs indicating the relative values of $A$ and $B$. A magnitude comparator of signed
numbers is usually done by computing $A-B$ and looking at the sign of the result. If the result is
negative (i.e. the sign bit is $1$), then $A$ is less than $B$, otherwise $A$ is greater or equal to $B$.
5.2.4 ALU
An Arithmetical/Logic Unit (ALU) combines a variety of mathematical and logic operators into a single
unit. For example, a typical ALU might perform addition, subtraction, AND, and OR operations. The ALU
forms the heart of most computer systems. The ALU takes multiple $N$-bit inputs where it performs the
operations on. It has an additional control input that defines which operation the ALU should output.
Some ALU may output, next to the result, additional flags that point to different edge cases (i.e.
negative results, overflow etc.). One implementation of an ALU could look like this:
5.2.5 Shifters and Rotators
Shifters and rotators move bits and multiply or divide by powers of 2. There are several kinds of
commonly used shifters:
Logical shifter: shifts the number to the left ( LSL) or to the right (LSR) and fills empty spots with
$0$s
Example: $11001 \ LSR2 \ = 00110$
Arithmetic shifter: Is the same as a logical shifter, but on the right shift fills the most significant
bits with a copy of the old most significant bit.
Example: $11001 \ ASR2 \ 11110$
Rotator: Rotates the number in a circle such that empty spots are filled with bits shifted off the
other end.
Example: $11001 \ ROR2 \ 01110$
A left shift is a special case of multiplication . A left shift by $N$ bits multiplies the number by $2^N$.
An arithmetic right shift is a special case of division. An arithemtic right shift by $N$ bits divides the
number by $2^N$.
5.2.6 Multiplication
Multiplying binary numbers is very similar to mutliplying decimal numbers. In both cases, partial
products are formed by multiplying a single digit of the multiplier with the entire multiplicand. The
shifted partial products are summed to form the result. In general, an $N \times N$ multiplier
multiplies two $N$-bit numbers and produces a $2 \cdot N$-bit result. An implementation of a $4
\times 4$-multiplier could look as follows:
5.2.7 Division
Binary division can be performer by using the following algorithm for $N$-bit numbers in the range $[0,
2^{N-1}]$:
The result satisfies $\frac{A}{B} = Q + \frac{R}{B}$, i.e. the divider computes $\frac{A}{B}$ and produces
a quotient $Q$ and a remainder $R$. Specifically, each row calculates $D = R-B$. The signal $N$
indicates whether or not $D$ is negative. So a row's multiplexer select lines receive the most
significant bit of $D$, which is $1$ when the differenc is negative. The quotien $Q_i$ is $0$ when $D$ is
negative and $1$ otherwise.
5.3 Number Systems
Computers operate on both integers and fractions. This section introduces fixed- and floating-point
number systems that can represent rational numbers.
5.3.1 Fixed-Point Number Systems
The fixed-point notation has an implied binary point between integer and fraction bits, analogous to
the decimal point between the integer and fraction digits of an ordinary decimal number. The integer
bits are called high word and the fraction bits are called the low word . Following the representation of
$-2.375$ in the fixed-point number system:
5.4 Sequential Building Blocks
5.4.1 Counters
An $N$-bit binary counter is a sequential arithmetic circuit with clock and reset inputs and an $N$-
bit output $Q$. Reset intializes the output to $0$. The counter then advance through all $2^N$ possible
outputs in binary order, incrementing on the rising edge of the clock.
5.4.2 Shift Registers
R' = 0;
for (i = N-1 to 0)
R = {R' << 1, A_i}
D = R - B
if(D < 0)
Q_i = 0
R' = R
else
Q_i = 1
R' = D
R = R'
1
2
3
4
5
6
7
8
9
10
11
0010.0110 //absolut value
1010.0110 //sign and magnitude
1101.1010 //two's complement
1
2
3
A shift register has a clock, a serial input $S_{in}$, a serial output $S_{out}$, and $N$ parallel outputs
$Q_{N-1:0}$. On each rising edge of the clock, a new bit is shifted in from $S_{in}$ and all the
subsequent contents are shifted forward. The last bit in the shift register is available at $S_{out}$.
Shift registers can be viewed as serial-to-parallel converters . After $N$ cycles, the past $N$ inputs are
available in parallel at $Q$.
5.5 Memory Arrays
Digital systems also require memories to store the data used and generated by such circuits. This
section describes memory arrays than can effectively store large amounts of data.
5.5.1 Overview
Memory is organized as a two-dimensional array of memory cells . The memory reads or writes the
contents of one of the rows of the array. This row is specified by an adress . The value read or written is
called data . An array with $N$-bit adresses and $M$-bit data has $2^N$ rows and $M$ columns. Each
row of data is called a word . The depth of an array is the number of rows, the width is the number of
columns.
Bit Cells
Memory arrays are built as an array of bit cells , each of which stores $1$ bit of data. Each bit cell is
connected to a wordline and a bitline . When the wordline is $HIGH$, the stored bit transfers to or from
the bitline.
Memory Types
Memories are classified based on how they store bits in the bit cell. The broadest classification is
random access memory (RAM) versus read only memory (ROM) . RAM is volatile , meaning that it loses its
data when the power is turned off. ROM is nonvolatile , meaning that it retains its data indefinitely,
even without power. The two major types of RAMs are dynamic RAM (DRAM) and static RAM (SRAM) .
Dynamic RAM stores data as charge on a capacitor, whereas static RAM stores data using a pair of
cross-coupled inverters.
5.5.2 Dynamic Random Access Memory (DRAM)
Dynamic RAM (DRAM) stores a bit as the presence or absence of charge on a capacitor. The nMOS
transistor behaves as a switch that either connects or disconnects the capacitor from the bitline. When
the wordline is asserted, the nMOS transistor turns $ON$, and the stored bit value transfers to or from
the bitline.
Reading destroys the bit value stored on the capacitor, so the data word must be restored ( rewritten )
after each read. Even when DRAM is not read, the contents must be refreshed ( read and rewritten )
every few milliseconds, because the charge on the capacitor gradually leaks away.
5.5.3 Static Random Access Memory (SRAM)
Static RAM (SRAM) is static because stored bits do not need to be refreshed. The data bit is stored on
cross-coupled inverters. When the wordline is asserted, both nMOS transistors turn $ON$, and data
values are transferred to or from the bitlines. Unlike DRAM, if noise degrades the value of the stored bit,
the cross-coupled inverters restore the value themselfes.
5.6 Logic Arrays
Like memory, gates can be organized into regular arrays. If the connections are made programmable,
these logic arrays can be configured to perform any function without the user having to connect wires
in specific ways. This section introduces two types of logic arrays: programmable logic arrays (PLAs) ,
and field programmable gate arrays (FPGAs) .
5.6.2 Programmable Logic Array
Programmable logic arrays (PLAs) implement two level combinational logic in sum-of-products (SOP)
form. PLAs are built from and AND array followed by an OR array.
5.6.2 Field Programmable Gate Array
A field programmable gate array (FPGA) is an array of reconfigurable gates. Using software
programming tools, a user can implement designs on the FPGA using either HDL or a schemantic. FPGAs
are built as an array of configurable logic elements (LEs). Each LE can be configured to perform
combinational or sequential functions. The LEs are surrounded by input/output elements (IOEs) for
interfacing with the outside world.
Chapter 6 - Architecture
6.1 Introduction
In this chapter, we jump up a few levels of abstraction to define the architecture of a computer. The
architecture is the programmer's view of a computer. It is defined by the instruction set ( language ) ,
and operand locations ( registers and memory ). Since computer hardware only understands $1$'s and
$0$'s, instructions are encoded as binary numbers in a format called machine language . The specific
arrangement of registers. memories, ALUs, and other building blocks is called microarchitecture .
Throughout the following sub-chapters, we motivate the design of the MIPS architecture using four
principles articulated by Patterson and Hennessy :
1. Simplicity favors regularity
2. Make the common case fast
3. Smaller is faster
4. Good design demands good compromises
6.2 Assembly Language
6.2.1 Instructions
The most common operation computers perform is addition. The following example shows how addition
and subtraction is performed in MIPS:
The first part of the assembly instrcution, add , is called the mnemonic and indicates what operation
to perform. The operation is performed on b and c , the source operands , and the result is written to
a , the destination operand .
In assembly language, only single-lined comments are used. They begin with # and continue until the
end of the line.
The MIPS instrcution set makes the common case fast by including only simple, commonly used
instructions. MIPS is a reduced instruction set computer (RISC) architecture. Architectures with many
complex instructions, such as Intel's IA-32 architecture, are complex instruction set computer (CISC)
architectures.
6.2.2 Operands: Registers, Memory, and Constants
An instruction operats on operands . Operands can be stored in registers or memory, or they may be
constants stored in the instrcution itself.
Registers
Instructions need to access operands quickly so that they can run fast. Therefore, most architectures
specify a small number of registers that hold commonly used operands. The MIPS architecture uses 32
registers, called the register set or register file .
add a, b, c # a = b + c
sub a, b, c # a = b - c
1
2
MIPS register names are preceded by the $ sign. MIPS generally stores variables in 18 of the 32
registers: $\$ s0 - \$ s7$, and $\$ t0 - \$ t9$. Register names beginning with $\$ s$ are called saved
registers. Register names beginning with $\$ t$ are called temporary registers and are used for storing
temporary variables. The addition and subtraction using registers looks as follows:
The Register Set
The MIPS atchitecture defines 32 registers. Each register has a name and a number ranging from 0 to
31. The register file is built up as follows:
Memory
MIPS uses byte-addressable memory . This is, each byte in memory has a unique address. However, for
explanation purposes only, we first introduce word-addressable memory, and afterwards introduce the
MIPS byte-addressable memory. In word-addressable memory , each 32-bit data word has a unique 32-
bit address. Both the 32-bit data word and the 32-bit address are written in hexadecimal.
MIPS uses the load word instrcution, $lw$, to read a data word from memory into a register. The $lw$
instruction specifies the effective address in memory as the sum of base address and an offset .
Similarly, MIPS uses the store word instruction, $sw$, to write a data word from a register into memory.
The MIPS memory model, however, is byte-addressable, not word-addressable . The word-address is four
times the word number. The offset can be written either in decimal or hexadecimal. The MIPS
architecture also provides the $lb$ and $sb$ instructions that load and store single bytes in memory
rather than words. The following code example describes how to access byte-addressable memory:
# calculating a = b + c - d
add $t0, $s2, $s3 # t = c - d
sub $s0, $s1, $t0 # a = b + t
1
2
3
# The following code assumes (unlike MIPS) word-addressable memory
lw $s3, 1($0) # read memory word 1 into $s3
sw $s7, 5($0) # write $s7 to memory word 5
1
2
3
In the MIPS architecture, word addresses for $lw$ and $sw$ must be word aligned . That is, the address
must be divisible by 4. Thus, the instruction $\$s0, \ 7(\$0)$ is an illegal instruction.
Constants / Immediates
Load word and store word, $lw$ and $sw$, also illsutrate the use of constants in MIPS instructions.
These constants are called immediates , because their values do not require a register or memory
access. Add immediate , $addi$, adds the immediate specified in the instruction to a value in a register.
Subtraction is equivalent to adding a negative number, so, in the interest of simplicity, there is no
$subi$ instruction in the MIPS architecture. Following are examples of how we are using immediate
operands:
6.3 Machine Language
Digital circuits understand only $1$'s and $0$'s. Therefore, a program written in assembly language is
translated from mnemonics to a representation using only $1$'s and $0$'s, called machine language .
MIPS makes the compromise of defining three instruction formats: R-type, I-type, and J-type. R-type
instructions operate on three registers. I-type instructions operate on two registers and a 16-bit
immediate. J-type (jump) instructions operate on one 26-bit immediate.
6.3.1 R-Type Instructions
The name R-type is short fro register-type . R-type instructions use three registers as operands: two as
sources, and one as a destination. The 32-bit instruction has six fields: op, rs, rt, rd, shamt and funct .
The performing operation is encoded in the two fields op , also called opcode , and funct , the
function. All R-type operations have an opcode of $0$. The operation is solely determined by the
funct code. The operands are encoded in the three fields rs, rt, and rd . The first two registers, rs and
rt , are the source registers, rd is the destination register. The fith field, shamt , is used only in shift
operations, it indicates the amount to shift.
6.3.2 I-Type Instructions
The name I-type is short for immediate-type . I-type instructions use two register operands and one
immediate operand. The 32-bit instruction has four fields: op, rs, rt, and imm . The first three fields are
like those of R-type instructions. The imm field holds the 16-bit immediate. The operation is solely
determined by the opcode .
lw $s0, 8($0) # read data word 2 into $s0
lw $s1, 0xC($0) # read data word 3 into $s1
sw $s3, 4($0) # write $s3 to data word 1
sw $s4, 0x20($0) # write $s4 to data word 9
1
2
3
4
addi $s0, $s0, 4 # a = a + 4
addi $s1, $s0, -12 # b = a - 12
1
2
6.3.3 J-Type Instructions
The name J-type is short for jump-type . This format is used only with jump instructions . Like other
formats, J-type instructions begin with a 6-bit opcode . The remaining 26-bits are used to specify an
address, addr .
6.4 Programming
6.4.1 Aithmetic / Logical Instructions
Logical Instructions
MIPS logical operations include and, or , xor and nor . These R-type instructions operate bit-by-bit on
two source registers and write the result in the destination register. The and instruction is useful for
masking bits, i.e. forcing unwanted bits to $0$. The or instruction is useful for combining two bits from
two registers. MIPS does not provide a not instruction, but $A \ NOR \ \$ 0 = NOT \ A$, so the nor
instruction can substitute. Local operations can also operate on immediates. These I-type instructions
are $andr, \ ori,$ and $xori$.
Shift Instructions
MIPS shift operations are $sll$ (shift left logical), $srl$ (shift right logical), and $sra$ (shift right
arithmetic). MIPS also has variable-shift operations (shift by a register): $sllv, srlv,$ and $srav$.
Multiplication and Division Instructions
The MIPS architecture has two special-purpose registers , $hi$ and $lo$, which are used to hold the
results of multiplication and division . $mult \ \$ s0, \ \$ s1$ multiplies the values in $\$ s0$ and $\$
s1$. The 32 most significant bits are placed in $hi$ and the 32 least significant bits are placed in $lo$.
Similarly, $div \ \$ s0, \ \$ s1$ computes $\frac{\$ s0}{\$ s1}$. The quotient is placed in $lo$ and the
remainder in $hi$.
6.4.2 Branching
Conditional Branches
The MIPS instruction set has two conditional branch instructions: branch if equal ($beq$) and branch
if not equal ($bne$). $beq$ branches when the value in two registers is equal, and $bne$ branches if
they are not equal. Note that the branches are written as $beq \ \$ rs, \ \$ rt, \ imm$, where $\$ rs$ is
the first source register. If a branch instruction is $TRUE$, one says that the branch is taken and the
next instruction that is executed is the one after the label , called target . This could look as follows:
Jump
A programm can unconditionally branch , or jump, using the three types of jump instructions : jump
($j$), jump and link ($jal$), and jump register ($jr$). Jump ($j$) jumpes directly to the instruction at
the specified label. Jump and link ($jal$) is similar to $j$ but is used by procedures to save a return
address, as will be discussed later on. Jump register ($jr$) jumps to the address held in a register.
Example:
6.4.3 Conditional Statements
If Statements
An if statement executes a block of code, the if block , only when a condition is met. The following
example shows how to translate an if statement into MIPS assembly code:
If/Else Statements
If/else statements execute one of two blocks of code depending on a condition. When the condition of
the if block is met, the if block is executed. Otherwise, the else block is executed. Following an
example if/else statement:
addi $s0, $0, 4 # $s0 = 0 + 4
addi $s1, $0, 2 # $s1 = 0 + 2
bne $s0, $s1, target # $s1 != $s0, so branch is taken
addi $s0, $s0, 1 # not executed
target:
add $s1, $s1, $s0 # $s1 = 2 + 4
1
2
3
4
5
6
7
addi $s0, $0, 4 # $s0 = 4
addi $s1, $0, 2 # $s1 = 2
j target # jump to target
addi $s1, $s1, 2 # not executed
target:
add $s1, $s1, $s0 # $s1 = 6
1
2
3
4
5
6
7
bne $s3, $s4, L1 # if i != j, skip if-block
add $s0, $s1, $s2 # if block: f = g + h
L1:
sub $s0, $s0, $s3 # f = f - i
1
2
3
4
6.4.4 Getting Loopy
While Loops
While loops repeatedly execute a block of code until a condition is not met . The while loop in the
following example code determines the value of $x$ such that $2^x = 128$:
For Loops
For loops , like while loops, repeatedly execute a block of code until a condition is not met .
However, for loops add support for a loop varialbe , which typically keeps track of the number of loop
executions. The following example code adds the number from 0 to 9 in a for loop:
6.4.5 Arrays
Arrays are useful for accessing large amounts of similar data. An array is organized as sequential data
addresses in memory. Each array element is identified by a number called its index .
Array Indexing
The base address gives the address of the first array element, array[0] . The first step in accessing an
array element is to load the base address of the array into a register. Recall that the load upper
immediate ( lui ) and or immediate ( ori ) instructions can be used to load a 32-bit constant into a
register. The offset can be used to access subsequent elements of the array. For example, array[1] is
bne $s3, $s4, else # if i != j, branch to else
add $s0, $s1, $s2 # if-block: f = g + h
j end # skip past the else block
else:
sub $s0, $s0, $s3 # else-block: f = f - i
end:
1
2
3
4
5
6
addi $s0, $0, 1 # pow = 1
addi $s1, $0, 0 # x = 0
addi $t0, $0, 128 # t0 = 128 for comparison
while:
beq $s0, $t0, done # if pow = 128, exit while loop
sll $s0, $s0, 1 # pow = pow * 2
addi $s1, $s1, 1 # x = x + 1
j while
done:
1
2
3
4
5
6
7
8
9
10
add $s1, $0, $0 # sum = 0
addi $s0, $0, 0 # i = 0
addi $t0, $0, 10 # n = 10
for:
beq $s0, $t0, done # if i == 10, branch to done
add $s1, $s1, $s0 # sum = sum + i
addi $s0, $s0, 1 # increment i by one
j for
done:
1
2
3
4
5
6
7
8
9
stored one word or 4 bytes after array[0] , so it is accessed at an offset of 4 past the base address.
Following is an example of how to access arrays in MIPS:
6.4.6 Procedure Calls
Procedures have inputs, called arguments , and an output, called the return value . When one procedure
calls another, the calling procedure, the caller , and the called procedure, the callee , must agree on
where to put the arguments and the return value. In MIPS, the caller conventionally places up to four
arguments in registers $\$ a0 - \$ a3$ before making the procedure call, and the callee places the
return value in registers $\$ v0 - \$ v1$ before finishing.
Procedure Calls and Returns
MIPS uses the jump and link ($jal$) instruction to call a procedure and the jump register ($jr$)
instruction to return from a procedure. jal performs two functions: it stores the address of the next
instruction (the instruction after jal ) in the return address register $\$ ra$, and it jumps to the target
instruction.
Input Arguments and Return Values
By MIPS convention, procedure use $\$ a0 - \$ a3$ for input arguments and $\$ v0 - \$ v1$ for return
values . A procedure that returns a 64-bit value, such as a double-precision floating point number, uses
both return registers.
The Stack
The Stack is memory that is used to save local variables within a procedure. The stack is a last-in-
first-out (LIFO) queue. The last item pushed onto the stack is the first one that can be pulled
( popped ) off. The MIPS stack grows down in memory . The stack expands to lower memory addresses
when a program needs more scratch space. The stack pointer $\$ sp$ is a special register that points
to the top of the stack.
Recall that a procedure should calculate a return value but have no other unintended side effects. To
obey to this rule, a procedure saves registers on the stack before it modifies them, then restores them
from the stack before it returns. Specifically, it performs the following steps:
1. Makes space on the stack to store the values of one or more registers.
2. Stores the values of the registers on the stack.
3. Executes the procedure using the registers.
4. Restores the original values of the registers from the stack.
5. Deallocates space on the stack.
Preserved Registers
lui $s0, 0x1000 # $s0 = 0x10000000
ori $s0, $s0, 0x7000 # $s0 = 0x10007000
lw $t1, 0($s0) # $t1 = array[0]
sll $t1, $t1, 3 # $t1 = $t1 << 3 = $t1 * 8
sw $t1, 4($s0) # array[1] = $t1
1
2
3
4
5
6
MIPS divides registers into preserved and nonpreserved categories. The preserved registers include $\$
s0 - \$s7$ (hence their name saved ). The nonpreserved registers include $\$ t0 - \$ t9$ (hence their
name temporary ). A procedure must save and restore any of the preserved registers that it wishes to
use, but it can change the nonpreserved registers freely. Following is an example of a procedure saving
preserved registers on the stack:
6.6 Compiling, Assembling, and Loading
6.6.1 The Memory Map
With 32-bit addresses, the MIPS address space spans $2^{32}$ bytes = 4 gigabytes (GB). Word addresses
are divisible by 4 and range from 0 to 0xFFFFFFFC . The MIPS architecture divides the address space into
four parts or segments. The following sections describe each segment:
The Text Segment
The text segment stores the machine language program. It is large enough to accommodate almost
256 MB of code.
The Global Data Segment
The global data segment stores global variables that, in contrast to local variables, can be seen by all
procedures in a programm. Global variables are defined at start-up , before the program begins
executing. The global data segment is large enough to store 64 KB of global variables.
The Dynamic Data Segment
The dynamic data segment holds the stack and the heap. This is the largest segment of memory used
by a program, spanning almost 2 GB of the address space.
The Reserved Segments
The reserved segments are used by the operating system and cannot directly be used by the
programmer. Part of the reserved memory is used for interrupts and for memory-mapped I/O.
6.6.2 Translating and Starting a Program
Following are the steps required to translate a program from a high-level language into machine
language and to start executing a program:
diffofsums:
addi $sp, $sp, -4 # make space on stack to store one register
sw $s0, 0($sp) # save $s0 on stack
add $t0, $a0, $a1 # $t0 = f + g
add $t1, $a2, $a3 # $t1 = h + i
sub $s0, $t0, $t1 # result = (f + g) - (h + i)
add $v0, $s0, $0 # put return value in $v0
lw $s0, 0($sp) # restore $s0 from stack
addi $sp, $sp, 4 # deallocate stack space
jr $ra # return to caller
1
2
3
4
5
6
7
8
9
10
Step 1 - Compilation: A compiler translates high-level code into assembly language.
Step 2 - Assembling: The assembler turns the assembly language code into an object file containing
machine language code. The assembler makes two passes through the assembly code. On the first pass,
the assembler assigns instruction addresses and finds all the symbols , such as labels and global
variable names. On the second pass through the code, the assembler produces the machine language
code.
Step 3 - Linking: Most large programs contain more than one file. The job of the linker is to combine all
of the object files into one machine language file called the executable .
Step 4 - Loading: The operating system loads the program by reading the text segment of the
executable file from a storage device into the text segment memory.
6.7 Odds and Ends
6.7.1 Pseudoinstructions
MIPS defines pseudoinstructions that are not actually part of the instruction set but are commonly
used by programmers and compilers. When converted to machine code, pseudoinstructions are
translated into one or more MIPS instructions. Following are some useful pseudoinstructions:
6.7.3 Signed and Unsigned Instructions
Addition and Subtraction
MIPS provides signed and unsigned versions of addition and subtraction. The signed versions are add,
addi, and sub . The unsigned versions are addu, addiu, and subu . The two versions are identical
except that signed version trigger an exception on overflow , whereas unsigned versions do not.
Multiplication and Division
Multiplication and division come both in signed and unsigned flavors. mult and div treat the
operands as signed numbers. multu and divu treat the operands as unsigned numbers.
Set Less Than
Set less than instructions can compare either two registers ( slt ) or a register and an immediate ( slti ).
Set less than also comes in signed and unsigned ( sltu and sltiu ) versions.
Chaptter 7 - Microarchitecture
7.1 Introduction
This chapter covers microarchitecture , which is the connection between logic and architecture.
Microarchitecture is the specific arrangement of reigsters, ALUs, finite state machines, memories, and
other logic building blocks needed to implement an architecture.
7.1.1 Architectural State and Instruction Set
Recall that a computer architecture is defined by its instruciton set and architectural state . The
architectural state for the MIPS processor consists of the programm counter and the 32 registers . To
keep the mircoarchitecture easy to understand, we handle the following instructions:
R-type arithmeitc/logic instructions: $add, \ sub, \ and, \ or, \ slt$
Memory instructions: $lw, \ sw$
Branches: $beq$
7.1.2 Design Process
We will divide our microarchitecture into two interacting parts: the datapath and the control . The
datapath operats on words of data. It contains structures such as memories, registers, ALUs, and
multtiplexers.
A good way to design a complex system is to start with hardware containing the state elements. These
elements include the memories and the architectural state (the program counter and registers).
Program Counter: The program counter is an ordinary 32-bit registers. Its output, $PC$, points to the
current instruction. Its input, $PC'$, indicates the address of the next instruction.
Instrcution Memory: The instruction memory has a single read port. It takes a 32-bit instruction
address input, $A$, and reads the 32-bit data (i.e. the instruction) from that address onto the read
data output, $RD$.
Register File: The 32-element $\times$ 32-bit register file has two read ports and one write port. The
read ports take 5-bit address inputs, $A1$ and $A2$, each specifying one of $2^5 = 32$ registers as
source operands. The write port takes a 5-bit address input, $A3$, a 32-bit write data input, $WD$, a
write enable input, $WE3$, and a clock. If write enable is 1, the register file writes the data into the
specified register on the rising edge of the clock.
Data Memory: The data memory has a singel read/write port. If the write enable, $WE$, is 1, it
writes data $WD$ into address $A$ on the rising edge of the clock. If write enable is 0, it reads
address $A$ into $RD$.
The instruction memory, register file, and data memory are all read combinationally . In other words, if
the address changes, the new data appears at $RD$ after some propagation delay, no clock is involved.
Since the state elements change their state only on the rising edge of the clock, they are synchronous
sequential circuits.
7.1.3 MIPS Microarchitectures
In this chapter, we differ between three types of microarchitectures for the MIPS processor architecture:
The single cycle microarchitecture executes an entire insttruction in one cycle. The cycle time is
therefore limited by the slowest instruction.
The multi cycle microarchitecture executes instructions in a series of shorter cycles. Simpler
instructions execute in fewer cycles than complicated ones. The multicycle processor executes only
one instruction at a time, but each instruction takes multiple clock cycles.
The pipelined microarchitecture applies pipelining to the single-cycle microarchitecture. It
therefore can execute several instructions simultaneously, improving the throughput significantly.
7.2 Performance Analysis
The only gimmick-free way to measure performance is by measuring the execution time of a program.
The execution time of a program, measured in seconds, is given by the following equation:
The number of cycles per instruction , often called CPI , is the number of clock cycles required to
execute an avaerage instruction. The number of seconds per cycle is the clock period $T_C$. The clock
period is determined by the critical path through the logic on the processor.
7.3 Single-Cycle Processor
We first design a MIPS microarchitecture that executes instructions in a single cycle. We begin by
constructiong the datapath by connecting the state elements from Figure 7.1 with combinational logic
that can execute the various instructions.
7.3.1 Single-Cycle Datapath
As already mentioned, the program counter $PC$ register contains the address of the instruction to
execute. The first step is to read this instruction from the instruction memory. The instruction memory
reads out, or fetches , the 32-bit instruction, labeled $Instr$.
Load-Word Instructions
For a $lw$ instruction, the next step is to read the source register containing the base address. This
register is specified in the $rs$ field of the instruction, $Instr_{25:21}$. The $lw$ instruction also
requires an offset. The offset is stored in the immediate field of the instruction, $Instr_{15:0}$. Because
the 16-bit immediate might be either positive or negative, it must be sign-extended to 32-bit value,
called $SignImm$.
The processor must add the base address to the offset to find the address to read from memory. The
ALU receives two operands, $SrcA$ and $SrcB$. $SrcA$ comes from the register file, and $SrcB$ comes
from the sign-extended immediate. The 3-bit $ALUControl$ signal specifies the operation the ALU
should execute. The ALU generates a 32-bit $ALUResult$ and a $Zero$ flag. For the $lw$ instruction, the
ALU must add the base address and the offset and the $ALUResult$ is send to the data memory as the
address for the load instruction. The data is read from the data memory onto the $ReadData$ bus, then
written back to the destination register, which is specified in the $rt$ field, $Instr_{20:16}$, in the
register file at the and of the cycle.
A control signal called $RegWrite$ is connected to the port 3 write enable input, $WE3$, and is asserted
during a $lw$ instruction so that the data value is written into the register file. While the instruction is
being executed, the processor must compute the address of the nextt instruction, $PC'$. Because
instructions are 32 bits = 4 bytes, the next instruction is at $PC + 4 = PC'$.
Store-Word Instructions
Like the $lw$ instruction, the $sw$ instruction reads a base address from port 1 of the register and sign-
extends an immediate. The $sw$ instruction also reads a second register from the register file and
writes it to the data memory. The register is specified in the $rt$ field, $Instr_{20:16}$. These bits of the
instruction arre connected to the second register file read port, $A2$. The write enable port of the data
memory, $WE$, is controlled by $MemWrite$. For a $sw$ instruction, $MemWrite = 1$, to write data to
memory.
R-Type Instructions
We will now consider extending the datapath to handle the R-type instructions $add, \ sub, \ and, \ or,$
and $slt$. All of these instructions read two registers from the reigster file, perform some ALU operation
on them, and write the result back to a third register file. Up to here, the ALU always received its $SrcB$
operand from the sign-extended immediate $SignImm$. Now, we add a multiplexer to choose $SrcB$
from either the reigster file $RD2$ port or $SignImm$. The multiplexer is controlled by a new signal,
$ALUSrc$. $ALUSrc$ is $0$ for R-type instructions to choose $SrcB$ from the register file and it is $1$ for
$lw$ and $sw$ instructions to choose $SignImm$. Also, until now the register file always got its write
data from the data memory. However R-type instructions write the $ALUResult$ to the register file.
Therefore, we add another multiplexer to choose between $ReadData$ and $ALUResult$. We call this
output $Result$. This multiplexer is controlled by another new signal, $MemtoReg$.
Similarly, the register to write was specified by the $rt$ field of the instruction, $Instr_{20:16}$.
However, for R-type instructions, the register is specified in the $rd$ field, $Instr_{15:11}$. Thus we add
a third multiplexer to choose $WriteReg$ from the appropriate field of the instruction. The multiplexer is
controlled by $RegDst$. $RegDst$ is $1$ for R-type instructions to choose $WriteReg$ from the $rd$ field.
Branch-If-Equal Instructions
The $beq$ instruction compares two registers. If they are equal, it takes the branch by adding the
branch offset to the program counter. Recall that the offset is a positive or negative number, stored in
the $imm$ field of the instruction, $Instr_{31:26}$. Hence the immediate must be sign-extended and
multiplied by 4 to get the new program counter value $PC' = PC + 4 + SignImm \times 4$. The next $PC$
value for a taken branch, $PCBranch$, is computed by shifting $SignImm$ left by $2$ bits, then adding
it to $PCPlus4$. The two register are compared by computing $SrcA - SrcB$ using the ALU. If
$ALUResult$ is $0$, as indicated by the $Zero$ flag from the ALU, the registers are equal. We add a
multiplexer to choose $PC'$ from either $PCPlus4$ or $PCBranch$.
7.3.2 Single-Cycle Control
The control unit computes the control signals based on the $opcode$ and $funct$ fields of the
instruction, $Instr_{31:26}$ and $Instr_{5:0}$. Most of the control informations comes from the
$opcode$, but R-type instructions also use the $funct$ field to determine the ALU operation. Thus, we
will simplify our design by factoring the control unit into two blocks of combinational logic. The main
decoder computes most of the outputs from the $opcode$. It also determines a 2-bit $ALUOp$ signal.
The following truth table for the main decoder summarizes the control signals as a function of the
$opcode$:
7.3.3 More Instructions
Excluded from this summary.
7.3.4 Performance Analysis
Each instruction in the single-cycle processor takes one clock cyle, so the $CPI = 1$. The critical path in
our processor is the $lw$ instruction, hence the cycle time is:
In most implementation technologie, the ALU, memory, and register file accesses are substantially
slower than other operations. Therefore, the cycle time simplifies to:
7.4 Multicycle Processor
The single-cycle processor has three primary weaknesses:
It requires a clock cyle long enough to support the slwoest instruction, even though most
instructions are faster.
It requires three adders (one in the ALU and two for the PC logic), which are relatively expensive
circuits, especially if they must be fast.
It has seperate instruction and and data memories, which may not be realistic.
The multi-cycle processor addresses these weaknesses by breaking an instruction into multiple shorter
steps. In each short step, the processor can read or write from the memory or register file or use the
ALU.
7.4.1 Multicycle Datapath
In the single-cycle design, we used separate instruction and data memories because we needed to read
the instruction memory and read or write the data memory all in one cycle. Now, we choose to use a
combined memory for both instructions and data. This is more realistic, and it is feasible because we
can read the instruction in one cycle, then read or write the data in a separate cycle.
The $PC$ contains the address of the instruction to execute. The instruction is read and stored in a new
nonarchitectural Instruction Register so that it is available for further cycles. The instruction register
receives an enable signal, called $IRWrite$, that is asserted when it should be updated with a new
instruction.
Load-Word Instructions
For a $lw$ instruction, the next step is to read the source register containing the base address.
specified in the $rs$ field of the instruction, $Instr_{25:21}$. The register file reads the value of the
register onto $RD1$, which is then stored in another nonarchitectural register $A$.
We again have to sign-extend the offset and add it to the base address to get the address of the load.
We use an ALU to compute this sum. $ALUControl$ should be set to $010$ to perform an addition.
$ALUResult$ is stored in another nonarchitectural register called $ALUOut$.
The next step is to load the data from the calculated address in the memory. We add a multiplexer in
front of the memory to choose the memory address, $Adr$, from either $PC$ or $ALUOut$. The
multiplexer select signal is called $IorD$, to indicate either an instruction or data address. The data
read from the memory is stored in another nonarchitectural register, aclled $Data$.
Finally, the data is written back to the register file. The destination register is specified in the $rt$ field
of the instruction, $Instr_{20:16}$. At the same time, the processor must also update the program
counter $PC$. We can use the existing ALU on one of the steps when it is not busy. To do so, we must
insert source multiplexers to choose the $PC$ and the constant $4$ as ALU inputs. A two-input
multiplexer controlled by $ALUSrcA$ chooses either the $PC$ or register $A$ as $SrcA$. A four-input
multiplexer controlled by $ALUSrcB$ chooses either $4$ or $SignImm$ as $SrcB$.
Store-Word Instructions
The only new feature for $sw$ is that we must read a second register from the register file and write it
into the memory. The register is specified in the $rt$ field of the instruction, $Instr_{20:16}$. When the
register is read, it is stored in a nonarchitectural register $B$. On the next step, it is sent to the write
data port $WD$ of the data memory to be written. The memory receives an additional $MemWrite$
control signal to indicate that the write should occur.
R-Type Instructions
For R-type instructions, the instruction is again fetched, and the two source registers are read from the
register file. The ALU performs the appropriate operation and stores the result in $ALUOut$. On the next
step, $ALUOut$ is written back to the register specified by the $rd$ field of the instruction,
$Instr_{15:11}$. This requires two new multiplexers. The $MemtoReg$ multiplexer selects whether $WD3$
comes from the $ALUOut$ (for R-type instructions) or from $Data$ (for $lw$ instructions). The $RegDst$
instruction selects whether the destination register is specified in the $rt$ or $rd$ field of the
instruction.
Branch-If-Equal Instructions
For the $beq$ instruction, the instruction is again fetched, and the two source registers are read from
the register file. To determine whether or not the two registers are equal, the ALU subtracts the
registers and examines the $Zero$ flag. Meanwhile, the datapath must compute the next value of the
$PC$ if the branch is taken. In the single-cycle processor, yet another adder was needed to compute the
branch address. In the multi-cycle processor, the ALU can be reused again to save hardware. On one
step, the ALU computes $PC + 4$ and writes it back to the program counter, as we done for other
instructions. On another step, the ALU uses this updated $PC$ value to compute $PC + SignImm \times
4$. A new multiplexer, controlled by $PCSrc$, chooses what signal should be sent to $PC'$. The program
counter should be written either when $PCWrite$ is asserted or when a branch is taken. A new control
signal, $Branch$, indicates that the $beq$ instruction is being executed.
7.4.2 Multicycle Control
As int the single-cycle processor, the control unit is partitioned into a main controller and an ALU
decoder. Now, however, the main controller is an FSM that applies the proper control signals on the
proper cycles or steps.
The main controller produces multiplexer select and register enable signals for the datapath. The
select signals are $MemtoReg, \ RegDst, \ IorD, \ PCSrc, \ ALUSrcB$, and $ALUSrcA$. The enable signals
are $IRWrite, \ MemWrite, \ PCWrite, \ Branch$, and $RegWrite$.
The first step for any instruction is to fetch the instruction from memory at the address held in the
$PC$. The FSM enters this state on reset. To read memory, $IorD = 0$. $IRWrite$ is asserted to write the
instruction into the instruction register. $ALUSrcA = 0$, so $SrcA$ comes from the $PC$. $ALUSrcB = 01$,
so $SrcB$ is constant 4. $ALUOp = 00$, so the ALU decoder produces $ALUControl = 010$ to make the
ALU add. To update the $PC$ with this new value, $PCSrc = 0$, and $PCWrite$ is asserted.
The next step is to read the register file and decode the instruction. No control signals are necessary to
decode the instruction, but the FSM must wait $1$ cycle for the reading and decoding to complete.
If the instruction is a memory load or store ($lw$ or $sw$), the multicycle porcessor computes the
address by adding the base address to the sign-extended immediate. This requires $ALUSrc = 1$ to
select register $A$ and $ALUSrcB = 10$ to select $SignImm$. $ALUOp = 00$, so the ALU adds. If the
instruction is $lw$, the multicycle processor must next read data from memory and write it to the
register file.
To read from memory, $IorD = 1$ to select the memory address that was just computed and saved in
$ALUOut$. On the next step, S4 , $Data$ is written to the register file. $MemtoReg = 1$ to select $Data$,
and $RegDst = 0$ to pull the destination register from the $rt$ field of the instruction. $RegWrite$ is
asserted to perform the write, completing the $lw$ instruction.
From state S2 , if the instruction is $sw$, the data read from the second port of the register files is
simply written to memory. $IorD = 1$ to select the address computed in S2 and saved in $ALUOut$.
$MemWrite$ is asserted to write the memory.
If the $opcode$ indicates an R-type instruction, the multicycle processor must calculate the result
using the ALU and write that result back to the register file. In S6 , the instruction is executed by
selecting the $A$ and $B$ registers ($ALUSrcA = 1, \ ALUSrcB = 00$) and performing the ALU operation
indicated by the $funct$ field of the instruction. $ALUOp = 10$ for all R-type instructions. The
$ALUResult$ is stored in $ALUOut$.
In S7 , $ALUOut$ is written to the register file, $RegDst = 1$, because the destination register is stored
in the $rt$ field of the instruction. $MemtoReg = 0$ because the write data, $WD3$, comes from
$ALUOut$. $RegWrite$ is asserted to write the register file.
For a $beq$ instruction, the processor must calculate the destination address and compare the two
source registers to determine whether the branch should be taken. $ALUSrcA = 0$ to select the
incremented $PC$, $ALUSrcB = 11$ to select $SignImm \times 4$, and $ALUOp = 00$ to add. The
destination address is stored in $ALUOut$.
In S8 , the processor compares the two registers by subtracting them and checking to determine
whether the result is $0$. $ALUSrcA = 1$ to select register $A$, $ALUSrcB = 00$ to select register $B$,
$ALUOp = 01$ to subtract, $PCSrc = 1$ to take the destination address from $ALUOut$, and $Branch = 1$
to updated the $PC$ with this address if the ALU result is $0$.
7.3.4 More Instructions
Remark: The exact explanations for the determination of the signals is left out in this summary and can
be found in the book.
7.4.4 Performance Analysis
The execution time of an instruction depends on both the number of cycles it uses and the cycle time.
The multicycle processor requires three cycles for $beq$ and $j$ instructions, four cycles for $sw, addi,$
and R-type instructions, and five cycles for $lw$ instructions. The CPI depends on the realtive likelihood
that each instruction is used. Examining the datapath reveals two possible critical paths that would
limit the cycle time:
7.5 Pipelined Processor
Pipelining is a powerful way to improve the throughput of a digital system. We design a pipelined
processor by sub-dividing the single-cycle processor into five pipeline stages. Thus, five instructions
can execute simultaneously, one in each stage. Specifically, we call the five stages Fetch, Decode,
Execute, Memory and Writeback :
$Fetch :$ The processor reads the instruction from instruction memory.
$Decode :$ The processor reads the source operands from the register file and decodes the
instruction to produce control signals.
$Execute :$ The processor performs a computation with the ALU.
$Memory :$ The processor reads or writes data memory.
$Writeback :$ The processor writes the result to the register file, when applicable.
In the followin figure, an abstracted view of the pipeline in operation is shown in which each stage is
represented pictorially. Each pipeline stage is represented with its major component – Instruction
memory ( IM ), register file ( RF ) read, ALU execution, data memory ( DM ), and register file writeback –
to illustrate the flow of instructions through the pipeline.
A central challenge in pipelined systems is handling hazards that occur when the results of one
instruction are needed by a subsequent instruction before the former instruction has completed.
7.5.1 Pipelined Datapath
The pipelined datapathis formed by chopping the single-cycle datapath into five stages separated by
pipeline register. Signals are given a suffix ( F, D, E, M or W ) to indicate the stage in which they reside.
The register file is peculiar because it is read in the Decoder stage and written in the Writeback stage.
It is drawn in the Decode stage, but the write address and data come from the Writeback stage. One of
the subtle but critical issues in pipelining is that all signals associated with a particular instruction
must advance through the pipeline in union.
7.5.2 Pipelined Control
The pipelined processor takes the same control signals as the single-cycle processor and therefore uses
the same control unit. These conttrtol signals must be pipelined along with the data so that they
remain synchronized with the instruction.
7.5.3 Hazards
In a pipelined system, multiple instructions are handled concurrently. When one instruction is
dependent on the result of another that has not yet completed, a hazard occurs.
The register file can be read and written in the same cycle. Let us assume that the write takes place
during the first half of the cycle and the read takes place during the secand half of the cycle, so that
the register can be written and read back in the same cycle without introducing a hazard. The following
figure illustrates hazards that occur when one instruction writes a regsiter ($\$s0$) and subsequent
instructions read this register. This is called a read after write (RAW) hazard .
In principle, we should be able to forward the result from one instruction to the next to resolve the
RAW hazard without slowing down the pipeline.
Hazards are classified as data hazards and control hazards. A data hazard occurs when an instruction
tries to read a register that has not yet been written back by a previous instruction. A control hazard
occurs when the decision of what instruction to fetch next has not been made by the time the fetch
takes place.
Solving Data Hazards with Forwarding
Some data hazards can be solved by forwarding (also called bypassing ) a result from the Memory or
Writeback stage to a dependent instruction in the Execute stage. This requires adding multiplexers in
front of the ALU to select the operand from either the register file or the Memory or Writeback stage.
Forwarding is necessary when an instruction in the Execute stage has a source register matching the
destination register of an instruction in the Memory or Writeback stage. We therefore add a hazard
detection unit and two forwarding multiplexers. The hazard detection unit receives the two source
registers from the instruction in the Execute stage and the destination registers from the instructions in
the Memory and Writeback stages. It also receives the $RegWrite$ signals from the Memory and
Writeback stages to know whether the destination register will actually be written.
In summary, the function of the forwarding logic for $SrcA$ is given below. The forwarding logic for
$SrcB$ ($FowardBE$) is identical except that it checks $rt$ rather than $rs$:
Solving Data Hazards with Stalls
Unfortunately, the $lw$ instruction does not finish reading data until the end of the Memory stage, so
its result cannot be forwarded to the Execute stage of the next instruction. We say that $lw$ has a two-
cycle latency , because dependent instructions cannot use its result until two cycles later. The
alternative solution to forwarding is to stall the pipeline , holding up operation until the data is
available.
if ((rsE != 0) AND (rsE == WriteRegM) AND (RegWriteM)) then
ForwardAE = 10
else if ((rsE != 0) AND (rsE == WriteRegW) AND (RegWriteW)) then
ForwardAE = 01
else
ForwardAE = 00
1
2
3
4
5
6
In summary, stalling a stage is performed by disabling the pipeline register, so that the contents do not
change. When a stage is stalled, all previous stages must also be stalled, so that no subsequent
instructions are lost. The pipeline register directly after the stalled stage must be cleared to prevent
bogus information from propagating forward. Stalls degrade performance, so they should only be used
when necessary.
Stalls are supported by adding enable inputs ($EN$) to the Fetch and Decode pipeline registers and a
synchronous reset/clear ($CLR$) input to the Execute pipeline register.
The logic to compute the stalls and flushes is:
Solving Control Hazards
lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoReg
StallF = StallD = FlushE = lwstall
1
2
The $beq$ instruction present a control hazard: the pipelined processor does not know what instruction
to fetch next, because the branch decisions has not been made by the time the next instruction is
fetched. One machanism for dealing with the control hazard is to stall the pipeline until the branch
decision is made (i.e. $PRSrc$ is computed). Because the decision is made in the Memory stage, the
pipeline would have to be stalled for three cycles at every branch. This would severly degrade the
system performance.
An alternative is to predict wether a branch will be taken and begin executing instructions based on
the prediction. Once the branch decision is available, the processor can throw out the instructions if
the predicition was wrong. These wasted instruction cycles are called the branch misprediction penalty.
We could reduce the branch misprediction penalty if the branch decision could be made eralier. Making
the decision simply requires comparing the values of two registers. Using a dedicated equality
comparator is much faster than performing a subtraction and zero detection. If the comparator is fast
enough, it could be moved back into the Decode stage, so that the operands are read from the register
file and compared to determine the next $PC$ by the end of the Decode stage.
Hazard Summary
In Summary, RAW data hazards occur when an instruction depends on the resultof another instruction
thathas not yet been written into the register file. The data hazards can be resolved by forwarding if
the result is computed soon enough, otherwise they require stalling the pipeline until the result is
available. Control hazards occur when the decision of what instruction to fetch has not been made by
the time the next instruction must be fetched. Control hazards are solved by predicting which
instruction should be fetched and flushing the pipeline if the prediction is later determined to be
wrong.
7.6 HDL Representation
Remark: Chapter 7.6 is excluded from this summary. The exact implementation in Verilog HDL can be read
in the book.
7.7 Exceptions
Remark: Chapter 7.7 is excluded from the summary since it is not required by the course.
7.8 Advanced Microarchitecture
Recall that the time required to run a program is proportional to the period of the clock and to the
number of clock cycles per instruction (CPI). Thus, to increase performance we would like to speed up
the clock and/or reduce the CPI.
7.8.1 Deep Pipelines
The easiest way to speed up the clock is to chop the pipeline into more stages. Each stage contains
less logic, so it can run faster. This chapter considered a classic five-stage pipeline, but 10 to 20 stages
are now commonly used. The maximum number of pipeline stages is limited by pipeline hazards,
sequencing overhead, and cost. The pipeline registers between each stage have a sequencing overhead
from their setup time and clk-to-Q delay. This sequencing overhead makes adding more pipeline stages
give diminishing returns.
7.8.2 Branch Prediction
An ideal pipelined processor would have a CPI of $1$. The branch misprediction penalty is a major
reason for increased CPI. To address this problem, most pipelined processors use a branch predictor to
guess whether the branch should be taken.
The simplest form of branch prediction checks the direction of the branch and predicts that backward
branches should be taken. This is called static branch prediction , because it does not depend on the
history of the program. Forward branches are difficult to predict without knowing more about the
specific program. Therefore, most processors use dynamic branch predictors , which use the history of
the program execution to guess whether a branch should be taken. A one-bit dynamic branch predictor
remembers whether the branch was taken the last time and predicts that it will do the same thing the
next time. In summary, a $1$-bit branch predictor mispredicts the first and last branches of a loop.
7.8.3 Superscalar Processor
A superscalar processor contains multiple copies of the datapath hardware to execute multiple
instructions simultaneously. For example, a two-way superscalar processor has a datapath, which
fetches two instructions at a time from the instruction memory. Unfortunately, executing many
instructions simultaneously is difficult because of dependencies.
7.8.4 Out-of-Order Processor
To cope with the problem of dependencies, an out-of-order processor looks ahead across many
instructions to issue , or begin executing, independent instructions as rapidly as possible. The
instructions can be issued in a different order than written by the programmer, as long as dependencies
are honored so that the program produces the intended result.
The dependence of $add$ on $lw$ by way of $\$t0$ is a read after write (RAW) hazard . This is the type
of dependency we are accustomed to handling in the pipelined processor. It inherently limits the speed
at which the program can run, even if infinitely many execution units are available.
The dependence between $sub$ and $add$ by way of $\$ t0$ is called write after read (WAR) hazard or
and antidependence. WAR hazards could not occur in the simple MIPS pipeline, but they may happen in
an out-of-order processor if the dependent instruction (int this case, $sub$) is moved too early.
A third type of hazard, not shown in the program, is called write after write (WAW) hazard or and
output dependence. A WAW hazard occurs if any instruction attempts to write a register after a
subsequent instruction has already written it. The hazard would result in the wrong value being written
to the register.
The instruction level parallelism (ILP) is the number of instructions that can be executed
simultaneously for a particular program and microarchitecture. Theoretical studies have shown that
the ILP can be quite large for out-of-order microarchitectures with perfect branch predictors and
enormous numbers of execution units.
7.8.5 Register Renaming
Out-of-order processors use a technique called register renaming to eliminate WAR hazards. Register
renaming adds some nonarchitectural renaming registers to the processor. The programmer cannot use
these registers directly, because they are not partof the architecture. However, the processor is free to
use them to eliminate hazards.
7.8.6 Single Instruction Multiple Data
The term SMID stands for single instruction multiple data , in which a single instruction acts on
multiple pieces of data in parallel. A common application of SIMD is to perform many short arithmetic
operations at once, especially for graphics processing. This is also called packed arithmetic. For
example, a 32-bit microprocessor might pack four 8-bit data elements into one 32-bit word. Packed
$add$ and $sub$ instructions operate on all four data elements within the word in parallel.
7.8.7 Multithreading
Multithreading is a technique that helps keep a processor with many execution units busy even if the
ILP if a program is low or the program is stalled waiting for memory. A program running on a computer
is called a process . Computers can run multiplr processes simultaneously. Each process consists of one
or more threads that also run simultaneously.
In a conventional processor, the threads only give the illusion of running simultaneously. The threads
acutally take turns being executed on the processor under control of the OS. This procedure is called
context switching . As long as the processor switches through all the threads fast enough, the user
preceives all of the threads as running at the same time. A multithreaded processor contains more than
one copy of its architectural state, so that more than one thread can be active at a time.
Multithreading does not improve the performance of an individual thread, because it does not increase
the ILP. However, it does improve the overall throughput of the processor, because multiple threads can
use processor resources that would have been idle when executing a single thread.
7.8.8 Multiprocessors
A multiprocessor system consists of multiple processors and a method for communication between the
processors. A common form of multi-processing in computer systems is symmetric multiprocessing
(SMP) , in which two or more identical processors share a single main memory. The multiple processors
may be seperate chips or multiple cores on the same chip.
Multiprocessors can be use to run more threads simultaneously or to run a particular thread faster.
Running more threads simultaneously is easy, the threads are simply divided up among the processors.
7.9 Real-World Perspective: IA-32 Microarchitecture
Remark: Chapter 7.9 is excluded in this summary. It can be read in the book.
Chapter 8 - Memory Systems
8.1 Introduction
Computer system performance depends on the memory system as well as the processor
microarchitecture. Chapter 7 assumed an ideal memory system that could be accessed in a single
clockcycle. However, this would be only true for a very small memory or a very slow processor.
The processor communicates with the memory system over a memory interface . The processor sends
and address over the $Address$ bus to the memory system. For a read, $MemWrite$ is $0$ and the
memory returns the data on the $ReadData$ bus. For write, $MemWrite$ is $1$ and the processor sends
data to memory on the $WriteData$ bus.
The principles to make memory fast are temporal locality and spatial locality . Temporal locality
means, that if you have used a piece of data you will likely use it again soon. Spatial locality means,
that if you used a piece of data, you will likely use related data again soon.Furthermore, memory
systems use a hierarchy of storage to quickly access the most commonly used data while still having
the capacity to store large amounts of data.
Computer memory is generally built from DRAM chips. But since the speed of processors grows much
faster than the speed of DRAM, which results in loss of overall performance, computers store the most
commonly used instructions and data in a faster but smaller memory, called cache. The cache is
usually built out of SRAM on the same chip as the processor. The cache speed is comparable to the
processor speed, beacuse SRAM is inherently faster than DRAM, and beacuse the on-chip memory
eliminates lengthy delays caused by traveling to and from a seperate chip.
If the processor requests data that is available in the cache, it is returned quickly. This is called a
cache hit . Otherwise, the processor retrieves the data from the main memory (DRAM) . This is called a
cache miss . The third level in the memory hierarchy is the hard drive . In the same way that a library
uses the basement to store books that do not fit in the stacks, computer systems use the hard drive to
store data that does not fit in main memory.
The hard drive provides an illusion of more capacity than actually exists in the main memory. It is thus
called virtual memory . The main memory can therefore be seen as a cache for the most commonly used
data from the hard drive.
8.2 Memory System Performance Analysis
Memory system performance metrics are miss rate or hit rate and average memory access time . Miss
and hit rates are calculates as:
Average memory access time (AMAT) is the average time a processor must wait for memory per load or
store instruction. AMAt is calculated as:
where $t_{cache}, t_{MM}$, and $t_{VM}$ are the access times of the cache, main memory, and virtual
memory, and $MR_{cache}$ and $MR_{MM}$ are the cache and main memory miss rates, respectively.
8.3 Caches
A cache holds commonly used memory data. The number of data words that it can hold is called the
capacity , $C$. As we explain in the following sections, caches are specified by their capacity ($C$),
number of sets ($S$), block size ($b$), number of blocks ($B$), and degree of associativity ($N$).
8.3.1 What Data is Held in the Cache?
The cache must guess what data will be needed based on the past pattern of memory accesses. In
particular, the cache exploits temporal and spatial locality to achieve a low miss rate. Recall that
temporal locality means that the processor is likely to access a piece of data again soon if it has
accessed that data recently. Spatial locality means that, when the processor accesses a piece of data,
it is also likely to access data in nearby memory locations. Therefore, when the cache fetches one word
from memory, it may also fetch several adjacent words. The number of words in the cache block, $b$, is
called the block size . A cache of capacity $C$ contains $B = \frac{C}{b}$ blocks.
8.3.2 How is Data Found?
A cache is organized into $S$ sets , each of which holds one or more blocks of data. The relationship
between the address of data in main memory and the location of that data in the cache is called the
mapping . Each memory address maps to exactly one set in the cache. We differ between the following
types of mappings:
Direct Mapped Cache
A direct mapped cache has one block in each set, so it is organized into $S = B$ sets. An address in
block 0 of main memory maps to set 0 of the cache, and address in block 1 of main memory maps to
set 1 of the cache, and so forth.
Because many addresses map to a single set, the cache must also keep track of the address of the
data actually contained in each set. The two least significant bits of the 32-bit address are called the
byte offset , because they indicate the byte within the word. The next three bits are called the set bits ,
because thex indicate the set to which the address maps (generally, the number of set bits is
$\log_2S$). The remaining 27 tag bits indiciate the memory address of the data stored in a given cache
set. The cache uses a valid bit for each set to indicate whether the set holds meaningful data. If the
valid bit is 0, the contents are meaningless.
The cache is constructed as an eight-entry SRAM. Each entry, or set, contains one line consisting of 32
bits of data, 27 bits of tag, and 1 valid bit.
When two recently accessed addresses map to the same cache block, a conflict occurs, and the most
recently accessed address evicts the previous one from the block.
Multi-way Set Associative Cache
A $N$-way set associative cache reduced conflicts by providing $N$ blocks in each set where data
mapping to that set might be found. Each memory address still maps to a specific set, but it can map
to any one of the $N$ blocks in the set. $N$ is also called the degree of associativity of the cache. The
following figure shows the hardwarew for a $C=8$-word, $N=2$-way set associative cache. The cache
now has only $S= 4$ sets rather than 8. Thus, only $\log_24 =2$ set bits rather than 3 are used to select
the set.
Set associative caches generally have lower miss rates than direct mapped caches of the same
capacity, because they have fewer conflicts. However, set associative caches are usually slower and
somewhat more expensive to build because of the output multiplexer and additional comparators.
Fully Associative Cache
A fully associative cache contains a single set with $B$ ways, where $B$ is the number of blocks. A
fully associative cache is another name for a $B$-way set associative cache with one set. The
following figure shows the SRAM array of a fully associative cache with eight blocks. Upon data
request, eight tag comparisons must be made, because the data could be in any block.
Fully associative caches tend to have the fewest conflict misses for a given cache capacity, but they
require more hardware for additional tag comparison.
Block Size
To exploit spatial locality, a cache uses larger blocks to hold several consecutive words. The
advantage of a block size greater than one is that when a miss occurs and the word is fetched into the
cache, the adjacent words in the block are also fetched. Therefore, subsequent accesses are more likely
to hit because of spatial locality. The following figure shows the hardware of a $C = 8$-bit direct
mapped cache with $b=4$-word block size. The cache now has only $B = \frac{C}{b} = 2$ blocks.
Putting it All Together
Caches are organized as two-dimensional arrays. The rows are called sets, and the columns are called
ways. Each entry in the array consists of a data block and its associated valid and tag bits.
8.3.3 What Data is Replaced?
If a set is full when new data must be loaded, the block in that set is replaced with the new data. In
associative and fully associative caches, the cache must choose which block to evict when a cache is
full. The principle of temporal locality suggests that the best choice is to evict the least recently used
block, because it is least likely to be used again soon. Hence, most associative caches have a least
recently used (LRU) replacement policy.
8.3.4 Advanced Cache Design
Multi Level Caches
Large caches are beneficial because they are more likely to hold data of interest and therefore have
lower miss rates. Modern systems often use at least two levels of caches, as shown in the following
figure. The first-level (L1) cache is small enough to provide a one- or two-cycle access time. The
second-level (L2) cache is also built from SRAM but is larger, and therefore slower, than the L1 cache.
Reducing Miss Rate
Cache misses can be reduced by changing capacity, block size, and/or associativity. We classify the
cache misses into the following categories:
Compulsory miss: The first request to a cache block is called compulsory miss, because it must be
read from memory regardless of the cache design.
Capacity miss: Occurs when the cache is too small to hold alle concurrently used data.
Conflict miss: Are cause when several addresses map to the same set and evict blocks that are still
needed.
Changing cache parameters can affect one ore more type of cache miss.
8.4 Virtual Memory
Most modern computer systems use a hard drive made of magnetic or solid state storage as the lowest
level in the memory hierarchy. A hard drive made of magnetic storage is also called a hard disk . As the
name implies, the hard disk contains one or more rigid disks or platters , each of which has a
read/write head on the end of a long triangular arm. The head moves to the correct locatin on the disk
and reads or writes data magnetically as the disk rotates beneath it.
The object of adding a hard drive to the memory hierarchy is to inexpensively give the illusion of a very
large memory while still providing the speed of faster memory for most accesses. A computer with only
128MB of DRAM, for example, could effectively provide 2GB of memory using the hard drive. This larger
2-GB memory is called virtual memory , and the smaller 128-MB main memory is called physical
memory.
Programs can access data anywhere in virtual memory, so they must use virtual addresses that specify
the location in virtual memory. Virtual memory is divided into virtual pages , typically 4KB in size.
Physical memory is likewise divided into physical pages of the same size. A virtual page may be
located in physical memory (DRAM) or on the hard drive. The process of determining the physical
address from the virtual address is called address translation . If the processor attempts to access a
vrtual address that is not in physical memory, a page fault occurs, and the operating system loads the
page from the hard drive into physical memory.
8.4.1 Address Translation
In a system with virtual memory, programs use virtual addresses so that they can access a large
memory. The computer must translate these virtual addresses to either find the address in physical
memory or take a page fault and fetch the data from the hard drive. The most significant bits of the
virtual or physical address specify the virtual or physical page number . The least significant bits
specify the word within the page and are called the page offset.
8.4.2 The Page Table
The processor uses a page table to translate virtual addresses to physical addresses. The page table
contains an entry for each virtual page. This entry contains a physical page number and a valid bit. If
the valid bit is $1$, the virtual page maps to the physical page specified in the entry. Otherwise, the
virtual pages is found on the hard drive. The page table can be stored anywhere in physical memory.
The processor typically uses a dedicated register, called the page table register , to store the base
address of the page table in physical memory.
8.4.3 The Translation Lookaside Buffer
Virtual memory would have a severe performance impact if it required a page table read on every load
or store, doubling the delay of loads and stores. Fortunately, page tables accesses have great temporal
locality. In general, the processor can keep the last several page table entries in a small cache called
a translation lookaside buffer (TLB) . The processor "looks aside" to find the translation in the TLB before
having to access the page table in physical memory.
8.4.4 Memory Protection
As we already know, modern computers typically run several programs or processors at the same time.
In a well-designed computer system, the programs should be protected from each other. Specifically,
no program should be able to access another program's memory without permission. This is called
memory protection .
Virtual memory systems provide memory protection by giving each program its own virtual address
space . Each program can use its entire virtual address space without having to worry about where
other programs are physically located. However, a program can access only those physical pages that
are mapped in its page table. In this way, a program cannot accidentally or maliciously access
another program's physical pages.